The Collection of a Human Robot Collaboration Dataset for Cooperative Assembly in Glovebox Environments (2024)

Shivansh Sharma Mathew Huang Sanat Nair Alan Wen
Christina Petlowany Juston Moore^† Selma Wanna Mitch Pryor
Department of Mechanical Engineering
University of Texas at Austin, Austin, TX 78712
{shivansh.s, mathewh, sanatnair, alanwen,
cpetlowany, slwanna, mpryor}@utexas.edu
^† jmoore01@lanl.govCorresponding author.

Abstract

Industry 4.0 introduced AI as a transformative solution for modernizing manufacturing processes. Its successor, Industry 5.0, envisions humans as collaborators and experts guiding these AI-driven manufacturing solutions. Developing these techniques necessitates algorithms capable of safe, real-time identification of human positions in a scene, particularly their hands, during collaborative assembly. Although substantial efforts have curated datasets for hand segmentation, most focus on residential or commercial domains. Existing datasets targeting industrial settings predominantly rely on synthetic data, which we demonstrate does not effectively transfer to real-world operations. Moreover, these datasets lack uncertainty estimations critical for safe collaboration. Addressing these gaps, we present HAGS: Hand and Glove Segmentation Dataset. This dataset provides 1200 challenging examples to build applications toward hand and glove segmentation in industrial human-robot collaboration scenarios as well as assess out-of-distribution images, constructed via green screen augmentations, to determine ML-classifier robustness. We study state-of-the-art, real-time segmentation models to evaluate existing methods. Our dataset and baselines are publicly available¹¹1https://dataverse.tdl.org/dataset.xhtml?persistentId=doi:10.18738/T8/85R7KQ ²²2https://github.com/UTNuclearRoboticsPublic/assembly_glovebox_dataset.

1 Introduction

Gloveboxes are self-contained spaces that allow workers to handle hazardous materials through gloves affixed to sealed portholes (see Figure 1.) This setup protects operators from exposure and prevents unfiltered material releases into the environment (13). Workers and researchers that work with hazardous materials in gloveboxes face issues such as ergonomic injuries and potential hazardous exposure via glove tear. The use of robots in these environments can mitigate many of these problems.

The Collection of a Human Robot Collaboration Dataset for Cooperative Assembly in Glovebox Environments (1)

The Collection of a Human Robot Collaboration Dataset for Cooperative Assembly in Glovebox Environments (2)

Recent successes in the machine learning community have inspired robotics researchers to develop large-scale datasets aimed at achieving breakthroughs similar to ImageNet (12) for robotic research (5; 9). While there are innumerable possibilities for robotics applications, previous datasets prioritize instruction following in residential environments (5; 9). Unfortunately, these datasets often overlook elements of Human-Robot Interaction (HRI), including Human-Robot Collaboration (HRC): where human and robotic agents work together toward a shared goal. In particular, this behavior is desired in manufacturing tasks for collaborative assembly.

To perform collaborative assembly tasks safely, robots must understand where human operators’ hands are within a shared task space. This requires active safety systems which rely on hand segmentation algorithms to avoid or interact with human collaborators. Despite the plethora of openly available hand segmentation datasets, most do not prioritize operating in hazardous or industrial settings. Rather, these datasets leverage web-sourced data which are biased toward easily accessible objects and environments, e.g., common household items in residential settings (9). The public datasets that do target industrial domains often suffer from being either: (1) small-scale and lacking human subject diversity (30; 32) or (2) generated as synthetic data (18).

These shortcomings are highly consequential for active safety systems. For instance, most real-time segmentation algorithms leverage convolutional neural networks (CNNs) for hand classification. However, these architectures may over-rely on pixel color values for classification. Thus underrepresentation in these datasets may unnecessarily imperil people of color.

In an effort to motivate work toward this overlooked task space, we present HAGS: the Hand and Glove Segmentation dataset (IRB ID: STUDY00003948) which is publicly available ³³3https://dataverse.tdl.org/dataset.xhtml?persistentId=doi:10.18738/T8/85R7KQ. This dataset contributes the following:

1.
The first publicly released human-robot collaboration glovebox hand and glove segmentation dataset. This dataset contains 191 videos of joint assembly tasks totaling 9 hours of content with 1728 frames of pixel-level labels.
2.
The inclusion of twelve diverse participants to maximize the relevance of the ungloved data.
3.
Multiclass segmentation for distinguishing between left and right hands as needed by Human-Robot Collaboration applications.
4.
A report on comprehensive baselines for segmentation performance that include metrics for uncertainty quantification, which are largely missing in previous works.

2 Related Work

There are numerous hand segmentation datasets; however, most focus on aspects of daily life with activities such as cooking or playing cards (1; 21; 23). For a fuller accounting of these priors work see Appendix A.2. Although expansive, these datasets are not adequate for situations such as collaborative tasks with robots. Thus, we focus our dataset comparisons on the most relevant works in industrial domains (see Table 1.)

Dataset	Mode	Activity	Labels	Subjects	Label Method	# Classes
WorkingHands, 2019 (32)	CD	Assembly	7.8k	-	Pixel	14
MECCANO, 2020 (28)	C	Assembly	-	20	Bound. Box	21
HRC, 2022 (30)	C	Assembly	1.3k	2	Pixel	5
HaDR, 2023 (18)	CD	Manipulation	117k	-	Pixel	-
HAGS^∗ (ours), 2024 (31)	C	Assembly	1.7k	10	Pixel	2

Task Comparisons.Most industrial datasets are not tailored for HRC barring recent work on uncertainty estimation for segmentation tasks pertaining to collaborative assembly (30). Unfortunately, this dataset is derived from only two participants. In an effort to expand on this work, we collect additional data on twelve participants for two new assembly tasks while incorporating glovebox environment data. Supplementing glovebox data is crucial because this metallic environment presents unique challenges to segmentation such as shine and reflection.

The HaDR dataset (18) features robotic arms and closely resembles our goal to develop robust hand segmentation algorithms. However, no specific HRC task is defined in their dataset and it is entirely synthetic. Similarly, many industrial datasets synthetically augment background color and texture to assess robustness in segmentation models (18; 32). Additionally, while these datasets contain a plethora of tool and item classes, they do not distinguish right- and left-hand information in their labeling, which presents a challenge to extending to future HRI research and applications.

Participant Diversity.Well-constructed datasets are crucial to machine learning models’ reliability and task performance. In the context of active safety systems, participant diversity is of utmost importance. Our work surpasses previous industrial datasets (2; 6; 15; 25; 30) in this regard by including twelve diverse participants. Please refer to Appendix A.1 for additional details.

Color Invariance. Several datasets leverage depth data to reduce the influence of lighting and skin coloration on their segmentation models (18; 32). However, multi-modal models may still over-rely on RGB features. Other work seeks to completely remove this bias by creating color-agnostic datasets but potentially risks generalizable segmentation performance by forgoing rich RGB signals (20).

Alternative methods used to address color-invariant segmentation include augmenting existing works with simulated data (32) or generating fully synthetic training datasets (18). Although most synthetic datasets strive to capture features of real-world data, others aim to build texture and lighting invariance into their models via unrealistic data augmentations (18).

Contributions of HAGS. Our work addresses the limitations of existing industrial domain datasets by focusing on a previously underserved area: glovebox environments. We present a significantly large and diverse dataset, comprising real images from ten participants, in contrast to the synthetic images commonly used in other studies. Recognizing the potential application of this technology in active safety systems for human-robot interactions, we have meticulously developed challenging examples using out-of-distribution (OOD) scenarios, such as green screens with distracting images and hands instead of gloves. Furthermore, we encourage the evaluation of uncertainty quantification (UQ) metrics, such as expected calibration error, to enhance safety information. Our findings indicate that prior works do not transfer effectively to our challenging dataset, underscoring the need for further efforts targeting industrial domains.

3 HAGS Dataset

This section provides an overview of the dataset collection, preparation, and annotation processes for HAGS. Appendix A.1 documents the complete Datasheet (17).

3.1 Data Collection

Videos are collected in a standard Department of Energy (DOE) glovebox. Two camera angles are provided per video: one 1080p GoPro from a bird’s-eye view and one 1080p Intel RealSense Development Kit Camera recording the right side of the participant. Twelve participants are included in the study with 16 videos each totaling more than 9 hours of content. Normally distributed frames are collected and annotated from each video, amounting to over 1440 frames. The Unreal Robotics UR3e robot arm, with an attached gripper for handling objects, is used to aid the human subject. The robot is pre-programmed to assist the human participant with sequential, assembly tasks.

3.2 Surrogate Joint-Assembly Tasks

In order to gather representative data for joint-assembly tasks within a glovebox, two surrogate tasks were designed for human participants to perform. The first task was to assemble a Jenga tower, and the second task was to deconstruct a toolbox.

Jenga Task. Jenga blocks, which loosely resemble the shape, size, and color of human fingers, were chosen to challenge hand segmentation models. The experiment involved three roles: a robot operator, a Jenga block placer, and a participant. The robot operator managed the robot’s actions, which included picking up a Jenga block and handing it to the participant for placement. This sequence was repeated until the participant successfully stacked six Jenga blocks.

Toolbox Task. In this experiment, three roles were defined: a robot operator, a tool adjuster, and a participant. The participant was given a closed toolbox secured by four screws. A tool adjuster was on standby while the robot operator oversaw the robot’s movements. The robot systematically picked up tools from a stand and handed each one to the participant, who used them to unscrew the toolbox. After using the first two tools, the participant “rejected” the selection and handed the third tool back to the robot, which then returned it to the tool adjuster and retrieved the fourth tool for the participant to continue opening the box. Finally, the robot retrieved a previously used and replaced tool from the tool adjuster, providing it to the participant to remove the last screw. Each screwdriver differed in shape and size. The toolbox was white, posing a challenge for segmentation models to distinguish it from the white gloves worn by the participant in the glovebox.

3.3 Additional Factors.

Beyond task and participant diversity, two additional factors were altered during the collection process: green screen and glove use.

The Collection of a Human Robot Collaboration Dataset for Cooperative Assembly in Glovebox Environments (4)

Green screen. A green screen was placed on the bottom of the glovebox for half of the participant videos. This green screen was used to later apply synthetic textures and colors to the background of the real-world images (see Fig. 2). The script incorporated in our code repository for generating synthetic images leverages Google Image searches. Notably, these images are not stored within our dataset; rather, we provide the capability for other researchers to independently construct analogous datasets. Frames extracted from videos containing a green screen background are placed into an OOD test set.

Gloves. Despite operating in a glovebox, half of the recorded videos feature ungloved hands (see the right side of Fig. 1.) We incorporated this additional OOD scenario due to the risk of glove tears. Thus our dataset accounts for rare scenarios where the active safety system must still ensure safe HRC despite skin exposure. These videos are also placed in an OOD test set.

3.4 Data Preparation

The combination of the four factors below results in 16 videos per participant.

•
Top View / Side View
•
Toolbox task / Jenga task
•
Gloves worn / No gloves worn
•
Green screen included / No green screen included

We split sampled frames into an in-distribution (ID) and an OOD set. The ID set contains the most likely glove box operating scenarios. In total, 1440 frames were sampled for labeling. These were equally distributed across all videos, with 120 ID frames and 24 OOD frames sampled per participant.

3.5 Data Annotation

Three classes were assigned to each image: left-hand, right-hand, and background. Human annotators were instructed to annotate each hand from the tip to the wrist, and provide their best estimate of the wrist location when the subject was wearing gloves. Using LabelStudio (34), annotators provided key point prompts to MobileSAM (40), a segmentation model which provided a coarse label for annotators to refine. Four of the researchers performed annotations. Two annotators labeled each image to develop inter-annotator agreement (IAA) for label quality. We calculate IAA in two ways: (1) as the average Cohen’s Kappa (0.916) and (2) as the average IOU (0.957) between annotator-provided labels across the full dataset , indicating strong agreement. Each frame’s annotations were converted to a single PNG file, where the three classes were recorded: left-hand, right-hand, background.

4 Experiments

We conducted two experiments to assess the challenging nature of our dataset. Experiment A is a transfer learning experiment designed to demonstrate the limitations of prior works, listed in Table 1, when applied to our task space. This experiment highlights the lack of sufficient transferability of pretraining on existing datasets to our specific domain. Experiment B studies uncertainty quantification (UQ) involving ID and OOD testing using the HAGS dataset. The OOD frames for all participants include scenarios where gloves are not worn or a green screen is placed in the background. For the green screen frames, we overlay images of different objects to create diverse and challenging testing conditions. This experiment assesses the robustness and reliability of our models in handling diverse, challenging scenarios.

4.1 Experimental Setup

For training all models in both experiments, we input 256x256 image sizes and used the Adam optimizer with a learning rate of 1e-3 for the UNet and BiSeNetv2 architectures, and 8e-4 for MobileSAM training. To train or perform inference with MobileSAM, a visual prompt is first required. We adopt a simple approach of selecting a bounding box prompt that contained the whole image. Dropout of p=0.1 was utilized for training of all architectures. We use a variety of data augmentations applied to our training set, including: resize, color jitter, advanced blur, Gaussian noise, random rotate 90 (p=0.5).

Experiment A. To train, we use an internal cluster equipped with eight NVIDIA RTX A5000 GPUs. The models undergo initial pretraining using the WorkingHands (32), HaDr (18), and HRC (30) datasets. Following this pretraining phase, we fine-tune the models on a subset (10%) of our ID HAGS dataset. This approach is designed to test the assertion of prior work that strong performance can be achieved with minimal real-world examples when leveraging their synthetic datasets. We perform this fine tuning for 20 epochs with batch sizes of 64 and 34 for UNet and BiSeNet respectively. We did not include the MECCANO (28) dataset in our study because it only provides bounding box labels, and our task requires pixel-level segmentation. We evaluate the models on both ID and OOD data from the HAGS dataset.

Experiment B. For training, we utilize an internal cluster of three RTX A6000 GPUs. The training is conducted on ID video frames where participants are wearing gloves, and the glovebox background does not include a green screen. We exclude participant 2 from the training set, using their ID frames as the ID testing set. We then evaluate on ID and OOD testing splits. For UNet and MobileSAM we use a batch size of 64, while BiSeNetv2 utilizes a batch size of 128 for training.

4.2 Results

We monitor Intersection over Union (IoU) on test sets as a measure for accuracy, the Expected Calibration Error (ECE) metric for calibration error, and average predictive entropy to analyze model uncertainty. Time per image inference is used to analyze real-time model capabilities. Models are also ensembled for testing to see the impact on metrics.

Experiment A. As indicated in Table 2, the pretraining experiments posed significant challenges across all models and datasets. We identify two primary causes for these difficulties. First, the emphasis on real-time models led to the selection of lower-capacity models, which are potentially less suited for pretraining tasks. Second, the HaDR dataset (18), being the largest, may have been the only dataset with sufficient diversity and scale to facilitate effective pretraining. While these observations regarding the necessity of model and dataset scale align with current best practices, they conflict with the requirements of real-time segmentation tasks.

	ID		OOD - Hands + Replaced GS
Model	IOU $\uparrow$	ECE $\downarrow$	IOU $\uparrow$	ECE $\downarrow$
UNet+ensemble+dropout (HaDR)	0.4475	0.0039	0.2984	0.0089
UNet+ensemble+dropout (HRC)	-	-	-	-
UNet+ensemble+dropout (WH)	-	-	-	-
BiSeNetv2+ensemble+dropout (HaDR)	0.0025	0.0078	0.0015	0.0038
BiSeNetv2+ensemble+dropout (HRC)	0.1142	0.0155	0.0737	0.0167
BiSeNetv2+ensemble+dropout (WH)	-	-	-	-

Model	IOU $\uparrow$	ECE $\downarrow$	IOU $\uparrow$	ECE $\downarrow$	$\Delta$ PE $\uparrow$
	ID		OOD - Hands
UNet	0.8003	0.0034	0.5663	0.0100	-0.0050
UNet+dropout	0.7897	0.0036	0.5709	0.0095	-0.0041
UNet+ensemble	0.8017	0.0043	0.5936	0.0100	-0.0032
UNet+ensemble+dropout	0.8019	0.0043	0.5753	0.0100	-0.0027
BiSeNetv2	0.7304	0.0052	0.5410	0.0103	-0.0040
BiSeNetv2+dropout	0.7320	0.0059	0.5513	0.0102	-0.0001
BiSeNetv2+ensemble	0.7427	0.0056	0.5321	0.0120	-0.0038
BiSeNetv2+ensemble+dropout	0.7384	0.0058	0.4991	0.0128	-0.0037
MobileSAM	0.5622	0.4020	0.5126	0.4059	0.0000
MobileSAM+dropout	0.5486	0.4011	0.5220	0.4057	0.0002
MobileSAM+ensemble	0.5168	0.4012	0.5128	0.4210	0.0009
MobileSAM+ensemble+dropout	0.4974	0.4004	0.5315	0.4223	0.0006

Experiment B. In Table 3, we present the performance metrics on the ID test set, along with an OOD split focused exclusively on hands. The results for additional OOD splits, as elaborated in Section 4, are detailed in Table 4. Our analysis reveals that the primary source of variability in IoU scores arises from the choice of evaluation set rather than other factors such as the application of dropout or variations in model architecture. Notably, model ensembling leads to improved IoU scores, particularly evident in the replaced green screen dataset.

Model	IOU $\uparrow$	ECE $\downarrow$	$\Delta$ PE $\uparrow$	IOU $\uparrow$	ECE $\downarrow$	$\Delta$ PE $\uparrow$	IOU $\uparrow$	ECE $\downarrow$	$\Delta$ PE $\uparrow$
	OOD - Hands			OOD - Replaced GS			OOD - Hands + Replaced GS
UNet	0.5663	0.0100	-0.0050	0.5628	0.0114	-0.0041	0.5694	0.0104	-0.0046
UNet+dropout	0.5709	0.0095	-0.0041	0.5556	0.0115	-0.0028	0.5802	0.0098	-0.0037
UNet+ensemble	0.5936	0.0100	-0.0032	0.6458	0.0095	-0.0022	0.6056	0.0099	-0.0024
UNet+ensemble+dropout	0.5753	0.0100	-0.0027	0.6484	0.0092	-0.0015	0.6125	0.0096	-0.0028
BiSeNetv2	0.5410	0.0103	-0.0040	0.5447	0.0114	0.0016	0.5335	0.0111	-0.0021
BiSeNetv2+dropout	0.5513	0.0102	-0.0001	0.5495	0.0127	0.0029	0.5392	0.0116	0.0016
BiSeNetv2+ensemble	0.5321	0.0120	-0.0038	0.5857	0.0109	-0.0011	0.5409	0.0119	-0.0025
BiSeNetv2+ensemble+dropout	0.4991	0.0128	-0.0037	0.5622	0.0121	0.0017	0.5429	0.0123	-0.0027
MobileSAM	0.5126	0.4059	0.0000	0.5136	0.4067	0.0002	0.5196	0.4076	0.0004
MobileSAM+dropout	0.5220	0.4057	0.0002	0.5188	0.4065	0.0002	0.5263	0.4074	0.0004
MobileSAM+ensemble	0.5128	0.4210	0.0009	0.4802	0.4171	0.0013	0.4792	0.4070	0.0013
MobileSAM+ensemble+dropout	0.5315	0.4223	0.0006	0.5212	0.4193	0.0007	0.5112	0.4086	0.0012

Model	Time (s) $\downarrow$	Frames per Second $\uparrow$
UNet	0.0003	3333
UNet+ensemble	0.0073	137
BiSeNetv2	0.0008	1259
BiSeNetv2+ensemble	0.0018	56
MobileSAM	0.0146	68
MobileSAM+ensemble	0.0452	22

We see low ECE for the ID test set and a progressive increase in ECE for OOD sets, reaching the highest values in the replaced green screen split. When employing model ensembling, we observe the most significant reduction in ECE for the replaced green screen sets. Additionally, our findings indicate that ensembling tends to reduce predictive entropy. Lastly, we observe a counterintuitive pattern where predictive entropy is higher in ID sets compared to OOD sets. We hypothesize that this phenomenon arises because the model frequently misclassifies hands in the OOD set as background with high confidence, resulting in lower entropy for OOD frames.

Table 5 presents the average inference times per image for individual models. As anticipated, CNN-based models achieve real-time inference performance. The transformer model, MobileSAM, while inherently slower, still manages to attain near real-time performance, even when ensembled, at a rate of 22 frames per second.

5 Discussion, Conclusions, and Future Work

We present a supervised dataset of real-world RGB image data for hand and glove segmentation, which includes challenging scenarios involving bare hands and green-screened backgrounds. Our experiments demonstrate that pretraining on existing datasets is insufficient to achieve the Intersection over Union (IoU) or uncertainty quantification (UQ) necessary for safe operations in active safety systems for joint assembly, human-robot collaboration tasks. Additionally, training models from scratch on our collected data still presents a challenge when generalized to OOD situations that active safety systems should be better equipped to handle. This underscores the need for further research and specialized datasets in this domain.

There are several areas in which this work can be improved. One significant enhancement would be to increase the diversity and quantity of images and labels, particularly by increasing the training set size to be comparable to other industrial domain datasets. Although the test subjects were diverse, the size of the studies limited the range of diversity that could be included. Future work should further improve diversity with a larger participant pool that includes various age ranges, genders, and skin tones. We currently employ two static camera angles, which, while providing some diversity, resulted in hands being predominantly located in predictable, centered portions of the image, as indicated by the pixel occupancy heat maps (see Figure 3). This is a noted, but not significant, limitation since the rigid form of the glovebox limits operators to working in a known and relatively small reachable workspace. The study does not utilize depth sensors to acquire accompanying depth data, nor does it record robot trajectory information. Although both of which could be valuable for advancing other human-robot interaction (HRI) applications, their exclusion simplifies the experimental set-up while mirroring glovebox configurations in the real world. A final possible limitation is that the participants utilized a single brand of glove. Although gloves are typically similar in color they can vary slightly; and more importantly, they tend to yellow with age.

This work advocates for the application of modern machine learning methods in assisting industrial tasks for underserved use cases. This need is particularly acute for workers whose situation may not be served by typical datasets collected in more common domains. It is the more diverse workforce often employed in these hazardous environments (such as industrial glovebox environments) that can most benefit from the safe use of advanced automation.

The Collection of a Human Robot Collaboration Dataset for Cooperative Assembly in Glovebox Environments (5)

Acknowledgments and Disclosure of Funding

This manuscript has been approved for unlimited release and has been assigned LA-UR-24-25500. This research used resources provided by the Darwin testbed at Los Alamos National Laboratory (LANL) which is funded by the Computational Systems and Software Environments subprogram of LANL’s Advanced Simulation and Computing program (NNSA/DOE). This work was supported by the Laboratory Directed Research and Development program of LANL under project number 20210043DR and C1582/CW8217. LANL is operated by Triad National Security, LLC, for the National Nuclear Security Administration of the U.S. Department of Energy (Contract No. 89233218CNA000001).

References

[1]Sven Bambach, Stefan Lee, DavidJ. Crandall, and Chen Yu.Lending A Hand: Detecting Hands and Recognizing Activities in Complex Egocentric Interactions.In 2015 IEEE International Conference on Computer Vision (ICCV), pages 1949–1957, 2015.
[2]Lorenzo Baraldi, Francesco Paci, Giuseppe Serra, Luca Benini, and Rita Cucchiara.Gesture Recognition in Ego-centric Videos Using Dense Trajectories and Hand Segmentation.In 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 702–707, 2014.
[3]Lorenzo Baraldi, Francesco Paci, Giuseppe Serra, Luca Benini, and Rita Cucchiara.Gesture Recognition Using Wearable Vision Sensors to Enhance Visitors’ Museum Experiences.IEEE Sensors Journal, 15(5):2705–2714, 2015.
[4]Alejandro Betancourt, Pietro Morerio, EmiliaI. Barakova, Lucio Marcenaro, Matthias Rauterberg, and CarloS. Regazzoni.A Dynamic Approach and a New Dataset for Hand-detection in First Person Vision.In George Azzopardi and Nicolai Petkov, editors, Computer Analysis of Images and Patterns, pages 274–287, Cham, 2015. Springer International Publishing.
[5]Anthony Brohan etal.Rt-1: Robotics transformer for real-world control at scale, 2023.
[6]Minjie Cai, KrisM. Kitani, and Yoichi Sato.An Ego-Vision System for Hand Grasp Analysis.IEEE Transactions on Human-Machine Systems, 47(4):524–535, 2017.
[7]Minjie Cai, Feng Lu, and Yue Gao.Desktop Action Recognition From First-Person Point-of-View.IEEE Transactions on Cybernetics, 49(5):1616–1628, 2019.
[8]HyungJin Chang, Guillermo Garcia-Hernando, Danhang Tang, and Tae-Kyun Kim.Spatio-Temporal Hough Forest for efficient detection–localisation–recognition of fingerwriting in egocentric camera.Computer Vision and Image Understanding, 148:87–96, July 2016.
[9]Open X-Embodiment Collaboration.Open X-Embodiment: Robotic learning datasets and RT-X models.https://arxiv.org/abs/2310.08864, 2023.
[10]Sergio Cruz and Antoni Chan.Is that my hand? An egocentric dataset for hand disambiguation.Image and Vision Computing, 89:131–143, 2019.
[11]Dima Damen, Hazel Doughty, GiovanniMaria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray.Scaling Egocentric Vision: The EPIC-KITCHENS Dataset, 2018._eprint: 1804.02748.
[12]Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and LiFei-Fei.Imagenet: A large-scale hierarchical image database.In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
[13]DOE.Doe-hdbk-1169-2003, nuclear air cleaning handbook, 2003.
[14]DOE.Srs enhances safety, creates efficiencies in plutonium downblend process, 2020.
[15]Alireza Fathi, Xiaofeng Ren, and JamesM. Rehg.Learning to recognize objects in egocentric activities.In CVPR 2011, pages 3281–3288, 2011.
[16]Guillermo Garcia-Hernando, Shanxin Yuan, Seungryul Baek, and Tae-Kyun Kim.First-Person Hand Action Benchmark with RGB-D Videos and 3D Hand Pose Annotations, 2018._eprint: 1704.02463.
[17]Timnit Gebru, Jamie Morgenstern, Briana Vecchione, JenniferWortman Vaughan, Hanna Wallach, HalDaumé III, and Kate Crawford.Datasheets for datasets.Commun. ACM, 64(12):86–92, nov 2021.
[18]Stefan Grushko, Aleš Vysocký, Jakub Chlebek, and Petr Prokop.HaDR: Applying Domain Randomization for Generating Synthetic Multimodal Dataset for Hand Instance Segmentation in Cluttered Industrial Environments, 2023._eprint: 2304.05826.
[19]Yichao Huang, Xiaorui Liu, Xin Zhang, and Lianwen Jin.A Pointing Gesture Based Egocentric Interaction System: Dataset, Approach and Application.In 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 370–377, 2016.
[20]Byeongkeun Kang, Kar-Han Tan, Nan Jiang, Hung-Shuo Tai, Daniel Tretter, and TruongQ. Nguyen.Hand Segmentation for Hand-Object Interaction from Depth map, 2018._eprint: 1603.02345.
[21]AishaUrooj Khan and Ali Borji.Analysis of Hand Segmentation in the Wild, 2018._eprint: 1803.03317.
[22]Cheng Li and KrisM. Kitani.Pixel-Level Hand Detection in Ego-centric Videos.In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 3570–3577, 2013.
[23]Yin Li, Miao Liu, and JamesM. Rehg.In the Eye of the Beholder: Gaze and Actions in First Person Video, 2020._eprint: 2006.00626.
[24]Jirapat Likitlersuang, ElizabethR. Sumitro, Tianshi Cao, RyanJ. Visee, Sukhvinder Kalsi-Ryan, and Jose Zariffa.Egocentric Video: A New Tool for Capturing Hand Use of Individuals with Spinal Cord Injury at Home, 2019._eprint: 1809.00928.
[25]Jirapat Likitlersuang and Jose Zariffa.Interaction Detection in Egocentric Video: Toward a Novel Outcome Measure for Upper Extremity Function.IEEE journal of biomedical and health informatics, 22(2):561–569, March 2018.
[26]Shreyash Mohatta, Ramakrishna Perla, Gaurav Gupta, Ehtesham Hassan, and Ramya Hebbalaguppe.Robust Hand Gestural Interaction for Smartphone Based AR/VR Applications.In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 330–335, 2017.
[27]Hamed Pirsiavash and Deva Ramanan.Detecting activities of daily living in first-person camera views.In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2847–2854, 2012.
[28]Francesco Ragusa, Antonino Furnari, Salvatore Livatino, and GiovanniMaria Farinella.The MECCANO Dataset: Understanding Human-Object Interactions from Egocentric Videos in an Industrial-like Domain, 2020._eprint: 2010.05654.
[29]Grégory Rogez, JamesS. Supancic, and Deva Ramanan.Understanding Everyday Hands in Action from RGB-D Images.In 2015 IEEE International Conference on Computer Vision (ICCV), pages 3889–3897, 2015.
[30]Seyedomid Sajedi, Wansong Liu, Kareem Eltouny, Sara Behdad, Minghui Zheng, and Xiao Liang.Uncertainty-Assisted Image-Processing for Human-Robot Close Collaboration.IEEE Robotics and Automation Letters, 7(2):4236–4243, 2022.
[31]Shivansh Sharma, Mathew Huang, Sanat Nair, Alan Wen, Christina Petlowany, Selma Wanna, and Mitch Pryor.Hand and Glove Segmentation Dataset for Department of Energy Glovebox Environments, 2024.Available at https://doi.org/10.18738/T8/85R7KQ, version 1.0.
[32]Roy Shilkrot, Supreeth Narasimhaswamy, Saif Vazir, and Minh Hoai.Workinghands: A hand-tool assembly dataset for image segmentation and activity mining.In British Machine Vision Conference, 2019.
[33]Yansong Tang, Zian Wang, Jiwen Lu, Jianjiang Feng, and Jie Zhou.Multi-Stream Deep Neural Networks for RGB-D Egocentric Action Recognition.IEEE Transactions on Circuits and Systems for Video Technology, 29(10):3001–3015, 2019.
[34]Maxim Tkachenko, Mikhail Malyuk, Andrey Holmanyuk, and Nikolai Liubimov.Label Studio: Data labeling software, 2020-2022.Open source software available from https://github.com/heartexlabs/label-studio.
[35]Shaohua Wan and J.K. Aggarwal.Mining discriminative states of hands and objects to recognize egocentric actions with a wearable RGBD camera.In 2015 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 36–43, 2015.
[36]Wei Wang, Kaicheng Yu, Joachim Hugonot, Pascal Fua, and Mathieu Salzmann.Recurrent U-Net for Resource-Constrained Segmentation, 2019._eprint: 1906.04913.
[37]Wenbin Wu, Chenyang Li, Zhuo Cheng, Xin Zhang, and Lianwen Jin.YOLSE: Egocentric Fingertip Detection from Single RGB Images.In 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), pages 623–630, 2017.
[38]Chi Xu, LakshmiNarasimhan Govindarajan, and LiCheng.Hand Action Detection from Ego-centric Depth Sequences with Error-correcting Hough Transform, 2016._eprint: 1606.02031.
[39]Shanxin Yuan, QiYe, Bjorn Stenger, Siddhant Jain, and Tae-Kyun Kim.BigHand2.2M Benchmark: Hand Pose Dataset and State of the Art Analysis.In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2605–2613, Honolulu, HI, July 2017. IEEE.
[40]Chaoning Zhang, Dongshen Han, YuQiao, JungUk Kim, Sung-Ho Bae, Seungkyu Lee, and ChoongSeon Hong.Faster segment anything: Towards lightweight sam for mobile applications.arXiv preprint arXiv:2306.14289, 2023.
[41]Yifan Zhang, Congqi Cao, Jian Cheng, and Hanqing Lu.EgoGesture: A New Dataset and Benchmark for Egocentric Hand Gesture Recognition.IEEE Transactions on Multimedia, 20(5):1038–1050, 2018.
[42]Christian Zimmermann, Duygu Ceylan, Jimei Yang, Bryan Russell, Max Argus, and Thomas Brox.FreiHAND: A Dataset for Markerless Capture of Hand Pose and Shape from Single RGB Images, 2019._eprint: 1909.04349.

Checklist

1.
For all authors…
1. (a)
  Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?[Yes]
2. (b)
  Did you describe the limitations of your work?[Yes] Please refer to Section 5.
3. (c)
  Did you discuss any potential negative societal impacts of your work?[No]
4. (d)
  Have you read the ethics review guidelines and ensured that your paper conforms to them?[Yes]
2.
If you are including theoretical results…
1. (a)
  Did you state the full set of assumptions of all theoretical results?[N/A]
2. (b)
  Did you include complete proofs of all theoretical results?[N/A]
3.
If you ran experiments (e.g. for benchmarks)…
1. (a)
  Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)?[Yes] Data is hosted in the public repository at https://dataverse.tdl.org/dataset.xhtml?persistentId=doi:10.18738/T8/85R7KQ and our code is available at https://github.com/UTNuclearRoboticsPublic/assembly_glovebox_dataset.
2. (b)
  Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)?[Yes] Training details not explicitly stated in Section 4.1 are provided in the training\training_configs folder at the GitHub link.
3. (c)
  Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)?[No]
4. (d)
  Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)?[Yes] See Section 4.1.
4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
1. (a)
  If your work uses existing assets, did you cite the creators?[N/A]
2. (b)
  Did you mention the license of the assets?[N/A]
3. (c)
  Did you include any new assets either in the supplemental material or as a URL?[N/A]
4. (d)
  Did you discuss whether and how consent was obtained from people whose data you’re using/curating?[N/A]
5. (e)
  Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content?[N/A]
5.
If you used crowdsourcing or conducted research with human subjects…
1. (a)
  Did you include the full text of instructions given to participants and screenshots, if applicable?[Yes] The IRB exemption document is available here: https://dataverse.tdl.org/file.xhtml?fileId=599918&version=1.0 and the instructions and risks are outlined here: https://dataverse.tdl.org/file.xhtml?fileId=599917&version=1.0
2. (b)
  Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable?[Yes] These instructions and warnings are available at https://dataverse.tdl.org/file.xhtml?fileId=599917&version=1.0
3. (c)
  Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation?[Yes] The hourly wage was $0.00 per hour barring winning a raffled giftcard. These documents are available at https://dataverse.tdl.org/file.xhtml?fileId=599918&version=1.0 and https://dataverse.tdl.org/file.xhtml?fileId=599917&version=1.0

Appendix A Appendix

A.1 HAGS Datasheet

This section introduced our Hand and Glove Segmentation Dataset (HAGS) datasheet [17]. The HAGS dataset was collected by the Nuclear and Applied Robotics Group and funded by Los Alamos National Laboratory.

You can find the datasheet ⁴⁴4https://dataverse.tdl.org/dataset.xhtml?persistentId=doi:10.18738/T8/85R7KQ with doi:10.18738/T8/85R7KQ and the associated software online⁵⁵5https://github.com/sanatnair/Glovebox_Segmentation_Dataset_Tools ⁶⁶6https://github.com/UTNuclearRoboticsPublic/assembly_glovebox_dataset/tree/main.

Hosting, Long Term Preservation, and Maintenance: The dataset will be maintained by Selma Wanna(Email: slwanna@utexas.edu) and hosted on the Texas Data Repository ⁷⁷7https://dataverse.tdl.org/.

Statement of Responsibility:We accept the responsibility in case of violation of rights, etc., and confirmation of the data license.

Purpose

This dataset aims to mitigate limitations and biases in existing industrial domain datasets, including low diversity, static environments, and inadequate documentation of HRC datasets and their data collection methods. Additionally, this dataset tackles semantic segmentation of gloves and hands in DOE glovebox environments. We consider the semantic segmentation algorithm to be a component of an active safety system for human-robot collaborative assembly. As such, we construct out-of-distribution (OOD) scenarios to measure task performance and uncertainty quantification to measure algorithmic robustness to potential failures and exaggerated use cases in real-world scenarios.

This dataset serves as a foundational resource for training models to enhance working conditions in hazardous environments. It consists of videos capturing two distinct joint assembly tasks with diverse objects, including sampled frames and annotated frames for training and evaluating hand detection models.

The dataset consists of 191 videos, totaling approximately 9 hours of footage, along with 1728 annotated in-distribution and out-of-distribution frames, all originating from 12 diverse participants. Furthermore, the dataset emphasizes participants and procedures with various characteristics, such as differing skin tones, arm orientations, camera perspectives, and task execution methods. The two joint assembly tasks within this dataset include the assembly of a block tower and the disassembly of a tool box. In all, the diversity presented enables the dataset to cover a broad range of real-world scenarios.

The dataset’s final implementation includes the development of real-time baseline models to evaluate the challenging nature of the dataset. Ultimately, with this dataset, we aim to contribute to the ongoing efforts to enhance the safety and accuracy of deep learning models in human-robot collaboration environments.

Motivation

This dataset was created to enable research on real-time hand segmentation for Human Robot Collaboration (HRC) tasks in industrial domains. The focus is on joint assembly of objects in glovebox environments. We aim to mitigate limitations and biases in existing industrial domain datasets, including low diversity, static environments, and inadequate documentation of HRC datasets and their data collection methods. Because we view the real-time segmentation algorithm as an active safety-system, we construct out-of-distribution (OOD) scenarios to measure task performance and uncertainty quantification.

The dataset was created by Shivansh Sharma, Mathew Huang, Sanat Nair, Alan Wen, Christina Petloway, Juston Moore, Selma Wanna, and Mitch Pryor at the University of Texas at Austin.

The funding of the dataset is from Los Alamos National Laboratory in the form of a Laboratory Directed Research and Development program (20210043DR) and contract C1582/CW8217.

Composition of the Dataset

The instances that comprise the dataset are videos of test subject performing assembly tasks and sampled images from those videos. Additionally, there are supervised labels provided for each sampled image by two annotators.

There are 16 videos per test subject, from which sampled frames and annotations were created over 2 experiments, 4 variables, 2 camera angles, and 12 subjects.

The dataset includes 12 diverse participants and is self-contained; it does not include external sources or assets.

There are 1,728 in distribution (ID) and OOD frames total.

The dataset is a sample of instances. No tests were run to determine how representative the frames are; however, pixel heatmaps provided in Figure 3 indicate bias in hand location. Finer sampling should result in a more diverse dataset.

There are labels provided only for the images we have sampled. It is possible for other users to sample more images using the provided script ⁸⁸8https://github.com/sanatnair/Glovebox_Segmentation_Dataset_Tools. Thereafter, software like LabelStudio [34] can assist in creating the new images’ ground truths.

Each label is a pixel-wise map of three classes: background, left hand, or right hand. These labels and images are saved in PNG formats.

There is information missing from two test subject participants. We retained their corrupted files but removed their data from the final dataset.

There are no recommended data splits for the ID dataset, i.e., images which contain gloved hands and no green screen. However, for the OOD portion, i.e., images which contain either ungloved hands or green screens with and without overlaid images, we strongly recommend to use this data only for testing sets.

Three instances of noise are identified in this dataset after quality inspection. Firstly, it is observed that only 22 OOD frames were sampled with Participant 6, as opposed to the expected 24. This discrepancy occurs specifically within the Side_View of the Jenga_task. Secondly, there is an issue with the orientation of the Top_View > Toolbox_GL video and its subsequent frames and annotations for Participant 3. Finally, the cropped video included in the dataset for Participant 3 > Side_View > Jenga_GL_G is not the same as the video used for generating sampled frames and annotations.

The dataset is self-contained barring the instances with green screens with overlaid images. These images were sourced from Google images using a script⁹⁹9https://github.com/UTNuclearRoboticsPublic/assembly_glovebox_dataset/blob/main/data/convert_scripts/replace_green.py. There is no guarantee that the specific images will remain constant over time. There is not an archival version of this portion of the dataset that we can distribute.

No confidential, sensitive, anxiety inducing, or offensive images are contained in the dataset. Our dataset does not identify subpopulations.

It is possible but unlikely to identify individuals from the datasets.

Data Collection

The following list contains the hardware and software used in the data acquisition:

•
GoPro Hero 7 and a RealSense Development Kit Camera SR300
•
1080p, 30 fps, 0.5 Ultrawide Lens.
•
UR3e from Universal Robots, programmed using the pendant tablet.
•
Experiments conducted in the same location, with the same robot and glovebox.

A deterministic sampling strategy was used.

Students and researchers from the University of Texas at Austin were involved in this study. They were not paid, but were entered into a raffle for an Amazon Gift Card.

An ethical review process was conducted. An IRB exemption was granted for this study. This information is viewable online ¹⁰¹⁰10https://dataverse.tdl.org/file.xhtml?fileId=599918&version=1.0.

We collected the data directly from the individual participants. The data was collected from Fall 2022 - Spring 2023.

All participants received an informed consent form¹¹¹¹11https://dataverse.tdl.org/file.xhtml?fileId=599917&version=1.0 and were aware of our study and data collection.

All participants consented to the collection of their data. However, participants can revoke their data by emailing the study organizer.

Data Processing

Removal of instances and processing of missing values was done for two participants. This was due to a data corruption of our hard drive making their runs unrecoverable.

The raw data is saved in addition to the preprocessed data because we retain the video files in our dataset.

The software for labeling the data is available from LabelStudio [34].

Frames are labeled as left and right hand with LabelStudio software [34] and the MobileSAM [40] backend. Models were trained on frames compressed to 256x256 px. Ground truth masks and original footage data is available, while raw unprocessed data is private.

Ethics

Participants with recognizable features are asked to cover them up through the use of makeup or other methods to prevent identifiability.

The only characteristic identified by the dataset is race.

The experimental procedure used to collect the data is reviewed by the internal review board of the University of Texas at Austin (IRB ID: STUDY00003948).

Individuals are allowed to revoke their consent.

Uses

At the time of publication, this dataset has only been used in our original work.

This repository ¹²¹²12https://github.com/UTNuclearRoboticsPublic/assembly_glovebox_dataset/tree/main links to our public repository, which will contain a running list of associated papers or publications.

This dataset could potentially be used for intention recognition in Human Robot Interaction research. An example use case could be classifying future tool use/need based on user behavior. However, our dataset could be improved for this use case with gesture annotations.

Future uses could be impacted because we did not record robot trajectories when developing this dataset. Additionally, we did not use depth sensors or RGB-D cameras for data collection.

This dataset was collected in a DOE glovebox and as such the data should not be used on its own to develop segmentation algorithms for markedly different scenarios. We note that generalization in niche industrial domains is a current problem in ML, so please proceed with caution if you plan to adapt this work to your use case.

The data may be impacted by future changes in demographics and the proportion of representation of groups may change. The dataset is unfit for applications outside of the glovebox and applications not related to hands.

Distribution

We plan to distribute this dataset to DOE laboratories.

The dataset is hosted via the Texas Data Repository¹³¹³13https://dataverse.tdl.org/dataset.xhtml?persistentId=doi:10.18738/T8/85R7KQ. Supporting GitHub repositories include ¹⁴¹⁴14https://github.com/UTNuclearRoboticsPublic/assembly_glovebox_dataset/tree/main and ¹⁵¹⁵15https://github.com/sanatnair/Glovebox_Segmentation_Dataset_Tools.

The dataset was released February 29, 2024.

This dataset is available under a CC0 1.0 license. Both code repositories are released under BSD 3.0 licenses.

No third parties imposed IP-based or other restrictions on the data associated with the instances.

This dataset does not violate export controls or other regulatory restrictions.

A.2 Overview of Hand Segmentation Datasets

Dataset	Year	Mode	Device	Type of Activity	Setting
GTEA [15]	2011	C	GoPro	Cooking	Kitchen
ADL [27]	2012	C	GoPro	Daily Life	Home
EDSH [22]	2013	C	-	Daily Life	Home / Outdoors
Interactive Museum [2]	2014	C	-	Gesture	Museum
EgoHands [1]	2015	C	Google Glass	Manipulation	Social Setting
Maramotti [3]	2015	C	-	Gesture	Museum
UNIGE Hands [4]	2015	C	GoPro Hero3+	Daily Life	Various Daily Locations
GUN-71 [29]	2015	CD	Creative Senz3D	Daily Life	Home
RGBD Egocentric Action [35]	2015	CD	Creative Senz3D	Daily Life	Home
Fingerwriting in mid-air [8]	2016	CD	Creative Senz3D	Writing	Office
Ego-Finger [19]	2016	C	-	Gesture	Various Daily Locations
ANS able-bodied [25]	2016	C	Looxie 2	Daily Life	Home
UT Grasp [6]	2016	C	GoPro Hero2	Manipulation	-
GestureAR [26]	2017	C	Nexus 6 and Moto G3	Gesture	Various Backgrounds
EgoGesture [37]	2017	C	-	Gesture	-
Egocentric hand-action [38]	2017	D	Softkinetic DS325	Gesture	Office
BigHand2.2M [39]	2017	D	Intel RealSense SR300	Gesture/Pose	-
Desktop Action [7]	2018	C	GoPro Hero 2	Daily Life	Office
Epic Kitchens [11]	2018	C	GoPro	Cooking	Kitchen
FPHA [16]	2018	CD	Intel RealSense SR300	Gesture / Pose / Daily life	Home
EYTH [21]	2018	C	-	-	-
HandOverFace [21]	2018	C	-	-	-
EGTEA+ [23]	2018	C	SMI wearable eye-tracker	Cooking	Kitchen
THU-READ [33]	2018	CD	Primesense Carmine	Daily Life	Home
EgoGesture [41]	2018	CD	Intel RealSense SR300	Gesture	-
HOI [20]	2018	D	Kinect V2	Manipulation	-
EgoDaily [10]	2019	C	GoPro Hero5	Daily Life	Various Daily Locations
ANS SCI [24]	2019	C	GoPro Hero4	Daily Life	Home
KBH [36]	2019	C	HTC Vive	Manipulation / Keyboard	Office
WorkingHands [32]	2019	CD	Kinect V2	Assembly	Industrial
Freihand [42]	2019	C	-	Grasp / Pose	Green Screen and Outdoors
MECCANO [28]	2020	C	Intel RealSense SR300	Assembly	Desk
HRC [30]	2022	C	iPhone 11 Pro	Assembly	Industrial
HaDR [18]	2023	CD	Realsense L515	Manipulation	Industrial
Our Dataset	2024	C	Intel Realsense SR300 and GoPro Hero 7	Assembly	Industrial (glovebox)

Dataset	Frames	Annotated Frames	Videos	Duration	Subjects	Resolution	Annotation
GTEA [15]	31K	663	28	34 m	4	1280 x 720	act msk
ADL [27]	>1M	1,000,000	20	10 h	20	1280 x 720	act obj
EDSH [22]	20K	442	3	10 m	-	1280 x 720	msk
Interactive Museum [2]	-		700	-	5	800 x 450	gst msk
EgoHands [1]	130K	4800	48	72 m	8	1280 x 720	msk
Maramotti [3]	-	-	700	-	5	800 x 450	gst msk
UNIGE Hands [4]	150K	-	-	98 m	-	1280 x 720	det
GUN-71 [29]	12K	9100	-	-	8	-	grs
RGBD Egocentric Action [35]	-	-	-	-	20	C: 640x480 D: 320 x 240	act
Fingerwriting in mid-air [8]	8K	2500	-	-	-	-	ftp gst
Ego-Finger [19]	93K	-	24	-	-	640 x 480	det ftp
ANS able-bodied [25]	-	-	-	44 m	4	640 x 480	det ftp
UT Grasp [6]	-	-	50	4 h	5	960 x 540	grs
GestureAR [26]	51K	-	100	-	8	1280 x 720	gst
EgoGesture [37]	59K	-	-	-	-	-	det ftp gst
Egocentric hand-action [38]	154K	-	300	-	26	320 x 240	gst
BigHand2.2M [39]	290K	290K	-	-	-	640 x 480	pos
Desktop Action [7]	324K	660	60	3 h	6	1920 x 1080	act msk
Epic Kitchens [11]	11.5M	-	-	55h	32	1920 x 1080	act
FPHA [16]	100K	100K	1175	-	6	C: 1920 x 1080 D: 640 x 480	act pos
EYTH [21]	1290	1290	3	-	-	384 x 216	msk
HandOverFace [21]	300	300	-	-	-	384 x 216	msk
EGTEA+ [23]	>3M	15k	86	28h	32	1280 x 960	act gaz msk
THU-READ [33]	343K	652	1920	-	50	640 x 480	act msk
EgoGesture [41]	3M	-	24161	-	50	640 x 480	gst
HOI [20]	27525	-	-	-	6	-	msk
EgoDaily [10]	50K	50K	50	-	10	1920 x 1080	det hid
ANS SCI [24]	-	33K	-	-	17	480 x 854	det int
KBH [36]	12.5K	12.5K	161	-	50	230 x 306	msk
WorkingHands [32]	4.2K syn, 3.7K Real	7.8k	39	-	-	1920 × 1080	obj msk
Freihand [42]	37000	37000	-	-	32	-	pos
MECCANO [28]	299K	-	20	21 min	20	1280 x 720	act obj msk
HRC [30]	598	-	13	-	2	380 x 180 (resized)	msk
HaDR [18]	117000	117000	-	-	-	640 x 480	msk
Our Dataset	4320	2880	160	8 h	10	64 x 64 (resized)	msk

Dataset	Method	Simulated data	Total Classes	Non Hand Object Classes
GTEA [15]	Pixel	N	17	16
ADL [27]	Bounding Box	N	11	10
EDSH [22]	Pixel	N	1	0
Interactive Museum [2]	Bounding Box	N	1	0
EgoHands [1]	Pixel	N	4	0
Maramotti [3]	Pixel	N	1	0
UNIGE Hands [4]	-	N	1	0
GUN-71 [29]	Key Points + Forces	N	1	0
RGBD Egocentric Action [35]	-	N	1	0
Fingerwriting in mid-air [8]	Key Point	N	1	0
Ego-Finger [19]	Bounding Box + Points	N	1	0
ANS able-bodied [25]	Bounding Box	N	1	0
UT Grasp [6]	Bounding Box	N	1	0
GestureAR [26]	Pixel		1	0
EgoGesture [37]	Bounding Box + Points	N	1	0
Egocentric hand-action [38]	Bounding Box	Y	1	0
BigHand2.2M [39]	Key Point	N	1	0
Desktop Action [7]	Pixel	N	1	0
Epic Kitchens [11]	Bounding Box	N	324	323
FPHA [16]	Key Point	N	1	0
EYTH [21]	Pixel	N	1	0
HandOverFace [21]	Pixel	N	1	0
EGTEA+ [23]	Pixel	N	1	0
THU-READ [33]	Pixel	N	1	0
EgoGesture [41]	Pixel	N	1	0
HOI [20]	Bounding Box	N	1	0
EgoDaily [10]	Bounding Box	N	1	0
ANS SCI [24]	Pixel	N	1	0
KBH [36]	Pixel	N	1	0
WorkingHands [32]	Pixel	Y	14	13
Freihand [42]	Key Point	N	1	0
MECCANO [28]	Bounding Box	N	21	20
HRC [30]	Pixel	N	5	1
HaDR [18]	Pixel	Y	1	0
Our Dataset	Pixel	N	2	0