Information

This page provides general information and a detailed overview on the used data formats in the Bimanual Actions Dataset. A manual for the used data formats is also available for download as a PDF document here.

The Dataset in Numbers

Some facts about the Bimanual Actions Dataset.

Subjects 6 subjects (3 female, 3 male; 5 right-handed, 1 left-handed)
Tasks 9 tasks (5 in a kitchen context, 4 in a workshop context)
Recordings 540 recordings in total (6 subjects performed 9 tasks with 10 repetitions)
Playtime 2 hours and 18 minutes, or 221 000 RGB-D image frames
Quality 640 px × 480 px image resolution; 30 fps (83 recordings are at 15 fps due to technical issues)
Actions 14 actions (idle, approach, retreat, lift, place, hold, stir, pour, cut, drink, wipe, hammer, saw, and screw)
Objects 12 objects (cup, bowl, whisk, bottle, banana, cutting board, knife, sponge, hammer, saw, wood, and screwdriver)
Annotations Actions fully labelled for both hands individually; 5 413 frames labelled with object bounding boxes

Action Label Mapping

Refer to the following table for a mapping of action label IDs and their symbolic name.

# Action Description
0idleThe hand does nothing semantically meaningful
1approachThe hand approaches an object which is going to be relevant
2retreatThe hand retreats from an object after interacting with it
3liftThe hand lifts an object to allow using it
4placeThe hand places an object after using it
5holdThe hand holds an object to ease using it with the other
6pourThe hand pours something from the grasped object
7cutThe hand cuts something with the grasped object
8hammerThe hand hammers something with the grasped object
9sawThe hand saws something with the grasped object
10stirThe hand stirs something with the grasped object
11screwThe hand screws something with the grasped object
12drinkThe hand is used to drink with the grasped object
13wipeThe hand wipes something with the grasped object

Object Class Label Mapping

Refer to the following table for a mapping of object class label IDs and their symbolic name.

# Object Description
0bowlEither a small green bowl or a bigger orange bowl. Used only in the kitchen
1knifeA black knife. Used only in the kitchen
2screwdriverA screwdriver. Used only in the workshop
3cutting boardA wooden cutting board. Used only in the kitchen
4whiskA whisk. Used only in the kitchen
5hammerA hammer. Used only in the workshop
6bottleEither a white bottle, a smaller black bottle, or a green bottle. Used only in the kitchen
7cupEither a yellow, blue or red cup. Used only in the kitchen
8bananaA banana. Used only in the kitchen
9cerealsA pack of cereals. Used only in the kitchen
10spongeEither a big yellow sponge, or a smaller green one. Used only in the kitchen
11woodA piece of wood. Either a long one, or a smaller one. Used only in the workshop
12sawA saw. Used only in the workshop
13hard driveA hard drive. Used only in the workshop
14left handThe subject's left hand
15right handThe subject's right hand

Object Relations Label Mapping

Refer to the following table for a mapping of object relation IDs to their symbolic name.

# Relation Description
0contactSpatial relation. Objects are in contact
1aboveSpatial relation (static). One object is above the other
2belowSpatial relation (static). One object is below the other
3left ofSpatial relation (static). One object is left of the other
4right ofSpatial relation (static). One object is right of the other
5behind ofSpatial relation (static). One object is behind of the other
6in front ofSpatial relation (static). One object is in front of the other
7insideSpatial relation (static). One object is inside of another
8surroundSpatial relation (static). One object is surrounded by another
9moving togetherSpatial relation (dynamic). Two objects are in contact and move together
10halting togetherSpatial relation (dynamic). Two objects are in contact but do not move
11fixed moving togetherSpatial relation (dynamic). Two objects are in contact and only one moves
12getting closeSpatial relation (dynamic). Two objects are not in contact and move towards each other
13moving apartSpatial relation (dynamic). Two objects are not in contact and move apart from each other
14stableSpatial relation (dynamic). Two objects are not in contact and their distance stays the same
15temporalTemporal relation. Connects observations of one object instance over consecutive frames

RGB-D Camera Normalisation

The camera angles vary slightly, depending on when the recordings were taken. The file bimacs_rgbd_data_cam_norm.json contains the necessary data to perform a normalisation, i.e. to rotate the point clouds to account for the tilted camera, and offsets to center the world frame on the table.

The normalisations are stored in a JSON file, where key indices are stored in an integer array key_indices. These key indices denote, at which recording numbers the camera parameters have changed. The corresponding parameters can be obtained by looking for the next biggest key index for any given recording number, to then look up that index in the map. For example, for recording number 42, the parameters of key index 90 would be the correct ones. The index 0 is just a dummy to ease automatic processing.

The X-axis of the point cloud is rotated by the negative angle in angle. The offsets offset_rl (right/left), offset_h (height) and offset_d (depth) center the world frame on the table. The angle is in degree, all offsets are in millimeters.

RGB-D Video Format

The RGB and depth videos were recorded with a PrimeSense Carmine 1.09 and divided into separate folders, namely rgb and depth. The recordings are organised in subfolders. The first level of folders contain all recordings of a specific subject (i.e. subject_x), the second a specific task (e.g. task_4_k_wiping or task_9_w_sawing), and the third a specific take or repetition (i.e. take_x).

Each recording is a folder which contains a file metadata.csv and one or more chunk folders chunk_x. These chunks, in turn, contain a certain amount of frames (image files) from the recording. The metadata.csv file contains several variables relevant for the recording and may look like this:

name,type,value
fps,unsigned int,30
framesPerChunk,unsigned int,100
frameCount,unsigned int,427
frameWidth,unsigned int,640
frameHeight,unsigned int,480
extension,string,.png

The first column denotes the variable name, the second the type of the variable, and the third the value of the variable. The variable fps denotes the frame rate for the recording, while frameWidth and frameHeight denote the resolution of the recording, and frameCount the amount of frames. The extension of the image files is encoded into extension. Because the image frames are chunked, it is also important to know the amount of frames per chunk. This is stored in framesPerChunk.

For this dataset, the variables extension, framesPerChunk, frameWidth and frameHeight can be assumed constant, as the frame dimensions did not change and the other parameters were not changed during recording.

Action Ground Truth

This data format is used in Action ground truth. For each recording, there is a JSON file containing the action segmentation for both hands. Let's consider a simple example segmentation for a recording with 1015 frames:

{
  "right_hand": [0, 0, 1015],
  "left_hand": [0, 0, 1015]
}

For each hand, there is an array of elements, and the length of the array is always odd. All even elements are always integer values and depict key frames. All odd elements are either integer or null, and depict the action label ID (if integer), or that no action is associated (if null).

The example above therefore translates to: "For both hands, there is an action segment with the ID '0' beginning from frame 0 and ending before frame 1015" ('0' is the action label ID of 'idle'). A graphical representation with action label IDs substituted with the actual action labels would look like this:

Frame number:   0        ...       1015
                |                    |
  Right hand:   [       idle        )[
   Left hand:   [       idle        )[

Now let's consider that the right hand begins holding (the action label ID of 'hold' is '5') an object at frame 500. This would change the previous example to:

{
  "right_hand": [0, 0, 500, 5, 1015],
  "left_hand": [0, 0, 1015]
}

Or graphically:

Frame number:   0    ...    500    ...    1015
                |            |              |
  Right hand:   [   idle    )[    hold     )[
   Left hand:   [           idle           )[

2D Human Body Pose Data

The human body pose data is organised in subfolders. The first level of folders contain all pose data of a specific subject (i.e. subject_x), the second a specific task (e.g. task_4_k_wiping or task_9_w_sawing), and the third a specific take or repetition (i.e. take_x). The pose data for an individual frame is saved in a corresponding JSON file in the folder body_pose.

The root element of each JSON file is a list with one element, which is an object. The properties of this object are again objects, which encode the confidence, the label (e.g. RAnkle or Neck), and the coordinates in pixels of a given key point of the pose. If the confidence is 0.0, that pose key point was not found in the image.

2D Human Hand Pose Data

The human hand pose data is organised in subfolders. The first level of folders contain all pose data of a specific subject (i.e. subject_x), the second a specific task (e.g. task_4_k_wiping or task_9_w_sawing), and the third a specific take or repetition (i.e. take_x). The pose data for an individual frame is saved in a corresponding JSON file in the folder hand_pose.

The root element of each JSON file is a list with one element, which is an object. The properties of this object are again objects, which encode the confidence, the label (e.g. LHand_15 or RHand_4), and the coordinates relative to the image width and height of a given key point of the pose. If the confidence is 0.0, that pose key point was not found in the image.

2D Object Bounding Boxes

The 2D object bounding box data is organised in subfolders. The first level of folders contain all pose data of a specific subject (i.e. subject_x), the second a specific task (e.g. task_4_k_wiping or task_9_w_sawing), and the third a specific take or repetition (i.e. take_x). The 2D object bounding box data for an individual frame is saved in a corresponding JSON file in the folder 2d_objects.

The root element of each JSON file is a list of JSON objects, and each JSON object represents the bounding box of a detected object. Such an object looks like this:

{
    "bounding_box": {
        "h": 0.22833597660064697,
        "w": 0.07694989442825317,
        "x": 0.7088799476623535,
        "y": 0.6917555928230286
    },
    "candidates": [
        {
            "certainty": 0.9995384812355042,
            "class_name": "cereals",
            "colour": [
                0,
                255,
                111
            ]
        }
    ],
    "class_count": 16,
    "object_name": ""
}

The property bounding_box denotes the bounding box, with x and y being the coordinates of the center of the bounding box, and w and h its width and height, respectively. These values are relative to the input image's height and width.

Sometimes there are several object class candidates for a detected object, but most of the times it is only one. The candidates are listed in the candidates property. Here, the property class_name stores the possible object class candidate for that bounding box, and the property certainty the certainty as estimated by Yolo. The property colour encodes an RGB colour unique to the given class ID for visualisation purposes.

The total number of classes is stored in the class_count property.

3D Object Bounding Boxes

The 3D object bounding box data is organised in subfolders. The first level of folders contain all pose data of a specific subject (i.e. subject_x), the second a specific task (e.g. task_4_k_wiping or task_9_w_sawing), and the third a specific take or repetition (i.e. take_x). The 3D object bounding box data for an individual frame is saved in a corresponding JSON file in the folder 3d_objects.

The root element of each JSON file is a list of JSON objects, and each JSON object represents the bounding box of a detected object. Such an object looks like this:

{
    "bounding_box": {
        "x0": -78.30904388427734,
        "x1": 36.15547561645508,
        "y0": -789.953125,
        "y1": -749.12744140625,
        "z0": -1120.7742919921875,
        "z1": -976.8253784179688
    },
    "certainty": 0.9998389482498169,
    "class_name": "banana",
    "colour": [
        0,
        255,
        31
    ],
    "instance_name": "banana_2",
    "past_bounding_box": {
        "x0": -78.34809112548828,
        "x1": 36.506187438964844,
        "y0": -789.953125,
        "y1": -749.7930908203125,
        "z0": -1120.77294921875,
        "z1": -976.9805297851563
    }
}

The extents of the 3D bounding box are defined in the bounding_box property, where x0, x1, y0, y1, z0, and z1 denote the minimum and maximum extents for the x, y, and z axis respectively. Similarly, the property past_bounding_box contains the bounding box of the same object 333 ms in the past, which allows to compute dynamic spatial relations (cf. [1]).

The properties certainty, class_name, and colour were assumed from the 2D bounding boxes.

The property instance_name holds an unique identifier for the recording to make sure that the same object can be tracked across several frames.

3D Spatial Relations

The spatial relations data is organised in subfolders. The first level of folders contain all pose data of a specific subject (i.e. subject_x), the second a specific task (e.g. task_4_k_wiping or task_9_w_sawing), and the third a specific take or repetition (i.e. take_x). The spatial relations data for an individual frame is saved in a corresponding JSON file in the folder spatial_relations.

The root element of each JSON file is a list of JSON objects, and each JSON object represents one spatial relation between a pair of objects. Such an object looks like this:

{
    "object_index": 0,
    "relation_name": "behind of",
    "subject_index": 1
}

The properties object_index and subject_index are the respective object and subject of a relation, for example: "The bowl (object) is behind of the cup (subject)".

The property relation_name is the label of the relation, in plain text.

The list of relations is explicit. That is, if there are any implicit relations, they will be in that list as well.

Object Detection Data

The file bimacs_object_detection_data.zip is structured like this:

bimacs_object_detection_data/
 |- images/
 |   `- *.jpg
 |
 |- labels/
 |   `- *.txt
 |
 |- images_index.txt
 |- labels_index.txt
 `- object_class_names.txt

The folder images/ contains all training images in the dataset, and the corresponding ground truth can be found in the labels/ folder. The ground truth files have the same filename as the image in the dataset, just the file extension differs. The files images_index.txt and labels_index.txt contain a list of all files inside the images/ and labels/ folders respectively.

The file format for the ground truth is the format Darknet uses. Each ground truth bounding box is denoted in a line as a 5-tuple separated by white spaces:

<object class ID> <box center X> <box center Y> <box width> <box height>

The object class name corresponding to the ID can be found in object_class_names.txt. Please note that this file contains two unsued object classes (unused1 and unused2). You can safely ignore them. The extents of the bounding box, given by the coordinates of the center of the bounding box and its width and height are relative to the image width and height. That is, these values should range from 0 to 1.

Yolo Training Environment

The file bimacs_yolo_train_setup.zip contains the training environment to be used with Yolo together with the object detection dataset and is structured like this:

bimacs_object_detection_data/
 |- images/
 |   `- *.jpg
 |
 |- labels/
 |   `- *.txt
 |
 |- weightfiles/
 |   `- darknet53.conv.74
 |
 |- images_index.txt
 |- labels_index.txt
 |- object_class_names.txt
 |- net.cfg
 |- train_start.sh
 `- train_resume.sh

The folder images/ contains all training images in the dataset, and the corresponding ground truth can be found in the labels/ folder. The ground truth files have the same filename as the image in the dataset, just the file extension differs. The files images_index.txt and labels_index.txt contain a list of all files inside the images/ and labels/ folders respectively.

The file weightfiles/darknet53.conv.74 is a pre-trained weights file supplied by pjreddie.com (Joseph Redmon's homepage).

The file train_setup.cfg contains the setup for the training for Yolo, while the file net.cfg is a configuration of the used network architecture within Darknet.

There are also 2 script files provided, namely train_start.sh to easily start the training, and train_resume.sh to resume the training (loading a backup).

The object class name corresponding to the ID can be found in object_class_names.txt. Please note that this file contains two unsued object classes (unused1 and unused2). You can safely ignore them.

To use this training setup, extract the zip file and move the folder bimacs_object_detection_data into the root folder of your Darknet installation. Then, in the root of your Darknet installation, run ./bimacs_object_detection_data/train_start.sh to begin training. Backup and milestone weight files of the training will be written into the folder ./bimacs_object_detection_data/backup/. To resume the training, run ./bimacs_object_detection_data/train_resume.sh

Yolo Execution Environment

The file bimacs_yolo_run_setup.zip contains the runtime environment to be used with Yolo and is structured like this:

bimacs_object_detection_data/
 |- weightfiles/
 |   `- net_120000.weights
 |
 |- object_class_names.txt
 `- net.cfg

The file weightfiles/net_120000.weights is the weight file pre-trained on the objects used in the dataset.

The file net.cfg is the configuration of the used network architecture within Darknet.

The object class name corresponding to the ID can be found in object_class_names.txt. Please note that this file contains two unsued object classes (unused1 and unused2). You can safely ignore them.

References

[1] F. Ziaeetabar, T. Kulvicius, M. Tamosiunaite, and F. Wörgötter, “Recognition and Prediction of Manipulation Actions Using Enriched Semantic Event Chains,” Robotics and Autonomous Systems (RAS), vol. 110, pp. 173–188, Dec. 2018.