WAD 2020 Challenge

We are hosting a multi-object tracking challenge based on BDD100K, the largest open driving video dataset as part of the CVPR 2020 Workshop on Autonomous Driving. This is a large-scale tracking challenge under the most diverse driving conditions. Understanding the temporal association of objects within videos is one of the fundamental yet challenging tasks for autonomous driving. The BDD100K MOT dataset provides diverse driving scenarios with complicated occlusions and reappearing patterns, which serves as a great testbed for the reliability of the developed MOT algorithms in real scenes. We provide 2,000 fully annotated 40-second sequences under different weather conditions, time of the day, and scene types. We encourage participants from both academia and industry and the winning teams will be awarded certificates for the memorable achievement. The evaluation server is hosted on CodaLab.

Submission Deadline 11:59 PM PST, June 12, 2020

Winner prizes

Organizers: Haofeng Chen, Xin Wang, Fisher Yu
If you have any questions, please email bdd100k@googlegroups.com.


BDD100K Dataset

BDD Dataset

The tasks are based on BDD100K, the largest driving video dataset to date supporting heterogenous multi-task learning. It contains 100,000 videos representing more than 1000 hours of driving experience with more than 100 million frames. The videos comes with GPU/IMU data for trajectory information. The BDD100K dataset now provide annotations of the 10 tasks: image tagging, lane detection, drivable area segmentation, object detection, semantic segmentation, instance segmentation, multi-object detection tracking, multi-object segmentation tracking, domain adaptation and imitation learning. These diverse tasks make the study of heterogenous multi-task learning possible.

For the CVPR 2020 Workshop on Autonomous Driving, we host the multi-object detection tracking challenge on CodaLab detailed below. Challenges on the other tasks will be announced on our dataset website.

BDD100K MOT Dataset

animated

To advance the study on multiple object tracking, we introduce BDD100K MOT Dataset. We provide 1,400 video sequences for training, 200 video sequences for validation and 400 video sequences for testing. Each video sequence is about 40 seconds long with 5 FPS resulting in approximately 200 frames per video.

BDD100K MOT Dataset is not only diverse in visual scale among and within tracks, but in the temporal range of each track. Objects in the BDD100K MOT dataset also present complicated occlusion and reappearing patterns. An object may be fully occluded or move out of the frame, and then reappear later. BDD100K MOT Dataset shows real challenges of object re-identification for tracking in autonomous driving. Details about the MOT dataset can be found in the BDD100K paper. Access the BDD100K data website to download the data.

Folder Structure

bdd100k/
├── images/
|   ├── track/
|   |   ├── train/
|   |   |   ├── $VIDEO_NAME/
|   |   |   |   ├── $VIDEO_NAME-$FRAME_INDEX.jpg
|   |   ├── val/
|   |   ├── test/
├── labels-20/
|   ├── box-track/
|   |   ├── train/
|   |   |   ├── $VIDEO_NAME.json
|   |   |   |
|   |   ├── val/

The frames for each video are stored in a folder in the images directory. The labels for each video are stored in a json file with the format detailed below.

Label Format

Each json file contains a list of frame objects, and each frame object has the format below. The format follows the schema of BDD100K data format.

- name: string
- videoName: string
- index: int
- labels: [ ]
    - id: string
    - category: string
    - attributes:
        - Crowd: boolean
        - Occluded: boolean
        - Truncated: boolean
    - box2d:
        - x1: float
        - y1: float
        - x2: float
        - y2: float

There are 11 object categories in this release:

pedestrian
rider
other person
car
bus
truck
train
trailer
other vehicle
motorcycle
bicycle

Notes:

  • The same instance shares "id" across frames.
  • The "pedestrian", "bicycle", and "motorcycle" correspond to the "person", "bike", and "motor" classes in the BDD100K Detection dataset.
  • We consider "other person", "trailer", and "other vehicle" as distractors, which are ignored during evaluation. We only evaluate the multi-object tracking of the other 8 categories.
  • We set three super-categories: "person" (with classes "pedestrian" and"rider"), "vehicle" ("car", "bus", "truck", and "train"), and "bike" ("motorcycle" and "bicycle") for the purpose of evaluation.
  • Submission Format

    The submission file for each of the two phases is a json file compressed by zip. Each json file is a list of frame objects with the format detailed below. The format also follows the schema of BDD100K data format.

    - name: string
    - labels [ ]:
        - id: string
        - category: string
        - box2d:
           - x1: float
           - y1: float
           - x2: float
           - y2: float

    Note that objects with the same identity share id across frames in a given video, and should be unique across different videos. Our evaluation will match the category string in evaluation, so you can assign your own integer ID for the categories in your model. But we recommend to encode the 8 relevant categories in the following order so that it is easier for the research community to share the models.

    pedestrian
    rider
    car
    truck
    bus
    train
    motorcycle
    bicycle

    The evaluation server will perform evaluation for each category and aggregate the results to compute the overall metrics. Then the server will merge both the ground-truth and predicted labels into super-categories and evaluate for each super- category.

    Evaluation

  • Evaluation platform: We host our evaluation server on CodaLab. There are two phases for the challenge: val phase and test phase. The final ranking will be based on the test phase.
  • Pre-training: It is a fair game to pre-train your network with ImageNet or COCO, but if other datasets are used, please note in the submission description. We will rank the methods without using external datasets except ImageNet and COCO.
  • Ignoring distractors: As a preprocessing step, all predicted boxes are matched and the ones matched to distractor ground-truth boxes ("other person", "trailer", and "other vehicle") are ignored.
  • Crowd region: After bounding box matching, we ignore all detected false-positive boxes that has >50% overlap with the crowd region (ground-truth boxes with the "Crowd" attribute).
  • Super-category: In addition to the evaluation of all 8 classes, we merge ground truth and prediction categories into 3 super-categories specified above, and evaluate the results for each super-category. The super-category evaluation results will be provided only for the purpose of reference.
  • Metrics

    We employ Multiple Object Tracking Accuracy (MOTA) as our primary evaluation metric for ranking. All metrics are detailed below. See this paper for more details.

  • Multiple Object Tracking Accuracy (MOTA): MOTA is a commonly used evaluation metric for multiple object tracking. It penalizes the sum of missed boxes, false positive boxes, and identity switches divided by the number of ground truth boxes. We report percentage MOTA for evaluation.
  • MOTA
  • Multiple Object Tracking Precision (MOTP): the sum of the overlaps of matched boxes divided by the total number of matches. It is an indicator for localization precision. We report percentage MOTP for evaluation.
  • MOTP
  • Number of missed boxes (Misses): The total number of missed ground-truth boxes.
  • Number of false positives (FP): The total number of predicted boxes that are not matched with any ground-truth box.
  • Identity switch (Switch): An identity switch is counted when a ground-truth object is matched with a track that is different from the last known assigned track.
  • Number of mostly tracked objects (Mostly Tracked): The number of tracks that has at least 80% of its lifespan tracked.
  • Number of mostly lost objects (Mostly Lost): The number of tracks that has less than 20% of its lifespan tracked.
  • Number of partially tracked objects (Partially Tracked): The number of tracks that has at least 20% and less than 80% of its lifespan tracked.