Abstract

This paper presents a new task named weakly-supervised group activity recognition (GAR) which differs from conventional GAR tasks in that only video-level labels are available, yet the important persons within each frame are not provided even in the training data. This eases us to collect and annotate a large-scale NBA dataset and thus raise new challenges to GAR. To mine useful information from weak supervision, we present a key insight that key instances are likely to be related to each other, and thus design a social adaptive module (SAM) to reason about key persons and frames from noisy data. Experiments show significant improvement on the NBA dataset as well as the popular volleyball dataset. In particular, our model trained on video-level annotation achieves comparable accuracy to prior algorithms which required strong labels.

NBA Dataset (NUST-NBA181)

We collect a total of 181 NBA game videos with a high resolution of 1920 X 1080. Then we divide each video into 6-second clips and sub-sample them to 12fps. Ultimately, there are a total of 9,172 video clips, each of which belongs to one of the 9 activities. We used 7,624 clips for training and 1,548 clips for testing. The train-test split of this dataset is performed at video level, rather than at frame level.

Limited by the copyright, this dataset (~135GB) is available upon request and only for academic research. Please complete this form to download.

Group Activity Class:

Group Activity	# clips (Train/Test)
2p-succ.	798/163
2p-fail.-off.	434/107
2p-fail.-def.	1316/234
2p-layup-succ.	822/172
2p-layup-fail.-off.	455/89
2p-layup-fail.-def.	702/157
3p-succ.	728/183
3p-fail.-off.	519/83
3p-fail.-def.	1850/360

Train-test split: provided in the train_video_ids and test_video_ids

Further information:

Inside each video directory, a set of directories corresponds to annotated clips (e.g. "NBA/21800909/17" represents "Video ID 21800909, Clip ID 17". Each clip directory has 72 images (frames). We release high-resolution (1920 X 1080) images here, but only used the low-resolution version (224 X 224) in our experiments due to the limitation of computational resources.

Each video directory has an annotations.txt file that contains annotations for the selected clips. Each annotation line in format: {Clip ID} {Group Activity Class}

The normalized_detections.pkl file contains all possible bounding boxes for people in the scene, generated by Faster-RCNN pre-trained on the MS-COCO.

This dataset is easy to be extended. You can collect more NBA videos from the Internet and play-by-play data from NBA's official website. If you are interested in this, please feel free to contact me.

Acknowledgements

This work was supported by the National Key Research and Development Program of China under Grant 2018AAA0102002, the National Natural Science Foundation of China under Grants 61732007, 61702265, and 61932020.

Citation

In case using our NBA dataset or wish to refer to the baseline results, please cite the following publications.
@article{yan2020social,
title={Social Adaptive Module for Weakly-supervised Group Activity Recognition},
author={Yan, Rui and Xie, Lingxi and Tang, Jinhui and Shu, Xiangbo and Tian, Qi},
journal={arXiv preprint arXiv:2007.09470},
year={2020}
}
@article{tang2019coherence,
title={Coherence constrained graph LSTM for group activity recognition},
author={Tang, Jinhui and Shu, Xiangbo and Yan, Rui and Zhang, Liyan},
journal={IEEE transactions on pattern analysis and machine intelligence},
year={2019}
}
@inproceedings{yan2018participation,
title={Participation-contributed temporal dynamic model for group activity recognition},
author={Yan, Rui and Tang, Jinhui and Shu, Xiangbo and Li, Zechao and Tian, Qi},
booktitle={Proceedings of the 26th ACM international conference on Multimedia},
pages={1292--1300},
year={2018}
}