Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos

Liu, C.; Li, P.P.; Yu, Q.; Sheng, H.; Wang, D.; Li, L.; Yu, X.

doi:10.1109/CVPR52733.2024.02143

Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos

Date

2024

Authors

Liu, C.

Li, P.P.

Yu, Q.

Sheng, H.

Wang, D.

Li, L.

Yu, X.

Type:

Conference paper

Citation

Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2024, pp.22712-22722

Statement of Responsibility

Chen Liu, Peike Patrick Li, Qingtao Yu, Hongwei Sheng, Dadong Wang, Lincheng Li, Xin Yu

Conference Name

IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (16 Jun 2024 - 22 Jun 2024 : Seattle, United States)

DOI

10.1109/CVPR52733.2024.02143

Abstract

Existing audio-visual segmentation datasets typically focus on short-trimmed videos with only one pixel-map annotation for a per-second video clip. In contrast, for untrimmed videos, the sound duration, start- and end-sounding time positions, and visual deformation of audible objects vary significantly. Therefore, we observed that current AVS models trained on trimmed videos might struggle to segment sounding objects in long videos. To investigate the feasibility of grounding audible objects in videos along both temporal and spatial dimensions, we introduce the Long-Untrimmed Audio-Visual Segmentation dataset (LU-AVS), which includes precise frame-level annotations of sounding emission times and provides exhaustive mask annotations for all frames. Considering that pixel-level annotations are difficult to achieve in some complex scenes, we also provide the bounding boxes to indicate the sounding regions. Specifically, LU-AVS contains 10M mask annotations across 6.6K videos, and 11M bounding box annotations across 7K videos. Compared with the existing datasets, LU-AVS videos are on average 4~8 times longer, with the silent duration being 3~15 times greater. Furthermore, we try our best to adapt some baseline models that were originally designed for audio-visual-relevant tasks to examine the challenges of our newly curated LU-AVS. Through comprehensive evaluation, we demonstrate the challenges of LU-AVS compared to the ones containing trimmed videos. Therefore, LU-AVS provides an ideal yet challenging platform for evaluating audio-visual segmentation and localization on untrimmed long videos.

Rights

Grant ID

http://purl.org/au-research/grants/arc/DP220100800
http://purl.org/au-research/grants/arc/DE230100477

Published Version

https://doi.org/10.1109/cvpr52733.2024.02143

Persistent link to this record

https://hdl.handle.net/2440/149345

Full item page

Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos

Date

Authors

Editors

Advisors

Journal Title

Journal ISSN

Volume Title

Type:

Citation

Statement of Responsibility

Conference Name

DOI

Abstract

School/Discipline

Dissertation Note

Provenance

Description

Access Status

Rights

License

Grant ID

Published Version

Call number

Persistent link to this record