Sketch, ground, and refine: top-down dense video captioning

Deng, C.; Chen, S.; Chen, D.; He, Y.; Wu, Q.

doi:10.1109/CVPR46437.2021.00030

Sketch, ground, and refine: top-down dense video captioning

dc.contributor.author	Deng, C.
dc.contributor.author	Chen, S.
dc.contributor.author	Chen, D.
dc.contributor.author	He, Y.
dc.contributor.author	Wu, Q.
dc.contributor.conference	2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (19 Jun 2021 - 25 Jun 2021 : virtual online)
dc.date.issued	2021
dc.description.abstract	The dense video captioning task aims to detect and describe a sequence of events in a video for detailed and coherent storytelling. Previous works mainly adopt a "detect-then-describe" framework, which firstly detects event proposals in the video and then generates descriptions for the detected events. However, the definitions of events are diverse which could be as simple as a single action or as complex as a set of events, depending on different semantic con-texts. Therefore, directly detecting events based on video information is ill-defined and hurts the coherency and accuracy of generated dense captions. In this work, we reverse the predominant "detect-then-describe" fashion, proposing a top-down way to first generate paragraphs from a global view and then ground each event description to a video segment for detailed refinement. It is formulated as a Sketch, Ground, and Refine process (SGR). The sketch stage first generates a coarse-grained multi-sentence paragraph to describe the whole video, where each sentence is treated as an event and gets localised in the grounding stage. In the re-fining stage, we improve captioning quality via refinement-enhanced training and dual-path cross attention on both coarse-grained event captions and aligned event segments. The updated event caption can further adjust its segment boundaries. Our SGR model outperforms state-of-the-art methods on ActivityNet Captioning benchmark under traditional and story-oriented dense caption evaluations. Code will be released at ithub.com/bearcatt/SGR.
dc.description.statementofresponsibility	Chaorui Deng, Shizhe Chen, Da Chen, Yuan He, Qi Wu
dc.identifier.citation	Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2021, pp.234-243
dc.identifier.doi	10.1109/CVPR46437.2021.00030
dc.identifier.isbn	9781665445092
dc.identifier.issn	1063-6919
dc.identifier.issn	2575-7075
dc.identifier.orcid	Deng, C. [0000-0002-8587-9047]
dc.identifier.orcid	Wu, Q. [0000-0003-3631-256X]
dc.identifier.uri	https://hdl.handle.net/2440/134309
dc.language.iso	en
dc.publisher	IEEE
dc.publisher.place	online
dc.relation.grant	http://purl.org/au-research/grants/arc/DE190100539
dc.relation.ispartofseries	IEEE Conference on Computer Vision and Pattern Recognition
dc.rights	© 2021 IEEE.
dc.source.uri	https://ieeexplore.ieee.org/xpl/conhome/9577055/proceeding
dc.title	Sketch, ground, and refine: top-down dense video captioning
dc.type	Conference paper
pubs.publication-status	Published

Collections

Computer Science publications

Sketch, ground, and refine: top-down dense video captioning

Files

Collections