TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm

Zhang, B.; Cao, Z.; Du, H.; Yu, X.; Li, X.; Liu, J.; Wang, S.

doi:10.1109/wacv61041.2025.00485

TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm

Date

2025

Authors

Zhang, B.

Cao, Z.

Du, H.

Yu, X.

Li, X.

Liu, J.

Wang, S.

Type:

Conference paper

Citation

Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2025, pp.4957-4967

Statement of Responsibility

Bingqing Zhang, Zhuo Cao, Heming Du, Xin Yu, Xue Li, Jiajun Liu

Conference Name

IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (26 Feb 2025 - 6 Mar 2025 : Tucson, AZ, USA)

DOI

10.1109/wacv61041.2025.00485

Abstract

Text-Video Retrieval (TVR) methods typically match query-candidate pairs by aligning text and video features in coarse-grained, fine-grained, or combined (coarse-to-fine) manners. However, these frameworks predominantly employ a one(query)-to-one(candidate) alignment paradigm, which struggles to discern nuanced differences among candidates, leading to frequent mismatches. Inspired by Comparative Judgement in human cognitive science, where decisions are made by directly comparing items rather than evaluating them independently, we propose TokenBinder. This innovative two-stage TVR framework introduces a novel one-to-many coarse-to-fine alignment paradigm, imitating the human cognitive process of identifying specific items within a large collection. Our method employs a Focused-view Fusion Network with a sophisticated cross-attention mechanism, dynamically aligning and comparing features across multiple videos to capture finer nuances and contextual variations. Extensive experiments on six benchmark datasets confirm that TokenBinder substantially outperforms existing state-of-the-art methods. These results demonstrate its robustness and the effectiveness of its fine-grained alignment in bridging intra- and inter-modality information gaps in TVR tasks. Code is avaliable at https://github.com/bingqingzhang/TokenBinder.

Rights

Grant ID

http://purl.org/au-research/grants/arc/DP230101753

Published Version

https://doi.org/10.1109/wacv61041.2025.00485

Persistent link to this record

https://hdl.handle.net/2440/148896

Full item page

TokenBinder: Text-Video Retrieval with One-to-Many Alignment Paradigm

Date

Authors

Editors

Advisors

Journal Title

Journal ISSN

Volume Title

Type:

Citation

Statement of Responsibility

Conference Name

DOI

Abstract

School/Discipline

Dissertation Note

Provenance

Description

Access Status

Rights

License

Grant ID

Published Version

Call number

Persistent link to this record