Cluster-aware prompt ensemble learning for few-shot vision-language model adaptation

Chen, Z.; Yu, X.; Tao, X.; Li, Y.; Huang, Z.

doi:10.1016/j.patcog.2025.112596

Cluster-aware prompt ensemble learning for few-shot vision-language model adaptation

Files

hd_149921.pdf (8.35 MB)

(Published version)

Date

2026

Authors

Chen, Z.

Yu, X.

Tao, X.

Li, Y.

Huang, Z.

Type:

Journal article

Citation

Pattern Recognition, 2026; 172:112596-1-112596-13

Statement of Responsibility

Zhi Chen, Xin Yu, Xiaohui Tao, Yan Li, Zi Huang

DOI

10.1016/j.patcog.2025.112596

Abstract

Vision-language models (VLMs) such as CLIP achieve zero-shot transfer across various tasks by pre-training on numerous image-text pairs. These models often benefit from using an ensemble of context prompts to represent a class. Despite being effective, conventional prompt ensembling that averages textual features of context prompts often yields suboptimal results. This is because feature averaging shifts the class centroids away from the true class distribution. To address this issue, we propose the Cluster-Aware Prompt Ensemble Learning (CAPEL) framework, which preserves the cluster nature of context prompts. CAPEL classifies images into one of several class clusters, each represented by a distinct prompt. Instead of ensembling prompts in the feature space, we perform ensembling in the classification logits space, aligning better with the visual feature distribution. To further optimize prompt fine-tuning while maintaining cluster-specific discriminative power, we introduce a cluster-preserving regularization term. This ensures that prompts remain distinct and specialized for different clusters, preventing collapse into a uniform direction. Additionally, we integrate an adaptive prompt weighting technique to dynamically adjust the attention weights for flawed or ambiguous prompts, ensuring robust performance across diverse datasets and tasks.

Rights

Grant ID

http://purl.org/au-research/grants/arc/DP240101814

Published Version

https://doi.org/10.1016/j.patcog.2025.112596

Persistent link to this record

https://hdl.handle.net/2440/149921

Full item page

Cluster-aware prompt ensemble learning for few-shot vision-language model adaptation

Files

Date

Authors

Editors

Advisors

Journal Title

Journal ISSN

Volume Title

Type:

Citation

Statement of Responsibility

Conference Name

DOI

Abstract

School/Discipline

Dissertation Note

Provenance

Description

Access Status

Rights

License

Grant ID

Published Version

Call number

Persistent link to this record