Cluster-aware prompt ensemble learning for few-shot vision-language model adaptation

Chen, Z.; Yu, X.; Tao, X.; Li, Y.; Huang, Z.

doi:10.1016/j.patcog.2025.112596

Cluster-aware prompt ensemble learning for few-shot vision-language model adaptation

dc.contributor.author	Chen, Z.
dc.contributor.author	Yu, X.
dc.contributor.author	Tao, X.
dc.contributor.author	Li, Y.
dc.contributor.author	Huang, Z.
dc.date.issued	2026
dc.description.abstract	Vision-language models (VLMs) such as CLIP achieve zero-shot transfer across various tasks by pre-training on numerous image-text pairs. These models often benefit from using an ensemble of context prompts to represent a class. Despite being effective, conventional prompt ensembling that averages textual features of context prompts often yields suboptimal results. This is because feature averaging shifts the class centroids away from the true class distribution. To address this issue, we propose the Cluster-Aware Prompt Ensemble Learning (CAPEL) framework, which preserves the cluster nature of context prompts. CAPEL classifies images into one of several class clusters, each represented by a distinct prompt. Instead of ensembling prompts in the feature space, we perform ensembling in the classification logits space, aligning better with the visual feature distribution. To further optimize prompt fine-tuning while maintaining cluster-specific discriminative power, we introduce a cluster-preserving regularization term. This ensures that prompts remain distinct and specialized for different clusters, preventing collapse into a uniform direction. Additionally, we integrate an adaptive prompt weighting technique to dynamically adjust the attention weights for flawed or ambiguous prompts, ensuring robust performance across diverse datasets and tasks.
dc.description.statementofresponsibility	Zhi Chen, Xin Yu, Xiaohui Tao, Yan Li, Zi Huang
dc.identifier.citation	Pattern Recognition, 2026; 172:112596-1-112596-13
dc.identifier.doi	10.1016/j.patcog.2025.112596
dc.identifier.issn	0031-3203
dc.identifier.issn	1873-5142
dc.identifier.orcid	Yu, X. [0000-0001-9890-5489] [0000-0002-0269-5649] [0000-0002-3388-9606] [0000-0002-6265-9519]
dc.identifier.uri	https://hdl.handle.net/2440/149921
dc.language.iso	en
dc.publisher	Elsevier
dc.relation.grant	http://purl.org/au-research/grants/arc/DP240101814
dc.rights	© 2025 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
dc.source.uri	https://doi.org/10.1016/j.patcog.2025.112596
dc.subject	ensemble learning; vision-language models; conditional entropy; logits ensemble
dc.title	Cluster-aware prompt ensemble learning for few-shot vision-language model adaptation
dc.type	Journal article
pubs.publication-status	Published

Files

Original bundle

Now showing 1 - 1 of 1

Name:: hd_149921.pdf
Size:: 8.35 MB
Format:: Adobe Portable Document Format
Description:: Published version

Download

Collections

Research Outputs