T2ICount: enhancing cross-modal understanding for zero-shot counting

Yifei Qian, Zhongliang Guo, Bowen Deng, Chun Tong Lei, Shuai Zhao, Chun Pong Lau, Xiaopeng Hong, Michael Pound*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Zero-Shot object counting aims to count instances of arbitrary object categories specified by text descriptions. Existing methods typically rely on vision-language models like CLIP, but often exhibit limited sensitivity to text prompts. We present T21 Count, a diffusion-based framework that lever-ages rich prior knowledge and fine-grained visual understanding from pretrained diffusion models. While one-step demising ensures efficiency, it leads to weakened text sensitivity. To address this challenge, we propose a Hierarchical Semantic Correction Module that progressively refines text-image feature alignment, and a Representational Regional Coherence Loss that provides reliable supervision signals by leveraging the cross-attention maps extracted from the demising U-Net. Furthermore, we observe that current benchmarks mainly focus on majority objects in images, potentially masking models' text sensitivity. To address this, we contribute a challenging re-annotated subset of FSC147 for better evaluation of text-guided counting ability. Extensive experiments demonstrate that our method achieves superior performance across different benchmarks. Code is available at https://github.com/chal5yq/T2lCount.
Original languageEnglish
Title of host publication2025 IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Place of PublicationLos Alamitos
PublisherIEEE Computer Society
Pages25336-25345
Number of pages10
ISBN (Electronic)9798331543648
ISBN (Print)9798331543655
DOIs
Publication statusPublished - 13 Aug 2025

Publication series

NameIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
PublisherIEEE/CVF
ISSN (Print)1063-6919
ISSN (Electronic)2575-7075

Fingerprint

Dive into the research topics of 'T2ICount: enhancing cross-modal understanding for zero-shot counting'. Together they form a unique fingerprint.

Cite this