Automated crowd counting has made remarkable progress recently in computer vision thanks to the development of CNNs. However, this application area has run into bottlenecks since CNNs, by their nature, are limited by locally attentive receptive fields and are incapable of modelling larger-scale dependencies. To address this problem, we introduce a multi-scale transformer-based crowd-counting network, termed Crowd U-Transformer (CUT) which extracts and aggregates semantic and spatial features from multiple levels. In this design, we use crowd segmentation as an attention module to gain fine-grained features. Also, we propose a loss function that better focuses on the counting performance in the foreground area. Experimental results on four widely used benchmarks are presented and our method shows state-of-the-art performances.
|Published - 2022
|33rd British Machine Vision Conference Proceedings, BMVC 2022 - London, United Kingdom
Duration: 21 Nov 2022 → 24 Nov 2022
|33rd British Machine Vision Conference Proceedings, BMVC 2022
|21/11/22 → 24/11/22