HORUS: multimodal large language models framework for video retrieval at VBS 2025

Tai Nguyen, Vo Ngoc Minh Anh, Duc Dat Pham, Tran Quang Vinh, Nhu Duong Thi Quynh, Le Anh Tien, Tan Duy Le, Binh T. Nguyen*

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In the dynamic field of video retrieval, precise and effective search methods are crucial for managing complex datasets. We present HORUS, a novel approach based on multimodal Large Language Models (mLLMs) that advances video retrieval capabilities through two key innovations: (1) advanced multi-modal feature aggregation, integrating text-to-image search with CLIP, free-text search from captions generated by Video-LLaMA2, and visual features from Video-LLaMA to capture temporal dynamics; and (2) GPT-based query expansion, combined with an advanced filter, addresses issues with low-quality open-ended text queries and refines item searches based on type and location within a scene. This work provides cutting-edge solutions for the VBS 2025 challenge and offers valuable insights into enhancing video search techniques.
Original languageEnglish
Title of host publicationMultiMedia modeling
Subtitle of host publication31st international conference on multimedia modeling, MMM 2025, Nara, Japan, January 8–10, 2025, proceedings, part V
EditorsIchiro Ide, Ioannis Kompatsiaris, Changsheng Xu, Keiji Yanai, Wei-Ta Chu, Naoko Nitta, Michael Riegler, Toshihiko Yamasaki
Place of PublicationSingapore
PublisherSpringer Nature Singapore
Pages286-293
Number of pages8
ISBN (Electronic)9789819620746
ISBN (Print)9789819620739
DOIs
Publication statusPublished - 1 Jan 2025

Publication series

NameLecture notes in computer science
PublisherSpringer Singapore
Volume15524
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Keywords

  • Video browser showdown
  • Video retrieval
  • Multi-modal feature aggregation

Fingerprint

Dive into the research topics of 'HORUS: multimodal large language models framework for video retrieval at VBS 2025'. Together they form a unique fingerprint.

Cite this