Multi-View Captioning with Semantic Delta Re-Ranking for Zero-Shot Composed Video Retrieval

Abstract

Composed Video Retrieval (CVR) aims to retrieve video relevant to a query video while incorporating specific changes described in modification text. For Zero-Shot Composed Video Retrieval (ZS-CVR), current methods utilize vision-language models to convert the query video into a single caption, subsequently merged with modification text to generate an edited caption for retrieval. However, the modification text doesn't clearly specify which elements to preserve from the query video, leading to possible misalignment between edited caption and target video. Additionally, the final retrieval result should not be determined solely by the similarity between edited caption and candidate videos but also incorporate the semantic delta arising from the modification text. To address these issues, we propose Multi-View Captioning with Semantic Delta Re-Ranking (MCSD) method for ZS-CVR. Specifically, the Multi-View Captioning Module to generate captions covering potential semantics of the target video, the Semantic Delta Re-Ranking Module that computes the semantic delta between the original and edited captions, to adjust similarity scores and re-ranks the retrieval results. Extensive experiments on two benchmarks demonstrate that the proposed MCSD method achieves state-of-the-art performance in ZS-CVR.

Demo

This is a data demo in our MCSD.

Overview

Motivation: Video content is inherently dense in semantic information. A single caption often fails to capture the full semantics of a target video, whereas captions generated from multiple perspectives can provide more comprehensive coverage of its potential meanings.

overview

Figure 1: Illustration of multiple perspectives vs. single caption.

MCSD

To address these issues, we propose Multi-View Captioning with Semantic Delta Re-Ranking (MCSD) for ZS-CVR. Our method features:

(1) Multi-View Captioning Module to generate captions covering potential semantics of the target video;

(2) Semantic Delta Re-Ranking Module that computes the semantic delta between original and edited captions to adjust similarity scores and re-rank retrieval results.

model_framework

Figure 2: Architecture of the proposed MCSD method.

Citation

If you find our work useful, please consider citing our work!

@inproceedings{ding2025multi,
  title={Multi-view Captioning with Semantic Delta Re-ranking for Zero-Shot Composed Video Retrieval},
  author={Ding, Zhixiang and Liu, Lilong and Yang, Zhenyu and Qian, Shengsheng},
  booktitle={International Conference on Image and Graphics},
  pages={80--91},
  year={2025},
  organization={Springer}
}