xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs
Michael S. Ryoo, Honglu Zhou, Shrikant Kendre, Can Qin, Le Xue, Manli Shu, Silvio Savarese, Ran Xu, Caiming Xiong, Juan Carlos Niebles
Salesforce AI Research
arXiv 🤗 Model Release Notes
We introduce xGen-MM-Vid (BLIP-3-Video): a multimodal language model for videos, designed to efficiently capture temporal information over multiple frames. It takes advantage of the 'temporal encoder' in addition to the conventional visual tokenizer, which maps a sequence of tokens over multiple frames into a compact set of visual tokens. It achieves video QA and captioning accuracies comparable to much larger state-of-the-art models, while being much smaller (e.g., 4B vs. 34B) and more efficient by using fewer visual tokens (e.g., 32 vs. 4608 tokens).
Introduction

xGen-MM-Vid (BLIP-3-Video) is an efficient compact vision-language model (VLM) with an explicit temporal encoder, specifically designed to understand videos. Incorporation of a learanable temporal encoder modules within the original (image-based) BLIP-3 architecture is its key aspect. With BLIP-3-Video, we explore different types of temporal encoders, including conventional pooling and Transformer-based encoders, as well as more advanced learnable spatio-temporal pooling and sequential models like Token Turing Machines. Experimental results demonstrate that our 4B model performs comparably to larger 7B and 34B models on various complicated video question-answering and video captioning tasks, while utilizing much smaller number of visual tokens.
Different types of temporal encoders explored in BLIP-3-Video.
Releases
10/22/2024
- Arxiv paper release
📄 arXiv
01/16/2025
Examples
Example model outputs of BLIP-3-Video.
Citation
@misc{blip3video-xgenmmvid,
author = {Michael S. Ryoo and Honglu Zhou and Shrikant Kendre and Can Qin and Le Xue and Manli Shu and Silvio Savarese and Ran Xu and Caiming Xiong and Juan Carlos Niebles},
title = {xGen-MM-Vid (BLIP-3-Video): You Only Need 32 Tokens to Represent a Video Even in VLMs},
year = {2024},
eprint = {2410.16267},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2410.16267},
}