xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Le Xue1,♦, Manli Shu1,♦, Anas Awadalla1,2,♢, Jun Wang1,♢, An Yan1,♢,
Senthil Purushwalkam1,♢, Honglu
Zhou1,♢, Viraj Prabhu1,♢, Yutong Dai1,♢,
Michael S Ryoo1,♢, Shrikant Kendre1,♢, Jieyu Zhang1,2,♢,
Can Qin1, Shu Zhang1, Chia-Chih Chen1, Ning Yu1,
Juntao Tan1, Tulika Manoj Awalgaonkar1, Shelby Heinecke1, Huan Wang1,
Yejin Choi2,†, Ludwig Schmidt2,†, Zeyuan Chen1,†, Silvio Savarese1,†, Juan Carlos Niebles1,†, Caiming Xiong1,†, Ran Xu1,†
1 Salesforce AI Research 2 University of Washington
♦ First Authors, ♢ Core Authors, † Senior Authors
arXiv Code 🤗 Model Cards 🤗 Dataset Cards Release Notes
We introduce xGen-MM (BLIP-3), a framework (b) for developing Large Multimodal Models (LMMs). Our framework improves upon BLIP-2 (a) by (1) increasing the richness, scale, and diversity of training data, (2) replacing the Q-Former layers with a more scalable vision token sampler, and (3) simplifying the training process via the unification of the training objectives to a single loss at every training stage. The resulting suite of LMMs can perform various visual language tasks and achieve competitive performance across benchmarks.
Introduction
BLIP-3 is a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research.
BLIP-3 model architecture overview.
Releases
08/19/2024
- xGen-MM-v1.5
🤗 xGen-MM-base 🤗 xGen-MM-instruct 🤗 xGen-MM-instruct-interleave 🤗 xGen-MM-instruct-dpo - BLIP-3 datasets
🤗 BLIP3-OCR-200M 🤗 BLIP3-GROUNDING-50M - Code
xGen-MM Fine-tuning code
07/24/2024
- MINT-1T dataset (released by University of Washington)
🍃 MINT-1T
05/07/2024
- xGen-MM-v1.0
🤗 xGen-MM-base-v1 🤗 xGen-MM-instruct-v1
Demo
Example model outputs of xGen-MM-instruct-interleave. The model is capable of understanding interleaved image-text input and user queries about multiple images, while maintaining the performance on single-image QAs.
Citation
@misc{blip3-xgenmm,
author = {Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu},
title = {BLIP-3: A Family of Open Large Multimodal Models},
year = {2024},
eprint = {2408.08872},
archivePrefix = {arXiv},
primaryClass = {cs.CV},
url = {https://arxiv.org/abs/2408.08872},
}