xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

Le Xue1,, Manli Shu1,, Anas Awadalla1,2,, Jun Wang1,, An Yan1,,
Senthil Purushwalkam1,, Honglu Zhou1,, Viraj Prabhu1,, Yutong Dai1,,
Michael S Ryoo1,, Shrikant Kendre1,, Jieyu Zhang1,2,,
Can Qin1, Shu Zhang1, Chia-Chih Chen1, Ning Yu1,
Juntao Tan1, Tulika Manoj Awalgaonkar1, Shelby Heinecke1, Huan Wang1,
Yejin Choi2,, Ludwig Schmidt2,, Zeyuan Chen1,, Silvio Savarese1,, Juan Carlos Niebles1,, Caiming Xiong1,, Ran Xu1,


1 Salesforce AI Research 2 University of Washington
First Authors, Core Authors, Senior Authors

arXiv Code 🤗 Model Cards 🤗 Dataset Cards Release Notes

We introduce xGen-MM (BLIP-3), a framework (b) for developing Large Multimodal Models (LMMs). Our framework improves upon BLIP-2 (a) by (1) increasing the richness, scale, and diversity of training data, (2) replacing the Q-Former layers with a more scalable vision token sampler, and (3) simplifying the training process via the unification of the training objectives to a single loss at every training stage. The resulting suite of LMMs can perform various visual language tasks and achieve competitive performance across benchmarks.

Introduction

BLIP-3 is a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research.

BLIP-3 model architecture overview.


Releases

08/19/2024

07/24/2024

  • MINT-1T dataset (released by University of Washington)
    🍃 MINT-1T

05/07/2024

Demo

Example model outputs of xGen-MM-instruct-interleave. The model is capable of understanding interleaved image-text input and user queries about multiple images, while maintaining the performance on single-image QAs.

Citation

@misc{blip3-xgenmm,
  author          = {Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu},
  title           = {BLIP-3: A Family of Open Large Multimodal Models},
  year            = {2024},
  eprint          = {2408.08872},
  archivePrefix   = {arXiv},
  primaryClass    = {cs.CV},
  url             = {https://arxiv.org/abs/2408.08872}, 
}