xgen-mm-phi3-mini-instruct-dpo-r-v1.5 Open-Source Multimodal Model - Achieving High-Quality Image Caption Generation

Xgen Mm Phi3 Mini Instruct Dpo R V1.5

Developed by Salesforce

xGen-MM is a series of multimodal foundation models developed by Salesforce AI Research, improved based on the BLIP series, and trained on high-quality image captions and interleaved image-text data.

Image-to-Text

Safetensors

EnglishOpen Source License:Apache-2.0 #Multimodal Instruction Fine-tuning #Safety Enhancement #Interleaved Image-Text Understanding

Downloads 305

Release Time : 8/9/2024

Model Overview

This model is the DPO (Direct Preference Optimization) version of the xGen-MM series, focusing on enhancing multimodal understanding capabilities and safety, suitable for image-text generation and interactive tasks.

Model Features

Multimodal Understanding

Performs excellently in single-image and multi-image benchmarks, supporting complex multimodal interactive tasks.

Safety Optimization

Significantly reduces the probability of harmful content generation through DPO training (VLGuard score of 5.2, outperforming benchmark models).

Comprehensive Performance

Surpasses peer models in multiple benchmarks such as POPE, MMBench, and SEED-IMG.

Model Capabilities

Image Caption Generation

Multi-image Reasoning

Safe Content Filtering

Visual Question Answering

Cross-modal Understanding

Use Cases

Content Moderation

Harmful Content Detection

Automatically identifies potential harmful content in images and text

VLGuard score of 5.2 (lower is better)

Education

Multimodal Learning Assistant

Parses and explains image-text content in educational materials

MMBench development set score of 76.4

🚀 xGen-MM

xGen-MM is a series of the latest foundational Large Multimodal Models (LMMs) developed by Salesforce AI Research. It solves the challenges in multimodal understanding and provides high - quality multimodal processing capabilities, advancing the field of large multimodal models.

✨ Features

xGen-MM is an advancement upon the successful designs of the BLIP series, with fundamental enhancements for a more robust and superior foundation.
These models are trained at scale on high - quality image caption datasets and interleaved image - text data.
In the v1.5 (08/2024) release, multiple XGen - MM models are presented, including xgen - mm - phi3 - mini - instruct - interleave - r - v1.5, xgen - mm - phi3 - mini - base - r - v1.5, xgen - mm - phi3 - mini - instruct - singleimg - r - v1.5, and xgen - mm - phi3 - mini - instruct - dpo - r - v1.5.

📦 Installation

If you missed any packages, please consider the following:

pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121
pip install open_clip_torch==2.24.0
pip install einops
pip install einops-exts
pip install transformers==4.41.1

💻 Usage Examples

Basic Usage

Please check out our inference notebook for example code to use our model.

Advanced Usage

We also provide an example script for batch inference.

📚 Documentation

For more details, check out our tech report, [fine - tuning code](https://github.com/salesforce/LAVIS/tree/xgen - mm), and project page (coming soon).
Our evaluation is implemented based on [open - compass/VLMEvalKit](https://github.com/open - compass/VLMEvalKit). We will create a PR to that repo to support XGen - MM evaluation.

🔧 Technical Details

The main data sources are from the internet, including webpages, image stock sites, and curated datasets released by the research community. We have excluded certain data, such as LAION, due to known CSAM concerns. The model may be subject to bias from the original data source, as well as bias from LLMs and commercial APIs.

📄 License

Our code and weights are released under the [Apache 2.0](https://www.apache.org/licenses/LICENSE - 2.0.txt) license.

📊 DPO Model Results

Property	Details
Model Type	`xGen-MM` is a series of large multimodal models.
Training Data	High - quality image caption datasets and interleaved image - text data.

Model	VLGuard (↓)	HallusionBench (↑)	POPE (↑)	MMBench (dev) (↑)	SEED - IMG (↑)	MMStar (↑)	MME (norm) (↑)
Phi - 3 - vision*	9.1	-	83.5	74.2	71.0	47.9	55.3
xgen - mm - phi3 - mini - instruct - dpo - r - v1 (Ours)	5.2	56.6	86.8	76.4	72.1	47.1	64.4

(* = our eval)

We include some qualitative examples below of the safety features that complement our model's multimodal understanding capabilities.

📖 Citation

@misc{blip3-xgenmm,
  author          = {Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia - Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu},
  title           = {xGen-MM (BLIP - 3): A Family of Open Large Multimodal Models},
  year            = {2024},
  eprint          = {2408.08872},
  archivePrefix   = {arXiv},
  primaryClass    = {cs.CV},
  url             = {https://arxiv.org/abs/2408.08872}, 
}

⚠️ Important Note

The model may be subject to bias from the original data source, as well as bias from LLMs and commercial APIs. We strongly recommend users assess safety and fairness before applying to downstream applications.

💡 Usage Tip

This release is for research purposes only in support of an academic paper. Our models, datasets, and code are not specifically designed or evaluated for all downstream purposes. We strongly recommend users evaluate and address potential concerns related to accuracy, safety, and fairness before deploying this model. We encourage users to consider the common limitations of AI, comply with applicable laws, and leverage best practices when selecting use cases, particularly for high - risk scenarios where errors or misuse could significantly impact people’s lives, rights, or safety. For further guidance on use cases, refer to our AUP and AI AUP.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご