XGen-MM-Phi3-mini-base-r-v1 Open-Source Multimodal Large Model - Powerful Architecture Enables Better Application Experience

Xgen Mm Phi3 Mini Base R V1

Developed by Salesforce

XGen-MM is the latest multimodal large model series developed by Salesforce AI Research. Based on the successful design of BLIP, it achieves a more powerful and superior model architecture through fundamental enhancements.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal large model #High-resolution image understanding #Zero-shot reasoning

Downloads 240

Release Time : 5/7/2024

Model Overview

This model is trained on large-scale high-quality image description datasets and interleaved image-text data. It supports image-text-to-text tasks and has strong context learning ability.

Model Features

Powerful pre-trained base model

Achieves state-of-the-art performance with a 5B parameter scale and demonstrates strong context learning ability.

Flexible instruction fine-tuning

The instruction fine-tuned model performs best among open-source/closed-source vision-language models with a 5B parameter scale.

High-resolution image encoding

Supports flexible high-resolution image encoding and efficient visual token sampling.

Model Capabilities

Image description generation

Visual question answering

Multimodal context learning

High-resolution image processing

Use Cases

Image understanding and description

Image content description

Generate a detailed description of the image content

Example output: The dog is sitting on the beach waving to its owner.

Visual question answering

Image-based question answering

Answer natural language questions about the image content

Performs excellently in benchmarks such as OKVQA and TextVQA

🚀 XGen-MM: Salesforce's Latest Multimodal Model Series

Salesforce AI Research has rebranded the BLIP series to XGen-MM, offering state - of - the - art Large Multimodal Models (LMMs) with enhanced performance and features.

🚀 Quick Start

The XGen-MM series represents a significant advancement in multimodal technologies. To quickly get started with using these models, you can follow the code example in the "How to use" section. It demonstrates zero - shot inference on an image, including steps like model loading, image processing, and text generation.

✨ Features

Rebranding and Alignment: The rebranding from the BLIP series to XGen-MM aligns with Salesforce's unified XGen initiative for large foundation models.
State - of - the - Art Performance:
- The pretrained foundation model xgen - mm - phi3 - mini - base - r - v1 achieves top performance under 5b parameters and shows strong in - context learning capabilities.
- The instruct fine - tuned model xgen - mm - phi3 - mini - instruct - r - v1 outperforms other open - source and closed - source VLMs under 5b parameters.
Flexible Image Encoding: xgen - mm - phi3 - mini - instruct - r - v1 supports flexible high - resolution image encoding with efficient visual token sampling.

📦 Installation

If you miss any packages, you can install them using the following commands:

pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121
pip install open_clip_torch==2.24.0
pip install einops
pip install einops-exts
pip install transformers==4.41.1

💻 Usage Examples

Basic Usage

from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoImageProcessor
import requests
from PIL import Image
import IPython.display as display
import torch
model_name_or_path = "Salesforce/xgen-mm-phi3-mini-base-r-v1"
model = AutoModelForVision2Seq.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, use_fast=True, legacy=False)
image_processor = AutoImageProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer = model.update_special_tokens(tokenizer)

model = model.to('cuda')
tokenizer.padding_side = "left"

def apply_prompt_template(prompt, num_images=1, num_tokens_per_vis = 128, in_context=False, output=None):
    """
    num_tokens_per_vis: model.vlm.num_tokens_per_vis
    """
    placeholder_image_tokens = "<image placeholder>" * (num_tokens_per_vis - 1)
    if in_context:
        formatted_prompt = f"<image>{placeholder_image_tokens}" + f"{prompt}" + f"{output}" + "<|endofchunk|>"
    else:
        formatted_prompt = f"<image>{placeholder_image_tokens}"*num_images + f"{prompt}"
    return formatted_prompt

############ Zero shot inference ##########
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
instruction = "Describe what is the dog doing in this image in one sentence:"
print("==> Instruction: ", instruction)
print("==> Image: ")
display.display(raw_image.resize((int(raw_image.width*0.3), int(raw_image.height*0.3))))
inputs = image_processor([raw_image], return_tensors="pt")
prompt = apply_prompt_template(instruction)
language_inputs = tokenizer([prompt], return_tensors="pt")
inputs.update(language_inputs)
inputs = {name: tensor.cuda() for name, tensor in inputs.items()}

with torch.cuda.amp.autocast(dtype=torch.bfloat16):
    generated_text = model.generate(**inputs, 
                                    pad_token_id=tokenizer.pad_token_id,
                                    do_sample=False, max_new_tokens=64, top_p=None, num_beams=1,
                                    length_penalty=1.0, repetition_penalty=3.0)
prediction = tokenizer.decode(generated_text[0], skip_special_tokens=True)
print("==> prediciton: ", prediction)
print("-"*120)
# ==> prediciton: The dog is sitting on the beach and waving at his owner.

Advanced Usage

More comprehensive examples can be found in the notebook, where zero - shot and few - shot examples are provided respectively.

📚 Documentation

Results

Pretrain (base model without instruction tuning)

Property	Details
Model	Flamingo - 3B, MM1 - 3B, xgen - mm - phi3 - mini - base - r - v1 (Ours)
Shot	0, 4, 8
Datasets	COCO (val), NoCaps (val), TextCaps (val), OKVQA (val), TextVQA (val), VizWiz (testdev), VQAv2 (testdev)

The table shows the performance of different models on various datasets under different shot settings. The xgen - mm - phi3 - mini - base - r - v1 model achieves competitive results.

Instruct (after instruction tuning)

Property	Details
Model	MM1 - 3B - Chat, openbmb/MiniCPM - V - 2, VILA1.5 - 3B, xtuner/llava - phi - 3 - mini - hf, xgen - mm - phi3 - mini - instruct - r - v1 (Ours)
Datasets	SEED - IMG, MMBench(dev), MME - total, MME - P, MME - C, MMStar, MMMU (val), MMVet, MathVista (mini), ScienceQA (test), POPE, AI2D

The table presents the performance of different models on multiple multimodal datasets after instruction tuning. The xgen - mm - phi3 - mini - instruct - r - v1 model shows excellent performance in many aspects.

Reproducibility

Our SFT evaluation is based on the VLMEvalKit, where some inconsistencies with the official benchmarks (e.g., LLM judge API) are fixed. We also found that the raw resolution of the input image can affect the model output in some cases.

Bias, Risks, Limitations, and Ethical Considerations

The main data sources are from the internet, including webpages, image stock sites, and curated datasets released by the research community. Some data like LAION is excluded due to CSAM concerns. The model may have biases from the original data source, LLMs, and commercial APIs. Users are recommended to assess safety and fairness before applying the model to downstream applications.

Ethical Considerations

This release is for research purposes to support an academic paper. The models, datasets, and code are not designed or evaluated for all downstream purposes. Users should evaluate and address potential concerns related to accuracy, safety, and fairness before deployment. They are also encouraged to consider AI limitations, comply with laws, and follow best practices, especially in high - risk scenarios.

🔧 Technical Details

The model is for research purposes, and more technical details will be provided in a technical report soon.

📄 License

Our code and weights are released under the Apache - 2.0 license. The copyright of the training data remains with the original data owner.

📖 Code Acknowledgment

We would like to acknowledge the following projects:

📚 Citation

@misc{xue2024xgenmmblip3familyopen,
      title={xGen-MM (BLIP-3): A Family of Open Large Multimodal Models}, 
      author={Le Xue and Manli Shu and Anas Awadalla and Jun Wang and An Yan and Senthil Purushwalkam and Honglu Zhou and Viraj Prabhu and Yutong Dai and Michael S Ryoo and Shrikant Kendre and Jieyu Zhang and Can Qin and Shu Zhang and Chia-Chih Chen and Ning Yu and Juntao Tan and Tulika Manoj Awalgaonkar and Shelby Heinecke and Huan Wang and Yejin Choi and Ludwig Schmidt and Zeyuan Chen and Silvio Savarese and Juan Carlos Niebles and Caiming Xiong and Ran Xu},
      year={2024},
      eprint={2408.08872},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2408.08872}, 
}

🛠 Troubleshoot

If you encounter any issues with missing packages, refer to the installation section for the necessary commands.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご