đ XGen-MM: Salesforce's Latest Multimodal Model Series
Salesforce AI Research has rebranded the BLIP series to XGen-MM, offering state - of - the - art Large Multimodal Models (LMMs) with enhanced performance and features.
đ Quick Start
The XGen-MM
series represents a significant advancement in multimodal technologies. To quickly get started with using these models, you can follow the code example in the "How to use" section. It demonstrates zero - shot inference on an image, including steps like model loading, image processing, and text generation.
⨠Features
- Rebranding and Alignment: The rebranding from the
BLIP series
to XGen-MM
aligns with Salesforce's unified XGen initiative for large foundation models.
- State - of - the - Art Performance:
- The pretrained foundation model
xgen - mm - phi3 - mini - base - r - v1
achieves top performance under 5b parameters and shows strong in - context learning capabilities.
- The instruct fine - tuned model
xgen - mm - phi3 - mini - instruct - r - v1
outperforms other open - source and closed - source VLMs under 5b parameters.
- Flexible Image Encoding:
xgen - mm - phi3 - mini - instruct - r - v1
supports flexible high - resolution image encoding with efficient visual token sampling.
đĻ Installation
If you miss any packages, you can install them using the following commands:
pip install torch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 --index-url https://download.pytorch.org/whl/cu121
pip install open_clip_torch==2.24.0
pip install einops
pip install einops-exts
pip install transformers==4.41.1
đģ Usage Examples
Basic Usage
from transformers import AutoModelForVision2Seq, AutoTokenizer, AutoImageProcessor
import requests
from PIL import Image
import IPython.display as display
import torch
model_name_or_path = "Salesforce/xgen-mm-phi3-mini-base-r-v1"
model = AutoModelForVision2Seq.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, use_fast=True, legacy=False)
image_processor = AutoImageProcessor.from_pretrained(model_name_or_path, trust_remote_code=True)
tokenizer = model.update_special_tokens(tokenizer)
model = model.to('cuda')
tokenizer.padding_side = "left"
def apply_prompt_template(prompt, num_images=1, num_tokens_per_vis = 128, in_context=False, output=None):
"""
num_tokens_per_vis: model.vlm.num_tokens_per_vis
"""
placeholder_image_tokens = "<image placeholder>" * (num_tokens_per_vis - 1)
if in_context:
formatted_prompt = f"<image>{placeholder_image_tokens}" + f"{prompt}" + f"{output}" + "<|endofchunk|>"
else:
formatted_prompt = f"<image>{placeholder_image_tokens}"*num_images + f"{prompt}"
return formatted_prompt
img_url = 'https://storage.googleapis.com/sfr-vision-language-research/BLIP/demo.jpg'
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')
instruction = "Describe what is the dog doing in this image in one sentence:"
print("==> Instruction: ", instruction)
print("==> Image: ")
display.display(raw_image.resize((int(raw_image.width*0.3), int(raw_image.height*0.3))))
inputs = image_processor([raw_image], return_tensors="pt")
prompt = apply_prompt_template(instruction)
language_inputs = tokenizer([prompt], return_tensors="pt")
inputs.update(language_inputs)
inputs = {name: tensor.cuda() for name, tensor in inputs.items()}
with torch.cuda.amp.autocast(dtype=torch.bfloat16):
generated_text = model.generate(**inputs,
pad_token_id=tokenizer.pad_token_id,
do_sample=False, max_new_tokens=64, top_p=None, num_beams=1,
length_penalty=1.0, repetition_penalty=3.0)
prediction = tokenizer.decode(generated_text[0], skip_special_tokens=True)
print("==> prediciton: ", prediction)
print("-"*120)
Advanced Usage
More comprehensive examples can be found in the notebook, where zero - shot and few - shot examples are provided respectively.
đ Documentation
Results
Pretrain (base model without instruction tuning)
Property |
Details |
Model |
Flamingo - 3B, MM1 - 3B, xgen - mm - phi3 - mini - base - r - v1 (Ours) |
Shot |
0, 4, 8 |
Datasets |
COCO (val), NoCaps (val), TextCaps (val), OKVQA (val), TextVQA (val), VizWiz (testdev), VQAv2 (testdev) |
The table shows the performance of different models on various datasets under different shot settings. The xgen - mm - phi3 - mini - base - r - v1
model achieves competitive results.
Instruct (after instruction tuning)
Property |
Details |
Model |
MM1 - 3B - Chat, openbmb/MiniCPM - V - 2, VILA1.5 - 3B, xtuner/llava - phi - 3 - mini - hf, xgen - mm - phi3 - mini - instruct - r - v1 (Ours) |
Datasets |
SEED - IMG, MMBench(dev), MME - total, MME - P, MME - C, MMStar, MMMU (val), MMVet, MathVista (mini), ScienceQA (test), POPE, AI2D |
The table presents the performance of different models on multiple multimodal datasets after instruction tuning. The xgen - mm - phi3 - mini - instruct - r - v1
model shows excellent performance in many aspects.
Reproducibility
Our SFT evaluation is based on the VLMEvalKit, where some inconsistencies with the official benchmarks (e.g., LLM judge API) are fixed. We also found that the raw resolution of the input image can affect the model output in some cases.
Bias, Risks, Limitations, and Ethical Considerations
The main data sources are from the internet, including webpages, image stock sites, and curated datasets released by the research community. Some data like LAION is excluded due to CSAM concerns. The model may have biases from the original data source, LLMs, and commercial APIs. Users are recommended to assess safety and fairness before applying the model to downstream applications.
Ethical Considerations
This release is for research purposes to support an academic paper. The models, datasets, and code are not designed or evaluated for all downstream purposes. Users should evaluate and address potential concerns related to accuracy, safety, and fairness before deployment. They are also encouraged to consider AI limitations, comply with laws, and follow best practices, especially in high - risk scenarios.
đ§ Technical Details
The model is for research purposes, and more technical details will be provided in a technical report soon.
đ License
Our code and weights are released under the Apache - 2.0 license. The copyright of the training data remains with the original data owner.
đ Code Acknowledgment
We would like to acknowledge the following projects:
đ Citation
@misc{xue2024xgenmmblip3familyopen,
title={xGen-MM (BLIP-3): A Family of Open Large Multimodal Models},
author={Le Xue and Manli Shu and Anas Awadalla and Jun Wang and An Yan and Senthil Purushwalkam and Honglu Zhou and Viraj Prabhu and Yutong Dai and Michael S Ryoo and Shrikant Kendre and Jieyu Zhang and Can Qin and Shu Zhang and Chia-Chih Chen and Ning Yu and Juntao Tan and Tulika Manoj Awalgaonkar and Shelby Heinecke and Huan Wang and Yejin Choi and Ludwig Schmidt and Zeyuan Chen and Silvio Savarese and Juan Carlos Niebles and Caiming Xiong and Ran Xu},
year={2024},
eprint={2408.08872},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2408.08872},
}
đ Troubleshoot
If you encounter any issues with missing packages, refer to the installation section for the necessary commands.