Infimm Zephyr
Model Overview
Model Features
Model Capabilities
Use Cases
đ InfiMM
InfiMM, inspired by the Flamingo architecture, differentiates itself with unique training data and diverse large language models (LLMs). This approach enables InfiMM to retain the core strengths of Flamingo while offering enhanced capabilities. As the leading open - sourced variant in this field, InfiMM excels in accessibility and adaptability, driven by community collaboration. It's not just an imitation of Flamingo; it's an innovation in visual language processing.
Our model is another attempt to replicate the results reported in the paper "Flamingo: A Large - scale Visual Language Model for Multimodal Understanding" by DeepMind. Compared with previous open - sourced attempts (OpenFlamingo and IDEFIC), InfiMM offers more flexible models, allowing for a wide range of applications. In particular, InfiMM integrates the latest LLM models into the VLM domain and reveals the impact of LLMs with different scales and architectures.
Please note that InfiMM is currently in the beta stage, and we are continuously working on improving it.
đ Quick Start
Use the code below to get started with the base model:
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
processor = AutoProcessor.from_pretrained("Infi-MM/infimm-zephyr", trust_remote_code=True)
prompts = [
{
"role": "user",
"content": [
{"image": "assets/infimm-logo.webp"},
"Please explain this image to me.",
],
}
]
inputs = processor(prompts)
# use bf16
model = AutoModelForCausalLM.from_pretrained(
"Infi-MM/infimm-zephyr",
local_files_only=True,
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).eval()
inputs = inputs.to(model.device)
inputs["batch_images"] = inputs["batch_images"].to(torch.bfloat16)
generated_ids = model.generate(
**inputs,
min_generation_length=0,
max_generation_length=256,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_text)
⨠Features
- Inspired by the Flamingo architecture, with unique training data and diverse LLMs.
- More flexible models compared to previous open - sourced attempts, suitable for a wide range of applications.
- Integrates the latest LLM models into the VLM domain, revealing the impact of different LLMs.
đ Documentation
Model Details
- Developed by: Institute of Automation, Chinese Academy of Sciences and ByteDance
- Model Type: Visual Language Model (VLM)
- Language: English
- LLMs: Zephyr, LLaMA2 - 13B, Vicuna - 13B
- Vision Model: [EVA CLIP](https://huggingface.co/QuanSun/EVA - CLIP)
- Language(s) (NLP): en
- License: see License section
Model Family
Our model consists of several different models. Please see the details below.
Model | LLM | Vision Encoder | IFT |
---|---|---|---|
InfiMM - Zephyr | Zehpyr - 7B - beta | ViT - L - 336 | No |
InfiMM - Llama - 13B | Llama2 - 13B | ViT - G - 224 | No |
InfiMM - Vicuna - 13B | Vicuna - 13B | ViT - E - 224 | No |
InfiMM - Zephyr - Chat | Zehpyr - 7B - beta | ViT - L - 336 | Yes |
InfiMM - Llama - 13B - Chat | Llama2 - 13B | ViT - G - 224 | Yes |
InfiMM - Vicuna - 13B - Chat | Vicuna - 13B | ViT - E - 224 | Yes |
Demo
Will be released soon.
Our model adopts the Flamingo architecture, leveraging EVA CLIP as the visual encoder and employing LLaMA2, Vicuna, and Zephyr as language models. The visual and language modalities are connected through a Cross Attention module.
Training Details
Pretraining (PT)
We follow similar training procedures used in IDEFICS.
The model is trained on a mixture of image - text pairs and unstructured multimodal web documents. All data are from public sources. Many image URL links are expired, and we are only capable of downloading partial samples. We filter low - quality data, and here are the resulting data we used:
Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Number of Samples | Epochs |
---|---|---|---|---|---|
OBELICS | Unstructured Multimodal Web Documents | - | - | 101M | 1 |
MMC4 | Unstructured Multimodal Web Documents | - | - | 53M | 1 |
[LAION](https://huggingface.co/datasets/laion/laion2B - en) | Image - Text Pairs | - | 115M | 115M | 1 |
[COYO](https://github.com/kakaobrain/coyo - dataset) | Image - Text Pairs | - | 238M | 238M | 1 |
[LAION - COCO](https://laion.ai/blog/laion - coco/) | Image - Text Pairs | - | 140M | 140M | 1 |
PMD* | Image - Text Pairs | - | 20M | 20M | 1 |
*PMD is only used in models with 13B LLMs, not the 7B Zephyr model.
During pretraining of interleaved image - text samples, we apply masked cross - attention. However, we didn't strictly follow Flamingo, which alternates the attention of an image to its previous text or later text with a probability of 0.5.
We use the following hyperparameters:
Categories | Parameters | Value |
---|---|---|
Perceiver Resampler | Number of Layers | 6 |
Number of Latents | 64 | |
Number of Heads | 16 | |
Resampler Head Dimension | 96 | |
Training | Sequence Length | 384 (13B) / 792 (7B) |
Effective Batch Size | 40*128 | |
Max Images per Sample | 6 | |
Weight Decay | 0.1 | |
Optimizer | Adam(0.9, 0.999) | |
Gradient Accumulation Step | 2 | |
Learning Rate | Initial Max | 1e - 4 |
Decay Schedule | Constant | |
Warmup Step rate | 0.005 | |
Large - scale Optimization | Gradient Checkpointing | False |
Precision | bf16 | |
ZeRO Optimization | Stage 2 |
Multi - Task Training (MTT)
Here we use mix_cap_vqa to represent the mixed training set from COCO caption, TextCap, VizWiz Caption, VQAv2, OKVQA, VizWiz VQA, TextVQA, OCRVQA, STVQA, DocVQA, GQA and ScienceQA - image. For caption, we add a prefix such as "Please describe the image." before the question. And for QA, we add "Answer the question using a single word or phrase.". Specifically, for VizWiz VQA, we use "When the provided information is insufficient, respond with 'Unanswerable'. Answer the question using a single word or phrase.". While for ScienceQA - image, we use "Answer with the option's letter from the given choices directly."
Instruction Fine - Tuning (IFT)
For the instruction fine - tuning stage, we use the recently released [LLaVA - MIX - 665k](https://huggingface.co/datasets/liuhaotian/LLaVA - Instruct - 150K/tree/main).
We use the following hyperparameters:
Categories | Parameters | Value |
---|---|---|
Perceiver Resampler | Number of Layers | 6 |
Number of Latents | 64 | |
Number of Heads | 16 | |
Resampler Head Dimension | 96 | |
Training | Sequence Length | 384 (13B) / 792 (7B) |
Effective Batch Size | 64 | |
Max Images per Sample | 6 | |
Weight Decay | 0.1 | |
Optimizer | Adam(0.9, 0.999) | |
Gradient Accumulation Step | 2 | |
Learning Rate | Initial Max | 1e - 5 |
Decay Schedule | Constant | |
Warmup Step rate | 0.005 | |
Large - scale Optimization | Gradient Checkpointing | False |
Precision | bf16 | |
ZeRO Optimization | Stage 2 |
During IFT, similar to pretraining, we keep ViT and LLM frozen for both chat - based LLMs (Vicuna and Zephyr). For the Llama model, we keep the LLM trainable during the IFT stage. We also apply a chat - template to process the training samples.
Evaluation
PreTraining Evaluation
We evaluate the pretrained models on the following downstream tasks: Image Captioning and VQA. We also compare our results with IDEFICS.
Model | Shots | COCO CIDEr | Flickr30K CIDEr | VQA v2 Acc | TextVQA Acc | OK - VQA Acc |
---|---|---|---|---|---|---|
IDEFICS - 9B | 0 | 46 | 27.3 | 50.9 | 25.9 | 38.4 |
4 | 93 | 59.7 | 55.4 | 27.6 | 45.5 | |
IDEFICS - 80B | 0 | 91.8 | 53.7 | 60 | 30.9 | 45.2 |
4 | 110.3 | 73.7 | 64.6 | 34.4 | 52.4 | |
InfiMM - Zephyr - 7B | 0 | 78.8 | 60.7 | 33.7 | 15.2 | 17.1 |
4 | 108.6 | 71.9 | 59.1 | 34.3 | 50.5 | |
InfiMM - Llama2 - 13B | 0 | 85.4 | 54.6 | 51.6 | 24.2 | 26.4 |
4 | 125.2 | 87.1 | 66.1 | 38.2 | 55.5 | |
InfiMM - Vicuna13B | 0 | 69.6 | 49.6 | 60.4 | 32.8 | 49.2 |
4 | 118.1 | 81.4 | 64.2 | 38.4 | 53.7 |
IFT Evaluation
In our analysis, we concentrate on two primary benchmarks for evaluating MLLMs: 1) Multi - choice Question Answering (QA) and 2) Open - ended Evaluation.
đ§ Technical Details
The model adopts the Flamingo architecture. It uses EVA CLIP as the visual encoder and LLaMA2, Vicuna, and Zephyr as language models. The connection between visual and language modalities is achieved through a Cross Attention module. The training process consists of three stages: pretraining, multi - task training, and instruction fine - tuning, each with specific data sources and hyperparameters.
đ License
Please refer to the relevant license information for detailed usage terms.






