Infimm-Zephyr Open-Source Multimodal Vision-Language Model - Integrating LLMs for Various Vision-Language Tasks

Infimm Zephyr

Developed by Infi-MM

InfiMM is a multimodal vision-language model inspired by the Flamingo architecture, integrating the latest LLM models and suitable for a wide range of vision-language processing tasks.

Image-to-Text

Transformers

English#Multimodal Dialogue #Visual Question Answering #Zero-shot Learning

Downloads 23

Release Time : 1/4/2024

Model Overview

InfiMM is an innovative vision-language model that combines advanced visual encoders with large language models, capable of handling interactive tasks involving both images and text.

Model Features

Multimodal Understanding

Capable of processing both image and text inputs simultaneously to achieve cross-modal understanding

Flexible Architecture

Supports integration of LLMs of different scales and architectures, offering broader application possibilities

Open-source Accessibility

As the first open-source variant in this field, it provides better accessibility and adaptability

Model Capabilities

Image caption generation

Visual question answering

Multimodal dialogue

Image content understanding

Cross-modal reasoning

Use Cases

Content Understanding

Image Caption Generation

Generate detailed textual descriptions for input images

Achieved a CIDEr score of 108.6 on the COCO dataset

Visual Question Answering

Achieved an accuracy of 59.1% on the VQA v2 dataset

Education

Scientific Question Answering

Answer science questions based on images

Achieved an accuracy of 71.1% on the ScienceQA-Img dataset

🚀 InfiMM

InfiMM, inspired by the Flamingo architecture, differentiates itself with unique training data and diverse large language models (LLMs). This approach enables InfiMM to retain the core strengths of Flamingo while offering enhanced capabilities. As the leading open - sourced variant in this field, InfiMM excels in accessibility and adaptability, driven by community collaboration. It's not just an imitation of Flamingo; it's an innovation in visual language processing.

Our model is another attempt to replicate the results reported in the paper "Flamingo: A Large - scale Visual Language Model for Multimodal Understanding" by DeepMind. Compared with previous open - sourced attempts (OpenFlamingo and IDEFIC), InfiMM offers more flexible models, allowing for a wide range of applications. In particular, InfiMM integrates the latest LLM models into the VLM domain and reveals the impact of LLMs with different scales and architectures.

Please note that InfiMM is currently in the beta stage, and we are continuously working on improving it.

🚀 Quick Start

Use the code below to get started with the base model:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor


processor = AutoProcessor.from_pretrained("Infi-MM/infimm-zephyr", trust_remote_code=True)

prompts = [
    {
        "role": "user",
        "content": [
            {"image": "assets/infimm-logo.webp"},
            "Please explain this image to me.",
        ],
    }
]
inputs = processor(prompts)

# use bf16
model = AutoModelForCausalLM.from_pretrained(
    "Infi-MM/infimm-zephyr",
    local_files_only=True,
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
).eval()


inputs = inputs.to(model.device)
inputs["batch_images"] = inputs["batch_images"].to(torch.bfloat16)
generated_ids = model.generate(
    **inputs,
    min_generation_length=0,
    max_generation_length=256,
)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_text)

✨ Features

Inspired by the Flamingo architecture, with unique training data and diverse LLMs.
More flexible models compared to previous open - sourced attempts, suitable for a wide range of applications.
Integrates the latest LLM models into the VLM domain, revealing the impact of different LLMs.

📚 Documentation

Model Details

Developed by: Institute of Automation, Chinese Academy of Sciences and ByteDance
Model Type: Visual Language Model (VLM)
Language: English
LLMs: Zephyr, LLaMA2 - 13B, Vicuna - 13B
Vision Model: [EVA CLIP](https://huggingface.co/QuanSun/EVA - CLIP)
Language(s) (NLP): en
License: see License section

Model Family

Our model consists of several different models. Please see the details below.

Model	LLM	Vision Encoder	IFT
InfiMM - Zephyr	Zehpyr - 7B - beta	ViT - L - 336	No
InfiMM - Llama - 13B	Llama2 - 13B	ViT - G - 224	No
InfiMM - Vicuna - 13B	Vicuna - 13B	ViT - E - 224	No
InfiMM - Zephyr - Chat	Zehpyr - 7B - beta	ViT - L - 336	Yes
InfiMM - Llama - 13B - Chat	Llama2 - 13B	ViT - G - 224	Yes
InfiMM - Vicuna - 13B - Chat	Vicuna - 13B	ViT - E - 224	Yes

Demo

Will be released soon.

Our model adopts the Flamingo architecture, leveraging EVA CLIP as the visual encoder and employing LLaMA2, Vicuna, and Zephyr as language models. The visual and language modalities are connected through a Cross Attention module.

Training Details

Pretraining (PT)

We follow similar training procedures used in IDEFICS.

The model is trained on a mixture of image - text pairs and unstructured multimodal web documents. All data are from public sources. Many image URL links are expired, and we are only capable of downloading partial samples. We filter low - quality data, and here are the resulting data we used:

Data Source	Type of Data	Number of Tokens in Source	Number of Images in Source	Number of Samples	Epochs
OBELICS	Unstructured Multimodal Web Documents	-	-	101M	1
MMC4	Unstructured Multimodal Web Documents	-	-	53M	1
[LAION](https://huggingface.co/datasets/laion/laion2B - en)	Image - Text Pairs	-	115M	115M	1
[COYO](https://github.com/kakaobrain/coyo - dataset)	Image - Text Pairs	-	238M	238M	1
[LAION - COCO](https://laion.ai/blog/laion - coco/)	Image - Text Pairs	-	140M	140M	1
PMD*	Image - Text Pairs	-	20M	20M	1

*PMD is only used in models with 13B LLMs, not the 7B Zephyr model.

During pretraining of interleaved image - text samples, we apply masked cross - attention. However, we didn't strictly follow Flamingo, which alternates the attention of an image to its previous text or later text with a probability of 0.5.

We use the following hyperparameters:

Categories	Parameters	Value
Perceiver Resampler	Number of Layers	6
	Number of Latents	64
	Number of Heads	16
	Resampler Head Dimension	96
Training	Sequence Length	384 (13B) / 792 (7B)
	Effective Batch Size	40*128
	Max Images per Sample	6
	Weight Decay	0.1
	Optimizer	Adam(0.9, 0.999)
	Gradient Accumulation Step	2
Learning Rate	Initial Max	1e - 4
	Decay Schedule	Constant
	Warmup Step rate	0.005
Large - scale Optimization	Gradient Checkpointing	False
	Precision	bf16
	ZeRO Optimization	Stage 2

Multi - Task Training (MTT)

Here we use mix_cap_vqa to represent the mixed training set from COCO caption, TextCap, VizWiz Caption, VQAv2, OKVQA, VizWiz VQA, TextVQA, OCRVQA, STVQA, DocVQA, GQA and ScienceQA - image. For caption, we add a prefix such as "Please describe the image." before the question. And for QA, we add "Answer the question using a single word or phrase.". Specifically, for VizWiz VQA, we use "When the provided information is insufficient, respond with 'Unanswerable'. Answer the question using a single word or phrase.". While for ScienceQA - image, we use "Answer with the option's letter from the given choices directly."

Instruction Fine - Tuning (IFT)

For the instruction fine - tuning stage, we use the recently released [LLaVA - MIX - 665k](https://huggingface.co/datasets/liuhaotian/LLaVA - Instruct - 150K/tree/main).

We use the following hyperparameters:

Categories	Parameters	Value
Perceiver Resampler	Number of Layers	6
	Number of Latents	64
	Number of Heads	16
	Resampler Head Dimension	96
Training	Sequence Length	384 (13B) / 792 (7B)
	Effective Batch Size	64
	Max Images per Sample	6
	Weight Decay	0.1
	Optimizer	Adam(0.9, 0.999)
	Gradient Accumulation Step	2
Learning Rate	Initial Max	1e - 5
	Decay Schedule	Constant
	Warmup Step rate	0.005
Large - scale Optimization	Gradient Checkpointing	False
	Precision	bf16
	ZeRO Optimization	Stage 2

During IFT, similar to pretraining, we keep ViT and LLM frozen for both chat - based LLMs (Vicuna and Zephyr). For the Llama model, we keep the LLM trainable during the IFT stage. We also apply a chat - template to process the training samples.

Evaluation

PreTraining Evaluation

We evaluate the pretrained models on the following downstream tasks: Image Captioning and VQA. We also compare our results with IDEFICS.

Model	Shots	COCO CIDEr	Flickr30K CIDEr	VQA v2 Acc	TextVQA Acc	OK - VQA Acc
IDEFICS - 9B	0	46	27.3	50.9	25.9	38.4
	4	93	59.7	55.4	27.6	45.5
IDEFICS - 80B	0	91.8	53.7	60	30.9	45.2
	4	110.3	73.7	64.6	34.4	52.4
InfiMM - Zephyr - 7B	0	78.8	60.7	33.7	15.2	17.1
	4	108.6	71.9	59.1	34.3	50.5
InfiMM - Llama2 - 13B	0	85.4	54.6	51.6	24.2	26.4
	4	125.2	87.1	66.1	38.2	55.5
InfiMM - Vicuna13B	0	69.6	49.6	60.4	32.8	49.2
	4	118.1	81.4	64.2	38.4	53.7

IFT Evaluation

In our analysis, we concentrate on two primary benchmarks for evaluating MLLMs: 1) Multi - choice Question Answering (QA) and 2) Open - ended Evaluation.

🔧 Technical Details

The model adopts the Flamingo architecture. It uses EVA CLIP as the visual encoder and LLaMA2, Vicuna, and Zephyr as language models. The connection between visual and language modalities is achieved through a Cross Attention module. The training process consists of three stages: pretraining, multi - task training, and instruction fine - tuning, each with specific data sources and hyperparameters.

📄 License

Please refer to the relevant license information for detailed usage terms.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご