đ Idefics2
Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. It can answer questions about images, describe visual content, create stories based on multiple images, or act as a pure language model without visual inputs. It significantly improves upon Idefics1, enhancing capabilities in OCR, document understanding, and visual reasoning.
đ Quick Start
This section shows code snippets for generation with idefics2-8b-base
and idefics2-8b
. The codes only differ in input formatting. First, let's define some common imports and inputs.
import requests
import torch
from PIL import Image
from io import BytesIO
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda:0"
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")
Basic Usage
For idefics2-8b-base
Click to expand.
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base")
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b-base",
).to(DEVICE)
prompts = [
"<image>In this image, we can see the city of New York, and more specifically the Statue of Liberty.<image>In this image,",
"In which city is that bridge located?<image>",
]
images = [[image1, image2], [image3]]
inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
⨠Features
- Accepts arbitrary sequences of image and text inputs and produces text outputs.
- Can answer questions about images, describe visual content, create stories based on multiple images, or act as a pure language model without visual inputs.
- Significantly improves upon Idefics1, enhancing capabilities in OCR, document understanding, and visual reasoning.
đĻ Installation
No installation steps were provided in the original document, so this section is skipped.
đ Documentation
Model Summary
Uses
idefics2-8b-base
and idefics2-8b
can be used for inference on multimodal (image + text) tasks where the input consists of a text query and one (or multiple) images. Text and images can be interleaved arbitrarily. This includes image captioning, visual question answering, etc. These models do not support image generation.
For optimal results, it is recommended to fine-tune idefics2-8b
on specific use-cases and data. The instruction-fine-tuned model (idefics2-8b
) is better at following user instructions and should be preferred for out-of-the-box use or as a starting point for fine-tuning.
idefics2-8b
usually generates very short answers. For long generations, use idefics2-8b-chatty
, which was further fine-tuned on long conversations.
As a starting point, fine-tuning codes are provided that can be adapted for specific scenarios:
Technical summary
Idefics2 shows strong performance for a model of its size (8B parameters) compared to other open multimodal models and is often competitive with closed-source systems. It serves as a strong foundation for various use-case specific fine-tunings.
For more details, expand the result table.
Model |
Open weights |
Size |
# tokens per image |
MMMU (val/test) |
MathVista (testmini) |
TextVQA (val) |
MMBench (test) |
VQAv2 (test-dev) |
DocVQA (test) |
DeepSeek-VL |
â
|
7B |
576 |
36.6/- |
36.1 |
64.4 |
73.2 |
- |
49.6 |
LLaVa-NeXT-Mistral-7B |
â
|
7B |
2880 |
35.3/- |
37.7 |
65.7 |
68.7 |
82.2 |
- |
LLaVa-NeXT-13B |
â
|
13B |
2880 |
36.2/- |
35.3 |
67.1 |
70.0 |
82.8 |
- |
LLaVa-NeXT-34B |
â
|
34B |
2880 |
51.1/44.7 |
46.5 |
69.5 |
79.3 |
83.7 |
- |
MM1-Chat-7B |
â |
7B |
720 |
37.0/35.6 |
35.9 |
72.8 |
72.3 |
- |
- |
MM1-Chat-30B |
â |
30B |
720 |
44.7/40.3 |
39.4 |
73.5 |
75.1 |
83.7 |
|
Gemini 1.0 Pro |
â |
đ¤ˇââī¸ |
đ¤ˇââī¸ |
47.9/- |
45.2 |
74.6 |
- |
71.2 |
88.1 |
Gemini 1.5 Pro |
â |
đ¤ˇââī¸ |
đ¤ˇââī¸ |
58.5/- |
52.1 |
73.5 |
- |
73.2 |
86.5 |
Claude 3 Haiku |
â |
đ¤ˇââī¸ |
đ¤ˇââī¸ |
50.2/- |
46.4 |
- |
- |
- |
88.8 |
|
|
|
|
|
|
|
|
|
|
Idefics1 instruct (32-shots) |
â
|
80B |
- |
- |
- |
39.3 |
- |
68.8 |
- |
|
|
|
|
|
|
|
|
|
|
Idefics2 (w/o im. split) |
â
|
8B |
64 |
43.5/37.9 |
51.6 |
70.4 |
76.8 |
80.8 |
67.3 |
Idefics2 (w/ im. split) |
â
|
8B |
320 |
43.0/37.7 |
51.4 |
73.0 |
76.7 |
81.2 |
74.0 |
Idefics2 introduces several carefully abalated improvements over Idefics1:
- Manipulates images in their native resolutions (up to 980 x 980) and native aspect ratios by following the NaViT strategy, avoiding the need to resize images to fixed-size squares. Optionally, it allows sub-image splitting and passing images of very large resolution following the SPHINX strategy.
- Significantly enhances OCR abilities by integrating data that requires the model to transcribe text in an image or a document. Also improves abilities in answering questions on charts, figures, and documents with appropriate training data.
- Departs from Idefics1's architecture (gated cross-attentions) and simplifies the integration of visual features into the language backbone. Images are fed to the vision encoder followed by a learned Perceiver pooling and a MLP modality projection. The pooled sequence is then concatenated with the text embeddings to obtain an (interleaved) sequence of image(s) and text(s).
- All these improvements, along with better pre-trained backbones, result in a significant performance jump over Idefics1 for a model that is 10x smaller.
Idefics2 is trained in 2 stages for maximum efficiency. In the first stage, images are fed to the model at SigLIP's native resolution (squares of 384 x 384). In the second stage, images are fed to the model at their native resolution (with a maximum of 980 and a minimum of 378) and native aspect ratio. Since high resolution is necessary for OCR data, PDFA, Rendered-Text, and IDL are added to OBELICS, LAION Coco, and PMD during the second stage.
Following this, instruction fine-tuning is performed on The Cauldron, a collection of 50 manually curated vision-language datasets along with 9 text-only instruction fine-tuning datasets:
Lora is used to train the parameters initialized from pre-trained backbones, and full fine-tuning is used for newly initialized parameters (modality connector), as this strategy is found to be more stable and computationally efficient.
More details (training procedure, data selection, hyper-parameters, etc.) along with lessons learned from ablations will be available in an upcoming technical report.
đ§ Technical Details
Idefics2 is trained in two stages. In the first stage, images are fed at SigLIP's native resolution (384x384 squares). In the second stage, images are fed at their native resolution (max 980, min 378) and aspect ratio. High-resolution data like PDFA, Rendered-Text, and IDL are added to OBELICS, LAION Coco, and PMD for OCR training.
Instruction fine-tuning is done on The Cauldron and 9 text-only datasets. Lora is used for pre-trained backbone parameters, and full fine-tuning for new parameters.
đ License
The model is released under the Apache 2.0 license. We release 2 checkpoints:
- idefics2-8b-base: the base model
- idefics2-8b: the base model fine-tuned on a mixture of supervised and instruction datasets (text-only and multimodal datasets)
- idefics2-8b-chatty:
idefics2-8b
further fine-tuned on long conversation
â ī¸ Important Note
Idefics2 will NOT work with Transformers
version between 4.41.0 and 4.43.3 included. See the issue https://github.com/huggingface/transformers/issues/32271 and the fix https://github.com/huggingface/transformers/pull/32275.
đĄ Usage Tip
As of April 18th, 2024, Idefics2 is part of the 4.40.0
Transformers pypi release. Please upgrade your Transformers version (pip install transformers --upgrade
).