đ Idefics2
Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and generates text outputs. It can answer image-related questions, describe visual content, create stories based on multiple images, or function as a pure language model without visual inputs. It builds on Idefics1, significantly enhancing OCR, document understanding, and visual reasoning capabilities.
⨠Features
- Multimodal Capabilities: Handles both image and text inputs, enabling tasks like image captioning, visual question answering, and story creation from images.
- Enhanced Performance: Improves upon Idefics1, especially in OCR, document understanding, and visual reasoning.
- Multiple Checkpoints: Available in different checkpoints (
idefics2-8b-base
, idefics2-8b
, idefics2-8b-chatty
) for various use - cases.
đĻ Installation
No installation steps were provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
import requests
import torch
from PIL import Image
from io import BytesIO
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda:0"
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")
For idefics2-8b-base
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base")
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b-base",
).to(DEVICE)
prompts = [
"<image>In this image, we can see the city of New York, and more specifically the Statue of Liberty.<image>In this image,",
"In which city is that bridge located?<image>",
]
images = [[image1, image2], [image3]]
inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
đ Documentation
Model Summary
Uses
idefics2-8b-base
and idefics2-8b
can be used for inference on multimodal (image + text) tasks where the input consists of a text query and one or multiple images. Text and images can be interleaved arbitrarily. These tasks include image captioning, visual question answering, etc., but do not support image generation.
For optimal results, it is recommended to fine - tune idefics2-8b
on specific use - cases and data. The instruction - fine - tuned model (idefics2-8b
) is better at following user instructions and is preferred for out - of - the - box use or as a starting point for fine - tuning.
idefics2-8b
usually generates short answers. For long generations, use idefics2-8b-chatty
, which is further fine - tuned on long conversations.
Fine - tuning codes are provided for different scenarios:
Technical summary
Idefics2 shows strong performance for an 8B - parameter model compared to other open multimodal models and is often competitive with closed - source systems. It serves as a solid foundation for various use - case specific fine - tunings.
For more details, expand the result table.
Model |
Open weights |
Size |
# tokens per image |
MMMU (val/test) |
MathVista (testmini) |
TextVQA (val) |
MMBench (test) |
VQAv2 (test-dev) |
DocVQA (test) |
DeepSeek-VL |
â
|
7B |
576 |
36.6/- |
36.1 |
64.4 |
73.2 |
- |
49.6 |
LLaVa-NeXT-Mistral-7B |
â
|
7B |
2880 |
35.3/- |
37.7 |
65.7 |
68.7 |
82.2 |
- |
LLaVa-NeXT-13B |
â
|
13B |
2880 |
36.2/- |
35.3 |
67.1 |
70.0 |
82.8 |
- |
LLaVa-NeXT-34B |
â
|
34B |
2880 |
51.1/44.7 |
46.5 |
69.5 |
79.3 |
83.7 |
- |
MM1-Chat-7B |
â |
7B |
720 |
37.0/35.6 |
35.9 |
72.8 |
72.3 |
- |
- |
MM1-Chat-30B |
â |
30B |
720 |
44.7/40.3 |
39.4 |
73.5 |
75.1 |
83.7 |
|
Gemini 1.0 Pro |
â |
đ¤ˇââī¸ |
đ¤ˇââī¸ |
47.9/- |
45.2 |
74.6 |
- |
71.2 |
88.1 |
Gemini 1.5 Pro |
â |
đ¤ˇââī¸ |
đ¤ˇââī¸ |
58.5/- |
52.1 |
73.5 |
- |
73.2 |
86.5 |
Claude 3 Haiku |
â |
đ¤ˇââī¸ |
đ¤ˇââī¸ |
50.2/- |
46.4 |
- |
- |
- |
88.8 |
|
|
|
|
|
|
|
|
|
|
Idefics1 instruct (32 - shots) |
â
|
80B |
- |
- |
- |
39.3 |
- |
68.8 |
- |
|
|
|
|
|
|
|
|
|
|
Idefics2 (w/o im. split) |
â
|
8B |
64 |
43.5/37.9 |
51.6 |
70.4 |
76.8 |
80.8 |
67.3 |
Idefics2 (w/ im. split) |
â
|
8B |
320 |
43.0/37.7 |
51.4 |
73.0 |
76.7 |
81.2 |
74.0 |
Idefics2 introduces several carefully abalated improvements over Idefics1:
- Manipulates images in their native resolutions (up to 980 x 980) and native aspect ratios following the NaViT strategy, avoiding the need to resize images to fixed - size squares. It also optionally allows sub - image splitting and passing images of very large resolution following the SPHINX strategy.
- Significantly enhances OCR abilities by integrating data for text transcription in images or documents. It also improves abilities in answering questions on charts, figures, and documents with appropriate training data.
- Simplifies the integration of visual features into the language backbone by departing from Idefics1's architecture (gated cross - attentions). Images are fed to the vision encoder, followed by a learned Perceiver pooling and a MLP modality projection. The pooled sequence is then concatenated with text embeddings.
- These improvements, along with better pre - trained backbones, lead to a significant performance boost over Idefics1 for a model that is 10x smaller.
Idefics2 is trained in two stages for maximum efficiency. In the first stage, images are fed at SigLIP's native resolution (384 x 384 squares). In the second stage, images are fed at their native resolution (max 980, min 378) and native aspect ratio. PDFA, Rendered - Text, and IDL are added to OBELICS, LAION Coco, and PMD during the second stage for OCR data.
Instruction fine - tuning is performed on The Cauldron, a collection of 50 manually curated vision - language datasets, along with 9 text - only instruction fine - tuning datasets:
Lora is used to train the parameters initialized from pre - trained backbones, and full fine - tuning is used for newly initialized parameters (modality connector), as this strategy is more stable and computationally efficient.
More details (training procedure, data selection, hyper - parameters, etc.) and lessons learned from ablations will be available in an upcoming technical report.
How to Get Started
This section provides code snippets for generation with idefics2-8b-base
and idefics2-8b
. The codes differ only in input formatting. Common imports and inputs are first defined.
import requests
import torch
from PIL import Image
from io import BytesIO
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda:0"
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")
đ License
The model is released under the Apache 2.0 license. Two checkpoints are released:
- idefics2-8b-base: the base model
- idefics2-8b: the base model fine - tuned on a mixture of supervised and instruction datasets (text - only and multimodal datasets)
- idefics2-8b-chatty:
idefics2-8b
further fine - tuned on long conversation
â ī¸ Important Note
Idefics2 will NOT work with Transformers
version between 4.41.0 and 4.43.3 included. See the issue https://github.com/huggingface/transformers/issues/32271 and the fix https://github.com/huggingface/transformers/pull/32275.