Idefics2-8b-base Open-Source Multimodal Model - Free for Image and Text Processing, Great for OCR and Document Understanding

Idefics2 8b Base

Developed by HuggingFaceM4

Idefics2 is an open-source multimodal model developed by Hugging Face, capable of processing image and text inputs to generate text outputs, excelling in OCR, document understanding, and visual reasoning.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #High-resolution image processing #Multimodal question answering #Document OCR enhancement

Downloads 1,409

Release Time : 4/9/2024

Model Overview

Idefics2 is a multimodal model that can accept arbitrary sequences of images and text as input and generate text output. It can answer questions about images, describe visual content, create stories based on multiple images, and also function as a pure language model.

Model Features

Multimodal processing capability

Can simultaneously process image and text inputs and generate coherent text output

Native resolution support

Follows the NaViT strategy to process images at native resolution and aspect ratio (up to 980 x 980)

High-resolution image segmentation

Optionally supports sub-image segmentation for processing very high-resolution images

Enhanced OCR capability

Significantly improved text recognition and document understanding through specialized training

Model Capabilities

Image captioning

Visual question answering

Multi-image story creation

Document understanding

Chart analysis

Pure text language model

Use Cases

Education

Math problem solving

Provide solutions based on math problems in images

Excellent performance on math-related test sets

Content creation

Multi-image story creation

Generate coherent stories based on multiple related images

Document processing

Document content understanding

Recognize and understand content and structure in scanned documents

Achieved 74.0 on the DocVQA test set

🚀 Idefics2

Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and generates text outputs. It can answer image-related questions, describe visual content, create stories based on multiple images, or function as a pure language model without visual inputs. It builds on Idefics1, significantly enhancing OCR, document understanding, and visual reasoning capabilities.

✨ Features

Multimodal Capabilities: Handles both image and text inputs, enabling tasks like image captioning, visual question answering, and story creation from images.
Enhanced Performance: Improves upon Idefics1, especially in OCR, document understanding, and visual reasoning.
Multiple Checkpoints: Available in different checkpoints (idefics2-8b-base, idefics2-8b, idefics2-8b-chatty) for various use - cases.

📦 Installation

No installation steps were provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

import requests
import torch
from PIL import Image
from io import BytesIO

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda:0"

# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")

For `idefics2-8b-base`

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b-base",
).to(DEVICE)

# Create inputs
prompts = [
  "<image>In this image, we can see the city of New York, and more specifically the Statue of Liberty.<image>In this image,",
  "In which city is that bridge located?<image>",
]
images = [[image1, image2], [image3]]
inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
# ['In this image, we can see the city of New York, and more specifically the Statue of Liberty. In this image, we can see the city of Chicago, and more specifically the skyscrapers of the city.', 'In which city is that bridge located? The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, Calif

📚 Documentation

Model Summary

Property	Details
Developed by	Hugging Face
Model Type	Multi-modal model (image+text)
Language(s) (NLP)	en
License	Apache 2.0
Parent Models	google/siglip-so400m-patch14-384 and mistralai/Mistral-7B-v0.1
Resources for more information	Description of OBELICS: OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents; Paper: What matters when building vision-language models?

Uses

idefics2-8b-base and idefics2-8b can be used for inference on multimodal (image + text) tasks where the input consists of a text query and one or multiple images. Text and images can be interleaved arbitrarily. These tasks include image captioning, visual question answering, etc., but do not support image generation.

For optimal results, it is recommended to fine - tune idefics2-8b on specific use - cases and data. The instruction - fine - tuned model (idefics2-8b) is better at following user instructions and is preferred for out - of - the - box use or as a starting point for fine - tuning.

idefics2-8b usually generates short answers. For long generations, use idefics2-8b-chatty, which is further fine - tuned on long conversations.

Fine - tuning codes are provided for different scenarios:

With the TRL library: Script
With the Hugging Face Trainer: Tutorial notebook

Technical summary

Idefics2 shows strong performance for an 8B - parameter model compared to other open multimodal models and is often competitive with closed - source systems. It serves as a solid foundation for various use - case specific fine - tunings.

For more details, expand the result table.

Model	Open weights	Size	# tokens per image	MMMU (val/test)	MathVista (testmini)	TextVQA (val)	MMBench (test)	VQAv2 (test-dev)	DocVQA (test)
DeepSeek-VL	✅	7B	576	36.6/-	36.1	64.4	73.2	-	49.6
LLaVa-NeXT-Mistral-7B	✅	7B	2880	35.3/-	37.7	65.7	68.7	82.2	-
LLaVa-NeXT-13B	✅	13B	2880	36.2/-	35.3	67.1	70.0	82.8	-
LLaVa-NeXT-34B	✅	34B	2880	51.1/44.7	46.5	69.5	79.3	83.7	-
MM1-Chat-7B	❌	7B	720	37.0/35.6	35.9	72.8	72.3	-	-
MM1-Chat-30B	❌	30B	720	44.7/40.3	39.4	73.5	75.1	83.7
Gemini 1.0 Pro	❌	🤷‍♂️	🤷‍♂️	47.9/-	45.2	74.6	-	71.2	88.1
Gemini 1.5 Pro	❌	🤷‍♂️	🤷‍♂️	58.5/-	52.1	73.5	-	73.2	86.5
Claude 3 Haiku	❌	🤷‍♂️	🤷‍♂️	50.2/-	46.4	-	-	-	88.8

Idefics1 instruct (32 - shots)	✅	80B	-	-	-	39.3	-	68.8	-

Idefics2 (w/o im. split)	✅	8B	64	43.5/37.9	51.6	70.4	76.8	80.8	67.3
Idefics2 (w/ im. split)	✅	8B	320	43.0/37.7	51.4	73.0	76.7	81.2	74.0

Idefics2 introduces several carefully abalated improvements over Idefics1:

Manipulates images in their native resolutions (up to 980 x 980) and native aspect ratios following the NaViT strategy, avoiding the need to resize images to fixed - size squares. It also optionally allows sub - image splitting and passing images of very large resolution following the SPHINX strategy.
Significantly enhances OCR abilities by integrating data for text transcription in images or documents. It also improves abilities in answering questions on charts, figures, and documents with appropriate training data.
Simplifies the integration of visual features into the language backbone by departing from Idefics1's architecture (gated cross - attentions). Images are fed to the vision encoder, followed by a learned Perceiver pooling and a MLP modality projection. The pooled sequence is then concatenated with text embeddings.
These improvements, along with better pre - trained backbones, lead to a significant performance boost over Idefics1 for a model that is 10x smaller.

Idefics2 is trained in two stages for maximum efficiency. In the first stage, images are fed at SigLIP's native resolution (384 x 384 squares). In the second stage, images are fed at their native resolution (max 980, min 378) and native aspect ratio. PDFA, Rendered - Text, and IDL are added to OBELICS, LAION Coco, and PMD during the second stage for OCR data.

Instruction fine - tuning is performed on The Cauldron, a collection of 50 manually curated vision - language datasets, along with 9 text - only instruction fine - tuning datasets:

Lora is used to train the parameters initialized from pre - trained backbones, and full fine - tuning is used for newly initialized parameters (modality connector), as this strategy is more stable and computationally efficient.

More details (training procedure, data selection, hyper - parameters, etc.) and lessons learned from ablations will be available in an upcoming technical report.

How to Get Started

This section provides code snippets for generation with idefics2-8b-base and idefics2-8b. The codes differ only in input formatting. Common imports and inputs are first defined.

import requests
import torch
from PIL import Image
from io import BytesIO

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda:0"

# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")

📄 License

The model is released under the Apache 2.0 license. Two checkpoints are released:

idefics2-8b-base: the base model
idefics2-8b: the base model fine - tuned on a mixture of supervised and instruction datasets (text - only and multimodal datasets)
idefics2-8b-chatty: idefics2-8b further fine - tuned on long conversation

⚠️ Important Note

Idefics2 will NOT work with Transformers version between 4.41.0 and 4.43.3 included. See the issue https://github.com/huggingface/transformers/issues/32271 and the fix https://github.com/huggingface/transformers/pull/32275.

Idefics-Obelics logo

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご