Idefics2-8b-chatty: An Open-source Multimodal Model - Supports Image and Text Inputs, Can Answer Picture-related Questions and Create Stories

Idefics2 8b Chatty

Developed by HuggingFaceM4

Idefics2 is an open multimodal model capable of accepting arbitrary sequences of images and text as input and generating text output. The model can answer questions about images, describe visual content, create stories based on multiple images, or function purely as a language model.

Image-to-Text

Transformers

EnglishOpen Source License:Apache-2.0 #Multimodal Q&A #High-resolution image processing #OCR enhancement

Downloads 617

Release Time : 5/2/2024

Model Overview

Idefics2 is a multimodal model released under the Apache 2.0 license, supporting arbitrary interleaved inputs of images and text to generate text output. It excels in OCR, document understanding, and visual reasoning, representing an improved version of Idefics1 with a 10x smaller parameter count but significantly enhanced performance.

Model Features

Native resolution processing

Supports processing images at native resolution and aspect ratio, up to 980 x 980, eliminating the need for traditional fixed-size adjustments.

Enhanced OCR capability

Significantly improves OCR capability by integrating data that requires the model to transcribe text from images or documents.

Simplified architecture

Discards the complex architecture of Idefics1, simplifying the integration of visual features with the language backbone for improved efficiency.

High performance

Delivers outstanding performance at 8 billion parameters, competing with other open-source multimodal models and even rivaling closed-source systems.

Model Capabilities

Image description

Visual question answering

Multi-image story creation

Pure language model usage

Document understanding

Visual reasoning

Use Cases

Education

Visual question answering

Answers questions about image content, suitable for visual learning in educational settings.

Performs excellently on benchmarks like MMMU and MathVista.

Content creation

Multi-image story creation

Generates coherent story text based on multiple images.

Supports long-text generation, ideal for creative writing and content generation.

Document processing

Document understanding

Understands and transcribes text content within documents.

Performs excellently on benchmarks like DocVQA.

🚀 Idefics2

Idefics2 is an open multimodal model that accepts arbitrary sequences of image and text inputs and produces text outputs. It can answer questions about images, describe visual content, create stories based on multiple images, or act as a pure language model without visual inputs. It significantly improves upon Idefics1, enhancing capabilities in OCR, document understanding, and visual reasoning.

🚀 Quick Start

This section shows code snippets for generation with idefics2-8b-base and idefics2-8b. The codes only differ in input formatting. First, let's define some common imports and inputs.

import requests
import torch
from PIL import Image
from io import BytesIO

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda:0"

# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")

Basic Usage

For `idefics2-8b-base`

Click to expand.

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b-base",
).to(DEVICE)

# Create inputs
prompts = [
  "<image>In this image, we can see the city of New York, and more specifically the Statue of Liberty.<image>In this image,",
  "In which city is that bridge located?<image>",
]
images = [[image1, image2], [image3]]
inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
# ['In this image, we can see the city of New York, and more specifically the Statue of Liberty. In this image, we can see the city of Chicago, and more specifically the skyscrapers of the city.', 'In which city is that bridge located? The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, Calif

✨ Features

Accepts arbitrary sequences of image and text inputs and produces text outputs.
Can answer questions about images, describe visual content, create stories based on multiple images, or act as a pure language model without visual inputs.
Significantly improves upon Idefics1, enhancing capabilities in OCR, document understanding, and visual reasoning.

📦 Installation

No installation steps were provided in the original document, so this section is skipped.

📚 Documentation

Model Summary

Property	Details
Developed by	Hugging Face
Model Type	Multi-modal model (image+text)
Language(s) (NLP)	en
License	Apache 2.0
Parent Models	google/siglip-so400m-patch14-384 and mistralai/Mistral-7B-v0.1
Resources for more information	Description of OBELICS: OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents; Paper: What matters when building vision-language models?

Uses

idefics2-8b-base and idefics2-8b can be used for inference on multimodal (image + text) tasks where the input consists of a text query and one (or multiple) images. Text and images can be interleaved arbitrarily. This includes image captioning, visual question answering, etc. These models do not support image generation.

For optimal results, it is recommended to fine-tune idefics2-8b on specific use-cases and data. The instruction-fine-tuned model (idefics2-8b) is better at following user instructions and should be preferred for out-of-the-box use or as a starting point for fine-tuning.

idefics2-8b usually generates very short answers. For long generations, use idefics2-8b-chatty, which was further fine-tuned on long conversations.

As a starting point, fine-tuning codes are provided that can be adapted for specific scenarios:

With the TRL library: Script
With the Hugging Face Trainer: Tutorial notebook

Technical summary

Idefics2 shows strong performance for a model of its size (8B parameters) compared to other open multimodal models and is often competitive with closed-source systems. It serves as a strong foundation for various use-case specific fine-tunings.

For more details, expand the result table.

Model	Open weights	Size	# tokens per image	MMMU (val/test)	MathVista (testmini)	TextVQA (val)	MMBench (test)	VQAv2 (test-dev)	DocVQA (test)
DeepSeek-VL	✅	7B	576	36.6/-	36.1	64.4	73.2	-	49.6
LLaVa-NeXT-Mistral-7B	✅	7B	2880	35.3/-	37.7	65.7	68.7	82.2	-
LLaVa-NeXT-13B	✅	13B	2880	36.2/-	35.3	67.1	70.0	82.8	-
LLaVa-NeXT-34B	✅	34B	2880	51.1/44.7	46.5	69.5	79.3	83.7	-
MM1-Chat-7B	❌	7B	720	37.0/35.6	35.9	72.8	72.3	-	-
MM1-Chat-30B	❌	30B	720	44.7/40.3	39.4	73.5	75.1	83.7
Gemini 1.0 Pro	❌	🤷‍♂️	🤷‍♂️	47.9/-	45.2	74.6	-	71.2	88.1
Gemini 1.5 Pro	❌	🤷‍♂️	🤷‍♂️	58.5/-	52.1	73.5	-	73.2	86.5
Claude 3 Haiku	❌	🤷‍♂️	🤷‍♂️	50.2/-	46.4	-	-	-	88.8

Idefics1 instruct (32-shots)	✅	80B	-	-	-	39.3	-	68.8	-

Idefics2 (w/o im. split)	✅	8B	64	43.5/37.9	51.6	70.4	76.8	80.8	67.3
Idefics2 (w/ im. split)	✅	8B	320	43.0/37.7	51.4	73.0	76.7	81.2	74.0

Idefics2 introduces several carefully abalated improvements over Idefics1:

Manipulates images in their native resolutions (up to 980 x 980) and native aspect ratios by following the NaViT strategy, avoiding the need to resize images to fixed-size squares. Optionally, it allows sub-image splitting and passing images of very large resolution following the SPHINX strategy.
Significantly enhances OCR abilities by integrating data that requires the model to transcribe text in an image or a document. Also improves abilities in answering questions on charts, figures, and documents with appropriate training data.
Departs from Idefics1's architecture (gated cross-attentions) and simplifies the integration of visual features into the language backbone. Images are fed to the vision encoder followed by a learned Perceiver pooling and a MLP modality projection. The pooled sequence is then concatenated with the text embeddings to obtain an (interleaved) sequence of image(s) and text(s).
All these improvements, along with better pre-trained backbones, result in a significant performance jump over Idefics1 for a model that is 10x smaller.

Idefics2 is trained in 2 stages for maximum efficiency. In the first stage, images are fed to the model at SigLIP's native resolution (squares of 384 x 384). In the second stage, images are fed to the model at their native resolution (with a maximum of 980 and a minimum of 378) and native aspect ratio. Since high resolution is necessary for OCR data, PDFA, Rendered-Text, and IDL are added to OBELICS, LAION Coco, and PMD during the second stage.

Following this, instruction fine-tuning is performed on The Cauldron, a collection of 50 manually curated vision-language datasets along with 9 text-only instruction fine-tuning datasets:

Lora is used to train the parameters initialized from pre-trained backbones, and full fine-tuning is used for newly initialized parameters (modality connector), as this strategy is found to be more stable and computationally efficient.

More details (training procedure, data selection, hyper-parameters, etc.) along with lessons learned from ablations will be available in an upcoming technical report.

🔧 Technical Details

Idefics2 is trained in two stages. In the first stage, images are fed at SigLIP's native resolution (384x384 squares). In the second stage, images are fed at their native resolution (max 980, min 378) and aspect ratio. High-resolution data like PDFA, Rendered-Text, and IDL are added to OBELICS, LAION Coco, and PMD for OCR training.

Instruction fine-tuning is done on The Cauldron and 9 text-only datasets. Lora is used for pre-trained backbone parameters, and full fine-tuning for new parameters.

📄 License

The model is released under the Apache 2.0 license. We release 2 checkpoints:

idefics2-8b-base: the base model
idefics2-8b: the base model fine-tuned on a mixture of supervised and instruction datasets (text-only and multimodal datasets)
idefics2-8b-chatty: idefics2-8b further fine-tuned on long conversation

⚠️ Important Note

Idefics2 will NOT work with Transformers version between 4.41.0 and 4.43.3 included. See the issue https://github.com/huggingface/transformers/issues/32271 and the fix https://github.com/huggingface/transformers/pull/32275.

💡 Usage Tip

As of April 18th, 2024, Idefics2 is part of the 4.40.0 Transformers pypi release. Please upgrade your Transformers version (pip install transformers --upgrade).

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご