Idefics-80B Open-Source Multimodal Model - Process Images and Text for Free and Generate Useful Text Outputs

Idefics 80b

Developed by HuggingFaceM4

IDEFICS-9B is a 9-billion-parameter multimodal model capable of processing both image and text inputs to generate text outputs. It is an open-source replication of Deepmind's Flamingo model.

Image-to-Text

Transformers

EnglishOpen Source License:Other #Multimodal Interaction #Image-Text Generation #Few-Shot Learning

Downloads 70

Release Time : 7/5/2023

Model Overview

IDEFICS is a multimodal model that accepts arbitrary sequences of images and text as input and generates text outputs. It can answer questions about images, describe visual content, create stories based on multiple images, or function as a pure language model.

Model Features

Multimodal Understanding

Capable of processing both image and text inputs and understanding the relationship between them.

Few-Shot Learning in Context

Demonstrates strong learning capabilities with minimal examples.

Open-Source Replication

Built entirely on publicly available data and models, replicating the functionality of the closed-source Flamingo model.

Model Capabilities

Visual Question Answering

Image Captioning

Multi-Image Story Creation

Pure Text Generation

Use Cases

Content Creation

Story Creation Based on Multiple Images

Generates coherent storylines based on multiple provided images.

Produces coherent and creative narrative content.

Visual Understanding

Image Question Answering

Answers open-ended questions about image content.

Accurately describes the content and details within images.

🚀 IDEFICS

IDEFICS (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS) is an open - access multimodal model. It accepts sequences of image and text inputs and generates text outputs, similar to GPT - 4. Built on publicly available data and models, it performs well on various image - text benchmarks.

Idefics-Obelics logo

How do I pronounce the model's name? Watch a Youtube tutorial

IDEFICS is an open - access reproduction of Flamingo, a closed - source visual language model developed by Deepmind. Like GPT - 4, this multimodal model accepts arbitrary sequences of image and text inputs and produces text outputs. IDEFICS is built solely on publicly available data and models.

The model can answer questions about images, describe visual contents, create stories grounded on multiple images, or simply behave as a pure language model without visual inputs.

IDEFICS is on par with the original closed - source model on various image - text benchmarks, including visual question answering (open - ended and multiple choice), image captioning, and image classification when evaluated with in - context few - shot learning. It comes into two variants: a large 80 billion parameters version and a 9 billion parameters version.

We also fine - tune the base models on a mixture of supervised and instruction fine - tuning datasets, which boosts the downstream performance while making the models more usable in conversational settings: idefics - 80b - instruct and idefics - 9b - instruct. As they reach higher performance, we recommend using these instructed versions first.

Learn more about some of the technical challenges we encountered while training IDEFICS here.

Try out the demo!

✨ Features

Multimodal Input: Accepts arbitrary sequences of image and text inputs and generates text outputs.
Strong Few - Shot Learning: Shows strong in - context few - shot learning capabilities and performs well on various image - text benchmarks.
Two Variants: Available in 80 - billion and 9 - billion parameter versions.
Instruction - Tuned: Instruction - fine - tuned models have better downstream performance and are more suitable for conversations.

📚 Documentation

Model Details

Property	Details
Developed by	Hugging Face
Model Type	Multi - modal model (image+text)
Language(s) (NLP)	en
License	see License section
Parent Models	[laion/CLIP - ViT - H - 14 - laion2B - s32B - b79K](https://huggingface.co/laion/CLIP - ViT - H - 14 - laion2B - s32B - b79K) and [huggyllama/llama - 65b](https://huggingface.co/huggyllama/llama - 65b)
Resources for more information	- Description of OBELICS: OBELICS: An Open Web - Scale Filtered Dataset of Interleaved Image - Text Documents - Original Paper: Flamingo: a Visual Language Model for Few - Shot Learning

IDEFICS is a large multimodal English model that takes sequences of interleaved images and texts as inputs and generates text outputs. The model shows strong in - context few - shot learning capabilities and is on par with the closed - source model. This makes IDEFICS a robust starting point to fine - tune multimodal models on custom data.

IDEFICS is built on top of two unimodal open - access pre - trained models to connect the two modalities. Newly initialized parameters in the form of Transformer blocks bridge the gap between the vision encoder and the language model. The model is trained on a mixture of image - text pairs and unstructured multimodal web documents.

IDEFICS - instruct is the model obtained by further training IDEFICS on Supervised Fine - Tuning and Instruction Fine - Tuning datasets. This improves downstream performance significantly (making idefics - 9b - instruct a very strong model at its 9 billion scale), while making the model more suitable to converse with.

Uses

The model can be used to perform inference on multimodal (image + text) tasks in which the input is composed of a text query/instruction along with one or multiple images. This model does not support image generation.

It is possible to fine - tune the base model on custom data for a specific use - case. We note that the instruction - fine - tuned models are significantly better at following instructions from users and thus should be prefered when using the models out - of - the - box.

The following screenshot is an example of interaction with the instructed model:

Guarding baguettes

🚀 Quick Start

These resources showcase how to perform inference with IDEFICS (including 4 - bit quantized inference) along with how to fine - tune the models. In particular, this colab notebook shows how to fine - tune the 9 billion parameters model with a single Google Colab GPU with LoRA and 4 - bit quantization.

We provide quick - start code for both the base and the instruct models.

💻 Usage Examples

Basic Usage (Base Model)

import torch
from transformers import IdeficsForVisionText2Text, AutoProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

checkpoint = "HuggingFaceM4/idefics-9b"
model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
processor = AutoProcessor.from_pretrained(checkpoint)

# We feed to the model an arbitrary sequence of text strings and images. Images can be either URLs or PIL Images.
prompts = [
    [
        "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
        "In this picture from Asterix and Obelix, we can see"
    ],
]

# --batched mode
inputs = processor(prompts, return_tensors="pt").to(device)
# --single sample mode
# inputs = processor(prompts[0], return_tensors="pt").to(device)

# Generation args
bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

generated_ids = model.generate(**inputs, bad_words_ids=bad_words_ids, max_length=100)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
for i, t in enumerate(generated_text):
    print(f"{i}:\n{t}\n")

To quickly test your software without waiting for the huge model to download/load you can use HuggingFaceM4/tiny - random - idefics - it hasn't been trained and has random weights but it is very useful for quick testing.

Advanced Usage (Instruct Model)

import torch
from transformers import IdeficsForVisionText2Text, AutoProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

checkpoint = "HuggingFaceM4/idefics-9b-instruct"
model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
processor = AutoProcessor.from_pretrained(checkpoint)

# We feed to the model an arbitrary sequence of text strings and images. Images can be either URLs or PIL Images.
prompts = [
    [
        "User: What is in this image?",
        "https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
        "<end_of_utterance>",

        "\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",

        "\nUser:",
        "https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
        "And who is that?<end_of_utterance>",

        "\nAssistant:",
    ],
]

# --batched mode
inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
# --single sample mode
# inputs = processor(prompts[0], return_tensors="pt").to(device)

# Generation args
exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids

generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
for i, t in enumerate(generated_text):
    print(f"{i}:\n{t}\n")

Text generation inference

The hosted inference API is powered by Text Generation Inference. To query the model, you can use the following code snippet. The key is to pass images as fetchable URLs with the markdown syntax:

from text_generation import Client

API_TOKEN = "<YOUR_API_TOKEN>"
API_URL = "https://api-inference.huggingface.co/models/HuggingFaceM4/idefics-80b-instruct"
DECODING_STRATEGY = "Greedy"
QUERY = "User: What is in this image?![](https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG)<end_of_utterance>\nAssistant:"

client = Client(
    base_url=API_URL,
    headers={"x-use-cache": "0", "Authorization": f"Bearer {API_TOKEN}"},
)
generation_args = {
    "max_new_tokens": 256,
    "repetition_penalty": 1.0,
    "stop_sequences": ["<end_of_utterance>", "\nUser:"],
}

if DECODING_STRATEGY == "Greedy":
    generation_args["do_sample"] = False
elif DECODING_STRATEGY == "Top P Sampling":
    generation_args["temperature"] = 1.
    generation_args["do_sample"] = True
    generation_args["top_p"] = 0.95
    
generated_text = client.generate(prompt=QUERY, **generation_args)  
print(generated_text)

Note that we currently only host the inference for the instructed models.

🔧 Technical Details

Training Details

IDEFICS

We closely follow the training procedure laid out in Flamingo. We combine two open - access pre - trained models ([laion/CLIP - ViT - H - 14 - laion2B - s32B - b79K](https://huggingface.co/laion/CLIP - ViT - H - 14 - laion2B - s32B - b79K) and [huggyllama/llama - 65b](https://huggingface.co/huggyllama/llama - 65b)) by initializing new Transformer blocks. The pre - trained backbones are frozen while we train the newly initialized parameters.

The model is trained on the following data mixture of openly accessible English data:

Data Source	Type of Data	Number of Tokens in Source	Number of Images in Source	Epochs	Effective Proportion in Number of Tokens
OBELICS	Unstructured Multimodal Web Documents	114.9B	353M	1	73.85%
Wikipedia	Unstructured Multimodal Web Documents	3.192B	39M	3	6.15%
[LAION](https://huggingface.co/datasets/laion/laion2B - en)	Image - Text Pairs	29.9B	1.120B	1	17.18%
PMD	Image - Text Pairs	1.6B	70M	3	2.82%

OBELICS is an open, massive and curated collection of interleaved image - text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](https://atlas.nomic.ai/map/f2fba2aa - 3647 - 4f49 - a0f3 - 9347daeee499/ee4a84bd - f125 - 4bcc - a683 - 1b4e231cb10f). We use Common Crawl dumps between February 2020 and February 2023.

Wikipedia. We used the English dump of Wikipedia created on February 20th, 2023.

LAION is a collection of image - text pairs collected from web pages from Common Crawl and texts are obtained using the alternative texts of each image. We deduplicated it (following Webster et al., 2023), filtered it, and removed the opted - out images using the [Spawning API](https://api.spawning.ai/spawning - api).

PMD is a collection of publicly - available image - text pair datasets. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset. Due to a server failure at the time of the pre - processing, we did not include SBU captions.

For multimodal web documents, we feed the model sequences corresponding to the succession of text paragraphs and images. For image - text pairs, we form the training sequences by packing images w

📄 License

The license is of type "other". For more details, please refer to the relevant information in the model repository.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご