Model Overview
Model Features
Model Capabilities
Use Cases
đ IDEFICS
IDEFICS (Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS) is an open - access multimodal model that can handle arbitrary sequences of image and text inputs, producing text outputs. It's a great alternative to closed - source models, built on publicly available data.
How do I pronounce the model's name? Watch a Youtube tutorial
⨠Features
IDEFICS is an open - access reproduction of Flamingo, a closed - source visual language model developed by Deepmind. Similar to GPT - 4, this multimodal model accepts arbitrary sequences of image and text inputs and generates text outputs. It is constructed solely on publicly available data and models.
- Multifunctional Usage: The model can answer questions about images, describe visual contents, create stories based on multiple images, or act as a pure language model without visual inputs.
- Strong Performance: It performs on par with the original closed - source model on various image - text benchmarks, including visual question answering (open - ended and multiple choice), image captioning, and image classification when evaluated with in - context few - shot learning.
- Two Variants: There are two versions: a large 80 billion parameters version and a 9 billion parameters version.
- Instruction - Fine - Tuned Models: The base models are fine - tuned on a mixture of supervised and instruction fine - tuning datasets, resulting in idefics - 80b - instruct and idefics - 9b - instruct. These models have better downstream performance and are more suitable for conversational settings.
đ Documentation
Model Details
Property | Details |
---|---|
Developed by | Hugging Face |
Model Type | Multi - modal model (image+text) |
Language(s) (NLP) | en |
License | see License section |
Parent Models | [laion/CLIP - ViT - H - 14 - laion2B - s32B - b79K](https://huggingface.co/laion/CLIP - ViT - H - 14 - laion2B - s32B - b79K) and [huggyllama/llama - 65b](https://huggingface.co/huggyllama/llama - 65b) |
Resources for more information | Description of OBELICS: OBELICS: An Open Web - Scale Filtered Dataset of Interleaved Image - Text Documents; Original Paper: Flamingo: a Visual Language Model for Few - Shot Learning |
IDEFICS is a large multimodal English model that takes sequences of interleaved images and texts as inputs and generates text outputs. It shows strong in - context few - shot learning capabilities and is comparable to the closed - source model, making it a robust starting point for fine - tuning multimodal models on custom data.
IDEFICS is built on top of two unimodal open - access pre - trained models to connect the two modalities. Newly initialized parameters in the form of Transformer blocks bridge the gap between the vision encoder and the language model. The model is trained on a mixture of image - text pairs and unstructured multimodal web documents.
IDEFICS - instruct is obtained by further training IDEFICS on Supervised Fine - Tuning and Instruction Fine - Tuning datasets. This significantly improves downstream performance (making idefics - 9b - instruct a very strong model at its 9 billion scale) and makes the model more suitable for conversations.
Uses
The model can be used to perform inference on multimodal (image + text) tasks where the input consists of a text query/instruction along with one or multiple images. Note that this model does not support image generation.
It is possible to fine - tune the base model on custom data for a specific use - case. The instruction - fine - tuned models are much better at following user instructions, so they are preferred for out - of - the - box usage.
The following screenshot is an example of interaction with the instructed model:
Training Details
IDEFICS
We closely follow the training procedure of Flamingo. We combine two open - access pre - trained models ([laion/CLIP - ViT - H - 14 - laion2B - s32B - b79K](https://huggingface.co/laion/CLIP - ViT - H - 14 - laion2B - s32B - b79K) and [huggyllama/llama - 65b](https://huggingface.co/huggyllama/llama - 65b)) by initializing new Transformer blocks. The pre - trained backbones are frozen while we train the newly initialized parameters.
The model is trained on the following data mixture of openly accessible English data:
Data Source | Type of Data | Number of Tokens in Source | Number of Images in Source | Epochs | Effective Proportion in Number of Tokens |
---|---|---|---|---|---|
OBELICS | Unstructured Multimodal Web Documents | 114.9B | 353M | 1 | 73.85% |
Wikipedia | Unstructured Multimodal Web Documents | 3.192B | 39M | 3 | 6.15% |
[LAION](https://huggingface.co/datasets/laion/laion2B - en) | Image - Text Pairs | 29.9B | 1.120B | 1 | 17.18% |
PMD | Image - Text Pairs | 1.6B | 70M | 3 | 2.82% |
OBELICS is an open, massive and curated collection of interleaved image - text web documents, containing 141M documents, 115B text tokens and 353M images. An interactive visualization of the dataset content is available [here](https://atlas.nomic.ai/map/f2fba2aa - 3647 - 4f49 - a0f3 - 9347daeee499/ee4a84bd - f125 - 4bcc - a683 - 1b4e231cb10f). We use Common Crawl dumps between February 2020 and February 2023.
Wikipedia: We used the English dump of Wikipedia created on February 20th, 2023.
LAION is a collection of image - text pairs collected from web pages from Common Crawl, and texts are obtained using the alternative texts of each image. We deduplicated it (following Webster et al., 2023), filtered it, and removed the opted - out images using the [Spawning API](https://api.spawning.ai/spawning - api).
PMD is a collection of publicly - available image - text pair datasets. The dataset contains pairs from Conceptual Captions, Conceptual Captions 12M, WIT, Localized Narratives, RedCaps, COCO, SBU Captions, Visual Genome and a subset of YFCC100M dataset. Due to a server failure at the time of the pre - processing, we did not include SBU captions.
đ Quick Start
These resources showcase how to perform inference with IDEFICS (including 4 - bit quantized inference) and how to fine - tune the models. In particular, this colab notebook shows how to fine - tune the 9 billion parameters model with a single Google Colab GPU with LoRA and 4 - bit quantization.
Base Model
Use the code below to get started with the base model:
import torch
from transformers import IdeficsForVisionText2Text, AutoProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"
checkpoint = "HuggingFaceM4/idefics-9b"
model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
processor = AutoProcessor.from_pretrained(checkpoint)
# We feed to the model an arbitrary sequence of text strings and images. Images can be either URLs or PIL Images.
prompts = [
[
"https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
"In this picture from Asterix and Obelix, we can see"
],
]
# --batched mode
inputs = processor(prompts, return_tensors="pt").to(device)
# --single sample mode
# inputs = processor(prompts[0], return_tensors="pt").to(device)
# Generation args
bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
generated_ids = model.generate(**inputs, bad_words_ids=bad_words_ids, max_length=100)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
for i, t in enumerate(generated_text):
print(f"{i}:\n{t}\n")
Instruct Model
To quickly test your software without waiting for the huge model to download/load you can use HuggingFaceM4/tiny - random - idefics
- it hasn't been trained and has random weights but it is very useful for quick testing.
Use the following code to get started with the instruct model:
import torch
from transformers import IdeficsForVisionText2Text, AutoProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"
checkpoint = "HuggingFaceM4/idefics-9b-instruct"
model = IdeficsForVisionText2Text.from_pretrained(checkpoint, torch_dtype=torch.bfloat16).to(device)
processor = AutoProcessor.from_pretrained(checkpoint)
# We feed to the model an arbitrary sequence of text strings and images. Images can be either URLs or PIL Images.
prompts = [
[
"User: What is in this image?",
"https://upload.wikimedia.org/wikipedia/commons/8/86/Id%C3%A9fix.JPG",
"<end_of_utterance>",
"\nAssistant: This picture depicts Idefix, the dog of Obelix in Asterix and Obelix. Idefix is running on the ground.<end_of_utterance>",
"\nUser:",
"https://static.wikia.nocookie.net/asterix/images/2/25/R22b.gif/revision/latest?cb=20110815073052",
"And who is that?<end_of_utterance>",
"\nAssistant:",
],
]
# --batched mode
inputs = processor(prompts, add_end_of_utterance_token=False, return_tensors="pt").to(device)
# --single sample mode
# inputs = processor(prompts[0], return_tensors="pt").to(device)
# Generation args
exit_condition = processor.tokenizer("<end_of_utterance>", add_special_tokens=False).input_ids
bad_words_ids = processor.tokenizer(["<image>", "<fake_token_around_image>"], add_special_tokens=False).input_ids
generated_ids = model.generate(**inputs, eos_token_id=exit_condition, bad_words_ids=bad_words_ids, max_length=100)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)
for i, t in enumerate(generated_text):
print(f"{i}:\n{t}\n")
Text generation inference
The hosted inference API is powered by [Text Generation Inference](https://github.com/huggingface/text - generation - inference). To query the model, you can use the following code snippet. The key is to pass images as fetchable URLs with the markdown syntax:
from text_generation import Client
API_TOKEN = "<YOUR_API_TOKEN>"
API_URL = "https://api - inference.huggingface.co/models/HuggingFaceM4/idefics-80b-instruct"
DECODING_STRATEGY = "Greedy"
QUERY = "User: What is in this image?<end_of_utterance>\nAssistant:"
client = Client(
base_url=API_URL,
headers={"x - use - cache": "0", "Authorization": f"Bearer {API_TOKEN}"},
)
generation_args = {
"max_new_tokens": 256,
"repetition_penalty": 1.0,
"stop_sequences": ["<end_of_utterance>", "\nUser:"],
}
if DECODING_STRATEGY == "Greedy":
generation_args["do_sample"] = False
elif DECODING_STRATEGY == "Top P Sampling":
generation_args["temperature"] = 1.
generation_args["do_sample"] = True
generation_args["top_p"] = 0.95
generated_text = client.generate(prompt=QUERY, **generation_args)
print(generated_text)
Note that we currently only host the inference for the instructed models.
đ License
The license for this model is other
. For more details, please refer to the relevant information in the official repository.






