模型概述
模型特點
模型能力
使用案例
🚀 Idefics2
Idefics2是一個開放的多模態模型,它可以接受任意的圖像和文本輸入序列,並生成文本輸出。該模型能夠回答關於圖像的問題、描述視覺內容、基於多張圖像創作故事,或者在沒有視覺輸入的情況下單純作為語言模型使用。相較於Idefics1,它在光學字符識別(OCR)、文檔理解和視覺推理等方面的能力有了顯著提升。
🚀 快速開始
環境準備
在開始使用Idefics2之前,你需要安裝必要的庫。可以使用以下命令進行安裝:
pip install transformers requests torch pillow
代碼示例
以下是使用idefics2-8b-base
和idefics2-8b
進行文本生成的代碼示例:
import requests
import torch
from PIL import Image
from io import BytesIO
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
DEVICE = "cuda:0"
# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")
idefics2-8b-base
示例
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base")
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b-base",
).to(DEVICE)
# Create inputs
prompts = [
"<image>In this image, we can see the city of New York, and more specifically the Statue of Liberty.<image>In this image,",
"In which city is that bridge located?<image>",
]
images = [[image1, image2], [image3]]
inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
# ['In this image, we can see the city of New York, and more specifically the Statue of Liberty. In this image, we can see the city of Chicago, and more specifically the skyscrapers of the city.', 'In which city is that bridge located? The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and']
idefics2-8b
和idefics2-8b-chatty
示例
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
).to(DEVICE)
# Create inputs
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "What do we see in this image?"},
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
]
},
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "And how about this image?"},
]
},
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}
# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts)
# ['User: What do we see in this image? \nAssistant: In this image, we can see the city of New York, and more specifically the Statue of Liberty. \nUser: And how about this image? \nAssistant: In this image we can see buildings, trees, lights, water and sky.']
文本生成推理
Idefics2已集成到TGI中,我們為idefics2-8b
和idefics2-8b-chatty
提供了API端點。
from text_generation import Client
API_TOKEN="<YOUR_API_TOKEN>"
API_URL = "https://api-inference.huggingface.co/models/HuggingFaceM4/idefics2-8b-chatty"
# System prompt used in the playground for `idefics2-8b-chatty`
SYSTEM_PROMPT = "System: The following is a conversation between Idefics2, a highly knowledgeable and intelligent visual AI assistant created by Hugging Face, referred to as Assistant, and a human user called User. In the following interactions, User and Assistant will converse in natural language, and Assistant will do its best to answer User’s questions. Assistant has the ability to perceive images and reason about them, but it cannot generate images. Assistant was built to be respectful, polite and inclusive. It knows a lot, and always tells the truth. When prompted with an image, it does not make up facts.<end_of_utterance>\nAssistant: Hello, I'm Idefics2, Huggingface's latest multimodal assistant. How can I help you?<end_of_utterance>\n"
QUERY = "User:Describe this image.<end_of_utterance>\nAssistant:"
client = Client(
base_url=API_URL,
headers={"x-use-cache": "0", "Authorization": f"Bearer {API_TOKEN}"},
)
generation_args = {
"max_new_tokens": 512,
"repetition_penalty": 1.1,
"do_sample": False,
}
generated_text = client.generate(prompt=SYSTEM_PROMPT + QUERY, **generation_args)
generated_text
✨ 主要特性
- 多模態處理能力:能夠處理圖像和文本的任意組合輸入,支持圖像描述、視覺問答等多種任務。
- 高分辨率圖像支持:可以處理高達980x980分辨率的圖像,無需將圖像調整為固定大小的正方形。
- 增強的OCR能力:通過集成相關數據,顯著提升了在圖像和文檔中識別和轉錄文本的能力。
- 簡化的視覺特徵集成:採用新的架構,簡化了視覺特徵與語言模型的集成過程。
- 多階段訓練:通過兩階段訓練,提高了模型的效率和性能。
📦 安裝指南
目前文檔未提及具體的安裝步驟,你可以參考上述快速開始部分的環境準備步驟進行安裝。
💻 使用示例
基礎用法
上述快速開始部分的代碼示例展示瞭如何使用Idefics2進行文本生成,包括idefics2-8b-base
和idefics2-8b
的使用方法。
高級用法
如果你需要進行微調,可以參考以下資源:
- 使用TRL庫進行微調的腳本:Script
- 使用Hugging Face Trainer進行微調的教程筆記本:Tutorial notebook
📚 詳細文檔
模型概述
屬性 | 詳情 |
---|---|
開發團隊 | Hugging Face |
模型類型 | 多模態模型(圖像+文本) |
語言 | 英文 |
許可證 | Apache 2.0 |
父模型 | google/siglip-so400m-patch14-384 和 mistralai/Mistral-7B-v0.1 |
更多信息資源 | OBELICS 的描述:OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents;論文:What matters when building vision-language models? |
模型用途
idefics2-8b-base
和idefics2-8b
可用於多模態(圖像+文本)任務的推理,如圖像描述、視覺問答等。對於特定用例和數據,建議對idefics2-8b
進行微調以獲得最佳效果。idefics2-8b-chatty
經過進一步微調,適用於長對話場景。
技術細節
Idefics2在與其他開放多模態模型相比時,在其規模(80億參數)下表現出了強大的性能,並且在很多情況下能與閉源系統相競爭。它為各種特定用例的微調提供了堅實的基礎。
點擊展開結果表以獲取更多詳細信息。
(驗證集/測試集) |
(測試迷你集) |
(驗證集) |
(測試集) |
(測試開發集) |
(測試集) |
||||
---|---|---|---|---|---|---|---|---|---|
DeepSeek-VL | ✅ | 7B | 576 | 36.6/- | 36.1 | 64.4 | 73.2 | - | 49.6 |
LLaVa-NeXT-Mistral-7B | ✅ | 7B | 2880 | 35.3/- | 37.7 | 65.7 | 68.7 | 82.2 | - |
LLaVa-NeXT-13B | ✅ | 13B | 2880 | 36.2/- | 35.3 | 67.1 | 70.0 | 82.8 | - |
LLaVa-NeXT-34B | ✅ | 34B | 2880 | 51.1/44.7 | 46.5 | 69.5 | 79.3 | 83.7 | - |
MM1-Chat-7B | ❌ | 7B | 720 | 37.0/35.6 | 35.9 | 72.8 | 72.3 | - | - |
MM1-Chat-30B | ❌ | 30B | 720 | 44.7/40.3 | 39.4 | 73.5 | 75.1 | 83.7 | |
Gemini 1.0 Pro | ❌ | 🤷♂️ | 🤷♂️ | 47.9/- | 45.2 | 74.6 | - | 71.2 | 88.1 |
Gemini 1.5 Pro | ❌ | 🤷♂️ | 🤷♂️ | 58.5/- | 52.1 | 73.5 | - | 73.2 | 86.5 |
Claude 3 Haiku | ❌ | 🤷♂️ | 🤷♂️ | 50.2/- | 46.4 | - | - | - | 88.8 |
Idefics1 instruct (32-shots) | ✅ | 80B | - | - | - | 39.3 | - | 68.8 | - |
Idefics2 (無圖像分割) | ✅ | 8B | 64 | 43.5/37.9 | 51.6 | 70.4 | 76.8 | 80.8 | 67.3 |
Idefics2 (有圖像分割) | ✅ | 8B | 320 | 43.0/37.7 | 51.4 | 73.0 | 76.7 | 81.2 | 74.0 |
Idefics2在Idefics1的基礎上進行了多項改進:
- 高分辨率圖像處理:採用NaViT策略,能夠處理原始分辨率和縱橫比的圖像,避免了傳統的圖像調整大小操作。同時,借鑑SPHINX的策略,支持子圖像分割和處理高分辨率圖像。
- 增強的OCR能力:通過集成相關數據,顯著提升了在圖像和文檔中識別和轉錄文本的能力,同時在處理圖表、圖形和文檔相關問題時表現更好。
- 簡化的視覺特徵集成:採用新的架構,簡化了視覺特徵與語言模型的集成過程,提高了模型的效率。
- 性能提升:在模型大小縮小10倍的情況下,性能相比Idefics1有了顯著提升。
訓練過程
Idefics2採用兩階段訓練:
- 第一階段:將圖像以SigLIP的原始分辨率(384x384)輸入模型。
- 第二階段:將圖像以其原始分辨率(最大980,最小378)和縱橫比輸入模型,並添加PDFA、Rendered-Text和IDL等數據。
之後,在The Cauldron以及9個僅文本的指令微調數據集上進行指令微調。
模型優化
半精度加載
如果你的GPU支持,建議以半精度(torch.float16
或torch.bfloat16
)加載和運行模型:
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
+ torch_dtype=torch.float16,
).to(DEVICE)
視覺編碼器效率優化
如果你的GPU內存有限,可以採取以下措施:
- 禁用圖像分割:在初始化處理器時添加
do_image_splitting=False
:
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)
- 降低最大圖像分辨率:在初始化處理器時添加
size= {"longest_edge": 448, "shortest_edge": 378}
:
processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", size= {"longest_edge": 448, "shortest_edge": 378})
使用Flash-attention 2加速生成
首先,確保安裝了flash-attn
庫。然後,在加載模型時添加_attn_implementation="flash_attention_2"
:
model = AutoModelForVision2Seq.from_pretrained(
"HuggingFaceM4/idefics2-8b",
+ torch_dtype=torch.float16,
+ _attn_implementation="flash_attention_2",
).to(DEVICE)
4位量化
可以使用AWQ或bitsandbytes
進行4位量化,具體方法請參考文檔中的代碼示例。
模型優化對比
Flash attention 2 | 圖像分割 | 浮點類型 | 4位量化 | 峰值GPU內存 (GB) | 20次生成時間 (秒) |
---|---|---|---|---|---|
否 | 是 | fp32 | 否 | 54.9 | 55.6 |
否 | 是 | bf16 | 否 | 41.3 | 34.3 |
否 | 是 | fp16 | 否 | 36.7 | 33.3 |
是 | 是 | fp16 | 否 | 21.0 | 13.3 |
是 | 是 | fp16 | bitsandbytes (整個模型) | 8.9 | 19.9 |
否 | 是 | fp16 | bitsandbytes (整個模型) | 24.7 | 40.4 |
否 | 是 | fp16 | AWQ (僅LLM) | 26.4 | 37.1 |
是 | 是 | fp16 | AWQ (僅LLM) | 10.7 | 16.3 |
否 | 是 | fp16 | AWQ + 融合 (僅LLM) | 26.0 | 38.4 |
否 | 否 | fp32 | 否 | 38.8 | 17.5 |
否 | 否 | bf16 | 否 | 22.2 | 14.4 |
否 | 否 | fp16 | 否 | 21.3 | 13.9 |
是 | 否 | fp16 | 否 | 18.1 | 10.4 |
是 | 否 | fp16 | bitsandbytes (整個模型) | 6.0 | 17.3 |
否 | 否 | fp16 | bitsandbytes (整個模型) | 9.2 | 20.9 |
否 | 否 | fp16 | AWQ (僅LLM) | 10.9 | 15.9 |
是 | 否 | fp16 | AWQ (僅LLM) | 7.8 | 12.3 |
否 | 否 | fp16 | AWQ + 融合 (僅LLM) | 10.5 | 19.5 |
🔧 技術細節
模型架構
Idefics2基於google/siglip-so400m-patch14-384和mistralai/Mistral-7B-v0.1兩個預訓練模型構建,採用了新的架構來簡化視覺特徵與語言模型的集成。
訓練數據
Idefics2的訓練數據包括:
- HuggingFaceM4/OBELICS
- laion/laion-coco
- wikipedia
- facebook/pmd
- pixparse/idl-wds
- pixparse/pdfa-eng-wds
- wendlerc/RenderedText
- HuggingFaceM4/the_cauldron
- teknium/OpenHermes-2.5
- GAIR/lima
- databricks/databricks-dolly-15k
- meta-math/MetaMathQA
- TIGER-Lab/MathInstruct
- microsoft/orca-math-word-problems-200k
- camel-ai/math
- AtlasUnified/atlas-math-sets
- tiedong/goat
- Lin-Chen/ShareGPT4V
- jxu124/llava_conversation_58k
訓練過程
Idefics2採用兩階段訓練,具體過程如上述詳細文檔部分所述。
📄 許可證
Idefics2基於Apache 2.0許可證發佈,其依賴的兩個預訓練模型google/siglip-so400m-patch14-384和mistralai/Mistral-7B-v0.1也採用了相同的許可證。
⚠️ 重要提示
⚠️ 重要提示
💡 使用建議
- 為了獲得最佳結果,建議在特定用例和數據上對
idefics2-8b
進行微調。- 如果你的GPU支持,建議以半精度加載和運行模型,以提高效率。
- 在使用模型時,要注意其可能存在的偏差和侷限性,避免在高風險場景中使用。
📖 引用
如果你使用了Idefics2,請引用以下文獻:
@misc{laurencon2023obelics,
title={OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents},
author={Hugo Laurençon and Lucile Saulnier and Léo Tronchon and Stas Bekman and Amanpreet Singh and Anton Lozhkov and Thomas Wang and Siddharth Karamcheti and Alexander M. Rush and Douwe Kiela and Matthieu Cord and Victor Sanh},
year={2023},
eprint={2306.16527},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
@misc{laurençon2024matters,
title={What matters when building vision-language models?},
author={Hugo Laurençon and Léo Tronchon and Matthieu Cord and Victor Sanh},
year={2024},
eprint={2405.02246},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
🙏 致謝
感謝@yjernite、@sasha、@meg、@giadap、@jack-kumar和@frimelle在模型紅隊測試方面提供的幫助。








