Idefics2-8b-base開源多模態模型 - 免費處理圖文，OCR、文檔理解超好用

首頁

Idefics2 8b Base

由HuggingFaceM4開發

Idefics2 是 Hugging Face 開發的開源多模態模型，能夠處理圖像和文本輸入並生成文本輸出，在 OCR、文檔理解和視覺推理方面表現優異。

圖像生成文本

Transformers

英語開源協議:Apache-2.0 #高分辨率圖像處理 #多模態問答 #文檔OCR增強

下載量 1,409

發布時間 : 4/9/2024

模型概述

Idefics2 是一個多模態模型，可以接受任意序列的圖像和文本輸入，並生成文本輸出。它能夠回答關於圖像的問題、描述視覺內容、基於多張圖像創作故事，也可以作為純語言模型使用。

模型特點

多模態處理能力

能夠同時處理圖像和文本輸入，並生成連貫的文本輸出

原生分辨率支持

遵循 NaViT 策略，以原生分辨率和寬高比處理圖像（最高 980 x 980）

高分辨率圖像分割

可選地支持子圖像分割，可處理非常高分辨率的圖像

增強的OCR能力

通過專門訓練顯著提升了文本識別和文檔理解能力

模型能力

圖像描述

視覺問答

多圖像故事創作

文檔理解

圖表分析

純文本語言模型

使用案例

教育

數學問題解答

基於圖像中的數學問題提供解答

在數學相關測試集上表現優異

內容創作

多圖像故事創作

基於多張相關圖像生成連貫的故事

文檔處理

文檔內容理解

識別和理解掃描文檔中的內容和結構

在DocVQA測試集上達到74.0分

🚀 Idefics2

Idefics2是一個開放的多模態模型，它可以接受任意的圖像和文本輸入序列，並生成文本輸出。該模型能夠回答關於圖像的問題、描述視覺內容、基於多張圖像創作故事，或者在沒有視覺輸入的情況下單純作為語言模型使用。相較於Idefics1，它在光學字符識別（OCR）、文檔理解和視覺推理等方面的能力有了顯著提升。

🚀 快速開始

環境準備

在開始使用Idefics2之前，你需要安裝必要的庫。可以使用以下命令進行安裝：

pip install transformers requests torch pillow

代碼示例

以下是使用idefics2-8b-base和idefics2-8b進行文本生成的代碼示例：

import requests
import torch
from PIL import Image
from io import BytesIO

from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image

DEVICE = "cuda:0"

# Note that passing the image urls (instead of the actual pil images) to the processor is also possible
image1 = load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg")
image2 = load_image("https://cdn.britannica.com/59/94459-050-DBA42467/Skyline-Chicago.jpg")
image3 = load_image("https://cdn.britannica.com/68/170868-050-8DDE8263/Golden-Gate-Bridge-San-Francisco.jpg")

`idefics2-8b-base`示例

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b-base")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b-base",
).to(DEVICE)

# Create inputs
prompts = [
  "<image>In this image, we can see the city of New York, and more specifically the Statue of Liberty.<image>In this image,",
  "In which city is that bridge located?<image>",
]
images = [[image1, image2], [image3]]
inputs = processor(text=prompts, images=images, padding=True, return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
# ['In this image, we can see the city of New York, and more specifically the Statue of Liberty. In this image, we can see the city of Chicago, and more specifically the skyscrapers of the city.', 'In which city is that bridge located? The Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and the United States. It has been declared one of the Wonders of the Modern World by the American Society of Civil Engineers.\n\nThe Golden Gate Bridge is a suspension bridge spanning the Golden Gate, the one-mile-wide (1.6 km) strait connecting San Francisco Bay and the Pacific Ocean. The structure links the American city of San Francisco, California — the northern tip of the San Francisco Peninsula — to Marin County, carrying both U.S. Route 101 and California State Route 1 across the strait. The bridge is one of the most internationally recognized symbols of San Francisco, California, and']

`idefics2-8b`和`idefics2-8b-chatty`示例

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b")
model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
).to(DEVICE)

# Create inputs
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What do we see in this image?"},
        ]
    },
    {
        "role": "assistant",
        "content": [
            {"type": "text", "text": "In this image, we can see the city of New York, and more specifically the Statue of Liberty."},
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "And how about this image?"},
        ]
    },       
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image1, image2], return_tensors="pt")
inputs = {k: v.to(DEVICE) for k, v in inputs.items()}


# Generate
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)

print(generated_texts)
# ['User: What do we see in this image? \nAssistant: In this image, we can see the city of New York, and more specifically the Statue of Liberty. \nUser: And how about this image? \nAssistant: In this image we can see buildings, trees, lights, water and sky.']

文本生成推理

Idefics2已集成到TGI中，我們為idefics2-8b和idefics2-8b-chatty提供了API端點。

from text_generation import Client

API_TOKEN="<YOUR_API_TOKEN>"
API_URL = "https://api-inference.huggingface.co/models/HuggingFaceM4/idefics2-8b-chatty"

# System prompt used in the playground for `idefics2-8b-chatty`
SYSTEM_PROMPT = "System: The following is a conversation between Idefics2, a highly knowledgeable and intelligent visual AI assistant created by Hugging Face, referred to as Assistant, and a human user called User. In the following interactions, User and Assistant will converse in natural language, and Assistant will do its best to answer User’s questions. Assistant has the ability to perceive images and reason about them, but it cannot generate images. Assistant was built to be respectful, polite and inclusive. It knows a lot, and always tells the truth. When prompted with an image, it does not make up facts.<end_of_utterance>\nAssistant: Hello, I'm Idefics2, Huggingface's latest multimodal assistant. How can I help you?<end_of_utterance>\n"
QUERY = "User:![](https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg)Describe this image.<end_of_utterance>\nAssistant:"

client = Client(
    base_url=API_URL,
    headers={"x-use-cache": "0", "Authorization": f"Bearer {API_TOKEN}"},
)
generation_args = {
    "max_new_tokens": 512,
    "repetition_penalty": 1.1,
    "do_sample": False,
}
generated_text = client.generate(prompt=SYSTEM_PROMPT + QUERY, **generation_args)
generated_text

✨ 主要特性

多模態處理能力：能夠處理圖像和文本的任意組合輸入，支持圖像描述、視覺問答等多種任務。
高分辨率圖像支持：可以處理高達980x980分辨率的圖像，無需將圖像調整為固定大小的正方形。
增強的OCR能力：通過集成相關數據，顯著提升了在圖像和文檔中識別和轉錄文本的能力。
簡化的視覺特徵集成：採用新的架構，簡化了視覺特徵與語言模型的集成過程。
多階段訓練：通過兩階段訓練，提高了模型的效率和性能。

📦 安裝指南

目前文檔未提及具體的安裝步驟，你可以參考上述快速開始部分的環境準備步驟進行安裝。

💻 使用示例

基礎用法

上述快速開始部分的代碼示例展示瞭如何使用Idefics2進行文本生成，包括idefics2-8b-base和idefics2-8b的使用方法。

高級用法

如果你需要進行微調，可以參考以下資源：

使用TRL庫進行微調的腳本：Script
使用Hugging Face Trainer進行微調的教程筆記本：Tutorial notebook

📚 詳細文檔

模型概述

屬性	詳情
開發團隊	Hugging Face
模型類型	多模態模型（圖像+文本）
語言	英文
許可證	Apache 2.0
父模型	google/siglip-so400m-patch14-384 和 mistralai/Mistral-7B-v0.1
更多信息資源	OBELICS 的描述：OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents；論文：What matters when building vision-language models?

模型用途

idefics2-8b-base和idefics2-8b可用於多模態（圖像+文本）任務的推理，如圖像描述、視覺問答等。對於特定用例和數據，建議對idefics2-8b進行微調以獲得最佳效果。idefics2-8b-chatty經過進一步微調，適用於長對話場景。

技術細節

Idefics2在與其他開放多模態模型相比時，在其規模（80億參數）下表現出了強大的性能，並且在很多情況下能與閉源系統相競爭。它為各種特定用例的微調提供了堅實的基礎。

點擊展開結果表以獲取更多詳細信息。

模型	開放權重	大小	每張圖像的標記數	MMMU (驗證集/測試集)	MathVista (測試迷你集)	TextVQA (驗證集)	MMBench (測試集)	VQAv2 (測試開發集)	DocVQA (測試集)
DeepSeek-VL	✅	7B	576	36.6/-	36.1	64.4	73.2	-	49.6
LLaVa-NeXT-Mistral-7B	✅	7B	2880	35.3/-	37.7	65.7	68.7	82.2	-
LLaVa-NeXT-13B	✅	13B	2880	36.2/-	35.3	67.1	70.0	82.8	-
LLaVa-NeXT-34B	✅	34B	2880	51.1/44.7	46.5	69.5	79.3	83.7	-
MM1-Chat-7B	❌	7B	720	37.0/35.6	35.9	72.8	72.3	-	-
MM1-Chat-30B	❌	30B	720	44.7/40.3	39.4	73.5	75.1	83.7
Gemini 1.0 Pro	❌	🤷‍♂️	🤷‍♂️	47.9/-	45.2	74.6	-	71.2	88.1
Gemini 1.5 Pro	❌	🤷‍♂️	🤷‍♂️	58.5/-	52.1	73.5	-	73.2	86.5
Claude 3 Haiku	❌	🤷‍♂️	🤷‍♂️	50.2/-	46.4	-	-	-	88.8

Idefics1 instruct (32-shots)	✅	80B	-	-	-	39.3	-	68.8	-

Idefics2 (無圖像分割)	✅	8B	64	43.5/37.9	51.6	70.4	76.8	80.8	67.3
Idefics2 (有圖像分割)	✅	8B	320	43.0/37.7	51.4	73.0	76.7	81.2	74.0

Idefics2在Idefics1的基礎上進行了多項改進：

高分辨率圖像處理：採用NaViT策略，能夠處理原始分辨率和縱橫比的圖像，避免了傳統的圖像調整大小操作。同時，借鑑SPHINX的策略，支持子圖像分割和處理高分辨率圖像。
增強的OCR能力：通過集成相關數據，顯著提升了在圖像和文檔中識別和轉錄文本的能力，同時在處理圖表、圖形和文檔相關問題時表現更好。
簡化的視覺特徵集成：採用新的架構，簡化了視覺特徵與語言模型的集成過程，提高了模型的效率。
性能提升：在模型大小縮小10倍的情況下，性能相比Idefics1有了顯著提升。

訓練過程

Idefics2採用兩階段訓練：

第一階段：將圖像以SigLIP的原始分辨率（384x384）輸入模型。
第二階段：將圖像以其原始分辨率（最大980，最小378）和縱橫比輸入模型，並添加PDFA、Rendered-Text和IDL等數據。

之後，在The Cauldron以及9個僅文本的指令微調數據集上進行指令微調。

模型優化

半精度加載

如果你的GPU支持，建議以半精度（torch.float16或torch.bfloat16）加載和運行模型：

model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
+    torch_dtype=torch.float16,    
).to(DEVICE)

視覺編碼器效率優化

如果你的GPU內存有限，可以採取以下措施：

禁用圖像分割：在初始化處理器時添加do_image_splitting=False：

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", do_image_splitting=False)

降低最大圖像分辨率：在初始化處理器時添加size= {"longest_edge": 448, "shortest_edge": 378}：

processor = AutoProcessor.from_pretrained("HuggingFaceM4/idefics2-8b", size= {"longest_edge": 448, "shortest_edge": 378})

使用Flash-attention 2加速生成

首先，確保安裝了flash-attn庫。然後，在加載模型時添加_attn_implementation="flash_attention_2"：

model = AutoModelForVision2Seq.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
+    torch_dtype=torch.float16,    
+    _attn_implementation="flash_attention_2",
).to(DEVICE)

4位量化

可以使用AWQ或bitsandbytes進行4位量化，具體方法請參考文檔中的代碼示例。

模型優化對比

Flash attention 2	圖像分割	浮點類型	4位量化	峰值GPU內存 (GB)	20次生成時間 (秒)
否	是	fp32	否	54.9	55.6
否	是	bf16	否	41.3	34.3
否	是	fp16	否	36.7	33.3
是	是	fp16	否	21.0	13.3
是	是	fp16	bitsandbytes (整個模型)	8.9	19.9
否	是	fp16	bitsandbytes (整個模型)	24.7	40.4
否	是	fp16	AWQ (僅LLM)	26.4	37.1
是	是	fp16	AWQ (僅LLM)	10.7	16.3
否	是	fp16	AWQ + 融合 (僅LLM)	26.0	38.4

否	否	fp32	否	38.8	17.5
否	否	bf16	否	22.2	14.4
否	否	fp16	否	21.3	13.9
是	否	fp16	否	18.1	10.4
是	否	fp16	bitsandbytes (整個模型)	6.0	17.3
否	否	fp16	bitsandbytes (整個模型)	9.2	20.9
否	否	fp16	AWQ (僅LLM)	10.9	15.9
是	否	fp16	AWQ (僅LLM)	7.8	12.3
否	否	fp16	AWQ + 融合 (僅LLM)	10.5	19.5

🔧 技術細節

模型架構

Idefics2基於google/siglip-so400m-patch14-384和mistralai/Mistral-7B-v0.1兩個預訓練模型構建，採用了新的架構來簡化視覺特徵與語言模型的集成。

訓練數據

Idefics2的訓練數據包括：

訓練過程

Idefics2採用兩階段訓練，具體過程如上述詳細文檔部分所述。

📄 許可證

Idefics2基於Apache 2.0許可證發佈，其依賴的兩個預訓練模型google/siglip-so400m-patch14-384和mistralai/Mistral-7B-v0.1也採用了相同的許可證。

⚠️ 重要提示

⚠️ 重要提示

Idefics2不能與Transformers版本在4.41.0到4.43.3（包括）之間的版本兼容。請參考issue和修復方案。

該模型目前可能會在被要求時提供醫療診斷，但我們不建議用戶在未經適當調整和評估的情況下將其用於醫療應用。

儘管我們對訓練數據進行了過濾，但仍發現有少量不適合所有受眾的內容，模型可能會生成類似的文本。

我們對預訓練語言模型骨幹的組成了解相對較少，這使得難以將繼承的限制或問題行為與其數據聯繫起來。

💡 使用建議

為了獲得最佳結果，建議在特定用例和數據上對idefics2-8b進行微調。

如果你的GPU支持，建議以半精度加載和運行模型，以提高效率。

在使用模型時，要注意其可能存在的偏差和侷限性，避免在高風險場景中使用。

📖 引用

如果你使用了Idefics2，請引用以下文獻：

@misc{laurencon2023obelics,
      title={OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents},
      author={Hugo Laurençon and Lucile Saulnier and Léo Tronchon and Stas Bekman and Amanpreet Singh and Anton Lozhkov and Thomas Wang and Siddharth Karamcheti and Alexander M. Rush and Douwe Kiela and Matthieu Cord and Victor Sanh},
      year={2023},
      eprint={2306.16527},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

@misc{laurençon2024matters,
      title={What matters when building vision-language models?}, 
      author={Hugo Laurençon and Léo Tronchon and Matthieu Cord and Victor Sanh},
      year={2024},
      eprint={2405.02246},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}