gme-Qwen2-VL-2B-Instruct開源視覺語言模型 - 支持中英文自然語言處理任務

首頁

Gme Qwen2 VL 2B Instruct

由Alibaba-NLP開發

Qwen2-VL-2B-Instruct 是一個基於 Qwen2 架構的視覺語言模型，支持中英文，適用於多種自然語言處理任務。

文本生成圖像

Transformers

支持多種語言開源協議:Apache-2.0 #多模態理解 #中英雙語處理 #句子相似度計算

下載量 31.18k

發布時間 : 12/21/2024

模型概述

該模型是一個多模態視覺語言模型，能夠處理文本和圖像相關的任務，特別優化了指令跟隨能力。

模型特點

多語言支持

支持英語和中文，適用於跨語言任務。

多任務處理

能夠執行句子相似度、分類、檢索等多種自然語言處理任務。

視覺語言能力

結合視覺和語言處理能力，適用於多模態任務。

模型能力

文本分類

句子相似度計算

信息檢索

聚類分析

重排序

多模態處理

使用案例

文本分析

情感分析

對亞馬遜評論進行情感極性分類。

準確率高達 96.75%

意圖識別

識別銀行客服對話中的用戶意圖。

準確率 80.24%

信息檢索

文檔檢索

在 ArguAna 數據集上進行文檔檢索。

平均精度@10 達到 52.78

多模態應用

圖文匹配

結合視覺和語言信息進行圖文匹配任務。

🚀 GME-Qwen2-VL-2B：通用多模態嵌入模型

GME-Qwen2-VL-2B 是基於先進的 Qwen2-VL 多模態大語言模型（MLLMs）開發的統一多模態嵌入模型。該模型支持文本、圖像和圖像 - 文本對三種輸入類型，能生成通用向量表示，具備強大的檢索性能。

✨ 主要特性

統一的多模態表示：能夠處理單模態和組合模態輸入，生成統一的向量表示，支持文本檢索、文本到圖像檢索、圖像到圖像檢索等多種檢索場景（Any2Any 搜索）。
高性能：在通用多模態檢索基準測試（UMRB）中達到了當前最優（SOTA）結果，在多模態文本評估基準（MTEB）中也表現出色。
動態圖像分辨率：得益於 Qwen2-VL 和訓練數據，支持動態分辨率的圖像輸入。
強大的視覺檢索性能：在視覺文檔檢索任務中表現卓越，尤其適用於需要深入理解文檔截圖的複雜文檔理解場景，如專注於學術論文的多模態檢索增強生成（RAG）應用。

📦 安裝指南

使用自定義代碼調用模型時，可參考以下步驟：

# 可從 https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct/blob/main/gme_inference.py 找到 gme_inference.py 腳本
from gme_inference import GmeQwen2VL

texts = [
    "What kind of car is this?",
    "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023."
]
images = [
    'https://en.wikipedia.org/wiki/File:Tesla_Cybertruck_damaged_window.jpg',
    'https://en.wikipedia.org/wiki/File:2024_Tesla_Cybertruck_Foundation_Series,_front_left_(Greenwich).jpg',
]

gme = GmeQwen2VL("Alibaba-NLP/gme-Qwen2-VL-2B-Instruct")

# 單模態嵌入
e_text = gme.get_text_embeddings(texts=texts)
e_image = gme.get_image_embeddings(images=images)
print((e_text * e_image).sum(-1))
## tensor([0.2281, 0.6001], dtype=torch.float16)

# 如何設置嵌入指令
e_query = gme.get_text_embeddings(texts=texts, instruction='Find an image that matches the given text.')
# 如果 is_query=False，我們始終使用默認指令。
e_corpus = gme.get_image_embeddings(images=images, is_query=False)
print((e_query * e_corpus).sum(-1))
## tensor([0.2433, 0.7051], dtype=torch.float16)

# 融合模態嵌入
e_fused = gme.get_fused_embeddings(texts=texts, images=images)
print((e_fused[0] * e_fused[1]).sum())
## tensor(0.6108, dtype=torch.float16)

💻 使用示例

基礎用法

# 可從 https://huggingface.co/Alibaba-NLP/gme-Qwen2-VL-2B-Instruct/blob/main/gme_inference.py 找到 gme_inference.py 腳本
from gme_inference import GmeQwen2VL

texts = [
    "What kind of car is this?",
    "The Tesla Cybertruck is a battery electric pickup truck built by Tesla, Inc. since 2023."
]
images = [
    'https://en.wikipedia.org/wiki/File:Tesla_Cybertruck_damaged_window.jpg',
    'https://en.wikipedia.org/wiki/File:2024_Tesla_Cybertruck_Foundation_Series,_front_left_(Greenwich).jpg',
]

gme = GmeQwen2VL("Alibaba-NLP/gme-Qwen2-VL-2B-Instruct")

# 單模態嵌入
e_text = gme.get_text_embeddings(texts=texts)
e_image = gme.get_image_embeddings(images=images)
print((e_text * e_image).sum(-1))
## tensor([0.2281, 0.6001], dtype=torch.float16)

高級用法

# 設置嵌入指令
e_query = gme.get_text_embeddings(texts=texts, instruction='Find an image that matches the given text.')
# 如果 is_query=False，我們始終使用默認指令。
e_corpus = gme.get_image_embeddings(images=images, is_query=False)
print((e_query * e_corpus).sum(-1))
## tensor([0.2433, 0.7051], dtype=torch.float16)

# 融合模態嵌入
e_fused = gme.get_fused_embeddings(texts=texts, images=images)
print((e_fused[0] * e_fused[1]).sum())
## tensor(0.6108, dtype=torch.float16)

📚 詳細文檔

模型列表

模型	模型大小	最大序列長度	維度	MTEB - 英文	MTEB - 中文	UMRB
`gme-Qwen2-VL-2B`	2.21B	32768	1536	65.27	66.92	64.45
`gme-Qwen2-VL-7B`	8.29B	32768	3584	67.48	69.73	67.44

評估結果

我們在通用多模態檢索基準測試（UMRB）等測試中驗證了模型的性能。

		單模態		跨模態			融合模態				平均
		T→T (16)	I→I (1)	T→I (4)	T→VD (10)	I→T (4)	T→IT (2)	IT→T (5)	IT→I (2)	IT→IT (3)	(47)
VISTA	0.2B	55.15	31.98	32.88	10.12	31.23	45.81	53.32	8.97	26.26	37.32
CLIP - SF	0.4B	39.75	31.42	59.05	24.09	62.95	66.41	53.32	34.9	55.65	43.66
One - Peace	4B	43.54	31.27	61.38	42.9	65.59	42.72	28.29	6.73	23.41	42.01
DSE	4.2B	48.94	27.92	40.75	78.21	52.54	49.62	35.44	8.36	40.18	50.04
E5 - V	8.4B	52.41	27.36	46.56	41.22	47.95	54.13	32.9	23.17	7.23	42.52
[GME - Qwen2 - VL - 2B](https://huggingface.co/Alibaba - NLP/gme - Qwen2 - VL - 2B - Instruct)	2.2B	55.93	29.86	57.36	87.84	61.93	76.47	64.58	37.02	66.47	64.45
[GME - Qwen2 - VL - 7B](https://huggingface.co/Alibaba - NLP/gme - Qwen2 - VL - 7B - Instruct)	8.3B	58.19	31.89	61.35	89.92	65.83	80.94	66.18	42.56	73.62	67.44

🔧 技術細節

微調方法

GME 模型可以使用 SWIFT 進行微調：

pip install ms-swift -U

# MAX_PIXELS 設置以減少內存使用
# 查看: https://swift.readthedocs.io/en/latest/BestPractices/Embedding.html
nproc_per_node=8
MAX_PIXELS=1003520 \
USE_HF=1 \
NPROC_PER_NODE=$nproc_per_node \
swift sft \
    --model Alibaba-NLP/gme-Qwen2-VL-2B-Instruct \
    --train_type lora \
    --dataset 'HuggingFaceM4/TextCaps:emb' \
    --torch_dtype bfloat16 \
    --num_train_epochs 1 \
    --per_device_train_batch_size 2 \
    --per_device_eval_batch_size 2 \
    --gradient_accumulation_steps $(expr 64 / $nproc_per_node) \
    --eval_steps 100 \
    --save_steps 100 \
    --eval_strategy steps \
    --save_total_limit 5 \
    --logging_steps 5 \
    --output_dir output \
    --lazy_tokenize true \
    --warmup_ratio 0.05 \
    --learning_rate 5e-6 \
    --deepspeed zero3 \
    --dataloader_num_workers 4 \
    --task_type embedding \
    --loss_type infonce \
    --dataloader_drop_last true

📄 許可證

本項目採用 Apache 2.0 許可證。

🔖 引用

如果您覺得我們的論文或模型有幫助，請考慮引用：

@misc{zhang2024gme,
      title={GME: Improving Universal Multimodal Retrieval by Multimodal LLMs}, 
      author={Zhang, Xin and Zhang, Yanzhao and Xie, Wen and Li, Mingxin and Dai, Ziqi and Long, Dingkun and Xie, Pengjun and Zhang, Meishan and Li, Wenjie and Zhang, Min},
      year={2024},
      eprint={2412.16855},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={http://arxiv.org/abs/2412.16855}, 
}

⚠️ 重要提示

單圖像輸入：在 Qwen2-VL 中，一張圖像可能會轉換為大量的視覺標記。為了獲得良好的訓練效率，我們將視覺標記的數量限制為 1024。由於缺乏相關數據，我們的模型和評估僅保留單張圖像。
僅英文訓練：我們的模型僅在英文數據上進行訓練。儘管 Qwen2-VL 模型支持多語言，但多語言 - 多模態嵌入性能無法保證。

我們將在未來版本中擴展到多圖像輸入、圖像 - 文本交錯數據以及多語言數據。