Qwen2.5-VL-3B-Instruct-GGUF開源視覺語言模型 - 免費實現強大視覺理解與多模態處理

首頁

Qwen2.5 VL 3B Instruct GGUF

由unsloth開發

Qwen2.5-VL是Qwen家族的最新視覺語言模型，具備強大的視覺理解和多模態處理能力。

圖像生成文本英語#多模態視覺理解 #視頻時序分析 #結構化數據提取

下載量 4,645

發布時間 : 5/11/2025

模型概述

Qwen2.5-VL是一個多模態視覺語言模型，專注於提升視覺理解、智能體功能和結構化輸出生成能力。

模型特點

增強視覺理解

能精準識別常見物體，擅長分析圖像中的文本、圖表、圖標、圖形與版式佈局

智能體功能

可直接作為視覺智能體進行推理並動態調用工具，支持電腦與手機操作場景

長視頻理解

可解析超過1小時的視頻內容，具備精準定位相關視頻片段的事件捕捉能力

多格式視覺定位

通過生成邊界框或座標點精確定位圖像對象，並能穩定輸出JSON格式的座標與屬性數據

結構化輸出生成

針對發票掃描件、表單、表格等數據，支持內容結構化輸出

模型能力

圖像文本理解

視覺對象定位

視頻內容分析

結構化數據提取

多模態推理

工具調用

使用案例

商業應用

發票處理

自動識別和提取發票中的結構化數據

提高財務處理效率

表單分析

解析各類商業表單內容

簡化數據錄入流程

智能助手

視覺智能體

作為智能體進行視覺推理並調用工具

實現自動化操作

內容分析

視頻內容理解

解析長視頻內容並定位關鍵事件

提高視頻分析效率

🚀 Qwen2.5-VL-3B-Instruct

Qwen2.5-VL-3B-Instruct 是 Qwen 家族的最新視覺語言模型，具備理解圖像和視頻內容、視覺定位、生成結構化輸出等能力，能廣泛應用於金融、商業等領域。

🚀 快速開始

安裝依賴

Qwen2.5-VL 的代碼已集成在最新的 Hugging face transformers 中，建議使用以下命令從源代碼進行安裝：

pip install git+https://github.com/huggingface/transformers accelerate

否則可能會遇到以下錯誤：

KeyError: 'qwen2_5_vl'

同時，我們提供了一個工具包，可幫助你更方便地處理各種類型的視覺輸入，就像使用 API 一樣。它支持 base64、URL 以及交錯的圖像和視頻。可以使用以下命令進行安裝：

# 強烈建議使用 `[decord]` 特性以加快視頻加載速度
pip install qwen-vl-utils[decord]==0.0.8

如果你使用的不是 Linux 系統，可能無法從 PyPI 安裝 decord。在這種情況下，你可以使用 pip install qwen-vl-utils，它會回退到使用 torchvision 進行視頻處理。不過，你仍然可以從源代碼安裝 decord，以便在加載視頻時使用 decord。

使用 🤗 Transformers 進行對話

以下是一個代碼片段，展示瞭如何使用 transformers 和 qwen_vl_utils 來使用對話模型：

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默認：將模型加載到可用設備上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建議啟用 flash_attention_2 以獲得更好的加速和內存節省效果，特別是在多圖像和視頻場景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-3B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默認處理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

# 模型中每張圖像的視覺令牌數量默認範圍是 4 - 16384
# 你可以根據需要設置 min_pixels 和 max_pixels，例如令牌範圍為 256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成輸出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

多圖像推理

# 包含多個圖像和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻推理

# 包含圖像列表作為視頻和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地視頻路徑和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含視頻 URL 和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在 Qwen 2.5 VL 中，幀率信息也會輸入到模型中以與絕對時間對齊
# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻 URL 兼容性在很大程度上取決於第三方庫的版本。詳細信息如下表所示。如果你不想使用默認的後端，可以通過 FORCE_QWENVL_VIDEO_READER=torchvision 或 FORCE_QWENVL_VIDEO_READER=decord 來更改後端。

後端	HTTP	HTTPS
torchvision >= 0.19.0	✅	✅
torchvision < 0.19.0	❌	❌
decord	✅	❌

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合併消息以進行批量處理
messages = [messages1, messages2]

# 批量推理準備
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

🤖 ModelScope

強烈建議用戶（特別是中國大陸的用戶）使用 ModelScope。snapshot_download 可以幫助你解決下載檢查點時遇到的問題。

✨ 主要特性

視覺理解能力

Qwen2.5-VL 不僅擅長識別常見物體，如花卉、鳥類、魚類和昆蟲，還能夠高度準確地分析圖像中的文本、圖表、圖標、圖形和佈局。

智能代理能力

Qwen2.5-VL 可直接作為視覺代理，能夠進行推理並動態調用工具，具備計算機和手機使用能力。

長視頻理解和事件捕捉能力

Qwen2.5-VL 可以理解超過 1 小時的視頻，並且此次新增了通過定位相關視頻片段來捕捉事件的能力。

多格式視覺定位能力

Qwen2.5-VL 可以通過生成邊界框或點來準確地在圖像中定位物體，並能為座標和屬性提供穩定的 JSON 輸出。

結構化輸出生成能力

對於發票、表單、表格等掃描數據，Qwen2.5-VL 支持生成其內容的結構化輸出，有助於金融、商業等領域的應用。

📦 安裝指南

Qwen2.5-VL 的代碼已集成在最新的 Hugging face transformers 中，建議使用以下命令從源代碼進行安裝：

pip install git+https://github.com/huggingface/transformers accelerate

否則可能會遇到以下錯誤：

KeyError: 'qwen2_5_vl'

# 強烈建議使用 `[decord]` 特性以加快視頻加載速度
pip install qwen-vl-utils[decord]==0.0.8

💻 使用示例

基礎用法

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

# 默認：將模型加載到可用設備上
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", torch_dtype="auto", device_map="auto"
)

# 建議啟用 flash_attention_2 以獲得更好的加速和內存節省效果，特別是在多圖像和視頻場景中
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-3B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )

# 默認處理器
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct")

# 模型中每張圖像的視覺令牌數量默認範圍是 4 - 16384
# 你可以根據需要設置 min_pixels 和 max_pixels，例如令牌範圍為 256 - 1280，以平衡性能和成本
# min_pixels = 256*28*28
# max_pixels = 1280*28*28
# processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels)

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理：生成輸出
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

高級用法

多圖像推理

# 包含多個圖像和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

視頻推理

# 包含圖像列表作為視頻和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": [
                    "file:///path/to/frame1.jpg",
                    "file:///path/to/frame2.jpg",
                    "file:///path/to/frame3.jpg",
                    "file:///path/to/frame4.jpg",
                ],
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含本地視頻路徑和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "file:///path/to/video1.mp4",
                "max_pixels": 360 * 420,
                "fps": 1.0,
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 包含視頻 URL 和文本查詢的消息
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "video",
                "video": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-VL/space_woaudio.mp4",
            },
            {"type": "text", "text": "Describe this video."},
        ],
    }
]

# 在 Qwen 2.5 VL 中，幀率信息也會輸入到模型中以與絕對時間對齊
# 推理準備
text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    fps=fps,
    padding=True,
    return_tensors="pt",
    **video_kwargs,
)
inputs = inputs.to("cuda")

# 推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

批量推理

# 批量推理的示例消息
messages1 = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "What are the common elements in these pictures?"},
        ],
    }
]
messages2 = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]
# 合併消息以進行批量處理
messages = [messages1, messages2]

# 批量推理準備
texts = [
    processor.apply_chat_template(msg, tokenize=False, add_generation_prompt=True)
    for msg in messages
]
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=texts,
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

# 批量推理
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_texts = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_texts)

📚 詳細文檔

輸入圖像和視頻格式支持

對於輸入圖像，支持本地文件、base64 和 URL 格式。對於視頻，目前僅支持本地文件。

# 你可以直接在文本中需要的位置插入本地文件路徑、URL 或 base64 編碼的圖像
## 本地文件路徑
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## 圖像 URL
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "http://path/to/your/image.jpg"},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
## Base64 編碼的圖像
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "data:image;base64,/9j/..."},
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

圖像分辨率優化性能

模型支持廣泛的分辨率輸入。默認情況下，它使用原生分辨率進行輸入，但更高的分辨率可以提高性能，但會增加計算成本。用戶可以設置最小和最大像素數，以實現滿足自身需求的最佳配置，例如令牌數量範圍為 256 - 1280，以平衡速度和內存使用。

min_pixels = 256 * 28 * 28
max_pixels = 1280 * 28 * 28
processor = AutoProcessor.from_pretrained(
    "Qwen/Qwen2.5-VL-3B-Instruct", min_pixels=min_pixels, max_pixels=max_pixels
)

此外，還提供了兩種方法來精細控制輸入到模型的圖像大小：

定義 min_pixels 和 max_pixels：圖像將被調整大小，以保持其寬高比在 min_pixels 和 max_pixels 範圍內。
指定確切的尺寸：直接設置 resized_height 和 resized_width。這些值將被四捨五入到最接近的 28 的倍數。

# min_pixels 和 max_pixels
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "resized_height": 280,
                "resized_width": 420,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]
# resized_height 和 resized_width
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "file:///path/to/your/image.jpg",
                "min_pixels": 50176,
                "max_pixels": 50176,
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

長文本處理

當前的 config.json 設置的上下文長度最大為 32,768 個令牌。為了處理超過 32,768 個令牌的大量輸入，我們採用了 YaRN 技術，這是一種增強模型長度外推能力的技術，可確保在長文本上的最佳性能。對於支持的框架，可以在 config.json 中添加以下內容以啟用 YaRN：

{
    ...,
    "type": "yarn",
    "mrope_section": [
        16,
        24,
        24
    ],
    "factor": 4,
    "original_max_position_embeddings": 32768
}

需要注意的是，這種方法對時間和空間定位任務的性能有顯著影響，因此不建議使用。同時，對於長視頻輸入，由於 MRoPE 本身在使用 ids 方面更經濟，因此可以直接將 max_position_embeddings 修改為更大的值，例如 64k。

🔧 技術細節

模型架構更新

視頻理解的動態分辨率和幀率訓練

通過採用動態 FPS 採樣，將動態分辨率擴展到時間維度，使模型能夠理解不同採樣率的視頻。相應地，在時間維度上使用 ID 和絕對時間對齊更新 mRoPE，使模型能夠學習時間序列和速度，最終獲得定位特定時刻的能力。模型架構

精簡高效的視覺編碼器

通過策略性地將窗口注意力機制引入 ViT，提高了訓練和推理速度。同時，使用 SwiGLU 和 RMSNorm 進一步優化 ViT 架構，使其與 Qwen2.5 LLM 的結構保持一致。

📄 許可證

本項目遵循 qwen-research 許可證。

📈 評估

圖像基準測試

基準測試	InternVL2.5-4B	Qwen2-VL-7B	Qwen2.5-VL-3B
MMMU_val	52.3	54.1	53.1
MMMU-Pro_val	32.7	30.5	31.6
AI2D_test	81.4	83.0	81.5
DocVQA_test	91.6	94.5	93.9
InfoVQA_test	72.1	76.5	77.1
TextVQA_val	76.8	84.3	79.3
MMBench-V1.1_test	79.3	80.7	77.6
MMStar	58.3	60.7	55.9
MathVista_testmini	60.5	58.2	62.3
MathVision_full	20.9	16.3	21.2

視頻基準測試

基準測試	InternVL2.5-4B	Qwen2-VL-7B	Qwen2.5-VL-3B
MVBench	71.6	67.0	67.0
VideoMME	63.6/62.3	69.0/63.3	67.6/61.5
MLVU	48.3	-	68.2
LVBench	-	-	43.3
MMBench-Video	1.73	1.44	1.63
EgoSchema	-	-	64.8
PerceptionTest	-	-	66.9
TempCompass	-	-	64.4
LongVideoBench	55.2	55.6	54.2
CharadesSTA/mIoU	-	-	38.8

代理基準測試

基準測試	Qwen2.5-VL-3B
ScreenSpot	55.5
ScreenSpot Pro	23.9
AITZ_EM	76.9
Android Control High_EM	63.7
Android Control Low_EM	22.2
AndroidWorld_SR	90.8
MobileMiniWob++_SR	67.9

📖 引用

如果您覺得我們的工作有幫助，請引用以下文獻：

@misc{qwen2.5-VL,
    title = {Qwen2.5-VL},
    url = {https://qwenlm.github.io/blog/qwen2.5-vl/},
    author = {Qwen Team},
    month = {January},
    year = {2025}
}

@article{Qwen2VL,
  title={Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution},
  author={Wang, Peng and Bai, Shuai and Tan, Sinan and Wang, Shijie and Fan, Zhihao and Bai, Jinze and Chen, Keqin and Liu, Xuejing and Wang, Jialin and Ge, Wenbin and Fan, Yang and Dang, Kai and Du, Mengfei and Ren, Xuancheng and Men, Rui and Liu, Dayiheng and Zhou, Chang and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2409.12191},
  year={2024}
}

@article{Qwen-VL,
  title={Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond},
  author={Bai, Jinze and Bai, Shuai and Yang, Shusheng and Wang, Shijie and Tan, Sinan and Wang, Peng and Lin, Junyang and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2308.12966},
  year={2023}
}