Open-Qwen2VL开源多模态模型 - 支持图像与文本输入并生成文本内容

首页

Open Qwen2VL

由 weizhiwang 开发

Open-Qwen2VL是一个多模态模型，能够接收图像和文本作为输入并生成文本输出。

图像生成文本英语开源协议:CC #多模态图文理解 #学术开源模型 #高效预训练

下载量 568

发布时间 : 3/27/2025

模型简介

基于学术资源的高效计算全开放多模态大语言模型预训练，支持图像和文本输入，生成文本输出。

模型特点

多模态输入

支持同时接收图像和文本作为输入，进行联合理解与处理。

高效计算

基于学术资源进行高效计算，适合资源有限的研究环境。

全开放

模型、代码和数据完全开放，便于研究和二次开发。

模型能力

图像理解

文本生成

多模态推理

使用案例

图像描述

图像内容描述

对输入的图像进行详细描述，生成自然语言文本。

生成准确、详细的图像描述文本。

视觉问答

基于图像的问答

根据图像内容回答相关问题。

提供与图像内容相关的准确答案。

🚀 Open-Qwen2VL模型介绍

Open-Qwen2VL是一个多模态模型，它以图像和文本作为输入，并输出文本。该模型能够有效处理图像与文本的信息融合，为多模态任务提供了强大的支持。

🚀 快速开始

Open-Qwen2VL是一个多模态模型，它接收图像和文本作为输入，并输出文本。该模型的相关信息在论文 Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources 中有所描述。代码可在 https://github.com/Victorwz/Open-Qwen2VL 获取。

✨ 主要特性

多模态处理：能够同时处理图像和文本输入，输出文本结果。
开源可用：代码、模型、数据和论文均已发布。

📦 安装指南

请首先通过以下命令安装Open-Qwen2VL：

pip install git+https://github.com/Victorwz/Open-Qwen2VL.git#subdirectory=prismatic-vlms

💻 使用示例

基础用法

import requests
import torch
from PIL import Image
from prismatic import load

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# Load a pretrained VLM (either local path, or ID to auto-download from the HF Hub)
vlm = load("Open-Qwen2VL")
vlm.to(device, dtype=torch.bfloat16)

# Download an image and specify a prompt
image_url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
# image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
image = [vlm.vision_backbone.image_transform(Image.open(requests.get(image_url, stream=True).raw).convert("RGB")).unsqueeze(0)]
user_prompt = "<image>\nDescribe the image."

# Generate!
generated_text = vlm.generate_batch(
    image,
    [user_prompt],
    do_sample=False,
    max_new_tokens=512,
    min_length=1,
)
print(generated_text[0])

图像描述结果如下：

The image depicts a blue and orange bus parked on the side of a street. ...

📚 详细文档

模型信息

属性	详情
基础模型	Qwen/Qwen2.5 - 1.5B - Instruct、google/siglip - so400m - patch14 - 384
数据集	weizhiwang/Open - Qwen2VL - Data、MAmmoTH - VL/MAmmoTH - VL - Instruct - 12M
语言	英文
许可证	cc
任务类型	图像文本到文本

更新记录

[2025年4月1日] 代码库、模型、数据和论文发布。

致谢

本工作部分得到了美国国家科学基金会BioPACIFIC材料创新平台的资助，资助编号为DMR - 1933487。

📄 许可证

本项目采用cc许可证。

引用

@article{Open-Qwen2VL,
    title={Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources},
    author={Wang, Weizhi and Tian, Yu and Yang, Linjie and Wang, Heng and Yan, Xifeng},
    journal={arXiv preprint arXiv:2504.00595},
    year={2025}
  }