Open-Qwen2VL Open-Source Multimodal Model - Supports Image and Text Inputs and Generates Text Content

Open Qwen2VL

Developed by weizhiwang

Open-Qwen2VL is a multimodal model capable of receiving both images and text as input and generating text output.

Image-to-Text EnglishOpen Source License:CC #Multimodal Image-Text Understanding #Academic Open-Source Model #Efficient Pre-training

Downloads 568

Release Time : 3/27/2025

Model Overview

A fully open, efficient, and academically resource-based multimodal large language model pre-training, supporting image and text input with text output.

Model Features

Multimodal Input

Supports simultaneous image and text input for joint understanding and processing.

Efficient Computation

Based on academic resources for efficient computation, suitable for research environments with limited resources.

Fully Open

The model, code, and data are fully open, facilitating research and secondary development.

Model Capabilities

Image Understanding

Text Generation

Multimodal Reasoning

Use Cases

Image Captioning

Image Content Description

Generates detailed natural language descriptions of input images.

Produces accurate and detailed image description texts.

Visual Question Answering

Image-Based Question Answering

Answers questions based on image content.

Provides accurate answers related to the image content.

🚀 Open-Qwen2VL

Open-Qwen2VL is a multimodal model that accepts images and text as input and generates text as output. It effectively addresses the challenges in multimodal information processing and offers high - quality text output for image - text input scenarios.

🚀 Quick Start

Prerequisites

Ensure you have Python and pip installed on your system.

Installation

You can install Open-Qwen2VL using the following command:

pip install git+https://github.com/Victorwz/Open-Qwen2VL.git#subdirectory=prismatic-vlms

Inference

Here is an example of loading the model and performing inference:

import requests
import torch
from PIL import Image
from prismatic import load

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

# Load a pretrained VLM (either local path, or ID to auto-download from the HF Hub)
vlm = load("Open-Qwen2VL")
vlm.to(device, dtype=torch.bfloat16)

# Download an image and specify a prompt
image_url = "https://huggingface.co/adept/fuyu-8b/resolve/main/bus.png"
# image = Image.open(requests.get(image_url, stream=True).raw).convert("RGB")
image = [vlm.vision_backbone.image_transform(Image.open(requests.get(image_url, stream=True).raw).convert("RGB")).unsqueeze(0)]
user_prompt = "<image>\nDescribe the image."

# Generate!
generated_text = vlm.generate_batch(
    image,
    [user_prompt],
    do_sample=False,
    max_new_tokens=512,
    min_length=1,
)
print(generated_text[0])

The image caption results look like:

The image depicts a blue and orange bus parked on the side of a street. ...

✨ Features

Multimodal Input: Accepts both images and text as input, enabling comprehensive multimodal information processing.
Text Output: Generates high - quality text based on the input image and text.

📦 Model Information

Property	Details
Base Model	Qwen/Qwen2.5 - 1.5B - Instruct, google/siglip - so400m - patch14 - 384
Datasets	weizhiwang/Open - Qwen2VL - Data, MAmmoTH - VL/MAmmoTH - VL - Instruct - 12M
Language	en
License	cc
Pipeline Tag	image - text - to - text

📚 Documentation

This model is described in the paper Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources. The code is available at https://github.com/Victorwz/Open-Qwen2VL.

Updates

[4/1/2025] The codebase, model, data, and paper are released.

📄 License

The model is released under the cc license.

Acknowledgement

This work was partially supported by the BioPACIFIC Materials Innovation Platform of the National Science Foundation under Award No. DMR - 1933487

Citation

@article{Open-Qwen2VL,
    title={Open-Qwen2VL: Compute-Efficient Pre-Training of Fully-Open Multimodal LLMs on Academic Resources},
    author={Wang, Weizhi and Tian, Yu and Yang, Linjie and Wang, Heng and Yan, Xifeng},
    journal={arXiv preprint arXiv:2504.00595},
    year={2025}
  }

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご