Yi-VL-6B-hf Open-Source Multimodal Vision-Language Model - Supports Chinese-English Bilingual Visual Question-Answering Tasks

Yi VL 6B Hf

Developed by BUAADreamer

Yi-VL-6B is a multimodal vision-language model developed by 01-AI, supporting both Chinese and English, suitable for tasks like visual question answering.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:Other #Multimodal Visual Question Answering #Bilingual Support (Chinese-English)#Efficient Fine-Tuning Adaptation

Downloads 55

Release Time : 5/14/2024

Model Overview

Yi-VL-6B is a multimodal vision-language model based on the Yi series, capable of handling joint tasks involving images and text, such as visual question answering and image caption generation.

Model Features

Multimodal Capability

Capable of processing both image and text inputs to achieve joint understanding of vision and language.

Efficient Fine-Tuning Support

Recommended to use the LLaMA-Factory toolkit for efficient fine-tuning, facilitating adaptation to downstream tasks.

Bilingual Support (Chinese-English)

Natively supports visual-language task processing in both Chinese and English.

Model Capabilities

Visual Question Answering

Image Understanding

Multimodal Reasoning

Use Cases

Education

Visual Q&A for Learning Assistance

Helps students acquire relevant knowledge explanations by asking questions about images.

Content Understanding

Image Caption Generation

Automatically generates textual descriptions for images.

🚀 Huggingface Version of Yi-VL-6B Model

This is the Huggingface version of the Yi-VL-6B model. It can be used for fine - tuning in downstream tasks. We recommend using our efficient fine - tuning toolkit at LLaMA - Factory.

✨ Features

Developed by: 01 - AI.
Language(s) (NLP): Chinese/English
License: Yi Series Model License

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import requests
from PIL import Image

import torch
from transformers import AutoProcessor, AutoModelForVision2Seq, LlavaConfig
import transformers
from torch import nn

class LlavaMultiModalProjectorYiVL(nn.Module):
    def __init__(self, config: "LlavaConfig"):
        super().__init__()
        self.linear_1 = nn.Linear(config.vision_config.hidden_size, config.text_config.hidden_size, bias=True)
        self.linear_2 = nn.LayerNorm(config.text_config.hidden_size, bias=True)
        self.linear_3 = nn.Linear(config.text_config.hidden_size, config.text_config.hidden_size, bias=True)
        self.linear_4 = nn.LayerNorm(config.text_config.hidden_size, bias=True)
        self.act = nn.GELU()

    def forward(self, image_features):
        hidden_states = self.linear_1(image_features)
        hidden_states = self.linear_2(hidden_states)
        hidden_states = self.act(hidden_states)
        hidden_states = self.linear_3(hidden_states)
        hidden_states = self.linear_4(hidden_states)
        return hidden_states
# Monkey patch of LlavaMultiModalProjector is mandatory
transformers.models.llava.modeling_llava.LlavaMultiModalProjector = LlavaMultiModalProjectorYiVL

model_id = "BUAADreamer/Yi-VL-6B-hf"

messages = [
  { "role": "user", "content": "<image>What's in the picture?" }
]
image_file = "http://images.cocodataset.org/val2017/000000039769.jpg"

model = AutoModelForVision2Seq.from_pretrained(
    model_id, 
    torch_dtype=torch.float16, 
    low_cpu_mem_usage=True, 
).to(0)
processor = AutoProcessor.from_pretrained(model_id)

text = [processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False)]
images = [Image.open(requests.get(image_file, stream=True).raw)]
inputs = processor(text=text, images=images, return_tensors='pt').to(0, torch.float16)

output = model.generate(**inputs, max_new_tokens=200)
output = processor.batch_decode(output, skip_special_tokens=True)
print(output.split("Assistant:")[-1].strip())

Advanced Usage

You could also alternatively launch a Web demo by using the CLI command in LLaMA - Factory

llamafactory-cli webchat \
--model_name_or_path BUAADreamer/Yi-VL-6B-hf \
--template yivl \
--visual_inputs

📚 Documentation

lmms - eval Evaluation Results

Property	Details
MMMU_val	36.8
CMMMU_val	32.2

📄 License

The model is under the Yi Series Model License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご