chat-vector-llava-v1.5-7b-ja Open-Source Vision-Language Model - Supports Japanese Image Dialogue and Communication

Chat Vector Llava V1.5 7b Ja

Developed by toshi456

A visual-language model capable of conducting dialogues in Japanese about input images, created using the Chat Vector method by combining weights from multiple models

Image-to-Text

Transformers

Japanese#Japanese Visual Question Answering #Image Dialogue Generation #Multi-model Weight Fusion

Downloads 26

Release Time : 5/6/2024

Model Overview

This model can understand image content and engage in dialogues in Japanese, suitable for tasks like image caption generation and visual question answering.

Model Features

Japanese Visual Dialogue

A visual-language model specifically optimized for Japanese, capable of understanding images and conducting dialogues in Japanese

Multi-model Fusion

Uses the Chat Vector method to fuse weights from multiple high-performing models, combining their respective strengths

Multi-task Support

Supports various visual-language tasks including image caption generation and visual question answering

Model Capabilities

Image content understanding

Japanese dialogue generation

Visual question answering

Image caption generation

Use Cases

Visual Question Answering

Japanese Image Q&A

Ask questions about input images, and the model answers related questions in Japanese

Achieved a ROUGE-L score of 18.64 on the JA-VG-VQA-500 dataset

Image Captioning

Japanese Image Caption Generation

Generate Japanese descriptions for input images

Scored 53.61 on the Heron-Bench (Detail) task

🚀 Chat-Vector-LLaVA-v1.5-7b-JA Model Card

Chat-Vector-LLaVA-v1.5-7b-JA is a vision-language model that enables conversations about input images in Japanese. It offers valuable capabilities for vision-related tasks such as image captioning and VQA.

✨ Features

Model Detail

Model type: Chat-Vector-LLaVA-v1.5-7b-JA is a vision-language model that can converse about input images in Japanese. This model was created by adding and subtracting the weights of the llava-v1.5-7b, Llama-2-7b-hf, and ELYZA-japanese-Llama-2-7b models using the Chat Vector method as follows:

ELYZA-japanese-Llama-2-7b + (llava-v1.5-7b - Llama-2-7b-hf)

Comparing VLMs | Property | Details | |----------|---------| | Model Type | Vision-language model for Japanese conversations about images | | Training Data | Not provided |

Model	JA-VG-VQA-500 (ROUGE-L)	JA-VLM-Bench-In-the-Wild (ROUGE-L)	Heron-Bench(Detail)	Heron-Bench(Conv)	Heron-Bench(Complex)	Heron-Bench(Average)
Japanese Stable VLM	-	40.50	25.15	51.23	37.84	38.07
EvoVLM-JP-v1-7B	19.70	51.25	50.31	44.42	40.47	45.07
Heron BLIP Japanese StableLM Base 7B llava-620k	14.51	33.26	49.09	41.51	45.72	45.44
Heron GIT Japanese StableLM Base 7B	15.18	37.82	42.77	54.20	43.53	46.83
llava-jp-1.3b-v1.0-620k	12.69	44.58	51.21	41.05	45.95	44.84
llava-jp-1.3b-v1.1	13.33	44.40	50.00	51.83	48.98	50.39
chat-vector-llava-v1.5-7b-ja	18.64	42.23	53.61	44.36	44.48	46.10

image/png

📦 Installation

1. Download dependencies

git clone https://github.com/tosiyuki/vlm-chat-vector-ja.git

💻 Usage Examples

Basic Usage

import requests
import torch
import transformers
from PIL import Image

from transformers.generation.streamers import TextStreamer
from llava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.language_model.llava_llama import LlavaLlamaForCausalLM
from llava.mm_utils import tokenizer_image_token, process_images


if __name__ == "__main__":
    model_path = 'toshi456/chat-vector-llava-v1.5-7b-ja'
    device = "cuda" if torch.cuda.is_available() else "cpu"
    torch_dtype = torch.bfloat16 if device=="cuda" else torch.float32

    model = LlavaLlamaForCausalLM.from_pretrained(
        model_path, 
        device_map=device,
        low_cpu_mem_usage=True,
        use_safetensors=True,
        torch_dtype=torch.float16,
    ).eval()
    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_path,
        model_max_length=1024,
        padding_side="right",
        use_fast=False,
    )
    model.get_model().vision_tower.load_model()
    model = model.to(device)

    eos_token_id_list = [
        tokenizer.eos_token_id,
        tokenizer.bos_token_id,
    ]

    # image pre-process
    image_url = "https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/sample.jpg"
    image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')

    if not isinstance(image, list):
        image = [image]
    
    image_tensor = process_images(image, model.get_model().vision_tower.image_processor, model.config)
    if type(image_tensor) is list:
        image_tensor = [image.to(model.device, dtype=torch.float16) for image in image_tensor]
    else:
        image_tensor = image_tensor.to(model.device, dtype=torch.float16)

    # create prompt
    # ユーザー: <image>\n{prompt}
    conv_mode = "llava_llama_2"
    conv = conv_templates[conv_mode].copy()
    prompt = "猫の隣には何がありますか？"
    inp = DEFAULT_IMAGE_TOKEN + '\n' + prompt
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()

    input_ids = tokenizer_image_token(
        prompt, 
        tokenizer, 
        IMAGE_TOKEN_INDEX, 
        return_tensors='pt'
    ).unsqueeze(0)
    if device == "cuda":
        input_ids = input_ids.to(device)

    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    streamer = TextStreamer(tokenizer, skip_prompt=True, timeout=20.0)

    # parameter
    temperature = 0.0
    top_p = 1.0
    max_new_tokens=256

    # predict
    with torch.inference_mode():
        model.generate(
            inputs=input_ids,
            images=image_tensor,
            do_sample=True if temperature > 0 else False,
            temperature=temperature,
            top_p=top_p,
            max_new_tokens=max_new_tokens,
            streamer=streamer,
            use_cache=True,
            eos_token_id=eos_token_id_list,
        )

    """猫の隣には、コンピューター（パソコン）があります。<s>"""

📚 Documentation

Acknowledgement

📄 License

cc-by-nc-4.0

⚠️ Important Note

The code for the demo worked with 4.34.1 of transformers, but did not work properly with 4.37.2. We have not tested the code in between versions or in the latest version.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご