ConvLLaVA-JP-1.3b-1280: Open-Source Japanese Vision-Language Model - Supports High-Resolution Image Input for Conversations

Convllava JP 1.3b 1280

Developed by toshi456

ConvLLaVA-JP is a Japanese vision-language model that supports high-resolution input and can engage in conversations about input images.

Image-to-Text

Transformers

Japanese#Japanese Visual Question Answering #High-Resolution Image Understanding #Multi-Stage Joint Training

Downloads 31

Release Time : 6/14/2024

Model Overview

This model combines an image encoder and text decoder, supports high-resolution input up to 1280x1280, and can perform tasks such as image caption generation and visual question answering.

Model Features

High-Resolution Support

Supports high-resolution image input up to 1280x1280, capable of capturing richer visual details.

Multi-Stage Training

Adopts a three-stage training strategy: first training the visual projector, then jointly training the image encoder and language model, and finally fine-tuning.

Japanese Optimization

Specifically trained and optimized for Japanese, performing well on Japanese vision-language tasks.

Model Capabilities

Image Caption Generation

Visual Question Answering

Image Dialogue

High-Resolution Image Understanding

Use Cases

Image Understanding

Image Content Description

Generates detailed Japanese descriptions of input images.

Can accurately identify objects in images and their relationships.

Visual Question Answering

Answers Japanese questions about image content.

Performs well on benchmarks such as JA-VG-VQA-500 and JA-VLM-Bench-In-the-Wild.

Human-Computer Interaction

Image-Based Dialogue System

Engages in natural language conversations with users about image content.

Can understand complex questions and provide relevant answers.

🚀 ConvLLaVA-JP Model Card

ConvLLaVA-JP is a vision - language model designed to engage in conversations about input images, offering high - resolution input support.

🚀 Quick Start

To use the ConvLLaVA - JP model, you need to follow the steps below:

Download the necessary dependencies.
Perform inference using the provided code example.

📦 Installation

git clone https://github.com/tosiyuki/LLaVA-JP.git

💻 Usage Examples

Basic Usage

import requests
import torch
import transformers
from PIL import Image

from transformers.generation.streamers import TextStreamer
from llava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.llava_gpt2 import LlavaGpt2ForCausalLM
from llava.train.dataset import tokenizer_image_token


if __name__ == "__main__":
    model_path = 'toshi456/ConvLLaVA-JP-1.3b-1280'
    device = "cuda" if torch.cuda.is_available() else "cpu"
    torch_dtype = torch.bfloat16 if device=="cuda" else torch.float32

    model = LlavaGpt2ForCausalLM.from_pretrained(
        model_path, 
        low_cpu_mem_usage=True,
        use_safetensors=True,
        torch_dtype=torch_dtype,
        device_map=device,
    )
    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_path,
        model_max_length=1532,
        padding_side="right",
        use_fast=False,
    )
    model.eval()

    conv_mode = "v1"
    conv = conv_templates[conv_mode].copy()

    # image pre-process
    image_url = "https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/sample.jpg"
    image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
    
    if device == "cuda":
        image_tensor = model.get_model().vision_tower.image_processor(image).unsqueeze(0).half().cuda().to(torch_dtype)
    else:
        image_tensor = model.get_model().vision_tower.image_processor(image).unsqueeze(0).to(torch_dtype)

    # create prompt
    # ユーザー: <image>\n{prompt}
    prompt = "猫の隣には何がありますか？"
    inp = DEFAULT_IMAGE_TOKEN + '\n' + prompt
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()

    input_ids = tokenizer_image_token(
        prompt, 
        tokenizer, 
        IMAGE_TOKEN_INDEX, 
        return_tensors='pt'
    ).unsqueeze(0)
    if device == "cuda":
        input_ids = input_ids.to(device)

    input_ids = input_ids[:, :-1] # </sep>がinputの最後に入るので削除する
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    streamer = TextStreamer(tokenizer, skip_prompt=True, timeout=20.0)

    # predict
    with torch.inference_mode():
        output_id = model.generate(
            inputs=input_ids,
            images=image_tensor,
            do_sample=False,
            temperature=1.0,
            top_p=1.0,
            max_new_tokens=256,
            streamer=streamer,
            use_cache=True,
        )
    """猫の隣にはノートパソコンがあります。"""

📚 Documentation

Model Details

Model Type: ConvLLaVA - JP is a vision - language model capable of conversing about input images. This LVLM model uses [laion/CLIP - convnext_large_d_320.laion2B - s29B - b131K - ft](https://huggingface.co/laion/CLIP - convnext_large_d_320.laion2B - s29B - b131K - ft) as the image encoder and [llm - jp/llm - jp - 1.3b - v1.0](https://huggingface.co/llm - jp/llm - jp - 1.3b - v1.0) as the text decoder, supporting an input resolution of 1280 x 1280.

Training: This model was initially trained with the Vision Projector and Stage 5 using LLaVA - Pretrain - JA. In the second phase, it was trained on the Image Encoder, Vision Projector, Stage 5, and LLM using LLaVA - Pretrain - JA. In the third phase, it was fine - tuned with the Vision Projector and LLM using LLaVA - v1.5 - Instruct - 620K - JA.

For more information, refer to: https://github.com/tosiyuki/LLaVA - JP/tree/main

Comparing VLMs

Model	JA - VG - VQA - 500 (ROUGE - L)	JA - VLM - Bench - In - the - Wild (ROUGE - L)	Heron - Bench(Detail)	Heron - Bench(Conv)	Heron - Bench(Complex)	Heron - Bench(Average)
[Japanese Stable VLM](https://huggingface.co/stabilityai/japanese - stable - vlm)	-	40.50	25.15	51.23	37.84	38.07
[EvoVLM - JP - v1 - 7B](https://huggingface.co/SakanaAI/EvoVLM - JP - v1 - 7B)	19.70	51.25	50.31	44.42	40.47	45.07
[Heron BLIP Japanese StableLM Base 7B llava - 620k](https://huggingface.co/turing - motors/heron - chat - blip - ja - stablelm - base - 7b - v1 - llava - 620k)	14.51	33.26	49.09	41.51	45.72	45.44
[Heron GIT Japanese StableLM Base 7B](https://huggingface.co/turing - motors/heron - chat - git - ja - stablelm - base - 7b - v1)	15.18	37.82	42.77	54.20	43.53	46.83
[llava - jp - 1.3b - v1.0 - 620k](https://huggingface.co/toshi456/llava - jp - 1.3b - v1.0 - 620k)	12.69	44.58	51.21	41.05	45.95	44.84
[llava - jp - 1.3b - v1.1](https://huggingface.co/toshi456/llava - jp - 1.3b - v1.1)	13.33	44.40	50.00	51.83	48.98	50.39
[ConvLLaVA - JP - 1.3b - 768](https://huggingface.co/toshi456/ConvLLaVA - JP - 1.3b - 768)	12.05	42.80	44.24	40.00	48.16	44.96
[ConvLLaVA - JP - 1.3b - 1280](https://huggingface.co/toshi456/ConvLLaVA - JP - 1.3b - 1280)	11.88	43.64	38.95	44.79	41.24	42.31

Training Dataset

Stage 1 and Stage 2 Pretrain:

[LLaVA - Pretrain - JA](https://huggingface.co/datasets/turing - motors/LLaVA - Pretrain - JA)

Stage 3 Fine - tuning:

[LLaVA - v1.5 - Instruct - 620K - JA](https://huggingface.co/datasets/turing - motors/LLaVA - v1.5 - Instruct - 620K - JA)

Acknowledgement

ConvLLaVA
[LLM - jp](https://llm - jp.nii.ac.jp/)
Open CLIP

📄 License

This project is licensed under the CC - BY - NC - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご