Llava Jp 1.3b V1.1

Developed by toshi456

LLaVA-JP is a multimodal vision-language model that supports Japanese, capable of understanding and generating descriptions and dialogues about input images.

Image-to-Text

Transformers

Japanese#Japanese Visual Question Answering #High-Resolution Image Processing #Multimodal Dialogue

Downloads 90

Release Time : 4/17/2024

Model Overview

This model combines a visual encoder and a text decoder, supports high-resolution image input, and is specifically optimized for Japanese visual language tasks.

Model Features

High-Resolution Support

Supports 768x768 high-resolution image input through scaling_on_scales technology

Japanese Optimization

Specifically trained and optimized for Japanese visual language tasks

Two-Stage Training

Pre-trains the visual projector first, followed by instruction fine-tuning

Model Capabilities

Image understanding

Japanese image caption generation

Japanese visual question answering

Multimodal dialogue

Use Cases

Assistive Technology

Visual Assistance

Provides image content descriptions for visually impaired individuals

Content Analysis

Social Media Analysis

Automatically analyzes social media image content and generates descriptions

license: cc-by-nc-4.0 datasets:

turing-motors/LLaVA-Pretrain-JA
turing-motors/LLaVA-v1.5-Instruct-620K-JA language:
ja pipeline_tag: image-to-text tags:
vision
image-captioning
VQA

LLaVA-JP Model Card

Model detail

Model type:

LLaVA-JP is a vision-language model that can converse about input images.
This model is an LVLM model trained using google/siglip-so400m-patch14-384 as the image encoder and llm-jp/llm-jp-1.3b-v1.0 as the text decoder. supports the input of 768 x 768 high resolution images by scaling_on_scales method.

Training:

This model was initially trained with the Vision Projector using LLaVA-Pretrain-JA.
In the second phase, it was fine-tuned with LLaVA-v1.5-Instruct-620K-JA.

resources for more information: https://github.com/tosiyuki/LLaVA-JP/tree/main

Comparing VLMs

Model	JA-VG-VQA-500 (ROUGE-L)	JA-VLM-Bench-In-the-Wild (ROUGE-L)	Heron-Bench(Detail)	Heron-Bench(Conv)	Heron-Bench(Complex)	Heron-Bench(Average)
Japanese Stable VLM	-	40.50	25.15	51.23	37.84	38.07
EvoVLM-JP-v1-7B	19.70	51.25	50.31	44.42	40.47	45.07
Heron BLIP Japanese StableLM Base 7B llava-620k	14.51	33.26	49.09	41.51	45.72	45.44
Heron GIT Japanese StableLM Base 7B	15.18	37.82	42.77	54.20	43.53	46.83
llava-jp-1.3b-v1.0-620k	12.69	44.58	51.21	41.05	45.95	44.84
llava-jp-1.3b-v1.1	13.33	44.40	50.00	51.83	48.98	50.39

image/png

How to use the model

1. Download dependencies

git clone https://github.com/tosiyuki/LLaVA-JP.git

2. Inference

import requests
import torch
import transformers
from PIL import Image

from transformers.generation.streamers import TextStreamer
from llava.constants import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
from llava.conversation import conv_templates, SeparatorStyle
from llava.model.llava_gpt2 import LlavaGpt2ForCausalLM
from llava.train.arguments_dataclass import ModelArguments, DataArguments, TrainingArguments
from llava.train.dataset import tokenizer_image_token


if __name__ == "__main__":
    model_path = 'toshi456/llava-jp-1.3b-v1.1'
    device = "cuda" if torch.cuda.is_available() else "cpu"
    torch_dtype = torch.bfloat16 if device=="cuda" else torch.float32

    model = LlavaGpt2ForCausalLM.from_pretrained(
        model_path, 
        low_cpu_mem_usage=True,
        use_safetensors=True,
        torch_dtype=torch_dtype,
        device_map=device,
    )
    tokenizer = transformers.AutoTokenizer.from_pretrained(
        model_path,
        model_max_length=1532,
        padding_side="right",
        use_fast=False,
    )
    model.eval()

    conv_mode = "v1"
    conv = conv_templates[conv_mode].copy()

    # image pre-process
    image_url = "https://huggingface.co/rinna/bilingual-gpt-neox-4b-minigpt4/resolve/main/sample.jpg"
    image = Image.open(requests.get(image_url, stream=True).raw).convert('RGB')
    
    image_size = model.get_model().vision_tower.image_processor.size["height"]
    if model.get_model().vision_tower.scales is not None:
        image_size = model.get_model().vision_tower.image_processor.size["height"] * len(model.get_model().vision_tower.scales)
    
    if device == "cuda":
        image_tensor = model.get_model().vision_tower.image_processor(
            image, 
            return_tensors='pt', 
            size={"height": image_size, "width": image_size}
        )['pixel_values'].half().cuda().to(torch_dtype)
    else:
        image_tensor = model.get_model().vision_tower.image_processor(
            image, 
            return_tensors='pt', 
            size={"height": image_size, "width": image_size}
        )['pixel_values'].to(torch_dtype)

    # create prompt
    # ユーザー: <image>\n{prompt}
    prompt = "猫の隣には何がありますか？"
    inp = DEFAULT_IMAGE_TOKEN + '\n' + prompt
    conv.append_message(conv.roles[0], inp)
    conv.append_message(conv.roles[1], None)
    prompt = conv.get_prompt()

    input_ids = tokenizer_image_token(
        prompt, 
        tokenizer, 
        IMAGE_TOKEN_INDEX, 
        return_tensors='pt'
    ).unsqueeze(0)
    if device == "cuda":
        input_ids = input_ids.to(device)

    input_ids = input_ids[:, :-1] # </sep>がinputの最後に入るので削除する
    stop_str = conv.sep if conv.sep_style != SeparatorStyle.TWO else conv.sep2
    keywords = [stop_str]
    streamer = TextStreamer(tokenizer, skip_prompt=True, timeout=20.0)

    # predict
    with torch.inference_mode():
        model.generate(
            inputs=input_ids,
            images=image_tensor,
            do_sample=True,
            temperature=0.1,
            top_p=1.0,
            max_new_tokens=256,
            streamer=streamer,
            use_cache=True,
        )
    """猫の隣にはノートパソコンがあります。"""

Training dataset

Stage1 Pretrain

LLaVA-Pretrain-JA

Stage2 Fine-tuning

LLaVA-v1.5-Instruct-620K-JA

Acknowledgement

License

cc-by-nc-4.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご