Llm Jp 3 Vila 14b

Developed by llm-jp

A large-scale vision-language model developed by Japan's National Institute of Informatics, supporting Japanese and English with strong image understanding and text generation capabilities.

Image-to-Text

Safetensors

Japanese#Japanese Visual Question Answering #Multimodal Large Model #SigLIP Visual Encoding

Downloads 106

Release Time : 10/26/2024

Model Overview

This is a vision-language model combining a visual encoder and a large language model, capable of understanding image content and generating relevant text descriptions or answering questions.

Model Features

Multilingual Support

Supports both Japanese and English for vision-language understanding and generation

Three-Stage Training

Adopts a phased training strategy: first adjusting the projection layer, then jointly training the projection layer and LLM, and finally fine-tuning

High-Performance Visual Encoder

Uses siglip-so400m-patch14-384 as the visual encoder, providing powerful image understanding capabilities

Leading Evaluation

Outperforms similar models in multiple Japanese vision-language benchmarks

Model Capabilities

Image content understanding

Image caption generation

Visual question answering

Multimodal dialogue

Use Cases

Content Understanding & Generation

Image Captioning

Generates detailed textual descriptions for images

Achieved 57.2% LLM score on the Heron benchmark

Visual Question Answering

Answers natural language questions about image content

Achieved 3.62/5.0 LLM score on JA-VG-VQA500 test

Multimodal Applications

Image-Text Dialogue

Engages in natural language dialogue based on image content

Achieved 3.69/5.0 LLM score on JA-VLM in-the-wild benchmark

language:

ja pipeline_tag: image-text-to-text

LLM-jp-3 VILA 14B

This repository provides a large vision language model (VLM) developed by the Research and Development Center for Large Language Models at the National Institute of Informatics, Japan.

Usage

Python version: 3.10.12

Clone the repository and install the libraries.

git clone git@github.com:llm-jp/llm-jp-VILA.git
cd llm-jp-VILA

python3 -m venv venv
source venv/bin/activate

pip install --upgrade pip
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.4.2/flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -e .
pip install -e ".[train]"

pip install git+https://github.com/huggingface/transformers@v4.36.2
cp -rv ./llava/train/transformers_replace/* ./venv/lib/python3.10/site-packages/transformers/

Run the python script. You can change the image_path and query to your own.

import argparse
from io import BytesIO

import requests
import torch
from PIL import Image

from llava.constants import IMAGE_TOKEN_INDEX
from llava.conversation import conv_templates
from llava.mm_utils import (get_model_name_from_path,
                            process_images, tokenizer_image_token)
from llava.model.builder import load_pretrained_model
from llava.utils import disable_torch_init


def load_image(image_file):
    if image_file.startswith("http") or image_file.startswith("https"):
        response = requests.get(image_file)
        image = Image.open(BytesIO(response.content)).convert("RGB")
    else:
        image = Image.open(image_file).convert("RGB")
    return image


def load_images(image_files):
    out = []
    for image_file in image_files:
        image = load_image(image_file)
        out.append(image)
    return out


disable_torch_init()

model_checkpoint_path = "llm-jp/llm-jp-3-vila-14b"
model_name = get_model_name_from_path(model_checkpoint_path)
tokenizer, model, image_processor, context_len = load_pretrained_model(model_checkpoint_path, model_name)

image_path = "path/to/image"
image_files = [
    image_path
]
images = load_images(image_files)

query = "<image>\nこの画像について説明してください。"

conv_mode = "llmjp_v3"
conv = conv_templates[conv_mode].copy()
conv.append_message(conv.roles[0], query)
conv.append_message(conv.roles[1], None)
prompt = conv.get_prompt()

images_tensor = process_images(images, image_processor, model.config).to(model.device, dtype=torch.float16)
input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()

with torch.inference_mode():
    output_ids = model.generate(
        input_ids,
        images=[
            images_tensor,
        ],
        do_sample=False,
        num_beams=1,
        max_new_tokens=256,
        use_cache=True,
    )

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0]
print(outputs)

Model Details

Model components	Model / Architecture	Parameters
Vision encoder	siglip-so400m-patch14-384	428M
Projector	2-layer MLP	32M
LLM	llm-jp-3-13b-instruct	13B

Datasets

The model was trained in three stages.

Step-0

We used the following data sets to tune the parameters in the projector.

Language	Dataset	Images
Japanese	Japanese image text pairs	558K
English	LLaVA-Pretrain	558K

Step-1

We used the following data sets to tune the parameters in the projector and LLM.

Language	Dataset	Images
Japanese	Japanese image text pairs	6M
	Japanese interleaved data	6M
English	coyo (subset)	6M
	mmc4-core (subset)	6M

Step-2

We used the following data sets to tune the parameters in the projector and LLM.

Language	Dataset	Images
Japanese	llava-instruct-ja	156K
	japanese-photos-conv	12K
	ja-vg-vqa	99K
	synthdog-ja (subset)	102K
English	LLaVA	158K
	VQAv2	53K
	GQA	46K
	OCRVQA	80K
	TextVQA	22K

Evaluations

We evaluated our model using Heron Bench, JA-VLM-Bench-In-the-Wild, and JA-VG-VQA500. We used gpt-4o-2024-05-13 for LLM-as-a-judge.

Heron Bench

Models	LLM-as-a-judge score (%)
Japanese InstructBLIP Alpha	14.0
Japanese Stable VLM	24.2
Llama-3-EvoVLM-JP-v2	39.3
LLaVA-CALM2-SigLIP	43.3
llm-jp-3-vila-14b (Ours)	57.2
GPT-4o	87.6

JA-VLM-Bench-In-the-Wild

Models	ROUGE-L	LLM-as-a-judge score (/5.0)
Japanese InstructBLIP Alpha	20.8	2.42
Japanese Stable VLM	23.3	2.47
Llama-3-EvoVLM-JP-v2	41.4	2.92
LLaVA-CALM2-SigLIP	47.2	3.15
llm-jp-3-vila-14b (Ours)	52.3	3.69
GPT-4o	37.6	3.85

JA-VG-VQA-500

Models	ROUGE-L	LLM-as-a-judge score (/5.0)
Japanese InstructBLIP Alpha	--	--
Japanese Stable VLM	--	--
Llama-3-EvoVLM-JP-v2	23.5	2.96
LLaVA-CALM2-SigLIP	17.4	3.21
llm-jp-3-vila-14b (Ours)	16.2	3.62
GPT-4o	12.1	3.58

Risks and Limitations

The model released in this repository is in the early stages of our research and development. It has not been tuned such that model's outputs are aligned with social norms, ethical standards, and the law.

License

The weights of this model are released under the Apache License, Version 2.0. In addition, a user of this model must comply with the OpenAI terms of use because the model used synthetic data generated by OpenAI GPT-4.

Additional information

Regarding the license of the synthdog-ja dataset, there is no explicit license statement in the dataset documentation. While we attempted to contact the main corresponding author of "OCR-free Document Understanding Transformer" for clarification, we received no response.

Based on the following considerations:

The donut-base model trained on this dataset is released under the MIT license
The Wikipedia articles used in the dataset are licensed under CC-BY-SA

We have determined that the synthdog-ja dataset is most likely governed by the CC-BY-SA license, and proceeded with training under this assumption.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご