Llama-3-EZO-VLM-1: An Open-Source Japanese Vision-Language Model - Strengthening Japanese Proficiency for Diverse Applications

Llama 3 EZO VLM 1

Developed by AXCXEPT

A Japanese vision-language model based on Llama-3-8B-Instruct, enhanced with additional pretraining and instruction tuning for improved Japanese capabilities

Image-to-Text

Safetensors

Japanese#Japanese Visual Language #Multimodal Enhancement #Instruction Tuning Optimization

Downloads 19

Release Time : 8/3/2024

Model Overview

This model is based on Llama-3-8B-Instruct and improves its general performance through various tuning techniques, excelling in Japanese tasks while meeting diverse global needs.

Model Features

Enhanced Japanese Capabilities

Significantly improved Japanese processing through additional pretraining and instruction tuning

Multimodal Understanding

Combines visual and language capabilities to process both image and text inputs

Global Applicability

Designed to accommodate diverse global needs, not limited to Japanese tasks

Model Capabilities

Image Caption Generation

Visual Question Answering

Multi-turn Dialogue

Cross-modal Understanding

Use Cases

Intelligent Assistant

Image Content Q&A

Answers various questions about image content

Performs excellently in tasks such as traffic light color recognition

Content Understanding

Image Caption Generation

Generates detailed textual descriptions for images

Improves recognition and description capabilities compared to the base model

🚀 Llama-3-EZO-VLM-1

Based on the Llama-3 architecture, this model is enhanced for Japanese usage and suitable for diverse global needs.

image/png

Based on SakanaAI/Llama-3-EvoVLM-JP-v2, it has been enhanced for Japanese usage through additional pre-training and instruction tuning.

This model is based on Llama-3-8B-Instruct and is subject to the Llama-3 Terms of Use. For detailed information, please refer to the official Llama-3 license page.

🚀 Quick Start

DEMO

https://huggingface.co/spaces/HODACHI/Llama-3-EZO-VLM-1

[Usage]

pip install git+https://github.com/TIGER-AI-Lab/Mantis.git

import requests
from PIL import Image

import torch
from mantis.models.conversation import Conversation, SeparatorStyle
from mantis.models.mllava import chat_mllava, LlavaForConditionalGeneration, MLlavaProcessor
from mantis.models.mllava.utils import conv_templates
from transformers import AutoTokenizer

# 1. Set the system prompt
conv_llama_3_elyza = Conversation(
    system="<|start_header_id|>system<|end_header_id|>\n\nあなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。",
    roles=("user", "assistant"),
    messages=(),
    offset=0,
    sep_style=SeparatorStyle.LLAMA_3,
    sep="<|eot_id|>",
)
conv_templates["llama_3"] = conv_llama_3_elyza

# 2. Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "HODACHI/Llama-3-EZO-VLM-1"

processor = MLlavaProcessor.from_pretrained("TIGER-Lab/Mantis-8B-siglip-llama3")
processor.tokenizer.pad_token = processor.tokenizer.eos_token

model = LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16, device_map=device).eval()

# 3. Prepare a generate config
generation_kwargs = {
    "max_new_tokens": 256,
    "num_beams": 1,
    "do_sample": False,
    "no_repeat_ngram_size": 3,
}

# 4. Generate
text = "<image>の信号は何色ですか？"
url_list = [
    "https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
    "https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
   Image.open(requests.get(url_list[0], stream=True).raw).convert("RGB")
]

response, history = chat_mllava(text, images, model, processor, **generation_kwargs)

print(response)
# 信号の色は、青色です。

# 5. Multi-turn conversation
text = "では、<image>の信号は？"
images += [
   Image.open(requests.get(url_list[1], stream=True).raw).convert("RGB")
]
response, history = chat_mllava(text, images, model, processor, history=history, **generation_kwargs)

print(response)
# 赤色

✨ Features

This model is based on Llama-3-8B-Instruct, enhanced with multiple tuning techniques to improve its general performance. While it excels in Japanese language tasks, it's designed to meet diverse needs globally.

[Benchmark Results]

ElyzaTasks100

image/png It shows a significant performance improvement of 0.7 points compared to the base model.

Image Description Ability

image/png In all four examples, it achieves an improvement in recognition and description ability compared to the base model.

The following is the result of GPT-4o evaluating the outputs of GPT4, SakanaAI's BaseModel, and EZO's model for the same image and the same prompt. image/png

📚 Documentation

Model Details

Developed by: Axcxept co., ltd.
Model type: Autoregressive Language Model
Language(s): Japanese
License: META LLAMA 3 COMMUNITY LICENSE

[Model Data]

Training Dataset

We extracted high-quality data from Japanese Wikipedia and FineWeb to create instruction data. Our innovative training approach allows for performance improvements across various languages and domains, making the model suitable for global use despite its focus on Japanese data.

https://huggingface.co/datasets/legacy-datasets/wikipedia https://huggingface.co/datasets/HuggingFaceFW/fineweb

Data Preprocessing

We used a plain instruction tuning method to train the model on exemplary responses. This approach enhances the model's ability to understand and generate high-quality responses across various languages and contexts.

Implementation Information

[Pre-Instruction Training]

https://huggingface.co/instruction-pretrain/instruction-synthesizer

[Disclaimer]

This model is provided for research and development purposes only and should be regarded as an experimental prototype. It is not intended for commercial use or deployment in mission-critical environments. The use of this model is at the user's own risk, and its performance and results are not guaranteed. Axcxept Co., Ltd. shall not be liable for any direct, indirect, special, incidental, consequential damages, or any losses arising from the use of this model, regardless of the results obtained. Users should fully understand the risks associated with using this model and use it at their own discretion.

[Note]

Although we utilize the model from SakanaAI, there is no direct relationship between our company, this model, this space, and SakanaAI. Please show respect and refrain from contacting SakanaAI regarding this model.

[Hardware]

A100 × 8 (Running in 4h)

[Acknowledgment]

We would like to express our gratitude and respect to Meta for developing the base model, SakanaAI for customization, the developers of each team, and numerous individuals who provided the automatic evaluation methods.

[We are.]

📄 License

This model is subject to the META LLAMA 3 COMMUNITY LICENSE.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご