🚀 Llama-3-EZO-VLM-1
Based on the Llama-3 architecture, this model is enhanced for Japanese usage and suitable for diverse global needs.

Based on SakanaAI/Llama-3-EvoVLM-JP-v2, it has been enhanced for Japanese usage through additional pre-training and instruction tuning.
This model is based on Llama-3-8B-Instruct and is subject to the Llama-3 Terms of Use. For detailed information, please refer to the official Llama-3 license page.
🚀 Quick Start
DEMO
https://huggingface.co/spaces/HODACHI/Llama-3-EZO-VLM-1
[Usage]
pip install git+https://github.com/TIGER-AI-Lab/Mantis.git
import requests
from PIL import Image
import torch
from mantis.models.conversation import Conversation, SeparatorStyle
from mantis.models.mllava import chat_mllava, LlavaForConditionalGeneration, MLlavaProcessor
from mantis.models.mllava.utils import conv_templates
from transformers import AutoTokenizer
conv_llama_3_elyza = Conversation(
system="<|start_header_id|>system<|end_header_id|>\n\nあなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。",
roles=("user", "assistant"),
messages=(),
offset=0,
sep_style=SeparatorStyle.LLAMA_3,
sep="<|eot_id|>",
)
conv_templates["llama_3"] = conv_llama_3_elyza
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "HODACHI/Llama-3-EZO-VLM-1"
processor = MLlavaProcessor.from_pretrained("TIGER-Lab/Mantis-8B-siglip-llama3")
processor.tokenizer.pad_token = processor.tokenizer.eos_token
model = LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16, device_map=device).eval()
generation_kwargs = {
"max_new_tokens": 256,
"num_beams": 1,
"do_sample": False,
"no_repeat_ngram_size": 3,
}
text = "<image>の信号は何色ですか?"
url_list = [
"https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
"https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
Image.open(requests.get(url_list[0], stream=True).raw).convert("RGB")
]
response, history = chat_mllava(text, images, model, processor, **generation_kwargs)
print(response)
text = "では、<image>の信号は?"
images += [
Image.open(requests.get(url_list[1], stream=True).raw).convert("RGB")
]
response, history = chat_mllava(text, images, model, processor, history=history, **generation_kwargs)
print(response)
✨ Features
This model is based on Llama-3-8B-Instruct, enhanced with multiple tuning techniques to improve its general performance. While it excels in Japanese language tasks, it's designed to meet diverse needs globally.
[Benchmark Results]
ElyzaTasks100
It shows a significant performance improvement of 0.7 points compared to the base model.
Image Description Ability
In all four examples, it achieves an improvement in recognition and description ability compared to the base model.
The following is the result of GPT-4o evaluating the outputs of GPT4, SakanaAI's BaseModel, and EZO's model for the same image and the same prompt.

📚 Documentation
Model Details
[Model Data]
Training Dataset
We extracted high-quality data from Japanese Wikipedia and FineWeb to create instruction data. Our innovative training approach allows for performance improvements across various languages and domains, making the model suitable for global use despite its focus on Japanese data.
https://huggingface.co/datasets/legacy-datasets/wikipedia
https://huggingface.co/datasets/HuggingFaceFW/fineweb
Data Preprocessing
We used a plain instruction tuning method to train the model on exemplary responses. This approach enhances the model's ability to understand and generate high-quality responses across various languages and contexts.
Implementation Information
[Pre-Instruction Training]
https://huggingface.co/instruction-pretrain/instruction-synthesizer
[Disclaimer]
This model is provided for research and development purposes only and should be regarded as an experimental prototype. It is not intended for commercial use or deployment in mission-critical environments. The use of this model is at the user's own risk, and its performance and results are not guaranteed. Axcxept Co., Ltd. shall not be liable for any direct, indirect, special, incidental, consequential damages, or any losses arising from the use of this model, regardless of the results obtained. Users should fully understand the risks associated with using this model and use it at their own discretion.
[Note]
Although we utilize the model from SakanaAI, there is no direct relationship between our company, this model, this space, and SakanaAI. Please show respect and refrain from contacting SakanaAI regarding this model.
[Hardware]
A100 × 8 (Running in 4h)
[Acknowledgment]
We would like to express our gratitude and respect to Meta for developing the base model, SakanaAI for customization, the developers of each team, and numerous individuals who provided the automatic evaluation methods.
[We are.]

📄 License
This model is subject to the META LLAMA 3 COMMUNITY LICENSE.