Llama-3-EZO-VLM-1開源日語視覺語言模型 - 強化日語能力助力多元應用

首頁

Llama 3 EZO VLM 1

由AXCXEPT開發

基於 Llama-3-8B-Instruct 的日語視覺語言模型，通過額外預訓練和指令調優增強日語能力

圖像生成文本

Safetensors

日語#日語視覺語言 #多模態增強 #指令調優優化

下載量 19

發布時間 : 8/3/2024

模型概述

該模型基於 Llama-3-8B-Instruct，通過多種調優技術提升其通用性能，在日語任務中表現出色，同時滿足全球多樣化需求。

模型特點

增強的日語能力

通過額外預訓練和指令調優顯著提升日語處理能力

多模態理解

結合視覺和語言能力，可處理圖像和文本輸入

全球適用性

設計上兼顧全球多樣化需求，不侷限於日語任務

模型能力

圖像描述生成

視覺問答

多輪對話

跨模態理解

使用案例

智能助手

圖像內容問答

回答關於圖像內容的各類問題

在信號燈顏色識別等任務中表現優異

內容理解

圖像描述生成

為圖像生成詳細的文字描述

相比基礎模型提升了識別能力和描述能力

🚀 Llama-3-EZO-VLM-1

Llama-3-EZO-VLM-1 基於 Llama-3-8B-Instruct 模型，藉助多種調優技術提升了通用性能。它以 SakanaAI/Llama-3-EvoVLM-JP-v2 為基礎，通過額外的預訓練和指令調優，增強了日語使用能力，在日語任務中表現出色，同時也能滿足全球多樣化的需求。

🚀 快速開始

安裝依賴

pip install git+https://github.com/TIGER-AI-Lab/Mantis.git

使用示例

import requests
from PIL import Image

import torch
from mantis.models.conversation import Conversation, SeparatorStyle
from mantis.models.mllava import chat_mllava, LlavaForConditionalGeneration, MLlavaProcessor
from mantis.models.mllava.utils import conv_templates
from transformers import AutoTokenizer

# 1. Set the system prompt
conv_llama_3_elyza = Conversation(
    system="<|start_header_id|>system<|end_header_id|>\n\nあなたは誠実で優秀な日本人のアシスタントです。特に指示が無い場合は、常に日本語で回答してください。",
    roles=("user", "assistant"),
    messages=(),
    offset=0,
    sep_style=SeparatorStyle.LLAMA_3,
    sep="<|eot_id|>",
)
conv_templates["llama_3"] = conv_llama_3_elyza

# 2. Load model
device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "HODACHI/Llama-3-EZO-VLM-1"

processor = MLlavaProcessor.from_pretrained("TIGER-Lab/Mantis-8B-siglip-llama3")
processor.tokenizer.pad_token = processor.tokenizer.eos_token

model = LlavaForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16, device_map=device).eval()

# 3. Prepare a generate config
generation_kwargs = {
    "max_new_tokens": 256,
    "num_beams": 1,
    "do_sample": False,
    "no_repeat_ngram_size": 3,
}

# 4. Generate
text = "<image>の信號は何色ですか？"
url_list = [
    "https://images.unsplash.com/photo-1694831404826-3400c48c188d?q=80&w=2070&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D",
    "https://images.unsplash.com/photo-1693240876439-473af88b4ed7?q=80&w=1974&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
]
images = [
   Image.open(requests.get(url_list[0], stream=True).raw).convert("RGB")
]

response, history = chat_mllava(text, images, model, processor, **generation_kwargs)

print(response)
# 信號の色は、青色です。

# 5. Multi-turn conversation
text = "では、<image>の信號は？"
images += [
   Image.open(requests.get(url_list[1], stream=True).raw).convert("RGB")
]
response, history = chat_mllava(text, images, model, processor, history=history, **generation_kwargs)

print(response)
# 赤色

✨ 主要特性

基於 SakanaAI/Llama-3-EvoVLM-JP-v2 進行增強，通過額外的預訓練和指令調優，提升了日語使用能力。
採用多種調優技術，在不降低原始視覺性能的前提下，提高了文本處理的通用性能。
雖然專注於日語任務，但設計上能夠滿足全球多樣化的需求。

📚 詳細文檔

模型詳情

開發者：Axcxept co., ltd.
模型類型：自迴歸語言模型
支持語言：日語
許可證：META LLAMA 3 COMMUNITY LICENSE

模型數據

訓練數據集

從日語維基百科和 FineWeb 中提取高質量數據來創建指令數據。這種創新的訓練方法允許在各種語言和領域中提升性能，儘管專注於日語數據，但模型仍適用於全球使用。

日語維基百科：https://huggingface.co/datasets/legacy-datasets/wikipedia
FineWeb：https://huggingface.co/datasets/HuggingFaceFW/fineweb

數據預處理

使用普通指令調優方法讓模型學習示例響應。這種方法增強了模型在各種語言和上下文中理解和生成高質量響應的能力。

實現信息

[預指令訓練] https://huggingface.co/instruction-pretrain/instruction-synthesizer

基準測試結果

ElyzaTasks100

image/png 相比基礎模型，性能大幅提升了 0.7 個百分點。

圖像說明能力

image/png 在所有四個示例中，都實現了從基礎模型到識別能力和說明能力的提升。

以下是 GPT4、SakanaAI 公司的基礎模型、EZO 模型在同一圖像和同一提示下的輸出，由 GPT - 4o 評估的結果： image/png

DEMO

https://huggingface.co/spaces/HODACHI/Llama-3-EZO-VLM-1

免責聲明

此模型僅用於研究和開發目的，應被視為實驗性原型。它並非用於商業用途或部署在關鍵任務環境中。使用此模型由用戶自行承擔責任，其性能和結果不提供保證。Axcxept 有限公司對任何直接、間接、特殊、偶然、後果性的損害或因使用此模型而產生的任何損失，無論結果如何，均不承擔任何責任。用戶應充分理解使用此模型所涉及的風險，並自行決定是否使用。