llm-jp-3-vila-14b開源視覺語言模型 - 免費使用實現圖像理解與日英文本生成

首頁

Llm Jp 3 Vila 14b

由llm-jp開發

由日本國立情報學研究所開發的大型視覺語言模型，支持日語和英語，具備強大的圖像理解和文本生成能力。

圖像生成文本

Safetensors

日語#日語視覺問答 #多模態大模型 #SigLIP視覺編碼

下載量 106

發布時間 : 10/26/2024

模型概述

這是一個結合視覺編碼器和大型語言模型的視覺語言模型，能夠理解圖像內容並生成相關文本描述或回答問題。

模型特點

多語言支持

同時支持日語和英語的視覺語言理解與生成

三階段訓練

採用分階段訓練策略，先調整投影層，再聯合訓練投影層和LLM，最後進行微調

高性能視覺編碼器

使用siglip-so400m-patch14-384作為視覺編碼器，提供強大的圖像理解能力

評估領先

在多個日語視覺語言基準測試中表現優於同類模型

模型能力

圖像內容理解

圖像描述生成

視覺問答

多模態對話

使用案例

內容理解與生成

圖像描述

為圖像生成詳細的文字描述

在Heron基準測試中獲得57.2%的LLM評分

視覺問答

回答關於圖像內容的自然語言問題

在JA-VG-VQA500測試中獲得3.62/5.0的LLM評分

多模態應用

圖文對話

基於圖像內容進行自然語言對話

在JA-VLM野外基準測試中獲得3.69/5.0的LLM評分

🚀 LLM - jp - 3 VILA 14B

本倉庫提供了一個由日本國立情報學研究所的大語言模型研發中心開發的大型視覺語言模型（VLM）。該模型能夠處理圖像和文本信息，為圖像理解和文本生成等任務提供支持。

🚀 快速開始

環境要求

Python 版本：3.10.12

安裝步驟

克隆倉庫並安裝依賴庫。

```bash git clone git@github.com:llm-jp/llm-jp-VILA.git cd llm-jp-VILA ``` ```bash python3 -m venv venv source venv/bin/activate ``` ```bash pip install --upgrade pip wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.4.2/flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl pip install flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl pip install -e . pip install -e ".[train]" ``` ```bash pip install git+https://github.com/huggingface/transformers@v4.36.2 cp -rv ./llava/train/transformers_replace/* ./venv/lib/python3.10/site-packages/transformers/ ```
運行 Python 腳本。你可以將 image_path 和 query 替換為你自己的內容。
```python import argparse from io import BytesIO
import requests import torch from PIL import Image

from llava.constants import IMAGE_TOKEN_INDEX from llava.conversation import conv_templates from llava.mm_utils import (get_model_name_from_path, process_images, tokenizer_image_token) from llava.model.builder import load_pretrained_model from llava.utils import disable_torch_init

def load_image(image_file): if image_file.startswith("http") or image_file.startswith("https"): response = requests.get(image_file) image = Image.open(BytesIO(response.content)).convert("RGB") else: image = Image.open(image_file).convert("RGB") return image

def load_images(image_files): out = [] for image_file in image_files: image = load_image(image_file) out.append(image) return out

disable_torch_init()

model_checkpoint_path = "llm-jp/llm-jp-3-vila-14b" model_name = get_model_name_from_path(model_checkpoint_path) tokenizer, model, image_processor, context_len = load_pretrained_model(model_checkpoint_path, model_name)

image_path = "path/to/image" image_files = [ image_path ] images = load_images(image_files)

query = "\nこの畫像について説明してください。"

conv_mode = "llmjp_v3" conv = conv_templates[conv_mode].copy() conv.append_message(conv.roles[0], query) conv.append_message(conv.roles[1], None) prompt = conv.get_prompt()

images_tensor = process_images(images, image_processor, model.config).to(model.device, dtype=torch.float16) input_ids = tokenizer_image_token(prompt, tokenizer, IMAGE_TOKEN_INDEX, return_tensors="pt").unsqueeze(0).cuda()

with torch.inference_mode(): output_ids = model.generate( input_ids, images=[ images_tensor, ], do_sample=False, num_beams=1, max_new_tokens=256, use_cache=True, )

outputs = tokenizer.batch_decode(output_ids, skip_special_tokens=True)[0] print(outputs)
```
</details>
```

✨ 主要特性

模型架構

屬性	詳情
模型類型	大型視覺語言模型（VLM）
視覺編碼器	siglip - so400m - patch14 - 384，參數數量 428M
投影器	2 層 MLP，參數數量 32M
大語言模型	llm - jp - 3 - 13b - instruct，參數數量 13B

訓練數據

模型分三個階段進行訓練：

階段 0

用於調整投影器參數的數據集：

語言	數據集	圖像數量
日語	Japanese image text pairs	558K
英語	LLaVA - Pretrain	558K

階段 1

用於調整投影器和大語言模型參數的數據集：

語言	數據集	圖像數量
日語	Japanese image text pairs	6M
日語	Japanese interleaved data	6M
英語	coyo（子集）	6M
英語	mmc4 - core（子集）	6M

階段 2

用於調整投影器和大語言模型參數的數據集：

語言	數據集	圖像數量
日語	llava - instruct - ja	156K
日語	japanese - photos - conv	12K
日語	ja - vg - vqa	99K
日語	synthdog - ja（子集）	102K
英語	LLaVA	158K
英語	VQAv2	53K
英語	GQA	46K
英語	OCRVQA	80K
英語	TextVQA	22K

評估結果

使用 Heron Bench、JA - VLM - Bench - In - the - Wild 和 JA - VG - VQA500 對模型進行評估，使用 gpt - 4o - 2024 - 05 - 13 作為大語言模型評判器。

Heron Bench

模型	大語言模型評判得分（%）
Japanese InstructBLIP Alpha	14.0
Japanese Stable VLM	24.2
Llama - 3 - EvoVLM - JP - v2	39.3
LLaVA - CALM2 - SigLIP	43.3
llm - jp - 3 - vila - 14b（本模型）	57.2
GPT - 4o	87.6

JA - VLM - Bench - In - the - Wild

模型	ROUGE - L	大語言模型評判得分（/5.0）
Japanese InstructBLIP Alpha	20.8	2.42
Japanese Stable VLM	23.3	2.47
Llama - 3 - EvoVLM - JP - v2	41.4	2.92
LLaVA - CALM2 - SigLIP	47.2	3.15
llm - jp - 3 - vila - 14b（本模型）	52.3	3.69
GPT - 4o	37.6	3.85

JA - VG - VQA500

模型	ROUGE - L	大語言模型評判得分（/5.0）
Japanese InstructBLIP Alpha	--	--
Japanese Stable VLM	--	--
Llama - 3 - EvoVLM - JP - v2	23.5	2.96
LLaVA - CALM2 - SigLIP	17.4	3.21
llm - jp - 3 - vila - 14b（本模型）	16.2	3.62
GPT - 4o	12.1	3.58