TinyLLaVA開源多模態模型 - 免費部署高效處理視覺-語言任務

首頁

Tinyllava OpenELM 450M SigLIP 0.89B

由jiajunlong開發

TinyLLaVA 是一個小規模的多模態模型系列，該模型由 OpenELM-450M 和 SigLIP-0.89B 組成，專注於高效的視覺-語言任務處理。

文本生成圖像

Transformers

開源協議:Apache-2.0 #小規模多模態 #高效視覺問答 #輕量級LLM集成

下載量 102

發布時間 : 4/29/2024

模型概述

TinyLLaVA 是一個輕量級的多模態模型，結合了語言模型和視覺模型，能夠處理圖像和文本的聯合任務。

模型特點

輕量高效

模型參數規模小，適合資源受限的環境，同時性能優於部分更大規模的模型。

多模態支持

能夠同時處理圖像和文本輸入，完成視覺問答等任務。

模塊化設計

支持多種語言模型和視覺模型的組合，靈活性高。

模型能力

視覺問答

圖像描述生成

多模態理解

文本生成

使用案例

教育

視覺問答

回答關於圖像內容的問題，適用於教育場景中的互動學習。

在VQAv2數據集上達到71.74的準確率。

內容生成

圖像描述生成

為圖像生成詳細的文本描述，適用於無障礙服務或內容標註。

🚀 TinyLLaVA

TinyLLaVA發佈了一系列小規模的大多模態模型（LMMs），模型規模從0.55B到3.1B不等。我們表現最優的模型TinyLLaVA - Phi - 2 - SigLIP - 3.1B，在整體性能上優於現有的7B模型，如LLaVA - 1.5和Qwen - VL。

🚀 快速開始

模型介紹

這裡，我們介紹TinyLLaVA - OpenELM - 450M - SigLIP - 0.89B，該模型由TinyLLaVA Factory代碼庫訓練得到。對於大語言模型（LLM）和視覺塔，我們分別選擇了[OpenELM - 450M - Instruct](apple/OpenELM - 450M - Instruct)和[siglip - so400m - patch14 - 384](https://huggingface.co/google/siglip - so400m - patch14 - 384)。訓練此模型使用的數據集是[LLaVA](https://github.com/haotian - liu/LLaVA/blob/main/docs/Data.md)數據集。

使用示例

基礎用法

執行以下測試代碼：

from transformers import AutoTokenizer, AutoModelForCausalLM
hf_path = 'jiajunlong/TinyLLaVA-OpenELM-450M-SigLIP-0.89B'
model = AutoModelForCausalLM.from_pretrained(hf_path, trust_remote_code=True)
model.cuda()
config = model.config
tokenizer = AutoTokenizer.from_pretrained(hf_path, use_fast=False, model_max_length = config.tokenizer_model_max_length,padding_side = config.tokenizer_padding_side)
prompt="What are these?"
image_url="http://images.cocodataset.org/test-stuff2017/000000000001.jpg"
output_text, genertaion_time = model.chat(prompt=prompt, image=image_url, tokenizer=tokenizer)
print('model output:', output_text)
print('runing time:', genertaion_time)

結果展示

模型名稱	GQA	TextVQA	SQA	VQAv2	MME	MMB	MM - VET
[TinyLLaVA - 1.5B](https://huggingface.co/bczhou/TinyLLaVA - 1.5B)	60.3	51.7	60.3	76.9	1276.5	55.2	25.8
[TinyLLaVA - 0.89B](https://huggingface.co/jiajunlong/TinyLLaVA - OpenELM - 450M - SigLIP - 0.89B)	53.87	44.02	54.09	71.74	1118.75	37.8	20