HelpingAI-Vision開源視覺語言模型 - 增強場景理解，助力視覺場景分析應用

首頁

Helpingai Vision

由OEvortex開發

HelpingAI-Vision是一種創新的視覺語言模型，通過分區生成視覺標記嵌入來增強場景理解能力。

圖像生成文本

Transformers

英語開源協議:其他 #分區視覺嵌入 #多模態對話 #細粒度場景理解

下載量 23

發布時間 : 1/19/2024

模型概述

該模型基於MC-LLaVA-3b微調，整合LLaVA適配器，能夠處理圖像和文本輸入並生成相關文本輸出。

模型特點

分區視覺標記嵌入

為圖像的每個分區生成單個標記嵌入，而非傳統整圖嵌入方式，增強細節捕捉能力

LLaVA適配器整合

通過LLaVA適配器處理視覺嵌入，輸出維度為[N, 2560]的標記嵌入

ChatML對話格式

採用ChatML格式設計，特別適合聊天機器人應用場景

模型能力

圖像理解

視覺問答

圖像描述生成

多模態對話

使用案例

智能助手

視覺問答助手

回答用戶關於圖像內容的各類問題

準確識別圖像內容並提供相關回答

內容理解

圖像描述生成

為圖像生成詳細文字描述

生成符合圖像內容的自然語言描述

🚀 HelpingAI-Vision

HelpingAI-Vision是一款圖像文本生成模型，它基於HelpingAI-Lite並結合LLaVA適配器，能更細緻地理解圖像場景，適用於聊天等場景。

🚀 快速開始

你可以點擊下面的按鈕在Google Colab中打開項目：

✨ 主要特性

HelpingAI-Vision的核心原理是為圖像的每N部分生成一個標記嵌入，而非為整個圖像生成N個視覺標記嵌入。這種基於HelpingAI-Lite並結合LLaVA適配器的方法，旨在通過捕捉更詳細的信息來增強場景理解能力。

對於圖像的每個裁剪部分，使用完整的SigLIP編碼器生成一個嵌入（大小為[1, 1152]）。隨後，所有N個嵌入通過LLaVA適配器進行處理，得到大小為[N, 2560]的標記嵌入。目前，這些標記缺乏其在原始圖像中位置的明確信息，計劃在後續更新中加入位置信息。

HelpingAI-Vision是從MC-LLaVA-3b微調而來的。該模型採用ChatML提示格式，這表明它在基於聊天的場景中具有潛在的應用價值。

📦 安裝指南

安裝依賴

!pip install -q open_clip_torch timm einops

下載模型文件

from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="configuration_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="configuration_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="modeling_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="modeling_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="processing_llava.py", local_dir="./", force_download=True)

💻 使用示例

基礎用法

# 創建模型
from modeling_llava import LlavaForConditionalGeneration
import torch

model = LlavaForConditionalGeneration.from_pretrained("OEvortex/HelpingAI-Vision", torch_dtype=torch.float16)
model = model.to("cuda")

# 創建處理器
from transformers import AutoTokenizer
from processing_llava import LlavaProcessor, OpenCLIPImageProcessor

tokenizer = AutoTokenizer.from_pretrained("OEvortex/HelpingAI-Vision")
image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
processor = LlavaProcessor(image_processor, tokenizer)

# 設置圖像和文本
from PIL import Image
import requests

image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da"
raw_image = Image.open(requests.get(image_file, stream=True).raw)

prompt = """<|im_start|>system
A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.
The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
<|im_start|>user
<image>
Describe the image.<|im_end|>
<|im_start|>assistant
"""

# 處理輸入
with torch.inference_mode():
  inputs = processor(prompt, raw_image, model, return_tensors='pt')

inputs['input_ids'] = inputs['input_ids'].to(model.device)
inputs['attention_mask'] = inputs['attention_mask'].to(model.device)

from transformers import TextStreamer

streamer = TextStreamer(tokenizer)

# 生成數據
%%time
with torch.inference_mode():
  output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.9, temperature=1.2, eos_token_id=tokenizer.eos_token_id, streamer=streamer)
print(tokenizer.decode(output[0]).replace(prompt, "").replace("<|im_end|>", ""))

📚 詳細文檔

模型採用以下ChatML提示格式，可用於聊天場景：

<|im_start|>system
You are Vortex, a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

📄 許可證

本項目採用其他許可證，許可證名稱為hsul，你可以通過此鏈接查看詳細的許可證信息。

信息表格

屬性	詳情
模型類型	圖像文本生成模型
訓練數據	未提及
微調基礎模型	MC-LLaVA-3b
提示格式	ChatML

精選推薦AI模型

Llama 3 Typhoon V1.5x 8b Instruct

專為泰語設計的80億參數指令模型，性能媲美GPT-3.5-turbo，優化了應用場景、檢索增強生成、受限生成和推理任務

Cadet-Tiny是一個基於SODA數據集訓練的超小型對話模型，專為邊緣設備推理設計，體積僅為Cosmo-3B模型的2%左右。

Roberta Base Chinese Extractive Qa

基於RoBERTa架構的中文抽取式問答模型，適用於從給定文本中提取答案的任務。

智啟未來，您的人工智能解決方案智庫