🚀 HelpingAI-Vision
HelpingAI-Vision是一款圖像文本生成模型,它基於HelpingAI-Lite並結合LLaVA適配器,能更細緻地理解圖像場景,適用於聊天等場景。
🚀 快速開始
你可以點擊下面的按鈕在Google Colab中打開項目:
✨ 主要特性
HelpingAI-Vision的核心原理是為圖像的每N部分生成一個標記嵌入,而非為整個圖像生成N個視覺標記嵌入。這種基於HelpingAI-Lite並結合LLaVA適配器的方法,旨在通過捕捉更詳細的信息來增強場景理解能力。
對於圖像的每個裁剪部分,使用完整的SigLIP編碼器生成一個嵌入(大小為[1, 1152])。隨後,所有N個嵌入通過LLaVA適配器進行處理,得到大小為[N, 2560]的標記嵌入。目前,這些標記缺乏其在原始圖像中位置的明確信息,計劃在後續更新中加入位置信息。
HelpingAI-Vision是從MC-LLaVA-3b微調而來的。該模型採用ChatML提示格式,這表明它在基於聊天的場景中具有潛在的應用價值。
📦 安裝指南
安裝依賴
!pip install -q open_clip_torch timm einops
下載模型文件
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="configuration_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="configuration_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="modeling_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="modeling_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="processing_llava.py", local_dir="./", force_download=True)
💻 使用示例
基礎用法
from modeling_llava import LlavaForConditionalGeneration
import torch
model = LlavaForConditionalGeneration.from_pretrained("OEvortex/HelpingAI-Vision", torch_dtype=torch.float16)
model = model.to("cuda")
from transformers import AutoTokenizer
from processing_llava import LlavaProcessor, OpenCLIPImageProcessor
tokenizer = AutoTokenizer.from_pretrained("OEvortex/HelpingAI-Vision")
image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
processor = LlavaProcessor(image_processor, tokenizer)
from PIL import Image
import requests
image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
prompt = """<|im_start|>system
A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.
The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
<|im_start|>user
<image>
Describe the image.<|im_end|>
<|im_start|>assistant
"""
with torch.inference_mode():
inputs = processor(prompt, raw_image, model, return_tensors='pt')
inputs['input_ids'] = inputs['input_ids'].to(model.device)
inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
from transformers import TextStreamer
streamer = TextStreamer(tokenizer)
%%time
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.9, temperature=1.2, eos_token_id=tokenizer.eos_token_id, streamer=streamer)
print(tokenizer.decode(output[0]).replace(prompt, "").replace("<|im_end|>", ""))
📚 詳細文檔
模型採用以下ChatML提示格式,可用於聊天場景:
<|im_start|>system
You are Vortex, a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
📄 許可證
本項目採用其他許可證,許可證名稱為hsul,你可以通過此鏈接查看詳細的許可證信息。
信息表格
屬性 |
詳情 |
模型類型 |
圖像文本生成模型 |
訓練數據 |
未提及 |
微調基礎模型 |
MC-LLaVA-3b |
提示格式 |
ChatML |