🚀 HelpingAI-Vision
HelpingAI-Vision is a model designed to enhance scene understanding by generating token embeddings for image parts, based on HelpingAI-Lite and incorporating the LLaVA adapter.
✨ Features
The fundamental concept behind HelpingAI-Vision is to generate one token embedding per N parts of an image, as opposed to producing N visual token embeddings for the entire image. This approach, based on the HelpingAI-Lite and incorporating the LLaVA adapter, aims to enhance scene understanding by capturing more detailed information.
For every crop of the image, an embedding is generated using the full SigLIP encoder (size [1, 1152]). Subsequently, all N embeddings undergo processing through the LLaVA adapter, resulting in a token embedding of size [N, 2560]. Currently, these tokens lack explicit information about their position in the original image, with plans to incorporate positional information in a later update.
HelpingAI-Vision was fine - tuned from MC-LLaVA-3b.
The model adopts the ChatML prompt format, suggesting its potential application in chat - based scenarios. If you have specific queries or would like further details, feel free ask
<|im_start|>system
You are Vortex, a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant
📦 Installation
Install dependencies
!pip install -q open_clip_torch timm einops
Download modeling files
from huggingface_hub import hf_hub_download
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="configuration_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="configuration_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="modeling_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="modeling_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="processing_llava.py", local_dir="./", force_download=True)
💻 Usage Examples
Basic Usage
from modeling_llava import LlavaForConditionalGeneration
import torch
model = LlavaForConditionalGeneration.from_pretrained("OEvortex/HelpingAI-Vision", torch_dtype=torch.float16)
model = model.to("cuda")
from transformers import AutoTokenizer
from processing_llava import LlavaProcessor, OpenCLIPImageProcessor
tokenizer = AutoTokenizer.from_pretrained("OEvortex/HelpingAI-Vision")
image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
processor = LlavaProcessor(image_processor, tokenizer)
from PIL import Image
import requests
image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da"
raw_image = Image.open(requests.get(image_file, stream=True).raw)
prompt = """<|im_start|>system
A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.
The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
<|im_start|>user
<image>
Describe the image.<|im_end|>
<|im_start|>assistant
"""
with torch.inference_mode():
inputs = processor(prompt, raw_image, model, return_tensors='pt')
inputs['input_ids'] = inputs['input_ids'].to(model.device)
inputs['attention_mask'] = inputs['attention_mask'].to(model.device)
from transformers import TextStreamer
streamer = TextStreamer(tokenizer)
%%time
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.9, temperature=1.2, eos_token_id=tokenizer.eos_token_id, streamer=streamer)
print(tokenizer.decode(output[0]).replace(prompt, "").replace("<|im_end|>", ""))
📄 License
The model is released under the hsul license.
Property |
Details |
Library Name |
transformers |
Base Model |
visheratin/MC-LLaVA-3b |
Pipeline Tag |
image-text-to-text |