HelpingAI-Vision Open-Source Vision-Language Model - Enhance Scene Understanding and Boost Visual Scene Analysis Applications

Helpingai Vision

Developed by OEvortex

HelpingAI-Vision is an innovative vision-language model that enhances scene understanding through partitioned visual token embeddings.

Image-to-Text

Transformers

EnglishOpen Source License:Other #Partitioned Visual Embedding #Multimodal Dialogue #Fine-grained Scene Understanding

Downloads 23

Release Time : 1/19/2024

Model Overview

This model is fine-tuned based on MC-LLaVA-3b and integrates the LLaVA adapter, capable of processing both image and text inputs to generate relevant text outputs.

Model Features

Partitioned Visual Token Embedding

Generates individual token embeddings for each partition of an image, rather than traditional whole-image embedding, enhancing detail capture capability

LLaVA Adapter Integration

Processes visual embeddings through LLaVA adapter, outputting token embeddings with dimensions [N, 2560]

ChatML Dialogue Format

Designed with ChatML format, particularly suitable for chatbot application scenarios

Model Capabilities

Image Understanding

Visual Question Answering

Image Caption Generation

Multimodal Dialogue

Use Cases

Intelligent Assistant

Visual Q&A Assistant

Answers various user questions about image content

Accurately identifies image content and provides relevant answers

Content Understanding

Image Caption Generation

Generates detailed textual descriptions for images

Produces natural language descriptions that match image content

🚀 HelpingAI-Vision

HelpingAI-Vision is a model designed to enhance scene understanding by generating token embeddings for image parts, based on HelpingAI-Lite and incorporating the LLaVA adapter.

✨ Features

The fundamental concept behind HelpingAI-Vision is to generate one token embedding per N parts of an image, as opposed to producing N visual token embeddings for the entire image. This approach, based on the HelpingAI-Lite and incorporating the LLaVA adapter, aims to enhance scene understanding by capturing more detailed information.

For every crop of the image, an embedding is generated using the full SigLIP encoder (size [1, 1152]). Subsequently, all N embeddings undergo processing through the LLaVA adapter, resulting in a token embedding of size [N, 2560]. Currently, these tokens lack explicit information about their position in the original image, with plans to incorporate positional information in a later update.

HelpingAI-Vision was fine - tuned from MC-LLaVA-3b.

The model adopts the ChatML prompt format, suggesting its potential application in chat - based scenarios. If you have specific queries or would like further details, feel free ask

<|im_start|>system
You are Vortex, a helpful AI assistant.<|im_end|>
<|im_start|>user
{prompt}<|im_end|>
<|im_start|>assistant

📦 Installation

Install dependencies

!pip install -q open_clip_torch timm einops

Download modeling files

from huggingface_hub import hf_hub_download

hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="configuration_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="configuration_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="modeling_llava.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="modeling_phi.py", local_dir="./", force_download=True)
hf_hub_download(repo_id="OEvortex/HelpingAI-Vision", filename="processing_llava.py", local_dir="./", force_download=True)

💻 Usage Examples

Basic Usage

# Create a model
from modeling_llava import LlavaForConditionalGeneration
import torch

model = LlavaForConditionalGeneration.from_pretrained("OEvortex/HelpingAI-Vision", torch_dtype=torch.float16)
model = model.to("cuda")

# Create processors
from transformers import AutoTokenizer
from processing_llava import LlavaProcessor, OpenCLIPImageProcessor

tokenizer = AutoTokenizer.from_pretrained("OEvortex/HelpingAI-Vision")
image_processor = OpenCLIPImageProcessor(model.config.preprocess_config)
processor = LlavaProcessor(image_processor, tokenizer)

# Set image and text
from PIL import Image
import requests

image_file = "https://images.unsplash.com/photo-1439246854758-f686a415d9da"
raw_image = Image.open(requests.get(image_file, stream=True).raw)

prompt = """<|im_start|>system
A chat between a curious human and an artificial intelligence assistant.
The assistant gives helpful, detailed, and polite answers to the human's questions.
The assistant does not hallucinate and pays very close attention to the details.<|im_end|>
<|im_start|>user
<image>
Describe the image.<|im_end|>
<|im_start|>assistant
"""

# Process inputs
with torch.inference_mode():
  inputs = processor(prompt, raw_image, model, return_tensors='pt')

inputs['input_ids'] = inputs['input_ids'].to(model.device)
inputs['attention_mask'] = inputs['attention_mask'].to(model.device)

from transformers import TextStreamer

streamer = TextStreamer(tokenizer)

# Generate the data
%%time
with torch.inference_mode():
  output = model.generate(**inputs, max_new_tokens=200, do_sample=True, top_p=0.9, temperature=1.2, eos_token_id=tokenizer.eos_token_id, streamer=streamer)
print(tokenizer.decode(output[0]).replace(prompt, "").replace("<|im_end|>", ""))

📄 License

The model is released under the hsul license.

Property	Details
Library Name	transformers
Base Model	visheratin/MC-LLaVA-3b
Pipeline Tag	image-text-to-text

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご