LLaVA-Saiga-8b Open-source Visual Language Model - Suitable for Russian tasks and can also handle English content

Llava Saiga 8b

Developed by deepvk

LLaVA-Saiga-8b is a vision-language model (VLM) developed based on the IlyaGusev/saiga_llama3_8b model, primarily optimized for Russian tasks while retaining English processing capabilities.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Russian Visual Question Answering #Multimodal Dialogue #LLaVA Architecture

Downloads 205

Release Time : 7/25/2024

Model Overview

This model is trained using the original LLaVA framework, supporting multimodal interaction between images and text, capable of performing tasks such as visual question answering and image captioning.

Model Features

Multilingual Support

Primarily optimized for Russian tasks while retaining English processing capabilities

Multimodal Interaction

Supports joint processing of images and text, capable of understanding image content and generating relevant text

LLaVA Framework Compatibility

Adopts the original LLaVA training pipeline, compatible with mainstream evaluation frameworks

Model Capabilities

Visual Question Answering

Image Caption Generation

Multimodal Dialogue

Cross-Language Understanding

Use Cases

Education

Visual-Assisted Learning

Helps students understand concepts and answer questions through images

Content Generation

Automatic Image Annotation

Generates descriptive text for images

🚀 LLaVA-Saiga-8b

LLaVA-Saiga-8b is a Vision-Language Model (VLM) based on the IlyaGusev/saiga_llama3_8b model and trained in the original LLaVA setup. It's mainly adapted for Russian but can also handle English.

🚀 Quick Start

The model can be easily used via the transformers API.

💻 Usage Examples

Basic Usage

import requests

from PIL import Image
from transformers import AutoProcessor, AutoTokenizer, LlavaForConditionalGeneration

model_name = "deepvk/llava-saiga-8b"

model = LlavaForConditionalGeneration.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

url = "https://www.ilankelman.org/stopsigns/australia.jpg"
img = Image.open(requests.get(url, stream=True).raw)
messages = [
    {"role": "user", "content": "<image>\nОпиши картинку несколькими словами."}
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[img], text=text, return_tensors="pt")

generate_ids = model.generate(**inputs, max_new_tokens=30)
answer = tokenizer.decode(generate_ids[0, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(answer)

Use the <image> tag to point to an image in the text and follow the chat template for a multi - turn conversation. The model can chat without images or handle multiple images in a conversation, though this behavior is untested.

The model format allows direct use in popular frameworks. For example, you can test the model using lmms-eval. See the Results section for details.

🔧 Technical Details

Training

To train this model, we followed the original LLaVA pipeline and reused the haotian-liu/LLaVA framework.

The model was trained in two stages:

The adapter was trained using pre - training data from ShareGPT4V.
Instruction tuning involved training the LLM and the adapter. We used:
- deepvk/LLaVA-Instruct-ru - our new dataset of VLM instructions in Russian.
- deepvk/GQA-ru - the training part of the popular GQA test, translated into Russian. We used the post - prompt "Ответь одним словом. ".
- We also used instruction data from ShareGPT4V.

The entire training process took 3 - 4 days on 8 x A100 80GB.

📚 Documentation

Results

The model's performance was evaluated using the lmms-eval framework.

accelerate launch -m lmms_eval --model llava_hf --model_args pretrained="deepvk/llava-saiga-8b" \
  --tasks gqa-ru,mmbench_ru_dev,gqa,mmbench_en_dev --batch_size 1 \
  --log_samples --log_samples_suffix llava-saiga-8b --output_path ./logs/

Model	GQA	GQA-ru	MMBench	MMBench-ru
`deepvk/llava-gemma-2b-lora`	56.39	46.37	51.72	40.19
`Intel/llava-gemma-2b`	59.80	0.20	39.40	28.30
`deepvk/llava-saiga-8b` [this model]	62.00	51.44	64.26	56.65
`llava-hf/llava-1.5-7b-hf`	61.31	28.39	62.97	52.25
`llava-hf/llava-v1.6-mistral-7b-hf`	64.65	6.65	67.70	48.80

Note: For MMBench, we didn't use the OpenAI API for finding the quantifier in the generated string. Therefore, the score is similar to Exact Match as in the GQA benchmark.

📄 License

This project is licensed under the apache-2.0 license.

📖 Citation

@misc{liu2023llava,
  title={Visual Instruction Tuning}, 
  author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
  publisher={NeurIPS},
  year={2023},
}

@misc{deepvk2024llava-saiga-8b,
	title={LLaVA-Saiga-8b},
	author={Belopolskih, Daniil and Spirin, Egor},
	url={https://huggingface.co/deepvk/llava-saiga-8b},
	publisher={Hugging Face},
	year={2024},
}

📋 Information Table

Property	Details
Library Name	transformers
Model Type	Vision - Language Model (VLM)
Base Model	IlyaGusev/saiga_llama3_8b
Pipeline Tag	image - text - to - text
Training Data	deepvk/LLaVA - Instruct - ru, Lin - Chen/ShareGPT4V, deepvk/GQA - ru
Language	ru, en
License	apache - 2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご