🚀 LLaVA-Saiga-8b
LLaVA-Saiga-8b is a Vision-Language Model (VLM) based on the IlyaGusev/saiga_llama3_8b
model and trained in the original LLaVA setup. It's mainly adapted for Russian but can also handle English.
🚀 Quick Start
The model can be easily used via the transformers
API.
💻 Usage Examples
Basic Usage
import requests
from PIL import Image
from transformers import AutoProcessor, AutoTokenizer, LlavaForConditionalGeneration
model_name = "deepvk/llava-saiga-8b"
model = LlavaForConditionalGeneration.from_pretrained(model_name)
processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
url = "https://www.ilankelman.org/stopsigns/australia.jpg"
img = Image.open(requests.get(url, stream=True).raw)
messages = [
{"role": "user", "content": "<image>\nОпиши картинку несколькими словами."}
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(images=[img], text=text, return_tensors="pt")
generate_ids = model.generate(**inputs, max_new_tokens=30)
answer = tokenizer.decode(generate_ids[0, inputs.input_ids.shape[1]:], skip_special_tokens=True)
print(answer)
Use the <image>
tag to point to an image in the text and follow the chat template for a multi - turn conversation. The model can chat without images or handle multiple images in a conversation, though this behavior is untested.
The model format allows direct use in popular frameworks. For example, you can test the model using lmms-eval. See the Results section for details.
🔧 Technical Details
Training
To train this model, we followed the original LLaVA pipeline and reused the haotian-liu/LLaVA
framework.
The model was trained in two stages:
- The adapter was trained using pre - training data from
ShareGPT4V
.
- Instruction tuning involved training the LLM and the adapter. We used:
deepvk/LLaVA-Instruct-ru
- our new dataset of VLM instructions in Russian.
deepvk/GQA-ru
- the training part of the popular GQA test, translated into Russian. We used the post - prompt "Ответь одним словом. ".
- We also used instruction data from ShareGPT4V.
The entire training process took 3 - 4 days on 8 x A100 80GB.
📚 Documentation
Results
The model's performance was evaluated using the lmms-eval
framework.
accelerate launch -m lmms_eval --model llava_hf --model_args pretrained="deepvk/llava-saiga-8b" \
--tasks gqa-ru,mmbench_ru_dev,gqa,mmbench_en_dev --batch_size 1 \
--log_samples --log_samples_suffix llava-saiga-8b --output_path ./logs/
Note: For MMBench, we didn't use the OpenAI API for finding the quantifier in the generated string. Therefore, the score is similar to Exact Match as in the GQA benchmark.
📄 License
This project is licensed under the apache-2.0
license.
📖 Citation
@misc{liu2023llava,
title={Visual Instruction Tuning},
author={Liu, Haotian and Li, Chunyuan and Wu, Qingyang and Lee, Yong Jae},
publisher={NeurIPS},
year={2023},
}
@misc{deepvk2024llava-saiga-8b,
title={LLaVA-Saiga-8b},
author={Belopolskih, Daniil and Spirin, Egor},
url={https://huggingface.co/deepvk/llava-saiga-8b},
publisher={Hugging Face},
year={2024},
}
📋 Information Table
Property |
Details |
Library Name |
transformers |
Model Type |
Vision - Language Model (VLM) |
Base Model |
IlyaGusev/saiga_llama3_8b |
Pipeline Tag |
image - text - to - text |
Training Data |
deepvk/LLaVA - Instruct - ru, Lin - Chen/ShareGPT4V, deepvk/GQA - ru |
Language |
ru, en |
License |
apache - 2.0 |