Turkish-LLaVA-v0.1 Open-Source Vision-Language Model - Free to Process Image and Text Inputs and Execute Turkish Instructions

Turkish LLaVA V0.1

Developed by ytu-ce-cosmos

A Turkish visual-language model specifically designed for multimodal visual instruction-following tasks, capable of processing both visual (image) and text inputs to understand and execute instructions provided in Turkish.

Image-to-Text

Safetensors

OtherOpen Source License:MIT #Turkish Visual Question Answering #Multimodal Instruction Following #OCR Enhancement

Downloads 86

Release Time : 10/31/2024

Model Overview

This model adopts the LLaVA architecture, integrating a Turkish Llama language model, enabling it to process image and text inputs for visual reasoning and instruction-following tasks.

Model Features

Multimodal Processing Capability

Capable of processing both visual (image) and text inputs for cross-modal understanding.

Turkish Language Support

A visual-language model optimized specifically for Turkish, suitable for Turkish-speaking users.

Instruction Following

Can understand and execute user-provided visual and text instructions.

OCR Enhancement

Improved performance on OCR-related tasks through training on 110K rounds of multi-turn instruction data including book covers.

Model Capabilities

Image Understanding

Text Generation

Visual Reasoning

Multimodal Dialogue

Instruction Following

Use Cases

Visual Question Answering

Image Content Description

Generate detailed Turkish descriptions based on user-provided images.

Example successfully described a scene of a puppy in the garden.

Visual Reasoning

Answer user questions based on image content.

Education

Book Cover Recognition

Identify book covers and provide related information.

🚀 Llava-CosmosLlama

A Turkish visual language model for multi-modal visual instruction-following tasks, leveraging the LLaVA architecture and integrating the ytucosmos/Turkish-Llama-8b-Instruct-v0.1 language model.

✨ Features

Designed for multi-modal visual instruction-following tasks.
Utilizes the LLaVA architecture and integrates the ytucosmos/Turkish-Llama-8b-Instruct-v0.1 language model.
Capable of processing both visual (image) and textual inputs, and understanding and executing instructions in Turkish.

📦 Installation

Using lmdeploy

Install requirements:

conda create -n lmdeploy python=3.8 -y
conda activate lmdeploy
pip install lmdeploy

💻 Usage Examples

Basic Usage

from lmdeploy import pipeline, ChatTemplateConfig
from lmdeploy.vl import load_image

pipe = pipeline("ytu-ce-cosmos/Turkish-LLaVA-v0.1",
                chat_template_config=ChatTemplateConfig(model_name='llama3'))

url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/idefics-im-captioning.jpg"
image = load_image(url)

response = pipe(('Bu resimde öne çıkan ögeler nelerdir?', image))

print(response)

"""
Resimde, çiçeklerle dolu bir bahçede yavru bir köpek ve arka planda bir ağaç yer alıyor.
Köpek, çiçeklerin arasında otururken ve etrafını saran çiçeklerin arasından bakarken görülebiliyor.
Bu sahne, köpeğin bahçede geçirdiği zamanın tadını çıkardığı ve çevresini keşfettiği sakin ve huzurlu bir atmosferi yansıtıyor.
"""

Image used in this example:

📚 Documentation

Model Details

The model was pretrained on LLaVA-CC3M-Pretrain-595K dataset, which was translated to Turkish using DeepL Translate. It was further fine-tuned using subsets the following datasets to enhance its visual reasoning and understanding capabilities:

Stanford GQA
VisualGenome
COCO
110K multi-turn instruction following data consisting of book covers, to enhance models capabilities on tasks regarding OCR.

Property	Details
Model Type	Turkish visual language model
Training Data	Pretrained on LLaVA-CC3M-Pretrain-595K (translated to Turkish), fine-tuned on subsets of Stanford GQA, VisualGenome, COCO, and 110K multi-turn instruction following data of book covers

📄 License

This project is licensed under the MIT license.

Acknowledgments

Computing resources used in this work were provided by the National Center for High Performance Computing of Turkey (UHeM).
Thanks to the generous support from the Hugging Face team, it is possible to download models from their S3 storage 🤗

Citation

@inproceedings{zeer2024cosmos,
  title={Cosmos-LLaVA: Chatting with the Visual},
  author={Zeer, Ahmed and Dogan, Eren and Erdem, Yusuf and {\.I}nce, Elif and Shbib, Osama and Uzun, M Egemen and Uz, Atahan and Yuce, M Kaan and Kesgin, H Toprak and Amasyali, M Fatih},
  booktitle={2024 8th International Artificial Intelligence and Data Processing Symposium (IDAP)},
  pages={1--7},
  year={2024},
  organization={IEEE}
}

Contact

COSMOS AI Research Group, Yildiz Technical University Computer Engineering Department
https://cosmos.yildiz.edu.tr/
cosmos@yildiz.edu.tr

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご