TraVisionLM-base Open-source Turkish Vision-Language Model - Lightweight and Free for Understanding Instructions and Generating Image Responses

Travisionlm Base

Developed by ucsahin

The first Turkish vision-language model, lightweight (875M parameters), capable of understanding Turkish instructions and generating responses based on images.

Image-to-Text

Transformers

OtherOpen Source License:Apache-2.0 #Turkish Visual Question Answering #Lightweight Multimodal Model #Image Instruction Understanding

Downloads 136

Release Time : 8/5/2024

Model Overview

TraVisionLM is a multimodal model combining a visual encoder and language model, specifically designed for Turkish, supporting image understanding and text generation tasks.

Model Features

Lightweight and Efficient

Only 875M parameters, fast inference speed, suitable for resource-limited environments.

Turkish Language Optimized

The first vision-language model specifically designed for Turkish, filling a gap in the language.

Multimodal Fusion

Innovative visual projector design enables efficient alignment between images and text.

Ease of Use

Fully compatible with the Transformers library, can be loaded and used without additional dependencies.

Model Capabilities

Image Caption Generation

Visual Question Answering

Image-Text Retrieval

Video Question Answering (via frame sampling)

Use Cases

Image Understanding

Brief Description

Generate a short description of the image, suitable for quick content understanding.

Less hallucination, higher accuracy

Detailed Description

Generate an image description with rich details.

May include inferred details beyond the image

Visual Question Answering

Open-ended Questions

Answer open-ended questions about image content.

Requires adjusting generation parameters to optimize answer quality

Extended Applications

Video Analysis

Enable video content question answering via frame sampling.

Image-Text Retrieval

Supports image-text retrieval tasks without modifying the architecture.

🚀 TraVisionLM: The First Turkish Visual Language Model

🌟 TraVisionLM is a lightning - fast and compact (only 875M parameters) visual language model on Hugging Face. It can respond to Turkish instructions when given an image input! 🌟

✨ Developed to be compatible with the Transformers library, TraVisionLM is incredibly easy to load, fine - tune, and use for rapid inferences, all without the need for any external libraries! ⚡️

Ready to experience the Turkish visual language model? Let's go! 🇹🇷🖼️🤖

You can try the model here: TRaVisionLM - Demo

✨ Features

It's a very fast and small visual language model on Hugging Face, responding to Turkish instructions with an image input.
Compatible with the Transformers library, enabling easy loading, fine - tuning, and fast inferences.

📚 Documentation

Model Details

This model is a multimodal large language model that combines SigLIP as its vision encoder with GPT2 - large as its language model. The vision projector connects the two modalities together. Its architecture closely resembles PaliGemma, with some refined adjustments to the vision projector and the causal language modeling.

Here's the summary of the development process:

Unimodal pretraining: Instead of pretraining both modalities from scratch, it leverages the image encoder from [google/siglip - base - patch16 - 256 - multilingual](https://huggingface.co/google/siglip - base - patch16 - 256 - multilingual) and the language model from [ytu - ce - cosmos/turkish - gpt2 - large](https://huggingface.co/ytu - ce - cosmos/turkish - gpt2 - large).
Feature Alignment: Following the [LLaVA training recipe](https://github.com/haotian - liu/LLaVA?tab=readme - ov - file#train), it trains only the vision projector using 500K image - text pairs to align visual and textual features.
Task Specific Training: The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering, using over 1M image - prompt - completion triplets.
Finetuning on Downstream Tasks: Finally, the model is fine - tuned for object detection to demonstrate its versatility in various downstream tasks. Explore the fine - tuned model for object detection at [ucsahin/TraVisionLM - Object - Detection - ft](https://huggingface.co/ucsahin/TraVisionLM - Object - Detection - ft) for more details.

Model Description

Developed by: ucsahin
Model type: [Image - Text - to - Text](https://huggingface.co/tasks/image - text - to - text)
Language(s) (NLP): Turkish
License: Apache license 2.0

💻 Usage Examples

Direct Use

Short Captioning You can give the model task instructions like "Açıkla", "Kısaca açıkla", "Görseli özetle", "Çok kısa özetle" etc. The model will generate a short description of the image you provide.

⚠️ Important Note The model tends to hallucinate less for this task. You can try adjusting the generation parameters to produce the most useful answer for your needs.

Detailed Captioning You can give the model task instructions like "Detaylı açıkla", "Çok detaylı açıkla", "Görseli detaylı anlat", "Görseli çok detaylı anlat" etc. The model will generate a very detailed description of the image you provide.

⚠️ Important Note The model tends to hallucinate more for this task. Although it generally produces responses related to the image, it may provide details and information that are not present in the image. You can try adjusting the generation parameters to produce the most useful answer for your needs.

Visual Question Answering You can ask the model open - ended questions like "Resmin odağında ne var?", "Görselde adam ne yapıyor?", "Kaç zürafa var?", "Görselle ilgili ne söylenir?", "Görseldeki *obje* ne renk?" etc. The model will generate responses that complement your question.

⚠️ Important Note The model tends to hallucinate more for this task. Although it generally produces responses related to the image and the question, it may provide details and information that are not present in the image. You can try adjusting the generation parameters to produce the most useful answer for your needs.

Downstream Use

(Video - Text - to - Text) The model can be adapted for a question - answering task related to your videos. By sampling video frames and generating answers for each frame, the model can be used without any changes to the architecture.
(Image/Text Retrieval conditioned on Text/Image) For the task of most relevant image retrieval conditioned on text or vice versa, the model can be used directly without any modifications.
(Fine - tuning) For all other tasks that support the model's architecture, such as visual classification, the model can be fine - tuned using the Transformers library. For an example, check out [ucsahin/TraVisionLM - Object - Detection - ft](https://huggingface.co/ucsahin/TraVisionLM - Object - Detection - ft).

💡 Usage Tip "As time permits, I plan to share more applications for these indirect uses. Meanwhile, I eagerly await support or collaboration requests from the community" 🤝💪

Out - of - Scope Use

This model is not suitable for the following scenarios:

Although the model can answer simple questions related to your images, it is not suitable for multi - turn complex chat scenarios. Past information is not retained; the model does not use previously asked questions as context. However, you can easily train the model for this task by preparing a chat template accordingly.
The model does not accept multiple image inputs. For instance, it is not suitable for answering questions that compare two different images. Modifications to the architecture would be necessary to add this feature. For such a model, you can check [HuggingFaceM4/idefics2 - 8b](https://huggingface.co/HuggingFaceM4/idefics2 - 8b) (English only).
The model has not been trained for tasks such as character and text recognition (OCR), segmentation, and multi - object detection. To achieve acceptable performance in these tasks, visual language models like [google/paligemma - 3b - pt - 224](https://huggingface.co/google/paligemma - 3b - pt - 224) and [microsoft/Florence - 2 - large](https://huggingface.co/microsoft/Florence - 2 - large) have been trained on billions of documents and images.

📄 License

This model is licensed under the Apache license 2.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご