🚀 TraVisionLM: The First Turkish Visual Language Model
🌟 TraVisionLM is a lightning - fast and compact (only 875M parameters) visual language model on Hugging Face. It can respond to Turkish instructions when given an image input! 🌟
✨ Developed to be compatible with the Transformers library, TraVisionLM is incredibly easy to load, fine - tune, and use for rapid inferences, all without the need for any external libraries! ⚡️
Ready to experience the Turkish visual language model? Let's go! 🇹🇷🖼️🤖
You can try the model here: TRaVisionLM - Demo
✨ Features
- It's a very fast and small visual language model on Hugging Face, responding to Turkish instructions with an image input.
- Compatible with the Transformers library, enabling easy loading, fine - tuning, and fast inferences.
📚 Documentation
Model Details
This model is a multimodal large language model that combines SigLIP as its vision encoder with GPT2 - large as its language model. The vision projector connects the two modalities together.
Its architecture closely resembles PaliGemma, with some refined adjustments to the vision projector and the causal language modeling.
Here's the summary of the development process:
- Unimodal pretraining: Instead of pretraining both modalities from scratch, it leverages the image encoder from [google/siglip - base - patch16 - 256 - multilingual](https://huggingface.co/google/siglip - base - patch16 - 256 - multilingual) and the language model from [ytu - ce - cosmos/turkish - gpt2 - large](https://huggingface.co/ytu - ce - cosmos/turkish - gpt2 - large).
- Feature Alignment: Following the [LLaVA training recipe](https://github.com/haotian - liu/LLaVA?tab=readme - ov - file#train), it trains only the vision projector using 500K image - text pairs to align visual and textual features.
- Task Specific Training: The aligned model undergoes further training for tasks such as short captioning, detailed captioning, and simple visual question answering, using over 1M image - prompt - completion triplets.
- Finetuning on Downstream Tasks: Finally, the model is fine - tuned for object detection to demonstrate its versatility in various downstream tasks. Explore the fine - tuned model for object detection at [ucsahin/TraVisionLM - Object - Detection - ft](https://huggingface.co/ucsahin/TraVisionLM - Object - Detection - ft) for more details.
Model Description
- Developed by: ucsahin
- Model type: [Image - Text - to - Text](https://huggingface.co/tasks/image - text - to - text)
- Language(s) (NLP): Turkish
- License: Apache license 2.0
💻 Usage Examples
Direct Use
- Short Captioning
You can give the model task instructions like
"Açıkla", "Kısaca açıkla", "Görseli özetle", "Çok kısa özetle"
etc. The model will generate a short description of the image you provide.
⚠️ Important Note
The model tends to hallucinate less for this task. You can try adjusting the generation parameters to produce the most useful answer for your needs.
- Detailed Captioning
You can give the model task instructions like
"Detaylı açıkla", "Çok detaylı açıkla", "Görseli detaylı anlat", "Görseli çok detaylı anlat"
etc. The model will generate a very detailed description of the image you provide.
⚠️ Important Note
The model tends to hallucinate more for this task. Although it generally produces responses related to the image, it may provide details and information that are not present in the image. You can try adjusting the generation parameters to produce the most useful answer for your needs.
- Visual Question Answering
You can ask the model open - ended questions like
"Resmin odağında ne var?", "Görselde adam ne yapıyor?", "Kaç zürafa var?", "Görselle ilgili ne söylenir?", "Görseldeki *obje* ne renk?"
etc. The model will generate responses that complement your question.
⚠️ Important Note
The model tends to hallucinate more for this task. Although it generally produces responses related to the image and the question, it may provide details and information that are not present in the image. You can try adjusting the generation parameters to produce the most useful answer for your needs.
Downstream Use
- (Video - Text - to - Text) The model can be adapted for a question - answering task related to your videos. By sampling video frames and generating answers for each frame, the model can be used without any changes to the architecture.
- (Image/Text Retrieval conditioned on Text/Image) For the task of most relevant image retrieval conditioned on text or vice versa, the model can be used directly without any modifications.
- (Fine - tuning) For all other tasks that support the model's architecture, such as visual classification, the model can be fine - tuned using the Transformers library. For an example, check out [ucsahin/TraVisionLM - Object - Detection - ft](https://huggingface.co/ucsahin/TraVisionLM - Object - Detection - ft).
💡 Usage Tip
"As time permits, I plan to share more applications for these indirect uses. Meanwhile, I eagerly await support or collaboration requests from the community" 🤝💪
Out - of - Scope Use
This model is not suitable for the following scenarios:
- Although the model can answer simple questions related to your images, it is not suitable for multi - turn complex chat scenarios. Past information is not retained; the model does not use previously asked questions as context. However, you can easily train the model for this task by preparing a chat template accordingly.
- The model does not accept multiple image inputs. For instance, it is not suitable for answering questions that compare two different images. Modifications to the architecture would be necessary to add this feature. For such a model, you can check [HuggingFaceM4/idefics2 - 8b](https://huggingface.co/HuggingFaceM4/idefics2 - 8b) (English only).
- The model has not been trained for tasks such as character and text recognition (OCR), segmentation, and multi - object detection. To achieve acceptable performance in these tasks, visual language models like [google/paligemma - 3b - pt - 224](https://huggingface.co/google/paligemma - 3b - pt - 224) and [microsoft/Florence - 2 - large](https://huggingface.co/microsoft/Florence - 2 - large) have been trained on billions of documents and images.
📄 License
This model is licensed under the Apache license 2.0.