# Multimodal Text Generation
Vintern 1B V3 5 GGUF Ext
MIT
Vintern-1B-v3_5 is a 1-billion-parameter vision-language model supporting image-text generation tasks.
Text-to-Image
V
rootonchair
242
1
Mistral Small 3.1 24B Instruct 2503 GGUF
Apache-2.0
This is a vision-enhanced version based on Mistral-Small-3.1-24B-Instruct-2503, supporting image-to-text generation tasks.
Image-to-Text
M
ggml-org
670
3
Gemma 3 4b It Int8 Asym Ov
Apache-2.0
Gemma 3 4B parameter model optimized with OpenVINO, supporting text-to-text and visual-text inference
Image-to-Text
G
Echo9Zulu
152
1
Gemma 3 4b It Llamafile
Gemma 3 is a lightweight open-source model series launched by Google, built on Gemini technology, supporting multimodal input and text output.
Text-to-Image
G
Mozilla
751
3
Gemma 3 1b Pt Qat Q4 0 Gguf
Gemma is a family of lightweight, cutting-edge open models from Google, built on the same research and technology as the Gemini models. The 1B version is a pretrained base model in GGUF format with Quantization-Aware Training (QAT).
Image-to-Text
G
google
97
6
Qwen2 VL 7B Latex OCR
Apache-2.0
A fine-tuned version of the Qwen2-VL-7B model, trained using Unsloth and Huggingface TRL library, achieving 2x inference speed improvement.
Text-to-Image
Transformers English

Q
erickrus
35
3
Llava NeXT Video 34B DPO
Llama 2 is a series of open-source large language models developed by Meta, supporting various natural language processing tasks.
Video-to-Text
Transformers

L
lmms-lab
214
10
Ko Deplot
Apache-2.0
ko-deplot is a Korean visual question answering model based on Google's Pix2Struct architecture, fine-tuned from the Deplot model, supporting chart image question-answering tasks in Korean and English.
Image-to-Text
Transformers Supports Multiple Languages

K
nuua
252
5
Featured Recommended AI Models