# Image-to-Text

Qari OCR 0.3 SNAPSHOT VL 2B Instruct Merged GGUF
This is a statically quantized version based on the Qari-OCR-0.3-SNAPSHOT-VL-2B-Instruct-merged model, mainly used for image-to-text conversion tasks.
Image-to-Text Transformers English
Q
mradermacher
188
0
Vintern 1B V3 5 GGUF Ext
MIT
Vintern-1B-v3_5 is a 1-billion-parameter vision-language model supporting image-text generation tasks.
Text-to-Image
V
rootonchair
242
1
Mixtex Finetune
MIT
MixTex base_ZhEn is an image-to-text model supporting both Chinese and English, released under the MIT License.
Image-to-Text Supports Multiple Languages
M
wzmmmm
27
0
Sarashina2 Vision 8b
MIT
Sarashina2-Vision-8B is a large Japanese vision-language model trained by SB Intuitions, based on the Sarashina2-7B and Qwen2-VL-7B image encoders, achieving excellent performance in multiple benchmarks.
Image-to-Text Transformers Supports Multiple Languages
S
sbintuitions
1,233
4
Trocr Nepali
A Devanagari optical character recognition model based on the TrOCR architecture, specifically fine-tuned for Nepali/Devanagari script
Text Recognition Transformers Other
T
syubraj
175
0
Trocr Math Handwritten
TrOCR is a Transformer-based OCR model specifically designed for recognizing handwritten mathematical formulas
Image-to-Text Transformers
T
fhswf
290
6
Florence 2 Large
MIT
Florence-2 is an advanced vision foundation model developed by Microsoft, using a prompt-based approach to handle a wide range of vision and vision-language tasks.
Image-to-Text Transformers
F
Binaryy
24
0
Florence 2 Large
MIT
Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based approach to handle a wide range of visual and vision-language tasks.
Image-to-Text Transformers
F
lodestone-horizon
14
0
Horus OCR
Donut is a Transformer-based image-to-text model capable of extracting and generating textual content from images.
Image-to-Text Transformers
H
TeeA
21
0
Libra 11b Base
Apache-2.0
Libra is a decoupled vision system built upon large language models, possessing fundamental multimodal understanding capabilities.
Image-to-Text Transformers
L
YifanXu
18
0
Llava Phi 3 Mini Gguf
LLaVA-Phi-3-mini is a fine-tuned LLaVA model based on Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336, specializing in image-to-text tasks.
Image-to-Text
L
xtuner
1,676
133
Infimm Hd
InfiMM-HD is a high-resolution multimodal model capable of understanding and generating content that combines images and text.
Image-to-Text Transformers English
I
Infi-MM
17
27
Git Base Next Refined
MIT
Fine-tuned image-to-text model based on microsoft/git-base
Large Language Model Transformers Other
G
swaroopajit
24
0
Vit Gpt2 Verifycode Caption
Apache-2.0
A ViT-GPT2 architecture captcha recognition model fine-tuned on a dataset of 60,000 images, capable of accurately identifying text in captcha images.
Image-to-Text Transformers
V
AIris-Channel
28
1
Pix2struct Refexp Base
Apache-2.0
Pix2Struct is an image encoder-text decoder model trained for multiple vision-language tasks, including image captioning and visual question answering.
Image-to-Text Transformers Supports Multiple Languages
P
gitlost-murali
20
0
Trocr Small Korean
Apache-2.0
TrOCR is a Korean image-to-text model based on a vision encoder-decoder architecture, using DeiT as the image encoder and RoBERTa as the text decoder.
Image-to-Text Korean
T
team-lucid
342
17
Git 20
MIT
A multimodal model based on Microsoft's GIT framework, focused on extracting text from student homework images and generating teacher feedback
Image-to-Text Transformers Supports Multiple Languages
G
uf-aice-lab
18
1
Mangaocr Hoogberta V2
A Japanese manga text recognition model based on the TrOCR architecture, specifically designed for extracting text content from manga images.
Image-to-Text Transformers
M
dsupa
39
0
Trocr Base Handwritten OCR Handwriting Recognition V2
A fine-tuned handwritten OCR model based on Microsoft's trocr-base-handwritten, achieving a character error rate (CER) of 0.0360 on the evaluation set
Text Recognition Transformers English
T
DunnBC22
269
16
Vit Gpt2 Image Captioning
Apache-2.0
This is an image captioning model based on the Vision Encoder-Decoder architecture, capable of generating natural language descriptions for input images.
Image-to-Text Transformers
V
baseplate
55
2
Sky Scribe
A satellite image caption generation model fine-tuned based on Microsoft GIT-base, generating brief descriptions for NASA Earth Observatory images
Image-to-Text Transformers Other
S
nkasmanoff
16
0
Pix2struct Large
Apache-2.0
Pix2Struct is an image encoder-text decoder model trained on image-text pairs, suitable for various vision-language tasks
Image-to-Text Transformers Supports Multiple Languages
P
google
6,601
34
Pix2struct Ai2d Base
Apache-2.0
Pix2Struct is a vision-language understanding model specifically fine-tuned for scientific chart visual question answering (VQA) tasks
Text-to-Image Transformers Supports Multiple Languages
P
google
1,575
42
Pix2struct Base
Apache-2.0
Pix2Struct is an image encoder-text decoder model trained on various image-text pairs for tasks including image captioning and visual question answering.
Image-to-Text Transformers Supports Multiple Languages
P
google
6,390
71
Git Large Vatex
MIT
GIT is a Transformer decoder conditioned on CLIP image tokens and text tokens, designed for tasks like image and video caption generation and visual question answering.
Image-to-Text Transformers Supports Multiple Languages
G
microsoft
267
1
Dof Invoice 1
MIT
Invoice processing model fine-tuned based on naver-clova-ix/donut-base
Image-to-Text Transformers
D
Sebabrata
13
0
Image Caption Generator
A vision-language model trained on the Flickr8k dataset, capable of generating natural language descriptions for input images
Image-to-Text Transformers
I
bipin
177
15
Vit Gpt2 Coco En
An image-to-text model based on ViT and GPT2 architectures, capable of generating reasonable English descriptions for input images
Image-to-Text
V
ydshieh
5,177
38
Trocr Large Handwritten
TrOCR is a Transformer-based optical character recognition model specifically designed for handwritten text recognition, fine-tuned on the IAM dataset.
Text Recognition Transformers
T
microsoft
59.17k
115
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase