The Best 897 Image-to-Text Tools in 2025
Clip Vit Large Patch14
CLIP is a vision-language model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, supporting zero-shot image classification.
Image-to-Text
C
openai
44.7M
1,710
Clip Vit Base Patch32
CLIP is a multimodal model developed by OpenAI that can understand the relationship between images and text, supporting zero-shot image classification tasks.
Image-to-Text
C
openai
14.0M
666
Siglip So400m Patch14 384
Apache-2.0
SigLIP is a vision-language model pre-trained on the WebLi dataset, employing an improved sigmoid loss function to optimize image-text matching tasks.
Image-to-Text
Transformers

S
google
6.1M
526
Clip Vit Base Patch16
CLIP is a multimodal model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, enabling zero-shot image classification capabilities.
Image-to-Text
C
openai
4.6M
119
Blip Image Captioning Base
Bsd-3-clause
BLIP is an advanced vision-language pretrained model, excelling in image captioning tasks and supporting both conditional and unconditional text generation.
Image-to-Text
Transformers

B
Salesforce
2.8M
688
Blip Image Captioning Large
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling at image caption generation tasks, supporting both conditional and unconditional image caption generation.
Image-to-Text
Transformers

B
Salesforce
2.5M
1,312
Openvla 7b
MIT
OpenVLA 7B is an open-source vision-language-action model trained on the Open X-Embodiment dataset, capable of generating robot actions based on language instructions and camera images.
Image-to-Text
Transformers English

O
openvla
1.7M
108
Llava V1.5 7b
LLaVA is an open-source multimodal chatbot, fine-tuned based on LLaMA/Vicuna, supporting image-text interaction.
Image-to-Text
Transformers

L
liuhaotian
1.4M
448
Vit Gpt2 Image Captioning
Apache-2.0
This is an image captioning model based on ViT and GPT2 architectures, capable of generating natural language descriptions for input images.
Image-to-Text
Transformers

V
nlpconnect
939.88k
887
Blip2 Opt 2.7b
MIT
BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text generation tasks.
Image-to-Text
Transformers English

B
Salesforce
867.78k
359
Siglip2 So400m Patch14 384
Apache-2.0
SigLIP 2 is a vision-language model based on the SigLIP pre-training objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.
Image-to-Text
Transformers

S
google
622.54k
20
Gemma 3 4b It
Gemma is a lightweight, advanced open model series launched by Google, built on the same research and technology as Gemini. Gemma 3 is a multimodal model capable of processing both text and image inputs to generate text outputs.
Image-to-Text
Transformers

G
google
608.22k
477
Llava Llama 3 8b V1 1 Transformers
A LLaVA model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image-text-to-text tasks
Image-to-Text
L
xtuner
454.61k
78
Phi 3.5 Vision Instruct
MIT
Phi-3.5-vision is a lightweight, cutting-edge open multimodal model supporting 128K context length, focusing on high-quality, reasoning-rich text and visual data.
Image-to-Text
Transformers Other

P
microsoft
397.38k
679
Gemma 3 27b It
Gemma is a lightweight cutting-edge open model series launched by Google, built on the same technology as Gemini, supporting multimodal input and text output.
Image-to-Text
Transformers

G
google
371.46k
1,274
Git Base
MIT
GIT is a dual-conditional Transformer decoder based on CLIP image tokens and text tokens, designed for image-to-text generation tasks.
Image-to-Text
Transformers Supports Multiple Languages

G
microsoft
365.74k
93
Gemma 3 12b It
Gemma is a lightweight cutting-edge open-source multimodal model series launched by Google, built on the technology used to create Gemini models, supporting text and image inputs to generate text outputs.
Image-to-Text
Transformers

G
google
364.65k
340
Siglip Base Patch16 224
Apache-2.0
SigLIP is a vision-language model pretrained on the WebLi dataset, utilizing an improved Sigmoid loss function to optimize image-text matching tasks
Image-to-Text
Transformers

S
google
250.28k
43
Siglip Large Patch16 384
Apache-2.0
SigLIP is a multimodal model pretrained on the WebLi dataset, utilizing an improved Sigmoid loss function, suitable for zero-shot image classification and image-text retrieval tasks.
Image-to-Text
Transformers

S
google
245.21k
6
Blip2 Opt 6.7b Coco
MIT
BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text generation and visual question answering tasks.
Image-to-Text
Transformers English

B
Salesforce
216.79k
33
Trocr Base Handwritten
MIT
TrOCR is a Transformer-based optical character recognition model specifically designed for handwritten text recognition.
Image-to-Text
Transformers

T
microsoft
206.74k
405
Moondream2
Apache-2.0
Moondream is a lightweight vision-language model designed for efficient operation across all platforms.
Image-to-Text
M
vikhyatk
184.93k
1,120
Kosmos 2 Patch14 224
MIT
Kosmos-2 is a multimodal large language model capable of understanding and generating text descriptions related to images, and establishing associations between text and image regions.
Image-to-Text
Transformers

K
microsoft
171.99k
162
Donut Base Finetuned Docvqa
MIT
Donut is an OCR-free document understanding Transformer model, fine-tuned on the DocVQA dataset, capable of directly extracting and comprehending text information from images.
Image-to-Text
Transformers

D
naver-clova-ix
167.80k
231
Biomedclip PubMedBERT 256 Vit Base Patch16 224
MIT
BiomedCLIP is a biomedical vision-language foundation model pre-trained via contrastive learning on the PMC-15M dataset, supporting cross-modal retrieval, image classification, visual question answering, and other tasks.
Image-to-Text English
B
microsoft
137.39k
296
Donut Base Finetuned Rvlcdip
MIT
Donut is an OCR-free document understanding Transformer model that combines a visual encoder and text decoder to process document images.
Image-to-Text
Transformers

D
naver-clova-ix
125.36k
13
Minicpm V 2 6 Int4
MiniCPM-V 2.6 is a multimodal vision-language model supporting image-to-text conversion with multilingual processing capabilities.
Image-to-Text
Transformers Other

M
openbmb
122.58k
79
Blip2 Flan T5 Xl
MIT
BLIP-2 is a vision-language model based on Flan T5-xl, pre-trained by freezing the image encoder and large language model, supporting tasks such as image captioning and visual question answering.
Image-to-Text
Transformers English

B
Salesforce
91.77k
68
Minicpm V 2 6
MiniCPM-V is a mobile GPT-4V-level multimodal large language model that supports single-image, multi-image, and video understanding, equipped with visual and optical character recognition capabilities.
Image-to-Text
Transformers Other

M
openbmb
91.52k
969
H2ovl Mississippi 2b
Apache-2.0
H2OVL-Mississippi-2B is a high-performance general-purpose vision-language model developed by H2O.ai, capable of handling a wide range of multimodal tasks. This model has 2 billion parameters and performs excellently in tasks such as image captioning, visual question answering (VQA), and document understanding.
Image-to-Text
Transformers English

H
h2oai
91.28k
34
Clip Flant5 Xxl
Apache-2.0
A vision-language generation model fine-tuned based on google/flan-t5-xxl, specifically designed for image-text retrieval tasks
Image-to-Text
Transformers English

C
zhiqiulin
86.23k
2
Florence 2 SD3 Captioner
Apache-2.0
Florence-2-SD3-Captioner is an image caption generation model based on the Florence-2 architecture, specifically designed for generating high-quality image captions.
Image-to-Text
Transformers Supports Multiple Languages

F
gokaygokay
80.06k
34
H2ovl Mississippi 800m
Apache-2.0
An 800M-parameter vision-language model from H2O.ai, specializing in OCR and document understanding with excellent performance
Image-to-Text
Transformers English

H
h2oai
77.67k
33
Moondream1
A 1.6B-parameter multimodal model combining SigLIP and Phi-1.5 architectures, supporting image understanding and Q&A tasks
Image-to-Text
Transformers English

M
vikhyatk
70.48k
487
Gemma 3 27b It Qat Q4 0 Gguf
Gemma is a lightweight open-source multimodal model series launched by Google. It supports text and image inputs and generates text outputs. It has a 128K large context window and supports over 140 languages.
Image-to-Text
G
google
69.29k
251
Smolvlm2 2.2B Instruct
Apache-2.0
SmolVLM2-2.2B is a lightweight multimodal model designed for analyzing video content. It can process video, image, and text inputs and generate text outputs.
Image-to-Text
Transformers English

S
HuggingFaceTB
62.56k
164
Pix2struct Tiny Random
MIT
This is an image-to-text model based on the MIT license, capable of converting image content into descriptive text.
Image-to-Text
Transformers

P
fxmarty
60.87k
2
Florence 2 Base Ft
MIT
Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based approach to handle a wide range of vision and vision-language tasks.
Image-to-Text
Transformers

F
microsoft
56.78k
110
Gemma 3 4b Pt
Gemma is a series of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create Gemini models.
Image-to-Text
Transformers

G
google
55.03k
68
Gemma 3 12b Pt
Gemma is a lightweight open-source multimodal model series launched by Google, built on the same technology as Gemini, supporting text and image inputs and generating text outputs.
Image-to-Text
Transformers

G
google
54.36k
46
Chexpert Mimic Cxr Findings Baseline
MIT
This is a medical imaging report generation model based on the VisionEncoderDecoder architecture, specifically designed to generate radiology report texts from chest X-ray images.
Image-to-Text
Transformers English

C
IAMJB
53.27k
1
Chexpert Mimic Cxr Impression Baseline
MIT
This is a text generation model based on chest X-ray images, capable of generating radiology impression reports from medical imaging.
Image-to-Text
Transformers English

C
IAMJB
52.87k
0
- 1
- 2
- 3
- 4
- 5
- 6
- 10