Image-to-Text

The Best 897 Image-to-Text Tools in 2025

Clip Vit Large Patch14

CLIP is a vision-language model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, supporting zero-shot image classification.

Clip Vit Base Patch32

CLIP is a multimodal model developed by OpenAI that can understand the relationship between images and text, supporting zero-shot image classification tasks.

Siglip So400m Patch14 384

SigLIP is a vision-language model pre-trained on the WebLi dataset, employing an improved sigmoid loss function to optimize image-text matching tasks.

Clip Vit Base Patch16

CLIP is a multimodal model developed by OpenAI that maps images and text into a shared embedding space through contrastive learning, enabling zero-shot image classification capabilities.

Blip Image Captioning Base

BLIP is an advanced vision-language pretrained model, excelling in image captioning tasks and supporting both conditional and unconditional text generation.

Blip Image Captioning Large

BLIP is a unified vision-language pretraining framework, excelling at image caption generation tasks, supporting both conditional and unconditional image caption generation.

OpenVLA 7B is an open-source vision-language-action model trained on the Open X-Embodiment dataset, capable of generating robot actions based on language instructions and camera images.

Transformers English

LLaVA is an open-source multimodal chatbot, fine-tuned based on LLaMA/Vicuna, supporting image-text interaction.

Vit Gpt2 Image Captioning

This is an image captioning model based on ViT and GPT2 architectures, capable of generating natural language descriptions for input images.

BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text generation tasks.

Transformers English

Siglip2 So400m Patch14 384

SigLIP 2 is a vision-language model based on the SigLIP pre-training objective, integrating multiple technologies to enhance semantic understanding, localization, and dense feature extraction capabilities.

Gemma is a lightweight, advanced open model series launched by Google, built on the same research and technology as Gemini. Gemma 3 is a multimodal model capable of processing both text and image inputs to generate text outputs.

Llava Llama 3 8b V1 1 Transformers

A LLaVA model fine-tuned based on Meta-Llama-3-8B-Instruct and CLIP-ViT-Large-patch14-336, supporting image-text-to-text tasks

Phi 3.5 Vision Instruct

Phi-3.5-vision is a lightweight, cutting-edge open multimodal model supporting 128K context length, focusing on high-quality, reasoning-rich text and visual data.

Transformers Other

Gemma is a lightweight cutting-edge open model series launched by Google, built on the same technology as Gemini, supporting multimodal input and text output.

GIT is a dual-conditional Transformer decoder based on CLIP image tokens and text tokens, designed for image-to-text generation tasks.

Transformers Supports Multiple Languages

Gemma is a lightweight cutting-edge open-source multimodal model series launched by Google, built on the technology used to create Gemini models, supporting text and image inputs to generate text outputs.

Siglip Base Patch16 224

SigLIP is a vision-language model pretrained on the WebLi dataset, utilizing an improved Sigmoid loss function to optimize image-text matching tasks

Siglip Large Patch16 384

SigLIP is a multimodal model pretrained on the WebLi dataset, utilizing an improved Sigmoid loss function, suitable for zero-shot image classification and image-text retrieval tasks.

Blip2 Opt 6.7b Coco

BLIP-2 is a vision-language model that combines an image encoder with a large language model for image-to-text generation and visual question answering tasks.

Transformers English

Trocr Base Handwritten

TrOCR is a Transformer-based optical character recognition model specifically designed for handwritten text recognition.

Moondream is a lightweight vision-language model designed for efficient operation across all platforms.

Kosmos 2 Patch14 224

Kosmos-2 is a multimodal large language model capable of understanding and generating text descriptions related to images, and establishing associations between text and image regions.

Donut Base Finetuned Docvqa

Donut is an OCR-free document understanding Transformer model, fine-tuned on the DocVQA dataset, capable of directly extracting and comprehending text information from images.

Biomedclip PubMedBERT 256 Vit Base Patch16 224

BiomedCLIP is a biomedical vision-language foundation model pre-trained via contrastive learning on the PMC-15M dataset, supporting cross-modal retrieval, image classification, visual question answering, and other tasks.

Image-to-Text English

Donut Base Finetuned Rvlcdip

Donut is an OCR-free document understanding Transformer model that combines a visual encoder and text decoder to process document images.

Minicpm V 2 6 Int4

MiniCPM-V 2.6 is a multimodal vision-language model supporting image-to-text conversion with multilingual processing capabilities.

Transformers Other

Blip2 Flan T5 Xl

BLIP-2 is a vision-language model based on Flan T5-xl, pre-trained by freezing the image encoder and large language model, supporting tasks such as image captioning and visual question answering.

Transformers English

MiniCPM-V is a mobile GPT-4V-level multimodal large language model that supports single-image, multi-image, and video understanding, equipped with visual and optical character recognition capabilities.

Transformers Other

H2ovl Mississippi 2b

H2OVL-Mississippi-2B is a high-performance general-purpose vision-language model developed by H2O.ai, capable of handling a wide range of multimodal tasks. This model has 2 billion parameters and performs excellently in tasks such as image captioning, visual question answering (VQA), and document understanding.

Transformers English

Clip Flant5 Xxl

A vision-language generation model fine-tuned based on google/flan-t5-xxl, specifically designed for image-text retrieval tasks

Transformers English

Florence 2 SD3 Captioner

Florence-2-SD3-Captioner is an image caption generation model based on the Florence-2 architecture, specifically designed for generating high-quality image captions.

Transformers Supports Multiple Languages

H2ovl Mississippi 800m

An 800M-parameter vision-language model from H2O.ai, specializing in OCR and document understanding with excellent performance

Transformers English

A 1.6B-parameter multimodal model combining SigLIP and Phi-1.5 architectures, supporting image understanding and Q&A tasks

Transformers English

Gemma 3 27b It Qat Q4 0 Gguf

Gemma is a lightweight open-source multimodal model series launched by Google. It supports text and image inputs and generates text outputs. It has a 128K large context window and supports over 140 languages.

Smolvlm2 2.2B Instruct

SmolVLM2-2.2B is a lightweight multimodal model designed for analyzing video content. It can process video, image, and text inputs and generate text outputs.

Transformers English

Pix2struct Tiny Random

This is an image-to-text model based on the MIT license, capable of converting image content into descriptive text.

Florence 2 Base Ft

Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based approach to handle a wide range of vision and vision-language tasks.

Gemma is a series of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create Gemini models.

Gemma is a lightweight open-source multimodal model series launched by Google, built on the same technology as Gemini, supporting text and image inputs and generating text outputs.

Chexpert Mimic Cxr Findings Baseline

This is a medical imaging report generation model based on the VisionEncoderDecoder architecture, specifically designed to generate radiology report texts from chest X-ray images.

Transformers English

Chexpert Mimic Cxr Impression Baseline

This is a text generation model based on chest X-ray images, capable of generating radiology impression reports from medical imaging.

Transformers English

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase