Model Selection

Image-to-Text

# Image-to-Text

Qari OCR 0.3 SNAPSHOT VL 2B Instruct Merged GGUF

This is a statically quantized version based on the Qari-OCR-0.3-SNAPSHOT-VL-2B-Instruct-merged model, mainly used for image-to-text conversion tasks.

Transformers English

Vintern 1B V3 5 GGUF Ext

Vintern-1B-v3_5 is a 1-billion-parameter vision-language model supporting image-text generation tasks.

Mixtex Finetune

MixTex base_ZhEn is an image-to-text model supporting both Chinese and English, released under the MIT License.

Image-to-Text Supports Multiple Languages

Sarashina2 Vision 8b

Sarashina2-Vision-8B is a large Japanese vision-language model trained by SB Intuitions, based on the Sarashina2-7B and Qwen2-VL-7B image encoders, achieving excellent performance in multiple benchmarks.

Transformers Supports Multiple Languages

A Devanagari optical character recognition model based on the TrOCR architecture, specifically fine-tuned for Nepali/Devanagari script

Text Recognition

Transformers Other

Trocr Math Handwritten

TrOCR is a Transformer-based OCR model specifically designed for recognizing handwritten mathematical formulas

Florence 2 Large

Florence-2 is an advanced vision foundation model developed by Microsoft, using a prompt-based approach to handle a wide range of vision and vision-language tasks.

Florence 2 Large

Florence-2 is an advanced vision foundation model developed by Microsoft, employing a prompt-based approach to handle a wide range of visual and vision-language tasks.

lodestone-horizon

Donut is a Transformer-based image-to-text model capable of extracting and generating textual content from images.

Libra is a decoupled vision system built upon large language models, possessing fundamental multimodal understanding capabilities.

Llava Phi 3 Mini Gguf

LLaVA-Phi-3-mini is a fine-tuned LLaVA model based on Phi-3-mini-4k-instruct and CLIP-ViT-Large-patch14-336, specializing in image-to-text tasks.

InfiMM-HD is a high-resolution multimodal model capable of understanding and generating content that combines images and text.

Transformers English

Git Base Next Refined

Fine-tuned image-to-text model based on microsoft/git-base

Large Language Model

Transformers Other

Vit Gpt2 Verifycode Caption

A ViT-GPT2 architecture captcha recognition model fine-tuned on a dataset of 60,000 images, capable of accurately identifying text in captcha images.

Pix2struct Refexp Base

Pix2Struct is an image encoder-text decoder model trained for multiple vision-language tasks, including image captioning and visual question answering.

Transformers Supports Multiple Languages

Trocr Small Korean

TrOCR is a Korean image-to-text model based on a vision encoder-decoder architecture, using DeiT as the image encoder and RoBERTa as the text decoder.

Image-to-Text Korean

A multimodal model based on Microsoft's GIT framework, focused on extracting text from student homework images and generating teacher feedback

Transformers Supports Multiple Languages

Mangaocr Hoogberta V2

A Japanese manga text recognition model based on the TrOCR architecture, specifically designed for extracting text content from manga images.

Trocr Base Handwritten OCR Handwriting Recognition V2

A fine-tuned handwritten OCR model based on Microsoft's trocr-base-handwritten, achieving a character error rate (CER) of 0.0360 on the evaluation set

Text Recognition

Transformers English

Vit Gpt2 Image Captioning

This is an image captioning model based on the Vision Encoder-Decoder architecture, capable of generating natural language descriptions for input images.

A satellite image caption generation model fine-tuned based on Microsoft GIT-base, generating brief descriptions for NASA Earth Observatory images

Transformers Other

Pix2struct Large

Pix2Struct is an image encoder-text decoder model trained on image-text pairs, suitable for various vision-language tasks

Transformers Supports Multiple Languages

Pix2struct Ai2d Base

Pix2Struct is a vision-language understanding model specifically fine-tuned for scientific chart visual question answering (VQA) tasks

Transformers Supports Multiple Languages

Pix2struct Base

Pix2Struct is an image encoder-text decoder model trained on various image-text pairs for tasks including image captioning and visual question answering.

Transformers Supports Multiple Languages

Git Large Vatex

GIT is a Transformer decoder conditioned on CLIP image tokens and text tokens, designed for tasks like image and video caption generation and visual question answering.

Transformers Supports Multiple Languages

Invoice processing model fine-tuned based on naver-clova-ix/donut-base

Image Caption Generator

A vision-language model trained on the Flickr8k dataset, capable of generating natural language descriptions for input images

Vit Gpt2 Coco En

An image-to-text model based on ViT and GPT2 architectures, capable of generating reasonable English descriptions for input images

Trocr Large Handwritten

TrOCR is a Transformer-based optical character recognition model specifically designed for handwritten text recognition, fine-tuned on the IAM dataset.

Text Recognition

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase