My-model Open-source Image-to-Text Tool - Free Deployment, Generate Descriptive Text Based on Images

My Model

Developed by anoushhka

GIT is a Transformer-based image-to-text generation model capable of generating descriptive text from input images.

Supports Multiple LanguagesOpen Source License:MIT #Image Caption Generation #Visual Question Answering #Multimodal Transformer

Downloads 87

Release Time : 4/8/2025

Model Overview

GIT (short for GenerativeImage2Text) is a dual-conditional Transformer decoder based on CLIP image tokens and text tokens. The model is trained via teacher forcing on a large number of image-text pairs and can perform tasks such as image caption generation and visual question answering.

Model Features

Dual-Conditional Transformer Architecture

Processes both image tokens and text tokens simultaneously to achieve image-to-text generation.

Multi-Task Capability

Supports various vision-language tasks such as image caption generation, visual question answering, and image classification.

Large-Scale Pretraining

Pretrained on 10 million image-text pairs and fine-tuned on the COCO dataset.

Model Capabilities

Image Caption Generation

Visual Question Answering (VQA)

Image Classification

Video Caption Generation

Use Cases

Content Generation

Automatic Image Tagging

Generates descriptive text for images

Can be used for social media content management or accessibility.

Intelligent Q&A

Visual Question Answering System

Answers natural language questions about image content

Can be used in educational or customer service scenarios.

🚀 GIT (GenerativeImage2Text), base-sized, fine-tuned on COCO

GIT (GenerativeImage2Text) is a base-sized model fine-tuned on the COCO dataset. It addresses the challenge of generating text from images, offering high - quality image captioning and enabling various vision - language tasks.

🚀 Quick Start

You can use the raw model for image captioning. Check out the model hub to find fine - tuned versions for tasks that interest you. For code examples, refer to the documentation.

✨ Features

Versatile Task Support: Can be used for image and video captioning, visual question answering (VQA) on images and videos, and even image classification.
Unique Architecture: A Transformer decoder conditioned on both CLIP image tokens and text tokens, with specific attention mask mechanisms for image and text tokens.

📚 Documentation

Model description

GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a large number of (image, text) pairs.

The goal of the model is to predict the next text token given the image tokens and previous text tokens. When predicting the next text token, the model has full access to the image patch tokens (using a bidirectional attention mask), but only access to the previous text tokens (using a causal attention mask for the text tokens).

GIT architecture

This architecture enables the model to handle tasks such as:

Image and video captioning
Visual question answering (VQA) on images and videos
Image classification (by conditioning the model on the image and asking it to generate a class in text)

Intended uses & limitations

You can use the raw model for image captioning. Look for fine - tuned versions on the model hub for tasks that interest you.

Training data

From the paper:

We collect 0.8B image - text pairs for pre - training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a).

However, this is for the model referred to as "GIT" in the paper, which is not open - sourced. This checkpoint is "GIT - base", a smaller variant of GIT trained on 10 million image - text pairs and then fine - tuned on COCO. See table 11 in the paper for more details.

Preprocessing

Refer to the original repo for details on preprocessing during training. During validation, the shorter edge of each image is resized, followed by center cropping to a fixed - size resolution. Then, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

Evaluation results

For evaluation results, refer to the paper.

🔧 Technical Details

GIT was introduced in the paper GIT: A Generative Image - to - text Transformer for Vision and Language by Wang et al. and first released in this repository.

Disclaimer: The team releasing GIT did not write a model card for this model, so this model card has been written by the Hugging Face team.

📄 License

This model is released under the MIT license.

Property	Details
Model Type	GIT (GenerativeImage2Text), base - sized, fine - tuned on COCO
Training Data	10 million image - text pairs for GIT - base, then fine - tuned on COCO. The original GIT was pre - trained on 0.8B image - text pairs including COCO, CC3M, SBU, VG, CC12M, ALT200M, etc.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご