đ GIT (GenerativeImage2Text), large-sized, fine-tuned on COCO, R*
The GIT model, a large-sized variant fine-tuned on COCO, is designed for image-to-text tasks. It addresses offensive caption issues by re - training on specific datasets.
đ Quick Start
The GIT (GenerativeImage2Text) model, specifically the large - sized version fine - tuned on COCO, was introduced in the paper GIT: A Generative Image - to - text Transformer for Vision and Language by Wang et al. and first released in this repository.
Disclaimer: The team releasing GIT did not write a model card for this model, so this model card has been written by the Hugging Face team.
⨠Features
- Versatile Applications: Can be used for image and video captioning, visual question answering (VQA) on images and videos, and even image classification.
- Unique Architecture: A Transformer decoder conditioned on both CLIP image tokens and text tokens, trained using "teacher forcing" on numerous (image, text) pairs.
đ Documentation
Model description
GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a large number of (image, text) pairs.
The goal of the model is to predict the next text token, given the image tokens and previous text tokens.
The model has full access to (i.e., a bidirectional attention mask is used for) the image patch tokens, but only has access to the previous text tokens (i.e., a causal attention mask is used for the text tokens) when predicting the next text token.

This allows the model to be used for tasks such as:
- Image and video captioning
- Visual question answering (VQA) on images and videos
- Even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text).
Intended uses & limitations
You can use the raw model for image captioning. See the model hub to look for fine - tuned versions on a task that interests you.
How to use
For code examples, we refer to the documentation.
Training data
From the paper:
We collect 0.8B image - text pairs for pre - training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a).
=> however, this is for the model referred to as "GIT" in the paper, which is not open - sourced.
This checkpoint is "GIT - large", which is a smaller variant of GIT trained on 20 million image - text pairs.
Next, the model was fine - tuned on COCO.
See table 11 in the paper for more details.
Preprocessing
We refer to the original repo regarding details for preprocessing during training.
During validation, one resizes the shorter edge of each image, after which center cropping is performed to a fixed - size resolution. Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.
Evaluation results
For evaluation results, we refer readers to the paper.
đ License
This model is released under the MIT license.