đ GIT (GenerativeImage2Text), base-sized, fine-tuned on VATEX
GIT (GenerativeImage2Text) is a base-sized model fine-tuned on VATEX. It can generate text descriptions based on images, bridging the gap between vision and language.
đ Quick Start
The raw model can be used for video captioning. You can search for fine - tuned versions on the model hub according to your needs. For code examples, refer to the documentation.
⨠Features
- Versatile Applications: Suitable for tasks like image and video captioning, visual question answering (VQA) on images and videos, and even image classification.
- Transformer - based Design: A Transformer decoder conditioned on both CLIP image tokens and text tokens, trained with "teacher forcing" on numerous (image, text) pairs.
đ Documentation
Model description
GIT is a Transformer decoder that takes both CLIP image tokens and text tokens as input. It is trained on a large number of (image, text) pairs using "teacher forcing". The model aims to predict the next text token given the image tokens and previous text tokens. It has full access to image patch tokens (using a bidirectional attention mask) and only accesses previous text tokens (using a causal attention mask) when predicting the next text token.

This design enables the model to handle various tasks such as image and video captioning, visual question answering (VQA) on images and videos, and image classification.
Intended uses & limitations
You can utilize the raw model for video captioning. Check the model hub to find fine - tuned versions for specific tasks.
Training data
The paper mentions collecting 0.8B image - text pairs for pre - training, including COCO, Conceptual Captions (CC3M), SBU, Visual Genome (VG), Conceptual Captions (CC12M), ALT200M, and an additional 0.6B data. However, this is for the "GIT" model in the paper, which is not open - sourced. This "GIT - base" checkpoint is a smaller variant trained on 10 million image - text pairs and then fine - tuned on VATEX. For more details, refer to table 11 in the paper.
Preprocessing
For preprocessing details during training, refer to the original repository. During validation, the shorter edge of each image is resized, followed by center cropping to a fixed - size resolution. Then, frames are normalized across the RGB channels using the ImageNet mean and standard deviation.
Evaluation results
For evaluation results, please refer to the paper.
đ License
This model is released under the MIT license.
đĻ Additional Information
Property |
Details |
Model Type |
GIT (GenerativeImage2Text), base - sized, fine - tuned on VATEX |
Training Data |
10 million image - text pairs, fine - tuned on VATEX |
â ī¸ Important Note
The team releasing GIT did not write a model card for this model, so this model card has been written by the Hugging Face team.