Git-large-r-textcaps Open-source Model - Freely Support Applications like Image Captioning and Visual Question Answering

Git Large R Textcaps

Developed by microsoft

GIT is a dual-conditioned Transformer decoder based on CLIP image tokens and text tokens, designed for tasks such as image caption generation and visual question answering.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Image Caption Generation #Visual Question Answering #Multimodal Transformer

Downloads 51

Release Time : 1/22/2023

Model Overview

The large version of the GIT (short for GenerativeImage2Text) model, fine-tuned on TextCaps. This model achieves image-to-text generation through a dual-conditioned Transformer decoder using CLIP image tokens and text tokens.

Model Features

Dual-conditioned Transformer Decoder

Combines CLIP image tokens and text tokens to achieve image-to-text generation.

Multi-task Support

Can be used for various tasks such as image caption generation, visual question answering (VQA), and image classification.

Large-scale Pretraining

Trained on 20 million image-text pairs and fine-tuned on TextCaps.

Model Capabilities

Image Caption Generation

Visual Question Answering (VQA)

Image Classification

Use Cases

Image Understanding

Image Caption Generation

Generates detailed textual descriptions for input images.

Visual Question Answering

Answers natural language questions about image content.

Image Classification

Classifies images by generating textual categories.

🚀 GIT (GenerativeImage2Text), large-sized, fine-tuned on TextCaps, R*

The GIT model is designed for image - to - text tasks, offering high - quality captioning and related capabilities.

🚀 Quick Start

The GIT (GenerativeImage2Text) model, in its large - sized version, is fine - tuned on TextCaps. It was introduced in the paper GIT: A Generative Image - to - text Transformer for Vision and Language by Wang et al. and first released in this repository.

Disclaimer: The team releasing GIT did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a large number of (image, text) pairs. Its goal is to predict the next text token given the image tokens and previous text tokens. The model has full access to the image patch tokens (using a bidirectional attention mask) and only access to the previous text tokens (using a causal attention mask) when predicting the next text token.

GIT architecture

This enables the model to be used for various tasks:

Image and video captioning
Visual question answering (VQA) on images and videos
Even image classification (by conditioning the model on the image and asking it to generate a class in text)

📚 Documentation

Intended uses & limitations

You can use the raw model for image captioning. Check the model hub to find fine - tuned versions for tasks that interest you.

How to use

For code examples, refer to the documentation.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

🔧 Technical Details

Training data

From the paper:

We collect 0.8B image - text pairs for pre - training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a).

However, this is for the model referred to as "GIT" in the paper, which is not open - sourced. This checkpoint is "GIT - large", a smaller variant of GIT trained on 20 million image - text pairs. Then, the model was fine - tuned on TextCaps. See table 11 in the paper for more details.

Preprocessing

Refer to the original repo for details on preprocessing during training. During validation, the shorter edge of each image is resized, followed by center cropping to a fixed - size resolution. Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

📄 License

This model is released under the MIT license.

Property	Details
Model Type	GIT (GenerativeImage2Text), large - sized, fine - tuned on TextCaps
Training Data	20 million image - text pairs for GIT - large, fine - tuned on TextCaps. The original GIT was pre - trained on 0.8B image - text pairs including COCO, CC3M, SBU, VG, CC12M, ALT200M, etc.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご