git-base Open-source Image-to-Text Generation Model - Free Deployment for Accurate Text Descriptions of Images

Git Base

Developed by microsoft

GIT is a dual-conditional Transformer decoder based on CLIP image tokens and text tokens, designed for image-to-text generation tasks.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Image Captioning #Visual Question Answering (VQA)#Dual-Modality Transformer

Downloads 365.74k

Release Time : 12/6/2022

Model Overview

GIT is a generative image-to-text Transformer model capable of producing descriptive text based on image content, supporting tasks such as image captioning and visual question answering.

Model Features

Dual-Conditional Transformer Architecture

Processes both image tokens and text tokens simultaneously to achieve image-to-text generation.

Multi-Task Support

Applicable to various vision-language tasks such as image captioning, visual question answering, and image classification.

Large-Scale Pretraining

Pretrained on 10 million image-text pairs (base version).

Model Capabilities

Image Captioning

Visual Question Answering

Image Classification

Video Captioning

Use Cases

Content Generation

Automatic Image Description

Generates accurate textual descriptions for images

Can be used to assist visually impaired individuals or content management

Question Answering Systems

Visual Question Answering

Answers natural language questions about image content

Can be used in smart customer service or educational applications

🚀 GIT (GenerativeImage2Text), base-sized

GIT (GenerativeImage2Text) is a base-sized model. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository.

Disclaimer: The team releasing GIT did not write a model card for this model, so this model card has been written by the Hugging Face team.

✨ Features

GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens.
It can be used for tasks like image and video captioning, visual question answering (VQA) on images and videos, and even image classification.

📚 Documentation

📋 Model description

GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a lot of (image, text) pairs.

The goal for the model is simply to predict the next text token, giving the image tokens and previous text tokens.

The model has full access to (i.e. a bidirectional attention mask is used for) the image patch tokens, but only has access to the previous text tokens (i.e. a causal attention mask is used for the text tokens) when predicting the next text token.

GIT architecture

This allows the model to be used for tasks like:

Image and video captioning
Visual question answering (VQA) on images and videos
Even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text).

📌 Intended uses & limitations

You can use the raw model for image captioning. See the model hub to look for fine - tuned versions on a task that interests you.

💻 Usage Examples

For code examples, we refer to the documentation.

📊 Training data

From the paper:

We collect 0.8B image - text pairs for pre - training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a).

=> however this is for the model referred to as "GIT" in the paper, which is not open - sourced.

This checkpoint is "GIT - base", which is a smaller variant of GIT trained on 10 million image - text pairs.

See table 11 in the paper for more details.

⚙️ Preprocessing

We refer to the original repo regarding details for preprocessing during training.

During validation, one resizes the shorter edge of each image, after which center cropping is performed to a fixed - size resolution. Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

📈 Evaluation results

For evaluation results, we refer readers to the paper.

📄 License

This model is released under the MIT license.

Property	Details
Model Type	GIT (GenerativeImage2Text), base - sized
Training Data	10 million image - text pairs for "GIT - base"; 0.8B image - text pairs (including COCO, CC3M, SBU, VG, CC12M, ALT200M and an extra 0.6B data) for "GIT" in the paper

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご