git-large Open Source Model - A Practical Tool for Free Image-to-Text Generation

Git Large

Developed by microsoft

GIT is a dual-conditional Transformer decoder based on CLIP image tokens and text tokens for image-to-text generation tasks

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Image Captioning #Visual Question Answering #Multimodal Transformer

Downloads 1,404

Release Time : 1/2/2023

Model Overview

GIT is a generative image-to-text Transformer model capable of performing tasks such as image captioning, visual question answering, and image classification. It processes image tokens with bidirectional attention and text tokens with causal attention.

Model Features

Dual-Modal Processing

Processes both image and text tokens simultaneously using different attention mechanisms

Multi-Task Capability

A single model capable of performing multiple vision-language tasks

Large-Scale Pretraining

Trained on 20 million image-text pairs (large version trained on 800 million data points)

Model Capabilities

Image Captioning

Visual Question Answering

Image Classification

Video Captioning

Video Question Answering

Use Cases

Content Generation

Automatic Image Description

Generates natural language descriptions for images

Can generate text that accurately describes image content

Visual Understanding

Image Question Answering System

Answers natural language questions about image content

Can correctly answer various questions about image content

Content Classification

Zero-shot Image Classification

Classifies images by generating category text

Can perform classification without specific training

🚀 GIT (GenerativeImage2Text), large-sized

The large-sized version of the GIT (GenerativeImage2Text) model. It addresses the need for high - quality image - to - text generation, offering a powerful solution for various vision - language tasks.

🚀 Quick Start

The GIT (GenerativeImage2Text) model, in its large - sized version, was introduced in the paper GIT: A Generative Image - to - text Transformer for Vision and Language by Wang et al. and first released in this repository. Note that the team releasing GIT did not write a model card for this model, and this one has been written by the Hugging Face team.

✨ Features

Model description

GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a large number of (image, text) pairs. Its goal is to predict the next text token, given the image tokens and previous text tokens.

The model has full access to the image patch tokens (using a bidirectional attention mask), but only accesses the previous text tokens (using a causal attention mask for text tokens) when predicting the next text token.

GIT architecture

This enables the model to be used for tasks such as:

Image and video captioning
Visual question answering (VQA) on images and videos
Even image classification (by conditioning the model on the image and asking it to generate a class in text)

Intended uses & limitations

You can use the raw model for image captioning. Check the model hub to find fine - tuned versions for tasks that interest you.

How to use

For code examples, refer to the documentation.

Training data

From the paper:

We collect 0.8B image - text pairs for pre - training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a).

However, this is for the model referred to as "GIT" in the paper, which is not open - sourced. This checkpoint is "GIT - large", a smaller variant of GIT trained on 20 million image - text pairs. See table 11 in the paper for more details.

Preprocessing

Refer to the original repo for details on preprocessing during training. During validation, one resizes the shorter edge of each image, then performs center cropping to a fixed - size resolution. Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

Evaluation results

For evaluation results, refer to the paper.

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご