Git-large-textcaps Open-source Model - Free Support for Image Caption Generation and Visual Question Answering Tasks

Git Large Textcaps

Developed by microsoft

GIT is a dual-conditional decoder model based on Transformer, designed for tasks such as image caption generation and visual question answering.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Image Caption Generation #Text-Enhanced Vision #Multimodal Transformer

Downloads 1,749

Release Time : 1/2/2023

Model Overview

The GIT model utilizes a dual-conditional Transformer decoder with CLIP image tokens and text tokens, enabling tasks like image caption generation, visual question answering, and image classification.

Model Features

Dual-Conditional Transformer Decoder

Combines CLIP image tokens and text tokens for efficient image-to-text conversion.

Multi-Task Support

Capable of performing various tasks such as image caption generation, visual question answering, and image classification.

Large-Scale Pre-training

Trained on 20 million image-text pairs and fine-tuned on TextCaps.

Model Capabilities

Image Caption Generation

Visual Question Answering

Image Classification

Use Cases

Image Understanding

Image Caption Generation

Generates detailed textual descriptions for input images.

Visual Question Answering

Answers natural language questions about image content.

Image Classification

Text Category Generation

Generates corresponding text categories based on images.

🚀 GIT (GenerativeImage2Text), large-sized, fine-tuned on TextCaps

GIT is a large-sized Generative Image-to-text model fine-tuned on TextCaps. It can effectively convert images into text, offering solutions for various vision and language tasks.

🚀 Quick Start

You can use the raw model for image captioning. Check the model hub to find fine - tuned versions for tasks that interest you. For code examples, refer to the documentation.

✨ Features

Versatile Applications: Can be used for image and video captioning, visual question answering (VQA) on images and videos, and even image classification.
Transformer Decoder: GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens, trained using "teacher forcing" on numerous (image, text) pairs.

📚 Documentation

Model description

GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a lot of (image, text) pairs. The goal for the model is simply to predict the next text token, given the image tokens and previous text tokens.

The model has full access to (i.e., a bidirectional attention mask is used for) the image patch tokens, but only has access to the previous text tokens (i.e., a causal attention mask is used for the text tokens) when predicting the next text token.

GIT architecture

This allows the model to be used for tasks like:

Image and video captioning
Visual question answering (VQA) on images and videos
Even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text).

Intended uses & limitations

You can use the raw model for image captioning. See the model hub to look for fine - tuned versions on a task that interests you.

Training data

From the paper:

We collect 0.8B image - text pairs for pre - training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a).

=> however this is for the model referred to as "GIT" in the paper, which is not open - sourced.

This checkpoint is "GIT - large", which is a smaller variant of GIT trained on 20 million image - text pairs. Next, the model was fine - tuned on TextCaps. See table 11 in the paper for more details.

Preprocessing

We refer to the original repo regarding details for preprocessing during training. During validation, one resizes the shorter edge of each image, after which center cropping is performed to a fixed - size resolution. Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

Evaluation results

For evaluation results, we refer readers to the paper.

📄 License

This model is released under the MIT license.

Property	Details
Model Type	GIT (GenerativeImage2Text), large - sized, fine - tuned on TextCaps
Training Data	20 million image - text pairs for "GIT - large", fine - tuned on TextCaps. The original "GIT" used 0.8B image - text pairs for pre - training from multiple sources.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご