git-base-textvqa Open - Source Vision - Language Model: Image to Text Description, Empowering TextVQA Tasks!

Git Base Textvqa

Developed by microsoft

GIT is a Transformer-based vision-language model capable of converting images into textual descriptions, specifically fine-tuned for TextVQA tasks.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Image-to-Text Generation #TextVQA Fine-tuning #Visual Question Answering

Downloads 1,182

Release Time : 12/6/2022

Model Overview

This model is conditioned on CLIP image tokens and text tokens, enabling tasks such as image captioning and visual question answering. The base version was trained on 10 million image-text pairs and fine-tuned for TextVQA tasks.

Model Features

Bidirectional Image Attention

The model has full access to image patch tokens using a bidirectional attention mechanism.

Causal Text Generation

When predicting the next text token, it can only access previous text tokens, using a causal attention mask.

Multi-task Adaptability

Can be used for various tasks such as image captioning, visual question answering, and image classification.

Model Capabilities

Image Captioning

Visual Question Answering

Image Classification

Text Generation

Use Cases

Visual Question Answering

TextVQA

Answering questions based on text content within images

Specifically fine-tuned for TextVQA tasks

Image Understanding

Image Captioning

Generating descriptive text for images

🚀 GIT (GenerativeImage2Text), base-sized, fine-tuned on TextVQA

The GIT (GenerativeImage2Text) base-sized model fine-tuned on TextVQA, enabling vision and language tasks.

🚀 Quick Start

The GIT model is a base-sized version fine - tuned on TextVQA. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository.

Disclaimer: The team releasing GIT did not write a model card for this model, so this model card has been written by the Hugging Face team.

✨ Features

Model description

GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a large number of (image, text) pairs.

The goal of the model is to predict the next text token, given the image tokens and previous text tokens.

The model has full access to (i.e., a bidirectional attention mask is used for) the image patch tokens, but only has access to the previous text tokens (i.e., a causal attention mask is used for the text tokens) when predicting the next text token.

GIT architecture

This allows the model to be used for tasks such as:

Image and video captioning
Visual question answering (VQA) on images and videos
Even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text).

Intended uses & limitations

You can use the raw model for visual question answering (VQA). Check the model hub to find fine - tuned versions for tasks that interest you.

How to use

For code examples, refer to the documentation.

📦 Training Data

From the paper:

We collect 0.8B image - text pairs for pre - training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a).

However, this is for the model referred to as "GIT" in the paper, which is not open - sourced.

This checkpoint is "GIT - base", a smaller variant of GIT trained on 10 million image - text pairs. Then, the model was fine - tuned on TextVQA.

See table 11 in the paper for more details.

Preprocessing

Refer to the original repo for details on preprocessing during training.

During validation, resize the shorter edge of each image, then perform center cropping to a fixed - size resolution. Next, normalize the frames across the RGB channels with the ImageNet mean and standard deviation.

📚 Evaluation Results

For evaluation results, refer to the paper.

📄 License

This model is released under the MIT license.

Property	Details
Model Type	GIT (GenerativeImage2Text), base - sized, fine - tuned on TextVQA
Training Data	10 million image - text pairs for GIT - base, fine - tuned on TextVQA. See paper for pre - training data details.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご