Open-source Vision-Language Model: git-large-textvqa - Free Deployment to Aid TextVQA Task Solutions

Git Large Textvqa

Developed by microsoft

GIT is a vision-language model based on a Transformer decoder, trained with dual conditioning on CLIP image tokens and text tokens, specifically optimized for TextVQA tasks.

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Text Visual Question Answering #Generative Image Understanding #Multimodal Transformer

Downloads 62

Release Time : 1/2/2023

Model Overview

The GIT model processes image tokens through bidirectional attention mechanisms and combines them with causal attention masks for text generation, suitable for tasks such as image captioning, visual question answering, and image classification.

Model Features

Multimodal Processing Capability

Simultaneously processes image and text inputs to achieve cross-modal understanding and generation.

Bidirectional Image Attention

Employs bidirectional attention mechanisms for image tokens to fully capture visual features.

Causal Text Generation

Uses causal attention masks during text generation to ensure the rationality of autoregressive predictions.

Model Capabilities

Image Caption Generation

Visual Question Answering

Image Classification (via text generation)

Use Cases

Visual Understanding

Image Content Question Answering

Answers complex questions about text content within images

Performs excellently in TextVQA tasks (specific metrics available in the paper)

Assistive Technology

Visual Impairment Assistance

Generates textual descriptions of image content for visually impaired users

🚀 GIT (GenerativeImage2Text), large-sized, fine-tuned on TextVQA

The GIT (GenerativeImage2Text) large-sized model fine-tuned on TextVQA. It can be used for tasks such as image captioning and visual question answering.

🚀 Quick Start

GIT (short for GenerativeImage2Text) model, large-sized version, fine-tuned on TextVQA. It was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and first released in this repository.

Disclaimer: The team releasing GIT did not write a model card for this model so this model card has been written by the Hugging Face team.

✨ Features

GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a lot of (image, text) pairs.

The goal for the model is simply to predict the next text token, giving the image tokens and previous text tokens.

The model has full access to (i.e. a bidirectional attention mask is used for) the image patch tokens, but only has access to the previous text tokens (i.e. a causal attention mask is used for the text tokens) when predicting the next text token.

GIT architecture

This allows the model to be used for tasks like:

Image and video captioning
Visual question answering (VQA) on images and videos
Even image classification (by simply conditioning the model on the image and asking it to generate a class for it in text).

📚 Documentation

Intended uses & limitations

You can use the raw model for visual question answering (VQA). See the model hub to look for fine - tuned versions on a task that interests you.

How to use

For code examples, we refer to the documentation.

Training data

From the paper:

We collect 0.8B image - text pairs for pre - training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a).

=> however this is for the model referred to as "GIT" in the paper, which is not open - sourced.

This checkpoint is "GIT - large", which is a smaller variant of GIT trained on 20 million image - text pairs.

Next, the model was fine - tuned on TextVQA.

See table 11 in the paper for more details.

Preprocessing

We refer to the original repo regarding details for preprocessing during training.

During validation, one resizes the shorter edge of each image, after which center cropping is performed to a fixed - size resolution. Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

Evaluation results

For evaluation results, we refer readers to the paper.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご