Microsoft Git Base Open-Source Image-to-Text Model - Free Deployment, Easily Convert Visual Content into Text

Microsoft Git Base

Developed by seckmaster

GIT is a Transformer-based generative image-to-text model capable of converting visual content into textual descriptions.

Image-to-Text

Safetensors

Supports Multiple LanguagesOpen Source License:MIT #Image Caption Generation #Visual Question Answering #Multimodal Transformer

Downloads 18

Release Time : 12/4/2024

Model Overview

GIT (GenerativeImage2Text) is a Transformer decoder model that combines CLIP image tokens and text tokens, trained via teacher forcing, enabling tasks such as image caption generation and visual question answering.

Model Features

Bidirectional Image Attention

The model employs bidirectional attention masking for image patch tokens to fully comprehend image content.

Causal Text Generation

During text generation, it can only access previous text tokens, ensuring coherent textual descriptions.

Multi-task Support

Capable of handling various tasks including image caption generation, visual question answering, and even image classification.

Model Capabilities

Image Caption Generation

Visual Question Answering

Image Classification (via text generation)

Video Caption Generation

Use Cases

Content Generation

Automatic Image Tagging

Generate accurate textual descriptions for images

Can be used in image retrieval systems and accessibility applications

Visual Question Answering

Image Content Q&A

Answer natural language questions about image content

Applicable in smart assistants and educational applications

🚀 GIT (GenerativeImage2Text), base-sized

GIT (GenerativeImage2Text) is a base-sized model. It addresses the challenge of generating text from images, offering a powerful solution for various vision - language tasks. This model was introduced in the paper GIT: A Generative Image-to-text Transformer for Vision and Language by Wang et al. and was first released in this repository.

Disclaimer: The team releasing GIT did not write a model card for this model, so this model card has been written by the Hugging Face team.

✨ Features

Versatile Task Support: Can be used for image and video captioning, visual question answering (VQA) on images and videos, and even image classification.
Transformer - based Design: A Transformer decoder conditioned on both CLIP image tokens and text tokens.

📚 Documentation

Model description

GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a large number of (image, text) pairs.

The goal of the model is to predict the next text token, given the image tokens and previous text tokens.

The model has full access to (i.e., a bidirectional attention mask is used for) the image patch tokens, but only has access to the previous text tokens (i.e., a causal attention mask is used for the text tokens) when predicting the next text token.

GIT architecture

This allows the model to be used for tasks such as:

Image and video captioning
Visual question answering (VQA) on images and videos
Image classification (by simply conditioning the model on the image and asking it to generate a class for it in text)

Intended uses & limitations

You can use the raw model for image captioning. Check the model hub to find fine - tuned versions for tasks that interest you.

How to use

For code examples, refer to the documentation.

Training data

From the paper:

We collect 0.8B image - text pairs for pre - training, which include COCO (Lin et al., 2014), Conceptual Captions (CC3M) (Sharma et al., 2018), SBU (Ordonez et al., 2011), Visual Genome (VG) (Krishna et al., 2016), Conceptual Captions (CC12M) (Changpinyo et al., 2021), ALT200M (Hu et al., 2021a), and an extra 0.6B data following a similar collection procedure in Hu et al. (2021a).

However, this is for the model referred to as "GIT" in the paper, which is not open - sourced.

This checkpoint is "GIT - base", a smaller variant of GIT trained on 10 million image - text pairs.

See table 11 in the paper for more details.

Preprocessing

Refer to the original repo for details on preprocessing during training.

During validation, one resizes the shorter edge of each image, after which center cropping is performed to a fixed - size resolution. Next, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

Evaluation results

For evaluation results, refer to the paper.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご