Git-large-coco Open-source Vision-Language Model - Free Deployment for Image Caption Generation and Visual Question Answering

Git Large Coco

Developed by microsoft

GIT is a Transformer decoder-based vision-language model capable of generating image captions and performing visual question answering

Image-to-Text

Transformers

Supports Multiple LanguagesOpen Source License:MIT #Image Caption Generation #Visual Question Answering #Multimodal Transformer

Downloads 6,582

Release Time : 1/2/2023

Model Overview

The GIT (GenerativeImage2Text) model processes images through CLIP image tokens and text tokens, using bidirectional attention for images and causal attention for text generation. It is suitable for tasks such as image/video caption generation and visual question answering

Model Features

Bidirectional Image Attention

The model uses bidirectional attention mechanism for image patch tokens to fully understand image content

Causal Text Generation

Employs causal attention masking during text generation to ensure coherent autoregressive text generation

Multi-task Support

A single model can simultaneously support multiple tasks including image caption generation, visual question answering, and image classification

Model Capabilities

Image Caption Generation

Visual Question Answering (VQA)

Image Classification

Video Caption Generation

Use Cases

Content Generation

Automatic Image Tagging

Generate natural language descriptions for images

Can be used in scenarios like social media and content management systems

Assistive Technology

Visual Assistance

Describe image contents for visually impaired individuals

Improves information accessibility

Education

Educational Material Generation

Automatically generate text descriptions for textbook illustrations

Reduces teachers' lesson preparation workload

🚀 GIT (GenerativeImage2Text), large-sized, fine-tuned on COCO

GIT (GenerativeImage2Text) is a large-sized model fine-tuned on COCO. It's designed to generate text from images, offering capabilities for tasks like image captioning and visual question - answering.

🚀 Quick Start

If you want to use the raw model for image captioning, you can explore the model hub to find fine - tuned versions for specific tasks. For code examples, refer to the documentation.

✨ Features

Versatile Applications: Can be used for image and video captioning, visual question answering (VQA) on images and videos, and even image classification.
Transformer - based: GIT is a Transformer decoder that uses both CLIP image tokens and text tokens, trained on a large number of (image, text) pairs.

📚 Documentation

Model description

GIT is a Transformer decoder conditioned on both CLIP image tokens and text tokens. The model is trained using "teacher forcing" on a large number of (image, text) pairs. Its goal is to predict the next text token given the image tokens and previous text tokens.

The model has full access to the image patch tokens (using a bidirectional attention mask), but only accesses the previous text tokens (using a causal attention mask) when predicting the next text token.

GIT architecture

Intended uses & limitations

You can use the raw model for image captioning. Check the model hub for fine - tuned versions on tasks that interest you.

Training data

The paper mentions collecting 0.8B image - text pairs for pre - training, including COCO, Conceptual Captions (CC3M), SBU, Visual Genome (VG), Conceptual Captions (CC12M), ALT200M, and an extra 0.6B data. However, this is for the model referred to as "GIT" in the paper, which is not open - sourced.

This checkpoint, "GIT - large", is a smaller variant trained on 20 million image - text pairs and then fine - tuned on COCO. See table 11 in the paper for more details.

Preprocessing

For preprocessing details during training, refer to the original repo. During validation, the shorter edge of each image is resized, followed by center cropping to a fixed - size resolution. Then, frames are normalized across the RGB channels with the ImageNet mean and standard deviation.

Evaluation results

For evaluation results, refer to the paper.

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご