G

Git Base Vqav2

Developed by microsoft
GIT is a Transformer decoder-based vision-language model trained with CLIP image tokens and text token conditioning, suitable for tasks like image captioning and visual question answering.
Downloads 199
Release Time : 12/6/2022

Model Overview

The GIT (short for GenerativeImage2Text) model is the base-scale version and has been fine-tuned on the VQAv2 dataset. It processes image tokens with bidirectional attention mechanisms and generates text tokens using causal attention masks.

Model Features

Bidirectional Image Attention Mechanism
The model employs bidirectional attention mechanisms for image patch tokens, allowing full access to image information.
Causal Text Generation
When predicting the next text token, the model can only access previous text tokens, using causal attention masks.
Multi-task Adaptability
The model can be applied to various tasks such as image captioning, visual question answering, and image classification.

Model Capabilities

Image Captioning
Visual Question Answering
Image Classification

Use Cases

Visual Question Answering
VQAv2 Dataset Q&A
The model fine-tuned on the VQAv2 dataset can answer questions about image content.
Refer to the original paper for specific evaluation results.
Image Captioning
Automatic Image Annotation
The model can generate text describing image content.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase