G

Git Base Coco

Developed by microsoft
GIT is a Transformer decoder based on CLIP image tokens and text tokens, used for tasks such as image caption generation and visual question answering.
Downloads 5,461
Release Time : 12/6/2022

Model Overview

GIT is a Transformer-based model trained via teacher forcing on a large number of image-text pairs, capable of predicting the next text token and suitable for tasks like image caption generation, visual question answering, and image classification.

Model Features

Bidirectional Image Attention
The model has full access to image patch tokens using bidirectional attention masking.
Causal Text Attention
When predicting the next text token, it can only access previous text tokens, using causal attention masking.
Multi-task Support
Can be used for various tasks such as image caption generation, visual question answering, and image classification.

Model Capabilities

Image Caption Generation
Visual Question Answering
Image Classification

Use Cases

Image Understanding
Image Caption Generation
Generate natural language descriptions for input images.
Visual Question Answering
Answer natural language questions about image content.
Image Classification
Zero-shot Image Classification
Generate corresponding text categories conditioned on images.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase