G

Git Large Msrvtt Qa

Developed by microsoft
GIT is a dual-condition Transformer decoder based on CLIP image tokens and text tokens, specifically fine-tuned for the MSRVTT-QA task.
Downloads 108
Release Time : 1/2/2023

Model Overview

The GIT model is trained via teacher forcing on a large number of image-text pairs, capable of predicting the next text token, suitable for tasks such as image and video caption generation, visual question answering, and image classification.

Model Features

Dual-Condition Transformer Decoder
Combines CLIP image tokens and text tokens, supporting bidirectional attention mechanisms and causal attention masks.
Multi-Task Adaptability
Applicable to various tasks such as image and video caption generation, visual question answering, and image classification.
Large-Scale Pretraining
Trained on 10 million image-text pairs and fine-tuned on MSRVTT-QA.

Model Capabilities

Image Caption Generation
Video Caption Generation
Visual Question Answering
Image Classification

Use Cases

Video Understanding
Video Question Answering
Answering questions based on video content.
Performs excellently on the MSRVTT-QA task.
Image Understanding
Image Caption Generation
Generating natural language descriptions for images.
Image Classification
Classifying images by generating text categories.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase