G

Git Base Textvqa

Developed by microsoft
GIT is a Transformer-based vision-language model capable of converting images into textual descriptions, specifically fine-tuned for TextVQA tasks.
Downloads 1,182
Release Time : 12/6/2022

Model Overview

This model is conditioned on CLIP image tokens and text tokens, enabling tasks such as image captioning and visual question answering. The base version was trained on 10 million image-text pairs and fine-tuned for TextVQA tasks.

Model Features

Bidirectional Image Attention
The model has full access to image patch tokens using a bidirectional attention mechanism.
Causal Text Generation
When predicting the next text token, it can only access previous text tokens, using a causal attention mask.
Multi-task Adaptability
Can be used for various tasks such as image captioning, visual question answering, and image classification.

Model Capabilities

Image Captioning
Visual Question Answering
Image Classification
Text Generation

Use Cases

Visual Question Answering
TextVQA
Answering questions based on text content within images
Specifically fine-tuned for TextVQA tasks
Image Understanding
Image Captioning
Generating descriptive text for images
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase