T

Taiyi Vit 87M D

Developed by IDEA-CCNL
An English MAP visual encoder specially pretrained on COCO and Visual Genome datasets, based on ViT-base architecture
Downloads 24
Release Time : 5/4/2022

Model Overview

This model is a visual encoder based on the CLIP-ViT-base architecture, infused with multimodal information through specialized training tasks, suitable for visual tasks like image classification

Model Features

Special Pretraining Scheme
Utilizes novel pretraining method D to inject multimodal information through specialized training tasks
High Performance
Outperforms the original CLIP-ViT-base model on benchmarks like CIFAR10 and ImageNet1k
Multimodal Representation
Pretrained on MSCOCO and VG datasets, enabling multimodal understanding capabilities

Model Capabilities

Image Classification
Visual Feature Extraction
Multimodal Representation Learning

Use Cases

Computer Vision
Image Classification
Classifies input images, supporting ImageNet 1000-class tasks
Achieves 82.4% accuracy on ImageNet1k
Visual Feature Extraction
Extracts high-level visual features from images for downstream tasks
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase