T

Taiyi CLIP RoBERTa 326M ViT H Chinese

Developed by IDEA-CCNL
The first open-source Chinese CLIP model, pre-trained on 123 million image-text pairs, with RoBERTa-large architecture as the text encoder.
Downloads 108
Release Time : 9/26/2022

Model Overview

This model is a vision-language representation system capable of joint feature extraction for images and texts, supporting zero-shot image classification and text-image retrieval tasks.

Model Features

Chinese Multimodal Understanding
Vision-language joint representation capability optimized specifically for Chinese scenarios
Large-Scale Pretraining
Pre-trained on 123 million Chinese image-text pairs, covering a wide range of visual concepts
Efficient Architecture Design
Freezes visual encoder parameters and only fine-tunes the language encoder to improve training efficiency

Model Capabilities

Zero-shot image classification
Text-image retrieval
Multimodal feature extraction
Cross-modal similarity calculation

Use Cases

Image Understanding
Zero-shot image classification
Classify images without specific training
Achieves 54.35% Top1 accuracy on ImageNet1k-CN
Cross-modal Retrieval
Text-to-image retrieval
Retrieve relevant images based on text descriptions
Achieves 60.82% Top1 accuracy on Flickr30k-CNA test set
Image-to-text retrieval
Retrieve relevant text descriptions based on images
Achieves 60.02% Top1 accuracy on COCO-CN test set
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase