T

Taiyi CLIP Roberta Large 326M Chinese

Developed by IDEA-CCNL
The first open-source Chinese CLIP model, pre-trained on 123 million image-text pairs, supporting Chinese image-text feature extraction and zero-shot classification
Downloads 10.37k
Release Time : 7/19/2022

Model Overview

Chinese multimodal CLIP model, using RoBERTa-large as the text encoder and ViT-L-14 as the visual encoder, specifically designed for Chinese image-text tasks

Model Features

Chinese Multimodal Support
The first CLIP model optimized specifically for Chinese, supporting joint representation learning of Chinese text and images
Large-scale Pre-training
Pre-trained on 123 million Chinese image-text pairs (Wukong + Zero dataset), learning rich cross-modal associations
Stable Training Strategy
Adopts a strategy of freezing the visual encoder and fine-tuning only the text encoder to enhance training stability

Model Capabilities

Zero-shot image classification
Image-text feature extraction
Cross-modal retrieval
Image-text similarity calculation

Use Cases

Content Retrieval
Chinese Image Search
Retrieve relevant images using Chinese text queries
Top1 accuracy of 54.36% on the Chinese Flickr30k test set
Content Classification
Zero-shot Image Classification
Classify images directly without fine-tuning
Top1 accuracy of 53.05% on the Chinese version of ImageNet1k
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase