T

Taiyi CLIP RoBERTa 102M ViT L Chinese

Developed by IDEA-CCNL
The first open-source Chinese CLIP model, pre-trained on 123 million text-image pairs, with a text encoder based on the RoBERTa-base architecture.
Downloads 668
Release Time : 9/27/2022

Model Overview

A Chinese vision-language joint representation model supporting image classification and text-image retrieval tasks.

Model Features

Chinese Multimodal Support
The first CLIP model specifically optimized for Chinese, supporting joint representation of Chinese text and images.
Efficient Training Strategy
Adopts a strategy of freezing visual encoder parameters and fine-tuning only the language encoder to improve training efficiency and stability.
Large-scale Pre-training Data
Integrates the Wukong dataset (100 million samples) and the Zero dataset (23 million samples) for pre-training.

Model Capabilities

Zero-shot Image Classification
Text-Image Retrieval
Multimodal Feature Extraction

Use Cases

Image Understanding
Zero-shot Image Classification
Classify images without fine-tuning.
Achieves 55.04% Top1 accuracy on ImageNet1k-CN.
Cross-modal Retrieval
Text-to-Image Retrieval
Retrieve relevant images based on Chinese text descriptions.
Achieves 58.32% Top1 accuracy on the Flickr30k-CNA test set.
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase