T

Taiyi CLIP Roberta 102M Chinese

Developed by IDEA-CCNL
The first open-source Chinese CLIP model, pre-trained on 123 million image-text pairs, with a text encoder based on RoBERTa-base architecture.
Downloads 558
Release Time : 7/9/2022

Model Overview

This model is a Chinese vision-language representation model that understands the relationship between images and text, supporting zero-shot image classification and image-text retrieval tasks.

Model Features

Chinese Support
The first CLIP model specifically optimized for Chinese, using a Chinese RoBERTa-wwm architecture for the text encoder.
Large-scale Pre-training
Pre-trained on 123 million Chinese image-text pairs, including the Wukong dataset and 360Zero dataset.
Efficient Training Strategy
Freezes visual encoder parameters and fine-tunes only the language encoder to improve training efficiency and stability.

Model Capabilities

Zero-shot image classification
Image-text feature extraction
Cross-modal retrieval
Image-text similarity calculation

Use Cases

Image Understanding
Zero-shot Image Classification
Classify images without fine-tuning
Top1 accuracy of 42.85% on Chinese ImageNet1k
Information Retrieval
Image-Text Retrieval
Search for relevant images based on text or relevant text based on images
Top1 accuracy of 46.32% on Chinese Flickr30k test set
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase