C

Chinese Clip Vit Large Patch14 336px

Developed by OFA-Sys
Chinese CLIP is a simplified implementation of CLIP based on approximately 200 million Chinese image-text pairs, using ViT-L/14@336px as the image encoder and RoBERTa-wwm-base as the text encoder.
Downloads 713
Release Time : 11/9/2022

Model Overview

Large-scale Chinese vision-language pre-training model, supporting tasks such as image-text similarity calculation and cross-modal retrieval.

Model Features

Large-scale Chinese Pre-training
Trained on 200 million Chinese image-text pairs, with better understanding capabilities for Chinese scenarios.
High-performance Cross-modal Retrieval
Achieves SOTA performance on Chinese benchmarks such as MUGE and Flickr30K-CN.
Zero-shot Transfer Capability
Supports zero-shot image classification and cross-modal retrieval tasks.

Model Capabilities

Image-text similarity calculation
Text-to-image retrieval
Image-to-text retrieval
Zero-shot image classification

Use Cases

E-commerce
Product Image-Text Matching
Automatically match product images with descriptive text
Improves product search accuracy
Content Moderation
Inappropriate Content Detection
Detect inconsistent or inappropriate image-text content
Enhances moderation efficiency
Featured Recommended AI Models
ยฉ 2025AIbase