C

Chinese Clip Vit Base Patch16

Developed by Xenova
Chinese CLIP model based on ViT architecture, supporting multimodal understanding of images and text
Downloads 264
Release Time : 12/13/2023

Model Overview

This is a Chinese CLIP model based on the Vision Transformer (ViT) architecture, specifically optimized for Chinese scenarios, capable of cross-modal understanding and matching between images and text.

Model Features

Chinese Optimization
CLIP model specifically optimized for Chinese scenarios, supporting cross-modal understanding between Chinese text and images
Zero-shot Learning
Capable of image classification for new categories without specific training
ONNX Compatibility
Provides ONNX format weights for easy deployment in Web environments

Model Capabilities

Zero-shot image classification
Image-text similarity calculation
Cross-modal retrieval

Use Cases

Content Understanding
Chinese Image Classification
Classify images with Chinese labels
Example shows 99.2% accuracy in classifying Pikachu images
Content Moderation
Inappropriate Content Detection
Detect inappropriate images through text descriptions
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase