Chinese Clip Vit Base Patch16
Chinese CLIP model based on ViT architecture, supporting multimodal understanding of images and text
Downloads 264
Release Time : 12/13/2023
Model Overview
This is a Chinese CLIP model based on the Vision Transformer (ViT) architecture, specifically optimized for Chinese scenarios, capable of cross-modal understanding and matching between images and text.
Model Features
Chinese Optimization
CLIP model specifically optimized for Chinese scenarios, supporting cross-modal understanding between Chinese text and images
Zero-shot Learning
Capable of image classification for new categories without specific training
ONNX Compatibility
Provides ONNX format weights for easy deployment in Web environments
Model Capabilities
Zero-shot image classification
Image-text similarity calculation
Cross-modal retrieval
Use Cases
Content Understanding
Chinese Image Classification
Classify images with Chinese labels
Example shows 99.2% accuracy in classifying Pikachu images
Content Moderation
Inappropriate Content Detection
Detect inappropriate images through text descriptions
Featured Recommended AI Models