Chinese Clip Vit Large Patch14
Chinese CLIP model based on Vision Transformer architecture, supporting cross-modal understanding and generation between images and text.
Downloads 14
Release Time : 12/13/2023
Model Overview
This model is a Chinese vision-language pre-trained model capable of understanding the relationship between images and text, supporting cross-modal tasks such as image classification and text-to-image caption generation.
Model Features
Cross-modal understanding
Capable of processing both image and text information to understand semantic relationships between them
Chinese optimization
Specially optimized for Chinese language and scenarios
Web deployment
Converted to ONNX format, supporting browser-based execution via Transformers.js
Model Capabilities
Image feature extraction
Text feature extraction
Image-text similarity calculation
Cross-modal retrieval
Image caption generation
Use Cases
E-commerce
Product search
Search for relevant product images using text descriptions
Improves search accuracy and user experience
Content moderation
Image-text consistency check
Verify if image content matches descriptive text
Reduces false or misleading content
Featured Recommended AI Models