The open-source Chinese-CLIP-ViT-Large-Patch14 model supports cross-modal understanding and generation between images and texts.

Chinese Clip Vit Large Patch14

Developed by Xenova

Chinese CLIP model based on Vision Transformer architecture, supporting cross-modal understanding and generation between images and text.

Text-to-Image

Transformers

#Chinese image-text matching #Web deployment #ONNX format

Downloads 14

Release Time : 12/13/2023

Model Overview

This model is a Chinese vision-language pre-trained model capable of understanding the relationship between images and text, supporting cross-modal tasks such as image classification and text-to-image caption generation.

Model Features

Cross-modal understanding

Capable of processing both image and text information to understand semantic relationships between them

Chinese optimization

Specially optimized for Chinese language and scenarios

Web deployment

Converted to ONNX format, supporting browser-based execution via Transformers.js

Model Capabilities

Image feature extraction

Text feature extraction

Image-text similarity calculation

Cross-modal retrieval

Image caption generation

Use Cases

E-commerce

Product search

Search for relevant product images using text descriptions

Improves search accuracy and user experience

Content moderation

Image-text consistency check

Verify if image content matches descriptive text

Reduces false or misleading content

Property	Details
Base Model	OFA-Sys/chinese-clip-vit-large-patch14
Library Name	transformers.js

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Chinese Clip Vit Large Patch14

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 Chinese-CLIP-ViT-Large-Patch14 ONNX for Transformers.js

📚 Documentation

⚠️ Important Note