Chinese-CLIP ViT-Base-Patch16 Open-Source Model - Supports Multimodal Understanding of Images and Texts

Home

Chinese Clip Vit Base Patch16

Developed by Xenova

Chinese CLIP model based on ViT architecture, supporting multimodal understanding of images and text

Text-to-Image

Transformers

#Zero-shot Image Classification #Chinese Multimodal #ONNX Optimization

Downloads 264

Release Time : 12/13/2023

Model Overview

This is a Chinese CLIP model based on the Vision Transformer (ViT) architecture, specifically optimized for Chinese scenarios, capable of cross-modal understanding and matching between images and text.

Model Features

Chinese Optimization

CLIP model specifically optimized for Chinese scenarios, supporting cross-modal understanding between Chinese text and images

Zero-shot Learning

Capable of image classification for new categories without specific training

ONNX Compatibility

Provides ONNX format weights for easy deployment in Web environments

Model Capabilities

Zero-shot image classification

Image-text similarity calculation

Cross-modal retrieval

Use Cases

Content Understanding

Chinese Image Classification

Classify images with Chinese labels

Example shows 99.2% accuracy in classifying Pikachu images

Content Moderation

Inappropriate Content Detection

Detect inappropriate images through text descriptions

Property	Details
Base Model	OFA-Sys/chinese-clip-vit-base-patch16
Library Name	transformers.js

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Chinese Clip Vit Base Patch16

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 Chinese CLIP ViT-Base Patch16 with Transformers.js

🚀 Quick Start

📦 Installation

💻 Usage Examples

🔍 Basic Usage

📚 Documentation