V

Vit Xsmall Patch16 Clip 224.tinyclip Yfcc15m

Developed by timm
A compact vision-language model based on CLIP architecture, designed for efficient zero-shot image classification
Downloads 444
Release Time : 3/20/2024

Model Overview

This model is a lightweight version of the CLIP architecture, trained on the YFCC15M dataset, suitable for zero-shot image classification tasks.

Model Features

Lightweight design
Utilizes XSmall-scale ViT architecture with lower computational resource requirements
Zero-shot learning
Capable of performing image classification tasks without domain-specific training
Multimodal understanding
Simultaneously comprehends visual and textual information for cross-modal matching

Model Capabilities

Zero-shot image classification
Image-text matching
Cross-modal retrieval

Use Cases

Content management
Automatic image tagging
Automatically generates descriptive tags for unlabeled images
Improves image library management efficiency
E-commerce
Product categorization
Classifies product images based on natural language descriptions
Supports new product categories without additional training
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase