D

DFN Public

Developed by apple
This is a CLIP-based ViT-B/32 model trained using a Data Filtering Network (DFN) on datasets including CC12M, CC3M, and Shutterstock 15M, suitable for zero-shot image classification tasks.
Downloads 3,822
Release Time : 7/8/2024

Model Overview

This model is a vision-language Transformer based on Contrastive Language-Image Pre-training (CLIP), automatically filtering training data through a Data Filtering Network, capable of performing zero-shot image classification and image-text matching tasks.

Model Features

Data Filtering Network Training
Uses a small Data Filtering Network (DFN) to automatically filter large-scale uncurated datasets, improving training data quality
Multi-dataset Joint Training
Combines three datasets—Conceptual Captions 12M/3M and Shutterstock 15M—for training
Zero-shot Classification Capability
Can be directly applied to new image classification tasks without task-specific fine-tuning

Model Capabilities

Zero-shot image classification
Image-text matching
Cross-modal retrieval

Use Cases

Content Management
Automatic Image Tagging
Automatically generates descriptive labels for unlabeled images
E-commerce
Product Image Classification
Automatically classifies product images based on descriptions
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase