Open-source DFN5B - CLIP Image Encoder Model - Free for Visual Feature Extraction Tasks

Vit Huge Patch14 Clip 224.dfn5b

Developed by timm

A ViT-Huge image encoder based on the CLIP architecture, released by Apple as part of the DFN5B-CLIP model, suitable for visual feature extraction tasks.

Image Classification

Transformers

Open Source License:Other #CLIP visual encoding #large-scale pretraining #zero-shot transfer

Downloads 128

Release Time : 12/26/2024

Model Overview

This model is a Vision Transformer (ViT) implementation of the CLIP architecture, specifically designed for image feature extraction. It employs a huge-scale patch14 structure with an input resolution of 224x224 pixels.

Model Features

Large-scale Vision Transformer

Utilizes ViT-Huge architecture with powerful image feature extraction capabilities

CLIP-compatible design

Developed based on the CLIP framework, can be used in conjunction with text encoders

High-resolution processing

Supports input resolution of 224x224 pixels

Model Capabilities

Image feature extraction

Visual representation learning

Use Cases

Computer vision

Image classification

Extracts image features for classification tasks

Visual search

Generates feature vectors for image retrieval systems

Multimodal applications

Image-text matching

Works with text encoders to achieve cross-modal retrieval

Property	Details
Tags	image - feature - extraction, timm, transformers
Library Name	timm
License	other

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

Vit Huge Patch14 Clip 224.dfn5b

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 Model card for vit_huge_patch14_clip_224.dfn5b

🚀 Quick Start

📄 License