CLIP-ViT-L-rho50-k1-constrained-FARE2 Open-source Feature Extraction Model - Optimizing Image and Text Encoding Function

Home

CLIP ViT L Rho50 K1 Constrained FARE2

Developed by LEAF-CLIP

A feature extraction model fine-tuned based on openai/clip-vit-large-patch14, optimizing the image and text encoders

Multimodal Fusion

Transformers

Open Source License:MIT #Robust image encoding #Semantic-constrained text encoding #Adversarial training optimization

Downloads 253

Release Time : 4/16/2025

Model Overview

This is a feature extraction model based on the CLIP architecture. The image and text encoders are fine-tuned using the FARE and LEAF methods, suitable for multimodal feature extraction tasks.

Model Features

Adversarial fine-tuning

The image encoder is fine-tuned using FARE at ε = 2/255, enhancing the robustness against adversarial attacks

Semantic-constrained fine-tuning

The text encoder is fine-tuned using LEAF with k = 1, ρ = 50 and semantic constraints

Multimodal feature extraction

Supports both image and text feature extraction, maintaining the multimodal capabilities of the original CLIP

Model Capabilities

Image feature extraction

Text feature extraction

Multimodal feature alignment

Use Cases

Computer vision

Image retrieval

Use the extracted image features for similar image retrieval

Natural language processing

Cross-modal retrieval

Implement cross-modal retrieval from text to image or from image to text

Property	Details
Base Model	openai/clip-vit-large-patch14
Datasets	ILSVRC/imagenet-1k, mlfoundations/datacomp_small
License	MIT
Library Name	transformers
Pipeline Tag	feature-extraction

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

CLIP ViT L Rho50 K1 Constrained FARE2

Model Overview

Model Features

Model Capabilities

Use Cases

🚀 LEAF-CLIP/CLIP-ViT-L-rho50-k1-FARE2

🚀 Quick Start

✨ Features

📦 Installation

📚 Documentation

📄 License

📦 Model Information