CLIP-ViT-L-14-spectrum-icons-20k Open-Source Model - For Abstract Image and Text Retrieval Tasks

Home

CLIP ViT L 14 Spectrum Icons 20k

Developed by JianLiao

A vision-language model fine-tuned based on CLIP ViT-L/14, optimized for abstract image-text retrieval tasks

Text-to-Image

TensorBoard

EnglishOpen Source License:MIT #Zero-shot Image Classification #Abstract Visual Retrieval #Text-Image Alignment

Downloads 1,576

Release Time : 1/5/2025

Model Overview

This model is fine-tuned on 23,000 abstract image-text pairs, enhancing text-to-image and image-to-text retrieval performance, particularly suitable for handling abstract visual features

Model Features

Abstract Visual Feature Understanding

Enhanced understanding of abstract icons and symbols through fine-tuning on a dedicated dataset

Efficient Retrieval Capability

Achieves R@1 of 70% and R@5 over 96% in bidirectional image-text retrieval tasks

Domain Adaptability

Optimized performance in specific domains while maintaining the generalization capability of the base model

Model Capabilities

Zero-shot image classification

Text-to-image retrieval

Image-to-text retrieval

Abstract visual feature matching

Use Cases

Information Retrieval

Icon Library Search

Retrieve matching icon images through natural language descriptions

R@1 accuracy approximately 70%

Content Management

Automatic Image Tagging

Generate descriptive text labels for abstract icons

🚀 CLIP-ViT-L-14-spectrum-icons-23k

A fine-tuned CLIP ViT-L/14 model for improved text-to-image and image-to-text retrieval tasks.

🚀 Quick Start

Install the required dependencies and load the fine-tuned model:

from open_clip import create_model_and_transforms, tokenizer

model, preprocess = create_model_and_transforms(
    model_name="hf-hub:JianLiao/CLIP-ViT-L-14-spectrum-icons-20k"
)

tokenizer = tokenizer("ViT-L-14")

# Example: Text-to-Image Retrieval
text_inputs = tokenizer(["a description of the image", "another description of the image"])
image = preprocess("/path/to/image.png").unsqueeze(0)

with torch.no_grad():
    logits_per_image, logits_per_text = model(image, text_inputs)
    probs = logits_per_image.softmax(dim=-1).numpy()

✨ Features

Direct Use

Zero-shot image classification.
Text-to-image and image-to-image retrieval.
Improving text-image alignment in abstract visual contexts.

Downstream Use

Fine-tuning for domain-specific image-text retrieval tasks.
Integration into applications requiring enhanced semantic search.

📦 Installation

No specific installation steps other than the code in the quick start section are provided in the original document.

💻 Usage Examples

Basic Usage

from open_clip import create_model_and_transforms, tokenizer

model, preprocess = create_model_and_transforms(
    model_name="hf-hub:JianLiao/CLIP-ViT-L-14-spectrum-icons-20k"
)

tokenizer = tokenizer("ViT-L-14")

# Example: Text-to-Image Retrieval
text_inputs = tokenizer(["a description of the image", "another description of the image"])
image = preprocess("/path/to/image.png").unsqueeze(0)

with torch.no_grad():
    logits_per_image, logits_per_text = model(image, text_inputs)
    probs = logits_per_image.softmax(dim=-1).numpy()

📚 Documentation

Model Details

Model Description

This is a fine-tuned CLIP ViT-L/14 model based on the pretrained laion/CLIP-ViT-L-14-laion2B-s32B-b82K from LAION, adapted for improved text-to-image and image-to-text retrieval tasks using a custom dataset of 23,000 PNG-text caption pairs(JianLiao/spectrum-icons). The fine-tuning process utilized the OpenCLIP library and NVIDIA GPUs to specialize the model for handling abstract visual features and enhancing RAG performance.

The base model was originally trained on the LAION-2B dataset, leveraging natural language supervision to align visual and textual embeddings. This fine-tuning task aimed to adapt the model further for specific domains while maintaining generalization.

Training Details

Training Data

The model was fine-tuned on 23,000 image-text caption pairs. The dataset was designed to include diverse and abstract visual elements paired with detailed textual descriptions to enhance the model's capability in handling abstract queries and retrieval tasks.

Training Procedure

The fine-tuning was conducted using the OpenCLIP library on a machine with 6 NVIDIA RTX-3090 GPUs. Key hyperparameters include:

Learning Rate: 5e-6 with cosine decay.
Batch Size: 64 per GPU, effective global batch size of 384.
Epochs: 40.
Precision: Mixed precision (amp_bf16) for improved efficiency.
Augmentations:
- Color Jitter: (0.2, 0.2, 0.1, 0.0) with a probability of 0.7.
- Grayscale Probability: 0.2.

The training incorporated gradient checkpointing, distributed data parallelism (NCCL), and regular evaluations for zero-shot performance. Validation was performed after each epoch.

Evaluation

Testing Data, Factors & Metrics

Testing Data

The model was evaluated on the validation set split from the 23,000 image-text pairs. Metrics were computed for both image-to-text and text-to-image retrieval tasks.

Metrics

Recall at K:
- R@1, R@5, R@10 for image-to-text and text-to-image retrieval.
Mean Rank and Median Rank:
- Average and median positions of the correct match in retrieval.

Results

Image-to-Text Retrieval:
- R@1: ~70.0%
- R@5: ~96.0%
- R@10: ~97.8%
- Mean Rank: ~2.24
- Median Rank: ~1.0
Text-to-Image Retrieval:
- R@1: ~70.3%
- R@5: ~96.4%
- R@10: ~98.1%
- Mean Rank: ~2.17
- Median Rank: ~1.0

The results demonstrate robust alignment between visual and textual embeddings, with strong performance on both retrieval tasks.

🔧 Technical Details

The base model was originally trained on the LAION-2B dataset, leveraging natural language supervision to align visual and textual embeddings. The fine - tuning process utilized the OpenCLIP library and NVIDIA GPUs to specialize the model for handling abstract visual features and enhancing RAG performance. The fine - tuning was conducted on a machine with 6 NVIDIA RTX - 3090 GPUs, using specific hyperparameters and incorporating techniques like gradient checkpointing and distributed data parallelism.

📄 License

This project is licensed under the MIT license.

Acknowledgements

The pretrained base model was developed by LAION and trained on the LAION-2B dataset.

Citation

BibTeX:

@inproceedings{cherti2023reproducible,
  title={Reproducible scaling laws for contrastive language-image learning},
  author={Cherti, Mehdi and Beaumont, Romain and Wightman, Ross and Wortsman, Mitchell and Ilharco, Gabriel and Gordon, Cade and Schuhmann, Christoph and Schmidt, Ludwig and Jitsev, Jenia},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  pages={2818--2829},
  year={2023}
}

OpenAI CLIP paper

@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}

OpenCLIP software

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Wightman, Ross and
                  Gordon, Cade and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}

📋 Information Table

Property	Details
Model Type	A fine - tuned CLIP ViT - L/14 model
Training Data	23,000 image - text caption pairs (JianLiao/spectrum-icons)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご