CLIP-ViT-L-14-DataComp.XL-s13B-b90K Open Source Model - For Zero-shot Image Classification and Image-text Retrieval

CLIP ViT L 14 DataComp.XL S13b B90k

Developed by laion

This model is a CLIP ViT-L/14 trained on the DataComp-1B dataset, primarily used for zero-shot image classification and image-text retrieval tasks.

Text-to-Image Open Source License:MIT #Zero-shot image classification #Multimodal retrieval #1.4 billion parameter scale

Downloads 586.75k

Release Time : 4/26/2023

Model Overview

A vision-language model trained using the OpenCLIP framework on the DataComp-1B dataset, capable of performing tasks such as zero-shot image classification and image-text retrieval.

Model Features

Large-scale training data

Trained on 1.4 billion samples from the DataComp-1B dataset, covering a wide range of visual concepts

Zero-shot capability

Capable of performing image classification tasks on new categories without fine-tuning

Cross-modal understanding

Simultaneously understands image and text information, supporting image-text retrieval tasks

Model Capabilities

Zero-shot image classification

Image-text retrieval

Cross-modal understanding

Use Cases

Computer vision

Image classification

Classify images of new categories without training

Achieves 79.2% zero-shot top-1 accuracy on ImageNet-1k

Image-text retrieval

Search for relevant images based on text descriptions, or generate descriptions from images

Research

Multimodal research

Study representation learning and transfer capabilities of vision-language models

🚀 CLIP ViT-L-14 trained DataComp-1B

This model is a research output for zero - shot, arbitrary image classification. It enables researchers to explore related fields and can be used for interdisciplinary studies on the potential impact of such models.

🚀 Quick Start

To get started with the model, see How to Get Started with the Model.

✨ Features

Direct Use: Zero - shot image classification, image and text retrieval, etc.
Downstream Use: Image classification and other image task fine - tuning, linear probe image classification, image generation guiding and conditioning, etc.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Model Details

Model Description

A CLIP ViT - L/14 model trained with the DataComp - 1B (https://github.com/mlfoundations/datacomp) using OpenCLIP (https://github.com/mlfoundations/open_clip). Model training was done on the stability.ai cluster.

Uses

As per the original [OpenAI CLIP model card](https://github.com/openai/CLIP/blob/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1/model - card.md), this model is intended as a research output for research communities.

Direct Use: Zero - shot image classification, image and text retrieval, among others.
Downstream Use: Image classification and other image task fine - tuning, linear probe image classification, image generation guiding and conditioning, among others.
Out - of - Scope Use:
- Any deployed use case of the model - whether commercial or not - is currently out of scope. Non - deployed use cases such as image search in a constrained environment are also not recommended unless there is thorough in - domain testing of the model with a specific, fixed class taxonomy.
- Certain use cases which would fall under the domain of surveillance and facial recognition are always out - of - scope regardless of the performance of the model.

Training Details

Training Data

This model was trained with 1.4 billion samples of the DataComp - 1B dataset (https://arxiv.org/abs/2304.14108).

⚠️ Important Note

The dataset is uncurated, and collected links may lead to strongly discomforting and disturbing content. It is possible to extract a “safe” subset by filtering out samples based on safety tags. However, we cannot entirely exclude the possibility of harmful content. The dataset is for research purposes and not recommended for creating ready - to - go industrial products.

Training Procedure

Please see https://arxiv.org/abs/2304.14108.

Evaluation

Evaluation was done on 38 datasets, using the DataComp repo and the [LAION CLIP Benchmark](https://github.com/LAION - AI/CLIP_benchmark).

Testing Data, Factors & Metrics

Testing Data: The testing is performed on a suite of 38 datasets. See our paper for more details (https://arxiv.org/abs/2304.14108).

Results

The model achieves a 79.2% zero - shot top - 1 accuracy on ImageNet - 1k. See our paper for more details and results (https://arxiv.org/abs/2304.14108).

Acknowledgements

Acknowledging stability.ai for the compute used to train this model.

Citation

BibTeX:

DataComp

@article{datacomp,
  title={DataComp: In search of the next generation of multimodal datasets},
  author={Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, Ludwig Schmidt},
  journal={arXiv preprint arXiv:2304.14108},
  year={2023}
}

OpenAI CLIP paper

@inproceedings{Radford2021LearningTV,
  title={Learning Transferable Visual Models From Natural Language Supervision},
  author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
  booktitle={ICML},
  year={2021}
}

OpenCLIP software

@software{ilharco_gabriel_2021_5143773,
  author       = {Ilharco, Gabriel and
                  Wortsman, Mitchell and
                  Wightman, Ross and
                  Gordon, Cade and
                  Carlini, Nicholas and
                  Taori, Rohan and
                  Dave, Achal and
                  Shankar, Vaishaal and
                  Namkoong, Hongseok and
                  Miller, John and
                  Hajishirzi, Hannaneh and
                  Farhadi, Ali and
                  Schmidt, Ludwig},
  title        = {OpenCLIP},
  month        = jul,
  year         = 2021,
  note         = {If you use this software, please cite it as below.},
  publisher    = {Zenodo},
  version      = {0.1},
  doi          = {10.5281/zenodo.5143773},
  url          = {https://doi.org/10.5281/zenodo.5143773}
}

🔧 Technical Details

No specific technical implementation details (more than 50 words) are provided in the original document, so this section is skipped.

📄 License

The model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご