🚀 CLIP ViT-L-14 trained DataComp-1B
This model is a research output for zero - shot, arbitrary image classification. It enables researchers to explore related fields and can be used for interdisciplinary studies on the potential impact of such models.
🚀 Quick Start
To get started with the model, see How to Get Started with the Model.
✨ Features
- Direct Use: Zero - shot image classification, image and text retrieval, etc.
- Downstream Use: Image classification and other image task fine - tuning, linear probe image classification, image generation guiding and conditioning, etc.
📦 Installation
No installation steps are provided in the original document, so this section is skipped.
💻 Usage Examples
No code examples are provided in the original document, so this section is skipped.
📚 Documentation
Model Details
Model Description
A CLIP ViT - L/14 model trained with the DataComp - 1B (https://github.com/mlfoundations/datacomp) using OpenCLIP (https://github.com/mlfoundations/open_clip). Model training was done on the stability.ai cluster.
Uses
As per the original [OpenAI CLIP model card](https://github.com/openai/CLIP/blob/d50d76daa670286dd6cacf3bcd80b5e4823fc8e1/model - card.md), this model is intended as a research output for research communities.
- Direct Use: Zero - shot image classification, image and text retrieval, among others.
- Downstream Use: Image classification and other image task fine - tuning, linear probe image classification, image generation guiding and conditioning, among others.
- Out - of - Scope Use:
- Any deployed use case of the model - whether commercial or not - is currently out of scope. Non - deployed use cases such as image search in a constrained environment are also not recommended unless there is thorough in - domain testing of the model with a specific, fixed class taxonomy.
- Certain use cases which would fall under the domain of surveillance and facial recognition are always out - of - scope regardless of the performance of the model.
Training Details
Training Data
This model was trained with 1.4 billion samples of the DataComp - 1B dataset (https://arxiv.org/abs/2304.14108).
⚠️ Important Note
The dataset is uncurated, and collected links may lead to strongly discomforting and disturbing content. It is possible to extract a “safe” subset by filtering out samples based on safety tags. However, we cannot entirely exclude the possibility of harmful content. The dataset is for research purposes and not recommended for creating ready - to - go industrial products.
Training Procedure
Please see https://arxiv.org/abs/2304.14108.
Evaluation
Evaluation was done on 38 datasets, using the DataComp repo and the [LAION CLIP Benchmark](https://github.com/LAION - AI/CLIP_benchmark).
Testing Data, Factors & Metrics
- Testing Data: The testing is performed on a suite of 38 datasets. See our paper for more details (https://arxiv.org/abs/2304.14108).
Results
The model achieves a 79.2% zero - shot top - 1 accuracy on ImageNet - 1k. See our paper for more details and results (https://arxiv.org/abs/2304.14108).
Acknowledgements
Acknowledging stability.ai for the compute used to train this model.
Citation
BibTeX:
@article{datacomp,
title={DataComp: In search of the next generation of multimodal datasets},
author={Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexander Ratner, Shuran Song, Hannaneh Hajishirzi, Ali Farhadi, Romain Beaumont, Sewoong Oh, Alex Dimakis, Jenia Jitsev, Yair Carmon, Vaishaal Shankar, Ludwig Schmidt},
journal={arXiv preprint arXiv:2304.14108},
year={2023}
}
@inproceedings{Radford2021LearningTV,
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
booktitle={ICML},
year={2021}
}
@software{ilharco_gabriel_2021_5143773,
author = {Ilharco, Gabriel and
Wortsman, Mitchell and
Wightman, Ross and
Gordon, Cade and
Carlini, Nicholas and
Taori, Rohan and
Dave, Achal and
Shankar, Vaishaal and
Namkoong, Hongseok and
Miller, John and
Hajishirzi, Hannaneh and
Farhadi, Ali and
Schmidt, Ludwig},
title = {OpenCLIP},
month = jul,
year = 2021,
note = {If you use this software, please cite it as below.},
publisher = {Zenodo},
version = {0.1},
doi = {10.5281/zenodo.5143773},
url = {https://doi.org/10.5281/zenodo.5143773}
}
🔧 Technical Details
No specific technical implementation details (more than 50 words) are provided in the original document, so this section is skipped.
📄 License
The model is released under the MIT license.