🚀 Model Card for CLIP ViT-B/32 - LAION-2B
This model card presents a CLIP ViT - B/32 model trained on the LAION - 2B English subset. It aims to facilitate research in zero - shot, arbitrary image classification and related fields.
🚀 Quick Start
Use the following code to get started with the model.
** TODO ** - Hugging Face transformers, OpenCLIP, and timm getting started snippets
✨ Features
Model Details
- A CLIP ViT - B/32 model trained with the LAION - 2B English subset of LAION - 5B (https://laion.ai/blog/laion - 5b/) using OpenCLIP (https://github.com/mlfoundations/open_clip).
- Model training was done by Romain Beaumont on the stability.ai cluster.
Uses
- Research Output: Intended as a research output for research communities to better understand and explore zero - shot, arbitrary image classification and for interdisciplinary studies of its potential impact.
- Direct Use: Zero - shot image classification, image and text retrieval, etc.
- Downstream Use: Image classification and other image task fine - tuning, linear probe image classification, image generation guiding and conditioning, etc.
- Out - of - Scope Use:
- Any deployed use case (commercial or not) is currently out of scope. Non - deployed use cases in a constrained environment are not recommended without thorough in - domain testing.
- Use cases in surveillance and facial recognition are always out of scope.
- Since the model is trained and evaluated only in English, its use should be limited to English language use cases.
Training Details
- Training Data: Trained with the 2 Billion sample English subset of LAION - 5B (https://laion.ai/blog/laion - 5b/).
⚠️ Important Note
The dataset is uncurated, and collected links may lead to discomforting content. It's recommended for research purposes only. A “safe” subset can be extracted, but the presence of harmful content cannot be entirely excluded.
- Training Procedure: See training notes and [wandb logs](https://wandb.ai/rom1504/eval_openclip/reports/B - 32 - 2B--VmlldzoyNDkwNDMy).
Evaluation
- Evaluation was done with code in the [LAION CLIP Benchmark suite](https://github.com/LAION - AI/CLIP_benchmark).
- Testing Data: VTAB+ for classification and COCO and Flickr for retrieval.
- Results: The model achieves a 66.6 zero - shot top - 1 accuracy on ImageNet - 1k. Initial benchmarks on a wider range of datasets are viewable at https://github.com/LAION - AI/CLIP_benchmark/blob/main/benchmark/results.ipynb
Acknowledgements
We acknowledge stability.ai for providing the compute resources for training this model.
Citation
In addition to the forthcoming LAION - 5B (https://laion.ai/blog/laion - 5b/) paper, please cite:
OpenAI CLIP paper
@inproceedings{Radford2021LearningTV,
title={Learning Transferable Visual Models From Natural Language Supervision},
author={Alec Radford and Jong Wook Kim and Chris Hallacy and A. Ramesh and Gabriel Goh and Sandhini Agarwal and Girish Sastry and Amanda Askell and Pamela Mishkin and Jack Clark and Gretchen Krueger and Ilya Sutskever},
booktitle={ICML},
year={2021}
}
OpenCLIP software
@software{ilharco_gabriel_2021_5143773,
author = {Ilharco, Gabriel and
Wortsman, Mitchell and
Wightman, Ross and
Gordon, Cade and
Carlini, Nicholas and
Taori, Rohan and
Dave, Achal and
Shankar, Vaishaal and
Namkoong, Hongseok and
Miller, John and
Hajishirzi, Hannaneh and
Farhadi, Ali and
Schmidt, Ludwig},
title = {OpenCLIP},
month = jul,
year = 2021,
note = {If you use this software, please cite it as below.},
publisher = {Zenodo},
version = {0.1},
doi = {10.5281/zenodo.5143773},
url = {https://doi.org/10.5281/zenodo.5143773}
}
📄 License
This model is released under the MIT license.