đ Model card for Recap-CLIP-ViT-L-16-Txt-Huge-2.56B
A CLIPA model trained on Recap-DataComp-1B, designed for zero - shot image classification.
đ Quick Start
The Recap - CLIP - ViT - L - 16 - Txt - Huge - 2.56B model is a powerful tool for zero - shot image classification. It's trained on the Recap - DataComp - 1B dataset and can be easily integrated into your projects.
⨠Features
- Model Type: Contrastive Image - Text, Zero - Shot Image Classification.
- Original: https://github.com/UCSC-VLAA/Recap-DataComp-1B
- Dataset: https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B
- Papers:
- What If We Recaption Billions of Web Images with LLaMA - 3?: https://arxiv.org/abs/2406.08478
Property |
Details |
Model Type |
Contrastive Image - Text, Zero - Shot Image Classification |
Training Data |
https://huggingface.co/datasets/UCSC-VLAA/Recap-DataComp-1B |
đĻ Installation
No specific installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
import torch
import torch.nn.functional as F
from urllib.request import urlopen
from PIL import Image
from open_clip import create_model_from_pretrained, get_tokenizer
model, preprocess = create_model_from_pretrained('hf-hub:UCSC-VLAA/ViT-L-16-HTxt-Recap-CLIP')
tokenizer = get_tokenizer('hf-hub:UCSC-VLAA/ViT-L-16-HTxt-Recap-CLIP')
image = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image = preprocess(image).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat", "a beignet"], context_length=model.context_length)
with torch.no_grad(), torch.cuda.amp.autocast():
image_features = model.encode_image(image)
text_features = model.encode_text(text)
image_features = F.normalize(image_features, dim=-1)
text_features = F.normalize(text_features, dim=-1)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
đ Documentation
Bias, Risks, and Limitations
This model is trained on an image - text dataset with LLaVA - 1.5 - LLaMA3 - 8B generated captions, which may still contain biases and inaccuracies inherent in the original web - crawled data.
Users should be aware of the bias, risks, or limitations when using this model. Check the dataset card page for more details.
â ī¸ Important Note
This model may have biases and inaccuracies due to the original web - crawled data. Check the dataset card for more details.
đ License
This model is licensed under cc - by - 4.0.
đ Citation
@article{li2024recaption,
title={What If We Recaption Billions of Web Images with LLaMA-3?},
author={Xianhang Li and Haoqin Tu and Mude Hui and Zeyu Wang and Bingchen Zhao and Junfei Xiao and Sucheng Ren and Jieru Mei and Qing Liu and Huangjie Zheng and Yuyin Zhou and Cihang Xie},
journal={arXiv preprint arXiv:2406.08478},
year={2024}
}
đ Model Contact
zwang615@ucsc.edu