TiC-CLIP-bestpool-sequential Open-Source Vision-Language Model - Stay Synchronized with the Latest Data and of High Practical Value

Tic CLIP Bestpool Sequential

Developed by apple

TiC-CLIP is a vision-language model trained on the TiC-DataComp-Yearly dataset, employing continual learning strategies to keep the model synchronized with the latest data

Text-to-Image Open Source License:Other #Continual Temporal Learning #Zero-shot Image Classification #Cross-year Data Training

Downloads 280

Release Time : 6/5/2024

Model Overview

This model is designed for continual learning in vision-language tasks, trained on temporally continuous data to avoid the high costs of traditional retraining, supporting zero-shot image classification and cross-modal retrieval

Model Features

Continual Learning Strategy

Utilizes experience replay strategy for continual training, reducing computational costs by 2.5x compared to traditional full retraining

Temporal Robustness

Specifically designed to handle temporally varying data, outperforming traditional CLIP models on newer data

Large-scale Training Data

Trained on the TiC-DataComp dataset containing 127 billion timestamped image-text pairs from 2014-2022

Model Capabilities

Zero-shot image classification

Image-text retrieval

Cross-modal representation learning

Use Cases

Computer Vision

Time-sensitive Image Classification

Image classification for concepts and trends that evolve over time

Achieves approximately 8% higher accuracy than traditional CLIP models on 2021-2022 data

Information Retrieval

Cross-temporal Image Retrieval

Retrieve images from different time periods based on text queries

🚀 Model Card for TiC-CLIP-bestpool-sequential

This repository offers TiC-CLIP models trained on TiC-DataComp-Yearly (xlarge, bestpool filtering) from 2014 to 2022 using modified OpenCLIP code. For more details, visit our GitHub repo.

✨ Features

Vision and Zero-Shot Image Classification: Ideal for vision-related tasks and zero-shot image classification.
Time-Continual Benchmarks: Introduces web-scale Time-Continual (TiC) benchmarks for vision-language model training.
Efficient Training Approach: Demonstrates a rehearsal-based method to reduce compute costs.

📦 Installation

The models are compatible with the DataComp evaluation suite and a patched version for TiC-DataComp-Retrieval and TiC-DataCompNet. Follow the instructions in our GitHub repo to create evaluation sets or DataComp for standard evaluations on 38 datasets.

💻 Usage Examples

Basic Usage

YEAR=2016 # There are no models before 2016 since data from 2014-2016 were compined into one year
REPO="apple/TiC-CLIP-bestpool-sequential"
huggingface-cli download $REPO checkpoints/$YEAR.pt

## Train Cummulative
pushd datacomp
final_data_dir=$TIC_DATACOMP_Y_PATH/train/$YEAR/
torchrun --nproc_per_node 8 --nnodes 1 \
    train.py \
    --scale "tic_medium" \
    --dataset_resampled \
    --data_dir $final_data_dir \
    --output_dir "./results/" \
    --exp_name "datacomp_medium-basic_cumulative" \
    --imagenet_val  $IMAGENET_VAL_PATH  \
    --save_frequency 1 \
    --resume
popd

Advanced Usage

## Evaluate Model
# Evaluate a ViT-B/16 model on TiC/Retrieval/Yearly/$YEAR and
# TiC/DataCompNet/Yearly/$YEAR
pushd datacomp
python ../dataset_creation/tic-datacomp/generate_tasklist.py --yaml-path tasklist.yml --sample-eval --eval-tasks retrieval/yearly,datacompnet/yearly
python evaluate.py --data_dir data/ --train_output_dir ./results --use_model "ViT-B-16 $YEAR.pt" --skip_hf --skip_db --skip_notification

OpenCLIP Load and Inference Example

import open_clip
from huggingface_hub import hf_hub_download
filename = hf_hub_download(repo_id="apple/TiC-CLIP-bestpool-sequential", filename="checkpoints/2016.pt")
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-16', filename)
tokenizer = open_clip.get_tokenizer('ViT-B-16')

image = preprocess(Image.open("image.png").convert('RGB')).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

📚 Documentation

Model Details

Model Description

Keeping large foundation models updated with the latest data is costly. To avoid the high costs of constant retraining, continual training is essential. The lack of large-scale continual learning benchmarks or baselines exacerbates this problem. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for vision-language model training: TiC-DataComp, TiC-YFCC, and TiC-Redcaps. TiC-DataComp, our largest dataset, contains over 12.7B timestamped image-text pairs from 2014 - 2022. We use these benchmarks for dynamic evaluations to measure the temporal robustness of existing models. We show that OpenAI's CLIP (trained on data up to 2020) loses ≈8% zero-shot accuracy on our curated retrieval task from 2021 - 2022 compared to more recently trained models in the OpenCLIP repository. We also study efficient ways to train models on time-continuous data and demonstrate that a simple rehearsal-based approach reduces compute by 2.5× compared to retraining from scratch. Code is available at this https URL.

Developed by: Apple
License: See LICENSE

Model Sources

Uses

Researchers can utilize TiC-CLIP pretrained models to design continual learning methods more quickly by starting from a pretrained checkpoint and training on subsequent year or month data.

Training Details

Training Data

Refer to TiC-DataComp.

Training Procedure

Refer to Sections 2 - 3 of our TiC-CLIP paper.

Citation

TiC-CLIP: Continual Training of CLIP Models. (ICLR 2024) Garg, S., Farajtabar, M., Pouransari, H., Vemulapalli, R., Mehta, S., Tuzel, O., Shankar, V. and Faghri, F..

@inproceedings{garg2024tic,
  title={TiC-CLIP: Continual Training of CLIP Models},
  author={Garg, Saurabh and Farajtabar, Mehrdad and Pouransari, Hadi and Vemulapalli, Raviteja and Mehta, Sachin and Tuzel, Oncel and Shankar, Vaishaal and Faghri, Fartash},
  booktitle={The Twelfth International Conference on Learning Representations (ICLR)},
  year={2024},
  url={https://openreview.net/forum?id=TLADT8Wrhn}
}

📄 License

This project is under the custom-apple-license.

Property	Details
Model Type	TiC-CLIP models trained on TiC-DataComp-Yearly (xlarge, bestpool filtering)
Training Data	apple/TiC-DataComp
Library Name	tic-clip
Tags	vision, zero-shot-image-classification

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご