TiC-CLIP-basic-cumulative Open-source Vision-Language Model - Solve the Dilemma of Model-Data Synchronization at Low Cost

Tic CLIP Basic Cumulative

Developed by apple

TiC-CLIP is a continually trained vision-language model focused on addressing the high cost of synchronizing base models with the latest data.

Text-to-Image Open Source License:Other #Continual Learning Vision Models #Zero-shot Image Classification #Temporal Data Adaptation

Downloads 259

Release Time : 6/5/2024

Model Overview

TiC-CLIP maintains model performance on temporally continuous data through continual training strategies, avoiding the overhead of frequent retraining.

Model Features

Continual Training Strategy

Uses a replay-based continual training approach, reducing computation by 2.5x compared to traditional full retraining

Temporal Robustness

Specifically designed to handle temporally continuous data, maintaining performance on new data

Large-scale Benchmark

Trained on the TiC-DataComp dataset containing 12.7 billion timestamped image-text pairs from 2014-2022

Model Capabilities

Zero-shot image classification

Cross-modal retrieval

Continual learning

Use Cases

Computer Vision

Time-sensitive Image Classification

Classifying content that changes over time (e.g., pop culture, fashion trends)

8% higher accuracy than traditional CLIP models on 2021-2022 data

Cross-modal Retrieval

Temporal Continuous Retrieval

Performing cross-modal retrieval across different time periods

🚀 Model Card for TiC-CLIP-basic-cumulative

This repository houses TiC-CLIP models trained on TiC-DataComp-Yearly (xlarge, basic filtering) with data from 2014 to 2022 using our modified OpenCLIP code. For more details, check out our GitHub repo.

📚 Documentation

Model Details

Model Description

Keeping large foundation models updated with the latest data is inherently costly. To avoid the high costs of constant retraining, it's essential to train these models continuously. This issue is worsened by the lack of large-scale continual learning benchmarks or baselines.

We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps. Our largest dataset, TiC-DataComp, contains over 12.7B timestamped image-text pairs spanning 9 years (2014 - 2022).

First, we use our benchmarks to conduct various dynamic evaluations to measure the temporal robustness of existing models. We find that OpenAI's CLIP (trained on data up to 2020) loses approximately 8% zero-shot accuracy on our curated retrieval task from 2021 - 2022 compared to more recently trained models in the OpenCLIP repository.

Then, we explore how to efficiently train models on time-continuous data. We show that a simple rehearsal-based approach, which continues training from the last checkpoint and replays old data, reduces compute by 2.5× compared to the standard practice of retraining from scratch. The code is available at this https URL.

Developed by: Apple
License: See LICENSE

Model Sources

Uses

Researchers can utilize TiC-CLIP pretrained models to design continual learning methods more quickly. They can start from a pretrained checkpoint and continue training on the next year or month's data.

How to Get Started with the Model

The models are compatible with the DataComp evaluation suite and our patched version of DataComp for evaluation on TiC-DataComp-Retrieval and TiC-DataCompNet. They can also be used to resume training or initialize new training using OpenCLIP code.

Follow the instructions in our GitHub repo to create the evaluation sets or refer to DataComp for standard evaluations on 38 datasets.

The following code snippets assume that the TiC-DataComp data has been prepared according to the instructions in the GitHub repo.

Training

YEAR=2016 # There are no models before 2016 since data from 2014-2016 were compined into one year
REPO="apple/TiC-CLIP-basic-cumulative"
huggingface-cli download $REPO checkpoints/$YEAR.pt

## Train
pushd datacomp
final_data_dir=$TIC_DATACOMP_Y_PATH/train/$YEAR/
torchrun --nproc_per_node 8 --nnodes 1 \
    train.py \
    --scale "tic_medium" \
    --dataset_resampled \
    --data_dir $final_data_dir \
    --output_dir "./results/" \
    --exp_name "datacomp_medium-basic_cumulative" \
    --imagenet_val  $IMAGENET_VAL_PATH  \
    --save_frequency 1 \
    --resume
popd

Evaluation

# Evaluate a ViT-B/16 model on TiC/Retrieval/Yearly/$YEAR and
# TiC/DataCompNet/Yearly/$YEAR
pushd datacomp
python ../dataset_creation/tic-datacomp/generate_tasklist.py --yaml-path tasklist.yml --sample-eval --eval-tasks retrieval/yearly,datacompnet/yearly
python evaluate.py --data_dir data/ --train_output_dir ./results --use_model "ViT-B-16 $YEAR.pt" --skip_hf --skip_db --skip_notification

OpenCLIP Load and Inference Example

import open_clip
from huggingface_hub import hf_hub_download
filename = hf_hub_download(repo_id="apple/TiC-CLIP-basic-cumulative", filename="checkpoints/2016.pt")
model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-16', filename)
tokenizer = open_clip.get_tokenizer('ViT-B-16')

image = preprocess(Image.open("image.png").convert('RGB')).unsqueeze(0)
text = tokenizer(["a diagram", "a dog", "a cat"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)

Training Details

Training Data

Refer to TiC-DataComp.

Training Procedure

Refer to Sections 2 - 3 of our TiC-CLIP paper.

Citation

TiC-CLIP: Continual Training of CLIP Models. (ICLR 2024) Garg, S., Farajtabar, M., Pouransari, H., Vemulapalli, R., Mehta, S., Tuzel, O., Shankar, V. and Faghri, F..

@inproceedings{garg2024tic,
  title={TiC-CLIP: Continual Training of CLIP Models},
  author={Garg, Saurabh and Farajtabar, Mehrdad and Pouransari, Hadi and Vemulapalli, Raviteja and Mehta, Sachin and Tuzel, Oncel and Shankar, Vaishaal and Faghri, Fartash},
  booktitle={The Twelfth International Conference on Learning Representations (ICLR)},
  year={2024},
  url={https://openreview.net/forum?id=TLADT8Wrhn}
}

## 📄 License
- **License Type**: other
- **License Name**: custom-apple-license
- **License Link**: [https://github.com/apple/ml-tic-clip/blob/main/LICENSE](https://github.com/apple/ml-tic-clip/blob/main/LICENSE)

| Property | Details |
|----------|---------|
| Model Type | Vision, Zero-shot Image Classification |
| Training Data | apple/TiC-DataComp |
| Library Name | tic-clip |

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご