Model Overview
Model Features
Model Capabilities
Use Cases
đ AltCLIP-m18
Following the bilingual model AltCLIP and the nine - language model AltCLIP - m9, we developed an 18 - language CLIP model, offering enhanced multilingual support for text - image tasks.
đ Quick Start
The AltCLIP - m18 model can provide support for the AltDiffusion - m18 model. Specific information on the AltDiffusion model can be found in this tutorial.
The model code has been open - sourced on FlagAI, and the weights are located on modelhub. We also provide scripts for fine - tuning, inference, and validation. You're welcome to try them out.
⨠Features
- Multilingual Support: AltCLIP - m18 supports 18 languages, including English, Chinese, Japanese, Thai, Korean, Hindi, Ukrainian, Arabic, Turkish, Vietnamese, Polish, Dutch, Portuguese, Italian, Spanish, German, French, and Russian.
- Model Compatibility: It can support the AltDiffusion - m18 model, enabling more powerful text - image generation capabilities.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
# Cifar10 dataset evaluation code
# Copyright Š 2022 BAAI. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License")
import torch
from flagai.auto_model.auto_loader import AutoLoader
import zeroshot_classification
import json
import os
from torchvision.datasets import CIFAR10
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
maxlen = 256
dataset_root = "./clip_benchmark_datasets/"
dataset_name = "cifar10"
auto_loader = AutoLoader(
task_name="txt_img_matching",
model_dir="./checkpoints/",
model_name="AltCLIP-XLMR-L-m18" # Load the checkpoints from Modelhub(model.baai.ac.cn/models)
)
model = auto_loader.get_model()
model.to(device)
model.eval()
tokenizer = auto_loader.get_tokenizer()
transform = auto_loader.get_transform()
dataset = CIFAR10(root=os.path.join(dataset_root, dataset_name),
transform=transform,
download=True)
batch_size = 128
num_workers = 4
template = {"cifar10": [
"a photo of a {c}.",
"a blurry photo of a {c}.",
"a black and white photo of a {c}.",
"a low contrast photo of a {c}.",
"a high contrast photo of a {c}.",
"a bad photo of a {c}.",
"a good photo of a {c}.",
"a photo of a small {c}.",
"a photo of a big {c}.",
"a photo of the {c}.",
"a blurry photo of the {c}.",
"a black and white photo of the {c}.",
"a low contrast photo of the {c}.",
"a high contrast photo of the {c}.",
"a bad photo of the {c}.",
"a good photo of the {c}.",
"a photo of the small {c}.",
"a photo of the big {c}."
],
}
def evaluate():
if dataset:
dataloader = torch.utils.data.DataLoader(
dataset,
batch_size=batch_size,
shuffle=False,
num_workers=num_workers,
)
zeroshot_templates = template["cifar10"]
classnames = dataset.classes if hasattr(dataset, "classes") else None
metrics = zeroshot_classification.evaluate(
model,
dataloader,
tokenizer,
classnames,
zeroshot_templates,
device=device,
amp=True,
)
dump = {
"dataset": dataset_name,
"metrics": metrics
}
print(dump)
with open("./result.txt", "w") as f:
json.dump(dump, f)
return metrics
if __name__ == "__main__":
evaluate()
Advanced Usage
# Inference script
import torch
from PIL import Image
from flagai.auto_model.auto_loader import AutoLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
loader = AutoLoader(
task_name="txt_img_matching",
model_name="AltCLIP-XLMR-L-m18", # Load the checkpoints from Modelhub(model.baai.ac.cn/models)
model_dir="./checkpoints"
)
model = loader.get_model()
tokenizer = loader.get_token
đ Documentation
Model Information
Property | Details |
---|---|
Name | AltCLIP - m18 |
Task | Text - Image |
Language(s) | Multilingual (English, Chinese, Japanese, Thai, Korean, Hindi, Ukrainian, Arabic, Turkish, Vietnamese, Polish, Dutch, Portuguese, Italian, Spanish, German, French, Russian) |
Model | CLIP |
Github | FlagAI |
Training Datasets
No | Language | Stage1(LAION400M)(MIT) |
---|---|---|
1 | En | |
2 | th | CCAligned |
3 | ko | WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)) |
4 | hi | CCAligned |
5 | uk | CCMatrix |
6 | ar | WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), [OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles - v2018.php) |
7 | tr | WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), CCMatrix |
8 | vi | CCMatrix |
9 | pl | CCMatrix , WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)) |
10 | nl | CCMatrix |
11 | pt | CCAligned |
12 | it | WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), Wikipedia |
13 | ja | MultiParaCrawl ([Creative Commons CC0 license](https://creativecommons.org/share - your - work/public - domain/cc0/) ) |
14 | zh | WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), TSL2019 |
15 | es | WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)) |
16 | de | WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), EUbookshop |
17 | fr | WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), EuroPat ([Creative Commons CC0 license](https://creativecommons.org/share - your - work/public - domain/cc0/)) |
18 | ru | WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), CCMatrix |
[1] WuDaoMM dataset is only used for academic research. Any use of this dataset should follow the following requirements. WuDaoMM does not own the copyright of these pictures. Use of images is subject to the Flickr term of use. Users of the images take full responsibility for using the dataset and do not distribute the above images privately. If the copyright of the image is violated, please contact us and it will be removed immediately.
Stage 1 uses parallel corpus data. Stage 2&3 mainly use a subset of Laion - Aesthetics. The wudaoMM data set (CC - BY - SA 4.0) is used as a Chinese data set.
Evaluation Results
ImageNet
ImageNet - adv | ImageNet - adv - cn | ImageNet - adv - es | ImageNet - adv - fr | ImageNet - adv - it | ImageNet - adv - jp | ImageNet - adv - ko | ImageNet - adv - ru | ImageNet - ren | ImageNet - ren - cn | imageNet - ren - es | ImageNet - ren - fr | ImageNet - ren - it | ImageNet - ren - jp | ImageNet - ren - ko | ImageNet - ren - ru | ImageNet - ske | ImageNet - ske - cn | ImageNet - ske - es | ImageNet - ske - fr | ImageNet - ske - it | ImageNet - ske - jp | ImageNet - ske - ko | ImageNet - ske - ru | ImageNet - 1k | ImageNet - 1k - cn | ImageNet - 1k - es | ImageNet - 1k - fr | ImageNet - 1k - it | ImageNet - 1k - jp | ImageNet - 1k - ko | ImageNet - 1k - ru | ImageNet - v2 | ImageNet - v2 - cn | ImageNet - v2 - es | ImageNet - v2 - fr | ImageNet - v2 - it | ImageNet - v2 - jp | ImageNet - v2 - ko | ImageNet - v2 - ru | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AltCLIP - M18 | 58 | 50.35 | 43.56 | 44.07 | 48.25 | 36.48 | 38.48 | 40.57 | 89.53 | 81.36 | 71.78 | 74.96 | 76.44 | 67.68 | 69.27 | 75.53 | 65.42 | 51.26 | 97.44 | 84.83 | 30.52 | 68.62 | 67.46 | 54.4 | 76.71 | 57.12 | 54.22 | 54.84 | 52.29 | 51.71 | 53.65 | 51.53 | 65.45 | 51.76 | 48.91 | 49.24 | 47.27 | 46.76 | 48.1 | 46.53 |
Other Classification
caltech101 | cars | cifar10 | cifar100 | country211 | dtd | eurosat | fer2013 | Fgvc - aircraft | flowers | food101 | gtsrb | Hateful - memes | Kitti - distance | Mnist | pcam | pets | Renderedsst2 | Resisc45 | Voc2007 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AltCLIP - M18 | 88.25 | 92.75 | 97.44 | 84.83 | 30.52 | 68.62 | 67.46 | 54.4 | 40.41 | 71.64 | 92.49 | 56.35 | 50.8 | 14.91 | 78.46 | 54.76 | 94.11 | 65.95 | 70.83 | 81.62 |
Retrieval
Multi30k - de - I2T | Multi30k - de - T2I | Multi30k - en - I2T | Multi30k - en - T2I | Multi30k - fr - I2T | Multi30k - fr - I2T | Xtd - de - I2T | Xtd - de - T2I | Xtd - en - I2T | Xtd - en - T2I | Xtd - es - I2T | Xtd - es - T2I | Xtd - fr - I2T | Xtd - fr - T2I | Xtd - it - I2T | Xtd - it - T2I | Xtd - jp - I2T | Xtd - jp - T2I | Xtd - ko - I2T | Xtd - ko - T2I | Xtd - pl - I2T | Xtd - pl - T2I | Xtd - ru - I2T | Xtd - ru - T2I | Xtd - tr - I2T | Xtd - tr - T2I | Xtd - zh - I2T | Xtd - zh - T2I | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
AltCLIP - M18 | 84.4 | 65.82 | 91.1 | 77.76 | 74.5 | 75.4 | 64.76 | 66.57 | 72.17 | 72.67 | 65.83 | 65.03 | 67.17 | 67.47 | 66.63 | 66.03 | 58.96 | 62.96 | 61.42 | 64.43 | 67.23 | 69.14 | 60.22 | 61.02 | 65.03 | 64.23 | 64.53 | 65.43 |
đ§ Technical Details
No technical details are provided in the original document, so this section is skipped.
đ License
No license information is provided in the original document, so this section is skipped.
đ Citation
If you find this work helpful, please consider to cite
@article{https://doi.org/10.48550/arxiv.2211.06679,
doi = {10.48550/ARXIV.2211.06679},
url = {https://arxiv.org/abs/2211.06679},
author = {Chen, Zhongzhi and Liu, Guang and Zhang, Bo - Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell},
keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences},
title = {AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non - exclusive license}
}
â ī¸ Important Note
WuDaoMM dataset is only used for academic research. Any use of this dataset should follow the requirements. WuDaoMM does not own the copyright of these pictures. Use of images is subject to the Flickr term of use. Users of the images take full responsibility for using the dataset and do not distribute the above images privately. If the copyright of the image is violated, please contact us and it will be removed immediately.







