AltCLIP-m18 Open-Source Image-Text Matching Model - Free Deployment, Supports Image-Text Matching in 18 Languages

Home

Altclip M18

Developed by BAAI

AltCLIP-m18 is a CLIP model supporting 18 languages for image-text matching tasks.

Text-to-Image

Transformers

#Multilingual image-text matching #18 language support #Cross-modal retrieval

Downloads 58

Release Time : 3/27/2023

Model Overview

AltCLIP-m18 is a multilingual CLIP model supporting 18 languages including English, Chinese, Japanese, etc., primarily used for image-text matching tasks and providing support for the AltDiffusion-m18 model.

Model Features

Multilingual support

Supports image-text matching tasks in 18 languages.

Support for AltDiffusion

Can serve as the foundational model for AltDiffusion-m18.

Multi-stage training

Adopts a three-stage training strategy including parallel corpus training and Laion-Aesthetics subset training.

Model Capabilities

Image-text matching

Multilingual text understanding

Image classification

Use Cases

Multilingual applications

Multilingual image search

Use different language queries to match relevant images.

Generative model support

AltDiffusion support

Provides multilingual text encoding capabilities for AltDiffusion-m18.

🚀 AltCLIP-m18

Following the bilingual model AltCLIP and the nine - language model AltCLIP - m9, we developed an 18 - language CLIP model, offering enhanced multilingual support for text - image tasks.

🚀 Quick Start

The AltCLIP - m18 model can provide support for the AltDiffusion - m18 model. Specific information on the AltDiffusion model can be found in this tutorial.

The model code has been open - sourced on FlagAI, and the weights are located on modelhub. We also provide scripts for fine - tuning, inference, and validation. You're welcome to try them out.

✨ Features

Multilingual Support: AltCLIP - m18 supports 18 languages, including English, Chinese, Japanese, Thai, Korean, Hindi, Ukrainian, Arabic, Turkish, Vietnamese, Polish, Dutch, Portuguese, Italian, Spanish, German, French, and Russian.
Model Compatibility: It can support the AltDiffusion - m18 model, enabling more powerful text - image generation capabilities.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

# Cifar10 dataset evaluation code
# Copyright © 2022 BAAI. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License")
import torch
from flagai.auto_model.auto_loader import AutoLoader
import zeroshot_classification
import json 
import os 
from torchvision.datasets import CIFAR10

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
maxlen = 256

dataset_root = "./clip_benchmark_datasets/"
dataset_name = "cifar10"

auto_loader = AutoLoader(
    task_name="txt_img_matching",
    model_dir="./checkpoints/",
    model_name="AltCLIP-XLMR-L-m18"   # Load the checkpoints from Modelhub(model.baai.ac.cn/models)
)

model = auto_loader.get_model()
model.to(device)
model.eval()
tokenizer = auto_loader.get_tokenizer()
transform = auto_loader.get_transform()

dataset = CIFAR10(root=os.path.join(dataset_root, dataset_name), 
                transform=transform,   
                download=True)
batch_size = 128
num_workers = 4

template = {"cifar10": [
        "a photo of a {c}.",
        "a blurry photo of a {c}.",
        "a black and white photo of a {c}.",
        "a low contrast photo of a {c}.",
        "a high contrast photo of a {c}.",
        "a bad photo of a {c}.",
        "a good photo of a {c}.",
        "a photo of a small {c}.",
        "a photo of a big {c}.",
        "a photo of the {c}.",
        "a blurry photo of the {c}.",
        "a black and white photo of the {c}.",
        "a low contrast photo of the {c}.",
        "a high contrast photo of the {c}.",
        "a bad photo of the {c}.",
        "a good photo of the {c}.",
        "a photo of the small {c}.",
        "a photo of the big {c}."
    ],
}
def evaluate():
    if dataset:
        dataloader = torch.utils.data.DataLoader(
            dataset,
            batch_size=batch_size,
            shuffle=False,
            num_workers=num_workers,
        )

        zeroshot_templates = template["cifar10"]
        classnames = dataset.classes if hasattr(dataset, "classes") else None

        metrics = zeroshot_classification.evaluate(
            model,
            dataloader,
            tokenizer,
            classnames, 
            zeroshot_templates,
            device=device,
            amp=True,
        )
       
        dump = {
            "dataset": dataset_name,
            "metrics": metrics
        }

        print(dump)
        with open("./result.txt", "w") as f:
            json.dump(dump, f)
        return metrics

if __name__ == "__main__":
    evaluate()

Advanced Usage

# Inference script
import torch
from PIL import Image
from flagai.auto_model.auto_loader import AutoLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

loader = AutoLoader(
    task_name="txt_img_matching",
    model_name="AltCLIP-XLMR-L-m18",   # Load the checkpoints from Modelhub(model.baai.ac.cn/models)
    model_dir="./checkpoints"
)

model = loader.get_model()
tokenizer = loader.get_token

📚 Documentation

Model Information

Property	Details
Name	AltCLIP - m18
Task	Text - Image
Language(s)	Multilingual (English, Chinese, Japanese, Thai, Korean, Hindi, Ukrainian, Arabic, Turkish, Vietnamese, Polish, Dutch, Portuguese, Italian, Spanish, German, French, Russian)
Model	CLIP
Github	FlagAI

Training Datasets

No	Language	Stage1(LAION400M)(MIT)
1	En
2	th	CCAligned
3	ko	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode))
4	hi	CCAligned
5	uk	CCMatrix
6	ar	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), [OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles - v2018.php)
7	tr	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), CCMatrix
8	vi	CCMatrix
9	pl	CCMatrix , WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode))
10	nl	CCMatrix
11	pt	CCAligned
12	it	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), Wikipedia
13	ja	MultiParaCrawl ([Creative Commons CC0 license](https://creativecommons.org/share - your - work/public - domain/cc0/) )
14	zh	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), TSL2019
15	es	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode))
16	de	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), EUbookshop
17	fr	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), EuroPat ([Creative Commons CC0 license](https://creativecommons.org/share - your - work/public - domain/cc0/))
18	ru	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), CCMatrix

[1] WuDaoMM dataset is only used for academic research. Any use of this dataset should follow the following requirements. WuDaoMM does not own the copyright of these pictures. Use of images is subject to the Flickr term of use. Users of the images take full responsibility for using the dataset and do not distribute the above images privately. If the copyright of the image is violated, please contact us and it will be removed immediately.

Stage 1 uses parallel corpus data. Stage 2&3 mainly use a subset of Laion - Aesthetics. The wudaoMM data set (CC - BY - SA 4.0) is used as a Chinese data set.

Evaluation Results

ImageNet

	ImageNet - adv	ImageNet - adv - cn	ImageNet - adv - es	ImageNet - adv - fr	ImageNet - adv - it	ImageNet - adv - jp	ImageNet - adv - ko	ImageNet - adv - ru		ImageNet - ren	ImageNet - ren - cn	imageNet - ren - es	ImageNet - ren - fr	ImageNet - ren - it	ImageNet - ren - jp	ImageNet - ren - ko	ImageNet - ren - ru		ImageNet - ske	ImageNet - ske - cn	ImageNet - ske - es	ImageNet - ske - fr	ImageNet - ske - it	ImageNet - ske - jp	ImageNet - ske - ko	ImageNet - ske - ru		ImageNet - 1k	ImageNet - 1k - cn	ImageNet - 1k - es	ImageNet - 1k - fr	ImageNet - 1k - it	ImageNet - 1k - jp	ImageNet - 1k - ko	ImageNet - 1k - ru		ImageNet - v2	ImageNet - v2 - cn	ImageNet - v2 - es	ImageNet - v2 - fr	ImageNet - v2 - it	ImageNet - v2 - jp	ImageNet - v2 - ko	ImageNet - v2 - ru
AltCLIP - M18	58	50.35	43.56	44.07	48.25	36.48	38.48	40.57		89.53	81.36	71.78	74.96	76.44	67.68	69.27	75.53		65.42	51.26	97.44	84.83	30.52	68.62	67.46	54.4		76.71	57.12	54.22	54.84	52.29	51.71	53.65	51.53		65.45	51.76	48.91	49.24	47.27	46.76	48.1	46.53

Other Classification

	caltech101	cars	cifar10	cifar100	country211	dtd	eurosat	fer2013	Fgvc - aircraft	flowers	food101	gtsrb	Hateful - memes	Kitti - distance	Mnist	pcam	pets	Renderedsst2	Resisc45	Voc2007
AltCLIP - M18	88.25	92.75	97.44	84.83	30.52	68.62	67.46	54.4	40.41	71.64	92.49	56.35	50.8	14.91	78.46	54.76	94.11	65.95	70.83	81.62

Retrieval

	Multi30k - de - I2T	Multi30k - de - T2I	Multi30k - en - I2T	Multi30k - en - T2I	Multi30k - fr - I2T	Multi30k - fr - I2T	Xtd - de - I2T	Xtd - de - T2I	Xtd - en - I2T	Xtd - en - T2I	Xtd - es - I2T	Xtd - es - T2I	Xtd - fr - I2T	Xtd - fr - T2I	Xtd - it - I2T	Xtd - it - T2I	Xtd - jp - I2T	Xtd - jp - T2I	Xtd - ko - I2T	Xtd - ko - T2I	Xtd - pl - I2T	Xtd - pl - T2I	Xtd - ru - I2T	Xtd - ru - T2I	Xtd - tr - I2T	Xtd - tr - T2I	Xtd - zh - I2T	Xtd - zh - T2I
AltCLIP - M18	84.4	65.82	91.1	77.76	74.5	75.4	64.76	66.57	72.17	72.67	65.83	65.03	67.17	67.47	66.63	66.03	58.96	62.96	61.42	64.43	67.23	69.14	60.22	61.02	65.03	64.23	64.53	65.43

🔧 Technical Details

No technical details are provided in the original document, so this section is skipped.

📄 License

No license information is provided in the original document, so this section is skipped.

📚 Citation

If you find this work helpful, please consider to cite

@article{https://doi.org/10.48550/arxiv.2211.06679,
  doi = {10.48550/ARXIV.2211.06679},
  url = {https://arxiv.org/abs/2211.06679},
  author = {Chen, Zhongzhi and Liu, Guang and Zhang, Bo - Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences},
  title = {AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non - exclusive license}
}

⚠️ Important Note

WuDaoMM dataset is only used for academic research. Any use of this dataset should follow the requirements. WuDaoMM does not own the copyright of these pictures. Use of images is subject to the Flickr term of use. Users of the images take full responsibility for using the dataset and do not distribute the above images privately. If the copyright of the image is violated, please contact us and it will be removed immediately.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご