AltCLIP-m18オープンソース画像・テキストマッチングモデル - 無料でデプロイ可能、18言語の画像・テキストマッチングをサポート

ホーム

Altclip M18

BAAIによって開発

AltCLIP-m18は18言語をサポートするCLIPモデルで、画像テキストマッチングタスクに使用されます。

テキスト生成画像

Transformers

#多言語画像テキストマッチング #18言語サポート #クロスモーダル検索

ダウンロード数 58

リリース時間 : 3/27/2023

モデル概要

AltCLIP-m18は多言語CLIPモデルで、英語、中国語、日本語など18言語をサポートし、主に画像テキストマッチングタスクに使用され、AltDiffusion-m18モデルをサポートします。

モデル特徴

多言語サポート

18言語の画像テキストマッチングタスクをサポートします。

AltDiffusionサポート

AltDiffusion-m18モデルのベースモデルとして使用できます。

多段階トレーニング

並列コーパストレーニングとLaion-Aestheticsサブセットトレーニングを含む3段階のトレーニング戦略を採用しています。

モデル能力

画像テキストマッチング

多言語テキスト理解

画像分類

使用事例

多言語アプリケーション

多言語画像検索

異なる言語でクエリを実行し、関連画像をマッチングします。

生成モデルサポート

AltDiffusionサポート

AltDiffusion-m18に多言語テキストエンコーディング機能を提供します。

🚀 AltCLIP-m18

AltCLIP-m18は、多言語対応のText-Imageタスクを扱うCLIPモデルです。18の言語をサポートし、AltDiffusion-m18モデルにも対応しています。コードはオープンソースで公開されており、モデルの重みも利用可能です。

属性	详情
名称	AltCLIP-m18
タスク	Text-Image
言語	英語、中国語、日本語、タイ語、韓国語、ヒンディー語、ウクライナ語、アラビア語、トルコ語、ベトナム語、ポーランド語、オランダ語、ポルトガル語、イタリア語、スペイン語、ドイツ語、フランス語、ロシア語
モデルタイプ	CLIP
GitHub	FlagAI

🚀 クイックスタート

AltCLIP-m18は、多言語対応のText-Imageモデルです。このモデルは、画像とテキストの関連性を評価することができ、AltDiffusion-m18モデルにも対応しています。

✨ 主な機能

18の言語をサポートし、多言語環境でのText-Imageタスクに対応。
AltDiffusion-m18モデルをサポートし、画像生成などのアプリケーションに利用可能。
コードがオープンソースで公開されており、微調整や推論が容易。

📚 ドキュメント

概要

バイリンガルモデルAltCLIPと9言語モデルAltCLIP-m9に続いて、18言語のCLIPモデルを訓練しました。これをAltCLIP-m18と命名しました。このモデルは、英語、中国語、日本語、タイ語、韓国語、ヒンディー語、ウクライナ語、アラビア語、トルコ語、ベトナム語、ポーランド語、オランダ語、ポルトガル語、イタリア語、スペイン語、ドイツ語、フランス語、ロシア語をサポートしています。

AltCLIP-m18モデルは、AltDiffusion-m18モデルをサポートしています。AltDiffusionモデルの詳細については、このチュートリアルを参照してください。

モデルのコードは、FlagAIでオープンソース化されており、重みはmodelhubにあります。微調整、推論、検証のスクリプトも提供しているので、ぜひ試してみてください。

訓練データセット

No	言語	Stage1(LAION400M)(MIT)
1	英語
2	タイ語	CCAligned
3	韓国語	WikiMatrix (CC-BY-SA 4.0)
4	ヒンディー語	CCAligned
5	ウクライナ語	CCMatrix
6	アラビア語	WikiMatrix (CC-BY-SA 4.0), OpenSubtitles
7	トルコ語	WikiMatrix (CC-BY-SA 4.0), CCMatrix
8	ベトナム語	CCMatrix
9	ポーランド語	CCMatrix , WikiMatrix (CC-BY-SA 4.0)
10	オランダ語	CCMatrix
11	ポルトガル語	CCAligned
12	イタリア語	WikiMatrix (CC-BY-SA 4.0), Wikipedia
13	日本語	MultiParaCrawl (Creative Commons CC0 license )
14	中国語	WikiMatrix (CC-BY-SA 4.0), TSL2019
15	スペイン語	WikiMatrix (CC-BY-SA 4.0)
16	ドイツ語	WikiMatrix (CC-BY-SA 4.0), EUbookshop
17	フランス語	WikiMatrix (CC-BY-SA 4.0), EuroPat (Creative Commons CC0 license)
18	ロシア語	WikiMatrix (CC-BY-SA 4.0), CCMatrix

[1] WuDaoMMデータセットは学術研究のみに使用され、このデータセットの使用には以下の要件があります。WuDaoMMはこれらの画像の著作権を持っていません。画像の使用はFlickrの利用規約に従う必要があります。画像のユーザーは、データセットの使用について全責任を負い、上記の画像を私的に配布してはいけません。画像の著作権が侵害された場合は、お知らせいただければ、直ちに削除します。

段階1では並列コーパスデータを使用しています。段階2と3では、主にLaion-Aestheticsのサブセットを使用しています。中国語のデータセットには、wudaoMMデータセット(CC-BY-SA 4.0)を使用しています。

引用

AltCLIPに関する詳細なレポートを公開しています。もしあなたの研究に役立つ場合は、ぜひ引用してください。

@article{https://doi.org/10.48550/arxiv.2211.06679,
  doi = {10.48550/ARXIV.2211.06679},
  url = {https://arxiv.org/abs/2211.06679},
  author = {Chen, Zhongzhi and Liu, Guang and Zhang, Bo-Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences},
  title = {AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

AltCLIP-m18評価

ImageNet

	ImageNet-adv	ImageNet-adv-cn	ImageNet-adv-es	ImageNet-adv-fr	ImageNet-adv-it	ImageNet-adv-jp	ImageNet-adv-ko	ImageNet-adv-ru		ImageNet-ren	ImageNet-ren-cn	imageNet-ren-es	ImageNet-ren-fr	ImageNet-ren-it	ImageNet-ren-jp	ImageNet-ren-ko	ImageNet-ren-ru		ImageNet-ske	ImageNet-ske-cn	ImageNet-ske-es	ImageNet-ske-fr	ImageNet-ske-it	ImageNet-ske-jp	ImageNet-ske-ko	ImageNet-ske-ru		ImageNet-1k	ImageNet-1k-cn	ImageNet-1k-es	ImageNet-1k-fr	ImageNet-1k-it	ImageNet-1k-jp	ImageNet-1k-ko	ImageNet-1k-ru		ImageNet-v2	ImageNet-v2-cn	ImageNet-v2-es	ImageNet-v2-fr	ImageNet-v2-it	ImageNet-v2-jp	ImageNet-v2-ko	ImageNet-v2-ru
AltCLIP-M18	58	50.35	43.56	44.07	48.25	36.48	38.48	40.57		89.53	81.36	71.78	74.96	76.44	67.68	69.27	75.53		65.42	51.26	97.44	84.83	30.52	68.62	67.46	54.4		76.71	57.12	54.22	54.84	52.29	51.71	53.65	51.53		65.45	51.76	48.91	49.24	47.27	46.76	48.1	46.53

その他の分類

	caltech101	cars	cifar10	cifar100	country211	dtd	eurosat	fer2013	Fgvc-aircraft	flowers	food101	gtsrb	Hateful-memes	Kitti-distance	Mnist	pcam	pets	Renderedsst2	Resisc45	Voc2007
AltCLIP-M18	88.25	92.75	97.44	84.83	30.52	68.62	67.46	54.4	40.41	71.64	92.49	56.35	50.8	14.91	78.46	54.76	94.11	65.95	70.83	81.62

検索

	Multi30k-de-I2T	Multi30k-de-T2I	Multi30k-en-I2T	Multi30k-en-T2I	Multi30k-fr-I2T	Multi30k-fr-I2T	Xtd-de-I2T	Xtd-de-T2I	Xtd-en-I2T	Xtd-en-T2I	Xtd-es-I2T	Xtd-es-T2I	Xtd-fr-I2T	Xtd-fr-T2I	Xtd- it- I2T	Xtd-it- T2I	Xtd-jp-I2T	Xtd-jp-T2I	Xtd-ko-I2T	Xtd-ko-T2I	Xtd-pl-I2T	Xtd-pl-T2I	Xtd-ru-I2T	Xtd-ru-T2I	Xtd-tr-I2T	Xtd-tr-T2I	Xtd-zh-I2T	Xtd-zh-T2I
AltCLIP-M18	84.4	65.82	91.1	77.76	74.5	75.4	64.76	66.57	72.17	72.67	65.83	65.03	67.17	67.47	66.63	66.03	58.96	62.96	61.42	64.43	67.23	69.14	60.22	61.02	65.03	64.23	64.53	65.43

💻 使用例

基本的な使用法

Cifar10データセット評価コード

# Copyright © 2022 BAAI. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License")
import torch
from flagai.auto_model.auto_loader import AutoLoader
import zeroshot_classification
import json 
import os 
from torchvision.datasets import CIFAR10

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
maxlen = 256

dataset_root = "./clip_benchmark_datasets/"
dataset_name = "cifar10"

auto_loader = AutoLoader(
    task_name="txt_img_matching",
    model_dir="./checkpoints/",
    model_name="AltCLIP-XLMR-L-m18"   # Load the checkpoints from Modelhub(model.baai.ac.cn/models)
)

model = auto_loader.get_model()
model.to(device)
model.eval()
tokenizer = auto_loader.get_tokenizer()
transform = auto_loader.get_transform()

dataset = CIFAR10(root=os.path.join(dataset_root, dataset_name), 
                transform=transform,   
                download=True)
batch_size = 128
num_workers = 4

template = {"cifar10": [
        "a photo of a {c}.",
        "a blurry photo of a {c}.",
        "a black and white photo of a {c}.",
        "a low contrast photo of a {c}.",
        "a high contrast photo of a {c}.",
        "a bad photo of a {c}.",
        "a good photo of a {c}.",
        "a photo of a small {c}.",
        "a photo of a big {c}.",
        "a photo of the {c}.",
        "a blurry photo of the {c}.",
        "a black and white photo of the {c}.",
        "a low contrast photo of the {c}.",
        "a high contrast photo of the {c}.",
        "a bad photo of the {c}.",
        "a good photo of the {c}.",
        "a photo of the small {c}.",
        "a photo of the big {c}."
    ],
}
def evaluate():
    if dataset:
        dataloader = torch.utils.data.DataLoader(
            dataset,
            batch_size=batch_size,
            shuffle=False,
            num_workers=num_workers,
        )

        zeroshot_templates = template["cifar10"]
        classnames = dataset.classes if hasattr(dataset, "classes") else None

        metrics = zeroshot_classification.evaluate(
            model,
            dataloader,
            tokenizer,
            classnames, 
            zeroshot_templates,
            device=device,
            amp=True,
        )
       
        dump = {
            "dataset": dataset_name,
            "metrics": metrics
        }

        print(dump)
        with open("./result.txt", "w") as f:
            json.dump(dump, f)
        return metrics

if __name__ == "__main__":
    evaluate()

推論スクリプト

import torch
from PIL import Image
from flagai.auto_model.auto_loader import AutoLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

loader = AutoLoader(
    task_name="txt_img_matching",
    model_name="AltCLIP-XLMR-L-m18",   # Load the checkpoints from Modelhub(model.baai.ac.cn/models)
    model_dir="./checkpoints"
)

model = loader.get_model()
tokenizer = loader.get_tokenizer()
transform = loader.get_transform()

model.eval()
model.to(device)
tokenizer = loader.get_tokenizer()

def inference():
    image = Image.open("./dog.jpeg")
    image = transform(image)
    image = torch.tensor(image["pixel_values"]).to(device)
    tokenizer_out = tokenizer(["a rat", "a dog", "a cat"], 
                                padding=True,
                                truncation=True,
                                max_length=77,
                                return_tensors='pt')

    text = tokenizer_out["input_ids"].to(device)
    attention_mask = tokenizer_out["attention_mask"].to(device)
    with torch.no_grad():
        image_features = model.get_image_features(image)
        text_features = model.get_text_features(text, attention_mask=attention_mask)
        text_probs = (image_features @ text_features.T).softmax(dim=-1)

    print(text_probs.cpu().numpy()[0].tolist())

if __name__=="__main__":
    inference()

微調整

Cifar10データセット

# Copyright © 2022 BAAI. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License")
import torch
from flagai.auto_model.auto_loader import AutoLoader
import os 
from flagai.trainer import Trainer
from torchvision.datasets import (
    CIFAR10
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

dataset_root = "./clip_benchmark_datasets"
dataset_name = "cifar10"

batch_size = 4
classes = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

auto_loader = AutoLoader(
    task_name="txt_img_matching",
    model_dir="./checkpoints",
    model_name="AltCLIP-XLMR-L-m18"   # Load the checkpoints from Modelhub(model.baai.ac.cn/models)
)

model = auto_loader.get_model()
model.to(device)
model.eval()
tokenizer = auto_loader.get_tokenizer()
transform = auto_loader.get_transform()

trainer = Trainer(env_type="pytorch",
                pytorch_device=device,
                experiment_name="clip_finetuning",