AltCLIP-m18開源圖文匹配模型 - 免費部署支持18種語言圖文匹配

首頁

Altclip M18

由BAAI開發

AltCLIP-m18是一個支持18種語言的CLIP模型，用於圖文匹配任務。

文本生成圖像

Transformers

#多語言圖文匹配 #18種語言支持 #跨模態檢索

下載量 58

發布時間 : 3/27/2023

模型概述

AltCLIP-m18是一個多語言CLIP模型，支持英語、中文、日語等18種語言，主要用於圖文匹配任務，併為AltDiffusion-m18模型提供支持。

模型特點

多語言支持

支持18種語言的圖文匹配任務。

為AltDiffusion提供支持

可作為AltDiffusion-m18模型的基礎模型。

多階段訓練

採用三階段訓練策略，包括平行語料訓練和Laion-Aesthetics子集訓練。

模型能力

圖文匹配

多語言文本理解

圖像分類

使用案例

多語言應用

多語言圖像搜索

使用不同語言查詢匹配相關圖像。

生成模型支持

AltDiffusion支持

為AltDiffusion-m18提供多語言文本編碼能力。

🚀 AltCLIP-m18

AltCLIP-m18是一個支持18種語言的CLIP模型，在文本 - 圖像任務中表現出色。它可以為AltDiffusion-m18模型提供支持，並且模型代碼已開源，還提供了微調、推理和驗證腳本，方便用戶使用。

🚀 快速開始

你可以在 FlagAI 上獲取模型代碼，權重位於 modelhub 。同時，我們還提供了微調、推理、驗證的腳本，你可以根據需求進行試用。

✨ 主要特性

多語言支持：支持英語、中文、日語、泰語、韓語、印地語、烏克蘭語、阿拉伯語、土耳其語、越南語、波蘭語、荷蘭語、葡萄牙語、意大利語、西班牙語、德語、法語和俄語。
模型支持：可以為AltDiffusion-m18模型提供支持。
開源可用：模型代碼已在 FlagAI 上開源。

📦 安裝指南

文檔未提及具體安裝步驟，可參考 FlagAI 上的代碼進行操作。

💻 使用示例

基礎用法

以下是Cifar10數據集評測的代碼示例：

# Copyright © 2022 BAAI. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License")
import torch
from flagai.auto_model.auto_loader import AutoLoader
import zeroshot_classification
import json 
import os 
from torchvision.datasets import CIFAR10

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
maxlen = 256

dataset_root = "./clip_benchmark_datasets/"
dataset_name = "cifar10"

auto_loader = AutoLoader(
    task_name="txt_img_matching",
    model_dir="./checkpoints/",
    model_name="AltCLIP-XLMR-L-m18"   # Load the checkpoints from Modelhub(model.baai.ac.cn/models)
)

model = auto_loader.get_model()
model.to(device)
model.eval()
tokenizer = auto_loader.get_tokenizer()
transform = auto_loader.get_transform()

dataset = CIFAR10(root=os.path.join(dataset_root, dataset_name), 
                transform=transform,   
                download=True)
batch_size = 128
num_workers = 4

template = {"cifar10": [
        "a photo of a {c}.",
        "a blurry photo of a {c}.",
        "a black and white photo of a {c}.",
        "a low contrast photo of a {c}.",
        "a high contrast photo of a {c}.",
        "a bad photo of a {c}.",
        "a good photo of a {c}.",
        "a photo of a small {c}.",
        "a photo of a big {c}.",
        "a photo of the {c}.",
        "a blurry photo of the {c}.",
        "a black and white photo of the {c}.",
        "a low contrast photo of the {c}.",
        "a high contrast photo of the {c}.",
        "a bad photo of the {c}.",
        "a good photo of the {c}.",
        "a photo of the small {c}.",
        "a photo of the big {c}."
    ],
}
def evaluate():
    if dataset:
        dataloader = torch.utils.data.DataLoader(
            dataset,
            batch_size=batch_size,
            shuffle=False,
            num_workers=num_workers,
        )

        zeroshot_templates = template["cifar10"]
        classnames = dataset.classes if hasattr(dataset, "classes") else None

        metrics = zeroshot_classification.evaluate(
            model,
            dataloader,
            tokenizer,
            classnames, 
            zeroshot_templates,
            device=device,
            amp=True,
        )
       
        dump = {
            "dataset": dataset_name,
            "metrics": metrics
        }

        print(dump)
        with open("./result.txt", "w") as f:
            json.dump(dump, f)
        return metrics

if __name__ == "__main__":
    evaluate()

高級用法

以下是推理腳本的代碼示例：

import torch
from PIL import Image
from flagai.auto_model.auto_loader import AutoLoader

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

loader = AutoLoader(
    task_name="txt_img_matching",
    model_name="AltCLIP-XLMR-L-m18",   # Load the checkpoints from Modelhub(model.baai.ac.cn/models)
    model_dir="./checkpoints"
)

model = loader.get_model()
tokenizer = loader.get_tokenizer()
transform = loader.get_transform()

model.eval()
model.to(device)
tokenizer = loader.get_tokenizer()

def inference():
    image = Image.open("./dog.jpeg")
    image = transform(image)
    image = torch.tensor(image["pixel_values"]).to(device)
    tokenizer_out = tokenizer(["a rat", "a dog", "a cat"], 
                                padding=True,
                                truncation=True,
                                max_length=77,
                                return_tensors='pt')

    text = tokenizer_out["input_ids"].to(device)
    attention_mask = tokenizer_out["attention_mask"].to(device)
    with torch.no_grad():
        image_features = model.get_image_features(image)
        text_features = model.get_text_features(text, attention_mask=attention_mask)
        text_probs = (image_features @ text_features.T).softmax(dim=-1)

    print(text_probs.cpu().numpy()[0].tolist())

if __name__=="__main__":
    inference()

微調用法

以下是Cifar10數據集微調的代碼示例：

# Copyright © 2022 BAAI. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License")
import torch
from flagai.auto_model.auto_loader import AutoLoader
import os 
from flagai.trainer import Trainer
from torchvision.datasets import (
    CIFAR10
)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

dataset_root = "./clip_benchmark_datasets"
dataset_name = "cifar10"

batch_size = 4
classes = ['airplane', 'automobile', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']

auto_loader = AutoLoader(
    task_name="txt_img_matching",
    model_dir="./checkpoints",
    model_name="AltCLIP-XLMR-L-m18"   # Load the checkpoints from Modelhub(model.baai.ac.cn/models)
)

model = auto_loader.get_model()
model.to(device)
model.eval()
tokenizer = auto_loader.get_tokenizer()
transform = auto_loader.get_transform()

trainer = Trainer(env_type="pytorch",
                pytorch_device=device,
                experiment_name="clip_finetuning",
                batch_size=4,
                lr=1e-4,
                epochs=10,
                log_interval=10)

dataset = CIFAR10(root=os.path.join(dataset_root, dataset_name), 
                transform=transform,   
                download=True)

def cifar10_collate_fn(batch):
    # image shape is (batch, 3, 224, 224)
    images = torch.tensor([b[0]["pixel_values"][0] for b in batch])
    # text_id shape is (batch, n)
    input_ids = torch.tensor([tokenizer(f"a photo of a {b[1]}",
                                padding=True,
                                truncation=True,
                                max_length=77)["input_ids"] for b in batch])    

    attention_mask = torch.tensor([tokenizer(f"a photo of a {b[1]}",
                                padding=True,
                                truncation=True,
                                max_length=77)["attention_mask"] for b in batch])

    return {
        "pixel_values": images,
        "input_ids": input_ids,
        "attention_mask": attention_mask,
    }
    
if __name__ == "__main__":
    trainer.train(model=model, train_dataset=dataset, collate_fn=cifar10_collate_fn)

📚 詳細文檔

模型信息

屬性	詳情
名稱	AltCLIP-m18
任務	Text-Image
語言	多語言（英語、中文、日語、泰語、韓語、印地語、烏克蘭語、阿拉伯語、土耳其語、越南語、波蘭語、荷蘭語、葡萄牙語、意大利語、西班牙語、德語、法語和俄語）
模型類型	CLIP
代碼地址	FlagAI
權重地址	modelhub

訓練數據集

階段1使用平行語料庫數據，階段2和3主要使用Laion - Aesthetics的一個子集，中文數據集採用wudaoMM數據集(CC - BY - SA 4.0)。具體各語言訓練數據集如下：

No	語言	階段1(LAION400M)(MIT)
1	英語
2	泰語	CCAligned
3	韓語	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode))
4	印地語	CCAligned
5	烏克蘭語	CCMatrix
6	阿拉伯語	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), [OpenSubtitles](https://opus.nlpl.eu/OpenSubtitles - v2018.php)
7	土耳其語	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), CCMatrix
8	越南語	CCMatrix
9	波蘭語	CCMatrix , WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode))
10	荷蘭語	CCMatrix
11	葡萄牙語	CCAligned
12	意大利語	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), Wikipedia
13	日語	MultiParaCrawl ([Creative Commons CC0 license](https://creativecommons.org/share - your - work/public - domain/cc0/))
14	中文	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), TSL2019
15	西班牙語	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode))
16	德語	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), EUbookshop
17	法語	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), EuroPat ([Creative Commons CC0 license](https://creativecommons.org/share - your - work/public - domain/cc0/))
18	俄語	WikiMatrix ([CC - BY - SA 4.0](https://creativecommons.org/licenses/by - sa/4.0/legalcode)), CCMatrix

⚠️ 重要提示

WuDaoMM數據集僅用於學術研究，任何使用該數據集都應該遵循以下要求。WuDaoMM不擁有這些圖片的版權。圖片的使用必須遵守Flickr使用條款。圖像的用戶對使用數據集承擔全部責任，不私自傳播上面的圖片。如果圖片的版權受到侵犯，請聯繫我們，我們將立即刪除。

評測結果

ImageNet評測

	ImageNet - adv	ImageNet - adv - cn	ImageNet - adv - es	ImageNet - adv - fr	ImageNet - adv - it	ImageNet - adv - jp	ImageNet - adv - ko	ImageNet - adv - ru		ImageNet - ren	ImageNet - ren - cn	imageNet - ren - es	ImageNet - ren - fr	ImageNet - ren - it	ImageNet - ren - jp	ImageNet - ren - ko	ImageNet - ren - ru		ImageNet - ske	ImageNet - ske - cn	ImageNet - ske - es	ImageNet - ske - fr	ImageNet - ske - it	ImageNet - ske - jp	ImageNet - ske - ko	ImageNet - ske - ru		ImageNet - 1k	ImageNet - 1k - cn	ImageNet - 1k - es	ImageNet - 1k - fr	ImageNet - 1k - it	ImageNet - 1k - jp	ImageNet - 1k - ko	ImageNet - 1k - ru		ImageNet - v2	ImageNet - v2 - cn	ImageNet - v2 - es	ImageNet - v2 - fr	ImageNet - v2 - it	ImageNet - v2 - jp	ImageNet - v2 - ko	ImageNet - v2 - ru
AltCLIP - M18	58	50.35	43.56	44.07	48.25	36.48	38.48	40.57		89.53	81.36	71.78	74.96	76.44	67.68	69.27	75.53		65.42	51.26	97.44	84.83	30.52	68.62	67.46	54.4		76.71	57.12	54.22	54.84	52.29	51.71	53.65	51.53		65.45	51.76	48.91	49.24	47.27	46.76	48.1	46.53

其他分類評測

	caltech101	cars	cifar10	cifar100	country211	dtd	eurosat	fer2013	Fgvc - aircraft	flowers	food101	gtsrb	Hateful - memes	Kitti - distance	Mnist	pcam	pets	Renderedsst2	Resisc45	Voc2007
AltCLIP - M18	88.25	92.75	97.44	84.83	30.52	68.62	67.46	54.4	40.41	71.64	92.49	56.35	50.8	14.91	78.46	54.76	94.11	65.95	70.83	81.62

檢索評測

	Multi30k - de - I2T	Multi30k - de - T2I	Multi30k - en - I2T	Multi30k - en - T2I	Multi30k - fr - I2T	Multi30k - fr - I2T	Xtd - de - I2T	Xtd - de - T2I	Xtd - en - I2T	Xtd - en - T2I	Xtd - es - I2T	Xtd - es - T2I	Xtd - fr - I2T	Xtd - fr - T2I	Xtd - it - I2T	Xtd - it - T2I	Xtd - jp - I2T	Xtd - jp - T2I	Xtd - ko - I2T	Xtd - ko - T2I	Xtd - pl - I2T	Xtd - pl - T2I	Xtd - ru - I2T	Xtd - ru - T2I	Xtd - tr - I2T	Xtd - tr - T2I	Xtd - zh - I2T	Xtd - zh - T2I
AltCLIP - M18	84.4	65.82	91.1	77.76	74.5	75.4	64.76	66.57	72.17	72.67	65.83	65.03	67.17	67.47	66.63	66.03	58.96	62.96	61.42	64.43	67.23	69.14	60.22	61.02	65.03	64.23	64.53	65.43

🔧 技術細節

文檔未提及詳細技術實現細節。

📄 許可證

文檔未提及具體許可證信息。

引用

關於AltCLIP，我們已經推出了相關報告，有更多細節可以查閱，如對您的工作有幫助，歡迎引用。

@article{https://doi.org/10.48550/arxiv.2211.06679,
  doi = {10.48550/ARXIV.2211.06679},
  url = {https://arxiv.org/abs/2211.06679},
  author = {Chen, Zhongzhi and Liu, Guang and Zhang, Bo - Wen and Ye, Fulong and Yang, Qinghong and Wu, Ledell},
  keywords = {Computation and Language (cs.CL), FOS: Computer and information sciences},
  title = {AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non - exclusive license}
}