clip-vit-base-patch32-ko開源模型 - 支持韓英雙語圖像文本匹配任務

首頁

Clip Vit Base Patch32 Ko

由Bingsu開發

基於知識蒸餾訓練的韓語CLIP模型，支持韓英雙語圖像-文本匹配任務

文本生成圖像

Transformers

韓語開源協議:MIT #韓語CLIP模型 #零樣本圖像分類 #多模態理解

下載量 3,147

發布時間 : 9/16/2022

模型概述

這是一個韓語版本的CLIP模型，基於ViT-Base-Patch32架構，通過知識蒸餾方法訓練而成，專門用於處理韓語和英語的跨模態檢索任務。

模型特點

韓語優化

專門針對韓語進行優化，使用AIHUB平臺的韓英平行語料訓練

知識蒸餾訓練

採用知識蒸餾方法從原版CLIP模型遷移學習

雙語支持

同時支持韓語和英語的文本輸入

模型能力

零樣本圖像分類

圖像-文本匹配

跨模態檢索

使用案例

圖像分類

動物識別

識別圖像中的動物類型

能準確區分貓和狗等常見動物

內容審核

違規內容檢測

檢測圖像中是否包含違規內容

🚀 clip-vit-base-patch32-ko

這是一個韓語CLIP模型，通過Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation進行訓練，可用於圖像分類等任務。

🚀 快速開始

本模型可快速用於圖像分類任務，以下是使用示例。

📦 安裝指南

文檔未提及具體安裝步驟，可參考模型倉庫相關說明進行安裝。

💻 使用示例

基礎用法

import requests
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

repo = "Bingsu/clip-vit-base-patch32-ko"
model = AutoModel.from_pretrained(repo)
processor = AutoProcessor.from_pretrained(repo)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["고양이 두 마리", "개 두 마리"], images=image, return_tensors="pt", padding=True)
with torch.inference_mode():
    outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

>>> probs
tensor([[0.9926, 0.0074]])

高級用法

from transformers import pipeline

repo = "Bingsu/clip-vit-base-patch32-ko"
pipe = pipeline("zero-shot-image-classification", model=repo)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
result = pipe(images=url, candidate_labels=["고양이 한 마리", "고양이 두 마리", "분홍색 소파에 드러누운 고양이 친구들"], hypothesis_template="{}")

>>> result
[{'score': 0.9456236958503723, 'label': '분홍색 소파에 드러누운 고양이 친구들'},
 {'score': 0.05315302312374115, 'label': '고양이 두 마리'},
 {'score': 0.0012233294546604156, 'label': '고양이 한 마리'}]

📚 詳細文檔

Tokenizer

分詞器是將韓語數據和英語數據按7:3的比例混合，通過原CLIP分詞器的.train_new_from_iterator方法訓練得到的。

參考代碼：https://github.com/huggingface/transformers/blob/bc21aaca789f1a366c05e8b5e111632944886393/src/transformers/models/clip/modeling_clip.py#L661-L666

        # text_embeds.shape = [batch_size, sequence_length, transformer.width]
        # take features from the eot embedding (eot_token is the highest number in each sequence)
        # casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14
        pooled_output = last_hidden_state[
            torch.arange(last_hidden_state.shape[0]), input_ids.to(torch.int).argmax(dim=-1)
        ]

由於CLIP模型在計算pooled_output時使用id最大的標記，因此eos標記必須是最後一個標記。