japanese - clip - vit - b - 16オープンソースモデル - 日本語テキストと画像の対比学習をサポート

ホーム

Japanese Clip Vit B 16

rinnaによって開発

rinna株式会社が訓練した日本語CLIPモデルで、日本語テキストと画像の対比学習をサポート

テキスト生成画像

Transformers

日本語オープンソースライセンス:Apache-2.0 #日本語マルチモーダル #ゼロショット画像分類 #ViT-B/16アーキテクチャ

ダウンロード数 26.12k

リリース時間 : 4/27/2022

モデル概要

このモデルはCLIPアーキテクチャに基づくマルチモーダルモデルで、日本語テキストと画像を同一の特徴空間にマッピングし、クロスモーダル検索や分類タスクを実現します。

モデル特徴

日本語専用

日本語に最適化されたCLIPモデルで、日本語テキストと画像の関連付け学習をサポート

マルチモーダル能力

画像とテキスト入力を同時に処理し、クロスモーダルの特徴抽出とマッチングを実現

事前学習モデル

大規模データセット(CC12M)で事前学習済みで、下流タスクに直接使用可能

モデル能力

画像特徴抽出

日本語テキスト特徴抽出

画像-テキスト類似度計算

クロスモーダル検索

使用事例

画像分類

マルチラベル画像分類

日本語ラベルを使用して画像を分類

各ラベルの確率分布を出力可能

クロスモーダル検索

テキストによる画像検索

日本語テキスト記述を使用して関連画像を検索

画像によるテキスト検索

画像を使用してマッチする日本語テキスト記述を検索

🚀 rinna/japanese-clip-vit-b-16

このモデルは、rinna Co., Ltd.によって学習された日本語版のCLIP (Contrastive Language-Image Pre-Training)モデルです。他の利用可能なモデルについては、japanese-clipを参照してください。

🚀 クイックスタート

以下の手順でモデルを使用できます。

📦 インストール

パッケージをインストールします。

$ pip install git+https://github.com/rinnakk/japanese-clip.git

💻 使用例

基本的な使用法

import io
import requests
from PIL import Image
import torch
import japanese_clip as ja_clip

device = "cuda" if torch.cuda.is_available() else "cpu"


model, preprocess = ja_clip.load("rinna/japanese-clip-vit-b-16", cache_dir="/tmp/japanese_clip", device=device)
tokenizer = ja_clip.load_tokenizer()

img = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = preprocess(img).unsqueeze(0).to(device)
encodings = ja_clip.tokenize(
    texts=["犬", "猫", "象"],
    max_seq_len=77,
    device=device,
    tokenizer=tokenizer, # this is optional. if you don't pass, load tokenizer each time
)

with torch.no_grad():
    image_features = model.get_image_features(image)
    text_features = model.get_text_features(**encodings)
    
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1.0, 0.0, 0.0]]

🔧 技術詳細

モデルアーキテクチャ

このモデルは、画像エンコーダとしてViT-B/16 Transformerアーキテクチャを使用し、テキストエンコーダとして12層のBERTを使用しています。画像エンコーダは、AugReg vit-base-patch16-224モデルから初期化されています。

学習データ

このモデルは、キャプションを日本語に翻訳したCC12Mデータセットで学習されています。

リリース日

2022年5月12日

引用方法

@misc{rinna-japanese-clip-vit-b-16,
    title = {rinna/japanese-clip-vit-b-16},
    author = {Shing, Makoto and Zhao, Tianyu and Sawada, Kei},
    url = {https://huggingface.co/rinna/japanese-clip-vit-b-16}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}