japanese-cloob-vit-b-16オープンソースモデル - 日本語の画像とテキストのクロスモーダル理解を支援

ホーム

Japanese Cloob Vit B 16

rinnaによって開発

rinna株式会社によって訓練された日本語CLOOB（Contrastive Leave-One-Out Boost）モデルで、画像とテキストのクロスモーダル理解に使用されます

テキスト生成画像

Transformers

日本語オープンソースライセンス:Apache-2.0 #日本語マルチモーダル #画像テキストマッチング #ゼロショット分類

ダウンロード数 229.51k

リリース時間 : 4/27/2022

モデル概要

このモデルはCLOOBアーキテクチャに基づいており、日本語テキストと画像の関連性を理解し、画像分類やテキスト-画像マッチングなどのタスクをサポートします

モデル特徴

日本語クロスモーダル理解

日本語に特化して設計されたビジョン-ランゲージモデルで、日本語テキストと画像の関連性を効果的に理解できます

CLOOBアーキテクチャ

Contrastive Leave-One-Out Boost(CLOOB)手法を採用し、クロスモーダル表現学習の効果を向上させます

事前訓練ViTモデル

画像エンコーダはAugReg vit-base-patch16-224モデルで初期化されています

モデル能力

画像特徴抽出

テキスト特徴抽出

画像-テキストマッチング

クロスモーダル検索

使用事例

画像分類

動物画像分類

画像中の動物の種類を識別（例：犬、猫、象）

犬の画像分類の精度が100%を示す例

クロスモーダル検索

テキストから画像検索

日本語のテキスト記述に基づいて関連画像を検索

🚀 rinna/japanese-cloob-vit-b-16

このモデルは、rinna Co., Ltd.によって学習された日本語版のCLOOB (Contrastive Leave One Out Boost)モデルです。他の利用可能なモデルについては、japanese-clipを参照してください。

🚀 クイックスタート

📦 インストール

$ pip install git+https://github.com/rinnakk/japanese-clip.git

💻 使用例

基本的な使用法

import io
import requests
from PIL import Image
import torch
import japanese_clip as ja_clip

device = "cuda" if torch.cuda.is_available() else "cpu"


model, preprocess = ja_clip.load("rinna/japanese-cloob-vit-b-16", device=device)
tokenizer = ja_clip.load_tokenizer()

img = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = preprocess(img).unsqueeze(0).to(device)
encodings = ja_clip.tokenize(
    texts=["犬", "猫", "象"],
    max_seq_len=77,
    device=device,
    tokenizer=tokenizer, # this is optional. if you don't pass, load tokenizer each time
)

with torch.no_grad():
    image_features = model.get_image_features(image)
    text_features = model.get_text_features(**encodings)
    
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1.0, 0.0, 0.0]]

📚 ドキュメント

🔧 技術詳細

モデルアーキテクチャ

このモデルは、画像エンコーダとしてViT-B/16 Transformerアーキテクチャを使用し、テキストエンコーダとして12層のBERTを使用して学習されています。画像エンコーダは、AugReg vit-base-patch16-224モデルから初期化されています。

学習データ

このモデルは、キャプションを日本語に翻訳したCC12Mを使用して学習されています。

属性	详情
モデルタイプ	Japanese CLOOB (Contrastive Leave One Out Boost)
学習データ	CC12M（キャプションを日本語に翻訳）

公開日

2022年5月12日

引用方法

@misc{rinna-japanese-cloob-vit-b-16,
    title = {rinna/japanese-cloob-vit-b-16},
    author = {Shing, Makoto and Zhao, Tianyu and Sawada, Kei},
    url = {https://huggingface.co/rinna/japanese-cloob-vit-b-16}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}