llm-jp-clip-vit-large-patch14オープンソースの日本語CLIPモデル - 無料でゼロショット画像分類と画像・テキスト検索を実現

ホーム

Llm Jp Clip Vit Large Patch14

llm-jpによって開発

OpenCLIPフレームワークでトレーニングされた日本語CLIPモデルで、14.5億の日本語画像テキストペアデータセットでトレーニングされ、ゼロショット画像分類と画像テキスト検索タスクをサポートします

テキスト生成画像

Safetensors

日本語オープンソースライセンス:Apache-2.0 #日本語CLIP #ゼロショット分類 #画像テキスト検索

ダウンロード数 254

リリース時間 : 12/27/2024

モデル概要

これは日本語の視覚言語モデルで、画像と日本語テキストを共有の埋め込み空間にマッピングし、ゼロショット画像分類とクロスモーダル検索機能を実現します

モデル特徴

大規模日本語トレーニングデータ

15億の日本語画像テキストペアデータセットを使用し、高品質な機械翻訳により取得

高性能視覚言語理解

複数のベンチマークテストで優れたパフォーマンスを発揮し、特に日本文化関連のタスクで顕著

ゼロショット分類能力

特定のタスクの微調整なしで画像分類タスクを実行可能

モデル能力

ゼロショット画像分類

画像テキスト類似度計算

クロスモーダル検索

画像意味理解

使用事例

コンテンツモデレーション

違反コンテンツ検出

テキスト記述を通じて画像内の違反コンテンツを検出

電子商取引

商品検索

自然言語記述で関連商品画像を検索

メディア分析

画像ラベリング

画像に自動的に日本語説明ラベルを生成

🚀 llm-jp-clip-vit-large-patch14

このモデルは、日本語の画像とテキストを関連付けるためのCLIPモデルです。OpenCLIPを用いて訓練され、大規模な日本語の画像-テキストペアデータセットを利用しています。

🚀 クイックスタート

このモデルは、日本語の画像とテキストを関連付けるためのCLIPモデルです。OpenCLIPを用いて訓練され、大規模な日本語の画像-テキストペアデータセットを利用しています。以下に、使用方法を説明します。

✨ 主な機能

日本語の画像とテキストを関連付けることができます。
ゼロショット画像分類や画像-テキスト検索などのタスクに利用できます。

📦 インストール

このモデルを使用するには、open_clip_torchをインストールする必要があります。以下のコマンドを実行してください。

$ pip install open_clip_torch

💻 使用例

基本的な使用法

import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-large-patch14')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-large-patch14')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["猫", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])

📚 ドキュメント

モデルの詳細

このモデルは、OpenCLIPを用いて、relaion2B-en-research-safe-japanese-translationというデータセットで訓練された日本語CLIPモデルです。このデータセットは、ReLAION-5Bの英語サブセットの日本語訳で、gemma-2-9b-itによって翻訳されています。

このモデルの総パラメータ数は467Mです。

訓練の詳細

モデルアーキテクチャ

属性	详情
モデルタイプ	日本語CLIPモデル
テキストエンコーダ	RoBERTa base with llm-jp-tokenizer
画像エンコーダ	ViT-L/14
訓練データ	relaion2B-en-research-safe-japanese-translation

訓練データ

このモデルは、relaion2B-en-research-safe-japanese-translationというデータセットで訓練されています。画像ダウンロードの成功率が70%であったため、データセットのサイズは14.5億サンプルとなり、9エポック（合計130億サンプル）で処理されました。

評価

評価コードは、こちらにあります。

以下の表は、各モデルのゼロショット画像分類と画像-テキスト検索タスクにおける性能を示しています。太字は1位、_下線_は2位を示しています。

モデル	パラメータ数 (M)	ImageNet	Recruit	CIFAR10	CIFAR100	Food101	Caltech101	XM3600 I → T	XM3600 T → I	平均
日本語CLIP
Rinna ViT-B/16	196	50.6	39.9	90.7	64.0	53.2	84.6	53.8	54.0	61.4
Rinna ViT-B/16 cloob	196	54.6	41.6	88.2	60.3	57.2	80.2	53.4	53.4	61.1
LY ViT-B/16	196	52.0	83.8	96.3	76.7	73.9	88.4	76.9	78.0	78.3
llm-jp-ViT-B/16	248	54.2	59.4	91.8	69.2	82.2	85.6	73.6	72.7	73.6
StabilityAI ViT-L/16	414	62.4	70.5	97.6	84.1	74.0	86.7	67.3	66.0	76.1
llm-jp-ViT-L/14	467	59.5	62.9	96.4	77.0	88.2	87.8	74.1	74.1	77.5
多言語CLIP
SigLIP B/16-256 multi	370	51.9	71.2	92.4	65.8	78.6	85.6	45.9	43.0	66.8
jina-clip-v2	865	35.8	48.1	95.1	58.3	52.0	69.4	67.3	66.4	61.6
LAION ViT-H/14 multi	1193	53.0	74.5	97.9	78.4	74.3	85.1	75.0	72.0	76.3

📄 ライセンス

このモデルは、The Apache License, Version 2.0の下で提供されています。

訓練データはgemma-2-9b-itを用いて翻訳されているため、Gemma Terms of Useを参照してください。私たちはGemmaを翻訳目的のみに利用しています。セクション1.1(e)の「Model Derivatives」の定義によれば、私たちのモデルは「Gemmaと同様の動作をさせるためのモデル」には該当しません。したがって、Gemmaのライセンスを引き継ぐ必要はないと結論付けています。

引用

@inproceedings{sugiura-etal-2025-developing,
    title = "Developing {J}apanese {CLIP} Models Leveraging an Open-weight {LLM} for Large-scale Dataset Translation",
    author = "Sugiura, Issa  and
      Kurita, Shuhei  and
      Oda, Yusuke  and
      Kawahara, Daisuke  and
      Okazaki, Naoaki",
    editor = "Ebrahimi, Abteen  and
      Haider, Samar  and
      Liu, Emmy  and
      Haider, Sammar  and
      Leonor Pacheco, Maria  and
      Wein, Shira",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)",
    month = apr,
    year = "2025",
    address = "Albuquerque, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-srw.15/",
    pages = "162--170",
    ISBN = "979-8-89176-192-6",
    abstract = "CLIP is a foundational model that bridges images and text, widely adopted as a key component in numerous vision-language models.However, the lack of large-scale open Japanese image-text pairs poses a significant barrier to the development of Japanese vision-language models.In this study, we constructed a Japanese image-text pair dataset with 1.5 billion examples using machine translation with open-weight LLMs and pre-trained Japanese CLIP models on the dataset.The performance of the pre-trained models was evaluated across seven benchmark datasets, achieving competitive average scores compared to models of similar size without the need for extensive data curation. However, the results also revealed relatively low performance on tasks specific to Japanese culture, highlighting the limitations of translation-based approaches in capturing cultural nuances. Our dataset, models, and code are publicly available."
}