llm-jp-clip-vit-base-patch16 オープンソース日本語 CLIP モデル - 無料のゼロサンプル画像分類をサポート

ホーム

Llm Jp Clip Vit Base Patch16

llm-jpによって開発

日本語CLIPモデル、OpenCLIPフレームワークで訓練され、ゼロショット画像分類タスクをサポート

テキスト生成画像

Safetensors

日本語オープンソースライセンス:Apache-2.0 #日本語CLIP #ゼロショット分類 #画像テキスト検索

ダウンロード数 40

リリース時間 : 12/17/2024

モデル概要

これは日本語の視覚言語モデルで、画像と日本語テキストを関連付けることができ、特にゼロショット画像分類タスクに適しています。モデルは14.5億の日本語画像テキストペアデータセットで訓練され、総パラメータ数は248Mです。

モデル特徴

日本語専用

日本語に最適化されたCLIPモデルで、日本語テキスト理解に優れた性能を発揮

大規模訓練データ

14.5億の日本語画像テキストペアデータセットを使用して訓練され、幅広い視覚概念をカバー

ゼロショット能力

特定の訓練なしで新しいカテゴリの画像分類タスクを実行可能

モデル能力

ゼロショット画像分類

画像-テキストマッチング

クロスモーダル検索

使用事例

画像分類

日本語ラベル画像分類

日本語テキストラベルを使用して画像を分類

ImageNet日本語分類タスクで54.2%の精度を達成

クロスモーダル検索

画像検索

日本語テキストクエリを使用して関連画像を検索

XM3600データセットの画像からテキスト検索タスクで73.6%の精度を達成

🚀 llm - jp - clip - vit - base - patch16

このモデルは、OpenCLIPを使用して学習された日本語CLIPモデルで、大規模な日本語画像 - テキストペアデータセットを活用しています。画像とテキストの関連付けに優れ、ゼロショット画像分類などのタスクに有効です。

🚀 クイックスタート

このモデルは、日本語の画像とテキストの関連付けに特化したCLIPモデルです。以下のセクションでは、モデルのインストール方法と使用例を紹介します。

📦 インストール

$ pip install open_clip_torch

💻 使用例

基本的な使用法

import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["猫", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])

参考情報

Using OpenCLIP at Hugging Face、HuggingFace Docs
OpenCLIP repository

📚 ドキュメント

モデルの詳細

この日本語CLIPモデルは、OpenCLIPを用いて、relaion2B - en - research - safe - japanese - translationで学習されています。このデータセットは、ReLAION - 5Bの英語サブセットの日本語訳で、gemma - 2 - 9b - itによって翻訳されています。

このモデルの総パラメータ数は248Mです。

学習の詳細

モデルアーキテクチャ

テキストエンコーダ: llm - jp - tokenizerを備えたRoBERTa base
画像エンコーダ: ViT - B/16

学習データ

このモデルは、relaion2B - en - research - safe - japanese - translationで学習されています。画像ダウンロードの成功率が70%であったため、データセットサイズは14.5億サンプルで、9エポック（合計130億サンプル）にわたって処理されました。

評価

評価コード: https://github.com/llm-jp/clip-eval

表: 各モデルのゼロショット画像分類と画像 - テキスト検索タスクにおける性能。太字は1位を、_下線_は2位を示します。

モデル	パラメータ (M)	ImageNet	Recruit	CIFAR10	CIFAR100	Food101	Caltech101	XM3600 I → T	XM3600 T → I	平均
日本語CLIP
Rinna ViT - B/16	196	50.6	39.9	90.7	64.0	53.2	84.6	53.8	54.0	61.4
Rinna ViT - B/16 cloob	196	54.6	41.6	88.2	60.3	57.2	80.2	53.4	53.4	61.1
LY ViT - B/16	196	52.0	83.8	96.3	76.7	73.9	88.4	76.9	78.0	78.3
llm - jp - ViT - B/16	248	54.2	59.4	91.8	69.2	82.2	85.6	73.6	72.7	73.6
StabilityAI ViT - L/16	414	62.4	70.5	97.6	84.1	74.0	86.7	67.3	66.0	76.1
llm - jp - ViT - L/14	467	59.5	62.9	96.4	77.0	88.2	87.8	74.1	74.1	77.5
多言語CLIP
SigLIP B/16 - 256 multi	370	51.9	71.2	92.4	65.8	78.6	85.6	45.9	43.0	66.8
jina - clip - v2	865	35.8	48.1	95.1	58.3	52.0	69.4	67.3	66.4	61.6
LAION ViT - H/14 multi	1193	53.0	74.5	97.9	78.4	74.3	85.1	75.0	72.0	76.3

📄 ライセンス

The Apache License, Version 2.0

学習データはgemma - 2 - 9b - itを使用して翻訳されているため、Gemma Terms of Useを参照してください。当社はGemmaを翻訳目的のみに利用しています。セクション1.1(e)の「モデル派生品」の定義によれば、当社のモデルは「Gemmaと同様の動作をさせるためのモデル」には該当しません。したがって、Gemmaのライセンスを引き継ぐ必要はないと結論付けています。

引用

Bibtex:

@inproceedings{sugiura-etal-2025-developing,
    title = "Developing {J}apanese {CLIP} Models Leveraging an Open-weight {LLM} for Large-scale Dataset Translation",
    author = "Sugiura, Issa  and
      Kurita, Shuhei  and
      Oda, Yusuke  and
      Kawahara, Daisuke  and
      Okazaki, Naoaki",
    editor = "Ebrahimi, Abteen  and
      Haider, Samar  and
      Liu, Emmy  and
      Haider, Sammar  and
      Leonor Pacheco, Maria  and
      Wein, Shira",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)",
    month = apr,
    year = "2025",
    address = "Albuquerque, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-srw.15/",
    pages = "162--170",
    ISBN = "979-8-89176-192-6",
    abstract = "CLIP is a foundational model that bridges images and text, widely adopted as a key component in numerous vision-language models.However, the lack of large-scale open Japanese image-text pairs poses a significant barrier to the development of Japanese vision-language models.In this study, we constructed a Japanese image-text pair dataset with 1.5 billion examples using machine translation with open-weight LLMs and pre-trained Japanese CLIP models on the dataset.The performance of the pre-trained models was evaluated across seven benchmark datasets, achieving competitive average scores compared to models of similar size without the need for extensive data curation. However, the results also revealed relatively low performance on tasks specific to Japanese culture, highlighting the limitations of translation-based approaches in capturing cultural nuances. Our dataset, models, and code are publicly available."
}