llm-jp-clip-vit-base-patch16開源日語CLIP模型 - 支持免費零樣本圖像分類

首頁

Llm Jp Clip Vit Base Patch16

由llm-jp開發

日語CLIP模型，基於OpenCLIP框架訓練，支持零樣本圖像分類任務

文本生成圖像

Safetensors

日語開源協議:Apache-2.0 #日語CLIP #零樣本分類 #圖文檢索

下載量 40

發布時間 : 12/17/2024

模型概述

這是一個日語視覺語言模型，能夠將圖像與日語文本進行關聯，特別適用於零樣本圖像分類任務。模型在14.5億日語圖文對數據集上訓練，總參數量為248M。

模型特點

日語專用

專門針對日語優化的CLIP模型，在日語文本理解方面表現優異

大規模訓練數據

使用14.5億日語圖文對數據集訓練，覆蓋廣泛視覺概念

零樣本能力

無需特定訓練即可執行新類別的圖像分類任務

模型能力

零樣本圖像分類

圖像-文本匹配

跨模態檢索

使用案例

圖像分類

日語標籤圖像分類

使用日語文本標籤對圖像進行分類

在ImageNet日語分類任務上達到54.2%準確率

跨模態檢索

圖像搜索

使用日語文本查詢檢索相關圖像

在XM3600數據集上圖到文檢索任務中達到73.6%準確率

🚀 llm-jp-clip-vit-base-patch16模型

本項目是基於OpenCLIP訓練的日語CLIP模型，利用大規模日語圖像文本對數據集進行訓練，可用於零樣本圖像分類等視覺語言任務，為日語視覺語言處理提供了有效的解決方案。

🚀 快速開始

安裝

$ pip install open_clip_torch

零樣本圖像分類示例

import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["貓", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])

參考資料：

在Hugging Face上使用OpenCLIP，HuggingFace文檔
OpenCLIP 倉庫

✨ 主要特性

基於OpenCLIP訓練的日語CLIP模型。
使用relaion2B-en-research-safe-japanese-translation數據集進行訓練。
模型總參數數量為2.48億。

📦 安裝指南

$ pip install open_clip_torch

💻 使用示例

基礎用法

import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["貓", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])

📚 詳細文檔

模型詳情

該日語CLIP模型使用OpenCLIP在relaion2B-en-research-safe-japanese-translation數據集上進行訓練。該數據集是ReLAION - 5B英文子集（https://huggingface.co/datasets/laion/relaion2B-en-research-safe）的日語翻譯版本，由gemma - 2 - 9b - it翻譯。

該模型的總參數數量為2.48億。

訓練細節

模型架構

文本編碼器：使用llm - jp - tokenizer的RoBERTa base。
圖像編碼器：ViT - B/16。

訓練數據

該模型在relaion2B-en-research-safe-japanese-translation數據集上進行訓練。由於圖像下載成功率為70%，數據集大小為14.5億個樣本，共進行了9個輪次的訓練（總共處理了130億個樣本）。

評估

評估代碼：https://github.com/llm-jp/clip-eval

表格：各模型在零樣本圖像分類和圖像文本檢索任務中的性能表現。粗體表示第一名，_下劃線_表示第二名。

模型	參數數量 (M)	ImageNet	Recruit	CIFAR10	CIFAR100	Food101	Caltech101	XM3600 I → T	XM3600 T → I	平均得分
日語CLIP
Rinna ViT - B/16	196	50.6	39.9	90.7	64.0	53.2	84.6	53.8	54.0	61.4
Rinna ViT - B/16 cloob	196	54.6	41.6	88.2	60.3	57.2	80.2	53.4	53.4	61.1
LY ViT - B/16	196	52.0	83.8	96.3	76.7	73.9	88.4	76.9	78.0	78.3
llm - jp - ViT - B/16	248	54.2	59.4	91.8	69.2	82.2	85.6	73.6	72.7	73.6
StabilityAI ViT - L/16	414	62.4	70.5	97.6	84.1	74.0	86.7	67.3	66.0	76.1
llm - jp - ViT - L/14	467	59.5	62.9	96.4	77.0	88.2	87.8	74.1	74.1	77.5
多語言CLIP
SigLIP B/16 - 256 multi	370	51.9	71.2	92.4	65.8	78.6	85.6	45.9	43.0	66.8
jina - clip - v2	865	35.8	48.1	95.1	58.3	52.0	69.4	67.3	66.4	61.6
LAION ViT - H/14 multi	1193	53.0	74.5	97.9	78.4	74.3	85.1	75.0	72.0	76.3

📄 許可證

Apache許可證，版本2.0

請參考Gemma使用條款，因為訓練數據使用gemma - 2 - 9b - it進行翻譯。我們僅將Gemma用於翻譯目的。根據第1.1(e)節中“模型衍生作品”的定義，我們的模型不屬於“為使模型表現得與Gemma相似的模型”類別。因此，我們得出結論，無需繼承Gemma許可證。

引用

@inproceedings{sugiura-etal-2025-developing,
    title = "Developing {J}apanese {CLIP} Models Leveraging an Open-weight {LLM} for Large-scale Dataset Translation",
    author = "Sugiura, Issa  and
      Kurita, Shuhei  and
      Oda, Yusuke  and
      Kawahara, Daisuke  and
      Okazaki, Naoaki",
    editor = "Ebrahimi, Abteen  and
      Haider, Samar  and
      Liu, Emmy  and
      Haider, Sammar  and
      Leonor Pacheco, Maria  and
      Wein, Shira",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)",
    month = apr,
    year = "2025",
    address = "Albuquerque, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-srw.15/",
    pages = "162--170",
    ISBN = "979-8-89176-192-6",
    abstract = "CLIP is a foundational model that bridges images and text, widely adopted as a key component in numerous vision-language models.However, the lack of large-scale open Japanese image-text pairs poses a significant barrier to the development of Japanese vision-language models.In this study, we constructed a Japanese image-text pair dataset with 1.5 billion examples using machine translation with open-weight LLMs and pre-trained Japanese CLIP models on the dataset.The performance of the pre-trained models was evaluated across seven benchmark datasets, achieving competitive average scores compared to models of similar size without the need for extensive data curation. However, the results also revealed relatively low performance on tasks specific to Japanese culture, highlighting the limitations of translation-based approaches in capturing cultural nuances. Our dataset, models, and code are publicly available."
}