llm-jp-clip-vit-large-patch14開源日語CLIP模型 - 免費實現零樣本圖像分類與圖文檢索

首頁

Llm Jp Clip Vit Large Patch14

由llm-jp開發

基於OpenCLIP框架訓練的日語CLIP模型，在14.5億日文圖文對數據集上訓練，支持零樣本圖像分類和圖文檢索任務

文本生成圖像

Safetensors

日語開源協議:Apache-2.0 #日語CLIP #零樣本分類 #圖文檢索

下載量 254

發布時間 : 12/27/2024

模型概述

這是一個日語視覺語言模型，能夠將圖像和日文文本映射到共享的嵌入空間，實現零樣本圖像分類和跨模態檢索功能

模型特點

大規模日語訓練數據

使用15億日文圖文對數據集訓練，通過高質量機器翻譯獲得

高性能視覺語言理解

在多個基準測試中表現優異，尤其在日本文化相關任務上

零樣本分類能力

無需特定任務微調即可執行圖像分類任務

模型能力

零樣本圖像分類

圖文相似度計算

跨模態檢索

圖像語義理解

使用案例

內容審核

違規內容檢測

通過文本描述檢測圖像中的違規內容

電子商務

商品搜索

通過自然語言描述查找相關商品圖片

媒體分析

圖像標註

自動為圖像生成日文描述標籤

🚀 llm-jp-clip-vit-large-patch14模型

本項目是一個日語CLIP模型，使用OpenCLIP在大規模日語圖像文本對上進行訓練，可用於零樣本圖像分類和圖像文本檢索等視覺語言任務。

🚀 快速開始

安裝

$ pip install open_clip_torch

零樣本圖像分類

import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-large-patch14')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-large-patch14')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["貓", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])

參考資料：

在Hugging Face使用OpenCLIP，HuggingFace文檔
OpenCLIP 倉庫

✨ 主要特性

基於OpenCLIP訓練，可用於零樣本圖像分類和圖像文本檢索任務。
在大規模日語圖像文本對上進行訓練，參數總量達4.67億。

📦 安裝指南

$ pip install open_clip_torch

💻 使用示例

基礎用法

import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-large-patch14')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-large-patch14')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["貓", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])

📚 詳細文檔

模型詳情

該日語CLIP模型使用OpenCLIP在relaion2B-en-research-safe-japanese-translation上進行訓練，這是ReLAION - 5B英文子集的日語翻譯版本，由gemma - 2 - 9b - it翻譯。模型的總參數數量為4.67億。

訓練詳情

模型架構

文本編碼器：使用llm - jp - tokenizer的RoBERTa base
圖像編碼器：ViT - L/14

訓練數據

該模型在relaion2B-en-research-safe-japanese-translation上進行訓練。由於圖像下載成功率為70%，數據集大小為14.5億個樣本，共進行了9個epoch的處理（總共130億個樣本）。

評估

評估代碼：https://github.com/llm-jp/clip-eval

表格：各模型在零樣本圖像分類和圖像文本檢索任務中的性能表現。粗體表示第一名，_下劃線_表示第二名。

模型	參數數量（百萬）	ImageNet	Recruit	CIFAR10	CIFAR100	Food101	Caltech101	XM3600 I → T	XM3600 T → I	平均得分
日語CLIP
Rinna ViT - B/16	196	50.6	39.9	90.7	64.0	53.2	84.6	53.8	54.0	61.4
Rinna ViT - B/16 cloob	196	54.6	41.6	88.2	60.3	57.2	80.2	53.4	53.4	61.1
LY ViT - B/16	196	52.0	83.8	96.3	76.7	73.9	88.4	76.9	78.0	78.3
llm - jp - ViT - B/16	248	54.2	59.4	91.8	69.2	82.2	85.6	73.6	72.7	73.6
StabilityAI ViT - L/16	414	62.4	70.5	97.6	84.1	74.0	86.7	67.3	66.0	76.1
llm - jp - ViT - L/14	467	59.5	62.9	96.4	77.0	88.2	87.8	74.1	74.1	77.5
多語言CLIP
SigLIP B/16 - 256 multi	370	51.9	71.2	92.4	65.8	78.6	85.6	45.9	43.0	66.8
jina - clip - v2	865	35.8	48.1	95.1	58.3	52.0	69.4	67.3	66.4	61.6
LAION ViT - H/14 multi	1193	53.0	74.5	97.9	78.4	74.3	85.1	75.0	72.0	76.3

🔧 技術細節

模型使用OpenCLIP框架進行訓練，結合了文本編碼器和圖像編碼器，能夠學習圖像和文本之間的關聯。
通過在大規模日語圖像文本對上進行訓練，模型能夠在零樣本圖像分類和圖像文本檢索任務中取得較好的性能。

📄 許可證

Apache許可證，版本2.0

由於訓練數據使用gemma - 2 - 9b - it進行翻譯，請參考Gemma使用條款。我們僅將Gemma用於翻譯目的。根據第1.1(e)節中“模型衍生作品”的定義，我們的模型不屬於“為了使該模型表現得與Gemma相似的模型”類別。因此，我們認為沒有必要繼承Gemma許可證。

引用

@inproceedings{sugiura-etal-2025-developing,
    title = "Developing {J}apanese {CLIP} Models Leveraging an Open-weight {LLM} for Large-scale Dataset Translation",
    author = "Sugiura, Issa  and
      Kurita, Shuhei  and
      Oda, Yusuke  and
      Kawahara, Daisuke  and
      Okazaki, Naoaki",
    editor = "Ebrahimi, Abteen  and
      Haider, Samar  and
      Liu, Emmy  and
      Haider, Sammar  and
      Leonor Pacheco, Maria  and
      Wein, Shira",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)",
    month = apr,
    year = "2025",
    address = "Albuquerque, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-srw.15/",
    pages = "162--170",
    ISBN = "979-8-89176-192-6",
    abstract = "CLIP is a foundational model that bridges images and text, widely adopted as a key component in numerous vision-language models.However, the lack of large-scale open Japanese image-text pairs poses a significant barrier to the development of Japanese vision-language models.In this study, we constructed a Japanese image-text pair dataset with 1.5 billion examples using machine translation with open-weight LLMs and pre-trained Japanese CLIP models on the dataset.The performance of the pre-trained models was evaluated across seven benchmark datasets, achieving competitive average scores compared to models of similar size without the need for extensive data curation. However, the results also revealed relatively low performance on tasks specific to Japanese culture, highlighting the limitations of translation-based approaches in capturing cultural nuances. Our dataset, models, and code are publicly available."
}