llm-jp-clip-vit-base-patch16开源日语CLIP模型 - 支持免费零样本图像分类

首页

Llm Jp Clip Vit Base Patch16

由 llm-jp 开发

日语CLIP模型，基于OpenCLIP框架训练，支持零样本图像分类任务

文本生成图像

Safetensors

日语开源协议:Apache-2.0 #日语CLIP #零样本分类 #图文检索

下载量 40

发布时间 : 12/17/2024

模型简介

这是一个日语视觉语言模型，能够将图像与日语文本进行关联，特别适用于零样本图像分类任务。模型在14.5亿日语图文对数据集上训练，总参数量为248M。

模型特点

日语专用

专门针对日语优化的CLIP模型，在日语文本理解方面表现优异

大规模训练数据

使用14.5亿日语图文对数据集训练，覆盖广泛视觉概念

零样本能力

无需特定训练即可执行新类别的图像分类任务

模型能力

零样本图像分类

图像-文本匹配

跨模态检索

使用案例

图像分类

日语标签图像分类

使用日语文本标签对图像进行分类

在ImageNet日语分类任务上达到54.2%准确率

跨模态检索

图像搜索

使用日语文本查询检索相关图像

在XM3600数据集上图到文检索任务中达到73.6%准确率

🚀 llm-jp-clip-vit-base-patch16模型

本项目是基于OpenCLIP训练的日语CLIP模型，利用大规模日语图像文本对数据集进行训练，可用于零样本图像分类等视觉语言任务，为日语视觉语言处理提供了有效的解决方案。

🚀 快速开始

安装

$ pip install open_clip_torch

零样本图像分类示例

import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["猫", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])

参考资料：

在Hugging Face上使用OpenCLIP，HuggingFace文档
OpenCLIP 仓库

✨ 主要特性

基于OpenCLIP训练的日语CLIP模型。
使用relaion2B-en-research-safe-japanese-translation数据集进行训练。
模型总参数数量为2.48亿。

📦 安装指南

$ pip install open_clip_torch

💻 使用示例

基础用法

import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["猫", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])

📚 详细文档

模型详情

该日语CLIP模型使用OpenCLIP在relaion2B-en-research-safe-japanese-translation数据集上进行训练。该数据集是ReLAION - 5B英文子集（https://huggingface.co/datasets/laion/relaion2B-en-research-safe）的日语翻译版本，由gemma - 2 - 9b - it翻译。

该模型的总参数数量为2.48亿。

训练细节

模型架构

文本编码器：使用llm - jp - tokenizer的RoBERTa base。
图像编码器：ViT - B/16。

训练数据

该模型在relaion2B-en-research-safe-japanese-translation数据集上进行训练。由于图像下载成功率为70%，数据集大小为14.5亿个样本，共进行了9个轮次的训练（总共处理了130亿个样本）。

评估

评估代码：https://github.com/llm-jp/clip-eval

表格：各模型在零样本图像分类和图像文本检索任务中的性能表现。粗体表示第一名，_下划线_表示第二名。

模型	参数数量 (M)	ImageNet	Recruit	CIFAR10	CIFAR100	Food101	Caltech101	XM3600 I → T	XM3600 T → I	平均得分
日语CLIP
Rinna ViT - B/16	196	50.6	39.9	90.7	64.0	53.2	84.6	53.8	54.0	61.4
Rinna ViT - B/16 cloob	196	54.6	41.6	88.2	60.3	57.2	80.2	53.4	53.4	61.1
LY ViT - B/16	196	52.0	83.8	96.3	76.7	73.9	88.4	76.9	78.0	78.3
llm - jp - ViT - B/16	248	54.2	59.4	91.8	69.2	82.2	85.6	73.6	72.7	73.6
StabilityAI ViT - L/16	414	62.4	70.5	97.6	84.1	74.0	86.7	67.3	66.0	76.1
llm - jp - ViT - L/14	467	59.5	62.9	96.4	77.0	88.2	87.8	74.1	74.1	77.5
多语言CLIP
SigLIP B/16 - 256 multi	370	51.9	71.2	92.4	65.8	78.6	85.6	45.9	43.0	66.8
jina - clip - v2	865	35.8	48.1	95.1	58.3	52.0	69.4	67.3	66.4	61.6
LAION ViT - H/14 multi	1193	53.0	74.5	97.9	78.4	74.3	85.1	75.0	72.0	76.3

📄 许可证

Apache许可证，版本2.0

请参考Gemma使用条款，因为训练数据使用gemma - 2 - 9b - it进行翻译。我们仅将Gemma用于翻译目的。根据第1.1(e)节中“模型衍生作品”的定义，我们的模型不属于“为使模型表现得与Gemma相似的模型”类别。因此，我们得出结论，无需继承Gemma许可证。

引用

@inproceedings{sugiura-etal-2025-developing,
    title = "Developing {J}apanese {CLIP} Models Leveraging an Open-weight {LLM} for Large-scale Dataset Translation",
    author = "Sugiura, Issa  and
      Kurita, Shuhei  and
      Oda, Yusuke  and
      Kawahara, Daisuke  and
      Okazaki, Naoaki",
    editor = "Ebrahimi, Abteen  and
      Haider, Samar  and
      Liu, Emmy  and
      Haider, Sammar  and
      Leonor Pacheco, Maria  and
      Wein, Shira",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)",
    month = apr,
    year = "2025",
    address = "Albuquerque, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-srw.15/",
    pages = "162--170",
    ISBN = "979-8-89176-192-6",
    abstract = "CLIP is a foundational model that bridges images and text, widely adopted as a key component in numerous vision-language models.However, the lack of large-scale open Japanese image-text pairs poses a significant barrier to the development of Japanese vision-language models.In this study, we constructed a Japanese image-text pair dataset with 1.5 billion examples using machine translation with open-weight LLMs and pre-trained Japanese CLIP models on the dataset.The performance of the pre-trained models was evaluated across seven benchmark datasets, achieving competitive average scores compared to models of similar size without the need for extensive data curation. However, the results also revealed relatively low performance on tasks specific to Japanese culture, highlighting the limitations of translation-based approaches in capturing cultural nuances. Our dataset, models, and code are publicly available."
}