llm-jp-clip-vit-large-patch14开源日语CLIP模型 - 免费实现零样本图像分类与图文检索

首页

Llm Jp Clip Vit Large Patch14

由 llm-jp 开发

基于OpenCLIP框架训练的日语CLIP模型，在14.5亿日文图文对数据集上训练，支持零样本图像分类和图文检索任务

文本生成图像

Safetensors

日语开源协议:Apache-2.0 #日语CLIP #零样本分类 #图文检索

下载量 254

发布时间 : 12/27/2024

模型简介

这是一个日语视觉语言模型，能够将图像和日文文本映射到共享的嵌入空间，实现零样本图像分类和跨模态检索功能

模型特点

大规模日语训练数据

使用15亿日文图文对数据集训练，通过高质量机器翻译获得

高性能视觉语言理解

在多个基准测试中表现优异，尤其在日本文化相关任务上

零样本分类能力

无需特定任务微调即可执行图像分类任务

模型能力

零样本图像分类

图文相似度计算

跨模态检索

图像语义理解

使用案例

内容审核

违规内容检测

通过文本描述检测图像中的违规内容

电子商务

商品搜索

通过自然语言描述查找相关商品图片

媒体分析

图像标注

自动为图像生成日文描述标签

🚀 llm-jp-clip-vit-large-patch14模型

本项目是一个日语CLIP模型，使用OpenCLIP在大规模日语图像文本对上进行训练，可用于零样本图像分类和图像文本检索等视觉语言任务。

🚀 快速开始

安装

$ pip install open_clip_torch

零样本图像分类

import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-large-patch14')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-large-patch14')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["猫", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])

参考资料：

在Hugging Face使用OpenCLIP，HuggingFace文档
OpenCLIP 仓库

✨ 主要特性

基于OpenCLIP训练，可用于零样本图像分类和图像文本检索任务。
在大规模日语图像文本对上进行训练，参数总量达4.67亿。

📦 安装指南

$ pip install open_clip_torch

💻 使用示例

基础用法

import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-large-patch14')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-large-patch14')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["猫", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])

📚 详细文档

模型详情

该日语CLIP模型使用OpenCLIP在relaion2B-en-research-safe-japanese-translation上进行训练，这是ReLAION - 5B英文子集的日语翻译版本，由gemma - 2 - 9b - it翻译。模型的总参数数量为4.67亿。

训练详情

模型架构

文本编码器：使用llm - jp - tokenizer的RoBERTa base
图像编码器：ViT - L/14

训练数据

该模型在relaion2B-en-research-safe-japanese-translation上进行训练。由于图像下载成功率为70%，数据集大小为14.5亿个样本，共进行了9个epoch的处理（总共130亿个样本）。

评估

评估代码：https://github.com/llm-jp/clip-eval

表格：各模型在零样本图像分类和图像文本检索任务中的性能表现。粗体表示第一名，_下划线_表示第二名。

模型	参数数量（百万）	ImageNet	Recruit	CIFAR10	CIFAR100	Food101	Caltech101	XM3600 I → T	XM3600 T → I	平均得分
日语CLIP
Rinna ViT - B/16	196	50.6	39.9	90.7	64.0	53.2	84.6	53.8	54.0	61.4
Rinna ViT - B/16 cloob	196	54.6	41.6	88.2	60.3	57.2	80.2	53.4	53.4	61.1
LY ViT - B/16	196	52.0	83.8	96.3	76.7	73.9	88.4	76.9	78.0	78.3
llm - jp - ViT - B/16	248	54.2	59.4	91.8	69.2	82.2	85.6	73.6	72.7	73.6
StabilityAI ViT - L/16	414	62.4	70.5	97.6	84.1	74.0	86.7	67.3	66.0	76.1
llm - jp - ViT - L/14	467	59.5	62.9	96.4	77.0	88.2	87.8	74.1	74.1	77.5
多语言CLIP
SigLIP B/16 - 256 multi	370	51.9	71.2	92.4	65.8	78.6	85.6	45.9	43.0	66.8
jina - clip - v2	865	35.8	48.1	95.1	58.3	52.0	69.4	67.3	66.4	61.6
LAION ViT - H/14 multi	1193	53.0	74.5	97.9	78.4	74.3	85.1	75.0	72.0	76.3

🔧 技术细节

模型使用OpenCLIP框架进行训练，结合了文本编码器和图像编码器，能够学习图像和文本之间的关联。
通过在大规模日语图像文本对上进行训练，模型能够在零样本图像分类和图像文本检索任务中取得较好的性能。

📄 许可证

Apache许可证，版本2.0

由于训练数据使用gemma - 2 - 9b - it进行翻译，请参考Gemma使用条款。我们仅将Gemma用于翻译目的。根据第1.1(e)节中“模型衍生作品”的定义，我们的模型不属于“为了使该模型表现得与Gemma相似的模型”类别。因此，我们认为没有必要继承Gemma许可证。

引用

@inproceedings{sugiura-etal-2025-developing,
    title = "Developing {J}apanese {CLIP} Models Leveraging an Open-weight {LLM} for Large-scale Dataset Translation",
    author = "Sugiura, Issa  and
      Kurita, Shuhei  and
      Oda, Yusuke  and
      Kawahara, Daisuke  and
      Okazaki, Naoaki",
    editor = "Ebrahimi, Abteen  and
      Haider, Samar  and
      Liu, Emmy  and
      Haider, Sammar  and
      Leonor Pacheco, Maria  and
      Wein, Shira",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)",
    month = apr,
    year = "2025",
    address = "Albuquerque, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-srw.15/",
    pages = "162--170",
    ISBN = "979-8-89176-192-6",
    abstract = "CLIP is a foundational model that bridges images and text, widely adopted as a key component in numerous vision-language models.However, the lack of large-scale open Japanese image-text pairs poses a significant barrier to the development of Japanese vision-language models.In this study, we constructed a Japanese image-text pair dataset with 1.5 billion examples using machine translation with open-weight LLMs and pre-trained Japanese CLIP models on the dataset.The performance of the pre-trained models was evaluated across seven benchmark datasets, achieving competitive average scores compared to models of similar size without the need for extensive data curation. However, the results also revealed relatively low performance on tasks specific to Japanese culture, highlighting the limitations of translation-based approaches in capturing cultural nuances. Our dataset, models, and code are publicly available."
}