llm-jp-clip-vit-base-patch16: Open-source Japanese CLIP Model - Supports Free Zero-shot Image Classification

Home

Llm Jp Clip Vit Base Patch16

Developed by llm-jp

Japanese CLIP model trained on OpenCLIP framework, supporting zero-shot image classification tasks

Text-to-Image

Safetensors

JapaneseOpen Source License:Apache-2.0 #Japanese CLIP #Zero-shot classification #Image-text retrieval

Downloads 40

Release Time : 12/17/2024

Model Overview

This is a Japanese vision-language model capable of associating images with Japanese text, particularly suitable for zero-shot image classification tasks. The model was trained on a dataset of 1.45 billion Japanese image-text pairs with a total of 248M parameters.

Model Features

Japanese-specific

A CLIP model specifically optimized for Japanese, excelling in Japanese text understanding

Large-scale training data

Trained on 1.45 billion Japanese image-text pairs, covering a wide range of visual concepts

Zero-shot capability

Capable of performing image classification for new categories without specific training

Model Capabilities

Zero-shot image classification

Image-text matching

Cross-modal retrieval

Use Cases

Image classification

Japanese-labeled image classification

Classify images using Japanese text labels

Achieved 54.2% accuracy on ImageNet Japanese classification task

Cross-modal retrieval

Image search

Retrieve relevant images using Japanese text queries

Achieved 73.6% accuracy on XM3600 image-to-text retrieval task

🚀 llm-jp-clip-vit-base-patch16

A Japanese CLIP model trained on a Japanese translation of the English subset of ReLAION-5B, enabling zero-shot image classification.

🚀 Quick Start

This model is a Japanese CLIP model trained with OpenCLIP on relaion2B-en-research-safe-japanese-translation. It can be used for zero-shot image classification tasks.

✨ Features

Japanese CLIP: Specifically trained for Japanese language tasks.
Zero-shot Image Classification: Capable of classifying images without prior training on specific classes.
Trained on Large-scale Data: Utilizes a large Japanese translation dataset for training.

📦 Installation

$ pip install open_clip_torch

💻 Usage Examples

Basic Usage

import open_clip

model, preprocess = open_clip.create_model_from_pretrained('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')
tokenizer = open_clip.get_tokenizer('hf-hub:llm-jp/llm-jp-clip-vit-base-patch16')

import torch
from PIL import Image
import requests

url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
image = Image.open(requests.get(url, stream=True).raw)
image = preprocess(image).unsqueeze(0)
text = tokenizer(["猫", "犬", "鳥"])

with torch.no_grad(), torch.cuda.amp.autocast():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features /= text_features.norm(dim=-1, keepdim=True)

    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# Label probs: tensor([[9.9425e-01, 5.2273e-03, 5.2600e-04]])

Reference

Using OpenCLIP at Hugging Face, HuggingFace Docs
OpenCLIP repository

📚 Documentation

Model Details

A Japanese CLIP model trained with OpenCLIP on relaion2B-en-research-safe-japanese-translation, a Japanese translation of the English subset of ReLAION-5B (https://huggingface.co/datasets/laion/relaion2B-en-research-safe), translated by gemma-2-9b-it.

The total number of parameters of this model is 248M.

Training Details

Model Architecture

Property	Details
Text Encoder	RoBERTa base with llm-jp-tokenizer
Image Encoder	ViT-B/16

Training Data

This model is trained on relaion2B-en-research-safe-japanese-translation. Due to a 70% success rate in image downloads, the dataset size was 1.45 billion samples, and we processed it over 9 epochs (13 billion samples in total).

Evaluation

Evaluation Code: https://github.com/llm-jp/clip-eval

Table: Performance of each model in zero-shot image classification and image-text retrieval tasks. Bold indicates first place, and underline indicates second place.

Model	Params (M)	ImageNet	Recruit	CIFAR10	CIFAR100	Food101	Caltech101	XM3600 I → T	XM3600 T → I	Avg.
Japanese CLIP
Rinna ViT-B/16	196	50.6	39.9	90.7	64.0	53.2	84.6	53.8	54.0	61.4
Rinna ViT-B/16 cloob	196	54.6	41.6	88.2	60.3	57.2	80.2	53.4	53.4	61.1
LY ViT-B/16	196	52.0	83.8	96.3	76.7	73.9	88.4	76.9	78.0	78.3
llm-jp-ViT-B/16	248	54.2	59.4	91.8	69.2	82.2	85.6	73.6	72.7	73.6
StabilityAI ViT-L/16	414	62.4	70.5	97.6	84.1	74.0	86.7	67.3	66.0	76.1
llm-jp-ViT-L/14	467	59.5	62.9	96.4	77.0	88.2	87.8	74.1	74.1	77.5
Multilingual CLIP
SigLIP B/16-256 multi	370	51.9	71.2	92.4	65.8	78.6	85.6	45.9	43.0	66.8
jina-clip-v2	865	35.8	48.1	95.1	58.3	52.0	69.4	67.3	66.4	61.6
LAION ViT-H/14 multi	1193	53.0	74.5	97.9	78.4	74.3	85.1	75.0	72.0	76.3

📄 License

The Apache License, Version 2.0

Please refer to the Gemma Terms of Use, as the training data was translated using gemma-2-9b-it. We utilizes Gemma solely for translation purposes. According to the definition of "Model Derivatives" in Section 1.1(e), our model does not fall under the category of a "model in order to cause that model to perform similarly to Gemma." Therefore, we have concluded that it is not necessary to inherit the Gemma license.

📚 Citation

@inproceedings{sugiura-etal-2025-developing,
    title = "Developing {J}apanese {CLIP} Models Leveraging an Open-weight {LLM} for Large-scale Dataset Translation",
    author = "Sugiura, Issa  and
      Kurita, Shuhei  and
      Oda, Yusuke  and
      Kawahara, Daisuke  and
      Okazaki, Naoaki",
    editor = "Ebrahimi, Abteen  and
      Haider, Samar  and
      Liu, Emmy  and
      Haider, Sammar  and
      Leonor Pacheco, Maria  and
      Wein, Shira",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 4: Student Research Workshop)",
    month = apr,
    year = "2025",
    address = "Albuquerque, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-srw.15/",
    pages = "162--170",
    ISBN = "979-8-89176-192-6",
    abstract = "CLIP is a foundational model that bridges images and text, widely adopted as a key component in numerous vision-language models.However, the lack of large-scale open Japanese image-text pairs poses a significant barrier to the development of Japanese vision-language models.In this study, we constructed a Japanese image-text pair dataset with 1.5 billion examples using machine translation with open-weight LLMs and pre-trained Japanese CLIP models on the dataset.The performance of the pre-trained models was evaluated across seven benchmark datasets, achieving competitive average scores compared to models of similar size without the need for extensive data curation. However, the results also revealed relatively low performance on tasks specific to Japanese culture, highlighting the limitations of translation-based approaches in capturing cultural nuances. Our dataset, models, and code are publicly available."
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご