Japanese Clip Vit B 32 Roberta Base

Developed by recruit-jp

A Japanese version of the CLIP model that maps Japanese text and images into the same embedding space, suitable for zero-shot image classification, text-image retrieval, and other tasks.

Text-to-Image

Transformers

Japanese#Japanese Multimodal #Zero-shot Classification #Image-Text Retrieval

Downloads 384

Release Time : 12/20/2023

Model Overview

This model is a Japanese version of CLIP (Contrastive Language-Image Pretraining), based on a ViT-B/32 image encoder and Roberta Base text encoder, specifically optimized for Japanese.

Model Features

Japanese Optimization

Specifically optimized for Japanese text and images, outperforming general multilingual CLIP models in Japanese tasks.

Dual-Modal Embedding

Capable of mapping images and text into the same embedding space, enabling cross-modal retrieval and comparison.

Zero-shot Learning

Performs image classification and retrieval tasks without task-specific training.

Model Capabilities

Zero-shot image classification

Text-image retrieval

Image feature extraction

Text feature extraction

Cross-modal similarity calculation

Use Cases

E-commerce

Product Image Search

Search for relevant product images using Japanese text descriptions

Improves search accuracy and user experience

Content Management

Automatic Image Tagging

Automatically generate Japanese tags for images

Reduces manual labeling costs

license: cc-by-4.0 language:

ja pipeline_tag: feature-extraction tags:
clip
japanese-clip

recruit-jp/japanese-clip-vit-b-32-roberta-base

Overview

Developed by: Recruit Co., Ltd.
Model type: Contrastive Language-Image Pretrained Model
Language(s): Japanese
LICENSE: CC-BY-4.0

More details are described in our tech blog post.

日本語CLIP学習済みモデルとその評価用データセットの公開

Model Details

This model is a Japanese CLIP. Using this model, you can map Japanese texts and images into the same embedding space. You can use this model for tasks such as zero-shot image classification, text-image retrieval, image feature extraction, and so on.

This model uses the image encoder of laion/CLIP-ViT-B-32-laion2B-s34B-b79K for image encoder and rinna/japanese-roberta-base for text encoder. This model is trained on Japanese subset of LAION2B-multi dataset and is tailored for Japanese language.

How to use

Install packages

pip install pillow requests transformers torch torchvision sentencepiece

Run the code below

import io
import requests

import torch
import torchvision
from PIL import Image
from transformers import AutoTokenizer, AutoModel


model_name = "recruit-jp/japanese-clip-vit-b-32-roberta-base"
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, trust_remote_code=True).to(device)


def _convert_to_rgb(image):
    return image.convert('RGB')


preprocess = torchvision.transforms.Compose([
    torchvision.transforms.Resize(size=224, interpolation=torchvision.transforms.InterpolationMode.BICUBIC, max_size=None),
    torchvision.transforms.CenterCrop(size=(224, 224)),
    _convert_to_rgb,
    torchvision.transforms.ToTensor(),
    torchvision.transforms.Normalize(mean=[0.48145466, 0.4578275, 0.40821073], std=[0.26862954, 0.26130258, 0.27577711])
])


def tokenize(tokenizer, texts):
    texts = ["[CLS]" + text for text in texts]
    encodings = [
        # NOTE: the maximum token length that can be fed into this model is 77
        tokenizer(text, max_length=77, padding="max_length", truncation=True, add_special_tokens=False)["input_ids"]
        for text in texts
    ]
    return torch.LongTensor(encodings)


# Run!
image = Image.open(
    io.BytesIO(
        requests.get(
            'https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260'
        ).content
    )
)
image = preprocess(image).unsqueeze(0).to(device)
text = tokenize(tokenizer, texts=["犬", "猫", "象"]).to(device)
with torch.inference_mode():
    image_features = model.get_image_features(image)
    image_features /= image_features.norm(dim=-1, keepdim=True)
    text_features = model.get_text_features(input_ids=text)
    text_features /= text_features.norm(dim=-1, keepdim=True)
    probs = image_features @ text_features.T
print("Label probs:", probs.cpu().numpy()[0])

Model Performance

We've conducted model performance evaluation on the datasets listed below. Since ImageNet V2 and Food101 are datasets from English speaking context, we translated the class label into Japanese before we conduct evaluation.

ImageNet V2 test set (Top-1 Accuracy)
Food101 (Top-1 Accuracy)
Hiragana dataset from ETL Character Database (Top-1 Accuracy)
Katakana dataset from ETL Character Database (Top-1 Accuracy)
STAIR Captions Image-to-Text Retrieval (Average of Precision@1,5,10)
STAIR Captions Text-to-Image Retrieval (Average of Precision@1,5,10)
jafood101 (Top-1 Accuracy)
jaflower30 (Top-1 Accuracy)
jafacility20 (Top-1 Accuracy)
jalandmark10 (Top-1 Accuracy)

We also evaluated laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k, laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k, rinna/japanese-clip-vit-b-16 and stabilityai/japanese-stable-clip-vit-l-16 on the same datasets. Note that stabilityai/japanese-stable-clip-vit-l-16 is trained on STAIR Captions dataset, we skipped evaluation of stability's model on STAIR Captions.

Model	ImageNet V2	Food101	ETLC-hiragana	ETLC-katakana	STAIR Captions image-to-text	STAIR Captions text-to-image	jafood101	jaflower30	jafacility20	jalandmark10
laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k	0.471	0.742	0.055	0.029	0.462	0.223	0.709	0.869	0.820	0.899
laion/CLIP-ViT-B-32-xlm-roberta-base-laion5B-s13B-b90k	0.326	0.508	0.162	0.061	0.372	0.169	0.609	0.709	0.749	0.846
rinna/japanese-clip-vit-b-16	0.435	0.491	0.014	0.024	0.089	0.034	0.308	0.592	0.406	0.656
stabilityai/japanese-stable-clip-vit-l-16	0.481	0.460	0.013	0.023	-	-	0.413	0.689	0.677	0.752
recruit-jp/japanese-clip-vit-b-32-roberta-base	0.175	0.301	0.030	0.038	0.191	0.102	0.524	0.592	0.676	0.797

Training Dataset

This model is trained with 128M image-text pairs from the Japanese subset of LAION2B-multi dataset.

Disclaimer

㈱リクルートは、本モデル利用による成果に関し、正確性、有用性、確実性、違法性の確認及び何らの保証および補償を行わないものとし、また、モデル利用によって利用者に生じた損害および第三者との間における紛争について㈱リクルートは一切責任を負いません。

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご