Japanese-clip-vit-b-16 Open-source Model - Support Contrastive Learning of Japanese Text and Images

Home

Japanese Clip Vit B 16

Developed by rinna

A Japanese CLIP model trained by rinna Co., Ltd., supporting contrastive learning between Japanese text and images

Text-to-Image

Transformers

JapaneseOpen Source License:Apache-2.0 #Japanese Multimodal #Zero-shot Image Classification #ViT-B/16 Architecture

Downloads 26.12k

Release Time : 4/27/2022

Model Overview

This model is a multimodal model based on the CLIP architecture, capable of mapping Japanese text and images into the same feature space for cross-modal retrieval and classification tasks.

Model Features

Japanese-Specific

A CLIP model optimized specifically for Japanese, supporting associative learning between Japanese text and images

Multimodal Capability

Capable of processing both image and text inputs for cross-modal feature extraction and matching

Pretrained Model

Pretrained on a large-scale dataset (CC12M) and ready for direct use in downstream tasks

Model Capabilities

Image Feature Extraction

Japanese Text Feature Extraction

Image-Text Similarity Calculation

Cross-modal Retrieval

Use Cases

Image Classification

Multi-label Image Classification

Classify images using Japanese labels

Can output probability distributions for each label

Cross-modal Search

Text-to-Image Search

Search for relevant images using Japanese text descriptions

Image-to-Text Search

Search for matching Japanese text descriptions using images

🚀 rinna/japanese-clip-vit-b-16

This is a Japanese CLIP (Contrastive Language-Image Pre-Training) model trained by rinna Co., Ltd., offering feature extraction capabilities in the vision domain.

🚀 Quick Start

This section guides you through the steps to use the rinna/japanese-clip-vit-b-16 model.

📦 Installation

First, you need to install the necessary package.

$ pip install git+https://github.com/rinnakk/japanese-clip.git

💻 Usage Examples

Basic Usage

The following code demonstrates how to load the model, preprocess an image, and perform inference.

import io
import requests
from PIL import Image
import torch
import japanese_clip as ja_clip

device = "cuda" if torch.cuda.is_available() else "cpu"


model, preprocess = ja_clip.load("rinna/japanese-clip-vit-b-16", cache_dir="/tmp/japanese_clip", device=device)
tokenizer = ja_clip.load_tokenizer()

img = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = preprocess(img).unsqueeze(0).to(device)
encodings = ja_clip.tokenize(
    texts=["犬", "猫", "象"],
    max_seq_len=77,
    device=device,
    tokenizer=tokenizer, # this is optional. if you don't pass, load tokenizer each time
)

with torch.no_grad():
    image_features = model.get_image_features(image)
    text_features = model.get_text_features(**encodings)
    
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1.0, 0.0, 0.0]]

🔧 Technical Details

Model Architecture

The model employs a ViT-B/16 Transformer architecture as the image encoder and a 12 - layer BERT as the text encoder. The image encoder was initialized from the AugReg vit-base-patch16-224 model.

Training

The model was trained on CC12M with captions translated to Japanese.

📄 License

This model is released under The Apache 2.0 license.

📚 Documentation

Release Date

The model was released on May 12, 2022.

How to Cite

If you use this model in your research, please cite it as follows:

@misc{rinna-japanese-clip-vit-b-16,
    title = {rinna/japanese-clip-vit-b-16},
    author = {Shing, Makoto and Zhao, Tianyu and Sawada, Kei},
    url = {https://huggingface.co/rinna/japanese-clip-vit-b-16}
}

@inproceedings{sawada2024release,
    title = {Release of Pre-Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}

Other Available Models

Please see japanese-clip for the other available models.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご