Japanese-cloob-vit-b-16 Open-source Model - Empowering Cross-modal Understanding of Japanese Images and Texts

Japanese Cloob Vit B 16

Developed by rinna

Japanese CLOOB (Contrastive Leave-One-Out Boost) model trained by rinna Co., Ltd. for cross-modal understanding of images and text

Text-to-Image

Transformers

JapaneseOpen Source License:Apache-2.0 #Japanese Multimodal #Image-Text Matching #Zero-Shot Classification

Downloads 229.51k

Release Time : 4/27/2022

Model Overview

This model is based on the CLOOB architecture and can understand the relationship between Japanese text and images, supporting tasks such as image classification and text-image matching

Model Features

Japanese Cross-Modal Understanding

A vision-language model specifically designed for Japanese, effectively understanding the relationship between Japanese text and images

CLOOB Architecture

Utilizes Contrastive Leave-One-Out Boost (CLOOB) method to enhance cross-modal representation learning

Pre-trained ViT Model

Image encoder initialized based on the AugReg vit-base-patch16-224 model

Model Capabilities

Image Feature Extraction

Text Feature Extraction

Image-Text Matching

Cross-Modal Retrieval

Use Cases

Image Classification

Animal Image Classification

Identify animal categories in images (e.g., dogs, cats, elephants)

Example shows 100% accuracy in classifying dog images

Cross-Modal Retrieval

Text-to-Image Retrieval

Retrieve relevant images based on Japanese text descriptions

🚀 rinna/japanese-cloob-vit-b-16

This is a Japanese CLOOB (Contrastive Leave One Out Boost) model trained by rinna Co., Ltd., offering advanced feature - extraction capabilities in vision tasks.

rinna-icon

This is a Japanese CLOOB (Contrastive Leave One Out Boost) model trained by rinna Co., Ltd..

Please see japanese-clip for the other available models.

🚀 Quick Start

📦 Installation

Install the necessary package:

$ pip install git+https://github.com/rinnakk/japanese-clip.git

💻 Usage Examples

Basic Usage

import io
import requests
from PIL import Image
import torch
import japanese_clip as ja_clip

device = "cuda" if torch.cuda.is_available() else "cpu"

model, preprocess = ja_clip.load("rinna/japanese-cloob-vit-b-16", device=device)
tokenizer = ja_clip.load_tokenizer()

img = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = preprocess(img).unsqueeze(0).to(device)
encodings = ja_clip.tokenize(
    texts=["犬", "猫", "象"],
    max_seq_len=77,
    device=device,
    tokenizer=tokenizer, # this is optional. if you don't pass, load tokenizer each time
)

with torch.no_grad():
    image_features = model.get_image_features(image)
    text_features = model.get_text_features(**encodings)
    
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)  # prints: [[1.0, 0.0, 0.0]]

🔧 Technical Details

Model architecture

The model was trained using a ViT - B/16 Transformer architecture as an image encoder and uses a 12 - layer BERT as a text encoder. The image encoder was initialized from the AugReg vit - base - patch16 - 224 model.

Training

The model was trained on CC12M with the captions translated to Japanese.

Release date

May 12, 2022

📚 Documentation

How to cite

@misc{rinna-japanese-cloob-vit-b-16,
    title = {rinna/japanese-cloob-vit-b-16},
    author = {Shing, Makoto and Zhao, Tianyu and Sawada, Kei},
    url = {https://huggingface.co/rinna/japanese-cloob-vit-b-16}
}

@inproceedings{sawada2024release,
    title = {Release of Pre - Trained Models for the {J}apanese Language},
    author = {Sawada, Kei and Zhao, Tianyu and Shing, Makoto and Mitsui, Kentaro and Kaga, Akio and Hono, Yukiya and Wakatsuki, Toshiaki and Mitsuda, Koh},
    booktitle = {Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC - COLING 2024)},
    month = {5},
    year = {2024},
    pages = {13898--13905},
    url = {https://aclanthology.org/2024.lrec-main.1213},
    note = {\url{https://arxiv.org/abs/2404.01657}}
}

📄 License

The Apache 2.0 license

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご