clip-japanese-base Open-source Japanese Model - Trained with Massive Data, Suitable for Various Visual Tasks

Clip Japanese Base

Developed by line-corporation

A Japanese CLIP model developed by LY Corporation, trained on approximately 1 billion web-collected image-text pairs, suitable for various vision tasks.

Text-to-Image

Transformers

JapaneseOpen Source License:Apache-2.0 #Japanese Multimodal #Zero-shot Classification #Image-Text Retrieval

Downloads 14.31k

Release Time : 4/24/2024

Model Overview

This model is a Japanese version of Contrastive Language-Image Pretraining (CLIP), suitable for tasks like zero-shot image classification, text-to-image or image-to-text retrieval.

Model Features

Powerful Japanese Vision-Language Understanding

A CLIP model specifically optimized for Japanese, capable of understanding relationships between Japanese text and images.

Efficient Architecture Design

Utilizes Eva02-B as image encoder, more efficient compared to traditional ViT architectures.

Large-scale Pretraining Data

Trained on approximately 1 billion web-collected image-text pairs, covering diverse scenarios.

Model Capabilities

Zero-shot Image Classification

Text-to-Image Retrieval

Image-to-Text Retrieval

Cross-modal Feature Extraction

Use Cases

Image Retrieval

Japanese Description-based Image Search

Retrieve relevant images using Japanese text queries

Achieves R@1 of 0.30 on STAIR Captions dataset

Image Classification

Zero-shot Japanese Image Classification

Classify images without fine-tuning

Achieves 89% accuracy on Recruit Datasets

🚀 clip-japanese-base

This is a Japanese CLIP (Contrastive Language-Image Pre-training) model developed by LY Corporation. Trained on approximately 1 billion web-collected image-text pairs, it can be applied to various visual tasks, including zero-shot image classification, text-to-image, or image-to-text retrieval.

🚀 Quick Start

📦 Installation

pip install pillow requests sentencepiece transformers torch timm

💻 Usage Examples

Basic Usage

import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer

HF_MODEL_PATH = 'line-corporation/clip-japanese-base'
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True).to(device)

image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt").to(device)
text = tokenizer(["犬", "猫", "象"]).to(device)

with torch.no_grad():
    image_features = model.get_image_features(**image)
    text_features = model.get_text_features(**text)
    text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)

print("Label probs:", text_probs)
# [[1., 0., 0.]]

📚 Documentation

🔧 Technical Details

The model uses an Eva02-B Transformer architecture as the image encoder and a 12-layer BERT as the text encoder. The text encoder was initialized from rinna/japanese-clip-vit-b-16.

Evaluation

Dataset

STAIR Captions (v2014 val set of MSCOCO) for image-to-text (i2t) and text-to-image (t2i) retrieval. We measure performance using R@1, which is the average recall of i2t and t2i retrieval.
Recruit Datasets for image classification.
ImageNet-1K for image classification. We translated all classnames into Japanese. The classnames and templates can be found in ja-imagenet-1k-classnames.txt and ja-imagenet-1k-templates.txt.

Result

Model	Image Encoder Params	Text Encoder params	STAIR Captions (R@1)	Recruit Datasets (acc@1)	ImageNet-1K (acc@1)
Ours	86M(Eva02-B)	100M(BERT)	0.30	0.89	0.58
Stable-ja-clip	307M(ViT-L)	100M(BERT)	0.24	0.77	0.68
Rinna-ja-clip	86M(ViT-B)	100M(BERT)	0.13	0.54	0.56
Laion-clip	632M(ViT-H)	561M(XLM-RoBERTa)	0.30	0.83	0.58
Hakuhodo-ja-clip	632M(ViT-H)	100M(BERT)	0.21	0.82	0.46

📄 License

The Apache License, Version 2.0

Citation

@misc{clip-japanese-base,
    title = {CLIP Japanese Base},
    author={Shuhei Yokoo and Shuntaro Okada and Peifei Zhu and Shuhei Nishimura and Naoki Takayama}
    url = {https://huggingface.co/line-corporation/clip-japanese-base},
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご