🚀 日語CLIP基礎模型
這是由LY Corporation開發的日語CLIP(對比語言 - 圖像預訓練)模型。該模型在約10億個從網絡收集的圖像 - 文本對上進行訓練,適用於各種視覺任務,包括零樣本圖像分類、文本到圖像或圖像到文本的檢索。
🚀 快速開始
安裝指南
pip install pillow requests sentencepiece transformers torch timm
使用示例
基礎用法
import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
HF_MODEL_PATH = 'line-corporation/clip-japanese-base'
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True).to(device)
image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt").to(device)
text = tokenizer(["犬", "貓", "象"]).to(device)
with torch.no_grad():
image_features = model.get_image_features(**image)
text_features = model.get_text_features(**text)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
📚 詳細文檔
模型架構
該模型使用Eva02 - B Transformer架構作為圖像編碼器,使用12層BERT作為文本編碼器。文本編碼器從rinna/japanese - clip - vit - b - 16初始化。
評估
數據集
結果
模型 |
圖像編碼器參數 |
文本編碼器參數 |
STAIR Captions(R@1) |
Recruit Datasets(acc@1) |
ImageNet - 1K(acc@1) |
我們的模型 |
86M(Eva02 - B) |
100M(BERT) |
0.30 |
0.89 |
0.58 |
Stable - ja - clip |
307M(ViT - L) |
100M(BERT) |
0.24 |
0.77 |
0.68 |
Rinna - ja - clip |
86M(ViT - B) |
100M(BERT) |
0.13 |
0.54 |
0.56 |
Laion - clip |
632M(ViT - H) |
561M(XLM - RoBERTa) |
0.30 |
0.83 |
0.58 |
Hakuhodo - ja - clip |
632M(ViT - H) |
100M(BERT) |
0.21 |
0.82 |
0.46 |
📄 許可證
Apache許可證,版本2.0
引用
@misc{clip-japanese-base,
title = {CLIP Japanese Base},
author={Shuhei Yokoo and Shuntaro Okada and Peifei Zhu and Shuhei Nishimura and Naoki Takayama}
url = {https://huggingface.co/line-corporation/clip-japanese-base},
}