🚀 日语CLIP基础模型
这是由LY Corporation开发的日语CLIP(对比语言 - 图像预训练)模型。该模型在约10亿个从网络收集的图像 - 文本对上进行训练,适用于各种视觉任务,包括零样本图像分类、文本到图像或图像到文本的检索。
🚀 快速开始
安装指南
pip install pillow requests sentencepiece transformers torch timm
使用示例
基础用法
import io
import requests
from PIL import Image
import torch
from transformers import AutoImageProcessor, AutoModel, AutoTokenizer
HF_MODEL_PATH = 'line-corporation/clip-japanese-base'
device = "cuda" if torch.cuda.is_available() else "cpu"
tokenizer = AutoTokenizer.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
processor = AutoImageProcessor.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
model = AutoModel.from_pretrained(HF_MODEL_PATH, trust_remote_code=True).to(device)
image = Image.open(io.BytesIO(requests.get('https://images.pexels.com/photos/2253275/pexels-photo-2253275.jpeg?auto=compress&cs=tinysrgb&dpr=3&h=750&w=1260').content))
image = processor(image, return_tensors="pt").to(device)
text = tokenizer(["犬", "猫", "象"]).to(device)
with torch.no_grad():
image_features = model.get_image_features(**image)
text_features = model.get_text_features(**text)
text_probs = (100.0 * image_features @ text_features.T).softmax(dim=-1)
print("Label probs:", text_probs)
📚 详细文档
模型架构
该模型使用Eva02 - B Transformer架构作为图像编码器,使用12层BERT作为文本编码器。文本编码器从rinna/japanese - clip - vit - b - 16初始化。
评估
数据集
结果
模型 |
图像编码器参数 |
文本编码器参数 |
STAIR Captions(R@1) |
Recruit Datasets(acc@1) |
ImageNet - 1K(acc@1) |
我们的模型 |
86M(Eva02 - B) |
100M(BERT) |
0.30 |
0.89 |
0.58 |
Stable - ja - clip |
307M(ViT - L) |
100M(BERT) |
0.24 |
0.77 |
0.68 |
Rinna - ja - clip |
86M(ViT - B) |
100M(BERT) |
0.13 |
0.54 |
0.56 |
Laion - clip |
632M(ViT - H) |
561M(XLM - RoBERTa) |
0.30 |
0.83 |
0.58 |
Hakuhodo - ja - clip |
632M(ViT - H) |
100M(BERT) |
0.21 |
0.82 |
0.46 |
📄 许可证
Apache许可证,版本2.0
引用
@misc{clip-japanese-base,
title = {CLIP Japanese Base},
author={Shuhei Yokoo and Shuntaro Okada and Peifei Zhu and Shuhei Nishimura and Naoki Takayama}
url = {https://huggingface.co/line-corporation/clip-japanese-base},
}