clip-vit-base-patch32-ko Open Source Model - Supports Korean-English Bilingual Image-Text Matching Tasks

Home

Clip Vit Base Patch32 Ko

Developed by Bingsu

Korean CLIP model trained via knowledge distillation, supporting Korean-English bilingual image-text matching tasks

Text-to-Image

Transformers

KoreanOpen Source License:MIT #Korean CLIP model #Zero-shot image classification #Multimodal understanding

Downloads 3,147

Release Time : 9/16/2022

Model Overview

This is a Korean version of the CLIP model based on the ViT-Base-Patch32 architecture, trained using knowledge distillation methods, specifically designed for Korean and English cross-modal retrieval tasks.

Model Features

Korean optimization

Specifically optimized for Korean, trained using Korean-English parallel corpus from AIHUB platform

Knowledge distillation training

Uses knowledge distillation to transfer learning from the original CLIP model

Bilingual support

Supports both Korean and English text inputs

Model Capabilities

Zero-shot image classification

Image-text matching

Cross-modal retrieval

Use Cases

Image classification

Animal recognition

Identify animal types in images

Can accurately distinguish common animals like cats and dogs

Content moderation

Inappropriate content detection

Detect if images contain inappropriate content

🚀 clip-vit-base-patch32-ko

A Korean CLIP model trained using the method described in Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation

Widget

src: https://huggingface.co/datasets/mishig/sample_images/resolve/main/cat-dog-music.png
candidate_labels: 기타치는 고양이, 피아노 치는 강아지
example_title: Guitar, cat and dog

License

This project is licensed under the MIT license.

🚀 Quick Start

This is a Korean CLIP model trained by Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation.

Training code: https://github.com/Bing-su/KoCLIP_training_code

Data used: All Korean-English parallel data from AIHUB

💻 Usage Examples

Basic Usage

import requests
import torch
from PIL import Image
from transformers import AutoModel, AutoProcessor

repo = "Bingsu/clip-vit-base-patch32-ko"
model = AutoModel.from_pretrained(repo)
processor = AutoProcessor.from_pretrained(repo)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(text=["고양이 두 마리", "개 두 마리"], images=image, return_tensors="pt", padding=True)
with torch.inference_mode():
    outputs = model(**inputs)
logits_per_image = outputs.logits_per_image
probs = logits_per_image.softmax(dim=1)

>>> probs
tensor([[0.9926, 0.0074]])

Advanced Usage

from transformers import pipeline

repo = "Bingsu/clip-vit-base-patch32-ko"
pipe = pipeline("zero-shot-image-classification", model=repo)

url = "http://images.cocodataset.org/val2017/000000039769.jpg"
result = pipe(images=url, candidate_labels=["고양이 한 마리", "고양이 두 마리", "분홍색 소파에 드러누운 고양이 친구들"], hypothesis_template="{}")

>>> result
[{'score': 0.9456236958503723, 'label': '분홍색 소파에 드러누운 고양이 친구들'},
 {'score': 0.05315302312374115, 'label': '고양이 두 마리'},
 {'score': 0.0012233294546604156, 'label': '고양이 한 마리'}]

🔧 Technical Details

Tokenizer

The tokenizer was trained by mixing Korean and English data at a ratio of 7:3 and using .train_new_from_iterator from the original CLIP tokenizer.

https://github.com/huggingface/transformers/blob/bc21aaca789f1a366c05e8b5e111632944886393/src/transformers/models/clip/modeling_clip.py#L661-L666

        # text_embeds.shape = [batch_size, sequence_length, transformer.width]
        # take features from the eot embedding (eot_token is the highest number in each sequence)
        # casting to torch.int for onnx compatibility: argmax doesn't support int64 inputs with opset 14
        pooled_output = last_hidden_state[
            torch.arange(last_hidden_state.shape[0]), input_ids.to(torch.int).argmax(dim=-1)
        ]

Since the CLIP model uses the token with the largest ID when calculating pooled_output, the eos token must be the last token.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご