distilbert-base-turkish-cased-clipオープンソースモデル - 画像エンコーダに適合するトルコ語テキストエンコーディングツール

ホーム

Distilbert Base Turkish Cased Clip

mysによって開発

dbmdz/distilbert-base-turkish-casedをベースにファインチューニングされたトルコ語テキストエンコーダーで、CLIPのViT - B/32画像エンコーダーと組み合わせて使用するためのものです。

テキスト生成画像

Transformers

#トルコ語CLIP #マルチモーダルアライメント #テキストエンコーダー

ダウンロード数 2,354

リリース時間 : 3/2/2022

モデル概要

このモデルはトルコ語に最適化されたテキストエンコーダーで、CLIPモデルの画像エンコーダーと組み合わせて、クロスモーダルのテキスト - 画像マッチングタスクを実現するように特別に設計されています。

モデル特徴

トルコ語最適化

トルコ語テキストに対して特別にファインチューニングと最適化が行われています。

CLIP互換

CLIPのViT - B/32画像エンコーダーと組み合わせて使用するように設計されています。

軽量アーキテクチャ

DistilBERTをベースにしており、性能を維持しながらモデルサイズを削減しています。

モデル能力

トルコ語テキストエンコーディング

クロスモーダルテキスト - 画像マッチング

マルチモーダル表現学習

使用事例

クロスモーダル検索

トルコ語画像検索

トルコ語テキストクエリを使用して関連する画像を検索します。

コンテンツ推薦

トルコ語コンテンツ推薦

テキスト記述に基づいて関連するビジュアルコンテンツを推薦します。

🚀 トルコ語テキストエンコーダモデル

このプロジェクトは、dbmdz/distilbert-base-turkish-cased をベースに微調整されたモデルです。トルコ語のテキストエンコーダとして機能し、CLIP の ViT - B/32 画像エンコーダと組み合わせて使用できます。

🚀 クイックスタート

このモデルは dbmdz/distilbert-base-turkish-cased の微調整版で、トルコ語のテキストエンコーダとして機能し、CLIP の ViT-B/32 画像エンコーダと一緒に使用できます。このモデルは [私の GitHub の関連リポジトリ] の clip_head.h5 と一緒に使用する必要があります。完全な動作例を得るには、そのリポジトリにアクセスしてください。以下は簡単な使用例です。

from transformers import AutoTokenizer, TFAutoModel
import tensorflow as tf
import numpy as np
from PIL import Image
import torch
import clip

model_name = "mys/distilbert-base-turkish-cased-clip"
base_model = TFAutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
head_model = tf.keras.models.load_model("./clip_head.h5")

def encode_text(base_model, tokenizer, head_model, texts):
    tokens = tokenizer(texts, padding=True, return_tensors='tf')
    embs = base_model(**tokens)[0]

    attention_masks = tf.cast(tokens['attention_mask'], tf.float32)
    sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True)
    masked_embs = embs * tf.expand_dims(attention_masks, axis=-1)
    base_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32)
    clip_embs = head_model(base_embs)
    clip_embs /= tf.norm(clip_embs, axis=-1, keepdims=True)
    return clip_embs

demo_images = {
    "bilgisayarda çalışan bir insan": "myspc.jpeg",
    "sahilde bir insan ve bir heykel": "mysdk.jpeg"
    }

clip_model, preprocess = clip.load("ViT-B/32")
images = {key: Image.open(f"images/{value}") for key, value in demo_images.items()}
img_inputs = torch.stack([preprocess(image).to('cpu') for image in images.values()])

with torch.no_grad():
    image_embs = clip_model.encode_image(img_inputs).float().to('cpu')

image_embs /= image_embs.norm(dim=-1, keepdim=True)
image_embs = image_embs.detach().numpy()
text_embs = encode_text(base_model, tokenizer, head_model, list(images.keys())).numpy()
similarities = image_embs @ text_embs.T
logits = tf.nn.softmax(tf.convert_to_tensor(similarities)).numpy()
idxs = np.argmax(logits, axis=-1).tolist()
for i, (key, value) in enumerate(demo_images.items()):
    print("path: ", value, "true label: ", key, "prediction: ", list(demo_images.keys())[idxs[i]], "score: ", logits[i, idxs[i]])

上記のコードスニペットで引用されているサンプル画像は、GitHub リポジトリの images ディレクトリにあります。

💻 使用例

基本的な使用法

from transformers import AutoTokenizer, TFAutoModel
import tensorflow as tf
import numpy as np
from PIL import Image
import torch
import clip

model_name = "mys/distilbert-base-turkish-cased-clip"
base_model = TFAutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
head_model = tf.keras.models.load_model("./clip_head.h5")

def encode_text(base_model, tokenizer, head_model, texts):
    tokens = tokenizer(texts, padding=True, return_tensors='tf')
    embs = base_model(**tokens)[0]

    attention_masks = tf.cast(tokens['attention_mask'], tf.float32)
    sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True)
    masked_embs = embs * tf.expand_dims(attention_masks, axis=-1)
    base_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32)
    clip_embs = head_model(base_embs)
    clip_embs /= tf.norm(clip_embs, axis=-1, keepdims=True)
    return clip_embs

demo_images = {
    "bilgisayarda çalışan bir insan": "myspc.jpeg",
    "sahilde bir insan ve bir heykel": "mysdk.jpeg"
    }

clip_model, preprocess = clip.load("ViT-B/32")
images = {key: Image.open(f"images/{value}") for key, value in demo_images.items()}
img_inputs = torch.stack([preprocess(image).to('cpu') for image in images.values()])

with torch.no_grad():
    image_embs = clip_model.encode_image(img_inputs).float().to('cpu')

image_embs /= image_embs.norm(dim=-1, keepdim=True)
image_embs = image_embs.detach().numpy()
text_embs = encode_text(base_model, tokenizer, head_model, list(images.keys())).numpy()
similarities = image_embs @ text_embs.T
logits = tf.nn.softmax(tf.convert_to_tensor(similarities)).numpy()
idxs = np.argmax(logits, axis=-1).tolist()
for i, (key, value) in enumerate(demo_images.items()):
    print("path: ", value, "true label: ", key, "prediction: ", list(demo_images.keys())[idxs[i]], "score: ", logits[i, idxs[i]])

🔧 技術詳細

encode_text() 関数は、Distilbert モデルが出力する各トークンの隠れ状態を集約し、各シーケンスに対して単一のベクトルを生成します。その後、clip_head.h5 モデルは全結合層を通じて、このベクトルを CLIP のテキストエンコーダと同じベクトル空間に投影します。まず、すべての Distilbert 層を凍結し、ヘッドの全結合層を数エポック訓練します。次に、凍結を解除し、全結合層と Distilbert 層を一緒にさらに数エポック訓練します。私は COCO キャプションを機械翻訳してトルコ語に変換することでデータセットを作成しました。訓練中は、元の CLIP テキストエンコーダが出力する英語キャプションのベクトル表現を目標値として使用し、これらのベクトルと clip_head.h5 の出力間の平均二乗誤差（MSE）を最小化します。