distilbert-base-turkish-cased-clip開源模型 - 適配圖像編碼器的土耳其語文本編碼工具

首頁

Distilbert Base Turkish Cased Clip

由mys開發

基於dbmdz/distilbert-base-turkish-cased微調的土耳其語文本編碼器，用於與CLIP的ViT-B/32圖像編碼器配合使用

文本生成圖像

Transformers

#土耳其語CLIP #多模態對齊 #文本編碼器

下載量 2,354

發布時間 : 3/2/2022

模型概述

該模型是一個針對土耳其語優化的文本編碼器，專門設計用於與CLIP模型的圖像編碼器配合，實現跨模態的文本-圖像匹配任務。

模型特點

土耳其語優化

專門針對土耳其語文本進行微調優化

CLIP兼容

設計用於與CLIP的ViT-B/32圖像編碼器配合使用

輕量級架構

基於DistilBERT，在保持性能的同時減少模型大小

模型能力

土耳其語文本編碼

跨模態文本-圖像匹配

多模態表示學習

使用案例

跨模態檢索

土耳其語圖像搜索

使用土耳其語文本查詢搜索相關圖像

內容推薦

土耳其語內容推薦

基於文本描述推薦相關視覺內容

🚀 土耳其語文本編碼器模型

本項目是一個微調後的模型，基於 dbmdz/distilbert-base-turkish-cased 進行微調，可作為土耳其語文本編碼器，與 CLIP 的 ViT - B/32 圖像編碼器配合使用。

🚀 快速開始

本模型是 dbmdz/distilbert-base-turkish-cased 的微調版本，可作為土耳其語的文本編碼器，與 CLIP 的 ViT-B/32 圖像編碼器配合使用。它需要與 [我在 GitHub 上的配套倉庫] 中的 clip_head.h5 一起使用。前往該倉庫可獲取完整的工作示例，以下是一個簡單的使用示例：

from transformers import AutoTokenizer, TFAutoModel
import tensorflow as tf
import numpy as np
from PIL import Image
import torch
import clip

model_name = "mys/distilbert-base-turkish-cased-clip"
base_model = TFAutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
head_model = tf.keras.models.load_model("./clip_head.h5")

def encode_text(base_model, tokenizer, head_model, texts):
    tokens = tokenizer(texts, padding=True, return_tensors='tf')
    embs = base_model(**tokens)[0]

    attention_masks = tf.cast(tokens['attention_mask'], tf.float32)
    sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True)
    masked_embs = embs * tf.expand_dims(attention_masks, axis=-1)
    base_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32)
    clip_embs = head_model(base_embs)
    clip_embs /= tf.norm(clip_embs, axis=-1, keepdims=True)
    return clip_embs

demo_images = {
    "bilgisayarda çalışan bir insan": "myspc.jpeg",
    "sahilde bir insan ve bir heykel": "mysdk.jpeg"
    }

clip_model, preprocess = clip.load("ViT-B/32")
images = {key: Image.open(f"images/{value}") for key, value in demo_images.items()}
img_inputs = torch.stack([preprocess(image).to('cpu') for image in images.values()])

with torch.no_grad():
    image_embs = clip_model.encode_image(img_inputs).float().to('cpu')

image_embs /= image_embs.norm(dim=-1, keepdim=True)
image_embs = image_embs.detach().numpy()
text_embs = encode_text(base_model, tokenizer, head_model, list(images.keys())).numpy()
similarities = image_embs @ text_embs.T
logits = tf.nn.softmax(tf.convert_to_tensor(similarities)).numpy()
idxs = np.argmax(logits, axis=-1).tolist()
for i, (key, value) in enumerate(demo_images.items()):
    print("path: ", value, "true label: ", key, "prediction: ", list(demo_images.keys())[idxs[i]], "score: ", logits[i, idxs[i]])

上述代碼片段中引用的示例圖像可以在 GitHub 倉庫的 images 目錄下找到。

💻 使用示例

基礎用法

from transformers import AutoTokenizer, TFAutoModel
import tensorflow as tf
import numpy as np
from PIL import Image
import torch
import clip

model_name = "mys/distilbert-base-turkish-cased-clip"
base_model = TFAutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
head_model = tf.keras.models.load_model("./clip_head.h5")

def encode_text(base_model, tokenizer, head_model, texts):
    tokens = tokenizer(texts, padding=True, return_tensors='tf')
    embs = base_model(**tokens)[0]

    attention_masks = tf.cast(tokens['attention_mask'], tf.float32)
    sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True)
    masked_embs = embs * tf.expand_dims(attention_masks, axis=-1)
    base_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32)
    clip_embs = head_model(base_embs)
    clip_embs /= tf.norm(clip_embs, axis=-1, keepdims=True)
    return clip_embs

demo_images = {
    "bilgisayarda çalışan bir insan": "myspc.jpeg",
    "sahilde bir insan ve bir heykel": "mysdk.jpeg"
    }

clip_model, preprocess = clip.load("ViT-B/32")
images = {key: Image.open(f"images/{value}") for key, value in demo_images.items()}
img_inputs = torch.stack([preprocess(image).to('cpu') for image in images.values()])

with torch.no_grad():
    image_embs = clip_model.encode_image(img_inputs).float().to('cpu')

image_embs /= image_embs.norm(dim=-1, keepdim=True)
image_embs = image_embs.detach().numpy()
text_embs = encode_text(base_model, tokenizer, head_model, list(images.keys())).numpy()
similarities = image_embs @ text_embs.T
logits = tf.nn.softmax(tf.convert_to_tensor(similarities)).numpy()
idxs = np.argmax(logits, axis=-1).tolist()
for i, (key, value) in enumerate(demo_images.items()):
    print("path: ", value, "true label: ", key, "prediction: ", list(demo_images.keys())[idxs[i]], "score: ", logits[i, idxs[i]])

🔧 技術細節

encode_text() 函數聚合了 Distilbert 模型輸出的每個標記的隱藏狀態，為每個序列生成一個單一向量。然後，clip_head.h5 模型通過一個全連接層將該向量投影到與 CLIP 的文本編碼器相同的向量空間中。首先，凍結所有 Distilbert 層，並對頭部全連接層進行幾個 epoch 的訓練。然後，解除凍結，將全連接層與 Distilbert 層一起再訓練幾個 epoch。我通過將 COCO 字幕機器翻譯成土耳其語來創建數據集。在訓練期間，使用原始 CLIP 文本編碼器輸出的英語字幕的向量表示作為目標值，並最小化這些向量與 clip_head.h5 輸出之間的均方誤差（MSE）。