🚀 土耳其语文本编码器模型
本项目是一个微调后的模型,基于 dbmdz/distilbert-base-turkish-cased 进行微调,可作为土耳其语文本编码器,与 CLIP 的 ViT - B/32
图像编码器配合使用。
🚀 快速开始
本模型是 dbmdz/distilbert-base-turkish-cased 的微调版本,可作为土耳其语的文本编码器,与 CLIP 的 ViT-B/32
图像编码器配合使用。它需要与 [我在 GitHub 上的配套仓库] 中的 clip_head.h5
一起使用。前往该仓库可获取完整的工作示例,以下是一个简单的使用示例:
from transformers import AutoTokenizer, TFAutoModel
import tensorflow as tf
import numpy as np
from PIL import Image
import torch
import clip
model_name = "mys/distilbert-base-turkish-cased-clip"
base_model = TFAutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
head_model = tf.keras.models.load_model("./clip_head.h5")
def encode_text(base_model, tokenizer, head_model, texts):
tokens = tokenizer(texts, padding=True, return_tensors='tf')
embs = base_model(**tokens)[0]
attention_masks = tf.cast(tokens['attention_mask'], tf.float32)
sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True)
masked_embs = embs * tf.expand_dims(attention_masks, axis=-1)
base_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32)
clip_embs = head_model(base_embs)
clip_embs /= tf.norm(clip_embs, axis=-1, keepdims=True)
return clip_embs
demo_images = {
"bilgisayarda çalışan bir insan": "myspc.jpeg",
"sahilde bir insan ve bir heykel": "mysdk.jpeg"
}
clip_model, preprocess = clip.load("ViT-B/32")
images = {key: Image.open(f"images/{value}") for key, value in demo_images.items()}
img_inputs = torch.stack([preprocess(image).to('cpu') for image in images.values()])
with torch.no_grad():
image_embs = clip_model.encode_image(img_inputs).float().to('cpu')
image_embs /= image_embs.norm(dim=-1, keepdim=True)
image_embs = image_embs.detach().numpy()
text_embs = encode_text(base_model, tokenizer, head_model, list(images.keys())).numpy()
similarities = image_embs @ text_embs.T
logits = tf.nn.softmax(tf.convert_to_tensor(similarities)).numpy()
idxs = np.argmax(logits, axis=-1).tolist()
for i, (key, value) in enumerate(demo_images.items()):
print("path: ", value, "true label: ", key, "prediction: ", list(demo_images.keys())[idxs[i]], "score: ", logits[i, idxs[i]])
上述代码片段中引用的示例图像可以在 GitHub 仓库的 images
目录下找到。
💻 使用示例
基础用法
from transformers import AutoTokenizer, TFAutoModel
import tensorflow as tf
import numpy as np
from PIL import Image
import torch
import clip
model_name = "mys/distilbert-base-turkish-cased-clip"
base_model = TFAutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
head_model = tf.keras.models.load_model("./clip_head.h5")
def encode_text(base_model, tokenizer, head_model, texts):
tokens = tokenizer(texts, padding=True, return_tensors='tf')
embs = base_model(**tokens)[0]
attention_masks = tf.cast(tokens['attention_mask'], tf.float32)
sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True)
masked_embs = embs * tf.expand_dims(attention_masks, axis=-1)
base_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32)
clip_embs = head_model(base_embs)
clip_embs /= tf.norm(clip_embs, axis=-1, keepdims=True)
return clip_embs
demo_images = {
"bilgisayarda çalışan bir insan": "myspc.jpeg",
"sahilde bir insan ve bir heykel": "mysdk.jpeg"
}
clip_model, preprocess = clip.load("ViT-B/32")
images = {key: Image.open(f"images/{value}") for key, value in demo_images.items()}
img_inputs = torch.stack([preprocess(image).to('cpu') for image in images.values()])
with torch.no_grad():
image_embs = clip_model.encode_image(img_inputs).float().to('cpu')
image_embs /= image_embs.norm(dim=-1, keepdim=True)
image_embs = image_embs.detach().numpy()
text_embs = encode_text(base_model, tokenizer, head_model, list(images.keys())).numpy()
similarities = image_embs @ text_embs.T
logits = tf.nn.softmax(tf.convert_to_tensor(similarities)).numpy()
idxs = np.argmax(logits, axis=-1).tolist()
for i, (key, value) in enumerate(demo_images.items()):
print("path: ", value, "true label: ", key, "prediction: ", list(demo_images.keys())[idxs[i]], "score: ", logits[i, idxs[i]])
🔧 技术细节
encode_text()
函数聚合了 Distilbert 模型输出的每个标记的隐藏状态,为每个序列生成一个单一向量。然后,clip_head.h5
模型通过一个全连接层将该向量投影到与 CLIP 的文本编码器相同的向量空间中。首先,冻结所有 Distilbert 层,并对头部全连接层进行几个 epoch 的训练。然后,解除冻结,将全连接层与 Distilbert 层一起再训练几个 epoch。我通过将 COCO 字幕机器翻译成土耳其语来创建数据集。在训练期间,使用原始 CLIP 文本编码器输出的英语字幕的向量表示作为目标值,并最小化这些向量与 clip_head.h5
输出之间的均方误差(MSE)。
📚 详细文档
数据集
数据集和训练笔记本将很快发布。
📄 致谢
本工作得到了 Google 的支持,他们提供了 Google Cloud 信用额度。感谢 Google 对开源项目的支持!🎉