distilbert-base-turkish-cased-clip Open Source Model - Turkish Text Encoding Tool Compatible with Image Encoders

Distilbert Base Turkish Cased Clip

Developed by mys

A Turkish text encoder fine-tuned from dbmdz/distilbert-base-turkish-cased, designed to work with CLIP's ViT-B/32 image encoder

Text-to-Image

Transformers

#Turkish CLIP #Multimodal Alignment #Text Encoder

Downloads 2,354

Release Time : 3/2/2022

Model Overview

This model is a text encoder optimized for Turkish, specifically designed to work with CLIP's image encoder for cross-modal text-image matching tasks.

Model Features

Turkish Optimization

Specially fine-tuned and optimized for Turkish text

CLIP Compatibility

Designed to work with CLIP's ViT-B/32 image encoder

Lightweight Architecture

Based on DistilBERT, reducing model size while maintaining performance

Model Capabilities

Turkish text encoding

Cross-modal text-image matching

Multimodal representation learning

Use Cases

Cross-modal Retrieval

Turkish Image Search

Search for relevant images using Turkish text queries

Content Recommendation

Turkish Content Recommendation

Recommend relevant visual content based on text descriptions

🚀 Turkish Text Encoder Model

This project is a fine - tuned model based on [dbmdz/distilbert - base - turkish - cased](https://huggingface.co/dbmdz/distilbert - base - turkish - cased), used as a Turkish text encoder with CLIP's ViT - B/32 image encoder.

🚀 Quick Start

This model is a finetuned version of [dbmdz/distilbert - base - turkish - cased](https://huggingface.co/dbmdz/distilbert - base - turkish - cased) for use as a text encoder in Turkish, paired with CLIP's ViT - B/32 image encoder. It should be used in conjunction with clip_head.h5 from [my accompanying repo on GitHub]. Visit that repo for a fully - working example. A simple usage example is provided below:

from transformers import AutoTokenizer, TFAutoModel
import tensorflow as tf
import numpy as np
from PIL import Image
import torch
import clip

model_name = "mys/distilbert-base-turkish-cased-clip"
base_model = TFAutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
head_model = tf.keras.models.load_model("./clip_head.h5")

def encode_text(base_model, tokenizer, head_model, texts):
    tokens = tokenizer(texts, padding=True, return_tensors='tf')
    embs = base_model(**tokens)[0]

    attention_masks = tf.cast(tokens['attention_mask'], tf.float32)
    sample_length = tf.reduce_sum(attention_masks, axis=-1, keepdims=True)
    masked_embs = embs * tf.expand_dims(attention_masks, axis=-1)
    base_embs = tf.reduce_sum(masked_embs, axis=1) / tf.cast(sample_length, tf.float32)
    clip_embs = head_model(base_embs)
    clip_embs /= tf.norm(clip_embs, axis=-1, keepdims=True)
    return clip_embs


demo_images = {
    "bilgisayarda çalışan bir insan": "myspc.jpeg",
    "sahilde bir insan ve bir heykel": "mysdk.jpeg"
    }

clip_model, preprocess = clip.load("ViT-B/32")
images = {key: Image.open(f"images/{value}") for key, value in demo_images.items()}
img_inputs = torch.stack([preprocess(image).to('cpu') for image in images.values()])

with torch.no_grad():
    image_embs = clip_model.encode_image(img_inputs).float().to('cpu')

image_embs /= image_embs.norm(dim=-1, keepdim=True)
image_embs = image_embs.detach().numpy()
text_embs = encode_text(base_model, tokenizer, head_model, list(images.keys())).numpy()
similarities = image_embs @ text_embs.T
logits = tf.nn.softmax(tf.convert_to_tensor(similarities)).numpy()
idxs = np.argmax(logits, axis=-1).tolist()
for i, (key, value) in enumerate(demo_images.items()):
    print("path: ", value, "true label: ", key, "prediction: ", list(demo_images.keys())[idxs[i]], "score: ", logits[i, idxs[i]])

The sample images referred to in the above code snippet can be found in the images directory of the GitHub repo.

💻 Usage Examples

Basic Usage

# The above code example is a basic usage demonstration
from transformers import AutoTokenizer, TFAutoModel
import tensorflow as tf
import numpy as np
from PIL import Image
import torch
import clip

model_name = "mys/distilbert-base-turkish-cased-clip"
base_model = TFAutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
head_model = tf.keras.models.load_model("./clip_head.h5")

# Function definition and subsequent code...

🔧 Technical Details

The encode_text() function aggregates the per - token hidden states output by the Distilbert model to generate a single vector for each sequence. Then, the clip_head.h5 model projects this vector onto the same vector space as CLIP's text encoder using a single dense layer. Initially, all Distilbert layers were frozen, and the head dense layer was trained for several epochs. Subsequently, the freezing was removed, and the dense layer was trained along with the Distilbert layers for a few more epochs. The dataset was created by machine - translating COCO captions into Turkish. During training, the vector representations of English captions output by the original CLIP text encoder were used as target values, and the Mean Squared Error (MSE) between these vectors and the clip_head.h5 outputs was minimized.

📄 License

No license information is provided in the original document, so this section is skipped.

Acknowledgement

Google supported this work by providing Google Cloud credit. Thank you Google for supporting the open source! 🎉

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご