Turkish - Mini - BERT Open - Source Model: Free implementation of Turkish sentence and paragraph vector representation for clustering and semantic search

Turkish Mini Bert Uncased Mean Nli Stsb Tr

Developed by atasoglu

This is a small BERT model based on Turkish, specifically designed for vector representations of sentences and paragraphs, suitable for tasks such as clustering and semantic search.

Text Embedding

Transformers

OtherOpen Source License:MIT #Turkish semantic similarity #Lowercase text optimization #Lightweight BERT

Downloads 22

Release Time : 2/15/2024

Model Overview

The model maps sentences and paragraphs into a 256-dimensional dense vector space, primarily used for sentence similarity calculation and feature extraction.

Model Features

Turkish language optimization

Specially optimized for Turkish, suitable for processing Turkish text.

Lowercase conversion

All text has been manually converted to lowercase to meet Turkish language processing requirements.

Efficient vector representation

Maps sentences and paragraphs into a 256-dimensional dense vector space, suitable for resource-limited environments.

Model Capabilities

Sentence similarity calculation

Feature extraction

Text clustering

Semantic search

Use Cases

Text processing

Semantic search

Used for building semantic search engines for Turkish.

Text clustering

Performs clustering analysis on Turkish text.

🚀 Turkish Mini BERT Uncased Mean NLI STSB TR

This model maps sentences and paragraphs to a 256-dimensional dense vector space, suitable for tasks like clustering and semantic search.

This is a sentence-transformers model. It can transform sentences and paragraphs into a 256-dimensional dense vector space, which can be used for tasks such as clustering or semantic search.

This model is adapted from ytu-ce-cosmos/turkish-mini-bert-uncased and fine - tuned on the following datasets:

⚠️ Important Note

As stated by the model's authors, all texts need to be manually lowercased:

text.replace("I", "ı").lower()

🚀 Quick Start

✨ Features

Maps sentences and paragraphs to a 256 - dimensional dense vector space.
Can be used for clustering and semantic search tasks.

📦 Installation

Using this model becomes easy when you have sentence-transformers installed:

pip install -U sentence-transformers

💻 Usage Examples

Basic Usage (Sentence - Transformers)

from sentence_transformers import SentenceTransformer
sentences = ["Bu örnek bir cümle", "Her cümle dönüştürülür"]

model = SentenceTransformer('atasoglu/turkish-mini-bert-uncased-mean-nli-stsb-tr')
embeddings = model.encode(sentences)
print(embeddings)

Advanced Usage (HuggingFace Transformers)

Without sentence-transformers, you can use the model like this: First, you pass your input through the transformer model, then you have to apply the right pooling - operation on - top of the contextualized word embeddings.

from transformers import AutoTokenizer, AutoModel
import torch


#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Sentences we want sentence embeddings for
sentences = ["Bu örnek bir cümle", "Her cümle dönüştürülür"]

# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('atasoglu/turkish-mini-bert-uncased-mean-nli-stsb-tr')
model = AutoModel.from_pretrained('atasoglu/turkish-mini-bert-uncased-mean-nli-stsb-tr')

# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)

# Perform pooling. In this case, mean pooling.
sentence_embeddings = mean_pooling(model_output, encoded_input['attention_mask'])

print("Sentence embeddings:")
print(sentence_embeddings)

📚 Documentation

Evaluation Results

Achieved results on the STS - b test split are given below:

Cosine - Similarity :	    Pearson: 0.8117	Spearman: 0.8074
Manhattan - Distance:	    Pearson: 0.8029	Spearman: 0.7972
Euclidean - Distance:	    Pearson: 0.8028	Spearman: 0.7977
Dot - Product - Similarity:	Pearson: 0.7563	Spearman: 0.7435

Training

The model was trained with the parameters:

DataLoader: torch.utils.data.dataloader.DataLoader of length 45 with parameters:

{'batch_size': 128, 'sampler': 'torch.utils.data.sampler.RandomSampler', 'batch_sampler': 'torch.utils.data.sampler.BatchSampler'}

Loss: sentence_transformers.losses.CosineSimilarityLoss.CosineSimilarityLoss

Parameters of the fit() - Method:

{
    "epochs": 10,
    "evaluation_steps": 4,
    "evaluator": "sentence_transformers.evaluation.EmbeddingSimilarityEvaluator.EmbeddingSimilarityEvaluator",
    "max_grad_norm": 1,
    "optimizer_class": "<class 'torch.optim.adamw.AdamW'>",
    "optimizer_params": {
        "lr": 2e-05
    },
    "scheduler": "WarmupLinear",
    "steps_per_epoch": null,
    "warmup_steps": 45,
    "weight_decay": 0.01
}

Full Model Architecture

SentenceTransformer(
  (0): Transformer({'max_seq_length': 256, 'do_lower_case': False}) with Transformer model: BertModel 
  (1): Pooling({'word_embedding_dimension': 256, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False})
)

📄 License

This model is licensed under the MIT license.

📋 Information Table

Property	Details
Pipeline Tag	Sentence - Similarity
Tags	sentence - transformers, feature - extraction, sentence - similarity, transformers
Datasets	nli_tr, emrecan/stsb - mt - turkish
Library Name	sentence - transformers
Base Model	ytu - ce - cosmos/turkish - mini - bert - uncased
License	MIT

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご