modernbert-base-tr-uncased: An Open-source Turkish Pretrained Model - Excellent Performance in Long Text Processing and Multiple Domains!

Modernbert Base Tr Uncased

Developed by artiwise-ai

Turkish pre-trained model based on ModernBERT architecture, supporting 8192 context length with excellent performance across multiple domains

Large Language Model

Transformers

OtherOpen Source License:MIT #Turkish Masked Prediction #Long Context Support #Multi-domain Adaptation

Downloads 159

Release Time : 3/16/2025

Model Overview

This is the Turkish adaptation of ModernBERT, fine-tuned using the Turkish portion of CulturaX based on answerdotai/ModernBERT-base, optimized for Turkish text processing

Model Features

Extended Context Length

Supports 8192 context length, far exceeding the 512 limit of traditional BERT models

Multi-domain Optimization

Excellent performance in multiple domains including Q&A, reviews, and biomedical fields

Modern Architecture

Based on ModernBERT architecture with improved pre-training and fine-tuning capabilities

Model Capabilities

Turkish Text Understanding

Masked Language Modeling

Multi-domain Text Processing

Use Cases

Q&A Systems

Turkish Q&A

Used for building Turkish Q&A systems

Achieved 74.5% accuracy on Q&A datasets (5% masking ratio)

Sentiment Analysis

Product Review Analysis

Analyzing Turkish product reviews

Achieved 62.67% accuracy on review datasets (5% masking ratio)

Biomedical Text Processing

Medical Literature Analysis

Processing Turkish biomedical texts

Achieved 58.11% accuracy on biomedical datasets (5% masking ratio)

🚀 Artiwise ModernBERT - Base Turkish Uncased

We present Artiwise ModernBERT for Turkish 🎉. It is a BERT model with a modernized architecture and an increased context size. (The context size of older BERT models is 512, while that of ModernBERT is 8192). This model is a Turkish adaptation of ModernBERT, fine - tuned from answerdotai/ModernBERT - base using only the Turkish part of CulturaX.

📦 Installation

Note: Torch version must be >= 2.6.0 and transformers version >= 4.50.0 for the model to function properly. Also Don't use the do_lower_case = True flag with the tokenizer. Instead, convert your text to lower case as follows:

text.replace("I", "ı").lower()

This is due to a known issue with the tokenizer.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForMaskedLM
import torch

# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("artiwise-ai/modernbert-base-tr-uncased")
model = AutoModelForMaskedLM.from_pretrained("artiwise-ai/modernbert-base-tr-uncased")

# Example sentence with masked token
text = "Türkiye'nin başkenti [MASK]'dır."
text.replace("I", "ı").lower()

# Tokenize and prepare input
inputs = tokenizer(text, return_tensors="pt")

# Get the position of the masked token
mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

# Forward pass
with torch.no_grad():
    outputs = model(**inputs)

# Get the predictions for the masked token
logits = outputs.logits
mask_token_logits = logits[0, mask_token_index, :]
top_5_tokens = torch.topk(mask_token_logits, 5, dim=1).indices[0].tolist()

# Print the predictions
print(f"Original text: {text}")
print("Top 5 predictions for [MASK]:")
for token in top_5_tokens:
    print(f"- {tokenizer.decode([token])}")

📚 Documentation

Stats

Property	Details
Model Type	Artiwise ModernBERT - Base Turkish Uncased
Training Data	CulturaX 192GB (tr)
Base Model	`answerdotai/ModernBERT-base`

Benchmark

The benchmark results below demonstrate that Artiwise ModernBERT consistently outperforms existing Turkish BERT variants across multiple domains and masking levels, highlighting its superior generalization capabilities.

Dataset & Mask Level	Artiwise Modern Bert	ytu - ce - cosmos/turkish - base - bert - uncased	dbmdz/bert - base - turkish - uncased
QA Dataset (5% mask)	74.50	60.84	48.57
QA Dataset (10% mask)	72.18	58.75	46.29
QA Dataset (15% mask)	69.46	56.50	44.30
Review Dataset (5% mask)	62.67	48.57	35.38
Review Dataset (10% mask)	59.60	45.77	33.04
Review Dataset (15% mask)	56.51	43.05	31.05
Biomedical Dataset (5% mask)	58.11	50.78	40.82
Biomedical Dataset (10% mask)	55.55	48.37	38.51
Biomedical Dataset (15% mask)	52.71	45.82	36.44

Our experiments used three datasets: the [Turkish Biomedical Corpus](https://huggingface.co/datasets/hazal/Turkish - Biomedical - corpus - trM), the Turkish Product Reviews dataset, and the general - domain QA corpus turkish_v2.

📄 License

This project is under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご