Bert2D-cased-Turkish-128K-WWM-NSW2開源模型 - 高效處理複雜詞法結構的土耳其語

首頁

Bert2d Cased Turkish 128K WWM NSW2

由yigitbekir開發

Bert2DModel是對經典BERT架構的全新探索，專為處理像土耳其語這種具有複雜詞法結構的語言而設計。

大型語言模型

PyTorch

開源協議:Apache-2.0 #土耳其語優化 #二維嵌入 #複雜詞法處理

下載量 610

發布時間 : 5/22/2025

模型概述

Bert2DModel通過獨特的'二維嵌入'系統，不僅關注單詞在句子中的位置，還考慮單詞內部子部分的位置，從而更深入地理解語法和語義。該模型的首個版本是針對土耳其語進行訓練的。

模型特點

二維嵌入系統

通過同時考慮單詞在句子中的位置和單詞內部子部分的位置，更深入地理解語法和語義。

針對土耳其語優化

專門為處理土耳其語這種具有複雜詞法結構的語言而設計。

自定義配置參數

引入了標準BERT模型中不存在的新配置參數，如max_word_position_embeddings和max_intermediate_subword_position_embeddings。

模型能力

土耳其語文本理解

填充掩碼任務

文本分類

標記分類

使用案例

文本理解

職業預測

預測句子中缺失的職業信息。

例如：'Adamın mesleği [MASK] midir acaba?' 可能預測為 'mühendis'（工程師）或 'doktor'（醫生）。

語法分析

複雜詞法結構解析

解析土耳其語中複雜的詞法結構。

🚀 Bert2DModel

Bert2DModel是對經典BERT架構的全新探索，專為處理像土耳其語這種具有複雜詞法結構的語言而設計。它通過獨特的“二維嵌入”系統，不僅關注單詞在句子中的位置，還考慮單詞內部子部分的位置，從而更深入地理解語法和語義。該模型的首個版本是針對土耳其語進行訓練的。

🚀 快速開始

你可以通過以下示例瞭解如何使用fill-mask管道與Bert2DModel，或者直接使用AutoModel類加載它。

💻 使用示例

基礎用法

from transformers import pipeline

# 1. Define your model repository ID
repo_id = "yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2"

# 2. Create the pipeline for the "fill-mask" task
# The model_kwargs dictionary passes arguments to the underlying model loading function.
fill_masker = pipeline(
    "fill-mask",
    model=repo_id,
    use_fast=True,
    trust_remote_code=True
)

# 3. Prepare the input and get predictions
masked_sentence = "Adamın mesleği [MASK] midir acaba?"
predictions = fill_masker(masked_sentence)

# 4. Print the results in a user-friendly format
print(f"Predictions for: '{masked_sentence}'")
for prediction in predictions:
    print(f"  Sequence: {prediction['sequence']}")
    print(f"  Token: {prediction['token_str']}")
    print(f"  Score: {prediction['score']:.4f}")
    print("-" * 20)

# Expected output:
# Sequence: Adamın mesleği mühendis midir acaba?
# Score: 0.2393
# --------------------
# Sequence: Adamın mesleği doktor midir acaba?
# Score: 0.1698
# --------------------

高級用法

from transformers import AutoTokenizer, AutoModel

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2", trust_remote_code=True)
model = AutoModel.from_pretrained("yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2", trust_remote_code=True)

# Example text
text = "Türkiye'nin başkenti Ankara'dır."
inputs = tokenizer(text, return_tensors="pt")

# Get model outputs
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

🔧 技術細節

配置要點

Bert2D引入了標準BERT模型中不存在的新配置參數。在訓練或微調時，你必須使用Bert2DConfig並注意這些設置，否則可能會導致意外行為。兩個關鍵的新參數是max_word_position_embeddings和max_intermediate_subword_position_embeddings。

from transformers import AutoConfig

# Load the custom config from a pretrained model
config = AutoConfig.from_pretrained("yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2", trust_remote_code=True)
  
# Access new parameters
print(f"Max Word Positions: {config.max_word_position_embeddings}")
# Expected output: Max Word Positions: 512
  
print(f"Intermediate Subword Position: {config.max_intermediate_subword_position_embeddings}")
# Expected output: Intermediate Subword Position: 2