Bert2D-cased-Turkish-128K-WWM-NSW2开源模型 - 高效处理复杂词法结构的土耳其语

首页

Bert2d Cased Turkish 128K WWM NSW2

由 yigitbekir 开发

Bert2DModel是对经典BERT架构的全新探索，专为处理像土耳其语这种具有复杂词法结构的语言而设计。

大型语言模型

PyTorch

开源协议:Apache-2.0 #土耳其语优化 #二维嵌入 #复杂词法处理

下载量 610

发布时间 : 5/22/2025

模型简介

Bert2DModel通过独特的'二维嵌入'系统，不仅关注单词在句子中的位置，还考虑单词内部子部分的位置，从而更深入地理解语法和语义。该模型的首个版本是针对土耳其语进行训练的。

模型特点

二维嵌入系统

通过同时考虑单词在句子中的位置和单词内部子部分的位置，更深入地理解语法和语义。

针对土耳其语优化

专门为处理土耳其语这种具有复杂词法结构的语言而设计。

自定义配置参数

引入了标准BERT模型中不存在的新配置参数，如max_word_position_embeddings和max_intermediate_subword_position_embeddings。

模型能力

土耳其语文本理解

填充掩码任务

文本分类

标记分类

使用案例

文本理解

职业预测

预测句子中缺失的职业信息。

例如：'Adamın mesleği [MASK] midir acaba?' 可能预测为 'mühendis'（工程师）或 'doktor'（医生）。

语法分析

复杂词法结构解析

解析土耳其语中复杂的词法结构。

🚀 Bert2DModel

Bert2DModel是对经典BERT架构的全新探索，专为处理像土耳其语这种具有复杂词法结构的语言而设计。它通过独特的“二维嵌入”系统，不仅关注单词在句子中的位置，还考虑单词内部子部分的位置，从而更深入地理解语法和语义。该模型的首个版本是针对土耳其语进行训练的。

🚀 快速开始

你可以通过以下示例了解如何使用fill-mask管道与Bert2DModel，或者直接使用AutoModel类加载它。

💻 使用示例

基础用法

from transformers import pipeline

# 1. Define your model repository ID
repo_id = "yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2"

# 2. Create the pipeline for the "fill-mask" task
# The model_kwargs dictionary passes arguments to the underlying model loading function.
fill_masker = pipeline(
    "fill-mask",
    model=repo_id,
    use_fast=True,
    trust_remote_code=True
)

# 3. Prepare the input and get predictions
masked_sentence = "Adamın mesleği [MASK] midir acaba?"
predictions = fill_masker(masked_sentence)

# 4. Print the results in a user-friendly format
print(f"Predictions for: '{masked_sentence}'")
for prediction in predictions:
    print(f"  Sequence: {prediction['sequence']}")
    print(f"  Token: {prediction['token_str']}")
    print(f"  Score: {prediction['score']:.4f}")
    print("-" * 20)

# Expected output:
# Sequence: Adamın mesleği mühendis midir acaba?
# Score: 0.2393
# --------------------
# Sequence: Adamın mesleği doktor midir acaba?
# Score: 0.1698
# --------------------

高级用法

from transformers import AutoTokenizer, AutoModel

# Load the tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2", trust_remote_code=True)
model = AutoModel.from_pretrained("yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2", trust_remote_code=True)

# Example text
text = "Türkiye'nin başkenti Ankara'dır."
inputs = tokenizer(text, return_tensors="pt")

# Get model outputs
outputs = model(**inputs)
last_hidden_states = outputs.last_hidden_state

🔧 技术细节

配置要点

Bert2D引入了标准BERT模型中不存在的新配置参数。在训练或微调时，你必须使用Bert2DConfig并注意这些设置，否则可能会导致意外行为。两个关键的新参数是max_word_position_embeddings和max_intermediate_subword_position_embeddings。

from transformers import AutoConfig

# Load the custom config from a pretrained model
config = AutoConfig.from_pretrained("yigitbekir/Bert2D-cased-Turkish-128K-WWM-NSW2", trust_remote_code=True)
  
# Access new parameters
print(f"Max Word Positions: {config.max_word_position_embeddings}")
# Expected output: Max Word Positions: 512
  
print(f"Intermediate Subword Position: {config.max_intermediate_subword_position_embeddings}")
# Expected output: Intermediate Subword Position: 2