bert-base-romanian-cased-v1開源模型 - 精準支持羅馬尼亞語處理，免費可用

首頁

Bert Base Romanian Cased V1

由dumitrescustefan開發

這是一個針對羅馬尼亞語的BERT基礎模型，區分大小寫，基於15GB語料庫訓練。

大型語言模型其他開源協議:MIT #羅馬尼亞語BERT #區分大小寫 #自然語言處理

下載量 6,466

發布時間 : 3/2/2022

模型概述

該模型是基於BERT架構的羅馬尼亞語預訓練模型，適用於各種自然語言處理任務。

模型特點

羅馬尼亞語專用

專門針對羅馬尼亞語訓練，相比多語言模型有更好的性能表現。

區分大小寫

模型能夠識別和處理大小寫字母的區別。

大規模訓練數據

基於15GB的羅馬尼亞語語料庫訓練，包含多種來源的數據。

模型能力

文本編碼

語言理解

命名實體識別

詞性標註

使用案例

自然語言處理

詞性標註

對羅馬尼亞語文本進行詞性標註

在UPOS任務上達到98.00%的準確率

命名實體識別

識別羅馬尼亞語文本中的命名實體

在RONEC數據集上達到85.88%的F1分數

🚀 羅馬尼亞語基礎大小寫敏感BERT模型v1

這是一個針對羅馬尼亞語的BERT 基礎、大小寫敏感 模型，在15GB的語料庫上進行訓練，版本為。該模型可用於解決羅馬尼亞語相關的自然語言處理任務，如詞性標註、命名實體識別等，為羅馬尼亞語的文本處理提供了強大的支持。

🚀 快速開始

如何使用

from transformers import AutoTokenizer, AutoModel
import torch
# 加載分詞器和模型
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# 對句子進行分詞並通過模型處理
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # 批量大小為1
outputs = model(input_ids)
# 獲取編碼
last_hidden_states = outputs[0]  # 最後一個隱藏狀態是輸出元組的第一個元素

⚠️ 重要提示

請始終對文本進行清理！將 s 和 t 的軟音符字母替換為逗號字母，使用以下代碼：

text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

因為該模型未在軟音符 s 和 t 上進行訓練。如果不進行替換，由於 <UNK> 標記的存在，性能將會下降，並且每個單詞的標記數量會增加。

💻 使用示例

基礎用法

from transformers import AutoTokenizer, AutoModel
import torch
# 加載分詞器和模型
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# 對句子進行分詞並通過模型處理
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # 批量大小為1
outputs = model(input_ids)
# 獲取編碼
last_hidden_states = outputs[0]  # 最後一個隱藏狀態是輸出元組的第一個元素

高級用法

# 對文本進行清理並進行預測
text = "Acesta este un test."
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
input_ids = torch.tensor(tokenizer.encode(text, add_special_tokens=True)).unsqueeze(0)
outputs = model(input_ids)
last_hidden_states = outputs[0]

📚 詳細文檔

評估

評估在通用依存關係羅馬尼亞語RRT 的UPOS、XPOS和LAS上進行，以及基於 RONEC 的命名實體識別（NER）任務上進行。詳細信息以及更多未在此處展示的深度測試，可在專門的評估頁面中找到。

基線模型是多語言BERT 模型 bert-base-multilingual-(un)cased，在撰寫本文時，它是唯一可用於羅馬尼亞語的BERT模型。

模型	UPOS	XPOS	NER	LAS
bert-base-multilingual-cased	97.87	96.16	84.13	88.04
bert-base-romanian-cased-v1	98.00	96.46	85.88	89.69

語料庫

該模型在以下語料庫上進行訓練（下表中的統計數據是清理後的結果）：

語料庫	行數（百萬）	單詞數（百萬）	字符數（十億）	大小（GB）
OPUS	55.05	635.04	4.045	3.8
OSCAR	33.56	1725.82	11.411	11
維基百科	1.54	60.47	0.411	0.4
總計	90.15	2421.33	15.867	15.2

引用

如果您在研究論文中使用此模型，請引用以下論文：

Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.

或者，使用BibTeX格式：

@inproceedings{dumitrescu-etal-2020-birth,
    title = "The birth of {R}omanian {BERT}",
    author = "Dumitrescu, Stefan  and
      Avram, Andrei-Marius  and
      Pyysalo, Sampo",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
    month = nov,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.findings-emnlp.387",
    doi = "10.18653/v1/2020.findings-emnlp.387",
    pages = "4324--4328",
}