🚀 巽他語RoBERTa基礎模型
巽他語RoBERTa基礎模型是一個基於RoBERTa架構的掩碼語言模型。它在四個數據集上進行訓練:OSCAR的unshuffled_deduplicated_su
子集、巽他語mc4子集、巽他語CC100子集以及巽他語維基百科。該模型為自然語言處理領域提供了專門針對巽他語的預訓練能力,有助於提升相關任務的處理效果。
✨ 主要特性
- 基於RoBERTa架構,具有強大的語言理解能力。
- 在多個大規模巽他語數據集上進行訓練,數據覆蓋廣泛。
- 從頭開始訓練,在評估中取得了1.952的損失和63.98%的準確率。
📦 安裝指南
文檔未提及具體安裝步驟,可參考Hugging Face的通用安裝方法來安裝相關依賴庫。
💻 使用示例
基礎用法
from transformers import pipeline
pretrained_name = "w11wo/sundanese-roberta-base"
fill_mask = pipeline(
"fill-mask",
model=pretrained_name,
tokenizer=pretrained_name
)
fill_mask("Budi nuju <mask> di sakola.")
高級用法
from transformers import RobertaModel, RobertaTokenizerFast
pretrained_name = "w11wo/sundanese-roberta-base"
model = RobertaModel.from_pretrained(pretrained_name)
tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_name)
prompt = "Budi nuju diajar di sakola."
encoded_input = tokenizer(prompt, return_tensors='pt')
output = model(**encoded_input)
📚 詳細文檔
模型信息
屬性 |
詳情 |
模型類型 |
sundanese-roberta-base |
訓練參數數量 |
1.24億 |
架構 |
RoBERTa |
訓練/驗證數據(文本) |
OSCAR、mC4、CC100、維基百科(758MB) |
評估結果
該模型訓練了50個輪次,訓練結束後的最終結果如下:
訓練損失 |
驗證損失 |
驗證準確率 |
總耗時 |
1.965 |
1.952 |
0.6398 |
6:24:51 |
🔧 技術細節
- 該模型使用Hugging Face的Flax框架進行訓練。
- 所有訓練所需的腳本可以在文件和版本標籤中找到。
- 通過TensorBoard記錄的訓練指標也可查看。
📄 許可證
本項目採用MIT許可證。
⚠️ 重要提示
請考慮來自四個數據集的偏差可能會延續到該模型的結果中。
📖 引用信息
@article{rs-907893,
author = {Wongso, Wilson
and Lucky, Henry
and Suhartono, Derwin},
journal = {Journal of Big Data},
year = {2022},
month = {Feb},
day = {26},
abstract = {The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.},
issn = {2693-5015},
doi = {10.21203/rs.3.rs-907893/v1},
url = {https://doi.org/10.21203/rs.3.rs-907893/v1}
}
👨💻 作者
巽他語RoBERTa基礎模型由Wilson Wongso訓練和評估。