sundanese-roberta-base開源巽他語語言模型 - 多數據集訓練助力語言應用

Home

Sundanese Roberta Base

Developed by w11wo

基於RoBERTa架構的巽他語掩碼語言模型，在多個數據集上訓練而成。

大型語言模型 OtherOpen Source License:MIT #巽他語掩碼預測 #低資源語言模型 #東南亞語言處理

Downloads 32

Release Time : 3/2/2022

Model Overview

這是一個基於RoBERTa架構的巽他語掩碼語言模型，主要用於巽他語的文本理解和生成任務。

Model Features

多數據集訓練

在OSCAR、mC4、CC100和維基百科四個數據集上訓練，確保模型廣泛覆蓋巽他語用法。

高準確率

驗證準確率達到63.98%，在巽他語任務中表現良好。

專門針對巽他語優化

專門為巽他語設計和訓練，相比多語言模型有更好的語言理解能力。

Model Capabilities

巽他語文本理解

掩碼語言預測

文本特徵提取

Use Cases

教育

巽他語學習輔助

幫助學生理解和學習巽他語語法和詞彙。

自然語言處理

巽他語文本分析

用於巽他語文本的分類、情感分析等任務。

🚀 巽他語RoBERTa基礎模型

巽他語RoBERTa基礎模型是一個基於RoBERTa架構的掩碼語言模型。它在四個數據集上進行訓練：OSCAR的unshuffled_deduplicated_su子集、巽他語mc4子集、巽他語CC100子集以及巽他語維基百科。該模型為自然語言處理領域提供了專門針對巽他語的預訓練能力，有助於提升相關任務的處理效果。

✨ 主要特性

基於RoBERTa架構，具有強大的語言理解能力。
在多個大規模巽他語數據集上進行訓練，數據覆蓋廣泛。
從頭開始訓練，在評估中取得了1.952的損失和63.98%的準確率。

📦 安裝指南

文檔未提及具體安裝步驟，可參考Hugging Face的通用安裝方法來安裝相關依賴庫。

💻 使用示例

基礎用法

from transformers import pipeline

pretrained_name = "w11wo/sundanese-roberta-base"

fill_mask = pipeline(
    "fill-mask",
    model=pretrained_name,
    tokenizer=pretrained_name
)

fill_mask("Budi nuju <mask> di sakola.")

高級用法

from transformers import RobertaModel, RobertaTokenizerFast

pretrained_name = "w11wo/sundanese-roberta-base"
model = RobertaModel.from_pretrained(pretrained_name)
tokenizer = RobertaTokenizerFast.from_pretrained(pretrained_name)

prompt = "Budi nuju diajar di sakola."
encoded_input = tokenizer(prompt, return_tensors='pt')
output = model(**encoded_input)

📚 詳細文檔

模型信息

屬性	詳情
模型類型	`sundanese-roberta-base`
訓練參數數量	1.24億
架構	RoBERTa
訓練/驗證數據（文本）	OSCAR、mC4、CC100、維基百科（758MB）

評估結果

該模型訓練了50個輪次，訓練結束後的最終結果如下：

訓練損失	驗證損失	驗證準確率	總耗時
1.965	1.952	0.6398	6:24:51

🔧 技術細節

該模型使用Hugging Face的Flax框架進行訓練。
所有訓練所需的腳本可以在文件和版本標籤中找到。
通過TensorBoard記錄的訓練指標也可查看。

📄 許可證

本項目採用MIT許可證。

⚠️ 重要提示

請考慮來自四個數據集的偏差可能會延續到該模型的結果中。

📖 引用信息

@article{rs-907893,
    author   = {Wongso, Wilson
                and Lucky, Henry
                and Suhartono, Derwin},
    journal  = {Journal of Big Data},
    year     = {2022},
    month    = {Feb},
    day      = {26},
    abstract = {The Sundanese language has over 32 million speakers worldwide, but the language has reaped little to no benefits from the recent advances in natural language understanding. Like other low-resource languages, the only alternative is to fine-tune existing multilingual models. In this paper, we pre-trained three monolingual Transformer-based language models on Sundanese data. When evaluated on a downstream text classification task, we found that most of our monolingual models outperformed larger multilingual models despite the smaller overall pre-training data. In the subsequent analyses, our models benefited strongly from the Sundanese pre-training corpus size and do not exhibit socially biased behavior. We released our models for other researchers and practitioners to use.},
    issn     = {2693-5015},
    doi      = {10.21203/rs.3.rs-907893/v1},
    url      = {https://doi.org/10.21203/rs.3.rs-907893/v1}
}