indobert-large-p1開源印尼語語言模型 - 免費助力印尼語理解與應用

首頁

Indobert Large P1

由indobenchmark開發

IndoBERT 是基於 BERT 模型的印度尼西亞語先進語言模型，通過掩碼語言建模和下一句預測目標進行訓練。

大型語言模型其他開源協議:MIT #印尼語預訓練 #大規模語言模型 #上下文表示提取

下載量 1,686

發布時間 : 3/2/2022

模型概述

IndoBERT 是一個針對印度尼西亞語優化的預訓練語言模型，適用於各種自然語言處理任務。

模型特點

大規模預訓練

使用 Indo4B 數據集（23.43GB 文本）進行預訓練

不區分大小寫

模型處理文本時不區分大小寫

兩階段訓練

模型經過兩個階段的訓練過程（P1 和 P2）

模型能力

文本表示學習

語言理解

文本分類

問答系統

命名實體識別

使用案例

自然語言處理

文本分類

對印度尼西亞語文本進行分類

問答系統

構建印度尼西亞語問答系統

🚀 IndoBERT大型模型 (階段1 - 不區分大小寫)

IndoBERT 是一個基於BERT模型的、用於印尼語的先進語言模型。該預訓練模型使用掩碼語言建模（MLM）目標和下一句預測（NSP）目標進行訓練。

✨ 主要特性

IndoBERT基於BERT模型，針對印尼語進行了優化，使用特定的訓練目標提升了語言理解能力。

📦 安裝指南

文檔未提及具體安裝步驟，若使用相關模型，可參考transformers庫的安裝方式。

💻 使用示例

基礎用法

from transformers import BertTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-large-p1")
model = AutoModel.from_pretrained("indobenchmark/indobert-large-p1")

高級用法

x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
print(x, model(x)[0].sum())

📚 詳細文檔

所有預訓練模型

屬性	詳情
模型類型	`indobenchmark/indobert-base-p1`、`indobenchmark/indobert-base-p2`、`indobenchmark/indobert-large-p1`、`indobenchmark/indobert-large-p2`、`indobenchmark/indobert-lite-base-p1`、`indobenchmark/indobert-lite-base-p2`、`indobenchmark/indobert-lite-large-p1`、`indobenchmark/indobert-lite-large-p2`
訓練數據	Indo4B (23.43 GB的文本)
模型參數數量	分別為124.5M、124.5M、335.2M、335.2M、11.7M、11.7M、17.7M、17.7M
架構	Base、Base、Large、Large、Base、Base、Large、Large

📄 許可證

本項目採用MIT許可證。

👥 作者

IndoBERT 由 Bryan Wilie*、Karissa Vincentio*、Genta Indra Winata*、Samuel Cahyawijaya*、Xiaohong Li、Zhi Yuan Lim、Sidik Soleman、Rahmad Mahendra、Pascale Fung、Syafri Bahar、Ayu Purwarianti 進行訓練和評估。

📚 引用

如果您使用了我們的工作，請引用：

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}