indobert-base-p1開源印尼語語言模型 - 支持文本理解與預測任務

首頁

Indobert Base P1

由indobenchmark開發

IndoBERT是基於BERT模型的印尼語先進語言模型，採用掩碼語言建模（MLM）和下一句預測（NSP）目標進行訓練。

大型語言模型其他開源協議:MIT #印尼語預訓練 #多階段訓練 #掩碼語言建模

下載量 261.95k

發布時間 : 3/2/2022

模型概述

IndoBERT是一個針對印尼語優化的預訓練語言模型，基於BERT架構，適用於各種自然語言處理任務。

模型特點

印尼語優化

專門針對印尼語進行訓練和優化，適用於印尼語的自然語言處理任務。

基於BERT架構

採用BERT模型架構，具有強大的語言理解和生成能力。

大規模訓練數據

使用Indo4B數據集（23.43 GB文本）進行訓練，覆蓋廣泛的印尼語內容。

模型能力

文本理解

文本生成

語言模型預訓練

句子關係預測

使用案例

自然語言處理

文本分類

對印尼語文本進行分類任務

問答系統

構建印尼語問答系統

文本生成

生成印尼語文本內容

🚀 IndoBERT基礎模型（階段1 - 不區分大小寫）

IndoBERT是基於BERT模型的最先進的印尼語語言模型。預訓練模型使用掩碼語言建模（MLM）目標和下一句預測（NSP）目標進行訓練。

🚀 快速開始

IndoBERT是基於BERT模型的印尼語先進語言模型，預訓練模型通過掩碼語言建模（MLM）和下一句預測（NSP）目標進行訓練。

✨ 主要特性

基於BERT架構，適用於印尼語。
使用掩碼語言建模（MLM）和下一句預測（NSP）目標進行預訓練。

📦 安裝指南

暫未提供安裝步驟相關內容。

💻 使用示例

基礎用法

from transformers import BertTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-base-p1")
model = AutoModel.from_pretrained("indobenchmark/indobert-base-p1")

高級用法

x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
print(x, model(x)[0].sum())

📚 詳細文檔

所有預訓練模型

屬性	詳情
模型類型	`indobenchmark/indobert-base-p1`、`indobenchmark/indobert-base-p2`、`indobenchmark/indobert-large-p1`、`indobenchmark/indobert-large-p2`、`indobenchmark/indobert-lite-base-p1`、`indobenchmark/indobert-lite-base-p2`、`indobenchmark/indobert-lite-large-p1`、`indobenchmark/indobert-lite-large-p2`
訓練數據	Indo4B（23.43 GB文本）

模型	參數數量	架構	訓練數據
`indobenchmark/indobert-base-p1`	1.245億	基礎	Indo4B（23.43 GB文本）
`indobenchmark/indobert-base-p2`	1.245億	基礎	Indo4B（23.43 GB文本）
`indobenchmark/indobert-large-p1`	3.352億	大型	Indo4B（23.43 GB文本）
`indobenchmark/indobert-large-p2`	3.352億	大型	Indo4B（23.43 GB文本）
`indobenchmark/indobert-lite-base-p1`	1170萬	基礎	Indo4B（23.43 GB文本）
`indobenchmark/indobert-lite-base-p2`	1170萬	基礎	Indo4B（23.43 GB文本）
`indobenchmark/indobert-lite-large-p1`	1770萬	大型	Indo4B（23.43 GB文本）
`indobenchmark/indobert-lite-large-p2`	1770萬	大型	Indo4B（23.43 GB文本）

🔧 技術細節

暫未提供技術細節相關內容。

📄 許可證

本項目採用MIT許可證。

📄 作者

IndoBERT由Bryan Wilie*、Karissa Vincentio*、Genta Indra Winata*、Samuel Cahyawijaya*、Xiaohong Li、Zhi Yuan Lim、Sidik Soleman、Rahmad Mahendra、Pascale Fung、Syafri Bahar、Ayu Purwarianti進行訓練和評估。

📄 引用

如果您使用了我們的工作，請引用：

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}