indobert-base-p2開源印尼語語言模型 - 助力印尼語相關信息處理與應用

首頁

Indobert Base P2

由indobenchmark開發

IndoBERT是基於BERT模型的印尼語最先進的語言模型，通過掩碼語言建模和下一句預測目標進行訓練。

大型語言模型其他開源協議:MIT #印尼語預訓練 #不區分大小寫 #大規模語料

下載量 25.89k

發布時間 : 3/2/2022

模型概述

IndoBERT是一個針對印尼語優化的預訓練語言模型，適用於各種自然語言理解任務。

模型特點

印尼語優化

專門針對印尼語進行預訓練和優化

大規模訓練數據

使用23.43GB的印尼語文本(Indo4B)進行訓練

不區分大小寫

第二階段模型不區分大小寫，提高文本處理靈活性

模型能力

文本表示學習

上下文理解

語言建模

句子關係預測

使用案例

自然語言處理

文本分類

可用於印尼語文本分類任務

命名實體識別

識別印尼語文本中的命名實體

🚀 IndoBERT基礎模型 (階段2 - 不區分大小寫)

IndoBERT是基於BERT模型的最先進的印尼語語言模型。該預訓練模型使用掩碼語言建模（MLM）目標和下一句預測（NSP）目標進行訓練。

🚀 快速開始

IndoBERT是一個強大的印尼語語言模型，你可以按照以下步驟快速使用它。

✨ 主要特性

基於BERT模型，是印尼語的先進語言模型。
使用掩碼語言建模（MLM）和下一句預測（NSP）目標進行預訓練。

📦 安裝指南

文檔未提及具體安裝步驟，可參考transformers庫的安裝方式來使用該模型。

💻 使用示例

基礎用法

from transformers import BertTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-base-p2")
model = AutoModel.from_pretrained("indobenchmark/indobert-base-p2")

高級用法

x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
print(x, model(x)[0].sum())

📚 詳細文檔

所有預訓練模型

屬性	詳情
模型類型	`indobenchmark/indobert-base-p1`、`indobenchmark/indobert-base-p2`、`indobenchmark/indobert-large-p1`、`indobenchmark/indobert-large-p2`、`indobenchmark/indobert-lite-base-p1`、`indobenchmark/indobert-lite-base-p2`、`indobenchmark/indobert-lite-large-p1`、`indobenchmark/indobert-lite-large-p2`
訓練數據	Indo4B（23.43 GB文本）

模型	參數數量	架構	訓練數據
`indobenchmark/indobert-base-p1`	1.245億	基礎	Indo4B（23.43 GB文本）
`indobenchmark/indobert-base-p2`	1.245億	基礎	Indo4B（23.43 GB文本）
`indobenchmark/indobert-large-p1`	3.352億	大型	Indo4B（23.43 GB文本）
`indobenchmark/indobert-large-p2`	3.352億	大型	Indo4B（23.43 GB文本）
`indobenchmark/indobert-lite-base-p1`	1170萬	基礎	Indo4B（23.43 GB文本）
`indobenchmark/indobert-lite-base-p2`	1170萬	基礎	Indo4B（23.43 GB文本）
`indobenchmark/indobert-lite-large-p1`	1770萬	大型	Indo4B（23.43 GB文本）
`indobenchmark/indobert-lite-large-p2`	1770萬	大型	Indo4B（23.43 GB文本）

📄 許可證

本項目採用MIT許可證。

👨‍💻 作者

IndoBERT 由Bryan Wilie*、Karissa Vincentio*、Genta Indra Winata*、Samuel Cahyawijaya*、Xiaohong Li、Zhi Yuan Lim、Sidik Soleman、Rahmad Mahendra、Pascale Fung、Syafri Bahar、Ayu Purwarianti進行訓練和評估。

📚 引用

如果您使用了我們的工作，請引用：

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}