indobert-large-p2開源印尼語語言模型 - 助力印尼語內容理解與處理

首頁

Indobert Large P2

由indobenchmark開發

IndoBERT是基於BERT模型針對印尼語開發的尖端語言模型，採用掩碼語言建模（MLM）和下一句預測（NSP）目標進行訓練。

大型語言模型其他開源協議:MIT #印尼語預訓練 #無大小寫區分 #多任務學習

下載量 2,272

發布時間 : 3/2/2022

模型概述

IndoBERT是針對印尼語優化的預訓練語言模型，主要用於自然語言理解任務，支持印尼語文本的上下文表徵提取和語言理解。

模型特點

印尼語優化

專門針對印尼語進行優化，適用於印尼語的自然語言處理任務。

大規模預訓練

基於Indo4B數據集（23.43 GB文本）進行預訓練，具有強大的語言理解能力。

無大小寫區分

模型在第二階段訓練中不區分大小寫，適用於不同大小寫的文本輸入。

模型能力

印尼語文本理解

上下文表徵提取

掩碼語言建模

下一句預測

使用案例

自然語言處理

文本分類

用於印尼語文本的分類任務，如情感分析、主題分類等。

命名實體識別

識別印尼語文本中的命名實體，如人名、地名、組織名等。

語言模型微調

下游任務微調

可用於微調以適配特定的印尼語NLP任務。

🚀 IndoBERT大型模型 (階段2 - 不區分大小寫)

IndoBERT 是一個基於BERT模型的印尼語先進語言模型。該預訓練模型使用掩碼語言建模（MLM）目標和下一句預測（NSP）目標進行訓練。

✨ 主要特性

IndoBERT是基於BERT模型的印尼語先進語言模型，通過掩碼語言建模（MLM）和下一句預測（NSP）目標進行預訓練。

📦 安裝指南

文檔未提及安裝步驟，跳過此章節。

💻 使用示例

基礎用法

from transformers import BertTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-large-p2")
model = AutoModel.from_pretrained("indobenchmark/indobert-large-p2")

高級用法

x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
print(x, model(x)[0].sum())

📚 詳細文檔

所有預訓練模型

屬性	詳情
模型類型	`indobenchmark/indobert-base-p1`、`indobenchmark/indobert-base-p2`、`indobenchmark/indobert-large-p1`、`indobenchmark/indobert-large-p2`、`indobenchmark/indobert-lite-base-p1`、`indobenchmark/indobert-lite-base-p2`、`indobenchmark/indobert-lite-large-p1`、`indobenchmark/indobert-lite-large-p2`
訓練數據	Indo4B（23.43 GB文本）

作者

IndoBERT 由 Bryan Wilie*、Karissa Vincentio*、Genta Indra Winata*、Samuel Cahyawijaya*、Xiaohong Li、Zhi Yuan Lim、Sidik Soleman、Rahmad Mahendra、Pascale Fung、Syafri Bahar、Ayu Purwarianti 進行訓練和評估。

引用

如果您使用了我們的工作，請引用：

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}