indobert-lite-large-p2開源印尼語語言模型 - 免費助力印尼語內容處理

首頁

Indobert Lite Large P2

由indobenchmark開發

IndoBERT是基於BERT模型、專為印尼語打造的先進語言模型，通過掩碼語言建模和下一句預測目標進行訓練。

大型語言模型

Transformers

其他開源協議:MIT #印尼語專用 #輕量級BERT #大規模預訓練

下載量 117

發布時間 : 3/2/2022

模型概述

IndoBERT是為印尼語設計的預訓練語言模型，支持自然語言理解任務，適用於處理印尼語文本。

模型特點

專為印尼語優化

模型針對印尼語進行了專門訓練和優化，能夠更好地理解和處理印尼語文本。

輕量級設計

Lite版本模型參數較少，適合資源有限的環境使用。

無大小寫區分

模型不區分大小寫，能夠處理不同大小寫形式的文本輸入。

模型能力

印尼語文本理解

掩碼語言建模

下一句預測

使用案例

自然語言處理

文本分類

對印尼語文本進行分類任務

命名實體識別

識別印尼語文本中的命名實體

🚀 IndoBERT-Lite 大型模型 (階段2 - 不區分大小寫)

IndoBERT是一個基於BERT模型的印尼語先進語言模型。預訓練模型使用掩碼語言建模（MLM）目標和下一句預測（NSP）目標進行訓練。

🚀 快速開始

你可以按照以下步驟使用該模型：

加載模型和分詞器

from transformers import BertTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-lite-large-p2")
model = AutoModel.from_pretrained("indobenchmark/indobert-lite-large-p2")

提取上下文表示

x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
print(x, model(x)[0].sum())

✨ 主要特性

IndoBERT基於BERT模型，針對印尼語進行了優化，使用了掩碼語言建模（MLM）和下一句預測（NSP）目標進行訓練。

📦 安裝指南

文檔未提及安裝步驟，可參考transformers庫的安裝方法來安裝依賴。

💻 使用示例

基礎用法

from transformers import BertTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-lite-large-p2")
model = AutoModel.from_pretrained("indobenchmark/indobert-lite-large-p2")

高級用法

x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
print(x, model(x)[0].sum())

📚 詳細文檔

所有預訓練模型

屬性	詳情
模型類型	包含`indobenchmark/indobert-base-p1`、`indobenchmark/indobert-base-p2`、`indobenchmark/indobert-large-p1`、`indobenchmark/indobert-large-p2`、`indobenchmark/indobert-lite-base-p1`、`indobenchmark/indobert-lite-base-p2`、`indobenchmark/indobert-lite-large-p1`、`indobenchmark/indobert-lite-large-p2`等多種模型
訓練數據	Indo4B（23.43 GB文本）

具體各模型的參數數量和架構如下：

模型	參數數量	架構	訓練數據
`indobenchmark/indobert-base-p1`	124.5M	Base	Indo4B（23.43 GB文本）
`indobenchmark/indobert-base-p2`	124.5M	Base	Indo4B（23.43 GB文本）
`indobenchmark/indobert-large-p1`	335.2M	Large	Indo4B（23.43 GB文本）
`indobenchmark/indobert-large-p2`	335.2M	Large	Indo4B（23.43 GB文本）
`indobenchmark/indobert-lite-base-p1`	11.7M	Base	Indo4B（23.43 GB文本）
`indobenchmark/indobert-lite-base-p2`	11.7M	Base	Indo4B（23.43 GB文本）
`indobenchmark/indobert-lite-large-p1`	17.7M	Large	Indo4B（23.43 GB文本）
`indobenchmark/indobert-lite-large-p2`	17.7M	Large	Indo4B（23.43 GB文本）

📄 許可證

本項目採用MIT許可證。

👨‍💻 作者

IndoBERT 由 Bryan Wilie*、Karissa Vincentio*、Genta Indra Winata*、Samuel Cahyawijaya*、Xiaohong Li、Zhi Yuan Lim、Sidik Soleman、Rahmad Mahendra、Pascale Fung、Syafri Bahar、Ayu Purwarianti 訓練和評估。

📚 引用

如果使用本項目的成果，請引用以下文獻：

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}