indobert-large-p2オープンソースインドネシア語言モデル - インドネシア語コンテンツの理解と処理を支援

ホーム

Indobert Large P2

indobenchmarkによって開発

IndoBERTは、BERTモデルをベースにインドネシア語向けに開発された最先端の言語モデルで、マスク言語モデリング（MLM）と次文予測（NSP）の目標で学習されています。

大規模言語モデルその他オープンソースライセンス:MIT #インドネシア語の事前学習 #大文字小文字の区別なし #マルチタスク学習

ダウンロード数 2,272

リリース時間 : 3/2/2022

モデル概要

IndoBERTは、インドネシア語用に最適化された事前学習言語モデルで、主に自然言語理解タスクに使用され、インドネシア語テキストのコンテキスト表現の抽出と言語理解をサポートします。

モデル特徴

インドネシア語最適化

インドネシア語に特化して最適化されており、インドネシア語の自然言語処理タスクに適しています。

大規模事前学習

Indo4Bデータセット（23.43 GBのテキスト）を基に事前学習されており、強力な言語理解能力を持っています。

大文字小文字の区別なし

モデルは第二フェーズの学習で大文字小文字を区別せず、異なる大文字小文字のテキスト入力に適しています。

モデル能力

インドネシア語テキスト理解

コンテキスト表現抽出

マスク言語モデリング

次文予測

使用事例

自然言語処理

テキスト分類

インドネシア語テキストの分類タスク、例えば感情分析やトピック分類などに使用されます。

固有表現抽出

インドネシア語テキスト内の固有表現、例えば人名、地名、組織名などを識別します。

言語モデルの微調整

下流タスクの微調整

特定のインドネシア語NLPタスクに適合するように微調整するために使用できます。

🚀 IndoBERT Largeモデル (phase2 - uncased)

IndoBERTは、BERTモデルに基づく、インドネシア語向けの最先端の言語モデルです。事前学習モデルは、マスク言語モデリング（MLM）目標と次文予測（NSP）目標を使用して学習されています。

📚 ドキュメント

すべての事前学習モデル

プロパティ	詳細
モデルタイプ	`indobenchmark/indobert-base-p1`、`indobenchmark/indobert-base-p2`、`indobenchmark/indobert-large-p1`、`indobenchmark/indobert-large-p2`、`indobenchmark/indobert-lite-base-p1`、`indobenchmark/indobert-lite-base-p2`、`indobenchmark/indobert-lite-large-p1`、`indobenchmark/indobert-lite-large-p2`
パラメータ数	それぞれ124.5M、124.5M、335.2M、335.2M、11.7M、11.7M、17.7M、17.7M
アーキテクチャ	Base、Base、Large、Large、Base、Base、Large、Large
学習データ	Indo4B (23.43 GBのテキスト)

💻 使用例

基本的な使用法

# Load model and tokenizer
from transformers import BertTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-large-p2")
model = AutoModel.from_pretrained("indobenchmark/indobert-large-p2")

高度な使用法

# Extract contextual representation
x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
print(x, model(x)[0].sum())

👥 作成者

IndoBERT は、Bryan Wilie*、Karissa Vincentio*、Genta Indra Winata*、Samuel Cahyawijaya*、Xiaohong Li、Zhi Yuan Lim、Sidik Soleman、Rahmad Mahendra、Pascale Fung、Syafri Bahar、Ayu Purwariantiによって学習と評価が行われました。

📄 引用

もし当社の成果物を使用する場合は、以下を引用してください。

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}