Indobert - lite - base - p2オープンソースのインドネシア語言語モデルで、さまざまなインドネシア語のテキスト処理タスクに使用できます。

ホーム

Indobert Lite Base P2

indobenchmarkによって開発

IndoBERTはインドネシア語向けに開発されたトップクラスの言語モデルで、BERTアーキテクチャに基づき、マスク言語モデリングと次文予測の目標を用いて訓練されています。

大規模言語モデル

Transformers

その他オープンソースライセンス:MIT #インドネシア語専用 #軽量級BERT #大文字小文字の区別なし

ダウンロード数 2,498

リリース時間 : 3/2/2022

モデル概要

IndoBERT-LiteはIndoBERTの軽量級バージョンで、インドネシア語に特化して最適化されており、自然言語理解タスクに適しています。

モデル特徴

軽量級設計

モデルのパラメータが少なく、リソースが制限された環境に適しています。

インドネシア語最適化

インドネシア語に特化して事前学習されており、インドネシア語のタスクで優れた性能を発揮します。

大文字小文字の区別なし

モデルは大文字小文字を区別せず、あらゆるテキスト形式に適用できます。

モデル能力

テキスト表現抽出

マスク言語モデリング

次文予測

使用事例

自然言語処理

テキスト分類

インドネシア語テキストの感情分析やトピック分類に使用できます。

質問応答システム

インドネシア語の質問応答システムの構築に適しています。

🚀 IndoBERT-Lite基礎モデル（フェーズ2 - 大文字小文字を区別しない）

IndoBERT は、BERTモデルに基づくインドネシア語の最先端の言語モデルです。この事前学習モデルは、マスク言語モデリング（MLM）目標と次文予測（NSP）目標を使用して訓練されています。

✨ 主な機能

IndoBERTは、BERTモデルに基づくインドネシア語の高度な言語モデルで、マスク言語モデリング（MLM）と次文予測（NSP）目標を通じて事前学習されています。

💻 使用例

基本的な使用法

from transformers import BertTokenizer, AutoModel
tokenizer = BertTokenizer.from_pretrained("indobenchmark/indobert-lite-base-p2")
model = AutoModel.from_pretrained("indobenchmark/indobert-lite-base-p2")

高度な使用法

x = torch.LongTensor(tokenizer.encode('aku adalah anak [MASK]')).view(1,-1)
print(x, model(x)[0].sum())

📚 ドキュメント

すべての事前学習モデル

属性	詳細
モデルタイプ	`indobenchmark/indobert-base-p1`、`indobenchmark/indobert-base-p2`、`indobenchmark/indobert-large-p1`、`indobenchmark/indobert-large-p2`、`indobenchmark/indobert-lite-base-p1`、`indobenchmark/indobert-lite-base-p2`、`indobenchmark/indobert-lite-large-p1`、`indobenchmark/indobert-lite-large-p2`
訓練データ	Indo4B（23.43 GBのテキスト）

モデル	パラメータ数	アーキテクチャ	訓練データ
`indobenchmark/indobert-base-p1`	1.245億	基礎	Indo4B（23.43 GBのテキスト）
`indobenchmark/indobert-base-p2`	1.245億	基礎	Indo4B（23.43 GBのテキスト）
`indobenchmark/indobert-large-p1`	3.352億	大型	Indo4B（23.43 GBのテキスト）
`indobenchmark/indobert-large-p2`	3.352億	大型	Indo4B（23.43 GBのテキスト）
`indobenchmark/indobert-lite-base-p1`	1170万	基礎	Indo4B（23.43 GBのテキスト）
`indobenchmark/indobert-lite-base-p2`	1170万	基礎	Indo4B（23.43 GBのテキスト）
`indobenchmark/indobert-lite-large-p1`	1770万	大型	Indo4B（23.43 GBのテキスト）
`indobenchmark/indobert-lite-large-p2`	1770万	大型	Indo4B（23.43 GBのテキスト）

📄 ライセンス

このプロジェクトはMITライセンスを採用しています。

📝 作者

IndoBERT は、Bryan Wilie*、Karissa Vincentio*、Genta Indra Winata*、Samuel Cahyawijaya*、Xiaohong Li、Zhi Yuan Lim、Sidik Soleman、Rahmad Mahendra、Pascale Fung、Syafri Bahar、Ayu Purwarianti によって訓練および評価されました。

📚 引用

もしあなたが私たちの成果を使用した場合、以下を引用してください：

@inproceedings{wilie2020indonlu,
  title={IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding},
  author={Bryan Wilie and Karissa Vincentio and Genta Indra Winata and Samuel Cahyawijaya and X. Li and Zhi Yuan Lim and S. Soleman and R. Mahendra and Pascale Fung and Syafri Bahar and A. Purwarianti},
  booktitle={Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing},
  year={2020}
}