IndicBERTv2-MLM-onlyオープンソース多言語モデル - 23種類のインド語と英語のテキスト処理をサポート

ホーム

Indicbertv2 MLM Only

ai4bharatによって開発

IndicBERTは、23種類のインド語と英語をサポートする多言語言語モデルで、2億7800万のパラメータを持ち、IndicCorp v2で学習され、IndicXTREMEベンチマークテストで評価されています。

大規模言語モデル

Transformers

複数言語対応オープンソースライセンス:MIT #多言語ヒンディー語対応 #マスク付きトークン予測タスク #大規模コーパスによる学習

ダウンロード数 87.60k

リリース時間 : 11/13/2022

モデル概要

IndicBERTは、インド語の処理に特化した多言語BERTスタイルのモデルで、複数の学習目標とデータセットを通じて最適化され、マスク付きトークン予測タスクをサポートします。

モデル特徴

多言語対応

23種類のインド語と英語をサポートし、複数の言語ファミリーをカバーします。

複数の学習目標

MLM、TLM、逆翻訳などの複数の目標を通じて学習し、モデルの性能を向上させます。

語彙共有の最適化

IndicBERT - SSバージョンは、文字変換を通じて言語間の語彙共有を改善します。

モデル能力

多言語テキスト理解

マスク付きトークン予測タスクの処理

異言語間の転移学習

使用事例

自然言語理解

固有表現抽出

複数のインド語で固有表現を識別します。

感情分析

インド語のテキストの感情傾向を分析します。

機械翻訳支援

平行コーパスの強化

TLM学習により機械翻訳モデルの性能を向上させます。

🚀 IndicBERT

IndicBERTは、多言語言語モデルで、IndicCorp v2上で訓練され、IndicXTREMEベンチマークテストで評価されています。このモデルは2.78億個のパラメータを持ち、23種類のインド言語と英語をサポートしています。モデルは、複数の目標とデータセットを用いて訓練されています。

サポート言語

属性	詳細
サポート言語リスト	as、bn、brx、doi、en、gom、gu、hi、kn、ks、kas、mai、ml、mr、mni、mnb、ne、or、pa、sa、sat、sd、snd、ta、te、ur
言語詳細	asm_Beng、ben_Beng、brx_Deva、doi_Deva、eng_Latn、gom_Deva、guj_Gujr、hin_Deva、kan_Knda、kas_Arab、kas_Deva、mai_Deva、mal_Mlym、mar_Deva、mni_Beng、mni_Mtei、npi_Deva、ory_Orya、pan_Guru、san_Deva、sat_Olck、snd_Arab、snd_Deva、tam_Taml、tel_Telu、urd_Arab

モデルタグ

indicbert2
ai4bharat
multilingual

ライセンス

本プロジェクトはMITライセンスの下で提供されています。

評価指標

正解率

タスクタイプ

マスク埋め

🚀 クイックスタート

モデルリスト

IndicBERT-MLM [モデル] - IndicCorp v2に基づき、MLM目標で訓練された古典的なBERTスタイルのモデル
- +Samanantar [モデル] - Samanantar平行コーパスを追加目標とするTLM訓練モデル [論文] | [データセット]
- +逆翻訳 [モデル] - IndicTransモデルを使ってIndicCorp v2データセットのインド部分を英語に翻訳し、追加目標とするTLM訓練モデル [モデル]
IndicBERT-SS [モデル] - 言語間でより良い語彙共有を促進するため、インド言語の文字を天城文に変換し、MLM目標で訓練されたBERTスタイルのモデル

📦 インストール

微調整スクリプトはtransformersライブラリに基づいています。新しいconda環境を作成し、以下のように設定してください：

conda create -n finetuning python=3.9
pip install -r requirements.txt

💻 使用例

基本的な使用法

すべてのタスクは同じ構造に従います。詳細なハイパーパラメータの選択については、各ファイルを参照してください。以下のコマンドは、あるタスクの微調整を実行するためのものです：

python IndicBERT/fine-tuning/$TASK_NAME/$TASK_NAME.py \
    --model_name_or_path=$MODEL_NAME \
    --do_train

パラメータ説明

MODEL_NAME: 微調整するモデルの名前で、ローカルパスまたはHuggingFaceモデルセンターのモデルを指定できます。
TASK_NAME: 以下のタスクのいずれか [ner, paraphrase, qa, sentiment, xcopa, xnli, flores]

⚠️ 重要な注意事項

MASSIVEタスクについては、公式リポジトリに記載されている説明を使用してください。

📚 引用

@inproceedings{doddapaneni-etal-2023-towards,
    title = "Towards Leaving No {I}ndic Language Behind: Building Monolingual Corpora, Benchmark and Models for {I}ndic Languages",
    author = "Doddapaneni, Sumanth  and
      Aralikatte, Rahul  and
      Ramesh, Gowtham  and
      Goyal, Shreya  and
      Khapra, Mitesh M.  and
      Kunchukuttan, Anoop  and
      Kumar, Pratyush",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.693",
    doi = "10.18653/v1/2023.acl-long.693",
    pages = "12402--12426",
    abstract = "Building Natural Language Understanding (NLU) capabilities for Indic languages, which have a collective speaker base of more than one billion speakers is absolutely crucial. In this work, we aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes (i) monolingual corpora (ii) NLU testsets (iii) multilingual LLMs focusing on Indic languages. Specifically, we curate the largest monolingual corpora, IndicCorp, with 20.9B tokens covering 24 languages from 4 language families - a 2.3x increase over prior work, while supporting 12 additional languages. Next, we create a human-supervised benchmark, IndicXTREME, consisting of nine diverse NLU tasks covering 20 languages. Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature. To the best of our knowledge, this is the first effort towards creating a standard benchmark for Indic languages that aims to test the multilingual zero-shot capabilities of pretrained language models. Finally, we train IndicBERT v2, a state-of-the-art model supporting all the languages. Averaged across languages and tasks, the model achieves an absolute improvement of 2 points over a strong baseline. The data and models are available at \url{https://github.com/AI4Bharat/IndicBERT}.",
}