🚀 IndicBERT
IndicBERT是一個多語言語言模型,在IndicCorp v2上進行訓練,並在IndicXTREME基準測試中進行評估。該模型擁有2.78億個參數,支持23種印度語言以及英語。模型通過多種目標和數據集進行訓練。
支持語言
屬性 |
詳情 |
支持語言列表 |
as、bn、brx、doi、en、gom、gu、hi、kn、ks、kas、mai、ml、mr、mni、mnb、ne、or、pa、sa、sat、sd、snd、ta、te、ur |
語言詳情 |
asm_Beng、ben_Beng、brx_Deva、doi_Deva、eng_Latn、gom_Deva、guj_Gujr、hin_Deva、kan_Knda、kas_Arab、kas_Deva、mai_Deva、mal_Mlym、mar_Deva、mni_Beng、mni_Mtei、npi_Deva、ory_Orya、pan_Guru、san_Deva、sat_Olck、snd_Arab、snd_Deva、tam_Taml、tel_Telu、urd_Arab |
模型標籤
- indicbert2
- ai4bharat
- multilingual
許可證
本項目採用MIT許可證。
評估指標
任務類型
填充掩碼
🚀 快速開始
模型列表
- IndicBERT-MLM [模型] - 一個基於IndicCorp v2,使用MLM目標訓練的經典BERT風格模型
- +Samanantar [模型] - 以Samanantar平行語料庫為額外目標的TLM訓練模型 [論文] | [數據集]
- +反向翻譯 [模型] - 通過IndicTrans模型將IndicCorp v2數據集中的印度部分翻譯成英語,作為額外目標的TLM訓練模型 [模型]
- IndicBERT-SS [模型] - 為了促進語言間更好的詞彙共享,將印度語言的文字轉換為天城文,並使用MLM目標訓練的BERT風格模型
📦 安裝指南
微調腳本基於transformers庫。創建一個新的conda環境並按如下方式進行設置:
conda create -n finetuning python=3.9
pip install -r requirements.txt
💻 使用示例
基礎用法
所有任務遵循相同的結構,請查看各個文件以獲取詳細的超參數選擇。以下命令用於運行某個任務的微調:
python IndicBERT/fine-tuning/$TASK_NAME/$TASK_NAME.py \
--model_name_or_path=$MODEL_NAME \
--do_train
參數說明
- MODEL_NAME: 要微調的模型名稱,可以是本地路徑或來自HuggingFace模型中心的模型
- TASK_NAME: 以下任務之一 [
ner, paraphrase, qa, sentiment, xcopa, xnli, flores
]
⚠️ 重要提示
對於MASSIVE任務,請使用官方倉庫中提供的說明。
📚 引用說明
@inproceedings{doddapaneni-etal-2023-towards,
title = "Towards Leaving No {I}ndic Language Behind: Building Monolingual Corpora, Benchmark and Models for {I}ndic Languages",
author = "Doddapaneni, Sumanth and
Aralikatte, Rahul and
Ramesh, Gowtham and
Goyal, Shreya and
Khapra, Mitesh M. and
Kunchukuttan, Anoop and
Kumar, Pratyush",
editor = "Rogers, Anna and
Boyd-Graber, Jordan and
Okazaki, Naoaki",
booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2023",
address = "Toronto, Canada",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.acl-long.693",
doi = "10.18653/v1/2023.acl-long.693",
pages = "12402--12426",
abstract = "Building Natural Language Understanding (NLU) capabilities for Indic languages, which have a collective speaker base of more than one billion speakers is absolutely crucial. In this work, we aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes (i) monolingual corpora (ii) NLU testsets (iii) multilingual LLMs focusing on Indic languages. Specifically, we curate the largest monolingual corpora, IndicCorp, with 20.9B tokens covering 24 languages from 4 language families - a 2.3x increase over prior work, while supporting 12 additional languages. Next, we create a human-supervised benchmark, IndicXTREME, consisting of nine diverse NLU tasks covering 20 languages. Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature. To the best of our knowledge, this is the first effort towards creating a standard benchmark for Indic languages that aims to test the multilingual zero-shot capabilities of pretrained language models. Finally, we train IndicBERT v2, a state-of-the-art model supporting all the languages. Averaged across languages and tasks, the model achieves an absolute improvement of 2 points over a strong baseline. The data and models are available at \url{https://github.com/AI4Bharat/IndicBERT}.",
}