IndicBERTv2-MLM-only開源多語言模型 - 支持23種印度語及英語文本處理

首頁

Indicbertv2 MLM Only

由ai4bharat開發

IndicBERT是一個支持23種印度語言及英語的多語言語言模型，擁有2.78億參數，在IndicCorp v2上訓練並在IndicXTREME基準測試中評估。

大型語言模型

Transformers

支持多種語言開源協議:MIT #多語言印度語支持 #填充掩碼任務 #大規模語料訓練

下載量 87.60k

發布時間 : 11/13/2022

模型概述

IndicBERT是一個多語言BERT風格模型，專注於印度語言處理，通過多種訓練目標和數據集優化，支持填充掩碼任務。

模型特點

多語言支持

支持23種印度語言及英語，覆蓋多種語言家族。

多種訓練目標

通過MLM、TLM及反向翻譯等多種目標訓練，提升模型性能。

詞彙共享優化

IndicBERT-SS版本通過文字轉換促進語言間更好的詞彙共享。

模型能力

多語言文本理解

填充掩碼任務處理

跨語言遷移學習

使用案例

自然語言理解

命名實體識別

在多種印度語言中識別命名實體。

情感分析

分析印度語言文本的情感傾向。

機器翻譯輔助

平行語料庫增強

通過TLM訓練提升機器翻譯模型的性能。

🚀 IndicBERT

IndicBERT是一個多語言語言模型，在IndicCorp v2上進行訓練，並在IndicXTREME基準測試中進行評估。該模型擁有2.78億個參數，支持23種印度語言以及英語。模型通過多種目標和數據集進行訓練。

支持語言

屬性	詳情
支持語言列表	as、bn、brx、doi、en、gom、gu、hi、kn、ks、kas、mai、ml、mr、mni、mnb、ne、or、pa、sa、sat、sd、snd、ta、te、ur
語言詳情	asm_Beng、ben_Beng、brx_Deva、doi_Deva、eng_Latn、gom_Deva、guj_Gujr、hin_Deva、kan_Knda、kas_Arab、kas_Deva、mai_Deva、mal_Mlym、mar_Deva、mni_Beng、mni_Mtei、npi_Deva、ory_Orya、pan_Guru、san_Deva、sat_Olck、snd_Arab、snd_Deva、tam_Taml、tel_Telu、urd_Arab

模型標籤

indicbert2
ai4bharat
multilingual

許可證

本項目採用MIT許可證。

評估指標

準確率

任務類型

填充掩碼

🚀 快速開始

模型列表

IndicBERT-MLM [模型] - 一個基於IndicCorp v2，使用MLM目標訓練的經典BERT風格模型
- +Samanantar [模型] - 以Samanantar平行語料庫為額外目標的TLM訓練模型 [論文] | [數據集]
- +反向翻譯 [模型] - 通過IndicTrans模型將IndicCorp v2數據集中的印度部分翻譯成英語，作為額外目標的TLM訓練模型 [模型]
IndicBERT-SS [模型] - 為了促進語言間更好的詞彙共享，將印度語言的文字轉換為天城文，並使用MLM目標訓練的BERT風格模型

📦 安裝指南

微調腳本基於transformers庫。創建一個新的conda環境並按如下方式進行設置：

conda create -n finetuning python=3.9
pip install -r requirements.txt

💻 使用示例

基礎用法

所有任務遵循相同的結構，請查看各個文件以獲取詳細的超參數選擇。以下命令用於運行某個任務的微調：

python IndicBERT/fine-tuning/$TASK_NAME/$TASK_NAME.py \
    --model_name_or_path=$MODEL_NAME \
    --do_train

參數說明

MODEL_NAME: 要微調的模型名稱，可以是本地路徑或來自HuggingFace模型中心的模型
TASK_NAME: 以下任務之一 [ner, paraphrase, qa, sentiment, xcopa, xnli, flores]

⚠️ 重要提示

對於MASSIVE任務，請使用官方倉庫中提供的說明。

📚 引用說明

@inproceedings{doddapaneni-etal-2023-towards,
    title = "Towards Leaving No {I}ndic Language Behind: Building Monolingual Corpora, Benchmark and Models for {I}ndic Languages",
    author = "Doddapaneni, Sumanth  and
      Aralikatte, Rahul  and
      Ramesh, Gowtham  and
      Goyal, Shreya  and
      Khapra, Mitesh M.  and
      Kunchukuttan, Anoop  and
      Kumar, Pratyush",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.693",
    doi = "10.18653/v1/2023.acl-long.693",
    pages = "12402--12426",
    abstract = "Building Natural Language Understanding (NLU) capabilities for Indic languages, which have a collective speaker base of more than one billion speakers is absolutely crucial. In this work, we aim to improve the NLU capabilities of Indic languages by making contributions along 3 important axes (i) monolingual corpora (ii) NLU testsets (iii) multilingual LLMs focusing on Indic languages. Specifically, we curate the largest monolingual corpora, IndicCorp, with 20.9B tokens covering 24 languages from 4 language families - a 2.3x increase over prior work, while supporting 12 additional languages. Next, we create a human-supervised benchmark, IndicXTREME, consisting of nine diverse NLU tasks covering 20 languages. Across languages and tasks, IndicXTREME contains a total of 105 evaluation sets, of which 52 are new contributions to the literature. To the best of our knowledge, this is the first effort towards creating a standard benchmark for Indic languages that aims to test the multilingual zero-shot capabilities of pretrained language models. Finally, we train IndicBERT v2, a state-of-the-art model supporting all the languages. Averaged across languages and tasks, the model achieves an absolute improvement of 2 points over a strong baseline. The data and models are available at \url{https://github.com/AI4Bharat/IndicBERT}.",
}