🚀 IndicNER
IndicNER is a model designed to identify named entities in Indian languages. It is fine - tuned on 11 Indian languages with millions of sentences and benchmarked on multiple datasets.
🚀 Quick Start
IndicNER is a model trained to complete the task of identifying named entities from sentences in Indian languages. Our model is specifically fine - tuned to the 11 Indian languages mentioned above over millions of sentences. The model is then benchmarked over a human annotated testset and multiple other publicly available Indian NER datasets.
The 11 languages covered by IndicNER are: Assamese, Bengali, Gujarati, Hindi, Kannada, Malayalam, Marathi, Oriya, Punjabi, Tamil, Telugu.
✨ Features
- Multilingual Support: Covers 11 Indian languages, including Assamese, Bengali, Gujarati, etc.
- Trained on Large Datasets: Fine - tuned over millions of sentences and benchmarked on multiple datasets.
📦 Installation
No specific installation steps are provided in the original document.
📚 Documentation
Training Corpus
Our model was trained on a dataset which we mined from the existing Samanantar Corpus. We used a bert - base - multilingual - uncased model as the starting point and then fine - tuned it to the NER dataset mentioned previously.
Downloads
Download from this same Huggingface repo.
Update 20 Dec 2022: We released a new paper documenting IndicNER and Naamapadam. We have a different model reported in the paper. We will update the repo here soon with this model.
Usage
You can use [this Colab notebook](https://colab.research.google.com/drive/1sYa - PDdZQ_c9SzUgnhyb3Fl7j96QBCS8?usp=sharing) for samples on using IndicNER or for finetuning a pre - trained model on Naampadam dataset to build your own NER models.
Citing
If you are using IndicNER, please cite the following article:
@misc{mhaske2022naamapadam,
doi = {10.48550/ARXIV.2212.10168},
url = {https://arxiv.org/abs/2212.10168},
author = {Mhaske, Arnav and Kedia, Harshit and Doddapaneni, Sumanth and Khapra, Mitesh M. and Kumar, Pratyush and Murthy, Rudra and Kunchukuttan, Anoop},
title = {Naamapadam: A Large - Scale Named Entity Annotated Data for Indic Languages},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non - exclusive license}
}
We would like to hear from you if:
- You are using our resources. Please let us know how you are putting these resources to use.
- You have any feedback on these resources.
License
The IndicNER code (and models) are released under the MIT License.
Contributors
This work is the outcome of a volunteer effort as part of the AI4Bharat initiative.
Contact