đ roberta-medium-amharic
The roberta-medium-amharic
model is designed for Amharic language processing, offering high - performance solutions for tasks like sentiment classification and named entity recognition.
đ Quick Start
This model has the same architecture as [xlm - roberta - base](https://huggingface.co/FacebookAI/xlm - roberta - base) and was pretrained from scratch using the Amharic subsets of the oscar, mc4, and [amharic - sentences - corpus](https://huggingface.co/datasets/rasyosef/amharic - sentences - corpus) datasets, on a total of 290 Million tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 32k.
The model was trained for 15 hours on an A100 40GB GPU.
It achieves the following results on the evaluation set:
Loss: 2.446
Perplexity: 11.59
Even though this model has 42 Million parameters it beats the 7x larger 279 Million
parameter [xlm - roberta - base](https://huggingface.co/FacebookAI/xlm - roberta - base) multilingual model on Amharic Sentiment Classification and Named Entity Recognition tasks.
đģ Usage Examples
Basic Usage
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/roberta-medium-amharic')
>>> unmasker("á¨áááĢá¸á á¨áĸáĩáŽáĩáĢ á¨áᥠáááŊ ááá° <mask> á°ááĨá¯ááĸ")
[{'score': 0.7755730152130127,
'token': 137,
'token_str': 'áááĩ',
'sequence': 'á¨áááĢá¸á á¨áĸáĩáŽáĩáĢ á¨áᥠáááŊ ááá° áááĩ á°ááĨá¯ááĸ'},
{'score': 0.09340856224298477,
'token': 346,
'token_str': 'á ááĩ',
'sequence': 'á¨áááĢá¸á á¨áĸáĩáŽáĩáĢ á¨áᥠáááŊ ááá° á ááĩ á°ááĨá¯ááĸ'},
{'score': 0.08586721867322922,
'token': 217,
'token_str': 'áááŗáĩ',
'sequence': 'á¨áááĢá¸á á¨áĸáĩáŽáĩáĢ á¨áᥠáááŊ ááá° áááŗáĩ á°ááĨá¯ááĸ'},
{'score': 0.011987944133579731,
'token': 733,
'token_str': 'á ááŗáĩ',
'sequence': 'á¨áááĢá¸á á¨áĸáĩáŽáĩáĢ á¨áᥠáááŊ ááá° á ááŗáĩ á°ááĨá¯ááĸ'},
{'score': 0.010042797774076462,
'token': 1392,
'token_str': 'áááą',
'sequence': 'á¨áááĢá¸á á¨áĸáĩáŽáĩáĢ á¨áᥠáááŊ ááá° áááą á°ááĨá¯ááĸ'}]
đ Documentation
Finetuning
This model was finetuned and evaluated on the following Amharic NLP tasks:
- Sentiment Classification
- Dataset: [amharic - sentiment](https://huggingface.co/datasets/rasyosef/amharic - sentiment)
- Code: https://github.com/rasyosef/amharic - sentiment - classification
- Named Entity Recognition
- Dataset: [amharic - named - entity - recognition](https://huggingface.co/datasets/rasyosef/amharic - named - entity - recognition)
- Code: https://github.com/rasyosef/amharic - named - entity - recognition
Finetuned Model Performance
The reported F1 scores are macro averages.
Model |
Size (# params) |
Perplexity |
Sentiment (F1) |
Named Entity Recognition (F1) |
roberta - base - amharic |
110M |
8.08 |
0.88 |
0.78 |
roberta - medium - amharic |
42.2M |
11.59 |
0.84 |
0.75 |
bert - medium - amharic |
40.5M |
13.74 |
0.83 |
0.68 |
bert - small - amharic |
27.8M |
15.96 |
0.83 |
0.68 |
bert - mini - amharic |
10.7M |
22.42 |
0.81 |
0.64 |
bert - tiny - amharic |
4.18M |
71.52 |
0.79 |
0.54 |
xlm - roberta - base |
279M |
|
0.83 |
0.73 |
afro - xlmr - base |
278M |
|
0.83 |
0.75 |
afro - xlmr - large |
560M |
|
0.86 |
0.76 |
am - roberta |
443M |
|
0.82 |
0.69 |
đĻ Model Information
Property |
Details |
Library Name |
transformers |
Datasets |
oscar, mc4, rasyosef/amharic - sentences - corpus |
Language |
am |
Metrics |
perplexity |
Pipeline Tag |
fill - mask |
đ¨ Widget Examples
- Example 1:
- Text: á¨áááĢá¸á á¨áĸáĩáŽáĩáĢ á¨áᥠáááŊ ááá° á°ááĨá¯ááĸ
- Example 2:
- Text: áŖáááĩ á ááĩáĩ áááŗáĩ á¨á ááŽáŗ áááĢáĩ á¨áĻá ááĸ á áĨá
á á¨áá¯ááĸ
- Example 3:
- Text: áŦááĢááĢá á¨áŗá áĨáĩá¨áŗá á á ááĩ ááá á¨á°ááá áĩááģá¸áá áá°ááŗá¸áá á°á¨áĩá á¨áááŊá ááŖ á¨áá°áá°á á¨áá¨áĨ áááĒ áá áĩáááĩ á ááŦáá°ááĩ áááĢá áŠáļ áĸá°á¨áá ááŦá áá á¨á°áááá áĨáá
áĩáá´ ááá á áĨá¨á°ááᨠáááĸ
- Example 4:
- Text: á°ááĒááš á ááĩáĩአáĢá¸ááá áĩ á¨áá áĢ áĩáĢ ááĢá¨á áĨá á
áááá áĨáá°á á¨á áááŗá á¨ááĢáĩá°áĢáá ááŦáĩ á ááą áááĸ