đ bert-medium-amharic
This bert-medium-amharic
model shares the same architecture as bert-medium. It was pretrained from scratch using the Amharic subsets of the oscar, mc4, and amharic-sentences-corpus datasets, with a total of 290 Million tokens. The tokenizer was also trained from scratch on the same text corpus, having a vocabulary size of 28k.
đ Quick Start
Model Information
Property |
Details |
Library Name |
transformers |
Datasets |
oscar, mc4, rasyosef/amharic-sentences-corpus |
Language |
am |
Metrics |
perplexity |
Pipeline Tag |
fill-mask |
Widget Examples
- Example 1:
- Text: á¨áááĢá¸á á¨áĸáĩáŽáĩáĢ á¨áᥠáááŊ ááá° [MASK] á°ááĨá¯ááĸ
- Example 2:
- Text: áŖáááĩ á ááĩáĩ áááŗáĩ á¨á ááŽáŗ áááĢáĩ á¨áĻá [MASK] ááĸ á áĨá
á á¨áá¯ááĸ
- Example 3:
- Text: áŦááĢááĢá á¨áŗá áĨáĩá¨áŗá á á ááĩ ááá á¨á°ááá áĩááģá¸áá áá°ááŗá¸áá á°á¨áĩá á¨áááŊá ááŖ á¨áá°áá°á á¨áá¨áĨ áááĒ áá áĩáááĩ á ááŦáá°ááĩ áááĢá áŠáļ [MASK] áĸá°á¨áá ááŦá áá á¨á°áááá áĨáá
áĩáá´ ááá á áĨá¨á°ááᨠáááĸ
- Example 4:
- Text: á°ááĒááš á ááĩáĩአáĢá¸ááá áĩ á¨áá áĢ áĩáĢ ááĢá¨á [MASK] áĨá á
áááá áĨáá°á á¨á áááŗá á¨ááĢáĩá°áĢáá ááŦáĩ á ááą áááĸ
đģ Usage Examples
Basic Usage
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-medium-amharic')
>>> unmasker("á¨áááĢá¸á á¨áĸáĩáŽáĩáĢ á¨áᥠáááŊ ááá° [MASK] á°ááĨá¯ááĸ")
[{'score': 0.5135582089424133,
'token': 9345,
'token_str': 'áááĩ',
'sequence': 'á¨áááĢá¸á á¨áĸáĩáŽáĩáĢ á¨áᥠáááŊ ááá° áááĩ á°ááĨá¯á áĸ'},
{'score': 0.2923661470413208,
'token': 9617,
'token_str': 'áááŗáĩ',
'sequence': 'á¨áááĢá¸á á¨áĸáĩáŽáĩáĢ á¨áᥠáááŊ ááá° áááŗáĩ á°ááĨá¯á áĸ'},
{'score': 0.09527599066495895,
'token': 9913,
'token_str': 'á ááĩ',
'sequence': 'á¨áááĢá¸á á¨áĸáĩáŽáĩáĢ á¨áᥠáááŊ ááá° á ááĩ á°ááĨá¯á áĸ'},
{'score': 0.06960058212280273,
'token': 10898,
'token_str': 'á ááŗáĩ',
'sequence': 'á¨áááĢá¸á á¨áĸáĩáŽáĩáĢ á¨áᥠáááŊ ááá° á ááŗáĩ á°ááĨá¯á áĸ'},
{'score': 0.019061630591750145,
'token': 28157,
'token_str': '##áááĩ',
'sequence': 'á¨áááĢá¸á á¨áĸáĩáŽáĩáĢ á¨áᥠáááŊ ááá°áááĩ á°ááĨá¯á áĸ'}]
đ§ Technical Details
This model was finetuned and evaluated on the following Amharic NLP tasks:
Sentiment Classification
- Dataset: amharic-sentiment
- Code: https://github.com/rasyosef/amharic-sentiment-classification
Named Entity Recognition
Finetuned Model Performance
The reported F1 scores are macro averages.
Model |
Size (# params) |
Perplexity |
Sentiment (F1) |
Named Entity Recognition (F1) |
bert-medium-amharic |
40.5M |
13.74 |
0.83 |
0.68 |
bert-small-amharic |
27.8M |
15.96 |
0.83 |
0.68 |
bert-mini-amharic |
10.7M |
22.42 |
0.81 |
0.64 |
bert-tiny-amharic |
4.18M |
71.52 |
0.79 |
0.54 |
xlm-roberta-base |
279M |
|
0.83 |
0.73 |
am-roberta |
443M |
|
0.82 |
0.69 |