Bert-medium-amharic: An Open-source Amharic Model - Achieving Efficient Language Processing Comparable to Large Models

Bert Medium Amharic

Developed by rasyosef

A pre-trained Amharic language model based on the bert-medium architecture, with 40.5 million parameters trained on 290 million tokens, achieving performance comparable to larger multilingual models.

Large Language Model

Transformers

Other#Amharic NLP #Low-parameter efficiency #Fill-mask

Downloads 2,661

Release Time : 6/16/2024

Model Overview

A BERT model specifically designed for the Amharic language, supporting fill-mask tasks and applicable to text understanding and generation tasks.

Model Features

Efficient Parameter Utilization

Achieves Amharic language processing capabilities comparable to a 279 million parameter model with only 40.5 million parameters.

Dedicated Tokenizer

Amharic-specific tokenizer based on a 28k vocabulary.

Multi-dataset Training

Trained on integrated datasets including oscar, mc4, and Amharic sentence corpora.

Model Capabilities

Amharic text understanding

Fill-mask prediction

Downstream task fine-tuning

Use Cases

Natural Language Processing

Sentiment Analysis

Classify sentiment tendencies in Amharic text

F1 score 0.83

Named Entity Recognition

Identify entities such as person names and locations in Amharic text

F1 score 0.68

🚀 bert-medium-amharic

This bert-medium-amharic model shares the same architecture as bert-medium. It was pretrained from scratch using the Amharic subsets of the oscar, mc4, and amharic-sentences-corpus datasets, with a total of 290 Million tokens. The tokenizer was also trained from scratch on the same text corpus, having a vocabulary size of 28k.

🚀 Quick Start

Model Information

Property	Details
Library Name	transformers
Datasets	oscar, mc4, rasyosef/amharic-sentences-corpus
Language	am
Metrics	perplexity
Pipeline Tag	fill-mask

Widget Examples

Example 1:
- Text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።
Example 2:
- Text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር [MASK] ግዢ በእጅጉ ጨምሯል።
Example 3:
- Text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ [MASK] ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው።
Example 4:
- Text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል [MASK] እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው።

💻 Usage Examples

Basic Usage

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/bert-medium-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ [MASK] ተቆጥሯል።")

[{'score': 0.5135582089424133,
  'token': 9345,
  'token_str': 'ዓመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል ።'},
 {'score': 0.2923661470413208,
  'token': 9617,
  'token_str': 'ዓመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል ።'},
 {'score': 0.09527599066495895,
  'token': 9913,
  'token_str': 'አመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል ።'},
 {'score': 0.06960058212280273,
  'token': 10898,
  'token_str': 'አመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል ።'},
 {'score': 0.019061630591750145,
  'token': 28157,
  'token_str': '##ዓመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተዓመት ተቆጥሯል ።'}]

🔧 Technical Details

This model was finetuned and evaluated on the following Amharic NLP tasks:

Sentiment Classification

Dataset: amharic-sentiment
Code: https://github.com/rasyosef/amharic-sentiment-classification

Named Entity Recognition

Dataset: amharic-named-entity-recognition
Code: https://github.com/rasyosef/amharic-named-entity-recognition

Finetuned Model Performance

The reported F1 scores are macro averages.

Model	Size (# params)	Perplexity	Sentiment (F1)	Named Entity Recognition (F1)
bert-medium-amharic	40.5M	13.74	0.83	0.68
bert-small-amharic	27.8M	15.96	0.83	0.68
bert-mini-amharic	10.7M	22.42	0.81	0.64
bert-tiny-amharic	4.18M	71.52	0.79	0.54
xlm-roberta-base	279M		0.83	0.73
am-roberta	443M		0.82	0.69

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご