roberta-medium-amharic open-source model - Specifically designed for Amharic to address the performance issues in NLP tasks.

Roberta Medium Amharic

Developed by rasyosef

A RoBERTa model specifically designed for Amharic, which solves the problem of insufficient performance in Amharic NLP tasks through pre-training from scratch.

Large Language Model

Transformers

Other#Exclusive for Amharic #Efficient and lightweight #Optimized for sentiment classification

Downloads 132

Release Time : 1/6/2025

Model Overview

A medium-sized pre-trained model based on the RoBERTa architecture, optimized for Amharic and supporting NLP tasks such as sentiment classification and named entity recognition.

Model Features

Efficient performance

With only 42 million parameters, it outperforms multilingual models with 7 times the number of parameters on Amharic tasks.

Professional training

Trained from scratch using 290 million Amharic tokens, including a dedicated tokenizer.

Fast training

Training can be completed in only 15 hours on an A100 40GB GPU.

Model Capabilities

Amharic text understanding

Sentiment analysis

Named entity recognition

Masked language modeling

Use Cases

Sentiment analysis

Sentiment classification of Amharic reviews

Judge the positive/negative sentiment of Amharic user reviews.

F1 score of 0.84 (macro-average)

Information extraction

Amharic named entity recognition

Identify entities such as person names and place names from Amharic text.

F1 score of 0.75 (macro-average)

🚀 roberta-medium-amharic

The roberta-medium-amharic model is designed for Amharic language processing, offering high - performance solutions for tasks like sentiment classification and named entity recognition.

🚀 Quick Start

This model has the same architecture as [xlm - roberta - base](https://huggingface.co/FacebookAI/xlm - roberta - base) and was pretrained from scratch using the Amharic subsets of the oscar, mc4, and [amharic - sentences - corpus](https://huggingface.co/datasets/rasyosef/amharic - sentences - corpus) datasets, on a total of 290 Million tokens. The tokenizer was trained from scratch on the same text corpus, and had a vocabulary size of 32k.

The model was trained for 15 hours on an A100 40GB GPU.

It achieves the following results on the evaluation set:

Loss: 2.446
Perplexity: 11.59

Even though this model has 42 Million parameters it beats the 7x larger 279 Million parameter [xlm - roberta - base](https://huggingface.co/FacebookAI/xlm - roberta - base) multilingual model on Amharic Sentiment Classification and Named Entity Recognition tasks.

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='rasyosef/roberta-medium-amharic')
>>> unmasker("ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ <mask> ተቆጥሯል።")

[{'score': 0.7755730152130127,
  'token': 137,
  'token_str': 'ዓመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመት ተቆጥሯል።'},
 {'score': 0.09340856224298477,
  'token': 346,
  'token_str': 'አመት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመት ተቆጥሯል።'},
 {'score': 0.08586721867322922,
  'token': 217,
  'token_str': 'ዓመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመታት ተቆጥሯል።'},
 {'score': 0.011987944133579731,
  'token': 733,
  'token_str': 'አመታት',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ አመታት ተቆጥሯል።'},
 {'score': 0.010042797774076462,
  'token': 1392,
  'token_str': 'ዓመቱ',
  'sequence': 'ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ዓመቱ ተቆጥሯል።'}]

📚 Documentation

Finetuning

This model was finetuned and evaluated on the following Amharic NLP tasks:

Sentiment Classification
- Dataset: [amharic - sentiment](https://huggingface.co/datasets/rasyosef/amharic - sentiment)
- Code: https://github.com/rasyosef/amharic - sentiment - classification
Named Entity Recognition
- Dataset: [amharic - named - entity - recognition](https://huggingface.co/datasets/rasyosef/amharic - named - entity - recognition)
- Code: https://github.com/rasyosef/amharic - named - entity - recognition

Finetuned Model Performance

The reported F1 scores are macro averages.

Model	Size (# params)	Perplexity	Sentiment (F1)	Named Entity Recognition (F1)
roberta - base - amharic	110M	8.08	0.88	0.78
roberta - medium - amharic	42.2M	11.59	0.84	0.75
bert - medium - amharic	40.5M	13.74	0.83	0.68
bert - small - amharic	27.8M	15.96	0.83	0.68
bert - mini - amharic	10.7M	22.42	0.81	0.64
bert - tiny - amharic	4.18M	71.52	0.79	0.54
xlm - roberta - base	279M		0.83	0.73
afro - xlmr - base	278M		0.83	0.75
afro - xlmr - large	560M		0.86	0.76
am - roberta	443M		0.82	0.69

📦 Model Information

Property	Details
Library Name	transformers
Datasets	oscar, mc4, rasyosef/amharic - sentences - corpus
Language	am
Metrics	perplexity
Pipeline Tag	fill - mask

🎨 Widget Examples

Example 1:
- Text: ከሀገራቸው ከኢትዮጵያ ከወጡ ግማሽ ምዕተ ተቆጥሯል።
Example 2:
- Text: ባለፉት አምስት ዓመታት የአውሮጳ ሀገራት የጦር ግዢ በእጅጉ ጨምሯል።
Example 3:
- Text: ኬንያውያን ከዳር እስከዳር በአንድ ቆመው የተቃውሞ ድምጻቸውን ማሰማታቸውን ተከትሎ የዜጎችን ቁጣ የቀሰቀሰው የቀረጥ ጭማሪ ሕግ ትናንት በፕሬዝደንት ዊልያም ሩቶ ቢደረግም ዛሬም ግን የተቃውሞው እንቅስቃሴ መቀጠሉ እየተነገረ ነው።
Example 4:
- Text: ተማሪዎቹ በውድድሩ ካሸነፉበት የፈጠራ ስራ መካከል እና ቅዝቃዜን እንደአየር ሁኔታው የሚያስተካክል ጃኬት አንዱ ነው።

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご