🚀 AI-translator-eng-to-9ja
This is a 418-million parameter translation model designed to translate from English into Yoruba, Igbo, and Hausa. Trained on a dataset of 1,500,000 sentences (500,000 for each language), it offers high-quality translations for these languages. The model aims to build a system that simplifies communication with LLMs using Igbo, Hausa, and Yoruba.
✨ Features
- Multilingual Translation: Translates from English to Yoruba, Igbo, and Hausa.
- High-Quality Output: Trained on a large dataset for accurate translations.
- LLM Communication: Facilitates communication with LLMs in local languages.
📦 Installation
No specific installation steps are provided in the original document.
💻 Usage Examples
Basic Usage
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
import huggingface_hub
huggingface_hub.login()
model = M2M100ForConditionalGeneration.from_pretrained("HelpMumHQ/AI-translator-eng-to-9ja")
tokenizer = M2M100Tokenizer.from_pretrained("HelpMumHQ/AI-translator-eng-to-9ja")
eng_text="Healthcare is an important field in virtually every society because it directly affects the well-being and quality of life of individuals. It encompasses a wide range of services and professions, including preventive care, diagnosis, treatment, and management of diseases and conditions."
tokenizer.src_lang = "en"
tokenizer.tgt_lang = "ig"
encoded_eng = tokenizer(eng_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_eng, forced_bos_token_id=tokenizer.get_lang_id("ig"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
eng_text="Healthcare is an important field in virtually every society because it directly affects the well-being and quality of life of individuals. It encompasses a wide range of services and professions, including preventive care, diagnosis, treatment, and management of diseases and conditions. Effective healthcare systems aim to improve health outcomes, reduce the incidence of illness, and ensure that individuals have access to necessary medical services."
tokenizer.src_lang = "en"
tokenizer.tgt_lang = "yo"
encoded_eng = tokenizer(eng_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_eng, forced_bos_token_id=tokenizer.get_lang_id("yo"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
eng_text="Healthcare is an important field in virtually every society because it directly affects the well-being and quality of life of individuals. It encompasses a wide range of services and professions, including preventive care, diagnosis, treatment, and management of diseases and conditions. Effective healthcare systems aim to improve health outcomes, reduce the incidence of illness, and ensure that individuals have access to necessary medical services."
tokenizer.src_lang = "en"
tokenizer.tgt_lang = "ha"
encoded_eng = tokenizer(eng_text, return_tensors="pt")
generated_tokens = model.generate(**encoded_eng, forced_bos_token_id=tokenizer.get_lang_id("ha"))
tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
📚 Documentation
Model Details
Property |
Details |
Languages Supported |
Source Language: English; Target Languages: Yoruba, Igbo, Hausa |
Model Type |
418 Million parameter translation model |
Training Data |
A dataset of 1,500,000 translation pairs from open - source parallel corpora and curated datasets for Yoruba, Igbo, and Hausa |
Supported Language Codes
- English:
en
- Yoruba:
yo
- Igbo:
ig
- Hausa:
ha
Training Dataset
The training dataset consists of 1,500,000 translation pairs, sourced from a combination of open - source parallel corpora and curated datasets specific to Yoruba, Igbo, and Hausa.
Limitations
- Performance may vary depending on the complexity and domain of the text, even though the model performs well in English - to - Yoruba, Igbo, and Hausa translations.
- Translation quality may decline for extremely long sentences or ambiguous contexts.
Training Hyperparameters
The following hyperparameters were used during training:
- learning_rate: 2e - 05
- train_batch_size: 64
- eval_batch_size: 64
- seed: 42
- optimizer: Adam with betas=(0.9,0.999) and epsilon = 1e - 08
- lr_scheduler_type: linear
- num_epochs: 3
Framework Versions
- Transformers 4.44.2
- Pytorch 2.4.0+cu121
- Datasets 2.21.0
- Tokenizers 0.19.1
📄 License
This model is licensed under the MIT license.