🚀 Faseeh
A Machine Translation Model designed to translate to True Classical Arabic, aiming to address the dominance of Modern Standard Arabic (Arabized English) in current translations.
🚀 Quick Start
Use the code below to get started with the model.
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig
model_name = "Abdulmohsena/Faseeh"
tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="eng_Latn", tgt_lang="arb_Arab")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
generation_config = GenerationConfig.from_pretrained(model_name)
dummy = "And the Saudi Arabian Foreign Minister assured the visitors of the importance to seek the security."
encoded_ar = tokenizer(dummy, return_tensors="pt")
generated_tokens = model.generate(**encoded_ar, generation_config=generation_config)
tokenizer.decode(generated_tokens[0], skip_special_tokens=True)
✨ Features
A language model designed for translation into True Classical Arabic, as the current dominant form in translation is Modern Standard Arabic (Arabized English).
What is Arabized English?
It is a language with an Arabic appearance but a Western essence. There are many examples of this, such as "lifestyle" instead of "life", "common ground" instead of "equality", "inner peace" instead of "tranquility or serenity", and "negatives and positives" instead of "virtues and vices".
📚 Documentation
Model Details
- Finetuned version of facebook's NLLB 200 Distilled 600M Parameters
Model Sources
- Repository: https://github.com/AbdulmohsenA/Faseeh
Bias, Risks, and Limitations
- The language pairs outside of the Quran are mostly translated by Google Translate. Thus, the quality of translation depends on the quality of Google's translation from Classical Arabic to English.
- The metric used in this model is bertscore/e5score. It is not perfect in terms of alignment, but it is the best available metric for semantic translation. Thus, until a better substitute appears, this is the main evaluation metric.
- The metrics used in general to evaluate the translation quality to Arabic are trained on Modern Standard Arabic, thus making them misaligned with the goals of the model.
Improvements
- A much better approach to generating a language pair from classical Arabic text is to use GPT4o (at the time of writing, it is the only model capable of understanding complex Arabic sentences).
- There should be evaluation metrics designed for the goal of this model. Currently, only a binary classifier has been created to classify whether a sentence is classical or not. It provides a score from 0 to 1, but it is neither sufficient nor flexible, so more work is needed in evaluations.
Training Data
- Arabic text outside of HuggingFace datasets is scraped from Shamela Library
Metrics
- COMET: to pay more attention to representing the same meaning rather than focusing on individual words (Semantic Translation, not Syntactic Translation)
- Fluency Score: A custom-built metric to classify whether a sentence is classical or not.
📄 License
This project is licensed under the MIT license.