Faseeh Open-Source Machine Translation Model - Free Deployment, Accurate Translation into Pure Classical Arabic

Faseeh

Developed by Abdulmohsena

A machine translation model specifically designed for translating into pure Classical Arabic, addressing issues with modern hybrid Arabic

Machine Translation

Transformers

ArabicOpen Source License:MIT #Classical Arabic translation #Language purification #NLLB fine-tuning

Downloads 2,876

Release Time : 6/5/2024

Model Overview

An Arabic translation model fine-tuned based on the distilled version of NLLB-200, focused on generating pure Classical Arabic (فصيح) rather than modern hybrid Arabic (العَرَنْجِيَّة)

Model Features

Pure Classical Arabic output

Specifically designed to avoid translation issues with modern hybrid Arabic (العَرَنْجِيَّة), generating pure Classical Arabic (فصيح)

Based on NLLB-200 distilled version

Fine-tuned on Facebook's NLLB-200 distilled version (600 million parameters)

Semantic-priority translation

Uses semantic evaluation metrics like COMET to prioritize semantic accuracy over word-for-word correspondence

Model Capabilities

English to Classical Arabic translation

Semantic-priority machine translation

Avoids modern hybrid Arabic expressions

Use Cases

Academic research

Ancient text translation

Accurately translating Western academic literature into pure Classical Arabic

Avoids distorted expressions of academic concepts in modern hybrid Arabic

Religious text translation

Religious literature translation

Ensures religious-related texts comply with Classical Arabic standards

Maintains purity of religious terminology

🚀 Faseeh

A Machine Translation Model designed to translate to True Classical Arabic, aiming to address the dominance of Modern Standard Arabic (Arabized English) in current translations.

🚀 Quick Start

Use the code below to get started with the model.

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, GenerationConfig

model_name = "Abdulmohsena/Faseeh"

tokenizer = AutoTokenizer.from_pretrained(model_name, src_lang="eng_Latn", tgt_lang="arb_Arab")
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
generation_config = GenerationConfig.from_pretrained(model_name)


dummy = "And the Saudi Arabian Foreign Minister assured the visitors of the importance to seek the security."

encoded_ar = tokenizer(dummy, return_tensors="pt")
generated_tokens = model.generate(**encoded_ar, generation_config=generation_config)

tokenizer.decode(generated_tokens[0], skip_special_tokens=True)

✨ Features

A language model designed for translation into True Classical Arabic, as the current dominant form in translation is Modern Standard Arabic (Arabized English).

What is Arabized English?

It is a language with an Arabic appearance but a Western essence. There are many examples of this, such as "lifestyle" instead of "life", "common ground" instead of "equality", "inner peace" instead of "tranquility or serenity", and "negatives and positives" instead of "virtues and vices".

📚 Documentation

Model Details

Finetuned version of facebook's NLLB 200 Distilled 600M Parameters

Model Sources

Repository: https://github.com/AbdulmohsenA/Faseeh

Bias, Risks, and Limitations

The language pairs outside of the Quran are mostly translated by Google Translate. Thus, the quality of translation depends on the quality of Google's translation from Classical Arabic to English.
The metric used in this model is bertscore/e5score. It is not perfect in terms of alignment, but it is the best available metric for semantic translation. Thus, until a better substitute appears, this is the main evaluation metric.
The metrics used in general to evaluate the translation quality to Arabic are trained on Modern Standard Arabic, thus making them misaligned with the goals of the model.

Improvements

A much better approach to generating a language pair from classical Arabic text is to use GPT4o (at the time of writing, it is the only model capable of understanding complex Arabic sentences).
There should be evaluation metrics designed for the goal of this model. Currently, only a binary classifier has been created to classify whether a sentence is classical or not. It provides a score from 0 to 1, but it is neither sufficient nor flexible, so more work is needed in evaluations.

Training Data

Arabic text outside of HuggingFace datasets is scraped from Shamela Library

Metrics

COMET: to pay more attention to representing the same meaning rather than focusing on individual words (Semantic Translation, not Syntactic Translation)
Fluency Score: A custom-built metric to classify whether a sentence is classical or not.

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご