translate-ar-en-v1.0-hplt_opus Open-source Translation Model - Achieve Free Translation between Arabic and English

Translate Ar En V1.0 Hplt Opus

Developed by HPLT

Arabic-English machine translation model trained on OPUS and HPLT data, available in both Marian and Hugging Face formats.

Machine Translation

Transformers

Supports Multiple Languages#Arabic-English translation #High-precision machine translation #Multi-domain applicability

Downloads 20

Release Time : 2/27/2024

Model Overview

This model is specifically designed for Arabic-to-English machine translation tasks, utilizing a Transformer-base architecture with SentencePiece tokenizer for text processing.

Model Features

Multi-framework support

Provides both MarianNMT and Hugging Face formats to accommodate different development environments.

High-quality training data

Trained on rigorously cleaned OPUS and HPLT datasets using OpusCleaner to ensure translation quality.

High-performance tokenization

Employs SentencePiece's Unigram algorithm for effective Arabic and English text tokenization.

Model Capabilities

Arabic-to-English text translation

Batch text processing

High-quality machine translation

Use Cases

Cross-language communication

Arabic document translation

Automatically translate Arabic documents into English for international communication.

Achieved 40.1 BLEU score on FLORES200 test set

Multilingual content localization

Helps content creators quickly convert Arabic content into English versions.

Achieved 34.7 BLEU score on NTREX test set

🚀 HPLT MT release v1.0

This repository offers a translation model for Arabic-English, trained using OPUS and HPLT data. It's available in both Marian and Hugging Face formats.

🚀 Quick Start

This repository contains a translation model for Arabic-English, trained with OPUS and HPLT data. It is available in both Marian and Hugging Face formats.

✨ Features

Trained on OPUS and HPLT data for Arabic-English translation.
Available in both Marian and Hugging Face formats, ensuring compatibility with different frameworks.

📦 Installation

The model has been trained with MarianNMT and the weights are in the Marian format. It has also been converted into the Hugging Face format for compatibility with transformers.

Using Marian

To run inference with MarianNMT, refer to the Inference/Decoding/Translation section of our GitHub repository. You will need the model file model.npz.best-chrf.npz and the vocabulary file model.ar-en.spm from this repository.

Using transformers

We have also converted this model to the Hugging Face format. Note that due to a known issue in weight conversion, the checkpoint cannot work with transformer versions <4.26 or >4.30. We tested and suggest pip install transformers==4.28.

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("HPLT/translate-ar-en-v1.0-hplt_opus")
model = AutoModelForSeq2SeqLM.from_pretrained("HPLT/translate-ar-en-v1.0-hplt_opus")

inputs = ["Input goes here.", "Make sure the language is right."]
batch_tokenized = tokenizer(inputs, return_tensors="pt", padding=True)
model_output = model.generate(
    **batch_tokenized, num_beams=6, max_new_tokens=512
)
batch_detokenized = tokenizer.batch_decode(
    model_output,
    skip_special_tokens=True,
)

print(batch_detokenized)

📚 Documentation

Model Info

Property	Details
Source Language	Arabic
Target Language	English
Dataset	OPUS and HPLT data
Model Architecture	Transformer-base
Tokenizer	SentencePiece (Unigram)
Cleaning	We used OpusCleaner with a set of basic rules. Details can be found in the filter files here.

You can check out our deliverable report, GitHub repository, and website for more details.

Benchmarks

When decoded using Marian, the model has the following test scores.

Test set	BLEU	chrF++	COMET22
FLORES200	40.1	63.1	0.8645
NTREX	34.7	58.9	0.8426

📄 License

This project is licensed under cc-by-4.0.

🙏 Acknowledgements

This project has received funding from the European Union's Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government's Horizon Europe funding guarantee [grant number 10052546].

Brought to you by researchers from the University of Edinburgh and Charles University in Prague with support from the whole HPLT consortium.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご