Arabic-English-BGE-M3 Open-Source Model - Achieve Efficient Arabic-English Bilingual Processing with Low Memory Usage

Arabic English Bge M3

Developed by sayed0am

This is a pruned version of the BAAI/bge-m3 model optimized for Arabic, retaining approximately 98% of the original model's quality while using less memory.

Text Embedding

Safetensors

Supports Multiple LanguagesOpen Source License:MIT #Arabic-English Bilingual Embedding #Lightweight Pruning #Passage Retrieval Optimization

Downloads 257

Release Time : 2/19/2025

Model Overview

This model is optimized for Arabic sentence similarity calculations, supporting passage retrieval and sentence similarity computation in both Arabic and English.

Model Features

Efficient Pruning

Reduced by approximately 75% compared to the original model while retaining 98% of its quality.

Bilingual Support

Specially optimized for Arabic and English.

ONNX Quantization Support

Provides an ONNX quantized version to further reduce model size.

Model Capabilities

Compute sentence similarity

Passage retrieval

Cross-lingual text matching

Use Cases

Information Retrieval

Arabic Document Retrieval

Search for relevant documents in Arabic document collections

Efficiently and accurately retrieves relevant Arabic content

Multilingual Applications

Arabic-English Bilingual Matching

Match similar content in Arabic and English

Enables cross-lingual association between Arabic and English content

🚀 🇸🇦 Arabic-English BGE-M3

This model is designed for sentence similarity tasks. It offers a solution for comparing the similarity between Arabic and English sentences, which is highly valuable in passage retrieval scenarios.

✨ Features

It is a 36.2% smaller version of BAAI/bge-m3 specifically tailored for the Arabic language.
The ONNX quantized version is approximately 75% smaller (363 MB) than the pruned model, while retaining about 98% of the original model's quality.
This pruned model performs similarly to the original model for Arabic language tasks with a significantly smaller memory footprint. However, it may not perform well for other languages in the original multilingual model as tokens not commonly used in Arabic were removed from the original multilingual model's vocabulary.

📦 Installation

This model can be used with different libraries. Here are the installation and usage steps for different libraries:

Transformers Library

from transformers import AutoModel, AutoTokenizer

model_name = "sayed0am/arabic-english-bge-m3"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, use_fast=True)

Sentence-Transformers Library

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("sayed0am/arabic-english-bge-m3")

Using ONNX

# pip install huggingface-hub
 
from huggingface_hub import snapshot_download

snapshot_download(repo_id="sayed0am/arabic-english-bge-m3",local_dir="arabic-english-bge-m3")

from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
import torch

# Make sure that you download the model weights locally to `bge-m3-onnx`
model = ORTModelForFeatureExtraction.from_pretrained("arabic-english-bge-m3", subfolder="onnx", provider="CUDAExecutionProvider") # omit provider for CPU usage.
tokenizer = AutoTokenizer.from_pretrained("arabic-english-bge-m3")
sentences = [
    "English: The quick brown fox jumps over the lazy dog.",
    "Arabic: الثعلب البني السريع يقفز فوق الكلب الكسول."
]
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt').to("cuda") # For CPU remove .to("cuda")

# Get the embeddings
out=model(**encoded_input,return_dict=True).last_hidden_state

# normalize the embeddings
dense_vecs = torch.nn.functional.normalize(out[:, 0], dim=-1)

📄 License

This model is released under the MIT license.

Property	Details
Pipeline Tag	Sentence Similarity
Languages	Arabic, English
License	MIT
Tags	Passage Retrieval, Sentence Similarity, Pruned
Library Name	Sentence-Transformers
Base Model	BAAI/bge-m3
Base Model Relation	Quantized

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご