🚀 🇸🇦 Arabic-English BGE-M3
This model is designed for sentence similarity tasks. It offers a solution for comparing the similarity between Arabic and English sentences, which is highly valuable in passage retrieval scenarios.
✨ Features
- It is a 36.2% smaller version of BAAI/bge-m3 specifically tailored for the Arabic language.
- The ONNX quantized version is approximately 75% smaller (363 MB) than the pruned model, while retaining about 98% of the original model's quality.
- This pruned model performs similarly to the original model for Arabic language tasks with a significantly smaller memory footprint. However, it may not perform well for other languages in the original multilingual model as tokens not commonly used in Arabic were removed from the original multilingual model's vocabulary.
📦 Installation
This model can be used with different libraries. Here are the installation and usage steps for different libraries:
Transformers Library
from transformers import AutoModel, AutoTokenizer
model_name = "sayed0am/arabic-english-bge-m3"
model = AutoModel.from_pretrained(model_name, trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, use_fast=True)
Sentence-Transformers Library
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sayed0am/arabic-english-bge-m3")
Using ONNX
from huggingface_hub import snapshot_download
snapshot_download(repo_id="sayed0am/arabic-english-bge-m3",local_dir="arabic-english-bge-m3")
from optimum.onnxruntime import ORTModelForFeatureExtraction
from transformers import AutoTokenizer
import torch
model = ORTModelForFeatureExtraction.from_pretrained("arabic-english-bge-m3", subfolder="onnx", provider="CUDAExecutionProvider")
tokenizer = AutoTokenizer.from_pretrained("arabic-english-bge-m3")
sentences = [
"English: The quick brown fox jumps over the lazy dog.",
"Arabic: الثعلب البني السريع يقفز فوق الكلب الكسول."
]
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt').to("cuda")
out=model(**encoded_input,return_dict=True).last_hidden_state
dense_vecs = torch.nn.functional.normalize(out[:, 0], dim=-1)
📄 License
This model is released under the MIT license.
Property |
Details |
Pipeline Tag |
Sentence Similarity |
Languages |
Arabic, English |
License |
MIT |
Tags |
Passage Retrieval, Sentence Similarity, Pruned |
Library Name |
Sentence-Transformers |
Base Model |
BAAI/bge-m3 |
Base Model Relation |
Quantized |