AraEuroBert-610M開源模型 - 支持大序列長度的阿拉伯語語義文本嵌入

Home

Araeurobert 610M

Developed by Omartificial-Intelligence-Space

基於EuroBERT-610m微調的阿拉伯語語義文本嵌入模型，支持1152維稠密向量空間和8192標記的最大序列長度。

文本嵌入

Safetensors

ArabicOpen Source License:MIT #阿拉伯語語義嵌入 #長文本處理(8k)#套娃式向量

Downloads 160

Release Time : 3/19/2025

Model Overview

專為阿拉伯語優化的語義文本嵌入模型，適用於語義相似度計算、語義搜索、文本分類等任務。

Model Features

阿拉伯語優化

針對阿拉伯語文本進行了專業微調，顯著提升阿拉伯語義任務性能

長文本支持

支持最大8192標記的序列長度，適合處理長文本

嵌套嵌入

支持1152/960/768/512維度的套娃式嵌入，可根據需求調整維度

高性能

在STS17和STS22.v2基準測試中表現優於標準EuroBERT模型

Model Capabilities

語義文本相似度計算

語義搜索

複述挖掘

文本分類

文本聚類

Use Cases

信息檢索

阿拉伯語語義搜索

構建阿拉伯語搜索引擎，理解查詢與文檔的語義相似度

提升搜索結果的相關性

文本分析

阿拉伯語文本分類

對阿拉伯語新聞、評論等進行自動分類

準確率優於傳統方法

🚀 Ara - EuroBERT：阿拉伯語語義文本嵌入模型

Ara - EuroBERT 是一個專門針對阿拉伯語語義文本嵌入進行優化的模型。它基於 [EuroBERT/EuroBERT - 610m](https://huggingface.co/EuroBERT/EuroBERT - 610m) 進行微調，能夠將句子和段落映射到一個 1152 維的密集向量空間，最大序列長度可達 8192 個標記。該模型可用於語義文本相似度計算、語義搜索、釋義挖掘、文本分類、聚類等多種任務。

🚀 快速開始

直接使用（Sentence Transformers）

首先，安裝 Sentence Transformers 庫：

pip install -U sentence-transformers

然後，加載模型並進行推理：

from sentence_transformers import SentenceTransformer

# 從 🤗 Hub 下載模型
model = SentenceTransformer("Omartificial-Intelligence-Space/AraEuroBert-610M")

# 進行推理
sentences = [
    'لاعبة كرة ناعمة ترمي الكرة إلى زميلتها في الفريق',
    'شخصان يلعبان كرة البيسبول',
    'لاعبين لكرة البيسبول يجلسان على مقعد',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1152]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

✨ 主要特性

語義文本嵌入：能夠將阿拉伯語句子和段落映射到 1152 維的密集向量空間，實現語義文本相似度計算等多種任務。
長序列處理：最大序列長度可達 8192 個標記，可處理較長的文本。
嵌套嵌入支持：支持 Matryoshka（嵌套）嵌入，維度包括 1152、960、768 和 512，可根據性能和計算效率的需求進行選擇。

📦 安裝指南

安裝 Sentence Transformers 庫：

pip install -U sentence-transformers

💻 使用示例

基礎用法

from sentence_transformers import SentenceTransformer

# 從 🤗 Hub 下載模型
model = SentenceTransformer("Omartificial-Intelligence-Space/AraEuroBert-610M")

# 進行推理
sentences = [
    'لاعبة كرة ناعمة ترمي الكرة إلى زميلتها في الفريق',
    'شخصان يلعبان كرة البيسبول',
    'لاعبين لكرة البيسبول يجلسان على مقعد',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1152]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

📚 詳細文檔

模型詳情與基準性能

![基準測試結果](https://cdn - uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/Kv78q7NmI3hhOXkRv30s9.png) 上述基準測試結果表明，與標準 EuroBERT 模型相比，AraEuroBERT 模型的性能有顯著提升：

STS17 基準測試：AraEuroBERT - 610M 得分 83，顯著優於標準 EuroBERT - 610M（14），甚至超過更大的 EuroBERT - 2.1B（12）。
STS22.v2 基準測試：AraEuroBERT - 210M 得分 61，優於更大的 AraEuroBERT - 610M（53）和所有標準 EuroBERT 變體。

這些結果凸顯了我們針對阿拉伯語文本嵌入進行的專門微調的有效性，即使是參數較少的 210M 模型，在阿拉伯語語義任務上也表現出色。

指標

語義相似度

數據集：sts - dev - 1152、sts - dev - 960、sts - dev - 768 和 sts - dev - 512
評估方法：使用 EmbeddingSimilarityEvaluator 進行評估

指標	sts - dev - 1152	sts - dev - 960	sts - dev - 768	sts - dev - 512
pearson_cosine	0.8264	0.8259	0.8244	0.8238
spearman_cosine	0.8307	0.8302	0.8293	0.8293

模型描述

屬性	詳情
模型類型	Sentence Transformer
基礎模型	[EuroBERT/EuroBERT - 610m](https://huggingface.co/EuroBERT/EuroBERT - 610m)
最大序列長度	8192 個標記
輸出維度	1152 維（支持嵌套維度：1152、960、768、512）
相似度函數	餘弦相似度
訓練數據	228 萬個包含阿拉伯語文本三元組的訓練樣本
語言	阿拉伯語

完整模型架構

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: EuroBertModel 
  (1): Pooling({'word_embedding_dimension': 1152, 'pooling_mode_cls_token': False, 'pooling_mode_mean_tokens': True, 'pooling_mode_max_tokens': False, 'pooling_mode_mean_sqrt_len_tokens': False, 'pooling_mode_weightedmean_tokens': False, 'pooling_mode_lasttoken': False, 'include_prompt': True})
)

嵌套嵌入

該模型支持 Matryoshka（嵌套）嵌入，維度如下：

全維度：1152
降維維度：960、768、512

你可以根據具體需求選擇嵌入維度，在性能和計算效率之間進行權衡。

🔧 技術細節

模型微調

該模型基於 [EuroBERT/EuroBERT - 610m](https://huggingface.co/EuroBERT/EuroBERT - 610m) 進行微調，專門針對阿拉伯語語義文本嵌入進行優化。通過在大量阿拉伯語文本三元組上進行訓練，模型能夠更好地捕捉阿拉伯語的語義信息。

向量空間映射

模型將句子和段落映射到一個 1152 維的密集向量空間，使得語義相似的文本在向量空間中距離更近。同時，模型支持最大 8192 個標記的序列長度，能夠處理較長的文本。

嵌套嵌入機制

模型支持 Matryoshka（嵌套）嵌入，提供了不同維度的嵌入選項。用戶可以根據實際需求選擇合適的維度，在性能和計算效率之間進行平衡。

📄 許可證

本模型遵循 MIT 許可證。

📖 引用

如果在你的研究中使用了該模型，請引用以下文獻：

@misc{boizard2025eurobertscalingmultilingualencoders,
      title={EuroBERT: Scaling Multilingual Encoders for European Languages}, 
      author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and Duarte M. Alves and André Martins and Ayoub Hammal and Caio Corro and Céline Hudelot and Emmanuel Malherbe and Etienne Malaboeuf and Fanny Jourdan and Gabriel Hautreux and João Alves and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Patrick Fernandes and Ricardo Rei and Pierre Colombo},
      year={2025},
      eprint={2503.05500},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.05500}, 
}

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}