AraEuroBert-2.1B開源模型 - 支持長文本輸入的阿拉伯語語義嵌入工具

首頁

Araeurobert 2.1B

由Omartificial-Intelligence-Space開發

基於EuroBERT-2.1B微調的阿拉伯語語義嵌入模型，支持2304維稠密向量空間和最長8192個標記的輸入。

文本嵌入

Safetensors

阿拉伯語開源協議:MIT #阿拉伯語語義嵌入 #2304維高維向量 #8192長文本支持

下載量 45

發布時間 : 3/20/2025

模型概述

專為阿拉伯語優化的語義文本嵌入模型，適用於語義相似度計算、語義搜索、文本分類等多種NLP任務。

模型特點

高維語義嵌入

支持2304維稠密向量空間，可捕捉豐富的語義信息

長文本支持

最大支持8192個標記的輸入，適合處理長文本

嵌套維度選擇

支持2304、1152、960、580四種維度選擇，平衡性能與計算效率

阿拉伯語優化

專門針對阿拉伯語進行優化，在STS17基準測試中獲得79分

模型能力

語義文本相似度計算

語義搜索

複述挖掘

文本分類

聚類分析

使用案例

信息檢索

阿拉伯語文檔相似度搜索

在阿拉伯語文檔庫中查找語義相似的文檔

高準確率的語義匹配

內容分析

阿拉伯語文本聚類

對阿拉伯語新聞或社交媒體內容進行主題聚類

有效的主題識別和分組

🚀 Ara - EuroBERT：大規模阿拉伯語語義文本嵌入模型

Ara - EuroBERT - 2.1B 是一個基於 [EuroBERT/EuroBERT - 2.1B](https://huggingface.co/EuroBERT/EuroBERT - 2.1B) 微調的 sentence - transformers 模型，專門針對 阿拉伯語語義嵌入 進行了優化。

該模型可將句子和段落映射到一個 2304 維的密集向量空間，並且支持在單個輸入序列中處理 多達 8192 個標記。

模型標籤與信息

屬性	詳情
模型類型	Sentence Transformer
基礎模型	[EuroBERT/EuroBERT - 2.1B](https://huggingface.co/EuroBERT/EuroBERT - 2.1B)
訓練數據	未提及
損失函數	MatryoshkaLoss、MultipleNegativesRankingLoss
支持語言	阿拉伯語
評估指標	Pearson Cosine、Spearman Cosine

模型特性

多維度嵌入支持：該模型支持 Matryoshka（嵌套）嵌入，具有以下維度：
- 全維度：2304
- 降維維度：1151、960、580 你可以根據具體需求選擇嵌入維度，在性能和計算效率之間進行權衡。
適用場景廣泛：適用於語義文本相似度、語義搜索、釋義挖掘、文本分類、聚類等阿拉伯語自然語言處理任務。

基準測試表現

![模型基準測試表現](https://cdn - uploads.huggingface.co/production/uploads/628f7a71dd993507cfcbe587/qcT6MrSY1RY_RX9lSJuQl.png)

基準測試亮點

STS17 基準測試：AraEuroBERT - 2.1B 取得了 79 分的成績，顯著優於標準的 EuroBERT - 2.1B（12 分）。
STS22.v2 基準測試：得分 55，與更小、更高效的模型相比具有競爭力。

語義相似度指標

數據集：sts - dev - 2304、sts - dev - 1152、sts - dev - 960、sts - dev - 580
評估方法：使用 EmbeddingSimilarityEvaluator 進行評估

指標	sts - dev - 2304	sts - dev - 1152	sts - dev - 960	sts - dev - 580
Pearson (cosine)	0.7268	0.7267	0.7263	0.7246
Spearman (cosine)	0.7298	0.7299	0.7297	0.7286

完整模型架構

SentenceTransformer(
  (0): Transformer({'max_seq_length': 8192, 'do_lower_case': False}) with Transformer model: EuroBertModel 
  (1): Pooling({
        'word_embedding_dimension': 2304,
        'pooling_mode_cls_token': False,
        'pooling_mode_mean_tokens': True,
        'pooling_mode_max_tokens': False,
        'include_prompt': True
  })
)

使用示例

基礎用法

首先安裝 Sentence Transformers 庫：

pip install -U sentence-transformers

然後加載模型並進行推理：

from sentence_transformers import SentenceTransformer

# 從 🤗 Hub 下載模型
model = SentenceTransformer("Omartificial-Intelligence-Space/AraEuroBert-2.1B")

# 進行推理
sentences = [
    'لاعبة كرة ناعمة ترمي الكرة إلى زميلتها في الفريق',
    'شخصان يلعبان كرة البيسبول',
    'لاعبين لكرة البيسبول يجلسان على مقعد',
]
embeddings = model.encode(sentences)
print(embeddings.shape)
# [3, 1152]

# 獲取嵌入向量的相似度分數
similarities = model.similarity(embeddings, embeddings)
print(similarities.shape)
# [3, 3]

引用說明

如果你在研究中使用了該模型，請引用以下文獻：

@misc{boizard2025eurobertscalingmultilingualencoders,
      title={EuroBERT: Scaling Multilingual Encoders for European Languages}, 
      author={Nicolas Boizard and Hippolyte Gisserot-Boukhlef and Duarte M. Alves and André Martins and Ayoub Hammal and Caio Corro and Céline Hudelot and Emmanuel Malherbe and Etienne Malaboeuf and Fanny Jourdan and Gabriel Hautreux and João Alves and Kevin El-Haddad and Manuel Faysse and Maxime Peyrard and Nuno M. Guerreiro and Patrick Fernandes and Ricardo Rei and Pierre Colombo},
      year={2025},
      eprint={2503.05500},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2503.05500}, 
}

@inproceedings{reimers-2019-sentence-bert,
    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
    author = "Reimers, Nils and Gurevych, Iryna",
    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
    month = "11",
    year = "2019",
    publisher = "Association for Computational Linguistics",
    url = "https://arxiv.org/abs/1908.10084",
}

@misc{kusupati2024matryoshka,
    title={Matryoshka Representation Learning},
    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
    year={2024},
    eprint={2205.13147},
    archivePrefix={arXiv},
    primaryClass={cs.LG}
}