gte-modernbert-base開源文本嵌入模型 - 支持長文本處理，評估表現優異

首頁

Gte Modernbert Base

由Alibaba-NLP開發

基於ModernBERT預訓練編碼器的文本嵌入模型，支持8192 tokens長文本處理，在MTEB、LoCO和COIR等評估任務中表現優異。

文本嵌入

Transformers

英語開源協議:Apache-2.0 #長文本嵌入 #高效檢索 #多任務優化

下載量 74.52k

發布時間 : 1/20/2025

模型概述

該模型是阿里巴巴集團通義實驗室開發的文本嵌入模型，專注於英語文本處理，適用於信息檢索、語義相似度計算等任務。

模型特點

長文本處理能力

支持最大8192 tokens的輸入長度，適合處理長文檔

高效性能

支持Flash Attention 2加速，在GPU上運行效率高

多場景適用

在MTEB、LoCO和COIR等多種評估任務中表現優異

模型能力

文本嵌入

語義相似度計算

信息檢索

長文檔處理

使用案例

信息檢索

文檔檢索

在大規模文檔庫中快速檢索相關內容

在LoCO評估中NDCG@10達到88.88

語義相似度

問答匹配

計算問題與候選答案的語義相似度

在MTEB語義相似度任務中得分81.57

🚀 gte-modernbert-base

我們很高興推出 gte-modernbert 系列模型，該系列模型基於最新的 modernBERT 僅編碼器預訓練基礎模型構建。gte-modernbert 系列模型包括文本嵌入模型和重排模型。

與當前開源社區中類似規模的模型相比，gte-modernbert 模型在多個文本嵌入和文本檢索評估任務中表現出了具有競爭力的性能，這些評估包括 MTEB、LoCO 和 COIR 評估等。

🚀 快速開始

模型概述

開發者：阿里巴巴集團通義實驗室
模型類型：文本嵌入
主要語言：英語
模型大小：1.49 億參數
最大輸入長度：8192 個詞元
輸出維度：768

模型列表

模型	語言	模型類型	模型大小	最大序列長度	維度	MTEB-en	BEIR	LoCo	CoIR
gte-modernbert-base	英語	文本嵌入	149M	8192	768	64.38	55.33	87.57	79.31
gte-reranker-modernbert-base	英語	文本重排器	149M	8192	-	-	56.19	90.68	79.99

使用說明

⚠️ 重要提示

對於 transformers 和 sentence-transformers，如果你的 GPU 支持，並且你安裝了 flash_attn，則會自動使用高效的 Flash Attention 2。不過這不是必需的。

pip install flash_attn

使用 `transformers`

# Requires transformers>=4.48.0

import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

model_path = "Alibaba-NLP/gte-modernbert-base"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]
 
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
# [[42.89073944091797, 71.30911254882812, 33.664554595947266]]

使用 `sentence-transformers`

# Requires transformers>=4.48.0
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

model = SentenceTransformer("Alibaba-NLP/gte-modernbert-base")
embeddings = model.encode(input_texts)
print(embeddings.shape)
# (4, 768)

similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)
# tensor([[0.4289, 0.7131, 0.3366]])

使用 `transformers.js`

// npm i @huggingface/transformers
import { pipeline, matmul } from "@huggingface/transformers";

// Create a feature extraction pipeline
const extractor = await pipeline(
  "feature-extraction",
  "Alibaba-NLP/gte-modernbert-base",
  { dtype: "fp32" }, // Supported options: "fp32", "fp16", "q8", "q4", "q4f16"
);

// Embed queries and documents
const embeddings = await extractor(
  [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms",
  ],
  { pooling: "cls", normalize: true },
);

// Compute similarity scores
const similarities = (await matmul(embeddings.slice([0, 1]), embeddings.slice([1, null]).transpose(1, 0))).mul(100);
console.log(similarities.tolist()); // [[42.89077377319336, 71.30916595458984, 33.66455841064453]]

訓練詳情

gte-modernbert 系列模型遵循之前 GTE 模型的訓練方案，唯一的區別是預訓練語言模型基礎從 GTE-MLM 替換為 ModernBert。有關更多訓練細節，請參考我們的論文：mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval

評估

MTEB

其他模型的結果取自 MTEB 排行榜。由於 gte-modernbert 系列的所有模型參數都小於 10 億，我們只關注 MTEB 排行榜中參數小於 10 億的模型結果。

模型名稱	參數大小 (M)	維度	序列長度	平均 (56)	分類 (12)	聚類 (11)	成對分類 (3)	重排 (4)	檢索 (15)	STS (10)	摘要 (1)
mxbai-embed-large-v1	335	1024	512	64.68	75.64	46.71	87.2	60.11	54.39	85	32.71
multilingual-e5-large-instruct	560	1024	514	64.41	77.56	47.1	86.19	58.58	52.47	84.78	30.39
bge-large-en-v1.5	335	1024	512	64.23	75.97	46.08	87.12	60.03	54.29	83.11	31.61
gte-base-en-v1.5	137	768	8192	64.11	77.17	46.82	85.33	57.66	54.09	81.97	31.17
bge-base-en-v1.5	109	768	512	63.55	75.53	45.77	86.55	58.86	53.25	82.4	31.07
gte-large-en-v1.5	409	1024	8192	65.39	77.75	47.95	84.63	58.50	57.91	81.43	30.91
modernbert-embed-base	149	768	8192	62.62	74.31	44.98	83.96	56.42	52.89	81.78	31.39
nomic-embed-text-v1.5	-	768	8192	62.28	73.55	43.93	84.61	55.78	53.01	81.94	30.4
gte-multilingual-base	305	768	8192	61.4	70.89	44.31	84.24	57.47	51.08	82.11	30.58
jina-embeddings-v3	572	1024	8192	65.51	82.58	45.21	84.01	58.13	53.88	85.81	29.71
gte-modernbert-base	149	768	8192	64.38	76.99	46.47	85.93	59.24	55.33	81.57	30.68

LoCo（長文檔檢索）(NDCG@10)

模型名稱	維度	序列長度	平均 (5)	QsmsumRetrieval	SummScreenRetrieval	QasperAbastractRetrieval	QasperTitleRetrieval	GovReportRetrieval
gte-qwen1.5-7b	4096	32768	87.57	49.37	93.10	99.67	97.54	98.21
gte-large-v1.5	1024	8192	86.71	44.55	92.61	99.82	97.81	98.74
gte-base-v1.5	768	8192	87.44	49.91	91.78	99.82	97.13	98.58
gte-modernbert-base	768	8192	88.88	54.45	93.00	99.82	98.03	98.70
gte-reranker-modernbert-base	-	8192	90.68	70.86	94.06	99.73	99.11	89.67

COIR（代碼檢索任務）(NDCG@10)

模型名稱	維度	序列長度	平均(20)	CodeSearchNet-ccr-go	CodeSearchNet-ccr-java	CodeSearchNet-ccr-javascript	CodeSearchNet-ccr-php	CodeSearchNet-ccr-python	CodeSearchNet-ccr-ruby	CodeSearchNet-go	CodeSearchNet-java	CodeSearchNet-javascript	CodeSearchNet-php	CodeSearchNet-python	CodeSearchNet-ruby	apps	codefeedback-mt	codefeedback-st	codetrans-contest	codetrans-dl	cosqa	stackoverflow-qa	synthetic-text2sql
gte-modernbert-base	768	8192	79.31	94.15	93.57	94.27	91.51	93.93	90.63	88.32	83.27	76.05	85.12	88.16	77.59	57.54	82.34	85.95	71.89	35.46	43.47	91.2	61.87
gte-reranker-modernbert-base	-	8192	79.99	96.43	96.88	98.32	91.81	97.7	91.96	88.81	79.71	76.27	89.39	98.37	84.11	47.57	83.37	88.91	49.66	36.36	44.37	89.58	64.21

BEIR(NDCG@10)

| 模型名稱 | 維度 | 序列長度 | 平均(15) | ArguAna | ClimateFEVER | CQADupstackAndroidRetrieval | DBPedia | FEVER | FiQA2018 | HotpotQA | MSMARCO | NFCorpus | NQ | QuoraRetrieval | SCIDOCS | SciFact | Touche2020 | TRECCOVID | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | gte-modernbert-base | 768 | 8192 | 55.33 | 72.68 | 37.74 | 42.63 | 41.79 | 91.03 | 48.81 | 69.47 | 40.9 | 36.44 | 57.62 | 88.55 | 21.29 | 77.4 | 21.68 | 81.95 | | gte-reranker-modernbert-base | - | 8192 | 56.73 | 69.03 | 37.79 | 44.68 | 47.23 | 94.54 | 49.81 | 78.16 | 45.38 | 30.69 | 64.57 | 87.77 | 20.60 | 73.57 | 27.36 | 79.89 |

招聘信息

我們的通義實驗室正在招聘 研究實習生 和 全職研究員。我們正在尋找在表徵學習、大語言模型驅動的信息檢索、檢索增強生成 (RAG) 和基於智能體的系統方面有專業知識的熱情人士。我們的團隊位於充滿活力的北京和杭州市。如果你充滿好奇心，渴望通過工作產生有意義的影響，我們很樂意收到你的來信。請將簡歷和簡要介紹發送至 dingkun.ldk@alibaba-inc.com。

引用

如果你發現我們的論文或模型很有幫助，請隨時引用我們。

@inproceedings{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
  pages={1393--1412},
  year={2024}
}

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}