Gte Modernbert Base

Developed by Alibaba-NLP

A text embedding model based on the ModernBERT pre-trained encoder, supporting long text processing up to 8192 tokens, with excellent performance on evaluation tasks such as MTEB, LoCO, and COIR.

Text Embedding

Transformers

EnglishOpen Source License:Apache-2.0 #Long Text Embedding #Efficient Retrieval #Multi-task Optimization

Downloads 74.52k

Release Time : 1/20/2025

Model Overview

This model is a text embedding model developed by Alibaba Group's Tongyi Lab, specializing in English text processing and suitable for tasks such as information retrieval and semantic similarity calculation.

Model Features

Long Text Processing Capability

Supports input lengths of up to 8192 tokens, suitable for processing long documents

High Efficiency

Supports Flash Attention 2 acceleration, with high operational efficiency on GPUs

Multi-scenario Applicability

Performs excellently in various evaluation tasks such as MTEB, LoCO, and COIR

Model Capabilities

Text Embedding

Semantic Similarity Calculation

Information Retrieval

Long Document Processing

Use Cases

Information Retrieval

Document Retrieval

Quickly retrieve relevant content from large-scale document libraries

Achieved NDCG@10 of 88.88 in LoCO evaluation

Semantic Similarity

Question-Answer Matching

Calculate the semantic similarity between questions and candidate answers

Scored 81.57 in MTEB semantic similarity tasks

license: apache-2.0 language:

en base_model:
answerdotai/ModernBERT-base base_model_relation: finetune pipeline_tag: sentence-similarity library_name: transformers tags:
sentence-transformers
mteb
embedding
transformers.js

gte-modernbert-base

We are excited to introduce the gte-modernbert series of models, which are built upon the latest modernBERT pre-trained encoder-only foundation models. The gte-modernbert series models include both text embedding models and rerank models.

The gte-modernbert models demonstrates competitive performance in several text embedding and text retrieval evaluation tasks when compared to similar-scale models from the current open-source community. This includes assessments such as MTEB, LoCO, and COIR evaluation.

Model Overview

Developed by: Tongyi Lab, Alibaba Group
Model Type: Text Embedding
Primary Language: English
Model Size: 149M
Max Input Length: 8192 tokens
Output Dimension: 768

Model list

Models	Language	Model Type	Model Size	Max Seq. Length	Dimension	MTEB-en	BEIR	LoCo	CoIR
`gte-modernbert-base`	English	text embedding	149M	8192	768	64.38	55.33	87.57	79.31
`gte-reranker-modernbert-base`	English	text reranker	149M	8192	-	-	56.19	90.68	79.99

Usage

[!TIP] For transformers and sentence-transformers, if your GPU supports it, the efficient Flash Attention 2 will be used automatically if you have flash_attn installed. It is not mandatory.
pip install flash_attn

Use with transformers

# Requires transformers>=4.48.0

import torch.nn.functional as F
from transformers import AutoModel, AutoTokenizer

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

model_path = "Alibaba-NLP/gte-modernbert-base"
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModel.from_pretrained(model_path)

# Tokenize the input texts
batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')

outputs = model(**batch_dict)
embeddings = outputs.last_hidden_state[:, 0]
 
# (Optionally) normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
scores = (embeddings[:1] @ embeddings[1:].T) * 100
print(scores.tolist())
# [[42.89073944091797, 71.30911254882812, 33.664554595947266]]

Use with sentence-transformers:

# Requires transformers>=4.48.0
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

input_texts = [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms"
]

model = SentenceTransformer("Alibaba-NLP/gte-modernbert-base")
embeddings = model.encode(input_texts)
print(embeddings.shape)
# (4, 768)

similarities = cos_sim(embeddings[0], embeddings[1:])
print(similarities)
# tensor([[0.4289, 0.7131, 0.3366]])

Use with transformers.js:

// npm i @huggingface/transformers
import { pipeline, matmul } from "@huggingface/transformers";

// Create a feature extraction pipeline
const extractor = await pipeline(
  "feature-extraction",
  "Alibaba-NLP/gte-modernbert-base",
  { dtype: "fp32" }, // Supported options: "fp32", "fp16", "q8", "q4", "q4f16"
);

// Embed queries and documents
const embeddings = await extractor(
  [
    "what is the capital of China?",
    "how to implement quick sort in python?",
    "Beijing",
    "sorting algorithms",
  ],
  { pooling: "cls", normalize: true },
);

// Compute similarity scores
const similarities = (await matmul(embeddings.slice([0, 1]), embeddings.slice([1, null]).transpose(1, 0))).mul(100);
console.log(similarities.tolist()); // [[42.89077377319336, 71.30916595458984, 33.66455841064453]]

Training Details

The gte-modernbert series of models follows the training scheme of the previous GTE models, with the only difference being that the pre-training language model base has been replaced from GTE-MLM to ModernBert. For more training details, please refer to our paper: mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval

Evaluation

MTEB

The results of other models are retrieved from MTEB leaderboard. Given that all models in the gte-modernbert series have a size of less than 1B parameters, we focused exclusively on the results of models under 1B from the MTEB leaderboard.

Model Name	Param Size (M)	Dimension	Sequence Length	Average (56)	Class. (12)	Clust. (11)	Pair Class. (3)	Reran. (4)	Retr. (15)	STS (10)	Summ. (1)
mxbai-embed-large-v1	335	1024	512	64.68	75.64	46.71	87.2	60.11	54.39	85	32.71
multilingual-e5-large-instruct	560	1024	514	64.41	77.56	47.1	86.19	58.58	52.47	84.78	30.39
bge-large-en-v1.5	335	1024	512	64.23	75.97	46.08	87.12	60.03	54.29	83.11	31.61
gte-base-en-v1.5	137	768	8192	64.11	77.17	46.82	85.33	57.66	54.09	81.97	31.17
bge-base-en-v1.5	109	768	512	63.55	75.53	45.77	86.55	58.86	53.25	82.4	31.07
gte-large-en-v1.5	409	1024	8192	65.39	77.75	47.95	84.63	58.50	57.91	81.43	30.91
modernbert-embed-base	149	768	8192	62.62	74.31	44.98	83.96	56.42	52.89	81.78	31.39
nomic-embed-text-v1.5		768	8192	62.28	73.55	43.93	84.61	55.78	53.01	81.94	30.4
gte-multilingual-base	305	768	8192	61.4	70.89	44.31	84.24	57.47	51.08	82.11	30.58
jina-embeddings-v3	572	1024	8192	65.51	82.58	45.21	84.01	58.13	53.88	85.81	29.71
gte-modernbert-base	149	768	8192	64.38	76.99	46.47	85.93	59.24	55.33	81.57	30.68

LoCo (Long Document Retrieval)(NDCG@10)

Model Name	Dimension	Sequence Length	Average (5)	QsmsumRetrieval	SummScreenRetrieval	QasperAbastractRetrieval	QasperTitleRetrieval	GovReportRetrieval
gte-qwen1.5-7b	4096	32768	87.57	49.37	93.10	99.67	97.54	98.21
gte-large-v1.5	1024	8192	86.71	44.55	92.61	99.82	97.81	98.74
gte-base-v1.5	768	8192	87.44	49.91	91.78	99.82	97.13	98.58
gte-modernbert-base	768	8192	88.88	54.45	93.00	99.82	98.03	98.70
gte-reranker-modernbert-base	-	8192	90.68	70.86	94.06	99.73	99.11	89.67

COIR (Code Retrieval Task)(NDCG@10)

Model Name	Dimension	Sequence Length	Average(20)	CodeSearchNet-ccr-go	CodeSearchNet-ccr-java	CodeSearchNet-ccr-javascript	CodeSearchNet-ccr-php	CodeSearchNet-ccr-python	CodeSearchNet-ccr-ruby	CodeSearchNet-go	CodeSearchNet-java	CodeSearchNet-javascript	CodeSearchNet-php	CodeSearchNet-python	CodeSearchNet-ruby	apps	codefeedback-mt	codefeedback-st	codetrans-contest	codetrans-dl	cosqa	stackoverflow-qa	synthetic-text2sql
gte-modernbert-base	768	8192	79.31	94.15	93.57	94.27	91.51	93.93	90.63	88.32	83.27	76.05	85.12	88.16	77.59	57.54	82.34	85.95	71.89	35.46	43.47	91.2	61.87
gte-reranker-modernbert-base	-	8192	79.99	96.43	96.88	98.32	91.81	97.7	91.96	88.81	79.71	76.27	89.39	98.37	84.11	47.57	83.37	88.91	49.66	36.36	44.37	89.58	64.21

BEIR(NDCG@10)

Model Name	Dimension	Sequence Length	Average(15)	ArguAna	ClimateFEVER	CQADupstackAndroidRetrieval	DBPedia	FEVER	FiQA2018	HotpotQA	MSMARCO	NFCorpus	NQ	QuoraRetrieval	SCIDOCS	SciFact	Touche2020	TRECCOVID
gte-modernbert-base	768	8192	55.33	72.68	37.74	42.63	41.79	91.03	48.81	69.47	40.9	36.44	57.62	88.55	21.29	77.4	21.68	81.95
gte-reranker-modernbert-base	-	8192	56.73	69.03	37.79	44.68	47.23	94.54	49.81	78.16	45.38	30.69	64.57	87.77	20.60	73.57	27.36	79.89

Hiring

We have open positions for Research Interns and Full-Time Researchers to join our team at Tongyi Lab. We are seeking passionate individuals with expertise in representation learning, LLM-driven information retrieval, Retrieval-Augmented Generation (RAG), and agent-based systems. Our team is located in the vibrant cities of Beijing and Hangzhou. If you are driven by curiosity and eager to make a meaningful impact through your work, we would love to hear from you. Please submit your resume along with a brief introduction to dingkun.ldk@alibaba-inc.com.

Citation

If you find our paper or models helpful, feel free to give us a cite.

@inproceedings{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
  author={Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Wen and Dai, Ziqi and Tang, Jialong and Lin, Huan and Yang, Baosong and Xie, Pengjun and Huang, Fei and others},
  booktitle={Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track},
  pages={1393--1412},
  year={2024}
}

@article{li2023towards,
  title={Towards general text embeddings with multi-stage contrastive learning},
  author={Li, Zehan and Zhang, Xin and Zhang, Yanzhao and Long, Dingkun and Xie, Pengjun and Zhang, Meishan},
  journal={arXiv preprint arXiv:2308.03281},
  year={2023}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご