gte-multilingual-mlm-base Open-source Multilingual Text Encoder - Supports Long Text Encoding for 75 Languages

Gte Multilingual Mlm Base

Developed by Alibaba-NLP

mGTE series multilingual text encoder, supporting 75 languages, with a maximum context length of 8192, based on BERT+RoPE+GLU architecture, excelling in GLUE and XTREME-R benchmarks

Large Language Model

Safetensors

Open Source License:Apache-2.0 #Multilingual text encoding #Long-context support #Cross-language retrieval

Downloads 342

Release Time : 8/6/2024

Model Overview

Universal multilingual text encoder, focusing on long-context text representation and re-ranking, suitable for multilingual retrieval tasks

Model Features

Ultra-long context support

Supports a maximum sequence length of 8192, suitable for processing long documents

Multilingual capability

Supports 75 languages, with excellent performance on the multilingual benchmark XTREME-R

Improved architecture design

Adopts the transformer++ architecture of BERT+RoPE+GLU, combining Rotary Position Embedding (RoPE) and Gated Linear Units (GLU)

Multi-stage training strategy

Employs a phased training approach from short to long sequences, effectively supporting long-context modeling

Model Capabilities

Multilingual text encoding

Long-text representation

Text re-ranking

Cross-language retrieval

Use Cases

Information retrieval

Cross-language document retrieval

Retrieving relevant documents in a multilingual environment

Achieved 64.44 points on the XTREME-R benchmark, outperforming XLM-R-base

Natural language understanding

Multilingual text classification

Classifying multilingual text tasks

Achieved 83.47 points on the GLUE benchmark

🚀 gte-multilingual-mlm-base

We introduce the mGTE series, a new set of generalized text encoder, embedding, and reranking models. These models support 75 languages and can handle a context length of up to 8192. They are built on the transformer++ encoder backbone (BERT + RoPE + GLU, code available at Alibaba-NLP/new-impl) and use the vocabulary of XLM-R. This text encoder (mGTE-MLM-8192 in our paper) outperforms the previous state-of-the-art XLM-R-base of the same size in both GLUE and XTREME-R.

🚀 Quick Start

This section provides an overview of the mGTE series models. For detailed usage, please refer to the official documentation or relevant code repositories.

✨ Features

Multilingual Support: Supports 75 languages, making it suitable for a wide range of multilingual applications.
Long Context Handling: Capable of handling a context length of up to 8192, enabling better understanding of long texts.
High Performance: Outperforms previous state-of-the-art models in relevant benchmarks.

📦 Installation

No installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

No code examples are provided in the original document, so this section is skipped.

📚 Documentation

Model Information

Developed by: Institute for Intelligent Computing, Alibaba Group
Model type: Text Encoder
Paper: mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval

Model List

Property	Details
Model Type	Text Encoder
Training Data	Masked language modeling (MLM): `c4-en`, `mc4`, `skypile`, `Wikipedia`, `CulturaX`, etc (refer to paper appendix A.1)

Models	Language	Model Size	Max Seq. Length	GLUE	XTREME-R
`gte-multilingual-mlm-base`	Multiple	306M	8192	83.47	64.44
`gte-en-mlm-base`	English	-	8192	85.61	-
`gte-en-mlm-large`	English	-	8192	87.58	-

Training Details

Training Data

Masked language modeling (MLM): c4-en, mc4, skypile, Wikipedia, CulturaX, etc (refer to paper appendix A.1)

Training Procedure

To enable the backbone model to support a context length of 8192, a multi-stage training strategy was adopted. The model first undergoes preliminary MLM pre-training on shorter lengths. Then, the data is resampled, reducing the proportion of short texts, and MLM pre-training continues.

The entire training process is as follows:

MLM-2048: lr 2e-4, mlm_probability 0.3, batch_size 8192, num_steps 250k, rope_base 10000
MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 2048, num_steps 30k, rope_base 160000

Evaluation

Models	Language	Model Size	Max Seq. Length	GLUE	XTREME-R
`gte-multilingual-mlm-base`	Multiple	306M	8192	83.47	64.44
`gte-en-mlm-base`	English	-	8192	85.61	-
`gte-en-mlm-large`	English	-	8192	87.58	-
`MosaicBERT-base`	English	137M	128	85.4	-
`MosaicBERT-base-2048`	English	137M	2048	85	-
`JinaBERT-base`	English	137M	512	85	-
`nomic-bert-2048`	English	137M	2048	84	-
`MosaicBERT-large`	English	434M	128	86.1	-
`JinaBERT-large`	English	434M	512	83.7	-
`XLM-R-base`	Multiple	279M	512	80.44	62.02
`RoBERTa-base`	English	125M	512	86.4	-
`RoBERTa-large`	English	355M	512	88.9	-

🔧 Technical Details

The models are built upon the transformer++ encoder backbone (BERT + RoPE + GLU, code refer to Alibaba-NLP/new-impl) as well as the vocabulary of XLM-R. A multi-stage training strategy is used to enable the model to support a context length of 8192.

📄 License

The model is released under the Apache-2.0 license.

📖 Citation

If you find our paper or models helpful, please consider citing them as follows:

@misc{zhang2024mgtegeneralizedlongcontexttext,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval}, 
  author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
  year={2024},
  eprint={2407.19669},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2407.19669}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご