đ gte-multilingual-mlm-base
We introduce the mGTE
series, a new set of generalized text encoder, embedding, and reranking models. These models support 75 languages and can handle a context length of up to 8192. They are built on the transformer++ encoder backbone (BERT + RoPE + GLU, code available at Alibaba-NLP/new-impl) and use the vocabulary of XLM-R
. This text encoder (mGTE-MLM-8192
in our paper) outperforms the previous state-of-the-art XLM-R-base of the same size in both GLUE and XTREME-R.
đ Quick Start
This section provides an overview of the mGTE
series models. For detailed usage, please refer to the official documentation or relevant code repositories.
⨠Features
- Multilingual Support: Supports 75 languages, making it suitable for a wide range of multilingual applications.
- Long Context Handling: Capable of handling a context length of up to 8192, enabling better understanding of long texts.
- High Performance: Outperforms previous state-of-the-art models in relevant benchmarks.
đĻ Installation
No installation steps are provided in the original document, so this section is skipped.
đģ Usage Examples
No code examples are provided in the original document, so this section is skipped.
đ Documentation
Model Information
Model List
Property |
Details |
Model Type |
Text Encoder |
Training Data |
Masked language modeling (MLM): c4-en , mc4 , skypile , Wikipedia , CulturaX , etc (refer to paper appendix A.1) |
Training Details
Training Data
- Masked language modeling (MLM):
c4-en
, mc4
, skypile
, Wikipedia
, CulturaX
, etc (refer to paper appendix A.1)
Training Procedure
To enable the backbone model to support a context length of 8192, a multi-stage training strategy was adopted. The model first undergoes preliminary MLM pre-training on shorter lengths. Then, the data is resampled, reducing the proportion of short texts, and MLM pre-training continues.
The entire training process is as follows:
- MLM-2048: lr 2e-4, mlm_probability 0.3, batch_size 8192, num_steps 250k, rope_base 10000
- MLM-8192: lr 5e-5, mlm_probability 0.3, batch_size 2048, num_steps 30k, rope_base 160000
Evaluation
đ§ Technical Details
The models are built upon the transformer++ encoder backbone (BERT + RoPE + GLU, code refer to Alibaba-NLP/new-impl) as well as the vocabulary of XLM-R
. A multi-stage training strategy is used to enable the model to support a context length of 8192.
đ License
The model is released under the Apache-2.0 license.
đ Citation
If you find our paper or models helpful, please consider citing them as follows:
@misc{zhang2024mgtegeneralizedlongcontexttext,
title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval},
author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
year={2024},
eprint={2407.19669},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2407.19669},
}