gte-en-mlm-large Open-source English Text Encoder - Supports long contexts and free deployment, a practical choice!

Gte En Mlm Large

Developed by Alibaba-NLP

Large English text encoder in the GTE-v1.5 series, supporting context lengths up to 8192, built on an improved BERT architecture.

Large Language Model

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Long-context encoding #English text embedding #Efficient MLM training

Downloads 171

Release Time : 8/6/2024

Model Overview

This model is a general-purpose text encoder developed by Alibaba Group's Institute of Intelligent Computing, primarily used for English text embedding representation and re-ranking tasks, supporting long-context processing.

Model Features

Long-context support

Supports context lengths up to 8192, suitable for processing long documents and complex texts.

Improved BERT architecture

Enhanced architecture combining RoPE and GLU, improving model performance.

Phased training strategy

Adopts a phased training strategy from 512 to 8192, effectively supporting long-context learning.

Model Capabilities

Text embedding

Text re-ranking

Long-text processing

Masked language modeling

Use Cases

Information retrieval

Document retrieval

Used for semantic retrieval and ranking of long documents

Provides more accurate retrieval results in long-context scenarios

Natural language processing

Text representation learning

Generates high-quality text embedding representations

Can be used for feature extraction in downstream NLP tasks

🚀 gte-en-mlm-large

This project introduces the GTE-v1.5 series, which are new generalized text encoder, embedding, and reranking models capable of handling context lengths of up to 8192. These models are built on the transformer++ encoder backbone (BERT + RoPE + GLU, with code reference to Alibaba-NLP/new-impl) and use the vocabulary of bert-base-uncased.

🚀 Quick Start

This text encoder is the GTEv1.5-en-MLM-large-8192 mentioned in table 13 of our paper.

✨ Features

Developed by: Institute for Intelligent Computing, Alibaba Group
Model type: Text Encoder
Paper: mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval.

Model list

Models	Language	Model Size	Max Seq. Length	GLUE	XTREME-R
`gte-multilingual-mlm-base`	Multiple	306M	8192	83.47	64.44
`gte-en-mlm-base`	English	-	8192	85.61	-
`gte-en-mlm-large`	English	-	8192	87.58	-

🔧 Technical Details

Training Data

Masked language modeling (MLM): c4-en

Training Procedure

To enable the backbone model to support a context length of 8192, a multi - stage training strategy was adopted. The model first undergoes preliminary MLM pre - training on shorter lengths. Then, the data is resampled to reduce the proportion of short texts, and MLM pre - training continues.

The entire training process is as follows:

MLM - 512: lr 2e - 4, mlm_probability 0.3, batch_size 4096, num_steps 300000, rope_base 10000
MLM - 2048: lr 5e - 5, mlm_probability 0.3, batch_size 4096, num_steps 30000, rope_base 10000
MLM - 8192: lr 5e - 5, mlm_probability 0.3, batch_size 1024, num_steps 30000, rope_base 160000

📚 Documentation

Evaluation

Models	Language	Model Size	Max Seq. Length	GLUE	XTREME-R
`gte-multilingual-mlm-base`	Multiple	306M	8192	83.47	64.44
`gte-en-mlm-base`	English	137M	8192	85.61	-
`gte-en-mlm-large`	English	435M	8192	87.58	-
`MosaicBERT-base`	English	137M	128	85.4	-
`MosaicBERT-base-2048`	English	137M	2048	85	-
`JinaBERT-base`	English	137M	512	85	-
`nomic-bert-2048`	English	137M	2048	84	-
`MosaicBERT-large`	English	434M	128	86.1	-
`JinaBERT-large`	English	434M	512	83.7	-
`XLM-R-base`	Multiple	279M	512	80.44	62.02
`RoBERTa-base`	English	125M	512	86.4	-
`RoBERTa-large`	English	355M	512	88.9	-

📄 License

This project is licensed under the Apache-2.0 license.

📚 Citation

If you find our paper or models helpful, please consider citing them as follows:

@misc{zhang2024mgte,
  title={mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval}, 
  author={Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang},
  year={2024},
  eprint={2407.19669},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2407.19669}, 
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご