MRT5-large Open-source Language Model - Efficiently Shorten Input Sequences and Improve Usage Efficiency!

Mrt5 Large

Developed by stanfordnlp

MrT5 is an efficient byte-level language model based on ByT5 improvements, reducing input sequence length by approximately 50% through dynamic token merging technology

Large Language Model

Transformers

Supports Multiple Languages#Dynamic Token Merging #Multilingual Byte-Level Processing #Efficient Sequence Compression

Downloads 33

Release Time : 3/23/2025

Model Overview

MrT5 is an efficient improved version of ByT5, dynamically shortening input sequence lengths by integrating token deletion mechanisms in the encoder, providing a more efficient solution for byte-level models

Model Features

Dynamic Token Merging

Dynamically determines token retention or deletion through a learnable deletion gating mechanism, effectively reducing sequence length

Efficient Byte Processing

Directly processes UTF-8 byte streams without a tokenizer, supporting multilingual processing

Soft Deletion Training

Uses softmax1 attention mechanism and PI controller to achieve stable deletion rate control

Model Capabilities

Multilingual Text Generation

Sequence-to-Sequence Transformation

Efficient Byte-Level Processing

Use Cases

Academic Research

Language Model Efficiency Research

Used to study the impact of dynamic token merging on model efficiency

Sequence length reduced by an average of 50%

Natural Language Processing

Multilingual Text Generation

Supports text generation tasks in 15 languages

🚀 MrT5 Large

MrT5 Large is a more efficient variant of ByT5. It integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length, presenting a solution to the practical limitations of existing byte - level models.

🚀 Quick Start

Like ByT5, MrT5 works on raw UTF - 8 bytes and can be used without a tokenizer. Make sure to set trust_remote_code=True to load the MrT5 code:

from transformers import AutoModelForSeq2SeqLM
import torch

model = AutoModelForSeq2SeqLM.from_pretrained('stanfordnlp/mrt5-large', trust_remote_code=True)

input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3  # add 3 for special tokens
labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3  # add 3 for special tokens

# Forward pass with hard deletion
loss = model(input_ids, labels=labels, hard_delete=True).loss

For batched inference and training, you can use ByT5's tokenizer class:

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model = AutoModelForSeq2SeqLM.from_pretrained('stanfordnlp/mrt5-large', trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained('google/byt5-large')

model_inputs = tokenizer(["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt")
labels = tokenizer(["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt").input_ids

# Forward pass with hard deletion
loss = model(**model_inputs, labels=labels, hard_delete=True).loss

✨ Features

Dynamic Token Merging: MrT5 integrates a token deletion mechanism in its encoder to dynamically shorten the input sequence length.
Efficient Architecture: By effectively "merging" critical information from deleted tokens into a more compact sequence, it presents a solution to the practical limitations of existing byte - level models.
Multilingual Support: It supports 15 typologically diverse languages, including English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu.

📚 Documentation

Model Details

This is the model card for the 1.23B - parameter MrT5 Large (mrt5-large), a more efficient variant of ByT5 Large (google/byt5-large). This model is trained to reduce sequence lengths by ~50% on average.

Developed by: Julie Kallini, Shikhar Murty, Christopher D. Manning, Christopher Potts, Róbert Csordás
Model type: MrT5
Languages: English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu
Fine - tuned from model: google/byt5-large
Sources for more information:
- GitHub Repository
- Paper

Model Architecture

MrT5 Large uses the model configuration of the standard ByT5 Large, which has a feed - forward dimensionality of 3840, a model dimensionality of 1536, 36 encoder layers, 12 decoder layers, 16 attention heads in each layer, and 1.23B total parameters.

MrT5 has an additional delete gate, which dynamically reduces the encoder sequence length. In this model, it is placed after the third encoder layer, and all subsequent layers operate on a reduced sequence. This model was trained with a deletion rate of δ = 0.5, which means that the model reduces its encoder sequence length by ~50% after the third layer. MrT5’s gating mechanism only introduces an additional 3,000 parameters.

MrT5 Large is initialized from ByT5 Large and fine - tuned on the same training objective. Only MrT5's delete gate is randomly initialized before training. The other distinguishing feature of MrT5 is that it uses [softmax1](https://www.evanmiller.org/attention - is - off - by - one.html) in its attention mechanism.

Uses

This model is an encoder - decoder architecture designed primarily for sequence - to - sequence tasks. While it can be used as - is for exploratory or academic purposes, fine - tuning is recommended to achieve optimal performance on specific downstream tasks.

To leverage the model’s deletion feature, please use the custom MrT5Trainer available in the accompanying repository. This specialized trainer ensures that the deletion mechanism is properly maintained and integrated during fine - tuning.

Because this is a base model built for academic and research explorations, it is not intended for production - grade deployments. Users should carefully evaluate the model’s outputs, especially in any setting where reliability and robustness are critical.

Bias, Risks, and Limitations

Language models are known to exhibit various forms of social bias and may produce harmful or offensive content (Bender et al., 2021; Bommasani et al., 2022; Liang et al., 2022). Like other language models, this model may produce biased or harmful outputs. It has not been fine - tuned for safety and should be used with caution, especially in sensitive contexts.

Training Details

Training Data

For continued pre - training, we use the multilingual C4 (mC4) corpus (Raffel et al., 2020; Xue et al., 2021). MrT5 is trained on 15 typologically diverse languages: English, French, Spanish, German, Greek, Bulgarian, Russian, Turkish, Arabic, Vietnamese, Thai, Chinese, Hindi, Swahili, and Urdu. To avoid training models for multiple epochs, we ensure that the samples drawn from the mC4 corpus are sufficiently large. Additionally, we extract equal - sized samples for each language (in terms of bytes) from the mC4 training split.

Training Procedure

MrT5 is trained on the ByT5 span corruption pre - training objective. In this task, spans of tokens in unlabeled text data are replaced with a single sentinel token ID per span, and the model must fill in the missing tokens. For ByT5 and MrT5, these are spans of bytes, and the masks can potentially interfere with word boundaries.

Preprocessing

When training on the span corruption objective, we calculate the corrupted spans such that the average masked span length is 20 tokens with a noise density of 15%—that is, 15% of tokens in the sequence are masked out, following the specification outlined in the ByT5 paper.

Optimization

MrT5 is trained for 5,000 gradient steps over batches of 2^20 tokens (i.e., an encoder sequence length of 1024 with an effective batch size of 1024). We use the AdamW optimizer with an initial learning rate of 1e - 4 with linear decay and no warmup.

To achieve a specific sequence length reduction rate, we use a PI controller with a target deletion ratio of δ = 0.5, as described in Section 3.2 of the paper. We also use attention score regularization, as described in Appendix D of the paper.

Environmental Impact

Hardware Type: NVIDIA A100 - SXM4 - 80GB
GPU Count: 4
Hours used: ~73 hours
Cloud Provider: Stanford NLP Cluster

📄 License

Citation

If you use this model, please cite the MrT5 paper:

@inproceedings{
    kallini2025mrt,
    title={MrT5: Dynamic Token Merging for Efficient Byte-level Language Models},
    author={Julie Kallini and Shikhar Murty and Christopher D Manning and Christopher Potts and R{\'o}bert Csord{\'a}s},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=VYWBMq1L7H}
}

Also cite the ByT5 paper:

@article{xue-etal-2022-byt5,
    title = "{B}y{T}5: Towards a Token-Free Future with Pre-trained Byte-to-Byte Models",
    author = "Xue, Linting  and
      Barua, Aditya  and
      Constant, Noah  and
      Al-Rfou, Rami  and
      Narang, Sharan  and
      Kale, Mihir  and
      Roberts, Adam  and
      Raffel, Colin",
    editor = "Roark, Brian  and
      Nenkova, Ani",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "10",
    year = "2022",
    address = "Cambridge, MA",
    publisher = "MIT Press",
    url = "https://aclanthology.org/2022.tacl-1.17",
    doi = "10.1162/tacl_a_00461",
    pages = "291--306",
}

👨‍💻 Model Card Authors

Julie Kallini
kallini@stanford.edu

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご