Open-source ByT5-base Model - Supports Multilingual Text Processing and Has Strong Robustness to Noisy Data

Byt5 Base

Developed by google

ByT5 is a tokenizer-free version of Google's T5 that directly processes UTF-8 byte sequences, supporting multilingual text processing with robustness to noisy data.

Large Language Model Supports Multiple LanguagesOpen Source License:Apache-2.0 #Byte-level processing #Multilingual support #Noise robustness

Downloads 24.17k

Release Time : 3/2/2022

Model Overview

ByT5 is a pre-trained language model that operates directly on raw byte sequences without tokenization, suitable for multilingual text generation and understanding tasks.

Model Features

Tokenizer-free processing

Directly processes UTF-8 byte sequences without relying on tokenizers, reducing preprocessing complexity.

Multilingual support

Natively supports over 100 languages and can immediately process text in any language.

Noise robustness

Performs exceptionally well on noisy text data, such as spelling errors and non-standard text.

Unified architecture

Based on standard Transformer architecture with minimal modifications required to handle byte sequences.

Model Capabilities

Multilingual text generation

Text understanding

Machine translation

Text summarization

Use Cases

Natural Language Processing

Multilingual text generation

Generates coherent text in different languages

Outperforms token-based models on tasks like TweetQA

Noisy text processing

Handles text with spelling errors or non-standard formats

Demonstrates stronger robustness to noisy data

🚀 ByT5 - Base

ByT5 is a tokenizer-free version of Google's T5, generally following the architecture of MT5. It addresses the limitations of traditional token - based models by operating directly on raw text, offering better performance on noisy data.

🚀 Quick Start

ByT5 is a tokenizer - free alternative to Google's T5, following the architecture of MT5. It was pre - trained on mC4 with an average span - mask of 20 UTF - 8 characters and requires fine - tuning for downstream tasks.

✨ Features

Tokenizer - free: Operates directly on raw UTF - 8 bytes, eliminating the need for a tokenizer.
Good on noisy data: Significantly outperforms [mt5 - base](https://huggingface.co/google/mt5 - base) on tasks like TweetQA.
Multilingual support: Can process text in multiple languages out of the box.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

ByT5 works on raw UTF - 8 bytes and can be used without a tokenizer:

from transformers import T5ForConditionalGeneration
import torch

model = T5ForConditionalGeneration.from_pretrained('google/byt5-base')

input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3  # add 3 for special tokens
labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3  # add 3 for special tokens

loss = model(input_ids, labels=labels).loss # forward pass

Advanced Usage

For batched inference & training, it is recommended using a tokenizer class for padding:

from transformers import T5ForConditionalGeneration, AutoTokenizer

model = T5ForConditionalGeneration.from_pretrained('google/byt5-base')
tokenizer = AutoTokenizer.from_pretrained('google/byt5-base')

model_inputs = tokenizer(["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt")
labels = tokenizer(["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt").input_ids

loss = model(**model_inputs, labels=labels).loss # forward pass

📚 Documentation

Abstract

Most widely - used pre - trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token - free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error - prone text preprocessing pipelines. Since byte or character sequences are longer than token sequences, past work on token - free models has often introduced new model architectures designed to amortize the cost of operating directly on raw text. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade - offs in terms of parameter count, training FLOPs, and inference speed, and show that byte - level models are competitive with their token - level counterparts. We also demonstrate that byte - level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre - trained byte - level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.

model image

🔧 Technical Details

The model was only pre - trained on mC4 excluding any supervised training with an average span - mask of 20 UTF - 8 characters. This pre - training approach allows the model to handle raw text effectively but requires fine - tuning for downstream tasks.

📄 License

This model is licensed under the Apache 2.0 license.

Information Table

Property	Details
Languages	multilingual, af, am, ar, az, be, bg, bn, ca, ceb, co, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fil, fr, fy, ga, gd, gl, gu, ha, haw, hi, hmn, ht, hu, hy, ig, is, it, iw, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, no, ny, pa, pl, ps, pt, ro, ru, sd, si, sk, sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te, tg, th, tr, uk, und, ur, uz, vi, xh, yi, yo, zh, zu
Datasets	mc4
Model Type	ByT5 - Base
License	apache - 2.0

Paper Information

Paper: ByT5: Towards a token - free future with pre - trained byte - to - byte models
Authors: Linting Xue, Aditya Barua, Noah Constant, Rami Al - Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご