Open-source ByT5-small Model - Free Deployment, Supports Multilingual Text Processing, Excels at Handling Noisy Data

Byt5 Small

Developed by google

ByT5 is a tokenizer-free version of Google's T5 that directly processes raw UTF-8 bytes, supporting multilingual text processing with excellent performance on noisy data.

Large Language Model Supports Multiple LanguagesOpen Source License:Apache-2.0 #Byte-level processing #Multilingual support #Noise robustness

Downloads 1.4M

Release Time : 3/2/2022

Model Overview

ByT5 is a tokenizer-free pre-trained model based on the T5 architecture that directly processes byte sequences instead of tokens, supports multiple languages, and is particularly suitable for handling noisy text data.

Model Features

Tokenizer-free design

Directly processes raw UTF-8 bytes without a tokenizer, simplifying text processing workflows.

Multilingual support

Supports over 100 languages, capable of handling text data in multiple languages.

Noise robustness

Performs exceptionally well on noisy text data, such as spelling errors and non-standard text.

Unified architecture

Based on the standard Transformer architecture, requiring minimal modifications to process byte sequences.

Model Capabilities

Text generation

Text understanding

Multilingual translation

Noisy text processing

Use Cases

Text generation

Multilingual text generation

Generates text content in multiple languages, suitable for international applications.

Capable of generating fluent multilingual text.

Text translation

Multilingual translation

Translates text from one language to another.

Performs well across multiple language pairs.

Noisy text processing

Social media text processing

Processes social media text containing spelling errors and non-standard usage.

Outperforms token-based models in tasks like TweetQA.

🚀 ByT5 - Small

ByT5 is a tokenizer-free version of Google's T5, generally following the architecture of MT5. It offers a solution for processing text without the need for a tokenizer, which is beneficial for handling various languages and noisy text data.

📋 Information Table

Property	Details
Supported Languages	multilingual, af, am, ar, az, be, bg, bn, ca, ceb, co, cs, cy, da, de, el, en, eo, es, et, eu, fa, fi, fil, fr, fy, ga, gd, gl, gu, ha, haw, hi, hmn, ht, hu, hy, ig, is, it, iw, ja, jv, ka, kk, km, kn, ko, ku, ky, la, lb, lo, lt, lv, mg, mi, mk, ml, mn, mr, ms, mt, my, ne, nl, no, ny, pa, pl, ps, pt, ro, ru, sd, si, sk, sl, sm, sn, so, sq, sr, st, su, sv, sw, ta, te, tg, th, tr, uk, und, ur, uz, vi, xh, yi, yo, zh, zu
Datasets	mc4
License	apache-2.0

🚀 Quick Start

ByT5 was only pre-trained on mC4 excluding any supervised training with an average span-mask of 20 UTF-8 characters. Therefore, this model has to be fine-tuned before it is useable on a downstream task.

✨ Features

Tokenizer-free: ByT5 operates directly on raw UTF-8 bytes, eliminating the need for a tokenizer.
Multilingual Support: It supports a wide range of languages, making it suitable for multilingual tasks.
Robust to Noisy Data: ByT5 works especially well on noisy text data. For example, google/byt5-small significantly outperforms mt5-small on TweetQA.

💻 Usage Examples

Basic Usage

ByT5 works on raw UTF-8 bytes and can be used without a tokenizer:

from transformers import T5ForConditionalGeneration
import torch

model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')

input_ids = torch.tensor([list("Life is like a box of chocolates.".encode("utf-8"))]) + 3  # add 3 for special tokens
labels = torch.tensor([list("La vie est comme une boîte de chocolat.".encode("utf-8"))]) + 3  # add 3 for special tokens

loss = model(input_ids, labels=labels).loss # forward pass

Advanced Usage

For batched inference & training it is however recommended using a tokenizer class for padding:

from transformers import T5ForConditionalGeneration, AutoTokenizer

model = T5ForConditionalGeneration.from_pretrained('google/byt5-small')
tokenizer = AutoTokenizer.from_pretrained('google/byt5-small')

model_inputs = tokenizer(["Life is like a box of chocolates.", "Today is Monday."], padding="longest", return_tensors="pt")
labels = tokenizer(["La vie est comme une boîte de chocolat.", "Aujourd'hui c'est lundi."], padding="longest", return_tensors="pt").input_ids

loss = model(**model_inputs, labels=labels).loss # forward pass

📚 Documentation

Paper

ByT5: Towards a token-free future with pre-trained byte-to-byte models

Authors

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, Colin Raffel

Abstract

Most widely-used pre-trained language models operate on sequences of tokens corresponding to word or subword units. Encoding text as a sequence of tokens requires a tokenizer, which is typically created as an independent artifact from the model. Token-free models that instead operate directly on raw text (bytes or characters) have many benefits: they can process text in any language out of the box, they are more robust to noise, and they minimize technical debt by removing complex and error-prone text preprocessing pipelines. In this paper, we show that a standard Transformer architecture can be used with minimal modifications to process byte sequences. We carefully characterize the trade-offs in terms of parameter count, training FLOPs, and inference speed, and show that byte-level models are competitive with their token-level counterparts. We also demonstrate that byte-level models are significantly more robust to noise and perform better on tasks that are sensitive to spelling and pronunciation. As part of our contribution, we release a new set of pre-trained byte-level Transformer models based on the T5 architecture, as well as all code and data used in our experiments.

model image

📄 License

This project is licensed under the apache-2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご