Pile-T5-XXL Open-Source Language Model - Empowering Text Processing and Intelligent Q&A Applications

Pile T5 Xxl

Developed by EleutherAI

Pile-T5 XXL is an encoder-decoder model trained on The Pile dataset using the T5x library, employing a MLM objective similar to the original T5 model, trained for 2 million steps (approximately 2 trillion tokens).

Large Language Model

Transformers

English#English Text Reconstruction #Large-scale Pretraining #Encoder-Decoder Architecture

Downloads 44

Release Time : 1/16/2024

Model Overview

Pile-T5 is primarily intended for research purposes, and its learned English internal representations can be used to extract features for downstream tasks. Beyond research, users can fine-tune and deploy the model under the Apache 2.0 license.

Model Features

Large-scale Training

Trained for 2 million steps on The Pile dataset, approximately 2 trillion tokens, with strong language understanding capabilities.

Efficient Architecture

Utilizes the scalable model architecture of T5x, drawing from UMT5's implementation, and uses LlamaTokenizer.

Research-Oriented

Primarily intended for research purposes, suitable for extracting downstream task features and conducting fine-tuning experiments.

Model Capabilities

Text Generation

Text Mask Prediction

Downstream Task Feature Extraction

Use Cases

Academic Research

Language Model Research

Used to study the internal representations and behavioral characteristics of large-scale language models.

Downstream Task Fine-Tuning

As a pre-trained model, it can be fine-tuned for specific tasks.

🚀 Pile-T5 XXL

Pile-T5 XXL is an encoder-decoder model trained on the Pile using the T5x library. It offers a powerful solution for text2text generation tasks, leveraging a large-scale English dataset to learn robust language representations.

✨ Features

Trained on Large-Scale Data: Utilizes the Pile, an 825GiB general - purpose English dataset, to learn comprehensive language patterns.
Based on T5x Library: Employs the T5x library for training, benefiting from its scalable model implementation.
MLM - Objective Training: Trained with a masked language modeling (MLM) objective similar to the original T5 model, enhancing its ability to understand and generate text.

📦 Installation

There is no specific installation process described in the original README. However, to use the model, you need to have the transformers library installed. You can install it using the following command:

pip install transformers

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("EleutherAI/pile-t5-xxl")
model = AutoModelForSeq2SeqLM.from_pretrained("EleutherAI/pile-t5-xxl")

📚 Documentation

Model Details

Property	Details
Developed by	EleutherAI
Model Type	Transformer-based Language Model
Language	English
Learn more	Blogpost. For details about the training dataset, see the Pile paper, and its data sheet.
License	Apache 2.0
Contact	To ask questions about this model, join the EleutherAI Discord, and post them in `#release-discussion`. Please read the existing GPT - NeoX - 20B documentation before asking about the model on Discord. For general correspondence: contact@eleuther.ai.

Hyperparameters

Hyperparameter	Value
n_parameters	11135426560
n_{encoder layers}	24
n_{decoder layers}	24
d_model	10240
d_emb	4096
n_heads	64
d_head	64
n_vocab	32128
Sequence Length	512

Uses and limitations

Intended use

Pile-T5 was developed primarily for research purposes. It learns an inner representation of the English language that can be used to extract features useful for downstream tasks. In addition to scientific uses, you may also further fine-tune and adapt Pile-T5 for deployment, as long as your use is in accordance with the Apache 2.0 license. This model works with the Transformers Library. If you decide to use pre-trained Pile-T5 as a basis for your fine-tuned model, please note that you need to conduct your own risk and bias assessment.

Out-of-scope use

Pile-T5 is not intended for deployment as-is. It is not a product and cannot be used for human-facing interactions without supervision. Pile-T5 has not been fine-tuned for downstream tasks for which language models are commonly deployed, such as writing genre prose, or commercial chatbots. This means Pile-T5 will likely not respond to a given prompt the way products such as ChatGPT do. This is because, unlike Pile-T5, ChatGPT was fine-tuned using methods such as Reinforcement Learning from Human Feedback (RLHF) to better “understand” human instructions and dialogue. This model is English-language only, and thus cannot be used for translation or generating text in other languages.

Limitations and biases

The core functionality of Pile-T5 is to take a string of text that has been partially replaced with mask tokens and predict a sequence of tokens that would replace those mask tokens. Remember that the statistically most likely sequence of tokens need not result in the most “accurate” text. Never rely on Pile-T5 to produce factually accurate output. This model was trained on the Pile, a dataset known to contain profanity and texts that are lewd or otherwise offensive. See Section 6 of the Pile paper for a discussion of documented biases with regards to gender, religion, and race. Pile-T5 may produce socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. We recommend curating the outputs of this model before presenting it to a human reader. Please inform your audience that you are using artificially generated text.

Training

Training dataset

The Pile is a 825GiB general-purpose dataset in English. It was created by EleutherAI specifically for training large language models. It contains texts from 22 diverse sources, roughly broken down into five categories: academic writing (e.g. arXiv), internet (e.g. CommonCrawl), prose (e.g. Project Gutenberg), dialogue (e.g. YouTube subtitles), and miscellaneous (e.g. GitHub, Enron Emails). See the Pile paper for a breakdown of all data sources, methodology, and a discussion of ethical implications. Consult the datasheet for more detailed documentation about the Pile and its component datasets. The Pile can be downloaded from the official website, or from a community mirror. The Pile was deduplicated before being used to train Pile-T5.

Training procedure

Pile-T5 was trained with a batch size of approximately 1M tokens (2048 sequences of 512 tokens each), for a total of 2,000,000 steps. Pile-T5 was trained with the span-corruption objective.

Training checkpoints

Intermediate checkpoints for Pile-T5 are accessible within this repository. There are in total 200 checkpoints that are spaced 10,000 steps. For T5x-native checkpoints that can be used for finetuning with the T5x library, refer to here. The training loss (in tfevent format) and validation perplexity (in jsonl) can be found here.

Evaluations

Pile-T5 XXL was evaluated on SuperGLUE, CodeXGLUE. A Flan-finetuned version was evaluated on Flan Held In tasks, MMLU and BBH. Results can be seen in the blogpost

BibTeX

@misc{2024PileT5,
  author  = {Lintang Sutawika and Aran Komatsuzaki and Colin Raffel},
  title   = {Pile-T5},
  year    = {2024},
  url     = {https://blog.eleuther.ai/pile-t5/},
  note    = {Blog post},
}

📄 License

This project is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご