🚀 BigBird large model
BigBird is a sparse-attention based transformer that extends Transformer-based models like BERT to handle much longer sequences. It also offers a theoretical understanding of what a complete transformer can achieve, which the sparse model can manage. This pre-trained model for the English language uses a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository.
Disclaimer: The team releasing BigBird did not write a model card for this model, so this model card has been written by the Hugging Face team.
🚀 Quick Start
BigBird is a powerful tool for handling long sequences. You can quickly start using it to process text and obtain features.
✨ Features
- Block Sparse Attention: BigBird relies on block sparse attention instead of normal attention (like BERT's attention). This allows it to handle sequences up to a length of 4096 at a much lower compute cost compared to BERT.
- SOTA Performance: It has achieved state-of-the-art (SOTA) results on various tasks involving very long sequences, such as long document summarization and question-answering with long contexts.
💻 Usage Examples
Basic Usage
from transformers import BigBirdModel
model = BigBirdModel.from_pretrained("google/bigbird-roberta-large")
model = BigBirdModel.from_pretrained("google/bigbird-roberta-large", attention_type="original_full")
model = BigBirdModel.from_pretrained("google/bigbird-roberta-large", block_size=16, num_random_blocks=2)
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
Advanced Usage
You can further customize the model's parameters according to your specific needs. For example, adjusting the block_size
and num_random_blocks
can optimize the model's performance for different sequence lengths and computational resources.
📚 Documentation
Model Description
BigBird uses block sparse attention, enabling it to handle long sequences more efficiently. It can process sequences up to 4096 tokens, which is a significant improvement over traditional models like BERT.
Training Data
This model is pre-trained on four publicly available datasets: Books, CC-News, Stories, and Wikipedia. It uses the same sentencepiece vocabulary as RoBERTa (which is in turn borrowed from GPT2).
Training Procedure
Documents longer than 4096 were split into multiple documents, and those much smaller than 4096 were joined. Similar to the original BERT training, 15% of tokens were masked, and the model was trained to predict the masked tokens. The model was warm-started from RoBERTa’s checkpoint.
🔧 Technical Details
BigBird's innovation lies in its block sparse attention mechanism. By reducing the computational complexity of attention calculation, it can handle long sequences without a significant increase in computational cost. This makes it suitable for tasks that require processing long texts, such as document summarization and long-context question-answering.
📄 License
This model is released under the Apache-2.0 license.
📚 BibTeX entry and citation info
@misc{zaheer2021big,
title={Big Bird: Transformers for Longer Sequences},
author={Manzil Zaheer and Guru Guruganesh and Avinava Dubey and Joshua Ainslie and Chris Alberti and Santiago Ontanon and Philip Pham and Anirudh Ravula and Qifan Wang and Li Yang and Amr Ahmed},
year={2021},
eprint={2007.14062},
archivePrefix={arXiv},
primaryClass={cs.LG}
}