Bigbird-RoBerta-Large Open-Source Model - Process Long Sequences for Free, Boost Long Document Tasks!

Bigbird Roberta Large

Developed by google

BigBird is a Transformer model based on sparse attention, capable of processing sequences up to 4096 tokens long, suitable for long document tasks.

Large Language Model EnglishOpen Source License:Apache-2.0 #Long Sequence Processing #Sparse Attention #Document Summarization

Downloads 1,152

Release Time : 3/2/2022

Model Overview

BigBird extends the processing capabilities of traditional Transformer models through block-sparse attention mechanisms, significantly reducing computational costs for long sequence processing, making it suitable for tasks like long document summarization and long-context question answering.

Model Features

Sparse Attention Mechanism

Uses block-sparse attention instead of standard attention, significantly reducing computational costs for long sequence processing.

Long Sequence Processing

Capable of processing sequences up to 4096 tokens long, suitable for long document tasks.

Flexible Configuration

Supports adjusting attention types (block-sparse or full attention), block size, and the number of random blocks.

Model Capabilities

Long Document Summarization

Long-Context Question Answering

Masked Language Modeling

Use Cases

Natural Language Processing

Long Document Summarization

Processes extremely long documents and generates summaries.

Achieves state-of-the-art performance in long document summarization tasks.

Long-Context Question Answering

Answers questions requiring understanding of long contexts.

Performs excellently in long-context question answering tasks.

🚀 BigBird large model

BigBird is a sparse-attention based transformer that extends Transformer-based models like BERT to handle much longer sequences. It also offers a theoretical understanding of what a complete transformer can achieve, which the sparse model can manage. This pre-trained model for the English language uses a masked language modeling (MLM) objective. It was introduced in this paper and first released in this repository.

Disclaimer: The team releasing BigBird did not write a model card for this model, so this model card has been written by the Hugging Face team.

🚀 Quick Start

BigBird is a powerful tool for handling long sequences. You can quickly start using it to process text and obtain features.

✨ Features

Block Sparse Attention: BigBird relies on block sparse attention instead of normal attention (like BERT's attention). This allows it to handle sequences up to a length of 4096 at a much lower compute cost compared to BERT.
SOTA Performance: It has achieved state-of-the-art (SOTA) results on various tasks involving very long sequences, such as long document summarization and question-answering with long contexts.

💻 Usage Examples

Basic Usage

from transformers import BigBirdModel

# by default its in `block_sparse` mode with num_random_blocks=3, block_size=64
model = BigBirdModel.from_pretrained("google/bigbird-roberta-large")

# you can change `attention_type` to full attention like this:
model = BigBirdModel.from_pretrained("google/bigbird-roberta-large", attention_type="original_full")

# you can change `block_size` & `num_random_blocks` like this:
model = BigBirdModel.from_pretrained("google/bigbird-roberta-large", block_size=16, num_random_blocks=2)

text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

Advanced Usage

You can further customize the model's parameters according to your specific needs. For example, adjusting the block_size and num_random_blocks can optimize the model's performance for different sequence lengths and computational resources.

📚 Documentation

Model Description

BigBird uses block sparse attention, enabling it to handle long sequences more efficiently. It can process sequences up to 4096 tokens, which is a significant improvement over traditional models like BERT.

Training Data

This model is pre-trained on four publicly available datasets: Books, CC-News, Stories, and Wikipedia. It uses the same sentencepiece vocabulary as RoBERTa (which is in turn borrowed from GPT2).

Training Procedure

Documents longer than 4096 were split into multiple documents, and those much smaller than 4096 were joined. Similar to the original BERT training, 15% of tokens were masked, and the model was trained to predict the masked tokens. The model was warm-started from RoBERTa’s checkpoint.

🔧 Technical Details

BigBird's innovation lies in its block sparse attention mechanism. By reducing the computational complexity of attention calculation, it can handle long sequences without a significant increase in computational cost. This makes it suitable for tasks that require processing long texts, such as document summarization and long-context question-answering.

📄 License

This model is released under the Apache-2.0 license.

📚 BibTeX entry and citation info

@misc{zaheer2021big,
      title={Big Bird: Transformers for Longer Sequences}, 
      author={Manzil Zaheer and Guru Guruganesh and Avinava Dubey and Joshua Ainslie and Chris Alberti and Santiago Ontanon and Philip Pham and Anirudh Ravula and Qifan Wang and Li Yang and Amr Ahmed},
      year={2021},
      eprint={2007.14062},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご