🚀 BigBird in 🤗Transformers
This repository focuses on the BigBird model in the 🤗Transformers library. BigBird addresses the limitations of traditional Transformer - based models in handling long sequences, offering a more efficient solution for tasks like long document summarization and question - answering with long contexts.
✨ Features
- Low - complexity Attention: BigBird uses block sparse attention instead of the normal attention mechanism, reducing the O(n^2) time and memory complexity of traditional transformers. This allows it to handle sequences up to a length of 4096 at a much lower computational cost.
- SOTA Performance: It has achieved state - of - the - art results on various tasks involving very long sequences, such as long document summarization and question - answering with long contexts.
📦 Installation
The BigBird RoBERTa - like model is available in 🤗Transformers. You can install the 🤗Transformers library using the following command:
pip install transformers
💻 Usage Examples
Basic Usage
from transformers import BigBirdTokenizer, BigBirdForSequenceClassification
import torch
tokenizer = BigBirdTokenizer.from_pretrained('google/bigbird - roberta - base')
model = BigBirdForSequenceClassification.from_pretrained('google/bigbird - roberta - base')
text = "This is an example sentence for BigBird."
inputs = tokenizer(text, return_tensors='pt')
outputs = model(**inputs)
logits = outputs.logits
Advanced Usage
from transformers import BigBirdTokenizer, BigBirdForQuestionAnswering
import torch
tokenizer = BigBirdTokenizer.from_pretrained('google/bigbird - roberta - base')
model = BigBirdForQuestionAnswering.from_pretrained('google/bigbird - roberta - base')
context = "This is a very long context. It contains a lot of information that can be used to answer questions. BigBird is designed to handle such long sequences efficiently."
question = "What is designed to handle long sequences efficiently?"
inputs = tokenizer(question, context, return_tensors='pt')
outputs = model(**inputs)
start_logits = outputs.start_logits
end_logits = outputs.end_logits
answer_start = torch.argmax(start_logits)
answer_end = torch.argmax(end_logits) + 1
answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(inputs['input_ids'][0][answer_start:answer_end]))
📚 Documentation
Model Details
- Model Type: BigBird is a Transformer - based model that uses block sparse attention.
- Training Data: The model is trained on large - scale datasets relevant to the tasks it is designed for, such as long document summarization and question - answering.
Attention Mechanism
BigBird's block sparse attention approximates the full attention matrix. It allows the model to focus on important tokens in a sequence more efficiently, reducing computational cost while maintaining performance.
Key Questions and Answers
- Do all tokens really have to attend to all other tokens? No, BigBird's attention mechanism allows it to focus on a subset of important tokens.
- How to decide what tokens are important? The model learns to identify important tokens during training, and the block sparse attention mechanism helps in this process.
- How to attend to just a few tokens in a very efficient way? BigBird uses block sparse attention, which reduces the number of attention computations required.
🔧 Technical Details
The main technical innovation of BigBird is its block sparse attention mechanism. Instead of computing the attention matrix for all pairs of tokens in a sequence (which has O(n^2) complexity), BigBird divides the sequence into blocks and computes attention only within and between relevant blocks. This significantly reduces the computational and memory requirements, making it feasible to process long sequences.
📄 License
This project is licensed under multiple licenses, including BSD - 3 - Clause and Apache - 2.0.