🚀 RoBERTa base model
A pre - trained model on the English language using masked language modeling (MLM). It can learn bidirectional sentence representations and is useful for downstream tasks.
🚀 Quick Start
This RoBERTa base model is pre - trained on English language data. You can use the raw model for masked language modeling, but it's mainly intended for fine - tuning on downstream tasks. Check the model hub for fine - tuned versions.
✨ Features
- Bidirectional Representation: Learns a bidirectional understanding of sentences through masked language modeling, different from traditional RNNs and autoregressive models.
- Feature Extraction: Can extract useful features for downstream tasks such as sequence classification, token classification, or question answering.
💻 Usage Examples
Basic Usage
You can use this model directly with a pipeline for masked language modeling:
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='roberta-base')
>>> unmasker("Hello I'm a <mask> model.")
[{'sequence': "<s>Hello I'm a male model.</s>",
'score': 0.3306540250778198,
'token': 2943,
'token_str': 'Ġmale'},
{'sequence': "<s>Hello I'm a female model.</s>",
'score': 0.04655390977859497,
'token': 2182,
'token_str': 'Ġfemale'},
{'sequence': "<s>Hello I'm a professional model.</s>",
'score': 0.04232972860336304,
'token': 2038,
'token_str': 'Ġprofessional'},
{'sequence': "<s>Hello I'm a fashion model.</s>",
'score': 0.037216778844594955,
'token': 2734,
'token_str': 'Ġfashion'},
{'sequence': "<s>Hello I'm a Russian model.</s>",
'score': 0.03253649175167084,
'token': 1083,
'token_str': 'ĠRussian'}]
Advanced Usage
Here is how to use this model to get the features of a given text in PyTorch:
from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
And in TensorFlow:
from transformers import RobertaTokenizer, TFRobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = TFRobertaModel.from_pretrained('roberta-base')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)
📚 Documentation
Model Description
RoBERTa is a transformers model pretrained on a large English corpus in a self - supervised way. It uses the Masked language modeling (MLM) objective. The model randomly masks 15% of the words in a sentence and predicts the masked words, enabling it to learn a bidirectional representation of the sentence. This inner representation of the English language can be used to extract features for downstream tasks.
Intended Uses & Limitations
- Intended Uses: Primarily for fine - tuning on tasks that use the whole sentence (potentially masked) for decision - making, such as sequence classification, token classification, or question answering.
- Limitations: Not suitable for text generation tasks. For such tasks, consider models like GPT2.
- Bias: The model can have biased predictions due to the unfiltered training data from the internet. This bias affects all fine - tuned versions.
Training Data
The RoBERTa model was pretrained on the combination of five datasets:
- BookCorpus, 11,038 unpublished books.
- English Wikipedia (excluding lists, tables and headers).
- CC - News, 63 million English news articles crawled from September 2016 to February 2019.
- OpenWebText, an open - source recreation of the WebText dataset for GPT - 2.
- Stories, a subset of CommonCrawl data with a story - like style.
Together, these datasets contain 160GB of text.
Training Procedure
Preprocessing
- Tokenization: Uses a byte version of Byte - Pair Encoding (BPE) with a vocabulary size of 50,000.
- Input Format: Takes pieces of 512 contiguous tokens that may span documents.
<s>
marks the start and </s>
marks the end of a document.
- Masking Procedure: 15% of tokens are masked. 80% are replaced by
<mask>
, 10% by a random token, and 10% are left unchanged. The masking is dynamic during pretraining.
Pretraining
The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. It used the Adam optimizer with a learning rate of 6e - 4, \(\beta_{1}=0.9\), \(\beta_{2}=0.98\), \(\epsilon = 1e - 6\), a weight decay of 0.01, 24,000 steps of learning rate warmup, and linear decay after.
Evaluation Results
When fine - tuned on downstream tasks, this model achieves the following Glue test results:
Task |
MNLI |
QQP |
QNLI |
SST - 2 |
CoLA |
STS - B |
MRPC |
RTE |
|
87.6 |
91.9 |
92.8 |
94.8 |
63.6 |
91.2 |
90.2 |
78.7 |
🔧 Technical Details
- Model Architecture: Based on the Transformer architecture, trained using the MLM objective.
- Training Setup: High - scale training on 1024 V100 GPUs with specific hyperparameters for the Adam optimizer.
📄 License
This model is released under the MIT license.
BibTeX entry and citation info
@article{DBLP:journals/corr/abs-1907-11692,
author = {Yinhan Liu and
Myle Ott and
Naman Goyal and
Jingfei Du and
Mandar Joshi and
Danqi Chen and
Omer Levy and
Mike Lewis and
Luke Zettlemoyer and
Veselin Stoyanov},
title = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},
journal = {CoRR},
volume = {abs/1907.11692},
year = {2019},
url = {http://arxiv.org/abs/1907.11692},
archivePrefix = {arXiv},
eprint = {1907.11692},
timestamp = {Thu, 01 Aug 2019 08:59:33 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}