roberta-base Open-source English Pre-trained Model - Free Text Feature Extraction and Task Fine-tuning

Roberta Base

Developed by FacebookAI

An English pre-trained model based on Transformer architecture, trained on massive text through masked language modeling objectives, supporting text feature extraction and downstream task fine-tuning

Large Language Model EnglishOpen Source License:MIT #Bidirectional Language Modeling #English Text Understanding #Dynamic Masked Training

Downloads 9.3M

Release Time : 3/2/2022

Model Overview

A bidirectional Transformer model employing dynamic masking strategy, optimizing BERT's pre-training methods, suitable for NLP tasks like sequence classification and token classification

Model Features

Dynamic Masking Strategy

Adopts more efficient dynamic masking, learning more comprehensive contextual representations compared to static masking

Large-scale Training Data

Incorporates 5 datasets totaling 160GB of text, covering various genres including books, news, and encyclopedias

Optimized Training Configuration

Uses 8K batch size and 512 sequence length for 500K training steps, employing Adam optimizer with learning rate warm-up strategy

Model Capabilities

Text Feature Extraction

Masked Language Prediction

Sequence Classification

Token Classification

Question Answering

Use Cases

Text Understanding

Sentiment Analysis

Classify sentiment orientation of reviews/tweets

Achieves 94.8% accuracy on SST-2 dataset

Text Similarity Calculation

Measure semantic similarity between two texts

Scores 91.2 on STS-B dataset

Information Extraction

Named Entity Recognition

Identify entities like persons/locations/organizations from text

🚀 RoBERTa base model

A pre - trained model on the English language using masked language modeling (MLM). It can learn bidirectional sentence representations and is useful for downstream tasks.

🚀 Quick Start

This RoBERTa base model is pre - trained on English language data. You can use the raw model for masked language modeling, but it's mainly intended for fine - tuning on downstream tasks. Check the model hub for fine - tuned versions.

✨ Features

Bidirectional Representation: Learns a bidirectional understanding of sentences through masked language modeling, different from traditional RNNs and autoregressive models.
Feature Extraction: Can extract useful features for downstream tasks such as sequence classification, token classification, or question answering.

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='roberta-base')
>>> unmasker("Hello I'm a <mask> model.")

[{'sequence': "<s>Hello I'm a male model.</s>",
  'score': 0.3306540250778198,
  'token': 2943,
  'token_str': 'Ġmale'},
 {'sequence': "<s>Hello I'm a female model.</s>",
  'score': 0.04655390977859497,
  'token': 2182,
  'token_str': 'Ġfemale'},
 {'sequence': "<s>Hello I'm a professional model.</s>",
  'score': 0.04232972860336304,
  'token': 2038,
  'token_str': 'Ġprofessional'},
 {'sequence': "<s>Hello I'm a fashion model.</s>",
  'score': 0.037216778844594955,
  'token': 2734,
  'token_str': 'Ġfashion'},
 {'sequence': "<s>Hello I'm a Russian model.</s>",
  'score': 0.03253649175167084,
  'token': 1083,
  'token_str': 'ĠRussian'}]

Advanced Usage

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = RobertaModel.from_pretrained('roberta-base')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

And in TensorFlow:

from transformers import RobertaTokenizer, TFRobertaModel
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')
model = TFRobertaModel.from_pretrained('roberta-base')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='tf')
output = model(encoded_input)

📚 Documentation

Model Description

RoBERTa is a transformers model pretrained on a large English corpus in a self - supervised way. It uses the Masked language modeling (MLM) objective. The model randomly masks 15% of the words in a sentence and predicts the masked words, enabling it to learn a bidirectional representation of the sentence. This inner representation of the English language can be used to extract features for downstream tasks.

Intended Uses & Limitations

Intended Uses: Primarily for fine - tuning on tasks that use the whole sentence (potentially masked) for decision - making, such as sequence classification, token classification, or question answering.
Limitations: Not suitable for text generation tasks. For such tasks, consider models like GPT2.
Bias: The model can have biased predictions due to the unfiltered training data from the internet. This bias affects all fine - tuned versions.

Training Data

The RoBERTa model was pretrained on the combination of five datasets:

BookCorpus, 11,038 unpublished books.
English Wikipedia (excluding lists, tables and headers).
CC - News, 63 million English news articles crawled from September 2016 to February 2019.
OpenWebText, an open - source recreation of the WebText dataset for GPT - 2.
Stories, a subset of CommonCrawl data with a story - like style.

Together, these datasets contain 160GB of text.

Training Procedure

Preprocessing

Tokenization: Uses a byte version of Byte - Pair Encoding (BPE) with a vocabulary size of 50,000.
Input Format: Takes pieces of 512 contiguous tokens that may span documents. <s> marks the start and </s> marks the end of a document.
Masking Procedure: 15% of tokens are masked. 80% are replaced by <mask>, 10% by a random token, and 10% are left unchanged. The masking is dynamic during pretraining.

Pretraining

The model was trained on 1024 V100 GPUs for 500K steps with a batch size of 8K and a sequence length of 512. It used the Adam optimizer with a learning rate of 6e - 4, \(\beta_{1}=0.9\), \(\beta_{2}=0.98\), \(\epsilon = 1e - 6\), a weight decay of 0.01, 24,000 steps of learning rate warmup, and linear decay after.

Evaluation Results

When fine - tuned on downstream tasks, this model achieves the following Glue test results:

Task	MNLI	QQP	QNLI	SST - 2	CoLA	STS - B	MRPC	RTE
	87.6	91.9	92.8	94.8	63.6	91.2	90.2	78.7

🔧 Technical Details

Model Architecture: Based on the Transformer architecture, trained using the MLM objective.
Training Setup: High - scale training on 1024 V100 GPUs with specific hyperparameters for the Adam optimizer.

📄 License

This model is released under the MIT license.

BibTeX entry and citation info

@article{DBLP:journals/corr/abs-1907-11692,
  author    = {Yinhan Liu and
               Myle Ott and
               Naman Goyal and
               Jingfei Du and
               Mandar Joshi and
               Danqi Chen and
               Omer Levy and
               Mike Lewis and
               Luke Zettlemoyer and
               Veselin Stoyanov},
  title     = {RoBERTa: {A} Robustly Optimized {BERT} Pretraining Approach},
  journal   = {CoRR},
  volume    = {abs/1907.11692},
  year      = {2019},
  url       = {http://arxiv.org/abs/1907.11692},
  archivePrefix = {arXiv},
  eprint    = {1907.11692},
  timestamp = {Thu, 01 Aug 2019 08:59:33 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1907-11692.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご