đ roberta-news
This project presents roberta-news
, a model similar to roberta-base
but pre - trained on a news - only dataset. It can be used for masked language tasks, offering unique insights based on news - related language patterns.
đ Quick Start
The model can be used with the HuggingFace pipeline like so:
Basic Usage
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='andyreas/roberta-gen-news')
>>> print(unmasker("The weather forecast for <mask> is rain.", top_k=5))
[{'score': 0.06107175350189209,
'token': 1083,
'token_str': ' Friday',
'sequence': 'The weather forecast for Friday is rain.'},
{'score': 0.04649643227458,
'token': 1359,
'token_str': ' Saturday',
'sequence': 'The weather forecast for Saturday is rain.'
},
{'score': 0.04370906576514244,
'token': 1772,
'token_str': ' weekend',
'sequence': 'The weather forecast for weekend is rain.'},
{'score': 0.04101456701755524,
'token': 1133,
'token_str': ' Wednesday',
'sequence': 'The weather forecast for Wednesday is rain.'},
{'score': 0.03785591572523117,
'token': 1234,
'token_str': ' Sunday',
'sequence': 'The weather forecast for Sunday is rain.'}]
⨠Features
- Similar to roberta-base in size, architecture, tokenizer algorithm, and Masked Language Modeling objective.
- Randomly initialized and pre - trained from scratch using a news - only dataset.
đĻ Installation
No specific installation steps are provided in the original document.
đ Documentation
Model Description
The model is similar to roberta-base in that it shares its size, architecture, tokenizer algorithm and Masked Language Modeling objective.
The model parameters of a RobertaForMaskedLM model were randomly initialized and pre - trained from scratch using a dataset consisting only of news.
Training Data
The model's training data consists of almost 13,000,000 English articles from ~90 outlets, which each consists of a headline (title) and a subheading (description). The articles were collected from the Sciride News Mine, after which some additional cleaning was performed on the data, such as removing duplicate articles and removing repeated "outlet tags" appearing before or after headlines such as "| Daily Mail Online".
The cleaned dataset can be found on huggingface here. roberta - news was pre - trained on a large subset (12,928,029 / 13,118,041) of the linked dataset, after repacking the data a bit to avoid abrupt truncation.
Training
Training ran for ~3 epochs using a learning rate of 2e - 5 and 50K warm - up steps out of ~2450K total steps.
Bias
Like any other model, roberta - news is subject to bias according to the data it was trained on.
đ§ Technical Details
- Model Type: Similar to
roberta-base
, using RobertaForMaskedLM
architecture.
- Training Data: Almost 13,000,000 English articles from ~90 outlets, cleaned and pre - processed.
- Training Process: ~3 epochs, learning rate of 2e - 5, 50K warm - up steps out of ~2450K total steps.
Property |
Details |
Model Type |
Similar to roberta-base , using RobertaForMaskedLM architecture |
Training Data |
Almost 13,000,000 English articles from ~90 outlets, cleaned and pre - processed. The cleaned dataset is available here |
đ License
This project is licensed under the MIT license.