Roberta-news Open-Source Model for the News Domain - Pretraining Empowers Precise Processing of News Texts

Home

Roberta News

Developed by AndyReas

A RoBERTa-based masked language model specifically pretrained for news text

Large Language Model

Transformers

EnglishOpen Source License:MIT #News text infilling #English news pretraining #Masked language modeling

Downloads 17

Release Time : 4/28/2023

Model Overview

This model is a masked language model based on the RoBERTa architecture, specifically pretrained for news text, suitable for news-related text generation and completion tasks.

Model Features

News domain specialization

Pretrained specifically on 13 million English news articles for better understanding of news text

Strict data cleaning

Training data underwent rigorous cleaning to remove noise such as duplicate articles and media tags

Complete training

Model parameters were randomly initialized and underwent full training for 3 epochs, including 50K warmup steps

Model Capabilities

News text completion

Masked prediction

Text generation

Use Cases

News content generation

Weather forecast completion

Complete missing information in weather forecasts

Can accurately predict weather-related date and location information

News headline generation

Generate complete news headlines based on partial content

🚀 roberta-news

This project presents roberta-news, a model similar to roberta-base but pre - trained on a news - only dataset. It can be used for masked language tasks, offering unique insights based on news - related language patterns.

🚀 Quick Start

The model can be used with the HuggingFace pipeline like so:

Basic Usage

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='andyreas/roberta-gen-news')
>>> print(unmasker("The weather forecast for <mask> is rain.", top_k=5))

[{'score': 0.06107175350189209, 
'token': 1083, 
'token_str': ' Friday', 
'sequence': 'The weather forecast for Friday is rain.'}, 
{'score': 0.04649643227458, 
'token': 1359, 
'token_str': ' Saturday', 
'sequence': 'The weather forecast for Saturday is rain.'
}, 
{'score': 0.04370906576514244, 
'token': 1772, 
'token_str': ' weekend', 
'sequence': 'The weather forecast for weekend is rain.'}, 
{'score': 0.04101456701755524, 
'token': 1133, 
'token_str': ' Wednesday', 
'sequence': 'The weather forecast for Wednesday is rain.'}, 
{'score': 0.03785591572523117, 
'token': 1234, 
'token_str': ' Sunday', 
'sequence': 'The weather forecast for Sunday is rain.'}]

✨ Features

Similar to roberta-base in size, architecture, tokenizer algorithm, and Masked Language Modeling objective.
Randomly initialized and pre - trained from scratch using a news - only dataset.

📦 Installation

No specific installation steps are provided in the original document.

📚 Documentation

Model Description

The model is similar to roberta-base in that it shares its size, architecture, tokenizer algorithm and Masked Language Modeling objective. The model parameters of a RobertaForMaskedLM model were randomly initialized and pre - trained from scratch using a dataset consisting only of news.

Training Data

The model's training data consists of almost 13,000,000 English articles from ~90 outlets, which each consists of a headline (title) and a subheading (description). The articles were collected from the Sciride News Mine, after which some additional cleaning was performed on the data, such as removing duplicate articles and removing repeated "outlet tags" appearing before or after headlines such as "| Daily Mail Online".

The cleaned dataset can be found on huggingface here. roberta - news was pre - trained on a large subset (12,928,029 / 13,118,041) of the linked dataset, after repacking the data a bit to avoid abrupt truncation.

Training

Training ran for ~3 epochs using a learning rate of 2e - 5 and 50K warm - up steps out of ~2450K total steps.

Bias

Like any other model, roberta - news is subject to bias according to the data it was trained on.

🔧 Technical Details

Model Type: Similar to roberta-base, using RobertaForMaskedLM architecture.
Training Data: Almost 13,000,000 English articles from ~90 outlets, cleaned and pre - processed.
Training Process: ~3 epochs, learning rate of 2e - 5, 50K warm - up steps out of ~2450K total steps.

Property	Details
Model Type	Similar to `roberta-base`, using `RobertaForMaskedLM` architecture
Training Data	Almost 13,000,000 English articles from ~90 outlets, cleaned and pre - processed. The cleaned dataset is available here

📄 License

This project is licensed under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご