roberta-large-1160k Open Source Multilingual Large Model - Supports Text Processing in Four Languages Including Swedish

Roberta Large 1160k

Developed by AI-Sweden-Models

Multilingual RoBERTa large model trained on Nordic corpora, supporting Swedish, Norwegian, Danish, and English

Supports Multiple LanguagesOpen Source License:MIT #Nordic multilingual understanding #High-precision mask prediction #Large-scale corpus training

Downloads 1,159

Release Time : 2/28/2024

Model Overview

This model is a large language model based on the RoBERTa architecture, specifically optimized for Nordic languages. It can be directly used for masked language modeling tasks but is more recommended for fine-tuning in downstream tasks.

Model Features

Nordic language optimization

Specifically trained and optimized for Swedish, Norwegian, and Danish

High-performance hardware training

Trained using 8 Intel® Gaudi® 2 AI accelerators

Downstream task adaptation

Recommended for fine-tuning in downstream tasks rather than direct prediction

Model Capabilities

Masked language modeling

Multilingual text understanding

Nordic language processing

Use Cases

Geographic knowledge Q&A

Capital recognition

Identify the capitals of Nordic countries

Accurately predicts capitals of Sweden, Norway, Denmark, etc.

Text understanding

Nordic language text analysis

Process Swedish, Norwegian, and Danish texts

🚀 roberta-large-1160k

This model can be used for masked language modeling and is mainly intended for fine - tuning on downstream tasks.

🚀 Quick Start

✨ Features

Supports masked language modeling.
Can be fine - tuned on downstream tasks.
Trained on Scandinavian subset of the Nordic Pile.

📦 Installation

This section is skipped as no installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

You can use this model directly with a pipeline for masked language modeling:

>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='AI-Sweden-Models/roberta-large-1160k')
>>> unmasker("Huvudstaden i Sverige är <mask>.")
[{'score': 0.5841221213340759,
  'token': 1945,
  'token_str': ' Stockholm',
  'sequence': 'Huvudstaden i Sverige är Stockholm.'},
 {'score': 0.06775698810815811,
  'token': 5007,
  'token_str': ' Göteborg',
  'sequence': 'Huvudstaden i Sverige är Göteborg.'},
 {'score': 0.05057400465011597,
  'token': 5761,
  'token_str': ' Malmö',
  'sequence': 'Huvudstaden i Sverige är Malmö.'},
 {'score': 0.021936343982815742,
  'token': 21449,
  'token_str': ' Norrköping',
  'sequence': 'Huvudstaden i Sverige är Norrköping.'},
 {'score': 0.017798304557800293,
  'token': 5658,
  'token_str': ' Uppsala',
  'sequence': 'Huvudstaden i Sverige är Uppsala.'}]

>>> unmasker("Hovedstaden i Norge er <mask>.")
[{'score': 0.6792309284210205,
  'token': 5158,
  'token_str': ' Oslo',
  'sequence': 'Hovedstaden i Norge er Oslo.'},
 {'score': 0.09379775077104568,
  'token': 15456,
  'token_str': ' Trondheim',
  'sequence': 'Hovedstaden i Norge er Trondheim.'},
 {'score': 0.052535850554704666,
  'token': 11370,
  'token_str': ' Bergen',
  'sequence': 'Hovedstaden i Norge er Bergen.'},
 {'score': 0.03465486690402031,
  'token': 29407,
  'token_str': ' hovedstaden',
  'sequence': 'Hovedstaden i Norge er hovedstaden.'},
 {'score': 0.03017985075712204,
  'token': 33311,
  'token_str': ' Kristiansand',
  'sequence': 'Hovedstaden i Norge er Kristiansand.'}]

>>> unmasker("Danmarks hovedstad er <mask>.")
[{'score': 0.11624140292406082,
  'token': 4794,
  'token_str': ' København',
  'sequence': 'Danmarks hovedstad er København.'},
 {'score': 0.045051511377096176,
  'token': 7680,
  'token_str': ' død',
  'sequence': 'Danmarks hovedstad er død.'},
 {'score': 0.02936543896794319,
  'token': 10795,
  'token_str': ' lukket',
  'sequence': 'Danmarks hovedstad er lukket.'},
 {'score': 0.026030730456113815,
  'token': 13580,
  'token_str': ' Odense',
  'sequence': 'Danmarks hovedstad er Odense.'},
 {'score': 0.02130937948822975,
  'token': 16347,
  'token_str': ' Roskilde',
  'sequence': 'Danmarks hovedstad er Roskilde.'}]

Advanced Usage

Here is how to use this model to get the features of a given text in PyTorch:

from transformers import RobertaTokenizer, RobertaModel
tokenizer = RobertaTokenizer.from_pretrained('AI-Sweden-Models/roberta-large-1160k')
model = RobertaModel.from_pretrained('AI-Sweden-Models/roberta-large-1160k')
text = "Replace me by any text you'd like."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)

📚 Documentation

Training data

The Scandinavian subset of the Nordic Pile (Swedish, Norwegian, Danish), consisting of 414 962 688 text samples.

Training procedure

The model was trained with the optimum - habana framework. Utilizing 8X Intel® Gaudi® 2 AI accelerators, managed by Intel Sweden AB.

The weights from https://huggingface.co/FacebookAI/roberta-large are used as initialization, and the tokenizer is trained from scratch.

This model is a checkpoint (1 160 000 / 1 350 790). The final run is 5 epochs. This is epoch: 4.29.

A batch size of 1536 was used.

Evaluation results

When fine - tuned on downstream tasks, this model achieves the following results:

rank	da_rank	no_rank	sv_rank	dansk	angry_tweets	scala_da	scandiqa_da	norne_nb	norne_nn	norec	scala_nb	scala_nn	norquad	suc3	swerec	scala_sv	scandiqa_sv
1.3	1.33	1.34	1.23	74.16	51.2	73.87	49.34	92.01	87.17	60.11	72.85	65.56	60.38	82.65	77.25	77.9	49.64

As by (2024/03/26) it is ranked #2 at ScandEval after gpt - 4 - 0613.

🔧 Technical Details

The model uses the optimum - habana framework for training. It starts with the weights from https://huggingface.co/FacebookAI/roberta-large and trains the tokenizer from scratch. It is trained on 8X Intel® Gaudi® 2 AI accelerators managed by Intel Sweden AB.

📄 License

This model is released under the MIT license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご