NorBERT 3 Open-Source Norwegian Pretrained Model - Free Support for Multiple Natural Language Processing Tasks

Norbert2

Developed by ltg

NorBERT 3 is a series of Norwegian pre-trained language models, trained on a large-scale Norwegian corpus, supporting various natural language processing tasks.

Large Language Model

Transformers

Other#Norwegian text infilling #Whole word masking technique #Trained on ultra-large corpus

Downloads 741

Release Time : 3/2/2022

Model Overview

NorBERT 3 is a Norwegian pre-trained language model based on the BERT architecture, specifically designed for Norwegian natural language processing tasks, suitable for scenarios such as text classification, named entity recognition, and question-answering systems.

Model Features

Large-scale corpus training

Trained on an ultra-large Norwegian corpus (C4 + NCC, approximately 15 billion tokens)

Whole word masking technique

Utilizes whole word masking to enhance the model's understanding of Norwegian

Multiple version options

Offers models with varying parameter sizes from ultra-lightweight to enhanced versions to meet different computational needs

Model Capabilities

Text understanding

Text generation

Fill-mask

Named entity recognition

Text classification

Use Cases

Text processing

Text completion

Automatically completes missing parts in Norwegian sentences

Example input: 'Nå ønsker de seg en [MASK] bolig.' Can predict suitable words like 'ny' (new)

Text classification

Classifies Norwegian text

Information extraction

Named entity recognition

Identifies entities such as person names and locations from Norwegian text

🚀 NorBERT - Norwegian BERT Model

NorBERT is a BERT - based language model trained on a large Norwegian corpus, offering high - quality language representation for Norwegian text processing.

🚀 Quick Start

Release 2.0 (February 7, 2022)

Please also check our newer models: NorBERT 3 family, which are trained with a better architecture.

This model was trained on a very large Norwegian corpus (C4 + NCC, about 15 billion word tokens). It has a vocabulary of 50,000 words and was trained using Whole Word Masking.

You can download the model here:

Cased Norwegian BERT Base 2.0 (NorBERT 2): 221.zip

For more information about NorBERT's training corpora, training procedure, and evaluation benchmarks, visit: http://norlm.nlpl.eu/

Associated code: https://github.com/ltgoslo/NorBERT

For more details, refer to this paper:

Andrey Kutuzov, Jeremy Barnes, Erik Velldal, Lilja Øvrelid, Stephan Oepen. [Large - Scale Contextualised Language Modelling for Norwegian](https://aclanthology.org/2021.nodalida - main.4/), NoDaLiDa'21 (2021)

NorBERT was trained as part of NorLM, a joint initiative of the projects [EOSC - Nordic](https://www.eosc - nordic.eu/) (European Open Science Cloud), coordinated by the Language Technology Group (LTG) at the University of Oslo.

The computations were carried out on resources provided by UNINETT Sigma2 - the National Infrastructure for High Performance Computing and Data Storage in Norway.

✨ Features

Trained on a large Norwegian corpus.
Utilizes Whole Word Masking during training.
Has a 50,000 - word vocabulary.

📚 Documentation

NorBERT - 3

In 2023, we released a new family of NorBERT - 3 language models for Norwegian. Generally, we recommend using these models:

NorBERT 3 xs (15M parameters)
NorBERT 3 small (40M parameters)
NorBERT 3 base (123M parameters)
NorBERT 3 large (323M parameters)

NorBERT - 3 is described in detail in this paper: [NorBench – A Benchmark for Norwegian Language Models](https://aclanthology.org/2023.nodalida - 1.61/) (Samuel et al., NoDaLiDa 2023)

📄 License

This project is licensed under the CC - BY - 4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご