bert-base-swedish-cased-ner Open-source Swedish Model - A Practical Tool Trained on Multi-source Texts

Bert Base Swedish Cased Ner

Developed by KB

Swedish BERT base model released by the National Library of Sweden/KBLab, trained on multi-source texts

Large Language Model Other#Swedish Pretraining #Named Entity Recognition #Multi-source Data Training

Downloads 20.77k

Release Time : 6/7/2022

Model Overview

Swedish pretrained language model based on BERT architecture, with training data covering various text types including books, news, and government publications

Model Features

Multi-source Training Data

Training data sourced from books, news, government publications, Wikipedia, and online forums for broad representation

Whole Word Masking Training

Utilizes Whole Word Masking technique for pretraining

Case-sensitive

Model preserves original text case information

Model Capabilities

Text Representation Learning

Named Entity Recognition

Language Understanding

Use Cases

Information Extraction

Named Entity Recognition

Identifying entities such as person names, locations, and organizations in text

Model fine-tuned on SUC 3.0 dataset can recognize 5 entity types

Text Analysis

Semantic Understanding

Used for building advanced Swedish NLP applications

🚀 Swedish BERT Models

The National Library of Sweden / KBLab releases three pre - trained language models based on BERT and ALBERT, trained on a large Swedish text corpus to provide representative language models for Swedish text.

The National Library of Sweden / KBLab has released three pre - trained language models based on BERT and ALBERT. These models are trained on approximately 15 - 20GB of text (200M sentences, 3000M tokens) from various sources such as books, news, government publications, the Swedish Wikipedia, and internet forums. The goal is to offer a representative BERT model for Swedish text. A more comprehensive description will be published later.

The following three models are currently available:

bert - base - swedish - cased (v1): A BERT model trained with the same hyperparameters as the one first published by Google.
bert - base - swedish - cased - ner (experimental): A BERT model fine - tuned for Named - Entity Recognition (NER) using the SUC 3.0 dataset.
albert - base - swedish - cased - alpha (alpha): An initial attempt at an ALBERT model for the Swedish language.

All models are case - sensitive and trained with whole - word masking.

📦 Files

Property	Details
bert - base - swedish - cased	config, vocab, pytorch_model.bin
bert - base - swedish - cased - ner	config, vocab, pytorch_model.bin
albert - base - swedish - cased - alpha	config, sentencepiece model, pytorch_model.bin

TensorFlow model weights will be released soon.

📚 Usage requirements / installation instructions

The examples below require Huggingface Transformers 2.4.1 and Pytorch 1.3.1 or greater. For Transformers < 2.4.0, the tokenizer must be instantiated manually, and the do_lower_case flag parameter should be set to False and keep_accents to True (for ALBERT).

To create an environment where the examples can be run, execute the following commands in a terminal on your preferred operating system:

# git clone https://github.com/Kungbib/swedish-bert-models
# cd swedish-bert-models
# python3 -m venv venv
# source venv/bin/activate
# pip install --upgrade pip
# pip install -r requirements.txt

💻 Usage Examples

Basic Usage

BERT Base Swedish

A standard BERT base for Swedish trained on a variety of sources. The vocabulary size is approximately 50k. Using Huggingface Transformers, the model can be loaded in Python as follows:

from transformers import AutoModel,AutoTokenizer

tok = AutoTokenizer.from_pretrained('KB/bert-base-swedish-cased')
model = AutoModel.from_pretrained('KB/bert-base-swedish-cased')

BERT base fine - tuned for Swedish NER

This model is fine - tuned on the SUC 3.0 dataset. Using the Huggingface pipeline, the model can be easily instantiated. For Transformer < 2.4.1, it seems the tokenizer must be loaded separately to disable lower - casing of input strings:

from transformers import pipeline

nlp = pipeline('ner', model='KB/bert-base-swedish-cased-ner', tokenizer='KB/bert-base-swedish-cased-ner')

nlp('Idag släpper KB tre språkmodeller.')

Running the Python code above should produce a result similar to the one below. The entity types used are TME for time, PRS for personal names, LOC for locations, EVN for events, and ORG for organizations. These labels are subject to change.

[ { 'word': 'Idag', 'score': 0.9998126029968262, 'entity': 'TME' },
  { 'word': 'KB',   'score': 0.9814832210540771, 'entity': 'ORG' } ]

The BERT tokenizer often splits words into multiple tokens, with the sub - parts starting with ##. For example, the string Engelbert kör Volvo till Herrängens fotbollsklubb gets tokenized as Engel ##bert kör Volvo till Herr ##ängens fotbolls ##klubb. To glue the parts back together, you can use the following code:

text = 'Engelbert tar Volvon till Tele2 Arena för att titta på Djurgården IF ' +\
       'som spelar fotboll i VM klockan två på kvällen.'

l = []
for token in nlp(text):
    if token['word'].startswith('##'):
        l[-1]['word'] += token['word'][2:]
    else:
        l += [ token ]

print(l)

This should result in something like the following (though less cleanly formatted):

[ { 'word': 'Engelbert',     'score': 0.99..., 'entity': 'PRS'},
  { 'word': 'Volvon',        'score': 0.99..., 'entity': 'OBJ'},
  { 'word': 'Tele2',         'score': 0.99..., 'entity': 'LOC'},
  { 'word': 'Arena',         'score': 0.99..., 'entity': 'LOC'},
  { 'word': 'Djurgården',    'score': 0.99..., 'entity': 'ORG'},
  { 'word': 'IF',            'score': 0.99..., 'entity': 'ORG'},
  { 'word': 'VM',            'score': 0.99..., 'entity': 'EVN'},
  { 'word': 'klockan',       'score': 0.99..., 'entity': 'TME'},
  { 'word': 'två',           'score': 0.99..., 'entity': 'TME'},
  { 'word': 'på',            'score': 0.99..., 'entity': 'TME'},
  { 'word': 'kvällen',       'score': 0.54..., 'entity': 'TME'} ]

ALBERT base

The easiest way to use this model is, again, with Huggingface Transformers:

from transformers import AutoModel,AutoTokenizer

tok = AutoTokenizer.from_pretrained('KB/albert-base-swedish-cased-alpha'),
model = AutoModel.from_pretrained('KB/albert-base-swedish-cased-alpha')

🙏 Acknowledgements

Resources from Stockholms University, Umeå University, and the Swedish Language Bank at Gothenburg University were used when fine - tuning BERT for NER.
Model pre - training was partly conducted in - house at the KBLab and partly (for material without active copyright) with the support of Cloud TPUs from Google's TensorFlow Research Cloud (TFRC).
The models are hosted on S3 by Huggingface 🤗

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご