nb-whisper-large Open-source Speech Recognition Model - Free Norwegian and English Speech Transcription and Translation

Nb Whisper Large

Developed by NbAiLabBeta

An automatic speech recognition model developed by the National Library of Norway, based on the Whisper architecture, supporting speech transcription and translation of Norwegian and English.

Speech Recognition

Transformers

Open Source License:Apache-2.0 #Norwegian speech recognition #Multi-dialect support #Long audio processing

Downloads 776

Release Time : 1/9/2024

Model Overview

NB-Whisper Large is a cutting - edge automatic speech recognition (ASR) and speech translation model developed based on OpenAI Whisper. It has been trained on 66,000 hours of Norwegian data and is suitable for high - precision speech - to - text tasks.

Model Features

Multi - size model series

Provides five models with different parameter scales from Tiny (39M) to Large (1550M) to meet the needs of different computing resources

Professional variant versions

Provides two professional variants, verbatim and semantic, suitable for precise transcription and content summarization scenarios respectively

Norwegian optimization

Trained on 66,000 hours of Norwegian data, with special optimization for Norwegian dialects and accents

Multi - format support

Provides PyTorch, TensorFlow, ONNX, and ggml format of whisper.cpp, supporting multiple deployment methods

Model Capabilities

Norwegian speech transcription

English speech transcription

Speech translation

Speaker separation

Timestamp alignment

Use Cases

Media processing

Broadcast content transcription

Automatically convert the broadcast programs of the Norwegian Broadcasting Corporation (NRK) into manuscripts

Supports generating subtitle files with timestamps

Meeting minutes

Automatically record the speech content of the Norwegian Parliament

The semantic version can generate a concise meeting summary

Education and research

Linguistic research

Analyze the speech characteristics of Norwegian dialects

The verbatim version provides precise phoneme - level transcription

🚀 NB-Whisper Large (Release Candidate)

Introducing the Norwegian NB-Whisper Large model, developed by the National Library of Norway. NB-Whisper is a state - of - the - art series of models for automatic speech recognition (ASR) and speech translation, based on OpenAI's Whisper. Each model in the series has been trained for 250,000 steps with 8 million samples, equivalent to 66,000 hours of speech. Stay tuned for our upcoming article for more details on the training methodology and dataset.

IMPORTANT: These models are currently Release Candidates. We are in the final stages of testing. If everything goes smoothly, we plan to officially release the models later this month.

✨ Features

Model Variants

Main Models: Suitable for most transcription tasks. There are different sizes available, including Tiny, Base, Small, Medium, and Large. | Model Size | Parameters | Model | |------------|------------|------------| | Tiny | 39M | NB-Whisper Tiny | | Base | 74M | NB-Whisper Base | | Small | 244M | NB-Whisper Small | | Medium | 769M | NB-Whisper Medium | | Large | 1550M | NB-Whisper Large |
Specialised Models: Trained for an additional 250 steps from the main models.
- Verbatim version: More literal, suitable for tasks requiring detailed transcription, such as linguistic analysis.
- Semantic version: Focuses on capturing the essence of content, ideal for meeting minutes and subtitling. | Model Size | Parameters | Verbatim version | Semantic version | |------------|------------|------------|------------------| | Tiny | 39M | Tiny - verbatim | Tiny - semantic | | Base | 74M | Base - verbatim | Base - semantic | | Small | 244M | Small - verbatim | Small - semantic | | Medium | 769M | Medium - verbatim | Medium - semantic | | Large | 1550M | Large - verbatim | Large - semantic |

Model Description

Property	Details
Developed by	NB AI-Lab
Shared by	NB AI-Lab
Model Type	`whisper`
Language(s) (NLP)	Norwegian, Norwegian Bokmål, Norwegian Nynorsk, English
License	Apache 2.0
Trained from model	openai/whisper-large
Code Repository	https://github.com/NbAiLab/nb-whisper/
Paper	Coming soon
Demo	See Spaces on this page

🚀 Quick Start

Online Demos

You can try the models directly through the HuggingFace Inference API on the right side of this page. Note that initially, the model needs to load and runs on limited CPU capacity, which might be slow. For a better experience, some models are temporarily hosted on TPUs for a few days, significantly improving performance. Check them out under the Spaces section on the Main Page.

Local Setup with HuggingFace

You can also run the models locally. The Tiny, Base, and Small models are optimized for CPU execution. For the Medium and Large models, a GPU - equipped system is recommended. With Python installed, setting up and using these models with HuggingFace's Transformers is easy. Refer to examples using this sample mp3 file.

# Download the sample file
$ wget -N https://github.com/NbAiLab/nb-whisper/raw/main/audio/king.mp3

# Install necessary libraries. 
$ pip install transformers>=4.35.2

After that, you can run the following in Python:

from transformers import pipeline

# Load the model
asr = pipeline("automatic-speech-recognition", "NbAiLabBeta/nb-whisper-large")

#transcribe
asr("king.mp3", generate_kwargs={'task': 'transcribe', 'language': 'no'})

Expected output

{
  {'text': ' Nordmenn er nordlendinger, trøndere, sørlendinger og folk fra alle andre regioner. Nordmenn er også innvandret fra Afghanistan, Pakistan, Polen, Sverige, Somalia og Syria. Det er ikke alltid så lett å si hvor vi er fra, hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra.'}
}

💻 Usage Examples

Extended HuggingFace

The output may have repetitions at the end if the video is longer than 30 seconds. You can use the chunk_lengt_s argument to transcribe longer files. Setting it to 28 seconds instead of the default 30 seconds may give slightly better results. Also, setting the beam size to 5 can greatly increase accuracy, though it takes a bit longer and requires more memory.

# Long Transcripts
asr("king.mp3", chunk_length_s=28, generate_kwargs={'task': 'transcribe', 'language': 'no'})

# Increase accuracy by setting beam size to 5
asr("king.mp3", chunk_length_s=28, return_timestamps=True, generate_kwargs={'num_beams': 5, 'task': 'transcribe', 'language': 'no'})

# Return Timestamps
asr("king.mp3", chunk_length_s=28, return_timestamps=True, generate_kwargs={'task': 'transcribe', 'language': 'no'})

# Return Word Level Timestamps
asr("king.mp3", chunk_length_s=28, return_timestamps="word", generate_kwargs={'task': 'transcribe', 'language': 'no'})

# Transcribe to Nynorsk
asr("king.mp3", chunk_length_s=28, generate_kwargs={'task': 'transcribe', 'language': 'nn'})

# Transcribe to English
asr("king.mp3", chunk_length_s=28, generate_kwargs={'task': 'transcribe', 'language': 'en'})

Expected output

Long transcripts:

{
  {'text': ' Nordmenn er nordlendinger, trøndere, sørlendinger og folk fra alle andre regioner. Nordmenn er også innvandret fra Afghanistan, Pakistan, Polen, Sverige, Somalia og Syria. Det er ikke alltid så lett å si hvor vi er fra, hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra, hvilken nasjonalitet vi tilhører. Det vi kaller hjem, er der hjertet vårt er, og det kan ikke alltid plasseres innenfor landegrenser. Nordmenn er jenter som er glad i jenter, gutter som er glad i gutter, og jenter og gutter som er glad i hverandre. Nordmenn trommer på Gud, Allah, Altet og ingenting. Nordmenn liker Grieg, Kygo, Helbilis og Kari Bremnes. Med andre ord, Norge er dere. Norge er oss. Mitt største håp for Norge er at vi skal klare å ta vare på hverandre, at vi skal bygge dette landet videre på tillit, fellesskap og raushet.'}
}

Timestamps:

{
  {'text': ' Nordmenn er nordlendinger, trøndere, sørlendinger og folk fra alle andre regioner. Nordmenn er også innvandret fra Afghanistan, Pakistan, Polen, Sverige, Somalia og Syria. Det er ikke alltid så lett å si hvor vi er fra, hvilken nasjonalitet vi er fra. Hvilken nasjonalitet vi er fra. hvilken nasjonalitet vi tilhører. Det vi kaller hjem, er der hjertet vårt er, og det kan ikke alltid plasseres innenfor landegrenser. Nordmenn er jenter som er glad i jenter, gutter som er glad i gutter, og jenter og gutter som er glad i hverandre. Nordmenn trommer på Gud, Allah, Altet og ingenting. Nordmenn liker Grieg, Kygo, Helbiles og Kari Bremnes. Med andre ord, Norge er dere. Norge er oss. Mitt største håp for Norge er at vi skal klare å ta vare på hverandre, at vi skal bygge dette landet videre på tillit, fellesskap og raushet.',
 'chunks': [{'timestamp': (0.0, 5.46),
   'text': ' Nordmenn er nordlendinger, trøndere, sørlendinger'},
  {'timestamp': (5.52, 8.68), 'text': ' og folk fra alle andre regioner.'},
  {'timestamp': (8.68, 16.64),
   'text': ' Nordmenn er også innvandret fra Afghanistan, Pakistan, Polen, Sverige, Somalia og Syria.'},
  {'timestamp': (16.64, 13.3),
   'text': ' Det er ikke alltid så lett å si hvor vi er fra, hvilken nasjonalitet vi er fra.'},
  {'timestamp': (13.32, 30.28),
   'text': ' Hvilken nasjonalitet vi er fra. hvilken nasjonalitet vi tilhører.'},
  {'timestamp': (32.52, 39.16),
   'text': ' Det vi kaller hjem, er der hjertet vårt er, og det kan ikke alltid plasseres'},
  {'timestamp': (39.16, 42.0), 'text': ' innenfor landegrenser.'},
  {'timestamp': (42.0, 46.74),
   'text': ' Nordmenn er jenter som er glad i jenter, gutter som er glad i gutter,'},
  {'timestamp': (46.74, 51.12),
   'text': ' og jenter og gutter som er glad i hverandre.'},
  {'timestamp': (51.16, 57.42),
   'text': ' Nordmenn trommer på Gud, Allah, Altet og ingenting.'},
  {'timestamp': (57.42, 64.3),
   'text': ' Nordmenn liker Grieg, Kygo, Helbiles og Kari Bremnes.'},
  {'timestamp': (64.34, 71.24),
   'text': ' Med andre ord, Norge er dere. Norge er oss.'},
  {'timestamp': (71.24, 78.04),
   'text': ' Mitt største håp for Norge er at vi skal klare å ta vare på hverandre,'},
  {'timestamp': (78.12, 84.68),
   'text': ' at vi skal bygge dette landet videre på tillit, fellesskap og raushet.'}]}
}

Word Level Timestamps:

{
  {"text": "Nordmenn er nordlendinger, trøndere, sørlendinger og folk fra alle andre regioner. Nordmenn er også innvandret fra Afghanistan, Pakistan, Polen, Sverige, Somalia og Syria. Det er ikke alltid så lett å si hvor vi er fra, hvilken nasjonalitet vi tilhører. Det vi kaller hjem, er der hjertet vårt er, og det kan ikke alltid plasseres innenfor landegrenser. Nordmenn er jenter som er glad i jenter, gutter som er glad i gutter, og jenter og gutter som er glad i hverandre. Nordmenn trommer på Gud, Allah, Altet og ingenting. Nordmenn liker Grieg, Kygo, Helbilis og Kari Bremnes. Med andre ord, Norge er dere. Norge er oss. Mitt største håp for Norge er at vi skal klare å ta vare på hverandre, at vi skal bygge dette landet videre på tillit, fellesskap og raushet.",
  "chunks": [
    {"text": "Nordmenn", "timestamp": [0.72, 1.42]},
    {"text": "er", "timestamp": [1.42, 1.74]},
    // ... more chunks ...
    {"text": "raushet.", "timestamp": [83.1, 84.88]}
  ]
  }
}

Nynorsk:

{
  {"text": "Nordmenn er nordlendingar, trøndarar, sørlendingar og folk frå alle andre regionar. Nordmenn er også innvandra frå Afghanistan, Pakistan, Polen, Sverige, Somalia og Syria. Det er ikkje alltid så lett å seie kvar vi er frå, kva nasjonalitet vi tilhøyrer. Det vi kallar heim, er der hjartet vårt er, og det kan ikkje alltid plasserast innanfor landegrenser. Nordmenn er jenter som er glad i jenter, gutar som erade i gutar, og jenter og gutar som er glade i kvarandre. Nordmenn trommar på Gud, Allah, Altet og ingenting. Nordmenn liker Grieg, Kygo, Helbiles og Kari Bremnes. Med andre ord, Noreg er dere! Noreg er oss. Mitt største håp for Noreg er at vi skal klare å ta vare på kvarandre, at vi skal byggje dette landet vidare på tillit, fellesskap og raushet."}
}

English:

{
  {"text": "Norwegians are Norwegians, trønders, southerners and people from all other regions. Norwegians are also invaded from Afghanistan, Pakistan, Poland, Sweden, Somalia and Suria. It is not always so easy to say where we are from, what nationality we belong to. What we call home is where our heart is, and it cannot always be placed within national borders. Norwegians are girls who like girls, boys who like boys, and girls and boys who like each other. Norwegians thrump on God, Allah, Altet and nothing. Norwegians like Grieg, Kygo, Helbilis and Kari Bremnes. In other words, Norway is you. Norway is us. My biggest hope for Norway is that we should be able to take care of each other, that we should build this country on trust, community and generosity."}
}

Whisper CPP

Whisper CPP is a C++ implementation of the Whisper model, offering the same functionalities with C++ efficiency and performance optimizations. It allows embedding any Whisper model into a binary file, facilitating real - world application development. However, some knowledge of compiling C++ programs is required. Their homepage provides application - building examples, including real - time transcription.

We have converted this model to the ggml - format model used by Whisper CPP binaries. The file can be downloaded here, and a q5_0 quantized version is also available here.

# We can download and compile whisper.cpp
$ git clone --depth 1 https://github.com/ggerganov/whisper.cpp --branch v1.5.1
$ cd whisper.cpp/
$ make

# We also need to convert the audio to WAV as that is the only format supported by whisper.cpp
$ wget -N https://github.com/NbAiLab/nb-whisper/raw/main/audio/king.mp3
$ ffmpeg -i king.mp3 -ar 16000 -ac 1 -c:a pcm_s16le king.wav                                        

# Lets download the two ggml-files from this site
wget -N https://huggingface.co/NbAiLabBeta/nb-whisper-large/resolve/main/ggml-model.bin -O models/nb-large-ggml-model.bin
wget -N https://huggingface.co/NbAiLabBeta/nb-whisper-large/resolve/main/ggml-model-q5_0.bin -O models/nb-large-ggml-model-q5_0.bin

# And run it with the f16 default model
$ ./main -l no -m models/nb-large-ggml-model.bin king.wav

# Or the quantized version
$ ./main -l no -m models/nb-large-ggml-model-q5_0.bin king.wav

WhisperX and Speaker Diarization

Speaker diarization identifies and separates different speakers in an audio recording, enhancing the quality of transcribing meetings or phone calls. WhisperX is an easy way to use our models for diarizing speech. It uses phoneme - based Wav2Vec - models for better timestamp alignment and, as of December 2023, has native support for nb - wav2vec - models. It currently uses PyAnnote - audio for diarization, which has a strict licence.

# Follow the install instructions on https://github.com/m-bain/whisperX
# Make sure you have a HuggingFace account and have agreed to the pyannote terms

# Log in (or supply HF Token in command line)
huggingface-cli login

# Download a test file
wget -N https://github.com/NbAiLab/nb-whisper/raw/main/audio/knuthamsun.mp3

# Optional. If you get complains about not support for Norwegian, do:
pip uninstall whisperx && pip install git+https://github.com/m-bain/whisperx.git@8540ff5985fceee764acbed94f656063d7f56540

# Transcribe the test file. All transcripts will end up in the directory of the mp3-file
whisperx knuthamsun.mp3 --model NbAiLabBeta/nb-whisper-large --language no --diarize

You can also run WhisperX from Python. Check the instructions on WhisperX homepage.

API

Instructions for accessing the models via a simple API are in the demos under Spaces. Note that these demos are temporary and will only be available for a few weeks.

📚 Documentation

Training Data

The training data comes from Språkbanken and the National Library of Norway's digital collection, including:

NST Norwegian ASR Database (16 kHz) and its corresponding dataset
Transcribed speeches from the Norwegian Parliament by Språkbanken
TV broadcast (NRK) subtitles (NLN digital collection)
Audiobooks (NLN digital collection)

Downstream Use

The models, especially the smaller ones, may have occasional hallucinations and may miss parts of the transcript. They are designed to convert spoken language into grammatically correct written sentences, which may not be word - for - word translations. We have two extra model variants for different transcription styles. We encourage users to try the models themselves.

Bias, Risks, and Limitations

Using these models without proper risk assessment and mitigation is irresponsible. They may contain biases or other undesirable distortions. Users deploying or integrating these models are responsible for risk mitigation and compliance with AI regulations. The National Library of Norway, as the model owner, disclaims liability for third - party use outcomes.

Software

The model was trained using Jax/Flax and converted to PyTorch, Tensorflow, whisper.cpp, and ONXX formats, available under Files and versions. We welcome requests for conversion to other formats. All training code and scripts are released under the Apache License 2.0 in the GitHub repository nb - whisper.

📄 License

The models are released under the Apache 2.0 license.

🔧 Technical Details

The NB - Whisper Large model is a product of the NoSTram project led by Per Egil Kummervold (@pere) at the National Library of Norway. Key contributors include Javier de la Rosa (@versae), Freddy Wetjen (@freddyw), and Rolv - Arild Braaten ([@Rolv - Arild](https://huggingface.co/Rolv - Arild)). NB AI - Lab, under the direction of Svein Arne Brygfjeld (@Brygfjeld), supported the project's successful completion. A detailed paper on our process and findings is forthcoming.

Disclaimer

The models in this repository are for general use and available to third parties. They may have biases and/or other undesirable distortions. When third parties deploy or provide systems and/or services... (The original README seems incomplete here, so I keep it as it is.)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご