Whisper-NER-v1 Open-Source Model - Free Speech Transcription and Open-Type Entity Recognition

Whisper Ner V1

Developed by aiola

WhisperNER is a novel model capable of simultaneous speech transcription and entity recognition, supporting open-type named entity recognition (NER).

Speech Recognition

Safetensors

Supports Multiple LanguagesOpen Source License:MIT #Speech Entity Recognition #Open Type NER #Multi-task ASR

Downloads 174

Release Time : 9/23/2024

Model Overview

WhisperNER is a powerful foundational model suitable for downstream tasks in automatic speech recognition (ASR) with NER, and its performance can be enhanced by fine-tuning on specific datasets.

Model Features

Joint Speech Transcription and Entity Recognition

Capable of simultaneous speech transcription and entity recognition, supporting open-type named entity recognition (NER).

Open-Type NER Support

Able to recognize diverse and evolving entities during inference.

Fine-Tunable Foundational Model

Suitable for downstream tasks in automatic speech recognition (ASR) with NER, and its performance can be enhanced by fine-tuning on specific datasets.

Model Capabilities

Speech Transcription

Named Entity Recognition

Open-Type Entity Recognition

Use Cases

Speech-to-Text and Entity Extraction

Meeting Minutes and Entity Extraction

Convert meeting recordings into text and extract key entities (e.g., names, companies, locations).

Enhances the efficiency and searchability of meeting records.

News Audio Analysis

Analyze news broadcast audio to extract key figures, organizations, and locations.

Quickly generates news summaries and entity indexes.

🚀 Whisper-NER

Whisper-NER is a novel model that enables joint speech transcription and entity recognition, supporting open-type NER for recognizing diverse entities during inference.

Demo: https://huggingface.co/spaces/aiola/whisper-ner-v1
Paper: WhisperNER: Unified Open Named Entity and Speech Recognition
Code: https://github.com/aiola-lab/whisper-ner

🚀 Quick Start

We introduce WhisperNER, a novel model that allows joint speech transcription and entity recognition. WhisperNER supports open-type NER, enabling recognition of diverse and evolving entities at inference. The WhisperNER model is designed as a strong base model for the downstream task of ASR with NER, and can be fine-tuned on specific datasets for improved performance.

🔧 Technical Details

Training Details

aiola/whisper-ner-v1 was trained on the NuNER dataset to perform joint audio transcription and NER tagging. The model was trained and evaluated only on English data. Check out the paper for full details.

💻 Usage Examples

Basic Usage

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

model_path = "aiola/whisper-ner-v1"
audio_file_path = "path/to/audio/file"
prompt = "person, company, location"  # comma separated entity tags
    
# load model and processor from pre-trained
processor = WhisperProcessor.from_pretrained(model_path)
model = WhisperForConditionalGeneration.from_pretrained(model_path)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# load audio file: user is responsible for loading the audio files themselves
target_sample_rate = 16000
signal, sampling_rate = torchaudio.load(audio_file_path)
resampler = torchaudio.transforms.Resample(sampling_rate, target_sample_rate)
signal = resampler(signal)
# convert to mono or remove first dim if needed
if signal.ndim == 2:
    signal = torch.mean(signal, dim=0)
# pre-process to get the input features
input_features = processor(
    signal, sampling_rate=target_sample_rate, return_tensors="pt"
).input_features
input_features = input_features.to(device)

prompt_ids = processor.get_prompt_ids(prompt.lower(), return_tensors="pt")
prompt_ids = prompt_ids.to(device)

# generate token ids by running model forward sequentially
with torch.no_grad():
    predicted_ids = model.generate(
        input_features,
        prompt_ids=prompt_ids,
        generation_config=model.generation_config,
        language="en",
    )

# post-process token ids to text, remove prompt
transcription = processor.batch_decode(
    predicted_ids, skip_special_tokens=True
)[0]
print(transcription)

Advanced Usage

Inference can be done using the above code. For more inference code and details, check out the whisper-ner repo.

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご