The open-source audio classification model of wav2vecbert2-filledPause - Accurately identify filled pauses in audio

Wav2vecbert2 Filledpause

Developed by classla

A model for classifying 20-millisecond audio frames to detect filler pauses (e.g., 'eee', 'errm', etc.)

OtherOpen Source License:Apache-2.0 #Multilingual filler pause detection #20ms frame-level classification #Audio event evaluation

Downloads 4,290

Release Time : 8/28/2024

Model Overview

This model is trained based on the facebook/w2v-bert-2.0 foundation model, specifically designed to detect filler pauses in speech.

Model Features

Multilingual support

Supports filler pause detection in five languages: Slovenian, Croatian, Serbian, Czech, and Polish

High-precision detection

Achieves an F1 score of 0.968 on the ROG corpus, demonstrating excellent performance

Intelligent post-processing

Significantly improves performance on the ParlaSpeech corpus through post-processing methods like removing short segments at the beginning and end

Model Capabilities

Audio frame classification

Filler pause detection

Multilingual speech analysis

Use Cases

Speech processing

Speech transcription preprocessing

Identify and label filler pauses before transcription to improve accuracy

Reduces non-semantic content in transcription results

Speech quality analysis

Analyze the frequency of filler pauses in speeches or conversations to assess oral fluency

Provides quantitative metrics for speech training or language learning

🚀 Frame classification for filled pauses

This model classifies individual 20ms frames of audio based on the presence of filled pauses ("eee", "errm", ...).

🚀 Quick Start

This model is designed for audio classification, specifically classifying 20ms frames of audio based on the presence of filled pauses.

✨ Features

Classifies 20ms audio frames for filled pauses.
Trained on human - annotated Slovenian speech corpus.
Evaluated on multiple corpora with post - processing options for better results.

📦 Installation

No specific installation steps are provided in the original document.

📚 Documentation

Model Information

Property	Details
Supported Languages	Slovenian (sl), Croatian (hr), Serbian (sr), Czech (cs), Polish (pl)
Base Model	facebook/w2v - bert - 2.0
Pipeline Tag	audio - classification
Metrics	f1, recall, precision

Training Data

The model was trained on the human - annotated Slovenian speech corpus ROG - Artur. Recordings from the train split were segmented into at most 30s long chunks.

Evaluation

Although the output of the model is a series of 0 or 1 describing 20ms frames, the evaluation was done on the event level. Spans of consecutive outputs 1 were bundled together into one event. When the true and predicted events partially overlap, this is counted as a true positive. We report precisions, recalls, and F1 - scores of the positive class.

Evaluation on ROG corpus

postprocessing	recall	precision	F1
none	0.981	0.955	0.968

Evaluation on ParlaSpeech corpora

For every language in the [ParlaSpeech collection](https://huggingface.co/collections/classla/parlaspeech - 670923f23ab185f413d40795), 400 instances were sampled and annotated by human annotators.

Since ParlaSpeech corpora are too big to be manually segmented as ROG is, we observed a few failure modes when inferring. It was discovered that post - processing can be used to improve results. False positives were observed to be caused by improper audio segmentation, which is why disabling predictions that start at the start of the audio or end at the end of the audio can be beneficial. Another failure mode is predicting very short events, which is why ignoring very short predictions can be safely discarded.

With added post - processing, the model achieves the following metrics:

lang	postprocessing	recall	precision	F1
CZ	drop_short_initial_and_final	0.889	0.859	0.874
HR	drop_short_initial_and_final	0.94	0.887	0.913
PL	drop_short_initial_and_final	0.903	0.947	0.924
RS	drop_short_initial_and_final	0.966	0.915	0.94

💻 Usage Examples

Basic Usage

from transformers import AutoFeatureExtractor, Wav2Vec2BertForAudioFrameClassification
from datasets import Dataset, Audio
import torch
import numpy as np
from pathlib import Path

device = torch.device("cuda")
model_name = "classla/wav2vecbert2-filledPause"
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = Wav2Vec2BertForAudioFrameClassification.from_pretrained(model_name).to(device)

ds = Dataset.from_dict(
    {
        "audio": [
            "/cache/peterr/mezzanine_resources/filled_pauses/data/dev/Iriss-J-Gvecg-P500001-avd_2082.293_2112.194.wav"
        ],
    }
).cast_column("audio", Audio(sampling_rate=16_000, mono=True))


def frames_to_intervals(
    frames: list[int],
    drop_short=True,
    drop_initial=True,
    drop_final=True,
    short_cutoff_s=0.08,
) -> list[tuple[float]]:
    """Transforms a list of ones or zeros, corresponding to annotations on frame
    levels, to a list of intervals ([start second, end second]).

    Allows for additional filtering on duration (false positives are often
    short) and start times (false positives starting at 0.0 are often an
    artifact of poor segmentation).

    :param list[int] frames: Input frame labels
    :param bool drop_short: Drop everything shorter than short_cutoff_s,
        defaults to True
    :param bool drop_initial: Drop predictions starting at 0.0, defaults to True
    :param bool drop_final: Drop predictions ending at audio end, defaults to True
    :param float short_cutoff_s: Duration in seconds of shortest allowable
        prediction, defaults to 0.08

    :return list[tuple[float]]: List of intervals [start_s, end_s]
    """
    from itertools import pairwise
    import pandas as pd

    results = []
    ndf = pd.DataFrame(
        data={
            "time_s": [0.020 * i for i in range(len(frames))],
            "frames": frames,
        }
    )
    ndf = ndf.dropna()
    indices_of_change = ndf.frames.diff()[ndf.frames.diff() != 0].index.values
    for si, ei in pairwise(indices_of_change):
        if ndf.loc[si : ei - 1, "frames"].mode()[0] == 0:
            pass
        else:
            results.append(
                (
                    round(ndf.loc[si, "time_s"], 3),
                    round(ndf.loc[ei, "time_s"], 3),
                )
            )
    if drop_short and (len(results) > 0):
        results = [i for i in results if (i[1] - i[0] >= short_cutoff_s)]
    if drop_initial and (len(results) > 0):
        results = [i for i in results if i[0] != 0.0]
    if drop_final and (len(results) > 0):
        results = [i for i in results if i[1] != 0.02 * len(frames)]
    return results


def evaluator(chunks):
    sampling_rate = chunks["audio"][0]["sampling_rate"]
    with torch.no_grad():
        inputs = feature_extractor(
            [i["array"] for i in chunks["audio"]],
            return_tensors="pt",
            sampling_rate=sampling_rate,
        ).to(device)
        logits = model(**inputs).logits
    y_pred = np.array(logits.cpu()).argmax(axis=-1)
    intervals = [frames_to_intervals(i) for i in y_pred]
    return {"y_pred": y_pred.tolist(), "intervals": intervals}


ds = ds.map(evaluator, batched=True)
print(ds["y_pred"][0])
# Prints a list of 20ms frames: [0,0,0,0,1,1,1,1,1,1,1,1,1,1,1,0....]
# with 0 indicating no filled pause detected in that frame

print(ds["intervals"][0])
# Prints the identified intervals as a list of [start_s, ends_s]:
# [[0.08, 0.28 ], ...]

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Citation

Coming soon.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご