whisper-large-v3-speech-flow Open-source Model - Free Detection of Speech Fluency and Disfluency Types

Whisper Large V3 Speech Flow

Developed by tiantiaf

A speech fluency classification model based on Whisper Large v3, capable of detecting speech fluency and disfluency types

Audio Classification

Safetensors

EnglishOpen Source License:Apache-2.0 #Speech Fluency Detection #Disfluency Type Identification #Multi-window Analysis

Downloads 157

Release Time : 5/22/2025

Model Overview

This model implements a speech fluency classification method, first detecting whether speech is fluent, and if not, further classifying the disfluency type (blocking, prolongation, sound repetition, word repetition, interjection).

Model Features

Fluency Detection

Accurately distinguishes between fluent and disfluent speech segments

Disfluency Type Classification

Further classifies disfluent speech into 5 specific types

Windowed Processing

Uses 3-second window size and 1-second step size for processing long speech

Model Capabilities

Speech Fluency Detection

Disfluency Type Classification

Long Speech Segmentation Processing

Use Cases

Speech Therapy

Stuttering Assessment

Assists speech therapists in evaluating the severity and types of stuttering in patients

Quantitative analysis of the frequency and type distribution of disfluent speech

Speech Quality Analysis

Speech Fluency Scoring

Provides fluency metrics for speech quality assessment systems

Automatically generates speech fluency reports

🚀 Whisper Large v3 for Speech Flow (Fluency) Classification

This model is designed for speech fluency classification, offering a solution to accurately assess speech flow and identify disfluency types.

🚀 Quick Start

✨ Features

This model includes the implementation of speech fluency classification described in Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits.
The model first predicts the speech with a 3 - second window size and 1 - second step size in ["fluent", "disfluent"].
If disfluent speech is detected, it predicts the disfluent types in ["Block", "Prolongation", "Sound Repetition", "Word Repetition", "Interjection"].

📦 Installation

Download repo

git clone git@github.com:tiantiaf0627/vox-profile-release.git

Install the package

conda create -n vox_profile python=3.8
cd vox-profile-release
pip install -e .

💻 Usage Examples

Basic Usage

# Load libraries
import torch
import torch.nn.functional as F
from src.model.fluency.whisper_fluency import WhisperWrapper

# Find device
device = torch.device("cuda") if torch.cuda.is_available() else "cpu"

# Load model from Huggingface
model = WhisperWrapper.from_pretrained("tiantiaf/whisper-large-v3-speech-flow").to(device)
model.eval()

Advanced Usage

# Prediction
audio_data = torch.zeros([1, 16000*10]).float().to(device)
audio_segment = (audio_data.shape[1] - 3*16000) // 16000 + 1
if audio_segment < 1: audio_segment = 1
input_audio = list()
input_audio_length = list()
for idx in range(audio_segment): 
    input_audio.append(audio_data[0, 16000*idx:16000*idx+3*16000])
    input_audio_length.append(torch.tensor(len(audio_data[0, 16000*idx:16000*idx+3*16000])))
input_audio = torch.stack(input_audio, dim=0)
input_audio_length = torch.stack(input_audio_length, dim=0)

# Prediction
fluency_outputs, disfluency_type_outputs = model(input_audio, length=input_audio_length)
fluency_prob   = F.softmax(fluency_outputs, dim=1).detach().cpu().numpy().astype(float).tolist()

disfluency_type_prob = nn.Sigmoid()(disfluency_type_outputs)
# we can set a higher threshold in practice
disfluency_type_predictions = (disfluency_type_prob > 0.7).int().detach().cpu().numpy().tolist()
disfluency_type_prob = disfluency_type_prob.cpu().numpy().astype(float).tolist()

# Now let's gather the predictions for the utterance
utterance_fluency_list = list()
utterance_disfluency_list = list()
for audio_idx in range(audio_segment):
  disfluency_type = list()
  if fluency_prob[audio_idx][0] > 0.5: 
      utterance_fluency_list.append("fluent")
  else: 
      # If the prediction is disfluent, then which disfluency type
      utterance_fluency_list.append("disfluent")
      predictions = disfluency_type_predictions[audio_idx]
      for label_idx in range(len(predictions)):
          if predictions[label_idx] == 1:
            disfluency_type.append(disfluency_type_labels[label_idx])
  utterance_disfluency_list.append(disfluency_type)

# Now print how fluent is the utterance
print(utterance_fluency_list)
print(utterance_disfluency_list)

📚 Documentation

If you have any questions, please contact: Tiantian Feng (tiantiaf@usc.edu)

📄 License

This model is licensed under the Apache 2.0 license.

📖 Citation

Kindly cite our paper if you are using our model or find it useful in your work

@article{feng2025vox,
  title={Vox-Profile: A Speech Foundation Model Benchmark for Characterizing Diverse Speaker and Speech Traits},
  author={Feng, Tiantian and Lee, Jihwan and Xu, Anfeng and Lee, Yoonjeong and Lertpetchpun, Thanathai and Shi, Xuan and Wang, Helin and Thebaud, Thomas and Moro-Velazquez, Laureano and Byrd, Dani and others},
  journal={arXiv preprint arXiv:2505.14648},
  year={2025}
}

Property	Details
Model Type	Whisper Large v3 for Speech Flow (Fluency) Classification
Base Model	openai/whisper-large-v3
Pipeline Tag	audio-classification
Metrics	accuracy
License	apache-2.0

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご