Smart Turn v2 Open-source Voice Detection Model - Analyze the waveform to determine whether the speaker's speech has ended

Smart Turn V2

Developed by pipecat-ai

Smart Turn v2 is an open-source semantic voice activity detection (VAD) model that determines whether the speaker has finished speaking by analyzing the raw waveform.

Speech Recognition

Safetensors

Other#Multilingual Voice Endpoint Detection #Real-time Voice Interaction #Low-latency VAD

Downloads 670

Release Time : 7/11/2025

Model Overview

This model supports multiple languages, has a small model size, and is fast. It is suitable for scenarios such as voice assistants and real-time transcription.

Model Features

Multilingual Support

Supports 14 languages, meeting the voice activity detection needs in different language environments.

Small Model Size

Compared with the v1 version, the model size is reduced by 6 times, only about 360 MB, making it easier to deploy and use.

Fast Speed

The speed of analyzing audio is increased by 3 times. It only takes about 12 milliseconds to analyze an 8-second audio on the NVIDIA L40S.

Model Capabilities

Semantic Voice Activity Detection

Multilingual Voice Analysis

Real-time Voice Processing

Use Cases

Voice Assistant/Chatbot

Avoid Interrupting Users

Wait for the user to truly finish speaking before replying to avoid interrupting the user.

Improve the user experience

Real-time Transcription + Text-to-Speech (TTS)

Trigger TTS

Trigger TTS only when the user finishes speaking to avoid 'two-way dialogue'.

Improve transcription accuracy

Call Center Assistance and Analysis

Speaker Separation and Sentiment Analysis

Provide accurate segmentation for the speaker separation and sentiment analysis pipeline.

Improve analysis efficiency

🚀 Smart Turn v2

Smart Turn v2 is an open‑source semantic Voice Activity Detection (VAD) model. It analyzes the raw waveform, not the transcript, to determine whether a speaker has finished their turn. Compared with v1, it offers the following improvements:

Multilingual – Supports 14 languages (EN, FR, DE, ES, PT, ZH, JA, HI, IT, KO, NL, PL, RU, TR).
6 × smaller – Approximately 360 MB, compared to 2.3 GB in v1.
3 × faster – It takes about 12 ms to analyze 8 s of audio on an NVIDIA L40S.

🚀 Quick Start

from transformers import pipeline
import soundfile as sf

pipe = pipeline(
    "audio-classification",
    model="pipecat-ai/smart-turn-v2",
    feature_extractor="facebook/wav2vec2-base"
)

speech, sr = sf.read("user_utterance.wav")
if sr != 16_000:
    raise ValueError("Resample to 16 kHz")

result = pipe(speech, top_k=None)[0]
print(f"Completed turn? {result['label']}  Prob: {result['score']:.3f}")
# label == 'complete' → user has finished speaking

✨ Features

Intended Use & Task

Use‑case	Why this model helps
Voice agents / chatbots	Wait to reply until the user has actually finished speaking.
Real‑time transcription + TTS	Avoid “double‑talk” by triggering TTS only when the user turn ends.
Call‑centre assist & analytics	Accurate segmentation for diarisation and sentiment pipelines.
Any project needing semantic VAD	Detects incomplete thoughts, filler words (“um …”, “えーと …”) and intonation cues ignored by classic energy‑based VAD.

The model outputs a single probability; values ≥ 0.5 indicate the speaker has completed their utterance.

Model Architecture

Backbone: wav2vec2 encoder
Head: shallow linear classifier
Params: 94.8 M (float32)
Checkpoint: 360 MB Safetensors (compressed)
The wav2vec2 + linear configuration out‑performed LSTM and deeper transformer variants during ablation studies.

Training Data

Source	Type	Split	Languages
`human_5_all`	Human‑recorded	Train / Dev / Test	EN
`chirp3_1`	Synthetic (Google Chirp3 TTS)	Train / Dev / Test	14 langs

Sentences were cleaned with Gemini 2.5 Flash to remove ungrammatical, controversial or written‑only text.
Filler‑word lists per language (e.g., “um”, “えーと”) built with Claude & GPT‑o3 and injected near sentence ends to teach the model about interrupted speech.

All audio/text pairs are released on the pipecat‑ai/datasets hub.

Evaluation & Performance

Accuracy on unseen synthetic test set (50 % complete / 50 % incomplete)

Lang	Acc %	Lang	Acc %
EN	94.3	IT	94.4
FR	95.5	KO	95.5
ES	92.1	PT	95.5
DE	95.8	TR	96.8
NL	96.7	PL	94.6
RU	93.0	HI	91.2
ZH	87.2	–	–

Human English benchmark (human_5_all) : 99 % accuracy.

Inference latency for 8 s audio

Device	Time
NVIDIA L40S	12 ms
NVIDIA A100	19 ms
NVIDIA T4 (AWS g4dn.xlarge)	75 ms
16‑core x86 CPU (Modal)	410 ms

oai_citation:7‡Daily

📄 License

This project is licensed under the bsd-2-clause license.

📦 Additional Information

Property	Details
Pipeline Tag	voice-activity-detection
Tags	speech-processing, semantic-vad, multilingual
Datasets	pipecat-ai/chirp3_1, pipecat-ai/orpheus_midfiller_1, pipecat-ai/orpheus_grammar_1, pipecat-ai/orpheus_endfiller_1, pipecat-ai/human_convcollector_1, pipecat-ai/rime_2, pipecat-ai/human_5_all
Languages	en, fr, de, es, pt, zh, ja, hi, it, ko, nl, pl, ru, tr