F

Frame VAD Multilingual MarbleNet V2.0

Developed by nvidia
Lightweight multilingual voice activity detection model supporting six languages (Chinese, English, French, German, Russian, Spanish) with only 91.5K parameters, suitable for real-time speech processing scenarios
Downloads 75
Release Time : 5/8/2025

Model Overview

Convolutional neural network for Voice Activity Detection (VAD), serving as a pre-processing module for speech recognition and speaker diarization systems, capable of outputting speech probability for every 20ms audio frame

Model Features

Lightweight Design
Only 91.5K parameters, ideal for real-time applications
Strong False Alarm Resistance
Reduced false alarm rate through noise perturbation and volume adjustment training
Multilingual Support
Supports six languages: Chinese, English, French, German, Russian, and Spanish
Frame-Level Detection
Outputs speech probability for every 20ms audio frame

Model Capabilities

Voice Activity Detection
Real-Time Audio Processing
Multilingual Speech Recognition Preprocessing

Use Cases

Speech Processing
ASR Pre-processing
Serves as speech/non-speech segmentation module for ASR systems
Improves speech recognition system efficiency
Speaker Diarization System
Used for speaker segmentation in conference recordings
Achieves 96.65 AUC on VoxConverse-test set
Smart Devices
Voice Wake-up Detection
Low-power voice detection for smart speakers
Lightweight design suitable for edge device deployment
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase