voiceGUARD Open-source Audio Classifier - Free Deployment, Accurately Distinguish Between Real Human and AI-generated Voices

Home

Voiceguard

Developed by Mrkomiljon

An audio classifier fine-tuned on Wav2Vec2, capable of distinguishing between human speech and AI-generated speech.

Audio Classification

Transformers

EnglishOpen Source License:MIT #AI Voice Authentication #High-Precision Audio Classification #Voice Anti-Counterfeiting

Downloads 127

Release Time : 12/2/2024

Model Overview

This model is used to detect whether audio is AI-generated, recognizing speech produced by various TTS models, suitable for voice security verification scenarios.

Model Features

High Accuracy Classification

Achieves 99.8% accuracy on test sets, effectively distinguishing human speech from various AI-generated speech.

Multi-Model Compatibility

Supports recognition of speech generated by multiple TTS models such as Melgan and DiffWave.

Lightweight Deployment

Based on Wav2Vec2-base architecture, suitable for real-time inference scenarios.

Model Capabilities

Audio Classification

AI-Generated Speech Detection

Voice Authenticity Verification

Use Cases

Security Verification

Voice Phishing Protection

Detects AI-synthesized speech in suspicious calls.

Identifies over 99% of known TTS model-generated speech.

Content Moderation

Fake Audio Identification

Identifies AI-generated speech content on social media.

🚀 Voice Detection AI - Real vs AI Audio Classifier

This model is an audio classifier based on fine - tuned Wav2Vec2. It can distinguish between real human voices and AI - generated voices. It has been trained on a dataset with samples from various TTS models and real human audio recordings.

🚀 Quick Start

To start using this model, you first need to install the necessary dependencies. Then you can load the model and processor, and perform audio classification.

✨ Features

Accurate Classification: Capable of distinguishing between real human voices and AI - generated voices.
Multiple Audio Formats Supported: Supports common audio formats such as .wav and .mp3.
Robust Performance: Can classify across multiple AI - generation models.

📦 Installation

Make sure you have transformers and torch installed:

pip install transformers torch torchaudio

💻 Usage Examples

Basic Usage

import torch
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torchaudio

# Load model and processor
model_name = "Mrkomiljon/voiceGUARD"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
processor = Wav2Vec2Processor.from_pretrained(model_name)

# Load audio
waveform, sample_rate = torchaudio.load("path_to_audio_file.wav")

# Resample if necessary
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = resampler(waveform)

# Preprocess
inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt", padding=True)

# Inference
with torch.no_grad():
    logits = model(**inputs).logits
    predicted_ids = torch.argmax(logits, dim=-1)

# Map to label
labels = ["Real Human Voice", "AI-generated"]
prediction = labels[predicted_ids.item()]
print(f"Prediction: {prediction}")

📚 Documentation

Model Overview

This model is a fine - tuned Wav2Vec2 - based audio classifier capable of distinguishing between real human voices and AI - generated voices. It has been trained on a dataset containing samples from various TTS models and real human audio recordings.

Model Details

Property	Details
Model Type	Wav2Vec2ForSequenceClassification
Fine - tuned on	Custom dataset with real and AI - generated audio
Classes	1. Real Human Voice 2. AI - generated (e.g., Melgan, DiffWave, etc.)
Input Requirements	Audio format: `.wav`, `.mp3`, etc.; Sample rate: 16kHz; Max duration: 10 seconds (longer audios are truncated, shorter ones are padded)

Performance

Robustness: Successfully classifies across multiple AI - generation models.
Limitations: Struggles with certain unseen AI - generation models (e.g., ElevenLabs).

Training Procedure

Data Collection: Compiled a balanced dataset of real human voices and AI - generated samples from various TTS models.
Preprocessing: Standardized audio formats, resampled to 16 kHz, and adjusted durations to 10 seconds.
Fine - Tuning: Utilized the Wav2Vec2 architecture for sequence classification, training for 3 epochs with a learning rate of 1e - 5.

Evaluation

Metrics: Accuracy, Precision, Recall
Results: Achieved 99.8% validation accuracy on the test set.

Limitations and Future Work

While VoiceGUARD performs robustly across known AI - generation models, it may encounter challenges with novel or unseen models.
Future work includes expanding the training dataset with samples from emerging TTS technologies to enhance generalization.

📄 License

This project is licensed under the MIT License. See the LICENSE file for details.

Acknowledgements

Special thanks to the developers of the Wav2Vec2 model and the contributors to the datasets used in this project.
View the complete project on GitHub

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご