đ Voice Detection AI - Real vs AI Audio Classifier
This model is an audio classifier based on fine - tuned Wav2Vec2. It can distinguish between real human voices and AI - generated voices. It has been trained on a dataset with samples from various TTS models and real human audio recordings.
đ Quick Start
To start using this model, you first need to install the necessary dependencies. Then you can load the model and processor, and perform audio classification.
⨠Features
- Accurate Classification: Capable of distinguishing between real human voices and AI - generated voices.
- Multiple Audio Formats Supported: Supports common audio formats such as
.wav
and .mp3
.
- Robust Performance: Can classify across multiple AI - generation models.
đĻ Installation
Make sure you have transformers
and torch
installed:
pip install transformers torch torchaudio
đģ Usage Examples
Basic Usage
import torch
from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2Processor
import torchaudio
model_name = "Mrkomiljon/voiceGUARD"
model = Wav2Vec2ForSequenceClassification.from_pretrained(model_name)
processor = Wav2Vec2Processor.from_pretrained(model_name)
waveform, sample_rate = torchaudio.load("path_to_audio_file.wav")
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
inputs = processor(waveform.squeeze().numpy(), sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(**inputs).logits
predicted_ids = torch.argmax(logits, dim=-1)
labels = ["Real Human Voice", "AI-generated"]
prediction = labels[predicted_ids.item()]
print(f"Prediction: {prediction}")
đ Documentation
Model Overview
This model is a fine - tuned Wav2Vec2 - based audio classifier capable of distinguishing between real human voices and AI - generated voices. It has been trained on a dataset containing samples from various TTS models and real human audio recordings.
Model Details
Property |
Details |
Model Type |
Wav2Vec2ForSequenceClassification |
Fine - tuned on |
Custom dataset with real and AI - generated audio |
Classes |
1. Real Human Voice 2. AI - generated (e.g., Melgan, DiffWave, etc.) |
Input Requirements |
Audio format: .wav , .mp3 , etc.; Sample rate: 16kHz; Max duration: 10 seconds (longer audios are truncated, shorter ones are padded) |
Performance
- Robustness: Successfully classifies across multiple AI - generation models.
- Limitations: Struggles with certain unseen AI - generation models (e.g., ElevenLabs).
Training Procedure
- Data Collection: Compiled a balanced dataset of real human voices and AI - generated samples from various TTS models.
- Preprocessing: Standardized audio formats, resampled to 16 kHz, and adjusted durations to 10 seconds.
- Fine - Tuning: Utilized the Wav2Vec2 architecture for sequence classification, training for 3 epochs with a learning rate of 1e - 5.
Evaluation
- Metrics: Accuracy, Precision, Recall
- Results: Achieved 99.8% validation accuracy on the test set.
Limitations and Future Work
- While VoiceGUARD performs robustly across known AI - generation models, it may encounter challenges with novel or unseen models.
- Future work includes expanding the training dataset with samples from emerging TTS technologies to enhance generalization.
đ License
This project is licensed under the MIT License. See the LICENSE file for details.
Acknowledgements
- Special thanks to the developers of the Wav2Vec2 model and the contributors to the datasets used in this project.
- View the complete project on GitHub