đ AST Fine-tuned for Fake Audio Detection
This model is fine - tuned for detecting fake/synthetic audio, offering high - accuracy binary classification.
đ Quick Start
This model is a binary classification head fine - tuned version of [MIT/ast - finetuned - audioset - 10 - 10 - 0.4593](https://huggingface.co/MIT/ast - finetuned - audioset - 10 - 10 - 0.4593) for detecting fake/synthetic audio. The original AST (Audio Spectrogram Transformer) classification head was replaced with a binary classification layer optimized for fake audio detection.
⨠Features
- Base Model: MIT/ast - finetuned - audioset - 10 - 10 - 0.4593 (AST pretrained on AudioSet)
- Task: Binary classification (fake/real audio detection)
- Input: Audio converted to Mel spectrogram (128 mel bins, 1024 time frames)
- Output: Probabilities [fake_prob, real_prob]
- Training Hardware: 2x NVIDIA T4 GPUs
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
import torch
import torchaudio
import soundfile as sf
import numpy as np
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "WpythonW/ast-fakeaudio-detector"
extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = AutoModelForAudioClassification.from_pretrained(model_name).to(device)
model.eval()
audio_files = ["audio1.wav", "audio2.mp3", "audio3.ogg"]
processed_batch = []
for audio_path in audio_files:
audio_data, sr = sf.read(audio_path)
if len(audio_data.shape) > 1 and audio_data.shape[1] > 1:
audio_data = np.mean(audio_data, axis=1)
if sr != 16000:
waveform = torch.from_numpy(audio_data).float()
if len(waveform.shape) == 1:
waveform = waveform.unsqueeze(0)
resample = torchaudio.transforms.Resample(
orig_freq=sr,
new_freq=16000
)
waveform = resample(waveform)
audio_data = waveform.squeeze().numpy()
processed_batch.append(audio_data)
inputs = extractor(
processed_batch,
sampling_rate=16000,
padding=True,
return_tensors="pt"
)
inputs = {k: v.to(device) for k, v in inputs.items()}
with torch.no_grad():
logits = model(**inputs).logits
probabilities = torch.nn.functional.softmax(logits, dim=-1)
for filename, probs in zip(audio_files, probabilities):
fake_prob = float(probs[0].cpu())
real_prob = float(probs[1].cpu())
prediction = "FAKE" if fake_prob > real_prob else "REAL"
print(f"\nFile: {filename}")
print(f"Fake probability: {fake_prob:.2%}")
print(f"Real probability: {real_prob:.2%}")
print(f"Verdict: {prediction}")
đ Documentation
Limitations
â ī¸ Important Note
Important considerations when using this model:
- The model works with 16kHz audio input
- Performance may vary with different types of audio manipulation not present in training data
- The model was trained on audio samples ranging from 4 to 10 seconds in duration.
đ License
The model is licensed under the Apache - 2.0 license.
đ Model Information
Property |
Details |
Datasets |
WpythonW/real - fake - voices - dataset2, mozilla - foundation/common_voice_17_0 |
Language |
en |
Metrics |
accuracy, f1, recall, precision |
Base Model |
MIT/ast - finetuned - audioset - 10 - 10 - 0.4593 |
Pipeline Tag |
audio - classification |
Library Name |
transformers |
Tags |
audio, audio - classification, fake - audio - detection, ast |
Inference Parameters |
sampling_rate: 16000, audio_channel: mono |
đ Model Results
Task |
Dataset |
Metrics |
Value |
Audio Classification |
real - fake - voices - dataset2 |
accuracy |
0.9662 |
Audio Classification |
real - fake - voices - dataset2 |
f1 |
0.971 |
Audio Classification |
real - fake - voices - dataset2 |
precision |
0.9692 |
Audio Classification |
real - fake - voices - dataset2 |
recall |
0.9728 |