ast-fakeaudio-detector Open-Source Audio Detection Model: Precise Identification of Forged/Synthetic Audio with Nearly 97% Accuracy

Ast Fakeaudio Detector

Developed by WpythonW

A binary classification model fine-tuned based on the AST architecture, specifically designed to detect fake/synthetic audio with an accuracy of 96.62%

Audio Classification

Transformers

Supports Multiple LanguagesOpen Source License:Apache-2.0 #High-precision forgery detection #Audio spectrum analysis #16kHz mono specialized

Downloads 31

Release Time : 1/4/2025

Model Overview

This model is a fine-tuned version based on MIT/ast-finetuned-audioset, optimized for fake audio detection by replacing the classification head. It takes audio spectrograms as input and outputs fake/real probabilities

Model Features

High-precision detection

Achieves 96.62% accuracy and 97.1 F1 score on the Real Fake Speech Dataset 2

Specialized optimization

The classification layer is specifically optimized for fake audio detection tasks

Efficient processing

Supports batch audio processing, suitable for large-scale detection scenarios

Model Capabilities

Audio authenticity detection

Fake audio recognition

Batch audio processing

Use Cases

Security verification

Voice authentication system

Detects fake audio that may be used in voice authentication systems

Can effectively identify over 96% of fake samples

Content moderation

Synthetic audio detection

Identifies synthetic/fake audio content on social media

🚀 AST Fine-tuned for Fake Audio Detection

This model is fine - tuned for detecting fake/synthetic audio, offering high - accuracy binary classification.

🚀 Quick Start

This model is a binary classification head fine - tuned version of [MIT/ast - finetuned - audioset - 10 - 10 - 0.4593](https://huggingface.co/MIT/ast - finetuned - audioset - 10 - 10 - 0.4593) for detecting fake/synthetic audio. The original AST (Audio Spectrogram Transformer) classification head was replaced with a binary classification layer optimized for fake audio detection.

✨ Features

Base Model: MIT/ast - finetuned - audioset - 10 - 10 - 0.4593 (AST pretrained on AudioSet)
Task: Binary classification (fake/real audio detection)
Input: Audio converted to Mel spectrogram (128 mel bins, 1024 time frames)
Output: Probabilities [fake_prob, real_prob]
Training Hardware: 2x NVIDIA T4 GPUs

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
import torchaudio
import soundfile as sf
import numpy as np
from transformers import AutoFeatureExtractor, AutoModelForAudioClassification

# Load model and move to available device
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_name = "WpythonW/ast-fakeaudio-detector"

extractor = AutoFeatureExtractor.from_pretrained(model_name)
model = AutoModelForAudioClassification.from_pretrained(model_name).to(device)
model.eval()

# Process multiple audio files
audio_files = ["audio1.wav", "audio2.mp3", "audio3.ogg"]
processed_batch = []

for audio_path in audio_files:
    # Load audio file
    audio_data, sr = sf.read(audio_path)
    
    # Convert stereo to mono if needed
    if len(audio_data.shape) > 1 and audio_data.shape[1] > 1:
        audio_data = np.mean(audio_data, axis=1)
    
    # Resample to 16kHz if needed
    if sr != 16000:
        waveform = torch.from_numpy(audio_data).float()
        if len(waveform.shape) == 1:
            waveform = waveform.unsqueeze(0)
        
        resample = torchaudio.transforms.Resample(
            orig_freq=sr, 
            new_freq=16000
        )
        waveform = resample(waveform)
        audio_data = waveform.squeeze().numpy()
    
    processed_batch.append(audio_data)

# Prepare batch input
inputs = extractor(
    processed_batch,
    sampling_rate=16000,
    padding=True,
    return_tensors="pt"
)
inputs = {k: v.to(device) for k, v in inputs.items()}

# Get predictions
with torch.no_grad():
    logits = model(**inputs).logits
    probabilities = torch.nn.functional.softmax(logits, dim=-1)

# Process results
for filename, probs in zip(audio_files, probabilities):
    fake_prob = float(probs[0].cpu())
    real_prob = float(probs[1].cpu())
    prediction = "FAKE" if fake_prob > real_prob else "REAL"
    
    print(f"\nFile: {filename}")
    print(f"Fake probability: {fake_prob:.2%}")
    print(f"Real probability: {real_prob:.2%}")
    print(f"Verdict: {prediction}")

📚 Documentation

Limitations

⚠️ Important Note

Important considerations when using this model:

The model works with 16kHz audio input
Performance may vary with different types of audio manipulation not present in training data
The model was trained on audio samples ranging from 4 to 10 seconds in duration.

📄 License

The model is licensed under the Apache - 2.0 license.

📊 Model Information

Property	Details
Datasets	WpythonW/real - fake - voices - dataset2, mozilla - foundation/common_voice_17_0
Language	en
Metrics	accuracy, f1, recall, precision
Base Model	MIT/ast - finetuned - audioset - 10 - 10 - 0.4593
Pipeline Tag	audio - classification
Library Name	transformers
Tags	audio, audio - classification, fake - audio - detection, ast
Inference Parameters	sampling_rate: 16000, audio_channel: mono

📈 Model Results

Task	Dataset	Metrics	Value
Audio Classification	real - fake - voices - dataset2	accuracy	0.9662
Audio Classification	real - fake - voices - dataset2	f1	0.971
Audio Classification	real - fake - voices - dataset2	precision	0.9692
Audio Classification	real - fake - voices - dataset2	recall	0.9728

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご