Whisper-large-v3-turbo-atcosim-finetune Open-source Model - Optimized for Air Traffic Control Communication Transcription

Whisper Large V3 Turbo Atcosim Finetune

Developed by tclin

A model fine-tuned based on OpenAI Whisper Large V3 Turbo, specifically optimized for transcribing Air Traffic Control (ATC) communications.

Speech Recognition

Transformers

EnglishOpen Source License:MIT #Air Traffic Control Speech Recognition #Low Word Error Rate (3.7%)#Professional Terminology Optimization

Downloads 28

Release Time : 4/15/2025

Model Overview

This model is fine-tuned on the ATCOSIM dataset, focusing on transcribing ATC radio communications to support aviation safety research and airspace management decisions.

Model Features

ATC Communication Optimization

Specifically fine-tuned for air traffic control communications, improving terminology recognition and call sign transcription accuracy.

Noise Handling Capability

Enhanced ability to handle radio transmission noise, improving recognition accuracy in noisy environments.

Efficient Fine-tuning Strategy

Adopts a partial freezing strategy to balance efficiency and adaptability, with the first 24 encoder layers kept frozen.

Model Capabilities

Speech Recognition

ATC Communication Transcription

Aviation Terminology Recognition

Noisy Environment Speech Processing

Use Cases

Aviation Safety

ATC Communication Transcription

Transcribing air traffic control radio communications

Word Error Rate 3.73%

Congestion Pattern Analysis

Analyzing congestion patterns in ATC communications

Airspace Management

Decision Support

Providing data-driven decision support for airspace management

🚀 Whisper Large V3 Turbo: Fine-tuned for ATC Domain

This model is a fine - tuned version of OpenAI's Whisper Large V3 Turbo, specifically optimized for transcribing Air Traffic Control (ATC) communications.

🚀 Quick Start

This model is a fine - tuned version of OpenAI's Whisper Large V3 Turbo specifically for Air Traffic Control (ATC) communications transcription. It was fine - tuned on the ATCOSIM dataset.

✨ Features

Designed for ATC: Optimized for transcribing ATC radio communications, supporting aviation safety research, analyzing congestion patterns, and enabling data - driven decision - making in airspace management.
Improved Performance: Achieves better transcription accuracy on aviation communications compared to the base Whisper model, especially in ATC terminology recognition, callsign transcription accuracy, handling radio transmission noise, and recognizing standardized phraseology.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
from transformers import pipeline

# Configure device and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load the model with pipeline
transcriber = pipeline(
    "automatic-speech-recognition", 
    model="tclin/whisper-large-v3-turbo-atcosim-finetune",
    chunk_length_s=30,
    max_new_tokens=128,
    torch_dtype=torch_dtype,
    device=device
)

# Transcribe audio file
result = transcriber("path_to_atc_audio.wav")
print(f"Transcription: {result['text']}")

Advanced Usage

import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# Load and preprocess audio
audio_path = "path_to_atc_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)

# Resample to 16kHz (required for Whisper models)
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = resampler(waveform)

# Convert stereo to mono if needed
if waveform.shape[0] > 1:
    waveform = waveform.mean(dim=0, keepdim=True)
    
# Convert to numpy array
waveform_np = waveform.squeeze().cpu().numpy()

# Configure device and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load model and processor
model = WhisperForConditionalGeneration.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune")
model = model.to(device=device, dtype=torch_dtype)  # Explicit device and dtype setting
processor = WhisperProcessor.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune")

# Method 1: Using processor directly (recommended for precise control)
input_features = processor(waveform_np, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(device=device, dtype=torch_dtype)

generated_ids = model.generate(input_features, max_new_tokens=128)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")

# Method 2: Using pipeline with preprocessed audio
from transformers import pipeline

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    torch_dtype=torch_dtype,
    device=device
)

result = pipe(waveform_np)
print(f"Transcription: {result['text']}")

Important Notes

⚠️ Important Note

Always ensure audio is resampled to 16kHz before processing.

Explicitly set both device and dtype when using GPU with model.to(device=device, dtype=torch_dtype).

For processing longer audio files, use the chunk_length_s parameter.

The model performs best on clean ATC communications with standard phraseology.

📚 Documentation

Model Description

This model is a fine - tuned version of OpenAI's Whisper Large V3 Turbo specifically optimized for Air Traffic Control (ATC) communications transcription. It was fine - tuned on the ATCOSIM dataset, which contains real ATC communications from operational environments.

Intended Use

This model is designed for transcribing ATC radio communications, supporting aviation safety research, analyzing ATC communications for congestion patterns, and enabling data - driven decision - making in airspace management.

Training Methodology

The model was fine - tuned using a partial freezing approach to balance efficiency and adaptability:

First 24 encoder layers were frozen.
All convolution layers and positional embeddings were frozen.
Later encoder layers and decoder were fine - tuned.

Training hyperparameters:

Learning rate: 1e - 5
Training steps: 5000
Warmup steps: 500
Gradient checkpointing enabled
FP16 precision
Batch size: 16 per device
Evaluation metric: Word Error Rate (WER)

Performance

The model achieves improved transcription accuracy on aviation communications compared to the base Whisper model, with particular improvements in ATC terminology recognition, callsign transcription accuracy, handling of radio transmission noise, and recognition of standardized phraseology.

Training Metrics

Training progress over 5000 steps (10 epochs):

Step	Training Loss	Validation Loss	WER
1000	0.090100	0.081074	5.81697
2000	0.021100	0.080030	4.00939
3000	0.010000	0.080892	5.67438
4000	0.002500	0.080460	3.88357
5000	0.001400	0.080753	3.73678

The final model achieves a Word Error Rate (WER) of 3.73678%, showing significant improvement throughout the training process and demonstrating strong performance on ATC communications.

Limitations

The model is specifically optimized for English ATC communications.
Performance may vary across different accents and regional phraseologies.
Not optimized for general speech recognition outside the aviation domain.
May struggle with extremely noisy transmissions or overlapping communications.

Broader Application

This model serves as a component in a larger speech - to - analysis pipeline for ATC communications that includes:

Audio - to - text transcription (this model).
Domain - specific text reformatting using contextual knowledge.
Congestion analysis based on transcribed communications.

🔧 Technical Details

The model uses a partial freezing approach during fine - tuning. The first 24 encoder layers, all convolution layers, and positional embeddings are frozen, while the later encoder layers and the decoder are fine - tuned. Training hyperparameters such as learning rate, training steps, warm - up steps, etc., are carefully set to balance efficiency and adaptability.

📄 License

This model is released under the MIT license.

📖 Citation

If you use this model in your research, please cite:

@misc{ta-chun_lin_2025,
	author       = { Ta-Chun Lin },
	title        = { whisper-large-v3-turbo-atcosim-finetune (Revision 4b2d400) },
	year         = 2025,
	url          = { https://huggingface.co/tclin/whisper-large-v3-turbo-atcosim-finetune },
	doi          = { 10.57967/hf/5272 },
	publisher    = { Hugging Face }
}

🙏 Acknowledgments

OpenAI for the base Whisper model.
The ATCOSIM dataset for providing high - quality ATC communications data.
The open - source community for tools and frameworks that made this fine - tuning possible.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご