đ Whisper Large V3 Turbo: Fine-tuned for ATC Domain
This model is a fine - tuned version of OpenAI's Whisper Large V3 Turbo, specifically optimized for transcribing Air Traffic Control (ATC) communications.
đ Quick Start
This model is a fine - tuned version of OpenAI's Whisper Large V3 Turbo specifically for Air Traffic Control (ATC) communications transcription. It was fine - tuned on the ATCOSIM dataset.
⨠Features
- Designed for ATC: Optimized for transcribing ATC radio communications, supporting aviation safety research, analyzing congestion patterns, and enabling data - driven decision - making in airspace management.
- Improved Performance: Achieves better transcription accuracy on aviation communications compared to the base Whisper model, especially in ATC terminology recognition, callsign transcription accuracy, handling radio transmission noise, and recognizing standardized phraseology.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
import torch
from transformers import pipeline
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
transcriber = pipeline(
"automatic-speech-recognition",
model="tclin/whisper-large-v3-turbo-atcosim-finetune",
chunk_length_s=30,
max_new_tokens=128,
torch_dtype=torch_dtype,
device=device
)
result = transcriber("path_to_atc_audio.wav")
print(f"Transcription: {result['text']}")
Advanced Usage
import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration
audio_path = "path_to_atc_audio.wav"
waveform, sample_rate = torchaudio.load(audio_path)
if sample_rate != 16000:
resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
waveform = resampler(waveform)
if waveform.shape[0] > 1:
waveform = waveform.mean(dim=0, keepdim=True)
waveform_np = waveform.squeeze().cpu().numpy()
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model = WhisperForConditionalGeneration.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune")
model = model.to(device=device, dtype=torch_dtype)
processor = WhisperProcessor.from_pretrained("tclin/whisper-large-v3-turbo-atcosim-finetune")
input_features = processor(waveform_np, sampling_rate=16000, return_tensors="pt").input_features
input_features = input_features.to(device=device, dtype=torch_dtype)
generated_ids = model.generate(input_features, max_new_tokens=128)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(f"Transcription: {transcription}")
from transformers import pipeline
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
max_new_tokens=128,
chunk_length_s=30,
torch_dtype=torch_dtype,
device=device
)
result = pipe(waveform_np)
print(f"Transcription: {result['text']}")
Important Notes
â ī¸ Important Note
- Always ensure audio is resampled to 16kHz before processing.
- Explicitly set both device and dtype when using GPU with
model.to(device=device, dtype=torch_dtype)
.
- For processing longer audio files, use the
chunk_length_s
parameter.
- The model performs best on clean ATC communications with standard phraseology.
đ Documentation
Model Description
This model is a fine - tuned version of OpenAI's Whisper Large V3 Turbo specifically optimized for Air Traffic Control (ATC) communications transcription. It was fine - tuned on the ATCOSIM dataset, which contains real ATC communications from operational environments.
Intended Use
This model is designed for transcribing ATC radio communications, supporting aviation safety research, analyzing ATC communications for congestion patterns, and enabling data - driven decision - making in airspace management.
Training Methodology
The model was fine - tuned using a partial freezing approach to balance efficiency and adaptability:
- First 24 encoder layers were frozen.
- All convolution layers and positional embeddings were frozen.
- Later encoder layers and decoder were fine - tuned.
Training hyperparameters:
- Learning rate: 1e - 5
- Training steps: 5000
- Warmup steps: 500
- Gradient checkpointing enabled
- FP16 precision
- Batch size: 16 per device
- Evaluation metric: Word Error Rate (WER)
Performance
The model achieves improved transcription accuracy on aviation communications compared to the base Whisper model, with particular improvements in ATC terminology recognition, callsign transcription accuracy, handling of radio transmission noise, and recognition of standardized phraseology.
Training Metrics
Training progress over 5000 steps (10 epochs):
Step |
Training Loss |
Validation Loss |
WER |
1000 |
0.090100 |
0.081074 |
5.81697 |
2000 |
0.021100 |
0.080030 |
4.00939 |
3000 |
0.010000 |
0.080892 |
5.67438 |
4000 |
0.002500 |
0.080460 |
3.88357 |
5000 |
0.001400 |
0.080753 |
3.73678 |
The final model achieves a Word Error Rate (WER) of 3.73678%, showing significant improvement throughout the training process and demonstrating strong performance on ATC communications.
Limitations
- The model is specifically optimized for English ATC communications.
- Performance may vary across different accents and regional phraseologies.
- Not optimized for general speech recognition outside the aviation domain.
- May struggle with extremely noisy transmissions or overlapping communications.
Broader Application
This model serves as a component in a larger speech - to - analysis pipeline for ATC communications that includes:
- Audio - to - text transcription (this model).
- Domain - specific text reformatting using contextual knowledge.
- Congestion analysis based on transcribed communications.
đ§ Technical Details
The model uses a partial freezing approach during fine - tuning. The first 24 encoder layers, all convolution layers, and positional embeddings are frozen, while the later encoder layers and the decoder are fine - tuned. Training hyperparameters such as learning rate, training steps, warm - up steps, etc., are carefully set to balance efficiency and adaptability.
đ License
This model is released under the MIT license.
đ Citation
If you use this model in your research, please cite:
@misc{ta-chun_lin_2025,
author = { Ta-Chun Lin },
title = { whisper-large-v3-turbo-atcosim-finetune (Revision 4b2d400) },
year = 2025,
url = { https://huggingface.co/tclin/whisper-large-v3-turbo-atcosim-finetune },
doi = { 10.57967/hf/5272 },
publisher = { Hugging Face }
}
đ Acknowledgments
- OpenAI for the base Whisper model.
- The ATCOSIM dataset for providing high - quality ATC communications data.
- The open - source community for tools and frameworks that made this fine - tuning possible.