TIGER-speech Open-source Speech Separation Model - Free Deployment and Effective Extraction of Key Acoustic Features

TIGER Speech

Developed by JusperLee

TIGER is a lightweight speech separation model that effectively extracts key acoustic features through frequency band partitioning, multi-scale, and full-band frame modeling.

Sound Separation

Safetensors

EnglishOpen Source License:Apache-2.0 #Lightweight Speech Separation #Multi-scale Attention #Time-Frequency Interleaved Modeling

Downloads 1,286

Release Time : 1/22/2025

Model Overview

TIGER is a speech separation model with significantly reduced parameter size and computational cost. Through frequency band partitioning and interleaved modeling architecture, it maintains high performance while drastically cutting down parameters and computational expenses.

Model Features

Lightweight Design

Reduces parameter count by 94.3% and MACs by 95.3% while maintaining high performance.

Frequency Band Partitioning and Compression

Utilizes prior knowledge to partition frequency bands and compress frequency information for improved efficiency.

Multi-scale Selective Attention

Employs Multi-scale Selective Attention (MSA) modules to extract contextual features.

Full-band Frame Attention

Introduces Full-band Frame Attention (F^3A) modules to capture time and frequency contextual information.

Model Capabilities

Speech Separation

Efficient Computation

Multi-scale Feature Extraction

Use Cases

Speech Processing

Speech Separation in Complex Acoustic Environments

Separates overlapping speech in environments with noise and more realistic reverberation.

Significantly outperforms TF-GridNet in inference speed and separation quality on the EchoSet dataset.

🚀 TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation

TIGER is a lightweight model for speech separation. It effectively extracts key acoustic features through frequency band - split, multi - scale and full - frequency - frame modeling, offering a high - performance solution with reduced parameters and computational cost.

✨ Features

Lightweight Design: Significantly reduces parameter size and computational cost.
Innovative Modeling: Leverages frequency band - split, multi - scale and full - frequency - frame modeling to extract key acoustic features.
New Dataset: Introduces the EchoSet dataset for more realistic evaluation in complex acoustic environments.

📦 Installation

git clone https://github.com/JusperLee/TIGER.git
cd TIGER
pip install -r requirements.txt

🚀 Quick Start

Test with Pre - trained Model

# Test using speech
python inference_speech.py --audio_path test/mix.wav

# Test using DnR
python inference_dnr.py --audio_path test/test_mixture_466.wav

Train with EchoSet

python audio_train.py --conf_dir configs/tiger.yml

Evaluate with EchoSet

python audio_test.py --conf_dir configs/tiger.yml

📚 Documentation

💥 News

[2025 - 01 - 23] We release the code and pre - trained model of TIGER! 🚀
[2025 - 01 - 23] We release the TIGER model and the EchoSet dataset! 🚀

📜 Abstract

In this paper, we propose a speech separation model with significantly reduced parameter size and computational cost: Time - Frequency Interleaved Gain Extraction and Reconstruction Network (TIGER). TIGER leverages prior knowledge to divide frequency bands and applies compression on frequency information. We employ a multi - scale selective attention (MSA) module to extract contextual features, while introducing a full - frequency - frame attention (F^3A) module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a novel dataset called EchoSet. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. Experimental results demonstrated that TIGER significantly outperformed state - of - the - art (SOTA) model TF - GridNet on the EchoSet dataset in both inference speed and separation quality, while reducing the number of parameters by 94.3% and the MACs by 95.3%. These results indicate that by utilizing frequency band - split and interleaved modeling structures, TIGER achieves a substantial reduction in parameters and computational costs while maintaining high performance. Notably, TIGER is the first speech separation model with fewer than 1 million parameters that achieves performance close to the SOTA model.

🖼️ TIGER

Overall pipeline of the model architecture of TIGER and its modules.

TIGER Model Architecture

📊 Results

Performance comparisons of TIGER and other existing separation models on Libri2Mix, LRS2 - 2Mix, and EchoSet. Bold indicates optimal performance, and italics indicate suboptimal performance.

TIGER Model Architecture

Efficiency comparisons of TIGER and other models.

TIGER Model Architecture

Comparison of performance and efficiency of cinematic sound separation models on DnR. '*' means the result comes from the original paper of DnR.

TIGER Model Architecture

📄 License

This project is licensed under the Apache 2.0 license.

📑 Citation

@article{xu2024tiger,
  title={TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation},
  author={Xu, Mohan and Li, Kai and Chen, Guo and Hu, Xiaolin},
  journal={arXiv preprint arXiv:2410.01469},
  year={2024}
}

📧 Contact

If you have any questions, please feel free to contact us via tsinghua.kaili@gmail.com.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご