Model Overview
Model Features
Model Capabilities
Use Cases
🚀 Lyric alignment
A framework for aligning Vietnamese song lyrics.
This project aims to build a model that aligns song lyrics with music audio. It takes a music segment (including vocals) and its lyrics as input and outputs the start and end times of each word in the lyrics.
🚀 Quick Start
To reproduce the model from scratch, run the following command:
CUDA_VISIBLE_DEVICES=0,1,2,3,4 python -m torch.distributed.launch --nproc_per_node 5 train.py
The train.py
script will automatically download the dataset from nguyenvulebinh/song_dataset and the pre-trained model nguyenvulebinh/wav2vec2-large-vi-vlsp2020 and then start the training process.
To load the model, use the following code:
from transformers import AutoTokenizer, AutoFeatureExtractor
from model_handling import Wav2Vec2ForCTC
model_path = 'nguyenvulebinh/lyric-alignment'
model = Wav2Vec2ForCTC.from_pretrained(model_path).eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_path)
vocab = [tokenizer.convert_ids_to_tokens(i) for i in range(len(tokenizer.get_vocab()))]
The code for lyric alignment is in the predict.py
file. Make sure your audio file is 16kHz and single-channel. You can use the preprocessing.py
file to convert the audio file to meet this requirement.
from predict import handle_sample
import torchaudio
import json
# wav_path: path to audio file. Need to be 16k and single channel.
# path_lyric: path to lyric data in json format, which includes list of segment and words
wav, _ = torchaudio.load(wav_path)
with open(path_lyric, 'r', encoding='utf-8') as file:
lyric_data = json.load(file)
lyric_alignment = handle_sample(wav, lyric_data)
✨ Features
- Lyric Alignment: Align Vietnamese song lyrics with music audio.
- High Accuracy: Achieved a high $IoU$ score on the public leaderboard.
- Data Crawling: Crawled additional data from public sources to improve model performance.
📦 Installation
The project uses the following dependencies:
transformers
torchaudio
torch
You can install these dependencies using pip
:
pip install transformers torchaudio torch
💻 Usage Examples
Basic Usage
from transformers import AutoTokenizer, AutoFeatureExtractor
from model_handling import Wav2Vec2ForCTC
model_path = 'nguyenvulebinh/lyric-alignment'
model = Wav2Vec2ForCTC.from_pretrained(model_path).eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_path)
vocab = [tokenizer.convert_ids_to_tokens(i) for i in range(len(tokenizer.get_vocab()))]
Advanced Usage
from predict import handle_sample
import torchaudio
import json
# wav_path: path to audio file. Need to be 16k and single channel.
# path_lyric: path to lyric data in json format, which includes list of segment and words
wav, _ = torchaudio.load(wav_path)
with open(path_lyric, 'r', encoding='utf-8') as file:
lyric_data = json.load(file)
lyric_alignment = handle_sample(wav, lyric_data)
📚 Documentation
Task description (Zalo AI challenge 2022)
The goal is to build a model that aligns lyrics with music audio. The input is a music segment (including vocals) and its lyrics, and the output is the start and end times of each word in the lyrics. The accuracy of the prediction is evaluated using the Intersection over Union ($IoU$) metric.
Data description
Zalo public dataset
- Training data: 1057 music segments from approximately 480 songs. Each segment is provided with a WAV audio file and a ground-truth JSON file containing lyrics and the aligned time frame (in milliseconds) of each word.
- Testing data:
- Public test: 264 music segments from approximately 120 songs.
- Private test: 464 music segments from approximately 200 songs.
Crawling public dataset
Since the Zalo dataset is small and noisy, additional data was crawled from other public sources. The data crawling and processing details are in the data_preparation folder. A total of 30,000 songs were crawled from the https://zingmp3.vn website, approximately 1,500 hours of audio.
Methodology
The strategies are based on the study of CTC-Segmentation by Ludwig Kürzinger and the Pytorch tutorial of Forced Alignment with Wav2Vec2.
The alignment process involves the following steps:
- Estimate the frame-wise label probability from the audio waveform.
- Generate the trellis matrix representing the probability of labels aligned at each time step.
- Find the most likely path from the trellis matrix.
To ensure good alignment, a robust acoustic model is needed to obtain good frame-wise probability, and the labels need to be in the spoken form. English words are converted to Vietnamese pronunciation using the nguyenvulebinh/spelling-oov model, and number formats are handled using Vinorm. Special characters are removed.
Evaluation setup
Acoustic model
The final model is based on nguyenvulebinh/wav2vec2-large-vi-vlsp2020. It was pre-trained on 13,000 hours of Vietnamese YouTube audio (unlabeled data) and fine-tuned on 250 hours of labeled VLSP ASR dataset at 16kHz sampled speech audio. The model was trained for around 50 epochs, taking 78 hours.
The performance WER of the model after training is as follows:
- Zalo public dataset - test set: 0.2267
- Crawling public dataset - test set: 0.1427
Alignment process
The alignment process involves the following steps:
- Format and convert the input to the spoken form.
- Force align the spoken form text and the audio using the CTC-Segmentation algorithm.
- Adjust the time frame of each word based on the behavior of the labeled data.
- Re-align the spoken form to the raw word.
Usage
The acoustic model can be loaded using the following code:
from transformers import AutoTokenizer, AutoFeatureExtractor
from model_handling import Wav2Vec2ForCTC
model_path = 'nguyenvulebinh/lyric-alignment'
model = Wav2Vec2ForCTC.from_pretrained(model_path).eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_path)
vocab = [tokenizer.convert_ids_to_tokens(i) for i in range(len(tokenizer.get_vocab()))]
The code for lyric alignment is in the predict.py
file. Make sure your audio file is 16kHz and single-channel. You can use the preprocessing.py
file to convert the audio file to meet this requirement.
from predict import handle_sample
import torchaudio
import json
# wav_path: path to audio file. Need to be 16k and single channel.
# path_lyric: path to lyric data in json format, which includes list of segment and words
wav, _ = torchaudio.load(wav_path)
with open(path_lyric, 'r', encoding='utf-8') as file:
lyric_data = json.load(file)
lyric_alignment = handle_sample(wav, lyric_data)
🔧 Technical Details
Evaluation metric
The accuracy of the prediction is evaluated using the Intersection over Union ($IoU$) metric. The $IoU$ of an audio segment $S_i$ is calculated as:
$IoU(S_i) = \frac{1}{m} \sum_{j=1}^{m}{\frac{G_j\cap P_j}{G_j\cup P_j}}$
where $m$ is the number of tokens of $S_i$. The Final $IoU$ across all $n$ audio segments is the average of their corresponding $IoU$:
$Final_IoU = \frac{1}{n} \sum_{i=1}^{n}{IoU(S_i)}$
Heuristic rules for alignment
- A word is not longer than 3s.
- If a word is shorter than 1.4s / 140ms, 20ms / 40ms is added to the start and end of the word.
- All timestamps of each word are shifted to the left by 120ms.
These heuristic rules are implemented in the add_pad
function in the utils.py
file.
📄 License
This project is licensed under the CC BY-NC 4.0 license.
Acknowledgment
We would like to thank the organizers of the Zalo AI Challenge 2022 for this exciting challenge.
Contact
nguyenvulebinh@gmail.com

