lyric-alignment Open-source Model - Freely Achieve Precise Timeline Alignment of Vietnamese Lyrics and Music Audio

Home

Lyric Alignment

Developed by nguyenvulebinh

A Vietnamese lyric timestamp alignment model based on wav2vec2 for precisely aligning lyrics with music audio

Speech Recognition

Transformers

Other#Vietnamese lyric alignment #CTC-Segmentation #Audio-text synchronization

Downloads 37

Release Time : 11/22/2022

Model Overview

This model is primarily used to precisely align Vietnamese song lyrics with audio timelines, supporting karaoke-style synchronized lyric display. The model is implemented using CTC-Segmentation algorithm and wav2vec2 architecture.

Model Features

High-precision alignment

Uses CTC-Segmentation algorithm to achieve precise lyric-audio timeline alignment

Multilingual processing

Capable of handling mixed Vietnamese and English lyric content

Large-scale training data

Trained on 1,500 hours of Vietnamese song data

Special character handling

Can process special characters, numeric formats, and nicknames in non-standard lyrics

Model Capabilities

Speech recognition

Lyric timestamp alignment

English-Vietnamese mixed processing

Special character conversion

Use Cases

Music applications

Karaoke lyric synchronization

Provides precise lyric timeline information for music players

Achieved IoU=0.632 accuracy in Zalo AI Challenge 2022

Music education

Helps learners accurately grasp song pronunciation and rhythm

🚀 Lyric alignment

A framework for aligning Vietnamese song lyrics.

This project aims to build a model that aligns song lyrics with music audio. It takes a music segment (including vocals) and its lyrics as input and outputs the start and end times of each word in the lyrics.

🚀 Quick Start

To reproduce the model from scratch, run the following command:

CUDA_VISIBLE_DEVICES=0,1,2,3,4 python -m torch.distributed.launch --nproc_per_node 5 train.py

The train.py script will automatically download the dataset from nguyenvulebinh/song_dataset and the pre-trained model nguyenvulebinh/wav2vec2-large-vi-vlsp2020 and then start the training process.

To load the model, use the following code:

from transformers import AutoTokenizer, AutoFeatureExtractor
from model_handling import Wav2Vec2ForCTC

model_path = 'nguyenvulebinh/lyric-alignment'
model = Wav2Vec2ForCTC.from_pretrained(model_path).eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_path)
vocab = [tokenizer.convert_ids_to_tokens(i) for i in range(len(tokenizer.get_vocab()))]

The code for lyric alignment is in the predict.py file. Make sure your audio file is 16kHz and single-channel. You can use the preprocessing.py file to convert the audio file to meet this requirement.

from predict import handle_sample
import torchaudio
import json

# wav_path: path to audio file. Need to be 16k and single channel. 
# path_lyric: path to lyric data in json format, which includes list of segment and words
wav, _ = torchaudio.load(wav_path)
with open(path_lyric, 'r', encoding='utf-8') as file:
    lyric_data = json.load(file)
lyric_alignment = handle_sample(wav, lyric_data)

✨ Features

Lyric Alignment: Align Vietnamese song lyrics with music audio.
High Accuracy: Achieved a high $IoU$ score on the public leaderboard.
Data Crawling: Crawled additional data from public sources to improve model performance.

📦 Installation

The project uses the following dependencies:

transformers
torchaudio
torch

You can install these dependencies using pip:

pip install transformers torchaudio torch

💻 Usage Examples

Basic Usage

from transformers import AutoTokenizer, AutoFeatureExtractor
from model_handling import Wav2Vec2ForCTC

model_path = 'nguyenvulebinh/lyric-alignment'
model = Wav2Vec2ForCTC.from_pretrained(model_path).eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_path)
vocab = [tokenizer.convert_ids_to_tokens(i) for i in range(len(tokenizer.get_vocab()))]

Advanced Usage

from predict import handle_sample
import torchaudio
import json

# wav_path: path to audio file. Need to be 16k and single channel. 
# path_lyric: path to lyric data in json format, which includes list of segment and words
wav, _ = torchaudio.load(wav_path)
with open(path_lyric, 'r', encoding='utf-8') as file:
    lyric_data = json.load(file)
lyric_alignment = handle_sample(wav, lyric_data)

📚 Documentation

Task description (Zalo AI challenge 2022)

The goal is to build a model that aligns lyrics with music audio. The input is a music segment (including vocals) and its lyrics, and the output is the start and end times of each word in the lyrics. The accuracy of the prediction is evaluated using the Intersection over Union ($IoU$) metric.

Data description

Zalo public dataset

Training data: 1057 music segments from approximately 480 songs. Each segment is provided with a WAV audio file and a ground-truth JSON file containing lyrics and the aligned time frame (in milliseconds) of each word.
Testing data:
- Public test: 264 music segments from approximately 120 songs.
- Private test: 464 music segments from approximately 200 songs.

Crawling public dataset

Since the Zalo dataset is small and noisy, additional data was crawled from other public sources. The data crawling and processing details are in the data_preparation folder. A total of 30,000 songs were crawled from the https://zingmp3.vn website, approximately 1,500 hours of audio.

Methodology

The strategies are based on the study of CTC-Segmentation by Ludwig Kürzinger and the Pytorch tutorial of Forced Alignment with Wav2Vec2.

The alignment process involves the following steps:

Estimate the frame-wise label probability from the audio waveform.
Generate the trellis matrix representing the probability of labels aligned at each time step.
Find the most likely path from the trellis matrix.

To ensure good alignment, a robust acoustic model is needed to obtain good frame-wise probability, and the labels need to be in the spoken form. English words are converted to Vietnamese pronunciation using the nguyenvulebinh/spelling-oov model, and number formats are handled using Vinorm. Special characters are removed.

Evaluation setup

Acoustic model

The final model is based on nguyenvulebinh/wav2vec2-large-vi-vlsp2020. It was pre-trained on 13,000 hours of Vietnamese YouTube audio (unlabeled data) and fine-tuned on 250 hours of labeled VLSP ASR dataset at 16kHz sampled speech audio. The model was trained for around 50 epochs, taking 78 hours.

The performance WER of the model after training is as follows:

Zalo public dataset - test set: 0.2267
Crawling public dataset - test set: 0.1427

Alignment process

The alignment process involves the following steps:

Format and convert the input to the spoken form.
Force align the spoken form text and the audio using the CTC-Segmentation algorithm.
Adjust the time frame of each word based on the behavior of the labeled data.
Re-align the spoken form to the raw word.

Usage

The acoustic model can be loaded using the following code:

from transformers import AutoTokenizer, AutoFeatureExtractor
from model_handling import Wav2Vec2ForCTC

model_path = 'nguyenvulebinh/lyric-alignment'
model = Wav2Vec2ForCTC.from_pretrained(model_path).eval()
tokenizer = AutoTokenizer.from_pretrained(model_path)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_path)
vocab = [tokenizer.convert_ids_to_tokens(i) for i in range(len(tokenizer.get_vocab()))]

from predict import handle_sample
import torchaudio
import json

# wav_path: path to audio file. Need to be 16k and single channel. 
# path_lyric: path to lyric data in json format, which includes list of segment and words
wav, _ = torchaudio.load(wav_path)
with open(path_lyric, 'r', encoding='utf-8') as file:
    lyric_data = json.load(file)
lyric_alignment = handle_sample(wav, lyric_data)

🔧 Technical Details

Evaluation metric

The accuracy of the prediction is evaluated using the Intersection over Union ($IoU$) metric. The $IoU$ of an audio segment $S_i$ is calculated as:

$IoU(S_i) = \frac{1}{m} \sum_{j=1}^{m}{\frac{G_j\cap P_j}{G_j\cup P_j}}$

where $m$ is the number of tokens of $S_i$. The Final $IoU$ across all $n$ audio segments is the average of their corresponding $IoU$:

$Final_IoU = \frac{1}{n} \sum_{i=1}^{n}{IoU(S_i)}$

Heuristic rules for alignment

A word is not longer than 3s.
If a word is shorter than 1.4s / 140ms, 20ms / 40ms is added to the start and end of the word.
All timestamps of each word are shifted to the left by 120ms.

These heuristic rules are implemented in the add_pad function in the utils.py file.

📄 License

This project is licensed under the CC BY-NC 4.0 license.

Acknowledgment

We would like to thank the organizers of the Zalo AI Challenge 2022 for this exciting challenge.

Contact

nguyenvulebinh@gmail.com

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご