MMS-300M-1130 Forced Aligner Open-Source Tool - A Powerful Tool for Forced Alignment of Text and Audio Supporting Multiple Languages

Mms 300m 1130 Forced Aligner

Developed by MahmoudAshraf

A text-to-audio forced alignment tool based on Hugging Face pre-trained models, supporting multiple languages with high memory efficiency

Speech Recognition

Transformers

Supports Multiple Languages#Multilingual speech alignment #Low memory consumption #Audio-text synchronization

Downloads 2.5M

Release Time : 5/2/2024

Model Overview

This model utilizes Hugging Face's CTC pre-trained models to achieve forced alignment between audio and text, significantly reducing memory consumption compared to traditional methods. Suitable for speech recognition, speech annotation, and similar scenarios.

Model Features

Efficient Memory Usage

Significantly reduces memory consumption compared to TorchAudio's forced alignment API

Multilingual Support

Supports forced alignment for over 100 languages

Based on wav2vec2 Architecture

Utilizes the advanced wav2vec2 model architecture to ensure alignment accuracy

Easy to Use

Provides a clear Python API interface for easy integration into existing workflows

Model Capabilities

Audio-text forced alignment

Speech recognition

Speech annotation

Multilingual processing

Use Cases

Speech Processing

Subtitle Generation

Generate precise time-aligned subtitles for video content

Improves synchronization accuracy between subtitles and speech

Speech Annotation

Generate precise word-level time annotations for speech datasets

Enhances the quality of training data for speech recognition models

Linguistic Research

Speech Analysis

Analyze speech characteristics and pronunciation patterns across different languages

Supports multilingual phonetic research

🚀 Forced Alignment with Hugging Face CTC Models

This Python package offers an efficient solution for performing forced alignment between text and audio using Hugging Face's pretrained models. It also features an improved implementation that consumes significantly less memory compared to the TorchAudio forced alignment API.

📦 Installation

pip install git+https://github.com/MahmoudAshraf97/ctc-forced-aligner.git

💻 Usage Examples

Basic Usage

import torch
from ctc_forced_aligner import (
    load_audio,
    load_alignment_model,
    generate_emissions,
    preprocess_text,
    get_alignments,
    get_spans,
    postprocess_results,
)

audio_path = "your/audio/path"
text_path = "your/text/path"
language = "iso" # ISO-639-3 Language code
device = "cuda" if torch.cuda.is_available() else "cpu"
batch_size = 16


alignment_model, alignment_tokenizer = load_alignment_model(
    device,
    dtype=torch.float16 if device == "cuda" else torch.float32,
)

audio_waveform = load_audio(audio_path, alignment_model.dtype, alignment_model.device)


with open(text_path, "r") as f:
    lines = f.readlines()
text = "".join(line for line in lines).replace("\n", " ").strip()

emissions, stride = generate_emissions(
    alignment_model, audio_waveform, batch_size=batch_size
)

tokens_starred, text_starred = preprocess_text(
    text,
    romanize=True,
    language=language,
)

segments, scores, blank_token = get_alignments(
    emissions,
    tokens_starred,
    alignment_tokenizer,
)

spans = get_spans(tokens_starred, segments, blank_token)

word_timestamps = postprocess_results(text_starred, spans, stride, scores)

📄 License

This project is licensed under the cc-by-nc-4.0 license.

📚 Documentation

Supported Languages

The following languages are supported by this package:

ab, af, ak, am, ar, as, av, ay, az, ba, bm, be, bn, bi, bo, sh, br, bg, ca, cs, ce, cv, ku, cy, da, de, dv, dz, el, en, eo, et, eu, ee, fo, fa, fj, fi, fr, fy, ff, ga, gl, gn, gu, zh, ht, ha, he, hi, sh, hu, hy, ig, ia, ms, is, it, jv, ja, kn, ka, kk, kr, km, ki, rw, ky, ko, kv, lo, la, lv, ln, lt, lb, lg, mh, ml, mr, ms, mk, mg, mt, mn, mi, my, zh, nl, 'no', 'no', ne, ny, oc, om, or, os, pa, pl, pt, ms, ps, qu (multiple entries), ro, rn, ru, sg, sk, sl, sm, sn, sd, so, es, sq, su, sv, sw, ta, tt, te, tg, tl, th, ti, ts, tr, uk, ms, vi, wo, xh, ms, yo, ms, zu, za

Pipeline Tag

automatic-speech-recognition

Model Checkpoint

The model checkpoint uploaded here is a conversion from torchaudio to HF Transformers for the MMS - 300M checkpoint trained on a forced alignment dataset.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご