A low-latency TTS model using VITS architecture, achieving Chilean Spanish speech synthesis through rapid fine-tuning (20 minutes) with minimal samples (80-150)
Model Features
Rapid Fine-tuning
Only requires 20 minutes of training time and 80-150 samples to adapt to specific accents
Lightweight & Low-latency
Designed with VITS architecture for efficient inference performance
Accent Adaptation
Specifically optimized for Chilean Spanish accent
Model Capabilities
Spanish Text-to-Speech
Chilean Accent Speech Synthesis
Real-time Speech Generation
Use Cases
Voice Interaction
Chilean Dialect Voice Assistant
Provides localized accent voice interaction experience for Chilean users
Sample audio demonstrates natural and fluent Chilean accent synthesis
Content Creation
Audio Content Production
Quickly generates narrations or dubbing with regional characteristics
🚀 Transformers Text-to-Speech Model
This project provides a text-to-speech solution using the finetuned MMS model. It can generate high - quality Spanish speech with low latency, trained on a Chilean Spanish dataset.
✨ Features
Light - weight and Low - latency: Based on the VITS architecture, it offers efficient text - to - speech conversion.
Fast Training: Can be finetuned in around 20 minutes with as few as 80 to 150 samples.
Multi - platform Support: Usable in both Python (Transformers library) and JavaScript (Transformers.js).
📦 Installation
Transformers.js
If you haven't already, you can install the Transformers.js JavaScript library from NPM using:
npm i @xenova/transformers
💻 Usage Examples
Basic Usage - Python (Transformers)
from transformers import pipeline
import scipy
model_id = "ylacombe/mms-spa-finetuned-chilean-monospeaker"
synthesiser = pipeline("text-to-speech", model_id) # add device=0 if you want to use a GPU
speech = synthesiser("Hola, ¿cómo estás hoy?")
scipy.io.wavfile.write("finetuned_output.wav", rate=speech["sampling_rate"], data=speech["audio"])
Advanced Usage - JavaScript (Transformers.js)
import { pipeline } from'@xenova/transformers';
// Create a text-to-speech pipelineconst synthesizer = awaitpipeline('text-to-speech', 'ylacombe/mms-spa-finetuned-chilean-monospeaker', {
quantized: false, // Remove this line to use the quantized version (default)
});
// Generate speechconst output = awaitsynthesizer('Hola, ¿cómo estás hoy?');
console.log(output);
// {// audio: Float32Array(69888) [ ... ],// sampling_rate: 16000// }
Optionally, save the audio to a wav file (Node.js):
This is a finetuned version of the Spanish version of Massively Multilingual Speech (MMS) models, which are light - weight, low - latency TTS models based on the VITS architecture.
It was trained in around 20 minutes with as little as 80 to 150 samples, on this Chilean Spanish dataset.