Speechless-llama3.2-v0.1 Open-source Text-to-Semantics Model - Convert Audio to Semantic Markers Directly without TTS

Speechless Llama3.2 V0.1

Developed by homebrewltd

Speechless is a compact open-source text-to-semantic model (1 billion parameters) designed to directly convert audio into discrete semantic representation tokens without relying on traditional text-to-speech (TTS) models.

Speech Synthesis

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Audio Semantic Tagging #Cross-language Support #End-to-end Speech Processing

Downloads 28

Release Time : 12/28/2024

Model Overview

This model simplifies the training process, saves resources, and achieves scalability by directly converting text into semantic speech tokens, especially suitable for low-resource languages.

Model Features

Direct Audio Conversion

Directly converts audio into discrete semantic representation tokens without relying on traditional text-to-speech (TTS) models.

Resource Efficient

Simplifies the training process and saves resources, especially suitable for low-resource languages.

Multilingual Support

Supports English and Vietnamese, trained on over 400 hours of English and 1000 hours of Vietnamese data.

Model Capabilities

Audio-to-Semantic Tagging

Multilingual Processing

Efficient Resource Utilization

Use Cases

Speech Processing

Speech Tag Generation

Directly converts audio into discrete semantic representation tokens for subsequent processing or analysis.

Word error rate is 3.99 on the Vietnamese test set and 3.27 on the English test set.

🚀 Speechless

Speechless is a compact, open - source text - to - semantics model with 1B parameters. It directly generates semantic representations of audio as discrete tokens, eliminating the need for a text - to - speech (TTS) model. By bypassing the traditional TTS → ASR pipeline, it simplifies training, saves resources, and is scalable, especially for low - resource languages. Trained on about 400 hours of English and 1000 hours of Vietnamese data, it's a core part of the Ichigo v0.5 family.

🚀 Quick Start

You can use the following example code to load the model.

import torch
from transformers import pipeline

model_id = "homebrewltd/Speechless-llama3.2-v0.1"

pipe = pipeline(
    "text-generation", 
    model=model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

pipe("<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research")

>>> [{'generated_text': '<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research.assistant\n\n<|sound_1968|><|sound_0464|><|sound_0642|><|duration_02|><|sound_0634|><|sound_0105|><|duration_02|><|sound_1745|><|duration_02|><|sound_1345|><|sound_0210|><|sound_1312|><|sound_1312|>'}]

✨ Features

Direct Semantic Generation: Generates direct semantic representations of audio as discrete tokens, eliminating the need for a TTS model.
Resource - Efficient: Simplifies training and saves resources by bypassing the traditional TTS → ASR pipeline.
Scalable: Suitable for low - resource languages.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

import torch
from transformers import pipeline

model_id = "homebrewltd/Speechless-llama3.2-v0.1"

pipe = pipeline(
    "text-generation", 
    model=model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

pipe("<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research")

📚 Documentation

Model Summary

Property	Details
Developed by	Homebrew Research
Model Architecture	Llama
Model Type	Text to Semantics
Language(s)	English and Vietnamese
License	Apache 2.0

Resources

Blog: Blog post

Intended Use

Intended Use Cases: This model is primarily designed for research purposes. This version focuses on generating direct semantic representations of audio as discrete tokens, eliminating the need for a text - to - speech (TTS) model.
Out - of - scope: The use of Ichigo Whisper in any manner that violates applicable laws or regulations is strictly prohibited.

Training Specs

Parameter	Value
Epochs	2
Global Batch Size	144
Learning Rate	3e - 4
Learning Scheduler	Cosine
Optimizer	AdamW
Warmup Ratio	0.05
Weight Decay	0.01
Max Sequence Length	512
Clip Grad Norm	1.0

Evaluation

Vietnamese

Model Name	Dataset test	Test samples	WER
Speechless v0.1	viet_bud500	7500	3.99

English

Model Name	Dataset test	Test samples	WER
Speechless v0.1	librispeech_asr	2620	3.27

Citation Information

BibTeX:

@article{Speechless 2024,
  title={Speechless},
  author={Homebrew Research},
  year=2024,
  month=December},
  url={https://huggingface.co/homebrewltd/Speechless-llama3.2-v0.1}

Acknowledgement

WhisperSpeech
Llama3.2

🔧 Technical Details

The model is trained on datasets including homebrewltd/Ichigo-tokenized-v0.1. It is trained on about 400 hours of English and 1000 hours of Vietnamese data. The training parameters are as shown in the "Training Specs" section.

📄 License

This model is licensed under the Apache 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご