Speechless-llama3.2-v0.1 Open Source Model - Convert Audio Directly to Semantic Markers without TTS

Speechless Llama3.2 V0.1

Developed by Menlo

Speechless is a compact open-source text-to-semantic model (1 billion parameters) designed to directly convert audio into discrete semantic tokens without relying on traditional text-to-speech (TTS) models.

Speech Recognition

Safetensors

Supports Multiple LanguagesOpen Source License:Apache-2.0 #Audio Semantic Tagging #Cross-language Support #End-to-end Speech Processing

Downloads 39

Release Time : 12/28/2024

Model Overview

Speechless eliminates the complexity of traditional TTS→ASR pipelines by directly converting text into semantic speech tokens, simplifying the training process, saving resources, and achieving scalability, especially for resource-scarce languages.

Model Features

Direct Audio-to-Semantic Tokenization

Converts audio directly into discrete semantic tokens without relying on traditional TTS models.

Multilingual Support

Supports English and Vietnamese, particularly suitable for resource-scarce languages.

Efficient Training

Simplifies the training process and saves computational resources.

Model Capabilities

Audio-to-Semantic Tokenization

Multilingual Processing

Efficient Resource Utilization

Use Cases

Speech Processing

Speech-to-Semantic Tokenization

Converts audio directly into semantic tokens for subsequent processing or analysis.

Word error rates as low as 3.27 (English) and 3.99 (Vietnamese).

Research

Speech Model Research

Used to study new methods for direct audio-to-semantic tokenization.

🚀 Speechless

Speechless is a compact, open - source text - to - semantics model that directly generates semantic representations of audio as discrete tokens, bypassing the need for a TTS model, simplifying training and saving resources, especially for low - resource languages.

image/png

🚀 Quick Start

You can use the given example code to load the model.

import torch
from transformers import pipeline

model_id = "homebrewltd/Speechless-llama3.2-v0.1"

pipe = pipeline(
    "text-generation", 
    model=model_id, 
    torch_dtype=torch.bfloat16, 
    device_map="auto"
)

pipe("<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research")

>>> [{'generated_text': '<|reserved_special_token_69|>I’m Speechless – A Model Developed by Homebrew Research.assistant\n\n<|sound_1968|><|sound_0464|><|sound_0642|><|duration_02|><|sound_0634|><|sound_0105|><|duration_02|><|sound_1745|><|duration_02|><|sound_1345|><|sound_0210|><|sound_1312|><|sound_1312|>'}]

✨ Features

Speechless is a compact, open - source text - to - semantics (1B parameters) model. It is designed to generate direct semantic representations of audio as discrete tokens, bypassing the need for a text - to - speech (TTS) model. Unlike traditional pipelines that rely on generating and processing audio (TTS → ASR), Speechless eliminates this complexity by directly converting text into semantic speech tokens, simplifying training, saving resources, and enabling scalability, especially for low - resource languages. Trained on over ~400 hours of English and ~1000 hours of Vietnamese data, it is a core component of the Ichigo v0.5 family.

📚 Documentation

Model Summary

Property	Details
Developed by	Homebrew Research
Model Architecture	Llama
Model Type	Text to Semantics
Language(s)	English and Vietnamese
License	Apache 2.0

Resources

Blog: Blog post

Intended Use

⚠️ Important Note

This model is primarily designed for research purposes. This version focuses on generating direct semantic representations of audio as discrete tokens, eliminating the need for a text - to - speech (TTS) model. The use of Ichigo Whisper in any manner that violates applicable laws or regulations is strictly prohibited.

🔧 Technical Details

Training Specs

Parameter	Value
Epochs	2
Global Batch Size	144
Learning Rate	3e - 4
Learning Scheduler	Cosine
Optimizer	AdamW
Warmup Ratio	0.05
Weight Decay	0.01
Max Sequence Length	512
Clip Grad Norm	1.0

Evaluation

Vietnamese

Model Name	Dataset test	Test samples	WER
Speechless v0.1	viet_bud500	7500	3.99

English

Model Name	Dataset test	Test samples	WER
Speechless v0.1	librispeech_asr	2620	3.27

📄 License

The model is licensed under Apache 2.0.

📖 Citation Information

BibTeX:

@article{Speechless 2024,
  title={Speechless},
  author={Homebrew Research},
  year=2024,
  month=December,
  url={https://huggingface.co/homebrewltd/Speechless-llama3.2-v0.1}

👏 Acknowledgement

WhisperSpeech
Llama3.2

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご