DiVA-llama-3-v0-8b Open-source Voice Assistant Model - Supports Voice and Text Input, Practical and Free

Diva Llama 3 V0 8b

Developed by WillHeld

DiVA Llama 3 is an end-to-end voice assistant model capable of processing both speech and text inputs, trained using distillation loss.

Text-to-Audio

Transformers

#End-to-end voice assistant #Multimodal input #Distillation training

Downloads 2,596

Release Time : 6/20/2024

Model Overview

This model is an end-to-end voice assistant that combines speech and text processing capabilities, developed based on the Llama 3 architecture, capable of understanding and responding to voice commands.

Model Features

End-to-end voice assistant

Can directly process speech input without a separate speech recognition module.

Distillation training

Trained using distillation loss to improve model efficiency.

Multimodal input

Supports both speech and text input simultaneously.

Model Capabilities

Speech understanding

Text generation

Multi-turn dialogue

Stylized responses (e.g., pirate style, New Yorker style)

Use Cases

Smart assistant

Voice interaction assistant

Interact with devices through voice commands

Can understand and respond to natural voice commands.

Multilingual applications

Multilingual voice assistant

Supports voice input and responses in different languages

🚀 Model Card for Diva Llama 3

An end-to-end Voice Assistant Model that can handle both speech and text inputs, trained using distillation loss.

This is an end-to-end Voice Assistant Model which can handle speech and text as inputs. It is trained using distillation loss. More details in the pre-print here.

See the model in action at diva-audio.github.io or look at the full training logs on Weights&Biases.

✨ Features

Handles both speech and text as inputs.
Trained using distillation loss.

📦 Installation

No installation steps were provided in the original document, so this section is skipped.

💻 Usage Examples

Basic Usage

from transformers import AutoModel
import librosa
import wget

filename = wget.download(
    "https://github.com/ffaisal93/SD-QA/raw/refs/heads/master/dev/eng/irl/wav_eng/-1008642825401516622.wav"
)

speech_data, _ = librosa.load(filename, sr=16_000)

model = AutoModel.from_pretrained("WillHeld/DiVA-llama-3-v0-8b", trust_remote_code=True)

print(model.generate([speech_data]))
print(model.generate([speech_data], ["Reply Briefly Like A Pirate"]))

filename = wget.download(
    "https://github.com/ffaisal93/SD-QA/raw/refs/heads/master/dev/eng/irl/wav_eng/-2426554427049983479.wav"
)

speech_data2, _ = librosa.load(filename, sr=16_000)

print(
    model.generate(
        [speech_data, speech_data2],
        ["Reply Briefly Like A Pirate", "Reply Briefly Like A New Yorker"],
    )
)

📚 Documentation

Training Details

Training Data

This model was trained on the CommonVoice corpus.

Training Procedure

This model was trained for 7k gradient steps with a batch size of 512 Recordings and a linearly decaying learning rate from 5e-5 to zero, with a linear warmup of 70 steps.

Environmental Impact

Property	Details
Hardware Type	V4-256 TPU
Hours used	11 Hours
Cloud Provider	Google Cloud
Compute Region	US Central C

Technical Specifications

Model Architecture and Objective

No detailed information was provided in the original document, so this part is skipped.

Compute Infrastructure

Hardware

This model was trained on at V4-256 TPU on Google Cloud.

Software

This model was trained with Levanter

📄 License

This project is licensed under the MPL-2.0 license.

📄 Citation

BibTeX:

@misc{DiVA,
      title={{D}istilling an {E}nd-to-{E}nd {V}oice {A}ssistant {W}ithout {I}nstruction {T}raining {D}ata}, 
      author={William Held and Ella Li and Michael Ryan and Weiyan Shi and Yanzhe Zhang and Diyi Yang},
      year={2024},
      eprint={2410.02678},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.02678}, 
}

📞 Model Card Contact

held@stanford.edu

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご