đ Model Card for Diva Llama 3
An end-to-end Voice Assistant Model that can handle both speech and text inputs, trained using distillation loss.
This is an end-to-end Voice Assistant Model which can handle speech and text as inputs. It is trained using distillation loss. More details in the pre-print here.
See the model in action at diva-audio.github.io or look at the full training logs on Weights&Biases.
⨠Features
- Handles both speech and text as inputs.
- Trained using distillation loss.
đĻ Installation
No installation steps were provided in the original document, so this section is skipped.
đģ Usage Examples
Basic Usage
from transformers import AutoModel
import librosa
import wget
filename = wget.download(
"https://github.com/ffaisal93/SD-QA/raw/refs/heads/master/dev/eng/irl/wav_eng/-1008642825401516622.wav"
)
speech_data, _ = librosa.load(filename, sr=16_000)
model = AutoModel.from_pretrained("WillHeld/DiVA-llama-3-v0-8b", trust_remote_code=True)
print(model.generate([speech_data]))
print(model.generate([speech_data], ["Reply Briefly Like A Pirate"]))
filename = wget.download(
"https://github.com/ffaisal93/SD-QA/raw/refs/heads/master/dev/eng/irl/wav_eng/-2426554427049983479.wav"
)
speech_data2, _ = librosa.load(filename, sr=16_000)
print(
model.generate(
[speech_data, speech_data2],
["Reply Briefly Like A Pirate", "Reply Briefly Like A New Yorker"],
)
)
đ Documentation
Training Details
Training Data
This model was trained on the CommonVoice corpus.
Training Procedure
This model was trained for 7k gradient steps with a batch size of 512 Recordings and a linearly decaying learning rate from 5e-5 to zero, with a linear warmup of 70 steps.
Environmental Impact
Property |
Details |
Hardware Type |
V4-256 TPU |
Hours used |
11 Hours |
Cloud Provider |
Google Cloud |
Compute Region |
US Central C |
Technical Specifications
Model Architecture and Objective
No detailed information was provided in the original document, so this part is skipped.
Compute Infrastructure
Hardware
This model was trained on at V4-256 TPU on Google Cloud.
Software
This model was trained with Levanter
đ License
This project is licensed under the MPL-2.0 license.
đ Citation
BibTeX:
@misc{DiVA,
title={{D}istilling an {E}nd-to-{E}nd {V}oice {A}ssistant {W}ithout {I}nstruction {T}raining {D}ata},
author={William Held and Ella Li and Michael Ryan and Weiyan Shi and Yanzhe Zhang and Diyi Yang},
year={2024},
eprint={2410.02678},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.02678},
}
đ Model Card Contact
held@stanford.edu