đ Ichigo-llama3s Family Model
The Ichigo-llama3s family is a model that natively understands audio and text input, aiming to improve the sound understanding capabilities of LLMs.
đ Quick Start
Try this model using Google Colab Notebook.
First, convert the audio file to sound tokens:
device = "cuda" if torch.cuda.is_available() else "cpu"
if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
hf_hub_download(
repo_id="jan-hq/WhisperVQ",
filename="whisper-vq-stoks-medium-en+pl-fixed.model",
local_dir=".",
)
vq_model = RQBottleneckTransformer.load_model(
"whisper-vq-stoks-medium-en+pl-fixed.model"
).to(device)
vq_model.ensure_whisper(device)
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):
wav, sr = torchaudio.load(audio_path)
if sr != 16000:
wav = torchaudio.functional.resample(wav, sr, 16000)
with torch.no_grad():
codes = vq_model.encode_audio(wav.to(device))
codes = codes[0].cpu().tolist()
result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
return f'<|sound_start|>{result}<|sound_end|>'
Then, infer the model the same as any other LLM:
def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
tokenizer = AutoTokenizer.from_pretrained(model_path)
model_kwargs = {"device_map": "auto"}
if use_4bit:
model_kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
elif use_8bit:
model_kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_8bit=True,
bnb_8bit_compute_dtype=torch.bfloat16,
bnb_8bit_use_double_quant=True,
)
else:
model_kwargs["torch_dtype"] = torch.bfloat16
model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)
return pipeline("text-generation", model=model, tokenizer=tokenizer)
def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
generation_args = {
"max_new_tokens": max_new_tokens,
"return_full_text": False,
"temperature": temperature,
"do_sample": do_sample,
}
output = pipe(messages, **generation_args)
return output[0]['generated_text']
llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
pipe = setup_pipeline(llm_path, use_8bit=True)
⨠Features
- The Ichigo-llama3s family natively understands audio and text input.
- The model is a supervised fine - tuned (SFT) version, trained on over 1 billion tokens, adding multi - turn speech conversations and noise rejection capabilities.
- It demonstrates improved robustness against noisy environmental inputs and enhanced multi - turn conversation capabilities.
đ Documentation
Model Details
We have developed and released the family Ichigo-llama3s. This family is natively understanding audio and text input.
This model is a supervised fine - tuned (SFT) version of homebrewltd/Ichigo-llama3.1-s-base-v0.3, trained on over 1 billion tokens from the Instruction Speech WhisperVQ v4 dataset which built upon Instruction Speech WhisperVQ v3, adding multi - turn speech conversations and noise rejection capabilities for enhanced performance. As a result, the model demonstrates improved robustness against noisy environmental inputs and enhanced multi - turn conversation capabilities, making it more reliable in real - world applications.
Property |
Details |
Model developers |
Homebrew Research |
Input |
Text and sound |
Output |
Text |
Model Architecture |
Llama - 3 |
Language(s) |
English |
Intended Use
Intended Use Cases This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.
Out - of - scope The use of llama3 - s in any manner that violates applicable laws or regulations is strictly prohibited.
đ§ Technical Details
Training process
Training Metrics Image: Below is a snapshot of the training loss curve visualized.

MMLU:
Model |
MMLU Score |
llama3.1 - instruct - 8b |
69.40 |
ichigo - llama3.1 - s - v0.4 |
64.66 |
ichigo - llama3.1 - s - v0.3: phase 3 |
63.79 |
ichigo - llama3.1 - s - v0.3: phase 2 |
63.08 |
ichigo - llama3.1 - s - base - v0.3 |
42.11 |
llama3.5 - instruct - v0.2 |
50.27 |
AudioBench Eval:
Model Bench |
Open - hermes Instruction Audio (GPT - 4 - O judge 0:5) |
Alpaca Instruction Audio (GPT - 4 - O judge 0:5) |
[Llama3.1 - s - v2](https://huggingface.co/homebrewltd/llama3 - s - instruct - v0.2) |
3.45 |
3.53 |
[Ichigo - llama3.1 - s v0.4](homebrewltd/Ichigo - llama3.1 - s - instruct - v0.4) |
3.5 |
3.52 |
[Ichigo - llama3.1 - s v0.3 - phase2 - cp7000](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 2) |
3.42 |
3.62 |
[Ichigo - llama3.1 - s v0.3 - phase2 - cplast](https://huggingface.co/jan - hq/llama3 - s - instruct - v0.3 - checkpoint - last) |
3.31 |
3.6 |
[Ichigo - llama3.1 - s v0.3 - phase3](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 3) |
3.64 |
3.68 |
[Qwen2 - audio - 7B](https://huggingface.co/Qwen/Qwen2 - Audio - 7B) |
2.63 |
2.24 |
Hardware
GPU Configuration: Cluster of 8x NVIDIA H100 - SXM - 80GB.
GPU Usage:
- Continual Training: 12 hours.
Training Arguments
We utilize torchtune library for the latest FSDP2 training code implementation.
Parameter |
Instruction Fine - Tuning |
Epoch |
1 |
Global batch size |
256 |
Learning Rate |
7e - 5 |
Learning Scheduler |
Cosine with warmup |
Optimizer |
Adam torch fused |
Warmup Ratio |
0.01 |
Weight Decay |
0.005 |
Max Sequence Length |
4096 |
Examples
- Good example:
Click to toggle Example 1
Click to toggle Example 2
- Misunderstanding example:
Click to toggle Example 3
- Off - tracked example:
Click to toggle Example 4
đ License
The model is under the apache - 2.0 license.
đ Citation Information
BibTeX:
@article{Llama3-S: Sound Instruction Language Model 2024,
title={Llama3 - S},
author={Homebrew Research},
year=2024,
month=August},
url={https://huggingface.co/homebrewltd/llama3.1 - s - 2024 - 08 - 20}
đ Acknowledgement
- WhisperSpeech
- [Meta - Llama - 3.1 - 8B - Instruct ](https://huggingface.co/meta - llama/Meta - Llama - 3.1 - 8B - Instruct)