đ Ichigo-llama3s Family
The Ichigo-llama3s family is a model that natively understands audio and text input, aiming to improve the sound understanding capabilities of LLMs.

đ Quick Start
Try this model using Google Colab Notebook.
đģ Usage Examples
Basic Usage
First, we need to convert the audio file to sound tokens
device = "cuda" if torch.cuda.is_available() else "cpu"
if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
hf_hub_download(
repo_id="jan-hq/WhisperVQ",
filename="whisper-vq-stoks-medium-en+pl-fixed.model",
local_dir=".",
)
vq_model = RQBottleneckTransformer.load_model(
"whisper-vq-stoks-medium-en+pl-fixed.model"
).to(device)
vq_model.ensure_whisper(device)
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):
wav, sr = torchaudio.load(audio_path)
if sr != 16000:
wav = torchaudio.functional.resample(wav, sr, 16000)
with torch.no_grad():
codes = vq_model.encode_audio(wav.to(device))
codes = codes[0].cpu().tolist()
result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
return f'<|sound_start|>{result}<|sound_end|>'
Advanced Usage
Then, we can inference the model the same as any other LLM.
def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
tokenizer = AutoTokenizer.from_pretrained(model_path)
model_kwargs = {"device_map": "auto"}
if use_4bit:
model_kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
elif use_8bit:
model_kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_8bit=True,
bnb_8bit_compute_dtype=torch.bfloat16,
bnb_8bit_use_double_quant=True,
)
else:
model_kwargs["torch_dtype"] = torch.bfloat16
model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)
return pipeline("text-generation", model=model, tokenizer=tokenizer)
def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
generation_args = {
"max_new_tokens": max_new_tokens,
"return_full_text": False,
"temperature": temperature,
"do_sample": do_sample,
}
output = pipe(messages, **generation_args)
return output[0]['generated_text']
llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
pipe = setup_pipeline(llm_path, use_8bit=True)
⨠Features
- The Ichigo-llama3s family natively understands audio and text input.
- Expand the Semantic tokens experiment with WhisperVQ as a tokenizer for audio files.
đĻ Installation
Not provided in the original document.
đ Documentation
Model Details
We have developed and released the family Ichigo-llama3s. This family is natively understanding audio and text input.
We expand the Semantic tokens experiment with WhisperVQ as a tokenizer for audio files from homebrewltd/mini-Ichigo-llama3.2-3B-s-base with nearly 1B tokens from Instruction Speech WhisperVQ v3 dataset.
Property |
Details |
Model developers |
Homebrew Research |
Input |
Text and sound |
Output |
Text |
Model Architecture |
Llama - 3 |
Language(s) |
English |
Intended Use
- Intended Use Cases: This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.
- Out - of - scope: The use of llama3 - s in any manner that violates applicable laws or regulations is strictly prohibited.
Training process
-
Training Metrics Image: Below is a snapshot of the training loss curve visualized.

-
MMLU:
| Model | MMLU Score |
| --- | --- |
| llama3.1 - instruct - 8b | 69.40 |
| ichigo - llama3.1 - s - v0.3: phase 3 | 63.79 |
| ichigo - llama3.1 - s - v0.3: phase 2 | 63.08 |
| ichigo - llama3.1 - s - base - v0.3 | 42.11 |
| mini - ichigo - llama3.2 - 3B - s - instruct | 58.60 |
| mini - ichigo - llama3.2 - 3B - s - base | 59.61 |
| llama3.1 - s - instruct - v0.2 | 50.27 |
-
AudioBench Eval:
| Model Bench | Open - hermes Instruction Audio (GPT - 4 - O judge 0:5) | Alpaca Instruction Audio (GPT - 4 - O judge 0:5) |
| --- | --- | --- |
| [Llama3.1 - s - v2](https://huggingface.co/homebrewltd/llama3 - s - instruct - v0.2) | 3.45 | 3.53 |
| [Ichigo - llama3.1 - s v0.3 - phase2 - cp7000](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 2) | 3.42 | 3.62 |
| [Ichigo - llama3.1 - s v0.3 - phase2 - cplast](https://huggingface.co/jan - hq/llama3 - s - instruct - v0.3 - checkpoint - last) | 3.31 | 3.6 |
| [Ichigo - llama3.1 - s v0.3 - phase3](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 3) | 3.64 | 3.68 |
| [mini - Ichigo - llama3.2 - 3B - s - instruct](https://huggingface.co/homebrewltd/mini - Ichigo - llama3.2 - 3B - s - instruct) | 2.58 | 2.07 |
| [Qwen2 - audio - 7B](https://huggingface.co/Qwen/Qwen2 - Audio - 7B) | 2.63 | 2.24 |
-
Hardware:
- GPU Configuration: Cluster of 10x NVIDIA A6000 - 48GB.
- GPU Usage:
-
Training Arguments:
We utilize torchtune library for the latest FSDP2 training code implementation.
| Parameter | Instruction Fine - Tuning |
|----------------------------|-------------------------|
| Epoch | 1 |
| Global batch size | 360 |
| Learning Rate | 7e - 5 |
| Learning Scheduler | LambdaLR with warmup |
| Optimizer | Adam torch fused |
| Warmup Ratio | 0.01 |
| Weight Decay | 0.005 |
| Max Sequence Length | 4096 |
Examples
- Good example:
Click to toggle Example 1
```
```
Click to toggle Example 2
```
```
- Misunderstanding example:
Click to toggle Example 3
```
```
- Off - tracked example:
Click to toggle Example 4
```
```
Citation Information
BibTeX:
@article{Llama3-S: Sound Instruction Language Model 2024,
title={Llama3-S},
author={Homebrew Research},
year=2024,
month=August,
url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-20}
}
Acknowledgement
- WhisperSpeech
- [Meta - Llama - 3.1 - 8B - Instruct](https://huggingface.co/meta - llama/Meta - Llama - 3.1 - 8B - Instruct)
đ License
The model is released under the apache - 2.0 license.