đ Ichigo-llama3s Family Model
The Ichigo-llama3s family is a model that natively understands audio and text input, expanding the semantic tokens experiment with WhisperVQ for audio files.

đ Documentation
⨠Features
- The Ichigo-llama3s family natively understands both audio and text input.
- It expands the Semantic tokens experiment with WhisperVQ as a tokenizer for audio files using nearly 1B tokens from the Instruction Speech WhisperVQ v3 dataset.
đĻ Installation
No specific installation steps are provided in the original document.
đģ Usage Examples
Basic Usage
First, convert the audio file to sound tokens:
device = "cuda" if torch.cuda.is_available() else "cpu"
if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
hf_hub_download(
repo_id="jan-hq/WhisperVQ",
filename="whisper-vq-stoks-medium-en+pl-fixed.model",
local_dir=".",
)
vq_model = RQBottleneckTransformer.load_model(
"whisper-vq-stoks-medium-en+pl-fixed.model"
).to(device)
vq_model.ensure_whisper(device)
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):
wav, sr = torchaudio.load(audio_path)
if sr != 16000:
wav = torchaudio.functional.resample(wav, sr, 16000)
with torch.no_grad():
codes = vq_model.encode_audio(wav.to(device))
codes = codes[0].cpu().tolist()
result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
return f'<|sound_start|>{result}<|sound_end|>'
Then, inference the model:
def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
tokenizer = AutoTokenizer.from_pretrained(model_path)
model_kwargs = {"device_map": "auto"}
if use_4bit:
model_kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
)
elif use_8bit:
model_kwargs["quantization_config"] = BitsAndBytesConfig(
load_in_8bit=True,
bnb_8bit_compute_dtype=torch.bfloat16,
bnb_8bit_use_double_quant=True,
)
else:
model_kwargs["torch_dtype"] = torch.bfloat16
model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)
return pipeline("text-generation", model=model, tokenizer=tokenizer)
def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
generation_args = {
"max_new_tokens": max_new_tokens,
"return_full_text": False,
"temperature": temperature,
"do_sample": do_sample,
}
output = pipe(messages, **generation_args)
return output[0]['generated_text']
llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
pipe = setup_pipeline(llm_path, use_8bit=True)
đ§ Technical Details
Model Details
- Model developers: Homebrew Research.
- Input: Text and sound.
- Output: Text.
- Model Architecture: Llama-3.
- Language(s): English.
Intended Use
- Intended Use Cases: This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.
- Out-of-scope: The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.
Training process
Hardware
- GPU Configuration: Cluster of 10x NVIDIA A6000-48GB.
- GPU Usage:
Training Arguments
We utilize torchtune library for the latest FSDP2 training code implementation.
Parameter |
Instruction Fine-tuning |
Epoch |
1 |
Global batch size |
360 |
Learning Rate |
7e-5 |
Learning Scheduler |
LambdaLR with warmup |
Optimizer |
Adam torch fused |
Warmup Ratio |
0.01 |
Weight Decay |
0.005 |
Max Sequence Length |
4096 |
đ License
The license for this project is apache-2.0.
đ Citation Information
BibTeX:
@article{Llama3-S: Sound Instruction Language Model 2024,
title={Llama3-S},
author={Homebrew Research},
year=2024,
month=August,
url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-20}
đ Acknowledgement