Ichigo-llama3.1-s-instruct-v0.3-phase-3 Open-source Model - Handle Ambiguous Inputs, Support Multi-turn Conversations, and Accept Audio and Text Inputs

Ichigo Llama3.1 S Instruct V0.3 Phase 3

Developed by Menlo

One of the Ichigo-llama3s series models, focusing on improving the ability to handle ambiguous inputs and multi-turn dialogues, supporting both audio and text inputs.

Text-to-Audio

Safetensors

EnglishOpen Source License:Apache-2.0 #Speech-Text Dual Modality #Fuzzy Instruction Optimization #Multi-turn Dialogue Enhancement

Downloads 20

Release Time : 9/25/2024

Model Overview

This model is a large language model based on the Llama-3 architecture, specifically optimized for speech understanding and multi-turn dialogues, supporting English speech and text inputs, with text output.

Model Features

Multimodal Input Support

Natively supports audio and text inputs, capable of handling mixed speech and text inputs.

Optimized Speech Understanding

Specially optimized for speech understanding, better at handling ambiguous speech inputs.

Multi-turn Dialogue Capability

Enhanced ability to handle multi-turn dialogues, suitable for complex conversational scenarios.

Model Capabilities

Speech-to-Text

Text Generation

Multi-turn Dialogue Processing

Use Cases

Voice Assistants

Smart Voice Assistant

Used to build intelligent assistants capable of understanding voice commands and generating responses.

Scored 3.42 in the Open-hermes voice command test (GPT-4-O score 0:5).

Speech Transcription

Meeting Minutes Transcription

Converts meeting recordings into text transcripts, supporting subsequent text analysis and processing.

🚀 Ichigo-llama3s Family

The Ichigo-llama3s family is a revolutionary model that natively understands both audio and text input. It offers a new way for users to interact with language models, enabling seamless communication through multiple modalities.

🚀 Quick Start

Try this model using Google Colab Notebook.

First, we need to convert the audio file to sound tokens

device = "cuda" if torch.cuda.is_available() else "cpu"
if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
    hf_hub_download(
        repo_id="jan-hq/WhisperVQ",
        filename="whisper-vq-stoks-medium-en+pl-fixed.model",
        local_dir=".",
    )
vq_model = RQBottleneckTransformer.load_model(
        "whisper-vq-stoks-medium-en+pl-fixed.model"
    ).to(device)
vq_model.ensure_whisper(device)
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):

    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    with torch.no_grad():
        codes = vq_model.encode_audio(wav.to(device))
        codes = codes[0].cpu().tolist()

    result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
    return f'<|sound_start|>{result}<|sound_end|>'

Then, we can inference the model the same as any other LLM.

def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    model_kwargs = {"device_map": "auto"}

    if use_4bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
    elif use_8bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
            bnb_8bit_use_double_quant=True,
        )
    else:
        model_kwargs["torch_dtype"] = torch.bfloat16

    model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)

    return pipeline("text-generation", model=model, tokenizer=tokenizer)

def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
    generation_args = {
        "max_new_tokens": max_new_tokens,
        "return_full_text": False,
        "temperature": temperature,
        "do_sample": do_sample,
    }

    output = pipe(messages, **generation_args)
    return output[0]['generated_text']

# Usage
llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
pipe = setup_pipeline(llm_path, use_8bit=True)

✨ Features

Multimodal Input: Natively understand both audio and text input, enhancing user interaction.
Fine - tuned Performance: Focused on fine - tuning to handle inaudible inputs and multi - turn conversations better.

📦 Model Information

Property	Details
Datasets	homebrewltd/instruction - speech - whispervq - v2
Model Type	Llama - 3
Input	Text and sound
Output	Text
Language	English
License	apache - 2.0
Pipeline Tag	audio - text - to - text
Tags	sound language model
Model developers	Homebrew Research

📚 Documentation

Intended Use

Intended Use Cases This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.

Out - of - scope The use of llama3 - s in any manner that violates applicable laws or regulations is strictly prohibited.

Training Process

Training Metrics Image

Below is a snapshot of the training loss curve visualized. image/png

MMLU

Model	MMLU Score
llama3.5 - instruct - 8b	69.40
ichigo - llama3.1 - s - v0.3: phase 3	63.79
ichigo - llama3.1 - s - v0.3: phase 2	63.08
ichigo - llama3.1 - s - base - v0.3	42.11
llama3.5 - instruct - v0.2	50.27

AudioBench Eval

Model Bench	Open - hermes Instruction Audio (GPT - 4 - O judge 0:5)	Alpaca Instruction Audio (GPT - 4 - O judge 0:5)
[Llama3.1 - s - v2](https://huggingface.co/homebrewltd/llama3 - s - instruct - v0.2)	3.45	3.53
[Ichigo - llama3.1 - s v0.3 - phase2 - cp7000](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 2)	3.42	3.62
[Ichigo - llama3.1 - s v0.3 - phase2 - cplast](https://huggingface.co/jan - hq/llama3 - s - instruct - v0.3 - checkpoint - last)	3.31	3.6
[Ichigo - llama3.1 - s v0.3 - phase3](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 3)	3.64	3.68
[Qwen2 - audio - 7B](https://huggingface.co/Qwen/Qwen2 - Audio - 7B)	2.63	2.24

Hardware

GPU Configuration: Cluster of 8x NVIDIA H100 - SXM - 80GB.

GPU Usage:

Continual Training: 3 hours.

Training Arguments

We utilize torchtune library for the latest FSDP2 training code implementation.

Parameter	Continual Training
Epoch	1
Global batch size	256
Learning Rate	1.5e - 5
Learning Scheduler	LambdaLR with warmup
Optimizer	AdamW Fused
Warmup Steps	8
Weight Decay	0.005
Max length	4096
Precision	bf16

More Detail

Paper: http://arxiv.org/abs/2410.15316

🔧 Technical Details

The model is based on the Llama - 3 architecture and is fine - tuned from [homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 2](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 2). The fine - tuning process focuses on improving user interaction, especially in handling inaudible inputs and multi - turn conversations.

📄 License

The model is released under the apache - 2.0 license.

📖 Citation Information

BibTeX:

@article{Llama3-S: Sound Instruction Language Model 2024,
  title={Llama3-S},
  author={Homebrew Research},
  year=2024,
  month=August,
  url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-20}

🙏 Acknowledgement

WhisperSpeech
[Meta - Llama - 3.1 - 8B - Instruct ](https://huggingface.co/meta - llama/Meta - Llama - 3.1 - 8B - Instruct)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご