Ichigo-llama3.1-s-instruct-v0.3-phase-3 Open-Source Large Model - Supports Audio and Text Input, Enhances Audio Understanding and Interaction Experience

Ichigo Llama3.1 S Instruct V0.3 Phase 3

Developed by homebrewltd

Ichigo-llama3s is a large language model series that supports both audio and text input, focusing on enhancing speech understanding capabilities and user interaction experience.

Text-to-Audio

Safetensors

EnglishOpen Source License:Apache-2.0 #Audio-Text Dual Modality #Multi-turn Dialogue Optimization #High-precision Speech Understanding

Downloads 43

Release Time : 9/25/2024

Model Overview

This model is developed based on the Llama-3 architecture, natively supporting both audio and text input, with a focus on improving the ability to handle unclear inputs and multi-turn dialogues, primarily used for research applications.

Model Features

Multimodal Input Support

Natively supports both audio and text input methods, capable of processing mixed inputs of speech tokens and text tokens.

Enhanced Speech Understanding

Specially optimized for handling unclear inputs and multi-turn dialogues, improving user interaction experience.

Efficient Training

Utilizes the latest FSDP2 training code implemented with the torchtune library, achieving high training efficiency.

Model Capabilities

Speech Understanding

Text Generation

Multi-turn Dialogue Handling

Unclear Input Handling

Use Cases

Research Applications

Speech Language Model Research

Used to explore the speech understanding capabilities of large language models

Achieved a GPT-4-O score of 3.64-3.68 in the AudioBench evaluation

Human-Computer Interaction Research

Used to study more natural human-computer dialogue systems

Optimized the ability to handle unclear inputs and multi-turn dialogues

🚀 Ichigo-llama3s Model

The Ichigo-llama3s model family, developed by Homebrew Research, is designed to natively understand both audio and text inputs. It offers enhanced capabilities in handling various types of user interactions, especially in multi - turn conversations and dealing with inaudible inputs.

🚀 Quick Start

Try this model using Google Colab Notebook.

First, convert the audio file to sound tokens:

device = "cuda" if torch.cuda.is_available() else "cpu"
if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
    hf_hub_download(
        repo_id="jan-hq/WhisperVQ",
        filename="whisper-vq-stoks-medium-en+pl-fixed.model",
        local_dir=".",
    )
vq_model = RQBottleneckTransformer.load_model(
        "whisper-vq-stoks-medium-en+pl-fixed.model"
    ).to(device)
vq_model.ensure_whisper(device)
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):

    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    with torch.no_grad():
        codes = vq_model.encode_audio(wav.to(device))
        codes = codes[0].cpu().tolist()

    result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
    return f'<|sound_start|>{result}<|sound_end|>'

Then, you can perform inference on the model, similar to any other LLM:

def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    model_kwargs = {"device_map": "auto"}

    if use_4bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
    elif use_8bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
            bnb_8bit_use_double_quant=True,
        )
    else:
        model_kwargs["torch_dtype"] = torch.bfloat16

    model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)

    return pipeline("text-generation", model=model, tokenizer=tokenizer)

def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
    generation_args = {
        "max_new_tokens": max_new_tokens,
        "return_full_text": False,
        "temperature": temperature,
        "do_sample": do_sample,
    }

    output = pipe(messages, **generation_args)
    return output[0]['generated_text']

# Usage
llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
pipe = setup_pipeline(llm_path, use_8bit=True)

✨ Features

Native Audio and Text Understanding: The model family can directly process both audio and text inputs.
Enhanced Interaction: Fine - tuned to improve user interaction, especially in multi - turn conversations and handling inaudible inputs.

📚 Documentation

Model Details

We have developed and released the Ichigo-llama3s family. This family is natively capable of understanding audio and text input.

This model focuses on fine - tuning the model to improve user interaction based on homebrewltd/Ichigo-llama3.1-s-instruct-v0.3-phase-2, particularly in handling inaudible inputs and multi - turn conversations.

Property	Details
Model Developers	Homebrew Research
Input	Text and sound
Output	Text
Model Architecture	Llama - 3
Language(s)	English

Intended Use

Intended Use Cases: This family is primarily intended for research applications. This version aims to further improve the LLM's sound understanding capabilities.

Out - of - scope: The use of llama3 - s in any manner that violates applicable laws or regulations is strictly prohibited.

🔧 Technical Details

Training process

Training Metrics Image: Below is a snapshot of the training loss curve visualized.

image/png

MMLU:

Model	MMLU Score
llama3.5 - instruct - 8b	69.40
ichigo - llama3.1 - s - v0.3: phase 3	63.79
ichigo - llama3.1 - s - v0.3: phase 2	63.08
ichigo - llama3.1 - s - base - v0.3	42.11
llama3.5 - instruct - v0.2	50.27

AudioBench Eval:

Model Bench	Open - hermes Instruction Audio (GPT - 4 - O judge 0:5)	Alpaca Instruction Audio (GPT - 4 - O judge 0:5)
[Llama3.1 - s - v2](https://huggingface.co/homebrewltd/llama3 - s - instruct - v0.2)	3.45	3.53
[Ichigo - llama3.1 - s v0.3 - phase2 - cp7000](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 2)	3.42	3.62
[Ichigo - llama3.1 - s v0.3 - phase2 - cplast](https://huggingface.co/jan - hq/llama3 - s - instruct - v0.3 - checkpoint - last)	3.31	3.6
[Ichigo - llama3.1 - s v0.3 - phase3](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 3)	3.64	3.68
[Qwen2 - audio - 7B](https://huggingface.co/Qwen/Qwen2 - Audio - 7B)	2.63	2.24

Hardware

GPU Configuration: Cluster of 8x NVIDIA H100 - SXM - 80GB.

GPU Usage:

Continual Training: 3 hours.

Training Arguments

We utilize the torchtune library for the latest FSDP2 training code implementation.

Parameter	Continual Training
Epoch	1
Global batch size	256
Learning Rate	1.5e - 5
Learning Scheduler	LambdaLR with warmup
Optimizer	AdamW Fused
Warmup Steps	8
Weight Decay	0.005
Max length	4096
Precision	bf16

More detail

Paper: http://arxiv.org/abs/2410.15316

📄 License

The model is released under the apache - 2.0 license.

📖 Citation Information

BibTeX:

@article{Llama3-S: Sound Instruction Language Model 2024,
  title={Llama3-S},
  author={Homebrew Research},
  year=2024,
  month=August,
  url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-20}

Acknowledgement

WhisperSpeech
[Meta - Llama - 3.1 - 8B - Instruct ](https://huggingface.co/meta - llama/Meta - Llama - 3.1 - 8B - Instruct)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご