Ichigo-llama3.1-s-instruct-v0.4 Open-Source Multimodal Model - Supports Audio and Text Inputs, Strong in Multi-Round Conversations in Noisy Environments

Ichigo Llama3.1 S Instruct V0.4

Developed by Menlo

A multimodal language model based on the Llama-3 architecture, supporting audio and text input comprehension with enhanced robustness in noisy environments and multi-turn conversation capabilities.

Text-to-Audio

Safetensors

EnglishOpen Source License:Apache-2.0 #Voice-text bimodal #Noise environment robustness #Multi-turn voice conversation

Downloads 44

Release Time : 11/8/2024

Model Overview

This model is part of the Ichigo-llama3s series developed by Homebrew Research, featuring improved audio understanding through supervised fine-tuning, suitable for research applications.

Model Features

Multimodal input support

Native support for audio and text input comprehension

Noise environment robustness

Demonstrates stronger robustness in noisy input environments

Enhanced multi-turn conversation

Improved multi-turn conversation capability through training data enhancement

Model Capabilities

Audio understanding

Text generation

Multi-turn conversation

Noise environment processing

Use Cases

Voice interaction research

Noisy environment speech understanding

Accurately comprehends voice commands in environments with significant background noise

Approximately 10% improvement in recognition accuracy compared to previous models

Multi-turn voice conversation system

Builds a voice conversation system with contextual understanding

Achieved a score of 64.66 in the MMLU evaluation

🚀 Ichigo-llama3s Family Model

The Ichigo-llama3s family is a model that natively understands audio and text input, aiming to improve the sound understanding capabilities of LLMs.

🚀 Quick Start

Try this model using Google Colab Notebook.

First, we need to convert the audio file to sound tokens

device = "cuda" if torch.cuda.is_available() else "cpu"
if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
    hf_hub_download(
        repo_id="jan-hq/WhisperVQ",
        filename="whisper-vq-stoks-medium-en+pl-fixed.model",
        local_dir=".",
    )
vq_model = RQBottleneckTransformer.load_model(
        "whisper-vq-stoks-medium-en+pl-fixed.model"
    ).to(device)
vq_model.ensure_whisper(device)
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):

    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    with torch.no_grad():
        codes = vq_model.encode_audio(wav.to(device))
        codes = codes[0].cpu().tolist()

    result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
    return f'<|sound_start|>{result}<|sound_end|>'

Then, we can inference the model the same as any other LLM.

def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    model_kwargs = {"device_map": "auto"}

    if use_4bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
    elif use_8bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
            bnb_8bit_use_double_quant=True,
        )
    else:
        model_kwargs["torch_dtype"] = torch.bfloat16

    model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)

    return pipeline("text-generation", model=model, tokenizer=tokenizer)

def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
    generation_args = {
        "max_new_tokens": max_new_tokens,
        "return_full_text": False,
        "temperature": temperature,
        "do_sample": do_sample,
    }

    output = pipe(messages, **generation_args)
    return output[0]['generated_text']

# Usage
llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
pipe = setup_pipeline(llm_path, use_8bit=True)

✨ Features

The Ichigo-llama3s family natively understands audio and text input.
This model is a supervised fine - tuned (SFT) version, trained on over 1 billion tokens, adding multi - turn speech conversations and noise rejection capabilities.
It demonstrates improved robustness against noisy environmental inputs and enhanced multi - turn conversation capabilities.

📚 Documentation

Model Details

We have developed and released the family Ichigo-llama3s. This family is natively understanding audio and text input.

This model is a supervised fine - tuned (SFT) version of homebrewltd/Ichigo-llama3.1-s-base-v0.3, trained on over 1 billion tokens from the Instruction Speech WhisperVQ v4 dataset which built upon Instruction Speech WhisperVQ v3, adding multi - turn speech conversations and noise rejection capabilities for enhanced performance. As a result, the model demonstrates improved robustness against noisy environmental inputs and enhanced multi - turn conversation capabilities, making it more reliable in real - world applications.

Model developers: Homebrew Research. Input: Text and sound. Output: Text. Model Architecture: Llama - 3. Language(s): English.

Intended Use

Intended Use Cases: This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities. Out - of - scope: The use of llama3 - s in any manner that violates applicable laws or regulations is strictly prohibited.

🔧 Technical Details

Training process

Training Metrics Image: Below is a snapshot of the training loss curve visualized. image/png

MMLU:

Model	MMLU Score
llama3.1 - instruct - 8b	69.40
ichigo - llama3.1 - s - v0.4	64.66
ichigo - llama3.1 - s - v0.3: phase 3	63.79
ichigo - llama3.1 - s - v0.3: phase 2	63.08
ichigo - llama3.1 - s - base - v0.3	42.11
llama3.5 - instruct - v0.2	50.27

AudioBench Eval:

Model Bench	Open - hermes Instruction Audio (GPT - 4 - O judge 0:5)	Alpaca Instruction Audio (GPT - 4 - O judge 0:5)
[Llama3.1 - s - v2](https://huggingface.co/homebrewltd/llama3 - s - instruct - v0.2)	3.45	3.53
[Ichigo - llama3.1 - s v0.4](homebrewltd/Ichigo - llama3.1 - s - instruct - v0.4)	3.5	3.52
[Ichigo - llama3.1 - s v0.3 - phase2 - cp7000](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 2)	3.42	3.62
[Ichigo - llama3.1 - s v0.3 - phase2 - cplast](https://huggingface.co/jan - hq/llama3 - s - instruct - v0.3 - checkpoint - last)	3.31	3.6
[Ichigo - llama3.1 - s v0.3 - phase3](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 3)	3.64	3.68
[Qwen2 - audio - 7B](https://huggingface.co/Qwen/Qwen2 - Audio - 7B)	2.63	2.24

Hardware

GPU Configuration: Cluster of 8x NVIDIA H100 - SXM - 80GB. GPU Usage:

Continual Training: 12 hours.

Training Arguments

We utilize torchtune library for the latest FSDP2 training code implementation.

Parameter	Instruction Fine - Tuning
Epoch	1
Global batch size	256
Learning Rate	7e - 5
Learning Scheduler	Cosine with warmup
Optimizer	Adam torch fused
Warmup Ratio	0.01
Weight Decay	0.005
Max Sequence Length	4096

💻 Usage Examples

Good example:

Click to toggle Example 1

Click to toggle Example 2

Misunderstanding example:

Click to toggle Example 3

Off - tracked example:

Click to toggle Example 4

📄 License

The license of this model is apache - 2.0.

📖 Citation Information

BibTeX:

@article{Llama3-S: Sound Instruction Language Model 2024,
  title={Llama3-S},
  author={Homebrew Research},
  year=2024,
  month=August,
  url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-20}

🙏 Acknowledgement

WhisperSpeech
[Meta - Llama - 3.1 - 8B - Instruct ](https://huggingface.co/meta - llama/Meta - Llama - 3.1 - 8B - Instruct)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご