Ichigo-llama3.1-s-instruct-v0.4 Open-source Multimodal Model - Supports Audio and Text Input, Strong Noise Resistance in Multi-round Conversations

Ichigo Llama3.1 S Instruct V0.4

Developed by homebrewltd

A multimodal language model based on Llama-3 architecture, supporting audio and text input understanding with noise robustness and multi-turn dialogue capabilities

Text-to-Audio

Safetensors

EnglishOpen Source License:Apache-2.0 #Speech-text dual modality #Noise robustness #Multi-turn speech dialogue

Downloads 486

Release Time : 11/8/2024

Model Overview

This model is a speech-text multimodal model developed based on the Llama-3 architecture, enhanced with supervised fine-tuning for speech understanding, specifically optimized for performance in noisy environments and multi-turn dialogue capabilities

Model Features

Multimodal input support

Natively supports audio and text input, capable of understanding speech content and generating text responses

Noise robustness

Incorporated noise suppression capability during training, maintaining good performance even in noisy environments

Multi-turn dialogue optimization

Enhanced dialogue coherence through training with newly added multi-turn speech dialogue data

Efficient training

Utilized torchtune library for FSDP2 training, optimizing training efficiency

Model Capabilities

Speech-to-text

Text generation

Multi-turn dialogue

Noisy environment understanding

Use Cases

Voice assistant

Intelligent voice assistant

Build smart assistants capable of understanding voice commands and responding

Achieved a score of 3.5 (GPT-4-O rating) in AudioBench evaluation

Speech transcription

Meeting transcription

Real-time transcription of meeting speech content into text

Educational applications

Language learning assistant

Helps learners practice English listening and speaking

🚀 Ichigo-llama3s Family Model

The Ichigo-llama3s family is a model that natively understands audio and text input, aiming to improve the sound understanding capabilities of LLMs.

🚀 Quick Start

Try this model using Google Colab Notebook.

First, convert the audio file to sound tokens:

device = "cuda" if torch.cuda.is_available() else "cpu"
if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
    hf_hub_download(
        repo_id="jan-hq/WhisperVQ",
        filename="whisper-vq-stoks-medium-en+pl-fixed.model",
        local_dir=".",
    )
vq_model = RQBottleneckTransformer.load_model(
        "whisper-vq-stoks-medium-en+pl-fixed.model"
    ).to(device)
vq_model.ensure_whisper(device)
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):

    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    with torch.no_grad():
        codes = vq_model.encode_audio(wav.to(device))
        codes = codes[0].cpu().tolist()

    result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
    return f'<|sound_start|>{result}<|sound_end|>'

Then, infer the model the same as any other LLM:

def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    model_kwargs = {"device_map": "auto"}

    if use_4bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
    elif use_8bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
            bnb_8bit_use_double_quant=True,
        )
    else:
        model_kwargs["torch_dtype"] = torch.bfloat16

    model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)

    return pipeline("text-generation", model=model, tokenizer=tokenizer)

def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
    generation_args = {
        "max_new_tokens": max_new_tokens,
        "return_full_text": False,
        "temperature": temperature,
        "do_sample": do_sample,
    }

    output = pipe(messages, **generation_args)
    return output[0]['generated_text']

# Usage
llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
pipe = setup_pipeline(llm_path, use_8bit=True)

✨ Features

The Ichigo-llama3s family natively understands audio and text input.
The model is a supervised fine - tuned (SFT) version, trained on over 1 billion tokens, adding multi - turn speech conversations and noise rejection capabilities.
It demonstrates improved robustness against noisy environmental inputs and enhanced multi - turn conversation capabilities.

📚 Documentation

Model Details

We have developed and released the family Ichigo-llama3s. This family is natively understanding audio and text input.

This model is a supervised fine - tuned (SFT) version of homebrewltd/Ichigo-llama3.1-s-base-v0.3, trained on over 1 billion tokens from the Instruction Speech WhisperVQ v4 dataset which built upon Instruction Speech WhisperVQ v3, adding multi - turn speech conversations and noise rejection capabilities for enhanced performance. As a result, the model demonstrates improved robustness against noisy environmental inputs and enhanced multi - turn conversation capabilities, making it more reliable in real - world applications.

Property	Details
Model developers	Homebrew Research
Input	Text and sound
Output	Text
Model Architecture	Llama - 3
Language(s)	English

Intended Use

Intended Use Cases This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.

Out - of - scope The use of llama3 - s in any manner that violates applicable laws or regulations is strictly prohibited.

🔧 Technical Details

Training process

Training Metrics Image: Below is a snapshot of the training loss curve visualized.

image/png

MMLU:

Model	MMLU Score
llama3.1 - instruct - 8b	69.40
ichigo - llama3.1 - s - v0.4	64.66
ichigo - llama3.1 - s - v0.3: phase 3	63.79
ichigo - llama3.1 - s - v0.3: phase 2	63.08
ichigo - llama3.1 - s - base - v0.3	42.11
llama3.5 - instruct - v0.2	50.27

AudioBench Eval:

Model Bench	Open - hermes Instruction Audio (GPT - 4 - O judge 0:5)	Alpaca Instruction Audio (GPT - 4 - O judge 0:5)
[Llama3.1 - s - v2](https://huggingface.co/homebrewltd/llama3 - s - instruct - v0.2)	3.45	3.53
[Ichigo - llama3.1 - s v0.4](homebrewltd/Ichigo - llama3.1 - s - instruct - v0.4)	3.5	3.52
[Ichigo - llama3.1 - s v0.3 - phase2 - cp7000](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 2)	3.42	3.62
[Ichigo - llama3.1 - s v0.3 - phase2 - cplast](https://huggingface.co/jan - hq/llama3 - s - instruct - v0.3 - checkpoint - last)	3.31	3.6
[Ichigo - llama3.1 - s v0.3 - phase3](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 3)	3.64	3.68
[Qwen2 - audio - 7B](https://huggingface.co/Qwen/Qwen2 - Audio - 7B)	2.63	2.24

Hardware

GPU Configuration: Cluster of 8x NVIDIA H100 - SXM - 80GB.

GPU Usage:

Continual Training: 12 hours.

Training Arguments

We utilize torchtune library for the latest FSDP2 training code implementation.

Parameter	Instruction Fine - Tuning
Epoch	1
Global batch size	256
Learning Rate	7e - 5
Learning Scheduler	Cosine with warmup
Optimizer	Adam torch fused
Warmup Ratio	0.01
Weight Decay	0.005
Max Sequence Length	4096

Examples

Good example:

Click to toggle Example 1

Click to toggle Example 2

Misunderstanding example:

Click to toggle Example 3

Off - tracked example:

Click to toggle Example 4

📄 License

The model is under the apache - 2.0 license.

📖 Citation Information

BibTeX:

@article{Llama3-S: Sound Instruction Language Model 2024,
  title={Llama3 - S},
  author={Homebrew Research},
  year=2024,
  month=August},
  url={https://huggingface.co/homebrewltd/llama3.1 - s - 2024 - 08 - 20}

🙏 Acknowledgement

WhisperSpeech
[Meta - Llama - 3.1 - 8B - Instruct ](https://huggingface.co/meta - llama/Meta - Llama - 3.1 - 8B - Instruct)

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご