Ichigo-llama3.1-s-instruct-v0.3-phase-2 Open-source Model - Supports Audio and Text Input Comprehension

Ichigo Llama3.1 S Instruct V0.3 Phase 2

Developed by homebrewltd

The Ichigo-llama3s series models natively support audio and text input comprehension, based on the Llama-3 architecture, using WhisperVQ as the tokenizer for audio files.

Text-to-Audio

Safetensors

EnglishOpen Source License:Apache-2.0 #Multimodal speech understanding #WhisperVQ audio encoding #Instruction fine-tuning optimization

Downloads 16

Release Time : 9/17/2024

Model Overview

This model is primarily for research applications, aiming to enhance the audio comprehension capabilities of large language models. It supports English, with text and audio as input and text as output.

Model Features

Multimodal input support

Natively supports audio and text input comprehension, extending the capabilities of traditional LLMs.

WhisperVQ audio tokenizer

Uses WhisperVQ as the tokenizer for audio files, improving the efficiency and quality of audio processing.

Research-oriented

Primarily for research applications, with a special focus on enhancing audio comprehension capabilities.

Model Capabilities

Audio comprehension

Text generation

Multimodal input processing

Use Cases

Research applications

Audio instruction comprehension

Understands and executes audio-based instructions, such as voice commands.

Achieved high scores in voice command benchmark tests.

Multimodal dialogue systems

Builds dialogue systems that support both audio and text input.

🚀 Ichigo-llama3s Family: Sound and Text Understanding Model

The Ichigo-llama3s family is a groundbreaking model that natively understands both audio and text inputs, offering new possibilities for research applications.

🚀 Quick Start

Try this model using Google Colab Notebook.

First, we need to convert the audio file to sound tokens

device = "cuda" if torch.cuda.is_available() else "cpu"
if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
    hf_hub_download(
        repo_id="jan-hq/WhisperVQ",
        filename="whisper-vq-stoks-medium-en+pl-fixed.model",
        local_dir=".",
    )
vq_model = RQBottleneckTransformer.load_model(
        "whisper-vq-stoks-medium-en+pl-fixed.model"
    ).to(device)
vq_model.ensure_whisper(device)
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):

    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    with torch.no_grad():
        codes = vq_model.encode_audio(wav.to(device))
        codes = codes[0].cpu().tolist()

    result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
    return f'<|sound_start|>{result}<|sound_end|>'

Then, we can inference the model the same as any other LLM.

def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    model_kwargs = {"device_map": "auto"}

    if use_4bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
    elif use_8bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
            bnb_8bit_use_double_quant=True,
        )
    else:
        model_kwargs["torch_dtype"] = torch.bfloat16

    model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)

    return pipeline("text-generation", model=model, tokenizer=tokenizer)

def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
    generation_args = {
        "max_new_tokens": max_new_tokens,
        "return_full_text": False,
        "temperature": temperature,
        "do_sample": do_sample,
    }

    output = pipe(messages, **generation_args)
    return output[0]['generated_text']

# Usage
llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
pipe = setup_pipeline(llm_path, use_8bit=True)

✨ Features

Natively understand audio and text input.
Expand the Semantic tokens experiment with WhisperVQ as a tokenizer for audio files.

📚 Documentation

Model Details

We have developed and released the family Ichigo-llama3s. This family is natively understanding audio and text input.

We expand the Semantic tokens experiment with WhisperVQ as a tokenizer for audio files from homebrewltd/Ichigo-llama3.1-s-base-v0.3 with nearly 1B tokens from Instruction Speech WhisperVQ v3 dataset. This is the model checkpoint from step 7000. Due to some noise in the training data, it has an artificially higher score on the Speech Instruction benchmark.

Model developers: Homebrew Research. Input: Text and sound. Output: Text. Model Architecture: Llama - 3. Language(s): English.

Intended Use

Intended Use Cases: This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities. Out - of - scope: The use of llama3 - s in any manner that violates applicable laws or regulations is strictly prohibited.

🔧 Technical Details

Training process

Training Metrics Image: Below is a snapshot of the training loss curve visualized. image/png

MMLU:

Model	MMLU Score
llama3.5 - instruct - 8b	69.40
ichigo - llama3.1 - s - v0.3: phase 3	63.79
ichigo - llama3.1 - s - v0.3: phase 2	63.08
ichigo - llama3.1 - s - base - v0.3	42.11
llama3.5 - instruct - v0.2	50.27

AudioBench Eval:

Model Bench	Open - hermes Instruction Audio (GPT - 4 - O judge 0:5)	Alpaca Instruction Audio (GPT - 4 - O judge 0:5)
[Llama3.1 - s - v2](https://huggingface.co/homebrewltd/llama3 - s - instruct - v0.2)	3.45	3.53
[Ichigo - llama3.1 - s v0.3 - phase2 - cp7000](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 2)	3.42	3.62
[Ichigo - llama3.1 - s v0.3 - phase2 - cplast](https://huggingface.co/jan - hq/llama3 - s - instruct - v0.3 - checkpoint - last)	3.31	3.6
[Ichigo - llama3.1 - s v0.3 - phase3](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 3)	3.64	3.68
[Qwen2 - audio - 7B](https://huggingface.co/Qwen/Qwen2 - Audio - 7B)	2.63	2.24

Hardware

GPU Configuration: Cluster of 8x NVIDIA H100 - SXM - 80GB. GPU Usage:

Continual Training: 12 hours.

Training Arguments

We utilize torchtune library for the latest FSDP2 training code implementation.

Parameter	Instruction Fine - Tuning
Epoch	1
Global batch size	256
Learning Rate	7e - 5
Learning Scheduler	Cosine with warmup
Optimizer	Adam torch fused
Warmup Ratio	0.01
Weight Decay	0.005
Max Sequence Length	4096

💻 Usage Examples

Good example

Click to toggle Example 1

```

</details>

<details>
<summary>Click to toggle Example 2</summary>

</details>

### Misunderstanding example
<details>
<summary>Click to toggle Example 3</summary>

</details>

### Off - tracked example
<details>
<summary>Click to toggle Example 4</summary>

</details>

## 📄 License
The model is released under the apache - 2.0 license.

## 📚 Citation Information
**BibTeX**:

@article{Llama3-S: Sound Instruction Language Model 2024, title={Llama3-S}, author={Homebrew Research}, year=2024, month=August, url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-20}


## 🙏 Acknowledgement
- **[WhisperSpeech](https://github.com/collabora/WhisperSpeech)**
- **[Meta - Llama - 3.1 - 8B - Instruct ](https://huggingface.co/meta - llama/Meta - Llama - 3.1 - 8B - Instruct)**

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご