Free and deployable! The open-source model mini-Ichigo-llama3.2-3B-s-instruct supports audio and text input understanding.

Mini Ichigo Llama3.2 3B S Instruct

Developed by Menlo

The Ichigo-llama3s series model is a multimodal language model developed by Homebrew Research, natively supporting audio and text input comprehension. Based on the Llama-3 architecture, it is trained using WhisperVQ as an audio file tokenizer, enhancing its audio understanding capabilities.

Text-to-Audio

Safetensors

EnglishOpen Source License:Apache-2.0 #Multimodal Audio Understanding #WhisperVQ Tokenization #Instruction Fine-tuning Optimization

Downloads 22

Release Time : 10/8/2024

Model Overview

This model is primarily designed for research applications, aiming to improve large language models' ability to understand audio. It supports English language processing and can be used for tasks such as audio-to-text conversion.

Model Features

Multimodal Input Support

Natively supports audio and text input comprehension, capable of handling complex multimodal tasks.

Audio Semantic Tokenization

Uses WhisperVQ as an audio file tokenizer, expanding experiments in audio semantic tokenization.

Research-oriented Design

Primarily aimed at research applications, with a special focus on enhancing large language models' understanding of audio.

Model Capabilities

Audio Understanding

Text Generation

Multimodal Processing

Use Cases

Research Applications

Audio Semantic Understanding Research

Used to study large language models' ability to comprehend audio content.

Achieved a GPT-4-O score of 2.58-3.68 in the AudioBench evaluation

Educational Applications

Voice-assisted Learning

Can serve as a foundational model for voice-assisted learning tools.

🚀 Ichigo-llama3s Family

The Ichigo-llama3s family is a model that natively understands audio and text input, aiming to improve the sound understanding capabilities of LLMs.

🚀 Quick Start

Try this model using Google Colab Notebook.

💻 Usage Examples

Basic Usage

First, we need to convert the audio file to sound tokens

device = "cuda" if torch.cuda.is_available() else "cpu"
if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
    hf_hub_download(
        repo_id="jan-hq/WhisperVQ",
        filename="whisper-vq-stoks-medium-en+pl-fixed.model",
        local_dir=".",
    )
vq_model = RQBottleneckTransformer.load_model(
        "whisper-vq-stoks-medium-en+pl-fixed.model"
    ).to(device)
vq_model.ensure_whisper(device)
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):

    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    with torch.no_grad():
        codes = vq_model.encode_audio(wav.to(device))
        codes = codes[0].cpu().tolist()

    result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
    return f'<|sound_start|>{result}<|sound_end|>'

Advanced Usage

Then, we can inference the model the same as any other LLM.

def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    model_kwargs = {"device_map": "auto"}

    if use_4bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
    elif use_8bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
            bnb_8bit_use_double_quant=True,
        )
    else:
        model_kwargs["torch_dtype"] = torch.bfloat16

    model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)

    return pipeline("text-generation", model=model, tokenizer=tokenizer)

def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
    generation_args = {
        "max_new_tokens": max_new_tokens,
        "return_full_text": False,
        "temperature": temperature,
        "do_sample": do_sample,
    }

    output = pipe(messages, **generation_args)
    return output[0]['generated_text']

# Usage
llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
pipe = setup_pipeline(llm_path, use_8bit=True)

✨ Features

The Ichigo-llama3s family natively understands audio and text input.
Expand the Semantic tokens experiment with WhisperVQ as a tokenizer for audio files.

📦 Installation

Not provided in the original document.

📚 Documentation

Model Details

We have developed and released the family Ichigo-llama3s. This family is natively understanding audio and text input.

We expand the Semantic tokens experiment with WhisperVQ as a tokenizer for audio files from homebrewltd/mini-Ichigo-llama3.2-3B-s-base with nearly 1B tokens from Instruction Speech WhisperVQ v3 dataset.

Property	Details
Model developers	Homebrew Research
Input	Text and sound
Output	Text
Model Architecture	Llama - 3
Language(s)	English

Intended Use

Intended Use Cases: This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.
Out - of - scope: The use of llama3 - s in any manner that violates applicable laws or regulations is strictly prohibited.

Training process

Training Metrics Image: Below is a snapshot of the training loss curve visualized.
MMLU: | Model | MMLU Score | | --- | --- | | llama3.1 - instruct - 8b | 69.40 | | ichigo - llama3.1 - s - v0.3: phase 3 | 63.79 | | ichigo - llama3.1 - s - v0.3: phase 2 | 63.08 | | ichigo - llama3.1 - s - base - v0.3 | 42.11 | | mini - ichigo - llama3.2 - 3B - s - instruct | 58.60 | | mini - ichigo - llama3.2 - 3B - s - base | 59.61 | | llama3.1 - s - instruct - v0.2 | 50.27 |
AudioBench Eval: | Model Bench | Open - hermes Instruction Audio (GPT - 4 - O judge 0:5) | Alpaca Instruction Audio (GPT - 4 - O judge 0:5) | | --- | --- | --- | | [Llama3.1 - s - v2](https://huggingface.co/homebrewltd/llama3 - s - instruct - v0.2) | 3.45 | 3.53 | | [Ichigo - llama3.1 - s v0.3 - phase2 - cp7000](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 2) | 3.42 | 3.62 | | [Ichigo - llama3.1 - s v0.3 - phase2 - cplast](https://huggingface.co/jan - hq/llama3 - s - instruct - v0.3 - checkpoint - last) | 3.31 | 3.6 | | [Ichigo - llama3.1 - s v0.3 - phase3](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 3) | 3.64 | 3.68 | | [mini - Ichigo - llama3.2 - 3B - s - instruct](https://huggingface.co/homebrewltd/mini - Ichigo - llama3.2 - 3B - s - instruct) | 2.58 | 2.07 | | [Qwen2 - audio - 7B](https://huggingface.co/Qwen/Qwen2 - Audio - 7B) | 2.63 | 2.24 |
Hardware:
- GPU Configuration: Cluster of 10x NVIDIA A6000 - 48GB.
- GPU Usage:
  - Fine - tuning: 12 hours.
Training Arguments: We utilize torchtune library for the latest FSDP2 training code implementation. | Parameter | Instruction Fine - Tuning | |----------------------------|-------------------------| | Epoch | 1 | | Global batch size | 360 | | Learning Rate | 7e - 5 | | Learning Scheduler | LambdaLR with warmup | | Optimizer | Adam torch fused | | Warmup Ratio | 0.01 | | Weight Decay | 0.005 | | Max Sequence Length | 4096 |

Examples

Good example:

Click to toggle Example 1

``` ```

Click to toggle Example 2

``` ```

Misunderstanding example:

Click to toggle Example 3

``` ```

Off - tracked example:

Click to toggle Example 4

``` ```

Citation Information

BibTeX:

@article{Llama3-S: Sound Instruction Language Model 2024,
  title={Llama3-S},
  author={Homebrew Research},
  year=2024,
  month=August,
  url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-20}
}

Acknowledgement

WhisperSpeech
[Meta - Llama - 3.1 - 8B - Instruct](https://huggingface.co/meta - llama/Meta - Llama - 3.1 - 8B - Instruct)

📄 License

The model is released under the apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご