Mini-Ichigo-llama3.2-3B-s-instruct Open Source Model - Supports Audio and Text Input, Enhances Audio Comprehension Ability

Mini Ichigo Llama3.2 3B S Instruct

Developed by homebrewltd

A multimodal language model based on the Llama-3 architecture, natively supporting audio and text input comprehension, focusing on enhancing large language models' understanding of audio.

Text-to-Audio

Safetensors

EnglishOpen Source License:Apache-2.0 #Audio Text Understanding #Multimodal LLM #WhisperVQ Tokens

Downloads 14

Release Time : 10/8/2024

Model Overview

This series of models expands audio semantic token experiments using WhisperVQ as an audio file tokenizer, supporting English language processing.

Model Features

Multimodal Input Support

Natively supports dual-modal input of audio and text, capable of processing semantic tokens converted from audio files.

Efficient Audio Processing

Integrates WhisperVQ audio tokenizer for efficient audio feature extraction and conversion.

Instruction Fine-tuning Optimization

Fine-tuned using nearly 1 billion tokens of instruction speech datasets to optimize audio comprehension capabilities.

Model Capabilities

Audio Understanding

Text Generation

Multimodal Reasoning

Instruction Following

Use Cases

Voice Interaction Research

Voice Command Understanding

Parses and executes complex instructions containing audio input

Achieved a score of 3.68 in AudioBench evaluation (GPT-4-O scoring standard)

Educational Technology

Language Learning Assistance

Provides real-time language learning feedback through audio input

🚀 Ichigo-llama3s Family Model

The Ichigo-llama3s family is a model that natively understands audio and text input, expanding the semantic tokens experiment with WhisperVQ for audio files.

📚 Documentation

✨ Features

The Ichigo-llama3s family natively understands both audio and text input.
It expands the Semantic tokens experiment with WhisperVQ as a tokenizer for audio files using nearly 1B tokens from the Instruction Speech WhisperVQ v3 dataset.

📦 Installation

No specific installation steps are provided in the original document.

💻 Usage Examples

Basic Usage

First, convert the audio file to sound tokens:

device = "cuda" if torch.cuda.is_available() else "cpu"
if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
    hf_hub_download(
        repo_id="jan-hq/WhisperVQ",
        filename="whisper-vq-stoks-medium-en+pl-fixed.model",
        local_dir=".",
    )
vq_model = RQBottleneckTransformer.load_model(
        "whisper-vq-stoks-medium-en+pl-fixed.model"
    ).to(device)
vq_model.ensure_whisper(device)
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):

    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    with torch.no_grad():
        codes = vq_model.encode_audio(wav.to(device))
        codes = codes[0].cpu().tolist()

    result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
    return f'<|sound_start|>{result}<|sound_end|>'

Then, inference the model:

def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    model_kwargs = {"device_map": "auto"}

    if use_4bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
    elif use_8bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
            bnb_8bit_use_double_quant=True,
        )
    else:
        model_kwargs["torch_dtype"] = torch.bfloat16

    model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)

    return pipeline("text-generation", model=model, tokenizer=tokenizer)

def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
    generation_args = {
        "max_new_tokens": max_new_tokens,
        "return_full_text": False,
        "temperature": temperature,
        "do_sample": do_sample,
    }

    output = pipe(messages, **generation_args)
    return output[0]['generated_text']

# Usage
llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
pipe = setup_pipeline(llm_path, use_8bit=True)

🔧 Technical Details

Model Details

Model developers: Homebrew Research.
Input: Text and sound.
Output: Text.
Model Architecture: Llama-3.
Language(s): English.

Intended Use

Intended Use Cases: This family is primarily intended for research applications. This version aims to further improve the LLM on sound understanding capabilities.
Out-of-scope: The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.

Training process

Training Metrics Image: Below is a snapshot of the training loss curve visualized.
MMLU: | Model | MMLU Score | | --- | --- | | llama3.1-instruct-8b | 69.40 | | ichigo-llama3.1-s-v0.3: phase 3 | 63.79 | | ichigo-llama3.1-s-v0.3: phase 2 | 63.08 | | ichigo-llama3.1-s-base-v0.3 | 42.11 | | mini-ichigo-llama3.2-3B-s-instruct | 58.60 | | mini-ichigo-llama3.2-3B-s-base | 59.61 | | llama3.1-s-instruct-v0.2 | 50.27 |
AudioBench Eval: | Model Bench | Open-hermes Instruction Audio (GPT-4-O judge 0:5) | Alpaca Instruction Audio (GPT-4-O judge 0:5) | | --- | --- | --- | | Llama3.1-s-v2 | 3.45 | 3.53 | | Ichigo-llama3.1-s v0.3-phase2 -cp7000 | 3.42 | 3.62 | | Ichigo-llama3.1-s v0.3-phase2-cplast | 3.31 | 3.6 | | Ichigo-llama3.1-s v0.3-phase3 | 3.64 | 3.68 | | mini-Ichigo-llama3.2-3B-s-instruct | 2.58 | 2.07 | | Qwen2-audio-7B | 2.63 | 2.24 |

Hardware

GPU Configuration: Cluster of 10x NVIDIA A6000-48GB.
GPU Usage:
- Fine-tuning: 12 hours.

Training Arguments

We utilize torchtune library for the latest FSDP2 training code implementation.

Parameter	Instruction Fine-tuning
Epoch	1
Global batch size	360
Learning Rate	7e-5
Learning Scheduler	LambdaLR with warmup
Optimizer	Adam torch fused
Warmup Ratio	0.01
Weight Decay	0.005
Max Sequence Length	4096

📄 License

The license for this project is apache-2.0.

📚 Citation Information

BibTeX:

@article{Llama3-S: Sound Instruction Language Model 2024,
  title={Llama3-S},
  author={Homebrew Research},
  year=2024,
  month=August,
  url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-20}

🙏 Acknowledgement

WhisperSpeech
Meta-Llama-3.1-8B-Instruct

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご