Ichigo-llama3.1-s-instruct-v0.3-phase-3開源大模型 - 支持音文輸入，增強聲音理解交互體驗

首頁

Ichigo Llama3.1 S Instruct V0.3 Phase 3

由homebrewltd開發

Ichigo-llama3s是一個支持音頻和文本輸入的大語言模型系列，專注於提升聲音理解能力和用戶交互體驗。

文本生成音頻

Safetensors

英語開源協議:Apache-2.0 #音頻文本雙模態 #多輪對話優化 #高精度語音理解

下載量 43

發布時間 : 9/25/2024

模型概述

該模型基於Llama-3架構開發，原生支持音頻和文本輸入，專注於提升處理聽不清輸入和多輪對話的能力，主要用於研究應用。

模型特點

多模態輸入支持

原生支持音頻和文本兩種輸入方式，能夠處理聲音標記和文本標記的混合輸入。

增強的聲音理解能力

特別優化了處理聽不清輸入和多輪對話的能力，提升了用戶交互體驗。

高效訓練

使用torchtune庫實現最新的FSDP2訓練代碼，訓練效率高。

模型能力

音頻理解

文本生成

多輪對話處理

聽不清輸入處理

使用案例

研究應用

聲音語言模型研究

用於探索大語言模型的聲音理解能力

在AudioBench評估中獲得3.64-3.68的GPT-4-O評分

人機交互研究

用於研究更自然的人機對話系統

優化了處理聽不清輸入和多輪對話的能力

🚀 Ichigo-llama3s 模型

Ichigo-llama3s 是一個原生支持音頻和文本輸入的模型家族，可用於研究應用，在音頻理解能力上有出色表現。

🚀 快速開始

你可以通過 Google Colab Notebook 來試用該模型。

首先，需要將音頻文件轉換為聲音標記：

device = "cuda" if torch.cuda.is_available() else "cpu"
if not os.path.exists("whisper-vq-stoks-medium-en+pl-fixed.model"):
    hf_hub_download(
        repo_id="jan-hq/WhisperVQ",
        filename="whisper-vq-stoks-medium-en+pl-fixed.model",
        local_dir=".",
    )
vq_model = RQBottleneckTransformer.load_model(
        "whisper-vq-stoks-medium-en+pl-fixed.model"
    ).to(device)
vq_model.ensure_whisper(device)
def audio_to_sound_tokens(audio_path, target_bandwidth=1.5, device=device):

    wav, sr = torchaudio.load(audio_path)
    if sr != 16000:
        wav = torchaudio.functional.resample(wav, sr, 16000)
    with torch.no_grad():
        codes = vq_model.encode_audio(wav.to(device))
        codes = codes[0].cpu().tolist()

    result = ''.join(f'<|sound_{num:04d}|>' for num in codes)
    return f'<|sound_start|>{result}<|sound_end|>'

然後，就可以像使用其他大語言模型一樣對該模型進行推理：

def setup_pipeline(model_path, use_4bit=False, use_8bit=False):
    tokenizer = AutoTokenizer.from_pretrained(model_path)

    model_kwargs = {"device_map": "auto"}

    if use_4bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_4bit=True,
            bnb_4bit_compute_dtype=torch.bfloat16,
            bnb_4bit_use_double_quant=True,
            bnb_4bit_quant_type="nf4",
        )
    elif use_8bit:
        model_kwargs["quantization_config"] = BitsAndBytesConfig(
            load_in_8bit=True,
            bnb_8bit_compute_dtype=torch.bfloat16,
            bnb_8bit_use_double_quant=True,
        )
    else:
        model_kwargs["torch_dtype"] = torch.bfloat16

    model = AutoModelForCausalLM.from_pretrained(model_path, **model_kwargs)

    return pipeline("text-generation", model=model, tokenizer=tokenizer)

def generate_text(pipe, messages, max_new_tokens=64, temperature=0.0, do_sample=False):
    generation_args = {
        "max_new_tokens": max_new_tokens,
        "return_full_text": False,
        "temperature": temperature,
        "do_sample": do_sample,
    }

    output = pipe(messages, **generation_args)
    return output[0]['generated_text']

# Usage
llm_path = "homebrewltd/llama3.1-s-instruct-v0.2"
pipe = setup_pipeline(llm_path, use_8bit=True)

✨ 主要特性

原生支持音頻和文本輸入。
專注於微調模型，以改善用戶交互，特別是在處理聽不見的輸入和多輪對話方面。

📚 詳細文檔

模型詳情

數據集：
- homebrewltd/instruction-speech-whispervq-v2
語言：英語
許可證：apache - 2.0
模型類型：音頻文本轉文本
標籤：聲音語言模型

我們開發併發布了 Ichigo-llama3s 模型家族。該家族原生支持理解音頻和文本輸入。

此模型專注於對 homebrewltd/Ichigo-llama3.1-s-instruct-v0.3-phase-2 進行微調，以改善用戶交互，特別是在處理聽不見的輸入和多輪對話方面。

屬性	詳情
模型開發者	Homebrew Research
輸入	文本和聲音
輸出	文本
模型架構	Llama - 3
語言	英語

預期用途

預期用例：該模型家族主要用於研究應用。此版本旨在進一步提高大語言模型的聲音理解能力。
非預期用途：嚴禁以任何違反適用法律法規的方式使用 llama3 - s。

訓練過程

訓練指標圖像：以下是訓練損失曲線的可視化快照。
MMLU 評估： | 模型 | MMLU 得分 | | --- | --- | | llama3.5 - instruct - 8b | 69.40 | | ichigo - llama3.1 - s - v0.3: phase 3 | 63.79 | | ichigo - llama3.1 - s - v0.3: phase 2 | 63.08 | | ichigo - llama3.1 - s - base - v0.3 | 42.11 | | llama3.5 - instruct - v0.2 | 50.27 |
AudioBench 評估： | 模型基準 | Open - hermes Instruction Audio (GPT - 4 - O judge 0:5) | Alpaca Instruction Audio (GPT - 4 - O judge 0:5) | | --- | --- | --- | | [Llama3.1 - s - v2](https://huggingface.co/homebrewltd/llama3 - s - instruct - v0.2) | 3.45 | 3.53 | | [Ichigo - llama3.1 - s v0.3 - phase2 - cp7000](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 2) | 3.42 | 3.62 | | [Ichigo - llama3.1 - s v0.3 - phase2 - cplast](https://huggingface.co/jan - hq/llama3 - s - instruct - v0.3 - checkpoint - last) | 3.31 | 3.6 | | [Ichigo - llama3.1 - s v0.3 - phase3](https://huggingface.co/homebrewltd/Ichigo - llama3.1 - s - instruct - v0.3 - phase - 3) | 3.64 | 3.68 | | [Qwen2 - audio - 7B](https://huggingface.co/Qwen/Qwen2 - Audio - 7B) | 2.63 | 2.24 |

硬件

GPU 配置：8 個 NVIDIA H100 - SXM - 80GB 組成的集群。
GPU 使用情況：
- 持續訓練：3 小時。

訓練參數

我們使用 torchtune 庫來實現最新的 FSDP2 訓練代碼。

參數	持續訓練
輪數	1
全局批量大小	256
學習率	1.5e - 5
學習率調度器	帶熱身的 LambdaLR
優化器	AdamW Fused
熱身步數	8
權重衰減	0.005
最大長度	4096
精度	bf16

📄 許可證

本項目採用 apache - 2.0 許可證。

📖 引用信息

BibTeX：

@article{Llama3-S: Sound Instruction Language Model 2024,
  title={Llama3-S},
  author={Homebrew Research},
  year=2024,
  month=August,
  url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-20}