Open-source Ichigo-llama3.1-s-base-v0.3 Model - Supports Audio and Text Input, Enhances Speech Comprehension Ability

Ichigo Llama3.1 S Base V0.3

Developed by Menlo

Llama3-S is a multimodal language model supporting both audio and text inputs, developed based on the Llama-3 architecture with a focus on enhancing speech understanding capabilities.

Audio-to-Text

Safetensors

EnglishOpen Source License:Apache-2.0 #Speech-Text Dual Modality #Large Language Model Extension #English Speech Understanding

Downloads 18

Release Time : 9/9/2024

Model Overview

This model has undergone continued pretraining on an extended vocabulary and natively supports audio and text inputs, primarily for research applications, especially in improving speech understanding capabilities.

Model Features

Multimodal Input Support

Natively supports audio and text inputs, capable of processing both speech and text data.

Speech Understanding Optimization

Significantly improves speech understanding capabilities through continued pretraining and vocabulary expansion.

Efficient Training

Utilizes the latest FSDP2 training code, optimizing training efficiency and resource utilization.

Model Capabilities

Speech-to-Text

Text Generation

Speech Understanding

Use Cases

Research Applications

Speech Understanding Research

Used to study the enhancement of large language models in speech understanding capabilities.

🚀 Llama3-S Sound Instruction Language Model

A family of models natively understanding audio and text input, aiming to improve sound understanding capabilities for research applications.

🚀 Quick Start

This README provides detailed information about the Llama3-S family of models, including model details, intended use, training process, citation information, and acknowledgements.

✨ Features

Natively understands audio and text input.
Continual pretraining on an expanded vocabulary.
Primarily intended for research applications to improve sound understanding capabilities.

📚 Documentation

🔍 Model Details

We have developed and released the llama3s family. This family can natively understand audio and text input.

We continually pretrain on the expanded vocabulary homebrewltd/llama3.1-s-whispervq-init with 900M tokens from the homebrewltd/raw-speech-whispervq-v1 dataset.

Property	Details
Model Developers	Homebrew Research
Input	Text and sound
Output	Text
Model Architecture	Llama - 3
Language(s)	English

🎯 Intended Use

Intended Use Cases: This family is primarily intended for research applications. This version aims to further improve the LLM's sound understanding capabilities.

⚠️ Important Note

The use of llama3 - s in any manner that violates applicable laws or regulations is strictly prohibited.

⚙️ Training Process

Training Metrics Image: Below is a snapshot of the training loss curve visualized.

image/png

MMLU:

Model	MMLU Score
llama3.5 - instruct - 8b	69.40
ichigo - llama3.1 - s - v0.3: phase 3	63.79
ichigo - llama3.1 - s - v0.3: phase 2	63.08
ichigo - llama3.1 - s - base - v0.3	42.11
llama3.5 - instruct - v0.2	50.27

💻 Hardware

GPU Configuration: Cluster of 10x NVIDIA A6000 - 48GB.
GPU Usage:
- Continual Training: 30 hours.

⚙️ Training Arguments

We utilize the torchtune library for the latest FSDP2 training code implementation.

Parameter	Continual Training
Epoch	1
Global batch size	480
Learning Rate	2e - 4
Learning Scheduler	Cosine with warmup
Optimizer	AdamW fused
Warmup Steps	50
Weight Decay	0.01
Max Sequence Length	512

📖 Citation Information

BibTeX:

@article{Llama3-S: Sound Instruction Language Model 2024,
  title={Llama3-S},
  author={Homebrew Research},
  year=2024,
  month=August,
  url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-15}

🙏 Acknowledgement

WhisperSpeech
Meta - Llama - 3.1 - 8B - Instruct

📄 License

This project is licensed under the Apache - 2.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご