Ichigo-llama3.1-s-base-v0.3 Open-Source Multimodal Model - Supports Audio and Text Input Understanding

Ichigo Llama3.1 S Base V0.3

Developed by homebrewltd

The Llama3-S series model is a multimodal language model developed by Homebrew Research, natively supporting audio and text input comprehension, extending the speech understanding capability based on the Llama-3 architecture.

Audio-to-Text

Safetensors

EnglishOpen Source License:Apache-2.0 #Speech-Text Dual Modality #English Speech Comprehension #Llama3 Architecture Extension

Downloads 33

Release Time : 9/9/2024

Model Overview

This model underwent continuous pre-training using a 900 million token speech dataset on an extended vocabulary, aiming to enhance the speech comprehension capabilities of large language models.

Model Features

Multimodal Input Support

Natively supports audio and text input comprehension, expanding the capability boundaries of traditional language models.

Speech Comprehension Optimization

Significantly improves speech comprehension through specialized dataset continuous pre-training.

Efficient Training

Utilizes the torchtune library to implement the latest FSDP2 training code, optimizing training efficiency.

Model Capabilities

Audio Comprehension

Text Generation

Multimodal Input Processing

Use Cases

Speech Research

Speech Command Comprehension

Parses and understands voice input commands

Achieved a 63.79 MMLU score on specific test sets

Educational Research

Language Learning Assistance

Helps learners comprehend English speech input

🚀 Llama3-S Family

The Llama3-S family is a set of models natively capable of understanding both audio and text inputs, aiming to enhance large language models' sound understanding capabilities.

🚀 Quick Start

This README provides detailed information about the Llama3-S family, including model details, intended use, training process, citation information, and acknowledgments.

✨ Features

Natively understands audio and text input.
Continual pretraining on an expanded vocabulary.
Aims to improve the LLM's sound understanding capabilities.

📚 Documentation

🔍 Model Details

We have developed and released the llama3s family. This family is natively capable of understanding audio and text input.

We conducted continual pretraining on the expanded vocabulary homebrewltd/llama3.1-s-whispervq-init with 900M tokens from the homebrewltd/raw-speech-whispervq-v1 dataset.

Model developers: Homebrew Research. Input: Text and sound. Output: Text. Model Architecture: Llama - 3. Language(s): English.

🎯 Intended Use

Intended Use Cases: This family is primarily intended for research applications. This version aims to further improve the LLM's sound understanding capabilities.

Out-of-scope: The use of llama3-s in any manner that violates applicable laws or regulations is strictly prohibited.

⚙️ Training process

Training Metrics Image: Below is a snapshot of the training loss curve visualized.

image/png

MMLU:

Model	MMLU Score
llama3.5-instruct-8b	69.40
ichigo-llama3.1-s-v0.3: phase 3	63.79
ichigo-llama3.1-s-v0.3: phase 2	63.08
ichigo-llama3.1-s-base-v0.3	42.11
llama3.5-instruct-v0.2	50.27

💻 Hardware

GPU Configuration: Cluster of 10x NVIDIA A6000 - 48GB. GPU Usage:

Continual Training: 30 hours.

📝 Training Arguments

We utilize the torchtune library for the latest FSDP2 training code implementation.

Parameter	Continual Training
Epoch	1
Global batch size	480
Learning Rate	2e-4
Learning Scheduler	Cosine with warmup
Optimizer	AdamW fused
Warmup Steps	50
Weight Decay	0.01
Max Sequence Length	512

📖 Citation Information

BibTeX:

@article{Llama3-S: Sound Instruction Language Model 2024,
  title={Llama3-S},
  author={Homebrew Research},
  year=2024,
  month=August},
  url={https://huggingface.co/homebrewltd/llama3.1-s-2024-08-15}

🙏 Acknowledgement

WhisperSpeech
Meta-Llama-3.1-8B-Instruct

📄 License

The model is released under the apache - 2.0 license.

Property	Details
Datasets	homebrewltd/instruction-speech-whispervq-v2
Language	English
License	apache-2.0
Tags	sound language model
Pipeline Tag	audio-text-to-text

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご