M

Mini Ichigo Llama3.2 3B S Instruct

Developed by homebrewltd
A multimodal language model based on the Llama-3 architecture, natively supporting audio and text input comprehension, focusing on enhancing large language models' understanding of audio.
Downloads 14
Release Time : 10/8/2024

Model Overview

This series of models expands audio semantic token experiments using WhisperVQ as an audio file tokenizer, supporting English language processing.

Model Features

Multimodal Input Support
Natively supports dual-modal input of audio and text, capable of processing semantic tokens converted from audio files.
Efficient Audio Processing
Integrates WhisperVQ audio tokenizer for efficient audio feature extraction and conversion.
Instruction Fine-tuning Optimization
Fine-tuned using nearly 1 billion tokens of instruction speech datasets to optimize audio comprehension capabilities.

Model Capabilities

Audio Understanding
Text Generation
Multimodal Reasoning
Instruction Following

Use Cases

Voice Interaction Research
Voice Command Understanding
Parses and executes complex instructions containing audio input
Achieved a score of 3.68 in AudioBench evaluation (GPT-4-O scoring standard)
Educational Technology
Language Learning Assistance
Provides real-time language learning feedback through audio input
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
Š 2025AIbase