Audio-to-Text

The Best 30 Audio-to-Text Tools in 2025

Qwen2-Audio is the Tongyi Qianwen large audio language model series, supporting both voice chat and audio analysis interaction modes.

Transformers English

Qwen2 Audio 7B GGUF

Qwen2-Audio is an advanced small-scale multimodal model that supports audio and text input, enabling voice interaction without relying on speech recognition modules.

Audio-to-Text English

Ultravox V0 5 Llama 3 3 70b

Ultravox is a multimodal voice large language model built upon Llama3.3-70B and Whisper, supporting both voice and text inputs, suitable for scenarios like voice agents and translation.

Transformers Supports Multiple Languages

Ultravox is a multimodal voice large language model based on Llama3.1-8B-Instruct and Whisper-medium, capable of processing both voice and text inputs simultaneously.

Transformers Supports Multiple Languages

Lightweight audio model, excelling in speech recognition, audio understanding, and executing audio instructions among other diverse tasks

Transformers English

Ultravox V0 4 1 Mistral Nemo

Ultravox is a multimodal model based on Mistral-Nemo and Whisper, capable of processing both speech and text inputs, suitable for tasks like voice agents and speech translation.

Transformers Supports Multiple Languages

Ultravox V0 6 Qwen 3 32b

Ultravox is a large multimodal speech language model capable of understanding and processing speech input, supporting multiple languages and noisy environments.

Transformers Supports Multiple Languages

The world's fastest and most efficient edge-deployable audio language model, a 2.6B parameter multimodal model capable of processing both text and audio inputs.

Audio-to-Text English

Qwen2 Audio 7B Instruct 4bit

This is the 4-bit quantized version of Qwen2-Audio-7B-Instruct, developed based on Alibaba Cloud's original Qwen model. It is an audio-text multimodal large language model.

Ultravox V0 5 Llama 3 2 1b ONNX

Ultravox is a multilingual audio-to-text model optimized based on the LLaMA-3-2.1B architecture, supporting speech recognition and transcription tasks in multiple languages.

Transformers Supports Multiple Languages

Ultravox is a multimodal voice large language model built upon Llama3-8B-Instruct and Whisper-small, capable of processing both speech and text inputs.

Transformers English

R1-AQA is an audio question answering model based on Qwen2-Audio-7B-Instruct, optimized through Group Relative Policy Optimization (GRPO) algorithm, achieving state-of-the-art performance in the MMAU benchmark.

Ultravox V0 4 1 Llama 3 1 8b

Ultravox is a multimodal speech large language model built on Llama3.1-8B-Instruct and whisper-large-v3-turbo, capable of processing both speech and text inputs.

Transformers Supports Multiple Languages

Shuka v1 is a language model natively supporting Indian language audio understanding, combining a self-developed audio encoder with the Llama3-8B-Instruct decoder, enabling zero-shot multilingual question-answering tasks.

Transformers Supports Multiple Languages

A multilingual audio-visual speech recognition model based on the MuAViC dataset, combining audio and visual modalities for robust performance

Seallms Audio 7B

SeaLLMs-Audio is a large-scale audio language model targeting Southeast Asia. It supports five major languages: Indonesian, Thai, Vietnamese, English, and Chinese, and has capabilities such as audio analysis and voice interaction.

Safetensors Supports Multiple Languages

Gemma 3 4b It Speech

Gemma-3-MM is a multimodal instruction model extended from Gemma-3-4b-it with added speech processing capabilities, capable of handling text, image, and audio inputs to generate text outputs.

Pathumma Llm Audio 1.0.0

Pathumma-llm-audio-1.0.0 is an 8-billion-parameter Thai large language model specifically designed for audio comprehension tasks, capable of processing various audio inputs including speech, general audio, and music.

Transformers Supports Multiple Languages

Llama 3 Typhoon V1.5 8b Audio Preview

Typhoon-Audio Preview is a Thai and English audio-language model capable of processing text and audio inputs, with text outputs.

Qwen2 Audio 7B Instruct GGUF

Static quantized version of Qwen2-Audio-7B-Instruct model, supporting English audio-to-text conversion tasks

Transformers English

Qwen-Audio-nf4 is the quantized version of Qwen-Audio, supporting multiple audio inputs and text outputs

Transformers Supports Multiple Languages

AV HuBERT MuAViC Ru

AV-HuBERT is an audio-visual speech recognition model trained on the MuAViC multilingual audio-visual corpus, combining audio and visual modalities for robust performance.

Ultravox V0 4 Llama 3 1 70b

Ultravox is a multimodal speech large language model, built upon the pre-trained Llama3.1-70B-Instruct and Whisper-medium backbones, capable of simultaneously receiving both speech and text as input.

Transformers Supports Multiple Languages

Phi 4 Mm Inst Asr Singlish

A multimodal speech recognition model optimized for Singapore English, fine-tuned based on Microsoft's Phi-4 multimodal instruction model, significantly improving recognition of Singapore English's unique phonetic features.

Transformers Supports Multiple Languages

Ichigo Llama3.1 S Base V0.3

The Llama3-S series model is a multimodal language model developed by Homebrew Research, natively supporting audio and text input comprehension, extending the speech understanding capability based on the Llama-3 architecture.

Audio-to-Text English

Phi 4 Multimodal Instruct Commonvoice Zh Tw

A Taiwanese Mandarin speech recognition model fine-tuned from microsoft/Phi-4-multimodal-instruct, trained on the Taiwanese Mandarin General Voice 19.0 dataset

Transformers Chinese

Ultravox V0 4 1 Llama 3 3 70b

Ultravox is a multimodal speech large language model based on Llama3.3-70B-Instruct and whisper-large-v3-turbo, capable of processing both speech and text inputs.

Transformers Supports Multiple Languages

Mistral Speech To Text

This is an experimental model that converts audio waveforms into ASCII art and then fine-tunes the Mistral model to predict text.

Ultravox is a multimodal speech large language model based on Llama3.1-8B-Instruct and Whisper-small, capable of processing both speech and text inputs.

Transformers English

Ichigo Llama3.1 S Base V0.3

Llama3-S is a multimodal language model supporting both audio and text inputs, developed based on the Llama-3 architecture with a focus on enhancing speech understanding capabilities.

Audio-to-Text English

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase