The Best 30 Audio-to-Text Tools in 2025

Qwen2 Audio 7B
Apache-2.0
Qwen2-Audio is the Tongyi Qianwen large audio language model series, supporting both voice chat and audio analysis interaction modes.
Audio-to-Text Transformers English
Q
Qwen
28.26k
114
Qwen2 Audio 7B GGUF
Apache-2.0
Qwen2-Audio is an advanced small-scale multimodal model that supports audio and text input, enabling voice interaction without relying on speech recognition modules.
Audio-to-Text English
Q
NexaAIDev
5,001
153
Ultravox V0 5 Llama 3 3 70b
MIT
Ultravox is a multimodal voice large language model built upon Llama3.3-70B and Whisper, supporting both voice and text inputs, suitable for scenarios like voice agents and translation.
Audio-to-Text Transformers Supports Multiple Languages
U
fixie-ai
3,817
26
Ultravox V0 4
MIT
Ultravox is a multimodal voice large language model based on Llama3.1-8B-Instruct and Whisper-medium, capable of processing both voice and text inputs simultaneously.
Audio-to-Text Transformers Supports Multiple Languages
U
fixie-ai
1,851
48
Aero 1 Audio
MIT
Lightweight audio model, excelling in speech recognition, audio understanding, and executing audio instructions among other diverse tasks
Audio-to-Text Transformers English
A
lmms-lab
1,348
74
Ultravox V0 4 1 Mistral Nemo
MIT
Ultravox is a multimodal model based on Mistral-Nemo and Whisper, capable of processing both speech and text inputs, suitable for tasks like voice agents and speech translation.
Audio-to-Text Transformers Supports Multiple Languages
U
fixie-ai
1,285
25
Ultravox V0 6 Qwen 3 32b
MIT
Ultravox is a large multimodal speech language model capable of understanding and processing speech input, supporting multiple languages and noisy environments.
Audio-to-Text Transformers Supports Multiple Languages
U
fixie-ai
1,240
0
Omniaudio 2.6B
Apache-2.0
The world's fastest and most efficient edge-deployable audio language model, a 2.6B parameter multimodal model capable of processing both text and audio inputs.
Audio-to-Text English
O
NexaAIDev
1,149
265
Qwen2 Audio 7B Instruct 4bit
This is the 4-bit quantized version of Qwen2-Audio-7B-Instruct, developed based on Alibaba Cloud's original Qwen model. It is an audio-text multimodal large language model.
Audio-to-Text Transformers
Q
alicekyting
1,090
6
Ultravox V0 5 Llama 3 2 1b ONNX
MIT
Ultravox is a multilingual audio-to-text model optimized based on the LLaMA-3-2.1B architecture, supporting speech recognition and transcription tasks in multiple languages.
Audio-to-Text Transformers Supports Multiple Languages
U
onnx-community
1,088
3
Ultravox V0 2
MIT
Ultravox is a multimodal voice large language model built upon Llama3-8B-Instruct and Whisper-small, capable of processing both speech and text inputs.
Audio-to-Text Transformers English
U
fixie-ai
792
51
R1 Aqa
Apache-2.0
R1-AQA is an audio question answering model based on Qwen2-Audio-7B-Instruct, optimized through Group Relative Policy Optimization (GRPO) algorithm, achieving state-of-the-art performance in the MMAU benchmark.
Audio-to-Text Transformers
R
mispeech
791
14
Ultravox V0 4 1 Llama 3 1 8b
MIT
Ultravox is a multimodal speech large language model built on Llama3.1-8B-Instruct and whisper-large-v3-turbo, capable of processing both speech and text inputs.
Audio-to-Text Transformers Supports Multiple Languages
U
fixie-ai
747
97
Shuka 1
Shuka v1 is a language model natively supporting Indian language audio understanding, combining a self-developed audio encoder with the Llama3-8B-Instruct decoder, enabling zero-shot multilingual question-answering tasks.
Audio-to-Text Transformers Supports Multiple Languages
S
sarvamai
729
54
AV HuBERT
A multilingual audio-visual speech recognition model based on the MuAViC dataset, combining audio and visual modalities for robust performance
Audio-to-Text Transformers
A
nguyenvulebinh
683
3
Seallms Audio 7B
Other
SeaLLMs-Audio is a large-scale audio language model targeting Southeast Asia. It supports five major languages: Indonesian, Thai, Vietnamese, English, and Chinese, and has capabilities such as audio analysis and voice interaction.
Audio-to-Text Safetensors Supports Multiple Languages
S
SeaLLMs
539
10
Gemma 3 4b It Speech
Gemma-3-MM is a multimodal instruction model extended from Gemma-3-4b-it with added speech processing capabilities, capable of handling text, image, and audio inputs to generate text outputs.
Audio-to-Text Transformers
G
junnei
383
12
Pathumma Llm Audio 1.0.0
Apache-2.0
Pathumma-llm-audio-1.0.0 is an 8-billion-parameter Thai large language model specifically designed for audio comprehension tasks, capable of processing various audio inputs including speech, general audio, and music.
Audio-to-Text Transformers Supports Multiple Languages
P
nectec
333
7
Llama 3 Typhoon V1.5 8b Audio Preview
Typhoon-Audio Preview is a Thai and English audio-language model capable of processing text and audio inputs, with text outputs.
Audio-to-Text Transformers
L
scb10x
218
12
Qwen2 Audio 7B Instruct GGUF
Apache-2.0
Static quantized version of Qwen2-Audio-7B-Instruct model, supporting English audio-to-text conversion tasks
Audio-to-Text Transformers English
Q
mradermacher
146
0
Qwen Audio Nf4
Qwen-Audio-nf4 is the quantized version of Qwen-Audio, supporting multiple audio inputs and text outputs
Audio-to-Text Transformers Supports Multiple Languages
Q
Ostixe360
134
1
AV HuBERT MuAViC Ru
AV-HuBERT is an audio-visual speech recognition model trained on the MuAViC multilingual audio-visual corpus, combining audio and visual modalities for robust performance.
Audio-to-Text Transformers
A
nguyenvulebinh
91
1
Ultravox V0 4 Llama 3 1 70b
MIT
Ultravox is a multimodal speech large language model, built upon the pre-trained Llama3.1-70B-Instruct and Whisper-medium backbones, capable of simultaneously receiving both speech and text as input.
Audio-to-Text Transformers Supports Multiple Languages
U
fixie-ai
79
4
Phi 4 Mm Inst Asr Singlish
MIT
A multimodal speech recognition model optimized for Singapore English, fine-tuned based on Microsoft's Phi-4 multimodal instruction model, significantly improving recognition of Singapore English's unique phonetic features.
Audio-to-Text Transformers Supports Multiple Languages
P
mjwong
61
0
Ichigo Llama3.1 S Base V0.3
Apache-2.0
The Llama3-S series model is a multimodal language model developed by Homebrew Research, natively supporting audio and text input comprehension, extending the speech understanding capability based on the Llama-3 architecture.
Audio-to-Text English
I
homebrewltd
33
4
Phi 4 Multimodal Instruct Commonvoice Zh Tw
MIT
A Taiwanese Mandarin speech recognition model fine-tuned from microsoft/Phi-4-multimodal-instruct, trained on the Taiwanese Mandarin General Voice 19.0 dataset
Audio-to-Text Transformers Chinese
P
JacobLinCool
28
1
Ultravox V0 4 1 Llama 3 3 70b
MIT
Ultravox is a multimodal speech large language model based on Llama3.3-70B-Instruct and whisper-large-v3-turbo, capable of processing both speech and text inputs.
Audio-to-Text Transformers Supports Multiple Languages
U
fixie-ai
26
10
Mistral Speech To Text
Apache-2.0
This is an experimental model that converts audio waveforms into ASCII art and then fine-tunes the Mistral model to predict text.
Audio-to-Text Transformers
M
0-hero
20
1
Ultravox V0 3
MIT
Ultravox is a multimodal speech large language model based on Llama3.1-8B-Instruct and Whisper-small, capable of processing both speech and text inputs.
Audio-to-Text Transformers English
U
FriendliAI
20
1
Ichigo Llama3.1 S Base V0.3
Apache-2.0
Llama3-S is a multimodal language model supporting both audio and text inputs, developed based on the Llama-3 architecture with a focus on enhancing speech understanding capabilities.
Audio-to-Text English
I
Menlo
18
4
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase