Qwen-Audio-nf4 Open-source Audio Model - Free Support for Multiple Audio Inputs Converted to Text Output

Home

Qwen Audio Nf4

Developed by Ostixe360

Qwen-Audio-nf4 is the quantized version of Qwen-Audio, supporting multiple audio inputs and text outputs

Audio-to-Text

Transformers

Supports Multiple Languages#Multitask Audio Understanding #Multilingual Audio Processing #Audio-Text Interaction

Downloads 134

Release Time : 4/25/2024

Model Overview

Qwen-Audio-nf4 is the quantized version of Alibaba Cloud's large-scale audio-language model Qwen-Audio, supporting various audio inputs (including human speech, natural sounds, music, singing) and text as input, with text as output.

Model Features

Multi-type Audio Support

Supports processing various audio types including human voice, natural sounds, music, and songs

Multitask Learning Framework

Adopts a multitask training framework supporting over 30 different audio tasks

No Fine-tuning Required

Achieves leading performance on multiple benchmark tasks without task-specific fine-tuning

Multi-turn Dialogue Support

Supports multi-turn audio and text dialogues, including scenarios like sound understanding and music appreciation

Model Capabilities

Audio-to-text conversion

Multilingual audio understanding

Music analysis

Sound reasoning

Multi-turn audio-text dialogue

Voice tool usage

Use Cases

Speech Recognition

Speech Transcription

Convert spoken language into text

Achieves SOTA on Aishell1 test set

Environmental Sound Analysis

Natural Sound Recognition

Identify types of natural sounds in the environment

Achieves SOTA on cochscene test set

Music Understanding

Music Description Generation

Generate descriptive text based on music

Achieves SOTA on ClothoAQA test set

🚀 Qwen-Audio-nf4

This is the quantized version of Qwen-Audio. It is a remarkable model in the field of audio processing, offering advanced capabilities for handling various audio types and generating corresponding text outputs.

🚀 Quick Start

Below, we provide simple examples to show how to use Qwen-Audio with 🤗 Transformers.

Before running the code, make sure you have setup the environment and installed the required packages. Make sure you meet the following requirements, and then install the dependent libraries.

pip install -r requirements.txt

For more details, please refer to tutorial.

💻 Usage Examples

Basic Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
import torch
torch.manual_seed(1234)

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)

# use bf16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, bf16=True).eval()
# use fp16
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="auto", trust_remote_code=True, fp16=True).eval()
# use cpu only
# model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cpu", trust_remote_code=True).eval()
# use cuda device
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-Audio", device_map="cuda", trust_remote_code=True).eval()

# Specify hyperparameters for generation (No need to do this if you are using transformers>4.32.0)
# model.generation_config = GenerationConfig.from_pretrained("Qwen/Qwen-Audio", trust_remote_code=True)
audio_url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac"
sp_prompt = "<|startoftranscript|><|en|><|transcribe|><|en|><|notimestamps|><|wo_itn|>"
query = f"<audio>{audio_url}</audio>{sp_prompt}"
audio_info = tokenizer.process_audio(query)
inputs = tokenizer(query, return_tensors='pt', audio_info=audio_info)
inputs = inputs.to(model.device)
pred = model.generate(**inputs, audio_info=audio_info)
response = tokenizer.decode(pred.cpu()[0], skip_special_tokens=False,audio_info=audio_info)
print(response)
# <audio>https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-Audio/1272-128104-0000.flac</audio><|startoftranscription|><|en|><|transcribe|><|en|><|notimestamps|><|wo_itn|>mister quilting is the apostle of the middle classes and we are glad to welcome his gospel<|endoftext|>

✨ Features

Qwen-Audio (Qwen Large Audio Language Model) is the multimodal version of the large model series, Qwen (abbr. Tongyi Qianwen), proposed by Alibaba Cloud. Qwen-Audio accepts diverse audio (human speech, natural sound, music and song) and text as inputs, outputs text. The contributions of Qwen-Audio include:

Fundamental audio models: Qwen-Audio is a fundamental multi-task audio-language model that supports various tasks, languages, and audio types, serving as a universal audio understanding model. Building upon Qwen-Audio, we develop Qwen-Audio-Chat through instruction fine-tuning, enabling multi-turn dialogues and supporting diverse audio-oriented scenarios.
Multi-task learning framework for all types of audios: To scale up audio-language pre-training, we address the challenge of variation in textual labels associated with different datasets by proposing a multi-task training framework, enabling knowledge sharing and avoiding one-to-many interference. Our model incorporates more than 30 tasks and extensive experiments show the model achieves strong performance.
Strong Performance: Experimental results show that Qwen-Audio achieves impressive performance across diverse benchmark tasks without requiring any task-specific fine-tuning, surpassing its counterparts. Specifically, Qwen-Audio achieves state-of-the-art results on the test set of Aishell1, cochlscene, ClothoAQA, and VocalSound.
Flexible multi-run chat from audio and text input: Qwen-Audio supports multiple-audio analysis, sound understanding and reasoning, music appreciation, and tool usage for speech editing.

📦 Installation

Requirements

python 3.8 and above
pytorch 1.12 and above, 2.0 and above are recommended
CUDA 11.4 and above are recommended (this is for GPU users)
FFmpeg

📚 Documentation

We release Qwen-Audio and Qwen-Audio-Chat, which are pretrained model and Chat model respectively. For more details about Qwen-Audio, please refer to our Github Repo. This repo is the one for Qwen-Audio.

📄 License

Researchers and developers are free to use the codes and model weights of Qwen-Audio. We also allow its commercial use. Check our license at LICENSE for more details.

🔧 Technical Details

Citation

If you find our paper and code useful in your research, please consider giving a star and citation

@article{Qwen-Audio,
  title={Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models},
  author={Chu, Yunfei and Xu, Jin and Zhou, Xiaohuan and Yang, Qian and Zhang, Shiliang and Yan, Zhijie  and Zhou, Chang and Zhou, Jingren},
  journal={arXiv preprint arXiv:2311.07919},
  year={2023}
}

📞 Contact Us

If you are interested to leave a message to either our research team or product team, feel free to send an email to qianwen_opensource@alibabacloud.com.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご