Qwen2-Audio-7B-Instruct-4bit Open-source Audio Text Model - Supports Multimodal Interactive Applications

Qwen2 Audio 7B Instruct 4bit

Developed by alicekyting

This is the 4-bit quantized version of Qwen2-Audio-7B-Instruct, developed based on Alibaba Cloud's original Qwen model. It is an audio-text multimodal large language model.

Audio-to-Text

Transformers

#Audio Understanding #Multimodal Dialogue #4-bit Quantization

Downloads 1,090

Release Time : 8/22/2024

Model Overview

This model supports multimodal input of audio and text, capable of understanding and generating text responses related to audio content. The 4-bit quantization technology reduces memory usage, making it suitable for hardware with limited resources.

Model Features

4-bit Quantization Technology

Reduces memory usage, enabling more efficient inference on hardware with limited resources

Multimodal Understanding

Processes both audio and text inputs simultaneously, achieving cross-modal understanding

Conversational Interaction

Supports multi-turn dialogues while maintaining contextual consistency

Model Capabilities

Audio content understanding

Text generation

Multi-turn dialogue

Cross-modal reasoning

Use Cases

Smart Assistants

Audio Content Q&A

Users upload audio files and ask questions about the content

The model accurately understands the audio content and provides relevant answers

Educational Applications

Language Learning Assistance

Analyzes speech pronunciation and provides feedback

🚀 Model Card for Model ID

This model is a 4-bit quantized version of Qwen2-Audio-7B-Instruct (https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct). It offers reduced memory usage and potentially faster inference, especially on hardware with limited resources.

✨ Features

Model Details

Model Description

This is the model card of a 🤗 transformers model pushed on the Hub, which has been automatically generated.

Developed: Based on the original Qwen model by Alibaba Cloud.
Model type: Audio-Text Multimodal Large Language Model

Model Sources

Repository: https://huggingface.co/Qwen/Qwen2-Audio-7B-Instruct

Uses

The 4-bit quantization allows for reduced memory usage and potentially faster inference times, especially on hardware with limited resources. However, there might be a slight degradation in performance compared to the full-precision model.

Bias, Risks, and Limitations

A GPU is needed for this model.

📦 Installation

To use this model, you'll need to have the transformers library installed, along with bitsandbytes for 4-bit quantization support.

💻 Usage Examples

Basic Usage

import torch
from io import BytesIO
from urllib.request import urlopen
import librosa
from transformers import Qwen2AudioForConditionalGeneration, AutoProcessor, BitsAndBytesConfig

processor = AutoProcessor.from_pretrained("alicekyting/Qwen2-Audio-7B-Instruct-4bit")
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16
)
model = Qwen2AudioForConditionalGeneration.from_pretrained(
    "alicekyting/Qwen2-Audio-7B-Instruct-4bit",
    device_map="auto",
    quantization_config=bnb_config
)

conversation = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3"},
        {"type": "text", "text": "What's that sound?"},
    ]},
    {"role": "assistant", "content": "It is the sound of glass shattering."},
    {"role": "user", "content": [
        {"type": "text", "text": "What can you do when you hear that?"},
    ]},
    {"role": "assistant", "content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property."},
    {"role": "user", "content": [
        {"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac"},
        {"type": "text", "text": "What does the person say?"},
    ]},
]
text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios = []
for message in conversation:
    if isinstance(message["content"], list):
        for ele in message["content"]:
            if ele["type"] == "audio":
                audios.append(
                    librosa.load(
                        BytesIO(urlopen(ele['audio_url']).read()),
                        sr=processor.feature_extractor.sampling_rate,
                        mono=True
                    )[0]
                )

inputs = processor(text=text, audios=audios, return_tensors="pt", padding=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

generate_ids = model.generate(**inputs, max_length=256)
generate_ids = generate_ids[:, inputs['input_ids'].size(1):]

response = processor.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)

Advanced Usage

Refer to the Qwen2-Audio-7B-Instruct model page on Hugging Face for more advanced usage examples and code snippets.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご