Dasheng-1.2B Open-Source Audio Encoder - Capturing Audio Information in Multiple Domains Such as Speech, Music, and Environmental Sounds

Dasheng 1.2B

Developed by mispeech

DaSheng is a general-purpose audio encoder trained with large-scale self-supervised learning, capable of capturing rich audio information across multiple domains such as speech, music, and environmental sounds.

Audio Classification

Transformers

Open Source License:Apache-2.0 #Large-scale Audio Encoding #Multi-domain Audio Classification #Self-supervised Learning

Downloads 135

Release Time : 6/6/2024

Model Overview

DaSheng is a general-purpose audio encoder with 1.2 billion parameters, trained on 272,356 hours of diverse audio data, excelling in tasks like speech, music, and environmental sound classification.

Model Features

Large-scale Training

Trained with 272,356 hours of diverse audio data

Multi-domain Applicability

Capable of processing various audio types including speech, music, and environmental sounds

High Performance

Outperforms previous results in the HEAR benchmark, excelling in multiple tasks

General-purpose Encoder

Extracts audio embedding features suitable for various downstream tasks

Model Capabilities

Audio Feature Extraction

Speech Classification

Music Classification

Environmental Sound Classification

Audio Embedding Generation

Use Cases

Speech Processing

Speech Command Recognition

Recognize short speech commands

Excellent performance on Speech Commands tasks

Speaker Counting

Count the number of speakers in audio

Achieves good results on LibriCount tasks

Music Analysis

Music Classification

Classify music clips

Excellent performance in music classification tasks

Environmental Sound Analysis

Environmental Sound Recognition

Identify various sounds in the environment

Good performance in environmental sound classification tasks

🚀 Dasheng: a large scale general-purpose audio encoder

Dasheng (Deep Audio-Signal Holistic Embeddings), or “大声” ("great sound"), is a general - purpose audio encoder. It is trained on a large - scale self - supervised learning task and can capture rich audio information across various domains such as speech, music, and environmental sounds. Trained on 272,356 hours of diverse audio data with 1.2 billion parameters, the model shows significant performance gains on the HEAR benchmark. It outperforms previous works on CREMA - D, LibriCount, Speech Commands, VoxLingua, and competes well in music and environmental sound classification tasks.

🚀 Quick Start

This section will guide you through the basic steps of using Dasheng, including installation, inference, and fine - tuning.

✨ Features

Large - scale Training: Trained on 272,356 hours of diverse audio data with 1.2 billion parameters.
Multi - domain Adaptability: Capable of capturing audio information in speech, music, and environmental sounds.
High Performance: Shows significant performance gains on the HEAR benchmark and outperforms previous works on multiple datasets.

📦 Installation

You can install Dasheng using the following command:

pip install git+https://github.com/jimbozhang/hf_transformers_custom_model_dasheng.git

💻 Usage Examples

Basic Usage

The following code demonstrates how to perform inference using Dasheng:

>>> model_name = "mispeech/dasheng-1.2B"

>>> from dasheng_model.feature_extraction_dasheng import DashengFeatureExtractor
>>> from dasheng_model.modeling_dasheng import DashengModel

>>> feature_extractor = DashengFeatureExtractor.from_pretrained(model_name)
>>> model = DashengModel.from_pretrained(model_name, outputdim=None)  # no linear output layer if `outputdim` is `None`

>>> import torchaudio
>>> audio, sampling_rate = torchaudio.load("resources/JeD5V5aaaoI_931_932.wav")
>>> assert sampling_rate == 16000
>>> audio.shape
torch.Size([1, 16000])   # mono audio of 1 second

>>> inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt")
>>> inputs.input_values.shape
torch.Size([1, 64, 101])   # 64 mel-filterbanks, 101 frames

>>> import torch
>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> outputs.hidden_states.shape
torch.Size([1, 25, 768])   # 25 T-F patches (patch size 64x4, no overlap), before mean-pooling

>>> outputs.logits.shape
torch.Size([1, 768])   # mean-pooled embedding (would be logits from a linear layer if `outputdim` was set)

Advanced Usage

You can fine - tune Dasheng on your own dataset. Click the following link to open a Colab notebook that demonstrates how to train a linear head on the ESC - 50 dataset with the Dasheng encoder frozen:

📚 Documentation

Original Repository: https://github.com/XiaoMi/dasheng
The model's performance on the HEAR benchmark is shown in the following image:

📄 License

This project is licensed under the Apache - 2.0 license.

📖 Citation

If you find Dasheng useful in your research, please consider citing the following paper:

@inproceedings{dinkel2023scaling,
  title={Scaling up masked audio encoder learning for general audio classification},
  author={Dinkel, Heinrich and Yan, Zhiyong and Wang, Yongqing and Zhang, Junbo and Wang, Yujun and Wang, Bin},
  booktitle={Interspeech 2024},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご