Dasheng - Base Open - Source Audio Encoder - Free to Process Audio Information in Multiple Fields Such as Voice, Music, and Ambient Sound

Dasheng Base

Developed by mispeech

Large-scale general-purpose audio encoder trained via self-supervised learning, capable of processing multi-domain audio information including speech, music, and environmental sounds

Audio Classification

Transformers

Open Source License:Apache-2.0 #Multi-domain audio encoding #Self-supervised pre-training #1.2 billion parameter large model

Downloads 273

Release Time : 6/6/2024

Model Overview

Dasheng is a general-purpose audio encoder trained on large-scale self-supervised learning tasks, designed to capture rich audio information across multiple domains such as speech, music, and environmental sounds.

Model Features

Large-scale training

Training data covers 272,356 hours of diverse audio

Multi-domain applicability

Capable of processing various audio types including speech, music, and environmental sounds

High performance

Demonstrates significant performance improvements on the HEAR benchmark, surpassing previous achievements

Model Capabilities

Audio feature extraction

Speech classification

Music classification

Environmental sound classification

Audio embedding generation

Use Cases

Speech processing

Speech command recognition

Used for identifying speech commands

Excellent performance on Speech Commands tasks

Speaker recognition

Used for identifying different speakers

Excellent performance on VoxLingua tasks

Music analysis

Music classification

Classifying music genres

Excellent performance in music classification tasks

Environmental sound analysis

Environmental sound classification

Classifying environmental sounds

Excellent performance in environmental sound classification tasks

🚀 Dasheng: a large scale general-purpose audio encoder

Dasheng (Deep Audio-Signal Holistic Embeddings), or “大声” ("great sound"), is a general - purpose audio encoder. It is trained on a large - scale self - supervised learning task, aiming to capture rich audio information across various domains such as speech, music, and environmental sounds. Trained on 272,356 hours of diverse audio data with 1.2 billion parameters, the model shows significant performance gains on the HEAR benchmark. Dasheng outperforms previous works on CREMA - D, LibriCount, Speech Commands, VoxLingua, and also competes well in music and environmental sound classification tasks.

🚀 Quick Start

✨ Features

Rich Audio Information Capture: Dasheng can capture rich audio information across multiple domains, including speech, music, and environmental sounds.
Large - scale Training: Trained on 272,356 hours of diverse audio data with 1.2 billion parameters.
High - performance: Shows significant performance gains on the HEAR benchmark and outperforms previous works on multiple datasets.

📦 Installation

pip install git+https://github.com/jimbozhang/hf_transformers_custom_model_dasheng.git

💻 Usage Examples

Basic Usage

>>> model_name = "mispeech/dasheng-base"

>>> from dasheng_model.feature_extraction_dasheng import DashengFeatureExtractor
>>> from dasheng_model.modeling_dasheng import DashengModel

>>> feature_extractor = DashengFeatureExtractor.from_pretrained(model_name)
>>> model = DashengModel.from_pretrained(model_name, outputdim=None)  # no linear output layer if `outputdim` is `None`

>>> import torchaudio
>>> audio, sampling_rate = torchaudio.load("resources/JeD5V5aaaoI_931_932.wav")
>>> assert sampling_rate == 16000
>>> audio.shape
torch.Size([1, 16000])   # mono audio of 1 second

>>> inputs = feature_extractor(audio, sampling_rate=sampling_rate, return_tensors="pt")
>>> inputs.input_values.shape
torch.Size([1, 64, 101])   # 64 mel-filterbanks, 101 frames

>>> import torch
>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> outputs.hidden_states.shape
torch.Size([1, 25, 768])   # 25 T-F patches (patch size 64x4, no overlap), before mean-pooling

>>> outputs.logits.shape
torch.Size([1, 768])   # mean-pooled embedding (would be logits from a linear layer if `outputdim` was set)

Advanced Usage

You can fine - tune the model on your own dataset. Click the following link to open the Colab notebook for fine - tuning on the ESC - 50 dataset:

example_finetune_esc50.ipynb demonstrates how to train a linear head on the ESC - 50 dataset with the Dasheng encoder frozen.

📚 Documentation

Original Repository

https://github.com/RicherMans/Dasheng

Model Visualization

dasheng

📄 License

This project is licensed under the Apache - 2.0 license.

📚 Citation

If you find Dasheng useful in your research, please consider citing the following paper:

@inproceedings{dinkel2023scaling,
  title={Scaling up masked audio encoder learning for general audio classification},
  author={Dinkel, Heinrich and Yan, Zhiyong and Wang, Yongqing and Zhang, Junbo and Wang, Yujun and Wang, Bin},
  booktitle={Interspeech 2024},
  year={2024}
}

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご