wav2vec2-large-robust-6-ft-age-gender Open Source Model - Free to Predict Speaker's Age and Gender from Audio

Home

Wav2vec2 Large Robust 6 Ft Age Gender

Developed by audeering

This model, fine-tuned from Wav2Vec2-Large-Robust, can predict the speaker's age and gender from raw audio.

Audio Classification

Transformers

#Voice Age Recognition #Gender Probability Prediction #Multi-dataset Training

Downloads 19.29k

Release Time : 9/4/2023

Model Overview

The model takes raw audio signals as input and outputs age predictions (approximately in the range 0...1 corresponding to 0...100 years) and gender probabilities (child/female/male). It also provides the pooled state of the last transformer layer.

Model Features

Multi-dataset Training

The model was trained on multiple datasets including aGender, Mozilla Common Voice, Timit, and Voxceleb 2, enhancing its generalization capability.

Lightweight Architecture

Only the first six transformer layers are used, reducing computational resource requirements while maintaining performance.

Multi-task Output

Simultaneously outputs age predictions, gender classification results, and the pooled state of the last transformer layer.

Model Capabilities

Audio Signal Processing

Age Prediction

Gender Classification

Feature Extraction

Use Cases

Speech Analysis

Demographic Research

Used to analyze the distribution characteristics of different age and gender groups in voice samples.

Personalized Services

Provides personalized recommendations or services based on user voice characteristics.

🚀 Model for Age and Gender Recognition based on Wav2vec 2.0 (6 layers)

This model is designed for age and gender recognition from audio signals, leveraging the power of Wav2vec 2.0 with 6 layers.

The model takes a raw audio signal as input and outputs predictions for age in a range of approximately 0...1 (equivalent to 0...100 years) and gender, expressing the probability for being a child, female, or male. Additionally, it provides the pooled states of the last transformer layer.

The model was created by fine-tuning Wav2Vec2-Large-Robust on aGender, Mozilla Common Voice, Timit, and Voxceleb 2. For this version of the model, only the first six transformer layers were trained.

An ONNX export of the model is available from doi:10.5281/zenodo.7761387. Further details are given in the associated paper and tutorial.

📦 Installation

No specific installation steps are provided in the original document, so this section is skipped.

✨ Features

Input: Raw audio signal.
Output: Age prediction (0 - 1 representing 0 - 100 years), gender probability (child, female, male), and pooled states of the last transformer layer.
Training: Fine-tuned on multiple datasets using the first six transformer layers of Wav2Vec2-Large-Robust.
Export: ONNX export available.

💻 Usage Examples

Basic Usage

import numpy as np
import torch
import torch.nn as nn
from transformers import Wav2Vec2Processor
from transformers.models.wav2vec2.modeling_wav2vec2 import (
    Wav2Vec2Model,
    Wav2Vec2PreTrainedModel,
)


class ModelHead(nn.Module):
    r"""Classification head."""

    def __init__(self, config, num_labels):

        super().__init__()

        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.final_dropout)
        self.out_proj = nn.Linear(config.hidden_size, num_labels)

    def forward(self, features, **kwargs):

        x = features
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)

        return x


class AgeGenderModel(Wav2Vec2PreTrainedModel):
    r"""Speech emotion classifier."""

    def __init__(self, config):

        super().__init__(config)

        self.config = config
        self.wav2vec2 = Wav2Vec2Model(config)
        self.age = ModelHead(config, 1)
        self.gender = ModelHead(config, 3)
        self.init_weights()

    def forward(
            self,
            input_values,
    ):

        outputs = self.wav2vec2(input_values)
        hidden_states = outputs[0]
        hidden_states = torch.mean(hidden_states, dim=1)
        logits_age = self.age(hidden_states)
        logits_gender = torch.softmax(self.gender(hidden_states), dim=1)

        return hidden_states, logits_age, logits_gender



# load model from hub
device = 'cpu'
model_name = 'audeering/wav2vec2-large-robust-6-ft-age-gender'
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = AgeGenderModel.from_pretrained(model_name)

# dummy signal
sampling_rate = 16000
signal = np.zeros((1, sampling_rate), dtype=np.float32)


def process_func(
    x: np.ndarray,
    sampling_rate: int,
    embeddings: bool = False,
) -> np.ndarray:
    r"""Predict age and gender or extract embeddings from raw audio signal."""

    # run through processor to normalize signal
    # always returns a batch, so we just get the first entry
    # then we put it on the device
    y = processor(x, sampling_rate=sampling_rate)
    y = y['input_values'][0]
    y = y.reshape(1, -1)
    y = torch.from_numpy(y).to(device)

    # run through model
    with torch.no_grad():
        y = model(y)
        if embeddings:
            y = y[0]
        else:
            y = torch.hstack([y[1], y[2]])

    # convert to numpy
    y = y.detach().cpu().numpy()

    return y


print(process_func(signal, sampling_rate))
#    Age       child      female      male
# [[ 0.3079211  0.00848487 0.0051472  0.9863679 ]]

print(process_func(signal, sampling_rate, embeddings=True))
# Pooled hidden states of last transformer layer
# [[ 0.00409924  0.00365688  0.02392936 ...  0.02349018 -0.13294911
#    0.1538802 ]]

Advanced Usage

The basic usage example already covers most scenarios. However, you can further customize the model by modifying the ModelHead or AgeGenderModel classes according to your specific requirements.

📚 Documentation

Datasets

agender
mozillacommonvoice
timit
voxceleb2

Inference

Inference is supported (inference: true).

📄 License

This model is licensed under the cc-by-nc-sa-4.0 license.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご