Open-source model wav2vec2-large-robust-24-ft-age-gender - Accurately predict age and gender from audio

Wav2vec2 Large Robust 24 Ft Age Gender

Developed by audeering

This model takes raw audio signals as input and outputs age predictions and gender probabilities (child/female/male), along with the pooled state of the last transformer layer.

Audio Classification

Transformers

#Voice age recognition #Multi-gender classification #Wav2vec2 fine-tuning

Downloads 44.13k

Release Time : 9/4/2023

Model Overview

A voice age and gender recognition model obtained by fine-tuning Wav2Vec2-Large-Robust on multiple datasets, capable of predicting speaker age and gender from raw audio.

Model Features

Multi-dataset training

Trained on multiple datasets including aGender, Mozilla Common Voice, Timit, and Voxceleb 2 to enhance model generalization

End-to-end processing

Directly processes raw audio signals without complex feature engineering

Multi-task output

Simultaneously outputs age predictions, gender probabilities, and transformer pooled states

Strong robustness

Based on the Wav2Vec2-Large-Robust architecture, offering strong robustness against noise and speech variations

Model Capabilities

Voice age recognition

Voice gender classification

Voice feature extraction

Use Cases

Speech analysis

Speaker demographics

Analyze the age and gender distribution of speakers from voice data

Can output age predictions (0-100 years) and gender probabilities

Voice interaction systems

Provide user demographic information for voice assistants to enable personalized interactions

Voice data analysis

Extract speaker age and gender features from large volumes of voice data

🚀 Model for Age and Gender Recognition based on Wav2vec 2.0 (24 layers)

This model takes a raw audio signal as input and outputs predictions for age (ranging approximately from 0...1, equivalent to 0...100 years) and gender, expressing the probability of being a child, female, or male. Additionally, it provides the pooled states of the last transformer layer.

🚀 Quick Start

The model was created by fine-tuning Wav2Vec2-Large-Robust on aGender, Mozilla Common Voice, Timit and Voxceleb 2. For this version of the model, all 24 transformer layers were trained. An ONNX export of the model is available from doi:10.5281/zenodo.7761387. Further details are given in the associated paper and tutorial.

✨ Features

Input & Output: Expects a raw audio signal as input and outputs age and gender predictions along with the pooled states of the last transformer layer.
Fine - Tuning: Fine - tuned on multiple datasets including agender, mozillacommonvoice, timit, and voxceleb2.
ONNX Export: An ONNX export of the model is available.

📦 Installation

The installation details are not provided in the original document.

💻 Usage Examples

Basic Usage

import numpy as np
import torch
import torch.nn as nn
from transformers import Wav2Vec2Processor
from transformers.models.wav2vec2.modeling_wav2vec2 import (
    Wav2Vec2Model,
    Wav2Vec2PreTrainedModel,
)


class ModelHead(nn.Module):
    r"""Classification head."""

    def __init__(self, config, num_labels):

        super().__init__()

        self.dense = nn.Linear(config.hidden_size, config.hidden_size)
        self.dropout = nn.Dropout(config.final_dropout)
        self.out_proj = nn.Linear(config.hidden_size, num_labels)

    def forward(self, features, **kwargs):

        x = features
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)

        return x


class AgeGenderModel(Wav2Vec2PreTrainedModel):
    r"""Speech emotion classifier."""

    def __init__(self, config):

        super().__init__(config)

        self.config = config
        self.wav2vec2 = Wav2Vec2Model(config)
        self.age = ModelHead(config, 1)
        self.gender = ModelHead(config, 3)
        self.init_weights()

    def forward(
            self,
            input_values,
    ):

        outputs = self.wav2vec2(input_values)
        hidden_states = outputs[0]
        hidden_states = torch.mean(hidden_states, dim=1)
        logits_age = self.age(hidden_states)
        logits_gender = torch.softmax(self.gender(hidden_states), dim=1)

        return hidden_states, logits_age, logits_gender



# load model from hub
device = 'cpu'
model_name = 'audeering/wav2vec2-large-robust-24-ft-age-gender'
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = AgeGenderModel.from_pretrained(model_name)

# dummy signal
sampling_rate = 16000
signal = np.zeros((1, sampling_rate), dtype=np.float32)


def process_func(
    x: np.ndarray,
    sampling_rate: int,
    embeddings: bool = False,
) -> np.ndarray:
    r"""Predict age and gender or extract embeddings from raw audio signal."""

    # run through processor to normalize signal
    # always returns a batch, so we just get the first entry
    # then we put it on the device
    y = processor(x, sampling_rate=sampling_rate)
    y = y['input_values'][0]
    y = y.reshape(1, -1)
    y = torch.from_numpy(y).to(device)

    # run through model
    with torch.no_grad():
        y = model(y)
        if embeddings:
            y = y[0]
        else:
            y = torch.hstack([y[1], y[2]])

    # convert to numpy
    y = y.detach().cpu().numpy()

    return y


print(process_func(signal, sampling_rate))
#    Age        female     male       child
# [[ 0.33793038 0.2715511  0.2275236  0.5009253 ]]

print(process_func(signal, sampling_rate, embeddings=True))
# Pooled hidden states of last transformer layer
# [[ 0.024444    0.0508722   0.04930823 ...  0.07247854 -0.0697901
#   -0.0170537 ]]

📚 Documentation

Datasets

Property	Details
Datasets	agender, mozillacommonvoice, timit, voxceleb2
Inference	true
Tags	speech, audio, wav2vec2, audio - classification, age - recognition, gender - recognition
License	cc - by - nc - sa - 4.0
Base Model	facebook/wav2vec2-large-robust

📄 License

The model is licensed under cc-by-nc-sa-4.0.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご