đ Voice Model for Dental Click Identification
This model uses the Wav2vec2 architecture to identify dental click utterances in speech, offering high accuracy on a limited dataset.
đ Quick Start
This model can be used via the transformers
library or the Hugging Face Hosted inference API.
â ī¸ Important Note
Do not use the 'Record from browser' option as the model may misidentify mouse clicks as speech utterances. Audio files for upload should be 1 second in length, in 'WAV' format, and use 16-bit signed integer PCM encoding.
⨠Features
- Specific Task: Trained for the keyword spotting task, specifically to identify dental click utterances in speech.
- Limited Training: Trained on a limited quantity of speech (~1.5 hours) from only one speaker.
- High Accuracy: Achieved 97% accuracy on a 20% hold-out test set.
đĻ Installation
No specific installation steps are provided in the original README.
đģ Usage Examples
Basic Usage
from transformers import AutoModelForAudioClassification, AutoFeatureExtractor
import torch
model_name = "your_model_name"
model = AutoModelForAudioClassification.from_pretrained(model_name)
feature_extractor = AutoFeatureExtractor.from_pretrained(model_name)
audio_path = "your_audio_file.wav"
inputs = feature_extractor(audio_path, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_class_id = logits.argmax().item()
predicted_label = model.config.id2label[predicted_class_id]
print(f"Predicted label: {predicted_label}")
Advanced Usage
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=16,
per_device_eval_batch_size=64,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs',
logging_steps=10,
)
train_dataset = ...
eval_dataset = ...
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=eval_dataset
)
trainer.train()
đ Documentation
Model Description
The model utilizes the Wav2vec2 architecture trained on the Superb dataset for the keyword spotting task. It was fine - tuned to identify dental click utterances (https://en.wikipedia.org/wiki/Dental_click) in speech.
The model was trained for 10 epochs on a limited quantity of speech (~1.5 hours) and with only one speaker. Therefore, it should not be assumed to be generalizable to other speakers or languages without further training data or rigorous testing.
The model was evaluated for accuracy on a hold - out test set of 20% of the available data and scored 97%.
đ§ Technical Details
The model is based on the Wav2vec2 architecture. It was trained on the Superb dataset and then fine - tuned for the specific task of identifying dental click utterances. The limited training data (both in quantity and the number of speakers) may affect its generalizability.
đ License
No license information is provided in the original README.
Property |
Details |
Model Type |
Utilizes Wav2vec2 architecture for keyword spotting and dental click identification |
Training Data |
Superb dataset, limited speech data (~1.5 hours) from one speaker |