Moonshine-base-ONNX Open Source Automatic Speech Recognition Model - High-Efficiency Inference for Rapid Speech Recognition

Home

Moonshine Base ONNX

Developed by onnx-community

ONNX-format automatic speech recognition model based on the Moonshine base model, supporting efficient inference

Speech Recognition

Transformers

Open Source License:MIT #Speech to Text #Real-time Transcription #Lightweight ASR

Downloads 1,171

Release Time : 12/14/2024

Model Overview

This is a lightweight model for automatic speech recognition (ASR) that converts speech to text. The model is optimized in ONNX format, making it suitable for deployment across various platforms.

Model Features

ONNX Format Optimization

The model is provided in ONNX format, supporting cross-platform deployment and efficient inference

Lightweight Design

The base model version is suitable for resource-constrained environments

End-to-End Speech Recognition

Directly generates text output from audio input without intermediate processing steps

Model Capabilities

Speech to Text

Real-time Speech Recognition

English Speech Processing

Use Cases

Speech Transcription

Meeting Minutes

Automatically convert meeting recordings into text transcripts

Accuracy depends on audio quality and accents

Voice Assistant

Add voice command recognition functionality to applications

Supports real-time voice input processing

Media Processing

Video Caption Generation

Automatically generate subtitles from speech in videos

Enhances video content accessibility

🚀 Automatic Speech Recognition with Moonshine Base

This project leverages the Moonshine base model for automatic speech recognition, using the Transformers.js library and ONNXRuntime.

🚀 Quick Start

This guide will show you how to use the Moonshine base model for automatic speech recognition. You can choose to use the Transformers.js library or ONNXRuntime based on your needs.

✨ Features

Automatic Speech Recognition: Utilize the Moonshine base model to transcribe speech from audio files.
Multiple Libraries Support: Support both JavaScript (Transformers.js) and Python (ONNXRuntime) for different development environments.

📦 Installation

If you haven't already, you can install the Transformers.js JavaScript library from NPM using:

npm i @huggingface/transformers

💻 Usage Examples

Basic Usage with Transformers.js

import { pipeline } from "@huggingface/transformers";

const transcriber = await pipeline("automatic-speech-recognition", "onnx-community/moonshine-base-ONNX");
const output = await transcriber("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav");
console.log(output);
// { text: 'And so my fellow Americans ask not what your country can do for you as what you can do for your country.' }

Advanced Usage with ONNXRuntime

import numpy as np
import onnxruntime as ort
from transformers import AutoConfig, AutoTokenizer
import librosa

# Load config and tokenizer
model_id = 'onnx-community/moonshine-base-ONNX'
config = AutoConfig.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Load encoder and decoder sessions
encoder_session = ort.InferenceSession('./onnx/encoder_model_quantized.onnx')
decoder_session = ort.InferenceSession('./onnx/decoder_model_merged_quantized.onnx')

# Set config values
eos_token_id = config.eos_token_id
num_key_value_heads = config.decoder_num_key_value_heads
dim_kv = config.hidden_size // config.decoder_num_attention_heads

# Load audio
audio_file = 'jfk.wav'
audio = librosa.load(audio_file, sr=16_000)[0][None]

# Run encoder
encoder_outputs = encoder_session.run(None, dict(input_values=audio))[0]

# Prepare decoder inputs
batch_size = encoder_outputs.shape[0]
input_ids = np.array([[config.decoder_start_token_id]] * batch_size)
past_key_values = {
    f'past_key_values.{layer}.{module}.{kv}': np.zeros([batch_size, num_key_value_heads, 0, dim_kv], dtype=np.float32)
    for layer in range(config.decoder_num_hidden_layers)
    for module in ('decoder', 'encoder')
    for kv in ('key', 'value')
}

# max 6 tokens per second of audio
max_len = min((audio.shape[-1] // 16_000) * 6, config.max_position_embeddings)

generated_tokens = input_ids
for i in range(max_len):
  use_cache_branch = i > 0
  logits, *present_key_values = decoder_session.run(None, dict(
      input_ids=generated_tokens[:, -1:],
      encoder_hidden_states=encoder_outputs,
      use_cache_branch=[use_cache_branch],
      **past_key_values,
  ))
  next_tokens = logits[:, -1].argmax(-1, keepdims=True)
  for j, key in enumerate(past_key_values):
    if not use_cache_branch or 'decoder' in key:
      past_key_values[key] = present_key_values[j]
  generated_tokens = np.concatenate([generated_tokens, next_tokens], axis=-1)
  if (next_tokens == eos_token_id).all():
    break

result = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print(result)

📄 License

This project is licensed under the MIT License.

Featured Recommended AI Models

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご