Hardware
2 hardware offerings
-
Evaluation and Development Boardsi.MX 93 Evaluation Kit
Active -
Evaluation and Development Boardsi.MX 91 Evaluation Kit
Active
Speech To Text (or Automatic speech recognition) models can transcribe and translate speech into text. They can be used for AI assistants and conversational AI applications. We provide the wrapping code for streaming efficiency of Speech To Text models. For the i.MX 95 microprocessing unit (MPU), we support Neutron neural processing unit (NPU) acceleration.
Models supported: Moonshine-tiny, Moonshine-base, Whisper-tiny.en, Whisper-base.en, Whisper-small.en and Whisper-medium.en
| Model | Model size [parameters] | Weights format | WER* [%] | Cores | Time To First Token** [seconds] | Full transcription latency following X seconds of speech [seconds] | Library dependency | ||
|---|---|---|---|---|---|---|---|---|---|
| X = 3s | X = 6s | X = 9s | |||||||
| Moonshine-tiny | 27M | Q8 | 5.79 | 6x Cortex-A55 | 0.17 | 0.38 | 0.58 | 0.84 | ONNX |
| Moonshine-base | 61M | Q8 | 4.34 | 6x Cortex-A55 | 0.3 | 0.56 | 0.89 | 1.52 | ONNX |
| Whisper-tiny.en | 39M | Q8 | 7.11 | 6x Cortex-A55 | 0.17 | 0.34 | 0.53 | 0.83 | ONNX |
| Whisper-base.en | 74M | Q8 | 5.32 | 6x Cortex-A55 | 0.29 | 0.53 | 0.87 | 1.32 | ONNX |
| Whisper-small.en | 244M | Q8 | 3.77 | 6x Cortex-A55 | 0.61 | 1.16 | 2.1 | 2.76 | ONNX |
| Whisper-Medium.en | 769M | Q8 | 3.93 | 6x Cortex-A55 | 1.73 | 3.08 | 4.85 | 7.34 | ONNX |
Profiling includes only the Speech To Text and does not account for any potential VAD overhead.
*Computed in streaming on LibriSpeech test-clean set
**Time until text starts printing after 3 seconds of speech
| Model | Model size [parameters] | Weights format | WER* [%] | Cores | Time To First Token** [seconds] | Full transcription latency following X seconds of speech [seconds] | Library dependency | ||
|---|---|---|---|---|---|---|---|---|---|
| X = 3s | X = 6s | X = 9s | |||||||
| Moonshine-tiny | 27M | Q8 | 5.79 | 2x Cortex-A55 | 0.29 | 0.53 | 0.89 | 1.39 | ONNX |
| Moonshine-base | 61M | Q8 | 4.34 | 2x Cortex-A55 | 0.54 | 0.94 | 1.58 | 2.76 | ONNX |
| Whisper-tiny.en | 39M | Q8 | 7.11 | 2x Cortex-A55 | 0.27 | 0.53 | 0.93 | 1.38 | ONNX |
| Whisper-base.en | 74M | Q8 | 5.32 | 2x Cortex-A55 | 0.5 | 0.9 | 1.62 | 2.43 | ONNX |
| Whisper-small.en | 244M | Q8 | 3.77 | 2x Cortex-A55 | 1.45 | 2.47 | 4.75 | 6.45 | ONNX |
Profiling includes only the Speech To Text and does not account for any potential VAD overhead.
*Computed in streaming on LibriSpeech test-clean set
**Time until text starts printing after 3 seconds of speech
| Model | Model size [parameters] | Weights format | WER* [%] | Cores | Time To First Token** [seconds] | Full transcription latency following X seconds of speech [seconds] | Library dependency | ||
|---|---|---|---|---|---|---|---|---|---|
| X = 3s | X = 6s | X = 9s | |||||||
| Moonshine-tiny | 27M | Q8 | 5.79 | 4x Cortex-A53 | 0.23 | 0.45 | 0.78 | 1.12 | ONNX |
| Moonshine-base | 61M | Q8 | 4.34 | 4x Cortex-A53 | 0.42 | 0.76 | 1.28 | 1.99 | ONNX |
| Whisper-tiny.en | 39M | Q8 | 7.11 | 4x Cortex-A53 | 0.21 | 0.44 | 0.73 | 0.99 | ONNX |
| Whisper-base.en | 74M | Q8 | 5.32 | 4x Cortex-A53 | 0.42 | 0.73 | 1.27 | 1.74 | ONNX |
| Whisper-small.en | 244M | Q8 | 3.77 | 4x Cortex-A53 | 1.4 | 2.11 | 3.83 | 5.51 | ONNX |
Profiling includes only the Speech To Text and does not account for any potential VAD overhead.
*Computed in streaming on LibriSpeech test-clean set
**Time until text starts printing after 3 seconds of speech