# EIQ FOR I.MX8

## **PUBLIC**

Raluca Popa i.MX Systems Engineer JULY 2020



SECURE CONNECTIONS FOR A SMARTER WORLD

PUBLIC

NXP, THE NXP LOGO AND NXP SECURE CONNECTIONS FOR A SMARTER WORLD ARE TRADEMARKS OF NXP B.V. ALL OTHER PRODUCT OR SERVICE NAMES ARE THE PROPERTY OF THEIR RESPECTIVE OWNERS. © 2020 NXP B.V.





#### eIQ (2H20-2021)

- Optimize cloud vendor ecosystem (incl. model optimization)
- Model transfer learning and optimizations
- Model zoos standard network models for ease of training/deployment

#### eIQ (2H20-2021)

- PyelQ: Pyarmnn, Pytflife, Pyonnx (now)
- Sample applications for rapid customer evaluation

#### eIQ (2020)

- Enhanced Open source inference engines
- MPU: TensorFlow Lite, Arm NN, ONNX runtime, OpenCV)
- MCU: TensorFlow Lite, Glow
- Optimized targeting of CPU, GPU, NPU, DSP



#### MACHINE LEARNING USE CASES

128 512 Performance (GOPS) Performance (TOPS) Face and still image Multi-face Live video face, object Multi-object surveillance recognition, person recognition, object **Computer Vision** recognition, object (people, cars, animals) detection (images) detection (video) interaction (robotics) Wake word, 10 Word Automatic speech 40,000 Word speech, multiple High-level Speech accents recognition (basic speech, speaker speaker recognition, Speech Analysis interpretation recognition command phrases) affect/emotion recognition Scene segmentation, single Basic segmentation, super resolution Image/Video Live video upscaling, denoising and multi-camera scene upscaling, denoising **Processing** reconstruction Anomaly detection Complex real-time motion Pose estimation (environmental Hand gesture recognition Sequence Analysis analysis sensors) ML Accelerator (NPU) 4x Cortex A53 **GPU** Cortex M7 2x Cortex A72



#### EDGE COMPUTE ENABLER - SCALABLE INFERENCE

Inference Time (log scale)

Gen. Purpose MCU **5-10x** improvement (e.g. Cortex® -Mx) **6-8x** improvement **High Compute MCU** (e.g. Cortex® -M7) Multi-core **5x** improvement **Applications Processor** (GHz +) GPU (Open CL) / (e.g. Cortex® -Ax) > 10x improvement **DSP** complexes **ML Accelerators** (incl. Neural Nets)

# i.MX Applications Processor Scalability



# i.MX 8 Series: Target Applications

## Advanced graphics, video, image processing, vision, audio and voice

## i.MX 8M Family

Advanced Computing, Audio/Video & Voice

















Safety Certifiable & Efficient Performance









## i.MX 8 Family

Advanced Graphics, Vision & Performance















#### I.MX 8M PLUS MACHINE LEARNING COMPUTE ENGINES

#### Machine Learning Accelerator (1GHz)

Primary Use: Multi-camera classification/detection

#### Quad Arm® Cortex-A53 (1.8GHz)

 Primary Use: Speech command recognition, object detect/classification

#### Cortex-M7+HiFi4 DSP (800MHz)

Primary Use: Keyword detection, sensor fusion

Bonus: 2 channel Image Signal Processor (ISP)

 Primary Use: Scaling, dewarping, image enhancement



#### I.MX 8M PLUS NPU

# Programmable Engine Unit (1 instance on i.MX8M Plus)

128-bit vector processing.
INT 8/16/32b, FLOAT 16/32b.
Most flexible programming unit to handle everything else

#### **Vision Engine**

Provides advanced image processing functions.

Universal Storage Cache
(1 instance on i.MX8M Plus)

Local memory and L1 cache to pass data amongst NPU modules



#### **Tensor Processing Core**

(3 instances on i.MX8M Plus) INT 8/16b, FLOAT 16b

Non-convolution layers.

Multi-lane processing for data shuffling, normalization, pooling/unpooling, LUT, etc.

Network pruning support, zero skipping, compression

#### **Neural Network Core**

(6 instances on i.MX8M Plus) 2.3 TOPs INT8

Handle Convolution Layer + RELU+ Max Pooling, and ComputeBounded Fully Connected Layers



# Comparison Between CPU and NPU Performance



SECURE CONNECTIONS FOR A SMARTER WORLD







#### INFERENCE EXAMPLE WITH TFLITE USING CPU

\$: ./benchmark\_model --graph=mobilenet\_v1\_1.0\_224\_quant.tflite --max\_num\_runs=10

```
STARTING!
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Min warmup runs duration (seconds): [0.5]
Graph: [mobilenet_v1_1.0_224_quant.tflite]
. . .
[Overall] - Memory usage: max resident set size = 0.52422 MB, total malloc-ed
size = 0.059494 MB
```

Average inference timings in us: Warmup: 155178, Init: 8951, no stats: 154802

[ONCE – init phase] Init time on CPU

~ 9 ms

[ONCE – init phase]
Warm-up time on CPU
~ 155 ms

Inference Performance on CPU

Benchmark application on CPU is showing an average of 155 ms.

#### INFERENCE EXAMPLE WITH TFLITE USING NPU

\$: ./benchmark\_model --graph=mobilenet\_v1\_1.0\_224\_quant.tflite --max\_num\_runs=10 --use\_nnapi=true

```
STARTING!
Min num runs: [50]
Min runs duration (seconds): [1]
Max runs duration (seconds): [150]
Inter-run delay (seconds): [-1]
Num threads: [1]
Min warmup runs: [1]
Min warmup runs duration (seconds): [0.5]
[Overall] - Memory usage: max resident set size = 27.8477 MB, total malloc-ed size =
7.60637 MB
Average inference timings in us: Warmup: 7.8273e+06, Init: 29893, no stats: 3059.06
```

[ONCE – init phase] Init time on NPU

~ 30 ms

[ONCE – init phase]
Warm-up time on NPU
~ 7.8 ms

[Each inference run]
Inference performance on NPU

Benchmark application on NPU is showing an average of 3.1 ms.

#### NPU PERFORMANCE CLARIFICATION

### elQ performance is measured as:

- 1. NPU Inferences per second (Hardware only)
  - Purely NPU execution time
- 2. eIQ SW Inference per second (Includes SW stack overhead)
  - ML Stack execution time
- 3. Samples End-to-end FPS (Camera capture to display)
  - A measurement of SoC System performance



TFLite SW IPS = 3.1 ms

#### **NPU PROFILING**

#### **Uboot config**

Update mmcargs by adding galcore.showArgs=1 galcore.gpuProfiler=1

```
u-boot=> editenv mmc
edit: setenv bootargs ${jh_clk} console=${console} root=${mmcroot} galcore.showArgs=1 galcore.gpuProfiler=1
u-boot=> boot
```

#### Yocto environment variables

```
export VSI_NN_LOG_LEVEL=0
export CNN_PERF=1
export NN_EXT_SHOW_PERF=1
export VIV_VX_DEBUG_LEVEL=1
export VIV_VX_PROFILE=1
```

#### Output

- Execution time
- Operators list and the NPU compute blocks where they were executed
- DDR bandwidth



# elQ Performance Overview



SECURE CONNECTIONS FOR A SMARTER WORLD





## elQ Target Device Details



|                 | Cortex-M        | DSP             | Cortex-A                           | GPU                 | NPU                |
|-----------------|-----------------|-----------------|------------------------------------|---------------------|--------------------|
| i.MX 8M Plus    | M7 (800 MHz)    | HiFi4 (800 MHz) | 4xA53 (1800 MHz)                   | GC7000UL (1000 MHz) | VIP9000 (1000 MHz) |
| i.MX 8QuadMax   | 2xM4F (266 MHz) | HiFi4 (x MHz)   | 2xA72 (1600 MHz); 4xA53 (1200 MHz) | 2xGC7000X (996 MHz) |                    |
| i.MX 8QuadXPlus | M4 (266 MHz)    | HiFi4           | 4x-A35 (1200 MHz)                  | GC7000L (850 MHz)   |                    |
| i.MX 8M Quad    | M4 (266 MHz)    |                 | 4xA53 (1500 MHz); x32DDR           | GC7000L (800 MHz)   |                    |
| i.MX 8M Mini    | M4 (400 MHz)    |                 | 4x-A53 (1800 MHz)x32DDR            |                     |                    |
| i.MX 8M Nano    | M7 (600MHz)     |                 | 4xA53 (1500 MHz); x16DDR           | GC7000UL (600 MHz)  |                    |



#### I.MX 8M PLUS NPU COMPARED TO CPU PERFORMANCE

Quantized results – NPU is 5-15x faster than CPUs



#### NOTES:

- 1. FP results (not shown) NPU is 5-12x slower than CPUs
- 2. SSD has post processing overhead (not tail end of the model). After objects are detected, all the bounding box information has to be processed and identified. SSD would identify many boxes for the same object and hence post processing consumes CPU time.

**PUBLIC** 

#### MOBILENET PERFORMANCE ACROSS I.MX8 COMPUTE UNITS



#### I.MX 8MQ 4XA53 COMPARED TO GC7000L



- CPUs are 1.4-6.3x faster than GPU (8M Nano CPUs are 4.4-9.3x faster than GPU; graph not shown)
- Use GPU as offload engine not performance accelerator
- TF Lite faster on quantized workloads Arm NN faster on floating-point



# elQ Demos - pyElQ



SECURE CONNECTIONS FOR A SMARTER WORLD





#### **PYEIQ OVERVIEW**

#### PyelQ - A Python Framework for elQ on i.MX Processors

Easy to install

\$: pip3 install eiq-<version>.tar.gz

Easy to run

```
root@imx8:~# cd /opt/eiq/demos
root@imx8:~/opt/eiq/demos# python3 <demo_name>.py
root@imx8:~/opt/eiq/demos# python3 <demo_name>.py --help
```

- Support demos based on TensorFlow Lite (2.1.0) for image classification and object detection.
- Support inference running on GPU/NPU and CPU.
- Currently support file and camera as input data.
- Allows easy benchmarking
- Sources available on the Code Aurora Forum <a href="https://source.codeaurora.org/external/imxsupport/pyeig/">https://source.codeaurora.org/external/imxsupport/pyeig/</a>



#### SAMPLE EXAMPLE - COMPARE PERFORMANCE BETWEEN CPU AND NPU/GPU

To run the demo app on Target.
# cd /opt/eiq/apps
# python3 switch-demo.py

# Display and camera connected to the board.

- [Input] Select compute unit: CPU/NPU.
- [Input] Select the image for inference.
- [Output] Model Name.
- [Output] Inference time.
- [Output] Top 5 Accuracy.



# Resources



SECURE CONNECTIONS FOR A SMARTER WORLD

**PUBLIC** 





#### **RESOURCES**

- Product page <u>i.MX 8M Plus</u> applications processor
- 4K MIPI Camera for i.MX 8M Plus applications processor
- i.MX 8M Plus applications processor <u>Fact Sheet</u>
- Technology Blog: Why Add an ISP and Machine Learning to the i.MX 8M Family
- elQ™ <u>Machine Learning Software Development Environment</u>
- Community: <u>eIQ Software Community</u>
- eIQ Security Toolkit
- Demos: <u>pyelQ</u>





# SECURE CONNECTIONS FOR A SMARTER WORLD