This repository has been archived on 2023-07-17. You can view files and clone it, but cannot push or open issues or pull requests.
bl_mcu_sdk/components/TinyMaix/benchmark.md

215 lines
8.5 KiB
Markdown

# Benchmark
## Test Models
### mnist
mnist: 28x28x1 input,4->8->16, pad valid
mnist_q_valid.h 2.4KB Flash 1.4KB RAM
suit for MCU have >=16KB Flash, >=2KB RAM
### cifar
cifar: 32x32x3 input, 32->32->64->1024->10, 5x5 conv
cifar10_q.h 89KB Flash 11KB RAM
suit for MCU have >=128KB Flash, >=20KB RAM
### vww96
vww96: vww model based on mobile net v1 0.25 96x96x3 input
vww96_q.tmdl 227KB Flash 54KB RAM
suit for MCU have >=256KB Flash, >=64KB RAM
https://mlcommons.org/en/inference-tiny-07/
### mbnet128
mbnet128: mobile net v1 0.25 128x128x3 input
mbnet128_0.25_q.tmdl 485KB Flash 96KB RAM
suit for MCU have >=512KB Flash, >=128KB RAM
https://github.com/fchollet/deep-learning-models/releases
## Test Record
model infer time unit is ms;
Sort by performance, compare priority: mbnet128 > vww96 > cifar > mnist
> Note1: arduino run another smaller mnist model due to limited memory
> Note2: all model record fastest model type's infer time, for example, C906 use FP16 result.
> Note3: XXX means impossible run this model on that chip
|Chip/Board |Core |Flash|RAM |Freq |mbnet|vww96|cifar|mnist|Note|
|--- |--- |--- |--- |--- |--- |--- |--- |--- |---|
|BL808's NPU |BLAI | 16MB|0.8+64MB| 320M| 5 | 3 | <1 | <1 ||
|i5-4590T |AMD64 |256GB| 8GB|2000M|7/24 |5/17 |0.9/4|0.04/<1|native/wasm|
|RK3399's A72|ARM A72 | 32GB| 4GB|1800M| 15 | 10 | 3 | 0.07||
|TH1520 |RV64V(C910)|16GB| 4GB|1848M| 17 | 10.4| 3.6 | 0.15|fp16|
|RK3399's A53|ARM A53 | 32GB| 4GB|1600M| 29 | 19 | 5 | 0.14||
|D1-H |RV64V |128GB| 2GB|1008M| 43 | 22 | 3.5 | 0.29||
|ZYNQ7010's A9|ARM A9 | 32GB| 1GB| 667M| 66 | 44 | 7.8 |0.34 ||
|BL808's C906|RV64V | 16MB|0.8+64MB| 480M| 81 | 57 | 10 | <1 ||
|HPM6750 |Andes D45</br>RV32IMAFDCP| 8MB|2MB+32MB| 816M| 174 | 112 | 33 | 0.45 ||
|STM32H750 |ARM CM7 | 1MB| 1024KB| 480M| 94 | 64 | 15 | <1 ||
|BL808's E907|RV32P | 16MB|0.8+64MB| 320M| 188 | 149 | 35 | <1 |mdl in psram|
|F1C200S |ARM926EJ-S| 16MB| 64MB| 608M| 213 | 145 | 38.5| 0.75||
|MT7621A |MIPS1004Kc|128MB| 128MB| 880M| 413 | 244 | 41 | 1 |need confirm|
|AT32F403A |ARM CM4 | 1MB| 96KB| 240M| 477 | 136 | 30 | 0.6 |mbnet in 224k ram mode|
|STM32G474RE |ARM CM4 |512KB| 128KB| 170M| XXX | 195 | 43 | 1 ||
|CH32V307 |RV32F</br>QingKe V4F|480KB| 128KB| 144M| XXX | 357 | 64 | 1 ||
|STM32F411CE |ARM CM4 |512KB| 128KB| 150M| 558 | 366 | 75 | 2 ||
| W801 |CSKYV2 ck804ef|2MB|160+128KB+8MB| 240M|606|364|52|<2| mdl in psram/(or sram if fit) |
|ESP32-S3 |Xtensa LX7| 8MB| 512KB| 240M| 610 | 381 | 86 | 5 |mdl in flash|
|LPC4337 |ARM CM4F | 1MB| 136KB| 204M| 654 | 627 | 91 | 3 |need confirm|
|XR806 |ARMv8-M</br>Star-MC1|2MB|288KB|160M|712|453| 104 | 1 ||
|ESP32 |Xtensa LX6| 4MB| 520KB| 240M| 755 | 476 | 132 | 2 |mdl in flash|
|ACM32F403 |ARM CM33 |512KB| 192KB| 180M| XXX | 458 | 139 | 2 ||
|STM32F767 |ARM CM7 | 2MB| 512KB| 216M| 869 | 640 | 185 | 3 |need confirm|
|SC5864B's DSP|HIFI3 | 32MB|8MB+192KB|160M| 805 |--- |--- |15 |need confirm|
|STM32L496 |ARM CM4 | 1MB| 320KB| 80M| 809 | 695 | 162 | 3 ||
|NRF52832 |ARM CM4 |512KB| 64KB| 64M|XXX |xxx | 217 | 4 ||
|ESP32-S2 |Xtensa LX7| 4MB| 320KB| 240M|1079 | 659 | 241 | 2.5|need confirm|
|RP2040 |ARM CM0+ | 16MB| 264KB| 280M|1211 | 716 | 200 | 2 |overclock 280M|
|CH32V203G6 |RV32</br>QingKe V4B| 32KB| 10KB| 144M| XXX | XXX | XXX | 2.5 ||
|ESP32-C3 |RV32 | 4MB| 400KB| 160M|2370 |1430 | 127 | 6 |mdl in flash|
|MM32F3270 |ARM CM3 |512KB| 128KB| 96M|3126 |2919 | 257 | 11 ||
|Nuclei N300 |RV32IMAFDC| 16MB| 512KB| 16M|XXXX |2946 | 795 | 11 |need confirm|
|STM32F103RC |ARM CM3 |256KB| 48KB| 72M|XXX |XXX | 445 | 7 ||
|STM32F103C8 |ARM CM3 | 64KB| 20KB| 72M| XXX | XXX | XXX | 7 ||
|CH32V103 |RV32</br>QingKe V3A| 64KB| 20KB| 72M| XXX | XXX | XXX | 13 ||
|SAMD21G18 |ARM CM0+ |256KB| 32KB| 48M| XXX | XXX | 700 | 14 |seeed XIAO|
|APM32F072CB |ARM CM0+ | 128KB| 16KB| 48M| XXX | XXX | XXX | 17 ||
|STM32G030F6 |ARM CM0+ | 32KB| 8KB| 64M| XXX | XXX | XXX | 18 ||
|CM0(Kintex-7)|ARM CM0 | ---| 1024KB| 50M| XXX | XXX | 1362| 23 |Kintex-7|
|CH582 |RV32</br>QingKe V4A|448KB|32KB|60M|XXX|XXX |XXX | 31 ||
|STC32G12K128|80251 |128KB| 12KB| 35M| XXX | XXX | XXX | 37 ||
|PicoRV32(GW2A)|RV32 | 1MB| 64KB| 54M| XXX | XXX |26935| 385 |Tang Primer 20K|
|Atmega328 |AVR | 32KB| 2KB| 16M| XXX | XXX | XXX |50(*)||
Normalization to 100M freq to compare CPU efficiency, using cifar model:
|Chip/Board |Core |cifar(ms)|
|--- |--- |--- |
|BL808's NPU |BLAI | 2 |
|D1-H |RV64V | 35 |
|BL808's C906|RV64V | 48 |
|RK3399's A72|ARM A72 | 52 |
|ZYNQ7010's A9|ARM A9 | 52 |
|TH1520 |RV64V(C910)| 66 |
|STM32H750 |ARM CM7 | 72 |
|AT32F403A |ARM CM4 | 72 |
|STM32G474RE |ARM CM4 | 73 |
|RK3399's A53|ARM A53 | 79 |
|CH32V307 |RV32 IMAC | 92 |
|BL808's E907|RV32P | 112 |
|STM32F411CE |ARM CM4 | 113 |
|W801 |CSKYV2 ck804ef| 125 |
|Nuclei N300 |RV32IMAFDC| 127 |
|STM32L496 |ARM CM4 | 130 |
|NRF52832 |ARM CM4 | 139 |
|XR806 |ARMv8-M</br>Star-MC1| 166 |
|ESP32-C3 |RV32 | 203 |
|ESP32-S3 |Xtensa LX7| 206 |
|F1C200S |ARM926EJ-S| 234 |
|MM32F3270 |ARM CM3 | 247 |
|ACM32F403 |ARM CM33 | 250 |
|ESP32 |Xtensa LX6| 317 |
|STM32F103RC |ARM cM3 | 320 |
|SAMD21G18 |ARM CM0+ | 336 |
|MT7621A |MIPS1004Kc| 360 |
|RP2040 |ARM CM0+ | 560 |
|ESP32-S2 |Xtensa LX7| 578 |
|CM0(Kintex-7)|ARM CM0 | 681 |
|PicoRV32(GW2A)|RV32 |14545|
# Infer Time & Input Size
mbnet infer time under different input size
BL808 C906 core 480M, use RV64V, FP16 model
|input size|infer time|
|--- |--- |
| 96x 96| 60ms|
|128x128| 81ms|
|160x160| 156ms|
|192x192| 183ms|
|224x224| 296ms|
# Optimization
## TM_FASTSCALE
Optimization for MCU which don't have FPU
STM32F103C8 run mnist
|Options|infer time|
|---|---|
|TM_FASTSCALE=0| 16ms|
|TM_FASTSCALE=1| 10ms|
## TM_ARCH_ARM_SIMD
Optimization for ARM MCU which have DSP (Cortex-M4,M7,etc.), suoport INT8 acceleration
STM32F411CE run mbnet 0.25, 128x128x3 input
|Options|infer time|
|---|---|
|TM_ARCH_OPT0 && INT8| 1199ms|
|TM_ARCH_ARM_SIMD && INT8| 840ms|
## TM_ARCH_ARM_MVEI
Optimization for ARM MCU which have MVEI instructions (Cortex-M55,etc.), suoport INT8 acceleration.
> Experimental, not test data.
## TM_ARCH_ARM_NEON
Optimization for ARM MPU which have NEON instructions (Cortex-A7 and newer), suoport INT8/FP32 acceleration
Raspberry Pi4 single core run mbnet 1.0, 224x224x3 input
(NEON INT8 not well optimized)
|ARCH|MDL_TYPE|OPT0 time|OPT1 time|
|---|---|---|---|
|TM_ARCH_CPU | INT8| 860ms| 821ms|
|TM_ARCH_CPU | FP32| 2307ms| 2271ms|
|TM_ARCH_ARM_NEON| FP32| 1275ms| 1223ms|
|TM_ARCH_ARM_NEON| INT8| 959ms| 923ms|
## TM_ARCH_RV32P
Optimization for RISC-V MCU which have P-extend instructions (like T-Head E907), suoport INT8 acceleration
BL808 E907 core run mbnet 0.25, 128x128x3 input (mdl in psram, cpu run in 320M, O2)
|ARCH|MDL_TYPE|OPT0 time|OPT1 time|
|---|---|---|---|
|TM_ARCH_CPU | INT8| 443ms| 283ms|
|TM_ARCH_RV32P| INT8| 345ms| 188ms|
## TM_ARCH_RV64V
Optimization for RISC-V MCU which have V-extend instructions (like T-Head C906), suoport INT8/FP32 acceleration
BL808 C906 core run mbnet 0.25, 128x128x3 input (mdl in psram, VLEN=128, cpu run in 480M, O2)
|ARCH|MDL_TYPE|OPT0 time|OPT1 time|
|---|---|---|---|
|TM_ARCH_CPU | INT8| 153ms| 125ms|
|TM_ARCH_CPU | FP32| 215ms| 177ms|
|TM_ARCH_RV64V| INT8| 123ms| 95ms|
|TM_ARCH_RV64V| FP32| 160ms| 121ms|
|TM_ARCH_RV64V| FP16| 129ms| 81ms|
## TM_ARCH_CSKYV2
Optimization for CskyV2 MCU which have DSP instructions (like ck804ef), suoport INT8 acceleration
w801 ck804ef core run mbnet 0.25, 128x128x3 input (mdl in psram, cpu run at 240M,psram overclock to 120M, O3, with data cache on)
|ARCH|MDL_TYPE|OPT0 time|OPT1 time|
|---|---|---|---|
|TM_ARCH_CPU | INT8| 950ms| 816ms|
|TM_ARCH_CSKYV2 | INT8| 724ms| 606ms|
# Compare to other infer library
Use SmallCifar model. TinyMaix use stride=2's time multipy by 4.
NNoM&TinyMaix run with STM32H750@218M, other run with STM32F746@216M
|InferLib|time(ms)|
|---|---|
|TFlite-micro|393|
|MicroTVM untuned|294|
|TinyMaix CPU O0|224|
|TinyMaix CPU O1|204|
|TinyMaix SIMD O0|176|
|NNoM|159|
|MicroTVM tuned|157|
|CMSIS-NN|136|
|TinyMaix SIMD O1|132|
|tinyengine|129|