I have a Framework Laptop 13 that has a Ryzen AI 7 350 CPU that includes an NPU. I have started playing with this NPU to understand how to develop software for it. While NPUs are mainly intended as accelerators for inference of ML models, they are fundamentally hardware accelerators for matrix multiplication and other similar linear algebra operations, so they are also useful for signal processing and other compute applications, which is why I am interested in them. Another reason why I am interested in this NPU is that, as I will explain below, it is very similar to the AIE-ML v2 AI engine in Versal FPGA SoCs, so this laptop is a great platform to learn how to use this AI engine.
NPUs use the concept of TOPS (tera operations per second) as a high-level marketing figure of their capabilities. An operation is generally understood as an addition or multiplication for int8 data types, since the amount of parallelization that can be achieved depends on the datatype width. The NPU on the Ryzen AI 7 350 is marketed as a 50 TOPS NPU. The main goal of this post is to understand where this number comes from, in terms of hardware execution units and capabilities, understand under which conditions it can be reached, and write a small application that reaches this TOPS value.
I think this is a good way of gaining in-depth understanding about a compute architecture. Most typical real world use cases are going to be slower than this, because the algorithms will have bottlenecks that result in hardware underutilization. By understanding how the hardware needs to be used to reach peak performance, we have a better idea of the gaps of these algorithms and also how to rewrite the algorithms to reduce the gap if possible. In a post last year about NEON kernels on the ARM Cortex-A53 I worked in a similar way, by choosing a simple kernel to accelerate and by comparing performance benchmarks with the peak performance allowed by the hardware.