Explore AMX instructions: Unlock the performance of Apple Silicon
Since 2020, Apple has published M1/M2/M3. They have at least four different ways to perform high-intensity computing tasks.
Standard arm NEON instructions.
Undocumented AMX (Apple Matrix Co-processor) instructions. Issued by the CPU and performed on the co-processor.
Apple Neural Engine
Metal GPU
If we use ARM NEON instructions to accelerate the sgemm kernel on the single core of the M1 Max, It can achieve a performance of around 102 GFLOPS. But if use AMX instructions it can achieve 1475 GFLOPS!
In this article, I will introduce how you can leverage the AMX instructions to unlock the potential performance of Apple Silicon. And the all code I used in here (Verified on M2 Pro). This article refers to the work of Peter Cawley et al, which contains more instructions and usage methods.