机器学习编译概念科普
带大家建立一个对机器学习编译的基本概念.
自从2020年Apple发布的芯片M1/M2/M3, 至少提供了四种不同的方式可以执行高负载的计算任务:
标准的ARMv8 SIMD/NEON向量指令集.
苹果尚未公开文档的AMX(Apple Matrix Co-processor)指令集, 由CPU发射, 在特殊的加速器上运行.
神经网络处理器ANE(Apple Neural Engine)
Metal GPU
在M1 Max上单核计算单精度浮点矩阵乘法时, 使用SIMD指令集可达到102 GFLOPS左右的性能, 而使用AMX指令集最多可达到1475 GFLOPS! 本文就来带领大家一同探索AMX指令集, 学习如何解锁这剩下的14倍算力.
Since 2020, Apple has published M1/M2/M3. They have at least four different ways to perform high-intensity computing tasks.
Standard arm NEON instructions.
Undocumented AMX (Apple Matrix Co-processor) instructions. Issued by the CPU and performed on the co-processor.
Apple Neural Engine
Metal GPU
If we use ARM NEON instructions to accelerate the sgemm kernel on the single core of the M1 Max, It can achieve a performance of around 102 GFLOPS. But if use AMX instructions it can achieve 1475 GFLOPS!
In this article, I will introduce how you can leverage the AMX instructions to unlock the potential performance of Apple Silicon. And the all code I used in here (Verified on M2 Pro). This article refers to the work of Peter Cawley et al, which contains more instructions and usage methods.
学习TileFlow
这篇论文中是如何进行多个内存层级的tiling
.