Explore AMX instructions: Unlock the performance of Apple Silicon

2024-04-23

Since 2020, Apple has published M1/M2/M3. They have at least four different ways to perform high-intensity computing tasks.

Standard arm NEON instructions.
Undocumented AMX (Apple Matrix Co-processor) instructions. Issued by the CPU and performed on the co-processor.
Apple Neural Engine
Metal GPU

If we use ARM NEON instructions to accelerate the sgemm kernel on the single core of the M1 Max, It can achieve a performance of around 102 GFLOPS. But if use AMX instructions it can achieve 1475 GFLOPS!

In this article, I will introduce how you can leverage the AMX instructions to unlock the potential performance of Apple Silicon. And the all code I used in here (Verified on M2 Pro). This article refers to the work of Peter Cawley et al, which contains more instructions and usage methods.

macos中bundle的使用

2024-03-13

研究一下在macos中如何编译bundle文件并动态加载并运行.

Affine Fusion Pass浅析

2024-01-11

学习mlir中Affine Fusion Pass, 主要关注依赖分析部分.

TileFlow: A Framework for Modeling Fusion Dataflow via Tree-based Analysis

2023-12-29

学习TileFlow这篇论文中是如何进行多个内存层级的tiling.

hugging face llama使用

2023-12-26

记录一下使用hugging face llama推理时遇到的问题.

Tensor DSL总结

2023-12-20

本文旨在总结一些张量优化的DSL是如何设计的, 尝试从其中发现一些共同点. 接下来我将统一使用Matmul(Transpose(Conv(lhs)),rhs)的例子在不同的框架中进行测试.

MLIRSharp

2023-11-25

记录一下MLIRSharp的开发总结.

tvm dynamic shape 学习

2023-11-15

探究tvm dynamic shape的实现.

mlc-llm 浅析

2023-11-01

学习tvm是如何解决LLM推理问题.

Alibaba EasyDist 浅析

2023-09-19

对于阿里巴巴开源的EasyDist: Automated Parallelization System and Infrastructure for Multiple Ecosystems代码解读, 主要关注IR设计与搜索域构造.