Explore AMX instructions: Unlock the performance of Apple Silicon

Since 2020, Apple has published M1/M2/M3. They have at least four different ways to perform high-intensity computing tasks.

  1. Standard arm NEON instructions.

  2. Undocumented AMX (Apple Matrix Co-processor) instructions. Issued by the CPU and performed on the co-processor.

  3. Apple Neural Engine

  4. Metal GPU

If we use ARM NEON instructions to accelerate the sgemm kernel on the single core of the M1 Max, It can achieve a performance of around 102 GFLOPS. But if use AMX instructions it can achieve 1475 GFLOPS!

In this article, I will introduce how you can leverage the AMX instructions to unlock the potential performance of Apple Silicon. And the all code I used in here (Verified on M2 Pro). This article refers to the work of Peter Cawley et al, which contains more instructions and usage methods.

阅读全文

macos中bundle的使用

研究一下在macos中如何编译bundle文件并动态加载并运行.

阅读全文

Affine Fusion Pass浅析

学习mlirAffine Fusion Pass, 主要关注依赖分析部分.

阅读全文

TileFlow: A Framework for Modeling Fusion Dataflow via Tree-based Analysis

学习TileFlow这篇论文中是如何进行多个内存层级的tiling.

阅读全文

hugging face llama使用

记录一下使用hugging face llama推理时遇到的问题.

阅读全文

Tensor DSL总结

本文旨在总结一些张量优化的DSL是如何设计的, 尝试从其中发现一些共同点. 接下来我将统一使用Matmul(Transpose(Conv(lhs)),rhs)的例子在不同的框架中进行测试.

阅读全文

MLIRSharp

记录一下MLIRSharp的开发总结.

阅读全文

tvm dynamic shape 学习

探究tvm dynamic shape的实现.

阅读全文

mlc-llm 浅析

学习tvm是如何解决LLM推理问题.

阅读全文

Alibaba EasyDist 浅析

对于阿里巴巴开源的EasyDist: Automated Parallelization System and Infrastructure for Multiple Ecosystems代码解读, 主要关注IR设计与搜索域构造.

阅读全文