机器学习编译概念科普

带大家建立一个对机器学习编译的基本概念.

阅读全文

benchmark的经验与技巧

为了公平对比性能都不是一件容易的事情. 各个框架的runtime都可能存在一些不同配置, 需要把他们安排到统一基准线去对比才有意义.

阅读全文

Ampl学习

熟悉一下ampl的语法.

阅读全文

Constraints Solver Internals

关于ortools中Constraints Solver的内部逻辑.

阅读全文

Model Driven Optimization

关于Model-Driven Optimization For Tensor Computations论文的阅读笔记.

阅读全文

探索AMX: 解锁Apple Silicon隐藏性能

自从2020年Apple发布的芯片M1/M2/M3, 至少提供了四种不同的方式可以执行高负载的计算任务:

  1. 标准的ARMv8 SIMD/NEON向量指令集.

  2. 苹果尚未公开文档的AMX(Apple Matrix Co-processor)指令集, 由CPU发射, 在特殊的加速器上运行.

  3. 神经网络处理器ANE(Apple Neural Engine)

  4. Metal GPU

在M1 Max上单核计算单精度浮点矩阵乘法时, 使用SIMD指令集可达到102 GFLOPS左右的性能, 而使用AMX指令集最多可达到1475 GFLOPS! 本文就来带领大家一同探索AMX指令集, 学习如何解锁这剩下的14倍算力.

阅读全文

Explore AMX instructions: Unlock the performance of Apple Silicon

Since 2020, Apple has published M1/M2/M3. They have at least four different ways to perform high-intensity computing tasks.

  1. Standard arm NEON instructions.

  2. Undocumented AMX (Apple Matrix Co-processor) instructions. Issued by the CPU and performed on the co-processor.

  3. Apple Neural Engine

  4. Metal GPU

If we use ARM NEON instructions to accelerate the sgemm kernel on the single core of the M1 Max, It can achieve a performance of around 102 GFLOPS. But if use AMX instructions it can achieve 1475 GFLOPS!

In this article, I will introduce how you can leverage the AMX instructions to unlock the potential performance of Apple Silicon. And the all code I used in here (Verified on M2 Pro). This article refers to the work of Peter Cawley et al, which contains more instructions and usage methods.

阅读全文

macos中bundle的使用

研究一下在macos中如何编译bundle文件并动态加载并运行.

阅读全文

Affine Fusion Pass浅析

学习mlirAffine Fusion Pass, 主要关注依赖分析部分.

阅读全文

TileFlow: A Framework for Modeling Fusion Dataflow via Tree-based Analysis

学习TileFlow这篇论文中是如何进行多个内存层级的tiling.

阅读全文