体验一下triton cpu，看看是否有想象中的效果。

1. 下载

因为triton他依赖一个特定版本的llvm，所以必须要clone 完整的llvm再切过去，下载就浪费我半天= =

❯ git clone https://github.com/triton-lang/triton-cpu.git                                                       
❯ git clone https://github.com/llvm/llvm-project.git

2. 编译

triton给出的编译方法是pip install -e .，但是我还是想用cmake的方式自己编译，然后遇到了一堆问题。。

2.1 llvm 编译

llvm可以参考他的ci中的编译方式, 他给每个平台都给出了详细的编译命令。

2.2 triton 编译选项

如果只有cpu后端，那么就使用下面的选项:

{
  "LLVM_LIBRARY_DIR": "xxxx/llvm-project/install/lib",
  "TRITON_BUILD_PYTHON_MODULE": true,
  "TRITON_BUILD_PROTON": false,
  "TRITON_CODEGEN_BACKENDS": "cpu",
  "TRITON_BUILD_UT": false
}

2.3 cmake bug 修复

cpu后端这句话会报错，先注释掉

# Override sleef's output directory with our own
# set_target_properties(sleef PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_LIBRARY_OUTPUT_DIRECTORY})

2.4 cpp 编译错误修复

由于triton他的opt里面用了所有的后端的pass，但是我们又没有编译nv/amd，所以需要修改bin/RegisterTritonDialects.h, 把编译时报错的地方都注释掉。然后third_party/nvidia/include/TritonNVIDIAGPUToLLVM/Utility.h中还有一个nv的函数找不到定义，自己手动写个实现：

bool canSkipBarSync(Operation *before, Operation *after){
  return false;
}

2.5 python 加载段错误修复

我用的python3.8，好像在mac会和pybind会有奇怪的问题，需要修改cmake：

if(NOT APPLE)
  link_libraries(${Python3_LIBRARIES})
endif()

2.6 软链接

triton的python编译脚本中有自动copy后端文件到指定目录，这里我们手动就需要自己软链接

❯ cd python/triton/backends
❯ mkdir cpu
❯ cd cpu
❯ ln -s xxxx/triton-cpu/third_party/cpu/backend/driver.py driver.py               
❯ ln -s xxxx/triton-cpu/third_party/cpu/backend/compiler.py compiler.py
❯ cd triton-cpu/python/triton/_C
❯ ln -s xxxx/triton-cpu/out/build/debug/libtriton.so libtriton.so              
❯ ln -s xxxx/triton-cpu/out/build/debug/third_party/cpu/libTritonCPURuntime.dylib libTritonCPURuntime.dylib                
❯ ln -s xxxx/triton-cpu/out/build/debug/third_party/cpu/sleef/lib/libsleef.dylib libsleef.dylib

2.7 环境变量配置

下面三个是debug使用的。

PYTHONPATH="xxxx/triton-cpu/out/build/debug:xxxx/triton-cpu/python"
TRITON_KERNEL_DUMP=1
TRITON_DEBUG=1
LLVM_IR_ENABLE_DUMP=1

2.8 python 执行报错

我这里是python3.8 但是他写了一个3.11才有的特性, 所以修改掉

def _init_slots(self):
    """ Initialize the slots of this class """
    for name, val in self.arg_properties.items():
        setattr(self, (name[3:] if name.startswith("tt.") else name) + '_' + str(self.property_values[name]), val)

还有一个是mac中选择编译器的问题, 修改文件python/triton/runtime/build.py:

if system == 'Darwin' and clang is not None:
    cc = clang
else:
    cc = gcc if gcc is not None else clang

3. 运行

3.1 vector add

我尝试执行最简单的例子，但是他会卡死在生成asm，然后我发现dump的llvm ir非常奇怪，有32768个数：然后发现vector add中又有:

CPU_ST_THRESHOLD = 65536

然后尝试一下缩小这个阈值，果然就可以正常编译了= =。但是还有omp的问题，因为apple自带的apple-clang是没有omp功能的。

可以使用conda安装omp然后修改python/triton/runtime/build.py中mac的配置，并且下面之前添加的openmp需要注释掉，添加上编译选项/头文件/库。

    if system == "Darwin":
        cc_cmd += ["-undefined", "dynamic_lookup"]
        cc_cmd += ['-Xpreprocessor', '-fopenmp', '-lomp']
        include_dirs += ['xxxxx/miniforge3/envs/ci/include']
        library_dirs += ['xxxxx/miniforge3/envs/ci/lib']
        # Don't use libgcc on clang + macos
        if "clang" in cc:
            libraries.remove("gcc")
.
.
.
        # if not os.environ.get("TRITON_DISABLE_OPENMP", None):
        #   cc_cmd += ["-fopenmp"]

现在终于可以正常运行vector add:

❯ python python/tutorials/01-vector-add.py
tensor([0.5151, 1.6826, 0.9153,  ..., 0.9852, 1.2714, 1.8192])
tensor([0.5151, 1.6826, 0.9153,  ..., 0.9852, 1.2714, 1.8192])
The maximum difference between torch-cpu and triton-cpu is 0.0
The maximum difference between torch-cpu-tiled and triton-cpu is 0.0

我仔细观察后，发现这个CPU_ST_THRESHOLD其实就是BLOCK_SIZE的最大值，那么其实是BLOCK_SIZE大于512的时候就已经很难编译了，这个可能和triton编译出展开的代码关系比较大。

4. 整体流程

通过下面这个例子可以很好理解整个启动过程：

import triton
import triton.language as tl
import torch


@triton.jit
def program_id_kernel(a, ConstArg: tl.constexpr):
  # There are multiple 'programs' processing different data. We identify which program
  # we are here:
  xid = tl.program_id(axis=0)
  yid = tl.program_id(axis=1)
  zid = tl.program_id(axis=2)
  tl.device_print("id:", xid, yid, zid)

  xsize = tl.num_programs(axis=0)
  ysize = tl.num_programs(axis=1)
  zsize = tl.num_programs(axis=2)
  tl.device_print("nums:", xsize, ysize, zsize)


def grid_callback(arguments):
  print(arguments)
  return [2, 2, 2]

a = torch.rand([10])
program_id_kernel[grid_callback](a, ConstArg=123)

4.1 如何并行？

首先triton编译出来的kernel他作为一个库函数，默认都有3维的index作为输入参数。然后runtime的driver部分手写了一个launcher去调用triton kernel，在启动时根据预先设定的gridX, gridY, gridZ大小，将这些index的tuple展平为1维，然后直接用openmp并行循环i即可。这里其实存在一个问题，三个id生成的循环顺序和程序是无关的，但是程序的运行出来的性能却和id的顺序有关。

auto all_grids = get_all_grids(gridX, gridY, gridZ);
// For now, use the default chunk size, total iterations / max_threads.
#ifdef _OPENMP
#pragma omp parallel for schedule(static) num_threads(max_threads)
#endif // _OPENMP
  for (size_t i = 0; i < N; ++i) {{
    const auto [x, y, z] = all_grids[i];
    (*kernel_ptr)({kernel_fn_args_list + ', ' if len(kernel_fn_args) > 0 else ''} x, y, z, gridX, gridY, gridZ);
  }}
}}

4.2 `gridX, gridY, gridZ`大小如何设定？

triton的runtime有launcher，但runtime也是被python端的jit function所调用的，在jit function启动一个kernel的时候，会把当前python端输入的参数都给到上面提到的grid_callback函数，此时这个函数的返回值就是gridX, gridY, gridZ大小。也就是说我们可以根据当前输入tensor的尺寸选择合适的工作线程个数。

class JITFunction(KernelInterface[T]):
    ...
    def run(self, *args, grid, warmup, **kwargs):
        ...
        if not warmup:
            # canonicalize grid
            assert grid is not None
            if callable(grid):
                # Arguments are passed as a dict to `grid`, by contract.
                # TODO(jlebar): In the new launch API, pass the compiler flags as a
                # second parameter to `grid`.
                grid = grid(bound_args)
            grid_size = len(grid)
            grid_0 = grid[0]
            grid_1 = grid[1] if grid_size > 1 else 1
            grid_2 = grid[2] if grid_size > 2 else 1

            # launch kernel
            launch_metadata = kernel.launch_metadata(grid, stream, *non_constexpr_vals)
            kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
                       self.CompiledKernel.launch_enter_hook, self.CompiledKernel.launch_exit_hook, *non_constexpr_vals)

triton-cpu初体验

triton-cpu初体验

1. 下载

2. 编译

2.1 llvm 编译

2.2 triton 编译选项

2.3 cmake bug 修复

2.4 cpp 编译错误修复

2.5 python 加载段错误修复

2.6 软链接

2.7 环境变量配置

2.8 python 执行报错

3. 运行

3.1 vector add

4. 整体流程

4.1 如何并行？

4.2 `gridX, gridY, gridZ`大小如何设定？

triton-cpu初体验

triton-cpu初体验

1. 下载

2. 编译

2.1 llvm 编译

2.2 triton 编译选项

2.3 cmake bug 修复

2.4 cpp 编译错误修复

2.5 python 加载段错误修复

2.6 软链接

2.7 环境变量配置

2.8 python 执行报错

3. 运行

3.1 vector add

4. 整体流程

4.1 如何并行？

4.2 gridX, gridY, gridZ大小如何设定？

4.2 `gridX, gridY, gridZ`大小如何设定？