triton-cpu初体验
体验一下triton cpu,看看是否有想象中的效果。
1. 下载
因为triton他依赖一个特定版本的llvm,所以必须要clone 完整的llvm再切过去,下载就浪费我半天= =
❯ git clone https://github.com/triton-lang/triton-cpu.git
❯ git clone https://github.com/llvm/llvm-project.git 2. 编译
triton给出的编译方法是pip install -e ., 但是我还是想用cmake的方式自己编译, 然后遇到了一堆问题。。
2.1 llvm 编译
llvm可以参考他的ci中的编译方式, 他给每个平台都给出了详细的编译命令。
2.2 triton 编译选项
如果只有cpu后端, 那么就使用下面的选项:
{
"LLVM_LIBRARY_DIR": "xxxx/llvm-project/install/lib",
"TRITON_BUILD_PYTHON_MODULE": true,
"TRITON_BUILD_PROTON": false,
"TRITON_CODEGEN_BACKENDS": "cpu",
"TRITON_BUILD_UT": false
}2.3 cmake bug 修复
cpu后端这句话会报错, 先注释掉
# Override sleef's output directory with our own
# set_target_properties(sleef PROPERTIES LIBRARY_OUTPUT_DIRECTORY ${CMAKE_LIBRARY_OUTPUT_DIRECTORY})2.4 cpp 编译错误修复
由于triton他的opt里面用了所有的后端的pass, 但是我们又没有编译nv/amd, 所以需要修改bin/RegisterTritonDialects.h, 把编译时报错的地方都注释掉。 然后third_party/nvidia/include/TritonNVIDIAGPUToLLVM/Utility.h中还有一个nv的函数找不到定义, 自己手动写个实现:
bool canSkipBarSync(Operation *before, Operation *after){
return false;
}2.5 python 加载段错误修复
我用的python3.8, 好像在mac会和pybind会有奇怪的问题,需要修改cmake:
if(NOT APPLE)
link_libraries(${Python3_LIBRARIES})
endif()2.6 软链接
triton的python编译脚本中有自动copy后端文件到指定目录, 这里我们手动就需要自己软链接
❯ cd python/triton/backends
❯ mkdir cpu
❯ cd cpu
❯ ln -s xxxx/triton-cpu/third_party/cpu/backend/driver.py driver.py
❯ ln -s xxxx/triton-cpu/third_party/cpu/backend/compiler.py compiler.py
❯ cd triton-cpu/python/triton/_C
❯ ln -s xxxx/triton-cpu/out/build/debug/libtriton.so libtriton.so
❯ ln -s xxxx/triton-cpu/out/build/debug/third_party/cpu/libTritonCPURuntime.dylib libTritonCPURuntime.dylib
❯ ln -s xxxx/triton-cpu/out/build/debug/third_party/cpu/sleef/lib/libsleef.dylib libsleef.dylib 2.7 环境变量配置
下面三个是debug使用的。
PYTHONPATH="xxxx/triton-cpu/out/build/debug:xxxx/triton-cpu/python"
TRITON_KERNEL_DUMP=1
TRITON_DEBUG=1
LLVM_IR_ENABLE_DUMP=12.8 python 执行报错
我这里是python3.8 但是他写了一个3.11才有的特性, 所以修改掉
def _init_slots(self):
""" Initialize the slots of this class """
for name, val in self.arg_properties.items():
setattr(self, (name[3:] if name.startswith("tt.") else name) + '_' + str(self.property_values[name]), val)还有一个是mac中选择编译器的问题, 修改文件python/triton/runtime/build.py:
if system == 'Darwin' and clang is not None:
cc = clang
else:
cc = gcc if gcc is not None else clang3. 运行
3.1 vector add
我尝试执行最简单的例子,但是他会卡死在生成asm,然后我发现dump的llvm ir非常奇怪, 有32768个数: 然后发现vector add中又有:
CPU_ST_THRESHOLD = 65536然后尝试一下缩小这个阈值, 果然就可以正常编译了= =。 但是还有omp的问题, 因为apple自带的apple-clang是没有omp功能的。
可以使用conda安装omp然后修改python/triton/runtime/build.py中mac的配置,并且下面之前添加的openmp需要注释掉,添加上编译选项/头文件/库。
if system == "Darwin":
cc_cmd += ["-undefined", "dynamic_lookup"]
cc_cmd += ['-Xpreprocessor', '-fopenmp', '-lomp']
include_dirs += ['xxxxx/miniforge3/envs/ci/include']
library_dirs += ['xxxxx/miniforge3/envs/ci/lib']
# Don't use libgcc on clang + macos
if "clang" in cc:
libraries.remove("gcc")
.
.
.
# if not os.environ.get("TRITON_DISABLE_OPENMP", None):
# cc_cmd += ["-fopenmp"]现在终于可以正常运行vector add:
❯ python python/tutorials/01-vector-add.py
tensor([0.5151, 1.6826, 0.9153, ..., 0.9852, 1.2714, 1.8192])
tensor([0.5151, 1.6826, 0.9153, ..., 0.9852, 1.2714, 1.8192])
The maximum difference between torch-cpu and triton-cpu is 0.0
The maximum difference between torch-cpu-tiled and triton-cpu is 0.0我仔细观察后,发现这个CPU_ST_THRESHOLD其实就是BLOCK_SIZE的最大值,那么其实是BLOCK_SIZE大于512的时候就已经很难编译了,这个可能和triton编译出展开的代码关系比较大。
4. 整体流程
通过下面这个例子可以很好理解整个启动过程:
import triton
import triton.language as tl
import torch
@triton.jit
def program_id_kernel(a, ConstArg: tl.constexpr):
# There are multiple 'programs' processing different data. We identify which program
# we are here:
xid = tl.program_id(axis=0)
yid = tl.program_id(axis=1)
zid = tl.program_id(axis=2)
tl.device_print("id:", xid, yid, zid)
xsize = tl.num_programs(axis=0)
ysize = tl.num_programs(axis=1)
zsize = tl.num_programs(axis=2)
tl.device_print("nums:", xsize, ysize, zsize)
def grid_callback(arguments):
print(arguments)
return [2, 2, 2]
a = torch.rand([10])
program_id_kernel[grid_callback](a, ConstArg=123)4.1 如何并行?
首先triton编译出来的kernel他作为一个库函数, 默认都有3维的index作为输入参数。然后runtime的driver部分手写了一个launcher去调用triton kernel,在启动时根据预先设定的gridX, gridY, gridZ大小,将这些index的tuple展平为1维,然后直接用openmp并行循环i即可。这里其实存在一个问题,三个id生成的循环顺序和程序是无关的,但是程序的运行出来的性能却和id的顺序有关。
auto all_grids = get_all_grids(gridX, gridY, gridZ);
// For now, use the default chunk size, total iterations / max_threads.
#ifdef _OPENMP
#pragma omp parallel for schedule(static) num_threads(max_threads)
#endif // _OPENMP
for (size_t i = 0; i < N; ++i) {{
const auto [x, y, z] = all_grids[i];
(*kernel_ptr)({kernel_fn_args_list + ', ' if len(kernel_fn_args) > 0 else ''} x, y, z, gridX, gridY, gridZ);
}}
}}4.2 gridX, gridY, gridZ大小如何设定?
triton的runtime有launcher,但runtime也是被python端的jit function所调用的,在jit function启动一个kernel的时候,会把当前python端输入的参数都给到上面提到的grid_callback函数,此时这个函数的返回值就是gridX, gridY, gridZ大小。 也就是说我们可以根据当前输入tensor的尺寸选择合适的工作线程个数。
class JITFunction(KernelInterface[T]):
...
def run(self, *args, grid, warmup, **kwargs):
...
if not warmup:
# canonicalize grid
assert grid is not None
if callable(grid):
# Arguments are passed as a dict to `grid`, by contract.
# TODO(jlebar): In the new launch API, pass the compiler flags as a
# second parameter to `grid`.
grid = grid(bound_args)
grid_size = len(grid)
grid_0 = grid[0]
grid_1 = grid[1] if grid_size > 1 else 1
grid_2 = grid[2] if grid_size > 2 else 1
# launch kernel
launch_metadata = kernel.launch_metadata(grid, stream, *non_constexpr_vals)
kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata,
self.CompiledKernel.launch_enter_hook, self.CompiledKernel.launch_exit_hook, *non_constexpr_vals)