学习TileFlow这篇论文中是如何进行多个内存层级的tiling.

1. 输入格式描述

输入需要如下几个文件, 每个文件描述了不同的内容.

tileflow arch/arch.yaml prob/prob.yaml map/map.yaml macro.yaml

1.1 arch.yaml

描述了整个芯片的架构层级.

architecture: 
  version: 0.2 

  subtree:
  - name: System
    
    local: 
    - name: MainMemory
      class: DRAM 
      attributes:
        block-size: 16384
        depth: 1
        word-bits: 16
        read_bandwidth: 4.3
        write_bandwidth: 2.9
      
    subtree: 
    - name: Buffer 
    
      local:  
      - name: Cache 
        class: SRAM
        attributes:
          word-bits: 16
          block_size: 16384
          depth: 3
          read_bandwidth: 52
          write_bandwidth: 20 # 16 


      subtree:
      - name: PE

        local: 
        - name: RegFile[0..255] 
          class: regfile
          attributes:
            meshX: 16
            meshY: 16
            depth: 1
            block_size: 3
            word-bits: 16
            read_bandwidth: 3.2
            write_bandwidth: 3.2

        - name: mac[0..255] 
          class: intmac 
          attributes: 
            word-bits: 16
            meshX: 16
            meshY: 16

1.2 prob.yaml

这里其实类似halide, 需要列出所有共享的迭代变量, 然后下面每个算子都是用这些维度来构建. 具体可以参考这里.

problem:
  io:
    ins: A, B, D
    outs: E
  dimensions: [M,N,K,L]
  instance:
    M: M 
    N: N
    L: L
    K: K

  ops:
  - name: GEMM1
    dimensions: [M,L,K] 
    data-spaces:
    - name: C 
      projection:
        - [[M]] 
        - [[L]] 
      read-write: True 
    - name: A 
      projection:
        - [[M]]
        - [[K]]
    - name: B
      projection:
        - [[K]]
        - [[L]]
    ins: A, B
    out: C
    
  - name: GEMM2 
    dimensions: [M,L,N]
    data-spaces:
    - name: E 
      projection: 
        - [[M]]
        - [[N]]
      read-write: True 
    - name: C
      projection: 
        - [[M]]
        - [[L]]
    - name: D 
      projection: 
        - [[L]]
        - [[N]]
    ins: C, D
    out: E

对应的计算如下所示:

M = 512
N = 64
K = 64 
L = 512
C[M,L] = A[M,K] @ B[K,L]
E[M,N] = C[M,L] @ D[L,N]

1.3 map.yaml

map是一个比较重要的配置, 这里重点说明一下. type分为temporal表示顺序执行和spatial表示并行执行. factors: M = MO N = NO K= KO表示它将M/N/K维度分别分为MO/NO/KO块.permutation: NMK表示循环从内到外分别为NMK. 这里的factors: M=MM K=KM N=NI表示再次切分这里的三个维度. multicast: true表示多播. split: 1表示映射到硬件xy.

原始文档参考这里.

mapping:
  node-type: Tile 
  type: temporal 
  factors: M=MO 
  target: MainMemory 

  subtree: 
  - node-type: Scope 
    type: Sequential 

    subtree:
    - node-type: Tile 
      factors: K=KO L=LO
      type: temporal 
      bypass: [C]
      target: MainMemory
      profile: False
      tag: op1
      

      subtree: 
      - node-type: Tile 
        type: temporal 
        factors: K=KM L=LI M=MM
        permutation: LMK
        target: Cache 
        tag: op1 
        
        subtree:
        - node-type: Tile
          type: Spatial 
          factors: M=SX K=SY
          split: 1
          permutation: MK
          target: Cache
          tag: op1 
          
          subtree: 
          - node-type: Tile 
            type: temporal  
            factors: M=1 L=1 K=1
            permutation: MLK 
            target: RegFile
            tag: op1 
            
            subtree:
            - node-type: Op
              name: GEMM1 

      # A common spatial tile
    - node-type: Tile
      type: temporal 
      factors: L=LO N=NO
      target: MainMemory
      profile: False
      bypass: [C]
      tag: op2

      subtree: 
      - node-type: Tile 
        type: temporal 
        factors: M=MM L=LM N=NI
        permutation: NML 
        target: Cache
        tag: op2
        
        subtree:
        - node-type: Tile
          type: Spatial 
          factors: M=SX L=SY
          split: 1
          permutation: ML
          target: Cache
          tag: op2

          subtree:
          - node-type: Tile 
            type: temporal 
            factors: M=1 L=1 N=1 
            permutation: MLN 
            target: RegFile
            tag: op2
            
            subtree: 
            - node-type: Op
              name: GEMM2

之前tile flow, 可以得到初始化时的一些关键信息:

-----------------Mapping---------------
root: 0xa23020
read: A B E D update: E 
for M in [0:MO), MainMemory
  read: A B E D update: E fill: A B E D write-back: E 
  Scope: Sequential{
    read: A B fill: A B 
    for K in [0:KO), MainMemory
      for L in [0:LO), MainMemory
        read: C A B update: C fill: A B 
        for K in [0:KM), Cache
          for M in [0:MM), Cache
            for L in [0:LI), Cache
              read: C A B update: C fill: C A B write-back: C 
              for K in [0:16) (Spatial-Y), Cache
                for M in [0:16) (Spatial-X), Cache
                  read: C A B update: C fill: C A B write-back: C 
                  for K in [0:1), RegFile
                    for L in [0:1), RegFile
                      for M in [0:1), RegFile
                        read: C A B update: C fill: C A B write-back: C 
                        Op: GEMM1(A,B,)->C

    read: E D update: E fill: E D write-back: E 
    for L in [0:LO), MainMemory
      for N in [0:NO), MainMemory
        read: C E D update: E fill: E D write-back: E 
        for L in [0:LM), Cache
          for M in [0:MM), Cache
            for N in [0:NI), Cache
              read: C E D update: E fill: C E D write-back: E 
              for L in [0:16) (Spatial-Y), Cache
                for M in [0:16) (Spatial-X), Cache
                  read: C E D update: E fill: C E D write-back: E 
                  for N in [0:1), RegFile
                    for L in [0:1), RegFile
                      for M in [0:1), RegFile
                        read: C E D update: E fill: C E D write-back: E 
                        Op: GEMM2(C,D,)->E

  }
---------------------------------------
constraints:
        KO*KM==4        #  loopcount constraint for tiling of dim K
        LO*LI==512      #  loopcount constraint for tiling of dim L
        MO*MM==32       #  loopcount constraint for tiling of dim M
        LO*LM==32       #  loopcount constraint for tiling of dim L
        MO*MM==32       #  loopcount constraint for tiling of dim M
        NO*NI==64       #  loopcount constraint for tiling of dim N
        (1*1*1*1+1*1*1*1+1*1*1*1)<=3    # Memory constraint at Tile::op1::RegFile::Temporal
        (Max(MM*16*1*1,MM*16*1*1)*Max(LO*LI*1*1,LO*LM*16*1*1)+MM*16*1*1*KM*16*1*1+KM*16*1*1*LI*1*1)<=131072     # Memory constraint at Tile::op1::Cache::Temporal
        (1*1*1*1+1*1*1*1+1*1*1*1)<=3    # Memory constraint at Tile::op2::RegFile::Temporal
        (Max(MM*16*1*1,MM*16*1*1)*Max(LO*LI*1*1,LO*LM*16*1*1)+MM*16*1*1*NI*1*1+LM*16*1*1*NI*1*1)<=131072        # Memory constraint at Tile::op2::Cache::Temporal
        (MO*Max(MM*16*1*1,MM*16*1*1)*Max(KO*KM*16*1*1,1)+Max(KO*KM*16*1*1,1)*Max(LO*LI*1*1,LO*LM*16*1*1)+MO*Max(MM*16*1*1,MM*16*1*1)*Max(1,NO*NI*1*1)+Max(LO*LI*1*1,LO*LM*16*1*1)*Max(1,NO*NI*1*1))<=524288   # Memory constraint at Tile::MainMemory::Temporal
        <16, 16> <= <16, 16>      # Resource constraint at Tile::op1::Cache::Spatial
        <16, 16> <= <16, 16>      # Resource constraint at Tile::op2::Cache::Spatial
        Max(<1, 1>,<1, 1>) <= <1, 1>      # Resource constraint at Scope::Sequential
==============Checker END================

其实我对于tileflow最好奇的一点是 temporal buffer是在哪个内存层级申请的, 通过上面这个constraints很好的解释了这一点. 他应该是在每个tile node上都会开buffer, 对于op1在cache上的这个node上设定了c为pypass, 因此他的大小计算为C[MM*16,L], 而其他两个buffer就是根据factor来计算的:A[MM*16,KM*16]以及B[KM*16,LI].

1.4 执行结果

这里应该是只搜索了buffer size.

***Optimal Mapping:
-----------------Nest Analysis----------------
Tile::MainMemory::Temporal,
strides,low,high:MNKL[4]: 128 64 64 512 ,[4]: 0 0 0 0 ,[4]: 511 63 63 511 
read: A B E D update: E 
for M in [0:MO(4)), MainMemory
   read: A B E D update: E fill: A B E D write-back: E 
   Scope: Sequential
   {
      Tile::op1::MainMemory::Temporal,
      strides,low,high:MNKL[4]: 128 1 32 512 ,[4]: 0 0 0 0 ,[4]: 127 0 63 511 
      read: A B fill: A B 
      for K in [0:KO(2)), MainMemory
        for L in [0:LO(1)), MainMemory
            Tile::op1::Cache::Temporal,
            strides,low,high:MNKL[4]: 128 1 16 512 ,[4]: 0 0 0 0 ,[4]: 127 0 31 511 
            read: C A B update: C fill: A B 
            for K in [0:KM(2)), Cache
              for M in [0:MM(8)), Cache
                for L in [0:LI(512)), Cache
                     Tile::op1::Cache::Spatial,
                     strides,low,high:MNKL[4]: 16 1 1 1 ,[4]: 0 0 0 0 ,[4]: 15 0 15 0 
                     read: C A B update: C fill: C A B write-back: C 
                     for K in [0:16) (Spatial-Y), Cache
                       for M in [0:16) (Spatial-X), Cache
                           Tile::op1::RegFile::Temporal,
                           strides,low,high:MNKL[4]: 1 1 1 1 ,[4]: 0 0 0 0 ,[4]: 0 0 0 0 
                           read: C A B update: C fill: C A B write-back: C 
                           for K in [0:1), RegFile
                             for L in [0:1), RegFile
                               for M in [0:1), RegFile
                                    read: C A B update: C fill: C A B write-back: C 
                                    Op: GEMM1(A,B,)->C

                                    repFactor:0
                                    accesses:0
                                    expanison:16,16
      Tile::op2::MainMemory::Temporal,
      strides,low,high:MNKL[4]: 128 64 1 512 ,[4]: 0 0 0 0 ,[4]: 127 63 0 511 
      read: E D update: E fill: E D write-back: E 
      for L in [0:LO(1)), MainMemory
        for N in [0:NO(1)), MainMemory
            Tile::op2::Cache::Temporal,
            strides,low,high:MNKL[4]: 128 64 1 16 ,[4]: 0 0 0 0 ,[4]: 127 63 0 511 
            read: C E D update: E fill: E D write-back: E 
            for L in [0:LM(32)), Cache
              for M in [0:MM(8)), Cache
                for N in [0:NI(64)), Cache
                     Tile::op2::Cache::Spatial,
                     strides,low,high:MNKL[4]: 16 1 1 1 ,[4]: 0 0 0 0 ,[4]: 15 0 0 15 
                     read: C E D update: E fill: C E D write-back: E 
                     for L in [0:16) (Spatial-Y), Cache
                       for M in [0:16) (Spatial-X), Cache
                           Tile::op2::RegFile::Temporal,
                           strides,low,high:MNKL[4]: 1 1 1 1 ,[4]: 0 0 0 0 ,[4]: 0 0 0 0 
                           read: C E D update: E fill: C E D write-back: E 
                           for N in [0:1), RegFile
                             for L in [0:1), RegFile
                               for M in [0:1), RegFile
                                    read: C E D update: E fill: C E D write-back: E 
                                    Op: GEMM2(C,D,)->E

                                    repFactor:0
                                    accesses:0
                                    expanison:16,16
   }
Cycle: 140288, Energy: 4.18771e+08
--------------END Nest Analysis---------------

TileFlow: A Framework for Modeling Fusion Dataflow via Tree-based Analysis