Each WG has two lanes: SIMT = what the warpgroup thread does in program order; tensor core = each wgmma's async in-flight span [issue → the wait that drains it]. Intra-warpgroup async = the tensor-core bar overlapping the SIMT lane above it (e.g. PV·(k−1) runs under softmax·k). Violet arrows = named-barrier ping-pong: a WG finishes its wgmma → arrive → frees the other WG's sync (the tiny wait peer block).