High-Performance GEMM + AllReduce Fusion Example¶

Overview¶

This example shows how to implement a multi-rank GEMM + AllReduce fused operator with PTO. It uses a dual-stream design for communication-compute overlap: a Compute Stream runs the GEMM kernel, and a Comm Stream runs the communication kernel. PTO communication instructions operate directly on the HCCL RDMA window to complete the AllReduce.

Supported AI Processors¶

Ascend950PR

Directory Layout¶

kernels/manual/a5/gemm_ar/
├── CMakeLists.txt              # Build configuration (3 targets: cube kernel, vec kernel, host executable)
├── run.sh                      # One-click build + run script (auto-computes HCCL_BUFFSIZE and locates MPI)
├── gemm_ar_config.h            # Global configuration (matrix shape, tile sizes, block counts)
├── main.cpp                    # Entry: MPI init, data generation, HCCL init, window allocation, perf measurement, verification
├── gemm_compute_kernel.cpp     # GEMM compute kernel (Cube side, L0C FP32 -> GM FP16 auto cast)
├── comm_kernel.cpp             # Communication kernel (Vector side, overlapped RS/AG AllReduce in one kernel)
├── kernel_launchers.h          # Host-side kernel launcher declarations
├── common.hpp                  # Device-side HcclRemotePtr wrapper (RDMA window address translation)
├── hccl_context.h              # HcclDeviceContext structure (RDMA window addresses for each rank)
├── ready_queue.hpp             # Multi-block lock-free tile queue (compute -> comm signaling)
└── comm_mpi.h                  # MPI dynamic loading wrapper (dlopen/dlsym, no hard link dependency)

Operator Description¶

Functionality¶

This example implements multi-rank GEMM + AllReduce:

\[ C_{final} = \sum_{i=0}^{nranks-1} A_i \times B \]

Where:

A_i is M x K and is private to each rank
B is K x N and shared by all ranks
C_i is the local GEMM result of shape M x N
C_final is the final M x N output after AllReduce

The default reference configuration in gemm_ar_config.h is M=5416, K=6144, N=1408 with 2 ranks.

Specification¶

Item	Value
OpType	`GEMM + AllReduce`
Input	`A_i`: `M x K`, `float16`, `ND` (private to each rank); `B`: `K x N`, `float16`, `DN` (shared)
Output	`C_final`: `M x N`, `float16`, `ND` (AllReduce result)
Compute kernel name	`GemmComputeKernel` (Cube architecture, `dav-c220-cube`)
Comm kernel name	`GemmCommAllKernel` (Vector architecture, `dav-c220-vec`)

Optimization Notes¶

This example uses a 2-rank Ascend950PR platform as the performance validation target. Ascend950PR (DAV_3510 / arch35) uses a split architecture where Cube (AIC) and Vector (AIV) are physically separate, which makes dual-stream communication-compute overlap practical.

Use the CANN platform_config as the source of truth for core counts. For example, on 950PR_958b:

cube_core_cnt=32 (Cube / AIC parallelism)

vector_core_cnt=64 (Vector / AIV parallelism)

Dual-stream overlap: the compute kernel runs on the Compute Stream (Cube) and the communication kernel runs on the Comm Stream (Vector). Tile-level signaling allows communication and computation to run concurrently.
Logical RS + AG in one mixed loop: RS reduces into the owner rank and AG broadcasts owner-local results, with both roles executing inside one subtile-driven loop and handing off through ready counters.
Block Swizzle: the compute kernel uses a zigzag tile traversal order (odd rows reversed) to improve L1 reuse of neighboring B matrix tiles.
Two-level double-buffer pipeline: L1 cache (stepK=4 batched TLOAD) plus L0 ping/pong buffering lets DMA movement overlap with Cube compute as much as possible.
Lock-free Ready Queue: each AIC writes one queue, and each communication block drains the queue subset {block_idx, block_idx + num_comm_blocks, ...}. AIV first probes with TTEST and only blocks with TWAIT when needed.
RS double buffering: the RS producer path uses ping/pong tiles so the TLOAD of the current subtile overlaps with the TSTORE<AtomicAdd> of the previous subtile.
Owner-local subtile executor: each owner-local tile is split into fixed-height subtiles (G_COMM_SUB_M, default 64 rows). AG blocks claim reversed-stripe subsets of those subtiles to smooth combined RS + AG load.
Publish / consume fences: the queue producer publishes the doorbell only after the slot payload is visible, and RS publishes subtile-ready / ag-summary doorbells only after pipe_barrier(PIPE_ALL) + dsb(DSB_DDR).

Tiling Parameters¶

Parameter	Value
`M` (raw)	5416
`K`	6144
`N` (raw)	1408
`M` (padded)	5504
`N` (padded)	1536
`baseM`	128
`baseK`	64
`baseN`	256
`stepKa`	4
`stepKb`	4
`commSubM`	64
`subtilesPerTile`	2
Number of tiles	258 (`43 x 6`)
`COMPUTE_BLOCK_NUM`	24
`COMM_BLOCK_NUM`	24

Overall Architecture¶

┌──────────────────────────────────────────────────────────────────────────────┐
│  Compute Stream (24 logical AIC)       Comm Stream (24 logical AIV)          │
│                                                                              │
│  GemmComputeKernel:                     GemmCommAllKernel:                   │
│  ┌─────────────────────────┐            ┌──────────────────────────────┐     │
│  │ for each tile:          │            │ RS/AG overlap loop           │     │
│  │   K-loop (L1 -> L0 -> Cube)          │   poll Ready Queue           │     │
│  │   TSTORE -> gemm_output │──Ready──→ │   TLOAD tile from gemm_output│     │
│  │   pipe_barrier(ALL)     │  Queue    │   TSTORE<AtomicAdd> -> owner │     │
│  │   Enqueue tile_idx      │            │   subtile-ready / summary++  │     │
│  └─────────────────────────┘            │   drain ready subtiles for AG│     │
│                                          │   TLOAD -> TSTORE to remote │     │
│                                          │   ready-driven AG handoff    │     │
│                                          │   subtile-level overlap      │     │
│                                          └──────────────────────────────┘     │
└──────────────────────────────────────────────────────────────────────────────┘

Compute Kernel Details¶

Time ->
L1 (MTE2):  [TLOAD A0,B0]                [TLOAD A1,B1]              ...
L0 (MTE1):       [TEXTRACT k0] [k1] [k2] [k3] [TEXTRACT k0'] ...
Cube (M):             [TMATMUL k0] [ACC k1] [ACC k2] [ACC k3] [TMATMUL k0'] ...
                      ^ full three-stage overlap ^

Each AIC is responsible for a subset of tiles assigned by block_idx x tiles_per_block. For each tile:

Block Swizzle mapping: remap the linear tile index into a zigzag traversal order, reversing odd rows so adjacent tiles reuse columns of matrix B in L1.
K-loop: every stepKa=4 iterations, perform one batched TLOAD into L1. Each iteration then uses TEXTRACT to pull one K-slice into L0, followed by TMATMUL / TMATMUL_ACC accumulation.
TSTORE: FP32 values in L0C are automatically cast to FP16 by the FixPipe and stored to gemm_output.
pipe_barrier(PIPE_ALL): guarantees that the GM write is complete.
MultiBlockEnqueueFast: enqueue tile_idx to notify the communication kernel.

Communication Kernel Details¶

The launched communication kernel follows the mixed subtile pipeline implemented in GemmCommAllImpl(): RS production and AG consumption are interleaved inside one loop, and the synchronization point is a per-subtile counter rather than a device-wide barrier.

RS Producer Path¶

Each communication block owns the queue subset:

queues(block b) = { b, b + num_comm_blocks, b + 2*num_comm_blocks, ... }

With the default COMPUTE_BLOCK_NUM = COMM_BLOCK_NUM = 24, this degenerates to 1:1. When fewer communication blocks are used, one block drains multiple compute queues round-robin via RsPollQueues() / RsWaitOnQueue().

For every dequeued tile:

The tile is split along M into G_COMM_SUBTILES_PER_TILE = G_BASE_M / G_COMM_SUB_M fixed-height subtiles.
RsPipelineStep() uses ping/pong UB tiles so the current subtile TLOAD overlaps with the previous subtile TSTORE<AtomicAdd>.
The RS destination is the owner rank owner = tile_idx % nranks, so reduction is completed directly in that rank's reduced_output.

RS/AG Overlap Synchronization¶

The overlap protocol uses two counters in the owner rank's signal_matrix:

subtile-ready[local_subtile_id]: counts how many ranks have completed RS for that owner-local subtile.
ag-summary[summary_block]: a coarser wakeup doorbell for the AG block responsible for that subtile.

Publishing follows RsPublishSubtileReady():

pipe_barrier(PIPE_ALL) flushes the local pipeline.
dsb(DSB_DDR) makes the reduced_output store globally visible.
RsNotifySubtileReady() increments the owner-local ready counter.
RsNotifyAgSummary() increments the AG wakeup counter selected by AgSummaryBlockForSubtile().

Consumption follows AgDrainReadyAssignedSubtiles():

Probe assigned subtile-ready counters with TTEST(..., nranks, GE).
On the first hit of a drain pass, execute one acquire fence (pipe_barrier + dsb) for all ready subtiles consumed in that pass.
Transfer each ready subtile to all remote ranks.
If no progress is possible, AgWaitAssignedSummary() blocks on summary_ack_count + 1 and waits for the next assigned wakeup.

AgSummaryBlockForSubtile() uses a reversed-stripe mapping so AG-heavy blocks land on RS-light blocks, which flattens the combined rs_work + ag_work load.

AG Executor Path¶

AG work is assigned in owner-local subtile space:

total_local_subtiles = my_tile_count * G_COMM_SUBTILES_PER_TILE
assigned_ids(block b) = { num_comm_blocks - 1 - b + k*num_comm_blocks }

For each ready assigned subtile:

AgDecodeLocalSubtile() maps the owner-local subtile id back to a global row offset in reduced_output.
AgTransferSubtileToAll() broadcasts exactly G_COMM_SUB_M rows to every remote rank.
The first remote peer is rotated by local_subtile_id % (nranks - 1) so not every block hammers the same destination first.

This design lets AG start as soon as a specific owner-local subtile is fully reduced across all ranks.

Ready Queue Mechanism¶

┌─────────────┐         ┌─────────────┐
│  AIC 0      │         │  AIV 0      │
│  (Compute)  │──Queue──│  (Comm)     │
│  block_idx=0│   0     │  block_idx=0│
└─────────────┘         └─────────────┘
┌─────────────┐         ┌─────────────┐
│  AIC 1      │         │  AIV 1      │
│  (Compute)  │──Queue──│  (Comm)     │
│  block_idx=1│   1     │  block_idx=1│
└─────────────┘         └─────────────┘
      ...                     ...
┌─────────────┐         ┌─────────────┐
│  AIC 23     │         │  AIV 23     │
│  (Compute)  │──Queue──│  (Comm)     │
│  block_idx=23│  23    │  block_idx=23│
└─────────────┘         └─────────────┘

This illustration uses the default logical block range 0...23 when COMPUTE_BLOCK_NUM = COMM_BLOCK_NUM = 24. These logical block IDs do not need to equal the physical cube_core_cnt and vector_core_cnt from Ascend950PR_958b.ini. If you use another SoC such as Ascend950PR_9599, follow the corresponding counts from its .ini.

Each queue is a 64-byte-aligned PerBlockQueue metadata block followed by cache-line-isolated PerBlockQueueSlot payloads.
Producer (AIC): PerBlockQueueEnqueueFast writes the target slot via GetQueueSlot(), flushes that slot with dcci, executes a release fence, and only then increments count.
Consumer (AIV): PerBlockQueueTryDequeue first checks count >= head+1 with TTEST, then executes an acquire fence and re-fetches the target slot via GetQueueSlot() until the payload becomes visible. If no tile is ready, it returns -1; after a prolonged idle period it falls back to hardware TWAIT.
The design is single-producer single-consumer, so no atomic operation is required inside the queue.

Memory Layout and HCCL Window¶

Only buffers written by remote TPUT or TNOTIFY need to live in the HCCL RDMA window. Buffers used only for local read/write can be allocated with plain aclrtMalloc.

Buffer	Size	Location	Why
`reduced_output`	`M x N x 2B`	HCCL window	RS `AtomicAdd` and AG remote `TPUT` writes (`FP16`)
`signal_matrix`	`G_SIGNAL_TOTAL_SLOTS x 4B`, aligned to 64B	HCCL window	`subtile-ready` and `ag-summary` counters (plus reserved legacy barrier slots)
`gemm_output`	`M x N x 2B`	aclrtMalloc	Local read/write only (`FP16`)
`src0_dev`, `src1_dev`	input matrices (`FP16`)	aclrtMalloc	Local read/write only

Window size is controlled by the HCCL_BUFFSIZE environment variable. run.sh sizes it from the padded reduced_output footprint and adds a large safety margin:

pad(M, G_BASE_M) x pad(N, G_BASE_N) x 2 / 1MB + 64MB

signal_matrix lives in the same window but is tiny compared with the added 64MB margin.

Measured Performance (Reference)¶

The following numbers were collected on 2-card Ascend950PR with M=5416, K=6144, N=1408 (padded to 5504 x 1536), 258 tiles (43 x 6), compute_blocks=32, and comm_blocks=24. Each rank computes a full GEMM C_i = A_i x B, and AllReduce sums the two C_i tensors.

Metric	Value
Compute-only	`323.2 us` (`289926 GFLOPS`)
Sequential	`856.7 us` (compute `325.8 us` + comm `530.8 us @ 29.7 GB/s`)
Pipelined	`580.2 us` (compute done `360.4 us`, comm done `580.2 us @ 27.1 GB/s`)
Speedup	`1.476x`
Time saved	`276.5 us` (`32.3%`)
Overlap eff	`84.8%`
Throughput	`322996 GFLOPS` (total)

What These Numbers Mean¶

Compute-only: pure GEMM execution time with no communication. It reflects the upper bound of single-card Cube utilization. The current pure-compute result is 323.2 us, or 289926 GFLOPS.
Sequential: compute followed by communication with no overlap. The current sequential path takes 856.7 us, split into 325.8 us of compute and 530.8 us of communication.
Pipelined: compute and communication run concurrently on two streams. The current Pipelined = 580.2 us; versus Sequential = 856.7 us, that is a 1.476x speedup with 84.8% overlap efficiency.
Speedup: Sequential / Pipelined. A larger value means communication-compute overlap is more effective.
Time saved: total wall-clock time saved relative to the sequential path. The current run saves 276.5 us, or 32.3%.
Overlap efficiency: the fraction of the shorter phase that is hidden by overlap. 84.8% means most of the shorter phase is now hidden by overlap.

Optimization History¶

The rows below are historical optimization checkpoints; the last row is the latest end-to-end result from the current subtile-ready / AG-summary overlap path on Ascend950PR. Treat the older rows as context, not as a literal decomposition of the live path.

Optimization	Pipelined (us)	Gain	Conclusion
Baseline	808	-	-
Block Swizzle	793	`-1.8%`	Kept
RS `AtomicAdd` removes the separate Reduce stage	736	`-6.6%`	Kept
AG row-level flattened scheduling	623	`-15.4%`	Historical checkpoint
48 AIV (`RS` skip + `AG` participate)	639	RS only on 24 AIV, AG on 48 AIV	Reverted (`AIC` interference)
48 AIV dual-queue (`1 AIC : 2 AIV`)	667	both RS and AG on 48 AIV	Reverted (`AIC` interference)
Current `subtile-ready / AG-summary overlap` path	580.2	current 2-rank Ascend950PR result	Current result

Performance Tuning Guide¶

1. Prioritize Multi-Core Partitioning¶

Each AIC receives a tile subset according to block_idx x tiles_per_block, and blocks do not interfere with one another.

Checklist:

Tune COMPUTE_BLOCK_NUM so each block gets a similar number of tiles.
For different matrix shapes, recompute the total tile count as G_NUM_TILES = (M_padded/128) x (N_padded/256).

2. Choose a Proper Base Tile¶

L0A and L0B use ping/pong double buffering, and each buffer is limited to 32 KiB.

For FP16 input (2 bytes/elem):

L0A tile bytes ~= baseM x baseK x 2 = 128 x 64 x 2 = 16 KiB
L0B tile bytes ~= baseK x baseN x 2 = 64 x 256 x 2 = 32 KiB

The communication tile size is:

baseM x baseN x sizeof(FP16) = 128 x 256 x 2 = 64 KB

3. Use L1 `stepK` Caching to Increase Reuse¶

With stepKa=stepKb=4, one TLOAD brings 4 K-slices into L1, and subsequent TEXTRACT operations pull them into L0 one by one.

L1 usage:

2 x 64KB (A) + 2 x 128KB (B) = 384KB <= 1024KB

Increasing stepK can reduce DMA launch overhead, but the total must still fit in L1.

4. Preserve Pipeline Overlap¶

The key to performance is the combination of:

double buffering inside the compute kernel (L1 / L0A / L0B)
dual-stream overlap between compute and communication

When you observe:

communication time >> compute time: the compute side is already efficient, so focus on improving communication or increasing overlap.
compute time >> communication time: communication is fully hidden, so focus on the compute side.

5. Tune the Number of Communication Blocks¶

COMM_BLOCK_NUM controls AIV parallelism in the communication kernel and can be adjusted via --comm-blocks.

On Ascend910B, measurements showed that increasing COMM_BLOCK_NUM from 24 to 48 caused a significant increase in AIC compute time (about +24%) because of HBM bandwidth contention and TSCH scheduling overhead. A more stable default was therefore 24. After moving to Ascend950PR, the upper bound should be reconsidered based on the SoC-specific vector_core_cnt in the corresponding .ini file, for example 64 on 958b and 72 on 9599. Do not assume the old "24 best, 48 worse" conclusion still holds without profiling on the target SoC.

6. Constraints¶

K must be divisible by G_BASE_K x G_STEP_KA (default 64 x 4 = 256).
M is padded automatically to a multiple of 128, and N is padded automatically to a multiple of 256.
All HCCL-window buffers must be allocated at the same offset on every rank.
signal_matrix must be reset with aclrtMemset before each iteration.

Build and Run¶

Configure the Ascend CANN environment:

export ASCEND_CANN_PATH=/usr/local/Ascend/cann-<version>/set_env.sh
source "${ASCEND_CANN_PATH}"

Activate a conda environment that provides Python and NumPy:

conda activate <your-conda-env>

Run the example with 2 ranks:

cd ${git_clone_path}/kernels/manual/a5/gemm_ar
./run.sh --nranks 2 --soc-version Ascend950PR_958b

Specify the starting device index:

FIRST_DEVICE=0 ./run.sh --nranks 2 --soc-version Ascend950PR_958b

Use custom compute/communication block counts:

./run.sh --nranks 2 --soc-version Ascend950PR_958b --compute-blocks 20 --comm-blocks 4

When successful, the program prints:

GEMM AllReduce demo completed successfully.

Environment Variables¶

Environment Variable	Purpose	Default Behavior
`ASCEND_CANN_PATH`	Full path to the CANN `set_env.sh` script	Auto-globs `/usr/local/Ascend/cann-*/set_env.sh` and picks the latest one
`MPI_SEARCH_DIRS`	Search paths for MPI `bin/` directories (space-separated)	Searches common locations such as `/usr/local/mpich/bin` and `/home/mpich/bin`
`ASCEND_DRIVER_PATH`	Ascend driver path used by CMake	Defaults to `/usr/local/Ascend/driver`
`MPI_LIB_PATH`	Absolute path to `libmpi.so` for runtime dynamic loading	Auto-set by `run.sh` according to the discovered MPI installation
`HCCL_BUFFSIZE`	HCCL RDMA window size in MB	Auto-computed by `run.sh` from `M` and `N`
`FIRST_DEVICE`	Starting NPU device index	Defaults to `0`

Changing Matrix Dimensions¶

Update CONFIG_G_M, CONFIG_G_K, and CONFIG_G_N in gemm_ar_config.h. All source files share the configuration through includes. You can also pass them from CMake:

cmake -DCONFIG_G_M=8192 -DCONFIG_G_K=8192 -DCONFIG_G_N=2048 ..

Constraint: K must be divisible by G_BASE_K x G_STEP_KA (default 64 x 4 = 256). HCCL_BUFFSIZE is computed automatically by run.sh.

FAQ¶

Problem	Cause and Fix
`HCCL window too small`	The window must cover the padded `reduced_output` footprint plus `signal_matrix`. Check whether `HCCL_BUFFSIZE` was manually overridden; `run.sh` auto-raises it from `pad(M) x pad(N) x 2 / 1MB + 64MB`
`HcclGetRootInfo failed: 7`	Leftover dirty state from a previous run. Execute `rm -rf /dev/shm/sem.hccl*; ipcrm -a` or wait about 30 seconds and retry
Hangs after HCCL initialization	Usually a rank synchronization problem. Check that all ranks reached `CommMpiBarrier`
Segmentation fault in the communication kernel	Usually caused by an invalid window address. Verify that `windowsIn[]` entries are non-zero
Signal-wait deadlock or AG stall	`signal_matrix` was not cleared between iterations, or the subtile-ready / AG-summary ownership mapping is wrong. Check whether `resetState` calls `memset` on `signal_matrix`
Verification fails with large `max_diff`	FP16 precision is limited. The validation tolerance is `atol=1.0, rtol=0.01`. If the diff is abnormally large, check subtile-ready / AG-summary synchronization and owner mapping
`aclInit repeat init` (`100002`)	Harmless. The code already guards against repeated `aclInit` in one process
`--allow-run-as-root` fails	This project uses MPICH. That option is specific to OpenMPI

Build System¶

Compiler: bisheng (CANN-bundled clang 15.0.5)
Cube kernel flags: --cce-aicore-arch=dav-c220-cube -DMEMORY_BASE
Vector kernel flags: --cce-aicore-arch=dav-c220-vec -DMEMORY_BASE
Host executable: standard -xc++ compilation
Linked libraries: runtime, ascendcl, hcomm, tiling_api
The include path for pto-comm-isa must come first so it overrides the pto_tile.hpp bundled with CANN

Changelog¶

Date	Change
2026-04-15	Added the A5 adaptation of `gemm_ar`
2026-04-21	Communication mode changed from `RS -> DeviceBarrier -> AG` to `subtile-ready / AG-summary overlap`
2026-04-24	Ready-queue transport was hardened with explicit slot addressing and the obsolete `debug_state` diagnostics were removed