High-Performance GEMM + AllReduce Fusion Example¶
Overview¶
This example shows how to implement a multi-rank GEMM + AllReduce fused operator with PTO on A2/A3-class chips. It uses a dual-stream design for communication-compute overlap: a Compute Stream runs the GEMM kernel, and a Comm Stream runs the communication kernel. PTO communication instructions operate directly on the HCCL RDMA window to complete the AllReduce.
Supported AI Processors¶
- A2/A3
Directory Layout¶
kernels/manual/a2a3/gemm_ar/
├── CMakeLists.txt # Build configuration (3 targets: cube kernel, vec kernel, host executable)
├── run.sh # One-click build + run script (auto-computes HCCL_BUFFSIZE and locates MPI)
├── gemm_ar_config.h # Global configuration (matrix shape, tile sizes, block counts)
├── main.cpp # Entry: MPI init, data generation, HCCL init, window allocation, perf measurement, verification
├── gemm_compute_kernel.cpp # GEMM compute kernel (Cube side, L0C FP32 -> GM FP16 auto cast)
├── comm_kernel.cpp # Communication kernel (Vector side, overlapped RS/AG AllReduce in one kernel)
├── kernel_launchers.h # Host-side kernel launcher declarations
├── common.hpp # Device-side HcclRemotePtr wrapper (RDMA window address translation)
├── hccl_context.h # HcclDeviceContext structure (RDMA window addresses for each rank)
├── ready_queue.hpp # Multi-block lock-free tile queue (compute -> comm signaling)
└── comm_mpi.h # MPI dynamic loading wrapper (dlopen/dlsym, no hard link dependency)
Operator Description¶
Functionality¶
This example implements multi-rank GEMM + AllReduce:
Where:
A_iisM x Kand is private to each rankBisK x Nand is shared by all ranksC_iis the local GEMM result of shapeM x NC_finalis the finalM x Noutput after AllReduce
The default matrix configuration in gemm_ar_config.h is M=5416, K=6144, N=1408. The performance section below uses 8-card Ascend 910B as the reference platform.
Specification¶
| Item | Value |
|---|---|
| OpType | GEMM + AllReduce |
| Input | A_i: M x K, float16, ND (private to each rank); B: K x N, float16, DN (shared) |
| Output | C_final: M x N, float16, ND (AllReduce result) |
| Compute kernel name | GemmComputeKernel (Cube architecture, dav-c220-cube) |
| Comm kernel name | GemmCommAllKernel (Vector architecture, dav-c220-vec) |
Optimization Notes¶
This example uses 8-card Ascend 910B as the main performance-validation target. On this platform, the reference setup uses 24 AIC blocks for compute and 24 AIV blocks for communication.
- Dual-stream overlap: the compute kernel runs on the Compute Stream and the communication kernel runs on the Comm Stream. Tile-level signaling allows communication and computation to run concurrently.
- Logical RS + AG in one mixed loop: RS reduces into the owner rank and AG broadcasts owner-local results, with both roles executing inside one subtile-driven loop and handing off through ready counters.
- Block Swizzle: the compute kernel uses a zigzag tile traversal order (odd rows reversed) to improve L1 reuse of neighboring
Bmatrix tiles. - Two-level double-buffer pipeline: L1 cache (
stepK=4batchedTLOAD) plus L0 ping/pong buffering lets DMA movement overlap with Cube compute as much as possible. - Lock-free Ready Queue: each AIC writes one queue, and each communication block drains the queue subset
{block_idx, block_idx + num_comm_blocks, ...}. AIV first probes withTTESTand only blocks withTWAITwhen needed. - RS double buffering: the RS producer path uses ping/pong tiles so the
TLOADof the current subtile overlaps with theTSTORE<AtomicAdd>of the previous subtile. - Owner-local subtile executor: each owner-local tile is split into fixed-height subtiles (
G_COMM_SUB_M, default64rows). AG blocks claim reversed-stripe subsets of those subtiles to smooth combined RS + AG load. - Publish / consume fences: RS publishes
subtile-readyandag-summarydoorbells only afterpipe_barrier(PIPE_ALL) + dsb(DSB_DDR), and AG performs one acquire fence before consuming a batch of ready subtiles.
Tiling Parameters¶
| Parameter | Value |
|---|---|
M (raw) |
5416 |
K |
6144 |
N (raw) |
1408 |
M (padded) |
5504 |
N (padded) |
1536 |
baseM |
128 |
baseK |
64 |
baseN |
256 |
stepKa |
4 |
stepKb |
4 |
commSubM |
64 |
subtilesPerTile |
2 |
| Number of tiles | 258 (43 x 6) |
COMPUTE_BLOCK_NUM |
24 |
COMM_BLOCK_NUM |
24 |
Overall Architecture¶
┌──────────────────────────────────────────────────────────────────────────────┐
│ Compute Stream (24 AIC) Comm Stream (24 AIV) │
│ │
│ GemmComputeKernel: GemmCommAllKernel: │
│ ┌─────────────────────────┐ ┌──────────────────────────────┐ │
│ │ for each tile: │ │ RS/AG overlap loop │ │
│ │ K-loop (L1 -> L0 -> Cube) │ poll Ready Queue │ │
│ │ TSTORE -> gemm_output │──Ready──→ │ TLOAD tile from gemm_output│ │
│ │ pipe_barrier(ALL) │ Queue │ TSTORE<AtomicAdd> -> owner │ │
│ │ Enqueue tile_idx │ │ subtile-ready / summary++ │ │
│ └─────────────────────────┘ │ drain ready subtiles for AG│ │
│ │ TLOAD -> TSTORE to remote │ │
│ │ ready-driven AG handoff │ │
│ │ subtile-level overlap │ │
│ │ │ │
│ └──────────────────────────────┘ │
└──────────────────────────────────────────────────────────────────────────────┘
Compute Kernel Details¶
Time ->
L1 (MTE2): [TLOAD A0,B0] [TLOAD A1,B1] ...
L0 (MTE1): [TEXTRACT k0] [k1] [k2] [k3] [TEXTRACT k0'] ...
Cube (M): [TMATMUL k0] [ACC k1] [ACC k2] [ACC k3] [TMATMUL k0'] ...
^ full three-stage overlap ^
Each AIC is responsible for a subset of tiles assigned by block_idx x tiles_per_block. For each tile:
- Block Swizzle mapping: remap the linear tile index into a zigzag traversal order, reversing odd rows so adjacent tiles reuse columns of matrix
Bin L1. - K-loop: every
stepKa=4iterations, perform one batchedTLOADinto L1. Each iteration then usesTEXTRACTto pull one K-slice into L0, followed byTMATMUL/TMATMUL_ACCaccumulation. - TSTORE: FP32 values in L0C are automatically cast to FP16 by the FixPipe and stored to
gemm_output. pipe_barrier(PIPE_ALL): guarantees that the GM write is complete.MultiBlockEnqueueFast: enqueuetile_idxto notify the communication kernel.
Communication Kernel Details¶
The launched communication kernel follows the mixed subtile pipeline implemented in GemmCommAllImpl(): RS production and AG consumption are interleaved inside one loop, and the synchronization point is a per-subtile counter rather than a device-wide barrier.
RS Producer Path¶
Each communication block owns the queue subset:
queues(block b) = { b, b + num_comm_blocks, b + 2*num_comm_blocks, ... }
With the default COMPUTE_BLOCK_NUM = COMM_BLOCK_NUM = 24, this degenerates to 1:1. When fewer communication blocks are used, one block drains multiple compute queues round-robin via RsPollQueues() / RsWaitOnQueue().
For every dequeued tile:
- The tile is split along
MintoG_COMM_SUBTILES_PER_TILE = G_BASE_M / G_COMM_SUB_Mfixed-height subtiles. RsPipelineStep()uses ping/pong UB tiles so the current subtileTLOADoverlaps with the previous subtileTSTORE<AtomicAdd>.- The RS destination is the owner rank
owner = tile_idx % nranks, so reduction is completed directly in that rank'sreduced_output.
RS/AG Overlap Synchronization¶
The overlap protocol uses two counters in the owner rank's signal_matrix:
subtile-ready[local_subtile_id]: counts how many ranks have completed RS for that owner-local subtile.ag-summary[summary_block]: a coarser wakeup doorbell for the AG block responsible for that subtile.
Publishing follows RsPublishSubtileReady():
pipe_barrier(PIPE_ALL)flushes the local pipeline.dsb(DSB_DDR)makes thereduced_outputstore globally visible.RsNotifySubtileReady()increments the owner-local ready counter.RsNotifyAgSummary()increments the AG wakeup counter selected byAgSummaryBlockForSubtile().
Consumption follows AgDrainReadyAssignedSubtiles():
- Probe assigned
subtile-readycounters withTTEST(..., nranks, GE). - On the first hit of a drain pass, execute one acquire fence (
pipe_barrier + dsb) for all ready subtiles consumed in that pass. - Transfer each ready subtile to all remote ranks.
- If no progress is possible,
AgWaitAssignedSummary()blocks onsummary_ack_count + 1and waits for the next assigned wakeup.
AgSummaryBlockForSubtile() uses a reversed-stripe mapping so AG-heavy blocks land on RS-light blocks, which flattens the combined rs_work + ag_work load.
AG Executor Path¶
AG work is assigned in owner-local subtile space:
total_local_subtiles = my_tile_count * G_COMM_SUBTILES_PER_TILE
assigned_ids(block b) = { num_comm_blocks - 1 - b + k*num_comm_blocks }
For each ready assigned subtile:
AgDecodeLocalSubtile()maps the owner-local subtile id back to a global row offset inreduced_output.AgTransferSubtileToAll()broadcasts exactlyG_COMM_SUB_Mrows to every remote rank.- The first remote peer is rotated by
local_subtile_id % (nranks - 1)so not every block hammers the same destination first.
This design lets AG start as soon as a specific owner-local subtile is fully reduced across all ranks.
Ready Queue Mechanism¶
┌─────────────┐ ┌─────────────┐
│ AIC 0 │ │ AIV 0 │
│ (Compute) │──Queue──│ (Comm) │
│ block_idx=0│ 0 │ block_idx=0│
└─────────────┘ └─────────────┘
┌─────────────┐ ┌─────────────┐
│ AIC 1 │ │ AIV 1 │
│ (Compute) │──Queue──│ (Comm) │
│ block_idx=1│ 1 │ block_idx=1│
└─────────────┘ └─────────────┘
... ...
┌─────────────┐ ┌─────────────┐
│ AIC 23 │ │ AIV 23 │
│ (Compute) │──Queue──│ (Comm) │
│ block_idx=23│ 23 │ block_idx=23│
└─────────────┘ └─────────────┘
- Each queue is a 64-byte-aligned
PerBlockQueuestructure containingcount(producer-side monotonically increasing counter) anddata[](tile index array). Queue slots are addressed through theGetQueueSlot()helper instead of relying on implicitdata[idx]layout assumptions. - Producer (AIC):
PerBlockQueueEnqueueFastwrites the target slot throughGetQueueSlot(), then incrementscount, and usesdccito flush cache state so the entry becomes visible to AIV. - Consumer (AIV):
PerBlockQueueTryDequeueuses hardwareTTESTto check whethercount >= head+1, refreshes the target slot throughGetQueueSlot(), and returns the tile id. If no tile is ready, it returns-1; after a prolonged idle period it falls back to hardwareTWAIT. - When
COMM_BLOCK_NUM < COMPUTE_BLOCK_NUM, one communication block drains multiple queues in round-robin order. The queue assignment is static, so no cross-block atomic arbitration is required. - The design is single-producer single-consumer, so no atomic operation is required inside the queue.
Memory Layout and HCCL Window¶
Only buffers written by remote TPUT or TNOTIFY need to live in the HCCL RDMA window. Buffers used only for local read/write can be allocated with plain aclrtMalloc.
| Buffer | Size | Location | Why |
|---|---|---|---|
reduced_output |
M x N x 2B |
HCCL window | RS AtomicAdd and AG remote TPUT writes (FP16) |
signal_matrix |
G_SIGNAL_TOTAL_SLOTS x 4B, aligned to 64B |
HCCL window | Subtile-ready and AG-summary counters (plus reserved legacy barrier slots) |
gemm_output |
M x N x 2B |
aclrtMalloc | Local read/write only (FP16) |
src0_dev, src1_dev |
input matrices (FP16) |
aclrtMalloc | Local read/write only |
Window size is controlled by the HCCL_BUFFSIZE environment variable. run.sh sizes it from the padded reduced_output footprint and adds a large safety margin:
pad(M, G_BASE_M) x pad(N, G_BASE_N) x 2 / 1MB + 64MB
signal_matrix lives in the same window but is tiny compared with the added 64MB margin.
Measured Performance (Reference)¶
The following numbers were collected from the current subtile-ready / AG-summary overlap implementation on 8-card Ascend 910B with M=5416, K=6144, N=1408 (padded to 5504 x 1536), 258 tiles (43 x 6), compute_blocks=24, and comm_blocks=24. Each rank computes a full GEMM C_i = A_i x B, and AllReduce sums the eight C_i tensors.
| Metric | Value |
|---|---|
| Compute-only | 368.1 us (254546 GFLOPS) |
| Sequential | 808.9 us (compute 371.6 us + comm 437.3 us @ 63.6 GB/s) |
| Pipelined | 560.6 us (compute done 367.2 us, comm done 560.6 us @ 49.7 GB/s) |
| Speedup | 1.443x |
| Time saved | 248.4 us (30.7%) |
| Overlap eff | 66.8% |
| Throughput | 1337307 GFLOPS (total) |
What These Numbers Mean¶
- Compute-only: pure GEMM execution time with no communication. It reflects the upper bound of single-card Cube utilization. The current pure-compute result is
368.1 us, or254546 GFLOPS. - Sequential: compute followed by communication with no overlap. The current sequential path takes
808.9 us, split into371.6 usof compute and437.3 usof communication. - Pipelined: compute and communication run concurrently on two streams. The current
Pipelined = 560.6 us; versusSequential = 808.9 us, that is a1.443xspeedup with66.8%overlap efficiency. - Speedup:
Sequential / Pipelined. A larger value means communication-compute overlap is more effective. - Time saved: total wall-clock time saved relative to the sequential path. The current run saves
248.4 us, or30.7%. - Overlap efficiency: the fraction of the shorter phase that is hidden by overlap.
66.8%means roughly two thirds of the shorter phase is now hidden by overlap.
Optimization History¶
The rows below are historical optimization checkpoints; the last row is the latest end-to-end result from the current
subtile-ready / AG-summary overlappath. Treat the older rows as context, not as a literal decomposition of the live path.
| Optimization | Pipelined (us) | Gain | Conclusion |
|---|---|---|---|
| Baseline | 808 | - | - |
| Block Swizzle | 793 | -1.8% |
Kept |
RS AtomicAdd removes the separate Reduce stage |
736 | -6.6% |
Kept |
| AG row-level flattened scheduling | 623 | -15.4% |
Historical checkpoint |
48 AIV (RS skip + AG participate) |
639 | RS only on 24 AIV, AG on 48 AIV | Reverted (AIC interference) |
48 AIV dual-queue (1 AIC : 2 AIV) |
667 | both RS and AG on 48 AIV | Reverted (AIC interference) |
Current subtile-ready / AG-summary overlap path |
560.6 | about -10.0% versus the 623 us historical checkpoint |
Current result |
Performance Tuning Guide¶
1. Prioritize Multi-Core Partitioning¶
Each AIC receives a tile subset according to block_idx x tiles_per_block, and blocks do not interfere with one another.
Checklist:
- Tune
COMPUTE_BLOCK_NUMso each block gets a similar number of tiles. - For different matrix shapes, recompute the total tile count as
G_NUM_TILES = (M_padded/128) x (N_padded/256).
2. Choose a Proper Base Tile¶
L0A and L0B use ping/pong double buffering, and each buffer is limited to 32 KiB.
For FP16 input (2 bytes/elem):
- L0A tile bytes ~=
baseM x baseK x 2=128 x 64 x 2 = 16 KiB - L0B tile bytes ~=
baseK x baseN x 2=64 x 256 x 2 = 32 KiB
The communication tile size is:
baseM x baseN x sizeof(FP16) = 128 x 256 x 2 = 64 KB
3. Use L1 stepK Caching to Increase Reuse¶
With stepKa=stepKb=4, one TLOAD brings 4 K-slices into L1, and subsequent TEXTRACT operations pull them into L0 one by one.
L1 usage:
2 x 64KB (A) + 2 x 128KB (B) = 384KB <= 1024KB
Increasing stepK can reduce DMA launch overhead, but the total must still fit in L1.
4. Preserve Pipeline Overlap¶
The key to performance is the combination of:
- double buffering inside the compute kernel (
L1/L0A/L0B) - dual-stream overlap between compute and communication
When you observe:
- communication time >> compute time: the compute side is already efficient, so focus on improving communication or increasing overlap.
- compute time >> communication time: communication is fully hidden, so focus on the compute side.
5. Tune the Number of Communication Blocks¶
COMM_BLOCK_NUM controls AIV parallelism in the communication kernel and can be adjusted via --comm-blocks.
On Ascend910B, measurements showed that increasing COMM_BLOCK_NUM from 24 to 48 caused a significant increase in AIC compute time (about +24%) because of HBM bandwidth contention and TSCH scheduling overhead. A more stable default was therefore 24.
6. Constraints¶
Kmust be divisible byG_BASE_K x G_STEP_KA(default64 x 4 = 256).Mis padded automatically to a multiple of 128, andNis padded automatically to a multiple of 256.- All HCCL-window buffers must be allocated at the same offset on every rank.
signal_matrixmust be reset withaclrtMemsetbefore each iteration.
Build and Run¶
- Configure the Ascend CANN environment:
export ASCEND_CANN_PATH=/usr/local/Ascend/cann-<version>/set_env.sh
source "${ASCEND_CANN_PATH}"
- Activate a conda environment that provides Python and NumPy:
conda activate <your-conda-env>
- Run the example with 8 ranks:
cd ${git_clone_path}/kernels/manual/a2a3/gemm_ar
./run.sh --nranks 8 --soc-version Ascend910B1
- Specify the starting device index:
FIRST_DEVICE=0 ./run.sh --nranks 8 --soc-version Ascend910B1
- Use custom compute/communication block counts:
./run.sh --nranks 8 --compute-blocks 20 --comm-blocks 4
When successful, the program prints:
GEMM AllReduce demo completed successfully.
Environment Variables¶
| Environment Variable | Purpose | Default Behavior |
|---|---|---|
ASCEND_CANN_PATH |
Full path to the CANN set_env.sh script |
Auto-globs /usr/local/Ascend/cann-*/set_env.sh and picks the latest one |
MPI_SEARCH_DIRS |
Search paths for MPI bin/ directories (space-separated) |
Searches common locations such as /usr/local/mpich/bin and /home/mpich/bin |
ASCEND_DRIVER_PATH |
Ascend driver path used by CMake | Defaults to /usr/local/Ascend/driver |
MPI_LIB_PATH |
Absolute path to libmpi.so for runtime dynamic loading |
Auto-set by run.sh according to the discovered MPI installation |
HCCL_BUFFSIZE |
HCCL RDMA window size in MB | Auto-computed by run.sh from the padded M / N footprint |
FIRST_DEVICE |
Starting NPU device index | Defaults to 0 |
Changing Matrix Dimensions¶
Update CONFIG_G_M, CONFIG_G_K, and CONFIG_G_N in gemm_ar_config.h. All source files share the configuration through includes. You can also pass them from CMake:
cmake -DCONFIG_G_M=8192 -DCONFIG_G_K=8192 -DCONFIG_G_N=2048 ..
Constraint: K must be divisible by G_BASE_K x G_STEP_KA (default 64 x 4 = 256). HCCL_BUFFSIZE is computed automatically by run.sh.
FAQ¶
| Problem | Cause and Fix |
|---|---|
HCCL window too small |
The window must cover the padded reduced_output footprint plus signal_matrix. Check whether HCCL_BUFFSIZE was manually overridden; run.sh auto-raises it from pad(M) x pad(N) x 2 / 1MB + 64MB |
HcclGetRootInfo failed: 7 |
Leftover dirty state from a previous run. Execute rm -rf /dev/shm/sem.hccl*; ipcrm -a or wait about 30 seconds and retry |
| Hangs after HCCL initialization | Usually a rank synchronization problem. Check that all ranks reached CommMpiBarrier |
| Segmentation fault in the communication kernel | Usually caused by an invalid window address. Verify that windowsIn[] entries are non-zero |
| Signal-wait deadlock or AG stall | signal_matrix was not cleared between iterations, or the subtile-ready / AG-summary ownership mapping is wrong. Check whether resetState calls memset on signal_matrix |
Verification fails with large max_diff |
FP16 precision is limited. The validation tolerance is atol=1.0, rtol=0.01. If the diff is abnormally large, check subtile-ready / AG-summary synchronization and owner mapping |
aclInit repeat init (100002) |
Harmless. The code already guards against repeated aclInit in one process |
--allow-run-as-root fails |
This project uses MPICH. That option is specific to OpenMPI |
Build System¶
- Compiler:
bisheng(CANN-bundled clang 15.0.5) - Cube kernel flags:
--cce-aicore-arch=dav-c220-cube -DMEMORY_BASE - Vector kernel flags:
--cce-aicore-arch=dav-c220-vec -DMEMORY_BASE - Host executable: standard
-xc++compilation - Linked libraries:
runtime,ascendcl,hcomm,tiling_api - The include path for
pto-comm-isamust come first so it overrides thepto_tile.hppbundled with CANN
Changelog¶
| Date | Change |
|---|---|
| 2025-12-15 | Initial version with dual-stream GEMM + AllReduce fusion |
| 2026-04-01 | Adapted to CANN 9.0.0 (removed the deprecated hccl/hccl.h dependency) |
| 2026-04-02 | RS AtomicAdd removed the standalone Reduce stage; AG flattening improved load balance |
| 2026-04-21 | Communication mode changed from RS -> DeviceBarrier -> AG to subtile-ready / AG-summary overlap |