High-Performance GEMM Operator Example¶
Overview¶
This example demonstrates how to implement a high-performance GEMM operator using PTO and common optimization techniques (core partitioning, base-block selection, L1 caching, and double buffering).
Supported AI Processors¶
- A2/A3
Directory Layout¶
kernels/manual/a2a3/gemm_performance/
├── scripts/
│ └── gen_data.py # Generates input and golden output
├── CMakeLists.txt # Build configuration
├── gemm_performance_kernel.cpp # Kernel implementation
├── main.cpp # Host-side entry point
└── run.sh # Convenience script
Operator Description¶
Function¶
This example implements GEMM:
Where:
Aism×kBisk×nCism×n
The default reference configuration in main.cpp uses m=k=n=6144.
Specification¶
| Item | Value |
|---|---|
| OpType | GEMM |
| Inputs | a: m×k, float16, ND; b: n×k, float16, ND |
| Output | c: m×n, float, ND |
| Kernel name | GEMMPerformance |
Optimization Notes¶
This example uses a 24-core A3 platform as the performance validation platform.
- Core partitioning: maximize parallelism by splitting work across Cube cores. Since
m,n, andkare equal, prefer not splittingkwithin a single core, and splitmandnacross 24 cores. A4 × 6grouping yieldssingleCoreM=1536,singleCoreK=6144,singleCoreN=1024(chosen by this example). - Base block selection: choose base blocks that maximize compute-to-memory ratio. For FP16, a common choice is
[baseM, baseN, baseK] = [128, 256, 64], which improves arithmetic intensity versus[128, 128, 128]while maintaining 512-byte-aligned GM writes. - L1 caching: move multiple base blocks from GM to L1 per transfer to improve bandwidth utilization. This example sets
stepKa=stepKb=4to cache fourkblocks at a time. - Double buffering: overlap DMA and compute by enabling double buffering in L1, L0A, and L0B.
Tiling Parameters¶
| Parameter | Value |
|---|---|
m |
6144 |
k |
6144 |
n |
6144 |
singleCoreM |
1536 |
singleCoreK |
6144 |
singleCoreN |
1024 |
baseM |
128 |
baseK |
64 |
baseN |
256 |
stepM |
1 |
stepKa |
4 |
stepKb |
4 |
stepN |
1 |
Measured Performance (Reference)¶
The following measurements were collected on Ascend A3 (24 cores) for several m=k=n sizes (fp16 inputs → fp32 output).
| Parameter | TMATMUL (Cube) Ratio | TEXTRACT Ratio | TLOAD Ratio | TSTORE Ratio | Execution time (ms) |
|---|---|---|---|---|---|
m=1536 k=1536 n=1536 |
54.5% | 42.2% | 72.2% | 7.7% | 0.0388 |
m=3072 k=3072 n=3072 |
79.0% | 62.0% | 90.9% | 5.8% | 0.2067 |
m=6144 k=6144 n=6144 |
86.7% | 68.1% | 95.2% | 3.1% | 1.5060 |
m=7680 k=7680 n=7680 |
80.6% | 63.0% | 98.4% | 2.4% | 3.1680 |
What the numbers suggest¶
These metrics are most useful for answering a single question: which engine is limiting the end-to-end pipeline.
- Scaling behavior: execution time grows super-linearly with
m=k=n(as expected forO(n^3)work), and throughput typically improves from small sizes to mid sizes before flattening. - TMATMUL utilization rises, then drops: TMATMUL (Cube) Ratio increases from 54.5% → 86.7% as the problem grows (better amortization and steadier pipelines), then drops to 80.6% at
7680³. This pattern usually indicates the compute pipeline is no longer the only limiter at the largest size. - TLOAD is near-saturated at large sizes: TLOAD Ratio grows to 98.4% at
7680³, suggesting the GM feed path is close to its limit and starts throttling compute (TMATMUL Ratio decreases). - TSTORE is small and keeps shrinking: output writeback is a small fraction of total time for GEMM, especially at larger sizes (one write for many FMAs).
- TEXTRACT is meaningful overhead: the 42%→68% range suggests L1→L0 extract/layout costs are not negligible; optimizing this stage (and overlapping it cleanly) directly impacts overall performance.
If you want a single rule of thumb: when TLOAD Ratio approaches ~100%, you are usually memory-feed limited (even if TMATMUL still looks “busy”), and further speedups come from reducing bytes moved per FLOP and improving overlap.
Performance Optimization Guide (How to Tune This Kernel)¶
This example is intentionally structured around a standard GEMM pipeline:
- TLOAD stage: GM → L1 (
TLOADintoaMatTile[]/bMatTile[]) - TEXTRACT stage: L1 → L0A/L0B (
TEXTRACTintoaTile[]/bTile[]) - TMATMUL stage: L0A/L0B → L0C (
TMATMUL/TMATMUL_ACCintocTile) - TSTORE stage: L0C → GM (
TSTOREofcTile)
The core kernel implementation is in kernels/manual/a2a3/gemm_performance/gemm_performance_kernel.cpp, with the critical control points below.
1) Partition work across cores first¶
Look at InitGMOffsets(...):
- The kernel splits the global
C[m,n]intoblockDimindependent tiles. - For square problems (
m≈n), splitting across bothmandnusually gives better balance than splitting only one dimension.
Checklist:
- Ensure
m % singleCoreM == 0andn % singleCoreN == 0. - Choose a 2D grid decomposition (
m-tiles ×n-tiles) that matchesblockDimso each core gets a contiguousApanel andBpanel.
2) Choose base tiles that fit L0A/L0B cleanly¶
Look at InitBuffers(...):
- L0A and L0B are explicitly double-buffered with a 32 KiB ping/pang split (
0x0and0x0 + 32768). - This implies an important constraint: the per-buffer tile footprint must be ≤ 32 KiB.
For fp16 inputs (2 bytes/elem):
- L0A tile bytes ≈
baseM * baseK * 2 - L0B tile bytes ≈
baseK * baseN * 2
The reference uses:
baseM=128, baseK=64→128*64*2 = 16 KiB(fits comfortably)baseK=64, baseN=256→64*256*2 = 32 KiB(fills the budget)
Guidelines:
- Prefer tile sizes that fully utilize the 32 KiB budget (especially for
B), but do not exceed it. - Keep
baseKaligned to the Cube’s preferred K granularity (often 32/64/128 depending on data type and layout).
3) Increase reuse with L1 “stepK” caching (without overflowing)¶
Look at ProcessKIteration(...) and the kModstepKa logic:
stepKa/stepKbcontrol how manyK-slices are staged into L1 per DMA.- The example uses
stepKa=stepKb=4: oneTLOADtransfer brings in 4 micro-panels that are laterTEXTRACT’d.
Guidelines:
- Increase
stepKto reduce DMA launch overhead and improve burst efficiency until L1 capacity or overlap breaks down. - If TLOAD is near 100% and TMATMUL drops, try:
- increasing
stepK(more reuse per fetch), or - increasing compute intensity (e.g., larger
baseN/baseMif L0 allows), or - improving overlap (next section).
4) Keep the pipeline overlapped (avoid bubbles)¶
The double-buffering flags (mte2DBFlag, mte1DBFlag) and event flow are the performance heart of this kernel:
- TLOAD loads next
aMatTile[]/bMatTile[]while - TEXTRACT extracts next
aTile[]/bTile[]while - TMATMUL computes current
TMATMUL[_ACC].
If you see:
- high TLOAD but low TMATMUL → the Cube is starving; overlap is insufficient or TLOAD is truly saturated.
- high TEXTRACT but low TMATMUL → extract/layout is the limiter; reduce
TEXTRACTcost or increase compute per extract.
Practical tuning steps:
- Make sure the “first-iteration warmup” and “last-iteration drain” do not serialize the steady-state loop. This file already includes “supplement first/last sync instr”; keep them if you refactor.
- Keep compute and data movement in separate phases per buffer index (ping/pang), and only
wait_flagat true dependency boundaries.
5) When scaling to new shapes, re-tune the core tile first¶
For different m/k/n, do not only change the constants:
- Recompute
singleCoreM/singleCoreNso each core gets a similar amount of work. - Recheck
mLoop,nLoop, andkLoop(RunGemmE2E), because loop trip counts strongly affect overlap efficiency.
Common failure mode:
- Very large
kLoopwith insufficientstepKcan make TLOAD dominate; very smallkLoopcan make overhead dominate.
6) Use the utilization ratios to decide what to optimize¶
From the measurements above:
7680³has TLOAD=98.4% and TMATMUL down to 80.6% → focus on reducing GM traffic (higher reuse, better cache staging) and improving overlap rather than micro-optimizingTMATMUL.- Mid sizes (
3072³,6144³) show strong TMATMUL and TLOAD simultaneously → pipeline is close to balanced; improvements require careful end-to-end changes.
Build and Run¶
- Configure your Ascend CANN environment:
source ${ASCEND_INSTALL_PATH}/bin/setenv.bash
- Generate input + golden output:
cd ${git_clone_path}/kernels/manual/a2a3/gemm_performance
python3 scripts/gen_data.py
- Run the example:
bash run.sh -r npu -v Ascend910B1
If the run succeeds, the output prints:
test success
Changelog¶
| Date | Change |
|---|---|
| 2025-12-15 | Adjusted example directory and added this README |