High-Performance GEMM Operator Example¶

Overview¶

This example demonstrates how to implement a high-performance GEMM operator using PTO and common optimization techniques (core partitioning, base-block selection, L1 caching, and double buffering).

Supported AI Processors¶

A2/A3

Directory Layout¶

kernels/manual/a2a3/gemm_performance/
├── scripts/
│   └── gen_data.py                  # Generates input and golden output
├── CMakeLists.txt                   # Build configuration
├── gemm_performance_kernel.cpp      # Kernel implementation
├── main.cpp                         # Host-side entry point
└── run.sh                           # Convenience script

Operator Description¶

Function¶

This example implements GEMM:

\[ C = A \times B \]

Where:

A is m×k
B is k×n
C is m×n

The default reference configuration in main.cpp uses m=k=n=6144.

Specification¶

Item	Value
OpType	`GEMM`
Inputs	`a`: `m×k`, `float16`, `ND`; `b`: `n×k`, `float16`, `ND`
Output	`c`: `m×n`, `float`, `ND`
Kernel name	`GEMMPerformance`

Optimization Notes¶

This example uses a 24-core A3 platform as the performance validation platform.

Core partitioning: maximize parallelism by splitting work across Cube cores. Since m, n, and k are equal, prefer not splitting k within a single core, and split m and n across 24 cores. A 4 × 6 grouping yields singleCoreM=1536, singleCoreK=6144, singleCoreN=1024 (chosen by this example).
Base block selection: choose base blocks that maximize compute-to-memory ratio. For FP16, a common choice is [baseM, baseN, baseK] = [128, 256, 64], which improves arithmetic intensity versus [128, 128, 128] while maintaining 512-byte-aligned GM writes.
L1 caching: move multiple base blocks from GM to L1 per transfer to improve bandwidth utilization. This example sets stepKa=stepKb=4 to cache four k blocks at a time.
Double buffering: overlap DMA and compute by enabling double buffering in L1, L0A, and L0B.

Tiling Parameters¶

Parameter	Value
`m`	6144
`k`	6144
`n`	6144
`singleCoreM`	1536
`singleCoreK`	6144
`singleCoreN`	1024
`baseM`	128
`baseK`	64
`baseN`	256
`stepM`	1
`stepKa`	4
`stepKb`	4
`stepN`	1

Measured Performance (Reference)¶

The following measurements were collected on Ascend A3 (24 cores) for several m=k=n sizes (fp16 inputs → fp32 output).

Parameter	TMATMUL (Cube) Ratio	TEXTRACT Ratio	TLOAD Ratio	TSTORE Ratio	Execution time (ms)
`m=1536` `k=1536` `n=1536`	54.5%	42.2%	72.2%	7.7%	0.0388
`m=3072` `k=3072` `n=3072`	79.0%	62.0%	90.9%	5.8%	0.2067
`m=6144` `k=6144` `n=6144`	86.7%	68.1%	95.2%	3.1%	1.5060
`m=7680` `k=7680` `n=7680`	80.6%	63.0%	98.4%	2.4%	3.1680

What the numbers suggest¶

These metrics are most useful for answering a single question: which engine is limiting the end-to-end pipeline.

Scaling behavior: execution time grows super-linearly with m=k=n (as expected for O(n^3) work), and throughput typically improves from small sizes to mid sizes before flattening.
TMATMUL utilization rises, then drops: TMATMUL (Cube) Ratio increases from 54.5% → 86.7% as the problem grows (better amortization and steadier pipelines), then drops to 80.6% at 7680³. This pattern usually indicates the compute pipeline is no longer the only limiter at the largest size.
TLOAD is near-saturated at large sizes: TLOAD Ratio grows to 98.4% at 7680³, suggesting the GM feed path is close to its limit and starts throttling compute (TMATMUL Ratio decreases).
TSTORE is small and keeps shrinking: output writeback is a small fraction of total time for GEMM, especially at larger sizes (one write for many FMAs).
TEXTRACT is meaningful overhead: the 42%→68% range suggests L1→L0 extract/layout costs are not negligible; optimizing this stage (and overlapping it cleanly) directly impacts overall performance.

If you want a single rule of thumb: when TLOAD Ratio approaches ~100%, you are usually memory-feed limited (even if TMATMUL still looks “busy”), and further speedups come from reducing bytes moved per FLOP and improving overlap.

Performance Optimization Guide (How to Tune This Kernel)¶

This example is intentionally structured around a standard GEMM pipeline:

TLOAD stage: GM → L1 (TLOAD into aMatTile[] / bMatTile[])
TEXTRACT stage: L1 → L0A/L0B (TEXTRACT into aTile[] / bTile[])
TMATMUL stage: L0A/L0B → L0C (TMATMUL / TMATMUL_ACC into cTile)
TSTORE stage: L0C → GM (TSTORE of cTile)

The core kernel implementation is in kernels/manual/a2a3/gemm_performance/gemm_performance_kernel.cpp, with the critical control points below.

1) Partition work across cores first¶

Look at InitGMOffsets(...):

The kernel splits the global C[m,n] into blockDim independent tiles.
For square problems (m≈n), splitting across both m and n usually gives better balance than splitting only one dimension.

Checklist:

Ensure m % singleCoreM == 0 and n % singleCoreN == 0.
Choose a 2D grid decomposition (m-tiles × n-tiles) that matches blockDim so each core gets a contiguous A panel and B panel.

2) Choose base tiles that fit L0A/L0B cleanly¶

Look at InitBuffers(...):

L0A and L0B are explicitly double-buffered with a 32 KiB ping/pang split (0x0 and 0x0 + 32768).
This implies an important constraint: the per-buffer tile footprint must be ≤ 32 KiB.

For fp16 inputs (2 bytes/elem):

L0A tile bytes ≈ baseM * baseK * 2
L0B tile bytes ≈ baseK * baseN * 2

The reference uses:

baseM=128, baseK=64 → 128*64*2 = 16 KiB (fits comfortably)
baseK=64, baseN=256 → 64*256*2 = 32 KiB (fills the budget)

Guidelines:

Prefer tile sizes that fully utilize the 32 KiB budget (especially for B), but do not exceed it.
Keep baseK aligned to the Cube’s preferred K granularity (often 32/64/128 depending on data type and layout).

3) Increase reuse with L1 “stepK” caching (without overflowing)¶

Look at ProcessKIteration(...) and the kModstepKa logic:

stepKa / stepKb control how many K-slices are staged into L1 per DMA.
The example uses stepKa=stepKb=4: one TLOAD transfer brings in 4 micro-panels that are later TEXTRACT’d.

Guidelines:

Increase stepK to reduce DMA launch overhead and improve burst efficiency until L1 capacity or overlap breaks down.
If TLOAD is near 100% and TMATMUL drops, try:
increasing stepK (more reuse per fetch), or
increasing compute intensity (e.g., larger baseN/baseM if L0 allows), or
improving overlap (next section).

4) Keep the pipeline overlapped (avoid bubbles)¶

The double-buffering flags (mte2DBFlag, mte1DBFlag) and event flow are the performance heart of this kernel:

TLOAD loads next aMatTile[]/bMatTile[] while
TEXTRACT extracts next aTile[]/bTile[] while
TMATMUL computes current TMATMUL[_ACC].

If you see:

high TLOAD but low TMATMUL → the Cube is starving; overlap is insufficient or TLOAD is truly saturated.
high TEXTRACT but low TMATMUL → extract/layout is the limiter; reduce TEXTRACT cost or increase compute per extract.

Practical tuning steps:

Make sure the “first-iteration warmup” and “last-iteration drain” do not serialize the steady-state loop. This file already includes “supplement first/last sync instr”; keep them if you refactor.
Keep compute and data movement in separate phases per buffer index (ping/pang), and only wait_flag at true dependency boundaries.

5) When scaling to new shapes, re-tune the core tile first¶

For different m/k/n, do not only change the constants:

Recompute singleCoreM/singleCoreN so each core gets a similar amount of work.
Recheck mLoop, nLoop, and kLoop (RunGemmE2E), because loop trip counts strongly affect overlap efficiency.

Common failure mode:

Very large kLoop with insufficient stepK can make TLOAD dominate; very small kLoop can make overhead dominate.

6) Use the utilization ratios to decide what to optimize¶

From the measurements above:

7680³ has TLOAD=98.4% and TMATMUL down to 80.6% → focus on reducing GM traffic (higher reuse, better cache staging) and improving overlap rather than micro-optimizing TMATMUL.
Mid sizes (3072³, 6144³) show strong TMATMUL and TLOAD simultaneously → pipeline is close to balanced; improvements require careful end-to-end changes.

Build and Run¶

Configure your Ascend CANN environment:

source ${ASCEND_INSTALL_PATH}/bin/setenv.bash

Generate input + golden output:

cd ${git_clone_path}/kernels/manual/a2a3/gemm_performance
python3 scripts/gen_data.py

Run the example:

bash run.sh -r npu -v Ascend910B1

If the run succeeds, the output prints:

test success

Changelog¶

Date	Change
2025-12-15	Adjusted example directory and added this README