2D Convolution Forward Operator Example¶

Overview¶

This sample implements 2D convolution forward calculation based on the PTO framework, covering common optimization techniques (multi-core partitioning, base-block selection, L1 caching, and double buffering).

Supported AI Processors¶

A2/A3

Directory Structure¶

kernels/manual/a2a3/conv2d_forward/
├── scripts/
│   └── gen_data.py                      # Generate input and golden output
├── CMakeLists.txt                       # Build configuration
├── conv2d_forward_kernel.cpp            # Kernel implementation
├── main.cpp                             # Host-side entry point
└── run.sh                               # Convenience script

Operator Description¶

Computational Function¶

This example implements 2D convolution forward calculation:

\[ Y = X * K \]

where * denotes the convolution operation. The input matrix formats are as follows:

X is of shape [batch, cin, hin, win, c0]
K is in Fractal_Z format: [cin * hk * wk, n/16, 16, c0]
Y is of shape [batch, n/c0, hout, wout, c0]
hout = (hin + padTop + padBottom - dilationH * (hk - 1) - 1) / strideH + 1
wout = (win + padLeft + padRight - dilationW * (wk - 1) - 1) / strideW + 1

The default reference configuration in main.cpp is X=[4, 32, 16, 96, 16], K=[288, 384, 16, 16], Y=[4, 384, 16, 96, 16], strideH=strideW=1, dilationH=dilationW=1, padTop=padBottom=padLeft=padRight=1.

Specifications¶

Item	Value
OpType	`Conv`
Input	`X`: `[batch,cin,hin,win,c0]`, `half`; `K`: `Fractal_Z`, `half`
Output	`Y`: `[batch,n/c0,hout,wout,c0]`, `half`
Kernel Name	`Conv`

Optimization Details¶

This example uses the 24-core A3 platform for performance verification.

Multi-core Partitioning (core partitioning): Split the workload among Cube cores to maximize parallelism.It is generally not recommended to further split k within a single core. Instead, m and n are distributed across the 24 cores. This example uses a 4 × 6 grouping, corresponding to singleCoreM=1536, singleCoreK=4608, and singleCoreN=1024.
Base block selection (base block selection): Choose a base block that offers a higher compute-to-memory-access ratio and better fits the on-chip capacity and alignment constraints. For FP16, a common choice is [baseM, baseN, baseK] = [128, 256, 48]. Compared to [128, 128, 128], it provides higher arithmetic intensity and is easier to maintain 512-byte alignment for GM write-back.
L1 caching (L1 caching): Load multiple base blocks from GM to L1 in one go to improve bandwidth utilization. In this example, stepKa=stepKb=3, meaning 3 k blocks are cached at a time.
Double buffering (double buffering): Enable double buffering in L1/L0A/L0B to overlap DMA operations with computation as much as possible.

Tiling Parameters¶

Parameter	Value
`m`	6144
`k`	4608
`n`	6144
`batch`	4
`cin`	32
`hin`	16
`win`	96
`c0`	16
`hk`	3
`wk`	3
`strideH`	1
`strideW`	1
`dilationH`	1
`dilationW`	1
`padTop`	1
`padBottom`	1
`padLeft`	1
`padRight`	1
`singleCoreM`	1536
`singleCoreK`	4608
`singleCoreN`	1024
`baseM`	128
`baseK`	48
`baseN`	256
`stepM`	1
`stepKa`	3
`stepKb`	3
`stepN`	1

Measured Performance (Reference)¶

The following data was measured on Ascend A3 (24 cores), covering multiple input/output matrix sizes (fp16 input → fp32 output).

Parameters	TMATMUL (Cube) %	TLOAD %	TEXTRACT %	TSTORE %	Execution Time (ms)
`X=[4,8,8,48,16]` `K=[72,96,16,16]`	52.90%	43.90%	59.4%	5.60%	0.0322
`X=[4,16,12,64,16]` `K=[144,192,16,16]`	86.1%	75.1%	57.3%	4%	0.1473
`X=[4,24,12,96,16]` `K=[216,288,16,16]`	89.3%	78.8%	61.6%	2.8%	0.4709
`X=[4,32,16,96,16]` `K=[288,384,16,16]`	90.8%	80.4%	61.1%	2.1%	1.0979
`X=[4,40,15,128,16]` `K=[360,480,16,16]`	91.3%	80.7%	62.4%	1.7%	2.1312

For the meaning of the parameters in the table and performance optimization schemes, please refer to gemm_performance Measured Performance.

Build and Run¶

Configure the Ascend CANN environment:

source ${ASCEND_INSTALL_PATH}/bin/setenv.bash

Generate input + golden output:

cd ${git_clone_path}/kernels/manual/a2a3/conv2d_forward
python3 scripts/gen_data.py

Run the example:

bash run.sh -r npu -v Ascend910B1

If the run succeeds, the output prints:

test success