2D Convolution Forward Operator Example¶
Overview¶
This sample implements 2D convolution forward calculation based on the PTO framework, covering common optimization techniques (multi-core partitioning, base-block selection, L1 caching, and double buffering).
Supported AI Processors¶
- A2/A3
Directory Structure¶
kernels/manual/a2a3/conv2d_forward/
├── scripts/
│ └── gen_data.py # Generate input and golden output
├── CMakeLists.txt # Build configuration
├── conv2d_forward_kernel.cpp # Kernel implementation
├── main.cpp # Host-side entry point
└── run.sh # Convenience script
Operator Description¶
Computational Function¶
This example implements 2D convolution forward calculation:
where * denotes the convolution operation. The input matrix formats are as follows:
Xis of shape[batch, cin, hin, win, c0]Kis inFractal_Zformat:[cin * hk * wk, n/16, 16, c0]Yis of shape[batch, n/c0, hout, wout, c0]hout = (hin + padTop + padBottom - dilationH * (hk - 1) - 1) / strideH + 1wout = (win + padLeft + padRight - dilationW * (wk - 1) - 1) / strideW + 1
The default reference configuration in main.cpp is X=[4, 32, 16, 96, 16], K=[288, 384, 16, 16], Y=[4, 384, 16, 96, 16], strideH=strideW=1, dilationH=dilationW=1, padTop=padBottom=padLeft=padRight=1.
Specifications¶
| Item | Value |
|---|---|
| OpType | Conv |
| Input | X: [batch,cin,hin,win,c0], half; K: Fractal_Z, half |
| Output | Y: [batch,n/c0,hout,wout,c0], half |
| Kernel Name | Conv |
Optimization Details¶
This example uses the 24-core A3 platform for performance verification.
- Multi-core Partitioning (core partitioning): Split the workload among Cube cores to maximize parallelism.It is generally not recommended to further split
kwithin a single core. Instead,mandnare distributed across the 24 cores. This example uses a4 × 6grouping, corresponding tosingleCoreM=1536,singleCoreK=4608, andsingleCoreN=1024. - Base block selection (base block selection): Choose a base block that offers a higher compute-to-memory-access ratio and better fits the on-chip capacity and alignment constraints. For FP16, a common choice is
[baseM, baseN, baseK] = [128, 256, 48]. Compared to[128, 128, 128], it provides higher arithmetic intensity and is easier to maintain 512-byte alignment for GM write-back. - L1 caching (L1 caching): Load multiple base blocks from GM to L1 in one go to improve bandwidth utilization. In this example,
stepKa=stepKb=3, meaning 3kblocks are cached at a time. - Double buffering (double buffering): Enable double buffering in L1/L0A/L0B to overlap DMA operations with computation as much as possible.
Tiling Parameters¶
| Parameter | Value |
|---|---|
m |
6144 |
k |
4608 |
n |
6144 |
batch |
4 |
cin |
32 |
hin |
16 |
win |
96 |
c0 |
16 |
hk |
3 |
wk |
3 |
strideH |
1 |
strideW |
1 |
dilationH |
1 |
dilationW |
1 |
padTop |
1 |
padBottom |
1 |
padLeft |
1 |
padRight |
1 |
singleCoreM |
1536 |
singleCoreK |
4608 |
singleCoreN |
1024 |
baseM |
128 |
baseK |
48 |
baseN |
256 |
stepM |
1 |
stepKa |
3 |
stepKb |
3 |
stepN |
1 |
Measured Performance (Reference)¶
The following data was measured on Ascend A3 (24 cores), covering multiple input/output matrix sizes (fp16 input → fp32 output).
| Parameters | TMATMUL (Cube) % | TLOAD % | TEXTRACT % | TSTORE % | Execution Time (ms) |
|---|---|---|---|---|---|
X=[4,8,8,48,16] K=[72,96,16,16] |
52.90% | 43.90% | 59.4% | 5.60% | 0.0322 |
X=[4,16,12,64,16] K=[144,192,16,16] |
86.1% | 75.1% | 57.3% | 4% | 0.1473 |
X=[4,24,12,96,16] K=[216,288,16,16] |
89.3% | 78.8% | 61.6% | 2.8% | 0.4709 |
X=[4,32,16,96,16] K=[288,384,16,16] |
90.8% | 80.4% | 61.1% | 2.1% | 1.0979 |
X=[4,40,15,128,16] K=[360,480,16,16] |
91.3% | 80.7% | 62.4% | 1.7% | 2.1312 |
For the meaning of the parameters in the table and performance optimization schemes, please refer to gemm_performance Measured Performance.
Build and Run¶
- Configure the Ascend CANN environment:
source ${ASCEND_INSTALL_PATH}/bin/setenv.bash
- Generate input + golden output:
cd ${git_clone_path}/kernels/manual/a2a3/conv2d_forward
python3 scripts/gen_data.py
- Run the example:
bash run.sh -r npu -v Ascend910B1
If the run succeeds, the output prints:
test success