Basic Topk Operator Example¶
Overview¶
This example demonstrates how to implement a Topk operator using PTO, including project setup, build, and execution.
Supported AI Processors¶
- A2/A3
Directory Layout¶
kernels/topk/
├── scripts/
│ └── gen_data.py # Generates input and golden output
├── CMakeLists.txt # Build configuration
├── topk_kernel.cpp # Kernel implementation
├── main.cpp # Host-side entry point
└── run.sh # Convenience script
Operator Description¶
Function¶
This example implements topk with fixed dimensions [rows, cols] = [4800, 1024]:
Specification¶
| Item | Value |
|---|---|
| OpType | topk |
| Inputs | [rows, cols] = [4800, 1024] |
| Output | data, index |
| Kernel name | topk_kernel |
Tiling Parameters¶
The validation platform has 48 cores. The workload is split across cores.
Per-core shape:
rows = 100,cols = 1024
Implementation Notes¶
Type definitions¶
The implementation defines topk representations. Load input data and index in GM to UB, use TSort32 to sort each 32 data, use TMrgsort for each tile. Extract data and index, then store back to gm seperately.
// data
using DynShapeDim5 = Shape<1, 1, 1, singleLoopRow, validCol>;
using DynStridDim5 = Stride<singleLoopRow * Cols, singleLoopRow * Cols, singleLoopRow * Cols, Cols, 1>;
using GlobalData = GlobalTensor<T, DynShapeDim5, DynStridDim5>;
// index
using IndexShapeDim5 = Shape<1, 1, 1, 1, validCol>;
using IndexStridDim5 = Stride<validCol, validCol, validCol, validCol, 1>;
using IndexGlobalData = GlobalTensor<indexT, IndexShapeDim5, IndexStridDim5>;
// sorted data and index
using DstShapeDim5 = Shape<1, 1, 1, singleLoopRow, topk>;
using DstStridDim5 = Stride<singleLoopRow * topk, singleLoopRow * topk, singleLoopRow * topk, topk, 1>;
using DstDataGlobalData = GlobalTensor<T, DstShapeDim5, DstStridDim5>;
using DstIdxGlobalData = GlobalTensor<indexT, DstShapeDim5, DstStridDim5>;
Pipeline scheduling¶
This example overlaps data movement and compute using double buffering in UB to improve utilization. In each iteration, two sets of operation are performed,TLOAD->TSORT32->TMRGSORT(include MRGSORT and MOV operation)->TSTORE. The pipeline dependece in each set is MTE2->V->MTE1->V->MTE3. TLOAD in the second sets can be performed before TSTORE in the first set is finished, so as others. Extra dependence V->MTE2is added to ensure that TLOAD in next iteration is performed after VEC operation is done in corresonding set.
Measured Performance (Reference)¶
The following measurements were collected on Ascend A3 (48 VEC core) for several sizes and different type.
| Parameter | aiv_vec_ratio | aiv_scalar_ratio | aiv_mte2_ratio | aiv_mte3_ratio | task_duration(us) |
|---|---|---|---|---|---|
type=float validRow=rows=4800 validCol=1024 cols=1280 topk=1000 |
94% | 3.2% | 11.7% | 10.4% | 324.106 |
type=float validRow=rows=3456 validCol=1024 cols=1280 topk=1000 |
91.5% | 4.6% | 12.3% | 10.5% | 238.819 |
type=float validRow=rows=2304 validCol=1024 cols=1280 topk=1000 |
88.7% | 6% | 12.4% | 10.1% | 161.375 |
type=half validRow=rows=4800 validCol=1024 cols=1280 topk=1008 |
93.7% | 2.4% | 11.5% | 9.6% | 326.886 |
Build and Run¶
- Configure your Ascend CANN environment (example path):
source ${ASCEND_INSTALL_PATH}/bin/setenv.bash
- Run the example:
cd ${git_clone_path}/kernels/manual/a2a3/topk
bash run.sh -r npu -v Ascend910B1
If the run succeeds, the output prints:
test success