Custom PyTorch Operator (KERNEL_LAUNCH) Example¶
This example shows how to implement a custom PTO-based kernel and expose it as a PyTorch operator via torch_npu.
Directory Layout¶
demos/baseline/add/
├── op_extension/ # Python package entry (module loader)
├── csrc/
│ ├── kernel/ # PTO kernel implementation
│ └── host/ # Host-side PyTorch operator registration
├── test/ # Minimal Python test
├── CMakeLists.txt # Build configuration
├── setup.py # Wheel build script
└── README.md # This document
1. Implement the kernel¶
Add a kernel source file under demos/baseline/add/csrc/kernel/ and include it in the build. For example, to build add_custom.cpp, add it to demos/baseline/add/CMakeLists.txt:
ascendc_library(no_workspace_kernel STATIC
csrc/kernel/add_custom.cpp
)
For build options and details, refer to the Ascend community documentation: https://www.hiascend.com/ascend-c
2. Integrate with PyTorch (torch_npu)¶
The host-side implementation lives under demos/baseline/add/csrc/host/.
2.1 Define the operator schema (Aten IR)¶
PyTorch uses TORCH_LIBRARY / TORCH_LIBRARY_FRAGMENT to declare operator schemas that can be called from Python via torch.ops.<namespace>.<op_name>.
Example: register a custom my_add operator in the npu namespace:
TORCH_LIBRARY_FRAGMENT(npu, m)
{
m.def("my_add(Tensor x, Tensor y) -> Tensor");
}
After this, Python can call torch.ops.npu.my_add.
2.2 Implement the operator¶
- Include the generated kernel launch header
aclrtlaunch_<kernel_name>.h(generated by the build system). - Allocate output tensors/workspace as needed.
- Enqueue the kernel via
ACLRT_LAUNCH_KERNEL(wrapped byEXEC_KERNEL_CMDin this example).
#include "utils.h"
#include "aclrtlaunch_add_custom.h"
at::Tensor run_add_custom(const at::Tensor &x, const at::Tensor &y)
{
at::Tensor z = at::empty_like(x);
uint32_t blockDim = 20;
uint32_t totalLength = 1;
for (uint32_t size : x.sizes()) {
totalLength *= size;
}
EXEC_KERNEL_CMD(add_custom, blockDim, x, y, z, totalLength);
return z;
}
2.3 Register the implementation¶
Register the implementation with TORCH_LIBRARY_IMPL. For NPU execution, torch_npu uses the PrivateUse1 dispatch key, please find the detailed introcution of PrivateUse1 on Pytorch official website https://docs.pytorch.org/tutorials/advanced/privateuseone.html
TORCH_LIBRARY_IMPL(npu, PrivateUse1, m)
{
m.impl("my_add", TORCH_FN(run_add_custom));
}
3. Build and run¶
This example requires PTO Tile Lib, PyTorch, torch_npu, and CANN. Follow the official torch_npu installation guide:
https://gitcode.com/ascend/pytorch#%E5%AE%89%E8%A3%85
or
python3 -m pip install -r requirements.txt
3.1 Set the target SoC¶
Edit demos/baseline/add/CMakeLists.txt and set SOC_VERSION to your target (example: A2A3 uses Ascend910B1):
set(SOC_VERSION "Ascendxxxyy" CACHE STRING "system on chip type")
You can query the chip name on the target machine via npu_smi info and use Ascend<Chip Name> as the value.
3.2 Build the wheel¶
Set the PTO Tile Lib path and build a wheel:
export ASCEND_HOME_PATH=/usr/local/Ascend/
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export PTO_LIB_PATH=[YOUR_PATH]/pto-isa
rm -rf build op_extension.egg-info
python3 setup.py bdist_wheel
3.3 Install the wheel¶
cd dist
pip uninstall *.whl
pip install *.whl
3.4 Run the test¶
cd test
python3 test.py