Custom PyTorch Operator (KERNEL_LAUNCH) Example

This example shows how to implement a custom PTO-based kernel and expose it as a PyTorch operator via torch_npu.

Directory Layout

demos/baseline/add/
├── op_extension/              # Python package entry (module loader)
├── csrc/
│   ├── kernel/                # PTO kernel implementation
│   └── host/                  # Host-side PyTorch operator registration
├── test/                      # Minimal Python test
├── CMakeLists.txt             # Build configuration
├── setup.py                   # Wheel build script
└── README.md                  # This document

1. Implement the kernel

Add a kernel source file under demos/baseline/add/csrc/kernel/ and include it in the build. For example, to build add_custom.cpp, add it to demos/baseline/add/CMakeLists.txt:

ascendc_library(no_workspace_kernel STATIC
    csrc/kernel/add_custom.cpp
)

For build options and details, refer to the Ascend community documentation: https://www.hiascend.com/ascend-c

2. Integrate with PyTorch (torch_npu)

The host-side implementation lives under demos/baseline/add/csrc/host/.

2.1 Define the operator schema (Aten IR)

PyTorch uses TORCH_LIBRARY / TORCH_LIBRARY_FRAGMENT to declare operator schemas that can be called from Python via torch.ops.<namespace>.<op_name>.

Example: register a custom my_add operator in the npu namespace:

TORCH_LIBRARY_FRAGMENT(npu, m)
{
    m.def("my_add(Tensor x, Tensor y) -> Tensor");
}

After this, Python can call torch.ops.npu.my_add.

2.2 Implement the operator

  1. Include the generated kernel launch header aclrtlaunch_<kernel_name>.h (generated by the build system).
  2. Allocate output tensors/workspace as needed.
  3. Enqueue the kernel via ACLRT_LAUNCH_KERNEL (wrapped by EXEC_KERNEL_CMD in this example).
#include "utils.h"
#include "aclrtlaunch_add_custom.h"

at::Tensor run_add_custom(const at::Tensor &x, const at::Tensor &y)
{
    at::Tensor z = at::empty_like(x);
    uint32_t blockDim = 20;
    uint32_t totalLength = 1;
    for (uint32_t size : x.sizes()) {
        totalLength *= size;
    }
    EXEC_KERNEL_CMD(add_custom, blockDim, x, y, z, totalLength);
    return z;
}

2.3 Register the implementation

Register the implementation with TORCH_LIBRARY_IMPL. For NPU execution, torch_npu uses the PrivateUse1 dispatch key, please find the detailed introcution of PrivateUse1 on Pytorch official website https://docs.pytorch.org/tutorials/advanced/privateuseone.html

TORCH_LIBRARY_IMPL(npu, PrivateUse1, m)
{
    m.impl("my_add", TORCH_FN(run_add_custom));
}

3. Build and run

This example requires PTO Tile Lib, PyTorch, torch_npu, and CANN. Follow the official torch_npu installation guide:

https://gitcode.com/ascend/pytorch#%E5%AE%89%E8%A3%85

or

python3 -m pip install -r requirements.txt

3.1 Set the target SoC

Edit demos/baseline/add/CMakeLists.txt and set SOC_VERSION to your target (example: A2A3 uses Ascend910B1):

set(SOC_VERSION "Ascendxxxyy" CACHE STRING "system on chip type")

You can query the chip name on the target machine via npu_smi info and use Ascend<Chip Name> as the value.

3.2 Build the wheel

Set the PTO Tile Lib path and build a wheel:

export ASCEND_HOME_PATH=/usr/local/Ascend/
source /usr/local/Ascend/ascend-toolkit/set_env.sh
export PTO_LIB_PATH=[YOUR_PATH]/pto-isa
rm -rf build op_extension.egg-info
python3 setup.py bdist_wheel

3.3 Install the wheel

cd dist
pip uninstall *.whl
pip install *.whl

3.4 Run the test

cd test
python3 test.py