Framework Integration¶

This document outlines integration patterns for exposing PTO-based kernels to framework runtimes. The exact registration APIs and backend names depend on the framework version, torch_npu integration model, and product release, so treat the snippets below as implementation patterns rather than drop-in code.

1. Integration Overview¶

1.1 Integration Architecture¶

Application layer
    ↓
Framework runtime and operator registration
    ↓
PTO-based kernel implementation
    ↓
Target Ascend AI Core execution

1.2 Integration Methods¶

Method	Pros	Cons	Use Case
Python Extension	Fast development, easy debugging	Python/C++ boundary overhead	Prototyping, quick validation
C++ Extension	High performance, type safety	More complex build and registration	Production, performance-critical
Framework plugin / custom backend path	Closer to deployment path	Higher maintenance cost	Stable product integration

1.3 Integration Workflow¶

1. Define operator interface
   ├─ Input/Output tensor specification
   ├─ Parameter types and defaults
   └─ Operator attributes (inplace, deterministic)

2. Implement operator logic
   ├─ Forward computation
   ├─ Backward propagation (training)
   └─ Shape inference

3. Register operator
   ├─ Framework operator registration
   ├─ Backend binding
   └─ Type inference

4. Test and validate
   ├─ Unit tests
   ├─ Numerical correctness
   └─ Performance benchmarks

5. Documentation and examples
   ├─ API documentation
   ├─ Usage examples
   └─ Performance reports

2. PyTorch Integration¶

2.1 Integration via torch_npu¶

Step 1: Define Operator Schema¶

// my_ops.cpp
#include <torch/extension.h>
#include <torch_npu/csrc/framework/utils/OpAdapter.h>

// Define operator schema
TORCH_LIBRARY_FRAGMENT(npu, m) {
  // Basic operator
  m.def("my_add(Tensor x, Tensor y) -> Tensor");

  // With scalar parameter
  m.def("my_mul(Tensor x, Scalar alpha) -> Tensor");

  // Multiple outputs
  m.def("my_split(Tensor x, int dim) -> (Tensor, Tensor)");

  // Inplace operator
  m.def("my_relu_(Tensor(a!) self) -> Tensor(a!)");

  // Optional parameters
  m.def("my_conv(Tensor input, Tensor weight, Tensor? bias=None, "
        "int stride=1, int padding=0) -> Tensor");
}

Step 2: Implement Operator¶

Simple operator implementation:

#include <pto/pto-inst.hpp>

// PTO Kernel implementation
__global__ __aicore__ void MyAddKernel(
    __gm__ float* out,
    __gm__ const float* x,
    __gm__ const float* y,
    uint32_t length) {

  int block_idx = get_block_idx();
  int block_num = get_block_num();

  int elements_per_block = (length + block_num - 1) / block_num;
  int start = block_idx * elements_per_block;
  int end = min(start + elements_per_block, length);

  using TileT = Tile<TileType::Vec, float, 16, 256>;

  for (int i = start; i < end; i += 16 * 256) {
    TileT tile_x, tile_y, tile_out;

    TLOAD(tile_x, GlobalTensor(x + i));
    TLOAD(tile_y, GlobalTensor(y + i));
    TADD(tile_out, tile_x, tile_y);
    TSTORE(GlobalTensor(out + i), tile_out);
  }
}

// PyTorch operator implementation
at::Tensor my_add_impl(const at::Tensor& x, const at::Tensor& y) {
  // Check inputs
  TORCH_CHECK(x.device() == y.device(), "Inputs must be on same device");
  TORCH_CHECK(x.sizes() == y.sizes(), "Inputs must have same shape");
  TORCH_CHECK(x.scalar_type() == at::kFloat, "Only float32 supported");

  // Allocate output
  at::Tensor out = at::empty_like(x);

  // Get data pointers
  float* out_ptr = out.data_ptr<float>();
  const float* x_ptr = x.data_ptr<float>();
  const float* y_ptr = y.data_ptr<float>();
  uint32_t length = x.numel();

  // Launch kernel
  int block_num = 24;  // A3 core count
  EXEC_KERNEL_CMD(MyAddKernel, block_num, out_ptr, x_ptr, y_ptr, length);

  return out;
}

Complex operator implementation (with backward pass):

// Forward
class MyConvFunction : public torch::autograd::Function<MyConvFunction> {
 public:
  static at::Tensor forward(
      torch::autograd::AutogradContext* ctx,
      const at::Tensor& input,
      const at::Tensor& weight,
      const at::Tensor& bias,
      int stride,
      int padding) {

    // Save tensors for backward pass
    ctx->save_for_backward({input, weight, bias});
    ctx->saved_data["stride"] = stride;
    ctx->saved_data["padding"] = padding;

    // Call PTO kernel
    at::Tensor output = run_conv_forward(input, weight, bias, stride, padding);

    return output;
  }

  static std::vector<at::Tensor> backward(
      torch::autograd::AutogradContext* ctx,
      std::vector<at::Tensor> grad_outputs) {

    // Restore saved tensors
    auto saved = ctx->get_saved_variables();
    auto input = saved[0];
    auto weight = saved[1];
    auto bias = saved[2];

    int stride = ctx->saved_data["stride"].toInt();
    int padding = ctx->saved_data["padding"].toInt();

    auto grad_output = grad_outputs[0];

    // Compute gradients
    at::Tensor grad_input = run_conv_backward_input(
        grad_output, weight, stride, padding);
    at::Tensor grad_weight = run_conv_backward_weight(
        grad_output, input, stride, padding);
    at::Tensor grad_bias = run_conv_backward_bias(grad_output);

    return {grad_input, grad_weight, grad_bias, 
            at::Tensor(), at::Tensor()};  // stride, padding have no gradient
  }
};

// Wrapper function
at::Tensor my_conv(
    const at::Tensor& input,
    const at::Tensor& weight,
    const at::Tensor& bias,
    int stride,
    int padding) {
  return MyConvFunction::apply(input, weight, bias, stride, padding);
}

Step 3: Register Implementation¶

// Register to NPU backend
TORCH_LIBRARY_IMPL(npu, PrivateUse1, m) {
  m.impl("my_add", TORCH_FN(my_add_impl));
  m.impl("my_mul", TORCH_FN(my_mul_impl));
  m.impl("my_conv", TORCH_FN(my_conv));
}

// Register autograd
TORCH_LIBRARY_IMPL(npu, Autograd, m) {
  m.impl("my_conv", TORCH_FN(my_conv));
}

Step 4: Build Python Extension¶

setup.py:

from setuptools import setup
from torch.utils.cpp_extension import BuildExtension, CppExtension

setup(
    name='my_pto_ops',
    ext_modules=[
        CppExtension(
            name='my_pto_ops',
            sources=['my_ops.cpp'],
            include_dirs=[
                '/path/to/pto-isa/include',
                '/path/to/torch_npu/include',
            ],
            library_dirs=[
                '/path/to/pto-isa/lib',
            ],
            libraries=['pto'],
            extra_compile_args=['-std=c++20', '-O3'],
        )
    ],
    cmdclass={'build_ext': BuildExtension}
)

Build:

python setup.py install

Step 5: Python Usage¶

import torch
import torch_npu
import my_pto_ops

# Create inputs
x = torch.randn(1024, 1024).npu()
y = torch.randn(1024, 1024).npu()

# Call custom operator
z = torch.ops.npu.my_add(x, y)

# Verify result
expected = x + y
assert torch.allclose(z, expected, rtol=1e-5)

print("✓ Custom op works correctly!")

2.2 Integration via torch.library (PyTorch 2.0+)¶

Simpler registration approach:

import torch
from torch.library import custom_op

@custom_op("mylib::my_add", mutates_args=())
def my_add(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    """Custom add operator"""
    return torch.ops.mylib.my_add_impl(x, y)

@my_add.register_fake
def _(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    """Shape inference"""
    assert x.shape == y.shape
    return torch.empty_like(x)

# Usage
x = torch.randn(10, 10)
y = torch.randn(10, 10)
z = torch.ops.mylib.my_add(x, y)

2.3 Complete Example: Add Operator¶

For a detailed tutorial, refer to: demos/baseline/add/README.md

3. TensorFlow Integration¶

3.1 Custom Op¶

Step 1: Define Op¶

// my_ops.cc
#include "tensorflow/core/framework/op.h"
#include "tensorflow/core/framework/shape_inference.h"

REGISTER_OP("MyAdd")
    .Input("x: float")
    .Input("y: float")
    .Output("z: float")
    .SetShapeFn([](::tensorflow::shape_inference::InferenceContext* c) {
      // Shape inference
      c->set_output(0, c->input(0));
      return tensorflow::Status::OK();
    })
    .Doc(R"doc(
Custom add operator

Args:
  x: First input tensor
  y: Second input tensor

Returns:
  z: x + y
)doc");

Step 2: Implement Kernel¶

#include "tensorflow/core/framework/op_kernel.h"
#include <pto/pto-inst.hpp>

class MyAddOp : public tensorflow::OpKernel {
 public:
  explicit MyAddOp(tensorflow::OpKernelConstruction* context)
      : OpKernel(context) {}

  void Compute(tensorflow::OpKernelContext* context) override {
    // Get inputs
    const tensorflow::Tensor& x = context->input(0);
    const tensorflow::Tensor& y = context->input(1);

    // Check shapes
    OP_REQUIRES(context, x.shape() == y.shape(),
                tensorflow::errors::InvalidArgument(
                    "Inputs must have same shape"));

    // Allocate output
    tensorflow::Tensor* z = nullptr;
    OP_REQUIRES_OK(context, context->allocate_output(0, x.shape(), &z));

    // Call PTO kernel
    const float* x_ptr = x.flat<float>().data();
    const float* y_ptr = y.flat<float>().data();
    float* z_ptr = z->flat<float>().data();
    uint32_t length = x.NumElements();

    EXEC_KERNEL_CMD(MyAddKernel, 24, z_ptr, x_ptr, y_ptr, length);
  }
};

// Register kernel
REGISTER_KERNEL_BUILDER(
    Name("MyAdd").Device(tensorflow::DEVICE_NPU),
    MyAddOp);

Step 3: Build¶

# Use TensorFlow build utilities
TF_CFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_compile_flags()))') )
TF_LFLAGS=( $(python -c 'import tensorflow as tf; print(" ".join(tf.sysconfig.get_link_flags()))') )

g++ -std=c++17 -shared my_ops.cc -o my_ops.so \
    ${TF_CFLAGS[@]} ${TF_LFLAGS[@]} \
    -I/path/to/pto-isa/include \
    -L/path/to/pto-isa/lib -lpto \
    -fPIC -O3

Step 4: Python Usage¶

import tensorflow as tf

# Load custom op
my_ops = tf.load_op_library('./my_ops.so')

# Use
x = tf.constant([[1.0, 2.0], [3.0, 4.0]])
y = tf.constant([[5.0, 6.0], [7.0, 8.0]])
z = my_ops.my_add(x, y)

print(z.numpy())
# [[6. 8.]
#  [10. 12.]]

3.2 Register Gradient¶

@tf.RegisterGradient("MyAdd")
def _my_add_grad(op, grad):
    """Gradient for MyAdd"""
    return grad, grad  # ∂z/∂x = 1, ∂z/∂y = 1

4. ONNX Runtime Integration¶

4.1 Custom Execution Provider¶

Step 1: Define Kernel¶

// my_onnx_ops.cc
#include "onnxruntime/core/framework/op_kernel.h"

class MyAddKernel : public onnxruntime::OpKernel {
 public:
  MyAddKernel(const onnxruntime::OpKernelInfo& info) : OpKernel(info) {}

  onnxruntime::Status Compute(onnxruntime::OpKernelContext* context) const override {
    // Get inputs
    const onnxruntime::Tensor* X = context->Input<onnxruntime::Tensor>(0);
    const onnxruntime::Tensor* Y = context->Input<onnxruntime::Tensor>(1);

    // Allocate output
    onnxruntime::Tensor* Z = context->Output(0, X->Shape());

    // Call PTO kernel
    const float* x_data = X->Data<float>();
    const float* y_data = Y->Data<float>();
    float* z_data = Z->MutableData<float>();
    size_t length = X->Shape().Size();

    EXEC_KERNEL_CMD(MyAddKernel, 24, z_data, x_data, y_data, length);

    return onnxruntime::Status::OK();
  }
};

Step 2: Register Kernel¶

ONNX_OPERATOR_KERNEL_EX(
    Add,
    kOnnxDomain,
    7,  // opset version
    kNpuExecutionProvider,
    MyAddKernel);

Step 3: Create Execution Provider¶

class NpuExecutionProvider : public onnxruntime::IExecutionProvider {
 public:
  NpuExecutionProvider() : IExecutionProvider(kNpuExecutionProvider) {}

  std::vector<std::unique_ptr<onnxruntime::ComputeCapability>>
  GetCapability(const onnxruntime::GraphViewer& graph,
                const std::vector<const onnxruntime::KernelRegistry*>& registries) const override {
    // Return supported operators
    // ...
  }
};

Step 4: Python Usage¶

import onnxruntime as ort

# Register custom EP
session_options = ort.SessionOptions()
session_options.register_custom_ops_library('my_onnx_ops.so')

# Create session
session = ort.InferenceSession(
    'model.onnx',
    session_options,
    providers=['NpuExecutionProvider', 'CPUExecutionProvider']
)

# Inference
outputs = session.run(None, {'input': input_data})

5. Inference Framework Integration¶

5.1 MindSpore Lite Integration¶

// Register custom operator
#include "include/registry/register_kernel.h"

class MyAddKernel : public mindspore::kernel::Kernel {
 public:
  int Prepare() override { return RET_OK; }

  int Execute() override {
    auto input0 = in_tensors_[0];
    auto input1 = in_tensors_[1];
    auto output = out_tensors_[0];

    // Call PTO kernel
    // ...

    return RET_OK;
  }
};

// Register
REGISTER_CUSTOM_KERNEL(NPU, MyProvider, kNumberTypeFloat32, Add, MyAddKernel);

6. Performance Optimization¶

6.1 Operator Fusion¶

# PyTorch example: Fuse Add + ReLU
@torch.jit.script
def fused_add_relu(x: torch.Tensor, y: torch.Tensor) -> torch.Tensor:
    return torch.relu(x + y)

# Replace with custom fused operator
torch.ops.npu.fused_add_relu(x, y)

6.2 Memory Optimization¶

// Inplace operator
at::Tensor& my_add_inplace(at::Tensor& x, const at::Tensor& y) {
  // Modify x directly, avoid allocating new memory
  float* x_ptr = x.data_ptr<float>();
  const float* y_ptr = y.data_ptr<float>();
  uint32_t length = x.numel();

  EXEC_KERNEL_CMD(MyAddInplaceKernel, 24, x_ptr, y_ptr, length);

  return x;
}

6.3 Asynchronous Execution¶

// Use CUDA Stream (or NPU Stream)
at::Tensor my_add_async(const at::Tensor& x, const at::Tensor& y) {
  at::Tensor out = at::empty_like(x);

  // Get current stream
  auto stream = at::cuda::getCurrentCUDAStream();

  // Launch kernel asynchronously
  EXEC_KERNEL_ASYNC(MyAddKernel, 24, stream, 
                    out.data_ptr<float>(),
                    x.data_ptr<float>(),
                    y.data_ptr<float>(),
                    x.numel());

  return out;
}

7. Debugging and Testing¶

7.1 Unit Tests¶

import unittest
import torch
import my_pto_ops

class TestMyOps(unittest.TestCase):
    def test_my_add(self):
        x = torch.randn(100, 100).npu()
        y = torch.randn(100, 100).npu()

        # Custom operator
        z_custom = torch.ops.npu.my_add(x, y)

        # Reference implementation
        z_ref = x + y

        # Verify
        self.assertTrue(torch.allclose(z_custom, z_ref, rtol=1e-5))

    def test_my_add_backward(self):
        x = torch.randn(100, 100, requires_grad=True).npu()
        y = torch.randn(100, 100, requires_grad=True).npu()

        z = torch.ops.npu.my_add(x, y)
        loss = z.sum()
        loss.backward()

        # Verify gradients
        self.assertIsNotNone(x.grad)
        self.assertIsNotNone(y.grad)
        self.assertTrue(torch.allclose(x.grad, torch.ones_like(x)))

if __name__ == '__main__':
    unittest.main()

7.2 Performance Benchmarking¶

import torch
import time

def benchmark(func, *args, warmup=10, iterations=100):
    # Warmup
    for _ in range(warmup):
        func(*args)

    # Synchronize
    torch.npu.synchronize()

    # Measure
    start = time.time()
    for _ in range(iterations):
        func(*args)
    torch.npu.synchronize()
    end = time.time()

    avg_time = (end - start) / iterations * 1000  # ms
    return avg_time

# Performance comparison
x = torch.randn(1024, 1024).npu()
y = torch.randn(1024, 1024).npu()

time_custom = benchmark(lambda: torch.ops.npu.my_add(x, y))
time_builtin = benchmark(lambda: x + y)

print(f"Custom op: {time_custom:.3f} ms")
print(f"Built-in op: {time_builtin:.3f} ms")
print(f"Speedup: {time_builtin / time_custom:.2f}x")

8. Best Practices¶

8.1 Design Principles¶

DO: - Keep operator interfaces simple and clear - Provide complete type support (float32, float16, int32, etc.) - Implement shape inference and type inference - Provide thorough documentation and examples - Write comprehensive unit tests

DON'T: - Don't allocate large temporary memory inside the operator - Don't assume inputs are always contiguous (use contiguous()) - Don't ignore edge cases (empty tensors, single-element tensors) - Don't use global state inside the operator

8.2 Performance Checklist¶

[ ] Does the operator support inplace operations?
[ ] Is operator fusion implemented?
[ ] Is asynchronous execution used?
[ ] Are unnecessary memory copies avoided?
[ ] Are multiple data types supported?
[ ] Have performance benchmarks been run?

8.3 Compatibility Checklist¶

[ ] Are dynamic shapes supported?
[ ] Are broadcast semantics supported?
[ ] Is gradient computation supported (for training)?
[ ] Is JIT compilation supported?
[ ] Is ONNX export supported?
[ ] Is a CPU fallback provided?

Framework Integration¶

1. Integration Overview¶

1.1 Integration Architecture¶

1.2 Integration Methods¶

1.3 Integration Workflow¶

2. PyTorch Integration¶

2.1 Integration via torch_npu¶

Step 1: Define Operator Schema¶

Step 2: Implement Operator¶

Step 3: Register Implementation¶

Step 4: Build Python Extension¶

Step 5: Python Usage¶

2.2 Integration via torch.library (PyTorch 2.0+)¶

2.3 Complete Example: Add Operator¶

3. TensorFlow Integration¶

3.1 Custom Op¶

Step 1: Define Op¶

Step 2: Implement Kernel¶

Step 3: Build¶

Step 4: Python Usage¶

3.2 Register Gradient¶

4. ONNX Runtime Integration¶

4.1 Custom Execution Provider¶

Step 1: Define Kernel¶

Step 2: Register Kernel¶

Step 3: Create Execution Provider¶

Step 4: Python Usage¶

5. Inference Framework Integration¶

5.1 MindSpore Lite Integration¶

6. Performance Optimization¶

6.1 Operator Fusion¶

6.2 Memory Optimization¶

6.3 Asynchronous Execution¶

7. Debugging and Testing¶

7.1 Unit Tests¶

7.2 Performance Benchmarking¶

8. Best Practices¶

8.1 Design Principles¶

8.2 Performance Checklist¶

8.3 Compatibility Checklist¶

References¶