PTO vs Other Operator Development Approaches¶

This document compares PTO with other mainstream operator development approaches, helping developers choose the most suitable development solution.

Comparison Overview¶

Feature	PTO	AscendC	TBE	CUDA
Abstraction Level	Medium (Tile-level)	Low (Register-level)	High (Operator-level)	Low (Thread-level)
Cross-generation Compatibility	✅ Excellent	⚠️ Needs adaptation	✅ Good	❌ Platform-bound
Performance Control	✅ High	✅ Highest	⚠️ Medium	✅ High
Development Efficiency	✅ High	⚠️ Low	✅ High	⚠️ Medium
Learning Curve	Medium	Steep	Gentle	Steep
Debugging Difficulty	Medium	Hard	Easy	Hard
Use Cases	High-performance custom ops	Extreme optimization	Rapid prototyping	NVIDIA GPU

1. PTO vs AscendC¶

PTO Advantages¶

Higher Abstraction Level - PTO operates on Tiles (2D data blocks), while AscendC requires manual register management - Automatic handling of data alignment and layout conversion - Easier to understand and maintain

Cross-generation Compatibility

// PTO code runs on A2/A3/A5 without modification
using TileT = Tile<TileType::Vec, float, 16, 16>;
TLOAD(tile, globalTensor);
TADD(result, tile1, tile2);

Development Efficiency - Fewer lines of code (typically 30-50% reduction) - Faster development cycle - Easier performance tuning

AscendC Advantages¶

Ultimate Performance Control - Direct control of hardware registers - Can achieve optimal instruction scheduling - Suitable for scenarios requiring extreme performance

Lower-level Hardware Access - Can use all hardware features - Finer-grained pipeline control

Selection Recommendation¶

Choose PTO: Most custom operator development, need cross-generation compatibility
Choose AscendC: Need to squeeze last 5-10% performance, targeting specific hardware only

2. PTO vs TBE¶

PTO Advantages¶

Better Performance Control

// PTO allows precise control of tiling and pipeline
for (int k = 0; k < K; k += tileK) {
  TLOAD(tileA, ...);  // Explicit data transfer control
  TLOAD(tileB, ...);
  TMATMUL(acc, tileA, tileB);  // Explicit compute control
}

More Flexible Operator Implementation - Can implement complex custom logic - Supports dynamic shapes and masks - Easier to implement operator fusion

TBE Advantages¶

Higher Development Efficiency - Based on TensorFlow/PyTorch high-level APIs - Automatic optimization and scheduling - Faster prototyping

Simpler Learning Curve - Python-like programming model - Rich operator library - Comprehensive documentation and examples

Selection Recommendation¶

Choose PTO: Need high-performance custom operators with clear performance requirements
Choose TBE: Rapid prototyping, standard operator implementation

3. PTO vs CUDA¶

PTO Advantages¶

Cross-platform Portability

// PTO code runs on different Ascend generations
// A2/A3/A5 without modification

// CUDA code is NVIDIA-specific
// Needs rewrite for AMD/Intel GPUs

Higher Abstraction Level - Tile-based programming vs thread-based - Automatic memory hierarchy management - Less boilerplate code

Better Compiler Optimization - Compiler understands high-level semantics - Automatic pipeline optimization - Better instruction scheduling

CUDA Advantages¶

Mature Ecosystem - Extensive libraries (cuBLAS, cuDNN, Thrust) - Rich community resources - Comprehensive tooling (Nsight, nvprof)

Fine-grained Control - Thread-level control - Shared memory management - Warp-level primitives

Wider Hardware Support - Runs on all NVIDIA GPUs - Large installed base

Selection Recommendation¶

Choose PTO: Developing for Ascend NPU, need portability across generations
Choose CUDA: Developing for NVIDIA GPU, need mature ecosystem

4. Code Comparison Examples¶

4.1 Vector Addition¶

PTO:

__global__ __aicore__ void VecAdd(
    __gm__ float* out,
    __gm__ const float* in0,
    __gm__ const float* in1,
    uint32_t length) {

  using TileT = Tile<TileType::Vec, float, 16, 256>;
  TileT a, b, c;

  for (int i = 0; i < length; i += 16 * 256) {
    TLOAD(a, GlobalTensor(in0 + i));
    TLOAD(b, GlobalTensor(in1 + i));
    TADD(c, a, b);
    TSTORE(GlobalTensor(out + i), c);
  }
}

CUDA:

__global__ void VecAdd(
    float* out,
    const float* in0,
    const float* in1,
    int length) {

  int idx = blockIdx.x * blockDim.x + threadIdx.x;

  if (idx < length) {
    out[idx] = in0[idx] + in1[idx];
  }
}

Comparison: - PTO: Tile-based, processes 4096 elements per iteration - CUDA: Thread-based, processes 1 element per thread - PTO: Fewer memory transactions, better bandwidth utilization - CUDA: More flexible thread organization

4.2 Matrix Multiplication¶

PTO:

__global__ __aicore__ void MatMul(
    __gm__ float* C,
    __gm__ const float* A,
    __gm__ const float* B,
    int M, int K, int N) {

  using TileLeft = TileLeft<half, 128, 64>;
  using TileRight = TileRight<half, 64, 256>;
  using TileAcc = TileAcc<float, 128, 256>;

  TileAcc acc;
  TFILL(acc, 0);

  for (int k = 0; k < K; k += 64) {
    TileLeft tileA;
    TileRight tileB;

    TLOAD(tileA, A[m:m+128, k:k+64]);
    TLOAD(tileB, B[k:k+64, n:n+256]);
    TMATMUL_ACC(acc, tileA, tileB);
  }

  TSTORE(C[m:m+128, n:n+256], acc);
}

CUDA:

__global__ void MatMul(
    float* C,
    const float* A,
    const float* B,
    int M, int K, int N) {

  __shared__ float As[TILE_SIZE][TILE_SIZE];
  __shared__ float Bs[TILE_SIZE][TILE_SIZE];

  int row = blockIdx.y * TILE_SIZE + threadIdx.y;
  int col = blockIdx.x * TILE_SIZE + threadIdx.x;

  float sum = 0.0f;

  for (int k = 0; k < K; k += TILE_SIZE) {
    // Load to shared memory
    As[threadIdx.y][threadIdx.x] = A[row * K + k + threadIdx.x];
    Bs[threadIdx.y][threadIdx.x] = B[(k + threadIdx.y) * N + col];
    __syncthreads();

    // Compute
    for (int i = 0; i < TILE_SIZE; i++) {
      sum += As[threadIdx.y][i] * Bs[i][threadIdx.x];
    }
    __syncthreads();
  }

  C[row * N + col] = sum;
}

Comparison: - PTO: Hardware matrix multiply instruction (TMATMUL) - CUDA: Manual loop-based multiplication - PTO: Simpler code, better performance - CUDA: More explicit memory management

5. Performance Comparison¶

5.1 Development Time¶

Task	PTO	AscendC	TBE	CUDA
Simple element-wise op	1 hour	2 hours	30 min	1 hour
GEMM optimization	1 day	3 days	N/A	2 days
Complex fused op	2 days	5 days	1 day	3 days

5.2 Runtime Performance¶

Relative Performance (normalized to PTO = 1.0):

Operator	PTO	AscendC	TBE	CUDA (on GPU)
Vector Add	1.0	1.05	0.8	1.2
GEMM	1.0	1.1	0.7	1.3
Softmax	1.0	1.05	0.75	1.1
Custom Fusion	1.0	1.15	0.6	N/A

Notes: - AscendC can achieve 5-15% better performance with expert optimization - TBE has 20-40% overhead due to abstraction - CUDA performance on different hardware (not directly comparable)

6. Selection Decision Tree¶

Start
  │
  ├─ Need cross-generation compatibility?
  │   ├─ Yes → PTO ✅
  │   └─ No → Continue
  │
  ├─ Need extreme performance (last 5-10%)?
  │   ├─ Yes → AscendC
  │   └─ No → Continue
  │
  ├─ Rapid prototyping?
  │   ├─ Yes → TBE
  │   └─ No → Continue
  │
  ├─ Targeting NVIDIA GPU?
  │   ├─ Yes → CUDA
  │   └─ No → PTO ✅
  │
  └─ Default → PTO ✅

7. Migration Guide¶

7.1 CUDA to PTO¶

Key Differences: - Thread → Tile - __shared__ memory → L1 Tile - __syncthreads() → Event-based sync - Manual loops → Tile operations

Example:

// CUDA
__global__ void kernel() {
  int idx = threadIdx.x;
  __shared__ float shared[256];
  shared[idx] = input[idx];
  __syncthreads();
  output[idx] = shared[idx] * 2;
}

// PTO
__global__ __aicore__ void kernel() {
  using TileT = Tile<TileType::Vec, float, 1, 256>;
  TileT tile;
  TLOAD(tile, input);
  TMULS(tile, tile, 2.0f);
  TSTORE(output, tile);
}

7.2 TBE to PTO¶

Key Differences: - High-level ops → Low-level Tile ops - Automatic scheduling → Manual pipeline - Python → C++

PTO vs Other Operator Development Approaches¶

Comparison Overview¶

1. PTO vs AscendC¶

PTO Advantages¶

AscendC Advantages¶

Selection Recommendation¶

2. PTO vs TBE¶

PTO Advantages¶

TBE Advantages¶

Selection Recommendation¶

3. PTO vs CUDA¶

PTO Advantages¶

CUDA Advantages¶

Selection Recommendation¶

4. Code Comparison Examples¶

4.1 Vector Addition¶

4.2 Matrix Multiplication¶

5. Performance Comparison¶

5.1 Development Time¶

5.2 Runtime Performance¶

6. Selection Decision Tree¶

7. Migration Guide¶

7.1 CUDA to PTO¶

7.2 TBE to PTO¶

References¶