PTO vs Other Operator Development Approaches¶
This document compares PTO with other mainstream operator development approaches, helping developers choose the most suitable development solution.
Comparison Overview¶
| Feature | PTO | AscendC | TBE | CUDA |
|---|---|---|---|---|
| Abstraction Level | Medium (Tile-level) | Low (Register-level) | High (Operator-level) | Low (Thread-level) |
| Cross-generation Compatibility | ✅ Excellent | ⚠️ Needs adaptation | ✅ Good | ❌ Platform-bound |
| Performance Control | ✅ High | ✅ Highest | ⚠️ Medium | ✅ High |
| Development Efficiency | ✅ High | ⚠️ Low | ✅ High | ⚠️ Medium |
| Learning Curve | Medium | Steep | Gentle | Steep |
| Debugging Difficulty | Medium | Hard | Easy | Hard |
| Use Cases | High-performance custom ops | Extreme optimization | Rapid prototyping | NVIDIA GPU |
1. PTO vs AscendC¶
PTO Advantages¶
Higher Abstraction Level - PTO operates on Tiles (2D data blocks), while AscendC requires manual register management - Automatic handling of data alignment and layout conversion - Easier to understand and maintain
Cross-generation Compatibility
// PTO code runs on A2/A3/A5 without modification
using TileT = Tile<TileType::Vec, float, 16, 16>;
TLOAD(tile, globalTensor);
TADD(result, tile1, tile2);
Development Efficiency - Fewer lines of code (typically 30-50% reduction) - Faster development cycle - Easier performance tuning
AscendC Advantages¶
Ultimate Performance Control - Direct control of hardware registers - Can achieve optimal instruction scheduling - Suitable for scenarios requiring extreme performance
Lower-level Hardware Access - Can use all hardware features - Finer-grained pipeline control
Selection Recommendation¶
- Choose PTO: Most custom operator development, need cross-generation compatibility
- Choose AscendC: Need to squeeze last 5-10% performance, targeting specific hardware only
2. PTO vs TBE¶
PTO Advantages¶
Better Performance Control
// PTO allows precise control of tiling and pipeline
for (int k = 0; k < K; k += tileK) {
TLOAD(tileA, ...); // Explicit data transfer control
TLOAD(tileB, ...);
TMATMUL(acc, tileA, tileB); // Explicit compute control
}
More Flexible Operator Implementation - Can implement complex custom logic - Supports dynamic shapes and masks - Easier to implement operator fusion
TBE Advantages¶
Higher Development Efficiency - Based on TensorFlow/PyTorch high-level APIs - Automatic optimization and scheduling - Faster prototyping
Simpler Learning Curve - Python-like programming model - Rich operator library - Comprehensive documentation and examples
Selection Recommendation¶
- Choose PTO: Need high-performance custom operators with clear performance requirements
- Choose TBE: Rapid prototyping, standard operator implementation
3. PTO vs CUDA¶
PTO Advantages¶
Cross-platform Portability
// PTO code runs on different Ascend generations
// A2/A3/A5 without modification
// CUDA code is NVIDIA-specific
// Needs rewrite for AMD/Intel GPUs
Higher Abstraction Level - Tile-based programming vs thread-based - Automatic memory hierarchy management - Less boilerplate code
Better Compiler Optimization - Compiler understands high-level semantics - Automatic pipeline optimization - Better instruction scheduling
CUDA Advantages¶
Mature Ecosystem - Extensive libraries (cuBLAS, cuDNN, Thrust) - Rich community resources - Comprehensive tooling (Nsight, nvprof)
Fine-grained Control - Thread-level control - Shared memory management - Warp-level primitives
Wider Hardware Support - Runs on all NVIDIA GPUs - Large installed base
Selection Recommendation¶
- Choose PTO: Developing for Ascend NPU, need portability across generations
- Choose CUDA: Developing for NVIDIA GPU, need mature ecosystem
4. Code Comparison Examples¶
4.1 Vector Addition¶
PTO:
__global__ __aicore__ void VecAdd(
__gm__ float* out,
__gm__ const float* in0,
__gm__ const float* in1,
uint32_t length) {
using TileT = Tile<TileType::Vec, float, 16, 256>;
TileT a, b, c;
for (int i = 0; i < length; i += 16 * 256) {
TLOAD(a, GlobalTensor(in0 + i));
TLOAD(b, GlobalTensor(in1 + i));
TADD(c, a, b);
TSTORE(GlobalTensor(out + i), c);
}
}
CUDA:
__global__ void VecAdd(
float* out,
const float* in0,
const float* in1,
int length) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < length) {
out[idx] = in0[idx] + in1[idx];
}
}
Comparison: - PTO: Tile-based, processes 4096 elements per iteration - CUDA: Thread-based, processes 1 element per thread - PTO: Fewer memory transactions, better bandwidth utilization - CUDA: More flexible thread organization
4.2 Matrix Multiplication¶
PTO:
__global__ __aicore__ void MatMul(
__gm__ float* C,
__gm__ const float* A,
__gm__ const float* B,
int M, int K, int N) {
using TileLeft = TileLeft<half, 128, 64>;
using TileRight = TileRight<half, 64, 256>;
using TileAcc = TileAcc<float, 128, 256>;
TileAcc acc;
TFILL(acc, 0);
for (int k = 0; k < K; k += 64) {
TileLeft tileA;
TileRight tileB;
TLOAD(tileA, A[m:m+128, k:k+64]);
TLOAD(tileB, B[k:k+64, n:n+256]);
TMATMUL_ACC(acc, tileA, tileB);
}
TSTORE(C[m:m+128, n:n+256], acc);
}
CUDA:
__global__ void MatMul(
float* C,
const float* A,
const float* B,
int M, int K, int N) {
__shared__ float As[TILE_SIZE][TILE_SIZE];
__shared__ float Bs[TILE_SIZE][TILE_SIZE];
int row = blockIdx.y * TILE_SIZE + threadIdx.y;
int col = blockIdx.x * TILE_SIZE + threadIdx.x;
float sum = 0.0f;
for (int k = 0; k < K; k += TILE_SIZE) {
// Load to shared memory
As[threadIdx.y][threadIdx.x] = A[row * K + k + threadIdx.x];
Bs[threadIdx.y][threadIdx.x] = B[(k + threadIdx.y) * N + col];
__syncthreads();
// Compute
for (int i = 0; i < TILE_SIZE; i++) {
sum += As[threadIdx.y][i] * Bs[i][threadIdx.x];
}
__syncthreads();
}
C[row * N + col] = sum;
}
Comparison: - PTO: Hardware matrix multiply instruction (TMATMUL) - CUDA: Manual loop-based multiplication - PTO: Simpler code, better performance - CUDA: More explicit memory management
5. Performance Comparison¶
5.1 Development Time¶
| Task | PTO | AscendC | TBE | CUDA |
|---|---|---|---|---|
| Simple element-wise op | 1 hour | 2 hours | 30 min | 1 hour |
| GEMM optimization | 1 day | 3 days | N/A | 2 days |
| Complex fused op | 2 days | 5 days | 1 day | 3 days |
5.2 Runtime Performance¶
Relative Performance (normalized to PTO = 1.0):
| Operator | PTO | AscendC | TBE | CUDA (on GPU) |
|---|---|---|---|---|
| Vector Add | 1.0 | 1.05 | 0.8 | 1.2 |
| GEMM | 1.0 | 1.1 | 0.7 | 1.3 |
| Softmax | 1.0 | 1.05 | 0.75 | 1.1 |
| Custom Fusion | 1.0 | 1.15 | 0.6 | N/A |
Notes: - AscendC can achieve 5-15% better performance with expert optimization - TBE has 20-40% overhead due to abstraction - CUDA performance on different hardware (not directly comparable)
6. Selection Decision Tree¶
Start
│
├─ Need cross-generation compatibility?
│ ├─ Yes → PTO ✅
│ └─ No → Continue
│
├─ Need extreme performance (last 5-10%)?
│ ├─ Yes → AscendC
│ └─ No → Continue
│
├─ Rapid prototyping?
│ ├─ Yes → TBE
│ └─ No → Continue
│
├─ Targeting NVIDIA GPU?
│ ├─ Yes → CUDA
│ └─ No → PTO ✅
│
└─ Default → PTO ✅
7. Migration Guide¶
7.1 CUDA to PTO¶
Key Differences:
- Thread → Tile
- __shared__ memory → L1 Tile
- __syncthreads() → Event-based sync
- Manual loops → Tile operations
Example:
// CUDA
__global__ void kernel() {
int idx = threadIdx.x;
__shared__ float shared[256];
shared[idx] = input[idx];
__syncthreads();
output[idx] = shared[idx] * 2;
}
// PTO
__global__ __aicore__ void kernel() {
using TileT = Tile<TileType::Vec, float, 1, 256>;
TileT tile;
TLOAD(tile, input);
TMULS(tile, tile, 2.0f);
TSTORE(output, tile);
}
7.2 TBE to PTO¶
Key Differences: - High-level ops → Low-level Tile ops - Automatic scheduling → Manual pipeline - Python → C++