Performance Best Practices¶
This document summarizes practical performance-tuning guidance for PTO operators. All numeric examples should be treated as analysis heuristics, not guaranteed hardware values, because achievable performance depends on chip generation, clocking, memory hierarchy, compiler behavior, workload shape, and the surrounding runtime.
1. Optimization Workflow¶
1.1 Standard Optimization Process¶
Correctness Verification → Performance Baseline → Bottleneck Analysis →
Targeted Optimization → Verification → Iteration
Detailed Steps:
Step 1: Ensure Correctness¶
# CPU simulation verification
python3 tests/run_cpu.py --testcase your_op --verbose
# NPU verification
python3 tests/script/run_st.py -r npu -v a3 -t your_op
Checkpoints: - ✅ Numerical error < 1e-5 (fp32) or < 1e-3 (fp16) - ✅ All test cases pass - ✅ Boundary conditions handled correctly
Step 2: Establish Performance Baseline¶
# Use msprof to collect performance data
msprof --application="your_app" --output=./baseline
Record Metrics: - Total execution time - Time proportion of each stage (TLOAD/TMATMUL/TSTORE) - Memory bandwidth utilization - Compute unit utilization
Step 3: Identify Bottlenecks¶
Analyze profiler output:
TLOAD: 45% ← Memory transfer
TEXTRACT: 10% ← Layout conversion
TMATMUL: 40% ← Computation
TSTORE: 5% ← Write back
Bottleneck Types: - Memory Bound: TLOAD/TSTORE proportion > 60% - Compute Bound: TMATMUL proportion > 70% - Conversion Bound: TEXTRACT/TMOV proportion > 20%
Step 4: Targeted Optimization¶
Choose optimization strategy based on bottleneck type (see subsequent sections).
Step 5: Verify Optimization Effect¶
Compare Metrics: - Performance improvement percentage - Time changes in each stage - Numerical correctness maintained
Step 6: Iterate Optimization¶
Repeat steps 3-5 until performance target is reached or optimization space is exhausted.
2. Performance Analysis Methods¶
2.1 Using msprof Tool¶
Basic Usage:
# Collect performance data
msprof --application="./your_app" \
--output=./profiling_data \
--ai-core=on \
--task-time=on
# View report
msprof --export=on \
--output=./profiling_data
Key Metrics:
| Metric | Meaning | Target Value |
|---|---|---|
| TMATMUL Proportion | Cube unit utilization | > 50% |
| TLOAD Proportion | Memory transfer time | < 40% |
| MTE Bandwidth | Memory bandwidth utilization | > 70% |
| Pipeline Bubbles | Idle time | < 10% |
2.2 Manual Timing¶
Insert timing code in critical paths:
#include <chrono>
auto start = std::chrono::high_resolution_clock::now();
// Critical code section
for (int i = 0; i < N; i++) {
TLOAD(tile, ...);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
printf("TLOAD time: %ld us\n", duration.count());
2.3 Theoretical Performance Calculation¶
Theoretical peak numbers are useful only for rough upper-bound analysis. Prefer comparing kernels under the same measurement setup, and rely on profiler evidence before drawing conclusions.
Example reasoning pattern:
Theoretical throughput upper bound = peak compute capability × estimated utilization
Achieved throughput = measured workload FLOPs / measured execution time
Use this comparison to decide whether a kernel is primarily compute-bound or memory-bound; avoid hard-coding platform figures in design conclusions unless they come from the target platform's official specifications.
3. Common Performance Issues¶
3.1 Memory Bandwidth Limited¶
Symptoms: - TLOAD/TSTORE proportion > 60% - TMATMUL proportion < 30%
Causes: - Tile too small, insufficient data reuse - Frequent GM ↔ L1 transfers - Not using pipeline overlap
Solutions:
✅ Increase Tile Size
// Before optimization: Small Tile
using TileT = Tile<TileType::Vec, float, 8, 64>; // 2KB
// After optimization: Large Tile
using TileT = Tile<TileType::Vec, float, 16, 256>; // 16KB
✅ Improve Data Reuse
// GEMM: K-dimension blocking
for (int k = 0; k < K; k += TILE_K) {
TLOAD(tileA, ...); // Load once
TLOAD(tileB, ...); // Load once
TMATMUL(acc, tileA, tileB); // Reuse multiple times
}
✅ Use double buffering or staged overlap when applicable
// Preload
TLOAD(tile[0], ...);
for (int i = 0; i < N; i++) {
int curr = i % 2;
int next = (i + 1) % 2;
// Compute current tile
process_tile(result[curr], tile[curr]);
// Load next tile in parallel when possible
if (i + 1 < N) {
TLOAD(tile[next], ...);
}
}
3.2 Low Compute Unit Utilization¶
Symptoms: - TMATMUL proportion < 40% - Many pipeline bubbles
Causes: - Data transfer can't keep up with computation speed - Too frequent synchronization - Tile shape doesn't match hardware
Solutions:
✅ Optimize Pipeline Overlap
// Use events instead of global sync
Event<Op::TLOAD, Op::TMATMUL> e;
e = TLOAD(tile, ...);
TMATMUL(acc, tile, ..., e); // Only wait for TLOAD
✅ Adjust Tile Shape
// A2/A3 recommended:
// Left: 128×64, Right: 64×256, Acc: 128×256
// A5 recommended:
// Left: 256×128, Right: 128×512, Acc: 256×512
4. Optimization Techniques Checklist¶
4.1 Tiling Optimization¶
✅ Choose Appropriate Tile Size - Balance on-chip capacity and data reuse - A2/A3: Single Tile typically 2-32 KB - A5: Single Tile can be larger (4-64 KB)
✅ Multi-level Tiling
// Global → Core-level → Block-level
// M×K×N → singleCoreM×singleCoreK×singleCoreN → baseM×baseK×baseN
✅ Consider Hardware Alignment Requirements - Row-major: Cols × sizeof(T) aligned to 32 bytes - Column-major: Rows × sizeof(T) aligned to 32 bytes - NZ layout: Special fractal alignment requirements
4.2 Memory Access Optimization¶
✅ Contiguous Access
// Good: Contiguous access
for (int i = 0; i < M; i++) {
TLOAD(tile, A[i, :]); // Row contiguous
}
// Bad: Strided access
for (int i = 0; i < M; i++) {
TLOAD(tile, A[:, i]); // Column access, may not be contiguous
}
✅ Data Prefetch
// Preload next batch of data
TPREFETCH(next_data, ...);
✅ Reduce GM Access Count
// Cache frequently accessed data in L1
TLOAD(cached_tile, ...); // Load once
for (int i = 0; i < N; i++) {
TCOMPUTE(result, cached_tile, ...); // Reuse multiple times
}
4.3 Computation Optimization¶
✅ Use Appropriate Data Types
// fp16 computation faster but lower precision
// fp32 higher precision but slower
// Choose based on requirements
// Mixed precision: fp16 input, fp32 accumulation
using TileLeft = TileLeft<half, 128, 64>;
using TileAcc = TileAcc<float, 128, 256>;
✅ Vectorized Operations
// Use Tile operations instead of scalar loops
TADD(c, a, b); // Process all elements in parallel
// Avoid:
for (int i = 0; i < rows; i++) {
for (int j = 0; j < cols; j++) {
c[i][j] = a[i][j] + b[i][j]; // Serial
}
}
✅ Operator Fusion
// Fuse multiple operations to reduce intermediate result storage
// Example: Softmax = exp(x - max) / sum(exp(x - max))
// Can be fused into one kernel
4.4 Synchronization Optimization¶
✅ Use Fine-grained Events
// Good: Only wait for necessary dependencies
Event<Op::TLOAD, Op::TADD> e;
e = TLOAD(tile, ...);
TADD(result, tile, ..., e);
// Bad: Global synchronization
TLOAD(tile, ...);
TSYNC<Op::TLOAD>(); // Wait for all TLOAD
TADD(result, tile, ...);
✅ Avoid Drain in Steady-state Loops
// Bad: Drain every iteration
for (int i = 0; i < N; i++) {
TLOAD(tile, ...);
TCOMPUTE(result, tile);
TSYNC(); // Wait for all operations to complete
}
// Good: Only drain at loop end
for (int i = 0; i < N; i++) {
TLOAD(tile, ...);
TCOMPUTE(result, tile);
}
TSYNC(); // Only sync once at the end
5. Platform-Specific Optimization¶
5.1 A2/A3 Optimization Points¶
Hardware Characteristics: - 24 cores - L1 capacity: ~512 KB/core - Cube peak: ~50 TFLOPS/core (fp16)
Recommended Configuration:
// GEMM Tile size
constexpr int baseM = 128;
constexpr int baseK = 64;
constexpr int baseN = 256;
// Fractal size
constexpr int fractalABSize = 512; // A/B operands
constexpr int fractalCSize = 1024; // Accumulator
Optimization Focus: - Prioritize optimizing K-dimension data reuse - Use double buffering to overlap TLOAD and TMATMUL - Pay attention to L1 capacity limits
5.2 A5 Optimization Points¶
Hardware Characteristics: - More cores - Larger L1 capacity: ~1 MB/core - Higher Cube peak
Recommended Configuration:
// GEMM Tile size (can be larger)
constexpr int baseM = 256;
constexpr int baseK = 128;
constexpr int baseN = 512;
Optimization Focus: - Utilize larger L1 capacity to increase Tile size - More aggressive pipeline optimization - Consider using MXFP4/MXFP8 mixed precision