Performance Best Practices¶
This document summarizes performance tuning best practices for PTO operators, providing systematic optimization methods and experience.
Contents¶
- 1. Optimization Workflow
- 2. Performance Analysis Methods
- 3. Common Performance Issues
- 4. Optimization Techniques Checklist
- 5. Platform-Specific Optimization
1. Optimization Workflow¶
1.1 Standard Optimization Process¶
Correctness Verification → Performance Baseline → Bottleneck Analysis →
Targeted Optimization → Verification → Iteration
Detailed Steps:
Step 1: Ensure Correctness¶
# CPU simulation verification
python3 tests/run_cpu.py --testcase your_op --verbose
# NPU verification
python3 tests/script/run_st.py -r npu -v a3 -t your_op
Checkpoints: - ✅ Numerical error < 1e-5 (fp32) or < 1e-3 (fp16) - ✅ All test cases pass - ✅ Boundary conditions handled correctly
Step 2: Establish Performance Baseline¶
# Use msprof to collect performance data
msprof --application="your_app" --output=./baseline
Record Metrics: - Total execution time - Time proportion of each stage (TLOAD/TMATMUL/TSTORE) - Memory bandwidth utilization - Compute unit utilization
Step 3: Identify Bottlenecks¶
Analyze profiler output:
TLOAD: 45% ← Memory transfer
TEXTRACT: 10% ← Layout conversion
TMATMUL: 40% ← Computation
TSTORE: 5% ← Write back
Bottleneck Types: - Memory Bound: TLOAD/TSTORE proportion > 60% - Compute Bound: TMATMUL proportion > 70% - Conversion Bound: TEXTRACT/TMOV proportion > 20%
Step 4: Targeted Optimization¶
Choose optimization strategy based on bottleneck type (see subsequent sections).
Step 5: Verify Optimization Effect¶
Compare Metrics: - Performance improvement percentage - Time changes in each stage - Numerical correctness maintained
Step 6: Iterate Optimization¶
Repeat steps 3-5 until performance target is reached or optimization space is exhausted.
2. Performance Analysis Methods¶
2.1 Using msprof Tool¶
Basic Usage:
# Collect performance data
msprof --application="./your_app" \
--output=./profiling_data \
--ai-core=on \
--task-time=on
# View report
msprof --export=on \
--output=./profiling_data
Key Metrics:
| Metric | Meaning | Target Value |
|---|---|---|
| TMATMUL Proportion | Cube unit utilization | > 50% |
| TLOAD Proportion | Memory transfer time | < 40% |
| MTE Bandwidth | Memory bandwidth utilization | > 70% |
| Pipeline Bubbles | Idle time | < 10% |
2.2 Manual Timing¶
Insert timing code in critical paths:
#include <chrono>
auto start = std::chrono::high_resolution_clock::now();
// Critical code section
for (int i = 0; i < N; i++) {
TLOAD(tile, ...);
}
auto end = std::chrono::high_resolution_clock::now();
auto duration = std::chrono::duration_cast<std::chrono::microseconds>(end - start);
printf("TLOAD time: %ld us\n", duration.count());
2.3 Theoretical Performance Calculation¶
GEMM Theoretical Peak:
Theoretical TFLOPS = Hardware Peak × Core Count × Utilization
Example A3 (24 cores):
- Hardware peak: ~50 TFLOPS/core (fp16)
- Theoretical peak: 50 × 24 = 1200 TFLOPS
- Achievable: ~70-80% = 840-960 TFLOPS
Memory Bandwidth Theoretical Value:
Theoretical Bandwidth = Hardware Bandwidth × Utilization
Example A3:
- Hardware bandwidth: ~900 GB/s
- Achievable: ~70-80% = 630-720 GB/s
3. Common Performance Issues¶
3.1 Memory Bandwidth Limited¶
Symptoms: - TLOAD/TSTORE proportion > 60% - TMATMUL proportion < 30%
Causes: - Tile too small, insufficient data reuse - Frequent GM ↔ L1 transfers - Not using pipeline overlap
Solutions:
✅ Increase Tile Size
// Before optimization: Small Tile
using TileT = Tile<TileType::Vec, float, 8, 64>; // 2KB
// After optimization: Large Tile
using TileT = Tile<TileType::Vec, float, 16, 256>; // 16KB
✅ Improve Data Reuse
// GEMM: K-dimension blocking
for (int k = 0; k < K; k += TILE_K) {
TLOAD(tileA, ...); // Load once
TLOAD(tileB, ...); // Load once
TMATMUL(acc, tileA, tileB); // Reuse multiple times
}
✅ Use Double Buffering
// Preload
TLOAD(tile[0], ...);
for (int i = 0; i < N; i++) {
int curr = i % 2;
int next = (i + 1) % 2;
// Compute current
TCOMPUTE(result[curr], tile[curr]);
// Load next simultaneously
if (i + 1 < N) {
TLOAD(tile[next], ...);
}
}
3.2 Low Compute Unit Utilization¶
Symptoms: - TMATMUL proportion < 40% - Many pipeline bubbles
Causes: - Data transfer can't keep up with computation speed - Too frequent synchronization - Tile shape doesn't match hardware
Solutions:
✅ Optimize Pipeline Overlap
// Use events instead of global sync
Event<Op::TLOAD, Op::TMATMUL> e;
e = TLOAD(tile, ...);
TMATMUL(acc, tile, ..., e); // Only wait for TLOAD
✅ Adjust Tile Shape
// A2/A3 recommended:
// Left: 128×64, Right: 64×256, Acc: 128×256
// A5 recommended:
// Left: 256×128, Right: 128×512, Acc: 256×512
4. Optimization Techniques Checklist¶
4.1 Tiling Optimization¶
✅ Choose Appropriate Tile Size - Balance on-chip capacity and data reuse - A2/A3: Single Tile typically 2-32 KB - A5: Single Tile can be larger (4-64 KB)
✅ Multi-level Tiling
// Global → Core-level → Block-level
// M×K×N → singleCoreM×singleCoreK×singleCoreN → baseM×baseK×baseN
✅ Consider Hardware Alignment Requirements - Row-major: Cols × sizeof(T) aligned to 32 bytes - Column-major: Rows × sizeof(T) aligned to 32 bytes - NZ layout: Special fractal alignment requirements
4.2 Memory Access Optimization¶
✅ Contiguous Access
// Good: Contiguous access
for (int i = 0; i < M; i++) {
TLOAD(tile, A[i, :]); // Row contiguous
}
// Bad: Strided access
for (int i = 0; i < M; i++) {
TLOAD(tile, A[:, i]); // Column access, may not be contiguous
}
✅ Data Prefetch
// Preload next batch of data
TPREFETCH(next_data, ...);
✅ Reduce GM Access Count
// Cache frequently accessed data in L1
TLOAD(cached_tile, ...); // Load once
for (int i = 0; i < N; i++) {
TCOMPUTE(result, cached_tile, ...); // Reuse multiple times
}
4.3 Computation Optimization¶
✅ Use Appropriate Data Types
// fp16 computation faster but lower precision
// fp32 higher precision but slower
// Choose based on requirements
// Mixed precision: fp16 input, fp32 accumulation
using TileLeft = TileLeft<half, 128, 64>;
using TileAcc = TileAcc<float, 128, 256>;
✅ Vectorized Operations
// Use Tile operations instead of scalar loops
TADD(c, a, b); // Process all elements in parallel
// Avoid:
for (int i = 0; i < rows; i++) {
for (int j = 0; j < cols; j++) {
c[i][j] = a[i][j] + b[i][j]; // Serial
}
}
✅ Operator Fusion
// Fuse multiple operations to reduce intermediate result storage
// Example: Softmax = exp(x - max) / sum(exp(x - max))
// Can be fused into one kernel
4.4 Synchronization Optimization¶
✅ Use Fine-grained Events
// Good: Only wait for necessary dependencies
Event<Op::TLOAD, Op::TADD> e;
e = TLOAD(tile, ...);
TADD(result, tile, ..., e);
// Bad: Global synchronization
TLOAD(tile, ...);
TSYNC<Op::TLOAD>(); // Wait for all TLOAD
TADD(result, tile, ...);
✅ Avoid Drain in Steady-state Loops
// Bad: Drain every iteration
for (int i = 0; i < N; i++) {
TLOAD(tile, ...);
TCOMPUTE(result, tile);
TSYNC(); // Wait for all operations to complete
}
// Good: Only drain at loop end
for (int i = 0; i < N; i++) {
TLOAD(tile, ...);
TCOMPUTE(result, tile);
}
TSYNC(); // Only sync once at the end
5. Platform-Specific Optimization¶
5.1 A2/A3 Optimization Points¶
Hardware Characteristics: - 24 cores - L1 capacity: ~512 KB/core - Cube peak: ~50 TFLOPS/core (fp16)
Recommended Configuration:
// GEMM Tile size
constexpr int baseM = 128;
constexpr int baseK = 64;
constexpr int baseN = 256;
// Fractal size
constexpr int fractalABSize = 512; // A/B operands
constexpr int fractalCSize = 1024; // Accumulator
Optimization Focus: - Prioritize optimizing K-dimension data reuse - Use double buffering to overlap TLOAD and TMATMUL - Pay attention to L1 capacity limits
5.2 A5 Optimization Points¶
Hardware Characteristics: - More cores - Larger L1 capacity: ~1 MB/core - Higher Cube peak
Recommended Configuration:
// GEMM Tile size (can be larger)
constexpr int baseM = 256;
constexpr int baseK = 128;
constexpr int baseN = 512;
Optimization Focus: - Utilize larger L1 capacity to increase Tile size - More aggressive pipeline optimization - Consider using MXFP4/MXFP8 mixed precision