Pipeline and Parallel Execution¶
This document introduces PTO's pipeline model and parallel execution mechanisms, helping developers fully utilize hardware resources for high-performance operators.
Contents¶
- 1. Pipeline Overview
- 2. Hardware Pipeline
- 3. Software Pipeline
- 4. Parallel Execution Model
- 5. Performance Optimization Tips
1. Pipeline Overview¶
1.1 What is Pipeline¶
Pipeline is a parallel technique that decomposes tasks into multiple stages, allowing different stages to process different data simultaneously.
Analogy: Car assembly line - Stage 1: Install chassis - Stage 2: Install engine - Stage 3: Install body - Stage 4: Paint
When Stage 2 processes car B, Stage 1 can simultaneously process car C.
1.2 Pipeline in PTO¶
PTO operators typically include the following stages:
TLOAD → Transform → Compute → TSTORE
↓ ↓ ↓ ↓
MTE2 MTE1 CUBE/VEC MTE1
Key Idea: Overlap execution of different stages to improve hardware utilization.
2. Hardware Pipeline¶
2.1 Ascend Hardware Pipeline¶
Ascend AI processors contain multiple independent execution units:
| Pipeline | Function | Typical Instructions |
|---|---|---|
| MTE2 | GM → L1 data transfer | TLOAD |
| MTE1 | L1 → L0 data transfer | TEXTRACT, TMOV |
| CUBE | Matrix multiplication | TMATMUL |
| VECTOR | Element-wise operations | TADD, TEXP, TMAX |
| SCALAR | Scalar operations and control flow | Address calculation, loop control |
2.2 Pipeline Parallelism¶
Different pipelines can execute simultaneously:
// Time T0: TLOAD executes on MTE2
TLOAD(tileA[0], ...);
// Time T1: TLOAD continues, TEXTRACT executes on MTE1 simultaneously
TLOAD(tileA[1], ...);
TEXTRACT(tileLeft[0], tileA[0]);
// Time T2: Three pipelines work simultaneously
TLOAD(tileA[2], ...);
TEXTRACT(tileLeft[1], tileA[1]);
TMATMUL(acc, tileLeft[0], tileRight[0]);
Performance Gain: Ideally can achieve 3-4× throughput improvement.
3. Software Pipeline¶
3.1 Double Buffering¶
Double buffering is the most commonly used software pipeline technique:
// Basic version (no pipeline)
for (int i = 0; i < N; i++) {
TLOAD(tile, ...); // Wait for load
TCOMPUTE(result, tile); // Wait for compute
TSTORE(..., result); // Wait for store
}
// Total time = N × (T_load + T_compute + T_store)
// Double buffering version (with pipeline)
TLOAD(tile[0], ...); // Preload first
for (int i = 0; i < N; i++) {
int curr = i % 2;
int next = (i + 1) % 2;
// Compute current iteration
TCOMPUTE(result[curr], tile[curr]);
// Load next iteration simultaneously
if (i + 1 < N) {
TLOAD(tile[next], ...);
}
// Store previous result
if (i > 0) {
TSTORE(..., result[1 - curr]);
}
}
TSTORE(..., result[N % 2]); // Store last result
// Total time ≈ max(T_load, T_compute, T_store) × N
// Speedup ≈ (T_load + T_compute + T_store) / max(...)
3.2 Triple Buffering¶
For more complex scenarios, use triple buffering:
TileT tile[3];
Event load_event[3], compute_event[3];
// Preload first two
TLOAD(tile[0], ..., load_event[0]);
TLOAD(tile[1], ..., load_event[1]);
for (int i = 0; i < N; i++) {
int curr = i % 3;
int next = (i + 1) % 3;
int prev = (i + 2) % 3;
// Load next
if (i + 2 < N) {
TLOAD(tile[next], ..., load_event[next]);
}
// Compute current
WAIT(load_event[curr]);
TCOMPUTE(result, tile[curr], compute_event[curr]);
// Store previous
if (i > 0) {
WAIT(compute_event[prev]);
TSTORE(..., result);
}
}
3.3 Event-based Synchronization¶
Use events for fine-grained synchronization:
// Define event types
Event<Op::TLOAD, Op::TADD> load_event;
Event<Op::TADD, Op::TSTORE> compute_event;
// Load with event
load_event = TLOAD(tile, ...);
// Compute depends on load
compute_event = TADD(result, tile, ..., load_event);
// Store depends on compute
TSTORE(..., result, compute_event);
4. Parallel Execution Model¶
4.1 Multi-core Parallelism¶
SPMD (Single Program, Multiple Data):
__global__ __aicore__ void ParallelKernel(...) {
int block_idx = get_block_idx();
int block_num = get_block_num();
// Each core processes different data
int start = block_idx * elements_per_core;
int end = min(start + elements_per_core, total_elements);
for (int i = start; i < end; i++) {
process(i);
}
}
4.2 Instruction-level Parallelism¶
Different instruction types can execute in parallel:
// These can execute simultaneously
TLOAD(tile_a, ...); // MTE2
TEXTRACT(tile_b, ...); // MTE1
TMATMUL(acc, ...); // CUBE
TADD(vec_result, ...); // VECTOR
4.3 Data-level Parallelism¶
Process multiple data elements simultaneously:
// Tile operations process all elements in parallel
using TileT = Tile<TileType::Vec, float, 16, 256>;
TileT a, b, c;
TLOAD(a, ...);
TLOAD(b, ...);
TADD(c, a, b); // All 16×256 elements computed in parallel
5. Performance Optimization Tips¶
5.1 Maximize Pipeline Overlap¶
Strategy 1: Preload Data
// Preload first batch
TLOAD(tile[0], data[0]);
for (int i = 0; i < N; i++) {
// Load next while processing current
if (i + 1 < N) {
TLOAD(tile[(i+1)%2], data[i+1]);
}
TCOMPUTE(result, tile[i%2]);
TSTORE(output[i], result);
}
Strategy 2: Use Events Instead of Global Sync
// Bad: Global synchronization
TLOAD(tile, ...);
TSYNC<Op::TLOAD>(); // Wait for all TLOAD
TADD(result, tile, ...);
// Good: Event-based synchronization
Event e = TLOAD(tile, ...);
TADD(result, tile, ..., e); // Only wait for this TLOAD
5.2 Balance Pipeline Stages¶
Identify Bottleneck:
TLOAD: 40% ← Bottleneck
TCOMPUTE: 20%
TSTORE: 10%
Idle: 30%
Solution: Increase compute intensity
// Increase computation per load
for (int k = 0; k < K; k += TILE_K) {
TLOAD(tileA, ...); // Load once
TLOAD(tileB, ...);
// Reuse multiple times
for (int sub_k = 0; sub_k < TILE_K; sub_k++) {
TMATMUL(acc, tileA[sub_k], tileB[sub_k]);
}
}
5.3 Reduce Synchronization Overhead¶
Minimize WAIT Calls:
// Bad: Frequent synchronization
for (int i = 0; i < N; i++) {
Event e = TLOAD(tile, ...);
WAIT(e); // Sync every iteration
TCOMPUTE(result, tile);
}
// Good: Batch synchronization
Event events[BATCH_SIZE];
for (int i = 0; i < N; i += BATCH_SIZE) {
// Load batch
for (int j = 0; j < BATCH_SIZE; j++) {
events[j] = TLOAD(tiles[j], ...);
}
// Process batch
for (int j = 0; j < BATCH_SIZE; j++) {
WAIT(events[j]);
TCOMPUTE(results[j], tiles[j]);
}
}
5.4 Optimize Memory Access Pattern¶
Contiguous Access:
// Good: Sequential access
for (int i = 0; i < M; i++) {
TLOAD(tile, A[i, :]); // Row-major, contiguous
}
// Bad: Strided access
for (int j = 0; j < N; j++) {
TLOAD(tile, A[:, j]); // Column access, may be strided
}
Prefetch:
// Prefetch next data
TPREFETCH(next_data, ...);
TCOMPUTE(current_data);