PTO ISA Quickstart (C++ Tile Intrinsics)¶
This quickstart is for operator/kernel developers who want to get a first PTO kernel running quickly and understand the core mental model.
It is not a full instruction encyclopedia. For detailed instruction semantics and constraints, see docs/isa/README.md.
0. What you will learn¶
After reading this document, you should be able to:
- Recognize the key concepts in PTO code:
GlobalTensor,Tile,TileType::Vec, events, andTSYNC. - Write a simple PTO-Auto style kernel:
TLOAD → compute → TSTORE. - Write a PTO-Manual style kernel: explicit tile buffer binding (
TASSIGN) and explicit ordering (events/flags). - Understand the typical shape of “bigger” kernels like row-softmax and GEMM at a high level.
1. Where PTO code lives (what you are writing)¶
In this repository, you write kernels in C++ + PTO intrinsics. A minimal kernel looks like:
#include <pto/pto-inst.hpp>
using namespace pto;
template <typename T>
__global__ AICORE void MyKernel(__gm__ T* out, __gm__ T* in0, __gm__ T* in1) {
// Use GlobalTensor + Tile + TLOAD/T* ops + TSTORE here.
}
Terminology:
__gm__: global memory pointer (GM).AICORE: runs on a single “core” on the device backend (CPU simulation defines it as a normal function annotation).GlobalTensor: a view of GM data with shape/stride/layout metadata (seedocs/coding/GlobalTensor.md).Tile: an on-chip tile object, conceptually a 2-D buffer in tile storage (seedocs/coding/Tile.md).
2. One-page cheat sheet (core concepts)¶
2.1 GlobalTensor: “the big tensor in GM”¶
GlobalTensor is the operand type for memory instructions such as TLOAD/TSTORE.
Recommended 2-D “syntax sugar” (shape + stride helpers):
template <typename T, int rows, int cols>
using Shape2D = TileShape2D<T, rows, cols, Layout::ND>;
template <typename T, int rows, int cols>
using Stride2D = BaseShape2D<T, rows, cols, Layout::ND>;
template <typename T, int rows, int cols>
using GT2D = GlobalTensor<T, Shape2D<T, rows, cols>, Stride2D<T, rows, cols>, Layout::ND>;
Mental model: GlobalTensor = “interpret this GM pointer as a (rows × cols) matrix view”.
2.2 Tile: “the small 2-D block you compute on”¶
A Tile is the core compute unit. A typical vector tile looks like:
template <typename T, int rows, int cols>
using VecTile = Tile<TileType::Vec, T, rows, cols, BLayout::RowMajor>;
Common tile locations:
TileType::Vec: elementwise / reduction style operations.TileType::Mat: general matrix tiles for movement/transform paths.TileType::Left,TileType::Right,TileType::Acc: matmul tiles.
2.3 Two styles: PTO-Auto vs PTO-Manual¶
PTO-Auto (high level):
- You describe the dataflow:
TLOAD → compute → TSTORE. - Tile buffer management and some synchronization can be handled by the compiler/runtime.
- In the API model,
TASSIGN(tile, addr)is a no-op when__PTO_AUTO__is enabled (seedocs/isa/TASSIGN.md).
PTO-Manual (expert mode):
- You bind explicit tile buffer addresses with
TASSIGN. - You express ordering explicitly (events or low-level flags).
- You can build double-buffer pipelines and overlap load/compute/store.
3. Your first kernel: vector add (PTO-Auto style)¶
Goal: out = in0 + in1 for a single tile.
#include <pto/pto-inst.hpp>
using namespace pto;
template <typename T, int kRows, int kCols>
AICORE void VecAddAutoOneTile(__gm__ T* out, __gm__ T* in0, __gm__ T* in1) {
using GT = GT2D<T, kRows, kCols>;
using TileT = Tile<TileType::Vec, T, kRows, kCols, BLayout::RowMajor, DYNAMIC, DYNAMIC>;
GT g0(in0), g1(in1), gout(out);
TileT t0(kRows, kCols), t1(kRows, kCols), tout(kRows, kCols);
TLOAD(t0, g0);
TLOAD(t1, g1);
TADD(tout, t0, t1);
TSTORE(gout, tout);
}
Why this is “Auto style”:
- No explicit
TASSIGN. - No explicit flags/events in the source code.
- The kernel is written as a direct dataflow.
On CPU simulation (python3 tests/run_cpu.py), this style is typically enough to validate correctness.
4. The same kernel: vector add (PTO-Manual style)¶
Now we write the same logic but explicitly:
- Bind tile buffers (
TASSIGN) - Express ordering (events or flags)
4.1 Manual ordering with events (recommended)¶
This matches the programming model in docs/coding/Event.md (device-only Event types).
template <typename T, int kRows, int kCols>
__global__ AICORE void VecAddManual(__gm__ T* out, __gm__ T* in0, __gm__ T* in1) {
using GT = GT2D<T, kRows, kCols>;
using TileT = Tile<TileType::Vec, T, kRows, kCols, BLayout::RowMajor, DYNAMIC, DYNAMIC>;
GT g0(in0), g1(in1), gout(out);
TileT t0(kRows, kCols), t1(kRows, kCols), tout(kRows, kCols);
#ifndef __PTO_AUTO__
constexpr uint32_t kT0Addr = 0x0000;
constexpr uint32_t kT1Addr = 0x4000;
constexpr uint32_t kOutAddr = 0x8000;
TASSIGN(t0, kT0Addr);
TASSIGN(t1, kT1Addr);
TASSIGN(tout, kOutAddr);
#endif
#ifdef __CCE_AICORE__
Event<Op::TLOAD, Op::TADD> e_load_to_add;
Event<Op::TADD, Op::TSTORE_VEC> e_add_to_store;
TLOAD(t0, g0);
e_load_to_add = TLOAD(t1, g1);
e_add_to_store = TADD(tout, t0, t1, e_load_to_add);
TSTORE(gout, tout, e_add_to_store);
#else
TLOAD(t0, g0);
TLOAD(t1, g1);
TADD(tout, t0, t1);
TSTORE(gout, tout);
#endif
}
4.2 Manual ordering with low-level flags (legacy style)¶
Some existing device kernels use set_flag/wait_flag directly. This is more hardware-coupled than events, but useful to recognize when reading older kernels.
On CPU simulation, these are stubs (no-ops).
See also: tests/cpu/st/testcase/tadd/tadd_kernel.cpp.
5. A slightly bigger pattern: row-wise softmax (Auto style)¶
Row-softmax is a common pattern behind attention kernels. For a tile X shaped [M, N]:
row_max = TROWMAX(X)→[M, 1]X = X - expand(row_max)(TROWEXPAND+TSUB)X = exp(X)(TEXP)row_sum = TROWSUM(X)→[M, 1]X = X / expand(row_sum)(TROWEXPAND+TDIV)
Example (single tile):
template <typename T, int M, int N>
AICORE void RowSoftmaxAutoOneTile(__gm__ T* out, __gm__ T* in) {
using GT = GT2D<T, M, N>;
using XTile = Tile<TileType::Vec, T, M, N, BLayout::RowMajor, DYNAMIC, DYNAMIC>;
using Col1 = Tile<TileType::Vec, T, M, 1, BLayout::RowMajor, DYNAMIC, DYNAMIC>;
GT gin(in), gout(out);
XTile x(M, N), tmp(M, N);
Col1 row_max(M, 1), row_sum(M, 1);
TLOAD(x, gin);
TROWMAX(row_max, x);
TROWEXPAND(tmp, row_max);
TSUB(x, x, tmp);
TEXP(x, x);
TROWSUM(row_sum, x);
TROWEXPAND(tmp, row_sum);
TDIV(x, x, tmp);
TSTORE(gout, x);
}
For tiled softmax over a larger tensor, the same structure applies; you iterate over tiles and set valid regions on edge tiles when needed.
6. GEMM skeleton: TMATMUL + TMOV (high-level shape)¶
A minimal GEMM at the tile level typically looks like:
TLOADA and B tiles (often intoMattiles).TMOVintoLeft/Righttiles (to satisfy boxed/fractal layout requirements).TMATMULinto an accumulator tile.- Convert/move/store the result (depending on backend and destination layout).
Skeleton (single-tile shape, illustrative):
template <typename A, typename B, typename Acc, int TM, int TK, int TN>
__global__ AICORE void GemmAutoOneTile(__gm__ A* a, __gm__ B* b, __gm__ Acc* c) {
using GA = GT2D<A, TM, TK>;
using GB = GT2D<B, TK, TN>;
using GC = GT2D<Acc, TM, TN>;
GA gA(a);
GB gB(b);
GC gC(c);
Tile<TileType::Mat, A, TM, TK, BLayout::RowMajor> a_mat;
Tile<TileType::Mat, B, TK, TN, BLayout::RowMajor> b_mat;
TileLeft<A, TM, TK> a_l;
TileRight<B, TK, TN> b_r;
TileAcc<Acc, TM, TN> acc;
TLOAD(a_mat, gA);
TLOAD(b_mat, gB);
TMOV(a_l, a_mat);
TMOV(b_r, b_mat);
TMATMUL(acc, a_l, b_r);
// Result writeback can be backend-specific; see the GEMM demos/kernels in this repo.
// For example, some flows move `acc` to a vec/mat tile before `TSTORE`.
}
To build real GEMM/attention kernels you add:
- Tiling of M/K/N across blocks and loops.
TMATMUL_ACCto accumulate across K tiles.TEXTRACT/TRESHAPE/TTRANSas needed for slicing/layout.- Events for overlapping load/compute/store.
7. Next steps¶
- Learn the models precisely:
docs/coding/Tile.mddocs/coding/GlobalTensor.mddocs/coding/Event.md- Debugging and assertion lookup:
docs/coding/debug.md- Browse more examples (expanded walkthroughs):
docs/coding/tutorials/README.md- Run CPU simulation first:
python3 tests/run_cpu.py --verbose- Use the instruction reference as needed:
docs/isa/README.md
Appendix: compiling PTO-Auto with Bisheng (Ascend CANN)¶
PTO-Auto has two parts:
- Library-level auto semantics: compile with
-D__PTO_AUTO__so Tiles use compiler-managed storage andTASSIGN(tile, addr)becomes a no-op (seedocs/isa/TASSIGN.md). - Compiler pipeline: enable the PTO lowering/bufferization passes in the Bisheng CCE toolchain.
Finding the correct “enable PTO passes” flag (CANN toolchain)¶
The exact flag name is toolchain-version dependent. On some releases it is a driver flag (e.g. --cce-enable-pto-passes); on others it is exposed as an LLVM option (passed via -mllvm).
Use your installed Bisheng to discover the supported spelling:
bisheng --help | rg -n "pto|PTO" || true
bisheng -mllvm --help | rg -n "pto|PTO" || true
Example (device compilation)¶
This compiles a single CCE kernel source into an object file. Adjust:
--cce-aicore-arch=...for your SoC (examples in this repo usedav-c220-vec,dav-c310-vec, etc.).-DMEMORY_BASEvs-DREGISTER_BASEfor the target backend used in this repo.- the “enable PTO passes” flag spelling based on your Bisheng help output.
source /usr/local/Ascend/ascend-toolkit/latest/bin/setenv.bash
bisheng -c -xcce -O2 --cce-aicore-only \
--cce-aicore-arch=dav-c310-vec \
-std=c++17 \
-I"$ASCEND_HOME_PATH/include" -I./include \
-DREGISTER_BASE -D__PTO_AUTO__ \
-mllvm -cce-aicore-stack-size=0x8000 \
-mllvm -cce-aicore-function-stack-size=0x8000 \
-mllvm -cce-aicore-record-overflow=true \
-mllvm -cce-aicore-addr-transform \
-mllvm -cce-aicore-dcci-insert-for-scalar=false \
<ENABLE_PTO_PASSES_FLAG> \
tadd.cpp -o tadd.o
If your toolchain uses a driver-style flag, <ENABLE_PTO_PASSES_FLAG> might look like:
--cce-enable-pto-passes
If your toolchain exposes it as an LLVM option, it might look like:
-mllvm -cce-enable-pto-passes
When in doubt, prefer “what bisheng --help and bisheng -mllvm --help show” over hard-coding a particular spelling in docs.