PTO Communication ISA Reference¶
This directory contains the per-instruction reference for the PTO Communication ISA.
- Source of truth (C++ intrinsics):
include/pto/comm/pto_comm_inst.hpp - Type definitions:
include/pto/comm/comm_types.hpp
Point-to-Point Communication (Synchronous)¶
Point-to-Point Communication (Asynchronous)¶
- TPUT_ASYNC: Asynchronous remote write (GM → DMA engine → GM)
- TGET_ASYNC: Asynchronous remote read (GM → DMA engine → GM)
Signal-Based Synchronization¶
- TNOTIFY: Send notification to remote NPU
- TWAIT: Blocking wait for signal condition
- TTEST: Non-blocking test signal condition
Collective Communication¶
- TGATHER: Gather data from all ranks
- TSCATTER: Scatter data to all ranks
- TREDUCE: Reduce data from all ranks to local
- TBROADCAST: Broadcast from current NPU to all ranks
Type Definitions¶
NotifyOp¶
Operation type for TNOTIFY:
| Value | Description |
|---|---|
NotifyOp::Set |
Direct set (signal = value) |
NotifyOp::AtomicAdd |
Atomic add (signal += value) |
WaitCmp¶
Comparison operators for TWAIT and TTEST:
| Value | Description |
|---|---|
WaitCmp::EQ |
Equal (==) |
WaitCmp::NE |
Not equal (!=) |
WaitCmp::GT |
Greater than (>) |
WaitCmp::GE |
Greater or equal (>=) |
WaitCmp::LT |
Less than (<) |
WaitCmp::LE |
Less or equal (<=) |
// Usage (unified runtime parameter style):
comm::TNOTIFY(signal, 1, comm::NotifyOp::Set);
comm::TWAIT(signal, 1, comm::WaitCmp::EQ);
comm::TTEST(signal, 1, comm::WaitCmp::GE);
ReduceOp¶
Reduction operators for TREDUCE:
| Value | Description |
|---|---|
ReduceOp::Sum |
Element-wise sum |
ReduceOp::Max |
Element-wise maximum |
ReduceOp::Min |
Element-wise minimum |
AtomicType¶
Atomic operation type for TPUT (defined in include/pto/common/constants.hpp):
| Value | Description |
|---|---|
AtomicType::AtomicNone |
No atomic operation (default) |
AtomicType::AtomicAdd |
Atomic add operation |
DmaEngine¶
DMA backend selection for TPUT_ASYNC and TGET_ASYNC:
| Value | Description |
|---|---|
DmaEngine::SDMA |
SDMA engine (supports 2D transfer) |
DmaEngine::URMA |
URMA engine (supports 1D transfer, todo) |
AsyncEvent¶
Returned by TPUT_ASYNC / TGET_ASYNC. Use to synchronize completion:
struct AsyncEvent {
uint64_t handle;
DmaEngine engine;
bool valid() const; // true if handle != 0
bool Wait(const AsyncSession &session) const; // block until transfer completes
bool Test(const AsyncSession &session) const; // non-blocking completion check
};
AsyncSession¶
Engine-agnostic session for async DMA operations. Build once, pass to all async calls:
comm::AsyncSession session;
comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, workspace, session);
Defined in include/pto/comm/async/async_types.hpp. See TPUT_ASYNC for construction details and parameters.
ParallelGroup¶
Wrapper for collective communication across multiple NPUs:
template <typename GlobalData>
struct ParallelGroup {
// Pointer to an array of `GlobalData` objects (each wraps a GM address).
// The array itself is local metadata; the wrapped addresses may refer to local or remote GM,
// depending on the collective instruction.
GlobalData *tensors;
int nranks; // Number of ranks
int rootIdx; // Root NPU's rank index
// Factory function (recommended): build from an existing tensor array.
static ParallelGroup Create(GlobalData *tensorArray, int size, int rank_id);
};