PTO Communication ISA Reference¶

This directory contains the per-instruction reference for the PTO Communication ISA.

Source of truth (C++ intrinsics): include/pto/comm/pto_comm_inst.hpp
Type definitions: include/pto/comm/comm_types.hpp

Point-to-Point Communication (Synchronous)¶

TPUT: Remote write (GM → UB → GM)
TGET: Remote read (GM → UB → GM)

Point-to-Point Communication (Asynchronous)¶

TPUT_ASYNC: Asynchronous remote write (GM → DMA engine → GM)
TGET_ASYNC: Asynchronous remote read (GM → DMA engine → GM)

Signal-Based Synchronization¶

TNOTIFY: Send notification to remote NPU
TWAIT: Blocking wait for signal condition
TTEST: Non-blocking test signal condition

Collective Communication¶

TGATHER: Gather data from all ranks
TSCATTER: Scatter data to all ranks
TREDUCE: Reduce data from all ranks to local
TBROADCAST: Broadcast from current NPU to all ranks

Type Definitions¶

NotifyOp¶

Operation type for TNOTIFY:

Value	Description
`NotifyOp::Set`	Direct set (`signal = value`)
`NotifyOp::AtomicAdd`	Atomic add (`signal += value`)

WaitCmp¶

Comparison operators for TWAIT and TTEST:

Value	Description
`WaitCmp::EQ`	Equal (`==`)
`WaitCmp::NE`	Not equal (`!=`)
`WaitCmp::GT`	Greater than (`>`)
`WaitCmp::GE`	Greater or equal (`>=`)
`WaitCmp::LT`	Less than (`<`)
`WaitCmp::LE`	Less or equal (`<=`)

// Usage (unified runtime parameter style):
comm::TNOTIFY(signal, 1, comm::NotifyOp::Set);
comm::TWAIT(signal, 1, comm::WaitCmp::EQ);
comm::TTEST(signal, 1, comm::WaitCmp::GE);

ReduceOp¶

Reduction operators for TREDUCE:

Value	Description
`ReduceOp::Sum`	Element-wise sum
`ReduceOp::Max`	Element-wise maximum
`ReduceOp::Min`	Element-wise minimum

AtomicType¶

Atomic operation type for TPUT (defined in include/pto/common/constants.hpp):

Value	Description
`AtomicType::AtomicNone`	No atomic operation (default)
`AtomicType::AtomicAdd`	Atomic add operation

DmaEngine¶

DMA backend selection for TPUT_ASYNC and TGET_ASYNC:

Value	Description
`DmaEngine::SDMA`	SDMA engine (supports 2D transfer)
`DmaEngine::URMA`	URMA engine (supports 1D transfer, todo)

AsyncEvent¶

Returned by TPUT_ASYNC / TGET_ASYNC. Use to synchronize completion:

struct AsyncEvent {
    uint64_t handle;
    DmaEngine engine;

    bool valid() const;                        // true if handle != 0
    bool Wait(const AsyncSession &session) const; // block until transfer completes
    bool Test(const AsyncSession &session) const; // non-blocking completion check
};

AsyncSession¶

Engine-agnostic session for async DMA operations. Build once, pass to all async calls:

comm::AsyncSession session;
comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, workspace, session);

Defined in include/pto/comm/async/async_types.hpp. See TPUT_ASYNC for construction details and parameters.

ParallelGroup¶

Wrapper for collective communication across multiple NPUs:

template <typename GlobalData>
struct ParallelGroup {
    // Pointer to an array of `GlobalData` objects (each wraps a GM address).
    // The array itself is local metadata; the wrapped addresses may refer to local or remote GM,
    // depending on the collective instruction.
    GlobalData *tensors;
    int nranks;   // Number of ranks
    int rootIdx;  // Root NPU's rank index

    // Factory function (recommended): build from an existing tensor array.
    static ParallelGroup Create(GlobalData *tensorArray, int size, int rank_id);
};