PTO Communication ISA Reference

This directory contains the per-instruction reference for the PTO Communication ISA.

  • Source of truth (C++ intrinsics): include/pto/comm/pto_comm_inst.hpp
  • Type definitions: include/pto/comm/comm_types.hpp

Point-to-Point Communication (Synchronous)

  • TPUT: Remote write (GM → UB → GM)
  • TGET: Remote read (GM → UB → GM)

Point-to-Point Communication (Asynchronous)

  • TPUT_ASYNC: Asynchronous remote write (GM → DMA engine → GM)
  • TGET_ASYNC: Asynchronous remote read (GM → DMA engine → GM)

Signal-Based Synchronization

  • TNOTIFY: Send notification to remote NPU
  • TWAIT: Blocking wait for signal condition
  • TTEST: Non-blocking test signal condition

Collective Communication

  • TGATHER: Gather data from all ranks
  • TSCATTER: Scatter data to all ranks
  • TREDUCE: Reduce data from all ranks to local
  • TBROADCAST: Broadcast from current NPU to all ranks

Type Definitions

NotifyOp

Operation type for TNOTIFY:

Value Description
NotifyOp::Set Direct set (signal = value)
NotifyOp::AtomicAdd Atomic add (signal += value)

WaitCmp

Comparison operators for TWAIT and TTEST:

Value Description
WaitCmp::EQ Equal (==)
WaitCmp::NE Not equal (!=)
WaitCmp::GT Greater than (>)
WaitCmp::GE Greater or equal (>=)
WaitCmp::LT Less than (<)
WaitCmp::LE Less or equal (<=)
// Usage (unified runtime parameter style):
comm::TNOTIFY(signal, 1, comm::NotifyOp::Set);
comm::TWAIT(signal, 1, comm::WaitCmp::EQ);
comm::TTEST(signal, 1, comm::WaitCmp::GE);

ReduceOp

Reduction operators for TREDUCE:

Value Description
ReduceOp::Sum Element-wise sum
ReduceOp::Max Element-wise maximum
ReduceOp::Min Element-wise minimum

AtomicType

Atomic operation type for TPUT (defined in include/pto/common/constants.hpp):

Value Description
AtomicType::AtomicNone No atomic operation (default)
AtomicType::AtomicAdd Atomic add operation

DmaEngine

DMA backend selection for TPUT_ASYNC and TGET_ASYNC:

Value Description
DmaEngine::SDMA SDMA engine (supports 2D transfer)
DmaEngine::URMA URMA engine (supports 1D transfer, todo)

AsyncEvent

Returned by TPUT_ASYNC / TGET_ASYNC. Use to synchronize completion:

struct AsyncEvent {
    uint64_t handle;
    DmaEngine engine;

    bool valid() const;                        // true if handle != 0
    bool Wait(const AsyncSession &session) const; // block until transfer completes
    bool Test(const AsyncSession &session) const; // non-blocking completion check
};

AsyncSession

Engine-agnostic session for async DMA operations. Build once, pass to all async calls:

comm::AsyncSession session;
comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, workspace, session);

Defined in include/pto/comm/async/async_types.hpp. See TPUT_ASYNC for construction details and parameters.

ParallelGroup

Wrapper for collective communication across multiple NPUs:

template <typename GlobalData>
struct ParallelGroup {
    // Pointer to an array of `GlobalData` objects (each wraps a GM address).
    // The array itself is local metadata; the wrapped addresses may refer to local or remote GM,
    // depending on the collective instruction.
    GlobalData *tensors;
    int nranks;   // Number of ranks
    int rootIdx;  // Root NPU's rank index

    // Factory function (recommended): build from an existing tensor array.
    static ParallelGroup Create(GlobalData *tensorArray, int size, int rank_id);
};