TPUT

Introduction

Remote write operation: write local data to remote NPU's memory. Data is transferred via a UB tile as intermediate staging buffer.

When the GlobalTensor exceeds the UB tile capacity, TPUT automatically performs 2D sliding — chunking rows (DIM_3) and columns (DIM_4) to fit each chunk into the tile, iterating over all outer dimensions (DIM_0, DIM_1, DIM_2).

Math Interpretation

For each element (i, j) in the valid region:

\[ \mathrm{dst}^{\mathrm{remote}}_{i,j} = \mathrm{src}^{\mathrm{local}}_{i,j} \]

Data flow: srcGlobalData (local GM)stagingTileData (UB)dstGlobalData (remote GM)

Assembly Syntax

PTO-AS form: see PTO-AS Specification.

Synchronous form:

tput %dst_remote, %src_local : (!pto.memref<...>, !pto.memref<...>)

Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit stagingTileData (or pingTile / pongTile) operand(s).

C++ Intrinsic

Declared in include/pto/comm/pto_comm_inst.hpp

Single-tile (auto-chunking)

template <AtomicType atomicType = AtomicType::AtomicNone,
          typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
                          TileData &stagingTileData, WaitEvents&... events);

Ping-pong double buffering

Uses two staging tiles to overlap TLOAD and TSTORE for adjacent chunks, hiding one DMA transfer behind the other.

template <AtomicType atomicType = AtomicType::AtomicNone,
          typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
                          TileData &pingTile, TileData &pongTile, WaitEvents&... events);

Runtime atomic type

template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
PTO_INST RecordEvent TPUT(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
                          TileData &stagingTileData, AtomicType atomicType, WaitEvents&... events);

Constraints

  • Type constraints:
    • GlobalSrcData::RawDType must equal GlobalDstData::RawDType.
    • TileData::DType must equal GlobalSrcData::RawDType.
    • GlobalSrcData::layout must equal GlobalDstData::layout.
  • Memory constraints:
    • dstGlobalData must point to remote address (on target NPU).
    • srcGlobalData must point to local address (on current NPU).
    • stagingTileData / pingTile / pongTile must be pre-allocated in Unified Buffer.
  • Valid region:
    • Transfer size is determined by GlobalTensor shape (auto-chunked to fit tile).
  • Atomic operation:
    • atomicType supports AtomicNone and AtomicAdd.
  • Ping-pong:
    • pingTile and pongTile must have the same type and dimensions.
    • Must reside at non-overlapping UB offsets.

Examples

Basic Usage

#include <pto/comm/pto_comm_inst.hpp>
#include <pto/pto-inst.hpp>

using namespace pto;

template <typename T>
void example_tput(__gm__ T* local_data, __gm__ T* remote_addr) {
    using TileT = Tile<TileType::Vec, T, 16, 16>;
    using GShape = Shape<1, 1, 1, 16, 16>;
    using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
    /* 
    If the globalTensor is larger than UB Tile, TPUT will perform 2D sliding automatically. 
    using GShape = Shape<1, 1, 1, 4096, 4096>;
    using GStride = BaseShape2D<T, 4096, 4096, Layout::ND>;
    */
    using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;

    GTensor srcG(local_data);
    GTensor dstG(remote_addr);
    TileT stagingTile;
    TASSIGN(stagingTile, 0);

    // Basic remote write
    comm::TPUT(dstG, srcG, stagingTile);

    // Remote write with atomic add
    comm::TPUT<AtomicType::AtomicAdd>(dstG, srcG, stagingTile);
}

Ping-pong Double Buffering

constexpr size_t tileUBBytes = ((64 * 64 * sizeof(float) + 1023) / 1024) * 1024;
TileT pingTile(64, 64);
TileT pongTile(64, 64);
TASSIGN(pingTile, 0);
TASSIGN(pongTile, tileUBBytes);  // Non-overlapping UB region

// Overlaps TLOAD[i+1] with TSTORE[i] for better pipeline utilization
comm::TPUT(dstG, srcG, pingTile, pongTile);

Runtime Atomic Type

// Select atomic type at runtime instead of compile-time template parameter
comm::TPUT(dstG, srcG, stagingTile, AtomicType::AtomicAdd);