TGET¶
Introduction¶
Remote read operation: read remote NPU's data to local memory. Data is transferred via a UB tile as intermediate staging buffer.
When the GlobalTensor exceeds the UB tile capacity, TGET automatically performs 2D sliding �?chunking rows (DIM_3) and columns (DIM_4) to fit each chunk into the tile, iterating over all outer dimensions (DIM_0, DIM_1, DIM_2).
Math Interpretation¶
For each element (i, j) in the valid region:
Data flow: srcGlobalData (remote GM) �?stagingTileData (UB) �?dstGlobalData (local GM)
Assembly Syntax¶
PTO-AS form: see PTO-AS Specification.
Synchronous form:
tget %dst_local, %src_remote : (!pto.memref<...>, !pto.memref<...>)
Lowering introduces UB staging tile(s) for the GM→UB→GM data path; the C++ intrinsic requires explicit stagingTileData (or pingTile / pongTile) operand(s).
C++ Intrinsic¶
Declared in include/pto/comm/pto_comm_inst.hpp
Single-tile (auto-chunking)¶
template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
TileData &stagingTileData, WaitEvents&... events);
Ping-pong double buffering¶
Uses two staging tiles to overlap TLOAD and TSTORE for adjacent chunks, hiding one DMA transfer behind the other.
template <typename GlobalDstData, typename GlobalSrcData, typename TileData, typename... WaitEvents>
PTO_INST RecordEvent TGET(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
TileData &pingTile, TileData &pongTile, WaitEvents&... events);
Constraints¶
- Type constraints:
GlobalSrcData::RawDTypemust equalGlobalDstData::RawDType.TileData::DTypemust equalGlobalSrcData::RawDType.GlobalSrcData::layoutmust equalGlobalDstData::layout.
- Memory constraints:
srcGlobalDatamust point to remote address (on source NPU).dstGlobalDatamust point to local address (on current NPU).stagingTileData/pingTile/pongTilemust be pre-allocated in Unified Buffer.
- Valid region:
- Transfer size is determined by
GlobalTensorshape (auto-chunked to fit tile).
- Transfer size is determined by
- Ping-pong:
pingTileandpongTilemust have the same type and dimensions.- Must reside at non-overlapping UB offsets.
Examples¶
Basic Usage¶
#include <pto/comm/pto_comm_inst.hpp>
#include <pto/pto-inst.hpp>
using namespace pto;
template <typename T>
void example_tget(__gm__ T* local_data, __gm__ T* remote_addr) {
using TileT = Tile<TileType::Vec, T, 16, 16>;
using GShape = Shape<1, 1, 1, 16, 16>;
using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
/*
If the globalTensor is larger than UB Tile, TGET will perform 2D sliding automatically.
using GShape = Shape<1, 1, 1, 4096, 4096>;
using GStride = BaseShape2D<T, 4096, 4096, Layout::ND>;
*/
using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
GTensor srcG(remote_addr);
GTensor dstG(local_data);
TileT stagingTile;
TASSIGN(stagingTile, 0);
// Basic remote read
comm::TGET(dstG, srcG, stagingTile);
}
Ping-pong Double Buffering¶
constexpr size_t tileUBBytes = ((64 * 64 * sizeof(float) + 1023) / 1024) * 1024;
TileT pingTile(64, 64);
TileT pongTile(64, 64);
TASSIGN(pingTile, 0);
TASSIGN(pongTile, tileUBBytes); // Non-overlapping UB region
// Overlaps TLOAD[i+1] with TSTORE[i] for better pipeline utilization
comm::TGET(dstG, srcG, pingTile, pongTile);