TGET_ASYNC

Introduction

TGET_ASYNC is an asynchronous remote read primitive. It starts a transfer from remote GM to local GM and returns an AsyncEvent immediately.

Data flow:

srcGlobalData (remote GM) -> DMA engine -> dstGlobalData (local GM)

Template Parameter

  • engine:
    • DmaEngine::SDMA (default)
    • DmaEngine::URMA (todo)

Important (SDMA path)
TGET_ASYNC with DmaEngine::SDMA currently supports only flat contiguous logical 1D tensors.
Non-1D or non-contiguous layouts are not supported by the current SDMA async implementation.

C++ Intrinsic

Declared in include/pto/comm/pto_comm_inst.hpp.

template <DmaEngine engine = DmaEngine::SDMA,
          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
PTO_INST AsyncEvent TGET_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
                               const AsyncSession &session, WaitEvents &... events);

AsyncSession is an engine-agnostic session object. Build once with BuildAsyncSession<engine>(), then pass to all async calls and event waits. The template engine parameter selects the DMA backend at compile time, making the code forward-compatible with future engines (URMA, CCU, etc.).

AsyncSession Construction

Use BuildAsyncSession from include/pto/comm/async/async_event_impl.hpp:

template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
                                    __gm__ uint8_t *workspace,
                                    AsyncSession &session,
                                    uint32_t syncId = 0,
                                    const sdma::SdmaBaseConfig &baseConfig = {32 * 1024, 0, 1},
                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);

The engine template parameter selects the backend (currently only SDMA).

Parameters with defaults:

Parameter Default Description
syncId 0 MTE3/MTE2 pipe sync event id (0-7). Override if kernel uses other pipe barriers on the same id.
baseConfig {32*1024, 0, 1} {block_bytes, comm_block_offset, queue_num}. Suitable for most single-queue transfers.
channelGroupIdx kAutoChannelGroupIdx SDMA channel group index. Default uses get_block_idx() internally, mapping to current AI core. Override for multi-block or custom channel mapping scenarios.

Constraints

  • GlobalSrcData::RawDType == GlobalDstData::RawDType
  • GlobalSrcData::layout == GlobalDstData::layout
  • SDMA path requires source tensor to be flat contiguous logical 1D only
  • workspace must be a valid GM pointer allocated by host-side SdmaWorkspaceManager

If the 1D contiguous requirement is not met, current implementation returns an invalid async event (handle == 0).

scratchTile Role

scratchTile is not used to hold transferred payload data.
It is converted to TmpBuffer and used as temporary UB workspace for:

  • writing/reading SDMA control words (flag, sq_tail, channel_info)
  • polling event completion flags
  • committing queue tail during completion

The real payload path remains remote GM -> DMA engine -> local GM; scratchTile is only for control/synchronization metadata.

scratchTile Type and Size Constraints

  • must be a pto::Tile type
  • must be UB/Vec tile (ScratchTile::Loc == TileType::Vec)
  • available bytes must be at least sizeof(uint64_t) (8 bytes)

Recommended: Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE> (256B).

Completion Semantics

Use AsyncEvent to synchronize:

  • event.Wait(session) — blocks until the transfer is complete

After wait succeeds, reads into dstGlobalData are complete.

Example

#include <pto/comm/pto_comm_inst.hpp>
#include <pto/common/pto_tile.hpp>

using namespace pto;

template <typename T>
__global__ AICORE void SimpleGet(__gm__ T *localDst, __gm__ T *remoteSrc,
                                 __gm__ uint8_t *sdmaWorkspace)
{
    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;

    ShapeDyn shape(1, 1, 1, 1, 1024);
    StrideDyn stride(1024, 1024, 1024, 1024, 1);
    GT dstG(localDst, shape, stride);
    GT srcG(remoteSrc, shape, stride);

    ScratchTile scratchTile;
    TASSIGN(scratchTile, 0x0);

    comm::AsyncSession session;
    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
        return;
    }

    auto event = comm::TGET_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
    (void)event.Wait(session);
}