TGET_ASYNC¶

Introduction¶

TGET_ASYNC is an asynchronous remote read primitive. It starts a transfer from remote GM to local GM and returns an AsyncEvent immediately.

Data flow:

srcGlobalData (remote GM) -> DMA engine -> dstGlobalData (local GM)

Template Parameter¶

engine:
- DmaEngine::SDMA (default)
- DmaEngine::URMA (todo)

Important (SDMA path)
TGET_ASYNC with DmaEngine::SDMA currently supports only flat contiguous logical 1D tensors.
Non-1D or non-contiguous layouts are not supported by the current SDMA async implementation.

C++ Intrinsic¶

Declared in include/pto/comm/pto_comm_inst.hpp.

template <DmaEngine engine = DmaEngine::SDMA,
          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
PTO_INST AsyncEvent TGET_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
                               const AsyncSession &session, WaitEvents &... events);

AsyncSession is an engine-agnostic session object. Build once with BuildAsyncSession<engine>(), then pass to all async calls and event waits. The template engine parameter selects the DMA backend at compile time, making the code forward-compatible with future engines (URMA, CCU, etc.).

AsyncSession Construction¶

Use BuildAsyncSession from include/pto/comm/async/async_event_impl.hpp:

template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
                                    __gm__ uint8_t *workspace,
                                    AsyncSession &session,
                                    uint32_t syncId = 0,
                                    const sdma::SdmaBaseConfig &baseConfig = {32 * 1024, 0, 1},
                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);

The engine template parameter selects the backend (currently only SDMA).

Parameters with defaults:

Parameter	Default	Description
`syncId`	`0`	MTE3/MTE2 pipe sync event id (0-7). Override if kernel uses other pipe barriers on the same id.
`baseConfig`	`{32*1024, 0, 1}`	`{block_bytes, comm_block_offset, queue_num}`. Suitable for most single-queue transfers.
`channelGroupIdx`	`kAutoChannelGroupIdx`	SDMA channel group index. Default uses `get_block_idx()` internally, mapping to current AI core. Override for multi-block or custom channel mapping scenarios.

Constraints¶

GlobalSrcData::RawDType == GlobalDstData::RawDType
GlobalSrcData::layout == GlobalDstData::layout
SDMA path requires source tensor to be flat contiguous logical 1D only
workspace must be a valid GM pointer allocated by host-side SdmaWorkspaceManager

If the 1D contiguous requirement is not met, current implementation returns an invalid async event (handle == 0).

scratchTile Role¶

scratchTile is not used to hold transferred payload data.
It is converted to TmpBuffer and used as temporary UB workspace for:

writing/reading SDMA control words (flag, sq_tail, channel_info)
polling event completion flags
committing queue tail during completion

The real payload path remains remote GM -> DMA engine -> local GM; scratchTile is only for control/synchronization metadata.

scratchTile Type and Size Constraints¶

must be a pto::Tile type
must be UB/Vec tile (ScratchTile::Loc == TileType::Vec)
available bytes must be at least sizeof(uint64_t) (8 bytes)

Recommended: Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE> (256B).

Completion Semantics¶

Use AsyncEvent to synchronize:

event.Wait(session) — blocks until the transfer is complete

After wait succeeds, reads into dstGlobalData are complete.

Example¶

#include <pto/comm/pto_comm_inst.hpp>
#include <pto/common/pto_tile.hpp>

using namespace pto;

template <typename T>
__global__ AICORE void SimpleGet(__gm__ T *localDst, __gm__ T *remoteSrc,
                                 __gm__ uint8_t *sdmaWorkspace)
{
    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;

    ShapeDyn shape(1, 1, 1, 1, 1024);
    StrideDyn stride(1024, 1024, 1024, 1024, 1);
    GT dstG(localDst, shape, stride);
    GT srcG(remoteSrc, shape, stride);

    ScratchTile scratchTile;
    TASSIGN(scratchTile, 0x0);

    comm::AsyncSession session;
    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
        return;
    }

    auto event = comm::TGET_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
    (void)event.Wait(session);
}