TPUT_ASYNC

Introduction

TPUT_ASYNC is an asynchronous remote write primitive. It starts a transfer from local GM to remote GM and returns an AsyncEvent immediately.

Data flow:

srcGlobalData (local GM) -> DMA engine -> dstGlobalData (remote GM)

Template Parameter

  • engine:
    • DmaEngine::SDMA (default)
    • DmaEngine::URMA (todo)

Important (SDMA path)
TPUT_ASYNC with DmaEngine::SDMA currently supports only flat contiguous logical 1D tensors.
Non-1D or non-contiguous layouts are not supported by the current SDMA async implementation.

C++ Intrinsic

Declared in include/pto/comm/pto_comm_inst.hpp.

template <DmaEngine engine = DmaEngine::SDMA,
          typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
PTO_INST AsyncEvent TPUT_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
                               const AsyncSession &session, WaitEvents &... events);

AsyncSession is an engine-agnostic session object. Build once with BuildAsyncSession<engine>(), then pass to all async calls and event waits. The template engine parameter selects the DMA backend at compile time, making the code forward-compatible with future engines (URMA, CCU, etc.).

AsyncSession Construction

Use BuildAsyncSession from include/pto/comm/async/async_event_impl.hpp:

template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
                                    __gm__ uint8_t *workspace,
                                    AsyncSession &session,
                                    uint32_t syncId = 0,
                                    const sdma::SdmaBaseConfig &baseConfig = {32 * 1024, 0, 1},
                                    uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);

The engine template parameter selects the backend (currently only SDMA).

Parameters with defaults:

Parameter Default Description
syncId 0 MTE3/MTE2 pipe sync event id (0-7). Override if kernel uses other pipe barriers on the same id.
baseConfig {32*1024, 0, 1} {block_bytes, comm_block_offset, queue_num}. Suitable for most single-queue transfers.
channelGroupIdx kAutoChannelGroupIdx SDMA channel group index. Default uses get_block_idx() internally, mapping to current AI core. Override for multi-block or custom channel mapping scenarios.

Constraints

  • GlobalSrcData::RawDType == GlobalDstData::RawDType
  • GlobalSrcData::layout == GlobalDstData::layout
  • SDMA path requires source tensor to be flat contiguous logical 1D only
  • workspace must be a valid GM pointer allocated by host-side SdmaWorkspaceManager

If the 1D contiguous requirement is not met, current implementation returns an invalid async event (handle == 0).

scratchTile Role

scratchTile is not the payload staging buffer for user data.
It is converted to TmpBuffer and used as temporary UB workspace for:

  • writing/reading SDMA control words (flag, sq_tail, channel_info)
  • polling event completion flags
  • committing queue tail during completion

Data payload moves between GM buffers directly; scratchTile only supports control and synchronization metadata.

scratchTile Type and Size Constraints

  • must be a pto::Tile type
  • must be UB/Vec tile (ScratchTile::Loc == TileType::Vec)
  • available bytes must be at least sizeof(uint64_t) (8 bytes)

Recommended: Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE> (256B).

Completion Semantics

Use AsyncEvent to synchronize:

  • event.Wait(session) — blocks until the transfer is complete

After wait succeeds, writes to dstGlobalData are complete.

Example

#include <pto/comm/pto_comm_inst.hpp>
#include <pto/common/pto_tile.hpp>

using namespace pto;

template <typename T>
__global__ AICORE void SimplePut(__gm__ T *remoteDst, __gm__ T *localSrc,
                                 __gm__ uint8_t *sdmaWorkspace)
{
    using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
    using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
    using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
    using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;

    ShapeDyn shape(1, 1, 1, 1, 1024);
    StrideDyn stride(1024, 1024, 1024, 1024, 1);
    GT dstG(remoteDst, shape, stride);
    GT srcG(localSrc, shape, stride);

    ScratchTile scratchTile;
    TASSIGN(scratchTile, 0x0);

    comm::AsyncSession session;
    if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
        return;
    }

    auto event = comm::TPUT_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
    (void)event.Wait(session);
}