TGET_ASYNC¶
Introduction¶
TGET_ASYNC is an asynchronous remote read primitive. It starts a transfer from remote GM to local GM and returns an AsyncEvent immediately.
Data flow:
srcGlobalData (remote GM) -> DMA engine -> dstGlobalData (local GM)
Template Parameter¶
engine:DmaEngine::SDMA(default)DmaEngine::URMA(todo)
Important (SDMA path)
TGET_ASYNCwithDmaEngine::SDMAcurrently supports only flat contiguous logical 1D tensors.
Non-1D or non-contiguous layouts are not supported by the current SDMA async implementation.
C++ Intrinsic¶
Declared in include/pto/comm/pto_comm_inst.hpp.
template <DmaEngine engine = DmaEngine::SDMA,
typename GlobalDstData, typename GlobalSrcData, typename... WaitEvents>
PTO_INST AsyncEvent TGET_ASYNC(GlobalDstData &dstGlobalData, GlobalSrcData &srcGlobalData,
const AsyncSession &session, WaitEvents &... events);
AsyncSession is an engine-agnostic session object. Build once with
BuildAsyncSession<engine>(), then pass to all async calls and event waits.
The template engine parameter selects the DMA backend at compile time, making the
code forward-compatible with future engines (URMA, CCU, etc.).
AsyncSession Construction¶
Use BuildAsyncSession from include/pto/comm/async/async_event_impl.hpp:
template <DmaEngine engine = DmaEngine::SDMA, typename ScratchTile>
PTO_INTERNAL bool BuildAsyncSession(ScratchTile &scratchTile,
__gm__ uint8_t *workspace,
AsyncSession &session,
uint32_t syncId = 0,
const sdma::SdmaBaseConfig &baseConfig = {32 * 1024, 0, 1},
uint32_t channelGroupIdx = sdma::kAutoChannelGroupIdx);
The engine template parameter selects the backend (currently only SDMA).
Parameters with defaults:
| Parameter | Default | Description |
|---|---|---|
syncId |
0 |
MTE3/MTE2 pipe sync event id (0-7). Override if kernel uses other pipe barriers on the same id. |
baseConfig |
{32*1024, 0, 1} |
{block_bytes, comm_block_offset, queue_num}. Suitable for most single-queue transfers. |
channelGroupIdx |
kAutoChannelGroupIdx |
SDMA channel group index. Default uses get_block_idx() internally, mapping to current AI core. Override for multi-block or custom channel mapping scenarios. |
Constraints¶
GlobalSrcData::RawDType == GlobalDstData::RawDTypeGlobalSrcData::layout == GlobalDstData::layout- SDMA path requires source tensor to be flat contiguous logical 1D only
- workspace must be a valid GM pointer allocated by host-side
SdmaWorkspaceManager
If the 1D contiguous requirement is not met, current implementation returns an invalid async event (handle == 0).
scratchTile Role¶
scratchTile is not used to hold transferred payload data.
It is converted to TmpBuffer and used as temporary UB workspace for:
- writing/reading SDMA control words (flag, sq_tail, channel_info)
- polling event completion flags
- committing queue tail during completion
The real payload path remains remote GM -> DMA engine -> local GM; scratchTile is only for control/synchronization metadata.
scratchTile Type and Size Constraints¶
- must be a
pto::Tiletype - must be UB/Vec tile (
ScratchTile::Loc == TileType::Vec) - available bytes must be at least
sizeof(uint64_t)(8 bytes)
Recommended: Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE> (256B).
Completion Semantics¶
Use AsyncEvent to synchronize:
event.Wait(session)— blocks until the transfer is complete
After wait succeeds, reads into dstGlobalData are complete.
Example¶
#include <pto/comm/pto_comm_inst.hpp>
#include <pto/common/pto_tile.hpp>
using namespace pto;
template <typename T>
__global__ AICORE void SimpleGet(__gm__ T *localDst, __gm__ T *remoteSrc,
__gm__ uint8_t *sdmaWorkspace)
{
using ShapeDyn = Shape<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
using StrideDyn = Stride<DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC, DYNAMIC>;
using GT = GlobalTensor<T, ShapeDyn, StrideDyn, Layout::ND>;
using ScratchTile = Tile<TileType::Vec, uint8_t, 1, comm::sdma::UB_ALIGN_SIZE>;
ShapeDyn shape(1, 1, 1, 1, 1024);
StrideDyn stride(1024, 1024, 1024, 1024, 1);
GT dstG(localDst, shape, stride);
GT srcG(remoteSrc, shape, stride);
ScratchTile scratchTile;
TASSIGN(scratchTile, 0x0);
comm::AsyncSession session;
if (!comm::BuildAsyncSession<comm::DmaEngine::SDMA>(scratchTile, sdmaWorkspace, session)) {
return;
}
auto event = comm::TGET_ASYNC<comm::DmaEngine::SDMA>(dstG, srcG, session);
(void)event.Wait(session);
}