TALLOC / TPUSH / TPOP / TFREE¶
Introduction¶
These interfaces provide an address-based GM FIFO protocol for producer/consumer communication through TPipe.
Unlike the tile-based TPUSH(pipe, tile) and TPOP(pipe, tile) forms, these APIs only manage FIFO synchronization and assign the GM slot address into a GlobalTensor object. The caller is responsible for moving data with ordinary TSTORE and TLOAD.
This is useful when a producer or consumer needs to access a FIFO slot as one or more GlobalTensor views, for example when a cube core stores several subtiles into one FIFO slot and vector cores later load slices from that slot.
C++ Intrinsic¶
Declared in include/pto/common/pto_instr.hpp:
template <typename Pipe, typename GlobalData, TileSplitAxis Split,
std::enable_if_t<is_global_data_v<GlobalData>, int> = 0, typename... WaitEvents>
PTO_INST RecordEvent TALLOC(Pipe &pipe, GlobalData &gmTensor, WaitEvents &... events);
template <typename Pipe, typename GlobalData, TileSplitAxis Split,
std::enable_if_t<is_global_data_v<GlobalData>, int> = 0, typename... WaitEvents>
PTO_INST RecordEvent TPUSH(Pipe &pipe, GlobalData &gmTensor, WaitEvents &... events);
template <typename Pipe, typename GlobalData, TileSplitAxis Split,
std::enable_if_t<is_global_data_v<GlobalData>, int> = 0, typename... WaitEvents>
PTO_INST RecordEvent TPOP(Pipe &pipe, GlobalData &gmTensor, WaitEvents &... events);
template <typename Pipe, typename GlobalData, TileSplitAxis Split,
std::enable_if_t<is_global_data_v<GlobalData>, int> = 0, typename... WaitEvents>
PTO_INST RecordEvent TFREE(Pipe &pipe, GlobalData &gmTensor, WaitEvents &... events);
Parameters¶
Pipe: aTPipe<FlagID, Direction, SlotSize, SlotNum, ...>type. The pipe owns FIFO state, slot base address, producer/consumer counters, and cross-core flags.GlobalData: aGlobalTensortype used for address calculation. ItsRawDTypeand static shape determine byte offsets for split modes.gmTensor: a runtimeGlobalTensorobject.TALLOCandTPOPassign the computed FIFO GM address into this object, whileTPUSHandTFREEuse it as the transaction descriptor.Split: one ofTileSplitAxis::TILE_NO_SPLIT,TileSplitAxis::TILE_UP_DOWN, orTileSplitAxis::TILE_LEFT_RIGHT.events: optional PTO events to synchronize before the interface body executes.
Producer Flow¶
TALLOC¶
TALLOC<Pipe, GlobalData, Split>(pipe, gmTensor) begins a producer-side FIFO transaction and assigns the GM address for the current FIFO slot into gmTensor.
It performs:
TSYNC(events...)- Producer-side allocation/wait for a free FIFO slot
- FIFO slot address calculation
- Producer tile index increment
gmTensoraddress assignment
It does not write data and does not notify the consumer. After TALLOC, the caller should use TSTORE or other GM-writing logic to populate the slot described by gmTensor.
TPUSH¶
TPUSH<Pipe, GlobalData, Split>(pipe, gmTensor) commits the producer transaction.
It performs:
TSYNC(events...)- Producer-side record/notify so the consumer can observe the data-ready flag
Call TPUSH only after all writes to the FIFO slot are complete.
Consumer Flow¶
TPOP¶
TPOP<Pipe, GlobalData, Split>(pipe, gmTensor) begins a consumer-side FIFO transaction and assigns the GM address for the current FIFO slot or subslot into gmTensor.
It performs:
TSYNC(events...)- Consumer-side wait for producer data-ready
- FIFO slot address calculation
- Consumer tile index increment
gmTensoraddress assignment
It does not load data and does not free the slot. After TPOP, the caller can use gmTensor directly or create additional GlobalTensor views from gmTensor.data() and use TLOAD to read data.
TFREE¶
TFREE<Pipe, GlobalData, Split>(pipe, gmTensor) completes the consumer transaction.
It performs:
TSYNC(events...)- Consumer-side free/notify so the producer can reuse the FIFO slot
Call TFREE only after all reads from the FIFO slot are complete.
Split Address Calculation¶
For DIR_C2V, vector consumers receive the same FIFO slot base address and can build their own GlobalTensor views from it. The GlobalData template argument used by TPOP normally matches the producer slot descriptor, so TPOP assigns the slot address into gmTensor. Consumers can then add explicit element offsets from gmTensor.data() before constructing smaller load views.
TILE_NO_SPLIT: no sub-core offset is added.TILE_UP_DOWN: sub-core offset isget_subblockid() * rows * cols * sizeof(dtype).TILE_LEFT_RIGHT: sub-core offset isget_subblockid() * cols * sizeof(dtype).
For DIR_V2C, producer-side split offsets are computed similarly when vector cores write separate subregions for a cube consumer.
GlobalData::staticShape[DIM_3], GlobalData::staticShape[DIM_4], and GlobalData::RawDType are used for split offset calculations. When the consumer wants to manually load subtiles from a full FIFO slot, use the same full-slot GlobalData type for TALLOC and TPOP, then construct narrower GlobalTensor load views from gmTensor.data().
Example¶
constexpr int M = 128;
constexpr int N = 128;
constexpr int RepeatN = 4;
constexpr int FullN = N * RepeatN;
constexpr int VecCores = 2;
constexpr int VecM = 16;
constexpr int VecLoadTimes = M / (VecCores * VecM);
using Pipe = TPipe<0, Direction::DIR_C2V, M * FullN * sizeof(float), 2>;
using SlotGlobal = GlobalTensor<float, Shape<1, 1, 1, M, FullN>, Stride<1, 1, 1, FullN, 1>>;
using StoreGlobal = GlobalTensor<float, Shape<1, 1, 1, M, N>, Stride<1, 1, 1, FullN, 1>>;
// TPOP uses this descriptor for split offset calculation, so each vector receives one [64, 512] half.
using PopGlobal = GlobalTensor<float, Shape<1, 1, 1, M / VecCores, FullN>, Stride<1, 1, 1, FullN, 1>>;
// One 3D TLOAD gathers four [16, 128] blocks into one logical [16, 512] vector tile.
using VecTileData = Tile<TileType::Vec, float, RepeatN * VecM, N, BLayout::RowMajor, RepeatN * VecM, N>;
using LoadGlobal3D = GlobalTensor<float, Shape<1, 1, RepeatN, VecM, N>, Stride<1, 1, N, FullN, 1>>;
using OutGlobal3D = GlobalTensor<float, Shape<1, 1, RepeatN, VecM, N>, Stride<1, 1, N, FullN, 1>>;
Pipe pipe(fifoMem, 0x0, 0x0);
// Cube producer: fill one [128, 512] FIFO slot with four [128, 128] stores.
SlotGlobal pushGlobal;
TALLOC<Pipe, SlotGlobal, TileSplitAxis::TILE_UP_DOWN>(pipe, pushGlobal);
for (int nTile = 0; nTile < RepeatN; ++nTile) {
StoreGlobal storeGlobal(pushGlobal.data() + nTile * N);
TSTORE(storeGlobal, accTile);
}
TPUSH<Pipe, SlotGlobal, TileSplitAxis::TILE_UP_DOWN>(pipe, pushGlobal);
// Vector consumer: each vector consumes four [16, 512] row slices from its [64, 512] half.
VecTileData vecTile;
VecTileData dstTile;
TASSIGN(vecTile, 0x0);
TASSIGN(dstTile, 0x10000);
PopGlobal popGlobal;
TPOP<Pipe, PopGlobal, TileSplitAxis::TILE_UP_DOWN>(pipe, popGlobal);
uint32_t subBlockIdx = get_subblockid();
for (int rowSlice = 0; rowSlice < VecLoadTimes; ++rowSlice) {
size_t vecBaseRow = static_cast<size_t>(M / VecCores) * static_cast<size_t>(subBlockIdx);
size_t localRowOffset = static_cast<size_t>(rowSlice * VecM);
size_t outRowOffset = (vecBaseRow + localRowOffset) * static_cast<size_t>(FullN);
LoadGlobal3D loadGlobal(popGlobal.data() + localRowOffset * static_cast<size_t>(FullN));
TLOAD(vecTile, loadGlobal);
TADDS(dstTile, vecTile, static_cast<float>(3.14));
OutGlobal3D outGlobal(out + outRowOffset);
TSTORE(outGlobal, dstTile);
}
TFREE<Pipe, PopGlobal, TileSplitAxis::TILE_UP_DOWN>(pipe, popGlobal);
Notes¶
- These interfaces are address-based. They intentionally do not perform
TSTOREorTLOADinternally. TALLOCmust be paired withTPUSH.TPOPmust be paired withTFREE.TALLOCandTPOPwrite the computed GM address into the providedGlobalTensor.- The
GlobalDatatemplate argument is used as a type-level shape/dtype descriptor, and the runtimegmTensorcarries the computed GM address for laterTSTORE,TLOAD,TPUSH, orTFREEcalls.