TSTORE¶
Tile Operation Diagram¶
Introduction¶
Store data from a Tile into a GlobalTensor (GM), optionally using atomic write or quantization parameters.
Math Interpretation¶
Notation depends on the GlobalTensor shape/stride and the Tile layout. Conceptually (2D view, with a base offset):
\[ \mathrm{dst}_{r_0 + i,\; c_0 + j} = \mathrm{src}_{i,j} \]
Assembly Syntax¶
PTO-AS form: see PTO-AS Specification.
Synchronous form:
tstore %t1, %sv_out[%c0, %c0]
IR Level 1 (SSA)¶
pto.tstore %src, %mem : (!pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
IR Level 2 (DPS)¶
pto.tstore ins(%src : !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)
C++ Intrinsic¶
Declared in include/pto/common/pto_instr.hpp and include/pto/common/constants.hpp:
template <typename TileData, typename GlobalData, AtomicType atomicType = AtomicType::AtomicNone,
typename... WaitEvents>
PTO_INST RecordEvent TSTORE(GlobalData& dst, TileData& src, WaitEvents&... events);
template <typename TileData, typename GlobalData, AtomicType atomicType = AtomicType::AtomicNone,
typename... WaitEvents>
PTO_INST RecordEvent TSTORE(GlobalData& dst, TileData& src, uint64_t preQuantScalar, WaitEvents&... events);
template <typename TileData, typename GlobalData, typename FpTileData, AtomicType atomicType = AtomicType::AtomicNone,
typename... WaitEvents>
PTO_INST RecordEvent TSTORE_FP(GlobalData& dst, TileData& src, FpTileData& fp, WaitEvents&... events);
Constraints¶
- Implementation checks (A2A3):
- Source tile location must be one of:
TileType::Vec,TileType::Mat,TileType::Acc. - Runtime: all
dst.GetShape(dim)values andsrc.GetValidRow()/GetValidCol()must be> 0. - For
TileType::Vec/TileType::Mat:TileData::DTypemust be one of:int8_t,uint8_t,int16_t,uint16_t,int32_t,uint32_t,int64_t,uint64_t,half,bfloat16_t,float.sizeof(TileData::DType) == sizeof(GlobalData::DType).- Layouts must match ND/DN/NZ (or a special case where
TileData::Rows == 1orTileData::Cols == 1). - For
int64_t/uint64_t, only ND->ND or DN->DN are supported.
- For
TileType::Acc(including quantized/atomic variants):- Destination layout must be ND or NZ.
- Source dtype must be
int32_torfloat. - When not using quantization, destination dtype must be
__gm__ int32_t/float/half/bfloat16_t. - Static shape constraints:
1 <= TileData::Cols <= 4095; if ND then1 <= TileData::Rows <= 8192; if NZ then1 <= TileData::Rows <= 65535andTileData::Cols % 16 == 0. - Runtime:
1 <= src.GetValidCol() <= 4095.
- Implementation checks (A5):
- Source tile location must be
TileType::VecorTileType::Acc(noMatstore on this target). - For
TileType::Vec:sizeof(TileData::DType) == sizeof(GlobalData::DType).TileData::DTypemust be one of:int8_t,uint8_t,int16_t,uint16_t,int32_t,uint32_t,int64_t,uint64_t,half,bfloat16_t,float,float8_e4m3_t,float8_e5m2_t,hifloat8_t,float4_e1m2x2_t,float4_e2m1x2_t.- Layouts must match ND/DN/NZ (or a special case where
TileData::Rows == 1orTileData::Cols == 1). - Additional alignment constraints are enforced (e.g., for ND the row-major width in bytes must be a multiple of 32; for DN the column-major height in bytes must be a multiple of 32, with special-case exceptions).
- For
TileType::Acc:- Destination layout must be ND or NZ; source dtype must be
int32_torfloat. - When not using quantization, destination dtype must be
__gm__ int32_t/float/half/bfloat16_t. - Static shape constraints match A2A3 for rows/cols;
AtomicAddadditionally restricts destination dtype to supported atomic types.
- Destination layout must be ND or NZ; source dtype must be
- Valid region:
- The implementation uses
src.GetValidRow()/src.GetValidCol()as the transfer size.
Examples¶
Auto¶
#include <pto/pto-inst.hpp>
using namespace pto;
template <typename T>
void example_auto(__gm__ T* out) {
using TileT = Tile<TileType::Vec, T, 16, 16>;
using GShape = Shape<1, 1, 1, 16, 16>;
using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
GTensor gout(out);
TileT t;
TSTORE(gout, t);
}
Manual¶
#include <pto/pto-inst.hpp>
using namespace pto;
template <typename T>
void example_manual(__gm__ T* out) {
using TileT = Tile<TileType::Vec, T, 16, 16>;
using GShape = Shape<1, 1, 1, 16, 16>;
using GStride = BaseShape2D<T, 16, 16, Layout::ND>;
using GTensor = GlobalTensor<T, GShape, GStride, Layout::ND>;
GTensor gout(out);
TileT t;
TASSIGN(t, 0x1000);
TSTORE<TileT, GTensor, AtomicType::AtomicAdd>(gout, t);
}
ASM Form Examples¶
Auto Mode¶
# Auto mode: compiler/runtime-managed placement and scheduling.
pto.tstore %src, %mem : (!pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
Manual Mode¶
# Manual mode: bind resources explicitly before issuing the instruction.
# Optional for tile operands:
# pto.tassign %arg0, @tile(0x1000)
# pto.tassign %arg1, @tile(0x2000)
pto.tstore %src, %mem : (!pto.tile<...>, !pto.partition_tensor_view<MxNxdtype>) -> ()
PTO Assembly Form¶
tstore %t1, %sv_out[%c0, %c0]
# IR Level 2 (DPS)
pto.tstore ins(%src : !pto.tile_buf<...>) outs(%mem : !pto.partition_tensor_view<MxNxdtype>)