TINSERT¶

Tile Operation Diagram¶

TINSERT tile operation

Introduction¶

Insert a source sub-tile into a destination tile at (indexRow, indexCol). Conceptually the inverse of TEXTRACT.

TINSERT is used for:

Acc → Mat insertion (with optional relu, scalar-quant, or vector-quant)
Acc → Vec insertion (with optional AccToVecMode, relu, scalar-quant, or vector-quant) (A5)
Vec → Mat insertion (ND and NZ layouts) (A5)
Vec → Vec insertion (ND and NZ layouts) (A5)
NZ split insertion (SPLIT2, SPLIT4) (A5)

Math Interpretation¶

Let R = src.GetValidRow() and C = src.GetValidCol(). For 0 <= i < R and 0 <= j < C:

\[ \mathrm{dst}_{\mathrm{indexRow}+i,\;\mathrm{indexCol}+j} = \mathrm{src}_{i,j} \]

Assembly Syntax¶

PTO-AS form: see PTO-AS Specification.

Synchronous form:

%dst = tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>

AS Level 1 (SSA)¶

%dst = pto.tinsert %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>

AS Level 2 (DPS)¶

pto.tinsert ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)

C++ Intrinsic¶

Declared in include/pto/common/pto_instr.hpp:

template <typename DstTileData, typename SrcTileData, typename... WaitEvents>
PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src,
                             uint16_t indexRow, uint16_t indexCol,
                             WaitEvents &... events);

template <typename DstTileData, typename SrcTileData, ReluPreMode reluMode,
          typename... WaitEvents>
PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src,
                             uint16_t indexRow, uint16_t indexCol,
                             WaitEvents &... events);

template <typename DstTileData, typename SrcTileData, AccToVecMode mode,
          ReluPreMode reluMode = ReluPreMode::NoRelu, typename... WaitEvents>
PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src,
                             uint16_t indexRow, uint16_t indexCol,
                             WaitEvents &... events);

template <typename DstTileData, typename SrcTileData,
          ReluPreMode reluMode = ReluPreMode::NoRelu, typename... WaitEvents>
PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src,
                             uint64_t preQuantScalar,
                             uint16_t indexRow, uint16_t indexCol,
                             WaitEvents &... events);

template <typename DstTileData, typename SrcTileData, AccToVecMode mode,
          ReluPreMode reluMode = ReluPreMode::NoRelu, typename... WaitEvents>
PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src,
                             uint64_t preQuantScalar,
                             uint16_t indexRow, uint16_t indexCol,
                             WaitEvents &... events);

template <typename DstTileData, typename SrcTileData, typename FpTileData,
          ReluPreMode reluMode = ReluPreMode::NoRelu, typename... WaitEvents>
PTO_INST RecordEvent TINSERT_FP(DstTileData &dst, SrcTileData &src,
                                FpTileData &fp,
                                uint16_t indexRow, uint16_t indexCol,
                                WaitEvents &... events);

template <typename DstTileData, typename SrcTileData, typename FpTileData,
          AccToVecMode mode, ReluPreMode reluMode = ReluPreMode::NoRelu,
          typename... WaitEvents>
PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src,
                             FpTileData &fp,
                             uint16_t indexRow, uint16_t indexCol,
                             WaitEvents &... events);

#ifdef PTO_NPU_ARCH_A5
template <TInsertMode mode, typename DstTileData, typename SrcTileData,
          typename... WaitEvents>
PTO_INST RecordEvent TINSERT(DstTileData &dst, SrcTileData &src,
                             uint16_t indexRow = 0, uint16_t indexCol = 0,
                             WaitEvents &... events);
#endif

Constraints¶

General constraints / checks¶

TINSERT has these overload families:
- plain insert: TINSERT(dst, src, indexRow, indexCol)
- relu form: TINSERT<..., reluMode>(dst, src, indexRow, indexCol)
- accumulator-to-vector form: TINSERT<..., mode, reluMode>(dst, src, indexRow, indexCol)
- scalar-quant form: TINSERT<..., reluMode>(dst, src, preQuantScalar, indexRow, indexCol) and TINSERT<..., mode, reluMode>(dst, src, preQuantScalar, indexRow, indexCol)
- vector-quant form: TINSERT_FP<..., reluMode>(dst, src, fp, indexRow, indexCol) and TINSERT<..., FpTileData, mode, reluMode>(dst, src, fp, indexRow, indexCol)
- NZ split form (A5 only): TINSERT<TInsertMode::SPLIT2>(dst, src, indexRow, indexCol) or TINSERT<TInsertMode::SPLIT4>(...)
reluMode is ReluPreMode::{NoRelu, NormalRelu}.
mode is AccToVecMode::{SingleModeVec0, SingleModeVec1, DualModeSplitM, DualModeSplitN}.
Runtime bounds: indexRow + src.ValidRow <= dst.Rows and indexCol + src.ValidCol <= dst.Cols.

A2A3 implementation checks¶

Supported tile-type pair: TileType::Acc → TileType::Mat only.
Source layout must be (BFractal: ColMajor, SFractal: RowMajor).
Destination layout must be (BFractal: ColMajor, SFractal: RowMajor) with SFractalSize == 512.
Dst.Cols * sizeof(DstDType) must be a multiple of 32 bytes and non-zero.
Plain / relu (non-quant) supported dtype pairs:
- float Acc → half, bfloat16_t
Scalar-quant supported dtype pairs:
- float Acc → int8_t
- int32_t Acc → int8_t, uint8_t, half, int16_t
Vector-quant (TINSERT_FP) supported dtype pairs:
- float Acc → int8_t, uint8_t
- int32_t Acc → int8_t, uint8_t, half, int16_t
Vector-quant requires an FpTileData scaling operand (TileType::Scaling).

A5 implementation checks¶

In addition to the Acc → Mat path, A5 supports Acc → Vec, Vec → Vec, Vec → Mat, and NZ split paths.
Acc → Mat (TileType::Acc → TileType::Mat):
- Source Acc type must be float or int32_t; source layout must be (BFractal: ColMajor, SFractal: RowMajor).
- Destination layout must be (!isRowMajor, SFractal: RowMajor) (NZ format).
- Non-quant (plain / relu) destination types:
  - float Acc → half, bfloat16_t, float
  - int32_t Acc → int32_t
- Scalar-quant destination types:
  - float Acc → int8_t, uint8_t, hifloat8_t, half, bfloat16_t, float8_e4m3_t
  - int32_t Acc → int8_t, uint8_t, half, bfloat16_t
- Vector-quant (TINSERT_FP) destination types: same as scalar-quant above.
Acc → Vec (TileType::Acc → TileType::Vec):
- Source Acc type must be float or int32_t; source layout must be (BFractal: ColMajor, SFractal: RowMajor).
- Non-quant (plain / relu) destination types:
  - float Acc → half, bfloat16_t, float
  - int32_t Acc → int32_t
- Scalar-quant destination types:
  - float Acc → int8_t, uint8_t, hifloat8_t, half, bfloat16_t, float8_e4m3_t
  - int32_t Acc → int8_t, uint8_t, half, bfloat16_t
- Vector-quant (TINSERT_FP / TINSERT with FpTileData) destination types: same as scalar-quant above.
- Destination layout must be one of: NZ-to-NZ (!isRowMajor, SFractal: RowMajor), NZ-to-ND (isRowMajor, SFractal: NoneBox), or NZ-to-DN (!isRowMajor, SFractal: NoneBox).
- AccToVecMode selects SingleModeVec0, SingleModeVec1, DualModeSplitM, or DualModeSplitN.
- Dual-destination modes (DualModeSplitM, DualModeSplitN) require QuantMode_t::NoQuant and do not support the NZ-to-DN path.
- Destination stride must be non-zero and dstStride * sizeof(dstType) must be a multiple of 32 bytes.
Vec → Vec (TileType::Vec → TileType::Vec):
- DstTileData::DType must equal SrcTileData::DType.
- Supported element types: half, bfloat16_t, float, int32_t, int8_t, hifloat8_t, float8_e4m3_t, float8_e5m2_t, float8_e8m0_t, float4_e2m1x2_t, float4_e1m2x2_t.
- Source and destination layout must match (both ND or both NZ).
- ND path: source valid region must fit within destination bounds. Dispatch selects copy_ubuf_to_ubuf (aligned), vlds/vsts (stride-aligned, unaligned validCol), vlds/vstus (unaligned strides or indexCol), or scalar copy (1×1 element).
- NZ path: source cols must not exceed destination cols. Uses ComputeNZBlockParams for fractal-block copy_ubuf_to_ubuf.
Vec → Mat (TileType::Vec → TileType::Mat, UB → L1):
- DstTileData::DType must equal SrcTileData::DType.
- Supported element types: half, bfloat16_t, float, int32_t, int8_t, hifloat8_t, float8_e4m3_t, float8_e5m2_t, float8_e8m0_t, float4_e2m1x2_t, float4_e1m2x2_t.
- ND path: source must be isRowMajor; uses copy_ubuf_to_cbuf. Data bytes per row must be aligned to BLOCK_BYTE_SIZE (32 bytes) for row-wise burst.
- NZ path: source must be (!isRowMajor, SFractal: RowMajor); uses ComputeNZBlockParams for fractal-block copy_ubuf_to_cbuf. For fp4 types (float4_e2m1x2_t, float4_e1m2x2_t), validCol and indexCol are halved for byte addressing.
NZ Split (TInsertMode::SPLIT2 / TInsertMode::SPLIT4, A5 only):
- Destination must be TileType::Mat; source must be TileType::Vec.
- DstTileData::DType must equal SrcTileData::DType.
- Source must be NZ format: (!isRowMajor, SFractal: RowMajor).
- Supported element types: half, bfloat16_t, float, int32_t, int8_t, hifloat8_t, float8_e4m3_t, float8_e5m2_t, float8_e8m0_t, float4_e2m1x2_t, float4_e1m2x2_t.
- validRow is aligned up to FRACTAL_NZ_ROW (16) for burst calculation.
- Splits the copy_ubuf_to_cbuf total burst into 2 or 4 sub-transfers, each handling totalBurstNum / SplitCount column blocks (last sub-transfer takes the remainder).

Examples¶

Auto¶

#include <pto/pto-inst.hpp>

using namespace pto;

// Vec -> Mat insertion (NZ layout)
void example_auto() {
  using SrcT = Tile<TileType::Vec, half, 16, 32, BLayout::ColMajor, 16, 32, SLayout::RowMajor>;
  using DstT = Tile<TileType::Mat, half, 16, 32, BLayout::ColMajor, -1, -1, SLayout::RowMajor>;
  SrcT src;
  DstT dst(16, 32);
  TINSERT(dst, src, /*indexRow=*/0, /*indexCol=*/0);
}

Manual¶

#include <pto/pto-inst.hpp>

using namespace pto;

// Vec -> Mat insertion (NZ layout, manual buffer assignment)
void example_manual() {
  using SrcT = Tile<TileType::Vec, half, 16, 32, BLayout::ColMajor, 16, 32, SLayout::RowMajor>;
  using DstT = Tile<TileType::Mat, half, 16, 32, BLayout::ColMajor, -1, -1, SLayout::RowMajor>;
  SrcT src;
  DstT dst(16, 32);
  TASSIGN(src, 0x0);
  TASSIGN(dst, 0x0);
  TINSERT(dst, src, /*indexRow=*/0, /*indexCol=*/0);
}

ASM Form Examples¶

Auto Mode¶

# Auto mode: compiler/runtime-managed placement and scheduling.
%dst = pto.tinsert %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>

Manual Mode¶

# Manual mode: resources must be bound explicitly before issuing the instruction.
# Optional for tile operands:
# pto.tassign %arg0, @tile(0x1000)
# pto.tassign %arg1, @tile(0x2000)
%dst = pto.tinsert %src, %idxrow, %idxcol : (!pto.tile<...>, dtype, dtype) -> !pto.tile<...>

PTO Assembly Form¶

%dst = tinsert %src[%r0, %r1] : !pto.tile<...> -> !pto.tile<...>
# AS Level 2 (DPS)
pto.tinsert ins(%src, %idxrow, %idxcol : !pto.tile_buf<...>, dtype, dtype) outs(%dst : !pto.tile_buf<...>)