TREM

Tile Operation Diagram

TREM tile operation

Introduction

Elementwise remainder of two tiles. The result has the same sign as the divider.

Math Interpretation

For each element (i, j) in the valid region:

\[\mathrm{dst}_{i,j} = \mathrm{remainder}(\mathrm{src0}_{i,j}, \mathrm{src1}_{i,j}) = \mathrm{src0}_{i,j} - \mathrm{floor}(\frac{\mathrm{src0}_{i,j}}{\mathrm{src1}_{i,j}}) \times \mathrm{src1}_{i,j}\]

The result sign is corrected to match the sign of the divider (src1).

Note: This differs from TFMOD where the result sign follows the dividend (src0).

Assembly Syntax

PTO-AS form: see PTO-AS Specification.

Synchronous form:

%dst = trem %src0, %src1 : !pto.tile<...>

AS Level 1 (SSA)

%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>

AS Level 2 (DPS)

pto.trem ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)

C++ Intrinsic

Declared in include/pto/common/pto_instr.hpp:

template <auto PrecisionType = RemAlgorithm::DEFAULT, typename TileDataDst, typename TileDataSrc0,
          typename TileDataSrc1, typename TileDataTmp, typename... WaitEvents>
PTO_INST RecordEvent TREM(TileDataDst &dst, TileDataSrc0 &src0, TileDataSrc1 &src1, TileDataTmp &tmp, WaitEvents &... events);

Constraints

  • Implementation checks (A2A3):
    • TileData::DType must be one of: float, float32_t, int32_t.
    • Tile layout must be row-major (TileData::isRowMajor).
    • Tile location must be vector (TileData::Loc == TileType::Vec).
    • Runtime: src0, src1 and dst tiles should have the same validRow/validCol.
    • tmp tile must have at least 2 rows and validCols columns (row 0 for intermediate results, row 1 for comparison mask).
  • Implementation checks (A5):
    • TileData::DType must be one of: half, float, int16_t, uint16_t, int32_t, uint32_t.
    • Tile layout must be row-major (TileData::isRowMajor).
    • Tile location must be vector (TileData::Loc == TileType::Vec).
    • Runtime: src0, src1 and dst tiles should have the same validRow/validCol.
    • Note: tmp parameter is accepted but not used on A5.
  • Valid region:
    • The op uses dst.GetValidRow() / dst.GetValidCol() as the iteration domain.
  • Division-by-zero:
    • Behavior is target-defined; the CPU simulator asserts in debug builds.
  • High Precision Algorithm:
    • Only available on A5 for float type; PrecisionType option is ignored on A2A3.

Examples

#include <pto/pto-inst.hpp>

using namespace pto;

void example() {
  using TileT = Tile<TileType::Vec, float, 16, 16>;
  using TmpT = Tile<TileType::Vec, float, 2, 16>;
  TileT dst, src0, src1;
  TmpT tmp;
  TREM(dst, src0, src1, tmp);
}

ASM Form Examples

Auto Mode

# Auto mode: compiler/runtime-managed placement and scheduling.
%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>

Manual Mode

# Manual mode: resources must be bound explicitly before issuing the instruction.
# Optional for tile operands:
# pto.tassign %arg0, @tile(0x1000)
# pto.tassign %arg1, @tile(0x2000)
%dst = pto.trem %src0, %src1 : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>

PTO Assembly Form

%dst = trem %src0, %src1 : !pto.tile<...>
# AS Level 2 (DPS)
pto.trem ins(%src0, %src1 : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dst : !pto.tile_buf<...>)