TCOLARGMIN¶
Tile Operation Diagram¶
Introduction¶
Get the row index of the minimum element for each column. A value+index variant is also available that returns both the minimum value and its row index.
Math Interpretation¶
Pure Index Mode¶
Let R = src.GetValidRow() and C = src.GetValidCol(). For 0 <= j < C:
\[ \mathrm{dstIdx}_{0,j} = \underset{0 \le i < R}{\operatorname{argmin}} \; \mathrm{src}_{i,j} \]
Value + Index Mode¶
\[ \mathrm{dstVal}_{0,j} = \min_{0 \le i < R} \mathrm{src}_{i,j} \]
\[ \mathrm{dstIdx}_{0,j} = \underset{0 \le i < R}{\operatorname{argmin}} \; \mathrm{src}_{i,j} \]
Assembly Syntax¶
PTO-AS form: see docs/grammar/PTO-AS.md.
Pure Index Mode¶
Synchronous form:
%dstIdx = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>
IR Level 1 (SSA):
%dstIdx = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
IR Level 2 (DPS):
pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dstIdx : !pto.tile_buf<...>)
Value + Index Mode¶
Synchronous form:
%dstVal, %dstIdx = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>, !pto.tile<...>
IR Level 1 (SSA):
%dstVal, %dstIdx = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> (!pto.tile<...>, !pto.tile<...>)
IR Level 2 (DPS):
pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dstVal, %dstIdx : !pto.tile_buf<...>, !pto.tile_buf<...>)
C++ Intrinsic¶
Declared in include/pto/common/pto_instr.hpp:
Pure Index Mode (3-argument)¶
template <typename TileDataOut, typename TileDataIn, typename TileDataTmp, typename... WaitEvents>
PTO_INST RecordEvent TCOLARGMIN(TileDataOut &dst, TileDataIn &src, TileDataTmp &tmp, WaitEvents &...events)
Value + Index Mode (4-argument)¶
template <typename TileDataOutVal, typename TileDataOutIdx, typename TileDataIn, typename TileDataTmp,
typename... WaitEvents>
PTO_INST RecordEvent TCOLARGMIN(TileDataOutVal& dstVal, TileDataOutIdx& dstIdx, TileDataIn& src, TileDataTmp& tmp,
WaitEvents&... events);
Constraints¶
General constraints / checks¶
dstIdxandsrcmust beTileType::Vec.srcmay use ND or DN non-fractal layout (SLayout::NoneBox).dstIdxmust use standard ND layout: row-major and non-fractal (BLayout::RowMajor,SLayout::NoneBox).- Supported destination index element types:
uint32_t,int32_t. - Runtime checks:
src.GetValidRow() != 0src.GetValidCol() != 0dstIdx.GetValidRow() == 1src.GetValidCol() == dstIdx.GetValidCol()
Pure Index Mode (3-argument)¶
A2A3 implementation checks¶
- Supported source element types:
half,float,uint16_t,uint32_t. tmpmust use the same element type assrc.tmpis used as scratch storage for index tracking and current comparison values.
A5 implementation checks¶
- Supported source element sizes are 8-bit, 16-bit, or 32-bit; covers
int8_t,uint8_t,int16_t,uint16_t,int32_t,uint32_t,half,float. tmpis accepted by the interface but not used by the implementation.
Value + Index Mode (4-argument)¶
In addition to the general constraints:
dstValmust beTileType::Vecwith standard ND layout (row-major, non-fractal).dstValelement type must match the source element typeTileDataIn::DType.- 8-bit source types are not supported.
- Runtime checks:
dstVal.GetValidRow() == 1dstVal.GetValidCol() != 0src.GetValidCol() == dstVal.GetValidCol()dstVal.GetValidRow() == dstIdx.GetValidRow()dstVal.GetValidCol() == dstIdx.GetValidCol()
A2A3 implementation checks¶
- Supported source element types:
half,float,uint16_t,uint32_t. - When source element size is 2 bytes (
half,uint16_t):dstIdxelement type must beuint16_torint16_t. - When source element size is 4 bytes (
float,uint32_t):dstIdxelement type must beuint32_torint32_t. tmpmust use the same element type assrc.tmpis used as scratch storage; for half input types an internal s16->f16->s32 conversion path is used for the index.
A5 implementation checks¶
- Source element size must be 16-bit or 32-bit (
sizeof(T) != 1). - When source element size is 2 bytes (
half,int16_t,uint16_t):dstIdxelement type must beuint16_torint16_t. - When source element size is 4 bytes (
float,int32_t,uint32_t):dstIdxelement type must beuint32_torint32_t. tmpis accepted by the interface but not used by the implementation.
About temporary tile tmp for A2A3¶
tmpis always used in the A2A3 implementation as scratch space for intermediate results (current row index, argmin index, and current minimum elements).tmptile's data type must be the same assrc's data type.tmptile is organized into three regions within a single row:- Region 0 (
[0, tmpGapEles)): current row index counter (incremented per row). - Region 1 (
[tmpGapEles, 2 * tmpGapEles)): current minimum elements for comparison. - Region 2 (
[2 * tmpGapEles, 3 * tmpGapEles)): argmin index result (before final conversion todstIdx). tmpGapElesis determined as follows:- When
srcValidCol >= elemPerRpt:tmpGapEles = elemPerRpt. - When
srcValidCol < elemPerRpt:tmpGapEles = ceil(srcValidCol / elemPerBlock) * elemPerBlock. - Simply set
tmptile size the same assrcwhensrcis small, or calculate using:
text
repeats = ceil(validCol / elementPerRepeat)
stride = ceil(repeats * 2 / elementPerBlock) * elementPerBlock + ceil(repeats / elementPerBlock) * elementPerBlock
- In Value + Index mode with half input,
tmpregion 2 data undergoes s16->f16->s32 conversion before being stored todstIdx.
About temporary tile tmp for A5¶
tmptemporary tile is not used in the A5 implementation for either mode. The A5 uses vector register-based computation (__VEC_SCOPE__) and does not require scratch tile storage.tmpis retained in the C++ intrinsic signature solely for API compatibility with A2A3.
Examples¶
Pure Index Mode¶
Auto¶
#include <pto/pto-inst.hpp>
using namespace pto;
void example_auto() {
using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
SrcT src(16, 255);
DstT dst(1, 255);
TmpT tmp(1, 32);
TCOLARGMIN(dst, src, tmp);
}
Manual¶
#include <pto/pto-inst.hpp>
using namespace pto;
void example_manual() {
using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
using DstT = Tile<TileType::Vec, uint32_t, 1, 256, BLayout::RowMajor, -1, -1>;
using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
SrcT src(16, 255);
DstT dst(1, 255);
TmpT tmp(1, 32);
TASSIGN(src, 0x0);
TASSIGN(dst, 0x1000);
TASSIGN(tmp, 0x2000);
TCOLARGMIN(dst, src, tmp);
}
Value + Index Mode¶
Auto¶
#include <pto/pto-inst.hpp>
using namespace pto;
void example_auto_val_idx() {
using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
using DstValT = Tile<TileType::Vec, float, 1, 256, BLayout::RowMajor, -1, -1>;
using DstIdxT = Tile<TileType::Vec, int32_t, 1, 256, BLayout::RowMajor, -1, -1>;
using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
SrcT src(16, 255);
DstValT dstVal(1, 255);
DstIdxT dstIdx(1, 255);
TmpT tmp(1, 32);
TCOLARGMIN(dstVal, dstIdx, src, tmp);
}
Manual¶
#include <pto/pto-inst.hpp>
using namespace pto;
void example_manual_val_idx() {
using SrcT = Tile<TileType::Vec, float, 16, 256, BLayout::RowMajor, -1, -1>;
using DstValT = Tile<TileType::Vec, float, 1, 256, BLayout::RowMajor, -1, -1>;
using DstIdxT = Tile<TileType::Vec, int32_t, 1, 256, BLayout::RowMajor, -1, -1>;
using TmpT = Tile<TileType::Vec, float, 1, 32, BLayout::RowMajor, -1, -1>;
SrcT src(16, 255);
DstValT dstVal(1, 255);
DstIdxT dstIdx(1, 255);
TmpT tmp(1, 32);
TASSIGN(src, 0x0);
TASSIGN(dstVal, 0x1000);
TASSIGN(dstIdx, 0x2000);
TASSIGN(tmp, 0x3000);
TCOLARGMIN(dstVal, dstIdx, src, tmp);
}
ASM Form Examples¶
Pure Index Auto Mode¶
# Auto mode: compiler/runtime-managed placement and scheduling.
%dstIdx = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
Pure Index Manual Mode¶
# Manual mode: resources must be bound explicitly before issuing the instruction.
# pto.tassign %arg0, @tile(0x1000)
# pto.tassign %arg1, @tile(0x2000)
%dstIdx = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> !pto.tile<...>
Value + Index Auto Mode¶
# Auto mode: compiler/runtime-managed placement and scheduling.
%dstVal, %dstIdx = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> (!pto.tile<...>, !pto.tile<...>)
Value + Index Manual Mode¶
# Manual mode: resources must be bound explicitly before issuing the instruction.
# pto.tassign %arg0, @tile(0x1000)
# pto.tassign %arg1, @tile(0x2000)
# pto.tassign %arg2, @tile(0x3000)
%dstVal, %dstIdx = pto.tcolargmin %src, %tmp : (!pto.tile<...>, !pto.tile<...>) -> (!pto.tile<...>, !pto.tile<...>)
PTO Assembly Form¶
# Pure index
%dstIdx = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>
# Value + index
%dstVal, %dstIdx = tcolargmin %src : !pto.tile<...> -> !pto.tile<...>, !pto.tile<...>
# IR Level 2 (DPS) - pure index
pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dstIdx : !pto.tile_buf<...>)
# IR Level 2 (DPS) - value + index
pto.tcolargmin ins(%src, %tmp : !pto.tile_buf<...>, !pto.tile_buf<...>) outs(%dstVal, %dstIdx : !pto.tile_buf<...>, !pto.tile_buf<...>)