TREDUCE¶
Introduction¶
Reduce operation: gather data from multiple remote NPUs and perform element-wise reduction locally.
Only the root needs to execute TREDUCE. Non-root ranks only need to ensure their source buffers are ready and remain valid for the duration of the operation. Calling TREDUCE on non-root ranks is undefined behavior.
Large Tile Support: When the GlobalTensor exceeds the UB tile capacity in rows and/or columns, the reduction is automatically chunked via 2D sliding.
Math Interpretation¶
For each element (i, j) in the valid region:
where \(N\) is the number of ranks and \(\oplus\) is the reduction operation (sum, max, min, etc.).
Assembly Syntax¶
PTO-AS form: see PTO-AS Specification.
Synchronous form:
treduce %group, %dst {op = #pto.reduce_op<Sum>} : (!pto.group<...>, !pto.memref<...>)
treduce %group, %dst {op = #pto.reduce_op<Max>} : (!pto.group<...>, !pto.memref<...>)
Lowering introduces internal accumulator and receive tiles for the reduce pipeline; the C++ intrinsic requires explicit accTileData, recvTileData (or accTileData, pingTileData, pongTileData) operand(s).
C++ Intrinsic¶
Declared in include/pto/comm/pto_comm_inst.hpp:
// Basic reduce (accumulator + receive tile)
template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
PTO_INST RecordEvent TREDUCE(ParallelGroupType ¶llelGroup, GlobalDstData &dstGlobalData,
TileData &accTileData, TileData &recvTileData, ReduceOp op, WaitEvents&... events);
// Ping-pong reduce (accumulator + ping + pong tiles for double buffering)
template <typename ParallelGroupType, typename GlobalDstData, typename TileData, typename... WaitEvents>
PTO_INST RecordEvent TREDUCE(ParallelGroupType ¶llelGroup, GlobalDstData &dstGlobalData,
TileData &accTileData, TileData &pingTileData, TileData &pongTileData,
ReduceOp op, WaitEvents&... events);
Constraints¶
- Type constraints:
ParallelGroup::value_type::RawDTypemust equalGlobalDstData::RawDType.TileData::DTypemust equalGlobalDstData::RawDType.
- Memory constraints:
dstGlobalDatamust point to local address (on current NPU).accTileData,recvTileData(oraccTileData,pingTileData,pongTileData) must be pre-allocated UB tiles.
- ParallelGroup constraints:
parallelGroup.tensors[r]must refer to rankr's source buffer (remote GM as seen by the root).parallelGroup.GetRootIdx()identifies the calling NPU as the reduce root.- All source tensors are assumed to have the same shape and strides.
- Chunked mode constraints (when data exceeds a single UB tile):
- If
TileDatahas staticValidRow,GetShape(DIM_3)must be divisible byValidRow. Use a Tile withDYNAMICValidRow for partial row support. - If
TileDatahas staticValidCol,GetShape(DIM_4)must be divisible byValidCol. Use a Tile withDYNAMICValidCol for partial column support.
- If
Examples¶
Basic Reduce Sum¶
#include <pto/comm/pto_comm_inst.hpp>
using namespace pto;
template <typename T, int SIZE, int NRANKS>
void reduce_sum(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
using TileT = Tile<TileType::Vec, T, 1, SIZE>;
using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>,
BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
// Stack-allocated tensors
GTensor tensors[NRANKS];
for (int i = 0; i < NRANKS; ++i) {
tensors[i] = GTensor(group_addrs[i]);
}
comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
GTensor dstG(result);
TileT accTile, recvTile;
comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Sum);
}
Max Reduce¶
#include <pto/comm/pto_comm_inst.hpp>
using namespace pto;
template <typename T, int SIZE, int NRANKS>
void reduce_max(__gm__ T* group_addrs[NRANKS], __gm__ T* result, int my_rank) {
using TileT = Tile<TileType::Vec, T, 1, SIZE>;
using GTensor = GlobalTensor<T, Shape<1,1,1,1,SIZE>,
BaseShape2D<T, 1, SIZE, Layout::ND>, Layout::ND>;
GTensor tensors[NRANKS];
for (int i = 0; i < NRANKS; ++i) {
tensors[i] = GTensor(group_addrs[i]);
}
comm::ParallelGroup<GTensor> group(tensors, NRANKS, my_rank);
GTensor dstG(result);
TileT accTile, recvTile;
comm::TREDUCE(group, dstG, accTile, recvTile, comm::ReduceOp::Max);
}