MoE Dispatch — PTO-ISA Standalone Communication Operator¶

Overview¶

Standalone implementation of the MegaMoE Dispatch communication operator using PTO-ISA instructions. This operator pulls quantized tokens from remote ranks' shared memory, splitting interleaved [int8 token | float scale] rows into compact separate outputs (gmA and gmPerTokenScale).

Three independent kernel paths are provided:

Direct (2-step): TLOAD remote GM → UB → TSTORE split — fast path with adaptive UB tiling
ViaGM (4-step): TGET remote GM → local GM → TLOAD → UB → TSTORE split — MegaMoE-compatible path
WithSync (integrated): CrossRankSync → Direct dispatch — device-side routing table computation + dispatch

Supported AI Processors¶

Ascend A5

Data Flow¶

Direct path (mode=direct):
  Remote GM ──TLOAD──▶ UB (ping/pong) ──TSTORE──▶ gmA (token)
                                        ──TSTORE──▶ gmPerTokenScale (scale)

ViaGM path (mode=viagm):
  Remote GM ──TGET──▶ Local temp GM ──TLOAD──▶ UB (ping/pong) ──TSTORE──▶ gmA
                                                                ──TSTORE──▶ gmPerTokenScale

WithSync path (mode=sync):
  Phase A: TPE AllGather (TSTORE remote write + TWAIT)
      Local TPE ──TSTORE+DataAsFlag──▶ All remote ranks
      ──TWAIT──▶ Receive TPE from all remote ranks

  Phase B: Compute routing tables (device-side)
      B.1 Strip DataAsFlag from received TPE (vectorized TLOAD/TADDS/TSTORE)
      B.2 Compute cumsumMM prefix sum (vectorized TLOAD/TADD/TSTORE)
      B.3 Compute preSumBeforeRank (scalar accumulation)

  Phase C: MoeDispatchDirect with computed tables

Algorithm¶

=== Direct / ViaGM ===
for each local expert (groupIdx):
    for each remote rank (dstEpIdx, strided by coreIdx):
        1. Compute remote source address in peer shmem
        2. Compute local destination offset in gmA/gmPerTokenScale
        3. [Direct] TLOAD interleaved rows into UB, TSTORE split token and scale
           [ViaGM]  TGET rows to local GM, then TLOAD→UB→TSTORE split
        4. Event-driven ping-pong: overlap TLOAD(N+1) with TSTORE(N)
    // Cross-rank continuous pipeline: no bubble between ranks

=== WithSync ===
Phase A — TPE AllGather:
    for each remote rank i:
        TSTORE local tokenPerExpert to rank i's TPE exchange area (with DataAsFlag offset)
    for each remote rank i:
        TWAIT until rank i's data arrives (poll GM for non-zero signal)

Phase B — Routing Table Computation:
    B.1: TLOAD each TPE row, TADDS to strip DataAsFlag offset, TSTORE back
    SYNCALL (software-based GM polling)
    B.2: Vectorized prefix sum — TLOAD row[i], TADD with accumulator, TSTORE → cumsumMM
    B.3: Scalar loop — accumulate preSumBeforeRank[i] from cumsumMM columns

Phase C — Dispatch:
    Call MoeDispatchDirect with the computed routing tables

Key Features¶

Triple-path design: Direct (fast), ViaGM (compatible), WithSync (self-contained)
Integrated CrossRankSync: WithSync path computes routing tables on-device, no host pre-computation
Vectorized cumsumMM: TLOAD/TADD/TSTORE prefix sum with pipelined events, padded to 32B alignment
Software SYNCALL: GM-polling based cross-core synchronization (avoids FFTS hardware dependency)
Adaptive MOVE_NUM: Compile-time DispatchTraits<TILE_COLS> auto-shrinks rows/batch for large hiddenSize
Event-driven ping-pong: set_flag/wait_flag overlaps MTE2 (TLOAD) and MTE3 (TSTORE) pipelines
Cross-rank continuous pipeline: Ping-pong state persists across remote ranks — no flush between ranks
Multi-core parallel: Each AIV core handles one or more remote ranks (strided assignment)
Token/scale separation: Remote rows [int8×K][float scale padded to 32B] → compact token + scale outputs

A5-Specific Notes¶

HCCL window offset errata: A5 MTE2 DMA reads from HCCL window base bytes [16..31] return zeros. The host driver applies winOffset=256 to skip the defective region.
HCCL V2 tiling: A5 uses the V2 tiling initialization path (common.hpp from tests/npu/a5/).
Compiler target: dav-c310-vec (A5 vector core ISA).

Specification¶

Item	Value
Data type (token)	`int8_t`
Data type (scale)	`float` (stored in 32B-aligned rows)
Remote row format	`hiddenSize` bytes token + `UB_ALIGN` (32) bytes padding (scale at offset 0)
Output token	`gmA[maxOutputSize, hiddenSize]` — compact, no padding
Output scale	`gmPerTokenScale[maxOutputSize]` — 32 bytes/row (float at offset 0)
Default hiddenSize	128
Execution model	AIV-only (vector cores), multi-rank via mpirun

Directory Layout¶

kernels/manual/a5/moe_dispatch/
├── moe_dispatch_kernel.cpp     # Device kernel: triple-path dispatch
├── main.cpp                    # Host driver: MPI init, data gen, launch, verify
├── moe_dispatch_config.h       # Shape constants, DispatchTraits, workspace layout
├── hccl_context.h              # Device-side HCCL context struct
├── CMakeLists.txt              # Build configuration (bisheng + dav-c310-vec)
├── run.sh                      # Build & run convenience script
└── README.md                   # This file

Build & Run¶

# Set environment
source /mnt/data/ntlab/liulei/set_env_new.sh
export HCCL_WHITELIST_DISABLE=1

# Build & run Direct path (default), 2 ranks
bash run.sh all --ep 2 --mode direct

# Build & run ViaGM path, 4 ranks
bash run.sh all --ep 4 --mode viagm

# Build & run WithSync path (CrossRankSync + Dispatch), 2 ranks
bash run.sh all --ep 2 --mode sync

# Use specific devices (start from device 4)
bash run.sh all --ep 4 --first-device 4 --mode direct

# Build only
bash run.sh build --ep 2 --hidden 128 --debug

# Run only (after build)
bash run.sh run --ep 2 --mode direct

# Clean build
bash run.sh all --ep 4 --mode viagm --clean

run.sh Parameters¶

Parameter	Default	Description
`--ep N`	2	Number of ranks (EP count)
`--mode direct\\|viagm\\|sync`	direct	Kernel path selection
`--first-device N`	0	First NPU device ID
`--hidden N`	128	Hidden size (K)
`--tokens N`	64	Max tokens per rank
`--max-output N`	512	Max output rows
`--experts N`	1	Experts per rank
`--clean`	—	Force clean rebuild
`--debug`	—	Enable debug mode

Relation to MegaMoE¶

This operator validates the Dispatch phase of MegaMoE. It can serve as a direct building block for the full MegaMoE fused operator:

MegaMoE full pipeline:
  InitRouting → [Dispatch] → GEMM (FFN) → Combine
                 ^^^^^^^^
                 This operator (WithSync covers InitRouting + Dispatch)

Interface compatibility: Parameters (cumsumMM, tokenPerExpert, preSumBeforeRank, shmemBase) match MegaMoE exactly
WithSync path: Equivalent to MegaMoE's CrossRankSyncAndlocalTokenPerExpertAllGatherAndGetSumPreRankV2 + DispatchAndCombine dispatch portion
ViaGM path: Functionally equivalent to MegaMoE's DispatchCopyPerToken
Direct path: PTO-ISA optimization that bypasses intermediate GM buffer

Reference¶

MegaMoE source: vllm-ascend/csrc/mc2/dispatch_ffn_combine/op_kernel/dispatch_ffn_combine_kernel.hpp
Design doc: /mnt/data/ntlab/liulei/docs/megamoe/dispatch_pto_isa_design.md
PTO-ISA TGET API: include/pto/comm/pto_comm_inst.hpp