TGET / TGET_ASYNC Bandwidth Comparison Example¶

Overview¶

This example compares the point-to-point communication bandwidth of TGET (synchronous remote read) and TGET_ASYNC (asynchronous SDMA remote read), sweeping transfer sizes from 4 KB to 4 MB, and measuring both host-side bandwidth (GB/s) and device-side average execution cycles.

TGET performs remote reads via UB (Unified Buffer) staging: Remote GM → UB → Local GM. Bandwidth saturates at approximately 4 GB/s due to UB throughput limits.
TGET_ASYNC performs remote reads via the SDMA engine: Remote GM → SDMA → Local GM. By bypassing the UB bottleneck, it reaches approximately 13–14 GB/s at 4 MB.

Supported AI Processors¶

A2/A3

Directory Layout¶

kernels/manual/a2a3/tget_bandwidth/
├── scripts/
│   └── plot_bw_compare.py           # Generate bandwidth comparison plot
├── CMakeLists.txt                   # Build configuration
├── tget_bandwidth_kernel.cpp        # Kernel implementation (AICORE + host orchestration)
├── tget_bandwidth_kernel.h          # Kernel header
├── main.cpp                         # Host-side entry point (MPI initialization)
├── run.sh                           # Convenience script
├── README_zh.md                     # Chinese version
└── README.md                        # This file

Operator Description¶

Data Flow¶

TGET (synchronous):

Peer NPU GM ──TGET──▶ Local UB ──TSTORE──▶ Local GM

TGET_ASYNC (asynchronous):

Peer NPU GM ──SDMA──▶ Local GM   (direct transfer, no UB staging)

Test Procedure¶

Each rank prepares send data in HCCL shared memory (PrepareSendBufferKernel)
The root rank runs TGET and TGET_ASYNC for each transfer size
Host-side timing measures bandwidth; device-side SYS_CNT measures cycles
Received data is verified for correctness

Specification¶

Item	Value
Data type	`float`
NPU count	2 (point-to-point)
Transfer sizes	4 KB, 16 KB, 64 KB, 256 KB, 1 MB, 4 MB
Metrics	host bandwidth (GB/s), device average cycles

Measured Performance (Reference)¶

The following measurements were collected on Ascend A2/A3 (float type, 2-NPU point-to-point).

Transfer Size	TGET BW (GB/s)	TGET_ASYNC BW (GB/s)	TGET Device Avg Cycles	TGET_ASYNC Device Avg Cycles
4 KB	0.21	0.19	50.85	118.18
16 KB	0.72	0.75	202.05	166.42
64 KB	1.75	2.55	780.73	338.10
256 KB	3.01	6.08	3347.12	1094.37
1 MB	3.75	10.48	12703.39	3791.18
4 MB	3.99	12.95	52878.12	14834.47

Analysis¶

TGET bandwidth gradually increases with transfer size but saturates at approximately 4 GB/s — the throughput ceiling of the UB staging path.
TGET_ASYNC significantly outperforms TGET for large transfers (≥256 KB), reaching approximately 13 GB/s at 4 MB, close to the theoretical SDMA engine bandwidth.
For very small transfers (4 KB), TGET_ASYNC is slightly slower than TGET due to SDMA launch overhead.

Bandwidth Comparison Plot¶

Run the plotting script to generate the comparison chart:

python3 scripts/plot_bw_compare.py

Build and Run¶

Prerequisites¶

CANN Toolkit >= 8.5.0 (TGET synchronous instruction); >= 9.0.0 (TGET_ASYNC asynchronous instruction)
MPICH (recommended; the host side loads libmpi.so via comm_mpi.h with MPICH-compatible communicator handles)
2 or more Ascend NPUs

MPI installation (MPICH recommended)¶

# Ubuntu / Debian
sudo apt install mpich libmpich-dev

# Or install under $HOME without root — see tests/README.md "Build MPICH from Source"
export PATH=$HOME/mpich/bin:$PATH
export MPI_LIB_PATH=$HOME/mpich/lib/libmpi.so

run.sh searches common MPICH install paths and sets MPI_LIB_PATH. Override with MPI_SEARCH_DIRS (space-separated list of bin/ directories).

Note: This example does not support OpenMPI. comm_mpi.h hardcodes MPICH MPI_COMM_WORLD handle values; OpenMPI uses a different communicator representation and runtime MPI calls may fail. --allow-run-as-root is OpenMPI-specific and is not supported by MPICH.

Steps¶

Configure your Ascend CANN environment:

source ${ASCEND_HOME_PATH}/bin/setenv.bash
# or source <workspace>/set_env_new.sh

Run the example (2-NPU by default). run.sh switches to its own directory automatically; invoke it from the repo root or from this directory:

# Option A: cd into this example first
cd ${git_clone_path}/kernels/manual/a2a3/tget_bandwidth
bash run.sh -r npu -v a3

# Option B: from the pto-isa-main repo root
bash kernels/manual/a2a3/tget_bandwidth/run.sh -r npu -v a3

Use -n to specify the number of ranks (default is 2):

bash run.sh -r npu -v a3 -n 2

-v a3 matches the ST test scripts and maps internally to SOC_VERSION=Ascend910B1 for the A2/A3 platform.

On success, the output looks like:

================ TGET/TGET_ASYNC Bandwidth Sweep ================
peer_rank=1 dtype=float tile_elems=1024
[BW] instr=TGET bytes=4096 iters=1000 ...
[BW] instr=TGET_ASYNC bytes=4096 iters=1000 ...
...
test success

Changelog¶

Date	Change
2026-06-01	Align docs and `run.sh` with MPICH; remove OpenMPI-only `mpirun` flag
2026-04-02	Migrated from ST test to standalone performance example