Flash Attention Performance Kernel (A5)¶

Overview¶

This example demonstrates how to implement a mixed-precision Flash Attention (FA) operator using PTO on the Ascend A5 platform, including project setup, build, and execution.

For detailed operator description, optimization strategies, and pipeline orchestration of Flash Attention, please refer to the common version documentation: ../../common/flash_atten/README.md

Supported AI Processors¶

Ascend A5

Directory Layout¶

kernels/manual/a5/flash_atten/
├── scripts/
│   ├── gen_data.py                  # Generates input and golden output
│   ├── generate_cases.py            # Generates test cases
│   ├── pipeline_log_analysis.py     # Pipeline log analysis
│   ├── pipeline_schedule_gen.py     # Pipeline schedule generation
│   └── run_timeline.sh              # Timeline analysis script
├── CMakeLists.txt                   # Build configuration
├── fa_performance_kernel.cpp        # Kernel implementation
├── fa_performance_kernel.h          # Kernel header
├── main.cpp                         # Host-side entry point
├── pto_macro_fa_gu.hpp              # FA GU macro definitions
├── pto_macro_fa_softmax.hpp         # FA Softmax macro definitions
├── pto_macro_matmul.hpp             # Matmul macro definitions
└── run.sh                           # Convenience script

Build and Run¶

Configure your Ascend CANN environment (example path):

source ${ASCEND_INSTALL_PATH}/bin/setenv.bash

Run the example:

cd ${git_clone_path}/kernels/manual/a5/flash_atten

# Run default cases (same set baked into generated_cases.*)
bash run.sh -r npu -v Ascend910_9599

# Run a single case from the generated set
bash run.sh -r npu -v Ascend910_9599 -c case_float_H_128_S0_128_S1_1024

# Provide custom cases (semicolon separated: HEAD_SIZE,S0,S1,CUBE_S0,TILE_S1)
# TILE_S1: supports 128 (=CUBE_S1), 256, 512
bash run.sh -r npu -v Ascend910_9599 --cases "128,128,1024,128,128;128,2048,2048,128,512"

# Provide custom cases and run just one of them
bash run.sh -r npu -v Ascend910_9599 --cases "128,128,1024,128,128;128,512,2048,128,128" \
  -c case_float_H_128_S0_128_S1_1024

If the run succeeds, the output prints:

test success

Performance¶

This section records reference performance numbers for the manual Flash Attention kernel on the A5 platform in this directory.

Definitions: - S0: query sequence length (rows of Q/O). - S1: key/value sequence length (rows of K/V). - Total task time (us): end-to-end kernel time per task (microseconds). - GOps: total operations counted for the task. - TFLOPS: GOps / time.

Measured Performance (Reference)¶

The following data were collected on Ascend A5:

Cores	S0	S1	Total task time (us)	GOps	TFLOPS
1	128	1024	TBD	67.11	TBD
1	128	2048	TBD	134.22	TBD
1	128	4096	TBD	268.44	TBD

Note: TBD indicates data to be measured.

Operator Description¶

For detailed implementation notes, mathematical formulas, tiling strategies, and pipeline orchestration of the Flash Attention operator, please refer to:

Common Flash Attention Documentation

The documentation includes: 1. Computation Flow (FlashAttention 2.0) 2. Tensor shape progression (per stage) 3. Per-stage implementation & optimizations (compute_qk, compute_p, compute_pv, compute_gu) 4. Pipeline orchestration & cube/vector parallelism 5. Multicore task split, tiling and load balancing

A5 Platform-Specific Optimizations¶

Compared to the A2/A3 platform, the A5 version of the Flash Attention kernel has been optimized for A5 architecture characteristics:

Adjusted buffer allocation strategy for A5's memory hierarchy
Optimized synchronization mechanisms between Cube and Vector cores
Tuned pipeline depth parameters to match A5 hardware characteristics

Specific tuning parameters and performance data will be continuously updated in future versions.