Flash Attention Performance Kernel (A5)

Overview

This example demonstrates how to implement a mixed-precision Flash Attention (FA) operator using PTO on the Ascend A5 platform, including project setup, build, and execution.

For detailed operator description, optimization strategies, and pipeline orchestration of Flash Attention, please refer to the common version documentation: ../../common/flash_atten/README.md

Supported AI Processors

  • Ascend A5

Directory Layout

kernels/manual/a5/flash_atten/
├── scripts/
│   ├── gen_data.py                  # Generates input and golden output
│   ├── generate_cases.py            # Generates test cases
│   ├── pipeline_log_analysis.py     # Pipeline log analysis
│   ├── pipeline_schedule_gen.py     # Pipeline schedule generation
│   └── run_timeline.sh              # Timeline analysis script
├── CMakeLists.txt                   # Build configuration
├── fa_performance_kernel.cpp        # Kernel implementation
├── fa_performance_kernel.h          # Kernel header
├── main.cpp                         # Host-side entry point
├── pto_macro_fa_gu.hpp              # FA GU macro definitions
├── pto_macro_fa_softmax.hpp         # FA Softmax macro definitions
├── pto_macro_matmul.hpp             # Matmul macro definitions
└── run.sh                           # Convenience script

Build and Run

  1. Configure your Ascend CANN environment (example path):
source ${ASCEND_INSTALL_PATH}/bin/setenv.bash
  1. Run the example:
cd ${git_clone_path}/kernels/manual/a5/flash_atten

# Run default cases (same set baked into generated_cases.*)
bash run.sh -r npu -v Ascend910_9599

# Run a single case from the generated set
bash run.sh -r npu -v Ascend910_9599 -c case_float_H_128_S0_128_S1_1024

# Provide custom cases (semicolon separated: HEAD_SIZE,S0,S1,CUBE_S0,TILE_S1)
# TILE_S1: supports 128 (=CUBE_S1), 256, 512
bash run.sh -r npu -v Ascend910_9599 --cases "128,128,1024,128,128;128,2048,2048,128,512"

# Provide custom cases and run just one of them
bash run.sh -r npu -v Ascend910_9599 --cases "128,128,1024,128,128;128,512,2048,128,128" \
  -c case_float_H_128_S0_128_S1_1024

If the run succeeds, the output prints:

test success

Performance

This section records reference performance numbers for the manual Flash Attention kernel on the A5 platform in this directory.

Definitions: - S0: query sequence length (rows of Q/O). - S1: key/value sequence length (rows of K/V). - Total task time (us): end-to-end kernel time per task (microseconds). - GOps: total operations counted for the task. - TFLOPS: GOps / time.

Measured Performance (Reference)

The following data were collected on Ascend A5:

Cores S0 S1 Total task time (us) GOps TFLOPS
1 128 1024 TBD 67.11 TBD
1 128 2048 TBD 134.22 TBD
1 128 4096 TBD 268.44 TBD

Note: TBD indicates data to be measured.

Operator Description

For detailed implementation notes, mathematical formulas, tiling strategies, and pipeline orchestration of the Flash Attention operator, please refer to:

The documentation includes: 1. Computation Flow (FlashAttention 2.0) 2. Tensor shape progression (per stage) 3. Per-stage implementation & optimizations (compute_qk, compute_p, compute_pv, compute_gu) 4. Pipeline orchestration & cube/vector parallelism 5. Multicore task split, tiling and load balancing

A5 Platform-Specific Optimizations

Compared to the A2/A3 platform, the A5 version of the Flash Attention kernel has been optimized for A5 architecture characteristics:

  • Adjusted buffer allocation strategy for A5's memory hierarchy
  • Optimized synchronization mechanisms between Cube and Vector cores
  • Tuned pipeline depth parameters to match A5 hardware characteristics

Specific tuning parameters and performance data will be continuously updated in future versions.