Perf-Sim User Guide¶

Perf-Sim is the pipeline-level simulator used by the PTO cost model. It does not execute a kernel on NPU hardware. Instead, it runs the __COSTMODEL version of the kernel in a normal C++ process, records PTO instructions and synchronization events, simulates pipeline scheduling, and emits total cycles, pipeline busy cycles, and a JSON trace.

When To Use It¶

Estimate the pipeline behavior and total cycles of a PTO kernel.
Compare tile shapes, preload depth, and synchronization strategies.
Check whether AIC/AIV, MTE, VEC, and FIXP pipelines overlap as expected.
Run quick performance analysis without NPU hardware or a full NPU ST flow.

Perf-Sim is intended for performance modeling and trend analysis. It does not replace hardware profiling.

Run An Existing Test¶

The FA perf-sim test is located at:

tests/costmodel/perf_sim_st/testcase/fa_perf_sim

Build and run:

cmake --build tests/costmodel/perf_sim_st/build --target fa_perf_sim --parallel 4
tests/costmodel/perf_sim_st/build/bin/fa_perf_sim

Run one test case:

tests/costmodel/perf_sim_st/build/bin/fa_perf_sim \
  --gtest_filter=FAPerfSim.LongSeq_256x64x2048

The default output directory is:

perf_sim_output/

Add A New Kernel¶

1. Use the costmodel test tree¶

Create a test under tests/costmodel/perf_sim_st/testcase/<case_name>/ and register the target:

pto_costmodel_sim_st(my_perf_sim)
target_include_directories(my_perf_sim PRIVATE
    ${PROJECT_SOURCE_DIR}/../../../kernels/manual/common/my_kernel
)

The costmodel project defines:

__COSTMODEL
__NPU_ARCH__=2201
PTO_COMM_NOT_SUPPORTED

2. Include the perf-sim launcher¶

A typical test entry includes:

#include <pto/pto-inst.hpp>
#include <pto/costmodel/perf_sim/launch.hpp>
#include <gtest/gtest.h>

#include "my_kernel.cpp"

Under __COSTMODEL, <pto/pto-inst.hpp> selects the costmodel PTO instruction implementation. launch.hpp provides LAUNCH_KERNEL, which replaces the real NPU launch in the test.

3. Write a plain C++ wrapper¶

Perf-Sim needs to call the kernel body inside the CPU process. In most cases, do not call the host-side LaunchXXX wrapper. Call the kernel function directly:

void runMyKernelCase()
{
    runMyKernel<128, 128, 1024>(
        nullptr, nullptr, nullptr);
}

TEST(MyPerfSim, Basic)
{
    LAUNCH_KERNEL(runMyKernelCase, , (1, nullptr, nullptr));
}

LAUNCH_KERNEL(func, targs, (block_dim, l2_ptr, stream), args...) uses the third argument as a simulated launch config:

Field	Meaning
`block_dim`	Number of physical AIC cores. `1` means single core, `4` means four physical cores.
`l2_ptr`	Enables the L2 cache model when non-null. Most tests pass `nullptr`.
`stream`	Unused in costmodel mode. Usually `nullptr`.

4. Guard host launch code¶

If the original kernel file contains a host launch wrapper:

void LaunchMyKernel(..., aclrtStream stream)
{
    runMyKernel<<<block_dim, nullptr, stream>>>(...);
}

The <<<...>>> syntax cannot be parsed by a normal C++ compiler. Guard host launch wrappers with #ifndef __COSTMODEL:

#ifndef __COSTMODEL
void LaunchMyKernel(..., aclrtStream stream)
{
    runMyKernel<<<block_dim, nullptr, stream>>>(...);
}
#endif

Do not guard the kernel body itself. Perf-Sim still needs to call it.

5. Handle NPU-only headers¶

Costmodel test targets include this directory first:

include/pto/costmodel/stubs

It contains costmodel-only ACL and prefetch stubs. Add new compatibility stubs there when necessary. Avoid changing public NPU headers just to satisfy the costmodel build.

Outputs¶

Each LAUNCH_KERNEL call emits:

perf_sim_output/<op_name>.json
perf_sim_output/<op_name>_pipeline_summary.csv

<op_name> is the function name passed to LAUNCH_KERNEL, for example runTFA_256x64x2048.

Text Report¶

The test prints a report similar to:

===== Perf-Sim Report: runTFA_256x64x2048 =====
Cores        : 1
Instructions : 1313
Sync events  : 1046
Total cycles : 60320

Pipeline     | AIC-0   |
-------------+--------+
      Scalar |    580 |
   MTE2(AIC) |  13585 |
        MTE1 |   7216 |
        CUBE |   8384 |
        FIXP |  16144 |
   MTE2(AIV) |  22320 |
         VEC |  72160 |
        MTE3 |   5034 |

Key fields:

Total cycles: end-to-end simulated cycles.
Pipeline busy cycles: accumulated work on each pipeline.
In multi-core cases, each column represents one physical core.

pipeline_summary.csv¶

CSV header:

op_name,core_id,unit,total_cycles,active_start_cycle,active_end_cycle,active_cycles,busy_cycles,scalar_cycles,mte2_aic_cycles,mte2_aiv_cycles,mte1_cycles,cube_cycles,fixp_cycles,vec_cycles,mte3_cycles

Each physical core emits three rows: AIC, AIV0, and AIV1. This keeps the 1C2V structure visible instead of adding both AIV busy-cycle totals together.

Column	Meaning
`unit`	`AIC`, `AIV0`, or `AIV1`
`total_cycles`	End-to-end cycles of the physical core
`active_start_cycle` / `active_end_cycle`	First and last non-zero-duration event of this unit
`active_cycles`	`active_end_cycle - active_start_cycle`; use this when comparing with CAModel core/veccore execution windows
`busy_cycles`	Sum of busy cycles in this unit
`mte2_aic_cycles` / `mte2_aiv_cycles`	MTE2 is split by AIC and AIV. AIC rows use `mte2_aic_cycles`; AIV rows use `mte2_aiv_cycles`.

Example:

op_name,core_id,unit,total_cycles,active_start_cycle,active_end_cycle,active_cycles,busy_cycles,scalar_cycles,mte2_aic_cycles,mte2_aiv_cycles,mte1_cycles,cube_cycles,fixp_cycles,vec_cycles,mte3_cycles
runTFA_64x64x512,0,AIC,12438,0,11849,11849,7592,292,2600,0,1560,1120,2020,0,0
runTFA_64x64x512,0,AIV0,12438,1488,12438,10950,9000,0,0,1366,0,0,0,7272,362
runTFA_64x64x512,0,AIV1,12438,1488,12438,10950,9000,0,0,1366,0,0,0,7272,362

Typical use:

Compare total_cycles between parameter sets.
Identify dominant pipelines, such as high vec_cycles.
Check whether rows are close in a multi-core run.

JSON Trace¶

<op_name>.json uses the Chrome Trace Event format. Open it with Perfetto or Chrome tracing:

chrome://tracing

Use the trace to inspect:

Whether AIC and AIV overlap.
Whether MTE2/MTE1/CUBE/FIXP/VEC/MTE3 form a steady pipeline.
Long idle gaps or serial regions.
Per-core balance in multi-core runs.

Reading Results¶

flowchart TD
    A["Check total_cycles"] --> B{"Expected?"}
    B -->|No| C["Open JSON trace"]
    B -->|Yes| D["Open pipeline_summary.csv"]
    C --> E["Find idle gaps or serialized stages"]
    D --> F["Compare pipeline busy cycles"]
    F --> G{"Dominant pipeline"}
    G -->|CUBE| H["Tune QK/PV tile shape or preload"]
    G -->|VEC| I["Tune vector work and AIV concurrency"]
    G -->|MTE| J["Tune movement size and data reuse"]

FAQ¶

The build fails on `<<<...>>>`¶

Costmodel compiled a host launch wrapper. Guard the wrapper with #ifndef __COSTMODEL and call the kernel body directly from the test.

The build cannot find `acl/acl.h`¶

Make sure the target includes:

include/pto/costmodel/stubs

Do not add a global stub under public include/acl for costmodel only.

AIV0 and AIV1 look very different¶

First check whether the kernel intentionally splits work by subblock. If both AIVs in the same physical core are meant to be equivalent, their task ranges and busy cycles should usually be close.

Can pipeline busy cycles exceed total cycles?¶

Yes. total_cycles is wall-clock time. Pipeline busy cycles are accumulated durations. With concurrent pipelines or multiple AIVs, accumulated busy cycles can be larger than total cycles.