Multi-core Programming¶

This document describes common multi-core programming patterns in PTO Tile Lib and emphasizes work partitioning styles that align with the current programming model.

It focuses on tile-based decomposition, output ownership, load balancing, and locality.

1. Overview¶

PTO kernels commonly follow an SPMD-style execution model: multiple cores run the same kernel body, and work assignment is derived from block or core identity.

This style matches the tile-oriented programming model used throughout the PTO documentation.

For introductory examples, see:

2. The main multi-core model used in this repository¶

2.1 SPMD-style work partitioning¶

The most common model in this repository is:

all cores execute the same kernel code
each core handles a different region of the input or output
partitioning is usually expressed in terms of rows, columns, tiles, or block ranges

This approach is a natural fit for:

elementwise kernels
reductions over tiled data
GEMM-like kernels
tiled attention-style kernels

2.2 Why this model is preferred¶

SPMD-style partitioning is easier to reason about in PTO because it aligns with:

tile-based work decomposition
predictable GM access patterns
straightforward load balancing
simpler synchronization structure

In most cases, keeping each core responsible for a regular, contiguous region is preferable to introducing irregular inter-core coordination.

3. Practical partitioning guidance¶

3.1 Partition by output ownership¶

A good default strategy is to partition work by the output region each core owns.

For example:

for vector-style operators, split the output along a linear range
for matrix-style operators, split along tile rows, tile columns, or a 2D block grid
for row-wise reductions, assign one or more output rows per core

This keeps the ownership model simple:

the core that computes an output tile also stores it
intermediate state remains local when possible
inter-core write conflicts are avoided

3.2 Balance work across cores¶

When assigning tiles to cores:

prefer partitions with similar compute cost per core
avoid leaving one small tail region to a single overloaded core if it can be redistributed cleanly
keep GM access regular and contiguous where possible

A partition that is mathematically even but creates poor memory locality may still perform badly, so balance and locality should be considered together.

3.3 Prefer regular tile loops¶

Multi-core kernels are easier to validate and optimize when each core follows the same tile loop structure.

Typical structure:

determine the tile range owned by the current core
iterate over the assigned tile range
perform TLOAD -> transform / compute -> TSTORE
handle edge tiles through valid-region control when needed

This style follows the tile-oriented programming model described in Quickstart Tutorial and Tile Programming Model.

4. Multi-core concerns that matter in PTO¶

4.1 Load balancing¶

Load balancing is important because PTO kernels often combine:

GM movement
layout transforms
vector or cube compute
explicit synchronization

If one core receives substantially more tiles or more expensive tiles than the others, overall throughput may be limited by the slowest core.

In practice, check:

whether the output space is partitioned evenly
whether edge tiles are concentrated on too few cores
whether some cores perform extra transform or reduction work

4.2 Memory locality¶

Good multi-core partitioning should preserve locality in GM.

Preferred patterns usually have:

contiguous reads or writes
repeated reuse of nearby tensor regions
stable tile shapes and strides

Poor locality often shows up as a high data-movement cost relative to compute.

4.3 Cross-core communication¶

This repository does document communication instructions under docs/isa/comm/, but general multi-core kernels should not assume that arbitrary producer-consumer scheduling across cores is the default model.

For most compute kernels, it is better to:

minimize cross-core dependencies
partition outputs cleanly
keep synchronization local to true producer-consumer relationships when required

If a kernel depends on communication instructions, see Communication ISA Reference and the corresponding instruction pages.

5. Relationship with pipeline optimization¶

Multi-core parallelism and pipeline overlap solve different problems:

multi-core parallelism increases throughput by distributing work across cores
pipeline overlap increases per-core utilization by overlapping load / transform / compute / store stages

A high-performance kernel usually needs both:

a sensible per-core tile partition
an efficient intra-core pipeline

For overlap and buffering guidance, see Pipeline Parallelism and Events and Synchronization.

6. Programming boundaries¶

Multi-core PTO kernels are normally described in terms of tile ownership, regular work partitioning, and explicit dependencies. The following items fall outside that description unless they are introduced by dedicated runtime or backend documents:

imaginary runtime APIs such as unspecified get_block_idx() contracts without repository context
placeholder instructions such as TCOMPUTE or TFILL when those are not actual PTO public intrinsics in the described form
generic pseudo-syntax such as Python-style tensor slicing inside TLOAD / TSTORE
unsupported claims that MPMD is a standard public programming model for ordinary PTO kernels in this repository

Such examples may be useful as intuition elsewhere, but they are not rigorous repository documentation.

7. Multi-core development process¶

A common development process is:

start from a correct single-tile or single-core structure
define output ownership per core
partition work into regular tile ranges
validate correctness on CPU simulation
tune tile sizes, partitioning, and overlap on the target backend

This workflow keeps the programming model aligned with the rest of the PTO documentation and avoids introducing unnecessary complexity too early.

8. Notes¶

Multi-core programming in PTO Tile Lib is generally organized around:

SPMD-style work partitioning;
output ownership and regular tile ranges;
regular, contiguous, and balanced access patterns;
coordination between multi-core partitioning and per-core pipeline optimization.