References and Further Reading

This document provides PTO development-related references, academic papers, online resources, and further reading to help developers deepen their understanding of PTO programming.

Contents


1. Official Documentation

PTO-ISA Core Documentation

  • PTO Virtual ISA Manual
  • Complete PTO instruction set architecture specification
  • Hardware abstraction model
  • Programming model details

  • ISA Instruction Reference

  • Detailed description of all PTO instructions
  • Instruction syntax and semantics
  • Usage examples

  • Programming Guide

  • PTO programming introduction
  • Best practices
  • Common patterns

Topic-Specific Documentation

CANN Documentation


2. Example Code

Basic Examples

  • Add Operator
  • Simple element-wise addition
  • Basic Tile operations
  • Multi-core parallelism

  • ReLU Operator

  • Activation function implementation
  • Conditional operations
  • Performance optimization

  • Softmax Operator

  • Reduction operations
  • Numerical stability
  • Row-wise processing

Advanced Examples

  • GEMM Optimization
  • Matrix multiplication optimization
  • Tiling strategies
  • Pipeline optimization
  • Performance tuning

  • Flash Attention

  • Attention mechanism implementation
  • Memory-efficient algorithm
  • Operator fusion

  • LayerNorm

  • Normalization layer
  • Reduction and broadcast
  • Numerical precision

Custom Operator Examples

  • Fused Add-ReLU-Mul
  • Operator fusion example
  • Three implementation versions
  • Progressive optimization

3. Academic Papers

Tensor Compilers and DSLs

  • TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
  • Chen et al., OSDI 2018
  • Tensor compiler framework
  • Automatic optimization

  • Halide: A Language and Compiler for Optimizing Parallelism, Locality, and Recomputation

  • Ragan-Kelley et al., PLDI 2013
  • Image processing DSL
  • Schedule separation

  • Tiramisu: A Polyhedral Compiler for Expressing Fast and Portable Code

  • Baghdadi et al., CGO 2019
  • Polyhedral model
  • Code generation

Hardware Accelerators

  • In-Datacenter Performance Analysis of a Tensor Processing Unit
  • Jouppi et al., ISCA 2017
  • Google TPU architecture
  • Performance analysis

  • NVIDIA A100 Tensor Core GPU: Performance and Innovation

  • NVIDIA, 2020
  • GPU architecture
  • Tensor Core design

Optimization Techniques

  • FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness
  • Dao et al., NeurIPS 2022
  • Memory-efficient attention
  • Tiling strategy

  • FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

  • Dao, 2023
  • Improved parallelism
  • Work partitioning

4. Online Resources

Official Websites

  • Ascend Community
  • Official Ascend platform
  • Documentation and downloads
  • Community forum

  • CANN GitHub

  • CANN source code
  • Issue tracking
  • Contribution guide

Tutorials and Blogs

Community Forums


Tensor Compilers

Deep Learning Frameworks

  • PyTorch
  • Dynamic computation graphs
  • Python-first design
  • Rich ecosystem

  • TensorFlow

  • Production-ready
  • Multi-platform support
  • Comprehensive tools

  • MindSpore

  • Huawei AI framework
  • Native Ascend support
  • Auto-parallelism

Performance Tools

  • msprof
  • Ascend profiler
  • Performance analysis
  • Bottleneck identification

  • NVIDIA Nsight

  • GPU profiler
  • System-wide analysis
  • Visualization

6. Tools and Libraries

Development Tools

  • CMake (>= 3.16)
  • Build system generator
  • Cross-platform support
  • cmake.org

  • GCC (>= 13.0) / Clang (>= 15.0)

  • C++20 compiler
  • Optimization support
  • gcc.gnu.org

  • Python (>= 3.8)

  • Scripting and testing
  • Framework integration
  • python.org

Debugging Tools

Performance Analysis

  • perf
  • Linux profiler
  • Hardware counters
  • System-wide analysis

  • Intel VTune

  • CPU profiler
  • Microarchitecture analysis
  • intel.com/vtune

Computer Architecture

  • Computer Architecture: A Quantitative Approach
  • Hennessy & Patterson
  • Classic architecture textbook
  • Performance analysis

  • Modern Processor Design: Fundamentals of Superscalar Processors

  • Shen & Lipasti
  • Pipeline design
  • Instruction-level parallelism

Parallel Programming

  • Programming Massively Parallel Processors
  • Kirk & Hwu
  • GPU programming
  • CUDA fundamentals

  • Parallel Programming in C with MPI and OpenMP

  • Quinn
  • Parallel patterns
  • Performance optimization

Compiler Design

  • Engineering a Compiler
  • Cooper & Torczon
  • Compiler construction
  • Optimization techniques

  • Advanced Compiler Design and Implementation

  • Muchnick
  • Advanced optimizations
  • Code generation

Deep Learning Systems

  • Deep Learning Systems: Algorithms, Compilers, and Processors
  • Sze et al.
  • DL accelerators
  • System design

Contributing

We welcome contributions to improve this documentation:


License

This documentation is licensed under Apache License 2.0.


Contact