Unlocking PyTorch’s Power: Introducing TorchInductor 2.0

Unlocking PyTorch’s Power: Introducing TorchInductor 2.0

  1. Introduction
  2. Principles of Torch Inductor
    • Torch Inductor as PyTorch Native
    • Python First Approach
    • Focus on General Design
  3. Technologies Used in Torch Inductor
    • Defined-by-Run Loop Level AR
    • Support for Dynamic Shapes and Strides with Senpai
    • Leveraging the Triton Language
  4. Understanding Mine by Run Loop Level AR
  5. Compiler Stack Overview
    • Graph Capturing and Operator Set Decomposition
    • Graph Lowerings for Simplicity
    • Scheduling for Fusion and Optimization
    • Code Generation with Backend and Wrapper Code Gen
  6. GPU and CPU Performance Results
    • Impressive Speedups on GPU Benchmarks
    • Collaboration with Intel Pi Torch for CPU Support and Generality
  7. Conclusion
  8. References

In this article, we will delve into Torch Inductor, a new compiler back-end for PyTorch 2.0. Torch Inductor aims to provide a faithful representation of the PyTorch behavior while offering a more generalized and scalable approach to compiler design. We will explore the principles behind Torch Inductor, the key technologies utilized, and dive into its inner workings. Additionally, we will examine the compiler stack and its components. Lastly, we will discuss the performance results achieved on both GPU and CPU platforms and conclude with the future prospects of Torch Inductor.

Torch Inductor as PyTorch Native

One of the fundamental principles behind Torch Inductor is its native integration with PyTorch. By leveraging similar abstractions to PyTorch’s eager mode, Torch Inductor can accurately capture the behavior of PyTorch. This native integration ensures that the compiled code faithfully reproduces the functionality and behavior of the original PyTorch codebase.

Python First Approach

Torch Inductor embraces Python as its primary language. This approach makes it easier for developers to hack on and extend the capabilities of the compiler. By relying on Python, Torch Inductor offers a familiar and flexible environment for implementing optimizations and exploring new compiler techniques.

Focus on General Design

Another crucial principle of Torch Inductor is its emphasis on breadth rather than depth. Early in its development, Torch Inductor prioritized tackling complex operators and optimizations to ensure a general and scalable design. This approach enables Torch Inductor to handle a wide range of operators and optimizations, making it applicable to various use cases.

Defined-by-Run Loop Level AR

Torch Inductor employs a defined-by-run loop level AR (AutoRetain) mechanism as a core technology. In this approach, the compiler’s internal representation (IR) is a Python callable object that operates at the loop level. This enables code generation and analysis by executing the IR, a Novel approach in the compiler space. The defined-by-run loop level AR has been successfully used in PyTorch programs, providing a reliable foundation for Torch Inductor.

Support for Dynamic Shapes and Strides with Senpai

Torch Inductor’s support for dynamic shapes and strides is achieved through the utilization of Senpai, a symbolic math library. Senpai allows reasoning about shapes symbolically and generates code that is not specialized to specific input sizes. This feature provides flexibility and adaptability to Torch Inductor, making it suitable for varying input Dimensions.

Leveraging the Triton Language

Torch Inductor takes inspiration from users who have written high-performance kernels in Triton, a state-of-the-art language developed by Philip Tillet at OpenAI. Torch Inductor’s compiler generates Trident code, which is easy to understand and inspect. On CPUs, Torch Inductor generates C++ code. Triton offers a higher-level language than CUDA, striking a balance between expressiveness and performance. This language choice ensures that Torch Inductor remains a versatile and efficient compiler for diverse hardware backends.

The inner workings of Torch Inductor’s defined-by-run loop level AR can be comprehended by considering an example. The IR for a “permute” and “add” operation showcases this mechanism. The inner function in the IR takes a list of Senpai Expressions that symbolically represents a coordinate to generate. Within the function body, specific operations like “ops.load” and “ops.add” are called. This IR is easily customizable and can be adapted for various purposes, such as analysis or code generation. Additionally, Torch Inductor employs an intermediate representation, defined by run IR, which provides further manipulability during the compilation process.

The Torch Inductor compiler stack comprises several components, each playing a crucial role in the compilation process. Here is an overview of how the stack operates:

  1. Graph Capturing and Operator Set Decomposition: Torch Inductor starts by capturing the computation graph from PyTorch Dynamo. It then decomposes the complex operator set into a more manageable set of around 250 primitive operators using AOT (Ahead-of-Time) autograd and PrimTorch.

  2. Graph Lowerings for Simplicity: The graph lowerings phase aims to simplify the graph by reducing its complexity. It achieves this by converting the reduced operator set to Torch Inductor’s loop-level IR, which consists of approximately 50 operators. This simplification enables more efficient optimization and code generation.

  3. Scheduling for Fusion and Optimization: Scheduling determines the fusion of operations and performs various optimizations, including memory planning and tiling. This phase ensures that the generated code is highly optimized and utilizes hardware resources efficiently.

  4. Code Generation with Backend and Wrapper Code Gen: The final stage involves code generation, which has two main components. The backend code generation produces either Trident code or C++ code, depending on the target hardware platform. The wrapper code generation handles stitching together multiple kernel calls, essentially replacing the interpreter part of the compiler stack. This step ensures that the compiled code is coherent and performant.

The performance of Torch Inductor has been evaluated on both GPU and CPU platforms, showcasing impressive speedups:

  • GPU: Torch Inductor achieves up to a 1.86x geometric mean speedup on large, realistic benchmark suites. These results demonstrate the substantial performance gains achievable with the Torch Inductor compiler stack, making it an exciting development for GPU-accelerated deep learning.

  • CPU: In collaboration with the Intel Pi Torch team, Torch Inductor achieves up to a 1.26x geometric mean speedup on CPU inference tasks. These results not only highlight the broad hardware support of Torch Inductor but also emphasize its focus on general applicability. By generating both Trident code and C++ code, Torch Inductor ensures compatibility and performance across various hardware backends.

Torch Inductor introduces a novel compiler backend for PyTorch, built on principles of PyTorch native integration, Python-first approach, and general design. With its unique features, such as defined-by-run loop level AR, support for dynamic shapes and strides with Senpai, and leveraging the Triton language, Torch Inductor achieves compelling performance results on both GPU and CPU platforms. As an open-source project within the PyTorch repository, Torch Inductor offers a promising direction for compiler optimizations in deep learning frameworks.

  1. Torch Inductor Code Base: GitHub – PyTorch
  2. Torch Inductor Blog Post: [URL of the Blog Post]

Read more here: Source link