High-performance code is essential to many of the most exciting applications of computing. Without core kernels that exploit the best hardware we can build near the peak of its potential—from CPUs to GPUs to DSPs and neural accelerators—we wouldn't have seen the past decade's breakthroughs in deep learning, smartphone photography, virtual reality, or self-driving cars.
But writing high-performance code is difficult. On today's hardware, the best-optimized code is often orders of magnitude faster than a clean, straight-forward implementation—even in a "fast" language like C.
For decades we have wished for compilers that could optimize our programs automatically. While compilers have made great strides, hardware continues to get more complex and demanding. From deep memory hierarchies, to convoluted vector units, to accelerators like GPUs, DSPs, and neural engines, making fast code on modern machines requires worrying about much more than just minimizing the number of sequentially executed instructions. The "sufficiently smart" compiler that can automatically and consistently turn high-level source code into the very best optimized implementation remains as elusive as ever.
As a result, much of the critical code at the core of everything from neural networks to Adobe Photoshop to your favorite video game is still painstakingly hand-optimized by human performance engineers.
Getting the level of control required to implement these optimizations has traditionally meant tearing away abstractions and writing directly in low-level C/C++, assembly, or CUDA code. But rewriting code at this level is slow, expensive, and error-prone: programmers routinely spend weeks or even months optimizing a single algorithm for a specific target machine.
To make high-performance code easier to write, a new class of programming languages has become popular over the past decade. User-schedulable languages start from high-level, often domain-specific descriptions of the computation to be performed. Then, instead of relying on a black-box compiler to automatically generate the best code to compute it on a particular machine, these languages expose explicit control over how it should be computed—including essential choices like how to structure loops and arrays, map to parallel threads and vector instructions, and block computation for better memory locality—via what they often call a schedule.
This separation of concerns is powerful because it decouples the essence of what you want to compute from all the details of how you want to optimize it. Factored apart in this way, an optimized program takes much less code, it's easy to optimize for different hardware just by changing the schedule, and many of the most common bugs (reading out of bounds, accessing values before they are initialized, or otherwise deviating from the high-level specification) cannot even be introduced. Because of these advantages, user-schedulable languages have become popular in key parts of industry. Indeed, there is probably code written in one of these languages on your phone or PC right now!
However, the first generation of user-schedulable languages have major limitations. They implement schedules as parameters on a lowering process from high-level source to low-level optimized code. Their scheduling interfaces are languages, in the sense that they enable expressing a large space of optimizations via the composition of a relatively small number of fundamental primitives, but they lack many of the key features fundamental to programming effectively, for example, essential facilities like abstraction, higher-order composition, and extensibility.
The breakthrough of the following paper is in fundamentally rethinking the design of user-schedulable languages so that decades of wisdom from traditional programming languages can be brought to bear.
The key insight underlying it all is the idea that, instead of implementing schedules as parameters on a monolithic lowering from high- to low-level code, user-guided optimizations can instead be implemented as rewrites within a single (functional) language. Adding new types of optimization to prior user-schedulable languages requires extending the compiler's core lowering process, and worrying about the potential interaction of the new feature with every other—an often herculean task. Rewrites, in contrast, are naturally composable, extensible, and easy to reason about: the definition and correctness of each new rewrite is independent of all the others.
Building on this, the authors show how classic ideas from functional programming can then directly be applied to abstract and compose rewrites into higher-level rewriting strategies. In contrast to earlier scheduling languages, we now have a way to specify optimizations and build high-performance libraries at ever-higher levels of abstraction.
What is the result? Using this approach, the authors finally show that it's possible to describe the same optimizations as state-of-the-art user-schedulable languages, and achieve the same blistering performance, using no complex domain-specific schedule compiler, just rich compositions of simple, extensible rewrites. By finally putting user-schedulable languages on solid PL foundations, this is a big step towards making high-performance programming easy, productive, and safe.
To view the accompanying paper, visit doi.acm.org/10.1145/3580371
The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.
No entries found