Sign In

Communications of the ACM


Research for Practice: OS Scheduling

pendulum swinging to the right, illustration

Credit: Andrij Borys Associates, Pelfophoto

back to top 

In any system that multiplexes resources, the problem of scheduling what computations run where and when is perhaps the most fundamental. Yet, like many other essential problems in computing (for example, query optimization in databases), academic research in scheduling moves like a pendulum, with periods of intense activity followed by periods of dormancy when it is considered a "solved" problem.

It is exciting to witness this pendulum swing back, and we are very lucky to have Kostis Kaffes to curate a recent selection of outstanding research in scheduling. His choices highlight breakthroughs related to performance, extensibility, and policy choice. The first paper challenges the putative trade-off between low latency (typically achieved by provisioning dedicated cores) and high utilization (which requires core reallocation) by making allocation decisions possible at single-microsecond granularity. The second enables the creation of arbitrary scheduling policies by factoring apart the creation and manipulation of policy, which can be handled completely by user-space agents, from the fixed kernel mechanisms that communicate events to agents and apply scheduling decisions. Finally, with the ability to perform load-balancing and allocation decisions based on flexible policies at microsecond granularity on the table, the final selection addresses the choice of policy on an application-by-application basis.

Kostis has put together a magnificent and cohesive selection, with an introduction that will make readers who are thinking about research in scheduling for the first time comfortable. Enjoy.
    —Peter Alvaro

Scheduling has long been one of the most fundamental operations in systems and networks. It involves assigning tasks to CPUs and switching between them—decisions critical both for application performance and system efficiency. For a long time, operating system (OS) scheduling focused on fairness. However, two recent developments have led to a renaissance in OS scheduling research. First, the emergence of cloud computing lent importance to different, hard-to-optimize metrics such as tail latency and timescales—that is, microsecond (μs) scale—which were not considered by legacy schedulers. Second, the end of Moore's Law made necessary the specialization of the system stack, including scheduling, to continue making performance improvements.

This edition of Research for Practice presents three papers that look at ways to bring OS scheduling into the modern age. The first paper focuses on building the fastest possible scheduler, while the second one aims to make it easy to implement new policies that are compatible with existing applications and operating systems. The third paper explores the best scheduling policies for different types of applications. Ultimately, all three papers provide valuable contributions to the ongoing efforts to develop better scheduling policies for modern computing systems.

Microsecond-Scale Core Reallocation
A. Ousterhout, J. Fried, J. Behrens, A. Belay, and H. Balakrishnan
Shenango: Achieving high CPU efficiency for latency-sensitive datacenter workloads. In Proceedings of the 16th Usenix Symposium on Networked Systems Design and Implementation, 2019;

The first paper by Ousterhout et al., answers the fundamental questions of how fast core allocation can take place in an operating system and whether such reallocation is beneficial for application performance. The system presented in the paper, called Shenango, challenges the widely held belief that allocating cores across applications at microsecond timescales is not feasible because of high overhead and the potential for cache pollution. The authors show that rapid core reallocation is indeed possible and can yield significant performance benefits.

The key to achieving microsecond-scale core reallocation in the Shenango operating system is the use of a dedicated scheduling core that could make allocation decisions every 5 μs. To determine when to allocate or reclaim cores from an application, Shenango monitors the length of each application's thread-run queues and network-packet queues, using the derivative of their length as a congestion signal. This algorithm runs entirely on the dedicated core, which also manages the steering of incoming network packets to their respective target application cores.

The authors demonstrated the efficacy of this approach by showing how fine-grain core reallocation improved the performance of latency-sensitive and batch applications co-located on the same system. By allocating cores based on the instantaneous input packet rate, Shenango reduced latency for the former and increased throughput for the latter by more than six times when using a 5 μs core reallocation interval compared with a 100 μs interval.

Subsequent work has shown that Shenango's microsecond-scale scheduler can also help mitigate interference in other system resources, such as last-level caches and memory bandwidth, and provide fine-grain feedback to the network to prevent overload.

A Framework for Deploying OS Scheduling to Linux
J.T. Humphries, et al.
ghOSt: Fast & Flexible User-space Delegation of Linux Scheduling. In Proceedings of the 28th ACM Symposium on Operating Systems Principles, 2021;

Building a highly efficient scheduler such as Shenango is an interesting academic exercise, but it is not immediately applicable to a production environment with constraints such as backward compatibility with existing applications and operating systems such as Linux. This is why a team of Google engineers built a framework, called ghOSt, that makes it easy to implement different scheduling policies and deploy them to the Linux kernel.

The key observation behind ghOSt's design is that scheduling does not need to take place in the kernel. Taking inspiration from microkernels, ghOSt delegates OS scheduling to user-space agents, either global or per CPU. It consists of a bare-bones kernel scheduling class that only communicates scheduling events (for example, a thread became runnable) to these agents, which make the scheduling decision of what to run where and then communicate it back to the kernel. Thus, developers enjoy the agility of user-space development without being burdened by the limitations of kernel code and long deployment cycles. Furthermore, ghOSt allows applications to share hints with the scheduling agents over shared memory that enable the latter to make more informed scheduling decisions.

The biggest challenge that ghOSt had to overcome was the communication latency between the kernel component and the user-space agents, which can take up to 5 μs. This can cause race conditions (for example, the user-space agent schedules a thread to a CPU that has been removed from the thread's CPU mask) and underutilization as a CPU remains idle waiting for the agent's scheduling decision.

ghOSt avoids race conditions by implementing a transaction API over shared memory that allows the agent to atomically commit scheduling decisions. To alleviate the second problem, the authors propose using custom eBPF programs that run locally on each core and schedule tasks temporarily until the agent's decision is received. The same techniques are also applicable when offloading other operating system functionality to user space (for example, memory management).

Choosing the Best Scheduling Policy Option
S. McClure, A. Ousterhout, S. Shenker, S. Ratnasamy
Efficient scheduling policies for microsecond-scale tasks. In Proceedings of the 19th Usenix Symposium on Networked Systems Design and Implementation, 2022;

After the introduction of ghOSt, which allows for easy development and deployment of custom scheduling policies, the question arises as to which policy each application should use. To answer this question, McClure et al. conducted a comprehensive analysis of the options available.

The key observation behind ghOSt's design is that scheduling does not need to take place in the kernel.

The authors divided the scheduling process into two distinct decisions: allocating cores across applications and load-balancing tasks across CPUs within each application. Surprisingly, they found the second decision is relatively simple; work-stealing is the best load-balancing policy in terms of both latency and efficiency, regardless of task service time distribution, number of cores, core-allocation policy, and load-balancing overheads.

In contrast, the core-allocation decision is much more nuanced. For example, contrary to past work, the authors found that proactively reclaiming cores from applications based on average delay or utilization performs better for small tasks than waiting for a CPU to become idle. They also discovered that when serving small tasks, it is better to assign a fixed number of CPUs to each application instead of allocating them dynamically.

This analysis opens new areas of research, such as the development of new hardware that implements a scalable global queue, which in simulation was shown to perform even better than work-stealing. Furthermore, this study did not consider the presence of preemption, suggesting a need for further examination of how preemption affects scheduling decisions.

Back to Top


These three papers make significant contributions to an ongoing effort to develop better scheduling policies for modern computing systems. The papers highlight the need for better, more efficient, and more flexible OS schedulers; open new areas of research; and demonstrate the importance of continued development and innovation in OS scheduling policies.

Back to Top


Kostis Kaffes is an assistant professor at Columbia University, New York, NY, USA.

Peter Alvaro is an associate professor of computer science at the University of California Santa Cruz, where he leads the Disorderly Labs research group (

Copyright held by author/owner. Publication rights licensed to ACM.
Request permission to publish from

The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.



No entries found

Sign In for Full Access
» Forgot Password? » Create an ACM Web Account
Article Contents: