Beyond Ctrl-C: The dark corners of Unix signal handling ꞏ sunshowers

RustConf 2024 is next week, so I thought I’d put up a written version of my RustConf 2023 talk about signals in time for that. I’ve tried to retain the feel of it being a talk, while editing it heavily to make it readable as a blog entry. Some links:

Video of the talk on YouTube.
Slides on Google Slides.
Demo repository on GitHub.

1. Introduction#

Imagine you’re in the middle of a conversation, when suddenly, a tap on the shoulder interrupts your train of thought. You turn to face the interloper, only to find a dear friend with an urgent message. In that moment, you’re faced with a choice: do you ignore the interruption and continue your conversation, or do you pause to address your friend’s needs?

In the world of computing, this tap on the shoulder is akin to a signal: a way for the operating system to interject and communicate with a running process. Just as you might choose to ignore or respond to your friend’s interruption, a process must decide how to handle the signals it receives.

Let’s start with a couple of questions:

Have you ever hit Ctrl-C while running a command-line program?
Have you ever encountered data corruption because you hit Ctrl-C at the wrong time?

If you’ve answered yes to the second question, then there’s a chance your program isn’t handling signals correctly. Much like a conversation that’s interrupted at an inopportune moment, a process that’s interrupted by a signal can be left in a state of disarray.

In this post, we’ll explore the world of Unix signals, delving into their history and their surprising complexity. We’ll learn how to handle these interruptions gracefully. And we’ll discover how async Rust can help us tame the chaos of signal handling, making it easier than ever to write robust software.

Most of this knowledge comes from my work on cargo-nextest, a next-generation test runner for Rust that is up to thrice as fast as cargo test. One of the things that sets nextest apart is how it carefully, rigorously handles any signals that come its way. Be sure to try it out if you haven’t already!

Why signals?#

If you’re an experienced engineer, you’re probably used to asking questions like why, when, and how much you should care about something.

So: Why bother with signal handling at all? When should you care about signals?

A screenshot of Kubernetes documentation with: 3 - SIGTERM signal is sent to the pod. At this point, Kubernetes will send a SIGTERM signal to the containers in the pod. This signal lets the containers know that they are going to be shut down soon. Your code should listen for this event and start shutting down cleanly at this point. This may include stopping any long-lived connections (like a database connection or WebSocket stream), saving the current state, or anything like that... — Kubernetes on SIGTERM (source).

You should care about signals if you’re developing a service. You’re likely going to be running under a service manager like Kubernetes, and the lifecycle generally involves signals.
This screenshot shows Kubernetes documentation explaining that when your service is shutting down, it will receive a SIGTERM. Docker works the same way with docker stop.
Note how the documentation advises that your code should “listen for this event and start shutting down cleanly at this point”.
You should also care about signals if you’re developing a command-line tool. That’s because your users are impatient and if they perceive your operation as being slow for any reason, they will hit Ctrl-C and send a signal to your process.

How much do you need to care about signals?

If you’re a command like ls or cat and all you’re doing is reading data, you probably don’t need to care much about signals. You’re not making any changes to the system so there’s little that can go wrong if your process dies.
If you’re writing data, it’s definitely worth thinking about signal handling.
There are ways to arrange for your code to be resilient to sudden termination, such as writing files atomically, or using a database like SQLite. But even if your code doesn’t depend on correct signal handling, you can likely provide a better user experience if you do handle them.
Where you need to care most about signals is if you’re orchestrating an operation in a distributed system. In those cases, if the process receives a signal locally, it may wish to send out cancellation messages over the wire.
Again, it’s a good idea to make your system fault-tolerant and resilient to sudden termination—for example, by serializing state in a persistent store—but at the very least, signals give you the opportunity to perform cleanup that can otherwise be difficult to do.

Before we move on: In this post, we’re going to focus on Unix, not Windows. We’re also going to focus on portable signal handling, which means mechanisms that work on every Unix. Some Unix platforms offer alternative ways to handle signals¹, but this post will not be discussing them.

A basic example#

Let’s look at a simple example of a signal being sent to a process:

A screenshot of a terminal where `cargo build` was run, and Ctrl-C was pressed within a couple of seconds after. The last line shows "Building 225/305: object, regex-automata, syn,".

What happened here?

I ran cargo build.
Then, a few seconds after, I hit Ctrl-C in my terminal.
The terminal sent a signal called SIGINT (where INT means “interrupt”) to the Cargo process.
The SIGINT signal caused the Cargo process, as well as all the Rust compiler processes underneath it, to be interrupted and terminated.

This example shows the two uses of signals. One is as a standardized, widely understood way for the kernel to interrupt a process. The other is as a basic, limited way to perform interprocess communication, or IPC.

Sending signals via `kill`#

Besides Ctrl-C and other shortcuts, the main way you’d be sending signals to processes on the command line is via the kill command.

To send SIGINT or Ctrl-C, you can use kill -INT <pid>, where <pid> is the numeric process ID.
Each signal also has an associated number. For SIGINT the number is always 2, so another way of saying this is kill -2 <pid>.
If you don’t specify a signal and just say kill <pid>, it sends SIGTERM by default.

Within a programmatic context, libc has a kill function you can call which does the same thing.

Signal names and numbers#

As mentioned above, each signal has a name and number associated with it. Some of those numbers are standardized across Unix platforms, while others aren’t.

On Linux, if you type in man 7 signal, you’ll see a long list of signals. Some of them are:

Name	Number	Default action
`SIGINT`	2	Terminate
`SIGTERM`	15	Terminate
`SIGKILL`	9	Terminate, can’t customize
`SIGSEGV`	11	Terminate and core dump
`SIGTSTP`	Varies	Stop process
`SIGCONT`	Varies	Resume process

In this table:

SIGKILL is also known as kill -9, and it’s a special signal used to kill a process. What sets SIGKILL apart is that unlike almost all other signals, its behavior can’t be customized in any way.
SIGSEGV might be familiar to you if you’ve ever encountered a core dump. Somewhat surprisingly, the behavior of SIGSEGV can be customized. For example, the Rust standard library customizes SIGSEGV’s behavior to detect call stack exhaustion².
SIGTSTP and SIGCONT are used for what is called “job control”. If you’ve ever used Ctrl-Z in vim, or the commands fg or bg, then they uses these signals³.

In general, all signals have a default action. Almost all of them also let you customize the default behavior, using what is called a signal handler.

Signal handlers#

A signal handler is a custom function that is used to intercept specific signals. Once a signal handler is set up, the kernel no longer follows the default action, calling the handler instead. In that sense, it’s a reverse system call, also known as an upcall.

For example, if you’re writing or reading data, you use a system call or syscall to perform that action—in that case, you’re calling the kernel. With an upcall, instead, the kernel calls you. (It’s an “upcall” because the call goes “up”, not “down”).

Importantly, signal upcalls can happen at almost any time. And this is where we start running into issues.

2. An example#

To see the sorts of issues that signal handlers can create, let’s walk through a specific example: that of a download manager.

Screenshot of a download manager on Windows XP. There are two windows visible, one in the background and one in the foreground. The background window is titled "Internet Download Manager 5.11" and has a list of partially and completely downloaded files in it. The foreground window says "85% mult.avi" and shows a download in progress. — Screenshot of a download manager on Windows XP.

Back in the 2000s, these programs were a lifesaver. I grew up with pretty terrible internet back then, and the download managers had several features that really helped. The most important feature was their ability to resume downloads, something that browsers didn’t support back then.

For this post, we’re going to work with a really simple download manager. Let’s say you provide a bunch of URLs to a tool, which then downloads them in parallel and maintains their status in a database.

Now, let’s say we want to handle SIGINT. If the user hits Ctrl-C, we may want to follow a small set of sensible steps:

Cancel all running downloads.
Flush any data to disk.
In the database, mark the state of these downloads as interrupted.

Your first idea might be, “let’s just put all this logic in a signal handler”. Can you do that? The answer turns out to be that no, you can’t do that. And why you can’t do that is helpful at illustrating many of the pitfalls of signal handlers.

Why are signal handlers limited?#

Earlier, I mentioned that signal handlers can be called at any time. It turns out that the ability to call a piece of code at any time is fraught with peril like little else in computing. This property of signal handlers is at the root of so many of the problems with them.

For example, consider what happens if a signal handler is called while you’re holding a mutex or other lock. In general, trying to acquire the same lock again will result in a deadlock⁴. So you can’t call functions that try to acquire a lock.
Well, which functions acquire locks? Even something as basic as allocating memory via malloc requires a lock, because it pokes at global structures. This means that you cannot allocate memory in a signal handler. This alone shuts off a large percentage of the things you can do in a signal handler.
Another joy of signal handling is that while a handler is running, your process can receive a different signal and invoke a second signal handler. As you might imagine, this is just very hard to reason about in practice.

These aren’t just theoretical concerns! You might have heard of the CVE database, where security vulnerabilities are filed and listed. The CVE database has a lesser-known cousin called the CWE database, which lists out “common weaknesses” that result in security vulnerabilities. Within this database, there are no fewer than four weaknesses related to incorrect signal handlers:

CWE-364: Signal Handler Race Condition
CWE-432: Dangerous Signal Handler not Disabled During Sensitive Operations
CWE-479: Signal Handler Use of a Non-reentrant Function
CWE-828: Signal Handler with Functionality that is not Asynchronous-Safe

On Linux, the man page on signal-safety lists functions described by POSIX as okay to call in signal handlers, and it is pretty short. You can write to a file descriptor, but malloc is not allowed. You also can’t open or seek a file.

The functions okay to call in signal handlers are called async-signal-safe functions. The term is a bit confusing! “async” here has nothing to do with async Rust. (In many ways it’s the opposite of async Rust, because the defining characteristic of async Rust is that you cannot just be interrupted or preempted at any time. You can only be interrupted at await points.)

The self-pipe trick#

A photo of the needle of a sewing machine working on a pipe in the context of a garment, with the label "Tips & tricks for piping perfectionists". — It’s important to get your piping right (source).

So how do most modern programs handle signals? To understand that, we briefly need to introduce the concept of the self-pipe trick. The trick uses a Unix feature called self-pipes, which have been described in the same breath as both “wonderful” and “cursed”: a high honor!

You might be familiar with pipes from having used them in shells via the namesake pipe (|) operator. For example, consider a typical find | xargs command.

When this command is run, the shell creates a pipe.
Each pipe has a write end and a read end.
In the case of find | xargs, the write end is held by find, and the read end by xargs.

A self-pipe is just a kind of pipe where the write and read ends are held by the same process.

Now, you might ask, “what’s the point of this?” And you’d be right to do so! Most of the time they don’t add any value. But they do add value in the specific context of signal handlers, because they let programs write signal handlers that are as simple as possible.

The program starts by creating a self-pipe.
It then hands the write end of the pipe to a signal handler, and holds on to the read end.
Then, the signal handler writes to that self-pipe. This is safe to do, because a pipe is a file descriptor, and writing to a file descriptor is async-signal-safe.
Then, the program reads from the self-pipe elsewhere.

Most C programs do this by hand, but in Rust you don’t have to write this delicate pattern manually. There are several crates which implement this pattern, like signal-hook, and most people just use one of them.

Now what?#

Going back to the download manager example: Let’s say you write to a pipe, indicating that Ctrl-C has been pressed. How do you handle the read side of the pipe? Once you’ve received a signal, how do you handle it?

One option is to set some kind of global flag, and check whether you’ve received a signal at every iteration of a loop–or maybe every Nth iteration of a loop.
This works fine for small programs that are CPU-bound, but isn’t really scalable to large programs because those tend to have lots of loops in them (are you really going to add checks to every loop?)
Large programs are I/O bound anyway, and it’s a bit hard to check for signals while you’re waiting for some network operation to complete.
Another potential solution: Store all the state behind a mutex, and wrest control of the program by locking out all the workers. This is a solution that some programs use, but it is really difficult to coordinate state between the workers and the signal handler. I really wouldn’t recommend following this approach.
The most reasonable approach for I/O-bound programs is to use message passing, where the parts of a program that deal with signals are notified that a signal has occurred.
This is possible to do in a synchronous model, but, as I’ve documented in my previous post about how nextest uses Tokio, it adds a great deal of unbounded complexity. What happens in practice is that you inevitably create a nest of threads whose only responsibility is to pass messages around.

The good news is that dealing with messages is much, much simpler with async Rust.

Why async Rust?#

At this point you might ask: “I’m just a simple downloader tool, why would I need async for this?”

Async Rust is presented as being for massively concurrent web or backend servers, but a secret about it is that that is just marketing. The scope of problems async Rust solves is much broader, and it happens to be incredibly well suited to signal handling for most programs. This is because async Rust provides some very expressive ways to perform advanced control flow, making such code readable without sacrificing performance.

To see how, we’re going to use Tokio’s signal handling functionality, which under the hood uses the same self-pipe trick mentioned above⁵. Here’s what a very simple example of Ctrl-C looks like under async Rust:

use tokio::signal::unix::{signal, SignalKind};

let mut ctrl_c_stream = signal(SignalKind::interrupt())?;
loop {
    ctrl_c_stream.recv().await;
    println!("got SIGINT");
}

In this example:

The code sets up a stream of SignalKind::interrupt or Ctrl-C signals, then awaiting the recv method.
The recv method resolves each time the process receives Ctrl-C.

Now, this isn’t very special by itself; you can easily implement this with synchronous code. But this model really shines in more complex code paths, because of async Rust’s ability to perform heterogenous selects via tokio::select!.

For a detailed discussion of heterogenous selects, see my earlier post about them. But a quick summary is that tokio::select! is a powerful control flow tool that waits for a set of futures to be ready concurrently, and resolves as soon as one of them completes.

What makes tokio::select really special is it doesn’t just work against specific kinds of asynchronicity, but against any arbitrary, heterogenous source of asynchronicity: signals, timers, any async function–you name it. As a result, tokio::select! is a great fit for signal handling⁶.

Implementing signal handling using `tokio::select!`#

Going back to our download manager example, let’s try using tokio::select!. There are a few ways to organize this, but here’s one way.

We’re going to briefly introduce two constructs that make our life simpler:

The JoinSet type, which stands for a set of worker tasks (not threads!) that are running in parallel.
Broadcast channels, which allows a single producer to send messages to multiple consumers. The idea here is that the main task will receive signals, and then broadcast them to workers.

Here’s how the main function works:

Create a JoinSet and a broadcast channel.
Spin up a stream of SIGINT signals, as before.
Spawn a task for each worker on the JoinSet, and pass in a receiver for broadcast messages.

Click to expand code sample

#[tokio::main]
async fn main() {
    let to_download = /* ... */;
    // Make a JoinSet.
    let mut join_set = JoinSet::new();
    // Also make a broadcast channel.
    let (sender, receiver) = broadcast::channel(16);

    // Spin up a stream of SIGINT signals.
    let mut ctrl_c_stream = signal(SignalKind::interrupt()).unwrap();

    // Spawn a task for each worker, passing in a broadcast receiver.
    for args in to_download {
        let receiver = sender.subscribe();
        join_set.spawn(worker_fn(args, receiver));
    }

    // ... continued below
}

Next, the main function needs to wait for results from all the workers. How would we do it if we weren’t handling signals? Well, in that case we would loop and wait for each worker task to finish until they’re all done, handling errors along the way.

Click to expand code sample

// continued from above...
loop {
    let v = join_set.join_next().await;
    if let Some(result) = v {
        // Handle result...
    } else {
        // No more downloads left
        break;
    }
}

To handle signals, we use a tokio::select! with two branches from which items are fetched in parallel:

The first branch waits for worker tasks to be done with the same code as above
The second branch awaits a Ctrl-C message from the stream.

Click to expand code sample

enum CancelKind {
    Interrupt,
}

loop {
    tokio::select! {
        v = join_set.join_next() => {
            if let Some(result) = v {
                // Handle result...
            } else {
                // No more downloads left
                break;
            }
        }
        Some(_) = ctrl_c_stream.recv() => {
            sender.send(CancelKind::Interrupt);
        }
    }
}

Now let’s look at the worker function. We first write our download function within an async block:

Click to expand code sample

async fn worker_fn(args: Args, receiver: Receiver<CancelKind>) -> Result<()> {
    let mut op = async {
        args.db.update_state(&args.url, Downloading).await?;
        download_url_to(&args.url, &args.file).await?;
        args.db.update_state(&args.url, Completed).await?;
        Ok(())
    };

    // ...
}

Then, just like earlier, we write a loop with a tokio::select! over two options:

The first branch drives the operation forward.
The second branch waits for cancellation messages over the broadcast channel.

Click to expand code sample

loop {
    tokio::select! {
        // 1. Drive the operation forward.
        res = &mut op => return res,

        // 2. Wait for cancellation messages over the broadcast channel.
        Ok(message) = receiver.recv() => {
            // Handle the cancellation message received.
        }
    }
}

What makes this model tick is how well it scales up to additional complexity. Two specific examples that are handled well by this model:

First, you’ll likely want to handle other signals like SIGTERM, since SIGINT isn’t the only signal you’d receive.
Another common extension is to use what is sometimes called a double Ctrl-C pattern. The first time the user hits Ctrl-C, you attempt to shut down the database cleanly, but the second time you encounter it, you give up and exit immediately.

A fully working example of the download manager described above is in the demo repository. These two extensions are marked as exercises for you to complete; try solving them!

3. Going deeper#

If all you’re doing is orchestrating an external operation like downloading files, then this is most of what you need to know. But a lot of the really interesting details about signals lie when you’re spawning processes—such as if you’re a shell, or a test runner like nextest. There is far too much to talk about here, but we’re going to go deeper into one particular example.

Process groups#

In part 1, I mentioned that if you press Ctrl-C, the shell sends SIGINT to the process you’re running. That isn’t quite correct; I fibbed a little! The truth is that the shell actually sends the signal to what is called a process group.

What is a process group? In Unix, processes can be organized into groups, each with its own unique identifier. Process groups allow a set of processes to be sent a signal at once. The shell creates a new process group for the command and its children, assigning them a unique group ID. This allows the shell to manage the processes as a unit, sending signals to the entire group at once.

On Linux, you can print process groups with a command like ps fo pid,pgid,comm. (Other Unix platforms have their own ps flags to display similar output.) For example, if you run this command while cargo build is running in another terminal, you might get some output that looks like this:

$ ps fo pid,pgid,comm
   PID    PGID  COMMAND
  1528    1528  zsh
  4100    4100   \_ cargo
  4261    4100       \_ rustc
  4587    4100       \_ rustc
  4640    4100       \_ rustc

In this case:

The zsh shell created a cargo process numbered 4100.
When zsh did that, it also created a corresponding process group with the same number.
When cargo build ran rustc, those process groups were inherited.

Recall from earlier that if you want to send SIGINT to a process via the kill command, you’d use kill -INT <pid>.

If you want to send SIGINT to a process group, in very typical Unix fashion you have to use a negative number.
For example, if you run kill -INT -4100, the SIGINT signal gets sent atomically to the whole process group numbered 4100: the process, its children, grandchildren, everything.

And that’s what really happens when you hit Ctrl-C: both the cargo process itself and all the child rustc processes are terminated by this signal sent to the process group.

Making your own process groups#

Let’s say that you’re a test runner like nextest. What if you want to join the party? For example, it’s common for tests to spin up servers in another process to communicate with. A test runner that terminates a test likely also wants to kill off any processes created by that test.

A screenshot of part of Rust's `std::os::unix::process::CommandExt`, with one method selected: fn process_group(&mut self, pgroup: i32) -> &mut Command.

To set the process group for a process, Unix provides a function called setpgid. In Rust, access to this is provided via the CommandExt extension trait’s process_group method.

Most of the time, you’ll want to pass in -1 for the process group, which means that the kernel will create a new process group for you with the same number as the process ID.

Forwarding signals to child process groups#

Let’s say you’re a command-line process that is using process groups to manage its children. What happens if the user hits Ctrl-C?

You might assume that process groups form some sort of tree. Just like with the notion of a tree of processes with parent and child processes, you might imagine that process groups follow a similar pattern.

In reality: no, process groups don’t form a tree. Each process group is its own island.

This means that as a boundary process which manages process groups of your own, it is your responsibility to forward signals to child process groups.

(This is quite easy to fit into our async Rust model: the worker tasks, on receiving a broadcast message, send the corresponding signal to the process groups that they’re responsible for.)

Most programs would want to behave as similarly as possible to the world where they didn’t set up process groups. To achieve that you’ll want to at least forward these signals:

To match behavior, for signals that are typically sent to process groups:

SIGINT (Ctrl-C).
SIGQUIT (Ctrl-\).
SIGTSTP (Ctrl-Z), and SIGCONT (fg or bg).
These two signals are interesting to handle in nextest: if you have timers running for your processes, for example if you want to timeout and kill process groups, you can combine SIGTSTP and SIGCONT with async Rust to pause and resume those timers. (There is also a deep story about an issue in POSIX lurking here, which I’ll cover in a sequel to this post.)

Also consider, to meet user expectations, forwarding SIGTERM and other signals. These signals are not always sent to process groups, but it would make sense to forward them.

4. Conclusion: echoes of the past#

A photo of an abandoned microwave tower with graffiti all over its decaying sides. The tower is surrounded by a barbed wire fence, semiotically acting as both a beacon and a warning to stay away. — An abandoned microwave tower in the Santa Monica Mountains, California (source).

Signals are a fundamental part of the computing landscape, a legacy of design decisions made decades ago. They are a reminder that the systems we build today are shaped by the choices of the past, and that even the most well-intentioned innovations can have unintended consequences.

In the words of Doug McIlroy, creator of Unix pipes:

Signal() was there first and foremost to support SIGKILL; it did not purport to provide a sound basis for asynchronous IPC. The complexity of sigaction() is evidence that asynchrony remains untamed 40 years on.

Signals, for all their utility, were never meant to be the foundation of interprocess communication that they have become today. Yet we have found ways to adapt and evolve them to our needs, working around their limitations and turning them into opportunities.

Indeed, this is the essence of computing, and of technology itself: a story of creativity in the face of constraint, and of building upon the foundations of the past to create something new and beautiful.

I hope this post has shed some light on the fascinating world of Unix signals, and perhaps even inspired you to think closely about the systems you build. If you’ll be at RustConf in Montreal, come find me: I’d love to chat more about this and hear your own stories about signals, or of other intricate systems with long-forgotten design decisions. See you there!

Thanks to Fiona for reviewing a draft of this post.

One example is signalfd on Linux. There is much to be said about signalfd, but it is outside the scope of this post. ↩︎
Some other language runtimes do much, much worse things with SIGSEGV. Here’s what Java does. ↩︎
If you haven’t, by the way, check it out! It’s one of the cooler parts of Unix. ↩︎
If you’ve dealt with locks you might have heard of the concept of recursive or re-entrant locks, which are locks that can be acquired by the same thread as many times as desired. Recursive locks often indicate a design flaw, because the goal of a lock is to temporarily violate invariants that are otherwise upheld outside of its context.
But more importantly, it isn’t guaranteed that signal handlers are going to be called on the same thread! So even recursive locks might deadlock when it comes to signal handlers. ↩︎
Under the hood, Tokio uses mio, which uses a self-pipe for its portable implementation. But on some platforms such as Linux, mio uses eventfd. This is an extension to Unix that is quite similar to self-pipes, and from the perspective of signal handlers serves the same purpose. ↩︎
There’s a case to be made that tokio::select! is too freeform—too powerful. By hiding details about async cancellation, select! can lead to surprising bugs. See this blog post by boats for more.
In this example, we’re handling cancellation explicitly, so the impact of those issues is less direct. ↩︎

Beyond Ctrl-C: The dark corners of Unix signal handling

1. Introduction#

Why signals?#

A basic example#

Sending signals via kill#

Signal names and numbers#

Signal handlers#

2. An example#

Why are signal handlers limited?#

The self-pipe trick#

Now what?#

Why async Rust?#

Implementing signal handling using tokio::select!#

3. Going deeper#

Process groups#

Making your own process groups#

Forwarding signals to child process groups#

4. Conclusion: echoes of the past#

Sending signals via `kill`#

Implementing signal handling using `tokio::select!`#