Cancelling async Rust

This is an edited, written version of my RustConf 2025 talk about cancellations in async Rust. Like the written version of my RustConf 2023 talk, I’ve tried to retain the feel of a talk while making it readable as a standalone blog entry. Some links:

Video of the talk on YouTube.
Slides on Google Slides.
Repository with links and notes on GitHub.
Coverage on Linux Weekly News.

Introduction#

Let’s start with a simple example – you decide to read from a channel in a loop and gather a bunch of messages:

loop {
    match rx.recv().await {
        Ok(msg) => process(msg),
        Err(_) => return,
    }
}

All good, nothing wrong with this, but you realize sometimes the channel is empty for long periods of time, so you add a timeout and print a message:

loop {
    match timeout(Duration::from_secs(5), rx.recv()).await {
        Ok(Ok(msg)) => process(msg),
        Ok(Err(_)) => return,
        Err(_) => println!("no messages for 5 seconds"),
    }
}

There’s nothing wrong with this code—it behaves as expected.

Now you realize you need to write a bunch of messages out to a channel in a loop:

loop {
    let msg = next_message();
    match tx.send(msg).await {
        Ok(_) => println!("sent successfully"),
        Err(_) => return,
    }
}

But sometimes the channel gets too full and blocks, so you add a timeout and print a message:

loop {
    let msg = next_message();
    match timeout(Duration::from_secs(5), tx.send(msg)).await {
        Ok(Ok(_)) => println!("sent successfully"),
        Ok(Err(_)) => return,
        Err(_) => println!("no space for 5 seconds"),
    }
}

It turns out that this code is often incorrect, because not all messages make their way to the channel.

Hi, I’m Rain, and this post is about cancelling async Rust. This post is split into three parts:

What is cancellation? It’s an extremely powerful part of async Rust but also one that is very hard to reason thoroughly about.
Analyzing cancellations: Going deep into their mechanics and providing some helpful ways to think about them.
What can be done? Solutions, including practical guidance, and real bugs we’ve found and fixed in production codebases.

Before we begin, I want to lay my cards on the table – I really love async Rust!

Me speaking at RustConf 2023. Beyond Ctrl-C: The dark corners of Unix signal handling. — Me speaking at RustConf 2023.

I gave a talk at RustConf a couple years ago talking about how async Rust is a great fit for signal handling in complex applications.
I’m also the author of cargo-nextest, a next-generation test runner for Rust, where async Rust is the best way I know of to express some really complex algorithms that I wouldn’t know how to express otherwise. I wrote a blog post about this a few years ago.

Now, I work at Oxide Computer Company, where we make cloud-in-a-box computers. We make vertically integrated systems where you provide power and networking on one end, and the software you want to run on the other end, and we take care of everything in between.

Of course, we use Rust everywhere, and in particular we use async Rust extensively for our higher-level software, such as storage, networking and the customer-facing management API. But along the way we’ve encountered a number of issues around async cancellation, and a lot of this post is about what we learned along the way.

1. What is cancellation?#

What does cancellation mean? Logically, a cancellation is exactly what it sounds like: you start some work, and then change your mind and decide to stop doing that work.

As you might imagine this is a useful thing to do:

You may have started a large download or a long network request
Maybe you’ve started reading a file, similar to the head command.

But then you change your mind: you want to cancel it rather than continue it to completion.

Cancellations in synchronous Rust#

Before we talk about async Rust, it’s worth thinking about how you’d do cancellations in synchronous Rust.

One option is to have some kind of flag you periodically check, maybe stored in an atomic:

while !should_cancel.load(Ordering::Relaxed) {
    expensive_operation();
}

The code that wishes to perform the cancellation can set that flag.
Then, the code which checks that flag can exit early.

This approach is fine for smaller bits of code but doesn’t really scale well to large chunks of code since you’d have to sprinkle these checks everywhere.

A related option, if you’re working with a framework as part of your work, is to panic with a special payload of some kind.

If that feels strange to you, you’re not alone! But the Salsa framework for incremental computation, used by—among other things—rust-analyzer, uses this approach.
Something I learned recently was that this only works on build targets which have a notion of panic unwinding, or being able to bubble up the panic. Not all platforms support this, and in particular, Wasm doesn’t. This means that Salsa cancellations don’t work if you build rust-analyzer for Wasm.

A third option is to kill the whole process. This is a very heavyweight approach, but an effective one in case you spawn processes to do your work.

Rather than kill the whole process, can you kill a single thread?

While some OSes have APIs to perform this action, they tend to warn very strongly against it. That’s because in general, most code is just not ready for a thread disappearing from underneath.
In particular, thread killing is not permitted by safe Rust, since it can cause serious corruption. For example, Rust mutexes would likely stay locked forever.

All of these options are suboptimal or of limited use in some way. In general, the way I think about it is that there isn’t a universal protocol for cancellation in synchronous Rust.

In contrast, there is such a protocol in async Rust, and in fact cancellations are extraordinarily easy to perform in async Rust.

Why is that so? To understand that, let’s look at what a future is.

What is a future?#

Here’s a simple example of a future:

// This creates a state machine.
let future = async {
    let data = request().await;
    process(data).await
};

// Nothing executes yet. `future` is just a struct in memory.

In this future, you first perform a network request which returns some data, and then you process it.

The Rust compiler looks at this future and generates a state machine, which is just a struct or enum in memory:

// The compiler generates something like:
enum MyFuture {
    Start,
    WaitingForNetwork(NetworkFuture),
    WaitingForProcess(ProcessFuture, Data),
    Done(Result),
}

// It's just data, no running code!

If you’ve written async Rust before the async and await keywords, you’ve probably written code like it by hand. It’s basically just an enum describing all the possible states the future can be in.

The compiler also generates an implementation of the Future trait for this future:

impl Future for MyFuture {
    fn poll(/* ... */) -> Poll<Self::Output> {
        match self {
            Start => { /* ... */ }
            WaitingForNetwork(fut) => { /* ... */ }
            // etc
        }
    }
}

and when you call .await on the future, it gets translated down to this underlying poll function. It is only when await or this poll function is called that something actually happens.

Note that this is diametrically opposed to how async works in other languages like Go, JavaScript, or C#. In those languages, when you create a future to await on, it starts doing its thing, immediately, in the background:

// JavaScript: starts running immediately
const promise = fetch('/api/data');

That’s regardless of whether you await it or not.

In Rust, this get call does nothing until you actually call .await on it:

// Rust: just data, does nothing!
let future = reqwest::get("/api/data");

I know I sound a bit like a broken record here, but if you can take away one thing from this post, it would be that futures are passive, and completely inert until awaited or polled.

The universal protocol#

So what does the universal protocol to cancel futures look like? It is simply to drop the future, or to not await it, or poll it any more. Since a future is just a state machine, you can throw it away at any time the poll function isn’t actively being called.

let future = some_async_work();
drop(future); // cancelled

The upshot of all this is that any Rust future can be cancelled at any await point.

Given how hard cancellation tends to be in synchronous environments, the ability to easily cancel futures in async Rust is extraordinarily powerful—in many ways its greatest strength!

But there is a flip side, which is that cancelling futures is far, far too easy. This is for two reasons.

First, it’s just way too easy to quietly drop a future. As we’re going to see, there are all kinds of code patterns that lead to silently dropping futures.
Now this wouldn’t be so bad, if not for the second reason: that cancellation of parent futures propagates down to child futures.
Because of Rust’s single ownership model, child futures are owned by parent ones. If a parent future is dropped or cancelled, the same happens to the child.
To figure out whether a child future’s cancellation can cause issues, you have to look at its parent, and grandparent, and so on. Reasoning about cancellation becomes a very complicated non-local operation.

2. Analyzing cancellations#

I’m going to cover some examples in a bit, but before we do that I want to talk about a couple terms, some of which you might have seen references to already.

Cancel safety and cancel correctness#

The first term is cancel safety. You might have seen mentions of this in the Tokio documentation. Cancel safety, as generally defined, means the property of a future that can be cancelled (i.e. dropped) without any side effects.

For example, a Tokio sleep future is cancel safe: you can just stop waiting on the sleep and it’s completely fine.

let future = tokio::time::sleep();
drop(future); // this has no side effects

An example of a future that is not cancel safe is Tokio’s MPSC send, which sends a message over a channel:

let message = /* ... */;
let future = sender.send(message);
drop(future); // message is lost!

If this future is dropped, the message is lost forever.

The important thing is that cancel safety is a local property of an individual future.

But cancel safety is not all that one needs to care about. What actually matters is the context the cancellation happens in, or in other words whether the cancellation actually causes some kind of larger property in the system to be violated.

For example, if you drop a future which sends a message, but for whatever reason you don’t care about the message any more, it’s not really a bug!

To capture this I tend to use a different term called cancel correctness, which I define as a global property of system correctness in the face of cancellations. (This isn’t a standard term, but it’s a framing I’ve found really helpful in understanding cancellations.)

When is cancel correctness violated? It requires three things:

The system has a cancel-unsafe future somewhere within it. As we’ll see, many APIs that are cancel-unsafe can be reworked to be cancel-safe. If there aren’t any cancel-unsafe futures in the system, then the system is cancel correct.
A cancel-unsafe future is actually cancelled. This may sound a bit trivial, but if cancel-unsafe futures are always run to completion, then the system can’t have cancel correctness bugs.
Cancelling the future violates some property of a system. This could be data loss as with Sender::send, some kind of invariant violation, or some kind of cleanup that must be performed but isn’t.

So a lot of making Rust async robust is about trying to tackle one of these three things.

I want to zoom in for a second on invariant violations and talk about an example of a Tokio API that is very prone to cancel correctness issues: Tokio mutexes.

The pain of Tokio mutexes#

The way Tokio mutexes work is: you create a mutex, you lock it which gives you mutable access to the data underneath, and then you unlock it by releasing the mutex.

let guard = mutex.lock().await;
// Access guard.data, protected by the mutex...
drop(guard);

If you look at the lock function’s documentation, in the “cancel safety” section it says:

This method uses a queue to fairly distribute locks in the order they were requested. Cancelling a call to lock makes you lose your place in the queue.

Okay, so not totally cancel safe, but the only kind of unsafety is fairness, which doesn’t sound too bad.

But the problems lie in what you actually do with the mutex. In practice, most uses of mutexes are in order to temporarily violate invariants that are otherwise upheld when a lock isn’t held.

I’ll use a real world example of a cancel correctness bug that we found at my job at Oxide: we had code to manage a bunch of data sent over by our computers, which we call sleds. The shared state was guarded by a mutex, and a typical operation was:

Obtain a lock on the mutex.
Obtain the sled-specific data by value, moving it to an invalid None state.
Perform an action.
Set the sled-specific data back to the next valid state.

Here’s a rough sketch of what that looks like:

let guard = mutex.lock().await;
// guard.data is Option<T>: Some to begin with
let data = guard.data.take(); // guard.data is now None

let new_data = process_data(data);
guard.data = Some(new_data); // guard.data is Some again

This is all well and good, but the problem is that the action being performed actually had an await point within it:

let guard = mutex.lock().await;
// guard.data is Option<T>: Some to begin with
let data = guard.data.take(); // guard.data is now None

// DANGER: cancellation here leaves data in None state!
let new_data = process_data(data).await;
guard.data = Some(new_data); // guard.data is Some again

If the code that operated on the mutex got cancelled at that await point, then the data would be stuck in the invalid None state. Not great!

And keep in mind the non-local reasoning aspect: when doing this analysis, you need to look at the whole chain of callers.

Cancellation patterns#

Now that we’ve talked about some of the bad things that can happen during cancellations, it’s worth asking what kinds of code patterns lead to futures being cancelled.

The most straightforward example, and maybe a bit of a silly one, is that you create a future but simply forget to call .await on it.

some_async_work(); // missing .await

Now Rust actually warns you if you don’t call .await on the future:

warning: unused implementer of `Future` that must be used
   |
11 |     some_async_work();
   |     ^^^^^^^^^^^^^^^^^
   |
   = note: futures do nothing unless you `.await` or poll them

But a code pattern I’ve sometimes made mistakes with is that the future returns a Result, and you want to ignore the result so you assign it to an underscore like so:

let _ = some_async_work(); // future returns Result

If I forget to call .await on the future, Rust doesn’t warn me about it at all, and then I’m left scratching my head about why this code didn’t run. I know this sounds really silly and basic, but I’ve made this mistake a bunch of times.

(After my talk, it was pointed out to me that Clippy 1.67 and above have a let_underscore_future warn-by-default lint for this. Hooray!)

Another example of futures being cancelled is try operations, such as Tokio’s try_join macro. For example:

async fn do_stuff_async() -> Result<(), &'static str> {
    // async work
}

async fn more_async_work() -> Result<(), &'static str> {
    // more here
}

let res = tokio::try_join!(
    do_stuff_async(),
    more_async_work(),
);

// ...

If you call try_join with a bunch of futures, and all of them succeed, it’s all good. But if one of them fails, the rest simply get cancelled.

In fact, at Oxide we had a pretty bad bug around this: we had code to stop a bunch of services, all expressed as futures. We used try_join:

try_join!(
    stop_service_a(),
    stop_service_b(),
    stop_service_c(),
)?;

If one of these operations failed for whatever reason, we would stop running the code to wait for the other services to exit. Oops!

But perhaps the most well-known source of cancellations is Tokio’s select macro. Select is this incredibly beautiful operation. It is called with a set of futures, and it drives all of them forward concurrently:

tokio::select! {
    result1 = future1 => handle_result1(result1),
    result2 = future2 => handle_result2(result2),
}

Each future has a code block associated with it (above, handle_result1 and handle_result2). If one of the futures completes, the corresponding code block is called. But also, all of the other futures are always cancelled!

For a variety of reasons, select statements in general, and select loops in particular, are particularly prone to cancel correctness issues. So a lot of the documentation about cancel safety talks about select loops. But I want to emphasize here that select is not the only source of cancellations, just a particularly notable one.

3. What can be done?#

So, now that we’ve looked at all of these issues with cancellations, what can be done about it?

First, I want to break the bad news to you – there is no general, fully reliable solution for this in Rust today. But in our experience there are a few patterns that have been successful at reducing the likelihood of cancellation bugs.

Going back to our definition of cancel correctness, there are three prongs all of which come together to produce a bug:

A cancel-unsafe future exists
This cancel-unsafe future is cancelled
The cancellation violates a system property

Most solutions we’ve come up with try and tackle one of these prongs.

Making futures cancel-safe#

Let’s look at the first prong: the system has a cancel-unsafe future somewhere in it. Can we use code patterns to make futures be cancel-safe? It turns out we can! I’ll give you two examples here.

The first is MPSC sends. Let’s come back to the example from earlier where we would lose messages entirely:

loop {
    let msg = next_message();
    match timeout(Duration::from_secs(5), tx.send(msg)).await {
        Ok(Ok(_)) => println!("sent successfully"),
        Ok(Err(_)) => return,
        Err(_) => println!("no space for 5 seconds"),
    }
}

Can we find a way to make this cancel safe?

In this case, yes, and we do so by breaking up the operation into two parts:

loop {
    match timeout(Duration::from_secs(5), tx.reserve()).await {
        Ok(Ok(permit)) => {
            permit.send(next_message());
            println!("sent successfully");
        }
        Ok(Err(_)) => return,
        Err(_) => println!("no space for 5 seconds"),
    }
}

The first component is the operation to reserve a permit or slot in the channel. This is an initial async operation that’s cancel-safe.
The second is to generate, then send the message, which is an operation that becomes infallible.

(I want to put an asterisk here that reserve is not entirely cancel-safe, since Tokio’s MPSC follows a first-in-first-out pattern and dropping the future means losing your place in line. Keep this in mind for now.)

Update 2025-10-24: The code sample now calls next_message after a permit has been reserved. Thanks to quad on Lobsters for the correction.

The second is with Tokio’s AsyncWrite.

If you’ve written synchronous Rust you’re probably familiar with the write_all method, which writes an entire buffer out:

use std::io::Write;

let buffer: &[u8] = /* ... */;
writer.write_all(buffer)?;

In synchronous Rust, this is a great API. But within async Rust, the write_all pattern is absolutely not cancel safe! If the future is dropped before completion, you have no idea how much of this buffer was written out.

use tokio::io::AsyncWriteExt;

let buffer: &[u8] = /* ... */;
writer.write_all(buffer).await?; // Not cancel-safe!

But there’s an alternative API that is cancel-safe, called write_all_buf. This API is carefully designed to enable the reporting of partial progress, and it doesn’t just accept a buffer, but rather something that looks like a cursor on top of it:

use tokio::io::AsyncWriteExt;

let mut buffer: io::Cursor<&[u8]> = /* ... */;
writer.write_all_buf(&mut buffer).await?;

When part of the buffer is written out, the cursor is advanced by that number of bytes. So if you call write_all_buf in a loop, you’ll be resuming from this partial progress, which works great.

Not cancelling futures#

Going back to the three prongs: the second prong is about actually cancelling futures. What code patterns can be used to not cancel futures? Here are a couple of examples.

The first one is, in a place like a select loop, resume futures rather than cancelling them each time. You’d typically achieve this by pinning a future, and then polling a mutable reference to that future. For example:

let mut future = Box::pin(channel.reserve());
loop {
    tokio::select! {
        permit = &mut future => break permit,
        _ = other_condition => continue,
    }
}

Coming back to our example of MPSC sends, the one asterisk with reserve is that cancelling it makes you lose your place in line. Instead, if you pin the reserve future and poll a mutable reference to it, you don’t lose your place in line.

(Does the difference here matter? It depends, but you can now have this strategy available to you.)

The second example is to use tasks. I mentioned earlier that futures are Rust are diametrically opposed to similar notions in languages like JavaScript. Well, there’s an alternative in async Rust that’s much closer to the JavaScript idea, and that’s tasks.

Unlike futures which are driven by the caller, tasks are driven by the runtime (such as Tokio).
With Tokio, dropping a handle to a task does not cause it to be cancelled, which means they’re a good place to run cancel-unsafe code.

A fun example is that at Oxide, we have an HTTP server called Dropshot. Previously, whenever an HTTP request came in, we’d use a future for it, and drop the future if the TCP connection was closed.

// Before: Future cancelled on TCP close
handle_request(req).await;

This was really bad because future cancellations could happen due to the behavior of not just the parent future, but of a process that was running across a network! This is a rather extreme form of non-local reasoning.

We addressed this by spinning up a task for each HTTP request, and by running the code to completion even if the connection is closed:

// After: Task runs to completion
tokio::spawn(handle_request(req));

Systematic solutions?#

The last thing I want to say is that this sucks!

The promise of Rust is that you don’t need to do this kind of non-local reasoning—that you can understand important behavior by looking at code directly around the behavior, then use the type system to scale that up to global correctness. Almost everything in Rust, from & and &mut to unsafe, is geared towards making that possible. However, future cancellations fly directly in the face of that, and I think they’re probably the least Rusty part of Rust. This is all really unfortunate.

Can we come up with something more systematic than this kind of ad-hoc reasoning?

There doesn’t exist anything in safe Rust today, but there are a few different ideas people have come up with. I wanted to give a nod to those ideas:

Async drop would let you run async code when a future is cancelled. This would handle some, though not all, of the cases we discussed today.
There’s also a couple different proposals for what are called linear types, where you could force some code to be run on drop, or mark a particular future as non-cancellable (once it’s been created it must be driven to completion).

All of these options have really significant implementation challenges, though. This blog post from boats covers some of these solutions, and the implementation challenges with them.

Conclusion#

In this post, we:

Saw that futures are passive
Introduced cancel safety and cancel correctness as concepts
Examined some bugs that can occur with cancellation
Looked at some recommendations you can use to mitigate the downsides of cancellation

Some of the recommendations are:

Avoid Tokio mutexes
Rewrite APIs to make futures cancel-safe
Find ways to ensure that cancel-unsafe futures are driven to completion

There’s a very deep well of complexity here, a lot more than I can cover in one blog post:

Why are futures passive, anyway?
Cooperative cancellation: cancellation tokens
Actor model as an alternative to Tokio mutexes
Task aborts
Structured concurrency
Relationship to panic safety and mutex poisoning

If you’re curious about any of these, check out this link where I’ve put together a collection of documents and blog posts about these concepts. In particular, I’d recommend reading these two Oxide RFDs:

RFD 397 Challenges with async/await in the control plane by David Pacheco
RFD 400 Dealing with cancel safety in async Rust by myself

Thank you for reading this post to the end! And thanks to many of my coworkers at Oxide for reviewing the talk and the RFDs linked above, and for suggestions and constructive feedback.