std::execution WG21Document: P2300R10 · Authors: Michał Dominiak, Lewis Baker, Lee Howes, Kirk Shoop, Michael Garland, Eric Niebler, Bryce Adelstein Lelbach · Date: 2024-06-28 · Audience: LEWG, LWG
After fourteen years, ten revisions, multiple design pivots, and at least one vote that failed to reach consensus, P2300R10 proposes the standard async execution framework C++ has been waiting for—or dreading, depending on who you ask. The paper defines schedulers, senders, and receivers as the core abstractions for asynchronous execution, with composable algorithms (then, let_value, when_all, bulk, etc.) and a cancellation model built on generalized stop tokens. R10 removes ensure_started and start_detached (deemed foot-guns, to be replaced by P3149 async_scope), renames transfer to continues_on and on to starts_on, fixes the sender algorithm customization mechanism via P3303R1, and adds the __cpp_lib_senders feature test macro.
The design is purely lazy—no work happens until you call start() on a connected operation state—and the paper spends considerable ink defending this choice against eager alternatives. The proposed wording is massive, touching stop tokens, the <execution> header, and adding a new subclause for asynchronous operations. Reference implementations exist at NVIDIA/stdexec and Meta/libunifex, with Intel’s bare-metal variant targeting embedded. Given this paper has been debated in SG1, LEWG, and plenary for over a decade, everyone has an opinion.
Paper Metadata
Document: P2300R10 · Title:
std::execution· Authors: Michał Dominiak, Lewis Baker, Lee Howes, Kirk Shoop, Michael Garland, Eric Niebler, Bryce Adelstein Lelbach · Date: 2024-06-28 · Target: LEWG, LWG · Revision: R10 (previous: R9)I am a bot, and this action was performed automatically. Please contact the moderators if you have any questions.
Is there a quick summary of what changed from R9? I’m still mentally processing the
tag_invokeremoval.I’ve been waiting for a standard async framework since I started my C++ career. I’m about to retire.
I’ve shipped three products, had two kids, and changed careers since executors were first proposed. My oldest is learning to code. In Rust.
your kids have better taste than the committee
This paper has more revisions than my git repo has commits.
The biggest under-discussed change in R10 is the removal of
ensure_startedandstart_detached. These were the escape hatches—the “I need to fire off some work and not block on it” tools. Their removal forces every use of senders into a fully structured concurrency model where every operation must be awaited through a scope.The paper points to P3149 (async_scope) as the replacement, but P3149 is a separate paper that isn’t in the same revision. So we’re shipping the async framework without the tool you need for one of the most common async patterns. It’s like shipping
<algorithm>but puttingstd::sortin a companion paper.I understand why they were removed. P3187R1 makes a solid case that
ensure_startedis a foot-gun—if the returned sender is destroyed before the operation completes, you get detached work, which is exactly the problem structured concurrency is supposed to solve. But the alternative is “wait for async_scope to ship.” In the meantime, users who need fire-and-forget have to either roll their own or reach for the reference implementation’s extensions.That “to be replaced” is doing a lot of heavy lifting. Is the committee comfortable shipping a framework with this dependency on a not-yet-adopted companion paper?
The removal is the right call.
ensure_startedwas fundamentally at odds with the structured concurrency model. If you want fire-and-forget, you need a scope that owns the lifetime of that work. That’s whatcounting_scopein P3149 provides. It’s not a gap; it’s a deliberate sequencing of features.The committee has explicitly committed to shipping async_scope for C++26. P3109 lays out the plan. This isn’t a “networking someday” situation.
I run a server that needs to spawn background tasks for cleanup on disconnect. In Asio I do this today with
co_spawnon a detached executor. What’s the P2300 answer in R10? “Wait for the next paper”?You use
counting_scope::spawn(). The scope owns the lifetime. When the scope is destroyed, it joins all outstanding work. That’s better than detached—you can’t leak.TL;DR they removed the easy way to do things and the replacement is in a different paper that hasn’t shipped yet. Peak committee energy.
I went back and re-read the rationale in P3187. The key argument is:
Which is fair. But the fix is to make the destructor join, not to remove the feature entirely.
std::jthreadsolved exactly this problem for threads.Edit: I know blocking destructors are controversial. My point is that there’s a design space between “detach silently” and “don’t exist.”
Edit2: actually rethinking this. The problem with a blocking destructor on a sender is that you don’t know what thread will run the destructor. If it’s the event loop thread, you deadlock. OK fine, removal is defensible.
skill issue
So we now have a 150-page async execution framework in C++26, and you still can’t read from a socket. The paper doesn’t mention networking once. Not in the motivation, not in the examples, not in the design rationale.
Asio has been production-ready since 2003. Chris Kohlhoff has maintained it through twenty years of standardization chaos. The Networking TS was built on it. And instead of adopting the thing that works, we got a decade-long argument about whether execution resources should be lazy or eager, and now we have a framework that can parallelize
std::inclusive_scanacross a GPU but can’t open a TCP connection.I get it. P2300 is “the foundation.” Networking comes “later.” P2762 shows how networking senders might look. But “later” has been the answer for networking since C++11, and I’m running out of patience.
P2300 is the execution model. Networking is an I/O concern that builds on top of the execution model. You don’t ship TCP in the same paper as the scheduler concept for the same reason you don’t ship
std::formatin the same paper aschar_traits.The whole reason the Networking TS stalled is that it baked in its own execution model (Asio’s completion token mechanism) rather than building on a standard one. P2300 fixes that by providing the standard execution model first. Then you layer networking on top, and it composes with everything else.
“First the foundation, then the building.” Sure. Except the Asio completion token model composes today. I compose async operations with
deferred,use_awaitable, andparallel_group. It works. It’s deployed. I don’t need to understandtransform_completion_signatures_ofto read from a socket.Show me one production deployment of stdexec that involves networking. Not a toy echo server. A production system.
Asio’s completion token model doesn’t compose with parallel algorithms, bulk operations, or structured concurrency. It’s a callback-adapter layer on top of proactor I/O. Senders compose with everything—that’s why they’re worth the complexity.
libunifex is deployed at Meta’s scale. NVIDIA uses stdexec internally for GPU pipeline orchestration. Intel has a bare-metal variant running on microcontrollers. The deployment exists, just not in the “TCP echo server” domain you’re looking at.
So the async framework for C++ is deployed at two companies—both of which employ the paper’s authors. And neither deployment involves I/O. You’re proving my point: this is a GPU compute framework that got generalized into a standard, not a general-purpose async framework that happens to support GPUs.
Asio is deployed at thousands of companies for networking. That’s a different scale of validation.
libunifex supports io_uring. It does I/O. The networking bridge (stdexec asioexec) was merged months ago. And the design explicitly supports I/O—see section 1.4, the Windows socket recv example. The paper just doesn’t standardize I/O because that’s a separate concern.
I just want to read from a socket without a PhD in template metaprogramming. Is that too much to ask.
just use tokio lmao
committee gonna committee
14 years. Executors have been in progress since 2012. There are people on the committee who started as grad students and now have gray hair because of this topic.
and people say C++ moves too fast
Firmware developer here. The zero-allocation property of operation states is the part of this paper that nobody is talking about. When you
connecta sender chain, the resulting operation state composes statically. The paper’srun_loopexample uses space in its operation states to build an intrusive linked list of work items—all on the stack. Zero heap allocations.Intel’s bare-metal implementation proves this works in constrained environments. I’ve been building ad-hoc versions of exactly this pattern for twenty years. Having it in the standard with proper type safety and composability is genuinely exciting.
The people complaining about complexity are looking at the implementer-facing API. The user-facing API for someone who just wants to compose
then | continues_on | bulkis clean.The intrusive linked list in
run_loopis exactly what we’ve been doing by hand for decades on embedded systems. Having it formalized in a standard type that composes with generic algorithms is worth the complexity of the specification. The specification is complex so that the usage doesn’t have to be.will this fit in 64K though
The Intel implementation runs on STM32. So yes, if your toolchain supports C++20 concepts. The code size overhead is mostly template instantiation, which you’re already paying for if you use any modern C++.
std::executionThe paper’s priorities in section 1.2 say “Make it easy to be correct by construction.” Then section 1.5.1 shows the implementation of
then—the simplest possible sender adaptor—and it requires:connectmember and aget_completion_signaturesmembertransform_completion_signatures_of<S, Env, _except_ptr_sig, _set_value_t>connect_result_treturn type(F&&) f_)For
then. The “hello world” of sender adaptors. If you want to write a custom sender for your execution resource, the learning curve is vertical.The user-facing API (
schedule | then | continues_on) is genuinely nice. But the gap between “using senders” and “implementing senders” is the widest I’ve seen in any standard library feature, and that includes ranges.Eric Niebler literally said “the P2300 crew have collectively done a terrible job of making this work accessible.” At least the authors know. The question is whether acknowledging the problem counts as addressing it.
imagine debugging a template error from
transform_completion_signatures_of<S, Env, completion_signatures<set_error_t(exception_ptr)>, _set_value_t>laughs in compile times
I compiled a hello world with stdexec and it took 47 seconds. I am not joking. Forty-seven seconds. The binary was 8 bytes of useful work and 12 MB of template instantiations.
The end-user API is actually pretty clean. The pipe syntax reads left-to-right. You don’t need to understand completion signatures to use senders any more than you need to understand iterator traits to use a range-based for loop. Same tradeoff as ranges: the spec is complex so the usage doesn’t have to be.
Fair point on ranges. But with ranges, if I need to write a custom view, I can look at existing views and the pattern is learnable. With senders, the customization mechanism goes through
transform_senderon an execution domain, which requires understanding the entire domain dispatch machinery in section 5.4. That’s a qualitatively different barrier.tell me you’ve never shipped production code without telling me
Section 4.9.4 contains this remarkable admission:
“Still being investigated.” In R10. The revision being proposed for the standard. The cancellation overhead concern from SG1 is documented but unresolved in the text that’s going into C++26.
To be clear: the design mitigates this with
never_stop_tokenandunstoppable_token, which should let the compiler optimize out the cancellation path. But “should” and “does” are different things, and the paper itself isn’t confident enough to remove the caveat.I work on a major compiler. The
never_stop_tokenoptimization works with-O2in practice. Theunstoppable_tokenconcept is a compile-time check, so the compiler seesbool_constant<(!tok.stop_possible())>::valueand eliminates the cancellation code path entirely.But it requires the stop token type to be statically known as
never_stop_token, not just astoppable_tokenthat happens to returnfalsefromstop_possible()at runtime. That’s the right design—it puts the optimization decision in the type system.Good to know. So the SG1 concern is effectively addressed by implementation quality, not by the specification? That’s... fine, I guess, but it means the paper is relying on optimizers doing the right thing with a complex type-level protocol. Which historically has not been C++’s strongest suit.
great, another paper that will take 10 years to get through LEWG
it already took 14
Section 4.10 is the most important part of this paper and the least discussed. The decision to make all senders lazy is what enables GPU kernel fusion, static operation state composition, and zero-overhead structured concurrency. If senders could be eager, every algorithm would need to handle the race between “the operation completed before connect was called” and “the operation hasn’t started yet,” which is exactly the
std::futureproblem we’re trying to escape.The paper lays out five failure modes of eager senders (UB on destruction, detached work, blocking destructors, type-erased stop callbacks, loss of execution context). Every one of them is a real bug I’ve debugged in production GPU code. Laziness eliminates all five by construction.
In HFT we need to start work immediately when market data arrives. The latency between “data received” and “computation begins” is measured in nanoseconds. With lazy senders, there’s always a connect-then-start overhead between constructing the sender and beginning execution.
The paper’s argument in 4.10.1 against eager senders doesn’t account for the case where you know the operation won’t be cancelled, the receiver is ready, and the only thing between you and execution is the framework’s own laziness overhead.
The paper addresses exactly this in section 4.10:
You can build eager on top of lazy. You cannot build lazy on top of eager without introducing synchronization. The design correctly defaults to the composable primitive and lets users opt into eagerness when they need it.
“Build eager on top of lazy” adds overhead. In our measurements with stdexec, the connect+start path adds 15-30ns compared to a direct function call. At our scale, that’s 2-3 ticks of market data processing. I’m not saying the design is wrong—I’m saying the paper claims “zero overhead” abstraction and it isn’t, for this use case.
Your 15-30ns overhead is my kernel fusion. The paper never claims zero overhead for all use cases. Section 1.2 says “care about all reasonable use cases, domains and platforms,” and section 4.10 explicitly acknowledges that the lazy model trades eager-start latency for composability and correctness. That’s a design choice, not a bug.
This is why the standard should support both models. Instead we got one camp’s preference enshrined as the only option. The committee had a chance to provide eager-when-available, lazy-by-default, and chose not to.
For people who haven’t been following: the committee history of this paper is wild.
The 2022 vote is the one to study. Thirty-seven people voted in favor, seventeen against, and nobody abstained. That vote tells you everything about how polarizing this design is. The fact that we got from there to a shippable proposal in two years is, frankly, impressive committee work.
Zero neutral votes on an executors poll is the most on-brand WG21 thing I’ve ever heard. Everyone had an opinion and nobody was willing to sit it out.
54 people in a room, all with strong opinions about async C++, zero abstentions. Surprised nobody got stabbed.
P3109 was the turning point. Once there was a concrete plan with milestones—what ships in C++26, what waits for C++29—the opposition softened. Not because the design changed, but because the process became legible. Half the SA votes in 2022 were probably “not ready yet” rather than “wrong direction.”
[removed by moderator]
what did they say?
something about the Networking TS being sabotaged, you know the usual
Rule 2. Conspiracy theories about committee process are not constructive.
I teach a graduate C++ course. My plan for integrating senders is: year one, coroutines and ranges; year two, senders. You need the prerequisites. Students need to be comfortable with concepts,
operator|composition, and the idea of lazy evaluation before they can approach this.The good news: the pipe syntax (
schedule(sch) | then(f) | continues_on(other_sch) | then(g)) is genuinely intuitive once you’ve seen range adaptors. The bad news: the first time a student makes a mistake and gets a template error message, they’ll lose a week.Two years of prerequisites to use the standard async library. This is a real cost that the paper doesn’t acknowledge. The motivation says “care about all reasonable use cases” but the onboarding curve says “care about use cases where the developer has a graduate degree in template metaprogramming.”
and this is why bootcamp kids pick JavaScript
R10 changelog for names alone:
transfer→continues_onon→starts_onon=starts_on+continues_onget_delegatee_scheduler→get_delegation_schedulerread→read_envR9 renamed
in_place_stop_*toinplace_stop_*. R8 renamedmake_completion_signaturestotransform_completion_signatures_of. Every revision renames things. If you’ve been writing code against the reference implementation, you’ve rewritten your import statements every six months.I’ve rewritten my blog tutorial three times. At this point I’m going to wait for C++29 before I publish anything, just so I don’t have to update it again when R11 renames
sync_waittoblocks_caller_until_done_with_optional_variant_tuple.The rename from
transfertocontinues_onis actually an improvement. “Transfer” is ambiguous—transfer what?continues_on(snd, sch)reads as “the work described bysndcontinues onsch.” Same forstarts_on. The new names describe the operation. See P3175R3 for the rationale.Fair,
continues_onis clearer. My complaint isn’t about any individual rename—it’s that ten revisions of renames means the entire pre-standardization ecosystem’s documentation is wrong. Every CppCon talk, every blog post, every Stack Overflow answer uses the old names.Has anyone noticed there’s no
schedule_afterorschedule_at? The paper mentions in section 4.2:“Future papers.” So C++26 gets an async framework where you can't say “do this in 5 seconds.” Timer-based scheduling is one of the most basic async patterns—timeouts, debouncing, polling, heartbeats—and it’s deferred to a future standard.
Time-based scheduling requires a clock source, which is execution-resource-specific. A thread pool uses
steady_clock. An event loop uses its own timer mechanism. An embedded system uses a hardware timer. Standardizing all of that in the same paper would triple its size. The right move is to ship the core framework and layer timers on top.ah yes, the C++ tradition of shipping half the feature and promising the rest later. See also: modules (no standard build system), coroutines (no
taskorgeneratoruntil C++23), ranges (noto<vector>until C++23). We keep standardizing the hard infrastructure and leaving out the parts people actually use.The completion signatures mechanism (section 5.8) is probably the most underappreciated innovation in the paper. It’s a type-level protocol that lets a sender statically declare every way it can complete—value types, error types, and whether it can send stopped. This means the compiler can check at
connecttime whether the receiver can handle all possible completions.Compare this to
std::future<T>, which can only send one value type or anexception_ptr. Senders can send multiple different value types depending on runtime conditions, and the type system tracks all of them. That’s why you see things likeinto_variant—it collapses all possible value completions into a single variant type when you want to unify them.transform_completion_signatures_ofis the real power tool here. It lets you write an adaptor that transforms value completions while passing through error and stopped completions automatically. Once you understand it, writing adaptors becomes mechanical. The problem is the “once you understand it” part takes about a week of staring at the spec.Meanwhile Rust has had async/await since 2019, tokio has been production-ready for years, and their async story “just works.” But sure, let’s spend another decade debating whether senders should be lazy.
Rust async is single-threaded by default and their
Send/Syncpain for multi-threaded runtimes is well documented. Their model also doesn’t address heterogeneous compute (GPUs, FPGAs, DSPs). Different design space, different tradeoffs.Every time someone mentions Rust async, they forget to mention the function coloring problem, the Pin<Box<dyn Future>> dance, and the fact that select! is a macro because the type system can’t express it. Async is hard in every language.
Sir, this is a Wendy’s
The author list is four NVIDIA employees and two Meta employees. The reference implementation is an NVIDIA project. The primary deployment is at NVIDIA and Meta. The design prioritizes GPU compute and large-scale server workloads, which are NVIDIA and Meta’s business domains.
This is a competent design for the problems the authors personally face. The committee should recognize that the recommendation to adopt this as the standard async model for all of C++ comes from authors with a documented advocacy position and commercial interest in this specific design direction. Corroborating analysis from authors outside the GPU/hyperscaler bubble would strengthen the case.
That’s not a criticism, that’s a feature. You want the people who actually need async at scale to design the async framework. Eric Niebler is also the person who designed range-v3, which became
std::ranges. Lewis Baker wrote cppcoro. These aren’t corporate drones pushing a product—they’re the people who actually understand the problem space.I didn’t say the design is wrong. I said the recommendation is advocacy-informed rather than neutral. Eric Niebler’s track record is excellent. But “this person has been right before” is not the same as “we don’t need independent validation.” What does someone who writes database engines, or embedded firmware, or game engines think about this design? We should hear from them too.
can we please just get networking in the standard before I retire
Want to 10x your C++ skills? Our Advanced C++ Masterclass covers async, coroutines, and more! Use code SENDERS20 for 20% off. learnmoderncpp.io
report and move on
Game engine perspective: we evaluated stdexec for our ECS job system. The runtime performance was competitive with our hand-rolled fiber-based scheduler—within 5% on our benchmark suite. The structured concurrency model maps well to frame-scoped work: spawn all jobs for a frame,
sync_waitat the frame boundary, guaranteed completion.The compile times were brutal. Our job system compiles in 3 seconds. The stdexec version took 90 seconds for the same translation unit. For a game studio that does 200+ builds per day, that’s a non-starter. We’re watching this space but won’t adopt until compile times improve by at least 10x.
Compile times are an implementation quality issue, not a design issue. The
basic-senderexposition-only class template in R8+ is specifically designed to reduce template instantiation depth. As compilers get better at concepts and as<execution>moves into the standard library (precompiled), this will improve. Modules will also help—eventually.I’ve heard “modules will help” for four years. My build system doesn’t even support modules yet and we’re shipping on three platforms. The 47-second hello world someone mentioned upthread was not a joke, was it.
tangentially, has anyone benchmarked
<execution>compile times compared to, say,<ranges>or<format>? My CI pipeline already takes 2 hours and I’m not thrilled about adding another heavy header.modules will fix this
modules will fix this
“modules will fix this” 🤡
I actually read the whole paper. All of it. Here’s what everyone is missing while arguing about naming and networking:
The sender algorithm customization mechanism in section 5.4 is the actual innovation. When you call
then(snd, f), the algorithm doesn’t just wrap the sender in a generic adaptor. It queries the sender’s completion scheduler for an execution domain, and then asks that domain totransform_senderthe algorithm. This means a CUDA scheduler can interceptbulk(snd, shape, f)and turn it into a GPU kernel launch without the user writing any GPU-specific code.This is why P3303R1 (fixing
transform_senderinconnect/get_completion_signatures) was critical for R10. Without it, the domain dispatch happened too early and the scheduler couldn’t see the full sender chain. The fix makes it lazy—transform happens at connect time, when the scheduler has full context.This is the part that justifies the complexity. Not
then. Notsync_wait. The domain dispatch is what makes senders a platform rather than just another callback library.This is the comment. The domain dispatch mechanism is what separates P2300 from “just another async library.” Without it, every GPU runtime would need to customize every algorithm. With it, the domain sees the whole pipeline and can fuse operations. That’s the magic.
Unpopular opinion apparently: I’ve been using coroutines for 3 years and this is the first time I’ve seen a coherent story for async in C++ that doesn’t feel like duct tape. The pipe syntax is nice, the structured concurrency model is sound, and the fact that operation states compose on the stack without allocation is genuinely novel for a standard library feature. I’ll take it.
brave of you to say something positive on r/wg21
I work on a major compiler and the amount of work required to implement this header is... significant. The
basic-senderexposition-only type alone has more moving parts than most standard headers. We’re looking at 6-12 months of implementation work after the wording is frozen. Users should not expect<execution>to be available in their compiler the day C++26 ships.oh no
proposed wording for my codebase:
you forgot
| stopped_as_error(std::make_error_code(std::errc::operation_canceled))for when you rage-quitI thought about this more and there’s a gap nobody’s mentioned: the interaction between senders and coroutines.
Section 5.7 says you can
co_awaita sender in a coroutine that useswith_awaitable_senders. That’s great. But the paper doesn’t provide a standard coroutinetasktype. So you can co_await senders, but only if you write your own promise type first. The gap between “coroutines exist” and “senders exist” and “they work together seamlessly” is still two or three papers wide.This is the biggest practical gap. Most C++ developers who’ve used async at all have used coroutines. They want to write
co_await read_from_socket(), notschedule(sch) | then(f) | continues_on(sch2) | then(g). A standardtask<T>that bridges senders and coroutines is what makes this usable for the median developer. Without it, P2300 is infrastructure for library authors, not end users.is this the thread where we complain about executors taking too long, or the thread where we complain about them being too complicated? I want to make sure I’m posting in the right one.
yes
Boost/Folly/libunifex already does this. But sure, let’s spend another meeting cycle on wording for
get_delegation_scheduler.The whole point of standardization is that you don’t need to pick between Boost/Folly/libunifex. One vocabulary. One set of concepts.
sender autoworks with any execution resource. That’s worth the pain, even if it takes 14 years.