r/wg21 - P0876R22 - fiber_context - fibers without scheduler

standards_watcher_2026 · 5 hr. ago

P0876R22 - fiber_context - fibers without schedulerWG21

▲ 342 points · 52 comments · submitted 5 hours ago by u/standards_watcher_2026

Document: P0876R22 · Authors: Oliver Kowalke, Nat Goodspeed · Date: 2026-02-22 · Audience: LEWG, LWG, CWG

Oliver Kowalke and Nat Goodspeed are back with R22 of the fiber proposal - the paper that has been through more revisions than most of us have had performance reviews. fiber_context proposes a minimal API for stackful context switching. No scheduler, no green threads, no opinions about how you use it. Just resume() to switch stacks and a synthesized fiber_context to switch back. Think Boost.Context, but standardized.

The pitch: give C++ a low-level primitive for switching between function call stacks. This is the building block that Boost.Fiber, Meta's folly::fibers, Bloomberg's quantum, Baidu's bthread, and literally billions of WeChat users (via Tencent's libco) already rely on through Boost.Context. A fiber switch takes 11 CPU cycles on x86_64. That is function-call territory.

The drama: EWG forwarded this to CWG/LWG for C++26 in Sofia last June (SF 10 / F 14 / N 4 / A 5 / SA 1), but both groups ran out of time. Now targeting C++29. The per-fiber exception state requirement from St. Louis 2024 remains the main implementation controversy, with Microsoft voicing performance concerns.

Paper: https://wg21.link/p0876r22

342 points (82% upvoted)

sorted by: best

▲ ▼

u/AutoModerator 1 point 5 hr. ago pinned comment

Reminder: be civil. Paper authors sometimes read these threads. If you haven't read the paper, consider doing so before commenting on the design.

Reply Share Report

▲ ▼

u/async_plumber 97 points 4 hr. ago 🏆

I work on an async networking stack built on Boost.Fiber, and this is the comment I've been waiting to write for about five years.

The paper barely touches what I think is the strongest argument for fiber_context: P4003R0 and P4007R0 laid out the coroutine frame allocation problem for async I/O. Every coroutine in an I/O call chain has a frame that can't be HALO-optimized because the operation's lifetime outlives the caller. So you need a custom allocator, threaded through every single coroutine in the chain via a thread_local write-through cache.

With fibers? The stack pointer IS your allocator. Decrement to allocate, increment to deallocate. Zero overhead frame management. We switched from a coroutine-based I/O model to folly::fibers at $employer three years ago and saw measurable latency improvements at p99 just from eliminating coroutine frame allocation churn.

The resume() taking 11 cycles claim checks out in our benchmarks on x86_64. That is comparable to a virtual function call. But the elephant in the room is the per-fiber exception state mandate from St. Louis - EWG voted 6/8/3/0/0 to require it. Has anyone actually measured what that costs on top of the 11 cycles?

Real link for the curious: github.com/boostorg/context - this is what the proposed API is modeled on.

Edit: to be clear, the 11 cycles number is from Boost.Context WITHOUT per-fiber exception state. The paper proposes mandating it but doesn't benchmark the delta. That's the thing I want to see.

Reply Share Report Save

▲ ▼

u/fiber_skeptic_2024 54 points 3 hr. ago

This is a great summary of the allocation argument, but you've buried the lede in your own edit. The 11 cycles number is meaningless for evaluating the proposal because the proposal mandates something that isn't measured in those 11 cycles.

Per-fiber exception state means on every resume() or resume_with() you have to save and restore the data underlying std::uncaught_exceptions() and std::current_exception(). The paper describes two strategies: (1) save/restore TLS on every switch, or (2) use upper_bound() on current stack pointer to look up which fiber is current. Strategy 1 adds a TLS read+write on every switch. Strategy 2 shifts cost to every exception query.

What's the actual number? 15 cycles? 30? 100 on Windows with SEH? The paper doesn't say.

Reply Share Report

▲ ▼

u/async_plumber 38 points 3 hr. ago

Fair point. In our testing with the Boost.Context patch that Nat presented in Wroclaw, the TLS save/restore strategy adds roughly 3-5ns per switch on libstdc++/Linux. So call it 20-25 total cycles on a modern Xeon. Still well under a microsecond, still orders of magnitude cheaper than a kernel context switch.

Whether that's acceptable depends on your workload. For us doing network I/O where each operation already takes microseconds minimum, it's noise. For someone doing millions of fiber switches per second on a hot HFT path, maybe not.

Reply Share Report

▲ ▼

u/fiber_skeptic_2024 29 points 2 hr. ago

3-5ns on libstdc++/Linux is the optimistic case though. Gor Nishanov's feedback in February 2025 specifically flagged Windows/SEH as the concern. The MSVC exception model ties exception state to the stack in a fundamentally different way than Itanium ABI.

And "possible while expressing concern about potential performance" from the paper is doing a LOT of heavy lifting. What does "concern about potential performance" mean when it's Microsoft talking about their own runtime?

Reply Share Report

▲ ▼

u/async_plumber 22 points 2 hr. ago

Yeah, the Windows story is genuinely different. SEH is stack-frame-based, and fibers create new stacks. I don't think anyone is claiming the cost is identical across platforms.

But here's the thing - without per-fiber exception state, you get the Appendix A/B/C horrors. std::current_exception() returns the wrong exception. throw; rethrows something from a completely different fiber. Those aren't hypothetical - the paper has working demonstrations of the breakage.

So the question isn't "is per-fiber exception state expensive." It's "is the alternative - broken exception semantics - acceptable." I'd argue no.

Reply Share Report

▲ ▼

u/fiber_skeptic_2024 18 points 1 hr. ago

Alright, that's a fair framing. Correctness over performance, especially for a feature that's supposed to be a building block. I'd still like to see actual benchmarks from MSVC before this ships, but I concede the paper made the right call mandating it. The alternative is a landmine in the standard library.

Reply Share Report

▲ ▼

u/UB_enjoyer_420 287 points 5 hr. ago 🏆🏆

22 revisions. this paper has been in committee longer than some of my coworkers have been alive

Reply Share Report

▲ ▼

u/cmake_victim_42 143 points 5 hr. ago

committee gonna committee

Reply Share Report

▲ ▼

u/boost_user_since_03 34 points 4 hr. ago

Some of us have been waiting for this since N3985 in 2014. The lineage is N3985 → P0099 → P0534 → P0876. Over a decade of stackful context switching proposals. At this rate my grandchildren will get to use std::fiber_context.

Reply Share Report

▲ ▼

u/context_switch_cost 76 points 4 hr. ago

The paper claims:

A fiber switch takes 11 CPU cycles on a x86_64-Linux system using an implementation based on the strategy described in fiber switch using the calling convention

We've measured similar numbers with our own fcontext-based implementation. On Zen4 it's closer to 8-9 cycles. On Sapphire Rapids, 10-12. The calling-convention trick - save only the callee-saved registers (R12-R15, RBX, RBP on SysV) and swap the stack pointer - is well-understood and about as fast as you can get without compiler intrinsics.

What I want to know is the cost with the per-fiber exception state mandate baked in. The paper proposes TLS save/restore on every switch OR an upper_bound() lookup on current_exception()/uncaught_exceptions() calls. In our codebase, fiber switches are on the critical path. Even 5ns extra per switch matters when you're doing 10M+ switches/sec on a market data feed.

The paper needs actual benchmark numbers for both strategies before LWG can make an informed decision.

Reply Share Report

▲ ▼

u/not_a_real_dev 12 points 3 hr. ago

wait, 11 cycles for a context switch? that's like... nothing? why is this even a debate

Reply Share Report

▲ ▼

u/context_switch_cost 41 points 3 hr. ago

11 cycles is the base cost without exception state tracking. That's the whole point - the base is cheap. The question is what the mandated additions do to that number. A kernel context switch is 10,000-50,000 cycles depending on the OS and what gets invalidated. Even if per-fiber exception state doubles the fiber switch cost to 22 cycles, it's still three orders of magnitude faster. But "doubles the cost" is the kind of thing I'd like to know before it ships.

Reply Share Report

▲ ▼

u/coroutine_hater 23 points † 3 hr. ago

now add exception state save/restore plus whatever Windows SEH magic MSVC needs and watch that number triple. this is the classic WG21 pattern - propose something elegant, then committee requirements make it expensive, then it ships 5 years late and nobody uses it because the Boost version is faster

Reply Share Report

▲ ▼

u/coroutine_hater 67 points 5 hr. ago

Remember when we were told coroutines would solve everything? P2300 and sender/receiver would give us async I/O and it would be beautiful? Now we need fibers because coroutine frame allocation is a problem (P4003R0), and coroutines waiting on senders need to propagate a custom allocator through every parameter list (P4007R0). Classic.

Reply Share Report

▲ ▼

u/senior_cpp_23 28 points 4 hr. ago

Coroutines and fibers serve different purposes. Coroutines are great for lazy evaluation, generators, and structured concurrency patterns where suspension points are explicit. Fibers are for when you want to run existing synchronous code cooperatively without rewriting every function signature. Reading the paper helps.

Reply Share Report

▲ ▼

u/coroutine_hater 15 points 4 hr. ago

I've read all 22 revisions thanks. My point is that the committee spent 5 years betting on stackless coroutines as the async primitive and now we're circling back to "actually you need stackful switching too." The ecosystem figured this out ages ago - that's why Boost.Fiber and folly::fibers exist.

Reply Share Report

▲ ▼

u/actually_read_the_paper 89 points 3 hr. ago

Let me lay out the timeline because the process story here is remarkable.

LWG tentatively approved library wording in St. Louis, June 2024. CWG finished initial core wording review in Tokyo, March 2024 - with one change request (per-fiber exception state). EWG approved that change 6/8/3/0/0 in St. Louis. Then EWG requested implementation experience. Wroclaw: presented it. Microsoft asked for time. Hagenberg: Microsoft conceded it's implementable, but arrived too late for EWG. Sofia, June 2025: EWG forwarded to CWG/LWG for C++26 with SF 10 / F 14 / N 4 / A 5 / SA 1.

And then CWG and LWG ran out of time.

The paper quotes P1000R6:

Just wait a couple more meetings and C++ will be open for business and can be the first thing voted into the C++ working draft.

This is the promise of the train model. A paper that has been through 22 revisions, had wording approved by two groups, got forwarded by a third, and still missed the train because of scheduling. Now it's C++29. The authors are understandably frustrated, and they are being polite about it.

The 5 against + 1 strongly against in the EWG vote is worth noting though. That is not insignificant opposition for a paper at forwarding stage. The concerns are real even if the majority is clear.

Reply Share Report

▲ ▼

u/just_ship_networking 156 points 3 hr. ago 🏆

can we please just get networking OR fibers OR anything async in the standard before I retire

Reply Share Report

▲ ▼

u/async_plumber 43 points 2 hr. ago

networking + fibers would honestly be the dream combo. synchronous-looking I/O code that's actually cooperative under the hood. that's what userver already does and it's a great model.

Reply Share Report

▲ ▼

u/template_wizard_9000 71 points 2 hr. ago

the train model is more like a bus that's always full and the driver is arguing with the passengers about whether the destination exists

Reply Share Report

▲ ▼

u/yet_another_throwaway -8 points 5 hr. ago

Meanwhile in Rust, we've had async/await for years and it actually shipped. No 22 revisions, no "ran out of time." Just a working async model. But sure, keep arguing about fiber exception semantics for another decade.

Reply Share Report

▲ ▼

u/template_wizard_9000 192 points 5 hr. ago 🏆🏆🏆

Sir, this is a Wendy's

Reply Share Report

▲ ▼

u/systems_greybeard 34 points 4 hr. ago

Rust's async model is pinning-based with stackless futures. Completely different design space. Rust explicitly chose NOT to support stackful coroutines/fibers because of the interaction with their ownership model. The tradeoffs are legitimate in both directions. This comparison helps nobody.

Reply Share Report

▲ ▼

u/systems_greybeard 62 points 3 hr. ago

I want to push back on how the paper handles P3620R0's concerns. The paper states:

At a high level, P3620R0 appears to argue that unless fibers are appropriate for all use cases, they must not be available for any use case.

This is a straw man and the authors probably know it. P3620R0 raises three specific, technical concerns:

1. thread_local is shared between fibers on the same thread. The paper says "std::thread was introduced despite this problem." But thread_local was literally designed FOR threads. Fibers break the expectation. The paper's own TLS section acknowledges the problems but waves them away.

2. Deadlock potential when holding a mutex across a fiber switch. The paper again says "C++20 coroutines have the same problem." True, but coroutines have a visible suspension point (co_await). With fibers, ANY opaque function call might suspend. That is a qualitative difference in audibility.

3. Cross-thread resumption is forbidden (removed in R10). But many production fiber schedulers - including brpc's bthread - DO migrate fibers between threads. The standard facility is strictly less capable than several of the deployed systems the paper cites as evidence of demand.

Dismissing these as "fibers aren't appropriate for all use cases so they can't be available for any" is not engaging with the actual technical objections. A stronger paper would steelman P3620R0 and explain why the tradeoffs are still worth it.

Reply Share Report

▲ ▼

u/boost_user_since_03 11 points 2 hr. ago

I've been using Boost.Fiber in production since 2016 and the TLS thing has literally never bitten us. We don't use thread_local in fiber-aware code. It's a known constraint, not a showstopper.

Reply Share Report

▲ ▼

u/systems_greybeard 27 points 2 hr. ago

Your experience in a fiber-aware codebase doesn't generalize. The whole point of standardizing something is that it gets used by people who didn't write the library. What happens when someone calls a third-party library from a fiber and that library uses thread_local internally? The paper's own Section on TLS explicitly acknowledges this: function F uses thread_local V, calls function G which resumes another fiber, that fiber modifies V, and when F regains control, V has a surprising value.

The paper's response is "std::thread was introduced despite this problem." That's true but threads got thread_local as a companion. Fibers don't get fiber_local. SG1 rejected P3346R0 which would have made thread_local fiber-specific.

Reply Share Report

▲ ▼

u/fiber_skeptic_2024 19 points 2 hr. ago

+1 on the cross-thread point especially. The paper cites brpc bthread (1 million+ deployed instances), Alibaba's Photon, and Meta's folly as evidence of demand - but all three support migrating fibers between threads, which fiber_context explicitly forbids. The standard version is a strict subset of what the cited production systems actually do.

Reply Share Report

▲ ▼

u/definitely_knows_fibers 5 points 4 hr. ago

so this is basically green threads but they don't want to call it green threads

Reply Share Report

▲ ▼

u/coroutine_convert 31 points 4 hr. ago

No. Green threads have a scheduler. fiber_context is specifically scheduler-free. When you call resume(), YOU decide which fiber runs next. There is no runtime making scheduling decisions for you. That is the entire point of this being a "building block" - you can implement green threads on top of it (Boost.Fiber does exactly that), but the primitive itself is lower level.

Reply Share Report

Promoted

CppCon 2026 - Aurora, CO

Early bird ends May 15. The conference for the C++ community. Keynotes, sessions, lightning talks, and more.

▲ ▼

u/coroutine_convert 84 points 3 hr. ago

The API design here is genuinely interesting, and I think most people scrolling past are missing what makes it clever.

The key insight: resume() returns a synthesized fiber_context representing the calling fiber. So the fiber you switch to receives a handle to the fiber that just suspended. And resume() is rvalue-ref-qualified, so calling it empties the object. You literally cannot hold two handles to the same fiber.

fiber_context f{[](fiber_context&& caller) {
    // caller is the fiber that called f.resume()
    caller = std::move(caller).resume();
    // back again, caller updated
    return std::move(caller);
}};
f = std::move(f).resume();
// f now represents the suspended lambda

The type system prevents the misuse. If you have the object, you own the only handle. Once you call resume(), the object is empty. No double-resume, no dangling handles, no reference counting. Move-only + rvalue-ref-qualified is brutal and correct.

But the learning curve is steep. The "synthesized fiber_context" concept requires understanding that resume() both suspends the caller AND creates a new object representing that caller. It is symmetric switching - no caller/callee relationship. Compare to Go's goroutines where the scheduler is invisible, or coroutines where co_await is the explicit suspension point. fiber_context makes you be the scheduler, which is the point - and the barrier to adoption.

Reply Share Report

▲ ▼

u/daily_linker_error 47 points 2 hr. ago

I read the code example three times and I still can't trace which fiber is running at which line. I think I understand it now but I had to draw a diagram. this is going to be a pedagogical nightmare.

Reply Share Report

▲ ▼

u/coroutine_convert 35 points 2 hr. ago

Right, and that is by design. The paper explicitly says this is the LOW-LEVEL primitive. You are not supposed to use fiber_context directly in application code any more than you use mmap instead of new. Libraries like Boost.Fiber wrap this in something sane. The value of standardizing the primitive is so those libraries can be written in portable C++ instead of per-platform assembly.

Reply Share Report

▲ ▼

u/segfault_appreciator 21 points 2 hr. ago

move-only + rvalue-ref-qualified resume(). brutal and correct. this is how you prevent a whole class of bugs at the type level instead of writing "Preconditions: don't do that" in the spec

Reply Share Report

▲ ▼

u/constexpr_everything_2025 14 points 4 hr. ago

ok but can we make fiber_context constexpr

Reply Share Report

▲ ▼

u/template_wizard_9000 23 points 3 hr. ago

funny you should say that - the paper mentions that when Hana Dusikova was implementing P3367R3 constexpr coroutines, the "easiest way to model a coroutine" in the constexpr evaluator was to use fibers internally. so in some sense, fibers ARE constexpr. just not the way you meant

Reply Share Report

▲ ▼

[deleted] 5 hr. ago

[removed by moderator]

▲ ▼

u/segfault_appreciator 8 points 4 hr. ago

what did they say?

Reply Share Report

▲ ▼

u/daily_linker_error 11 points 4 hr. ago

something about how C++ is dead and everyone should just use $other_language. the usual tuesday vibes

Reply Share Report

▲ ▼

u/paper_trail_2019 1 point 4 hr. ago

Rule 2.

Reply Share

▲ ▼

u/embedded_for_20_years 53 points 2 hr. ago

Embedded developer here. The explicit-stack constructor is the reason I care about this paper:

fiber_context(F&& entry, span<byte> stack, D&& deleter);

On our STM32 targets, we cannot afford dynamic stack allocation. We preallocate fixed-size stacks from a pool. The span constructor is exactly right for this use case.

But the paper says: "If at any time during the life of a fiber the data storage required to track its invocation sequence exceeds the size() of that span, the behaviour is undefined." No guard pages in our environment. No MMU in many cases. Stack overflow is silent corruption.

11 cycles for a context switch is appealing for our RTOS replacement use case. But UB on stack overflow, with no way to detect it, is the worst possible answer for safety-critical embedded systems. An implementation-defined hook for stack overflow detection would make this paper much more attractive for our domain.

Reply Share Report

▲ ▼

u/async_plumber 22 points 1 hr. ago

The explicit stack constructor was added in R12 specifically for use cases like yours. You're right that the UB on overflow is unfortunate - but that is the same UB you get with any thread's stack today. The paper does note that implementations "might find it advantageous" to provide a guard page, but that is a Note, not normative.

For embedded, you probably want to size your stacks conservatively and instrument with stack painting during development. Same as with RTOS task stacks.

Reply Share Report

▲ ▼

u/not_a_real_dev -3 points 1 hr. ago

what's a guard page?

Reply Share Report

Promoted

Corosio - Async I/O for C++

Built on Boost. Stackful coroutines and async networking for modern C++. corosio.org

▲ ▼

[deleted] 4 hr. ago

[deleted]

▲ ▼

u/just_ship_networking 31 points 4 hr. ago

laughs in compile times

at least <fiber_context> is presumably not going to be a compile-time disaster like <ranges>. it's one class with like 6 member functions.

Reply Share Report

▲ ▼

u/cmake_victim_42 18 points 3 hr. ago

single header, single class, no template metaprogramming wizardry. this might be the most compile-time-friendly proposal to come out of WG21 in years. small victories.

Reply Share Report

▲ ▼

u/not_a_real_dev 3 points 5 hr. ago

can someone ELI5 what a fiber is? I thought threads were the things you use for concurrency

Reply Share Report

▲ ▼

u/senior_cpp_23 42 points 5 hr. ago

Thread: the OS decides when it runs and when it stops. Preemptive. Expensive to create (kernel resources, default 1-8MB stack). Having 10,000 of them makes your OS unhappy.

Fiber: YOU decide when it runs and when it stops. Cooperative. Cheap to create (user-space, you pick the stack size). Having 100,000 of them is fine. Each fiber runs on a thread, but many fibers can share one thread by taking turns.

fiber_context specifically: the lowest-level version of "fiber." No scheduler, no runtime, no opinions. Just resume() to switch from one stack to another. Everything else is built on top by libraries.

Reply Share Report

▲ ▼

u/not_a_real_dev 1 point 4 hr. ago

ok that makes sense. but if I have to write the scheduler myself, why not just use threads and let the OS do it?

Reply Share Report

▲ ▼

u/senior_cpp_23 28 points 4 hr. ago

Because 10,000 concurrent connections = 10,000 threads = your OS crying. With fibers you handle 10,000 connections on a handful of threads. Each connection gets its own fiber with a small stack. When a connection is waiting for I/O, its fiber suspends and the thread picks up another fiber. 11 cycles to switch vs 10,000+ for a kernel context switch.

You don't write the scheduler yourself - you use a library (Boost.Fiber, folly::fibers, etc.). Those libraries use fiber_context under the hood. This paper standardizes the under-the-hood part so those libraries don't need platform-specific assembly.

Reply Share Report

▲ ▼

u/fiber_skeptic_2024 44 points 1 hr. ago

One more thing that bothers me about the proposed API. Destroying a non-empty fiber_context calls terminate:

If empty() is false, terminate is invoked ([except.terminate]).

This is more aggressive than std::thread, which gives you detach() and join() as escape hatches. With fiber_context, if you forget to properly wind down a fiber before the handle goes out of scope, your program dies. Period.

The paper includes an autocancel wrapper class in Appendix D - roughly 60 lines of non-trivial code - just to safely manage fiber lifetimes. If the "building-block primitive" requires a 60-line RAII wrapper to avoid calling terminate, maybe the building block needs another look at its lifecycle design.

I get the reasoning: you can't safely destroy a suspended fiber's stack because of RAII objects that might hold resources. And detach doesn't make sense without a scheduler. But "get it wrong and your program terminates" is a sharp edge for something intended to be foundational.

Edit: to be fair, std::jthread partially addressed the same problem for threads. Maybe a std::jfiber is the answer here too, but that's a separate paper and a separate decade of committee time.

Reply Share Report

▲ ▼

u/actually_read_the_paper 19 points 47 minutes ago

The autocancel in Appendix D is illuminating. It has to track a done flag, a stop flag, and loop over resume() calls until the fiber voluntarily terminates. And even then the paper says the result is ambiguous because returning empty from resume doesn't necessarily mean YOUR fiber terminated - it could be some other fiber in the chain that terminated and resumed you.

This is the paper being honest about the complexity of the problem. Fibers are genuinely harder to manage than threads because there is no OS to clean up after you. Whether that makes the API wrong or just low-level is a reasonable question.

Reply Share Report