Authors: Jeff Garland, Paul E. McKenney, Roger Orr, Bjarne Stroustrup, David Vandevoorde, Michael Wong (Directions Group)
Document: P4023R0
Date: 2026-02-23
Target: WG21 (Plenary)
Link: wg21.link/p4023r0
The Directions Group has dropped a paper on how C++ should deal with AI. Six authors including Bjarne, and this one touches two completely different nerve clusters at once.
Thrust I: Governance. Aligning WG21 with ISO/IEC JTC1's existing guidance on AI use. The human author is the "intelligence of record." AI can assist with research, summarization, and consistency checks, but generating normative wording or core design proposals without rigorous human verification is out. Bots are forbidden from attending ISO meetings. And the paper directly acknowledges the "AI slop" problem - voluminous but low-quality submissions wasting committee time.
Thrust II: The ImageNet Challenge. The DG is calling on the ecosystem - Boost, Beman Project, academics, open source foundations - to build a curated, human-validated dataset of modern idiomatic C++ (C++20/23/26). Tagged by domain (embedded, finance, AI), favoring spans over pointers, sender/receiver over callbacks, algorithms over raw loops. They also want tooling that surfaces intent at the call site for AI agents, including potentially connecting compilers to MCP.
This is a directional paper updating P2000 - no proposed wording, no straw polls. It is the DG saying "here is what we think the strategy should be." Whether the ecosystem actually builds this dataset is a different question entirely.
Reminder: paper authors sometimes read these threads. Critique the paper, not the person. Rule 2 is enforced.
P4023R0 - Audience: WG21 (Plenary) - Authors: Jeff Garland, Paul E. McKenney, Roger Orr, Bjarne Stroustrup, David Vandevoorde, Michael Wong - PDF
tl;dr: the directions group said AI bad but also AI good but also please make an ImageNet for C++ but also we can't do it ourselves but someone should. got it.
that's... actually not far off
I work in ML. The ImageNet analogy is doing a lot of heavy lifting here and it does not hold up.
ImageNet worked because image classification has a ground truth. A picture of a cat is a cat. You can label 14 million images and have humans agree on what they depict. The labels are objective.
Code quality is not objective. "Modern, idiomatic C++" is a moving target that the committee itself disagrees about. Is
std::optional<T&>idiomatic? Depends on which mailing you read. Is structured bindings in a range-for idiomatic? Depends on who you ask. Isco_awaiton a sender idiomatic? P2300 hasn't shipped outside NVIDIA's stack.ImageNet didn't need to resolve philosophical debates about what a cat is. A curated C++ dataset would need to resolve philosophical debates about what good code is - and those debates are literally what the committee does for a living, slowly, and without consensus half the time.
The paper proposes domain tags (
ai/,embedded/,finance/) as if that's sufficient quality control. It is not. You need pedagogical scaffolding - progressive complexity, explicit rationale for each design choice, anti-pattern comparisons. A flat corpus of "good" examples is how we got the current mess of stackoverflow-trained models in the first place.The goal is right. The analogy undersells the difficulty by a factor of ten.
this is the comment I came here for. saving this.
You're right that ImageNet is a bad analogy, but the underlying point stands: AI models generate terrible C++ because they trained on terrible C++. The question isn't whether we need curated data - we do. The question is who's going to do the curation and what "correct" means for a language with this many dialects.
The Core Guidelines were supposed to be this. They have 400+ rules and still don't cover half the design space the paper mentions.
Fair. I should have said the analogy is misleading, not that the goal is wrong. The Core Guidelines point is good - they're a natural starting point and they already exist. The paper doesn't even mention them, which is strange.
can we please just get networking in the standard before we start governing AI. priorities, people.
this comment is posted under every single paper regardless of topic and it always gets 400 upvotes
Sir, this is a Wendy's.
Read the whole thing. Thrust I is 90% restating what ISO/IEC JTC1 SC22 N5991 already says. The "author is the intelligence of record" principle, the prohibition on AI-generated normative text, the copyright concerns - that document covers all of it.
What P4023 adds is (a) making it explicit for WG21 context, (b) the "no bots in meetings" rule, and (c) the direct acknowledgment that AI slop is already happening. Point (c) is the interesting one. The DG is putting it on paper that they've seen low-quality AI-generated submissions. That's a political statement as much as a policy one.
Thrust II is where the new content is, and also where the paper is weakest. More below.
so the Directions Group wrote a paper to say "what ISO already said, but with C++ branding"?
For Thrust I, basically yes. The value-add is making it WG21-specific and putting the "slop" problem on the record. For Thrust II - the ImageNet challenge - that part is genuinely new. The problem is it has no execution plan. No funding, no hosting org, no timeline, no success criteria. "The ecosystem should" is not a plan.
From the paper:
This is exactly backwards. APIs that are easy for humans and APIs that are easy for machines are not the same thing. Human-friendly APIs use overloading, implicit conversions, ADL, and contextual defaults. Machine-friendly APIs use explicit types, no overloading, flat namespaces, and named parameters.
Consider:
An LLM doesn't know which
connectoverload to call without reading the header. A human knows from context. These are different design pressures. The paper glosses over this entirely.I actually agree with the paper here more than with you. The direction isn't "make C++ APIs look like Java." It's "surface intent at the call site." Inlay hints showing parameter names, concepts constraining template parameters,
[[nodiscard]]on return types - these help both humans and machines.The paper isn't asking you to stop overloading. It's asking you to make the intent of each overload discoverable without reading the implementation.
You're thinking about parameter names and inlay hints. I'm talking about overload sets and ADL. An LLM calling
swap(a, b)has no way to know if it's gettingstd::swap, a hidden friend found via ADL, or a namespace-scope overload - and the behavior differences matter. Inlay hints don't help there.The paper's advice reduces to "write clearer code" which - yes, obviously. But it claims the same clarity that helps humans also helps agents, and that's where the reasoning breaks down.
Fair point about ADL. That's a machine-hostile design pattern regardless of how you document it. I'll concede that the paper's framing of "human-friendly = agent-friendly" oversimplifies things. The reality is more like: some human-friendly patterns also help agents (concepts, nodiscard, strong types), and some don't (ADL, implicit conversions, overload sets with subtle SFINAE).
did two people just have a civil technical disagreement on reddit and reach partial consensus? is this real life?
so to summarize: C++ is so complex that even AI can't write it correctly, and the solution is... more papers
based
meanwhile in Rust, the compiler just tells you what's wrong. no curated training dataset needed. the type system is the training data.
cool story, now compile your Rust project with the 50 million lines of existing C++ it needs to interface with
this is why we can't have nice things
Rust borrow checker violations produce better training signal than C++ UB. That's just a fact. A compiler error is a labeled negative example. UB at runtime is an unlabeled catastrophe that might pass all your tests.
[removed by moderator]
what did they say?
something about all DG papers being written by AI already, you know the usual
The paper lists domain tags:
ai/,embedded/,finance/. As someone who has written embedded C++ for two decades: 99% of C++ training data online is web examples withstd::coutand heap allocation. My production code has zero dynamic allocations, no exceptions, no RTTI, and compiles for a Cortex-M4 with 256K flash.A "curated dataset" that includes an
embedded/tag is meaningless unless someone decides which embedded style. MISRA C++? AUTOSAR? Bare-metal RTOS? Safety-critical avionics? These are different worlds with different rules. The paper treats "embedded" as one thing. It is not.I want this dataset to exist. I do not believe a volunteer effort can produce it. This needs institutional backing and money.
skill issue
this is exactly the problem they're trying to solve though. the fact that AI generates
std::couthello-world when you ask for embedded C++ is the failure mode they're describing.Right, but "describing the problem" and "having a plan to solve it" are different documents. The paper says "WG21 cannot solve this alone" and then stops. Who funds it? Who curates it? Who decides that my no-allocation Cortex-M4 code is more "correct" than someone's Arduino sketch? The paper outsources the hard part to "the ecosystem" without defining what that means.
The dependency nobody is talking about: you cannot curate "safe modern C++" training data until you define what safe C++ is.
The paper says "favoring spans over pointers" and "null ptr checks." But the safety profiles work (P3081) is still in flight. The boundary between "safe" and "unsafe" C++ is actively being debated in SG23. Training an AI on today's definition of "safe" means retraining when the profiles ship and the definition changes.
Three-point version of the problem:
The DG is proposing a dataset that depends on committee output that doesn't exist yet. That's a sequencing error.
profiles are vapor until they ship. this whole paper is building on quicksand.
They have a reference implementation and active SG23 work. "Vapor" is unfair. "Not finished" is accurate. My point isn't that profiles will fail - it's that the training data challenge depends on their completion.
P3081, P2759, P3651 all feed into this. The dependency graph is real. You can't have a curated "safe C++" corpus without a stable definition of safety, and we don't have one yet.
I work in HFT. We have been generating C++ with internal tooling for two years. Our training data is proprietary. Our patterns are proprietary. The code that matters - the code where nanoseconds count - will never appear in a public dataset because it's worth money.
A public curated corpus will produce AI that writes competent library code and terrible performance-critical code. Which is fine for 90% of use cases and useless for the 10% that pays my salary.
The paper doesn't grapple with this. The best C++ is behind NDAs.
tell me you've never shipped open source without telling me you've never shipped open source
This is the silent majority problem. The people who write the most critical C++ can't contribute to the dataset. Abseil and Boost are the exception - high-quality public C++ that's actually used at scale. Everything else is either toy examples or locked behind corporate walls.
I have used Claude and GPT to help draft committee papers. Not generate - help draft. Research summaries, consistency checks, finding prior art, rewriting unclear paragraphs. The quality after human review is indistinguishable from fully human-written papers. I know because reviewers have told me my recent work is "noticeably clearer."
The governance section is solving a problem that doesn't exist yet. The real problem is not "AI is writing bad papers." The real problem is "bad papers exist and now people can produce them faster." That's a quality bar issue, not an AI issue.
that you know of
I appreciate the transparency, and I don't doubt your workflow produces good results. But the governance question isn't about quality - it's about accountability.
If an AI hallucinates a technical claim that makes it into normative text - say, a mischaracterization of implementation-defined behavior that influences wording - who owns that error? The human author, yes. But the failure mode is different. A human making an error has understood the surrounding context and made a judgment call. An AI hallucinating has no understanding at all. The error surface is different even if the output looks identical.
The human author is the intelligence of record. Which is exactly what the paper says. The accountability sits with the author regardless of their tools. We don't audit whether someone used a spellchecker or a thesaurus. Why are we auditing whether they used an LLM for research?
A spellchecker doesn't hallucinate new technical claims. That's not a fair comparison and you know it.
The real issue is the "slop" problem. We have already seen papers that are clearly 95% ChatGPT with five minutes of editing. I don't mean "the prose is suspiciously clean." I mean entire sections that read like prompted output - the hedging, the "it is worth noting that" constructions, the passive voice avalanche, the way every paragraph restates its own thesis. Three papers in the last mailing had this pattern.
The DG is putting governance in place because the quality bar is being gamed. That is not hypothetical.
Name them or it's FUD.
I could name at least two from Hagenberg but I enjoy being invited to meetings
Rule 2. Name-calling papers, not people. Last warning in this chain.
the fact that you can't tell which papers are AI-generated is literally the entire argument for governance, not against it
imagine being replaced by an AI that can't even get
std::variantvisitor syntax rightbold of you to assume I can get
std::variantvisitor syntax rightoverloaded{}gang rise upI was supposed to be replaced by Java. Then by C#. Then by Go. Then by Rust. Now by AI. I'm still here. The FORTRAN is still here. My codebase from 1997 is still running in production and nobody wants to touch it, which is exactly why I still have a job.
the FORTRAN is eternal. the FORTRAN endures.
The paper lists "Sender/Receiver over Callbacks" as a training data preference. Let's check the receipts.
P2300 (
std::execution) was voted into C++26 but has zero production deployments outside of NVIDIA's stdexec. The API surface is complex enough that even experienced async developers need days to internalize the model. The reference implementation is a research artifact, not a production library.We are asking AI to learn patterns that humans haven't adopted yet. The corpus would contain sender/receiver examples that approximately nobody has written in production. How is that "curated, human-validated" data? It's aspirational data. We're training the AI on what we wish people wrote, not what they actually write.
stdexec has been in libcu++ for a while now. It's not zero deployments.
libcu++ is NVIDIA's internal stack. That's my point. One vendor's CUDA library is not "ecosystem adoption."
I teach C++ to undergrads. The biggest AI problem isn't training data - it's that students submit AI-generated code they don't understand. It compiles. It passes the basic tests. And it has three undefined behaviors that only show up under ASan.
Last semester I got a submission that used
reinterpret_cast<int*>(&float_val)to "convert" a float to int. The student couldn't explain why it worked on their machine. The AI gave them code that looked correct, compiled without warnings, and violated strict aliasing. No amount of curated training data fixes this - the AI doesn't understand UB any more than the student does.The paper's heart is in the right place. But the crisis in my classroom isn't "AI generates C++98." It's "AI generates plausible-looking C++20 that's subtly broken in ways that require deep understanding to detect."
the "three undefined behaviors" thing hit different. I just audited an intern's code and found exactly this pattern.
we need to teach C++ to the AI and use AI to teach C++. surely nothing can go wrong with this circular dependency.
This is exactly why the curated dataset matters - but also why the paper's approach is insufficient. Domain tags (
ai/,embedded/,finance/) are not pedagogical scaffolding. You need progressive complexity, explicit rationale for each design choice, and anti-pattern comparisons showing why the wrong approach is wrong. A flat corpus of "good" examples doesn't teach; it just provides more sophisticated patterns for models to parrot without understanding.From the tooling section:
MCP is a specific protocol from Anthropic that has existed for about a year. It might not exist in three years. Why is a directions paper - a document meant to set strategy for the next decade - name-dropping a specific vendor protocol? This is like a 2015 directions paper saying "possibly connecting to Google+."
The underlying idea - structured compiler queries for AI agents - is sound. But pin it to the concept, not the implementation. LSP took a decade to get where it is. MCP might be a footnote by C++29.
because Jeff uses Cursor
the language server protocol has entered the chat
LSP has been "entering the chat" for 10 years and clangd still can't reliably parse template metaprogramming in large codebases. We should probably fix the protocol we have before adopting a new one.
Half the papers in the last mailing read like ChatGPT wrote them. The DG knows this. That's why Thrust I exists. This isn't preemptive governance - it's reactive damage control dressed up as strategy.
which papers? people keep saying this and never back it up with specifics.
I'm not naming papers because rule 2 exists. But the pattern is recognizable: every paragraph restates its own thesis, "it is worth noting that" appears four times, the "related work" section reads like a prompted summary, and the proposed wording has grammatical patterns that no native English speaker produces. You know it when you see it.
"the passive voice avalanche" describes half the existing standard text. are we sure the standard wasn't AI-generated?
accusing someone's paper of being AI-generated without evidence is genuinely harmful behavior. papers are public. authors are real people. some of them are non-native English speakers and the "it is worth noting" pattern is common in academic ESL writing.
[removed by moderator]
Thread locked. Rule 2 is not optional. If you want to discuss paper quality, do it without pointing fingers at specific submissions or authors.
the irony of discussing AI-generated slop on a thread where half the comments are probably AI-generated
Thrust II is basically asking for Boost 2.0 but for training data. The paper name-drops the Beman Project as a potential home for this kind of initiative, but Beman is already stretched thin incubating actual libraries for standardization. Adding "curate a massive cross-domain C++ corpus" to their scope without additional resources is wishful thinking.
The paper says "WG21 cannot solve this alone" but never identifies who can. Boost doesn't have the infrastructure for dataset curation. Beman doesn't have the funding. Academic groups could contribute but need grants. The only orgs with both the data and the money are the compiler vendors (Google, Microsoft, Apple, NVIDIA), and the paper doesn't ask them to do anything specific.
I want to see this happen. I don't see a mechanism for it to happen.
Beman Project for the curious: github.com/bemanproject - they're doing good work on near-standard library incubation but this would be a completely different kind of project.
The Beman Project doesn't even have funding for its current scope. The C++ Alliance does some of this but they're focused on Boost infrastructure. Who actually writes the check?
the paper says "Algorithms over Loops: Biasing generation toward
<algorithm>." my brother in Christ,#include <algorithm>adds 2 seconds to my compile timelaughs in compile times
modules will fix this. (said increasingly nervous man for the 7th year in a row)
[deleted]
what was this about?
something about training AI on the Boost mailing list, which honestly would produce the most aggressive code reviewer ever built
Honest question: if we had this magical curated modern C++ dataset, who decides what's "modern" and "idiomatic"? The committee that took 3 years to agree on a range adaptor closure semantics? The same body where SF/SA votes split 12-11?
Edit: I'm told the range adaptor thing was actually resolved fairly quickly. The point stands for basically everything else.
the real paper was the bike-shedding we did along the way
unpopular opinion: AI writing C++ will force the committee to simplify the language because complexity is the actual bug. imagine if every API had to be discoverable by an LLM. half the standard library would be redesigned. implicit conversion sequences, ADL, SFINAE, template argument deduction - all of it exists because humans can "figure it out from context." machines cannot. the pressure to make C++ AI-friendly is pressure to make C++ simpler, and that's the best thing that could happen to it.
this is actually a good take and you're being downvoted for it
a correct opinion at -47. peak r/wg21.
[removed by moderator]
report and move on
the fact that Bjarne Stroustrup co-authored a paper about AI governance for C++ is peak 2026
wait until you see P4024 on quantum computing governance
Read the whole thing twice. The real gap nobody is talking about:
Section 3 says:
Section 5 says:
These are two completely different strategies. One crowdsources quality through human curation. The other automates discovery through compiler tooling. The paper never reconciles them. Are we training AI on curated examples, or are we giving AI tools to query the compiler directly? Because those two approaches have fundamentally different failure modes, resource requirements, and timelines.
A paper that's setting "strategic direction" should at least acknowledge when its two thrusts point in different directions.
Edit: to be fair, maybe the answer is "both." But the paper doesn't say "both" - it just presents them adjacently without connecting the dots.
tl;dr: AI is both the future and the problem and the solution and we need a dataset and also governance and also tooling and also the ecosystem should do something and also we can't do it ourselves. got it.
disclosure: this comment was written by Claude