Authors: Eddie Nolan, Zach Laine
Document: P2728R11
Date: 2026-03-15
Target: SG9, SG16
Link: wg21.link/p2728r11
P2728 is the paper that's been quietly building the foundation for real Unicode support in C++. Eleven revisions in, it proposes range adaptors for transcoding between UTF-8, UTF-16, and UTF-32 - pipe-style, lazy, composable with the rest of <ranges>.
The core API is straightforward: u8"hello" | views::to_utf32 gives you a lazy view of UTF-32 code points. Error handling comes in two flavors - the default views substitute the Unicode replacement character (U+FFFD) for invalid sequences, and the _or_error variants produce std::expected values with a detailed error enum telling you exactly what went wrong.
Implementation experience from multiple directions: a Beman project reference implementation, Boost.Text with hundreds of GitHub stars, and Jonathan Wakely's implementation in libstdc++. The paper only works with char8_t/char16_t/char32_t - if your UTF-8 is in char, you pipe through as_char8_t first.
Paper: P2728R11 - Unicode in the Library, Part 1: UTF Transcoding | Authors: Eddie Nolan, Zach Laine | Date: 2026-03-15 | Audience: SG9, SG16 | Link: wg21.link/p2728r11
Reminder: paper authors sometimes read these threads. Be civil, be specific, critique the paper not the person. Rule 2 violations will be removed.
Eleven revisions. This paper has been through more design iterations than my startup's business model.
To be fair the API between R0 and R11 is completely different. They removed eager algorithms, changed error handling from exceptions to expected, split out null_term into its own paper, merged all the separate view classes into one. It's basically a new paper wearing old clothes.
The paper says this itself! And then requires
as_char8_tanyway. So to transcode my perfectly normalstd::stringUTF-8 text I need:instead of just
I understand the type-safety argument but this optimizes for the type system at the expense of the user who's sitting there with a
std::stringfull of UTF-8 and wants to do something with it.This was an explicit SG16 design decision.
chardoesn't carry encoding information - it could be Latin-1, it could be Shift-JIS, it could be UTF-8.char8_texists precisely to mark "this is UTF-8." The one-time cast at the boundary is the cost of not lying about your encoding.Also,
as_char8_tis literally astatic_castin a view. Zero-cost at runtime and composes naturally in a pipeline.I know the encoding of my strings at compile time. I set the compiler flags. The literal encoding hasn't surprised me since 2003. Making me declare what I already know every time I touch a string feels like ceremony, not safety.
That SG16 poll was SF:5 F:1 N:0 A:0 SA:1. Almost unanimous. At some point you have to accept the committee made a direction call and build on it rather than relitigate every time a paper follows the direction.
Zero discussion of performance in this paper. simdutf validates and transcodes UTF-8 at multiple GB/s using SIMD. These views store an
inplace_vectorbuffer, a parent pointer, an index, and do one-code-point-at-a-time transcoding through a virtual dispatch on the underlying range.That's fine for small strings and lazy composition, but the paper doesn't even acknowledge the tradeoff. Someone who searches "C++ UTF transcoding" after C++29 ships is going to find this, use it for bulk transcoding of a 50MB JSON file, and wonder why it's 20x slower than it needs to be.
You're comparing a bulk batch API to a composable lazy view. They serve fundamentally different purposes. Nobody expects
views::transformto compete with hand-rolled SIMD either - the value is composability and zero-allocation.Try doing that pipeline with simdutf. You'd need three intermediate allocations.
Fair distinction. But shouldn't the paper at least mention this? A user who sees "UTF transcoding" in the standard library and doesn't know simdutf exists will assume this is the way to do it. A non-normative note pointing to SIMD libraries for bulk workloads would cost two sentences and save a lot of Stack Overflow questions.
Actually yeah, that's a reasonable ask. A note in the design discussion section acknowledging the performance-vs-composability tradeoff wouldn't hurt. Something like what
ranges::sortsays about... wait,views::sortdoesn't exist. Bad example. But you get the idea.just use ICU
ICU is a 30MB dependency that does collation, BiDi, transliteration, break iteration, and date formatting. This paper is a zero-overhead range adaptor for UTF conversion. Recommending ICU for transcoding is like recommending Photoshop for cropping a screenshot.
The error handling design here is the best part of the paper and I don't think it's getting enough attention.
The paper cites CVE-2007-3917 where a game server crashed from invalid UTF-8 because the transcoding function threw exceptions and nobody caught them. The old
codecvtfacets had exactly this problem - exceptions as error handling for untrusted input is a denial-of-service vulnerability waiting to happen.The
_or_errorviews producestd::expectedvalues, so invalid input doesn't throw, doesn't crash, doesn't silently corrupt. You get an error enum telling you exactly what went wrong (truncated_utf8_sequence,unpaired_high_surrogate,encoded_surrogate, etc). The defaultto_utf8view substitutes U+FFFD per the Unicode spec. Two good defaults covering two common needs.The nice thing is that
_or_erroris explicitly a basis operation - section 10.2 shows you can rebuild the replacement behavior from theor_errorview with atransformandjoin. So the error-aware path is the primitive and the convenience path is layered on top. Clean factoring.Also the double-transcode optimization in 10.4 is clever. If you pipe through
to_utf32and thento_utf16, the CPO detects the nesting and elides the intermediate view. Same pattern asviews::reverseundoing itself.can we please get networking in the standard before we start on unicode
we can want two things
In Rust, strings are UTF-8 by default and validated at construction time. No transcoding views needed. The whole
char8_tdance doesn't exist because the language got the default encoding right from day one.Rust also has
OsStr,OsString,CStr,CString,str,String,Path,PathBuf,Cow<str>, and theFrom<&[u8]>escape hatch when you need non-UTF-8 bytes. The grass isn't as green as the evangelists claim.fair enough, though those exist for FFI boundaries not everyday use. But this isn't r/rust so I'll stop.
Dependencies concern: this paper needs P3117 for conditionally borrowed ranges, P3705 for
null_term, and P4030 for endian views if you want the full UTF-16 LE/BE story from section 6.6. That's three papers that need to land alongside this one for the complete picture.What's the C++29 dependency DAG looking like? Is anyone tracking whether these will actually be ready for the same standard?
P3117 is already being reviewed in LEWG. P3705 was literally carved out of this paper - it'll track together. P4030 is a nice-to-have, not a hard dependency. The core transcoding views work fine without endian views; section 6.6 is just showing what's possible when you compose them.
Section 6.5 with the playing card suit-changing example is the most delightful thing I've seen in a standards paper. Someone actually had fun writing that section and it shows. We need more
PLAYING CARD ACE OF SPADESin normative references.The changelog in section 11 is longer than the actual changes in R11. Two bullet points of new work, eleven subsections of "what we changed three years ago." The
codecvtfacets were deprecated in C++17. It's 2026. That's nine years to replace something everyone agreed was broken.Edit: yes I know the paper does more than replace codecvt. My point is about timeline, not scope.
[deleted]