r/wg21 - Unicode in the Library, Part 1: UTF Transcoding

P2728R11 - Unicode in the Library, Part 1: UTF Transcoding WG21

Posted by u/unicode_standardista · 18 hr. ago

Authors: Eddie Nolan, Zach Laine
Document: P2728R11
Date: 2026-03-15
Target: SG9, SG16
Link: wg21.link/p2728r11

P2728 is the paper that's been quietly building the foundation for real Unicode support in C++. Eleven revisions in, it proposes range adaptors for transcoding between UTF-8, UTF-16, and UTF-32 - pipe-style, lazy, composable with the rest of <ranges>.

The core API is straightforward: u8"hello" | views::to_utf32 gives you a lazy view of UTF-32 code points. Error handling comes in two flavors - the default views substitute the Unicode replacement character (U+FFFD) for invalid sequences, and the _or_error variants produce std::expected values with a detailed error enum telling you exactly what went wrong.

Implementation experience from multiple directions: a Beman project reference implementation, Boost.Text with hundreds of GitHub stars, and Jonathan Wakely's implementation in libstdc++. The paper only works with char8_t/char16_t/char32_t - if your UTF-8 is in char, you pipe through as_char8_t first.

▲ 186 points (91% upvoted) · 25 comments

sorted by: best

▲ ▼

u/AutoModerator 1 point 18 hr. ago pinned comment

Paper: P2728R11 - Unicode in the Library, Part 1: UTF Transcoding | Authors: Eddie Nolan, Zach Laine | Date: 2026-03-15 | Audience: SG9, SG16 | Link: wg21.link/p2728r11

Reminder: paper authors sometimes read these threads. Be civil, be specific, critique the paper not the person. Rule 2 violations will be removed.

Reply Share Report

▲ ▼

u/template_meta_enjoyer 87 points 17 hr. ago

Eleven revisions. This paper has been through more design iterations than my startup's business model.

Reply Share Report

▲ ▼

u/constexpr_when 23 points 16 hr. ago

To be fair the API between R0 and R11 is completely different. They removed eager algorithms, changed error handling from exceptions to expected, split out null_term into its own paper, merged all the separate view classes into one. It's basically a new paper wearing old clothes.

Reply Share Report

▲ ▼

u/char_was_fine_actually 67 points 16 hr. ago

Because virtually all UTF-8 text processed by C++ is stored in char

The paper says this itself! And then requires as_char8_t anyway. So to transcode my perfectly normal std::string UTF-8 text I need:

my_str | views::as_char8_t | views::to_utf32

instead of just

my_str | views::to_utf32

I understand the type-safety argument but this optimizes for the type system at the expense of the user who's sitting there with a std::string full of UTF-8 and wants to do something with it.

Reply Share Report

▲ ▼

u/sg16_lurker 31 points 15 hr. ago

This was an explicit SG16 design decision. char doesn't carry encoding information - it could be Latin-1, it could be Shift-JIS, it could be UTF-8. char8_t exists precisely to mark "this is UTF-8." The one-time cast at the boundary is the cost of not lying about your encoding.

Also, as_char8_t is literally a static_cast in a view. Zero-cost at runtime and composes naturally in a pipeline.

Reply Share Report

▲ ▼

u/char_was_fine_actually 18 points 14 hr. ago

I know the encoding of my strings at compile time. I set the compiler flags. The literal encoding hasn't surprised me since 2003. Making me declare what I already know every time I touch a string feels like ceremony, not safety.

Reply Share Report

▲ ▼

u/ranges_guy_2021 12 points 13 hr. ago

UTF transcoding interfaces provided by the C++ standard library should operate on charN_t types, with support for other types provided by adapters

That SG16 poll was SF:5 F:1 N:0 A:0 SA:1. Almost unanimous. At some point you have to accept the committee made a direction call and build on it rather than relitigate every time a paper follows the direction.

Reply Share Report

▲ ▼

u/simd_or_bust 44 points 15 hr. ago

Zero discussion of performance in this paper. simdutf validates and transcodes UTF-8 at multiple GB/s using SIMD. These views store an inplace_vector buffer, a parent pointer, an index, and do one-code-point-at-a-time transcoding through a virtual dispatch on the underlying range.

That's fine for small strings and lazy composition, but the paper doesn't even acknowledge the tradeoff. Someone who searches "C++ UTF transcoding" after C++29 ships is going to find this, use it for bulk transcoding of a 50MB JSON file, and wonder why it's 20x slower than it needs to be.

Reply Share Report

▲ ▼

u/lazy_eval_appreciator 38 points 14 hr. ago

You're comparing a bulk batch API to a composable lazy view. They serve fundamentally different purposes. Nobody expects views::transform to compete with hand-rolled SIMD either - the value is composability and zero-allocation.

input | views::as_char8_t | views::to_utf32
      | views::filter(is_letter)
      | views::to_utf8
      | ranges::to<string>()

Try doing that pipeline with simdutf. You'd need three intermediate allocations.

Reply Share Report

▲ ▼

u/simd_or_bust 19 points 13 hr. ago

Fair distinction. But shouldn't the paper at least mention this? A user who sees "UTF transcoding" in the standard library and doesn't know simdutf exists will assume this is the way to do it. A non-normative note pointing to SIMD libraries for bulk workloads would cost two sentences and save a lot of Stack Overflow questions.

Reply Share Report

▲ ▼

u/lazy_eval_appreciator 14 points 12 hr. ago

Actually yeah, that's a reasonable ask. A note in the design discussion section acknowledging the performance-vs-composability tradeoff wouldn't hurt. Something like what ranges::sort says about... wait, views::sort doesn't exist. Bad example. But you get the idea.

Reply Share Report

▲ ▼

u/former_icu_user -3 points 14 hr. ago

just use ICU

Reply Share Report

▲ ▼

u/string_theory_42 25 points 13 hr. ago

ICU is a 30MB dependency that does collation, BiDi, transliteration, break iteration, and date formatting. This paper is a zero-overhead range adaptor for UTF conversion. Recommending ICU for transcoding is like recommending Photoshop for cropping a screenshot.

Reply Share Report

Promoted

godbolt.org - because you need to see the assembly.

Compiler Explorer: Run code, see assembly, compare compilers. godbolt.org

▲ ▼

u/error_code_maximalist 35 points 12 hr. ago

The error handling design here is the best part of the paper and I don't think it's getting enough attention.

a multiplayer RPG server could be crashed by malicious users sending invalid UTF

The paper cites CVE-2007-3917 where a game server crashed from invalid UTF-8 because the transcoding function threw exceptions and nobody caught them. The old codecvt facets had exactly this problem - exceptions as error handling for untrusted input is a denial-of-service vulnerability waiting to happen.

The _or_error views produce std::expected values, so invalid input doesn't throw, doesn't crash, doesn't silently corrupt. You get an error enum telling you exactly what went wrong (truncated_utf8_sequence, unpaired_high_surrogate, encoded_surrogate, etc). The default to_utf8 view substitutes U+FFFD per the Unicode spec. Two good defaults covering two common needs.

Reply Share Report

▲ ▼

u/definitely_not_a_compiler_dev 19 points 11 hr. ago

The nice thing is that _or_error is explicitly a basis operation - section 10.2 shows you can rebuild the replacement behavior from the or_error view with a transform and join. So the error-aware path is the primitive and the convenience path is layered on top. Clean factoring.

Also the double-transcode optimization in 10.4 is clever. If you pipe through to_utf32 and then to_utf16, the CPO detects the nesting and elides the intermediate view. Same pattern as views::reverse undoing itself.

Reply Share Report

▲ ▼

u/async_skeptic 34 points 11 hr. ago

can we please get networking in the standard before we start on unicode

Reply Share Report

▲ ▼

u/coroutine_hater 52 points 10 hr. ago 🏆

we can want two things

Reply Share Report

▲ ▼

u/daily_driver_rust 12 points 9 hr. ago

In Rust, strings are UTF-8 by default and validated at construction time. No transcoding views needed. The whole char8_t dance doesn't exist because the language got the default encoding right from day one.

Reply Share Report

▲ ▼

u/string_theory_42 48 points 8 hr. ago 🏆

Rust also has OsStr, OsString, CStr, CString, str, String, Path, PathBuf, Cow<str>, and the From<&[u8]> escape hatch when you need non-UTF-8 bytes. The grass isn't as green as the evangelists claim.

Reply Share Report

▲ ▼

u/daily_driver_rust 5 points 7 hr. ago

fair enough, though those exist for FFI boundaries not everyday use. But this isn't r/rust so I'll stop.

Reply Share Report

Promoted

en.cppreference.com - You were going to look it up anyway.

The community-maintained reference for C and C++. Always current. en.cppreference.com

▲ ▼

u/boost_contributor_throwaway 29 points 6 hr. ago

Dependencies concern: this paper needs P3117 for conditionally borrowed ranges, P3705 for null_term, and P4030 for endian views if you want the full UTF-16 LE/BE story from section 6.6. That's three papers that need to land alongside this one for the complete picture.

What's the C++29 dependency DAG looking like? Is anyone tracking whether these will actually be ready for the same standard?

Reply Share Report

▲ ▼

u/ranges_guy_2021 16 points 5 hr. ago

P3117 is already being reviewed in LEWG. P3705 was literally carved out of this paper - it'll track together. P4030 is a nice-to-have, not a hard dependency. The core transcoding views work fine without endian views; section 6.6 is just showing what's possible when you compose them.

Reply Share Report

▲ ▼

u/the_real_move_semantics 42 points 4 hr. ago

Section 6.5 with the playing card suit-changing example is the most delightful thing I've seen in a standards paper. Someone actually had fun writing that section and it shows. We need more PLAYING CARD ACE OF SPADES in normative references.

Reply Share Report

▲ ▼

u/just_ship_it_already 19 points † 3 hr. ago

The changelog in section 11 is longer than the actual changes in R11. Two bullet points of new work, eleven subsections of "what we changed three years ago." The codecvt facets were deprecated in C++17. It's 2026. That's nine years to replace something everyone agreed was broken.

Edit: yes I know the paper does more than replace codecvt. My point is about timeline, not scope.

Reply Share Report

▲ ▼

[deleted] -7 points 8 hr. ago

[deleted]