P3904R1 - When paths go WTF: making formatting lossless WG21
Posted by u/format_enjoyer_2024 · 7 hr. ago

Author: Victor Zverovich
Document: P3904R1
Date: 2026-01-28
Target: SG16
Link: wg21.link/p3904r1

Victor Zverovich (of {fmt} fame) is back with a follow-up to P2845, which got std::filesystem::path formatting into C++26. Turns out there's one remaining gap: on Windows, paths can contain unpaired UTF-16 surrogates, and when you format them with std::format, distinct paths collapse to the same replacement character. Two different paths, same string. Not great.

The fix: WTF-8 (Wobbly Transformation Format - 8-bit). Yes, that's the actual name. It's a superset of UTF-8 that preserves arbitrary 16-bit code unit sequences losslessly. Rust already uses this for OsString on Windows, and libuv does the same. The visible output on terminals doesn't change - std::print still shows replacement characters for unpaired surrogates. The difference is in the bytes: each surrogate gets a unique WTF-8 encoding rather than all mapping to the same U+FFFD.

Already implemented in {fmt}. The parse-back API (going from WTF-8 to a path) is punted to a future paper.

▲ 147 points (91% upvoted) · 24 comments
sorted by: best
u/AutoModerator 1 point 7 hr. ago pinned comment

Paper: P3904R1 · Authors: Victor Zverovich · Target: SG16 · Date: 2026-01-28

I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.

u/actual_undefined_behavior 287 points 7 hr. ago 🏆

WTF-8. They called an encoding WTF-8. Say what you will about the committee's pace, but the paper titles are undefeated.

u/template_wizard_42 94 points 6 hr. ago

genuinely thought this was a shitpost until I clicked through and saw a real encoding spec

u/daily_cpp_news 53 points 6 hr. ago

It's legit. Simon Sapin formalized it years ago: wtf-8.codeberg.page. Rust uses it under the hood for all OS strings on Windows. The "Wobbly" in Wobbly Transformation Format is doing a lot of work.

u/not_a_real_cpp_dev 189 points 6 hr. ago 🏆
Compatibility means deliberately repeating other people's mistakes.

Opening epigraph goes hard. Best paper opener this mailing.

u/constexpr_everything_99 41 points 6 hr. ago

a pun title, a cynical epigraph, and a three-page paper that fixes one edge case. peak WG21 energy.

u/async_skeptic 67 points 5 hr. ago

So let me get this straight. The paper makes std::format produce WTF-8 encoded bytes for paths with unpaired surrogates. But the round trip depends on a read-path API that doesn't exist yet:

The API for the read path of the round trip will be proposed by a separate paper.

Are we shipping half a round trip? What's the plan if that follow-up paper doesn't make it through SG16? You've got WTF-8 bytes sitting in std::strings that nothing in the standard knows how to decode back to a path.

Edit: to be clear, I think the direction is right. I just don't love standardizing half of a bijection.

u/library_design_enjoyer 34 points 4 hr. ago

The format side has standalone value even without the parse side. This is a diagnostics and logging use case first - you want to be able to format two different paths and get two different strings. That's the core invariant P2845 broke on Windows.

Waiting to ship the format fix until the parse paper lands means P2845's lossy behavior sits in the standard for another cycle. The incremental approach is fine. Each half is useful independently.

u/async_skeptic 18 points 4 hr. ago

I take the diagnostics point. But "each half is useful independently" is a stretch - the paper literally motivates the change with round-trip losslessness. If the format side is just for diagnostics, why frame it around round trips at all?

The concern isn't the direction. It's shipping an encoding that nothing in the standard can consume.

u/library_design_enjoyer 12 points 3 hr. ago

Fair - the framing does lean heavily on round-trip and then defers half of it. I think the real immediate value is injectivity: two distinct paths produce two distinct strings. That's strictly better than what we have now even if parsability comes later. But yeah, the paper could be more upfront that the win right now is distinguishability, not full round trips.

u/formerly_msvc_dev 52 points 5 hr. ago

My concern is less about the round trip and more about what happens downstream. Right now, when you std::format a path, you get valid UTF-8 (with replacement characters). After this paper, you can get bytes that are valid WTF-8 but invalid UTF-8 - specifically the three-byte sequences for encoded surrogates.

Any code that receives a formatted path string and assumes it's valid UTF-8 - which is a lot of code - will reject or mangle these bytes. Logging frameworks, JSON serializers, database bindings. The std::format contract has been "produces valid UTF-8 text" and this would be the first exception.

Not saying it's wrong. But the blast radius is wider than "just a Windows edge case."

u/format_enjoyer_2024 23 points 4 hr. ago

The paper does address the terminal output side:

Recommended practice: For vprint_unicode, if invoking the native Unicode API requires transcoding, implementations should substitute invalid code units with U+FFFD REPLACEMENT CHARACTER per the Unicode Standard

So std::print output to a terminal is unchanged - it still shows the replacement character. The WTF-8 bytes only appear in the formatted string itself.

u/formerly_msvc_dev 19 points 4 hr. ago

Right, and that's exactly my point. The string you get from std::format and the string that std::print renders to a terminal are now semantically different objects. One is WTF-8, the other is UTF-8 with U+FFFD substitution. That's a new distinction that didn't exist before in the std::format ecosystem. I'm not against it, but "just a display-side thing" undersells the change.

u/just_use_rust_bro 43 points 6 hr. ago

Rust solved this years ago with OsString using WTF-8 internally. The fact that C++ is catching up to this in 2026 is on brand.

u/coroutine_hater_2020 71 points 6 hr. ago

I'm going to start a bingo card for r/wg21 threads. "Rust already does this" is the free space.

u/embedded_for_20_years 28 points 4 hr. ago

Genuine question: how common are unpaired surrogates in real filesystem paths? I've been writing Windows code for two decades and I've never personally encountered one. If this is fixing a theoretical hole that never occurs in practice, is it worth the complexity?

Not being dismissive - the consistency argument is real. I'm asking whether anyone has data on how often NTFS paths contain ill-formed UTF-16.

u/ntfs_internals_fan 37 points 3 hr. ago

More common than you'd think, but still rare. NTFS allows arbitrary 16-bit sequences as file names. You see them from:

- Malware or security tools that intentionally create paths with invalid UTF-16 to evade parsers
- Cross-platform file transfers where encoding roundtrip already went wrong
- Legacy East Asian software that predates Unicode and stored DBCS in UCS-2 fields
- Fuzzing and test harnesses

It's not a daily occurrence for most people, but if you write tools that enumerate all files (backup software, antivirus, search indexers), you will hit these. The paper's argument isn't frequency - it's that the formatter should be injective. Two distinct inputs, two distinct outputs. The current behavior violates that.

u/senior_lib_designer 41 points 3 hr. ago

The cross-language convergence is worth noting. Rust uses WTF-8 for OsString, libuv adopted it for Windows paths, and Python's PEP 383 solves the same problem with a different mechanism. When three independent ecosystems reach the same conclusion about ill-formed encoding handling, it's a decent signal that the approach is sound.

{fmt} already ships with this behavior: github.com/fmtlib/fmt. So we have implementation experience plus cross-ecosystem validation. This seems like one of the cleaner SG16 papers I've seen.

u/compiles_first_try 15 points 2 hr. ago

Python's approach is interesting to compare. PEP 383 uses surrogateescape, which maps invalid bytes to lone surrogates in the U+DC80..U+DCFF range. Different mechanism, same goal: reversible encoding of not-quite-Unicode data. The WTF-8 route is arguably cleaner since it doesn't repurpose the surrogate range for a second meaning.

[deleted] 4 hr. ago

[deleted]

u/template_wizard_42 5 points 4 hr. ago

what did they say?

u/yet_another_build_victim 14 points 2 hr. ago

I feel like every other paper in SG16 is about paths and encoding on Windows. At some point can we just acknowledge that the Windows path model was a mistake and move on

u/old_guard_cpp17 8 points 2 hr. ago

we can acknowledge it all we want, the installed base isn't going anywhere. there are more NTFS volumes in production than there are humans who've read the C++ standard. you work with the platform you have.