Author: Victor Zverovich
Document: P3904R1
Date: 2026-01-28
Target: SG16
Link: wg21.link/p3904r1
Victor Zverovich (of {fmt} fame) is back with a follow-up to P2845, which got std::filesystem::path formatting into C++26. Turns out there's one remaining gap: on Windows, paths can contain unpaired UTF-16 surrogates, and when you format them with std::format, distinct paths collapse to the same replacement character. Two different paths, same string. Not great.
The fix: WTF-8 (Wobbly Transformation Format - 8-bit). Yes, that's the actual name. It's a superset of UTF-8 that preserves arbitrary 16-bit code unit sequences losslessly. Rust already uses this for OsString on Windows, and libuv does the same. The visible output on terminals doesn't change - std::print still shows replacement characters for unpaired surrogates. The difference is in the bytes: each surrogate gets a unique WTF-8 encoding rather than all mapping to the same U+FFFD.
Already implemented in {fmt}. The parse-back API (going from WTF-8 to a path) is punted to a future paper.
Paper: P3904R1 · Authors: Victor Zverovich · Target: SG16 · Date: 2026-01-28
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.
WTF-8. They called an encoding WTF-8. Say what you will about the committee's pace, but the paper titles are undefeated.
genuinely thought this was a shitpost until I clicked through and saw a real encoding spec
It's legit. Simon Sapin formalized it years ago: wtf-8.codeberg.page. Rust uses it under the hood for all OS strings on Windows. The "Wobbly" in Wobbly Transformation Format is doing a lot of work.
Opening epigraph goes hard. Best paper opener this mailing.
a pun title, a cynical epigraph, and a three-page paper that fixes one edge case. peak WG21 energy.
So let me get this straight. The paper makes
std::formatproduce WTF-8 encoded bytes for paths with unpaired surrogates. But the round trip depends on a read-path API that doesn't exist yet:Are we shipping half a round trip? What's the plan if that follow-up paper doesn't make it through SG16? You've got WTF-8 bytes sitting in
std::strings that nothing in the standard knows how to decode back to a path.Edit: to be clear, I think the direction is right. I just don't love standardizing half of a bijection.
The format side has standalone value even without the parse side. This is a diagnostics and logging use case first - you want to be able to format two different paths and get two different strings. That's the core invariant P2845 broke on Windows.
Waiting to ship the format fix until the parse paper lands means P2845's lossy behavior sits in the standard for another cycle. The incremental approach is fine. Each half is useful independently.
I take the diagnostics point. But "each half is useful independently" is a stretch - the paper literally motivates the change with round-trip losslessness. If the format side is just for diagnostics, why frame it around round trips at all?
The concern isn't the direction. It's shipping an encoding that nothing in the standard can consume.
Fair - the framing does lean heavily on round-trip and then defers half of it. I think the real immediate value is injectivity: two distinct paths produce two distinct strings. That's strictly better than what we have now even if parsability comes later. But yeah, the paper could be more upfront that the win right now is distinguishability, not full round trips.
My concern is less about the round trip and more about what happens downstream. Right now, when you
std::formata path, you get valid UTF-8 (with replacement characters). After this paper, you can get bytes that are valid WTF-8 but invalid UTF-8 - specifically the three-byte sequences for encoded surrogates.Any code that receives a formatted path string and assumes it's valid UTF-8 - which is a lot of code - will reject or mangle these bytes. Logging frameworks, JSON serializers, database bindings. The
std::formatcontract has been "produces valid UTF-8 text" and this would be the first exception.Not saying it's wrong. But the blast radius is wider than "just a Windows edge case."
The paper does address the terminal output side:
So
std::printoutput to a terminal is unchanged - it still shows the replacement character. The WTF-8 bytes only appear in the formatted string itself.Right, and that's exactly my point. The string you get from
std::formatand the string thatstd::printrenders to a terminal are now semantically different objects. One is WTF-8, the other is UTF-8 with U+FFFD substitution. That's a new distinction that didn't exist before in thestd::formatecosystem. I'm not against it, but "just a display-side thing" undersells the change.Rust solved this years ago with
OsStringusing WTF-8 internally. The fact that C++ is catching up to this in 2026 is on brand.I'm going to start a bingo card for r/wg21 threads. "Rust already does this" is the free space.
Genuine question: how common are unpaired surrogates in real filesystem paths? I've been writing Windows code for two decades and I've never personally encountered one. If this is fixing a theoretical hole that never occurs in practice, is it worth the complexity?
Not being dismissive - the consistency argument is real. I'm asking whether anyone has data on how often NTFS paths contain ill-formed UTF-16.
More common than you'd think, but still rare. NTFS allows arbitrary 16-bit sequences as file names. You see them from:
- Malware or security tools that intentionally create paths with invalid UTF-16 to evade parsers
- Cross-platform file transfers where encoding roundtrip already went wrong
- Legacy East Asian software that predates Unicode and stored DBCS in UCS-2 fields
- Fuzzing and test harnesses
It's not a daily occurrence for most people, but if you write tools that enumerate all files (backup software, antivirus, search indexers), you will hit these. The paper's argument isn't frequency - it's that the formatter should be injective. Two distinct inputs, two distinct outputs. The current behavior violates that.
The cross-language convergence is worth noting. Rust uses WTF-8 for
OsString, libuv adopted it for Windows paths, and Python's PEP 383 solves the same problem with a different mechanism. When three independent ecosystems reach the same conclusion about ill-formed encoding handling, it's a decent signal that the approach is sound.{fmt} already ships with this behavior: github.com/fmtlib/fmt. So we have implementation experience plus cross-ecosystem validation. This seems like one of the cleaner SG16 papers I've seen.
Python's approach is interesting to compare. PEP 383 uses
surrogateescape, which maps invalid bytes to lone surrogates in the U+DC80..U+DCFF range. Different mechanism, same goal: reversible encoding of not-quite-Unicode data. The WTF-8 route is arguably cleaner since it doesn't repurpose the surrogate range for a second meaning.[deleted]
what did they say?
I feel like every other paper in SG16 is about paths and encoding on Windows. At some point can we just acknowledge that the Windows path model was a mistake and move on
we can acknowledge it all we want, the installed base isn't going anywhere. there are more NTFS volumes in production than there are humans who've read the C++ standard. you work with the platform you have.
Unpaired surrogates are undefined behavior in UTF-16. You shouldn't format them at all. Just throw an exception.