Document: P3642R4
Author: Jan Schultke
Date: 2026-02-17
Audience: LEWG
Paper proposes std::clmul and std::clmul_wide for carry-less multiplication - also known as XOR multiplication or polynomial multiplication over GF(2). It's the operation behind CRC computation, AES-GCM in cryptography, and some clever bit manipulation tricks like the simdjson JSON parser's quote-pair detection.
Hardware support is universal: x86 has had PCLMULQDQ since 2010, ARM has NEON pmull, RISC-V has Zbc/Zbkc extensions. LLVM recently added a portable @llvm.clmul intrinsic. The paper targets <numeric> and includes a widening variant (clmul_wide) that returns both halves of the result via a shared wide_result<T> type designed alongside P3161.
SG6 forwarded R1 at Sofia with a unanimous 11-0 poll, requesting SIMD overloads - which R4 now includes. Currently in the LEWG queue.
another <numeric> paper while
std::netremains a collective hallucination. but sure, polynomial multiplication, that's what the people were asking for.I've been using
_mm_clmulepi64_si128directly for years. Nice to see this getting a portable interface. The SG6 vote was 11-0 which is... not something you see every day in committee-land.So this is basically a portable wrapper around PCLMULQDQ / pmull / clmul. Which is fine, but the paper buries the lede a bit:
Translation: you can't just implement this in a header. You need the compiler to lower it. Same pattern as
popcountandcountl_zero. Fine for established bit operations, but it means every stdlib vendor needs to coordinate with their compiler team.The 9.2x benchmark gap between naive and NTL implementations is the real story here. If you're not hitting the hardware instruction, you're leaving an order of magnitude on the table.
Also: the SIMD widening operations - the AVX-512 VPCLMULQDQ stuff that crypto actually wants - are explicitly punted to a future paper.
The portability angle is the whole point. x86 has PCLMULQDQ (2010), ARM has NEON pmull, RISC-V has Zbc. Writing
#ifdefchains for three ISAs is exactly the kind of thing the standard library should paper over.The interesting design question here isn't
clmulitself - it'swide_result<T>.The paper explicitly shares this type with P3161 (widening arithmetic). That means if P3161 changes the type name, the layout, or the comparison semantics, this paper has to track. And P3161 hasn't shipped yet.
Fine in principle. In practice, coupling two in-flight proposals through a shared vocabulary type is how you get both of them stuck in LEWG for an extra year.
Minor: I would have expected this in
<bit>rather than<numeric>.clmulis fundamentally a bitwise operation. The paper's reasoning is "P0543 did it this way" which is a precedent argument, not a technical one.classic committee move: standardize the result type before standardizing the operation that needs it
the hilbert curve example in section 3.2 is pretty dense for motivation. the JSON parsing angle is the killer use case - simdjson uses exactly this trick (
clmul(quote_mask, -1)) and it's absurdly fast. should have led with that.