r/wg21 - Carry-less product: std::clmul

r/wg21

P3642R4 - Carry-less product: std::clmul WG21

Posted by u/sg6_lurker · 6 hr. ago

Document: P3642R4
Author: Jan Schultke
Date: 2026-02-17
Audience: LEWG

Paper proposes std::clmul and std::clmul_wide for carry-less multiplication - also known as XOR multiplication or polynomial multiplication over GF(2). It's the operation behind CRC computation, AES-GCM in cryptography, and some clever bit manipulation tricks like the simdjson JSON parser's quote-pair detection.

Hardware support is universal: x86 has had PCLMULQDQ since 2010, ARM has NEON pmull, RISC-V has Zbc/Zbkc extensions. LLVM recently added a portable @llvm.clmul intrinsic. The paper targets <numeric> and includes a widening variant (clmul_wide) that returns both halves of the result via a shared wide_result<T> type designed alongside P3161.

SG6 forwarded R1 at Sofia with a unanimous 11-0 poll, requesting SIMD overloads - which R4 now includes. Currently in the LEWG queue.

▲ 47 points (89% upvoted) · 7 comments

sorted by: best

▲ ▼

u/just_ship_something 31 points 5 hr. ago

another <numeric> paper while std::net remains a collective hallucination. but sure, polynomial multiplication, that's what the people were asking for.

Reply Share Report

▲ ▼

u/crc32_daily 18 points 4 hr. ago

I've been using _mm_clmulepi64_si128 directly for years. Nice to see this getting a portable interface. The SG6 vote was 11-0 which is... not something you see every day in committee-land.

Reply Share Report

▲ ▼

u/low_latency_dev 14 points 3 hr. ago

So this is basically a portable wrapper around PCLMULQDQ / pmull / clmul. Which is fine, but the paper buries the lede a bit:

The issue with library implementations is that the optimal implementation for std::clmul highly depends on the architecture and has interesting mathematical properties that become opaque in the library.

Translation: you can't just implement this in a header. You need the compiler to lower it. Same pattern as popcount and countl_zero. Fine for established bit operations, but it means every stdlib vendor needs to coordinate with their compiler team.

The 9.2x benchmark gap between naive and NTL implementations is the real story here. If you're not hitting the hardware instruction, you're leaving an order of magnitude on the table.

Also: the SIMD widening operations - the AVX-512 VPCLMULQDQ stuff that crypto actually wants - are explicitly punted to a future paper.

Reply Share Report

▲ ▼

u/crc32_daily 9 points 2 hr. ago

The portability angle is the whole point. x86 has PCLMULQDQ (2010), ARM has NEON pmull, RISC-V has Zbc. Writing #ifdef chains for three ISAs is exactly the kind of thing the standard library should paper over.

Reply Share Report

Promoted

Quick C++ Benchmark

Benchmark your C++ side by side. No setup required. quick-bench.com

▲ ▼

u/integer_overflow_pedant 12 points 2 hr. ago

The interesting design question here isn't clmul itself - it's wide_result<T>.

The paper explicitly shares this type with P3161 (widening arithmetic). That means if P3161 changes the type name, the layout, or the comparison semantics, this paper has to track. And P3161 hasn't shipped yet.

The result type is deliberately not named clmul_wide_result so that future mul_wide and other operations can use the same result type

Fine in principle. In practice, coupling two in-flight proposals through a shared vocabulary type is how you get both of them stuck in LEWG for an extra year.

Minor: I would have expected this in <bit> rather than <numeric>. clmul is fundamentally a bitwise operation. The paper's reasoning is "P0543 did it this way" which is a precedent argument, not a technical one.

Reply Share Report

▲ ▼

u/yet_another_standards_lawyer 6 points 1 hr. ago

classic committee move: standardize the result type before standardizing the operation that needs it

Reply Share Report

▲ ▼

u/parserjock_99 8 points 47 minutes ago

the hilbert curve example in section 3.2 is pretty dense for motivation. the JSON parsing angle is the killer use case - simdjson uses exactly this trick (clmul(quote_mask, -1)) and it's absurdly fast. should have led with that.

Reply Share Report