Document: P2929R2
Authors: Daniel Towner, Ruslan Arutyunyan (Intel)
Date: 2026-01-26
Audience: LEWG
Intel folks proposing simd::chunked_invoke - a utility that breaks large basic_vec values into register-sized chunks, applies a callable (typically wrapping a platform intrinsic) to each chunk, and reassembles the results via simd::cat. The paper title still says simd_invoke but the function was renamed to chunked_invoke in this revision to avoid confusion with std::invoke.
The motivation is practical: if your basic_vec is wider than a native register, you currently have to manually chunk, call intrinsics on each piece, and cat the results back together. This paper wraps that boilerplate into a single function. It also optionally passes a chunk index to the callable if the callable accepts one, using callable probing rather than a separate _indexed variant.
std::simd: the library that keeps growing before it ships. At this rate the API surface will be larger than the register file.
The core idea is a reasonable convenience. But the callable probing design gives me pause. From section 4.1.1:
"Probing capabilities" means SFINAE on whether the callable accepts an extra parameter. The note in the wording even warns about ambiguous overloads. I'd rather have an explicit
chunked_invoke_indexedthan silent overload resolution deciding whether my lambda gets an index or not. The permute precedent isn't great justification for propagating the same pattern.That said, the chunk-invoke-cat boilerplate is genuinely annoying. I've written it enough times to appreciate the motivation.
Genuinely curious - what does this buy you that Highway or xsimd don't already handle? Both of those have had "apply op to native-sized pieces" as a core abstraction for years.
Standardization, presumably. Though at the rate std::simd is moving, Highway will have rewritten their entire API twice before this ships.
The tail chunk handling is the interesting part. The paper's 19-element example on AVX (native size 8) gives chunks of 8, 8, and 3. Your callable has to handle that trailing 3-element chunk correctly.
The mandate says:
So your lambda needs to be a generic lambda or have overloads for every possible tail size. In practice that means a
constexpr ifchain inside the lambda to dispatch the right intrinsic per chunk size - which is pretty much the boilerplate this paper was supposed to eliminate.The non-tail case is where the convenience wins. The tail case still requires the same manual dispatch.
From the revision history:
Ironic that the paper title still says
simd_invoke.chunked_invokeis better though - it signals that the chunking is the point, not just "invoke but for SIMD." Aligns withchunkandcatnaming in the rest of the library.Wait,
std::simdmade it into C++26? I thought it was still in the parallelism TS.P1928 is merging it into the working draft. Long road from the TS, but it's happening.