Document: P3440R2
Author: Daniel Towner (Intel)
Date: 2026-02-20
Audience: LEWG
New revision of Daniel Towner's (Intel) paper proposing mask_from_count for std::simd - a free function that generates a mask with the first N elements set to true. The primary use case is loop remainders: when your data doesn't fill an entire SIMD vector, you need a mask to process just the leftover elements. Without a standard facility, users end up writing manual mask generation that's either non-portable across SIMD targets or subtly wrong for edge cases (the paper has some good examples of how the iota-based and bit-manipulation approaches silently break).
R2 renames the function from n_elements to mask_from_count, clarifies that out-of-range counts saturate instead of violating a precondition, and adds a simd-generic design that works uniformly across basic_vec and scalar types.
R2 renames it from
n_elementstomask_from_count. The naming section lists nine rejected alternatives. I give it two more revisions before someone in LEWG proposesmake_mask_for_first_n_elements_of_basic_vec_please.std::simdhas been "almost in the standard" since the Parallelism TS and we're still adding individual utility functions one paper at a time. At this rate we'll have a complete SIMD library by C++35.To be fair this one is like three lines of implementation. The fact that you can get it wrong in subtle ways on different targets is exactly why it should be in the standard library rather than hand-rolled in every codebase.
From the paper:
I understand the consistency argument with
partial_load, but in my domain (automotive ECU, ASIL-D) a count that exceeds the SIMD width is almost certainly a bug in the loop arithmetic. I'd rather have a precondition violation that the contract checker catches at the boundary than silent saturation that makes the loop "work" while processing garbage.The paper says Intel saw no performance difference between strict and relaxed preconditions, and that's plausible on AVX-512 where the clamp is a single
min. But the semantic cost matters more than the runtime cost here - you've turned a detectable bug into correct-looking behavior.partial_loadalready saturates for oversized ranges, so if you're using both together (which is the main use case per the paper), you'd need the same defensive check either way. At least they're consistent with each other.From the performance section:
This is one vendor on architectures where mask registers are first-class hardware (AVX-512
kregisters, AVX-10). On ARM SVE where masks are governing predicates with different constraints, or on NEON where masks are full-width vectors, the saturation clamp might not be free. Would be nice to see multi-target data before leaning on performance as the design justification.Not a dealbreaker for the function itself -
mask_from_countis obviously useful. Just saying the precondition argument leans on evidence from one architecture family.Google Highway has
FirstNwhich does exactly this, and it works across x86, ARM, RISC-V, and WASM today. Nice to see it landing in the standard eventually.github.com/google/highway