r/wg21 - P3440R2 - Add mask_from_count function to std::simd

r/wg21

P3440R2 - Add mask_from_count function to std::simd WG21

Posted by u/simd_lane_watcher · 7 hr. ago

Document: P3440R2
Author: Daniel Towner (Intel)
Date: 2026-02-20
Audience: LEWG

New revision of Daniel Towner's (Intel) paper proposing mask_from_count for std::simd - a free function that generates a mask with the first N elements set to true. The primary use case is loop remainders: when your data doesn't fill an entire SIMD vector, you need a mask to process just the leftover elements. Without a standard facility, users end up writing manual mask generation that's either non-portable across SIMD targets or subtly wrong for edge cases (the paper has some good examples of how the iota-based and bit-manipulation approaches silently break).

R2 renames the function from n_elements to mask_from_count, clarifies that out-of-range counts saturate instead of violating a precondition, and adds a simd-generic design that works uniformly across basic_vec and scalar types.

▲ 21 points (88% upvoted) · 7 comments

sorted by: best

▲ ▼

u/not_a_bikeshed_actually 11 points 6 hr. ago

R2 renames it from n_elements to mask_from_count. The naming section lists nine rejected alternatives. I give it two more revisions before someone in LEWG proposes make_mask_for_first_n_elements_of_basic_vec_please.

Reply Share Report

▲ ▼

u/former_boost_contributor 8 points 5 hr. ago

std::simd has been "almost in the standard" since the Parallelism TS and we're still adding individual utility functions one paper at a time. At this rate we'll have a complete SIMD library by C++35.

Reply Share Report

▲ ▼

u/constexpr_everything_2024 3 points 4 hr. ago

To be fair this one is like three lines of implementation. The fact that you can get it wrong in subtle ways on different targets is exactly why it should be in the standard library rather than hand-rolled in every codebase.

Reply Share Report

▲ ▼

u/embedded_for_20_years 6 points 4 hr. ago

From the paper:

count >= size() saturates to a full mask

I understand the consistency argument with partial_load, but in my domain (automotive ECU, ASIL-D) a count that exceeds the SIMD width is almost certainly a bug in the loop arithmetic. I'd rather have a precondition violation that the contract checker catches at the boundary than silent saturation that makes the loop "work" while processing garbage.

The paper says Intel saw no performance difference between strict and relaxed preconditions, and that's plausible on AVX-512 where the clamp is a single min. But the semantic cost matters more than the runtime cost here - you've turned a detectable bug into correct-looking behavior.

Reply Share Report

▲ ▼

u/constexpr_everything_2024 4 points 3 hr. ago

partial_load already saturates for oversized ranges, so if you're using both together (which is the main use case per the paper), you'd need the same defensive check either way. At least they're consistent with each other.

Reply Share Report

Promoted

Compiler Explorer

Because you need to see the assembly. godbolt.org

▲ ▼

u/arm_neon_enjoyer 5 points 3 hr. ago

From the performance section:

Intel's implementation experience shows no notable performance difference between implementations that handle all non-negative counts versus those with stricter preconditions

This is one vendor on architectures where mask registers are first-class hardware (AVX-512 k registers, AVX-10). On ARM SVE where masks are governing predicates with different constraints, or on NEON where masks are full-width vectors, the saturation clamp might not be free. Would be nice to see multi-target data before leaning on performance as the design justification.

Not a dealbreaker for the function itself - mask_from_count is obviously useful. Just saying the precondition argument leans on evidence from one architecture family.

Reply Share Report

▲ ▼

u/highway_or_bust 2 points 2 hr. ago

Google Highway has FirstN which does exactly this, and it works across x86, ARM, RISC-V, and WASM today. Nice to see it landing in the standard eventually.

github.com/google/highway

Reply Share Report