https://secret.club/2021/04/09/std-clamp.html

SECRET CLUB HOME ABOUT US

A look at LLVM - comparing clamp implementations

main authors image duk
Apr 9, 2021
---------------------------------------------------------------------

Please note that this is not an endorsement or criticism of either of
these languages. It's simply something I found interesting with how
LLVM handles code generation between the two. This is an
implementation quirk, not a language issue.

The LLVM project is a modular set of tools that make designing and
implementing a compiler significantly easier. The most well known
part of LLVM is their intermediate representation; IR for short.
LLVM's IR is an extremely powerful tool, designed to make
optimization and targeting many architectures as easy as possible.
Many tools use LLVM IR; the Clang C++ compiler and the Rust compiler
(rustc) are both notable examples. However, despite this unified
architecture, code generation can still vary wildly between
implementations and how the IR is used. Some time ago, I stumbled
upon this tweet discussing Rust's implementation of clamping compared
to C++:

    Rust 1.50 is out and has f32.clamp. I had extremely low
    expectations for performance based on C++ experience but as usual
    Rust proves to be "C++ done right".

    Of course Zig already has clamp and also gets codegen right.
    pic.twitter.com/0WI1fLrQaB

    -- Arseny Kapoulkine (@zeuxcg) February 11, 2021

Rust's code generation on the latest version of LLVM is far superior
compared to an equivalent Clang version using std::clamp, even though
they use the same underlying IR:

With f32.clamp:

pub fn clamp(v: f32) -> f32 {
    v.clamp(-1.0, 1.0)
}

The corresponding assembly is shown below. It is short, concise, and
pretty much the best you're going to get. We can see two memory
accesses to get the clamp bounds and efficient use of x86
instructions.

.LCPI0_0:
        .long   0xbf800000
.LCPI0_1:
        .long   0x3f800000
example::clamp:
        movss   xmm1, dword ptr [rip + .LCPI0_0]
        maxss   xmm1, xmm0
        movss   xmm0, dword ptr [rip + .LCPI0_1]
        minss   xmm0, xmm1
        ret

Next is a short C++ program using std::clamp:

#include <algorithm>
float clamp2(float v) {
    return std::clamp(v, -1.f, 1.f);
}

The corresponding assembly is shown below. It is significantly longer
with many more data accesses, conditional moves, and is in general
uglier.

.LCPI1_0:
        .long   0x3f800000                         float 1
.LCPI1_1:
        .long   0xbf800000                         float -1
clamp2(float):                                     @clamp2(float)
        movss   dword ptr [rsp - 4], xmm0
        mov     dword ptr [rsp - 8], -1082130432
        mov     dword ptr [rsp - 12], 1065353216
        ucomiss xmm0, dword ptr [rip + .LCPI1_0]
        lea     rax, [rsp - 12]
        lea     rcx, [rsp - 4]
        cmova   rcx, rax
        movss   xmm1, dword ptr [rip + .LCPI1_1]  # xmm1 = mem[0],zero,zero,zero
        ucomiss xmm1, xmm0
        lea     rax, [rsp - 8]
        cmovbe  rax, rcx
        movss   xmm0, dword ptr [rax]             # xmm0 = mem[0],zero,zero,zero
        ret

Interestingly enough, reimplementing std::clamp causes this issue to
disappear:

float clamp(float v, float lo, float hi) {
    v = (v < lo) ? lo : v;
    v = (v > hi) ? hi : v;
    return v;
}

float clamp1(float v) {
    return clamp(v, -1.f, 1.f);
}

The assembly generated here is the same as with Rust's
implementation:

.LCPI0_0:
        .long   0xbf800000                        # float -1
.LCPI0_1:
        .long   0x3f800000                        # float 1
clamp1(float):                                    # @clamp1(float)
        movss   xmm1, dword ptr [rip + .LCPI0_0]  # xmm1 = mem[0],zero,zero,zero
        maxss   xmm1, xmm0
        movss   xmm0, dword ptr [rip + .LCPI0_1]  # xmm0 = mem[0],zero,zero,zero
        minss   xmm0, xmm1
        ret

Clearly, something is off between std::clamp and our implementation.
According to the C++ reference, std::clamp takes two references along
with a predicate (which defaults to std::less) and returns a
reference. Functionally, the only difference between our code and
std::clamp is that we do not use reference types. Knowing this, we
can then reproduce the issue.

const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}

float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);
}

Once again, we've generated the same bad code as with std::clamp:

.LCPI1_0:
        .long   0x3f800000                        # float 1
.LCPI1_1:
        .long   0xbf800000                        # float -1
clamp2(float):                                    # @clamp2(float)
        movss   dword ptr [rsp - 4], xmm0
        mov     dword ptr [rsp - 8], -1082130432
        mov     dword ptr [rsp - 12], 1065353216
        ucomiss xmm0, dword ptr [rip + .LCPI1_0]
        lea     rax, [rsp - 12]
        lea     rcx, [rsp - 4]
        cmova   rcx, rax
        movss   xmm1, dword ptr [rip + .LCPI1_1]  # xmm1 = mem[0],zero,zero,zero
        ucomiss xmm1, xmm0
        lea     rax, [rsp - 8]
        cmovbe  rax, rcx
        movss   xmm0, dword ptr [rax]             # xmm0 = mem[0],zero,zero,zero
        ret

LLVM IR and Clang

LLVM IR is a Static Single Assignment (SSA) intermediate
representation. What this means is that every variable is only
assigned to once. In order to represent conditional assignments, SSA
form uses a special type of instruction called a "phi" node, which
picks a value based on the block that was previously running.
However, Clang does not initially use phi nodes. Instead, to make
initial code generation easier, variables in functions are allocated
on the stack using alloca instructions. Reads and assignments to the
variable are load and store instructions to the alloca, respectively:

int main() {
    float x = 0;
}

In this unoptimized IR, we can see an alloca instruction that then
has the float value 0 stored to it:

define dso_local i32 @main() #0 {
  %1 = alloca float, align 4
  store float 0.000000e+00, float* %1, align 4
  ret i32 0
}

LLVM will then (hopefully) optimize away the alloca instructions with
a relevant pass, like SROA.

LLVM IR and reference types

Reference types are represented as pointers in LLVM IR:

void test(float& x2) {
    x2 = 1;
}

In this optimized IR, we can see that the reference has been
converted to a pointer with specific attributes.

define dso_local void @_Z4testRf(float* nocapture nonnull align 4 dereferenceable(4) %0) local_unnamed_addr #0 {
  store float 1.000000e+00, float* %0, align 4, !tbaa !2
  ret void
}

When a function is given a reference type as an argument, it is
passed the underlying object's address instead of the object itself.
Also passed is some metadata about reference types. For example,
nonnull and dereferenceable are set as attributes to the argument
because the C++ standard dictates that references always have to be
bound to a valid object. For us, this means the alloca instructions
are passed directly to the clamp function:

__attribute__((noinline)) const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}

float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);
}

In this optimized IR, we can see alloca instructions passed to
bad_clamp corresponding to the variables passed as references.

define linkonce_odr dso_local nonnull align 4 dereferenceable(4) float* @_Z9bad_clampRKfS0_S0_(float* nonnull align 4 dereferenceable(4) %0, float* nonnull align 4 dereferenceable(4) %1, float* nonnull align 4 dereferenceable(4) %2) local_unnamed_addr #2 comdat {
  %4 = load float, float* %0, align 4
  %5 = load float, float* %1, align 4
  %6 = fcmp olt float %4, %5
  %7 = load float, float* %2, align 4
  %8 = fcmp ogt float %4, %7
  %9 = select i1 %8, float* %2, float* %0
  %10 = select i1 %6, float* %1, float* %9
  ret float* %10
}

define dso_local float @_Z6clamp2f(float %0) local_unnamed_addr #1 {
  %2 = alloca float, align 4
  %3 = alloca float, align 4
  %4 = alloca float, align 4
  store float %0, float* %2, align 4
  store float -1.000000e+00, float* %3, align 4
  store float 1.000000e+00, float* %4, align 4
  %6 = call nonnull align 4 dereferenceable(4) float* @_Z9bad_clampRKfS0_S0_(float* nonnull align 4 dereferenceable(4) %2, float* nonnull align 4 dereferenceable(4) %3, float* nonnull align 4 dereferenceable(4) %4)
  %7 = load float, float* %7, align 4
  ret float %7
}

Lifetime annotations are omitted to make the IR a bit clearer.

In this example, the noinline attribute was used to demonstrate
passing references to functions. If we remove the attribute, the call
is inlined into the function:

const float& bad_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}
float clamp2(float v) {
    return bad_clamp(v, -1.f, 1.f);
}

However, even after optimization, the alloca instructions are still
there for seemingly no good reason. These alloca instructions should
have been optimized away by LLVM's passes; they're not used anywhere
else and there are no tricky stores or lifetime problems.

define dso_local float @_Z6clamp2f(float %0) local_unnamed_addr #0 {
  %2 = alloca float, align 4
  %3 = alloca float, align 4
  %4 = alloca float, align 4
  store float %0, float* %2, align 4, !tbaa !2
  store float -1.000000e+00, float* %3, align 4, !tbaa !2
  store float 1.000000e+00, float* %4, align 4, !tbaa !2
  %5 = fcmp olt float %0, -1.000000e+00
  %6 = fcmp ogt float %0, 1.000000e+00
  %7 = select i1 %8, float* %4, float* %2
  %9 = select i1 %7, float* %3, float* %9
  %9 = load float, float* %10, align 4, !tbaa !2
  ret float %9
}

The only candidate here is the two sequential select instructions, as
they operate on the pointers created by the alloca instructions
instead of the underlying value. However, LLVM also has a pass for
this; if possible, LLVM will try to "speculate" across select
instructions that load their results.

select instructions are essentially ternary operators that pick one
of the last two operands (float pointers in our case) based on the
value of the first operand.

Select speculation - where things go wrong

A few calls down the chain, this function calls
isDereferenceableAndAlignedPointer, which is what determines whether
a pointer can be dereferenced. The code here exposes the main issue:
the select instruction is never considered 'dereferenceable'. As
such, when there are two selects in sequence (as seen with our
std::clamp), it will not try to speculate the select instruction and
will not remove the alloca.

Fix 1: libcxx

A potential fix is modifying the original code to not produce select
instructions in the same way. For example, we can mimic our original
implemenation with pointers instead of value types. Though the IR
output change is relatively small, this gives us the code generation
we want without modifying the LLVM codebase:

const float& better_ref_clamp(const float& v, const float& lo, const float& hi) {
    const float *out;
    out = (v < lo) ? &lo : &v;
    out = (*out > hi) ? &hi : out;
    return *out;
}

float clamp3(float v) {
    return better_ref_clamp(v, -1.f, 1.f);
}

As you can see, the IR generated after the call is inlined is
significantly shorter and more efficient than before:

define dso_local float @_Z6clamp3f(float %0) local_unnamed_addr #1 {
  %2 = fcmp olt float %0, -1.000000e+00
  %3 = select i1 %2, float -1.000000e+00, float %0
  %4 = fcmp ogt float %3, 1.000000e+00
  %5 = select i1 %4, float 1.000000e+00, float %3
  ret float %5
}

And the corresponding assembly is back to what we want it to be:

.LCPI1_0:
        .long   0xbf800000                        # float -1
.LCPI1_1:
        .long   0x3f800000                        # float 1
clamp3(float):                                    # @clamp3(float)
        movss   xmm1, dword ptr [rip + .LCPI1_0]  # xmm1 = mem[0],zero,zero,zero
        maxss   xmm1, xmm0
        movss   xmm0, dword ptr [rip + .LCPI1_1]  # xmm0 = mem[0],zero,zero,zero
        minss   xmm0, xmm1
        ret

Fix 2: LLVM

A much more general approach is fixing the code generation issue in
LLVM itself, which could be as simple as this:

diff --git a/llvm/lib/Analysis/Loads.cpp b/llvm/lib/Analysis/Loads.cpp
index d8f954f575838d9886fce0df2d40407b194e7580..affb55c7867f48866045534d383b4d7ba19773a3 100644
--- a/llvm/lib/Analysis/Loads.cpp
+++ b/llvm/lib/Analysis/Loads.cpp
@@ -103,6 +103,14 @@ static bool isDereferenceableAndAlignedPointer(
         CtxI, DT, TLI, Visited, MaxDepth);
   }

+  // For select instructions, both operands need to be dereferenceable.
+  if (const SelectInst *SelInst = dyn_cast<SelectInst>(V))
+    return isDereferenceableAndAlignedPointer(SelInst->getOperand(1), Alignment,
+                                              Size, DL, CtxI, DT, TLI,
+                                              Visited, MaxDepth) &&
+           isDereferenceableAndAlignedPointer(SelInst->getOperand(2), Alignment,
+                                              Size, DL, CtxI, DT, TLI,
+                                              Visited, MaxDepth);
   // For gc.relocate, look through relocations
   if (const GCRelocateInst *RelocateInst = dyn_cast<GCRelocateInst>(V))
     return isDereferenceableAndAlignedPointer(RelocateInst->getDerivedPtr(),

All it does is add select instructions to the list of instruction
types to consider potentially dereferenceable. Though it seems to fix
the issue (and alive2 seems to like it), this is otherwise untested.
Also, the codegen still isn't perfect. Though the redundant memory
accesses are removed, there are still many more instructions than in
our libcxx fix (and Rust's implementation):

.LCPI0_0:
        .long   0x3f800000                        # float 1
.LCPI0_1:
        .long   0xbf800000                        # float -1
clamp2(float):                                    # @clamp2(float)
        movss   xmm1, dword ptr [rip + .LCPI0_0]  # xmm1 = mem[0],zero,zero,zero
        minss   xmm1, xmm0
        movss   xmm2, dword ptr [rip + .LCPI0_1]  # xmm2 = mem[0],zero,zero,zero
        cmpltss xmm0, xmm2
        movaps  xmm3, xmm0
        andnps  xmm3, xmm1
        andps   xmm0, xmm2
        orps    xmm0, xmm3
        ret

However, this is because of the ternary operators done in the
original libcxx clamp:

template<class _Tp, class _Compare>
const _Tp& clamp(const _Tp& __v, const _Tp& __lo, const _Tp& __hi, _Compare __comp)
{
    _LIBCPP_ASSERT(!__comp(__hi, __lo), "Bad bounds passed to std::clamp");
    return __comp(__v, __lo) ? __lo : __comp(__hi, __v) ? __hi : __v;

}

The reason this doesn't look as good is because LLVM needs to store
the original value of __v for the second comparison. Due to this, it
then can't optimize the second part of this computation into a maxss
as that would produce different behavior when __lo is greater than
__hi and __v is negative.

const float& ref_clamp(const float& v, const float& lo, const float& hi) {
    return (v < lo) ? lo : (v > hi) ? hi : v;
}

const float& better_ref_clamp(const float& v, const float& lo, const float& hi) {
    const float *out;
    out = (v < lo) ? &lo : &v;
    out = (*out > hi) ? &hi : out;
    return *out;
}

int main() {
    printf("%f\n", ref_clamp(-2.f, 1.f, -1.f));        // this prints 1.000
    printf("%f\n", better_ref_clamp(-2.f, 1.f, -1.f)); // this prints -1.000
}

Even though we know this is undefined behavior in C++, LLVM doesn't
have enough information to know that. Adjusting code generation
accordingly would be no easy task either. Despite all of this though,
it does show how versatile LLVM truly is; relatively simple changes
can have significant results.

Tagged llvm  
PREVIOUS
How Runescape catches botters, and why they didn't catch me 
[noscript]