## UNLEASH THE POWER OF AVX-512 THROUGHAROHITECTURE, COMPILERAND CODEMODERNZATION

Xinmin Tian, Robert Geva, and Bob Valentine Intel Corporation
September 11, 2016

## Agenda

Section I - AVX-512 Architecture Insights
Section II - Intel ${ }^{\oplus}$ Compiler: Putting SIMD Vectorization to Work
Section III - Code Modernization: Best Practices for Vector Programming

Robert Valentine - Senior Principal Engineer
Intel Corporation
September 11, 2016

## Section I: Agenda

$\checkmark$ Introduction: Intel ${ }^{\oplus}$ ISA Roadmap
$\checkmark$ Deep dive: AVX1/2/AVX512 ISA
$\checkmark$ AVX-512 F: Common ISA Extension
$\checkmark$ AVX-512 ERI \& PRI: Intel ${ }^{\ominus}$ Xeon Phi ${ }^{\text {Tm }}$ Product Only
$\checkmark$ Xeon additions to AVX-512 F
$\checkmark$ Summary

## Intel ${ }^{\oplus}$ Advanced Vector Extensions



## Consistent Developer Tools and Programming Models



## Intel ${ }^{\circ}$ AVX Technology

Sandy Bridge


Haswell

## 256b AVX2

32 SP / 16 DP
Flops/Cycle (FMA)
Future (in planning, subject to change)

```
512b AVX512
```

Server: 64SP / 32 DP
Client: 32 SP / 16 DP
Flops/Cycle (FMA)

SNB-2011
HSW-2013

##  <br> AVX512 <br> 512-bit FP/Integer <br> 32 registers <br> 8 mask registers <br> Embedded rounding <br> Embedded broadcast <br> Scalar/SSE/AVX "promotions" <br> Native media additions <br> HPC additions <br> Transcendental support <br> Gather/Scatter

AVX512 Big Picture
$\checkmark$ Deep dive: AVX1/2/AVX512 ISA
$\checkmark$ AVX-512 F: Common ISA Extension
$\checkmark$ AVX-512 ERI \& PRI: Intel ${ }^{\oplus}$ Xeon Phi ${ }^{\text {m }}$ Product Only
$\checkmark$ Xeon additions to AVX-512 F

## AVX512 big picture

AVX512F

- 'Foundation' of architecture, required for any AVX512 implementation
- Many D/Q/SP/DP promotions from AVX2 with AAVX512 features
- Masking, 32 registers, embedded broadcast or rounding, 512-bit Vector Length
- New instructions added to accelerate HPC workloads
- Implementations add features to AVX512F "base"
- "base" will grow as MIC/Xeon converge on features

| AVX512CD | Conflict Detect : instructions tailored for vectorizing loops with potential address conflicts |
| :--- | :--- |
| AVX512ER | Exponential and Reciprocal : 'wide' approximateion of Log (22 bits) and RCP/RSQRT (28 bits) |
| AVX512PF | Prefetch : Multi-address prefetch instructions using gather/scatter semantics |
| AVX512DQ | Additional D/Q/SP/DP instructions (converts, transcendental support, etc) |
| AVX512BW | 512-bit Byte/Word support (promotions from AVX2, some additions) |
| AVX512VL | Vector Length Orthogonality : ability to operate on sub-512 vector sizes |

## Xeon \& Xeon Phi ${ }^{\text {TM }}$ New ISA: What Is Where?



Complex \& versatile big cores

- Big focus on latency and single-thread
- State-of-the-art SIMD support for HPC and Enterprise
- Best balance of performance for any workload
(intel inside
Xeon'Phi
Small \& efficient cores
- Big focus on throughput and many-threads
- State-of-the-art SIMD support for HPC
-Industry performance-per-watt leadership
KNL and Xeon are the first "unification" platforms:

AVX512F is the common SIMD foundation for HPC software development
$\qquad$

## AVX-512 features (I): More \& Bigger Registers

AVX: VADDPS YMMO, YMM3, [mem]

- Up to 16 AVX registers
- 8 in 32-bit mode
- 256-bit width
- $8 \times$ FP32
- $\quad 4$ x FP64

AVX-512: VADDPS ZMM0, ZMM24, [mem]

- Up to 32 AVX registers
- 8 in 32-bit mode
- 512-bit width
- $\quad 16$ x FP32
- $8 \times$ FP64
float32 A[N], B[N];
for(i=0; i<8; i++)
\{

$$
A[i]=A[i]+B[i] ;
$$

\}
float32 A[N], B[N];

```
for(i=0; i<16; i++)
{
    A[i] = A[i] + B[i];
```


## AVX-512 Mask Registers

8 Mask registers of size 64-bits

- k1-k7 can be used for predication
- kO can be used as a destination or source for mask manipulation operations

4 different mask granularities.
For instance, at 512b:

- Packed Integer Byte use mask bits [63:0]
- VPADDB zmm1 \{k1\}, zmm2, zmm3
- Packed Integer Word use mask bits [31:0]
- VPADDW zmm1 \{k1\}, zmm2, zmm3
- Packed IEEE FP32 and Integer Dword use mask bits [15:0]
- VADDPS zmm1 \{k1\}, zmm2, zmm3
- Packed IEEE FP64 and Integer Qword use mask bits [7:0]
- VADDPD zmm1 \{k1\}, zmm2, zmm3


|  | Vector Length |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  |  | 128 | 256 | 512 |
|  |  | 16 | 32 | 64 |
|  | Word | 8 | 16 | 32 |
|  | Dw ord/SP | 4 | 8 | 16 |
|  | Qw ord/DP | 2 | 4 | 8 |

## AVX-512 Features (II): Masking

## VADDPS ZMMO \{k1\}, ZMM3, [mem]

- Mask bits used to:

1. Suppress individual elements read from memory

- hence not signaling any memory fault

2. Avoid actual independent operations within an instruction happening

- and hence not signaling any FP fault

3. Avoid the individual destination elements being updated,

- or alternatively, force them to zero (zeroing)

```
for (I in vector length)
{
    if (no_masking or mask[I]) {
        dest[I] = OP(src2, src3)
    } else {
        if (zeroing_masking)
            dest[I] = 0
        else
            // dest[I] is preserved
    }
}
```

Caveat: vector shuffles do not suppress memory fault exceptions mask refers to "output" not to "input"

## Why True Masking?

## Memory fault suppression

- Vectorize code without touching memory that the correspondent scalar code would not touch
- Typical examples are if-conditional statements or loop remainders
- AVX is forced to use VMASKMOV* (risc)

MXCSR flag updates and fault handlers

- Avoid spurious floating-point exceptions without having to inject neutral data

Zeroing/merging

- Use zeroing to avoid false dependencies in OOO architecture
- Use merging to avoid extra blends in if-then-else clauses (predication) for great code density

```
float32 A[N], B[N], C[N];
for(i=0; i<16; i++)
{
    if(B[i] != 0) {
        A[i] =A[i] / B[i];
    else{
        A[i] = A[i] / C[i];
    }
}
VMOVUPS zmm2, A
VCMPPS k1, zmm0, B
VDIVPS zmm1 {k1}{z}, zmm2, B
KNOT k2, k1
VDIVPS zmm1 {k2}, zmm2, C
VMOVUPS A, zmm1
```


## Embedded Broadcasts and Masking Support

 VFMADD231PS zmm1, zmm2, C \{1to16\}- Scalars from memory are first class citizens
- Broadcast one scalar from memory into all vector elements before operation
- Memory fault suppression avoids fetching the scalar if no mask bit is set to 1

Other "tuples" supported

- Memory only touched if at least one consumer lane needs the data
- For instance, when broadcast a tuple of 4 elements, the semantics check for every element being really

VBROADCASTSS zmm1 \{k1\}, [rax] VBROADCASTF64X2 zmm2 \{k1\}, [rax] VBROADCASTF32X4 zmm3 \{k1\}, [rax] VBROADCASTF32X8 zmm4, \{k1\}, [rax] used

- E.g.: element 1 checks for mask bits 1, 5, 9, 13, $\ldots$

```
float32 A[N], B[N], C;
for(i=0; i<8; i++)
{
    if(A[i]!=0.0)
        A[i] =A[i] + C* B[i];
}
```


## AVX-512 Features: Embedded Rounding Control \& SAE (Suppress All Exceptions)

- MXCSR.RC can be overridden on a per instruction basis (Embedded Rounding Control )
- VADDPS ZMM1 \{k1\}, ZMM2, ZMM3 \{rne-sae\}
- VADDSS XMM1 \{k1\}, XMM2, XMM3 \{rrtz-sae\}
- "Suspend All Exceptions" (always implied by using Embedded Rounding Control)

NO MXCSR updates / exception reporting for any element

Expected usage of this feature

## Restricted to : FP instructions 512-bit or scalar Reg-reg operands

- Library codes can control effect of rounding and updates to MXCSR until the end stages of complex SW routines
- E.g.: avoid spurious overflow/underflow reporting in intermediate computations
- E.g: make sure that RM=rne regardless of the contents of MXCSR
- Saving, modifying and restoring MXCSR is generally slower and more and cumbersome
- Must use LDMXCSR to change fault masks, clear sticky bits or set a default rounding mode
- Do not need to use MXCSR OR embedded rounding for truncating FP conversion to int (use CVTT* instructions)


# AVX-512 F: <br> Common Xeon Phi (KNL) and Xeon Vector ISA Extension 

AVX-512 Foundation is the common SIMD foundation
for HPC software development
First on KNL
Planned on a future Xeon

## AVX-512 F Designed for HPC

- Promotions of many AVX and AVX2 instructions to AVX-512

32-bit and 64-bit floating-point instructions from AVX
Scalar and 512-bit
32-bit and 64-bit integer instructions from AVX2

- Many new instructions to speedup HPC workloads



## Quadword Integer Arithmetic

## Long int and packed pointer manipulation

64-bit integer trending towards becoming a first class citizen
Removes the need for expensive SW emulation sequences
Note: VPMULQ and int64 <-> FP converts not in AVX-512 F

| Instruction | Description |
| :--- | :--- |
| VPADDQ zmm1 \{k1\}, zmm2, zmm3 | INT64 addition |
| VPSUBQ zmm1 \{k1\}, zmm2, zmm3 | INT64 subtraction |
| VP\{SRA, SRL, SLL\}Q zmm1 \{k1\}, zmm2, imm8 | INT64 shift (imm8) |
| VP\{SRA, SRL, SLL\}VQ zmm1 \{k1\}, zmm2, zmm3 | INT64 shift (variable) |
| VP\{MAX,MIN\}Q zmm1 \{k1\}, zmm2, zmm3 | INT64 max, min |
| VP\{MAX,MIN\}UQ zmm1 \{k1\}, zmm2, zmm3 | UINT64 max, min |
| VPABSQ zmm1 \{k1\}, zmm2, zmm3 | INT64 absolute value |
| VPMUL\{DQ,UDQ zmm1 \{k1\}, zmm2, zmm3 | $32 \times 32=64$ integer multiply |

## Math Support

# Package to aid with Math library writing <br> - Good value upside in financial applications <br> - Available in PS, PD, SS and SD data types 

| Instruction |
| :---: |
| VGETXEXP ${ }_{\text {fPS,PD,SS, }}$ Sb\} |
| VGETMANT ${ }_{\text {(PS, P, P,SS,SD }}$ |
|  |
| $\mathrm{VSCALEF}_{\text {\{PS, PD, SS,SD\} }}$ |
|  |
|  |
| VRSQRT14 ${ }_{\text {[PS,PD,SS, SD] }}$ |
| $\mathrm{VDIV}_{\text {[PF,PD,Ss, SD }}$ |
| $\mathrm{VSQRT}_{\text {[PS,PD,SS,SD\} }}$ |


| zmm1 \{k1\}, zmm2 | Obtain exponent in FP format |
| :--- | :--- |
| zmm1 \{k1\}, zmm2 | Obtain normalized mantissa |
| zmm1 \{k1\}, zmm2, imm8 | Round to scaled integral number |
| zmm1 \{k1\}, zmm2, zmm3 | $\mathrm{X}^{*} 2^{\mathrm{y}}, \mathrm{X}<==$ getmant, Y <= getexp |
| zmm1, zmm2, zmm3, imm8 | Patch output numbers based on inputs |
| zmm1 \{k1\}, zmm2 | Approx. reciprocal() with rel. error $2^{-14}$ |
| zmm1 \{k1\}, zmm2 | Approx. rsqrt() with rel. error $2^{-14}$ |
| zmm1 \{k1\}, zmm2, zmm3 | IEEE division |
| zmm1 \{k1\}, zmm2 | IEEE square root |

## New 2-Source Shuffles

## 2-Src Shuffles

VSHUF\{PS,PD\}
VPUNPCK\{H,L\}\{DQ,QDQ\}
VUNPCK\{H,L\}\{PS,PD\}
VPERM\{I,D\}2\{D,Q,PS,PD\}
VSHUF\{F,I\}32X4

Long standing customer request
-16/32-entry table lookup (transcendental support)

- AOS $\Leftrightarrow$ SOA support, matrix transpose
- Variable VALIGN emulation



## Gather \& Scatter

D/Q/SP/DP element types

```
for(j=0,i=0; i<N; i++)
{
        B[R[i]] = A[Q[i]];
```

\}

VMOVDQU64 zmm1, Q[rsi]
VMOVDQU64 zmm2, R[rsi]
VGATHERQQ zmm0 \{k2\}, [rax+zmm1*8]
VSCATTERQQ [rax+zmm2*8] \{k3\}, zmm0


## Expand \& Compress

Allows vectorization of conditional loops

- Opposite operation (compress) in AVX512F
- Similar to FORTRAN pack/unpack intrinsics
- Provides mem fault suppression
- Faster than alternative gather/scatter


## VEXPANDPS zmm0 \{k2\}, [rax]

```
for(j=0,i=0;i<N; i++)
{
    if(C[i] != 0.0)
    {
        B[i] =A[i] * C[j++];
    }
}
```

Moves compressed (consecutive) elements in register or memory to sparse elements in register (controlled by mask), with merging prax] zeroing


## Bit Manipulation

## Basic bit manipulation operations on mask and vector operands

- Useful to manipulate mask registers
- Have uses in cryptography algorithms

| Instruction | Description |
| :--- | :--- |
| KUNPCKBW k1, k2, k3 | Interleave bytes in k2 and k3 |
| KSHIFT\{L,R\}W k1, k2, imm8 | Shift bits left/right using imm8 |
| VPROR\{D,Q\} zmm1 \{k1\}, zmm2, imm8 | Rotate bits right using imm8 |
| VPROL\{D,Q\} zmm1 \{k1\}, zmm2, imm8 | Rotate bits left using imm8 |
| VPRORV\{D,Q\} zmm1 \{k1\}, zmm2, zmm3/mem | Rotate bits right w/ variable ctrl |
| VPROLV\{D,Q\} zmm1 \{k1\}, zmm2, zmm3/mem | Rotate bits left w/ variable ctrl |

## VPTERNLOG - Ternary Logic Instruction

- Take every bit of three sources to obtain a 3-bit index N
- Obtain Nth bit from imm8

VPTERNLOGD zmm0 \{k2\}, zmm15, zmm3/[rax], imm8

Any arbitrary truth table of 3 values can be implemented andor, andxor, vote, parity, bitwise-cmov, etc each column in the right table corresponds to imm8

| S1 | S2 | S3 |
| :--- | :--- | :--- |
| 0 | 0 | 0 |
| 0 | 0 | 1 |
| 0 | 1 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 0 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
| 1 | 1 | 1 |$\quad$| ANDOR | VOTE (S1)?S3:S2 |  |
| :---: | :---: | :---: |
| 0 | 0 | 0 |
| 1 | 0 | 1 |
| 0 | 0 | 0 |
| 1 | 1 | 1 |
| 0 | 0 | 0 |
| 1 | 1 | 0 |
| 1 | 1 | 1 |
| 1 | 1 | 1 |



AVX-512 ERI \& AVX-512 PRI: Xeon Phi Only

## Xeon Phi Only Instructions

Set of segment-specific instruction extensions

- First appear on KNL
- Will be supported in all future Xeon Phi processors
- May or may not show up on a later Xeon processor


## Address two HPC customer requests

- Ability to maximize memory bandwidth
- Hardware prefetching is too restrictive
- Conventional software prefetching results in instructions overhead
- Competitive support for transcendental sequences
- Mostly division and square root
- Differentiating factor in HPC/TPT


## KNL AVX512 additions

| CPUID | Instructions | Motivation |
| :---: | :---: | :---: |
| $\begin{aligned} & \bar{\alpha} \\ & \stackrel{N}{N} \\ & \stackrel{N}{N} \\ & \dot{x} \\ & \underset{\gtrless}{2} \end{aligned}$ | PREFETCHWT1 | Reduce ring traffic in core-to-core data communication |
|  | VGATHERPF\{D,Q\}\{0,1\}PS | Reduce overhead of software prefetching: dedicate side engine to prefetch sparse structures while devoting the main CPU to pure raw flops |
|  | VSCATTERPF\{D,Q\}\{0,1\}PS |  |
|  | VEXP2\{PS,PD\} | Speed-up key FSI workloads: Black-Scholes, Montecarlo |
|  | VRCP28\{PS,PD\} | Key building block to speed up most transcendental sequences (in particular, division and square root): <br> Increasing precision from 14=>28 allows to reduce one complete Newton-Raphson iteration |
|  | VRSQRT28\{PS,PD\} |  |

## Summary of AVX512 on KNL

AVX-512 F: new 512-bit vector ISA extension

- Common between Xeon and Xeon Phi (KNL)


## AVX-512 CDI Conflict detection instructions

- Improves autovectorization of Histogram data patterns
- On Xeon Phi first

AVX-512 ERI \& PRI

- 28-bit transcendentals and new prefetch instructions
- On Xeon Phi only


## Xeon additions to AVX512F

## AVX512DQ

Complete Qword support

- VPMULLQ packed $64 \times 64 \rightarrow 64$
- Packed/Scalar converts of signed/unsigned to SP/DP
- Arithmatic shift right
- Etc

Extend mask architecture to word and byte

- Byte masks are natural for packed Qword operands

Minor additions to transcendental support
Convert AVX512 mask $\leftarrow \rightarrow$ 'SSE/AVX' mask
'aggregate datatype' support

- Broadcast/insert/extract complex singles etc


## AVX512DQ : additional HPC focus



Tuple support: 32X8, 64X2, 32X2
Int64 $\Leftrightarrow$ FP conversions
Both unsigned and signed
Transcendental package v2

INT64 arithmetic support

Byte support for mask instructions

Expanded mask functionality

## AVX512BW

Full support for Byte/Word operations

- MMX/SSE2/AVX2 re-promoted to AVX512 semantics

Mask operations extended to 32/64 bits

- 32-bit mask refers to AVX512 'short' operands
- 64-bit mask refers to AVX512 byte operands

Loads/Stores/Broadcastsfor AVX512 semantics
Permute architecture extended to words

- Vpermw, vpermi2w, vpermt2w

New PSAD instruction,etc

## AVX512BW : Byte and Word Support

| AV512BW | AV512BW |
| :---: | :---: |
| VPBROADCAST $\{$ B,W\} | KTEST $\{\mathrm{D}, \mathrm{Q}$ \} |
| VPSRLDQ, VPSLLDQ | KSHIFT ${ }^{\text {L }}$,R\}\{D,Q\} |
| VP\{SRL,SRA,SLL\} ${ }^{\text {VV }}$ W | KUNPACK\{WD,DQ\} |
| VPMOV\{WB,SWB, USWB\} | KADD $\{\mathrm{D}, \mathrm{Q}\}$ |
| VPTESTM $\{\mathrm{B}, \mathrm{W}\}$ | VPMOV\{B2M,W2M,M2B,M2W\} |
| VPMADW | VPCMP\{,EQ,GT\}\{B,W,UB,UW\} |
| VPTESTNM\{B,W\} | $\mathrm{VP}\{\mathrm{ABS}, \mathrm{AVG}\}$ [B,W\} |
| VDBPSADBW | VP\{ADD,SUB\}\{,S,US\}\{B,W\} |
| VPERMW, VPERM $\{1, T\} 2 \mathrm{~W}$ | VPALIGNR |
| VMOVDQU\{8,16\} | VP\{EXTR,INSR\}\{B,W\} |
| VPBLENDM\{B,W\} | VPMADD\{UBSW,WD\} |
| \{KAND,KANDN\} ${ }^{\text {d, Q }}$ \} | VP\{MAX,MIN\}\{S,U\}\{B,W\} |
| \{KOR,KXNOR,KXOR\}\{D,Q\} | VPMOV\{SX,ZX\}BW |
| KNOT\{D,Q\} | VPMUL\{HRS,H,L\}W |
| KORTEST\{D,Q\} | VPSADBW |


| AV512 BW |
| :--- |
| VPSHUFB, VPSHUF\{H,L\}W |
| VP\{SRA,SRL,SLL\}\{,V\}\{W\} |
| VPUNPCK\{H,L\}\{BW,WD\} |

## AVX512VL : Vector Length Orthogonality

Allow AVX512 instructions to operate on sub-vectors (lower 256/128 bits)

- Eases code generation for mixed data types
- Partial masks are functionally correct, why not use them?
- VL is in static in opcode, provides information EARLY in pipeline
- Clock gating of unneeded execution elements / buses
- Disabling RF read ports
- Preventing 'false overlap/forwarding' from being detected in memory
- Creating partial masks wastes instruction BW

AVX512VL is NOT a "list of instructions"

- "orthogonal feature' applying to "all" AVX512 instructions
- obvious caveats when instruction has implicit 256/512 width


## AVX512VL: Down-promotions



## Summary of Xeon AVX512 Additions

More Qword support

- Packed converts, VPMULLQ etc

Support for mixing AVX and AVX512 style masks

- VPMOVM2*, VPMOV*2M

All HLL datatypes at maximum SIMD width

- \# elements = VL / element_size

VL aids mixing datatypes

- VL = \# elements * element_size

VL specifies memory access sizes exactly

- Masks provide architectural support, but HW prefers a 'static’ knowledge


## Summary

AVX512 is a comprehensive addition to intels SIMD Instruction set
~2x performance on BLAS routines
new features to increase the vectorization coverage (masks, VPCOMRESS)
embedded rounding and new instructions accelerate math libraries
Knights Landing emphasis on HPC
support for D/Q/SP/DP and additional specialized instructions
Xeon adds support for all HLL datatypes
AVX512 is designed for compilers as well as programmers

## PUTTINGEXPLLCITVECTOR PROGRAMMINGTOWORKFORINTEL XEONANDXEON PHIARGHITECTURES

Xinmin Tian - Senior Principal Engineer Intel Compiler and Languages, SSG, Intel Corporation September 11, 2016

## Section II: Agenda

Seamless Vectorization and Parallelization Integration Showcase
Learnings: Cray*, Intel ${ }^{\oplus}$ Pentium ${ }^{\circledR} 4$ (90nm) SSE3 and SIMD Vectorization Hurdles
Successes: Putting SIMD Vectorization to Work
$\checkmark$ Mixed data type Vectorization
$\checkmark$ Function vectorization
$\checkmark$ Outer loop vectorization
$\checkmark$ SIMD Loop Vectorization with Cross-iteration Dependency
$\checkmark$ Less-than-full-vector Vectorization
$\checkmark$ Predication and Masking
$\checkmark$ Gather/Scatter Optimization
Advances: Tackle C++ Challenges and Beyond C/C++/Fortran
Summary: Close to Metal Performance

## Mandelbrot: ~2698x Speedup on Xeon Phi'"--Isn't it Cool?



## Learnings: Compiler Vectorization in 1978

The CRAY-1's Fortran compiler (CFT) is designed to give the scientific user immediate access to the benefits of the CRAY-1's vector processing architecture. An optimizing compiller, CFT, "vectorizes" innermost 100 loops. Compatible with the ANSI 1966 Fortran Standard and with many commonly supported Fortran extensions, cirt does not require any source program modifications or the use of additional nonstandard Fortran statements to achieve vectorization. Thus the user's investment of hundreds of man months of effort to develop Fortran programs for other contemporary computers is protected.

C CACM 1978
$C^{* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * ~}$
$c^{* * *}$ KERNEL 1 HYDRO FRAGMENT

C
cdir\$ ivdep
1001 DO $1 \mathrm{k}=1, \mathrm{n}$
$1 \quad \mathrm{X}(\mathrm{k})=\mathrm{Q}+\mathrm{Y}(\mathrm{k}) *(\mathrm{R} * \mathrm{ZX}(\mathrm{k}+10)+\mathrm{T} * \mathrm{ZX}(\mathrm{k}+11))$
C

Compiler vectorization "solved" in 1978

## Learnings: 2004 Intel ${ }^{\circledR}$ Pentium ${ }^{\circledR} 4$ SSE3 on 90nm

Complex Multiplication with SSE3: $(a+i b)(c+i d)=(a c-b d)+i(a d+b c)$


Memory
Ops:
3 SIMD loads
1 SIMD store
3 Arithmetic Ops
1 shuffle Ops

Ops not available in SSE2
movddup
addsubpd

Memory
4 Scalar loads, 6 Arithmetic Ops, 2 Scalar stores
Performance can be improved up to ~75\%, SPEC2000FP/168.wupwise 10-15\%

## Learnings: Program Factors Impact on Vectorization

Loop-carried dependencies

```
DO I = 2, N
    A(I) = A(I-1) + B(I)
ENDDO
```

Function calls

```
for (i = 1; i < nx; i++) {
    x = x0 + i * h;
    sumx = sumx + func(x, y, xp);
}
```

Pointer aliasing

```
```

for(j = 0; j <= MAX; j++) {

```
```

for(j = 0; j <= MAX; j++) {

```
for(i = 0; i <= MAX; i++) {
```

for(i = 0; i <= MAX; i++) {

```
for(i = 0; i <= MAX; i++) {
        D[i][j] += 1;
        D[i][j] += 1;
        D[i][j] += 1;
    }
    }
    }
}
```

```
```

}

```
```

```
}
```

```
```

```
```

struct _x { int d; int bound; };

```
```

struct _x { int d; int bound; };
void doit(int *a, struct _x *x)
void doit(int *a, struct _x *x)
{
{
for(int i = 0; i < x->bound; i++)
for(int i = 0; i < x->bound; i++)
a[i] = 0;
a[i] = 0;
}

```
```

}

```
```

Indirect memory access

Outer loops
Unknown loop iteration count

```
for (i=0; i<N; i++)
```

for (i=0; i<N; i++)
A[B[i]] += C[i]*D[i]

```
    A[B[i]] += C[i]*D[i]
```

```
void scale(int *a, int *b)
```

void scale(int *a, int *b)
{
{
for (int i = 0; i < 1000; i++)
for (int i = 0; i < 1000; i++)
b[i] = z * a[i];
b[i] = z * a[i];
}

```
}
```


## Learnings: SIMD Vectorization Hurdles

```
#pragma omp simd reduction(+:....)
for(p=0; p<N; p++) {
    // Blue work
    if(...) {
        // Green work
    } else {
        // Red work
    }
    while(...) {
        // Purple work
    }
    y = foo (x);
    Pink work
}
```

Two fundamental problems:
$\checkmark$ Data divergence
$\checkmark$ Control divergence


Vector code generation has become a more difficult problem increasing need for user guided explicit vectorization that maps concurrent execution to simd hardware

# Successes: Putting SIMD <br> Vectorization to Work 

Intel brings ICC Vectorization
Technology to LLVM
Vectorizer


## Mixed Data Type Vectorization



## Function Vectorization

| \#pragma omp declare simd <br> float sfoo(float $x$ ) <br> \{... ... | Compile | created | $\begin{aligned} & \ldots m 128 \text { vfoc } \\ & \{\ldots . . \\ & \} \end{aligned}$ | _m128 vx) |
| :---: | :---: | :---: | :---: | :---: |
| \} |  |  | Vector C function |  |
| Scalar C function |  |  |  |  |
| $\begin{aligned} & \text { sfoo(x0)->r0 } \\ & \text { sfoo }(x 1)->r 1 \\ & \text { sfoo(x2)->r2 } \\ & \text { sfoo }(x 3)->r 3 \\ & \text { sfoo( } x 4)->r 4 \end{aligned}$ | sfoo(x0)->rc | sfoo(x1)->r1 | sfoo(x2)->r2 | sfoo(x3)->r3 |
|  | sfoo(x4)->r | sfoo(x5)->r5 | sfoo(x6)->r6 | sfoo(x7)->r7 |
|  | sfoo(x8)->r8 | sfoo(x9)->r9 | ... | ... |
|  |  |  |  |  |
| ... | vfoo(x0...x3)->r0...r3 |  |  |  |
| Scalar execution | vfoo(X4...X7)->r4...r7 |  | Vector execution |  |

```
Recursive Function Vectorization
#pragma omp declare simd [processor(cpu-id)]
int binsearch(int key, int lo, int hi) {
    int ans;
    if ( lo > hi) {
        ans = -1;
    }
    else {
        int mid = lo + ((hi - lo) >> 1);
        int t = sortedarr[mid];
        if (key == t) {
                ans = mid;
        }
        else if ( key > t)
        ans = binsearch(key, mid + 1, hi);
    }
    else {
        ans = binsearch(key, lo, mid - 1);
        }
    }
    return ans;
}
#pragma omp simd
for (int i=0; i<M; i++) {
    ans[i] = binsearch(keys[i], 0, N-1);
}
```


## OpenMP* SIMD PROCESSOR Clause

New PROCESSOR clause extension to \#pragma omp declare simd (to define a SIMD routine) to target a specific processor

- Similar to Intel ${ }^{\oplus}$ Cilk ${ }^{\text {TM }}$ Plus extensions for declaring SIMD functions
- Available for C/C++ and Fortran
- Intel extension - NOT part of official OpenMP specification
- Helpful to allow programmers to leverage e.g. Intel ${ }^{\oplus}$ AVX-2 and Intel ${ }^{\circledR}$ AVX512 beyond default Intel ${ }^{\oplus}$ SSE2 support ( YMM+ZMM registers/operands additionally to XMM )


## Processor Name Identifiers

$\checkmark$ pentium_4
$\checkmark$ pentium_m
$\checkmark$ pentium_4_sse3
$\checkmark$ core_2_duo_ssse3
$\checkmark$ core_2_duo_sse4_1
$\checkmark$ atom
$\checkmark$ core_i7_sse4_2
$\checkmark$ core_aes_pclmulqdq
$\checkmark$ core_2nd_gen_avx
$\checkmark$ core_3rd_gen_avx

| $\checkmark$ | future_cpu_18 | // KNF |
| :--- | :--- | :--- |
| $\checkmark$ mic |  |  |
| $\checkmark$ future_cpu_19 | // KNC |  |
| $\checkmark$ future_cpu_20 | // HSW - no TSX |  |
| $\checkmark$ core_4th_gen_avx | // HSW -no TSX |  |
| $\checkmark$ core_4th_gen_avx_tsx | // HSW - TSX |  |
| $\checkmark$ future_cpu_21 | // BDW - NO TSX |  |
| $\checkmark$ future_cpu_21_tsx | // BDW - TSX |  |
| $\checkmark$ future_cu_22 | // KNL |  |
| $\checkmark$ future_cpu_23 | // SKL |  |

## Vortex Code: Outer Loop Vectorization <br> \#pragma omp simd // simd pragma for outer-loop at call-site of SIMD-function

```
for (int i = beg*16; i < end*16; ++i) {
    particleVelocity_block(px[i], py[i], pz[i], destvx + i, destvy + i, destvz + i, vel_block_start, vel_block_end);
}
```

\#pragama omp declare simd linear(velx,vely,velz) uniform(start,end) aligned(velx:64, vely:64, velz:64)
static void particleVelocity_block(const float posx, const float posy, const float posz,
float *velx, float *vely, float *velz, int start, int end) \{
for (int j = start; j < end; ++j) \{
const float del_p_x = posx - px[j];
const float del_p_y = posy - py[j];
const float del_p_z = posz - pz[j];
const float dxn= del_p_x * del_p_x + del_p_y * del_p_y + del_p_z * del_p_z +pa[j]* pa[j];
const float dxctaui = del_p_y *tz[j] - ty[j] * del_p_z;
const float dyctaui $=$ del_p_z*tx[j] - tz[j] * del_p_x;
const float dzctaui = del_p_x * ty[j] - tx[j] * del_p_y;
const float dst = 1.0f/std:::sqrt(dxn);
const float dst3 = dst*dst*dst;
*velx -= dxctaui * dst3;
*vely -= dyctaui * dst3;
*velz -= dzctaui *dst3;
\}
\}

## SIMD Loops with Cross-Iteration Dependencies

OpenMP* 4.5: Extend ordered Blocks in SIMD Contexts

```
C and C++:
#pragma omp ordered [simd]
    structured code block
```

Fortran:

```
!$omp ordered [simd]
    structured code block
!$omp end ordered
```


## Semantics:

- The ordered with simd clause construct specifies a structured block in the simd loop or SIMD function that will be executed in the order of the loop iterations w.r.t to dependency constraints or sequence of call to SIMD functions.


## Rules:

- \#pragma omp ordered simd is only allowed inside a SIMD loop or SIMD-enabled function.
- \#pragma omp ordered simd region must be a single-entry and single-exit code block
- The strict ordered execution is only guaranteed for the block itself
$\checkmark$ Execution remains weakly ordered w.r.t. to outside of the block or other ordered blocks
$\checkmark$ Data dependencies between statements of the same block will be correctly resolved
$\checkmark$ Other non-vector dependencies originating in ordered block still lead to undefined behavior


## Ordered SIMD Examples

OK:

```
```

\#pragma omp simd

```
```

\#pragma omp simd
for (i = 0; i < N; i++)
for (i = 0; i < N; i++)
{
{
..
..
\#pragma omp ordered simd
\#pragma omp ordered simd
{
{
a[indices[i]] += b[i]; // index conflict
a[indices[i]] += b[i]; // index conflict
}
}
\#pragma omp ordered simd
\#pragma omp ordered simd
{
{
if (c[i] > 0)
if (c[i] > 0)
q[j++] = b[i]; // compress pattern
q[j++] = b[i]; // compress pattern
}
}
...
...
\#pragma omp ordered simd
\#pragma omp ordered simd
{
{
lock(L) // atomic update
lock(L) // atomic update
if (x > 10) x = 0;
if (x > 10) x = 0;
unlock(L)
unlock(L)
}
}
,..

```
```

,..

```
```


## Not OK w.r.t serial:

```
#pragma omp simd
```

\#pragma omp simd
for (i = 0; i < N; i++) {
for (i = 0; i < N; i++) {
...
...
\#pragma omp ordered simd
\#pragma omp ordered simd
{
{
if (c[i] > 0)
if (c[i] > 0)
q[j++] = b[i]; // 1 'st compress
q[j++] = b[i]; // 1 'st compress
}
}
\#pragma omp ordered simd
\#pragma omp ordered simd
{
{
if (c[i] > 0) // order of stores will
if (c[i] > 0) // order of stores will
q[j++] = d[i]; // be changed w.r.t
q[j++] = d[i]; // be changed w.r.t
q[j++] // to serial execution
q[j++] // to serial execution
}
}
}

```
}
```


## ORDERED SIMD not always best Approach

```
#pragma omp simd
for(int i=0; i < VL; i++) {
    val = values[i];
    grp = groups[i];
#pragma omp ordered simd // conflict
    { g_total[grp] += val; }
```



```
#pragma omp simd reduction(+:g_total)
for(int i=0; i < VL; i++) {
    val = values[i];
    grp = groups[i];
    g total[grp] += val;
}
```



Solution: array reductions


## Less-Than-Full-Vector Vectorization

```
float foo(float *y, int n)
{ int k; float x = 10.0f;
    #pragma omp simd
    for (k = 0; k < n; k++) {
    x = x + fsqrt(y[k])
    }
    return x
}
```

```
misalign = &y[0] & 63
peeledTripCount =(63-misalign)/sizeof(float)
x = 10.0f;
do k0 = 0, peeledTripCount-1 // peeling loop
    x = x + fsqrt(y[k0])
enddo
x1_v512 = (m512)0
x2_v512 = (m512)0
mainTripCount = n - (( }\textrm{n}-\mathrm{ peeledTripCount) & 31)
do k1 = peeledTripCount, mainTripCount-1, 32
    x1_v512 = _mm512_add_ps(_mm512_fsqrt(y[k1:16]),x1_v512)
    x2_v512 = _mm512_add_ps(_mm512_fsqrt(y[k1+16:16]), x2_v512)
enddo
// perform vector add on two vector x1_v512 and x2_v512
x1_v512 = _mm512_add_ps(x1_v512, x2_512);
// perform horizontal add on all elements of x1_v512, and
// the add }\textrm{x}\mathrm{ for using its value in the remainder loop
x = x + _mm512_hadd_ps(x1_512)
do k2 = mainTripCount, n // Remainder loop
    x = x + fsqrt(y[k2])
```

enddo

## Less-Than-Full-Vector Vectorization

```
misalign \(=\& y[0] \& 63\)
peeledTripCount \(=(63-\) misalign \() /\) sizeof(float \()\)
x = 10.0f;
// create a vector: <0,1,2,...15>
k0_v512 = _mm512_series_pi(0, 1, 16)
// create vector: all 16 elements are peeledTripCount peeledTripCount_v512 =
_mm512_broadcast_pi32(peeledTripCount)
\(x 1 \_v 512=(m 512) 0\)
x2_v512 = (m512)0
do \(\mathrm{k} 0=0\), peeledTripCount-1, 16
// generate mask for vectorizing peeling loop
mask = _mm512_compare_pi32_mask_lt(k0_v512, peeledTriPCount_v512)
x1_v512 = _mm512_add_ps_mask(
_mm512_fsqrt(y[k0:16]), x1_v512, mask)
enddo
```

mainTripcount $=n-((n-$ peeledTripCount $) \& 31)$
do k1 = peeledTripCount, mainTripCount-1, 32
x1_v512 = _mm512_add_ps( _mm512_fsqrt(y[k1:16]), x1_v512)
x2_v512 = _mm512_add_ps( _mm512_fsqrt(y[k1+16:16]), x2_v512)
enddo
// create a vector: <mainTripCount, mainTripCount+1 ... mainTripCount+15> k2_v512 = _mm512_series_pi(mainTripCount, 1, 16)
// create a vector: all 16 elements has the same value $n$
n_v512 = _mm512_broadcast_pi32(n)
step_v512 = _mm512_broadcast_pi32(16)
do k 2 = mainTripCount, $\mathrm{n}, 16$ // vectorized remainder loop
mask = _mm512_compare_pi32_mask_lt(k2_v512, n_v512)
x1_v512 = _mm512_add_ps_mask( _mm512_fsqrt(y[k2:16]), x1_v512, mask)
k2_v512 = _mm512_add_ps(k2_v512, step_v512)
enddo
x1_v512 = _mm512_add_ps(x1_v512, x2_512);
// perform horizontal add on 8 elements and final reduction sum to write
// the result back to x .

## Predication and Masking

No conditional execution inside SIMD vector loop \#pragma omp simd
$\checkmark$ Except special vector conditions
$\checkmark$ Conditions transformed to vector predicates
$\checkmark$ Control flow is flattened
Predicated execution
$\checkmark$ Masking
$\checkmark$ Blending with speculation
Masking on loads/stores
$\checkmark$ Available on AVX512

```
for (i=0; i<n; i++) {
    if (A[i]>0) {
        B[i] = C[i] + 20;
    }
    else {
        B[i] = C[i] + 100;
    }
}
```

Before vectorizer

$$
B[i: i+3]\{M\}=C[i: i+3]\{M\}+\{M\} 20 ;
$$

$\checkmark$ Limited availability in Intel ${ }^{\circledR}$ MIC Arch, AVX2, and AVX

$$
M=(A[i: i+3]>0) ;
$$

$$
B[i: i+3]\{\sim M\}=C[i: i+3]\{\sim M\}+\{\sim M\} 100 ;
$$

$\checkmark$ Emulated on SSE
Blending
$\checkmark$ Safety to be proved


$$
B[i: i+3]=(C[i: i+3]+20) \& M \mid(C[i: i+3]+100) \& \sim M ;
$$

## Avoid Branchy Code $\rightarrow$ Improve SIMD Vector Efficiency

$\checkmark$ Only reverse conditions

## Neighboring Gather/Scatter Optimization

```
for (int i = 0; i < size; ++i) {
    #pragma omp simd
    for (int j = i + 1; j < size; ++j) {
        pij = pi - data[5 * j];
        qij = qi - data[5 * j + 1];
        rij = ri - data[5 * j + 2];
```



2x kernel speed-up on HSW

## Optimization of neighboring Gathers

- Complete support for unmasked strided (d[i]) and indexed (d[ind[i]]) loads of 1, 2, 4, 8 and 16-byte elements for SSE2-AVX512
- Provides more effective CPU resources usage for cases with data locality
- May require additional source changes to enable the pattern recognition (e.g. restrict, base/index hoisting, loads grouping)
- Also reflected in optimization report:

```
remark #34030: adjacent sparse (strided) loads
```

remark \#34030: adjacent sparse (strided) loads
optimized for speed.
optimized for speed.
Details: stride { 12 }, types { F32-V512, F32-
Details: stride { 12 }, types { F32-V512, F32-
V512, F32-V512 }, number of elements { 16 },
V512, F32-V512 }, number of elements { 16 },
select mask { 0x000000007 }.

```
select mask { 0x000000007 }.
```

```
struct {
    TYPE f0;
    TYPE fN;
} d[];
for (int i = 0; i < size; ++i)
    tmp += d[i].f0 + ... + d[i].fN;
Before (TYPE=FLOAT, N=2)
vgatherdps 0(%rcx,%zmm0,4), %zmm6{%k1}
vgatherdps 4(%rcx,%zmm0,4), %zmm7{%k2}
vgatherdps 8(%rcx,%zmm0,4), %zmm9{%k3}
```

Now (TYPE=FLOAT, N=2)

| vmovups | (\%rcx), \%zmm10 |
| :---: | :---: |
| vmovups | 64 (\%rcx), \%zmm9 |
| vmovups | 128 (\%rcx), \%zmm14 |
| vpermi2ps | \%zmm9, \%zmm10, \%zmm7 |
| vpermi2ps | \% $\mathbf{z m m} 9, \% \mathbf{z m m 1 0 , ~ \% \mathbf { m m }} 8$ |
| vpermt2ps | \%zmm9, \% zmm1, \% zmm10 |
| vpermi2ps | \%zmm7, \%zmm14, \%zmm11 |
| vpermi2ps | \%zmm8, \%zmm14, \%zmm12 |
| vpermt2ps | \%zmm10, \%zmm0, \%zmm14 |

## OpenMP SIMD Linear(ref/val/uval)

Rationale:

- For implicitly reference linear parameters it is nice to have reference as linear

OpenMP 4.5 syntax:

- Linearity specification for references vs. values
- linear(val(var):[step]) - the value is linear even if passed by reference
- If passed by reference the vector of references is passed
- linear(uval(var):[step]) - value passed by reference is linear
- The reference to the first lane is passed, other values constructed using step
- linear(ref(var):step) - for parameters passed by reference the underlying reference is linear
- Access will be sequential or strided depending on step
- Original linear(var:[step]) - the same as linear(val(var):[step])
! \$omp declare simd
REAL FUNCTION FOO (X, Y)
REAL, VALUE : : Y << by reference
REAL, VALUE : : X << by reference
$\mathrm{FOO}=\mathrm{X}+\mathrm{Y} \quad \ll$ gathers!!!!
END FUNCTION FOO
!omp\$ simd private (X,Y)
DO $\mathrm{I}=0$, N
$\mathbf{Y}=\mathrm{B}(\mathrm{I})$
$\mathbf{X}=\mathbf{A}(I)$
$C(I)+=F O O(X, Y)$
ENDDO

```
! $omp declare simd linear(ref(x),ref(y))
REAL FUNCTION FOO(X, Y)
REAL, VALUE :: Y << by reference
REAL, VALUE :: X << by reference
FOO = X + Y << sequential reads
END FUNCTION FOO
!omp$ simd private (X,Y)
DO I= 0, N
    Y = B(I)
    X = A(I)
    C(I) += FOO(X,Y)
ENDDO
```


## Linear(ref/val/uval) Examples

## Things to remember:

- linear(ref(x:[step])) matches to unit/non-unit stride arguments if step match
- linear(ref(x)) matches to private arguments: these are allocated sequentially
- linear(uval(x)) is preferred to linear(val(x)) for by-reference passed read-only linears: uval facilitates more efficient parameter passing. If both specified uval is matched

```
linear(ref):
#pragma omp declare simd linear(ref(p))
void add_one(int& p) { p += 1; }
int a[NN];
#prgma omp simd linear(p)
    add_one(*p); < <<< unit-stride load
    p++;
}
#prgma omp simd private(p)
for (i = O; i < NN; i++) {
    p = a[i];
    add_one(p); <<< private
    b[i] = p;
}
#prgma omp simd
for (i = 0; i < NN; i++) {
    add_one(i); <<< match, incorrect
}
```

```
Linear(val)/linear(uval):
interface
    real function func1(x, i)
! $omp declare simd(func1) uniform(x) linear(val(i):1)
            real(8), intent(inout) :: x(*)
            integer, intent(in), value :: i
    end function func1
    real function func2(x, i)
!$omp declare simd(func2) uniform(x) linear(uval(i):1)
            real(8), intent(inout) :: x(*)
            integer, intent(in), value :: i
    end function func2
end interface
!$omp simd linear(k:1)
    do i=1, n
        x(i) = func1(x, k) << k passed as vector of refs
        x(i) = func2(x,k) << k passed as single ref
            k = k + 1
    enddo
```


## SIMD Data Layout Template (SDLT) Library <br> \#include <stdio.h>

```
#include <stdio.h>
#include <iostream>
#define N }102
typedef struct RGBs {
    float r; float g; float b;
} RGBTy;
void main()
{
RGBTy a[N];
    #pragma omp simd
    for(int k=0; k<N; k++) {
        a[k].r = k*1.5; a[k].g = k*2.5; a[k].b = k*3.5;
    }
std::cout << "k =" << 10 <<
    ", a[k].r=" << a[10].r <<
    ",a[k].g=" << a[10].g <<
    ", a[k].b =" << a[10].b << std::endl;
}
#include <stdio.h>
```

\#include <sdlt/primitive.h>
\#include <sdlt/soa1d_container.h>
\#define N 1024
typedef struct RGBs \{
float r; float g; float b;

\} RGBTy;
b)
SDLT_PRIMITIVE(RGBTy, r, g, b)
void main()
\{
sdlt::soa1d_container<RGBTy> aContainer(N);
auto a = aContainer.access();
\#pragma omp simd
for(int k=0; k<N; k++) \{
$a[k] . r()=k^{*} 1.5 ; a[k] . g()=k^{*} 2.5 ; a[k] \cdot b()=k^{*} 3.5 ;$
\}
std::cout << "k =" << $10 \ll$
", a[k].r =" << a[10].r() <<
", a[k].g =" << a[10].g() <<
", a[k].b =" << a[10].b() << std::endl;

## SDLT: AOS vs. SOA

## AOS AVX512 ASM Code



## SOA / SDLT AVX512 ASM Code: scatter instructions are all gone

| B1.5: | \# Preds . . B1. 5 | B1. 4 |
| :---: | :---: | :---: |
| vpaddd | \%zmm4, \%zmm3, \%zmm12 | \#19.3 |
| vcvtdq2ps | \%zmm3, \%zmm7 | \#21.17 |
| vcvtdq2ps | \%zmm12, \%zmm10 | \#21.17 |
| vmulps | \%zmm7, \%zmm2, \%zmm5 | \#21.19 |
| vmulps | \%zmm7, \%zmm1, \%zmm6 | \#22.19 |
| vmulps | \%zmm7, \%zmm0, \%zmm8 | \#23.19 |
| vmulps | \%zmm10, \%zmm2, \%zmm3 | \#21.19 |
| vmulps | \%zmm10, \%zmm1, \%zmm9 | \#22.19 |
| vmulps | \%zmm10, \%zmm0, \%zmm11 | \#23.19 |
| vmovups | \%zmm5, (\%rsi,\%rcx,4) | \#21.15 |
| vmovups | \%zmm6, (\%rdx,\%rcx,4) | \#22.15 |
| vmovups | \%zmm8, (\%rax,\%rcx,4) | \#23.15 |
| vmovups | \%zmm3, 64 (\%rsi,\%rcx,4) | \#21.15 |
| vmovups | \%zmm9, 64 (\%rdx,\%rcx,4) | \#22.15 |
| vmovups | \%zmm11, 64 (\%rax, \%rcx,4) | \#23.15 |
| vpaddd | \%zmm4, \%zmm12, \%zmm3 | \#19.3 |
| addq | \$32, \%rcx | \#21.6 |
| cmpq | \$1024, \%rcx | \#21.6 |
| jb | ..B1.5 \# Prob 82\% | \#21. 6 |

## Programmer Friendly Optimization Report

Annotated source listing with compiler optimization reports


## Optimization Report Improvements

$\checkmark$ Significant improvement in variable names and memory references reporting
16.0: remark \#15346: vector dependence: assumed ANTI dependence between line 108 and line 116
17.0: remark \#15346: vector dependence: assumed ANTI dependence between *(s1) (108:2) and *( $\mathrm{r}+4$ ) (116:2)
$\checkmark$ More precise non-vectorization reasons

- E.g.: "exception handling for function call prevents vectorization"
$\checkmark$ Gather and partial scalarization reasons reporting (-qopt-report:5)
16.0: remark \#15328: vectorization support: gather was emulated for the variable xyBase: indirect access [scalar_ds7ash_fused.cpp $(334,27)]$
17.0: remark \#15328: vectorization support: gather was emulated for the variable <xybase[xboffset][c][s][1]>, indirect access, part of index is conditional [scalar_dslash_fused.cpp $(334,27)]$
Other reasons are:
- read from memory
- nonlinearly computed
o is result of a call to function
- is linear but may overflow \&either in unsigned indexing or in address computation
- is private < memory privatization in explicit vectorization or serialized computation


## Vectorization Advisor

Assist code vectorization for Intel ${ }^{\ominus}$ SIMD (Zakhar A. Matveev)
$\checkmark$ All the data you need in one place
$\checkmark$ Combines Intel Compiler opt-report with dynamic profile.
$\checkmark$ Detects "hot" unvectorized or "under vectorized" loops.
$\checkmark$ Identify performance penalties and recommend fixes
$\checkmark$ Explicit advices with "true intelligence" including OpenMP4.x
$\checkmark$ Memory layout (stride) analysis
$\checkmark$ Increase the confidence that vectorization is safe


## Advances: <br> Tackle C++ Challenges and Beyond



## Virtual SIMD Functions

## $\checkmark$ Syntax

- Exactly same syntax as for usual vector functions
$\checkmark$ Inheritance:
- Set of versions inherited and cannot be altered in overrides
- Implication: SIMDness should be introduced along with virtual method, not in overrides
$\checkmark$ Things to remember:
- Performance depends on divergence:
- uniform(this) fastest: single call per chunk
- Different overrides in lanes slowest: loop for each unique call target
- Limitations:
- Multiple inheritance is not supported
- Pointers to virtual vector methods are not supported

```
class A {
public:
    #pragma omp declare simd linear(X)
    #pragma omp declare simd uniform(this) linear(X)
        virtual int foo(int X)
};
#pragma omp declare simd uniform(this) linear(X)
int A::foo(int X){ return X+1; }
class B : public A {
public:
    // #pragma omp declare simd linear(X) - inherited
    // #pragma omp declare simd uniform(this) linear(X)
        int foo(int X) { return ( }\mp@subsup{\textrm{X}}{}{*}\textrm{X}); ;
};
int main() {
    A* b[N], a = new B();
    int sum=0;
    for (int i=0; i < N; i++) {
        b[1] = (i % 6) < 2 ? new A() : new B()
    }
    #pragma omp simd reduction (+:sum)
    for (int i=0; i < N; i++) {
        sum += a->foo(i); // uniform(this) matched
    }
    #pragma omp simd reduction (+:sum)
    for (int i=0; i < N; i++) {
        sum += b[i]->foo(i); // linear(X) matched
                                    // 1 or 2 calls per chunk
```


## SIMD Function Pointers

// SIMD vector pointer declaration annotation
\#pragma omp declare simd // universal but slowest definition matches the use in all three loops \#pragma omp declare simd linear(in1), linear(ref(in2)), uniform(mul) // matches the use in the first loop int (*funcptr)(int* in1, int\& in2, int mul);
int *a, *b, mul, *c;
int *ndx, nn;

```
// loop examples
```

\#pragma omp simd
for (int i = 0; i < nn; i++) \{
$c[i]=$ func $(a+i, *(b+i)$, mul); // in the loop, the first arg is changed linearly,
// the second reference is changed linearly too
// the third parameter is not changed
\}

## Vector Function Pointers

## Enable: -simd-function-pointers/ -Qsimd-function-pointers

The syntax:

- Apply usual vector function declaration to a function pointer variable or function pointer typedef
- Assign address of compatible SIMD-enabled function to a pointer or pass as parameter

Things to remember

- Performance depends on divergence
- Scalar function pointers and vector function pointers are binary incompatible
- Vector specifications are not part of a type (at least not in current implementation)
- They may not be used for function overloading or template instantiation. No specific name mangling done for e.g. parameters of such types.
- Situations that may lead to run-time ambiguities are caught and error "Error \#3757: this use of a vector function type is not fully supported" reported
- If you are sure that no ambiguity possible (e.g. function accepting vector function pointer has distinct name and fully declared before all uses) you may override the error via -wd3757 command line switch


## Vectorizing Loop with vcompress/vexpand

## Compress

count $=0$;
\#pragma omp simd
for(i) \{
if (cond(i))\{
\#pragma omp ordered simd \{
count++;
A[count] = B[i];//compress \}
\}
\}

Optimization Notice
Copyrighte 2015, Intel Corporation. All rights reserved.
Other names and brands may be claimed as the property of others.

## Expand

count $=0$;
\#pragma omp simd
for (i) \{
if (cond(i)) \{
\#pragma omp ordered simd
\{
count++;
$A[i]=B[$ count $] ; / / e x p a n d$
\}
\}
\}
When cond(i) is



## Compress/Expand with Monotonic Semantics

count $=0 ;$ inc $=1$;
\#pragma omp simd
for(i=0; i<N; i++) \{
\#pragma omp ordered simd monotonic(count: inc) // proposed clause
\{
if (cond(i)) \{
A[count] += B[i]; // compress count+=inc
$\mathrm{B}[\mathrm{i}]=\mathrm{C}[$ count $]$; // expand

## \}

\}
\}

```
Vector Operation with compress/expand
{
    st1 = count
    vt2 = maskload(B[i], cond(i))
    compressstore(A[st1], cond(i), vt2)
    count += popcount(cond(i))
    st1 += inc
    vt3 = expandload(C[st1], cond(i))
    maskstore(B[i], cond(i), vt3)
}
```


## Integral part of Intel ${ }^{\circledR}$ Parallel Studio XE

| As a software developer, I care about: |  | ...and my challenges are: | Intel compilers offer: |
| :---: | :---: | :---: | :---: |
|  | Performance - I develop applications that need to execute FAST | Taking advantage of the latest hardware innovations | Developers the full power of the latest x86compatible processors and instruction sets |
| $\\|\\|\\|\\|\\|\\|$ | Productivity - I need productivity and ease of use offered by compilers | Finding support for the leading languages and programming models | Support for the latest Fortran, C/C++, and OpenMP* standards; compatibility with leading compilers and IDEs |
|  | Scalability - I develop and debug my application locally, and deploy my application globally | Maintaining my code as core counts and vector widths increase at a fast pace | Scalable performance without changing code as newer generation processors are introduced |
|  |  |  | enter |

## Summary: Close to Metal Performance via Explicit SIMD Programming

The reality:
> There is no one single solution that would make all programmers happy after decades of trying.
> There is no free lunch for effectively utilizing SIMD HW in multicore CPUs, accelerators and GPUs.
> There are many emerging programming models for multicore CPUs, accelerators and GPUs.
> Programming languages and compilers are driven by hardware and application
> The incremental approach of applying the learnings from HPC and graphics is working

[^0]
## CODE MODERNIZATION: BEST PRACICESS IN VECTOR PROGRAMMING

Robert Geva - Senior Principal Engineer Intel Corporation
September 11, 2016

## Section III: Agenda

| Introduction |
| :---: |
| Case studies |
| The proposal being <br> processed at the <br> C++ standard |

-HW support growing over time
-Performance impact of SIMD
-Vector programming as part of parallel programming

- Based on joint projects with Intel's customers
- Myth debunking: vectorization is about adding \#pragma to the right loops
-Why OpenMP is inadequate
-What stays similar to OpenMP
-How it is different
-Future extensions to the proposal

Design patterns

Performance portability
-Best practices in using the vector syntax in algorithms
-AVX512 specific patterns

- How to write GPGPU kernels in OpenMP
- Myth debunking: why the same code for CPU ad GPU cannot be optimal for the CPU
-Performance data: methodology, case studies and results


## Parallel computing with Intel Architecture

|  | Cores | SIMD | LANES |
| :--- | ---: | ---: | ---: |
| $\times 2007$ | 8 | 128 | 32 |
| X2009 | 8 | 128 | 32 |
| $\times 2010$ | 12 | 128 | 48 |
| $\times 2012$ | 16 | 256 | 128 |
| X2013 | 24 | 256 | 192 |
| $\times 2014$ | 36 | 256 | 288 |
| $\times 2016$ | 44 | 256 | 352 |

vector lanes


Binomial Options
(1) Incremental growth in CPU resources
(2) Improvements in compilers and parallel frameworks
(3) Better Parallelization techniques


## Parallel performance over time



Monte Carlo Asian Options




4 considerations to take care of when writing an efficient, unconstrained parallel program

## \#1 Best Practice in Parallelizing a Loop Hierarchy



## Vectorize Innermost, Parallelize Outermost (VIPO)

## Performance with vector parallelism

Serial, single prec

serial, double prec


Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
parallel, single prec



Optimization I Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions.

## Vector Programming (Part of Parallel Programming)



## Capabilities in Vector Programming

```
#pragma omp simd
for(i = 0; i<N; ++i)
{
    a[i] = b[i]+c[i] ;
\}
```

1. Vector Loops
a) The syntax means that a loop is a vector loop
b) Used mostly at the application level
c) Syntax can look like OpenMP*, Intel® Cilk ${ }^{\mathrm{TM}}$, Intel ${ }^{\circledR}$ Threading Building Blocks, etc.
d) The loop is single threaded and consistent with SIMD execution
e) Additional syntax for more capabilities
2. Vector Functions
a) The function is compiled as if it is part of the body of a vector loop
b) For use in larger projects and for libraries
c) Organizations interested in methodological
parallel programming
d) Additional syntax for more capabilities
\#pragma omp declare simd
vec_add (float *a,float *b,float *c int i) \{

$$
a[i]=b[i]+c[i] ;
$$

Copyright © 2015, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
$\}$

## Case studies

## Case Study: Trinomial options



Vectorize the inner loop

| sec |  | How was it done |
| :--- | :--- | :--- |
| 31 |  | GCC Baseline |
| 28.7 | 1.1 X | ICC Baseline |
| 19.0 | 1.6 X | Vectorize critical loop. |
| 14.3 | 2.2 X | Use AVX |
| 11.6 | 2.7 X | Add aligned vectors |
| 0.84 | $\mathbf{3 7 X}$ | Parallelize with OMP |
| 0.60 | $\mathbf{5 1 X}$ | Move to Intel® Xeon <br> Phi coprocessor |

```
#pragma omp simd
for (int n = 0; n < 2*m + 1; n++)
{
    vCurrent[n] = fmax( BsAnalyticIntrinsicValue(TreeGetAdjustedSpot(m, n, params), kMod, isCall),
        discount*(ExpectedValue(vCurrent[n], vCurrent[n+1], vCurrent[n+2], params)));
    }
```


## Asian Options example: Vectorize an outer loop

```
#pragma omp parallel for simd reduction(+:val) reduction(+:val2)
for(int pos = 0; pos < RAND_N; pos++)
{
    __declspec(align(64)) tfloat simStepResult[SIMSTEPS+1];
    simStepResult[0] = 1.0;
    tfloat avgMean = 0.0;
    for (int simStep =0; simStep < SIMSTEPS; simStep++)
    {
        location = pos*SIMSTEPS + simStep;
        simStepResult[simStep+1] = simStepResult[simStep]*EXP(MuByT + VBySqrtT*l_Random[location]);
        avgMean += simStepResult[simStep+1];
    }
    //Use Arithmetic Mean
    avgMean *= Sval/SIMSTEPS;
    tfloat callValue = max((avgMean - Xval), 0);
    val += callValue;
    val2 += callValue * callValue;
}
```


## LIBOR case study: Vectorize an outer loop with function calls

```
#pragma omp declare simd
static void path_calc_b1(REAL *z, REAL *L, REAL *L2, const REAL* lambda)
{
    int i, n;
    REAL sqez, lam, con1, v, vrat;
    memcpy(L2, L, NN*sizeof(REAL)); A few lines of code
```



```
        sqez = SQRT_DELTA * z[n];
        v =REAL(0);
        for (i=n+1; i<NN; i++) {
            lam = lambda[i-n-1];
            con1 = DELTA * lam;
            v += con1 * L[i] / (REAL(1) + DELTA * L[i]);
            vrat = std::exp(con1 * v + lam * (sqez - REAL(0.5) * con1));
            L[i] = L[i] * vrat;
            L2[i+(n+1)*NN] = L[i];
        }
    }
}
```

```
#pragma omp simd reduction(+: sumv) reduction(+: sumlb)
for (path=0; path<numPaths; path++) {
    path_calc_b1(ptrZ, L, L2, lambda);
    path_calc_b2(L_b, L2, lambda);
```

Optimization I
Copyright © 2015,
*Other names and b


## OpenMP parallelism: wrong way

```
nbnds=jend-jstart+1 ! [jstart,jend)
!$OMP PARALLEL FOR collapse(2)
DO j=1,nbnds
    DO ir=1,nrxxs
        rho(ir,j)=CONJG(exxbuff(ir,j+jstart))*temppsi(ir)/Omega
    ENDDO
ENDDO
FFTm(rho) ! Batch 3D FFT
!$OMP PARALLEL FOR collapse(2)
DO j=1,nbnds
    DO ir=1,nrxxs
        vc(ir,j)=facb(ir)*rho(ir,j)*x(j+jstart)*y
    ENDDO
ENDDO
invFFTm(vc) ! Batch 3D FFT
!some more on vc
!$OMP PARALLEL FOR
Do ir=1,nrxxs
    Do j=1,nbnds
        result(ir)=result(ir)+vc(ir,j)*exxbuff(ir,j+jstart)
    ENDDO
ENDDO
```

collapse(2) introduced to expose parallelism over nbnds*nrxxs, but

- temppsi shared by j
- cannot be linearlized
- poor cache use


## Blocking: Parallelism, SIMD and Cache blocking

```
nbnds=jend-jstart+1 ! [jstart,jend)
nblocks=2048
!$OMP PARALLEL FOR collapse(2)
DO irb=1,nrxxs,nblocks
    DO j=1,nbnds
        irmax=min(nrxxs,irb+nblocks)
!DIR$ vector nontemporal(rho)
        DO ir=irb,irmax
            rho(ir,j)=CONJG(exxbuff(ir,j+jstart))*temppsi(ir)/Omega
        ENDDO
    ENDDO
ENDDO
```

Empirically determined for this loop

- irb-j loop better than j-irb
- nblocks=2048
- streaming stores (to be explored further later after dungeon)

Other seemingly similar loops favor j-irb
Reduction kernels benefit from blocking

## STAC-A2: breaking dependencies

```
void sum(const int* in1, const int* in2, std::size_t Scalar Cod
size, int* out)
{
    #pragma omp simd
    for(int i=0; i<size; ++i ){
        out[i] = in1[i] + in2[i];
    }
}
```



## Original Code

```
for (unsigned p = 0; i < nPaths; ++p)
{
    double mV[nTimeSteps];
    double mY[nTimeSteps];
    for (unsigned int t = 0; t < nTimeSteps; ++t){
        double currState = mY[t] ; // Backward dependency
            double logSpotPrice = func(currState, ...);
            mY[t+1][p] = logSpotPrice * A[t];
        mV[t+1][p] = logSpotPrice * B[t] + C[t] * mV[t][p];
        price[t][p] = logSpotPrice*D[t] +E[t] * mV[t][p];
    }
}
```

```
double mV[nTimeSteps][nPaths]
double mY[nTimeSteps][nPaths];
...
for (unsigned int t = 0; t < nTimeSteps; ++t){
    #pragma omp simd
    for (unsigned p = 0; i < nPaths; ++p)
    {
        double currState = mY[t][p] ;
        double logSpotPrice = func(currState, ...);
        mY[t+1][p] = logSpotPrice * A[t];
        mV[t+1][p] = logSpotPrice * B[t] + C[t] * mV[t][p];
        price[t][p] = logSpotPrice*D[t] +E[t] * mV[t][p];
    }
}
```


## Intel® TBB parallel blocks, using ranges

```
tbb::parallel_for(blocked_range<int>(0, nPaths),
    [&](const blocked_range<int>& r) {
        cosnt int block_size = r.size();
    double mV[nTimeSteps][block_size];
    double mY[nTimeSteps][block_size];
    ...
        for (unsigned int t = 0; t < nTimeSteps; ++t){
            #pragma omp simd
        for (unsigned p = 0; i < block_size; ++p)
            {
            double currState = mY[t][p] ;
            ...
            double logSpotPrice = func(currState, ...);
            mY[t+1][p] = logSpotPrice * A[t];
            mV[t+1][p] = logSpotPrice * B[t] + C[t] * mV[t][p];
            price[t][r.begin()+p] = logSpotPrice*D[t] +E[t] * mV[t][p];
            }
}
```


## PDE solver

```
void solve_tridigonal (double* x, const int N, double* a, double* b, double
*c, double* cprime)
{
    cprime[0] = c[0] / b[0];
    x[0] = x[0] / b[0];
    for (int in = 1; in < N; in++) {
            REAL m = REAL(1.0) / (b[in] - a[in] * cprime[in - 1]);
            cprime[in] = c[in] * m;
            x[in] = (x[in] - a[in] * x[in - 1]) * m;
    }
```

    for (int in = N - 2; in-- > 0; )
        \(x[i n]=x[i n]-c p r i m e[i n] * x[i n+1] ;\)
    \}

Most of the time is in the Thomas algorithm, solving a tridiagonal matrix

## But there is not a single matrix, there are many of them.

```
//Solve 2D PDE
for(int j=1; j<OUTER-1; j++)
{
    for(int i=1; i<INNER-1; i++)
    {
    //Create RHS
        H[i] = v1[i][j];
    }
    h[1] = h[1] - a[1]*v2[0][j];
    h[INNER-2] = h[INNER-2] - c[INNER-2]*v2[INNER-1][j];;
There is a loop of calls to the solver
    //Solve Tridiagonal system using Thomas Algorithm
    solve_tridigonal (&h[1], INNER-2, &a[1], &b[1], &c[1], &scratch[1]);
    //copy RHS to v2
    for(int i=1; i<INNER-1; i++)
    {
        v2[i][j] = h[i];
    }
}
```


## Widen the outer loop stride - add a dimension make space for a new inner loop

```
//Solve 2D PDE
for(int j=1; j<OUTER-1; j+=DIMNSZ)
{
    for(int i=1; i<INNER-1; i++)
    {
#pragma simd
    for(int j1=0; j1<DIMNSZ; j1++)
    {
        int cntJ = j+j1;
        h[i][j1] = vi[I][cmur], dimension
    }
}
#pragma simd
for(int j1=0; j1<DIMNSZ; j1++)
{
    int cntJ = j+j1;
    h[1][j1] = h[1][j1] - a[1]*v2[0][cntJ];
    h[INNER-2][j1] = h[INNER-2][j1] - c[INNER-2]*v2[INNER-1][cntJ];
}
//Solve Tridiagonal system using Thomas Algorithm
solve_tridigonal_simd(&h[1], INNER - 2, &a[1], &b[1], &c[1], &scratch[1]);
Add a
dimension
```

Add a dimension

```
        \}\begin{array}{c}{\mathrm{ Add a }}\\{\mathrm{ dimension }}
    }
```

```
    for(int i=1; i<INNER-1; i++)
```

```
    for(int i=1; i<INNER-1; i++)
```

```
    for(int i=1; i<INNER-1; i++)
```

```
    for(int i=1; i<INNER-1; i++)
    {
    {
    {
    {
#pragma simd
#pragma simd
#pragma simd
#pragma simd
for(int j1=0; j1<DIMNSZ; j1++)
for(int j1=0; j1<DIMNSZ; j1++)
for(int j1=0; j1<DIMNSZ; j1++)
for(int j1=0; j1<DIMNSZ; j1++)
    {
    {
    {
    {
        int cntJ = j+j1;
        int cntJ = j+j1;
        int cntJ = j+j1;
        int cntJ = j+j1;
        v2[i][cntJ] = h[i][j1];
        v2[i][cntJ] = h[i][j1];
        v2[i][cntJ] = h[i][j1];
        v2[i][cntJ] = h[i][j1];
        }
        }
        }
        }
    }
    }
    }
    }
}
```

}

```
}
```

}

```
```

}

```
```

```
}
```

```
```

}

```
```

```
}
```

```

\section*{The new inner loop can vectorize a the new dimension, where there are no dependencies}
```

void solve_tridigonal_simd (double x[][DIMNSZ], const int N, double* a, double* b, double *c, double *
cprime)
{
cprime[0] = c[0] / b[0];
\#pragma simd
for(int j=0; j<DIMNSZ; j++)
{
x[0][j] = x[0][j] / b[0];
}
/* loop from 1 to N - 1 inclusive */
for (int in = 1; in < N; in++)
{
double tmpA = a[in];
double tmpB = b[in];
double tmpC = c[in];
double m = REAL(1.0) / (tmpB - tmpA * cprime[in - 1]);
cprime[in] = tmpC * m;

```
```

\#pragma simd

```
#pragma simd
    for (int j =0; j<DIMNSZ; j++)
    for (int j =0; j<DIMNSZ; j++)
    {
    {
    {x[in][j] = (x[in][j] - tmpA * x[in - 1][j]) *m;
    {x[in][j] = (x[in][j] - tmpA * x[in - 1][j]) *m;
    }
    }
}
}
for (int in = N - 2; in-- > 0; )
for (int in = N - 2; in-- > 0; )
{
{
    double cPrime = cprime[in];
    double cPrime = cprime[in];
    #pragma simd
    #pragma simd
    for (int j =0; j<DIMNSZ; j++)
    for (int j =0; j<DIMNSZ; j++)
    {
    {
        x[in][j] = x[in][j] - cPrime * x[in + 1][j];
        x[in][j] = x[in][j] - cPrime * x[in + 1][j];
    }
    }
                            }
```

                            }
    ```


\section*{FXLSMMONTECARLOCASESTUDY}

Thomas Trenner
Robert Geva

\section*{Loop interchange leads to canonical loop hierarchy}

blocks loops - data independent
path loop - data independent
time loop - value at t depends on value at \(\mathrm{t}-1\)

Parallelize the outer loop
keep the middle loop sequential
vectorize the inner loop:
This works best!

\section*{Results}
\begin{tabular}{|c|c|c|c|}
\hline Step & Thread count & Time (Secs) & \begin{tabular}{l}
Measurements were taken on an E5-2697 Haswell with \\
- ICC v 15.0.1 \\
- RHEL 6.5
\end{tabular} \\
\hline 1 & 1 & 82.6 & Baseline Results \\
\hline 2 & 1 & 24.6 & Loop interchange \\
\hline 3 & 1 & 11.3 & With vectorization 2.2x Vectorization Speedup. \\
\hline 4 & 1 & 9.9 & +MKL RNG. \\
\hline 5 & 1 & 9.4 & +TBB allocator \\
\hline & & & 8.8x Single Threaded Speedup. \\
\hline 6 & 28 & 0.47 & +TBB Threads \\
\hline 7 & 28 & 0.45 & \begin{tabular}{l}
+Turbo mode enabled. \\
183x Multi threaded Speedup.
\end{tabular} \\
\hline
\end{tabular}

\section*{Vectorization}
```

\#pragma simd assert firstprivate(interpolator, dT, rootDT, tplus1, expMAlphaDt,\
normalCoefficient, deterministicDrift,\
localVolLowerLimit, localVolUpperLimit)\
vectorlengthfor(double)
\#pragma vector aligned
for(unsigned int path = 0; path < paths; ++path)
1\#if defined(__AVX2_)
_decIspec(vector(uniform(this), processor(core_4th_gen_avx)))
__declspec(vector(uniform(this)))
double Interpolate(
const double x
// [0] interpolated y-value
// [i] x-value to interpolate
) const;

```

\section*{Concepts used in vectorising the loop}
\#pragma simd:
- semantically correct to re-associate the order of evaluation, leading to vectorization

First private:
- each iteration gets a private copy of the object, initialized by the value it has prior to the loop

Assert:
- abort compilation with an error message if the loop is not vectorized

Vectorlengthfor(type):
- The size of type determines how many loop iteration to vectorize across
declspec(vector):
- A vector function. Compiled and execute as if a body of a vector loop

Uniform(parameter):
- All values of parameter in one vector invocation of the function are the same

Processor(code_4 \({ }^{\text {th }} \_\)gen_avx):
- use YMM to pass arguments, (the default is XMM)


\section*{Design patterns}

\section*{A simple patterns}
```

\#pragma omp parallel simd for
for (int i = 0; i < m_optionCount; i++)
BlackScholesBodyCPU(\&resultCallGen[i], \&resultPutGen[i],
stockPrice[i], optionStrike[i], optionYears[i],R,m_V);

```
```

\#prgma omp declare simd
void BlackScholesBodyCPU(
float* call, //Call option price
float* put, //Put option price
float Sf, //Current stock price
float Xf, //Option strike price
float Tf, //Option years
float Rf, //Riskless rate of return
float Vf //Stock volatility
K
float S = Sf, X = Xf, T = Tf, R = Rf, V = Vf;
float CNDD1, CNDD2, sqrtT, expRT
sqrtT = sqrtf(T);
d1 = (logf(S / X) + (R+0.5f*V * V) *T) / (V * sqrtT);
d2 = d1 - V * sqrtT;
CNDD1 = CND(d1);
CNDD2 = CND(d2);
expRT = expf(- R *T);
*call = (FTYPE)(S * CNDD1 - X * expRT * CNDD2);
*put = (FTYPE)(X * expRT * (1.Of-CNDD2) - S * (1.Of -
CNDD1);
}

```

A loop with no data dependencies
No control flow
Simple memory access
Could also be parallelized
Scales well

\section*{A loop with forward dependence}
```

float stepsArray[STEPS_CACHE_SIZE];
\#pragma omp simd
for (int j = 0; j < STEPS_CACHE_SIZE; j++) {
float profit = s * expf(vsdt * (2.0f * j - numSteps)) - x;
stepsArray[j] = profit > 0.0f ? profit : 0.0f;
}
for (int j = 0; j < numSteps; j++) {
\#pragma omp simd
for (int k = 0; k < NUM_STEPS_ROUND; ++k) {
stepsArray[k] = pdByr * stepsArray[k + 1] + puByr * stepsArray[k];
}
}

```

The vector loops propagates values from root to leaves
Looks very similar to original, sequential loop
stepArray \([\mathrm{k}]\) depends on stepArray \([\mathrm{k}+1]\), vector programming supports it SIMD != SIMT
Parallelization is at an outer level.

\section*{Parameter Qualifiers}
```

\#pragma omp declare simd
void foo (float \&a, int i)
{
x = a[i];
}

```

Compiling the vector variant will generate multiple expressions of \(x=a[i]-\) what are the relationship between the memory accesses?

If the compiler doesn't know better, then they are unrelated.
```

\#pragma omp declare simd uniform (a)
void foo(float *a, int i);
a is a pointer
i is a vector of integers
a[i] becomes gather/scatter

```
```

\#pragma omp declare simd linear(i)

```
void foo(float \(* a\), int i);
    a is a vector of pointers
    \(i\) is a sequence of integers
    [ \(i, i+1, i+2 \ldots]\)
    \(a[i]\) becomes gather/scatter
\#pragma omp declare simd
uniform(a),linear(i))
void foo(float *a, int i);
\(a\) is a pointer
i is a sequence of integers \([\mathrm{i}, \mathrm{i}+1, \mathrm{i}+2 \ldots]\)
\(a[i]\) is a unit-stride load/store

\section*{Multiple Variants of a Vector Function}
```

\#pragma omp declare simd
\#pragma omp declare simd uniform(r,op1,op2) linear (i)
Void
vec_add ( float *r, float *op1, float *op2, int i)
{
r[i] = op1[i] + op2[i];
}

```
```

\#pragma omp simd
for (int i = 0; i<N; ++i) {
vec_add(a,b,c,i);
}

```
```

\#pragma omp simd
for (int i = 0; i<N; ++i) {
vec_add(a[x1[[i]],b[x2[i]],c[x3[i]],i);
}

```

Two vector variants and one scalar

Call matches the variant with the uniforms

\section*{Call matches the} variant w/o the uniforms

\section*{Multiple Variants of a Vector Function}
```

\#pragma omp declane simd
\#pragma omp declare simd uniform(r,op1,op2) linear (i)
Void
vec_add ( float *r, float *op1, float *op2, int i)
{
r[i] = op1[i] + op2[i];
}

```
```

\#pragma omp simd
for (int i = 0; i<N; ++i) {
vec_add(a,b,c,i);
}

```
```

\#pragma omp simd
for (int i = 0; i<N; ++i) {
vec_add(a[x1[[i]],b[x2[i]],c[x3[i]],i);
}

```

Two vector variants and one scalar

Call matches the variant with the uniforms

\section*{Call matches the} variant w/o the uniforms

\section*{Multiple Variants of a Vector Function}
```

\#pragma omp declare simd
\#pragma omp declare simd uniform(r,op1,op2) linear (i)
Void
vec_add ( float *r, float *op1, float *op2, int i)
{
r[i] = op1[i] + op2[i];
}

```
```

\#pragma omp simd
for (int i = 0; i<N; ++i) {
vec_add(a,b,c,i);
}

```
```

\#pragma omp simd
for (int i = 0; i<N; ++i) {
vec_add(a[x1[[i]],b[x2[i]],c[x3[i]],i);
}

```

Two vector variants and one scalar

Call matches the variant with the uniforms

\section*{Call matches the} variant w/o the uniforms

\section*{Additional SIMD specific capabilities}
\[
\begin{aligned}
& \text { Scatter write: } \mathrm{a}[\mathrm{~b}[\mathrm{x}]]=\mathrm{d}[\mathrm{x}] \text {; } \\
& \text { Histogram: } \mathrm{a}[\mathrm{~b}[\mathrm{x}]]++ \text {; } \\
& \text { Expand: } \quad \text { if (c[i]) a[i] = b[i] * d[j++]; } \\
& \text { Compress: if (c[i])a[j++] }=\mathrm{b}[\mathrm{i}] * \mathrm{~d}[\mathrm{i}] \text {; }
\end{aligned}
\]

\section*{A lopsided loop}

Assume execution where expensive calc is called once per vector loop.

All lanes that execute inexpensive calc are held back, and execute as slow as the expensive calc.

Optimization: rewrite so that all expensive calcs are consecutive, and inexpensive calcs are consecutive.
```

\#pragma omp simd
for (int x = 0; x < N; ++x) {
double val = in[x];
if (val == 0.0){
results[x] = expensive_calc(val);
}
else
results[x] = inexpensive_calc(val);
}

```

The main loops speed-up for all HW targets.

The overhead is vectorizeable using compress / expand.

\section*{Partition By Weight}
```

for (int x = 0; x < N; ++x) {
double val = in[x];
int mask_local = val == 0.0;
mask[x] = mask_local;
if(mask_local){
vecX[cnt] = val; //compressed
cnt++;
}
}
\#pragma omp simd
for (int y = 0; y < cnt; ++y) {
vecX[y] = expensive_calc(vecX[y]);
}
cnt = 0;

```
```

for (int x = 0; x < N; ++x) {
double val = in[x];
if(___builtin_expect(mask[x],0))
results[x] = vecX[cnt++]; //expand
else
results[x] = inexpensive_calc(val);
}

```

\section*{Performance portability}


Questions related to performance portability
A. Interest in maintaining a single source base across CPUs and GPUs
B. Interest in languages that provide good support for both
C. Interest in OpenCL
D. Interest in conversion of CUDA code to CPUs / Xeon Phi

> Multiple parallelization related consideration are different, with potentially significant impact, limiting the potential for performance portability

Wider fan out vs more work / worker

Account for memory and cache efficiency, tiling, blocking

Account for cores per socket, hyper threads per core, NUMA effects

\section*{Today's focus: difference in vectorization.}

\section*{Rationale}

SIMT- style kernels are too restrictive
- There are many parallel algorithms and design patterns
- In many cases, the kernel is not the optimal design pattern
- Then, a kernel- only language or a kernel only algorithmic design locks you out of a solution A trivial case: Black Scholes

Example of kernels being inadequate: Binomial options
Data: writing kernels in OpenMP vs. writing loops in OpenMP,
- same language
- Same compiler
- Same HW
- Large performance impact

\section*{GPU kernels and CPU loop hierarchies}
\begin{tabular}{|c|}
\hline CUDA \\
\hline \multirow[t]{2}{*}{```
BlackScholesGPU<<<<256, 128>>>(
    d_CallResult, d_PutResult, d_OptionStrike, d_StockPrice,
    d_OptionYears, RISKFREE, VOLATILITY, OPT_N);
```

```
__device
void BlackScholesBodyGPU(
    float& CallResult,
    float& PutResult,
    float S, //Stock price
    float X, //Option strike
    float T, //Option years
    float R, //Riskless rate
    float V //Volatility rate
){
    float sqrtT, expRT;
    float d1, d2, CNDD1, CNDD2;
    sqrtT = sqrtf(T);
    d1 = (_logf(S / X) + (R + 0.5f * V * V) * T) / (V * sqrtT);
    d2 = d1 - V * sqrtT;
    CNDD1 = cndGPU(d1);
    CNDD2 = cndGPU(d2);
    expRT = __expf(- R * T);
    CallResult = S* CNDD1 - X * expRT * CNDD2;
    PutResult = X * expRT * (1.Of-CNDD2)- S *(1.Of-CNDD1);
}
```} \\
\hline \\
\hline
\end{tabular}

\section*{OpenMP}

\section*{\#pragma omp parallel simd for}
for (int i = 0; i < m_optionCount; i++)
BlackScholesBodyCPU(\&resultCallGen[i], \&resultPutGen[i],
stockPrice[i], optionStrike[i], optionYears[i],R,m_V);

\section*{\#prgma omp declare simd}
void BlackScholesBodyCPU
float* call, //Call option price
float* put, //Put option price
float Sf, //Current stock price
float Xf, //Option strike price
float Tf, //Option years
float Rf, //Riskless rate of return
float Vf //Stock volatility
) \{
float \(S=S f, X=X f, T=T f, R=R f, V=V f ;\)
float CNDD1, CNDD2, sqrtT, expRT
sqrtT = sqrtf(T);
d1 = (logf(S / X) + (R + 0.5f * V * V) * T) / (V * sqrtT);
\(\mathrm{d} 2=\mathrm{d} 1-\mathrm{V}\) * sqrtT;
CNDD1 = CND(d1);
CNDD2 \(=\) CND (d2);
\(\operatorname{expRT}=\operatorname{expf}\left(-\mathrm{R}^{*} \mathrm{~T}\right)\);
*call = (FTYPE)(S * CNDD1 - X * expRT * CNDD2);
*put \(=(\) FTYPE \()\left(X^{*} \operatorname{expRT} *(1.0 f-C N D D 2)-S^{*}(1.0 f-\right.\) CNDD1));
\}

\section*{Vectorize the inner loop independently of the outer loop. Vectorising with FORWARD dependencies}
```

float stepsArray[STEPS_CACHE_SIZE];
\#pragma omp simd
for (int j = 0; j < STEPS_CACHE_SIZE; j++) {
float profit = s * expf(vsdt * (2.0f * j - numSteps)) - x;
stepsArray[j] = profit > 0.0f ? profit : 0.0f;
}
for (int j = 0; j < numSteps; j++) {
\#pragma omp simd
for (int k = 0; k < NUM_STEPS_ROUND; ++k) {
stepsArray[k] = pdByr * stepsArray[k + 1] + puByr * stepsArray[k];
}
}

```

\section*{CUDA version}
```

//Calculations within shared memory
for(int k = c_start - 1; k >= c_end;){
//Compute discounted expected value
syncthreads();
if(tid <= k)
callB[tid] = puByDf * callA[tid + 1] + pdByDf * callA[tid];
k--;
//Compute discounted expected value
syncthreads();
if(tid <= k)
callA[tid] = puByDf * callB[tid + 1] + pdByDf * callB[tid];
k--;
}

```
since the order of thread scheduling is complex and best viewed as simply
undefined, the reduction primitive is double-buffered, ensuring by means of __syncthreads() that results from the previous stage are ready before they are used in the next,

\section*{C++ AMP Sample Code for binomial options}

```

index<1> tile_idx = tidx.tile; index<1> local_1dx = tidx.local
tile_static float call_a [CACHE_SIZE 1 ];
tile_static float call_b[CACHE_SIZE+1];
int tid = local_idx[0];
int i;
for ( $\mathrm{i}=$ tid; $\mathrm{i}<=$ NUM_STEPS; $\mathrm{i}+=$ CACHE_SIZE
Index<1> idx(tile_idx[0] * (NUM_STEPS + 16) + (i)
call_buffer[idx] = expiry_call_value(s[tile_idx], x[tile_idx], vdt[tile_idx], i),
for
for ( $\mathrm{i}=$ NUM_STEPS; $\mathrm{i}>0$; $\mathrm{i}-=$ CACHE_DELTA $)$
\{
for (int c_base $=0$; c_base $<i$; c_base $+=$ CACHE_STEP)
\{
int c_start $=\min \left(\right.$ CACHE_SIZE $-1, i-c \_$base $)$
int c_end $=$ c_start - CACHE_DELTA; tidx.barrier.wait()
if(tid <= c_start)
\{
index<1> idx(tile_idx[0] * (NUM STEPS + 16) + (c_base + tid))
call_a[tid] = call_buffer[idx];
\}
for (int $\mathrm{k}=\mathrm{c}$ _start $-1 ; \mathrm{k}>=\mathrm{c}_{-}$end;
\{
tidx.barrier.wait();

```


Significant and complex changes required when no vector programming available Original loop split to 2 , array split to 2 , barriers required.

\section*{GPGPU: only kernels}

One-Size-Fits-All design pattern:
Write a kernel function, serial code
Then: Invoke many instances in parallel
(not minimizing that a lot of hard work still required, including tiling, etc)


\section*{\#1 Best Practice in Parallelizing a Loop Hierarchy}


Vectorize Innermost, Parallelize Outermost (VIPO)

\section*{Methodology}

Write a few algorithms in two ways:
- Parallelize and vectorize the outer loop - the kernel pattern
- Parallelize the outer loop and vectorize the inner loop - VIPO

Express both patterns in OpenMP \({ }^{\oplus} 4.0\) and \(\mathrm{C}++\)
Use the same compiler - ICC
Use the same Hardware, OS, etc
The only difference - the parallelization and vectorization pattern
Compare performance

\section*{Skeleton of sequential Binomial Code}
```

Binomial()
{
__declspec(align(1024)) REAL Call[NUM_STEPS + 1];
//Forward Pass
for (int i = 0; i <= NUM_Nodes; i++)
{
double d = Sx * Exp(vDt * (2.0f* i - NUM_STEPS)) - Xx;
Call[i] = (d > 0) ? d : 0;
}
//Backward pass
for(int i = NUM_STEPS; i > 0; i--)
{
int Num_Nodes = i-1;
for(int j = 0; j <= Num_Nodes; j++)
Call[j] = puByDf * Call[j + 1] + pdByDf * Call[j];
}
}
main()
{
\#pragma omp parallel for
for (int i=0; i<Nopt; i++)
Binomial();

```
\}
Nopt \(=131072\) and NUM_STEPS \(=1024\)

\section*{Binomial Code Comparison}
__attribute__((vector(vectorlength(DIMNSZ)))
Binomial(.....)
\{
__declspec(align(1024)) double call[NUM_STEPS + 1];
for (int i = 0; i <= NUM_Nodes; i++) \{ double d \(=s x * \exp \left(t *\left(2.0 f * i-N U M \_S T E P S\right)\right)-x x ;\) call[i] = (d > 0) ? d : 0 ;
\}
for (int i = NUM_STEPS; i > 0; i--) \{ int Num_Nodes = i-1; for (int j = 0; j <= Num_Nodes; j++) call[j] = puByDf \(*\) Call[j + 1] + pdByDf * call[j];
\}
\}
main()
\{
\#pragma omp parallel for
for (int \(n=0\); \(n<\) Nopt; \(n+=\) DIMNSZ)
\{
......
\#pragma simd
for (int i = 0; i < DIMNSZ; ++i) \{
Binomial(... );
\}
```

    Binomial(....)
    {
        __declspec(align(1024)) REAL Call[NUM_STEPS + 1];
    #pragma simd
    for (int i = 0; i <= NUM_Nodes; i++) {
        double d = sx * exp(t * (2.0f * i - NUM_STEPS)) - xx;
        Call[i] = (d > 0) ? d : 0;
    }
    for(int i = NUM_STEPS; i > 0; i--) {
        int Num_Nodes = i-1;
        #pragma simd
        for(int j = 0; j <= Num_Nodes; j++)
            call[j] = puByDf * call[j + 1] + pdByDf * call[j];
        }
    }
main()
{
\#pragma omp parallel for
for (int i=0; i<Nopt; i++)
Binomial(...);
}

```

\section*{Monte Carlo Pseudo Code Pattern}
```

for(opt=0; opt< NumOptions ; opt++)
{
for(path=0; path<NumPaths; path++)
{
for (ts=1; ts<NumSteps; ts++)
{
S[ts] = S[ts-1] *exp(...)
}
}

```

\section*{Results}

\begin{tabular}{|l|l|l|l|l|l|l|}
\hline MC & reference & Kernel & & loops & & KDR \\
\hline & & time & speedup time & speedup & \\
\hline KNC & 4.19 & 2.59 & 1.620 .86 & 4.87 & 3.01 \\
\hline KNL & 3.99 & 2.48 & 1.610 .82 & 4.87 & 3.02 \\
\hline IVB & 3.35 & 1.23 & 2.721 .06 & 3.16 & 1.16 \\
\hline & & & & & & \\
\hline BO & reference & Kernel & & loops & & \\
\hline & & time & speedup time & speedup & \\
\hline KNC & 0.92 & 1.37 & 0.670 .85 & 1.08 & 1.61 \\
\hline KNL & 0.36 & 0.79 & 0.460 .36 & 1.01 & 2.22 \\
\hline IVB & 0.91 & 1.00 & 0.910 .90 & 1.01 & 1.11 \\
\hline & & & & & & \\
\hline LMM & reference & kernel & & loops & & \\
\hline & & time & speedup time & speedup & \\
\hline KNC & 2223.35 & 2340.60 & 0.95380 .10 & 5.85 & 6.16 \\
\hline KNL & 882.00 & 898.00 & 0.98102 .90 & 8.57 & 8.73 \\
\hline IVB & 1302.29 & 1414.00 & 0.92478 .10 & 2.72 & 2.96 \\
\hline
\end{tabular}

\footnotetext{
Optimization Notice
Copyright © 2015, Intel Corporation. All rights reserved.
*Other names and brands may be claimed as the property of others.
}


\section*{A proposal for c++}

\section*{The OpenMP syntax - a good solutions for loops}
```

\#pragma omp parallel for
for(int opt = 0; opt < OPT_N; opt++)
float VBySqrtT = VOLATILITY * sqrtf(T[opt]);
float MuByT = (RISKFREE - 0.5f * VOLATILITY * VOLATILITY) * T[opt];
float Sval = S[opt];
float Xval = X[opt];
float val = 0.0f, val2 = 0.0f;
\#pragma omp simd reduction(+:val) reduction(+:val2)
for(int pos = 0; pos < RAND_N; pos++){
float callValue = expectedCall(Sval, Xval, MuByT, VBySqrtT,
l_Random[pos]);
val += callValue;
val2 += callValue * callValue;
}
float exprt = expf(-RISKFREE *T[opt]);
h_CallResult[opt] = exprt * val / (float)RAND_N;
float stdDev = sqrtf(((float)RAND_N*val2 - val*val) /
((float)RAND_N*(float)(RAND_N - 1.f)));
h_CallConfidence[opt] =(float)(exprt * 1.96f * stdDev/sqrtf((float)RAND_N));

```

\section*{Not so much for standard algorithms}
```

Std::transform(inp.begin, inp.begin + inp.size(), out.begin(), ptr_fun<double, double>(sqrt));
void my_tranform(std::vector<int>\& src, std::vector<int>\& dst, std::function< int(int) > _func) {
vec::transform(src.begin(), src.end(), dst.begin(), _func);
}

```

\section*{The Parallelism TS in C++}
```

// Serial sort
std::sort(std::seq, x.begin(), x.end());
// Parallel sort
std::sort(std::par, x.begin(), x.end());
// Dynamically-selected policy
std::execution_policy e = std::seq();
if( x.size()>1024 )
e = std::par();
std::sort(e, x.begin(), x.end());
std::transform(std::par, b, e, o, ptr_fun(<double,
double>(sqrt));

```

Many STL algorithms are in the proposal:
copy, transform, replace,
generate, for_each, all_of, copy_if, find_if, is_sorted, inner_product, remove_if, rotate, binary_search ...

These will help with parallelization, Not with vectorization.

\section*{Indexed loops}
```

for_loop( par, 0, n, [\&](int i) {
A[i] = A[i] + B[i];
C[i] -= 2*A[i];
});

```
```

OpenMP Equivalent
\#pragma omp parallel for
for( int i=0; i<n; ++i ) {
A[i] = A[i] + B[i];
C[i] -= 2*A[i];
}

```

First, we proposed Indexed loops, even for the parallel execution policy

\section*{Reduction}
```

extern float s; extern int t;
for_loop( par, 0, n,
reduction_plus(s),
reduction_bit_and(t),
[\&](int i, float\& s_, int\& t_) {
s_ += A[i]*B[i];
t_ \&= C[i];
});
// s and t have final reduction values here.

```

OpenMP Equivalent
```

extern float s; extern int t;
\#pragma omp parallel for reduction(+:s) reduction(\&:t)
for( int i=0; i<n; ++i ) {
s += A[i]*B[i];
t \&= C[i];
}

## Vector execution policy

for_loop( vec, 0, n, [\&](int i) \{<br>$A[i]=A[i+1]+B[i] ;$<br>C[i] -= 2*A[i];<br>\});

OpenMP Equivalent

```
#pragma omp simd for
```

\#pragma omp simd for
for( int i=0; i<n; ++i ) {
for( int i=0; i<n; ++i ) {
A[i] = A[i+1] + B[i];
A[i] = A[i+1] + B[i];
C[i] -= 2*A[i];
C[i] -= 2*A[i];
}

```
}
```

A single threaded vector loops Allow vectorization of loops that cannot be parallelized (forward dependencies) Allow vectorization when multi-threading is undesired Useful for CPUs with SIMD, not so much for GPUs with SIMT

## Induction Variables

OpenMP 4.5 Equivalent

```
extern int j, k;
for_loop( vec, n, 0,
    induction(j, jstep),
    induction(k, -kstep),
    [&](int i, int j_, int k_) {
            A[i] = B[j_]*C[k_];
    });
// j and k have correct final values
here.
```

```
extern int j, k;
#pragma omp simd linear(j:jstep, k:-kstep)
for( int i=0; i<n; ++i) {
    A[i] = B[j]*C[k];
    j += jstep;
    k -= kstep;
}
```


## Extensions to the vec Policy

```
struct my_policy: vector_execution_policy {
    static const int safelen = 8;
    static const bool vectorize_remainder = true;
};
for_loop( my_policy(), 0, 1912, [&](int i) {
    Z[i+8] = Z[i]*A;
});
```

OpenMP Equivalent
(without vectorize_remainder)

```
#pragma omp simd safelen(8)
for( int i=0; i<1912; ++i ) {
    Z[i+8] = Z[i]*A;
});
```


## Both parallel and vector

\#pragma omp parallel simd for
\#pragma omp parallel simd for
for( int i=0; i<n; ++i ) {
for( int i=0; i<n; ++i ) {
A[i] = A[i] + B[i];
A[i] = A[i] + B[i];
C[i] -= 2*A[i];
C[i] -= 2*A[i];

Undefined behavior if there is a data race Undefined behavior if there is a critical section (deadlock) The loop is both parallelized and vectorized Same semantics as a "kernel" in GPGPU languages

## Less trivial vectorizeable algorithms

Algorithms with certain dependence patterns do not prevent vectorization of enclosing algorithms

And depending on the target architecture

```
// Histogram
a[b[i]]++;
```

May themselves be vectorized (may or may not be profitable)

Account for future direction of SIMD HW: these are made possible by AVX512

```
// compress / expand
if (cond(i)) {
    a[i] = b[i] * c[j++];
}
```



Binomial Options

(1) Incremental growth in CPU resources
(2) Improvements in compilers and parallel frameworks

## (3) Parallelization techniques

\#|BeST Practice in parallelzingaloophlirarchy


Vectorize Innermost, Parallelize Outermost (VIPO)

## Configuration

## Hardware configuration

- Processor:
- $\quad \operatorname{Intel}(\mathrm{R})$ Xeon(R) CPU E5-2697 @ 2.7 GHz (IVT)
- 2 sockets/24 cores/48 Threads; Turbo Off
- Memory:
- 128GB @ 1600 MHz

Software Configuration

- OS: Linux: RHEL 6.1
- Compiler:
- Gcc 4.8
- Intel Composer XE 2013 Sp1

Swap Pricer:

- Default : Number of scenarios 100
- Number of Swaps 100,000: parallelize at this level
- 26 time steps


## Legal Disclaimer

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

- A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS.
- Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or

Intel's compilers may or may not optimize to the same degree for nonIntel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel.

Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision \#20110804

## Configuration for parallel speed-up



## Legal Disclaimer \& Optimization Notice

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.

Copyright © 2016, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries.

## Optimization Notice

Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.

Notice revision \#20110804

## intel


[^0]:    Simple programming language extensions for computational use of SIMD Hardware Portable and consistent SIMD programming model across CPU, Coprocessors and GPUs

