# The significance of SIMD, SSE and AVX

#### Stephen Blair-Chappell

**Optimization Notice** 

Intel Compiler Labs



### Agenda

- 1. Auto-Vectorisation
- 2. CPU Dispatch
- 3. Manual Processor Dispatch
- 4. A Case Study





# "I must have the Intel compiler, it has sped up our application by two."

A customer when moving from version 9.1 to version 10 of the Intel compiler

**Optimization Notice** 



### **Auto-Vectorisation**





#### **Vector Processing**

- A specific case of **data level parallelism** (DLP)
- Same operation simultaneously executed on N >1 elements of a vector.



#### SIMD: Continuous Evolution

| 1999                                                                   | 2000                                                                                       | 2004                        | 2006               | 2007                                                                            | 2008                                                    | 2009                                                            | 2010\11                                                                                                               |
|------------------------------------------------------------------------|--------------------------------------------------------------------------------------------|-----------------------------|--------------------|---------------------------------------------------------------------------------|---------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------|
| SSE                                                                    | SSE2                                                                                       | SSE3                        | SSSE3              | SSE4.1                                                                          | SSE4.2                                                  | AES-NI                                                          | AVX                                                                                                                   |
| 70 instr<br>Single-<br>Precision<br>Vectors<br>Streaming<br>operations | 144 instr<br>Double-<br>precision<br>Vectors<br>8/16/32<br>64/128-bit<br>vector<br>integer | 13 instr<br>Complex<br>Data | 32 instr<br>Decode | 47 instr<br>Video<br>Graphics<br>building<br>blocks<br>Advanced<br>vector instr | 8 instr<br>String/XML<br>processing<br>POP-Count<br>CRC | 7 instr<br>Encryption<br>and<br>Decryption<br>Key<br>Generation | ~100 new<br>instr.<br>~300 legacy<br>sse instr<br>updated<br>256-bit<br>vector<br>3 and 4-<br>operand<br>instructions |



### SIMD Types in Processors from Intel [1]



#### MMX™

Vector size: 64bit Data types: 8, 16 and 32 bit integers VL: 2,4,8 For sample on the left: Xi, Yi 16 bit integers



#### Intel<sup>®</sup> SSE

Vector size: 128bit Data types: 8,16,32,64 bit integers 32 and 64bit floats VL: 2,4,8,16 Sample: Xi, Yi bit 32 int / float





### SIMD Types in Processors from Intel [2]



#### Intel<sup>®</sup> AVX

Vector size: 256bit Data types: 32 and 64 bit floats VL: 4, 8, 16 Sample: Xi, Yi 32 bit int or float



#### Intel<sup>®</sup> MIC

Vector size: 512bit Data types: 32 and 64 bit integers 32 and 64bit floats (some support for 16 bits floats) VL: 8,16 Sample: 32 bit float

Software and Services Group



### Scalar and Packed SSE Instructions

- The "vector" form of SSE instructions operating on multiple data elements simultaneously are called <u>packed</u> – thus vectorized SSE code means use of packed instructions
  - Most of these instructions have a <u>scalar</u> version too operating only one element only



 X4
 X3
 X2
 X1

 Y4
 Y3
 Y2
 Y1

 X4
 X3
 X2
 X1addY1



**Optimization Notice** 



#### Intel<sup>®</sup> AVX - Setting the Pace for Intel<sup>®</sup> Instruction Set



Core

Software and Services Group



#### Key Intel<sup>®</sup> Advanced Vector Extensions (Intel<sup>®</sup> AVX) Features

#### **KEY FEATURES**

#### BENEFITS

| <ul> <li>Wider Vectors</li> <li>Increased from 128 to 256 bit</li> <li>Two 128-bit load ports</li> </ul>                                                  | <ul> <li>Up to 2x peak FLOPs (floating point<br/>operations per second) output with good<br/>power efficiency</li> </ul> |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------|
| <ul> <li>Enhanced Data Rearrangement         <ul> <li>Use the new 256 bit primitives to<br/>broadcast, mask loads and permute data</li> </ul> </li> </ul> | <ul> <li>Organize, access and pull only necessary<br/>data more quickly and efficiently</li> </ul>                       |
| <ul> <li>Three and four Operands: Non<br/>Destructive Syntax for both AVX 128 and<br/>AVX 256</li> </ul>                                                  | <ul> <li>Fewer register copies, better register use for<br/>both vector and scalar code</li> </ul>                       |
| <ul> <li>Flexible unaligned memory access<br/>support</li> </ul>                                                                                          | <ul> <li>More opportunities to fuse load and<br/>compute operations</li> </ul>                                           |
| • Extensible new opcode (VEX)                                                                                                                             | Code size reduction                                                                                                      |

Intel<sup>®</sup> AVX is a general purpose architecture, expected to supplant SSE in all applications used today



#### A New 3- and 4- Operand Instruction Format

 Intel<sup>®</sup> Advanced Vector Extensions (Intel<sup>®</sup> AVX) has a distinct destination argument that results in fewer register copies, better register use, more load/op macro-fusion opportunities, and smaller code size







#### Intel<sup>®</sup> Microarchitecture (Sandy Bridge) Highlights



#### **Auto-Vectorization**

Transforming sequential code to exploit the vector (SIMD, SSE) processing capabilities





#### Many Ways to introduce SSE Vectorization





#### How do I know if a loop is vectorised?

-vec-report

> icl /Qvec-report MultArray.c MultArray.c(92): (col. 5) remark: LOOP WAS VECTORIZED.





# Examples of Code Generation

| <pre>C[1000];<br/>void add() {<br/>int i;<br/>for (i=0; i&lt;1000; i++)<br/>if (A[i]&gt;0)<br/>A[i] += B[i];<br/>else<br/>A[i] += C[i];<br/>}</pre>                                                                                                 | <pre>xorps xmm0, xmm0 cmpltpd xmm0, xmm2 movaps xmm1, B[rdx*8] andps xmm1, xmm0 andnps xmm0, C[rdx*8] orps xmm1, xmm0 addpd xmm2, xmm1 movaps A[rdx*8], xmm2 add rdx, 2 cmp rdx, 1000 j1 .B1.2 </pre>                               |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| <pre>.B1.2::<br/>vmovaps ymm3, A[rdx*8]<br/>vmovaps ymm1, C[rdx*8]<br/>vcmpgtpd ymm2, ymm3, ymm0<br/>vblendvpd ymm4, ymm1,B[rdx*8], ymm2<br/>vaddpd ymm5, ymm3, ymm4<br/>vmovaps A[rdx*8], ymm5<br/>add rdx, 4<br/>cmp rdx, 1000<br/>j1 .B1.2</pre> | .B1.2::<br>movaps xmm2, A[rdx*8]<br>xorps xmm0, xmm0<br>cmpltpd xmm0, xmm2<br>movaps xmm1, C[rdx*8]<br>blendvpd xmm1, B[rdx*8], xmm0<br>addpd xmm2, xmm1<br>movaps A[rdx*8], xmm2<br>add rdx, 2<br>cmp rdx, 1000<br>j1 .B1.2 SSE4.1 |

Software



#### "Loop was not vectorized" because:

- "Existence of vector dependence"
- "Non-unit stride used"
- "Mixed Data Types"
- "Condition too Complex"
- "Condition may protect exception"
- "Low trip count"

- "Subscript too complex"
- 'Unsupported Loop Structure"
- "Contains unvectorizable statement at line XX"
- "Not Inner Loop"
- "vectorization possible but seems inefficient"
- "Operator unsuited for vectorization"





### **Elemental Functions**

- Use scalar syntax to describe an operation on a single element
- Apply operation to arrays in parallel
- Utilize both vector parallelism and core parallelism

```
_declspec(vector)
double option_price_call_black_scholes
    (double S,double K,double r,double sigma,double time)
{
    double time_sqrt = sqrt(time);
    double d1 =
        (log(S/K)+r*time)/(sigma*time_sqrt)+0.5*sigma*time_sqrt;
    double d2 = d1-(sigma*time_sqrt);
    return S*N(d1) - K*exp(-r*time)*N(d2);
}
cilk_for(int i=0; i < NUM_OPTIONS; i++) {
        call_serial[i] = option_price_call_black_scholes(S[i], K[i], r, sigma, time[i]);
```

### **CPU-Dispatch**

#### **Adding Portability**

**Optimization Notice** 



# "I've stopped using the Intel compiler. Each time I ship the product to a customer, they complain that applications crashes"!"

A games developer at a recent networking event.

**Optimization Notice** 



### Imagine this scenario:

- 1. Your IT dept have just bought you the latest and greatest Intel based workstation.
- 2. You've heard **auto-vectorisation** can make a real difference to performance
- 3. You enable auto-vectorisation using **-xhost**
- 4. You boast to your colleagues, "my application runs faster than anything you can write..."
- 5. You send the application to a colleague it refuses to run.

Software and Services Group Optimization Notice Software

# What might be the issue?

# How can it be overcome?

(inte

Software

**Optimization Notice** 

#### Two Key Decisions to be Made :

1. How do we **introduce** the vector code?

2. How do we deal with the **Multiple** SIMD instruction set **extensions** like SSE, SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX ...?

Software and Services Group



#### Out-of-the-box behaviour – Intel Compiler

- Automatic-vectorisation is enabled by default
- (turn it off with -no-vec)
- The option -msse2 is used by default (as long as no x, ax or -m option has been used)

-msse2: "May generate Intel<sup>®</sup> SSE2 and SSE instructions. This value is only available on Linux systems".



### Building for non-intel processors (-m)

| Option | Description                                                                      |
|--------|----------------------------------------------------------------------------------|
| sse4.1 | May generate Intel <sup>®</sup> SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions. |
| ssse3  | May generate Intel <sup>®</sup> SSSE3, SSE3, SSE2, and SSE instructions.         |
| sse2   | May generate Intel <sup>®</sup> SSE2 and SSE instructions.                       |
| sse    | This option has been deprecated; it is now the same as specifying ia32.          |
| ia32   | Generates x86/x87 generic code that is compatible with IA-32 architecture.       |

This option tells the compiler to generate code specialized for the processor that executes your program.

Code generated with these options should execute on any compatible, non-Intel processor with support for the corresponding instruction set.





### Building for Intel processors (-x)

| Option    | Description                                                                                                                                                                                                                                          |
|-----------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| AVX       | AVX, SSE4.2, SSE4.1, SSSE3, SSE3, SSE2, and SSE instructions .                                                                                                                                                                                       |
| SSE4.2    | SSE4 Efficient Accelerated String and Text Processing instructions supported by Intel <sup>®</sup> Core <sup>™</sup> i7 processors. SSE4 .1, SSSE3, SSE3, SSE2, and SSE. May optimize for the Intel <sup>®</sup> Core <sup>™</sup> processor family. |
| SSE4.1    | SSE4 Vectorizing Compiler and Media Accelerator, SSSE3, SSE3, SSE2, and SSE . May optimize for Intel <sup>®</sup> 45nm Hi-k next generation Intel <sup>®</sup> Core <sup>™</sup> microarchitecture.                                                  |
| SSE3_ATOM | MOVBE , (depending on -minstruction ), SSSE3, SSE3, SSE2, and SSE . Optimizes for the Intel®<br>Atom™ processor and Intel® Centrino® Atom™ Processor Technology                                                                                      |
| SSSE3     | SSSE3, SSE3, SSE2, and SSE. Optimizes for the Intel <sup>®</sup> Core <sup>™</sup> microarchitecture.                                                                                                                                                |
| SSE3      | SSE3, SSE2, and SSE. Optimizes for the enhanced Pentium <sup>®</sup> M processor microarchitecture and Intel NetBurst <sup>®</sup> microarchitecture.                                                                                                |
| SSE2      | SSE2 and SSE . Optimizes for the Intel NetBurst <sup>®</sup> microarchitecture.                                                                                                                                                                      |

Software and Services Group







Software & Services Group Developer Products Division

Copyright© 2011, Intel Corporation. All rights reserved. \*Other brands and names are the property of their respective owners.



28



### Software

Software & Services Group Developer Products Division



#### Using -ax compiler option ...

- Generates multiple paths if there is a performance benefit
- Generates a base line path
- Other options (e.g. -03) control the base line path
- At runtime path chosen based on what processor code is running on

Software



#### The Base line

- Use -m or -x to set base line
- **-M** for non-intel processors
- -X for intel processors
- If no -m or -x, compiler defaults to -mSSE2
- -m and -x are mutually exclusive

**Software and Services Group** 





Software

Software & Services Group Developer Products Division





Software

Software & Services Group Developer Products Division







Software & Services Group Developer Products Division





#### **Running on Intel Processors**

- If -ax and -x are used together
- Base line will execute on Intel compatible processors specified by the -x





#### Running on Intel and non-Intel processors

- If -ax and -m are used together
- Base line will execute on non-Intel processors compatible with the processor type specified by -m





#### What option do AMD recommend?

AMD Opteron<sup>™</sup> 6100 Series P AMD Opteron<sup>™</sup> 4100 Series P Compiler Options Quick Referen

#### ICC

Latest release: 12.0 update3, March 2011 http://software.intel.com

| Architecture                                     |                    |  |
|--------------------------------------------------|--------------------|--|
| Generate instructions specific to<br>Magny-Cours | -msse3 (avoid –ax) |  |
| Optimization Levels                              |                    |  |
| Disable all optimizations                        | -00                |  |

http://developer.amd.com/Assets/CompilerOptQuickRef-61004100.pdf



#### Quiz – what option is best?

- 1. You application will only ever run on the same CPU as you development machine
- 2. Your application will run on a farm of AMD Opterons (4100) and Intel i7s
- 3. Your application will run on Sandy Bridge Machines and Core 2.
- 4. Your have no clue what machine the code will run on.



### **Benefit of CPU Dispatch**

#### Code

- still works on older processors
- Works properly on non-intel CPUs

   Non-intel processors will ALWAYS take the base-line
- Code can take advantage of latest generation of CPUs



## **Manual Processor Dispatch**







#### Manual processor Dispatch

- Allows you to write processor-specific code
- Provide more than one version of code
- Use \_\_declespec(cpu\_dispatch(cpuid,cpuid...)



#### **CPUID Arguments**

| Argument for cpuid                   | Processors                                                                                                                                                                                                                         |  |
|--------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|
| future_cpu_16<br>(subject to change) | 2nd generation Intel <sup>®</sup> Core <sup>™</sup> processor family with support for Intel <sup>®</sup> Advanced Vector Extensions (Intel <sup>®</sup> AVX).                                                                      |  |
| core_aes_pclmulqdq                   | Intel <sup>®</sup> Core <sup>™</sup> processors with support for Advanced Encryption Standard (AES) instructions and carry-less multiplication instruction                                                                         |  |
| core_i7_sse4_2                       | Intel <sup>®</sup> Core <sup>™</sup> processor family with support for Intel <sup>®</sup> SSE4 Efficient Accelerated String and Text Processing instructions (SSE4.2)                                                              |  |
| atom                                 | atom Intel <sup>®</sup> Atom <sup>™</sup> processors                                                                                                                                                                               |  |
| core_2_duo_sse4_1                    | Intel <sup>®</sup> 45nm Hi-k next generation Intel <sup>®</sup> Core <sup>™</sup> microarchitecture processors with support for Intel <sup>®</sup> SSE4 Vectorizing Compiler and Media Accelerators instructions (SSE4.1)          |  |
| core_2_duo_ssse3                     | Intel <sup>®</sup> Core <sup>™</sup> 2 Duo processors and Intel <sup>®</sup> Xeon <sup>®</sup> processors with Intel <sup>®</sup><br>Supplemental Streaming SIMD Extensions 3 (SSSE3)                                              |  |
| pentium_4_sse3                       | Intel <sup>®</sup> Pentium 4 processor with Intel <sup>®</sup> Streaming SIMD Extensions 3 (Intel <sup>®</sup> SSE3),<br>Intel <sup>®</sup> Core <sup>™</sup> Duo processors, Intel <sup>®</sup> Core <sup>™</sup> Solo processors |  |
| pentium_4                            | Intel <sup>®</sup> Intel Pentium 4 processors                                                                                                                                                                                      |  |
| pentium_m                            | Intel <sup>®</sup> Pentium M processors                                                                                                                                                                                            |  |
| pentium_iii                          | Intel <sup>®</sup> Pentium III processors                                                                                                                                                                                          |  |
| generic                              | Other IA-32 or Intel 64 processors or compatible processors not provided by Intel Corporation                                                                                                                                      |  |





#### Manual Dispatch Example

```
#include <stdio.h>
 // need to create specific function versions
 _declspec(cpu_dispatch(generic, future_cpu_16))
void dispatch_func() {};
  _declspec(cpu_specific(generic))
void dispatch_func() {
  printf("Code for non-Intel processors\and generic Intel\n");
}
 declspec(cpu specific(future cpu 16))
void dispatch func() {
  printf("Code for 2nd generation Intel Core processors goes here\n");
int main() {
  dispatch func();
  printf("Return from dispatch_func\n");
  return 0;
}
```



#### **Questions to Ask**

- Is my application going to run on a different CPU to my development platform?
- Is my application going to run on one Specific generation of CPU?
- Is my application just gong to run on just Intel CPUs?
- Will my application be running on non-intel processors?



# A Case Study

# **An Engine Simulator**

**Optimization Notice** 



#### **The Simulation Environment**





#### **The Simulation Frames**



Software and Services Group



#### Matlab design of the Engine Simulator







#### **Results on 100k loop simulation**

| CPU     | No Auto-<br>Vectorisation | With Auto-<br>Vectorisation | Speedup |
|---------|---------------------------|-----------------------------|---------|
| P4      | 39.344                    | 21.9                        | 1.80    |
| Core 2  | 5.546                     | 0.515                       | 10.77   |
| Speedup | 7.09                      | 45.52                       | 76      |

Software and Services Group



#### Vtune confirms reason for Speedup

| CPU EVENT             | Without Vect   | With Vect     |
|-----------------------|----------------|---------------|
| CPU_CLK_UNHALTED.CORE | 16,641,000,448 | 1,548,000,000 |
| INST_RETIRED.ANY      | 3,308,999,936  | 1,395,000,064 |
| X87_OPS_RETIRED.ANY   | 250,000,000    | 0             |
| SIMD_INST_RETIRED     | 0              | 763,000,000   |

Full paper available here: <u>http://edc.intel.com/Link.aspx?id=1045</u>

Software and Services Group



#### Summary of Simulation Performance Improvements

- Performance gains through **migrating** to newer Silicon
- Performance gains by using Intel compiler.





#### **Closing Remarks**

- Try Auto-vectorisation it can make a difference!
- Out-of-the-box use does not deliver the best optimisation
- If you are running on more than one generation of CPU use -ax (CPU dispatching)
- Use m option on non-intel CPUs

Software and Services Group



## **Any Questions**

**Optimization Notice** 



#### **Optimization Notice**

#### **Optimization Notice**

Intel<sup>®</sup> compilers, associated libraries and associated development tools may include or utilize options that optimize for instruction sets that are available in both Intel<sup>®</sup> and non-Intel microprocessors (for example SIMD instruction sets), but do not optimize equally for non-Intel microprocessors. In addition, certain compiler options for Intel compilers, including some that are not specific to Intel micro-architecture, are reserved for Intel microprocessors. For a detailed description of Intel compiler options, including the instruction sets and specific microprocessors they implicate, please refer to the "Intel<sup>®</sup> Compiler User and Reference Guides" under "Compiler Options." Many library routines that are part of Intel<sup>®</sup> compiler products are more highly optimized for Intel microprocessors than for other microprocessors. While the compilers and libraries in Intel<sup>®</sup> compiler products offer optimizations for both Intel and Intel-compatible microprocessors, depending on the options you select, your code and other factors, you likely will get extra performance on Intel microprocessors.

Intel<sup>®</sup> compilers, associated libraries and associated development tools may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include Intel<sup>®</sup> Streaming SIMD Extensions 2 (Intel<sup>®</sup> SSE2), Intel<sup>®</sup> Streaming SIMD Extensions 3 (Intel<sup>®</sup> SSE3), and Supplemental Streaming SIMD Extensions 3 (Intel<sup>®</sup> SSSE3) instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessors-dependent optimizations in this product are intended for use with Intel microprocessors.

While Intel believes our compilers and libraries are excellent choices to assist in obtaining the best performance on Intel<sup>®</sup> and non-Intel microprocessors, Intel recommends that you evaluate other compilers and libraries to determine which best meet your requirements. We hope to win your business by striving to offer the best performance of any compiler or library; please let us know if you find we do not.

Notice revision #20101101



#### Legal Disclaimer

INFORMATION IN THIS DOCUMENT IS PROVIDED "AS IS". NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT.

Performance tests and ratings are measured using specific computer systems and/or components and reflect the approximate performance of Intel products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. Buyers should consult other sources of information to evaluate the performance of systems or components they are considering purchasing. For more information on performance tests and on the performance of Intel products, reference www.intel.com/software/products.

BunnyPeople, Celeron, Celeron Inside, Centrino, Centrino Atom, Centrino Atom Inside, Centrino Inside, Centrino logo, Cilk, Core Inside, FlashFile, i960, InstantIP, Intel, the Intel logo, Intel386, Intel486, IntelDX2, IntelDX4, IntelSX2, Intel Atom, Intel Atom Inside, Intel Core, Intel Inside, Intel Inside logo, Intel. Leap ahead., Intel. Leap ahead. logo, Intel NetBurst, Intel NetMerge, Intel NetStructure, Intel SingleDriver, Intel SpeedStep, Intel StrataFlash, Intel Viiv, Intel vPro, Intel XScale, Itanium, Itanium Inside, MCS, MMX, Oplus, OverDrive, PDCharm, Pentium, Pentium Inside, skoool, Sound Mark, The Journey Inside, Viiv Inside, vPro Inside, VTune, Xeon, and Xeon Inside are trademarks of Intel Corporation in the U.S. and other countries.

\*Other names and brands may be claimed as the property of others.

Copyright ° 2010. Intel Corporation.







**Software and Services Group** 





Software and Services Group

