#### Intel SIMD architecture



Z. Jerry Shi

#### Associate Professor of Computer Science and Engineering University of Connecticut

Revised from Yung-Yu Chuang's slides , 2007

## Overview

- SIMD
- MMX architectures
- MMX instructions
- Examples
- SSE/SSE2/SSE3/SSE4

SIMD instructions are probably the best place to use assembly

 Compilers usually do not do a good job on using these
 instructions

### **Performance boost**

- Increasing clock rate is not fast enough for boosting performance
- Architecture improvements (such as pipeline/cache/SIMD) are more significant
- Multimedia applications share the following characteristics:
  - Small native data types (8-bit pixel, 16-bit audio)
  - Recurring operations
  - Inherent parallelism

### SIMD

• SIMD (single instruction multiple data) architecture performs the same operation on multiple data elements in parallel

PADDW MM0, MM1



#### SISD/SIMD/Streaming



# IA-32 SIMD development

- MMX (<u>Multimedia Extension</u>) was introduced in 1996
  - Pentium with MMX and Pentium II
- SSE (<u>Streaming SIMD Extension</u>) was introduced with Pentium III
- SSE2 was introduced with Pentium 4
- SSE3 was introduced with Pentium 4
  - For supporting hyper-threading technology
  - 13 more instructions
- SSSE3 (Supplemental) in June 2006 in the "Woodcrest" Xeons
  - 16 new discrete instructions
- SSE4 in spring 2007 has 54 new instructions
  - 47 in SSE4.1
  - 7 in SSE 4.2
- SSE4a from AMD are different from SSE4.1

# MMX

- Typical elements in many applications are small
  - 8 bits for pixels
  - 16 bits for audio
  - 32 bits for general computing
- New data type: 64-bit packed data type. Why 64 bits?
  - Good enough
  - Practical, see in a moment

## MMX data types



# **MMX integration into IA-32**



NaN or infinity as real because bits 79-64 are ones.

Even if MMX registers are 64-bit, they don't extend Pentium to a 64-bit CPU since only logic instructions are provided for 64-bit data.

8 MMX Registers MM0~MM7

# Compatibility

- Fully compatible with existing IA
  - No new mode or state was created
  - No extra state needs to be saved for context switching
- MMX is hidden behind FPU
  - When floating-point state is saved or restored, MMX is saved or restored.
- Existing OS to perform context switching on the processes executing MMX instruction without be aware of MMX
  - MMX and FPU cannot be used at the same time
    - It may be a bad decision
    - OS can just provide a service pack or get updated
    - Intel introduced SSE later without any aliasing

## **MMX instructions**

- 57 MMX instructions
  - add, subtract, multiply, multiply-add
  - compare
  - shift, logical operation
  - data conversion
  - 64-bit data move
- All instructions except for data move use MMX registers as operands
  - All starts with p except for movd, movq, and emms
- Most complete support for 16-bit operations

## **MMX instructions**

| Category   |                                    | Wraparound                            | Signed Saturation     | Unsigned<br>Saturation |
|------------|------------------------------------|---------------------------------------|-----------------------|------------------------|
| Arithmetic | Addition                           | PADDB, PADDW,<br>PADDD                | PADDSB,<br>PADDSW     | PADDUSB,<br>PADDUSW    |
|            | Subtraction                        | PSUBB, PSUBW,                         | PSUBSB,<br>PSUBSW     | PSUBUSB,<br>PSUBUSW    |
|            | Multiplication<br>Multiply and Add | PMULL, PMULH<br>PMADD                 |                       |                        |
| Comparison | Compare for Equal                  | PCMPEQB,<br>PCMPEQW,<br>PCMPEQD       |                       |                        |
|            | Compare for Greater<br>Than        | PCMPGTPB,<br>PCMPGTPW,<br>PCMPGTPD    |                       |                        |
| Conversion | Pack                               |                                       | PACKSSWB,<br>PACKSSDW | PACKUSWB               |
| Unpack     | Unpack High                        | PUNPCKHBW,<br>PUNPCKHWD,<br>PUNPCKHDQ |                       |                        |
|            | Unpack Low                         | PUNPCKLBW,<br>PUNPCKLWD,<br>PUNPCKLDQ |                       |                        |

## **MMX instructions**

|                    |                                                                     | Packed                                       | Full Quadword                |
|--------------------|---------------------------------------------------------------------|----------------------------------------------|------------------------------|
| Logical            | And<br>And Not<br>Or<br>Exclusive OR                                |                                              | PAND<br>PANDN<br>POR<br>PXOR |
| Shift              | Shift Left Logical<br>Shift Right Logical<br>Shift Right Arithmetic | PSLLW, PSLLD<br>PSRLW, PSRLD<br>PSRAW, PSRAD | PSLLQ<br>PSRLQ               |
|                    |                                                                     | Doubleword Transfers                         | Quadword Transfers           |
| Data Transfer      | Register to Register<br>Load from Memory<br>Store to Memory         | MOVD<br>MOVD<br>MOVD                         | MOVQ<br>MOVQ<br>MOVQ         |
| Empty MMX<br>State |                                                                     | EMMS                                         |                              |

Call it before you switch to FPU from MMX

## **Saturation arithmetic**

- Useful in graphics applications.
- When an operation overflows or underflows, the result becomes the largest or smallest possible representable number.
- Two types: signed and unsigned saturation



#### wrap-around

#### saturating

# Arithmetic

- **PADDB/PADDW/PADDD**: add two packed numbers, no EFLAGS is set, ensure overflow never occurs by yourself
- Multiplication: two steps
- **PMULLW**: multiplies four words and stores the four lo words of the four double word results
- **PMULHW/PMULHUW**: multiplies four words and stores the four hi words of the four double word results. **PMULHUW** for unsigned.

#### Arithmetic

#### • PMADDWD

 $\begin{array}{l} \mathsf{DEST[31:0]} \leftarrow (\mathsf{DEST[15:0]} * \mathsf{SRC[15:0]}) + (\mathsf{DEST[31:16]} * \mathsf{SRC[31:16]}); \\ \mathsf{DEST[63:32]} \leftarrow (\mathsf{DEST[47:32]} * \mathsf{SRC[47:32]}) + (\mathsf{DEST[63:48]} * \mathsf{SRC[63:48]}); \end{array}$ 

|      | SRC     | X3      | X2        | X1      | X0        |         |
|------|---------|---------|-----------|---------|-----------|---------|
|      | DEST    | Y3      | Y2        | Y1      | Y0        |         |
| TEMP | X3 * Y3 | X2 * Y2 |           | X1 * Y1 |           | X0 * Y0 |
|      | DEST    | (X3*Y3) | + (X2*Y2) | (X1*Y1) | + (X0*Y0) |         |

char d[]={5, 5, 5, 5, 5, 5, 5, 5}; char clr[]={65,66,68,...,87,88}; // 24 bytes asm{ movq mm1, d mov cx, 3 mov esi, 0 L1: movq mm0, clr[esi] paddb mm0, mm1 movq clr[esi], mm0 add esi, 8 loop L1 emms

## Comparison

• No CFLAGS, how many flags will you need? Results are stored in destination.



PCMPEQB/PCMPGTB Operation

#### **Change data types**

- Pack: converts a larger data type to the next smaller data type.
- Unpack: takes two operands and interleave them. It can be used for expand data type for immediate calculation.

#### Unpack low-order words into doublewords



### Pack with signed saturation



## Pack with signed saturation



### **Unpack low portion**



### **Unpack low portion**



#### **Unpack low portion**



## **Unpack high portion**



#### **Performance boost (data from 1996)**

Benchmark kernels: FFT, FIR, vector dot-product, IDCT, motion compensation

65% performance gain

Lower the cost of multimedia programs by removing the need of specialized DSP chips



# **Keys to SIMD programming**

- Efficient data layout
- Elimination of branches

### **Application: frame difference**





## **Application: frame difference**





# (A-B) or (B-A)



#### **Application: frame difference**

| mm1, | A //move 8 pixels of image A             |
|------|------------------------------------------|
| mm2, | B //move 8 pixels of image B             |
| mm3, | mm1 // mm3=A                             |
| mm1, | mm2 // mm1=A-B                           |
| mm2, | mm3 // mm2=B-A                           |
| mm1, | mm2 // mm1= A-B                          |
|      | <pre>mm1, mm2, mm3, mm1, mm2, mm1,</pre> |

#### **Example: image fade-in-fade-out**



Α

В

 $A^*\alpha + B^*(1-\alpha) = B + \alpha(A-B)$ 







### **Example: image fade-in-fade-out**

- Two formats: planar and chunky
- In Chunky format, 16 bits of 64 bits are wasted
- So, we use planar in the following example



### **Example: image fade-in-fade-out**


#### **Example: image fade-in-fade-out**

| MOVQ | mm0, al | lpha , | //4 16- | b   | zer | o-pa | ddi | .ng $\alpha$ |   |
|------|---------|--------|---------|-----|-----|------|-----|--------------|---|
| MOVD | mm1, A  |        | //move  | 4   | pix | els  | of  | image        | A |
| MOVD | mm2, B  |        | //move  | 4   | pix | els  | of  | image        | В |
| PXOR | mm3, mm | m3 ,   | //clear | : m | m3  | to a | 11  | zeroes       | 5 |

//unpack 4 pixels to 4 words
PUNPCKLBW mm1, mm3 // Because B-A could be
PUNPCKLBW mm2, mm3 // negative, need 16 bits
PSUBW mm1, mm2 //(B-A)
PMULHW mm1, mm0 //(B-A)\*fade/256
PADDW mm1, mm2 //(B-A)\*fade + B

//pack four words back to four bytes
PACKUSWB mm1, mm3

# **Data-independent computation**

- Each operation can execute without needing to know the results of a previous operation.
- Example, sprite overlay
- for i=1 to sprite\_Size
  - if sprite[i]=clr

then out\_color[i]=bg[i]

else out\_color[i]=sprite[i]





# **Application: sprite overlay**



| a3          | a2          | a1          | a0          |
|-------------|-------------|-------------|-------------|
| =           | =           | =           | =           |
| clear_color | clear_color | clear_color | clear_color |

| 11111111 | 00000000 | 11111111 | 00000000 |
|----------|----------|----------|----------|
|----------|----------|----------|----------|

Phase 2



### **Application: sprite overlay**

| MOVQ    | mmO, | sprite |
|---------|------|--------|
| MOVQ    | mm2, | mm0    |
| MOVQ    | mm4, | bg     |
| MOVQ    | mm1, | clr    |
| PCMPEQW | mm0, | mm1    |
| PAND    | mm4, | mm0    |
| PANDN   | mm0, | mm2    |
| POR     | mmO, | mm4    |

### **Application: matrix transpose**



Note: Repeat for the other rows to generate ( $[d_3, c_3, b_3, a_3]$  and  $[d_2, c_2, b_2, a_2]$ ).

MMX code sequence operation:

- mm1, row1 movq mm2, row2 movq mm3, row3 movg mm4, row4 movq mm1, mm2punpcklwd punpcklwd mm3, mm4 mm5, mm1 movq punpckldg mm1, mm3 punpckhdq mm5, mm3
- ; load pixels from first row of matrix
  - ; load pixels from second row of matrix
  - ; load pixels from third row of matrix
  - ; load pixels from fourth row of matrix
  - ; unpack low order words of rows 1 & 2, mm 1 = [b1, a1, b0, a0]
  - ; unpack low order words of rows 3 & 4, mm3 = [d1, c1, d0, c0]
  - ; copy mm1 to mm5
  - ; unpack low order doublewords -> mm2 = [d0, c0, b0, a0]
  - ; unpack high order doublewords -> mm5 = [d1, c1, b1, a1]

### **Application: matrix transpose (C code)**

```
char M1[4][8];// matrix to be transposed
char M2[8][4];// transposed matrix
int n=0;
for (int i=0;i<4;i++)
  for (int j=0;j<8;j++)
      { M1[i][j]=n; n++; }
```

### **Application: matrix transpose (MMX) - 1**

```
__asm{
    //move the 4 rows of M1 into MMX registers
    movq mm1,M1
    movq mm2,M1+8
    movq mm3,M1+16
    movq mm4,M1+24
```

```
//generate rows 1 to 4 of M2
punpcklbw mm1, mm2
punpcklbw mm3, mm4
movq mm0, mm1
punpcklwd mm1, mm3 //mm1 has row 2 & row 1
punpckhwd mm0, mm3 //mm0 has row 4 & row 3
movq M2, mm1
movq M2, mm1
```

### **Application: matrix transpose (MMX) - 2**

```
//generate rows 5 to 8 of M2
movq mm1, M1 //get row 1 of M1
movq mm3, M1+16 //get row 3 of M1
punpckhbw mm1, mm2
punpckhbw mm3, mm4
movg mm0, mm1
punpcklwd mm1, mm3 //mm1 has row 6 & row 5
punpckhwd mm0, mm3 //mm0 has row 8 & row 7
//save results to M2
movq M2+16, mm1
movq M2+24, mm0
emms
} //end
```

## How to use assembly in projects

- Write the whole project in assembly
  - Link with high-level languages
- Inline assembly
- Intrinsics

mov eax, 1 ; request version info

- cpuid ; supported since Pentium
- test edx, 00800000h ;bit 23
  - ; 0200000h (bit 25) SSE
  - ; 0400000h (bit 26) SSE2
- jnz HasMMX

cpuid

| Initial EAX<br>Value |                          | Information Provided about the Processor                                                                                                                                                                  |
|----------------------|--------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
|                      | Basic CPUI               | D Information                                                                                                                                                                                             |
| он                   | EAX<br>EBX<br>ECX<br>EDX | Maximum Input Value for Basic CPUID Information (see Table 3-13)<br>"Genu"<br>"ntel"<br>"inel"                                                                                                            |
| 01H                  | EAX                      | Version Information: Type, Family, Model, and Stepping ID (see Figure 3-5)                                                                                                                                |
|                      | EBX                      | Bits 7-0: Brand Index<br>Bits 15-8: CLFLUSH line size (Value * 8 = cache line size in bytes)<br>Bits 23-16: Maximum number of logical processors in this physical package.<br>Bits 31-24: Initial APIC ID |
|                      | ECX<br>EDX               | Extended Feature Information (see Figure 3-6 and Table 3-15)<br>Feature Information (see Figure 3-7 and Table 3-16)                                                                                       |
| 02H                  | EAX<br>EBX<br>ECX<br>EDX | Cache and TLB Information (see Table 3-17)<br>Cache and TLB Information<br>Cache and TLB Information<br>Cache and TLB Information                                                                         |

| 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | 31 30 29 28 27                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | 26 25 24                                     | 2322 | 2120 | ) 19 1 | 8 17 | 16 1 | 1514 | 4 13 | 12 | 11 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------|------|------|--------|------|------|------|------|----|-------|---|---|---|---|---|---|---|---|---|---|
| EDX                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       |                                              |      |      |        |      |      |      |      |    |       |   |   |   |   |   |   |   |   |   |   |
| PBE-Pend. Brk. EN<br>TM-Therm. Monitor -<br>HTT-Multi-threading<br>SS-Self Snoop<br>SSE2-SSE2 Extension<br>SSE2-SSE2 Extensions<br>FXSR-FXSAVE/FXRS<br>MMX-MMX Technolog<br>ACPI-Thermal Monito<br>DS-Debug Store<br>CLFSH-CFLUSH inst<br>PSN-Processor Seria<br>PSE-36 - Page Size F<br>PAT-Page Attribute Ta<br>CMOV-Conditional M<br>MCA-Machine Check<br>PGE-PTE Global Bit<br>MTRR-Memory Type<br>SEP-SYSENTER and<br>APIC-APIC on Chip<br>CX8-CMPXCHG8B In<br>MCE-Machine Check<br>PAE-Physical Address<br>MSR-RDMSR and W<br>TSC-Time Stamp Cou<br>PSE-Page Size Exter<br>DE-Debugging Exten<br>VME-Virtual-8086 Mo<br>FPU-x87 FPU on Chi | DINS<br>STOR<br>gy<br>DINS<br>STOR<br>gy<br>DINS<br>STOR<br>gy<br>DINS<br>STOR<br>gy<br>DINS<br>STOR<br>gy<br>DINS<br>STOR<br>gy<br>DINS<br>STOR<br>gy<br>DINS<br>STOR<br>gy<br>DINS<br>STOR<br>gy<br>DINS<br>STOR<br>GINS<br>STOR<br>gy<br>DINS<br>STOR<br>gy<br>DINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>GINS<br>STOR<br>STOR<br>STOR<br>STOR<br>STOR<br>STOR<br>STOR<br>STO | are Instr<br>e<br>gisters<br>port<br>ement - |      |      |        |      |      |      |      |    |       |   |   |   |   |   |   |   |   |   |   |

# Link ASM and High-level Language programs

- Assembly is rarely used to develop the entire program.
- Use high-level language for overall project development
   Relieves programmer from low-level details
- Use assembly language code
  - Speed up critical sections of code
  - Access nonstandard hardware devices
  - Write platform-specific code
  - Extend the high-level language's capabilities

### **General conventions**

- Considerations when calling assembly language procedures from high-level languages:
  - Both must use the same naming convention (rules regarding the naming of variables and procedures)
  - Both must use the same memory model, with compatible segment names
  - Both must use the same calling convention

### Inline assembly code

- Assembly language source code that is inserted directly into a HLL program.
- Compilers such as Microsoft Visual C++ and Borland C++ have compiler-specific directives that identify inline ASM code.
- Efficient inline code executes quickly because CALL and RET instructions are not required.
- Simple to code because there are no external names, memory models, or naming conventions involved.
- Decidedly not portable because it is written for a single platform.

### asm directive in Microsoft Visual C++

- Can be placed at the beginning of a single statement
- Or, It can mark the beginning of a block of assembly language statements
- Syntax:

|   | _asm statement |
|---|----------------|
|   | _asm {         |
|   | statement-1    |
|   | statement-2    |
|   | • • •          |
|   | statement-n    |
| } |                |

### **Intrinsics**

- An *intrinsic* is a function known by the compiler that directly maps to a sequence of one or more assembly language instructions.
- The compiler manages things that the user would normally have to be concerned with, such as register names, register allocations, and memory locations of data.
- Intrinsic functions are inherently more efficient than called functions because no calling linkage is required. But, not necessarily as efficient as assembly.
- \_mm\_<opcode>\_<suffix>

ps: packed single-precision
ss: scalar single-precision

#### **Intrinsics**

```
#include <xmmintrin.h>
m128 a , b , c;
c = mm add ps(a, b);
float a[4] , b[4] , c[4];
for( int i = 0 ; i < 4 ; ++ i )</pre>
    c[i] = a[i] + b[i];
// a = b * c + d / e;
m128 a = mm add ps( mm mul_ps(b, c)),
                      mm div ps( d , e ) );
```

### **SSE features**

- Add eight 128-bit data registers (XMM registers)
- Sixteen XMM registers are available in 64-bit mode
- 32-bit MXCSR register (control and status)
- Add a new data type
  - 4 single-precision floating-point numbers in a 128-bit register
- New instructions:
  - Instruction to perform SIMD operations on 128-bit packed singleprecision FP
  - Additional 64-bit SIMD integer operations
- Instructions that explicitly prefetch data, control data cacheability and ordering of store

In MMX

- An application cannot execute MMX instructions and perform floating-point operations simultaneously.
- A large number of processor clock cycles are needed to change the state of executing MMX instructions to the state of executing FP operations and vice versa.

## **SSE programming environment**



#### Exception

```
MM ALIGN16 float test1[4] = \{0, 0, 0, 1\};
MM ALIGN16 float test2[4] = { 1, 2, 3, 0 };
MM ALIGN16 float out[4];
MM SET EXCEPTION MASK(0);//enable exception
                           Without this, result is 1.#INF
try {
  m128 a = mm load ps(test1);
  m128 b = mm load ps(test2);
  a = mm div ps(a, b);
  mm store ps(out, a);
 except(EXCEPTION EXECUTE HANDLER) {
  if ( mm getcsr() & MM EXCEPT DIV ZERO)
     cout << "Divide by zero" << endl;
     return;
```

### **MXCSR control and status register**

| Reserved       F       R       P       U       O       Z       D       I       D       Z       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E       E< |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Flush to Zero         Rounding Control         Precision Mask         Underflow Mask         Overflow Mask         Divide-by-Zero Mask         Divide-by-Zero Mask         Denormal Operation Mask         Invalid Operation Mask         Denormals Are Zeros*         Precision Flag         Underflow Flag                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
| Overflow Flag<br>Divide-by-Zero Flag<br>Denormal Flag                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |

### **SSE Packed FP Operation**



Packed single-precision FP ADDPS, SUBPS, MULPS, DIVPS, RCPPS, SQRTPS, RSQRTPS, MAXPS, MINPS

### **SSE Scalar FP Operation**



Scalar single-precision FP used as FPU ADDSS, SUBSS, MULSS, DIVSS, RCPSS, SQRTSS, RSQRTSS, MAXSS, MINSS

#### **SSE Shuffle Packed Single-Precision (SHUFPS)**

#### SHUFPS xmm1, xmm2, imm8

Select [1..0] select DEST[0] from DEST // xmm1 Select [3..2] select DEST[1] from DEST Select [5..4] select DEST[2] from SRC // xmm2 Select [7..6] select DEST[3] from SRC



CASE (SELECT[1:0]) OF

- 0: DEST[31:0]  $\leftarrow$  DEST[31:0];
- 1: DEST[31:0]  $\leftarrow$  DEST[63:32];
- DEST[31:0]  $\leftarrow$  DEST[95:64]; 2:
- DEST[31:0]  $\leftarrow$  DEST[127:96]; 3:

ESAC;

CASE (SELECT[3:2]) OF

- 0: DEST[63:32]  $\leftarrow$  DEST[31:0];
- DEST[63:32]  $\leftarrow$  DEST[63:32]; 1:
- 2: DEST[63:32]  $\leftarrow$  DEST[95:64];
- 3: DEST[63:32]  $\leftarrow$  DEST[127:96];

CASE (SELECT[5:4]) OF

- 0: DEST[95:64]  $\leftarrow$  SRC[31:0];
- 1: DEST[95:64]  $\leftarrow$  SRC[63:32];
- 2: DEST[95:64]  $\leftarrow$  SRC[95:64];
- DEST[95:64]  $\leftarrow$  SRC[127:96]; 3:

ESAC;

CASE (SELECT[7:6]) OF

- 0: DEST[127:96]  $\leftarrow$  SRC[31:0];
- 1: DEST[127:96]  $\leftarrow$  SRC[63:32];
- 2: DEST[127:96]  $\leftarrow$  SRC[95:64];

- DEST[127:96]  $\leftarrow$  SRC[127:96]; 3:

ESAC:

ESAC;

# **Swap bytes with SHUFPS**







#### SSE Unpack Shuffle (UNPCKLPS and UNPCKHPS)



### SSE MOVLPS, MOVLHPS, MOVAPS



Many types: SS, LPS, HPS, APS, UPS, HLPS, LHPS

# **SSE Comparison for Single-precision FP**



Two versions: packed (ps) and scalar (ss)

CMPEQPS, CMPNEQPS, CMPLTPS, CMPNLTPS, CMPLEPS, CMPNLEPS CMPEQSS, CMPNEQSS, CMPLTSS, CMPNLTSS, CMPLESS, CMPNLESS

# **SSE Instruction Set (Floating-point Instructions)**

- Memory-to-Register / Register-to-Memory / Register-to-Register data movement
  - Scalar MOVSS
  - Packed MOVAPS, MOVUPS, MOVLPS, MOVHPS, MOVLHPS, MOVHLPS
- Arithmetic
  - Scalar ADDSS, SUBSS, MULSS, DIVSS, RCPSS, SQRTSS, MAXSS, MINSS, RSQRTSS
  - Packed ADDPS, SUBPS, MULPS, DIVPS, RCPPS, SQRTPS, MAXPS, MINPS, RSQRTPS
- Compare
  - Scalar CMPSS, COMISS, UCOMISS
  - Packed CMPPS
- Data shuffle and unpacking
  - Packed SHUFPS, UNPCKHPS, UNPCKLPS
- Data-type conversion
  - Scalar CVTSI2SS, CVTSS2SI, CVTTSS2SI
  - Packed CVTPI2PS, CVTPS2PI, CVTTPS2PI
- Bitwise logical operations
  - Packed ANDPS, ORPS, XORPS, ANDNPS

## **SSE Instruction Set (Integer and Other)**

#### **Integer instructions**

- Arithmetic
  - PMULHUW, PSADBW, PAVGB, PAVGW, PMAXUB, PMINUB, PMAXSW, PMINSW
- Data movement
  - PEXTRW, PINSRW
- Other
  - PMOVMSKB, PSHUFW

#### **Other instructions**

- MXCSR management
  - LDMXCSR, STMXCSR
- Cache and Memory management
  - PREFETCH0, PREFETCH1, PREFETCHNTA
  - MOVNTQ, MOVNTPS, MASKMOVQ, SFENCE

#### **SSE Example: Packed Floating-point Addition**

```
void add(float *a, float *b, float *c) {
  for (int i = 0; i < 4; i++)
    c[i] = a[i] + b[i];
}
                  movaps: move aligned packed SP FP
                  addps: add packed SP FP
asm {
mov eax, a
mov edx, b
mov ecx, c
movaps xmm0, XMMWORD PTR [eax]
addps xmm0, XMMWORD PTR [edx]
movaps XMMWORD PTR [ecx], xmm0
}
```
#### **SSE Example: PS Addition with Intrinsics**

```
__m128 a = _mm_load_ps(input1);
__m128 b = _mm_load_ps(input2);
__m128 t = _mm_add_ps(a, b);
__mm_store_ps(output, t);
```

Vector cross(const Vector& a , const Vector& b ) {
 return Vector(

( a[1] \* b[2] - a[2] \* b[1] ) , ( a[2] \* b[0] - a[0] \* b[2] ) , ( a[0] \* b[1] - a[1] \* b[0] ) );

#### **SSE Example: Cross Product**

```
m128 mm cross ps( m128 a , m128 b ) {
m128 ea , eb;
// set to a[1][2][0][3] , b[2][0][1][3]
ea = mm shuffle ps(a, a, MM SHUFFLE(3,0,2,1));
eb = mm shuffle_ps( b, b, _MM_SHUFFLE(3,1,0,2) );
// multiply
m128 xa = mm mul ps(ea, eb);
// set to a[2][0][1][3] , b[1][2][0][3]
a = mm shuffle ps(a, a, MM SHUFFLE(3,1,0,2));
b = mm shuffle ps(b, b, MM SHUFFLE(3,0,2,1));
 m128 xb = mm mul ps(a, b); // multiply
return mm sub ps( xa , xb ); // subtract
}
```

# **SSE Example: Dot Product**

- Given a set of vectors  $\{v_1, v_2, \dots, v_n\} = \{(x_1, y_1, z_1), (x_2, y_2, z_2), \dots, (x_n, y_n, z_n)\}$  and a vector  $v_c = (x_c, y_c, z_c)$ , calculate  $\{v_c \cdot v_i\}$
- Two options for memory layout
- Array of structure (AoS)

```
typedef struct { float dc, x, y, z; } Vertex;
Vertex v[n];
```

• Structure of array (SoA)

movaps xmm0, v ; xmm0 = DC, x0, y0, z0 movaps xmm1, vc ; xmm1 = DC, xc, yc, zc mulps xmm0, xmm1 ; xmm0 = DC, x0\*xc, y0\*yc, z0\*zcmovhlps xmm1, xmm0 ; xmm1 = DC, DC, DC, DC, x0\*xcaddps xmm1, xmm0 ; xmm1 = DC, DC, DC, x+z; x0\*xc+z0\*zc

movaps xmm2, xmm0 shufps xmm2, xmm2, 55h ; xmm2=DC, DC, DC, y0\*yc addps xmm1, xmm2 ; xmm1 = DC, DC, DC, x+y+z ; x0\*xc+y0\*yc+z0\*zc

#### movhlps:DEST[63..0] := SRC[127..64]

## **SSE Example: Dot Product (SoA)**

; X = x1, x2, ..., x3; Y = y1, y2, ..., y3; Z = z1, z2, ..., z3; A = xc, xc, xc, xc; B = yc, yc, yc, yc; C = zc, zc, zc, zcmovaps xmm0, X ; xmm0 = x1, x2, x3, x4movaps xmm1, Y; xmm1 = y1, y2, y3, y4movaps xmm2, Z ; xmm2 = z1, z2, z3, z4mulps xmm0, A ;xmm0=x1\*xc,x2\*xc,x3\*xc,x4\*xc mulps xmm1, B ;xmm1=y1\*yc,y2\*yc,y3\*xc,y4\*yc mulps xmm2, C ;xmm2=z1\*zc, z2\*zc, z3\*zc, z4\*zc addps xmm0, xmm1 addps xmm0, xmm2; xmm0=(x0\*xc+y0\*yc+z0\*zc)...

#### **SSE Example: Dot Product (SoA), Intrinsics**

\_mm\_store\_ps(output, t1);

- **prefetch** (<u>**mm\_prefetch**</u>): a hint for CPU to load operands for the next instructions so that data loading can be executed in parallel with computation.
- **movntps** (**\_mm\_stream\_ps**): ask CPU not to write data into cache, but to the memory directly.

### **SSE Example: Dot Product with Prefetch**

\_mm\_prefetch((const char\*)(vec1\_x + next), \_MM\_HINT\_NTA); \_mm\_prefetch((const char\*)(vec1\_y + next), \_MM\_HINT\_NTA); \_mm\_prefetch((const char\*)(vec1\_z + next), \_MM\_HINT\_NTA); \_mm\_prefetch((const char\*)(vec2\_x + next), \_MM\_HINT\_NTA); \_mm\_prefetch((const char\*)(vec2\_y + next), \_MM\_HINT\_NTA); \_mm\_prefetch((const char\*)(vec2\_z + next), \_MM\_HINT\_NTA);

// 1.5x speedup

81

```
float result = coeff[8] * x;
int i;
```

```
for(i = 7; i >= 2; i--) {
    result += coeff[i];
    result *= x;
}
```

return (result + 1) \* x + 1;

$$e^{x} = 1 + x + \frac{x^{2}}{2!} + \frac{x^{3}}{3!} + \frac{x^{4}}{4!} + \dots$$

## **Example: Exponential with Intrinsics**

```
int i;
m128 X = mm load ps(data);
m128 result = mm mul ps(X,coeff sse[8]);
for(i = 7; i >=2; i--) {
   result = mm add ps(result, coeff sse[i]);
  result = mm mul ps(result, X);
}
result = mm add ps(result, sse one);
result = mm mul ps(result, X);
result = mm add ps(result, sse one);
mm store ps(out, result);
```

### SSE2

- Introduced into the IA-32 architecture in the Pentium 4 and Intel Xeon processors in 2001
- Allowing advanced graphics such as 3-D graphics, video decoding/encoding, speech recognition
- What is new?
  - SIMD operations on 128-bit packed double-precision FP
  - SIMD operations on 128-bit packed 64-bit integers
    - Offers more flexibility in big numbers
- 144 new instructions
- AMD didn't support SSE2 until 2003, with their Opteron and Athlon64 processors

## **SSE2 Features**

• Add data types and instructions for them



• Programming environment unchanged (also packed and scalar)

# **SSE2** Instructions (ARITHMETIC)

addpd - Adds 2 64bit doubles. addsd - Adds bottom 64bit doubles. subpd - Subtracts 2 64bit doubles. subsd - Subtracts bottom 64bit doubles. mulpd - Multiplies 2 64bit doubles. mulsd - Multiplies bottom 64bit doubles. divpd - Divides 2 64bit doubles. divid - Divides bottom 64bit doubles. maxpd - Gets largest of 2 64bit doubles for 2 sets. maxsd - Gets largest of 2 64bit doubles to bottom set. minpd - Gets smallest of 2 64bit doubles for 2 sets. minsd - Gets smallest of 2 64bit values for bottom set. paddb - Adds 16 8bit integers. paddw - Adds 8 16bit integers. paddd - Adds 4 32bit integers. paddg - Adds 2 64bit integers. paddsb - Adds 16 8bit integers with saturation. paddsw - Adds 8 16bit integers using saturation. paddusb - Adds 16 8bit unsigned integers using saturation. paddusw - Adds 8 16bit unsigned integers using saturation. psubb - Subtracts 16 8bit integers. psubw - Subtracts 8 16bit integers. psubd - Subtracts 4 32bit integers. psubg - Subtracts 2 64bit integers. psubsb - Subtracts 16 8bit integers using saturation. psubsw - Subtracts 8 16bit integers using saturation. psubusb - Subtracts 16 8bit unsigned integers using saturation. psubusw - Subtracts 8 16bit unsigned integers using saturation. pmaddwd - Multiplies 16bit integers into 32bit results and adds results. pmulhw - Multiplies 16bit integers and returns the high 16bits of the result. pmully - Multiplies 16bit integers and returns the low 16bits of the result. pmuludg - Multiplies 2 32bit pairs and stores 2 64bit results. rcpps - Approximates the reciprocal of 4 32bit singles. rcpss - Approximates the reciprocal of bottom 32bit single. sartpd - Returns square root of 2 64bit doubles. sqrtsd - Returns square root of bottom 64bit double.

## **SSE2** Instructions (Logic)

andnpd - Logically NOT ANDs 2 64bit doubles. andnps - Logicallý NOT ANDs 4 32bit singles. andpd - Logically ANDs 2 64bit doubles. pand - Logically ANDs 2 128bit registers. pandn - Logically Inverts the first 128bit operand and ANDs with the second. por - Logically ORs 2 128bit registers. pslldq - Logically left shifts 1 128bit value. psllq - Logically left shifts 2 64bit values. pslld - Logicallý left shifts 4 32bit values. psllw - Logically left shifts 8 16bit values. psrad - Arithmétically right shifts 4 32bit values. psraw - Arithmetically right shifts 8 16bit values. psrldq - Logically right shifts 1 128bit values. psrlq - Logically right shifts 2 64bit values. psrld - Logicallý right shifts 4 32bit values. psrlw - Logically right shifts 8 16bit values. pxor - Logically XORs 2 128bit registers. orpd - Logically ORs 2 64bit doubles. xorpd - Logically XORs 2 64bit doubles.

- cmppd Compares 2 pairs of 64bit doubles.
- cmpsd Compares bottom 64bit doubles.
- comisd Compares bottom 64bit doubles and stores result in EFLAGS.
- ucomisd Compares bottom 64bit doubles and stores result in
- EFLAGS. (QNaN's don't throw exceptions with ucomisd, unlike comisd.
- pcmpxxb Compares 16 8bit integers.
- pcmpxxw Compares 8 16bit integers.
- pcmpxxd Compares 4 32bit integers.
- Compare Codes (the xx parts above):
- eq Equal to.
- lt Less than
- le Less than or equal to.
- ne Not equal.
- nlt Not less than.
- nle Not less than or equal to.
- ord Ordered.
- unord Unordered.

# **SSE2** Instructions (Conversion)

- cvtdq2pd Converts 2 32bit integers into 2 64bit doubles.
- cvtdq2ps Converts 4 32bit integers into 4 32bit singles.
- cvtpd2pi Converts 2 64bit doubles into 2 32bit integers in an MMX register.
- cvtpd2dq Converts 2 64bit doubles into 2 32bit integers in the bottom of an XMM register.
- cvtpd2ps Converts 2 64bit doubles into 2 32bit singles in the bottom of an XMM register.
- cvtpi2pd Converts 2 32bit integers into 2 32bit singles in the bottom of an XMM register.
- cvtps2dq Converts 4 32bit singles into 4 32bit integers.
- cvtps2pd Converts 2 32bit singles into 2 64bit doubles.
- cvtsd2si Converts 1 64bit double to a 32bit integer in a GPR.
- cvtsd2ss Converts bottom 64bit double to a bottom 32bit single. Tops are unchanged.
- cvtsi2sd Converts a 32bit integer to the bottom 64bit double.
- cvtsi2ss Converts a 32bit integer to the bottom 32bit single. cvtss2sd Converts bottom 32bit single to bottom 64bit double.
- cvtss2si Converts bottom 32bit single to a 32bit integer in a GPR.
- cvttpd2pi Converts 2 64bit doubles to 2 32bit integers using truncation into an MMX register.
- cvttpd2dq Converts 2 64bit doubles to 2 32bit integers using truncation.
- cvttps2dq Converts 4 32bit singles to 4 32bit integers using truncation.
- cvttps2pi Converts 2 32bit singles to 2 32bit integers using truncation into an MMX register.
- cvttsd2si Converts a 64bit double to a 32bit integer using truncation into a GPR. cvttss2si - Converts a 32bit single to a 32bit integer using truncation into a GPR.

## **SSE2** Instructions



# **SSE2** Instructions

Load/Store:

(is "minimize cache pollution" the same as "without using cache"??) movq - Moves a 64bit value, clearing the top 64bits of an XMM register. movsd - Moves a 64bit double, leaving tops unchanged if move is between two XMMregisters.

- movapd Moves 2 aligned 64bit doubles.
- movupd Moves 2 unaligned 64bit doubles.
- movhpd Moves top 64bit value to or from an XMM register.
- movlpd Moves bottom 64bit value to or from an XMM register.
- movdq2q Moves bottom 64bit value into an MMX register.
- movq2dq Moves an MMX register value to the bottom of an XMM register. Top is cleared to zero.
- movntpd Moves a 128bit value to memory without using the cache. NT is "Non Temporal."
- movntdq Moves a 128bit value to memory without using the cache.
- movnti Moves a 32bit value without using the cache.
- maskmovdqu Moves 16 bytes based on sign bits of another XMM register.

pmovmskb - Generates a 16bit Mask from the sign bits of each byte in an XMM register.

# **SSE2** Instructions

Shuffling: pshufd - Shuffles 32bit values in a complex way. pshufhw - Shuffles high 16bit values in a complex way. pshuflw - Shuffles low 16bit values in a complex way. unpckhpd - Unpacks and interleaves top 64bit doubles from 2 128bit sources into 1. unpcklpd - Unpacks and interleaves bottom 64bit doubles from 2 128 bit sources into 1. punpckhbw - Unpacks and interleaves top 8 8bit integers from 2 128bit sources into 1. punpckhwd - Unpacks and interleaves top 4 16bit integers from 2 128bit sources into 1. punpckhdq - Unpacks and interleaves top 2 32bit integers from 2 128bit sources into 1. punpckhqdq - Unpacks and interleaces top 64bit integers from 2 128bit sources into 1. punpcklbw - Unpacks and interleaves bottom 8 8bit integers from 2 128bit sources into 1. punpcklwd - Unpacks and interleaves bottom 4 16bit integers from 2 128bit sources into 1. punpckldq - Unpacks and interleaves bottom 2 32bit integers from 2 128bit sources into 1. punpcklqdq - Unpacks and interleaces bottom 64bit integers from 2 128bit sources into 1. packssdw - Packs 32bit integers to 16bit integers using saturation. packsswb - Packs 16bit integers to 8bit integers using saturation. packuswb - Packs 16bit integers to 8bit unsigned integers unsing saturation.

Cache Control:

clflush - Flushes a Cache Line from all levels of cache.

lfence - Guarantees that all memory loads issued before the lfence instruction are completed before anyloads after the lfence instruction.

mfence - Guarantees that all memory reads and writes issued before the mfence instruction are completed before any reads or writes after the mfence instruction.

pause - Pauses execution for a set amount of time.

- Introduced for Pentium 4 processor supporting Hyper-Threading Technology in 2004.
- The Intel Xeon processor 5100 series, Intel Core 2 processor families introduced Supplemental Streaming SIMD Extensions 3 (SSSE3)
- SSE4 are introduced in Intel processor generations built from 45nm process technology in 2006
- SSE3/SSE3/SSE4 do not introduce new data types
  - XMM registers are used to operate on packed data types
    - integer, single-precision FP, or double-precision FP

## SSE3

- 13 new instructions
  - Support horizontal operations across a single register
    - Instead of down through multiple registers
  - Asymmetric processing
- Unaligned access instructions are new type of instructions
- Process control instructions to boost performance with Intel's hyper-threading feature
- AMD started to support SSE3 in 2005

# **SSE3 Instructions ADDSUBPD**



Asymmetric processing: ADDSUBPD

Add and Sub of packed double-precision FP

The second operand may be from memory

# **SSE3 Instructions: HADDPD**



Horizontal data movement: HADDPD

Horizontal add of packed double-precision FP The second operand may be from memory

# **SSE3 Instructions Summary**

#### Arithmetic:

addsubpd – DP. Additon on higher pair, subtraction on lower pair (+, -).
addsubps – SP. Two adds and two subs interleaved (+, -, +, -).
haddpd – DP. Horizontal addition. (src1+src0, dst1+dst0)
haddps – SP. Horizontal addition. (src3+src2, src1+src0, dst3+dst2, dst1+dst0)
hsubpd – DP. Horizontal addition. (src1+src0, dst1+dst0)
hsubps – SP. Horizontal addition. (src3+src2, src1+src0, dst3+dst2, dst1+dst0)

#### Load/Store:

Iddqu – Loads an unaligned 128bit value
movddup – Loads or move a DP into lower half and duplicate to the higher
movshdup – Duplicates the higher singles. (src3, src3, src1, src1)
movsldup – Duplicates the lower singles. (src2, src2, src0, src0)
fisttp – Converts a floating-point value to an integer using truncation

#### **Process Control:**

monitor - Sets up a region to monitor for activity
mwait - Waits until activity happens in a region specified by monitor



- New 32 instructions designed for to accelerate a variety of multimedia and signal processing applications
  - Only 16, for both MMX and XMM
- Integer data types include packed byte, word, or double word
- Operands can be 64 or 128 bit in MMX registers, XMM registers, or memory

# **SSSE3 Instructions: PHADDD**



#### Horizontal data movement: PHADDD

Horizontal add of packed DW

The second operand may be from memory

# **SSSE3 Instructions Summary**

- 12 for horizontal addition or subtraction operations
   ADDW, ADDD, ADDSW, and three for SUB
- 6 for evaluating absolute values
  - PABSB, PABSW, PABSD
- 2 for multiply and add operations
  - PMADDUBSW (byte mul, add pairs, and to saturated words)
  - Speed up dot products
- 2 for packed-integer multiply operations
   PMULHRSW (Q.15 multiplications, with rounding and scaling)
- 2 for a byte-wise, in-place shuffle
  - PSHUFB (similar to PERMUTE)
- 6 instructions negating packed integers in the destination
   B, W, and D version
- 2 for alignment data from the composite of two operands
   PALIGNR (similar to SHIFT PAIR)



- SSE4 comprises of two sets of extensions
  - SSE4.1 includes 47 new instructions
    - Targets media, imaging and 3D graphics
    - Adds instructions for improving compiler vectorization
    - Significantly increases support for packed dword computation
  - SSE4.2 has 7 new instructions
    - Improves performance in string and text processing
- Registers
  - Two SSE4.2 instructions operate on general-purpose registers
  - All other instructions operate on XMM registers
    - No MMX registers

# **SSE4.1 Instructions Summary (1)**

- Six instructions for conditional copying
  - BLENDPS, BLENDPD, BLENDVPS, BLENDVPD
  - PBLENDVB, PBLENDW
- Eight instructions expand support for packed integer MIN/MAX
   PMINSB, PMINUW, PMINUD, PMINSD and PMAX versions
- Four instructions for floating-point rounding
  - ROUNDPS, ROUNDSS, ROUNDPD, ROUNDSD
  - One of the four rounding modes specified by an immediate
- Seven instructions for data insertion and extractions
  - INSERTPS, PINSRB, PINSRD/Q
  - EXTRACTPS, PEXTRB, PEXTRW, PEXTRD/PEXTRQ
- Twelve instructions converting packed integer format to D
  - PMOVSXBW, PMOVZXBW, also BQ, WD, WQ, DQ
  - Sign and zero extensions

# **SSE 4.1 Insructions Summary (2)**

- Two instructions perform packed DW multiplications
  - PMULDQ, 32-bit to 64-bit, two DW from the source are used
  - PMULLD, 32-bit to 32-bit
- Two instructions floating-point dot products
  - DPPS, dot product of PS, result in any (one or more) locations
  - DPPD, dot product of PD
- MPSADBW Computes 8 offset sums of absolute differences
- PTEST Compare two 128 bits values and set ZF and CF flags
- PCMPEQQ QD equality comparison
- PACKUSDW Signed DW to unsigned W with saturation
- MOVNTDQA Move DQ
- PHMINPOSUW Packed horizontal word minimum
  - Value in lower 16 bits and 3-bit index are bits 16 18

# **SSE 4.1 BLEND Instructions**

- Copy PS of PD from SRC if the control bit is 1
  - PS needs four bits while PD needs only two bits
- V versions use XMM0 as control
  - Use the MSB

|              | Packed<br>Double | Packed<br>Single | Packed           | Packed           | Packed | Packed | Blend   |
|--------------|------------------|------------------|------------------|------------------|--------|--------|---------|
| Instructions | FP               | FP               | QWord            | DWord            | Word   | Byte   | Control |
| BLENDPS      |                  | Х                |                  |                  |        |        | Imm8    |
| BLENDPD      | Х                |                  |                  |                  |        |        | Imm8    |
| BLENDVPS     |                  | Х                |                  | X <sup>(1)</sup> |        |        | XMM0    |
| BLENDVPD     | Х                |                  | X <sup>(1)</sup> |                  |        |        | XMM0    |
| PBLENDVB     |                  |                  | (2)              | (2)              | (2)    | Х      | XMM0    |
| PBLENDW      |                  |                  | Х                | Х                | Х      |        | Imm8    |

# **SSE4.2 Instructions**

- CRC32 Use the polynomial 0x11EDC6F41
  - R32 or R64 mode
- Four string comparison instructions
  - PCMPESTRI
  - PCMPESTRM
  - PCMPISTRI
  - PCMPISTRM
- PCMPGTQ Compare packed QW for greater than

- POPCNT Population count
- LZCNT Leading zero count

# **Other SIMD architectures**

• Graphics Processing Unit (GPU): nVidia 7800, 24 pipelines (8 vector/16 fragment)



# NVidia GeForce 8800, 2006

- Each GeForce 8800 GPU stream processor is a fully generalized, fully decoupled, scalar, processor that supports IEEE 754 floating point precision.
- Up to 128 stream processors



# **Cell processor**

- Cell Processor (IBM/Toshiba/Sony): 1 PPE (Power Processing Unit) +8 SPEs (Synergistic Processing Unit)
- An SPE is a RISC processor with 128-bit SIMD for single/double precision instructions, 128 128-bit registers, 256K local cache
- used in PS3.
## **Cell processor**



## **Cell Processor Architecture**



EIB (Element Interconnect Bus)

## References

- Intel MMX for Multimedia PCs, CACM, Jan. 1997
- Chapter 11 The MMX Instruction Set, The Art of Assembly
- Chap. 9, 10, 11 of IA-32 Intel Architecture Software Developer's Manual: Volume 1: Basic Architecture
- http://www.csie.ntu.edu.tw/~r89004/hive/sse/page\_1.html