### learn network inspire www.GDConf.com

Game Developers Conference<sup>®</sup> March 23-27, 2009 Moscone Center, San Francisco

### Lockless Programming in Games

Bruce Dawson Principal Software Design Engineer Microsoft Windows Client Performance

www.GDConf.com

09

ear



### Agenda

- » Locks and their problems
- » Lockless programming a different set of problems!
- » Portable lockless programming
- » Lockless algorithms that work
- » Conclusions
- » Focus is on improving intuition on the *reordering* aspects of lockless programming



## Cell phones

» Please turn off all cell phones, pagers, alarm clocks, crying babies, internal combustion engines, leaf blowers, etc.



### Mandatory Multi-core Mention

- » Xbox 360: six hardware threads
- » PS3: nine hardware threads
- » Windows: quad-core PCs for \$500
- » Multi-threading is mandatory if you want to harness the available power
- » Luckily it's easy
  - As long as there is no sharing of non-constant data
- » Sharing data is tricky
  - Easiest and safest way is to use OS features such as locks and semaphores



## Simple Job Queue

» Assigning work:

**>>** 

EnterCriticalSection( &workItemsLock );

workItems.push( workItem );

LeaveCriticalSection( &workItemsLock );

```
Worker threads:
EnterCriticalSection( &workItemsLock );
WorkItem workItem = workItems.front();
workItems.pop();
LeaveCriticalSection( &workItemsLock );
DoWork( workItem );
```



# The Problem With Locks...

- » Overhead acquiring and releasing locks takes time
  - So don't acquire locks too often
- » Deadlocks lock acquisition order must be consistent to avoid these
  - So don't have very many locks, or only acquire one at a time
- » Contention sometimes somebody else has the lock
  - So never hold locks for too long contradicts point 1
  - So have lots of little locks contradicts point 2
- » Priority inversions if a thread is swapped out while holding a lock, progress may stall
  - ③ Changing thread priorities can lead to this
  - Solution State & St



## Sensible Reaction

» Use locks carefully

- Don't lock too frequently
- ③ Don't lock for too long
- ③ Don't use too many locks
- ③ Don't have one central lock

» Or, try lockless

www.GDConf.com



# Lockless Programming

» Techniques for safe multi-threaded data sharing without locks

### » Pros:

- May have lower overhead
- Avoids deadlocks
- May reduce contention
- Avoids priority inversions

### » Cons

- Serv limited abilities
- S Extremely tricky to get right
- Senerally non-portable



## Job Queue Again

» Assigning work:

**>>** 

EnterCriticalSection( &workItemsLock );

workItems.push( workItem );

LeaveCriticalSection( &workItemsLock );

```
Worker threads:
EnterCriticalSection( &workItemsLock );
WorkItem workItem = workItems.front();
workItems.pop();
LeaveCriticalSection( &workItemsLock );
DoWork( workItem );
```



# Lockless Job Queue #1

» Assigning work:

**>>** 

EnterCriticalSection( &workItemsLock );

InterlockedPushEntrySList( workItem );

LeaveCriticalSection( &workItemsLock );

```
Worker threads:
EnterCriticalSection( &workItemsLock );
WorkItem workItem =
        InterlockedPopEntrySList();
LeaveCriticalSection( &workItemsLock );
DoWork( workItem );
```



## Lockless Job Stack #1

» Assigning work:



WorkItem workItem =

InterlockedPopEntrySList();

DoWork( workItem );

www.GDConf.com



## Lockless Job Queue #2

» Assigning work – one writer only:



www.GDConf.com



#### Simple CPU/Compiler Model



Read pC Write pA Write pB Read pD Write pC



#### Alternate CPU Model



Write pA Write pB Write pC

Visible order: Write pA Write pC Write pB



#### Alternate CPU – Reads Pass Reads



Read A1 Read A2 Read A1

Visible order: Read A1 Read A1 Read A2



#### Alternate CPU – Writes Pass Reads



Write A2 Visible order: Write A2

Read A1

Read A1



#### Alternate CPU – Reads Pass Writes



# Memory Models

|                        | x86/x64 | PowerPC | ARM  | IA64 |
|------------------------|---------|---------|------|------|
| store can pass store?  | No      | Yes*    | Yes* | Yes* |
| load can pass load?    | No      | Yes     | Yes  | Yes  |
| store can pass load?   | No      | Yes     | Yes  | Yes  |
| load can pass store?** | Yes     | Yes     | Yes  | Yes  |

» "Pass" means "visible before"

Pa

www.GDConf.com

- » Memory models are actually more complex than this
  - May vary for cacheable/non-cacheable, etc.
- » This only affects multi-threaded lock-free code!!!

\* Only stores to different addresses can pass each other

\*\* Loads to a previously stored address will load that value



#### Improbable CPU – Reads Don't Pass Writes





### Reads Must Pass Writes!

- » Reads not passing writes would mean L1 cache is frequently disabled
  - Severy read that follows a write would stall for shared storage latency
- » Huge performance impact
- » Therefore, on x86 and x64 (on all modern CPUs) reads can pass writes



# **Reordering Implications**

- » Publisher/Subscriber model
- » Thread A:

```
g_data = data;
```

```
g_dataReady = true;
```

```
» Thread B:
    if( g_dataReady )
        process( g_data );
```

» Is it safe?

www.GDConf.com





Proc 1: Write g\_data Write g\_dataReady

Proc 2: Read g\_dataReady Read g\_data

» Writes may reachL2 out of order





Proc 1: Write g\_data MyExportBarrier(); Write g\_dataReady

Proc 2: Read g\_dataReady Read g\_data

» Writes now reachL2 in order





Proc 1: Write g\_data MyExportBarrier(); Write g\_dataReady

Proc 2: Read g\_dataReady Read g\_data

Reads may leave
 L2 out of order –
 g\_data may be
 stale





Proc 1: Write g\_data MyExportBarrier(); Write g\_dataReady

Proc 2: Read g\_dataReady MyImportBarrier(); Read g\_data

» It's all good!



# x86/x64 FTW!!!

- » Not so fast...
- » Compilers are just as evil as processors
- » Compilers will rearrange your code as much as legally possible
  - And compilers assume your code is single threaded
- » Compiler and CPU reordering barriers needed



### MyExportBarrier

- Prevents reordering of writes by compiler *or* CPU
   Used when handing out access to data
- » x86/x64: \_ReadWriteBarrier();
  - ③ Compiler intrinsic, prevents compiler reordering
- » PowerPC: \_\_lwsync();
  - Ardware barrier, prevents CPU write reordering
- » ARM: \_\_dmb(); // Full hardware barrier
- » IA64: \_\_mf(); // Full hardware barrier
- » Positioning is crucial!
  - Write the data, MyExportBarrier, write the control value
- » Export-barrier followed by write is known as writerelease semantics



# MyImportBarrier();

- Prevents reordering of reads by compiler *or* CPU
   Used when gaining access to data
- » x86/x64: \_ReadWriteBarrier();
  - ③ Compiler intrinsic, prevents compiler reordering
- » PowerPC: \_\_lwsync(); or isync();
  - Hardware barrier, prevents CPU read reordering
- » ARM: \_\_dmb(); // Full hardware barrier
- » IA64: \_\_mf(); // Full hardware barrier
- » Positioning is crucial!
  - Read the control value, MyImportBarrier, read the data
- » Read followed by import-barrier is known as readacquire semantics



### Fixed Job Queue #2

» Assigning work – one writer only:

if( RoomAvail( readPt, writePt ) ) {

MyImportBarrier();

CircWorkList[ writePt ] = workItem;

MyExportBarrier();

```
writePt = WRAP( writePtr + 1 );
```

» Worker thr if( Datai MyImk

WorkItem workItem =

```
CircWorkList[ readPt ];
```

MyExportBarrier();

```
readPt = WRAP( readPt + 1 );
```

DoWork( workItem );

www.GDConf.com



### Dekker's/Peterson's Algorithm

```
int T1 = 0, T2 = 0;
```

```
Proc 1:
void LockForT1() {
    T1 = 1;
    if( T2 != 0 ) {
```

```
Proc 2:
void LockForT2() {
    T2 = 1;
    if( T1 != 0 ) {
```

. . .

}

www.GDConf.com



#### Dekker's/Peterson's Animation

Proc 1:

Write T1

Read T2

Proc 2:

Write T2

Read T1

» Epic fail! (on

x86/x64 also)





#### Dekker's/Peterson's Animation



Proc 1: Write T1 MemoryBarrier(); Read T2 Proc 2: Write T2 MemoryBarrier(); Read T1

» It's all good!



## Full Memory Barrier

- » MemoryBarrier();
  - x86: \_\_asm xchg Barrier, eax
  - . x64: \_\_faststorefence();
  - Xbox 360: \_\_\_\_sync();
  - . ARM: \_\_dmb();
  - ③ IA64: \_\_mf();
- » Needed for Dekker's algorithm, implementing locks, etc.
- Prevents all reordering including preventing reads passing writes
- » Most expensive barrier type



### Dekker's/Peterson's Fixed

```
int T1 = 0, T2 = 0;
```

```
Proc 1:
void LockForT1() {
    T1 = 1;
    MemoryBarrier();
    if( T2 != 0 ) {
```

```
Proc 2:
void LockForT2() {
    T2 = 1;
    MemoryBarrier();
    if( T1 != 0 ) {
```



### Dekker's/Peterson's Still Broken

```
int T1 = 0, T2 = 0;
```

```
Proc 1:
void LockForT1() {
    T1 = 1;
    MyExportBarrier();
    if( T2 != 0 ) {
```

```
Proc 2:
void LockForT2() {
    T2 = 1;
    MyExportBarrier();
    if( T1 != 0 ) {
```

www.GDConf.com



#### Dekker's/Peterson's Still Broken

```
int T1 = 0, T2 = 0;
```

```
Proc 1:
void LockForT1() {
    T1 = 1;
    MyImportBarrier();
    if( T2 != 0 ) {
```

```
Proc 2:
void LockForT2() {
    T2 = 1;
    MyImportBarrier();
    if( T1 != 0 ) {
```



#### Dekker's/Peterson's Still Broken

```
int T1 = 0, T2 = 0;
```

```
Proc 1:
void LockForT1() {
    T1 = 1;
    MyExportBarrier(); MyImportBarrier();
    if( T2 != 0 ) {
```

```
Proc 2:
void LockForT2() {
    T2 = 1;
    MyExportBarrier(); MyImportBarrier();
    if( T1 != 0 ) {
```



# What About Volatile?

- Standard volatile semantics not designed for multi-threading
  - Compiler can move normal reads/writes past volatile reads/writes
  - Also, doesn't prevent CPU reordering
- » VC++ 2005+ volatile is better...
  - Acts as read-acquire/write-release on x86/x64 and Itanium
  - Doesn't prevent hardware reordering on Xbox 360
- » Watch for atomic<T> in C++0x
  - Sequentially consistent by default but can choose from four memory models



**>>** 

## Double Checked Locking

```
Foo* GetFoo() {
    static Foo* volatile s pFoo;
    Foo* tmp = s pFoo;
    if( !tmp ) {
        EnterCriticalSection( &initLock );
        tmp = s pFoo; // Reload inside lock
        if( !tmp ) {
            tmp = new Foo();
            s pFoo = tmp;
        }
        LeaveCriticalSection( &initLock );
    }
    return tmp; }
  This is broken on many systems
```



# Possible Compiler Rewrite

```
Foo* GetFoo() {
    static Foo* volatile s pFoo;
    Foo* tmp = s pFoo;
    if( !tmp ) {
        EnterCriticalSection( &initLock );
        tmp = s pFoo; // Reload inside lock
        if( !tmp ) {
            s pFoo = (Foo*)new char[sizeof(Foo)];
            new(s pFoo) Foo; tmp = s pFoo;
        }
        LeaveCriticalSection( &initLock );
    }
    return tmp; }
```



>>

# Double Checked Locking

```
Foo* GetFoo() {
    static Foo* volatile s pFoo;
    Foo* tmp = s_pFoo; MyImportBarrier();
    if( !tmp ) {
        EnterCriticalSection( &initLock );
        tmp = s pFoo; // Reload inside lock
        if( !tmp ) {
            tmp = new Foo();
            MyExportBarrier(); s pFoo = tmp;
        }
        LeaveCriticalSection( &initLock );
    }
    return tmp; }
  Fixed
```



### InterlockedXxx

- Necessary to extend lockless algorithms to greater than two threads
   A whole separate talk...
- » InterlockedXxx is a full barrier on Windows for x86, x64, and Itanium
- » Not a barrier at all on Xbox 360
   ③ Oops. Still atomic, just not a barrier
- » InterlockedXxx Acquire and Release are portable across all platforms
  - Same guarantees everywhere
  - Safer than regular InterlockedXxx on Xbox 360
  - No difference on x86/x64
  - Recommended



# Practical Lockless Uses

- » Reference counts
- » Setting a flag to tell a thread to exit
- » Publisher/Subscriber with one reader and one writer – lockless pipe
- » SLists
- » XMCore on Xbox 360
- » Double checked locking



### Barrier Summary

- » MyExportBarrier when publishing data, to prevent write reordering
- » MyImportBarrier when acquiring data, to prevent read reordering
- » MemoryBarrier to stop all reordering, including reads passing writes
- » Identify where you are publishing/releasing and where you are subscribing/acquiring



#### Summary

- » Prefer using locks they are full barriers
  - » Acquiring and releasing a lock is a memory barrier
- » Use lockless only when costs of locks are shown to be too high
- » Use pre-built lockless algorithms if possible
- » Encapsulate lockless algorithms to make them safe to use
- » Volatile is not a portable solution
- » Remember that InterlockedXxx is a full barrier on Windows, but not on Xbox 360



#### References

- Intel memory model documentation in Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 3A: System Programming Guide
  - <u>http://download.intel.com/design/processor/manuals/253668.pdf</u>
- » AMD "Multiprocessor Memory Access Ordering"
  - <u>http://www.amd.com/us-</u> <u>en/assets/content\_type/white\_papers\_and\_tech\_docs/24593.pdf</u>
- » PPC memory model explanation
  - <u>http://www.ibm.com/developerworks/eserver/articles/powerpc.ht</u>
     <u>ml</u>
- » Lockless Programming Considerations for Xbox 360 and Microsoft Windows
  - <u>http://msdn.microsoft.com/en-us/library/bb310595.aspx</u>
- » Perils of Double Checked Locking
  - <u>http://www.aristeia.com/Papers/DDJ\_Jul\_Aug\_2004\_revised.pdf</u>
- » Java Memory Model Cookbook
  - http://g.oswego.edu/dl/jmm/cookbook.html



#### Questions?

» <u>bdawson@microsoft.com</u>

www.GDConf.com



### Feedback forms

» Please fill out feedback forms

www.GDConf.com