NUMA support
authorSimon Marlow <marlowsd@gmail.com>
Sat, 23 Apr 2016 20:14:49 +0000 (21:14 +0100)
committerSimon Marlow <marlowsd@gmail.com>
Fri, 10 Jun 2016 20:25:54 +0000 (21:25 +0100)
commit9e5ea67e268be2659cd30ebaed7044d298198ab0
treec395e74ee772ae0d59c852b3cbde743784b08d09
parentb9fa72a24ba2cc3120912e6afedc9280d28d2077
NUMA support

Summary:
The aim here is to reduce the number of remote memory accesses on
systems with a NUMA memory architecture, typically multi-socket servers.

Linux provides a NUMA API for doing two things:
* Allocating memory local to a particular node
* Binding a thread to a particular node

When given the +RTS --numa flag, the runtime will
* Determine the number of NUMA nodes (N) by querying the OS
* Assign capabilities to nodes, so cap C is on node C%N
* Bind worker threads on a capability to the correct node
* Keep a separate free lists in the block layer for each node
* Allocate the nursery for a capability from node-local memory
* Allocate blocks in the GC from node-local memory

For example, using nofib/parallel/queens on a 24-core 2-socket machine:

```
$ ./Main 15 +RTS -N24 -s -A64m
  Total   time  173.960s  (  7.467s elapsed)

$ ./Main 15 +RTS -N24 -s -A64m --numa
  Total   time  150.836s  (  6.423s elapsed)
```

The biggest win here is expected to be allocating from node-local
memory, so that means programs using a large -A value (as here).

According to perf, on this program the number of remote memory accesses
were reduced by more than 50% by using `--numa`.

Test Plan:
* validate
* There's a new flag --debug-numa=<n> that pretends to do NUMA without
  actually making the OS calls, which is useful for testing the code
  on non-NUMA systems.
* TODO: I need to add some unit tests

Reviewers: erikd, austin, rwbarton, ezyang, bgamari, hvr, niteria

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2199
43 files changed:
configure.ac
docs/users_guide/runtime_control.rst
includes/Cmm.h
includes/Rts.h
includes/RtsAPI.h
includes/rts/Constants.h
includes/rts/Flags.h
includes/rts/OSThreads.h
includes/rts/Threads.h
includes/rts/storage/Block.h
includes/rts/storage/MBlock.h
rts/Capability.c
rts/Capability.h
rts/HeapStackCheck.cmm
rts/Inlines.c
rts/Messages.h
rts/PrimOps.cmm
rts/ProfHeap.c
rts/RtsFlags.c
rts/SMPClosureOps.h [moved from includes/rts/storage/SMPClosureOps.h with 98% similarity]
rts/STM.c
rts/Schedule.c
rts/Task.c
rts/Task.h
rts/eventlog/EventLog.c
rts/package.conf.in
rts/posix/OSMem.c
rts/posix/OSThreads.c
rts/sm/BlockAlloc.c
rts/sm/BlockAlloc.h
rts/sm/GC.c
rts/sm/GCUtils.c
rts/sm/GCUtils.h
rts/sm/MBlock.c
rts/sm/MarkStack.h
rts/sm/OSMem.h
rts/sm/Storage.c
rts/win32/OSMem.c
rts/win32/OSThreads.c
testsuite/config/ghc
testsuite/tests/codeGen/should_run/all.T
testsuite/tests/concurrent/prog001/all.T
testsuite/tests/concurrent/should_run/all.T