False Sharing¶

How to detect and analyze False Sharing¶

perf record/report/stat are widely used for performance tuning, and once hotspots are detected, tools like ‘perf-c2c’ and ‘pahole’ can be further used to detect and pinpoint the possible false sharing data structures. ‘addr2line’ is also good at decoding instruction pointer when there are multiple layers of inline functions.

perf-c2c can capture the cache lines with most false sharing hits, decoded functions (line number of file) accessing that cache line, and in-line offset of the data. Simple commands are:

$ perf c2c record -ag sleep 3
$ perf c2c report --call-graph none -k vmlinux

When running above during testing will-it-scale’s tlb_flush1 case, perf reports something like:

Total records                     :    1658231
Locked Load/Store Operations      :      89439
Load Operations                   :     623219
Load Local HITM                   :      92117
Load Remote HITM                  :        139

#----------------------------------------------------------------------
    4        0     2374        0        0        0  0xff1100088366d880
#----------------------------------------------------------------------
  0.00%   42.29%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81373b7b         0       231       129     5312        64  [k] __mod_lruvec_page_state    [kernel.vmlinux]  memcontrol.h:752   1
  0.00%   13.10%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff81374718         0       226        97     3551        64  [k] folio_lruvec_lock_irqsave  [kernel.vmlinux]  memcontrol.h:752   1
  0.00%   11.20%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c29bf         0       170       136      555        64  [k] lru_add_fn                 [kernel.vmlinux]  mm_inline.h:41     1
  0.00%    7.62%    0.00%    0.00%    0.00%    0x8     1       1  0xffffffff812c3ec5         0       175       108      632        64  [k] release_pages              [kernel.vmlinux]  mm_inline.h:41     1
  0.00%   23.29%    0.00%    0.00%    0.00%   0x10     1       1  0xffffffff81372d0a         0       234       279     1051        64  [k] __mod_memcg_lruvec_state   [kernel.vmlinux]  memcontrol.c:736   1

A nice introduction for perf-c2c is [3].

‘pahole’ decodes data structure layouts delimited in cache line granularity. Users can match the offset in perf-c2c output with pahole’s decoding to locate the exact data members. For global data, users can search the data address in System.map.

Possible Mitigations¶

False sharing does not always need to be mitigated. False sharing mitigations should balance performance gains with complexity and space consumption. Sometimes, lower performance is OK, and it’s unnecessary to hyper-optimize every rarely used data structure or a cold data path.

False sharing hurting performance cases are seen more frequently with core count increasing. Because of these detrimental effects, many patches have been proposed across variety of subsystems (like networking and memory management) and merged. Some common mitigations (with examples) are:

Separate hot global data in its own dedicated cache line, even if it is just a ‘short’ type. The downside is more consumption of memory, cache line and TLB entries.
- Commit 91b6d3256356 (“net: cache align tcp_memory_allocated, tcp_sockets_allocated”)
Reorganize the data structure, separate the interfering members to different cache lines. One downside is it may introduce new false sharing of other members.
- Commit 802f1d522d5f (“mm: page_counter: re-layout structure to reduce false sharing”)
Replace ‘write’ with ‘read’ when possible, especially in loops. Like for some global variable, use compare(read)-then-write instead of unconditional write. For example, use:
```
if (!test_bit(XXX))
        set_bit(XXX);
```
instead of directly “set_bit(XXX);”, similarly for atomic_t data:
```
if (atomic_read(XXX) == AAA)
        atomic_set(XXX, BBB);
```
- Commit 7b1002f7cfe5 (“bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing”)
- Commit 292648ac5cf1 (“mm: gup: allow FOLL_PIN to scale in SMP”)
Turn hot global data to ‘per-cpu data + global data’ when possible, or reasonably increase the threshold for syncing per-cpu data to global data, to reduce or postpone the ‘write’ to that global data.
- Commit 520f897a3554 (“ext4: use percpu_counters for extent_status cache hits/misses”)
- Commit 56f3547bfa4d (“mm: adjust vm_committed_as_batch according to vm overcommit policy”)

Surely, all mitigations should be carefully verified to not cause side effects. To avoid introducing false sharing when coding, it’s better to:

Be aware of cache line boundaries
Group mostly read-only fields together
Group things that are written at the same time together
Separate frequently read and frequently written fields on different cache lines.

and better add a comment stating the false sharing consideration.

One note is, sometimes even after a severe false sharing is detected and solved, the performance may still have no obvious improvement as the hotspot switches to a new place.

Miscellaneous¶

One open issue is that the kernel has an optional data structure randomization mechanism, which also randomizes the situation of cache line sharing among data members.

The Linux Kernel

Contents

This Page

False Sharing¶

What is False Sharing¶

False Sharing Pitfalls¶

How to detect and analyze False Sharing¶

Possible Mitigations¶

Miscellaneous¶