

## OS IMPACT ON PERFORMANCE OPENING THE ROOF

#### LAPP - Annecy Sébastien Valat - INRIA

Gray Scott Reloaded – Summer School - 11 july 2024

#### I. Introduction

- II. Analysis of OS paging policy
- III. NUMA allocator for HPC applications
- IV. Page zeroing in Linux first touch handler
- V. Conclusion

## ADVERTISING

### **Understanding the memory management**

From the transistor to the application

#### <u>What every programmer should know about memory</u> (Urlich Drepper) https://people.freebsd.org/~lstewart/articles/cpumemory.pdf



| what Every Programmer 3                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                       | hould Know About Memory                                                                                                                                                                                                                                                                                                                                                                              |  |  |  |  |
|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| Bod F                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                         | Dropper<br>fat, Inc                                                                                                                                                                                                                                                                                                                                                                                  |  |  |  |  |
| 11                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | drad                                                                                                                                                                                                                                                                                                                                                                                                 |  |  |  |  |
| As CPU seen because both factor and more in<br>non-and well he for some inter-memory actor<br>more explorational memory building and as<br>these cannets work optimally without some he<br>the structure our the cost of using the memory<br>is well independent by much programmers. The                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     | marines, the limiting factor for interl programs or<br>or. Hordware designers have around up with even<br>obstration architectures with a CPU caches-but<br>phone the programmer. Disformanchy, nother<br>subjections of computer on the caches on CPUs<br>page explains the interlation of memory univer-<br>limiting up of CPU caches work developed, here                                         |  |  |  |  |
| 1 Introduction                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | day these changes manify come in the following forms.                                                                                                                                                                                                                                                                                                                                                |  |  |  |  |
| In the solution compares near tools using the location of the solution of the solution of the location of the | <ul> <li>4.4.4.4.4.4.4.4.4.4.4.4.4.4.4.4.4.4.4.</li></ul>                                                                                                                                                                                                                                                                                                                                            |  |  |  |  |
| Value sharps only these sensing the null intervery<br>as a buildenth has proven such user default and a<br>smooth di solvitore require charges to the hardware. To<br>"Theory on scaled between a paramit into require shar-<br>ming stoog diverticable.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | When a comments to operating system specific, locatili<br>instructions, the next exclusioned phenotites Learns. A<br>since well is comman any information induced the<br>Dise authors have no instruction in documents; the resplica-<br>for whene Ohens. If the resolute thinks where has no<br>additivent OSE have a to go to them seen show and do<br>the system documents simulate to this seen. |  |  |  |  |
| Engangin it 1987 Ularis Desgan<br>Alf optic material. No additional diamet                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | One last construct before the start. The text contains<br>sumber of occurrences of the term "assaily" and other<br>similar qualifiers. The technology discound here exist                                                                                                                                                                                                                            |  |  |  |  |



#### https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

## A look for HPC / supercomputers

My PhD. work

Contribution à l'amélioration des méthodes d'optimisation de la gestion de la mémoire dans le cadre du Calcul Haute Performance



## OPERATING SYSTEM (OS) BETWEEN APPLICATIONS AND HARDWARE

#### The keywords of the last two weeks



#### Hardware and software stack

To obtain performance

- We need to **optimize** the **interaction** between **all components**
- The synthetic view up to now :





https://press.ariane.group/le-nouveau-lanceur-europeen-ariane-6-a-pris-son-envol/?lang=fra

Gray Scott Reloaded | Sébastien Valat | 11 july 2024 | PAGE 8

#### Hardware and software stack

- Non optimal hardware usage leads to slow down,
- We didn't be in direct contact to the hardware.
- Bad usage of OS too.





- Introduction
  II. Analysis of OS paging policy
  III. NUMA allocator for HPC applications
  IV. Page zeroing in Linux first touch handler
  V. Conclusion

## INTRODUCTION

#### **Context : HPC**

Memory becomes a critical resource

Growing impact on **performance** 

**Data movements :** speed gap CPU / RAM, **memory wall**.

Management : now have to handle close to TB of memory

Decreasing memory per core



http://www.cea.fr/multimedia/Pages/galeries/defense/Tera-100.aspx

## LES CACHES

### Damn slow memory

A story of caches and hierarchy



Computer science : operations & data

Multiple memory levels

Hierarchical caches



Pre-fetcher

#### **Cache lines**



## Array of Struct in memory







#### Half of the cache lost

# AOS (Array of Struct)
struct Cell
{
 float u;
 float v;
};

Cell mesh[HEIGHT][WIDTH];

# Loop using one of the members # pragma omp parallel for for ( size\_t y = 0 ; y < HEIGHT ; y++) for ( size\_t x = 0 ; x < WIDTH ; x++) mesh[y][x].u += 5.0;

#### Struct of Array in memory

#### **Struct of Array**





# Loop using one of the members # pragma omp parallel for for ( size\_t y = 0 ; y < HEIGHT ; y++) for ( size\_t x = 0 ; x < WIDTH ; x++) mesh[y][x].u += 5.0;



#### Gray Scott Reloaded | Sébastien Valat | 11 july 2024 | Slide 18 / 46

The classic **mistake** :

- Walk over data (time 1)
- Walk again over data (time 2)







. . .

#### OMP\_PROC\_BIND=close OMP\_NUM\_THREADS=6 hwloc-bin core:0-5 ./soa

| Machine (31GB total)                                    |                              |             |                                            |                                            |                                                |                               |                             |                             |                             |                              |                              |                              |                              |                   |                                      |  |  |  |
|---------------------------------------------------------|------------------------------|-------------|--------------------------------------------|--------------------------------------------|------------------------------------------------|-------------------------------|-----------------------------|-----------------------------|-----------------------------|------------------------------|------------------------------|------------------------------|------------------------------|-------------------|--------------------------------------|--|--|--|
| Package L#0                                             |                              |             |                                            |                                            |                                                |                               |                             |                             |                             |                              |                              |                              | ┍                            | 16 16 PCI 01:00.0 |                                      |  |  |  |
| NUMANode L#0 P#0 (31GB)                                 |                              |             |                                            |                                            |                                                |                               |                             |                             |                             |                              |                              |                              |                              |                   | CoProc opencI0d0<br>20 compute units |  |  |  |
| L3 (24MB)                                               |                              |             |                                            |                                            |                                                |                               |                             |                             |                             |                              |                              |                              |                              |                   | 7976 MB                              |  |  |  |
| L2 (1280KB)                                             | L2 (1280KB)                  | L2 (1280KB) | L2 (1280KB)                                | L2 (1280KB)                                | L2 (1280KB)                                    | 80KB) L2 (2048KB) L2 (2048KB) |                             |                             |                             |                              |                              |                              |                              |                   | PCI 00:02.0                          |  |  |  |
| L1d (48KB)                                              | L1d (48KB)                   | L1d (48KB)  | L1d (48KB)                                 | L1d (48KB)                                 | L1d (48KB)                                     | L1d (32KB)                    | L1d (32KB)                  | L1d (32KB)                  | L1d (32KB)                  | L1d (32KB)                   | L1d (32KB)                   | L1d (32KB)                   | L1d (32KB)                   |                   | PCI 00:0e.0                          |  |  |  |
| L1i (32KB)                                              | L1i (32KB)                   | L1i (32KB)  | L1i (32KB)                                 | L1i (32KB)                                 | L1i (32KB)                                     | L1i (64KB)                    | L1i (64KB)                  | L1i (64KB)                  | L1i (64KB)                  | L1i (64KB)                   | L1i (64KB)                   | L1i (64KB)                   | L1i (64KB)                   |                   | Block nvme0n1<br>1907 GB             |  |  |  |
| Core L#0                                                | Core L#1<br>Plot 2<br>Plot 3 | Core L#2    | Core L#3<br>PU L#6<br>P#6<br>PU L#7<br>P#7 | Core L#4<br>PU L#8<br>P#8<br>PU L#9<br>P#9 | Core L#5<br>PU L#10<br>P#10<br>PU L#11<br>P#11 | Core L#6<br>PU L#12<br>P#12   | Core L#7<br>PU L#13<br>P#13 | Core L#8<br>PU L#14<br>P#14 | Core L#9<br>PU L#15<br>P#15 | Core L#10<br>PU L#16<br>P#16 | Core L#11<br>PU L#17<br>P#17 | Core L#12<br>PU L#18<br>P#18 | Core L#13<br>PU L#19<br>P#19 |                   | PCI 00:14.3<br>Net wlp0s20f3         |  |  |  |
| Host: svalat-inria<br>Date: jeu. 11 juil. 2024 09:41:01 |                              |             |                                            |                                            |                                                |                               |                             |                             |                             |                              |                              |                              |                              |                   |                                      |  |  |  |
|                                                         |                              |             |                                            |                                            |                                                |                               |                             |                             |                             |                              |                              |                              |                              |                   |                                      |  |  |  |
|                                                         |                              |             |                                            |                                            |                                                |                               |                             |                             |                             |                              |                              |                              |                              |                   |                                      |  |  |  |
| Fast cores                                              |                              |             |                                            |                                            |                                                | Energy efficient cores        |                             |                             |                             |                              |                              |                              |                              |                   |                                      |  |  |  |

Gray Scott Reloaded | Sébastien Valat | 11 july 2024 | Slide 20 / 46

#### OMP\_PROC\_BIND=spread OMP\_NUM\_THREADS=6 hwloc-bin core:0-5 ./soa

| Machine (31GB total)                                           |                                  |                                   |                                                     |             |                             |                             |                             |                             |                              |                              |                              |                              |    |                                      |
|----------------------------------------------------------------|----------------------------------|-----------------------------------|-----------------------------------------------------|-------------|-----------------------------|-----------------------------|-----------------------------|-----------------------------|------------------------------|------------------------------|------------------------------|------------------------------|----|--------------------------------------|
| Package L#0                                                    |                                  |                                   |                                                     |             |                             |                             |                             |                             |                              |                              |                              |                              | ╏┏ | 16 16 PCI 01:00.0                    |
| NUMANode L#0 P#0 (31GB)                                        |                                  |                                   |                                                     |             |                             |                             |                             |                             |                              |                              |                              |                              |    | CoProc opencI0d0<br>20 compute units |
| L3 (24MB)                                                      |                                  |                                   |                                                     |             |                             |                             |                             |                             |                              |                              |                              |                              |    | 7976 MB                              |
| L2 (1280KB) L2 (1280KB)                                        | L2 (1280KB)                      | L2 (1280KB)                       | L2 (1280KB)                                         | L2 (1280KB) | L2 (2048KB)                 | L2 (2048KB) L2 (2048KB)     |                             |                             |                              |                              |                              |                              |    | PCI 00:02.0                          |
| L1d (48KB) L1d (48KB)                                          | L1d (48KB)                       | L1d (48KB)                        | L1d (48KB)                                          | L1d (48KB)  | L1d (32KB)                   | L1d (32KB)                   | L1d (32KB)                   | L1d (32KB)                   |    | PCI 00:0e.0                          |
| L1i (32KB) L1i (32KB)                                          | L1i (32KB)                       | L1i (32KB)                        | L1i (32KB)                                          | L1i (32KB)  | L1i (64KB)                   | L1i (64KB)                   | L1i (64KB)                   | L1i (64KB)                   |    | Block nvme0n1<br>1907 GB             |
| Core L#0<br>PU 0<br>PU L#1<br>P#1<br>Core L#1<br>PU 1#3<br>P#3 | Core L#2<br>P 4<br>PU L#5<br>P#5 | Core L#3<br>Pu 6<br>PU L#7<br>P#7 | Core L#4<br>P1 ************************************ | Core L#5    | Core L#6<br>PU L#12<br>P#12 | Core L#7<br>PU L#13<br>P#13 | Core L#8<br>PU L#14<br>P#14 | Core L#9<br>PU L#15<br>P#15 | Core L#10<br>PU L#16<br>P#16 | Core L#11<br>PU L#17<br>P#17 | Core L#12<br>PU L#18<br>P#18 | Core L#13<br>PU L#19<br>P#19 |    | PCI 00:14.3<br>Net wlp0s20f3         |
| Host: svalat-inria<br>Date: jeu. 11 juil. 2024 09:41:01        |                                  |                                   |                                                     |             |                             |                             |                             |                             |                              |                              |                              |                              |    |                                      |
|                                                                |                                  |                                   |                                                     |             |                             |                             |                             |                             |                              |                              |                              |                              |    |                                      |
|                                                                |                                  |                                   |                                                     |             |                             |                             |                             |                             |                              |                              |                              |                              |    |                                      |
| Fast cores                                                     |                                  |                                   |                                                     |             |                             | Energy efficient cores      |                             |                             |                              |                              |                              |                              |    |                                      |

Gray Scott Reloaded | Sébastien Valat | 11 july 2024 | Slide 21 / 46

## LE NUMA

Hierarchical memory

Remote / local memories (NUMA : Non Uniform Memory Access)





Gray Scott Reloaded | Sébastien Valat | 11 july 2024 | Slide 23 / 43

#### Now also inside the CPU – Intel KNL

- Intel KNL (64 cores) can be configured in **2 or 4 NUMA domains**
- Also add MCDRAM (similar idea than GPU GDDR5) viewed as a NUMA node



MCDRAM MCDRAM DDR4 MCDRAM MCDRAM

Or on AMD CPUs

## PAGINATION

#### Software memory management layer

Impact of memory management mechanisms ?

- Involving two components :
  - User space memory allocator (malloc)
  - Operating System (OS)
- Focus on :
  - Impact on allocation time
  - Impact on access efficiency (placement)

Malloc C or C++ interface :



# float \* ptr = new float[SIZE]; ... ... ... delete [] ptr;



#### **OS virtual / physical address spaces**

Two address spaces : **physical + virtual** 

- Description of the memory mapping in blocks of 4 KB (pages)
- Paging was first used in 1962 on the ATLAS computer
- Area creation with syscalls : mmap / munmap / mremap
- Malloc has the responsibility to hide the pages to developers



http://www.computerhistory.org/collections/catalog/102698470



#### **Origin of the concept**

- May **1956** Fritz Rudolf Güntsch's
- Logical Design of a Digital Computer with Multiple Asynchronous Rotating Drums and Automatic High Speed Memory Operation
- Origin of the virtual memory concept <u>IEEE Annals of the History of Computing – Anecdotes</u> https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1369143



## Page table and TLB



CPU has a cache (TLB : Translation Lookaside Buffer)

### Huge pages

- With **4K pages**, intel CPU TLB has **1024 entries** => address **4 MB**
- **x86\_64** processors also support **2 MB** or **1 GB** pages (Huge pages)
- With **2M pages**, TLB address **2 GB**
- First real support : FreeBSD (superpages, 2002) [1]
- Support Linux : *old HugeTLBfs* then now **Transparent Huge Pages (THP), 2011**



## Play with memory mapping

- Virtual memory **isolate** each process
- We can do **shared memory**, mapping **the same memory twice**
- Most OS also use a trick by mapping the OS memory in each process
  - At the end of the address space
  - Protected
- Issue with Spectre attack a few years ago !



Physical memory (RAM)

#### Lazy page allocation

- mmap creates pure virtual area
- First touch creates a **page fault** for each virtual page
- OS provides physical pages on first touch
- **First touch implicitly** determines **NUMA placement** of the page



#### **Cache associativity**

- Data can only be placed in one of the **N lines associated to the address**
- Can create **conflicts** depending on the OS
  - Linux "**randomly**" **chooses** the pages



300 Random paging Linux paging 250 excès (pages) 200 150 Conflits en 100 50 0 8 10 12 2 6 0 Δ Taille du buffer (MB) 3 Linux 2.8 Linux + THP 2.6 2.4 2.2 2 1.8 1.6 1.4 2 10 12 6 8 0 4

Conflits liés à la poltique de pagination

Gray Scott Reloaded | Sébastien Valat | 11 july 2024 | Slide 33 / 43

Buffer size (Mo)

#### Same equation than electron in boltzman statistic !



Gray Scott Reloaded | Sébastien Valat | 11 july 2024 | Slide 34 / 46

### **Existing solutions**

#### Huge pages



### Page coloring

- 4K pages by taking care of associativity
- Available on **OpenSolaris**
- **Color** based on **virtual address** (modulo)



**Regular coloring** : coloration with **repeated patterns** 



Introduction
II. Analysis of OS pagin policy
III. NUMA allocator performances for HPC applications
IV. Page zeroing in Linux first touch handler
V. Conclusion

## ANALYSIS OF OS PAGING POLICY

### **OS strategies comparison**

Each **system** has its default paging **strategy**:

| OS          | Strategy      |  |
|-------------|---------------|--|
| Linux       | 4K random     |  |
| OpenSolaris | Page coloring |  |
| FreeBSD     | Huge pages    |  |

■ Is Linux slower due to random paging ?

Tested architecture : Intel **Nehalem bi-socket** 

Use a fixed compile chain : GCC/Binutils/MPI/BLAS

**Focus a pathological case** 

### **EulerMHD** issue

EulerMHD (CEA) :

- C++ /MPI
- Magnéto-hydrodynamic stencil code
- FreeBSD : slowdown of 1.5x, up to 3x in parallel
- Impacted function only do compute.
- Function with **9 arrays pre-allocated** at init. :

for (i = 0 ; i < SIZE ; i++) x1[i] = x2[i] + x3[i] ... + x9[i]

- Change between OS's :
  - User space memory allocator (malloc).
  - OS paging policy
  - (Scheduler)
- Effect can be controlled by changing the allocator.



EulerMHD, sequential, default allocator

Problem size



Gray Scott Reloaded | Sébastien Valat | 11 july 2024 | Slide 38 / 43

### Alignment effect on regular coloring

- Each malloc (OS) produces different alignments
- **FreeBSD** align **large segments** on **2 MB**
- **It interferes** with **regular patterns** generated by :
  - OpenSolaris coloration method (modulo)

Huge pages

### for (i = 0 ; i < SIZE ; i++) a[i] = b[i] + c[i];





### Solution

Avoid segment **alignments** on **cache way size** (mmap / malloc).

- The Linux random approach prevents pathological cases
- Do not <u>use **regular patterns**</u> for **page coloring** (eg. **single modulo**)
- **Huge pages** are **regular** by **hardware definition**



### Impact on threads

- Larger effects on <u>shared caches</u> with threads/processes (Nehalem)
- EulerMHD : Slowdown up to 3x on FreeBSD
- 16 ways L3 cache implies a maximum of 4 aligned arrays per core
- No limit on concurrent arrays for unaligned allocations



Gray Scott Reloaded | Sébastien Valat | 11 july 2024 | Slide 42 / 43

60

70

### New intel L3 cache slices

Since Sandy Bridge

L3 splits in **slices** 

**Slice** is selected by **hashing the address** 

**Each slice** has associativity with **16 ways** 

■ This fix the coloring/alignment issue



https://software.intel.com/en-us/articles/intel-xeon-processor-e5-26004600-product-family-technical-overview

### **On today CPUs**

- Not anymore an issue for Intel L3 caches
   Change of topology
- AMD Zen (Ryzen)
  - Now also use slices
  - Should solve the issue
- **Still** an issue on **IBM power 8** 
  - L3 cache has 8 ways for 8 MB
  - Issue present
  - **Power 9 ?** Also "regions" in LLC ?
- For **ARM** (v7/v8) ?
  - L2 shared associative cache
  - Issue should be present
  - But I never tested
- Issue for L2 of all processors !
  - Think hyperthreading with 8 ways !



#### Test on Haswell Core i7-4790

## 4K aliasing (old issue but fun !)

Consider the simple loop :

for (i = 1 ; i < SIZE ; i++) a[i] = b[i-1]

If addresses verify :

a % 4Ko = b % 4Ko

- **Processor thinks** (fast check with 12 lower bits) addresses are equals (alias)
- Processor do **not execute** them in **parallel** (**out of order**)

- In malloc, direct call to mmap generate 4K alignment by default !
- Mainly **fixed since sandy bridge**

4K aligned



16,8

Cycles / loop on Nehalem

8,5

Unaligned



# NUMA ALLOCATOR FOR HPC APPLICATIONS

### **Allocator performance on HPC applications**

Main interest : malloc time cost

- Test case : Hera (CEA)
  - Adaptive Mesh Refinement (AMR)
  - **Massive C++/MPI code** (~1 million lines).
- Large number of memory allocations (~75 millions / 5 minutes on 12 cores)

Large number of alloc/realloc around ~20 MB

- Available allocators :
  - **Doug Lea** / **PTMalloc** : libc Linux
  - **Jemalloc** : FreeBSD / Firefox / Facebook
  - **TCMalloc** : Google

Hoard



Temporal distribution of allocations



## Hera preliminary results



#### 128 cores



### How to measure malloc time

#### Measurement method :

T0 = clock\_start();
ptr = malloc(SIZE);
T1 = clock\_end();

Ok for **small blocks**, but not for **large** one :

- Lazy page allocation.
- Page faults on first access.

| For 4GB         | Malloc | First access |
|-----------------|--------|--------------|
| Time (M cycles) | 0,008  | 1 217        |

- Small allocation **well handled** by most allocators, **best is jemalloc / tcmalloc**.
- Cost for large allocation : page faults.
- **Commonly neglected**, literature mainly discuss small allocations
- Direct call to **mmap/munmap**
- **HPC applications** (expected to) use **large arrays**

## Large allocations

## My goals :

- **Recycle** large arrays
- Avoid **fragmentation** on large segments
- Take care of **NUMA**
- Limit locks

### **Allocator Profiles**

Test allocator with **multiple profiles** 

#### **Lowmem** profile

Return memory to the OS as soon as possible

### **UMA** Profile

- Recycle large segments
- Disable NUMA
- Use only one common memory source

### **NUMA** profile :

- **—** Recycle large segments
- Enable NUMA structures

### Hera on Nehalem-EP (128 : 4\*4\*8 cores)



#### Physical memory (GB)

## Mysql results



Gray Scott Reloaded | Sébastien Valat | 11 july 2024 | Slide 54 / 43

II. Analysis of OS paging policy
III. Allocator for HPC applications
IV. Page zeroing in Linux first touch handler
V. Conclusion and future work

## PAGE ZEROING IN LINUX FIRST TOUCH HANDLER

### **Benchmarking page faults**

Page faults are an issue for allocation performance

We previously limit them with large segment recycling

Can we **improve fault performance**?

Micro-benchmark :



## On my adctual latop – page fault costs (4K pages)



## Using also the energy efficient cores



Is average + standard deviation a right observable ?

Is median + 10% quartiles better

Gray Scott Reloaded | Sébastien Valat | 11 july 2024 | Slide 58 / 46

### Page fault scalability

- Are page faults scalable ? Over threads or processes.
- Mesurement on **4\*4 Nehalem-EP** (128 cores) and on **Xeon Phi** (60 cores)
- Get scalability issue !



Gray Scott Reloaded | Sébastien Valat | 11 july 2024 | Slide 59 / 43

### Can huge pages solve this issue ?

Standard pages: **4K** 

- Huge pages (x86\_64): **2M**
- Divide number of faults by 512
  - Impact on performance ?
     Sequential : only 40%
     Parallel : No

Why?



### What happens on first touch page fault ?



- Take locks on page table
- Check reason of the fault
- Is first touch from lazy allocation



## How to avoid page zeroing cost?

#### Microsoft approach :

- **Windows** uses a **system thread** to clear the memory
- So its done **out** of **critical path**
- But **zeroing**:
  - Implies useless work
  - Consumes CPU cycles so energy
  - Consumes memory bandwidth

#### Allocation pattern follow:

Why not **avoid them** ?

### **Reusing local pages to avoid zeroing**

- Page zeroing is **required** for **security reason**
- It prevents information **leaks** from **another processes** or from the **kernel**.
- But we can reuse pages locally !
- Need to extend the mmap semantic :
- Usable by **malloc / realloc**.



### **Performance impact**

- Get the **expected improvement** on **4K pages** (40% for sequential).
- Also improve scalability on 1 socket
- On NUMA locking effets become dominant for scalability
- Get the constant improvement related to page zeroing.



## **Performance impact on huge pages**

**Huge pages** (2 MB) faults become **47** times faster, **60** in parallel.

**New interest** for huge pages.



Page fault time on 2\*6 cores + Patched THP

Gray Scott Reloaded | Sébastien Valat | 11 july 2024 | Slide 65 / 43

## A SHORT PROBLEM WITH NUMA

### Malloc NUMA issue

**Exchanges** between **NUMA nodes** :



Most current allocators are affected by this issue

Malloc has no information about the use of allocated segments

C++ tend to have more allocs, so more exposed on NUMA

- Introduction
  II. Analysis of OS / allocator / caches interactions
  III. Allocator for HPC applications
  IV. Optimization of Linux page fault handler
  V. Conclusion and future work

## CONCLUSION

- Consider the **genius** of **peoples who invented** the **pagination** !
- Event after **60 years** of memory management we **can still do a lot !**
- Current operating systems **still have to digest** side effects of **multi-core** and **NUMA**
- Impact can be huge !
- Hope you know better **what is behind malloc** now !

## **QUESTIONS**?