Virtual Memory

Memory (RAM) is a scarce resource, even considering the current desktop/laptop standards (8 GiB). Any modern CPU can run several processes at a time due several physical cores as well as via multitasking. In a 32 bit architecture, is very normal that programs can access memory with 32 bit addresses. This means that each program is able to address 232 = 4.294.967.296 position. Being a position a byte, each program can address up to 4GiB of memory. This set of addresses is known as the address space of programs.

It might be clear to the reader that if our computer has 4GiB of RAM, not even a single program will be able to use the full 4GiB as the operating system needs some memory for itself. Even if the computer would have 8GiB or 16GiB or… of RAM, only 2, 3 or… programs could use 4GiB for their own use.

When programs have “direct” access to memory, as used to be initially, they have to implement memory management algorithms and some logic to solve 3 main issues:

What is Virtual Memory (VM)??

As commented, initial computers and operating systems used directly the memory available in the system. But because memory was very expensive, virtual memory was devised to define more logical memory than the real physical memory, so programs could think they had more memory to use. This, as many other things in Computer Science, is done via indirection - getting or accessing something via a reference instead of directly. In our case, it implies that instead of accessing any physical memory position via its physical address, programs are given virtual addresses. In others words, instead a physical address space, programs are given a virtual address space.

As you can imagine, Virtual Memory uses other devices (normally disk storage) to store what does not fit in real memory. This means that in the Virtual address space, some entries point to locations in disk. This is made transparently to the programs, so they think that everything is in the 4GiB of RAM they can address.

Also, it should be already clear that VM needs a place where to store the mapping among virtual and physical address spaces. This is known as the Page Table (PT). Each entry in the PT is known as a Page Table Entry (PTE). Both are better explained later.

So far we can say that there are very important consequences with Virtual Memory:

      1) Each program believes it has 4GiB of memory to play with (to address) –> This is good !!
      2) As each program has its own address space, there is no fragmentation –> This is also good !!
      3) By the previous rule, no memory overlapping can happen –> This is also good !!

      a) Indirection has a computing cost, so it is slower than direct access –> This is bad !!
      b) Writing or fetching information at disk is even slower (much slower) –> This is very bad !!

Please, note that 1, 2 and 3 solve the 3 issues that programs had to deal with when accessing directly to memory. This means another very important consequence: programmers are freed from knowing things about the physical running environment, so memory access via indirection gives a higher level of abstraction.

Pages (and some maths…)

So we have now the Page Table (PT), where we have all the Page Table Entries (PTE) maps between physical addresses (PA) and virtual addresses (VA)

As our addressing space is 232 (4.294.967.296) addresses, does the PT have to have 232 entries (PTEs)? And how big is a PTE? So how big is the PT? (and PTs have to live also in memory…).

Well, in a 32 bit architecture, the ‘word’ unit of work is 32 bit (so 4 bytes). As memory is normally aligned at word boundaries, PTE have to be 232 / 22 = 230 entries (in decimal, 4.294.967.296 / 4 = 1.073.741.824 entries).

This is still a very big number of entries (1 Giga), which will consume also a lot of memory for the PT itself. A way to overcome this problem is to do groups of words. These groups of words are called pages, and each page will have its own entry in the PT. Grouping ‘chunks’ of addresses in pages will reduce the number of entries in the Page Table.

Nowadays pages are normally 4KiB (4096 Bytes - 1024 words) for 32 bit systems. This reduces drastically the size of the PT: 232 / 212 = 220 entries (in decimal, 4.294.967.296 / 4096 = 1.048.576 entries). This is a 1 million entries. With a weight of 4 bytes per address, the PT will be 4MiB per process, which is a quite reasonable and manageable size.

There is, however, a drawback when using pages: whole pages have to be moved to disk when free space is needed in RAM or loaded from disk is not in memory. As it is explained a bit later, this is a slow process and affects overall performance.

NOTE: A virtual page is somewhat abstract when compared with a Page Frame (or just a frame). A Page frame normally refers to a page-sized and page-aligned physical memory block.

You can check the page size in your system with getconf

mycomputer ~ # getconf -a | grep -i page
PAGESIZE                           4096
PAGE_SIZE                          4096

Page size is “hard-coded” in the Linux kernel. Check it at ‘page.h’. Wondering what PAGESHIFT means in the code? Find in kernelnewbies.com an explanation.

Page size can be changed to use “Huge Pages” via the ‘CONFIG_HUGETLB_PAGE’ parameter at kernel compilation time.

Huge Page will be not covered here. But it is important to understand that there are other pages sizes, like 2MiB. In this case, the PT will be even smaller: a page would hold up to 524.288 words, and the PT would have only 2048 entries, with size in memory of 8 KiB. On the other hand, bigger pages implies more information to be loaded or saved in case they are in disk storage.

Lets comment a little bit about performance before talking about how the address translation is really done.

Performance

So far we have a mechanism to allow programs to think that each one of them has 4GiB of RAM, regardless of the real memory in the system. In other words, each program is given a Virtual Address Space (VAS) for 4GiB. Also, each program can work (change) at its own will the data in memory as VM isolates each address space.

Performance is, though all these benefits, a problem. Modern CPUs can work down to the ‘nanosecond’. This is a second divided by 1.000.000.000. This means the modern CPUs can do 1 instruction each nanosecond. Or 1.000.000.000 in a second. This is fast. Very fast.

But not everything goes that fast:

1 nanosecond 0,000000001 seconds 10-9 1 instruction
100 nanoseconds 0,0000001 seconds 10-7 DRAM random read
10 microseconds
(10.000 nanoseconds)
0,00001 seconds 10-5 Ping to localhost
10 milliseconds
(10.000.000 nanoseconds)
0,01 seconds 10-2 HD seek time
(or 1 Linux context switch)

As can be seen, seeking data in DRAM is slow. Each instruction in the CPU would have to wait 100 times it execution time (1 nanosecond vs. 100 nanoseconds) to get the data in DRAM. And that time is without VM (it is a direct access to a RAM address). With VM, that time is bigger, as some cycles have to be “wasted” to do the translation from VA to PA, fetch the data in RAM end back to the CPU.

Imagine the “eternal” time (from an instruction in the CPU point of view) to wait for the disk: VM does the translation, it finds it is in disk, data is read, put in memory and then it is fetched to the CPU… The instruction has had to wait for more than 10.000.000 nanoseconds. It also means that 10.000.000 of other instructions have queued to be loaded in the CPU to be run !!!

In simple words, DRAM is 100x slower than CPUs. And disks are between 1.000x and 10.000x slower than DRAM.

Mico Maco hopes that now it is clear the need for mechanism that can help to speed-up the delivery of data to CPUs. Translation Look-aside Buffer (TLB) is one of them and is explained later on.

Address Translation

Before speeding-up data delivery to the CPU, lets talk about how to manage Virtual Addresses (VA), Physical Addresses (PA) and pages.

We continue with the 4KiB page. This allows for 4096 addresses ‘in’ each page. This is 212, so 12 bits. These 12 bits mark which address is selected within a given page. So given a page number, these 12 bits are the offset to locate the address.

Therefore, we can decompose the VA in 20 bits for the Virtual Page Number and 12 for the offset. As commented before, this is 1.048.576 pages (1.048.576 PTEs)

As both physical and virtual memory is divided in pages, the offset can be transparently passed when doing the conversion. So offset ‘308’ in a VA is also offset ‘308’ in a PA. Only the Virtual page number has to be translated to the physical page.

As an example, imagine VA 0x000402A7 (“0000 0000 0000 0100 0000 0010 1010 0111” in base 2):

Offset: 2A7 (“0010 1010 0111”) Virtual Page Number: 00040 (“0000 0000 0000 0100 0000”)

We will need to get the value at position 00040 of the PT. Imagine that this value is 0010A (lets suppose there are 4GiB of RAM).

Then the PA would be physical page “0010A” with offset “2A7”, so “0010A2A7” (“0000 0000 0001 0000 1010 0010 1010 0111”)

As a short summary, the address translation mechanism consists in that each VA has a position in the PT “array”. That VA is the base of a page. In the same position there is the PA vaue. In other words, entries in the PT are tuples that relate VA (virtual pages) and PA (frames in real memory). Only the higher 20 bits of any 32 address are used for translation, as the 12 lower are used as a common offset for both VAs and PAs.

VM in linux

Linux follows, with some modifications, the described address translation mechanism. The difference strives in that Linux chooses to use a hierarchical system to organize the page tables. More precisely, Linux uses a 3 level hierarchy for 32 bit architectures and 4 levels for 64 bit architectures. For 32-bit, the 3 levels are named and Page Global Directory (PGD), Page Middle Directory (PMD) and our well known Page Table Entry (PTE)

Again, this is a further indirection: each directory contains pointers to other pointers. As stated in Gorman’s “Understanding The Linux Virtual Memory Manager“:

Each active entry in the PGD table points to a page frame containing an array of Page Middle Directory (PMD) entries of type pmd_t which in turn points to page frames containing Page Table Entries (PTE) of type pte_t, which finally points to page frames containing the actual user data.

In other words, each entry in PGD, PMD or PTE points to a “physical” frame (“page” sized). In the PTE, the offset is added to get the actual data.

There are some optimizations, like the PDE having an extra bit that states whether the there is data (pointer) for the requested page. This speeds up look-ups (if it is not set it can stop going down the levels and go directly to load the page from disk).

Although more complicated to work with, a hierarchical (or multilevel) structure allows to save memory. In a single level table, all pages entries have to be defined, consuming the space regardless whether they will hold data or not. In a multilevel hierarchy, only entries for pages in the PDE have to exists. The rest are created when they have data.

As always, there is a drawback for multilevel hierarchies: pages-in and pages-out movements require more actions to do (at least one per level).

Linux page information is located, for Intel architectures, in page_types.h. You will find the definition of the offset (PAGE_SHIFT), that serves to obtain the page size. Find in kernelnewbies.com an explanation of PAGE_SHIFT

/* PAGE_SHIFT determines the page size */
#define PAGE_SHIFT		12
#define PAGE_SIZE		(_AC(1,UL) << PAGE_SHIFT)
#define PAGE_MASK		(~(PAGE_SIZE-1))

Also, this file contains the definitions of the Page Middle Directory (PMD) (second level table) and page mask definition among others.

Further definitions for the 32-bit 3-level structure can be found at pgtable-3level_types.h.

As commented, 64-bit architectures use a 4-level hierarchical structure: Page Global Directory (PGD), Page Upper Directory (PUD), Page Middle Directory (PMD) and our well known Page Table Entry (PTE).

You can see the new PUD level definition in page_types.h.

More definitions for the 4-level structure can be found at pgtable_64_types.h.

It is important to note that (at least) up to 2017, 64-bit architectures only used 48 bits from the 64 available for virtual addresses. This 48 bit provide up to 256 TiB of addressable RAM (but see “Past and future” section below).

Linux defines in the kernel the mm_struct, which holds the informations regarding virtual memory. This ‘C’ structure is set-up for each process, and is referenced in a scheduler structure (task_struct) created for each process (see more at [scheduling]. Find here the definition for mm_struct (it is a large structure that goes down to line 509). At line 374 you can see the entry for the PGD.

TLB - Translation Lookaside Buffer

One of the most important optimizations regarding VM is the Translation Lookaside Buffer (TLB).

As commented previously, address translation consumes CPU cycles. There are calculations that memory is accessed 1,33 time per instruction. At a minimum, each data that an instructions needs implies an access to the page table and access to the memory address. For hierarchical virtual memory structures (like for Linux), this consumes even more CPU cycles as there are several page levels…

Chip manufacturers devised the TLB and implemented it in the hardware CPU designs. It is very similar to a cache, but it holds resolved page table entries which contain the map from a page table to is corresponding page frame. In order to be fast, TLB is is very small. The “logic” for the TLB is because observation has shown that programs access memory addresses which are normally quite close among them. This is known as “locality”.

This locality allows a small (so fast) TLB. It usually have no more than 64 page entries. This is a very small portion when compared with the million pages entries we end-up in the “Pages” section above. Nevertheless, the locality of memory access allows that many of the final memory addresses are in the same page or in very close pages. In this case, 64 pointers to pages provide a high ratio of TLB “good” hits.

There is a further improvement: many CPUs provide a TLB for instructions(iTLB) and another TLB for data (dTLB).

As in Shakespeare’s play “Hamlet”, to TLB or not TLB has many implications for PTEs:

To be in the TLB:

Not to be in the TLB can be because:

Some past and future

While Mico Maco is talking about the present, there are already people thinking on the future… As memory needs never end, there are already prevision for hierarchical structure with 5 levels. Look here !!

Perhaps the readers might want to understand some information in the past to understand better the future… Some (past) good readings are

Understanding The Linux Virtual Memory Manager”, a book written by Mel Gorman. AIt covers up to kernel 2.6, so is still stands well as the 3-level hierarchy was initially implemented in these kernels.

Also the articles “Four-level page tables merged” and 4 level page tables merged into mainline at LWN.net provide good view of how VM has evolved in Linux…

Kernel and User Memory in Linux

Find here a overview of the Linux memory layouts for 32-bit and 64-bit architectures.

For the x86 32-bit architecture, and with the default configuration, Linux splits the 4-GB virtual address space in user-space and kernel-space. By default it dedicates 3 GB to user space, and 1 GB for kernel space.

+--------------------+  <--- 0xFFFFFFFF (=4GiB)
|  Kernel addresses  |
|                    |
|--------------------|  <--- CONFIG_PAGE_OFFSET
|                    |       (default 0xC0000000 for 32 bit)
|                    |
|    User space      |
|    addresses       |
|                    |
|                    |
+--------------------+  <--- 0x00000000

For 64-bit, find a detailed virtual memory map layout in Linus Torvalds kernel repository

Ok, but how all this holds with processes?

Virtual Memory is a key component (if not the most important or complicated) of modern computers. When a program is executed, it is loaded into memory (now we know it is virtual memory). The process then runs under the control of the scheduler. The scheduler has a Process Control Block (PCB) (or task_struct in Linux) for each process. In the PCB there are several structures with process inforation. One of them is the struct_mm structure that contains the pointers to the different memory structures.

Find more about the scheduler here and here about how and executable file is loaded in memory (how a program becomes a process !!)