page table

Linux 定义了 5 级页表。其虚拟地址对应的物理地址通常由 page frame number(pfn) 引用。page frame number 表示 page 的物理地址➗PAGE_SIZE
以 4kB 一页,32位大小地址空间而言, pfn 0 代表 0x00000000,pfn 1 代表 0x00001000,以此类推。如果是 16 kB 则是 0x00000000, 0x00004000
对于 4KB 页,其页pfn使用的到地址为 12-31 位,此即 PAGE_SHIFT=12 的含义,PAGE_SIZE = (1 << PAGE_SHIFT)

pfn 再往上一层为 PTE (page table entries),不过后来随内存增大,page 索引层级也越来越多。
目前 page table 层级如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
+-----+
| PGD |
+-----+
|
| +-----+
+-->| P4D |
+-----+
|
| +-----+
+-->| PUD |
+-----+
|
| +-----+
+-->| PMD |
+-----+
|
| +-----+
+-->| PTE |
+-----+

自下而上每层含义:

  • PTE: pte_t, pteval_t,page table entry。pte为 PTRS_PER_PTE 元素 pteval_t 列表,每一个条目指向一个page
    其内容 pteval_t 为 32bit / 64bit 值,其高字节为 pfn ,低字节为架构相关的内存保护bit
  • pmd:pmd_t, pmdval_t,page middle directory。PTRS_PER_PMD 其引用了 pte 的条目。
  • pud:pud_tpudval_t,page upper directory。其在后期引入,用于处理4层 页表。
  • p4d:p4d_t, p4dval_t,page level 4 directory。在 pud 后引入处理 5 级列表。
  • pgd:pgd_t, pgdval_t,page global directory。每一个用户态进程维护一个 pgd,在 struct mm_struct (in strcut task_struct)中。(内核维护的有一个 swapper_pg_dir

x86_64

内存布局

4级页表虚拟内存布局如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
========================================================================================================================
Start addr | Offset | End addr | Size | VM area description
========================================================================================================================
| | | |
0000000000000000 | 0 | 00007fffffffefff | ~128 TB | user-space virtual memory, different per mm
00007ffffffff000 | ~128 TB | 00007fffffffffff | 4 kB | ... guard hole
__________________|____________|__________________|_________|___________________________________________________________
| | | |
0000800000000000 | +128 TB | 7fffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical
| | | | virtual memory addresses up to the -8 EB
| | | | starting offset of kernel mappings.
| | | |
| | | | LAM relaxes canonicallity check allowing to create aliases
| | | | for userspace memory here.
__________________|____________|__________________|_________|___________________________________________________________
|
| Kernel-space virtual memory, shared between all processes:
__________________|____________|__________________|_________|___________________________________________________________
| | | |
8000000000000000 | -8 EB | ffff7fffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical
| | | | virtual memory addresses up to the -128 TB
| | | | starting offset of kernel mappings.
| | | |
| | | | LAM_SUP relaxes canonicallity check allowing to create
| | | | aliases for kernel memory here.
____________________________________________________________|___________________________________________________________
| | | |
ffff800000000000 | -128 TB | ffff87ffffffffff | 8 TB | ... guard hole, also reserved for hypervisor
ffff880000000000 | -120 TB | ffff887fffffffff | 0.5 TB | LDT remap for PTI
ffff888000000000 | -119.5 TB | ffffc87fffffffff | 64 TB | direct mapping of all physical memory (page_offset_base)
ffffc88000000000 | -55.5 TB | ffffc8ffffffffff | 0.5 TB | ... unused hole
ffffc90000000000 | -55 TB | ffffe8ffffffffff | 32 TB | vmalloc/ioremap space (vmalloc_base)
ffffe90000000000 | -23 TB | ffffe9ffffffffff | 1 TB | ... unused hole
ffffea0000000000 | -22 TB | ffffeaffffffffff | 1 TB | virtual memory map (vmemmap_base)
ffffeb0000000000 | -21 TB | ffffebffffffffff | 1 TB | ... unused hole
ffffec0000000000 | -20 TB | fffffbffffffffff | 16 TB | KASAN shadow memory
__________________|____________|__________________|_________|____________________________________________________________
|
| Identical layout to the 56-bit one from here on:
____________________________________________________________|____________________________________________________________
| | | |
fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
| | | | vaddr_end for KASLR
fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
ffffffff80000000 |-2048 MB | | |
ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
ffffffffff000000 | -16 MB | | |
FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
__________________|____________|__________________|_________|___________________________________________________________

5级页表虚拟内存布局如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
========================================================================================================================
Start addr | Offset | End addr | Size | VM area description
========================================================================================================================
| | | |
0000000000000000 | 0 | 00fffffffffff000 | ~64 PB | user-space virtual memory, different per mm
00fffffffffff000 | ~64 PB | 00ffffffffffffff | 4 kB | ... guard hole
__________________|____________|__________________|_________|___________________________________________________________
| | | |
0100000000000000 | +64 PB | 7fffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical
| | | | virtual memory addresses up to the -8EB TB
| | | | starting offset of kernel mappings.
| | | |
| | | | LAM relaxes canonicallity check allowing to create aliases
| | | | for userspace memory here.
__________________|____________|__________________|_________|___________________________________________________________
|
| Kernel-space virtual memory, shared between all processes:
____________________________________________________________|___________________________________________________________
8000000000000000 | -8 EB | feffffffffffffff | ~8 EB | ... huge, almost 63 bits wide hole of non-canonical
| | | | virtual memory addresses up to the -64 PB
| | | | starting offset of kernel mappings.
| | | |
| | | | LAM_SUP relaxes canonicallity check allowing to create
| | | | aliases for kernel memory here.
____________________________________________________________|___________________________________________________________
| | | |
ff00000000000000 | -64 PB | ff0fffffffffffff | 4 PB | ... guard hole, also reserved for hypervisor
ff10000000000000 | -60 PB | ff10ffffffffffff | 0.25 PB | LDT remap for PTI
ff11000000000000 | -59.75 PB | ff90ffffffffffff | 32 PB | direct mapping of all physical memory (page_offset_base)
ff91000000000000 | -27.75 PB | ff9fffffffffffff | 3.75 PB | ... unused hole
ffa0000000000000 | -24 PB | ffd1ffffffffffff | 12.5 PB | vmalloc/ioremap space (vmalloc_base)
ffd2000000000000 | -11.5 PB | ffd3ffffffffffff | 0.5 PB | ... unused hole
ffd4000000000000 | -11 PB | ffd5ffffffffffff | 0.5 PB | virtual memory map (vmemmap_base)
ffd6000000000000 | -10.5 PB | ffdeffffffffffff | 2.25 PB | ... unused hole
ffdf000000000000 | -8.25 PB | fffffbffffffffff | ~8 PB | KASAN shadow memory
__________________|____________|__________________|_________|____________________________________________________________
|
| Identical layout to the 47-bit one from here on:
____________________________________________________________|____________________________________________________________
| | | |
fffffc0000000000 | -4 TB | fffffdffffffffff | 2 TB | ... unused hole
| | | | vaddr_end for KASLR
fffffe0000000000 | -2 TB | fffffe7fffffffff | 0.5 TB | cpu_entry_area mapping
fffffe8000000000 | -1.5 TB | fffffeffffffffff | 0.5 TB | ... unused hole
ffffff0000000000 | -1 TB | ffffff7fffffffff | 0.5 TB | %esp fixup stacks
ffffff8000000000 | -512 GB | ffffffeeffffffff | 444 GB | ... unused hole
ffffffef00000000 | -68 GB | fffffffeffffffff | 64 GB | EFI region mapping space
ffffffff00000000 | -4 GB | ffffffff7fffffff | 2 GB | ... unused hole
ffffffff80000000 | -2 GB | ffffffff9fffffff | 512 MB | kernel text mapping, mapped to physical address 0
ffffffff80000000 |-2048 MB | | |
ffffffffa0000000 |-1536 MB | fffffffffeffffff | 1520 MB | module mapping space
ffffffffff000000 | -16 MB | | |
FIXADDR_START | ~-11 MB | ffffffffff5fffff | ~0.5 MB | kernel-internal fixmap range, variable size and offset
ffffffffff600000 | -10 MB | ffffffffff600fff | 4 kB | legacy vsyscall ABI
ffffffffffe00000 | -2 MB | ffffffffffffffff | 2 MB | ... unused hole
__________________|____________|__________________|_________|___________________________________________________________

arm64

内核地址转换

在arm64架构中存在虚拟地址与物理地址之间转换的函数 __pa(), __va()
其定义于 arch/arm64/include/asm/memory.h(选取6.17.9版本源码)

1
2
3
4
5
6
7
8
9
10
/*
* Drivers should NOT use these either.
*/
#define __pa(x) __virt_to_phys((unsigned long)(x))
#define __pa_symbol(x) __phys_addr_symbol(RELOC_HIDE((unsigned long)(x), 0))
#define __pa_nodebug(x) __virt_to_phys_nodebug((unsigned long)(x))
#define __va(x) ((void *)__phys_to_virt((phys_addr_t)(x)))
#define pfn_to_kaddr(pfn) __va((pfn) << PAGE_SHIFT)
#define sym_to_pfn(x) __phys_to_pfn(__pa_symbol(x))

__virt_to_phys_nodebug函数定义于同一个文件中,__tag_reset特定于CONFIG_KASAN_SW_TAGS) || defined(CONFIG_KASAN_HW_TAGS两个内核配置,如果未开启则无需关注。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
/*
* Check whether an arbitrary address is within the linear map, which
* lives in the [PAGE_OFFSET, PAGE_END) interval at the bottom of the
* kernel's TTBR1 address range.
*/
#define __is_lm_address(addr) (((u64)(addr) - PAGE_OFFSET) < (PAGE_END - PAGE_OFFSET))

#define __lm_to_phys(addr) (((addr) - PAGE_OFFSET) + PHYS_OFFSET)
#define __kimg_to_phys(addr) ((addr) - kimage_voffset)

#define __virt_to_phys_nodebug(x) ({ \
phys_addr_t __x = (phys_addr_t)(__tag_reset(x)); \
__is_lm_address(__x) ? __lm_to_phys(__x) : __kimg_to_phys(__x); \
})

__is_lm_address函数判断该地址是否位于 linear mapping 地址。其地址范围为 [PAGE_OFFSET, PAGE_END),按照 VA_BITS 为39的情况,PAGE_OFFSET=-(1<<39)=0xffffff8000000000PAGE_END=-(1<<38)=0xffffffc000000000)
然后如果目标地址位于 linear map,则调用 __lm_to_phys。由于线性映射区是对内核物理地址的直接映射,所以其直接取偏移加上 PHYS_OFFSET 即可。
PHYS_OFFSET 定义为:

1
2
/* PHYS_OFFSET - the physical address of the start of memory. */
#define PHYS_OFFSET ({ VM_BUG_ON(memstart_addr & 1); memstart_addr; })

memstart_addrarch/arm64/mm/init.c 中通过 arm64_memblock_init 函数定义

1
2
3
4
5
6
7
8
9
10
11
/* lowest address */
phys_addr_t __init_memblock memblock_start_of_DRAM(void)
{
return memblock.memory.regions[0].base;
}
...
/*
* Select a suitable value for the base of physical memory.
*/
memstart_addr = round_down(memblock_start_of_DRAM(),
ARM64_MEMSTART_ALIGN);

这里的 memblock 是内核初始化时用来进行内存管理的对象。其regions一般通过读取DT来初始化。因此物理地址因设备而异。
__kimg_to_phys 函数则是将内核镜像虚拟地址转换为物理地址。比较典型的是内核中的数据段代码段中的函数变量等符号地址。这些与物理地址之间也是线性映射关系。其中偏移 kimage_voffset 在内核启动时就确定下来。

__va(x) 函数则是将物理地址转换为虚拟地址,该定义就比较简单:

1
2
#define __phys_to_virt(x)	((unsigned long)((x) - PHYS_OFFSET) | PAGE_OFFSET)
#define __phys_to_kimg(x) ((unsigned long)((x) + kimage_voffset))

phys_to_kimg则是用在页表转换中。

TTBR0寄存器

TTBR0 (Translation Table Base Register 0),其保存了第一级页表的基地址。
ARM 处理器 MMU 进行地址转换时,TTBR0 提供查找页表的起始位置。通常与 TTBR1 配对使用,TTBR0 负责处理低位虚拟地址空间,而 TTBR1 负责高位虚拟地址空间。

TTBR0 工作原理

当 CPU 需要将一个虚拟地址转换为物理地址,MMU 会:

  1. 检查虚拟地址范围,确定使用 TTBR0 还是 TTBR1
  2. 从相应的 TTBR 寄存器获取页表基地址
  3. 使用虚拟地址一部分作为索引,在页表中查找对应的表项
  4. 最终获得物理地址或下一级页表的地址
    TTBR0 的格式取决于 TTBCR.EAE 位的设置:
  • 当 TTBCR.EAE = 0 时,使用32位格式,TTBR0[63:32]被忽略
  • 当 TTBCR.EAE = 1 时,使用64位格式。
    TTBR0 不仅仅包含页表基地址,还包含一些属性位,用于控制页表的内存属性(如缓存性、共享性等)。

TTBR0 使用场景

1. 操作系统进程隔离

每个进程都有自己独立虚拟地址空间。当进程切换时,操作系统会更新 TTBR0,指向新的页表,从而实现进程间地址隔离。

1
2
3
4
5
6
7
8
9
// Linux内核中切换地址空间的示例代码
// arch/arm/mm/context.c
void switch_mm(struct mm_struct *prev, struct mm_struct *next,
struct task_struct *tsk)
{
// ...
cpu_switch_mm(pgd, next);
// ...
}
2. 安全扩展(TrustZone)

在 ARM TrustZone 环境中,有 (secure world) 已经 (Non-secure world) 概念,两者维护两个虚拟MMU。

3. 虚拟化环境

多个 guest OS 有自己的页表。

TTBR0 在 ATF 和 OP-TEE 的使用

ATF( ARM Trusted Firmware )

ATF 位于最底层,负责安全启动以及 TrustZone 管理。在 ATF 中 TTBR0 用于建立安全世界的内存映射。

OP-TEE (open Portable Trusted Execution Environment)

开源 TEE 实现,运行在安全世界中,为普通世界提供安全服务。其中使用 TTBR0 管理安全世界的内存映射

TTBR0 位字段详解

32位格式(TTBCR.EXE=0)

TTBR0 指令

使用 MRC 以及 MCR 指令 读写 TTBR0 寄存器,

1
2
3
4
5
6
7
8
9
10
11
; 读取TTBR0到R0和R1(64位访问)
MRRC p15, 0, r0, r1, c2

; 将R0和R1的值写入TTBR0(64位访问)
MCRR p15, 0, r0, r1, c2

; 读取TTBR0的低32位到R0(32位访问)
MRC p15, 0, r0, c2, c0, 0

; 将R0的值写入TTBR0的低32位(32位访问)
MCR p15, 0, r0, c2, c0, 0

ARM page table

32 位
其两级映射为 12+8+12。
一级页表由高12位索引,首先通过 TTBR 寄存器获取到对应页表基地址(TTB),然后加上高12位,索引到对应条目。条目对应多种类型: section, page table, supersection。

section 直接对应 1MB 的内存区域,page-table 则继续对应二级页表,supersection 对应16MB内存区域。
另外基地址寄存器有 TTBR0 以及 TTBR 1,当 N=0,即高位不为0时,会使用TTBR1(0xffffxxxx),当高位为0,则使用 TTBR0(0x7fxxxx)。
线性映射区一般使用 section 来映射。
利用表项内容 bits[1:0] 表示是否为 section 或 page table

  • section: 0b01
  • page table: 0b10,bits[18] 表示是否是 supersection
section-mapping

section 地址映射的流程如下:
MMU 从 TTBR 中获取一级页表基地址,然后加上高12位表项发现低两位为10,判断为 section mapping 。然后会直接取其页目录项的高12位与虚拟地址低20位拼接,获取到物理地址。

page-mapping

page 内存地址翻译为物理地址流程如下:
经过二级页表寻址,最后发现为 10 即可进行高20位与低12位进行拼接。

Linux Kernel 线性映射区

在 kernel 启动阶段的 map_mem() 函数进行了线性映射区的初始化。该内存区域是对物理地址的整体映射。
定义在 arch/arm64/mm/mmu.c 中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
static void __init map_mem(pgd_t *pgdp)
{
static const u64 direct_map_end = _PAGE_END(VA_BITS_MIN);
phys_addr_t kernel_start = __pa_symbol(_text);
phys_addr_t kernel_end = __pa_symbol(__init_begin);
phys_addr_t start, end;
phys_addr_t early_kfence_pool;
int flags = NO_EXEC_MAPPINGS;
u64 i;

/*
* Setting hierarchical PXNTable attributes on table entries covering
* the linear region is only possible if it is guaranteed that no table
* entries at any level are being shared between the linear region and
* the vmalloc region. Check whether this is true for the PGD level, in
* which case it is guaranteed to be true for all other levels as well.
* (Unless we are running with support for LPA2, in which case the
* entire reduced VA space is covered by a single pgd_t which will have
* been populated without the PXNTable attribute by the time we get here.)
*/
BUILD_BUG_ON(pgd_index(direct_map_end - 1) == pgd_index(direct_map_end) &&
pgd_index(_PAGE_OFFSET(VA_BITS_MIN)) != PTRS_PER_PGD - 1);

early_kfence_pool = arm64_kfence_alloc_pool();

if (can_set_direct_map())
flags |= NO_BLOCK_MAPPINGS | NO_CONT_MAPPINGS;

/*
* Take care not to create a writable alias for the
* read-only text and rodata sections of the kernel image.
* So temporarily mark them as NOMAP to skip mappings in
* the following for-loop
*/
memblock_mark_nomap(kernel_start, kernel_end - kernel_start);

/* map all the memory banks */
for_each_mem_range(i, &start, &end) {
if (start >= end)
break;
/*
* The linear map must allow allocation tags reading/writing
* if MTE is present. Otherwise, it has the same attributes as
* PAGE_KERNEL.
*/
__map_memblock(pgdp, start, end, pgprot_tagged(PAGE_KERNEL),
flags);
}

/*
* Map the linear alias of the [_text, __init_begin) interval
* as non-executable now, and remove the write permission in
* mark_linear_text_alias_ro() below (which will be called after
* alternative patching has completed). This makes the contents
* of the region accessible to subsystems such as hibernate,
* but protects it from inadvertent modification or execution.
* Note that contiguous mappings cannot be remapped in this way,
* so we should avoid them here.
*/
__map_memblock(pgdp, kernel_start, kernel_end,
PAGE_KERNEL, NO_CONT_MAPPINGS);
memblock_clear_nomap(kernel_start, kernel_end - kernel_start);
arm64_kfence_map_pool(early_kfence_pool, pgdp);
}

__map_memblock函数参数pgdp为页全局目录 swpaper_pg_dir,负责管理内核空间的内存映射。

参考链接