Page Cache
Adrian Huang | Jan, 2022
* Based on kernel 5.11 (x86_64) – QEMU
* SMP (4 CPUs) and 8GB memory
* Kernel parameter: nokaslr norandmaps
* Userspace: ASLR is disabled
* EXT4 file system
* Legacy BIOS
Agenda
• What is page cache?
• Page cache & buffer cache (struct buffer_head)
• How to find an existed page cache?
• Interaction with generic block layer: methods for bio construction
1. Based on buffer_head
2. [Without buffer_head] Based on page descriptor & file system
• File system block size & sector size
• [Detail Discussion] With or without buffer_head
• File system-based IO
• Block device-based IO
What is page cache?
• page cache (stored in physical memory) = cache disk data
• Speed up disk data access
• Linux kernel refers to the page cache for disk R/W
• If there is enough free memory,
• the page cache is kept for an indefinite time
• can be reused by other processes without accessing the disk
• Open a file with the O_DIRECT flag → Bypass page cache
• Application: Some database applications use their own disk cache algorithm
• Especially for large data access
• Example: Using direct I/O with Oracle
Reference from: Chapter 15. The Page Cache, Understanding the Linux Kernel, Third Edition
Page cache & buffer cache
1. Page cache: Interaction with VFS. (Upper layer)
2. Buffer cache: Interaction with the disk. (Lower layer)
4KB 4KB
512B
512B
512B
512B
[file] file->f_pos
(continuous file position)
Page cache and
buffer cache
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
512B
sector
..
Disk
4KB
Page Cache
Buffer Cache
(Buffer head)
Page Descriptor
Buffer Head
Buffer Head
Buffer Head
Buffer Head
Buffer
Buffer
Buffer
Buffer
Page Frame
page_address(page)
Disk
..
b_page
b_data
b_this_page
private
b_dev + b_blocknr:
submit_bh()->submit_bio()
b_this_page
buffer_head
b_state
b_blocknr
b_page
b_data
b_size = 1024
b_bdev
Legend
Page cache & buffer cache: relationship
1. Block size = file system-based unit
2. Page cache might *NOT* include buffer_head struct. (File system specific: file data)
b_page
b_this_page
b_this_page
b_this_page
b_this_page
private
[page->flags]
• PG_private: page cache (fs-private data)
task_struct
files
dentry
d_parent
d_name
d_inode
qstr
name = “mnt”
u32 hash
u32 len = 3
u64 hash_len
union
files_struct
fd_array[]
file
f_inode
f_pos
f_mapping
.
.
file
inode
*i_mapping
i_atime
i_mtime
i_ctime
mnt
dentry
f_path
address_space
i_data
host
page_tree
i_mmap
page
mapping
index
radix_tree_root
height = 2
rnode
radix_tree_node
count = 2
63
0 1 …
page
2 3
radix_tree_node
count = 1
63
0 1 …
2 3
page page
slots[0]
slots[3]
slots[1] slots[3] slots[2]
index = 1 index = 3 index = 194
radix_tree_node
count = 1
63
0 1 …
2 3
Radix Tree (or XArray): How to find an existed page cache?
[v4.20] XArray replaced radix tree
task_struct
files
dentry
d_parent
d_name
d_inode
qstr
name = “mnt”
u32 hash
u32 len = 3
u64 hash_len
union
files_struct
fd_array[]
file
f_inode
f_pos
f_mapping
.
.
file
inode
*i_mapping
i_atime
i_mtime
i_ctime
mnt
dentry
f_path
address_space
i_data
host
page_tree
i_mmap
page
mapping
index
Radix Tree (or XArray): How to find an existed page cache?
Page Cache/Buffer Cache
(fs/buffer.c, mm/readahead.c, mm/filemap.c)
Disk
filesystem
Disk
filesystem
Block Device
File
Mapping Layer
Generic Block Layer
I/O Scheduler Layer
Block Device Driver
Disk Disk Disk
submit_bio()
sys_mount()/__x64_sys_mount()
VFS
mount(…)
Buffer
b_this_page
buffer_head
b_state
b_blocknr
b_page
b_data
b_size
b_bdev
Buffer
buffer_head
…
private
Page Frame
.
.
ext4_mount->ext4_fill_super
fs/buffer.c: ext4_fill_super -> ext4_sb_bread_unmovable
bi_io_vec
bio
bi_iter
bi_size
bvec_iter
bi_sector: 512-byte sector
bv_len
bio_vec
bv_page
bv_offset
ext4_read_bh_lock-> … ->submit_bh -> submit_bio
bio_add_page(…)
Userspace application
page
mapping
index
XArray
Interaction with generic block layer: bio construction based on buffer_head
Interaction with generic block layer: buffer_head
• buffer_head cache: per-cpu variable
• Scenarios
oFile system metadata (file)
▪ superblock
▪ inode info
▪ extent tree
oFile hole
oBlock device: page cache is not up-to-date
Disk
filesystem
Mapping Layer
Generic Block Layer
I/O Scheduler Layer
Block Device Driver
Disk Disk Disk
submit_bio()
sys_read()/__x64_sys_read()
VFS
Page Frame
ext4_file_read_iter -> generic_file_read_iter
bi_io_vec
bio
bi_iter
bi_size
bvec_iter
bi_sector: 512-byte sector
bv_len
bio_vec
bv_page
bv_offset
ext4_mpage_readpages -> ext4_map_blocks -> submit_bio
m_lblk
ext4_map_blocks
m_pblk
m_len
m_flags
reference
page_address(page)
1. Allocate a page descriptor
2. Add it to XArray
XArray
page
mapping
index
file
f_pos
reference
Page Cache/Buffer Cache
(fs/buffer.c, mm/readahead.c, mm/filemap.c)
Disk
filesystem
Block Device
File
bio_add_page(…)
Interaction with generic block layer: bio construction based on page & file system
Disk
filesystem
Mapping Layer
Generic Block Layer
I/O Scheduler Layer
Block Device Driver
Disk Disk Disk
submit_bio()
sys_read()/__x64_sys_read()
VFS
Page Frame
ext4_file_read_iter -> generic_file_read_iter
bi_io_vec
bio
bi_iter
bi_size
bvec_iter
bi_sector: 512-byte sector
bv_len
bio_vec
bv_page
bv_offset
ext4_mpage_readpages -> ext4_map_blocks -> submit_bio
m_lblk
ext4_map_blocks
m_pblk
m_len
m_flags
reference
page_address(page)
1. Allocate a page descriptor
2. Add it to XArray
XArray
page
mapping
index
file
f_pos
reference
Page Cache/Buffer Cache
(fs/buffer.c, mm/readahead.c, mm/filemap.c)
Disk
filesystem
Block Device
File
bio_add_page(…)
Interaction with generic block layer: bio construction based on page & file system
1. No need to allocate buffer_head struct
2. [Scenario] readahead mechanism
A. File read/write (file system)
B. Block device read/write: corresponding page caches are not available yet.
File system block size & sector size
4KB
Page Cache
Disk
file system block size
Mapping Layer: file system
Generic Block Layer sector size
File system block size & sector size
4KB
Page Cache
Disk
file system block size
Mapping Layer: file system
Generic Block Layer sector size
File system block size & sector size: file system block size = 1024
4KB
Page Cache
Disk
Mapping Layer: file system
Generic Block Layer sector size
bi_size = 1024
bvec_iter
bi_sector
bv_len = 1024
bio_vec
bv_page
bv_offset
bio
Kernel
User
File system block size & sector size: file system block size = 1024
4KB
Disk
Mapping Layer: file system
Generic Block Layer sector size
bi_size = 1024
bvec_iter
bi_sector
bv_len = 1024
bio_vec
bv_page
bv_offset
bio
File system block size & sector size: file system block size = 4096
4KB
Page Cache
Disk
Mapping Layer: file system
Generic Block Layer sector size
bi_size = 4096
bvec_iter
bi_sector
bv_len = 4096
bio_vec
bv_page
bv_offset
bio
Kernel
User
File system block size & sector size: file system block size = 4096
4KB
Page Cache
Disk
Mapping Layer: file system
Generic Block Layer sector size
bi_size = 4096
bvec_iter
bi_sector
bv_len = 4096
bio_vec
bv_page
bv_offset
bio
File system access & raw block device access
4KB
Page Cache
Disk
file system block size
Mapping Layer: file system
Generic Block Layer sector size
Kernel
User
VFS sys_read()/__x64_sys_read()
4KB
Page Cache
Disk
block size
Mapping Layer: block device file
Generic Block Layer sector size
Kernel
User
VFS sys_read()/__x64_sys_read()
Submit IO with or without buffer_head struct?
With or without buffer_head
Kernel
User
page cache available?
Allocate/init page struct(s)
file hole?
Submit IO with buffer_head struct Submit IO without buffer_head struct
N
Is page up-to-date?
Return the page Invoke mapping->a_ops->readpage()
N
Y
Y
Y
N
readahead path
readpage path
[block device: blkdev_readpage()]
Submit IO with buffer_head struct
ext4 file system: ext4_readpage()
File hole detection – file system implementation
read_pages
aops->readahead
blkdev_readahead
mpage_readahead
do_mpage_readpage
blkdev_get_block
ext4_readahead
ext4_mpage_readpages
ext4_map_blocks
reiserfs_readahead
do_mpage_readpage
reiserfs_get_block
[special case]
cannot detect file hole
ext4 block device reiserfs
vfs_read
read_pages
aops->readahead
blkdev_readahead
mpage_readahead
do_mpage_readpage
blkdev_get_block
ext4_readahead
ext4_mpage_readpages
ext4_map_blocks
reiserfs_readahead
do_mpage_readpage
reiserfs_get_block
[special case]
cannot detect file hole
ext4 block device reiserfs
vfs_read
Detect file hole
Check file hole for specific blocks. Do not set
MAPPED flag if blocks are holes
File hole detection – file system implementation
• ext4 file system
• block size: 1024 bytes
• sector size: 512 bytes
Test Configuration
Sector Size
ext4 file system: block size
mount command
With or without buffer_head: system configuration
File size = 1023 bytes
4KB
Page Cache
Mapping Layer: file system
Generic Block Layer sector size = 512
bi_size = 1024
bvec_iter
bi_sector
bv_len = 1024
bio_vec
bv_page
bv_offset
bio
page_address(page)
1. Allocate a page descriptor
2. Add it to XArray
XArray
page
mapping
index
file
f_pos
reference
Page Frame
read.c
4KB
Page Cache
Mapping Layer: file system
Generic Block Layer sector size = 512
bi_size = 2048
bvec_iter
bi_sector
bv_len = 2048
bio_vec
bv_page
bv_offset
bio
page_address(page)
1. Allocate a page descriptor
2. Add it to XArray
XArray
page
mapping
index
file
f_pos
reference
Page Frame
File size = 2047 bytes
4KB
Page Cache
Mapping Layer: file system
Generic Block Layer sector size = 512
bi_size = 4096
bvec_iter
bi_sector
bv_len = 4096
bio_vec
bv_page
bv_offset
bio
page_address(page)
1. Allocate a page descriptor
2. Add it to XArray
XArray
page
mapping
index
file
f_pos
reference
Page Frame
File size = 4095 bytes
File size = 5119 bytes
Mapping Layer: file system
Generic Block Layer sector size = 512
bi_size = 5120
bvec_iter
bi_sector bv_len = 4096
bio_vec
bv_page
bv_offset
bio
page_address(page)
XArray
page
mapping
index
Page Frame
/ # /read /adrian/mnt/files/5119.txt 512
bv_len = 1024
bio_vec
bv_page
bv_offset
bi_vcnt = 2
4KB
Page Cache
page
mapping
index
• Spatial locality
• Default readahead pages: 32
o If file size < “default readahead pages”, read the
number of pages of the file.
Readahead mechanism
Kernel
User
Readahead mechanism: default pages = 32
File size = 5119 bytes
Mapping Layer: file system
Generic Block Layer sector size = 512
bi_size = 5120
bvec_iter
bi_sector bv_len = 4096
bio_vec
bv_page
bv_offset
bio
bv_len = 1024
bio_vec
bv_page
bv_offset
bi_vcnt = 2
4KB
Page Cache
file-hole-2
(ext4 file system)
hole data hole data
block # 0 1 2 3
Disk
Buffer
b_this_page
buffer_head
b_state
b_blocknr
b_page
b_data
b_size
b_bdev
Buffer
buffer_head
…
Page Frame
page
mapping
index
XArray
Buffer
Buffer
buffer_head
…
b_this_page
buffer_head
b_state
b_blocknr
b_page
b_data
b_size
b_bdev
bi_io_vec
bio
bi_iter
bi_io_vec
bio
bi_iter
private
bio_add_page
bio_add_page
submit_bio submit_bio
File hole – Use buffer_head
file-hole-2
(ext4 file system)
hole data hole data
block # 0 1 2 3
Disk
Buffer
b_this_page
buffer_head
b_state
b_blocknr
b_page
b_data
b_size
b_bdev
Buffer
buffer_head
…
Page Frame
page
mapping
index
XArray
Buffer
Buffer
buffer_head
…
b_this_page
buffer_head
b_state
b_blocknr
b_page
b_data
b_size
b_bdev
bi_io_vec
bio
bi_iter
bi_io_vec
bio
bi_iter
private
bio_add_page
bio_add_page
submit_bio submit_bio
File hole – Use buffer_head
xxd file-hole-2
File hole – Use buffer_head hole data hole data
block # 0 1 2 3
File hole – Use buffer_head hole data hole data
block # 0 1 2 3
Call path with/without buffer_head struct
Call path without buffer_head Call path with buffer_head
task_struct
files
dentry
d_parent
d_name
d_inode
qstr
name = “mnt”
u32 hash
u32 len = 3
u64 hash_len
union
files_struct
fd_array[]
file
f_inode
f_pos
f_mapping
.
.
file
inode
*i_mapping
i_atime
i_mtime
i_ctime
mnt
dentry
f_path
address_space
i_data
host
page_tree
i_mmap
page
mapping
index
a_ops
address_space_operations
readpage
readahead
…
mapping layer: file system or raw block disk
Interaction between VFS/mm and mapping layer
Disk
filesystem
Mapping Layer
Generic Block Layer
I/O Scheduler Layer
Block Device Driver
Disk Disk Disk
submit_bio()
sys_read()/__x64_sys_read()
VFS
Page Cache/Buffer Cache
(fs/buffer.c, mm/readahead.c, mm/filemap.c)
Disk
filesystem
Block Device
File
mm
task_struct
files
dentry
d_parent
d_name
d_inode
qstr
name = “mnt”
u32 hash
u32 len = 3
u64 hash_len
union
files_struct
fd_array[]
file
f_inode
f_pos
f_mapping
.
.
file
inode
*i_mapping
i_atime
i_mtime
i_ctime
mnt
dentry
f_path
address_space
i_data
host
page_tree
i_mmap
page
mapping
index
a_ops
address_space_operations
readpage
readahead
…
mapping layer: file system or raw block disk
Interaction between VFS/mm and mapping layer
Block device access – without buffer_head (full-page access)
Page Frame
blkdev_read_iter -> generic_file_read_iter
bi_io_vec
bio
bi_iter
bi_size = 4096
bvec_iter
bi_sector = 512
bv_len
bio_vec
bv_page
bv_offset
read_pages -> blkdev_readahead -> mpage_readahead ->
do_mpage_readpage
b_blocknr = 256
buffer_head
b_bdev
b_size = 4096
reference
page_address(page)
1. Allocate a page descriptor
2. Add it to XArray
XArray
page
mapping
index
file
f_pos
reference
get_block -> map_bh
bio_add_page(…)
Block device access – without buffer_head (full-page access)
bi_io_vec
bio
bi_iter
bi_size = 4096
bvec_iter
bi_sector = 512
bv_len
bio_vec
bv_page
bv_offset
read_pages -> blkdev_readahead -> mpage_readahead ->
do_mpage_readpage
b_blocknr = 256
buffer_head
b_bdev
b_size = 4096
reference
get_block -> map_bh
Block device access – with buffer_head
Unread
super
block
data
block # 0 1 2 3
Unread /dev/loop0
Page cache (not up-to-update)
Backup
[File system] page cache without buffer_head
[Block device] page cache with buffer_head
[Block device] page cache with buffer_head
[Block device] page cache without buffer_head

Page cache in Linux kernel

  • 1.
    Page Cache Adrian Huang| Jan, 2022 * Based on kernel 5.11 (x86_64) – QEMU * SMP (4 CPUs) and 8GB memory * Kernel parameter: nokaslr norandmaps * Userspace: ASLR is disabled * EXT4 file system * Legacy BIOS
  • 2.
    Agenda • What ispage cache? • Page cache & buffer cache (struct buffer_head) • How to find an existed page cache? • Interaction with generic block layer: methods for bio construction 1. Based on buffer_head 2. [Without buffer_head] Based on page descriptor & file system • File system block size & sector size • [Detail Discussion] With or without buffer_head • File system-based IO • Block device-based IO
  • 3.
    What is pagecache? • page cache (stored in physical memory) = cache disk data • Speed up disk data access • Linux kernel refers to the page cache for disk R/W • If there is enough free memory, • the page cache is kept for an indefinite time • can be reused by other processes without accessing the disk • Open a file with the O_DIRECT flag → Bypass page cache • Application: Some database applications use their own disk cache algorithm • Especially for large data access • Example: Using direct I/O with Oracle Reference from: Chapter 15. The Page Cache, Understanding the Linux Kernel, Third Edition
  • 4.
    Page cache &buffer cache 1. Page cache: Interaction with VFS. (Upper layer) 2. Buffer cache: Interaction with the disk. (Lower layer) 4KB 4KB 512B 512B 512B 512B [file] file->f_pos (continuous file position) Page cache and buffer cache 512B 512B 512B 512B 512B 512B 512B 512B 512B 512B 512B 512B 512B 512B 512B 512B 512B 512B 512B 512B sector .. Disk 4KB Page Cache Buffer Cache (Buffer head)
  • 5.
    Page Descriptor Buffer Head BufferHead Buffer Head Buffer Head Buffer Buffer Buffer Buffer Page Frame page_address(page) Disk .. b_page b_data b_this_page private b_dev + b_blocknr: submit_bh()->submit_bio() b_this_page buffer_head b_state b_blocknr b_page b_data b_size = 1024 b_bdev Legend Page cache & buffer cache: relationship 1. Block size = file system-based unit 2. Page cache might *NOT* include buffer_head struct. (File system specific: file data) b_page b_this_page b_this_page b_this_page b_this_page private [page->flags] • PG_private: page cache (fs-private data)
  • 6.
    task_struct files dentry d_parent d_name d_inode qstr name = “mnt” u32hash u32 len = 3 u64 hash_len union files_struct fd_array[] file f_inode f_pos f_mapping . . file inode *i_mapping i_atime i_mtime i_ctime mnt dentry f_path address_space i_data host page_tree i_mmap page mapping index radix_tree_root height = 2 rnode radix_tree_node count = 2 63 0 1 … page 2 3 radix_tree_node count = 1 63 0 1 … 2 3 page page slots[0] slots[3] slots[1] slots[3] slots[2] index = 1 index = 3 index = 194 radix_tree_node count = 1 63 0 1 … 2 3 Radix Tree (or XArray): How to find an existed page cache? [v4.20] XArray replaced radix tree
  • 7.
    task_struct files dentry d_parent d_name d_inode qstr name = “mnt” u32hash u32 len = 3 u64 hash_len union files_struct fd_array[] file f_inode f_pos f_mapping . . file inode *i_mapping i_atime i_mtime i_ctime mnt dentry f_path address_space i_data host page_tree i_mmap page mapping index Radix Tree (or XArray): How to find an existed page cache?
  • 8.
    Page Cache/Buffer Cache (fs/buffer.c,mm/readahead.c, mm/filemap.c) Disk filesystem Disk filesystem Block Device File Mapping Layer Generic Block Layer I/O Scheduler Layer Block Device Driver Disk Disk Disk submit_bio() sys_mount()/__x64_sys_mount() VFS mount(…) Buffer b_this_page buffer_head b_state b_blocknr b_page b_data b_size b_bdev Buffer buffer_head … private Page Frame . . ext4_mount->ext4_fill_super fs/buffer.c: ext4_fill_super -> ext4_sb_bread_unmovable bi_io_vec bio bi_iter bi_size bvec_iter bi_sector: 512-byte sector bv_len bio_vec bv_page bv_offset ext4_read_bh_lock-> … ->submit_bh -> submit_bio bio_add_page(…) Userspace application page mapping index XArray Interaction with generic block layer: bio construction based on buffer_head
  • 9.
    Interaction with genericblock layer: buffer_head • buffer_head cache: per-cpu variable • Scenarios oFile system metadata (file) ▪ superblock ▪ inode info ▪ extent tree oFile hole oBlock device: page cache is not up-to-date
  • 10.
    Disk filesystem Mapping Layer Generic BlockLayer I/O Scheduler Layer Block Device Driver Disk Disk Disk submit_bio() sys_read()/__x64_sys_read() VFS Page Frame ext4_file_read_iter -> generic_file_read_iter bi_io_vec bio bi_iter bi_size bvec_iter bi_sector: 512-byte sector bv_len bio_vec bv_page bv_offset ext4_mpage_readpages -> ext4_map_blocks -> submit_bio m_lblk ext4_map_blocks m_pblk m_len m_flags reference page_address(page) 1. Allocate a page descriptor 2. Add it to XArray XArray page mapping index file f_pos reference Page Cache/Buffer Cache (fs/buffer.c, mm/readahead.c, mm/filemap.c) Disk filesystem Block Device File bio_add_page(…) Interaction with generic block layer: bio construction based on page & file system
  • 11.
    Disk filesystem Mapping Layer Generic BlockLayer I/O Scheduler Layer Block Device Driver Disk Disk Disk submit_bio() sys_read()/__x64_sys_read() VFS Page Frame ext4_file_read_iter -> generic_file_read_iter bi_io_vec bio bi_iter bi_size bvec_iter bi_sector: 512-byte sector bv_len bio_vec bv_page bv_offset ext4_mpage_readpages -> ext4_map_blocks -> submit_bio m_lblk ext4_map_blocks m_pblk m_len m_flags reference page_address(page) 1. Allocate a page descriptor 2. Add it to XArray XArray page mapping index file f_pos reference Page Cache/Buffer Cache (fs/buffer.c, mm/readahead.c, mm/filemap.c) Disk filesystem Block Device File bio_add_page(…) Interaction with generic block layer: bio construction based on page & file system 1. No need to allocate buffer_head struct 2. [Scenario] readahead mechanism A. File read/write (file system) B. Block device read/write: corresponding page caches are not available yet.
  • 12.
    File system blocksize & sector size 4KB Page Cache Disk file system block size Mapping Layer: file system Generic Block Layer sector size
  • 13.
    File system blocksize & sector size 4KB Page Cache Disk file system block size Mapping Layer: file system Generic Block Layer sector size
  • 14.
    File system blocksize & sector size: file system block size = 1024 4KB Page Cache Disk Mapping Layer: file system Generic Block Layer sector size bi_size = 1024 bvec_iter bi_sector bv_len = 1024 bio_vec bv_page bv_offset bio Kernel User
  • 15.
    File system blocksize & sector size: file system block size = 1024 4KB Disk Mapping Layer: file system Generic Block Layer sector size bi_size = 1024 bvec_iter bi_sector bv_len = 1024 bio_vec bv_page bv_offset bio
  • 16.
    File system blocksize & sector size: file system block size = 4096 4KB Page Cache Disk Mapping Layer: file system Generic Block Layer sector size bi_size = 4096 bvec_iter bi_sector bv_len = 4096 bio_vec bv_page bv_offset bio Kernel User
  • 17.
    File system blocksize & sector size: file system block size = 4096 4KB Page Cache Disk Mapping Layer: file system Generic Block Layer sector size bi_size = 4096 bvec_iter bi_sector bv_len = 4096 bio_vec bv_page bv_offset bio
  • 18.
    File system access& raw block device access 4KB Page Cache Disk file system block size Mapping Layer: file system Generic Block Layer sector size Kernel User VFS sys_read()/__x64_sys_read() 4KB Page Cache Disk block size Mapping Layer: block device file Generic Block Layer sector size Kernel User VFS sys_read()/__x64_sys_read() Submit IO with or without buffer_head struct?
  • 19.
    With or withoutbuffer_head Kernel User page cache available? Allocate/init page struct(s) file hole? Submit IO with buffer_head struct Submit IO without buffer_head struct N Is page up-to-date? Return the page Invoke mapping->a_ops->readpage() N Y Y Y N readahead path readpage path [block device: blkdev_readpage()] Submit IO with buffer_head struct ext4 file system: ext4_readpage()
  • 20.
    File hole detection– file system implementation read_pages aops->readahead blkdev_readahead mpage_readahead do_mpage_readpage blkdev_get_block ext4_readahead ext4_mpage_readpages ext4_map_blocks reiserfs_readahead do_mpage_readpage reiserfs_get_block [special case] cannot detect file hole ext4 block device reiserfs vfs_read
  • 21.
    read_pages aops->readahead blkdev_readahead mpage_readahead do_mpage_readpage blkdev_get_block ext4_readahead ext4_mpage_readpages ext4_map_blocks reiserfs_readahead do_mpage_readpage reiserfs_get_block [special case] cannot detectfile hole ext4 block device reiserfs vfs_read Detect file hole Check file hole for specific blocks. Do not set MAPPED flag if blocks are holes File hole detection – file system implementation
  • 22.
    • ext4 filesystem • block size: 1024 bytes • sector size: 512 bytes Test Configuration Sector Size ext4 file system: block size mount command With or without buffer_head: system configuration
  • 23.
    File size =1023 bytes 4KB Page Cache Mapping Layer: file system Generic Block Layer sector size = 512 bi_size = 1024 bvec_iter bi_sector bv_len = 1024 bio_vec bv_page bv_offset bio page_address(page) 1. Allocate a page descriptor 2. Add it to XArray XArray page mapping index file f_pos reference Page Frame read.c
  • 24.
    4KB Page Cache Mapping Layer:file system Generic Block Layer sector size = 512 bi_size = 2048 bvec_iter bi_sector bv_len = 2048 bio_vec bv_page bv_offset bio page_address(page) 1. Allocate a page descriptor 2. Add it to XArray XArray page mapping index file f_pos reference Page Frame File size = 2047 bytes
  • 25.
    4KB Page Cache Mapping Layer:file system Generic Block Layer sector size = 512 bi_size = 4096 bvec_iter bi_sector bv_len = 4096 bio_vec bv_page bv_offset bio page_address(page) 1. Allocate a page descriptor 2. Add it to XArray XArray page mapping index file f_pos reference Page Frame File size = 4095 bytes
  • 26.
    File size =5119 bytes Mapping Layer: file system Generic Block Layer sector size = 512 bi_size = 5120 bvec_iter bi_sector bv_len = 4096 bio_vec bv_page bv_offset bio page_address(page) XArray page mapping index Page Frame / # /read /adrian/mnt/files/5119.txt 512 bv_len = 1024 bio_vec bv_page bv_offset bi_vcnt = 2 4KB Page Cache page mapping index • Spatial locality • Default readahead pages: 32 o If file size < “default readahead pages”, read the number of pages of the file. Readahead mechanism Kernel User
  • 27.
  • 28.
    File size =5119 bytes Mapping Layer: file system Generic Block Layer sector size = 512 bi_size = 5120 bvec_iter bi_sector bv_len = 4096 bio_vec bv_page bv_offset bio bv_len = 1024 bio_vec bv_page bv_offset bi_vcnt = 2 4KB Page Cache
  • 29.
    file-hole-2 (ext4 file system) holedata hole data block # 0 1 2 3 Disk Buffer b_this_page buffer_head b_state b_blocknr b_page b_data b_size b_bdev Buffer buffer_head … Page Frame page mapping index XArray Buffer Buffer buffer_head … b_this_page buffer_head b_state b_blocknr b_page b_data b_size b_bdev bi_io_vec bio bi_iter bi_io_vec bio bi_iter private bio_add_page bio_add_page submit_bio submit_bio File hole – Use buffer_head
  • 30.
    file-hole-2 (ext4 file system) holedata hole data block # 0 1 2 3 Disk Buffer b_this_page buffer_head b_state b_blocknr b_page b_data b_size b_bdev Buffer buffer_head … Page Frame page mapping index XArray Buffer Buffer buffer_head … b_this_page buffer_head b_state b_blocknr b_page b_data b_size b_bdev bi_io_vec bio bi_iter bi_io_vec bio bi_iter private bio_add_page bio_add_page submit_bio submit_bio File hole – Use buffer_head xxd file-hole-2
  • 31.
    File hole –Use buffer_head hole data hole data block # 0 1 2 3
  • 32.
    File hole –Use buffer_head hole data hole data block # 0 1 2 3
  • 33.
    Call path with/withoutbuffer_head struct Call path without buffer_head Call path with buffer_head
  • 34.
    task_struct files dentry d_parent d_name d_inode qstr name = “mnt” u32hash u32 len = 3 u64 hash_len union files_struct fd_array[] file f_inode f_pos f_mapping . . file inode *i_mapping i_atime i_mtime i_ctime mnt dentry f_path address_space i_data host page_tree i_mmap page mapping index a_ops address_space_operations readpage readahead … mapping layer: file system or raw block disk Interaction between VFS/mm and mapping layer Disk filesystem Mapping Layer Generic Block Layer I/O Scheduler Layer Block Device Driver Disk Disk Disk submit_bio() sys_read()/__x64_sys_read() VFS Page Cache/Buffer Cache (fs/buffer.c, mm/readahead.c, mm/filemap.c) Disk filesystem Block Device File mm
  • 35.
    task_struct files dentry d_parent d_name d_inode qstr name = “mnt” u32hash u32 len = 3 u64 hash_len union files_struct fd_array[] file f_inode f_pos f_mapping . . file inode *i_mapping i_atime i_mtime i_ctime mnt dentry f_path address_space i_data host page_tree i_mmap page mapping index a_ops address_space_operations readpage readahead … mapping layer: file system or raw block disk Interaction between VFS/mm and mapping layer
  • 36.
    Block device access– without buffer_head (full-page access) Page Frame blkdev_read_iter -> generic_file_read_iter bi_io_vec bio bi_iter bi_size = 4096 bvec_iter bi_sector = 512 bv_len bio_vec bv_page bv_offset read_pages -> blkdev_readahead -> mpage_readahead -> do_mpage_readpage b_blocknr = 256 buffer_head b_bdev b_size = 4096 reference page_address(page) 1. Allocate a page descriptor 2. Add it to XArray XArray page mapping index file f_pos reference get_block -> map_bh bio_add_page(…)
  • 37.
    Block device access– without buffer_head (full-page access) bi_io_vec bio bi_iter bi_size = 4096 bvec_iter bi_sector = 512 bv_len bio_vec bv_page bv_offset read_pages -> blkdev_readahead -> mpage_readahead -> do_mpage_readpage b_blocknr = 256 buffer_head b_bdev b_size = 4096 reference get_block -> map_bh
  • 38.
    Block device access– with buffer_head Unread super block data block # 0 1 2 3 Unread /dev/loop0 Page cache (not up-to-update)
  • 39.
  • 40.
    [File system] pagecache without buffer_head
  • 41.
    [Block device] pagecache with buffer_head
  • 42.
    [Block device] pagecache with buffer_head
  • 43.
    [Block device] pagecache without buffer_head