The Linux Virtual File System
Linux allows different filesystems to be accessed via a kernel software layer called Virtual File System (VFS). VFS handles all the system calls related to a standard Unix filesystem. The actual file systems implemented underneath the VFS need not use the exact same abstractions and operations internally. However, they must implement filesystem operations semantically equivalent to those specified with the VFS objects.
Any filesystem operation will go through:
- The system calls are invoked to access the VFS data structures and determine the filesystem where the accessed file belongs.
- The file is represented by a file data structure in kernel memory. This data structure contains a field called f_op that contains pointers to functions corresponding to different operations. The system call handler finds the pointer to the right function and invokes it.
VFS supported 3 groups of filesystem:
- Disk-based filesystems, i.e. etx2. etx4
- Network filesystems, i.e. NFS, AFS
- Pseudo-file-systems, /proc, pipefs, sockfs, sysfs
$ cat /proc/filesystems
nodev sysfs
nodev tmpfs
nodev bdev
nodev proc
nodev cgroup
...
VFS supports 4 filesystem data structures:
- Superblock stores the information of the filesystem. For disk-based filesystems, it is filesystem control block that store number of Inodes, disk blocks, start of the list of free disk blocks, etc. Destruction of the superblock will render the file system unreadable.
- File represents an open file, and is created in response to the open system call. It supports operations such as read, write, sendfile, lock, etc.
- Inode: we discussed in this post. For disk-based filesystems, this object corresponds to a file control block stored on disk.
- Dentry represents a directory entry that stores the mapping between file names and inode numbers because kernel does not understand filename.
Inode cache, Dentry cache and page cache
VFS caches Inodes and Dentry in respective cache to increase performance and avoid constant disk accesses, especially for the inode which can be updated many times while a file is open.
When writing or reading data, the data is also cached in page cache and marked as dirty pages, and transfered from buffer to underlying device (when flushing data with fsync()) or from buffer to process memory. This helps to minimize the system call, but can yield incorrect and stale results (when one process modifies the data on the storage device but another continues to work with its copy in memory) or data loss (say power to the computer is interrupted between data being written to the buffer in memory)
Leave a comment