In the previous article, we see that system processes need to perform files operation very frequently. In this article, I will discuss some of Linux filesystem properties and how Linux system looks up and opens a file.
Linux Filesystem Properties
File Path: file path in Linux can be specified in 2 ways, 1) Absolute path which tells the system to search for the file starting from the root directory, i.e. /a/b/file.txt AND, 2) relative path which use the current working directory as a point for reference.
File Link: if an existing file is in another location and typing the path might be too long, we can use a file link which creates a directory entry that points to that file. There are 2 types of links: hardlink and softlink. We discuss them in details in next section.
File Type: Linux supports regular files consists of a series of bytes and other special files such as directory, hardlink, softlink, character files, block files, local domain socket and named pipe. Devices are represented by files that are located in /dev directory. The Inode of device file also contain major number to identify device type and minor number to tells the driver which physical unit of a given device to address. Device files are created using mknod(). When a new file is created, the kernel assigns new Inode and set the file type in the Inode.
File Mounting: Mounting is a process by which OS makes files and directories on a storage device available for user to access via the computer’s file system. Unlike other systems where disks have separate filesystem, Linux allows a disk to be mounted in another disk’ file tree. The process of mounting comprises operating system acquiring access to the storage device, i.e reading the filesystem structure and metadata on it before registering them to the virtual file system (VFS) component. The exact location in VFS that the newly-mounted medium registered is called mount point. When the mounting process is completed, the user can access files and directories on the medium from there.
File Locking: Linux provides 2 locking mechanisms to prevent multiple processes from accessing a file, leading to race conditions. These mechanisms do not make entire directory or file inaccessible, but allow the caller to specify how much bytes of a file is locked.
- Advisory or shared lock: Linux allow multiple process from placing the shared lock on the a portion or entire file. However, an attempt to place an exclusive lock will fail.
- Mandatory or exclusive lock: Only 1 process is allow to place the exclusive lock and every byte in the region to be locked must be available.
- When placing a lock is not possible, the process can choose to block such that when the existing lock has been removed, the process is unblocked and the lock is placed. The process can also choose not to block, the system call is returned immediately, with the status code telling whether the lock has succeeded, and the caller has to decide what to do next.
Before we understand file look-up in Linux, let’s discuss filesystem layout and the Inode.
Filesystem layout: Sector, block, partition
Essentially, data are stored in hard disk devices, i.e. /dev/sda, /dev/sdb, etc.
The surface area of disk is divided up into circular tracks which are then pie-sliced into sectors of typical from 512 to 4096 bytes.
One or multiple sectors (2,4,8 or 16) are then grouped into block. Linux then divides the disk into disk partitions where each partition is a continuous span of blocks and formated either as a file system or as swap space(MBR allowed up to 4 primary partitions but can be extended1).
Operating systems can address (or point to) block instead of sector directly, because each block has a unique address or disk block number and the number of addresses that an operating system can address is limited. Hence, by defining a block as several sectors, it can work with bigger hard drives within the limit, i.e. early version of PC DOS could only address 65,536 blocks (64K),
In the example below, each sector is of the size of 512 bytes, a block is formed from 8 sectors and hence has the size of 4096 bytes.
vagrant@vagrant:~$ sudo fdisk -l Disk /dev/sda: 64 GiB, 68719476736 bytes, 134217728 sectors Units: sectors of 1 * 512 = 512 bytes Sector size (logical/physical): 512 bytes / 512 bytes I/O size (minimum/optimal): 512 bytes / 512 bytes Disklabel type: dos Disk identifier: 0x4ec46e60 vagrant@vagrant:~$ sudo blockdev --getbsz /dev/sda 4096
As I mentioned in article about Linux booting, the very first sector is called the MBR (Master Boot Record) and is used to boot the computer. The end of the MBR contains the partition table which describes the starting and ending addresses of each partition.
Partition layout varies depending on filesystem, but these are common:
- The first block is the boot block, however only 1 partition has the booting code, i.e. the active partition. During the booting process, the MBR program does is to load the booting code in the first block (the boot block) of this active partition.
- The superblock which contains the key parameters about the file system, i.e. number of Inodes, etc. and is read into memory when the computer is booted,
- The Inodes numbered from to some maximum, each Inode contains information of each file, we will discuss them in details below
- Data blocks where contents of files and directories are physically located.
Let’s now turn our attention to the most important filesystem data structure that store the attributes or metadata of open files and directories. Inode size is 128-bytes and they are stored in Inode table - a kernel data structure with a known location. File’s attributes include mode (protection bits, setuid, setgid bits), times (created, modified, accessed), link count, uid, gid, file size, generation number (incremented when Inode is reused), extended file attributes and pointers point to the addresses of disk blocks.
In modern Ext filesystem, each Inode has 15 pointers: the first 12 points to the addresses of 12 disk blocks which are described in the above section. Beyond that, there is a single indirect block which contains the disk addresses of more disk blocks; a double indirect block and triple indirect blocks.
By timtjtim - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=75836001
File descriptors are indexed into a per-process file descriptor table maintained by the kernel, that in turn indexes into a system-wide table of files opened by all processes, called the file table. This table records the mode with which the file (or other resource) has been opened: for reading, writing, appending, and possibly other modes. It also indexes into a third table called the inode table that describes the actual underlying files.To perform input or output, the process passes the file descriptor to the kernel through a system call, and the kernel will access the file on behalf of the process. The process does not have direct access to the file or inode tables.
Hardlink is a file that referent to the same Inode with origin file. It is often used when you need to move the file around because renaming or remove the source does not remove hardlink connection.
$ touch text.txt $ ln text.py hardlink.txt $ stat text.txt File: text.txt Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: fd00h/64768d Inode: 2621496 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant) $ stat hardlink.txt File: hardlink Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: fd00h/64768d Inode: 2621496 Links: 2 Access: (0664/-rw-rw-r--) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant)
The original text.txt and hardlink.txt have the same Inode number. The link from text.txt is link number 1 to Inode, and link from hardlink.txt is number 2. Since this is empty file, no blocks are allocated for the content of the file but generally, the Inodes will have pointers to point to the disk blocks.
Not that since Inodes are respective to different filesystem, You cannot create hardlink in different filesystem.
$ df -h Filesystem Size Used Avail Use% Mounted on udev 463M 0 463M 0% /dev tmpfs 99M 5.3M 94M 6% /run /dev/mapper/vagrant--vg-root 62G 2.3G 57G 4% / tmpfs 493M 0 493M 0% /dev/shm tmpfs 5.0M 0 5.0M 0% /run/lock tmpfs 493M 0 493M 0% /sys/fs/cgroup vagrant 466G 99G 368G 22% /vagrant tmpfs 99M 0 99M 0% /run/user/1000 $ pwd /home/vagrant $ touch test.txt $ ln test.txt /tmp/hardlink.txt $ ln test.txt /run/hardlink.txt ln: failed to create hard link '/run/hardlink.txt' => 'test.txt': Invalid cross-device link
In the example, /home/vagrant and /tmp are the same filesystem while /home/vagrant and /run are not, so no hardlink is allowed to create.
Also, you cannot create hardlink for directory since it could potentially introduces a number of problems for filesystem, i.e. the ambiguity or the loop between parents and children directories, i.e. you would try to access a directory that points to itself.2
On the other hand, symbolic link or softlink is a file that only contains path of origin file in its metadata. It is often used to refer to certain location in the system. The original and softlink can be in same or different filesystem.
$ ln -s test.txt symlink.txt $ ls -l symlink.txt lrwxrwxrwx 1 vagrant vagrant 8 Oct 18 15:53 symlink.txt -> test.txt $ stat text.txt File: test.txt Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: fd00h/64768d Inode: 2621496 Links: 1 Access: (0664/-rw-rw-r--) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant) $ stat symlink.txt File: symlink -> test.txt Size: 8 Blocks: 0 IO Block: 4096 symbolic link Device: fd00h/64768d Inode: 2621497 Links: 1 Access: (0777/lrwxrwxrwx) Uid: ( 1000/ vagrant) Gid: ( 1000/ vagrant)
The softlink.txt and the original file text.txt have different Inodes. The Inode of text.txt have pointer(s) that point to the disk blocks that stored the content of the file. On the other hand, the Inode of softlink.txt has pointer that points to the the disk block that stored the path to the file text.txt.
You can view the total available of Inode per filesystem with:
$ stat -f / File: "/" ID: 15766b2164c77a61 Namelen: 255 Type: overlayfs Block size: 4096 Fundamental block size: 4096 Blocks: Total: 15313873 Free: 3581851 Available: 2796520 Inodes: Total: 3907584 Free: 593877
There is no real deletion, When you remove a file with rm command, there is still data on the disk, the system simply unlink the file, and possibly zero out the metadata in inodes. The content is essentially still recoverable, unless the data has been overwritten
When you use mv command on same filesystem, it creates a new directory entry at the destination, pointing to the same Inode as the source. For different filesystems, it will first copy the file to the destination before unlink all the pointers of the source.
The Lookup Process
Files are identified by paths, after resolving symlinks. The kernel begins path name lookup on the requested file. In the case of absolute path, i.e. /home/mydirectory/file.txt, the root directory is known to be located in predetermined block on disk and its’ Inode is also known (usually 2). In the case of relative path, there is a pointer from process structure (task_struct) to current directory so we can determine Inode of current directory. The kernel also checks the process’s privileges. Within each directory, entries of files and directories are in unsorted order, each entry has Inode of the file. The directory is searched linearly to find the next entry from the path string, i.e. /home, then it fetches the Inode and uses the Inode to locate the directory blocks, the process continues until the last file is found and loads it into memory. The kernel also returns the handle or file descriptor to the process to handle the subsequently access of the file.