Filesystem

Incredibly enough, almost everybody knows what a file system is, but it is hard to find a common definition (or even how to write it!!). Please, choose your preferred one:

      - In a computer, a file system (sometimes written filesystem) is the way in which files are named and where they are placed logically for storage and retrieval. (Techtaget)

      - A file system is a method of organizing and retrieving files from a storage medium, such as a hard drive. (Computerhope)

      - In computing, a file system or filesystem is used to control how data is stored and retrieved. (Wikipedia)

      - A file system is the entire hierarchy of directories (also referred to as the directory tree) that is used to organize files on a computer system. (Linfo)

      - The method for storing and retrieving files on a disk. It is system software that takes commands from the operating system to read and write the disk clusters (groups of sectors). (PC Magazine)

Linux file system

While creating the Linux Kernel in 1991, Linus Torvals used the file system layout of Minix, the educational OS created by Andrew S. Tanenbaum in the 80’s. As Minix was designed with an educational aim, it presented several limits when used outside that scope. In 1992 Rémy Card designed the Extended File System (aka “EXT”), being the first file system specifically created for Linux. The metadata structure of EXT is based on the original Unix File System (UFS)

Ext has evolved during time, providing several major versions named ext2, ext3 and the current ext4 version.

Like Minix or UFS, EXT4 file system has several parts and structures which keep the FS metadata and also conform its disk layout. The most relevant structures to Mico Maco are presented in the next sections.

Disks, partitions and sectors

Before going deeper into the Linux fiel system, it is worth noting that file systems are created in physical storage, being it a Hard Disk Drive (HDD), a Solid State Drive (SSD), Flash or USB drive, etc. Any of these need a file system to storage any data. The type of file system will depend on the Operating System that will use the storage.

All operating systems have utilities to create file systems on storage. Usually, the first step is to create a partition in the disk. A partitions is a logical disk that can take part of the space or all of the space of the physical storage. This is done with fdisk in Linux and DOS or diskpart in Windows. Modern operating systems use the “GUID Partition Table” (GPT) partition schema instead of the old well known “Master Boot Record” (MBR).

The partition utilities divide the drive in sectors. Nowadays these sectors are mainly logical, as modern disks do not work anymore with the old physical “Cylinder-head-sector” (C/H/S) and use “Logical block addressing” (LBA). One can say that LBA is a logical abstraction of the physical disk implementation.

Please, we aware that there is some overlapping with sectors and blocks. Sectors refer to the physical portion of the storage. Regarding blocks, its meaning depends on the context. It normally means a portion of data (or where to put data). Bocks size might be the same as the sector size or a multiple of it.

Even nowadays, although operating systems tend to work with 4KB data blocks, hard drives still work with 512 byte sectors, so 1 block = 8 sectors. You can check this in your system:

fdisk -l | grep "Sector size"      <--  will tell the disk sector size
blockdev --getbsz /dev/sda         <--  will tell block size from the OS point of view

So far, the disk is just partitioned… Now it is time to create a file system in the desired partition. In Linux, mkfs creates de file system. If the file system type is “ext4” (the current default), it creates the disk layout and needed structures. Below are commented the ones Mico Maco considers most important. Refer to EXT4 for a full description.

Blocks and block groups

Returning to EXF4, it creates (by default) storage blocks of 4KB. Blocks are grouped in “Block Groups”. With the default block size (4KB), block groups are made of 32.768 blocks. Therefore a block group can hold up to 128MB of data.

Superblock

The superblock contains all the needed information about the configuration of the file system. It includes fields such as the total number of inodes and blocks in the file system, how many of them are free, when the file system was mounted (and if it was cleanly unmounted) and when it was modified

The superblock is essential to mounting the file system. A primary copy is stored at an offset of 1024 bytes from the start of the device. There are backup copies of the superblock stored in block groups throughout the file system.

Block Group Descriptor

Each block group metadata is held in a Block Group Descriptor structure. This structure stores data, among other, about the location of the block bitmap, location of the inode bitmap and the location of the inode table

Directory

Unix directories are lists of association structures which contains one filename and one inode number.

A directory is a filesystem object and has an inode just like a file. It is a specially formatted file containing records which associate each name with an inode number. Later revisions of the filesystem also encode the type of the object (file, directory, symlink, device, fifo, socket) to avoid the need to check the inode itself for this information

The inode allocation code tries to assign inodes which are in the same block group as the directory in which they are first created.

The original Ext2 revision used singly-linked list to store the filenames in the directory; newer revisions are able to use hashes and binary trees.

Also note that as directory grows additional blocks are assigned to store the additional file records. When filenames are removed, some implementations do not free these additional blocks.

Having seen what a directory is, it is worth noting now which is the difference among a hard link and a soft link.

A hard link is just a new entry in the directory structure. This entry contains a new file name that points to a pre-existing inode. So a hard link is not a file. It is just a new pointer to an existing file. As directories can only contain inodes from the filesystem they belong, hard links can not point to files in other filesystems.

A soft link is a file. We can say that it is a especial type of “text” file containing the path to another file. So a soft link consumes a inode (an of course a directory entry ), which points to a file with the “text”. As it contains a full path, it can point to any file in any filesystem.

In newer filesystems (e.g. EXT4), if the path does not exceeds 60 bytes, it is written in a special part of the inode (see below). This saves a data block.

Inode

Inodes (index node) contain metadata information for filesystem objects like data file, directory, device file, etc (everything Unix/Linux is a file !!)

Inodes are created with an empty information structure when the filesystem is created (it can not be changed dynamically) ** check LVM ** . The creation of a file system object implies the usage of an inode to hold the object metadata.

Inode usage can be seen with df -i
Inode number can be seen with ls -i
stat command shows several file metadata information

The inode structure has had a significant change in EXT4, increasing the size for an inode to 256 bytes, as opposed to 128 bytes in EXT2 or EXT3. EXT4 continues to use a default block size of 4096 bytes so there are 16 inodes per block.

The inode structure contains these fields:

      - Type of file (e.g., regular, directory, special device, pipes, etc.)
      - Access permissions for the file owner, the owner’s group members and others (i.e. the general public).
      - Number of links (aliases) to the file
      - File owner’s User ID
      - File owner’s Group ID
      - File size in bytes (for regular files)
      - Pointers to the disk addresses of the data blocks (where the contents of the file are actually stored)
      - Time of last access (atime)
      - Time of last modification (mtime) (modification of the data)
      - Time of last change (ctime) (change for the inode)
      - Time of deletion (dtime) (deletion time)

Files, inodes and disk blocks

By default, ext4 uses 4096k as disk block size. How larger file sizes are managed? and how large a single file can be?

Well, depends on the selected file system (see answer at the end of this section)

Before ext4, the previous ext2 and ext3 file systems used inodes that included pointers to the data blocks. Each inode contained 15 pointers of 32 bits (60 bytes in total):

0                                12 13 14 15   <--- 16 inode pointers
-------------------------------------------
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |
-------------------------------------------
|  |  |  |  |  |  |  |  |  |  |  |  |  |  |---------|
            data blocks             |  |            |
                                    |  |----|       |
                                    |       |       |
                                  -----   -----   -----
                                  |   |   |   |   |   |
                                  -----   -----   -----
                                   |||     |||     |||
                                   data   -----   -----
                                          |   |   |   |
                                          -----   -----
                                           |||     |||
With 4K blocks:                            data   -----
direct 12x4K = 48K                                |   |
indirect 1024x4K = 4MB                            -----
double indirect 1024x1024x4K = 4GB                 |||
triple indirect 1024x1024x1024x4K = 4TB            data

Instead of block direct-indirect pointers, Ext4 uses a different mechanism named “extent tree” to track file content. An extent is a single descriptor for a range of contiguous physical disk blocks.

The extent tree schema solves the inefficiency for large files that ext 2 and ext3 presented as the mapping keeps a entry for every single block, and big files have many blocks to handle.

As an extent can define a contiguous range of physical disk blocks of the needed size, extents improve the performance and also help to reduce the fragmentation, since an extent encourages continuous layouts on the disk.

The physical block field in an extent structure have a size of 48 bits. Therefore one extent can represent 2^15 contiguous blocks, or 128 MB (with 4 KB block size). Huge files will need to be spitted in several extents, but a 125 MB file can be hold by a single extent, instead of the 31250 blocks needed by ext3.

The extent information is placed in the inode structure in the same 60 bytes located for the ext2/ex3 block pointer (bytes from 44 to 90 in the inode structure). These 60 bytes can hold up to 1 extent header + 4 extents structures (12 bytes each). For larger files, more extents are needed. In this case a constant depth extent tree is used to store the extents map of a file.

The root of this tree is stored in the ext4 inode structure and extents are stored in the leaf nodes of the tree. Each node in the tree starts with an extent header, which contains the number of valid entries in the node, the capacity of entries the node can store, the depth of the tree. and a magic number.

So, recovering the previous “limits” question…

EXT3 EXT4
Max. FS size 16 TB 1 EB
Max. File size 2 TB 16 TB
Max. # of subdirs 32.000 Unlimited

1 EB = 1,048,576 TB (1 EB = 1024 PB, 1 PB = 1024 TB, 1 TB = 1024 GB)