Free Web Hosting Provider - Web Hosting - E-commerce - High Speed Internet - Free Web Page
Search the Web

 

Pune Institute of Computer Technology
 
 
 
 

LFS
 
 

(Log-structured File System)
 
 
 
 
 
 
 
 
 
 

By ...

Abhishikt N. Jain.

E-mail : author@abhishikt.8k.com

Web Site : http://abhishikt.8k.com

(1999 - 2000)




 
 
 

Acknowledgement:
 
 

I take this Opportunity to express my heartful thanks for all the people who have helped me in completion of this seminar. I am thankful to Prof. Dr. S. A. Kulkarni, H. O. D., Computer Dept., for giving me this opportunity to present my seminar. I also thank to my friend, Ashutosh S. Rajekar, who suggested me this topic. I also thank to Prashant V. Bhuse for his help. The thanks are also due to Mr. Cristian Czezatke, who provided some detailed information about Log-structured File Systems.

Last, but not the least, I would like to thank all my friends and colleagues, who directly or indirectly helped me by providing suggestions and valuable inputs from time to time.
 
 

Abhishikt N. Jain.

E-mail: author@abhishikt.8k.com

Web Site : http://abhishikt.8k.com

(1999-2000)



 

INDEX  

S. No. Title Page No.
  Acknowledgement 2
1 Introduction
1.1 Unix File Systems (UFS)
1.2 Unix Fast File System (FFS)
1.3 Log-structured File System (LFS)
5
2 Some Techniques for High Performance
2.1 Read and Write Caching
2.2 Disk Optimization Techniques
7
3 Log-structured File System
3.1 Overview of Log-structured File System
3.3 Data Location and reading
3.4 Inode map
9
4 Free Space Management in LFS
4.1 Using Threading
4.2 Using Copying
4.3 Combination of Treading and Copying
4.4 Segments (Mechanism to Manage Free Space)
4.5 Segment Cleaning
4.6 Segment Selection
4.7 Segment Layout
4.8 Log regions and segment summary blocks
4.9 Log region layout
14
5 Enhancing Performance of LFS
5.1 Enhancing Performance
5.2 Effect of Segment size
5.3 Cleaning Policies
5.4 Cleaning Algorithms
26
6 Summary and Conclusion 30
  Bibliography 31



 

 Chapter 1
 
 

Introduction
 
 

Computer program or system consists of several components, and if some of the components are made much faster while leaving the other components unchanged, then the unimproved components are likely to become a performance bottleneck.

Recent technology trends suggest that disk input and output may become such a bottleneck in the near future. CPU speeds are increasing dramatically along with memory sizes and speeds, but the speeds of disk drives are barely improving at all. Without new file system techniques, it seems likely that the performance of computers in the 2000’s will be limited by the disks they use for file storage. For example, suppose that an application program today spends 10% of its time waiting for disk I/O. If CPU speed improves by a factor of 10 without any disk improvements, then the application will only run about 5 times faster and will spend about 50% of its time waiting for I/O. If CPU speed improves by another factor of 10, still without disk improvements, then the application will only run about twice again as fast, and will spend about 90% of its time waiting for I/O. The hundred-fold overall improvement in CPU speed will only result in a ten-fold improvement in application speed.

Fortunately, there are two technology trends that offer hope for beating the I/O bottleneck: large memories and disk arrays. First, as memories get larger and larger, it becomes possible to cache more and more file information in main memory and thereby eliminate many disk accesses. Second, disks are becoming much cheaper and physically smaller. This makes it practical to build disk systems containing tens or hundreds of drives (‘‘disk arrays’’). Disk arrays don’t improve the performance of a single small access, but they offer much greater overall bandwidth and the possibility of many concurrent accesses.

This seminar is a discussion of the technology trends related to I/O, the problems they may pose for future engineering and office applications, and a set of possible solutions. Our overall goal is to find ways of capitalizing on fast-moving technologies (CPU speed, memory size, disk array size) to compensate for slow-moving technologies (disk seek times) so that future systems will not be I/O-limited [3].
 
 

1.1 Unix File System (UFS):

The basic storage in Unix is a file. A Unix file is represented as a sequence of bytes. Files also have attributes about the owner, access permissions, etc. Each file is uniquely identified as i-number. Unix uses a directory as a special type of file. The entries in directory may refer to other directories or files, this leads to a tree like structure. Unix system has fixed blocks of size 512-bytes.

For each open file a current offset into the file is maintained by the file system. The write system calls allows an application to transfer an arbitrary amount of data to a file starting at the current offset. The read system call allows an arbitrary amount of data to be read into memory buffer from the file starting at the current offset. The current offset is updated by both the read and write system calls. The application can change the offset without transferring data using the lseek system call. By explicitly setting the current offset before each transfer an application can perform non-sequential accesses to the file.

  1.2 Unix Fast File System (FFS):

In the early 1980s a major revision of the Unix file system (UFS) was done at Berkeley [2]. The resulting file system, called the Berkeley Fast File System (Unix FFS), has become one of the most widely used and studied Unix file systems. Unix FFS changed the on-disk layout in a number of ways that improved performance. Changes included increasing the block size, changing the disk region layouts, and a new block allocation algorithm. This section presents the changes made in Unix FFS while the next section will review the storage management techniques that motivated the changes.

Unix FFS removes the limit that blocks were only 512 bytes in size. Blocks in the Unix FFS file sys-tem can be any power of two starting with a minimum size of 4096 bytes up to the limits imposed by the disk controller and drives. Unix FFS also adds the notion of a block fragment that only occupies a portion of a file system block. To reduce internal fragmentation the last block in a file is allowed to be one or more fragments rather than an entire file system block. The fragment size is set to be one-fourth or one-eighth the size of a block. Although having fragments complicates the design, fragments are necessary to avoid large amounts of disk space being lost to internal fragmentation. Studies showed that with a 4096 byte block as much as 50% disk space would be lost on normal workloads.

Unix FFS keeps the basic on-disk format of inodes (explained latter in section 3.4) and indirect blocks that was present in the initial Unix implementation. However, the inodes are no longer clustered together in their own region. Instead, Unix FFS breaks the disk into regions of adjacent cylinders called cylinder groups. Each cylinder group has its own inode region and data block region.

Unix FFS also has a new block allocation algorithm and data structures. The freelist data structure is replaced by a bitmap data structure that records the allocated state for each block.
 
 

1.3 Log-structured file systems:

This dissertation describes a new disk storage management technique, called log-structured file systems, that is well suited for the technology and challenges facing the disk storage managers of the 1990s. Using a log-structured file system, storage write requests can be processed an order of magnitude more efficiently that current file systems; this allows the power of future processors to be fully utilized. The improvement in write performance is accomplished by writing all new information to disk in a sequential structure called the log. This approach increases write performance dramatically by almost eliminating disk seeks. By writing everything sequentially to disk, a log-structured file system uses the disk in the most efficient way possible. The sequential nature of the log also permits much faster crash recovery. Rather than scan the entire disk to restore consistency, a log-structured file system need only examine the most recent portion of the log. Crash recovery time in a log-structured file system does not grow with the disk capacity [1].

Log-structured file systems are based on the assumption that files are cached in main memory and that increasing memory sizes will make caches more and more effective at satisfying read requests. As a result, disk traffic will become dominated by writes. This condition is precisely the opposite of the read-dominated workloads that almost all-previous disk storage managers were designed for. The changed workloads require a write-optimized rather than read-optimized disk access technique.

The notion of logging is not new; it has been used extensively in database systems and a number of recent file systems have incorporated a log as an auxiliary structure to speed up writes and crash recovery. However, these other systems use the log only for temporary storage; the permanent home for information is in a traditional random-access storage structure on disk. In contrast, a log-structured file system stores data permanently in the log; there is no other structure on disk. The log contains indexing information so that files can be read back with an efficiency comparable to that of current file systems. The result is that a log-structured file system can provide the read performance of current read-optimized storage managers while greatly exceeding their write performance.

For a log-structured file system to operate efficiently, it must ensure that there are always large extents of free space available for writing new data. Keeping large extents of space available is the most significant challenge in the design of a log-structured file system. This dissertation presents a solution based on large extents called segments. A segment cleaner process continually re-generates empty segments by compressing the live data from heavily fragmented segments. The segment cleaner acts as a garbage collector that insures that there is room to append more to the log. The file system can be viewed as a log that is continually wrapping upon itself.

The segment cleaner uses a simple but effective algorithm to segregate older, more slowly changing data from younger rapidly changing data. The algorithm is based on cost and benefit and causes the older data to be treated differently from younger data during cleaning. Simulations show that this algorithm reduces the overhead of the segment cleaner substantially, allowing high performance for disks operating at the capacity utilization of current storage systems.

The concept of log-structured file systems is demonstrated by a prototype log-structured file system called LFS, which is in production use as part of the network operating system. Benchmark programs demonstrate that, for small files, the raw writing speed of LFS is more than an order of magnitude greater than that of the commonly used Unix file system. Even for other workloads, such as those including reads and large-file accesses, LFS is at least as fast as Unix in all cases but one (files read sequentially after being written randomly). Long-term measurements of LFS in production use show the inherent advantages of log-structured file systems. Overall, LFS permits about 65-75% of a disk’s raw bandwidth to be used for writing new data (the rest is used for cleaning). For comparison, Unix systems can only utilize 5-10% of a disk’s raw bandwidth for writing new data; the rest of the time is spent seeking.




 

Chapter 2
 
 

Techniques for High Performance
 
 

Access time of disk storage is the main problem for improving performance. So a number of techniques have been developed to improve the performance of disk storage managers. This section discusses several of the techniques, focusing on those that benefit the Unix or related environment. The first techniques presented involve the use of main memory to hide from the application the much slower magnetic disks. Techniques presented include caching and pre-fetching of data.

The second form of performance improvement techniques presented focuses on reducing the disk delays by accessing the disk with more efficient access patterns. Techniques of this form include optimizing the layout of data on disk to promote locality of storage that is likely to be requested together. Contiguous allocation and clustering are examples of such disk layout techniques. Disk efficiency can also be improved by ordering the requests sent to disk using the disk scheduling techniques.
 
 

2.1 Read and Write Caching:

Access time of disk very much greater than main memory, so main memory is used for caching disks. The applications that are frequently needed are cached in main memory so the time spent on read same thing from slow disk is saved. The Unix file system makes extensive use of read caching. All disk resident items such as files, directories, and metadata blocks can be cached in the disk cache. The Unix disk cache is maintained using Least Recently Used algorithm that exploits locality and reuse in file system access patterns. Studies have found cache hits rates of 80-90%, this rate is high enough to significantly reduce the disk delays experienced by Unix applications.

From the storage manager's view there is no disadvantage to read caching. The more memory devoted to caching, the higher the cache hit rate and the fewer disk accesses that are needed. From the whole system perspective there are disadvantages to read caching. Memory used to cache files can not be used for supporting the application’s other memory requirements. Care must be taken not to give so much memory to the storage manager that applications can not perform acceptably. The data, which might be referred in future, can be brought to memory in advance. This will improve performance greatly. This, Pre-fetching, works well in Unix system.

Write caching is also known as write buffering, write behind, or write delayed writes, returns control to the application before the write work is done (i.e. data transferred to the disk). The main memory is used to buffer the write requests. Second benefit of write caching occurs if multiple writes are done on same block, in this case changes in contents of the block is done in the buffer only and the final contents are transferred to the disk.

The sync system call occurs every 30 seconds in Unix system. This limits the data loss in case of power failure or a crash. Because of this, any changes that are older than 30 seconds will not be lost in a crash. This provision allows the file system to get some of the benefits of write caching without arbitrary long exposures to data loss.

To provide additional guarantees to applications, the Unix FFS implementation added a new system call, fsync, that allows an application to force all modifications to a file from the disk cache to disk. Once an fsync call returns, an application knows that all modifications to the file have been made it safely to disk. Applications such as text editors have been modified to use the fsync call to avoid loss of data during crashes.

2.2 Disk Optimization Techniques:

Once the main memory of the computer has been fully utilized to improve performance, the next area of focus is to minimize the time an application spends waiting for the requests that have go to disk. Disk request times can be reduced by reducing overheads such as time spent moving the disk head (disk seeks) and time spent waiting for the storage to rotate underneath the disk head (rotational delay). This section describes the general approach used by storage managers to reduce overheads in disk requests. The first approach is to control the layout of storage on disk so that requests can be processed with the least over-head. The second approach is to schedule the application’s disk requests to reduce the overheads.
 
 

Contiguous allocation works by allocating objects (i.e. files) to consecutive sectors on disk so that sequential accesses can be performed with minimal delay between blocks. When objects are read sequentially, the request for the first block suffers a delay but all blocks after that proceed at the best-case rate. Another Advantage of contiguous allocation is that the size of an index used to find the blocks of an object is small. The location of an entire object can be described by a single disk pointer to the start sector of the object. The index is much smaller than the Unix technique that has at least one pointer per block. Smaller indexes mean that less data needs to be stored and examined to locate blocks of an object.
 
 

If more than one disk request is outstanding at a given time then the storage manager can process the requests in an order that reduces disk overheads. For example if the disk is presented with 10 requests that alternate between disk track 0 and disk track 100 the storage manager can reorder the requests so that all the requests are processed on track 0 before moving to track 100. This would result in only two disk seek delays, one to track 0 and one to track 100, rather than ten seek delays that would occur if the requests were processed in the order given. This scheduling of the disk, called disk scheduling, can reduce the seek delays, number of seeks, and rotational delays. It is found that sorting large numbers of random requests can increase the disk efficiency from around 7% to 25% of the disk’s maximum bandwidth. Achieving 25% of the maximum disk bandwidth required a thousand requests to be queued and running of a complex, CPU intensive sort function that considered both seek delays and rotational latency.
 
 
 




 
 

Chapter 3
 
 

Log-structured File System
 
 
 
 

3.1 Overview of Log-structured File System

The fundamental idea of a log-structured file system is to improve write performance by buffering in the file cache all modifications to the file system and writing all the changes to disk sequentially in a single disk transfer operation. The information written to disk in the write operation includes file data blocks, file attributes, index blocks, directories, and almost all the other information used to manage the file system. By using this format, a log-structured file system can obtain high write performance regardless of the request size or access pattern of the workload. The log-structured file system gets its name from the similarity of its disk layout to that of a log in the logging techniques. In both cases the disk storage manager writes modifications sequentially to disk. The chief difference between the systems is that the log in a log-structured file system is the only data structure on disk.

Having the log as the only data structure on disk allows the log-structured file system to avoid the second disk write to update the data structure that is needed in systems that use write-ahead logging. Elimination of this second write allows log-structured file systems to utilize nearly 100% of the raw disk bandwidth on all workloads including those for which standard logging techniques can achieve only 25% to 50% of the bandwidth.

Although the basic idea of a log-structured file system is simple, there are two key issues that must be resolved to achieve the potential benefits of the log-as-the-only-data-structure approach. The first issue is how to retrieve information from the log. For traditional logging systems, the log is never read during normal operations and only read sequentially during crash recovery. A log-structured file system would have unacceptable read performance if this same design were used. If a sequential search of the log were required to retrieve information, the system could suffer large delays from reads that miss in the file cache.

The second major issue of log-structured file system design is how to manage the free space on disk so that large extents of free space are always available for writing new data. One way to view this problem is that the log of the log-structured file system quickly fills the disk and wraps upon itself. During steady-state operation the log of a log-structured file system is continuously wrapping upon itself. The key issue in the design of a log-structure file system is to handle this ‘‘log wrap’’ problem as efficiently as possible.
 
 
 

3.2 Data Location and Reading:

The log structure of the log-structured file system makes the allocation of data on disk trivial; data blocks are allocated and written when the log is written to disk. The challenge of file I/O in a log-structured file system is in the algorithms for retrieving data from disk. This section discusses the design techniques that allow log-structured file systems to quickly retrieve data from disk. Although the term ‘‘log-structured’’ might suggest that sequential scans are needed for retrieval operations in a log-structured file system, this is not the case and it would lead to unacceptable performance in most environments. For example, the disk technologies currently used in the environment have capacities around one gigabyte and maximum transfer rates of two megabytes per second. Using these disks a sequential scan over the entire disk would take over eight minutes. Even with a cache read hit rate of 99.9% (which is much higher than any current file system achieves) the average read time would increase by 480 milliseconds, a factor of ten slower than the current storage manager provides. Clearly, retrieval mechanisms which increases the file cache miss cost five orders-of-magnitude are unacceptable.

The alternative to sequential scan is to have the log-structured file system maintain an index within the log which can be used to speed retrieval for requests that miss in the file cache. The index allows the storage manager to locate and read data items written anywhere in the log without searching. Every data transfer to the log will modify this index so its update must be efficient. To achieve this efficiency, the modifications to the index must be written to the log along with all other modifications.

To implement the fast retrieval needs of Unix, the LFS prototype builds index structures similar to those used in Unix FFS. For each file, LFS allocates an inode data structure that, like the Unix inode, contains the file’s attributes (type, owner, permissions, etc.) and the disk addresses of the first ten blocks of the file. LFS also uses the same indirect blocks and double indirect blocks that as Unix for handling large files.

The main difference between the index structures of LFS and Unix FFS is that LFS inodes are not in fixed locations on disk. This differences is key to reducing the number of seeks of workloads containing small file modifications. When the inodes are in fixed locations updates to inodes, directory blocks, and the file’s data blocks cause non-sequential and small transfers to disks; precisely the type of access patterns that cause poor performance. Figure shows how the attributes, data, and index blocks of a file are allocated in the log in LFS and compares the disk layout of LFS with that of the Unix FFS.

Using the same indexing data structures as Unix FFS had several advantages for the LFS prototype. It was simple to integrate into the existing kernel and provided opportunities to reuse code from ’s original Unix FFS-like file system. Furthermore, keeping the indexing structure the same meant that behavioral differences between LFS and Unix FFS could not be attributed to the index mechanism. This was useful in the evaluation of the log-structured file system.

Figure 1: A comparison between LFS and Unix FFS:


Creating two single-block files named dir1/file1 and dir2/file2. Each system must write new data blocks and inodes for file1 and file2, plus new data blocks and inodes for the containing directories. Unix FFS requires ten non-sequential writes for the new information (the inodes for the new files are each written twice to ease recovery from crashes), while LFS performs the operations in a single large write. The same number of disk accesses will be required to read the files in the two systems. LFS also writes out new inode map blocks to record the new inode locations
 
 

3.3 Inode map:

Although accessing file data given the inode is the same in Unix FFS and LFS, accessing the inode is slightly more complex in LFS because of the code needed to locate the inode. In Unix each inode is at a fixed location on disk; given the i-number for a file, a simple calculation yields the disk address of the file’s inode. In contrast, LFS doesn’t place inodes at fixed positions; they are written to the log.

To locate inodes quickly LFS uses a data structure called an inode map that maintains the current location of each inode. The inode map is structured like a large array indexed by the file’s i-number. Each entry in the array contains the current disk address of one inode. To fetch an inode for a file, LFS indexes into the inode map to compute the disk location of the inode. Once the address lookup is done, file inode and data block access are identical to Unix FFS. The same inode and indirect block structure means that once the lookup is done the same code is executed and the same number of disk blocks are accessed.

Since the LFS inode map replaces what was a simple address calculation in Unix FFS, the data structure implementing the inode map should be both reliable and fast. Should the inode map become lost or corrupted, all files will be unaccessible or only available after very time-consuming sequential scans. The inode map is queried and updated frequently meaning these operations need to be fast. Ideally the 31 inode map lookup should take no longer than the address calculation done in Unix FFS.

To achieve high performance, LFS divides the inode map into blocks that are written to the log much like file data blocks. Writing the inode map to the log allows modifications to the map to be written to disk using the normal high write performance of the log. No extra disk seeks are needed to update the inode map.

The inode map needs to be reliably recovered after system crashes and shutdowns. To support fast startup of a file system, the locations of the inode map blocks are kept in a fixed region on disk called the checkpoint region. At file system attach time, the checkpoint region is read to find the location of all inode map blocks.

With the inode map divided into blocks, LFS can support high performance lookups in the inode map by caching frequently accessed blocks of the inode map in the same memory area used to cache file blocks. The key to high performance lookups in the inode map is to have the hit rate on the inode map be high enough so the average cost is close to zero disk accesses per lookup. To achieve this high hit rate requires that the active portion of the inode map fit comfortable in memory.

The amount of memory needed to successfully cache the inode map depends on the size of the map and the locality of the lookups. Size of the inode map depends on a file system creation parameter specifying the maximum number of inode map entries. Each inode map entry is 12 bytes in size and contains the current address of the inode, a flags field, a version number, and the time of last access of the file.

Although the inode map entry only needs the inode address to locate the file, having version number and the last access time in the inode map rather than the inode means that these fields can be examined and updated without fetching or updating the inode. This is particularly important for the access time, which is updated on every read operation to the file.

By default, the maximum number of inode map entries is set so there can be one entry for each 4 kilobytes of space. This means that the maximum size of the inode map will be equal to 0.29% of the disk space. Currently the size of main memory of the file servers is around 5% of the disk space on the machine. Thus the entire inode maps for all LFS file systems fit comfortably in main memory on the file servers.

The preceding argument about the maximum size of the inode map is based on the overly pessimistic assumption that the file systems have an average file size of 4 kilobytes (The average size on the file system range from 9 kilobytes to 60 kilobytes.) File systems with a larger average file size will have fewer inode map entries. Inode map blocks that contain no allocated entries are not given space in memory or on disks. The inode map can be as small as a single block for file systems with few files.

To allow the inode map to require even less space in the file cache, the allocation of files to i-numbers is done so that files in the same directory are given i-numbers close to each other. The flags field of the inode map entry indicates if the i-number is allocated. Directories are allocated an inode map entry at random and files created in the directory start at the directory’s inode map entry and scan forward until the first available entry is found. This allocation policy means that normally all the files in a directory will reside in the same inode map block. Locality of references to files in the same directory causes locality in the references to inode map blocks. The inode map blocks containing entries from the directories currently being referenced will be in memory. The blocks containing entries for directories not currently being referenced are not needed in memory.

The combination of locality of reference and a larger average file size on the production LFS systems means the hit rate for inode map reference is very high. Hit rates range from 99.1% to 99.9% on the production file systems even though on average only 15% of the maximum inode map size is resident in the file cache (i.e. only 3.6 megabytes of memory are used for caching inode map blocks of 8.1 gigabytes of disk space). Most misses occur because of cache cold-start effects when the file systems are first mounted after a reboot.

As technology scales up the size of the disk storage, the size of the inode map will increase but so will the size of the memories available to cache the blocks. Even with very large (1024-gigabyte) disk stores it seems unlikely that the inode map block cache miss rate will increase significantly.




 
 

Chapter 4
 
 

Free Space Management in LFS
 
 

The most challenging design issue for log-structured file systems is the management of free space. The goal is to maintain large contiguous free regions on disk so that the log can be written with large sequential transfers. When the file system is created all the free space is in a single extent on disk. By the time the log reaches the end of the disk and wraps around on itself, the free space will have been fragmented into many small extents corresponding to the files or blocks hat were deleted or overwritten. From this point on, the file system has two choices: threading or copying. Either the log can be threaded around the still live blocks or the live blocks can be copied out of the way. This section discusses the two approaches in detail and presents the combined threading-copying approach used in LFS.
 
 

4.1 Free Space Management using Threading:

The threading technique calls for the live blocks to be left where they are: new log blocks are written into the free spaces. Figure shows how a log-structured file system using threading might look. The disk is treated as a large circular list of blocks and the log is written to the next free block. The chief benefits of threading are its simplicity, the ability to operate at high disk capacity utilizations, and low free space management overheads when compared to the copying approaches. By scanning over the disk and writing to the free blocks a threaded log-structured file system can operate until the disk is 100% full. Keeping track of the allocated blocks is straightforward. A simple bitmap data structure like the one used by Unix FFS will work. The next block for writing the log can be located by scanning the bitmap in a circular fashion. The storage manager can continue operating as long as there is an unallocated block on the disk. This can be contrasted with some of the copying schemes under which it is prohibitively expensive to operate when the disk is nearly full.

The main problem with threading is that the free space on the disk will become fragmented so that large sequential transfers to the log are not possible. In a threading system, the large contiguous pieces of free disk space are overwritten with blocks of the log, which are later overwritten or deleted to deallocate the space. Unless all data is overwritten and deleted, the large pieces of disk space will be broken into smaller pieces by the blocks that survive. These surviving blocks must be skipped over when the log wraps. The time spent skipping over blocks can not be used for transferring data to the log.

The reduction in performance caused by skipping over the live blocks depends on the number of live blocks and how the live blocks are clustered on disk. The number and clustering of live blocks is in turn dependent on the storage capacity utilization and workload. Operating under a low storage capacity utilization means that the disk contains few live blocks to skip over when the log wraps. For example, a log-structured file system running with only 10% of the disk allocated to live blocks will on average write 9 out of every 10 blocks that the log wraps on. Using the assumption that the disk can skip over a block at least as fast as it can write the block, write performance of at least 90% of disk maximum bandwidth should be obtainable regardless of the workload’s access pattern. This example can be generalized to produce an estimate of the maximum bandwidth as a function of the disk capacity utilization.

Although a threaded log-structured file system can operate until all disk blocks are in use, the estimate for disk write performance indicates that performance will degrade as storage capacity utilization increases. Consider a disk operating at 80% storage capacity utilization, a common figure in the environment. The log-structured file system would have to skip over 8 blocks for every 2 blocks written thus achieving only 20% of the disk maximum bandwidth. As the disk fills the write performance of a threaded log-structured file system drops.

Figure 2: Managing Free Space by Threading in log-structured file system:

This figure shows the before and after disk image of log wrap on a threaded log-structured file system. With threading the log skips over the blocks that are still live and overwrites those blocks that have been previously deleted or overwritten. Pointers are needed to link together the disjoint pieces of the log.


The above estimate of write performance may be overly pessimistic for systems operating at high storage capacity utilizations. It is possible that the workload overwrites or deleted files in such a way that the live blocks are clustered with other live blocks and the dead blocks are clustered with other dead blocks. This clustering would allow the log-structured file system to write in large transfers for high bandwidth and use fast seeks to skip over data. Unfortunately such clustering is unlikely in practice. The clustering of live blocks is controlled by what data is written together to the log and the clustering of dead blocks is controlled by the pattern of overwrites and deletions generated by applications. Neither of these clusterings is under the control the storage manager so the performance of threading depends on the workload. The clustering needed for high performance threading relies on application with perfect locality in the write access patterns such that data that is written together is always overwritten or deleted together. This type of access allows the log to be threaded though the clusters of dead blocks and skipped over for the clusters of live blocks. Unfortunately, although the office/engineering environment has significant locality, it is not perfect. Most but not all files written at the same time will be deleted or overwritten together. The result is that over time the few that live will cause the free space to become heavily fragmented. The large contiguous disk transfers needed for high performance will not be available.
 
 
 
 

4.2 Free Space Management using Copying:

An alternative to skipping over live data during log wrap is to move the live blocks somewhere else. The live data can be read in and written back to disk in another location, allowing the log to be written to the newly cleared location. Copying allows a log-structured file system to avoid the fragmentation created by threading. By copying data out of the way, the log can always be written in large contiguous pieces. Figure shows how a log-structured file system using copying might look.

The disadvantage of copying is that the copying overhead reduces the amount of the disk bandwidth available for application data. The time spent reading the live data in and writing it back out can not be used for reading or writing new data. Several factors control the amount of copying needed and the impact of this copying on storage manager performance. The factors include how the copying is implemented and the application workload. This relationship between workload, implementation, and performance is examined in detail in next chapter. One fundamental issue that must be addressed in a copying scheme is where to copy the data. Possible solutions include copying the live data back in a compacted form at the head of the log, copying into another log-structured file system to form a hierarchy of logs, or copying the data into some totally different file system or archive. Each approach has its advantages but they all share the common disadvantage that all long-lived data is written to disk multiple times. Valuable disk bandwidth for writing new data is lost to these multiple writes.

Copying the data back into the log is the simplest approach; only one data structure, the log, is needed. Under this approach the log wrap is dealt with by reading in the live data being wrapped on, compressing it into an unfragmented piece, and writing it back into the log. The mechanism to implement this form of copying is simple but it suffers the disadvantage that long-lived data is continuously copied every time the log wraps upon it.

The other two approaches attempt to reduce the amount of copying by removing the data from the main log when it is wrapped upon. By copying the data into another log, a hierarchy of log-structured file systems can be formed with each log containing data of similar age. The hope is that the older the data in the log the slower it will be changing and the less copying will be needed. The hierarchy of logs approach lessens the problems of the single log approach because long lived data is not copied on every wrap of the main log. It still performs multiple copies of long-lived data as the data moves through the layers of the hierarchy.

An alternative to the hierarchy of logs approach is to copy the data into another non log- structured file system. Any files that live long enough to survive the first log wrap would be copied into a separate storage area. This second file system could use a format that works well for slowly changing read-mostly workloads. The hope is that the first log-structured file system would slow the rate of modifications to the point that the read-optimized file system could absorb the changes. Copying into a separate file system limits the number of copies of long lived data to two copies.


 
 

Figure 3: Managing Free Space by Copying in log-structured file system:

This figure shows the before and after disk image of log wrap on a copying log-structured file system. Blocks that are still alive with the log wraps are read, compacted, and written back to the log.



 
 
 
 

The disadvantage of Copying approach is that it suffers the same multiple copy problem of the hierarchy of logs. Every long-lived byte must be copied at least once. An additional complexity is caused by maintaining two different storage managers, a log-structured one and a read-optimized one.
 

4.3 Combination of Threading and Copying:

Using a pure Threading approach, a log-structured file system loses performance to fragmentation, as large sequential transfers are not possible. Using a pure Copying approach, a log-structured file system can control the fragmentation but loses performance to the copying overhead. What is needed is a combination of these techniques that uses Copying to control the fragmentation and uses Threading to control the copying overhead.

The solution to the log wrap problem used in the LFS prototype employs a combination of Threading and Copying. LFS copies live data back into the log to control fragmentation. When fragmentation is not a problem, a Threading approach is used to quickly skip the unfragmented log sections. The design has the advantages of both the Threading and Copying approaches without the fragmentation overhead of Threading or the copying overhead of Copying. So Threading is used when disk utilization is low, and Copying is used when disk utilization is high. Next section describes the mechanisms used to implement this hybrid approach.
 
 
 
 

4.4 Segments (Mechanism to Manage Free Space):

To ensure that large contiguous chunks of disk space are available for log writes, LFS divides the disk into large fixed-size extents called segments. The log is written to segments sequentially from the segment’s beginning to its end, and all live data must be copied out of a segment before the segment can be rewritten. Segments also form the unit of the log threading. Once the log fills the current segment the log writing moves to another segment. Any segment may be chosen to be the next segment provided that it contains no live blocks.

To ensure a constant supply of segments available for writing the log, LFS uses copying to move the live blocks from heavily fragmented segments. This copying, called Segment Cleaning, generates empty segments by copying all the live blocks out of one or more of segments, combining the blocks together, and writing them to the log. The segment cleaning mechanism is described in detail in the next section.

Segments and the segment cleaning mechanism also allow slowly changing data to be collected together and skipped over by the log. Long-lived and unchanging data living in a segment can be ignored by the log writing mechanism. This allows LFS to achieve both the benefits of threading as well as the benefits of copying.

The LFS design calls for segments to be large so that the transfer time to read or write a whole segment is much greater than the cost of a seek to the beginning of the segment. Thus whole-segment operations run at nearly the maximum bandwidth of the disk regardless of the order in which segments are accessed. If we increase segment size then percentage of bandwidth achievable also increases as access delay (seek time + rotational delay + controller overhead) becoming negligible compared disk write-time. But large segments will create problems for segment cleaning. So we have to select segment size to precise.
 
 

4.5 Segment Cleaning:

Because LFS writes entire segments at once, only segments that are known to contain no live blocks can be written. LFS refers to such empty segments as clean segments and only permits the log to be threaded through clean segments. To ensure a constant supply of clean segments, LFS uses a mechanism called segment cleaning that generates clean segments from segments containing live blocks. The segment cleaning mechanism works by copying all the live blocks out of one or more segments, combining the blocks together, and writing them to clean segments as part of the log. Once the data has been written back to the log, the original segments are clean and can be used for new data or for additional cleaning. The live data that was previously fragmented is now compacted together in the segments written during the cleaning operation.

The mechanisms used to implement segment cleaning in LFS are simple and flexible. Segment cleaning is implemented as a three-step process: read a number of segments into memory, identify the live data, and write the live data back to a smaller number of clean segments. This section details the basic mechanism used to implement segment cleaning. The segment cleaning mechanism is controlled by several policies that guide decisions made during the segment cleaning process. Policies control events such as when cleaning is started and stopped, which

Figure 4: Segment cleaning in LFS:

In the top figure the log is seen wrapping upon itself. Then segment cleaning occurs, the fragmented segments are read, compacted, and written back to other segments. After segment cleaning the log has room to grow though the newly cleaned segments. Note that the log can skip over non-fragmented segments.



 
 
 
 

segments are selected to be cleaned, how the live data of segments is combined together after being cleaned, and where the combined data is written.

The LFS implementation treats segment cleaning much as if the live blocks of the segment being cleaned were rewritten with the same data by an application program. Live blocks are brought into the file cache and written back to a segment in the log. Because of the similarity, segment cleaning in LFS is able to share much of the normal file system code.

The key difference between segment cleaning and normal file modifications is that during segment cleaning it must be possible to identify which blocks in a segment are live. For the live blocks it must also be possible to identify the file to which each block belongs and the position of the block within the file. This extra information is needed to update the file’s inode or indirect block to contain the new location of the block when it is written out. Although most file systems maintain an index mapping a file’s block to its location on disk, few support the reverse map from a disk block to its inode. When this reverse mapping is done such as during the disk scavenger program of the Unix file system, it is a very time-consuming process requiring scanning the inodes and indirect blocks of the file system. Using this mechanism would be unacceptably show for LFS.

Live block identification during segment cleaning is implemented in LFS by making the contents of a segment self-identifying. When a segment is written, enough additional information is added so that the segment cleaning mechanism can identify each block of a segment. For example, the additional information for a file block contains the file’s i-number and block number within the file. When the block is moved during segment cleaning this additional information is used to find and update the file’s inode to reflect the block’s new location.

Making segments self-identifying allows the segment cleaner to know what a block is or what it was but does not tell if it has been overwritten or deleted. The cleaner still needs a test to determine if a block is still live. LFS does this by checking the inode or indirect block to see if the block in question is still part of the file. If the block is pointed to by the file’s block index it is still live; otherwise it can be ignored during segment cleaning.

One advantage of using this lookup mechanism is that it eliminates data structures such as bitmaps or freelists that are commonly used in file systems to track blocks in use. Having no bitmap or freelist benefits both the normal running of the file system and the crash recovery algorithm. During normal operation such data structures are constantly being modified, requiring CPU and disk I/Os. The crash recovery algorithm must be coded to restore the data structures to a consistent state before the file system can be used again.

However, elimination of the bitmap means that LFS does not know which blocks in a segment are live until it reads the segment. This precludes any optimizations to the cleaning mechanism that attempt to read only the live blocks of segments. With the bitmap, the cleaning mechanism could be optimized to read only the live blocks as indicated by the bitmap.

The value of this lost optimization depends on the clustering of live blocks in the segments cleaned. In order for the optimization to be a big win, the implementation must be able to read the live blocks significantly faster than reading the entire segment. For current disk technology and CPU technology, this only occurs when a segment’s live blocks are clustered together into a few contiguous groups of blocks that can be read in a few transfers. Between the high disk controller overhead and the CPU time necessary to start a disk I/O, this is over 5 milliseconds of overhead per I/O. This overhead combined with missing disk revolutions between transfers means that when the number of transfers is greater than one per track it will not be faster than reading the entire segment.

A second disadvantage of the test LFS uses to determine if a block is live is that fetching the inode or indirect block may involve additional disk I/O’s. If the inode or indirect block that describes a block is not in the segment being cleaned and not in the file cache, the cleaning mechanism must fetch it from disk to do the liveness test. In the worst case the lookup could cause several additional disk I/O’s for each block examined. To keep the cost of cleaning low it is important that such random disk references be avoided.

In LFS, extra disk references will occur during the test only when the block’s index is neither in the file cache nor in the segments being cleaned. For workloads with small files and those with large files written sequentially the index is likely to be in one or both of these places. For small files, the file data and inode data are likely to be modified and written together into the same segment. Reading the entire segment reads both the file’s blocks and inode so no additional reads are required to check the block.

For workloads with large files written sequentially, the segments will contain many consecutive blocks of the same file. Checking the blocks will require few random disk I/O’s because many of the blocks will be mapped by the same indirect blocks and inode. The random I/O request to fetch an indirect block will be amortized over many lookup requests using the same indirect block.

The workloads under which the lookup test causes the most random I/O’s are those in which the index blocks and inode are separated from the blocks they map. This happens when large files are written non-sequentially and with little locality of reference. The data blocks mapped by a single indirect block will be scattered over the disk causing random I/O’s to fetch the index when cleaning.

Using knowledge of the workload in the environment, LFS further improves cleaning performance by maintaining a file version number that is updated every time a file is deleted or truncated to length zero. The version number allows LFS to determine quickly and without additional I/O’s whether a block is part of a file that has been totally overwritten or deleted. The version number for a file is kept in the file’s inode map entry (see Section 4.2.1) and included in the summary information for each of the file’s blocks in the segment. During the liveness test a quick comparison of the version number in the summary information and the inode map can determine if the block has been deleted.

In the workload environment of the LFS prototype, large-file random I/O without locality is rare so the lookup test performs well. With the truncate version number eliminating most of the inode lookups for deleted files, the remaining fetches of inodes require little disk I/O. Excluding the swap1 file system, only 1% to 5% of the inode fetches made during segment cleaning required any disk I/O. The needed inodes were either found in the file cache or in the segment being cleaned. The swap1 file system with its non-sequential access patterns to large files had a much higher number of disk I/Os during segment cleaning. Around 18% of the inode fetches missed in the file cache and required a disk I/O to fetch the inode block. The experience with swap1 suggests that an additional mechanism might be needed to reduce the extra disk I/Os during segment cleaning on file systems with workloads containing large files accessed non-sequentially.
 
 
 

4.6 Segment selection:

Although the cleaning mechanism described above will work for any segment, some segments are better than others for efficient operation of the system. Segments that are heavily fragmented make a better choice than those that contain little free space. To guide in the selection of segments to clean and to keep track of the segments that are clean, LFS maintains a data structure called the segment usage table.

For each segment on disk the segment usage table contains an entry that describes the state of the segment, the number of live bytes in the segment, and an estimate of the age of the youngest block in the segment. The state bits tell if the segment is clean and therefore available for writing new data. The live byte count and the age are used by the policy for selecting segments to clean. Its usage is described in next chapter.

The segment usage table is modified by any operation that adds or deletes data from disk. The entry for a segment is updated when the log is written to the segment and when files are truncated or deleted. Entries are also updated when a block of a file or directory is rewritten causing the old block to become free.

Like the inode map, the segment usage table is broken into blocks that can be cached in the file cache. The size of the segment usage table depends on the segment size and the size of the disk. With the 512-kilobyte segment size of the current file system, the segment usage table occupies 48 kilobytes per gigabyte of disk space.
 
 
 
 

4.7 Segment layout:

This section presents the part of the LFS design that controls the organization of data on disks. The disk layout is important because it affects not only the read performance of LFS but also the segment cleaning overhead, crash recovery time, and space efficiency. This section describes layout of data in segments and the mapping of segments onto the disk media.

Because both write performance and cleaning overhead depend on efficient whole-segment transfers, segments need to be mapped onto the disk so that whole segment disk transfers are efficient. Segment size and alignment should be adjusted so that segments correspond to ‘‘natural’’ disk transfer units such as sets of tracks or cylinders. Such placement allows high bandwidth transfers that are not slowed by extra seeks or rotational delays. For disk arrays, the alignment and size should take the redundancy overhead into account so that segment write operations avoid the costly read/modify/write cycles needed for non-stripe-aligned accesses.

Once the segments have been sized and aligned for efficient transfers to and from disk, the LFS storage manager can focus on the placement of data inside segments. Issues that influence the layout of data in a segment include the read performance, data identification, and end-of-log detection.

Read performance:

The whole-segment transfer mode used by log writes in LFS means that the write transfer performance does not depend on how the data is organized in a segment. The important performance metric driving the segment layout is the performance of read requests. Data of a segment is optimized to exploit common read access patterns to increase performance. In the environment the most common access pattern is sequential whole-file access.

Data identification:

The segment must contain the additional information required for self-identification. This information should be placed in the segment so that the overhead in terms of log bandwidth and storage efficiency is low and the speed of identification during cleaning is fast.

End-of-log detection:

A basic problem in a logging system is how the crash recovery algorithm detects when it has reached the end of the log. The mechanism used must work regardless of when the system dies while writing the log. For LFS this means it must be able to detect when a log write to a segment only partially completed because the system crashed.
 
 
 
 

4.8 Log regions and segment summary blocks:

The identification of a segment’s contents and the end-of-log detection algorithm are complicated by the need to write partial segments as well as whole ones. Partial-segment writes are needed when modifications to the file system are insufficient to fill an entire segment yet still must be written to provide reliability guarantees (e.g. a client has invoked the fsync kernel call). It must be possible to fill in a segment piece-wise with each piece self-contained.

LFS addresses this need by dividing segments into contiguous extents called log regions. A log region contains zero or more data or metadata blocks combined with a segment summary block that contains the additional information describing the contents of the region. The log region is the smallest unit in which blocks are written to the log. It can range in size from one block to an entire segment. Figure shows the layout of a log region containing several files.

Besides containing the self-identifying information for a segment, the segment summary block also contains pointers used to link log regions together to form a logically sequential log. The links are used to connect log regions in different segments as well as those within a single segment. Each segment summary block starts with a header containing the disk address of the previous segment summary block and the address of where the next segment summary block will be written. A program can follow these pointers to traverse the log in both the forward and backward directions.

Using the summary block approach for data identification requires that two different blocks, the summary block and the data block, be read to identify the data block. The alternative approach would be to make every block self-identifying so the summary block would not be needed. This approach was rejected because it would require stealing bytes from file data blocks to provide the identification information. Since most application make file system block size requests, stealing as little as one bit from a block will cause application request to access multiple disk blocks.
 
 
 
 

4.9 Log region layout:

With the segment summary blocks taking care of the identification and end-of-log detection, the remaining issue for the log layout on disk is the arrangement of the data for fast retrieval and high space efficiency. During the log writing process, LFS assembles an image of the log region in memory. It is not until this time that the disk addresses for the blocks in the log region are known. The layout code must place blocks into the region and update the indexes pointing at them. The rest of this section details the log region layout algorithms.

Once initiated, the log region layout algorithm runs as a separate kernel-resident process for each file system that needs to write data. The file cache code hands the layout algorithm the files whose attributes, data blocks, or block indexes have changed. The layout algorithm retrieves the modified blocks and attributes from the cache and builds an in-memory image of the log region organized as a list of pointers to blocks in the file cache. When all the modifications are placed in the log region or the log region fills it is transferred to the current segment being written.

To help read performance, the log region layout algorithm clusters together the modified blocks from the same file. This is implemented by placing into the log region all the modified blocks of a file before moving on to the next file. For files written sequentially, this algorithm will allocate the blocks of the file contiguously and in order in the log region. Sequential read requests to the file can then be processed sequentially from disk.

For each block placed in a log region, the algorithm must update two data structures: the segment summary block and the index of the block. The addition of a block in a log region is recorded in an entry added to the summary block of the log region. This entry contains the i-number of the file, the block number being added, and the truncate version of the file from the inode map. Placing the block in the log region also determines the disk address at which the block will be written. The inode or indirect block is updated in the file cache to reflect this new address for the block.

When placing a file in a log region, the data blocks are placed first followed by the indirect blocks followed by any double indirect blocks and finally the inode. Placing blocks in this bottom-up hierarchical fashion means that blocks are never modified after being placed in a log region. Furthermore, the layout algorithm is the same for indirect blocks as it is for data blocks. Placing an indirect block causes the next higher level of index to be updated in addition to adding an entry to the segment summary block. In the entry, negative block numbers are used to distinguish the index blocks from the file’s data blocks. These block numbers are assign using a depth first scan of the tree of indirect blocks (i.e. the single indirect blocks is -1, the double indirect block is -2, the children of the double indirect block are -3 through -1026, etc).

The final step of a file’s layout is to place the inode in the log region and update the file’s inode map to point at this new inode. This step is both important for crash recovery and problematic for the layout code. Until the inode is written and the inode map updated, the blocks written to the log region will not be visible on disk because it is not part of the file’s index rooted at the inode. The crash recovery algorithm described in next chapter uses this property so that the inode acts like a data base commit record that decides when to switch the on-disk version of a file to include the newly written changes.

The problem with placing the inode is caused by its small size. The inode in LFS is only 128 bytes so allocating an entire 4-kilobyte block or even a 512-byte sector will waste much disk space due to internal fragmentation. To reduce the fragmentation, LFS clusters the inodes of a log region into a block called an inode block. The size of the inode block is a file system settable parameter that defaults to 4 kilobytes.

Once all the modified blocks of the file have been placed in a log region, the layout algorithm places the inode of a file into the inode block and updates the inode map entry of the file to point at the inode block. Inode blocks are allocated in the log region whenever an inode needs to be placed and no space exists in the current inode block. For log regions with many small files, the files’ inodes will be placed into a few inode blocks with low fragmentation. If need be, all blocks of a log region can be allocated to data blocks with no inode blocks.

On the prototype LFS file systems with four-kilobyte inode blocks, on average only half the entries in the inode blocks were in use. The distribution indicated that having a smaller inode block would not have helped much because the blocks were either nearly full or contained only a few live inodes. One of the reasons for this was that the crash recovery mechanism of the distributed file system modifies the inodes of the files recently used by a client machine when the client crashes. This caused more inode modifications than LFS anticipated. In hindsight, the crash recovery version number should have been put in the inode map like the file’s access time.

Clustering the inodes together into an inode block has benefits beyond reducing fragmentation. By fetching the entire inode block when a reference to an inode in the block occurs, LFS implicitly pre-fetches the other inodes in the block. Any locality of reference to inodes in the same inode block is exploited. The inode blocks are cached like file blocks in the file cache. Since the higher levels of the file system cache individual inodes, the inode blocks act as a second-level cache.

On the production LFS file systems, most access to inodes find the inode block already in the cache. Except for the swap1 file system, the cache-hit rates on the inode blocks are in the range 90% to 96%. The hit rate on the swap1 file system was only 75% for two reasons. The first reason was that swap1 had relatively few files so the caching of inodes in the higher levels of the file system was very effective. Only the cold start misses referenced the inode blocks in the file cache. The second reason was that swap1 also suffered worse fragmentation in the inode blocks. The average number of inodes was only four per block and most (60%) inode blocks contain less than three inodes. The heavy fragmentation of the inode blocks meant that the caching was ineffective for swap1.

To reduce fragmentation and improve space utilization, the LFS log region layout algorithm allows the last block of a file to occupy less than a full 4-kilobyte block. The number of sectors allocated to the last block of a file is rounded to an integral number of 512-byte disk sectors. Like the Berkeley Unix FFS, allowing the last block to be a fragment greatly reduces the intra-file fragmentation in file systems with many small files that are not integral numbers of blocks. It is also possible to round fragments to even smaller units than disk sectors and achieve greater fragmentation reduction. This was not done in the current LFS implementation because addressing the disk on more than sector boundaries would require increasing the size of disk pointers.




 
 

  Chapter 5

Enhancing Performance of LFS
 
 

5.1 Enhancing Performance:
 

As we know improving Write performance of LFS was designed to provide high performance for write through large batched disk transfers. However, additional research demonstrated that cleaning overhead could result in dramatically lower write performance for some workloads. Self-tuning principles can be applied to LFS to provide high write performance across a broader range of workloads.
 

The cleaner reads an entire segment even if a few live blocks remain. It may be beneficial to allow the cleaner to read only if the live blocks when that would take less time, but following the original LFS, we did not include this optimization. Ideally, the data would be written once and never moved bt the cleaner; this happens if all data in the segment is overwritten before the segment is reclaimed. In the best case, then, the write cost can be 1; i.e. the entire write was used in the new data.
 

Here we define the transfer in-efficiency as…
 

Transfer in-efficiency = Segment Transfer Time actual / Segment Transfer Time Ideal
 


Access Time + Segment Size / Disk Bandwidth

Transfer in-efficiency =   ----------------------------------------------------------------

Segment Size / Disk Bandwidth

Disk Bandwidth

Transfer in-efficiency  =  Access Time *  -----------------------------------------------  + 1.

Segment Size


So the overall cost will be ...

Transfer Time total
=  ---------------------------------------------------
Transfer Time ideal



5.2 Effect of Segment size

The segment size plays an important role in the write performance of LFS than has been previously suggested. It was suggested that the segment size should be large enough to absorb the penalty of the seek to the segment. In Sprite LFS a relatively large 1-MB segment is used. On the other hand, there is a countervailing benefit to choosing a smaller segment size. It was observed that at smaller segments the variance in the segment utilization is larger; allowing the cleaner to chose less utilized segments. In particular, smaller segments can simply be declared clean without requiring any disk transfers by the cleaner.

In the limit, with one-block segments, cleaning costs would always be zero because all segments would be either fill or empty and no data would need to be compacted. Of course, single block segments would eliminate any advantage from batched transfers.

According to the original definition write cost, write cost is minimized at small segments because smaller the segment reduces cleaner overhead. However, this does not reflect the inefficiency introduced by transferring smaller segments. Transfer in-efficiency is the ratio between the actual segment transfer time to the time it would have taken to transfer the segment at full disk bandwidth.

As segments become large, the access time becomes insignificant relative to the time for transferring the segment, and therefore the transfer inefficiency approaches one. The write cost measures the overhead of cleaning, The transfer inefficiency measures the bandwidth degradation caused by seek and rotational delay.

The overall cost equation captures both these effects. The overall write cost is the total time required to write new data and clean segments, divided by the time to write the new data at full disk bandwidth.

If we consider disk characteristics, against segment size at different disk bandwidths, the overall write cost will be greater than 3 when segment size is less than 16 KB or greater than 4 MB. The write cost is minimum, when segment size is in around 256 KB. So at this segment size the disk performance will be maximum.
 

  5.3 Cleaning policies

The final parameters for the simulations selected the segment cleaning policy. The simulator allowed for control of the following policies…

(1) The size and number of segments in the file system. The segment size ranged in values from 24 kilobytes up to 8 megabytes. Unless otherwise specified a segment size of 2 megabytes (512 files) was used. The number of segments (size of the simulated disk) was adjusted so that there were always 200 megabytes (524288 files) of data.

(2) The number of live files read at a time during cleaning. The reading step of the cleaning mechanism continues until it has read this number of live files into memory to be combined and written out during the write-back step. The motivation behind the number is that LFS reserves space in the cache for cleaning. This number specifies the maximum amount of space reserved in the cache. If not specified then 10 megabytes (2560 files) were read during a cleaning operation.

(3) The policy used to select the segments to be cleaned. This policy is invoked when cleaning is started and is used to select which segments are to be read in and combined. Two policies were implemented: the greedy and cost-benefit policy. The greedy policy chooses the segments with the least number of active bytes according to the segment usage table. The cost-benefit policy uses a function of the age and amount of data in a segment. Unless specified the greedy policy is used.

(4) The policy specifying the order in which live data is written back to disk during cleaning. The two choices were to combine the live files in the order in which they were read from segments during cleaning or to group the files by last modify time.

Because of the large number of possible policy and workload combinations it was infeasible to study all combinations. The approach taken for this thesis was to study the effects of varying each cleaning policy individually. The goal was not to use the simulator to predict performance but to understand the reason behind the changes in write cost due to varying the policies. To further aid in understanding, I modified the simulator to graphically represent the state of the disk. As the simulator ran it produced an animated view of segments and allowed the effects of policy changes to be visualized.
 

5.4 Cleaning Algorithms

The cleaning used normally degrades the performance, so we have to use the cleaning algorithm such that it will least affect the performance. Traditional LFS cleaning is used when disk utilization is low. But at higher disk utilizations, depending on usage pattern different cleaning methods are used like Hole-plugging.
 
 

In traditional cleaning, the live blocks in several empty segments are combined to produce a new segment, freeing the old partially empty segment for reuse. In many environments traditional cleaning works well. Normally this cleaning is done when system is idle. For the workloads examined, 97 % of the cleaning is done in background. Observation shows that cleaning cost is relatively low when disk utilization is less than 80 %. If segments updates show a higher degree of locality, then some segments will be emptier than other and will yield more free space when cleaned.

The problem with cleaning appears at higher disk utilizations, mainly for the workloads with many random updates and insufficient idle time. Because segments do not have a chance to empty before they must be cleaned. In order to free one segment cleaner have to process many nearly full segments. Each segment must be read, and all but the few holes rewritten into a new segment. This translates high write cost. In an extreme case, the entire disk might need to be cleaned in order to make a new single free segment.
 
 
 

In hole-plugging, partially empty segment are freed by writing their live blocks into the holes found in other segments. In order to produce one free segment worth of space, we need only read one segment and rewrite each of its live blocks. These writes are more expensive per block than writing complete segment because each block requires additional seek and rotational delay. However, despite the higher per block cost, at higher at disk utilization, hole-plugging is still better than traditional cleaning because we avoid processing so many segments.
 

    Segment size = 4 blocks
 
 

                       


 
 
 

                       

New Free Segment
 
 

 

Live block of segment


Figure 6: Hole-plugging: Above figure shows some live block from one partially full segment are transferred to other partially full segment. This is very useful here as disk utilization is high.

    But at lower disk utilization larger cost of writing individual blocks makes hole-plugging more expensive than traditional cleaning. So we have to use different cleaning algorithm depending upon disk utilization.




 
 
 

Chapter 6
 
 

Summary and Conclusion
 
 
 
 

This Seminar presents a new disk storage management technique called a log-structured file system. The basic principle is simple: collect large amounts of new data in a file cache in main memory, then write the data to disk in a single large I/O that can use all of the disk’s bandwidth.
 
 

Also presented was the cost-benefit segment selection policy that enables the log-structured file system to exploit locality to further improve performance. Finally, it provides a software technology to help magnetic disk storage track the improvements in CPU technology so future systems can achieve higher performance.
 
 

Even if one of the other approaches, such as file caches with battery backup, proves better than log-structured file systems, we think that the nature of I/O to disk is changing enough to justify new disk organizations. Logging approaches offer easier recovery and versioning and better locality than most of today’s file systems, so some of the logging techniques may be applicable as a supplement to other approaches.




 
 
 

Bibliography
 
 

[1] Christian Czezatke. DTFS: A Log-structured File System for Linux, August 1998

[2] Mendel Roseblum. The design and implementation of a Log-structured File System. Master’s thesis, University of California, Berkley, January 1992.

[3] John Ousterhout, Fred Douglis. Beating the I/O bottleneck: A case study of Log-structured File System. February 1994.

[4] Uresh Vahlia, "UNIX Internals", Prentice-Hall, New Jersey, 1996, Chapter 11, pp. 338-350.

[5] Jeana Neefe Matthews Drew Roselli Adam M. Castello Randolf Y. Wang and Thomas E. Anderson. Improving the performance of Log-structured File System with adaptive methods. In Sixteenth ACM Symposium on Operating System Principles. ACM, October 1997.

[6] Tage Stabell. Security and Log-structured File System. ACM Operating System review, Vol. 31, No. 2, pp. 9-10, January 1997.

[7] John Ousterhout and Mendel Rosenblum. Design and implementation of Log-structured File System. July 1991.
 
 

Web-Sites:

[1] http://www.complang.tuwien.ac.at/czezatke/

[2] http://collective.cpoint.net/lfs/

[3] http://metalab.unc.edu/linux-bin/
 
 


Mail your Suggestions, Comments, Doubts, Questions to author@abhishikt.8k.com
$Id: LFS.htm,v 1.4 2000/02/29 18:31:46 root Exp root $