Abstract

Hadoop is one of the most popular big-data analytics platforms, often relying on hard disk drives for storage of big-data amounts that exceed the capacity of solid-state drives. Unlike other data-intensive applications, such as database management systems, big-data processing jobs frequently require extensive sequential I/O requests. Previously proposed methods for improving sequential I/O performance modified the block usage bitmap of the Ext2/3 filesystem in order to actively use the faster disk zones, which are the outer zones in each hard disk drive. However, these methods do not support Ext4, which is the current version of Ext filesystems. In this paper, we discuss a method for improving the sequential I/O performance of the Ext4 filesystem. First, we evaluate the sequential file access throughputs on Ext3, Ext4, and XFS filesystems. We point out that Ext4 does not actively utilize the area freed by deleting existing files, causing declines in file access performance. Second, we propose a method for improving the Ext4 sequential file access performance. The improved Ext4 actively utilizes the faster zones of storage devices by controlling file placement location. Third, we evaluate the proposed filesystem and show that it outperforms existing filesystems. In the case of TeraSort, Hadoop with the proposed Ext4 filesystem performs better than when using the original Ext4 filesystem by as much as 30.1%.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.