Hadoop has emerged as a successful framework for large-scale data-intensive computing applications. However, there is no research on performance models for the Hadoop Distributed File System (HDFS). Due to the complexity of HDFS and the difficulty of modeling the multiple impact factors for HDFS performance, to establish HDFS performance models based directly on these impact factors is very complicated. In this paper, the relationship between file size and HDFS Write/Read (denoted as W/R for short) throughput, i.e., the average flow rate of a HDFS W/R operation, is studied to build HDFS performance models from a systematic view. Based on the measured data of specially designed experiments (in which HDFS W/R operations can be viewed as single-input single-output systems), a system identification-based approach is applied to construct performance models for HDFS W/R operations under different conditions. Furthermore, dynamic characteristics metrics for HDFS performance are defined, and based on the identified performance models and these metrics, the dynamic characteristics of HDFS W/R operations, such as steady state and overshoot, are studied, and the relationships between impact factors and dynamic characteristics are analyzed. These analysis results can provide effective guidance and implications for the design and configuration of HDFS and Hadoop-based applications.
Dong, B., Zheng, Q., Tian, F., Chao, K-M., Godwin, N., ma, T., & Xu, H. (2014). Performance models and dynamic characteristics analysis for HDFS write and read operations: A systematic view. Journal of Systems and Software, 93, 132-151. https://doi.org/10.1016/j.jss.2014.02.038