Abstract— Virtualization platform solutions throughout the IT infrastructure are one important type of Green IT services in cloud data center. The Hadoop clusters composed of on-demand virtual infrastructure are used as an effective implementation of MapReduce for developing data intensive applications in cloud computing. Deploying Hadoop clusters on large numbers of data center virtual machines (VMs) can significantly increase its productivity and reduce both energy and resource consumption. However, the interference between the VMs is complicated and changing with the growth of data size. On the other hand, it would also decrease the performance of Map and Reduce tasks while using Hadoop clusters on virtual machines. In this paper, a Comprehensive Performance Rating (CPR) scheme is presented to probe the root causes of these problems, and solutions for VMs interference are introduced using data locality and excessive configuration parameters. Unlike previous solutions by customizing Hadoop native job scheduler, the proposed CPR scheme uses Hadoop configuration metrics revealed through Principal Component Analysis (PCA) method to guide the performance tuning work. The experimental data is resulted on a 20-node virtual cluster demonstration. The proposed CPR scheme performance is close to the measured execution time in different data size, cluster size and map tasks ratio.
Index Terms— Hadoop, mapreduce, data locality, principal component analysis.
Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, and Ko-Chin Chang are with National Defense University, Taiwan (e-mail: superalf@gmail.com).
[PDF]
Cite: Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee, " The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform," International Journal of Information and Electronics Engineering vol. 4, no. 6, pp. 480-484, 2014.