An optimized approach for storing and accessing small files on cloud storage

B. Dong, Q. Zheng, F. Tian, Kuo-Ming Chao, R. Ma, Rachid Anane

    Research output: Contribution to journalArticle

    90 Citations (Scopus)

    Abstract

    Hadoop distributed file system (HDFS) is widely adopted to support Internet services. Unfortunately, native HDFS does not perform well for large numbers but small size files, which has attracted significant attention. This paper firstly analyzes and points out the reasons of small file problem of HDFS: (1) large numbers of small files impose heavy burden on NameNode of HDFS; (2) correlations between small files are not considered for data placement; and (3) no optimization mechanism, such as prefetching, is provided to improve I/O performance. Secondly, in the context of HDFS, the clear cut-off point between large and small files is determined through experimentation, which helps determine ‘how small is small’. Thirdly, according to file correlation features, files are classified into three types: structurally-related files, logically-related files, and independent files. Finally, based on the above three steps, an optimized approach is designed to improve the storage and access efficiencies of small files on HDFS. File merging and prefetching scheme is applied for structurally-related small files, while file grouping and prefetching scheme is used for managing logically-related small files. Experimental results demonstrate that the proposed schemes effectively improve the storage and access efficiencies of small files, compared with native HDFS and a Hadoop file archiving facility.

    Original languageEnglish
    Pages (from-to)1847-1862
    Number of pages16
    JournalJournal of Network and Computer Applications
    Volume35
    Issue number6
    Early online date24 Jul 2012
    DOIs
    Publication statusPublished - Nov 2012

    Keywords

    • cloud storage
    • small file storage
    • storage efficiency
    • prefetching
    • access efficiency

    Fingerprint

    Dive into the research topics of 'An optimized approach for storing and accessing small files on cloud storage'. Together they form a unique fingerprint.

    Cite this