skip to main content
research-article

DGFIndex for smart grid: enhancing hive with a cost-effective multidimensional range index

Authors Info & Claims
Published:01 August 2014Publication History
Skip Abstract Section

Abstract

In Smart Grid applications, as the number of deployed electric smart meters increases, massive amounts of valuable meter data is generated and collected every day. To enable reliable data collection and make business decisions fast, high throughput storage and high-performance analysis of massive meter data become crucial for grid companies. Considering the advantage of high efficiency, fault tolerance, and price-performance of Hadoop and Hive systems, they are frequently deployed as underlying platform for big data processing. However, in real business use cases, these data analysis applications typically involve multidimensional range queries (MDRQ) as well as batch reading and statistics on the meter data. While Hive is high-performance at complex data batch reading and analysis, it lacks efficient indexing techniques for MDRQ.

In this paper, we propose DGFIndex, an index structure for Hive that efficiently supports MDRQ for massive meter data. DGFIndex divides the data space into cubes using the grid file technique. Unlike the existing indexes in Hive, which stores all combinations of multiple dimensions, DGFIndex only stores the information of cubes. This leads to smaller index size and faster query processing. Furthermore, with pre-computing user-defined aggregations of each cube, DGFIndex only needs to access the boundary region for aggregation query. Our comprehensive experiments show that DGFIndex can save significant disk space in comparison with the existing indexes in Hive and the query performance with DGFIndex is 2-50 times faster than existing indexes in Hive and HadoopDB for aggregation query, 2-5 times faster than both for non-aggregation query, 2-75 times faster than scanning the whole table in different query selectivity.

References

  1. Cloudera blog. https://blog.cloudera.com/blog/2009/02/the-small-files-problem/.Google ScholarGoogle Scholar
  2. Hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  3. Hadoopdb install instructions. http://hadoopdb.sourceforge.net/guide/.Google ScholarGoogle Scholar
  4. Hbase. https://hbase.apache.org/.Google ScholarGoogle Scholar
  5. Hive. http://hive.apache.org/.Google ScholarGoogle Scholar
  6. Hive aggregate index. https://issues.apache.org/jira/browse/HIVE-1694.Google ScholarGoogle Scholar
  7. Hive bitmap index. https://issues.apache.org/jira/browse/HIVE-1803.Google ScholarGoogle Scholar
  8. Hive compact index. https://issues.apache.org/jira/browse/HIVE-417.Google ScholarGoogle Scholar
  9. Oozie. https://oozie.apache.org/.Google ScholarGoogle Scholar
  10. A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proceedings of the VLDB Endowment, 2(1):922--933, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proceedings of the VLDB Endowment, 6(11):1009--1020, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 3(1-2):515--529, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Doulkeridis and K. Nørvåg. A survey of large-scale analytical query processing in mapreduce. The VLDB Journal, pages 1--26, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Eldawy and M. F. Mokbel. A demonstration of spatialhadoop: an efficient mapreduce framework for spatial data. Proceedings of the VLDB Endowment, 6(12):1230--1233, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Y. Eltabakh, F. Özcan, Y. Sismanis, P. J. Haas, H. Pirahesh, and J. Vondrak. Eagle-eyed elephant: split-oriented indexing in hadoop. In Proceedings of the 16th International Conference on Extending Database Technology, pages 89--100. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP '03: Proceedings of the 19th ACM Symposium on Operating Systems Principles, pages 29--43, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 1199--1208. IEEE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Hu, W. Liu, T. Rabl, S. Huang, Y. Liang, Z. Xiao, H.-A. Jacobsen, and X. Pei. Dualtable: A hybrid storage model for update optimization in hive. CoRR, abs/1404.6878, 2014.Google ScholarGoogle Scholar
  20. D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of mapreduce: An in-depth study. Proceedings of the VLDB Endowment, 3(1-2):472--483, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. In Proceedings of the second international workshop on MapReduce and its applications, pages 59--66. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid file: An adaptable, symmetric multikey file structure. ACM Transactions on Database Systems (TODS), 9(1):38--71, 1984. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 165--178. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel dbmss: friends or foes? Communications of the ACM, 53(1):64--71, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive-a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages 996--1005. IEEE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  26. Y. Xu and S. Hu. Qmapper: a tool for sql optimization on hive using query rewriting. In Proceedings of the 22nd international conference on World Wide Web companion, pages 211--212. International World Wide Web Conferences Steering Committee, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. DGFIndex for smart grid: enhancing hive with a cost-effective multidimensional range index
      Index terms have been assigned to the content through auto-classification.

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image Proceedings of the VLDB Endowment
        Proceedings of the VLDB Endowment  Volume 7, Issue 13
        August 2014
        466 pages
        ISSN:2150-8097
        Issue’s Table of Contents

        Publisher

        VLDB Endowment

        Publication History

        • Published: 1 August 2014
        Published in pvldb Volume 7, Issue 13

        Qualifiers

        • research-article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader