Abstract
In Smart Grid applications, as the number of deployed electric smart meters increases, massive amounts of valuable meter data is generated and collected every day. To enable reliable data collection and make business decisions fast, high throughput storage and high-performance analysis of massive meter data become crucial for grid companies. Considering the advantage of high efficiency, fault tolerance, and price-performance of Hadoop and Hive systems, they are frequently deployed as underlying platform for big data processing. However, in real business use cases, these data analysis applications typically involve multidimensional range queries (MDRQ) as well as batch reading and statistics on the meter data. While Hive is high-performance at complex data batch reading and analysis, it lacks efficient indexing techniques for MDRQ.
In this paper, we propose DGFIndex, an index structure for Hive that efficiently supports MDRQ for massive meter data. DGFIndex divides the data space into cubes using the grid file technique. Unlike the existing indexes in Hive, which stores all combinations of multiple dimensions, DGFIndex only stores the information of cubes. This leads to smaller index size and faster query processing. Furthermore, with pre-computing user-defined aggregations of each cube, DGFIndex only needs to access the boundary region for aggregation query. Our comprehensive experiments show that DGFIndex can save significant disk space in comparison with the existing indexes in Hive and the query performance with DGFIndex is 2-50 times faster than existing indexes in Hive and HadoopDB for aggregation query, 2-5 times faster than both for non-aggregation query, 2-75 times faster than scanning the whole table in different query selectivity.
- Cloudera blog. https://blog.cloudera.com/blog/2009/02/the-small-files-problem/.Google Scholar
- Hadoop. http://hadoop.apache.org/.Google Scholar
- Hadoopdb install instructions. http://hadoopdb.sourceforge.net/guide/.Google Scholar
- Hbase. https://hbase.apache.org/.Google Scholar
- Hive. http://hive.apache.org/.Google Scholar
- Hive aggregate index. https://issues.apache.org/jira/browse/HIVE-1694.Google Scholar
- Hive bitmap index. https://issues.apache.org/jira/browse/HIVE-1803.Google Scholar
- Hive compact index. https://issues.apache.org/jira/browse/HIVE-417.Google Scholar
- Oozie. https://oozie.apache.org/.Google Scholar
- A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proceedings of the VLDB Endowment, 2(1):922--933, 2009. Google ScholarDigital Library
- A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proceedings of the VLDB Endowment, 6(11):1009--1020, 2013. Google ScholarDigital Library
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 3(1-2):515--529, 2010. Google ScholarDigital Library
- C. Doulkeridis and K. Nørvåg. A survey of large-scale analytical query processing in mapreduce. The VLDB Journal, pages 1--26, 2013. Google ScholarDigital Library
- A. Eldawy and M. F. Mokbel. A demonstration of spatialhadoop: an efficient mapreduce framework for spatial data. Proceedings of the VLDB Endowment, 6(12):1230--1233, 2013. Google ScholarDigital Library
- M. Y. Eltabakh, F. Özcan, Y. Sismanis, P. J. Haas, H. Pirahesh, and J. Vondrak. Eagle-eyed elephant: split-oriented indexing in hadoop. In Proceedings of the 16th International Conference on Extending Database Technology, pages 89--100. ACM, 2013. Google ScholarDigital Library
- S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP '03: Proceedings of the 19th ACM Symposium on Operating Systems Principles, pages 29--43, 2003. Google ScholarDigital Library
- Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 1199--1208. IEEE, 2011. Google ScholarDigital Library
- S. Hu, W. Liu, T. Rabl, S. Huang, Y. Liang, Z. Xiao, H.-A. Jacobsen, and X. Pei. Dualtable: A hybrid storage model for update optimization in hive. CoRR, abs/1404.6878, 2014.Google Scholar
- D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of mapreduce: An in-depth study. Proceedings of the VLDB Endowment, 3(1-2):472--483, 2010. Google ScholarDigital Library
- J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. In Proceedings of the second international workshop on MapReduce and its applications, pages 59--66. ACM, 2011. Google ScholarDigital Library
- J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid file: An adaptable, symmetric multikey file structure. ACM Transactions on Database Systems (TODS), 9(1):38--71, 1984. Google ScholarDigital Library
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 165--178. ACM, 2009. Google ScholarDigital Library
- M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel dbmss: friends or foes? Communications of the ACM, 53(1):64--71, 2010. Google ScholarDigital Library
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive-a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages 996--1005. IEEE, 2010.Google ScholarCross Ref
- Y. Xu and S. Hu. Qmapper: a tool for sql optimization on hive using query rewriting. In Proceedings of the 22nd international conference on World Wide Web companion, pages 211--212. International World Wide Web Conferences Steering Committee, 2013. Google ScholarDigital Library
Index Terms
- DGFIndex for smart grid: enhancing hive with a cost-effective multidimensional range index
Recommendations
DGFIndex: a hive multidimensional range index for smart meter big data
MiddlewareDPT '13: Proceedings Demo & Poster Track of ACM/IFIP/USENIX International Middleware ConferenceIn Smart Grid, High-performance analysis of massive meter data is very crucial for electric companies to make decisions. With our observation, these data analysis applications typically involve multidimensional range queries (MDRQ) on meter data. While ...
A Query-oriented Adaptive Indexing Technique for Smart Grid Big Data Analytics
IoT (Internet of Things) based Smart Grid (SG) is defined as a power grid integrated with a large network of smart objects portrayed by information and communication technology. The data sources of IoT-based SG, as well as their correlations, are ...
Big Data and Smart Grid
BigDataScience '14: Proceedings of the 2014 International Conference on Big Data Science and ComputingBig Data brings the challenge for Smart Grid. By using the method of SWOT, the double-edged sword effect of Big Data for the Smart Grid has been analyzed. Big Data provides both opportunities and challenges. The benefits and opportunities are which, Big ...
Comments