research-article

DGFIndex for smart grid: enhancing hive with a cost-effective multidimensional range index

Authors:
Yue Liu

Chinese Academy of Sciences, China and University of Chinese Academy of Sciences, China

Chinese Academy of Sciences, China and University of Chinese Academy of Sciences, China
View Profile

,
Songlin Hu

Chinese Academy of Sciences, China

Chinese Academy of Sciences, China
View Profile

,
Tilmann Rabl

Middleware Systems Research Group University of Toronto, Canada

Middleware Systems Research Group University of Toronto, Canada
View Profile

,
Wantao Liu

Chinese Academy of Sciences, China

Chinese Academy of Sciences, China
View Profile

,
Hans-Arno Jacobsen

Middleware Systems Research Group University of Toronto, Canada

Middleware Systems Research Group University of Toronto, Canada
View Profile

,
Kaifeng Wu

State Grid Electricity Science Research Institute, China

State Grid Electricity Science Research Institute, China
View Profile

,
Jian Chen

Zhejiang Electric Power Corporation, China

Zhejiang Electric Power Corporation, China
View Profile

,
Jintao Li

Chinese Academy of Sciences, China

Chinese Academy of Sciences, China
View Profile

Proceedings of the VLDB Endowment Volume 7 Issue 13pp 1496–1507https://doi.org/10.14778/2733004.2733021

Published:01 August 2014Publication History

Proceedings of the VLDB Endowment

Abstract

In Smart Grid applications, as the number of deployed electric smart meters increases, massive amounts of valuable meter data is generated and collected every day. To enable reliable data collection and make business decisions fast, high throughput storage and high-performance analysis of massive meter data become crucial for grid companies. Considering the advantage of high efficiency, fault tolerance, and price-performance of Hadoop and Hive systems, they are frequently deployed as underlying platform for big data processing. However, in real business use cases, these data analysis applications typically involve multidimensional range queries (MDRQ) as well as batch reading and statistics on the meter data. While Hive is high-performance at complex data batch reading and analysis, it lacks efficient indexing techniques for MDRQ.

In this paper, we propose DGFIndex, an index structure for Hive that efficiently supports MDRQ for massive meter data. DGFIndex divides the data space into cubes using the grid file technique. Unlike the existing indexes in Hive, which stores all combinations of multiple dimensions, DGFIndex only stores the information of cubes. This leads to smaller index size and faster query processing. Furthermore, with pre-computing user-defined aggregations of each cube, DGFIndex only needs to access the boundary region for aggregation query. Our comprehensive experiments show that DGFIndex can save significant disk space in comparison with the existing indexes in Hive and the query performance with DGFIndex is 2-50 times faster than existing indexes in Hive and HadoopDB for aggregation query, 2-5 times faster than both for non-aggregation query, 2-75 times faster than scanning the whole table in different query selectivity.

References

Cloudera blog. https://blog.cloudera.com/blog/2009/02/the-small-files-problem/.Google Scholar
Hadoop. http://hadoop.apache.org/.Google Scholar
Hadoopdb install instructions. http://hadoopdb.sourceforge.net/guide/.Google Scholar
Hbase. https://hbase.apache.org/.Google Scholar
Hive. http://hive.apache.org/.Google Scholar
Hive aggregate index. https://issues.apache.org/jira/browse/HIVE-1694.Google Scholar
Hive bitmap index. https://issues.apache.org/jira/browse/HIVE-1803.Google Scholar
Hive compact index. https://issues.apache.org/jira/browse/HIVE-417.Google Scholar
Oozie. https://oozie.apache.org/.Google Scholar
A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proceedings of the VLDB Endowment, 2(1):922--933, 2009. Google ScholarDigital Library
A. Aji, F. Wang, H. Vo, R. Lee, Q. Liu, X. Zhang, and J. Saltz. Hadoop gis: a high performance spatial data warehousing system over mapreduce. Proceedings of the VLDB Endowment, 6(11):1009--1020, 2013. Google ScholarDigital Library
J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
J. Dittrich, J.-A. Quiané-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). Proceedings of the VLDB Endowment, 3(1-2):515--529, 2010. Google ScholarDigital Library
C. Doulkeridis and K. Nørvåg. A survey of large-scale analytical query processing in mapreduce. The VLDB Journal, pages 1--26, 2013. Google ScholarDigital Library
A. Eldawy and M. F. Mokbel. A demonstration of spatialhadoop: an efficient mapreduce framework for spatial data. Proceedings of the VLDB Endowment, 6(12):1230--1233, 2013. Google ScholarDigital Library
M. Y. Eltabakh, F. Özcan, Y. Sismanis, P. J. Haas, H. Pirahesh, and J. Vondrak. Eagle-eyed elephant: split-oriented indexing in hadoop. In Proceedings of the 16th International Conference on Extending Database Technology, pages 89--100. ACM, 2013. Google ScholarDigital Library
S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In SOSP '03: Proceedings of the 19th ACM Symposium on Operating Systems Principles, pages 29--43, 2003. Google ScholarDigital Library
Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In Data Engineering (ICDE), 2011 IEEE 27th International Conference on, pages 1199--1208. IEEE, 2011. Google ScholarDigital Library
S. Hu, W. Liu, T. Rabl, S. Huang, Y. Liang, Z. Xiao, H.-A. Jacobsen, and X. Pei. Dualtable: A hybrid storage model for update optimization in hive. CoRR, abs/1404.6878, 2014.Google Scholar
D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The performance of mapreduce: An in-depth study. Proceedings of the VLDB Endowment, 3(1-2):472--483, 2010. Google ScholarDigital Library
J. Lin, D. Ryaboy, and K. Weil. Full-text indexing for optimizing selection operations in large-scale data analytics. In Proceedings of the second international workshop on MapReduce and its applications, pages 59--66. ACM, 2011. Google ScholarDigital Library
J. Nievergelt, H. Hinterberger, and K. C. Sevcik. The grid file: An adaptable, symmetric multikey file structure. ACM Transactions on Database Systems (TODS), 9(1):38--71, 1984. Google ScholarDigital Library
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 165--178. ACM, 2009. Google ScholarDigital Library
M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel dbmss: friends or foes? Communications of the ACM, 53(1):64--71, 2010. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive-a petabyte scale data warehouse using hadoop. In Data Engineering (ICDE), 2010 IEEE 26th International Conference on, pages 996--1005. IEEE, 2010.Google ScholarCross Ref
Y. Xu and S. Hu. Qmapper: a tool for sql optimization on hive using query rewriting. In Proceedings of the 22nd international conference on World Wide Web companion, pages 211--212. International World Wide Web Conferences Steering Committee, 2013. Google ScholarDigital Library

Index Terms

DGFIndex for smart grid: enhancing hive with a cost-effective multidimensional range index
1. Information systems
  1. Information retrieval
    1. Document representation
    2. Search engine architectures and scalability
      1. Search engine indexing

Index terms have been assigned to the content through auto-classification.

Recommendations

DGFIndex: a hive multidimensional range index for smart meter big data
MiddlewareDPT '13: Proceedings Demo & Poster Track of ACM/IFIP/USENIX International Middleware Conference

In Smart Grid, High-performance analysis of massive meter data is very crucial for electric companies to make decisions. With our observation, these data analysis applications typically involve multidimensional range queries (MDRQ) on meter data. While ...
Read More
A Query-oriented Adaptive Indexing Technique for Smart Grid Big Data Analytics

IoT (Internet of Things) based Smart Grid (SG) is defined as a power grid integrated with a large network of smart objects portrayed by information and communication technology. The data sources of IoT-based SG, as well as their correlations, are ...
Read More
Big Data and Smart Grid
BigDataScience '14: Proceedings of the 2014 International Conference on Big Data Science and Computing

Big Data brings the challenge for Smart Grid. By using the method of SWOT, the double-edged sword effect of Big Data for the Smart Grid has been analyzed. Big Data provides both opportunities and challenges. The benefits and opportunities are which, Big ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 7, Issue 13
August 2014
466 pages
ISSN:2150-8097
Editors:
H. V. Jagadish
University of Michigan
,
Aoying Zhou
East Normal University, China
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2014
Published in pvldb Volume 7, Issue 13
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 5
  Total Citations
  View Citations
- 126
  Total Downloads
- Downloads (Last 12 months)0
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

DGFIndex for smart grid: enhancing hive with a cost-effective multidimensional range index

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

DGFIndex: a hive multidimensional range index for smart meter big data

A Query-oriented Adaptive Indexing Technique for Smart Grid Big Data Analytics

Big Data and Smart Grid

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

DGFIndex for smart grid: enhancing hive with a cost-effective multidimensional range index

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

DGFIndex: a hive multidimensional range index for smart meter big data

A Query-oriented Adaptive Indexing Technique for Smart Grid Big Data Analytics

Big Data and Smart Grid

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media