research-article

Open Access

Grouping Time Series for Efficient Columnar Storage

Authors:
Chenguang Fang

Tsinghua University, China, Beijing, China

Tsinghua University, China, Beijing, China

0000-0002-7261-7478
View Profile

,
Shaoxu Song

Tsinghua University, China, Beijing, China

Tsinghua University, China, Beijing, China

0000-0002-9503-2755
View Profile

,
Haoquan Guan

Tsinghua University, China, Beijing, China

Tsinghua University, China, Beijing, China

0009-0007-5412-4489
View Profile

,
Xiangdong Huang

Tsinghua University, China, Beijing, China

Tsinghua University, China, Beijing, China

0000-0002-6868-4045
View Profile

,
Chen Wang

Tsinghua University, China, Beijing, China

Tsinghua University, China, Beijing, China

0000-0003-1698-8992
View Profile

,
Jianmin Wang

Tsinghua University, China, Beijing, China

Tsinghua University, China, Beijing, China

0000-0001-6841-7943
View Profile

Authors Info & Claims

Proceedings of the ACM on Management of Data Volume 1 Issue 1Article No.: 23pp 1–26https://doi.org/10.1145/3588703

Published:30 May 2023Publication History

Proceedings of the ACM on Management of Data

Abstract

Columnar storage is now an industry standard design in most open-source or commercial time series database products, making them HTAP systems. The time column of a time series serves as the key for identifying the other value column, namely single-column storage scheme. When multiple time series share a similar set of timestamps, very likely in a module of multiple sensors, it is natural to group them together, i.e., one time column identifies multiple value columns in a single-group storage scheme. While multiple value columns sharing the same time column reduce the space cost of repeating timestamps, it may introduce extra space cost for recording null values. The reason is that time series may not be exactly aligned on each timestamp, owing to missing values, distinct data collection frequencies, unsynchronized clocks and so on. The columngroups storage scheme is thus to divide columns into multiple groups, within which the value columns share the same time column. Unfortunately, the problem of finding the optimal column groups for the minimum space cost is highly challenging, NP-hard according to our analysis. Thereby, we propose a heuristic algorithm for automatically grouping time series for efficient columnar storage. The column groups storage has been deployed in Apache IoTDB, an open-source time series database. The extensive performance analysis, over real-world data from our industrial partners, demonstrates that the proposed column groups achieve near optimal storage, more concise than the storage of single-column or single-group schemes. Interestingly, both the flushing and querying time costs of column groups are comparable to those of single-column or singlegroup, i.e., without incurring extra time cost.

Supplemental Material

23sigmod-alignment.mp4

mp4

20.5 MB

Download

References

Apache HBase. https://hbase.apache.org/.Google Scholar
Apache HBase Implementation. https://github.com/iotdbColumnGroup/HBase.Google Scholar
Apache IoTDB. http://iotdb.apache.org.Google Scholar
Appendix. https://iotdbcolumngroup.github.io/iotdbColumnGroup/appendix.pdf.Google Scholar
Code and Data. https://github.com/iotdbColumnGroup/iotdbColumnGroup.Google Scholar
InfluxDB. https://www.influxdata.com/.Google Scholar
OpenTSDB. http://opentsdb.net/.Google Scholar
Prometheus. https://prometheus.io.Google Scholar
Source Code. https://github.com/apache/iotdb/tree/research/auto-aligned.Google Scholar
TDengine. https://github.com/taosdata/TDengine/.Google Scholar
TDengine Implementation. https://github.com/iotdbColumnGroup/TDengine.Google Scholar
TimescaleDB. https://www.timescale.com.Google Scholar
Zomato Dataset. https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants.Google Scholar
Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. 2006. Efficient Exact Set-Similarity Joins. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006. ACM, 918--929. http://dl.acm.org/citation.cfm?id=1164206Google ScholarDigital Library
Esther M Arkin and Refael Hassin. 1998. On local search for weighted k-set packing. Mathematics of Operations Research 23, 3 (1998), 640--648.Google ScholarCross Ref
Haoqiong Bian, Ying Yan, Wenbo Tao, Liang Jeff Chen, Yueguo Chen, Xiaoyong Du, and Thomas Moscibroda. 2017. Wide Table Layout Optimization based on Column Ordering and Duplication. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017. ACM, 299--314. https://doi.org/10.1145/3035918.3035930Google ScholarDigital Library
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!). In 7th Symposium on Operating Systems Design and Implementation (OSDI '06), November 6--8, Seattle, WA, USA, Brian N. Bershad and Jeffrey C. Mogul (Eds.). USENIX Association, 205--218. http://www.usenix.org/events/osdi06/tech/chang.htmlGoogle Scholar
Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A Primitive Operator for Similarity Joins in Data Cleaning. In Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3--8 April 2006, Atlanta, GA, USA. IEEE Computer Society, 5. https://doi.org/10.1109/ICDE.2006.9Google ScholarDigital Library
Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata. 2011. Column-Oriented Storage Techniques for MapReduce. Proc. VLDB Endow. 4, 7 (2011), 419--429. https://doi.org/10.14778/1988776.1988778Google ScholarDigital Library
Pranjal Gupta, Amine Mhedhbi, and Semih Salihoglu. 2021. Columnar Storage and List-based Processing for Graph Database Management Systems. Proc. VLDB Endow. 14, 11 (2021), 2491--2504. http://www.vldb.org/pvldb/vol14/p2491-gupta.pdfGoogle ScholarDigital Library
Muon Ha and Yulia A. Shichkina. 2022. Translating a Distributed Relational Database to a Document Database. Data Sci. Eng. 7, 2 (2022), 136--155. https://doi.org/10.1007/s41019-022-00181--9Google Scholar
Shuai Han, Mingxia Liu, and Jian-Zhong Li. 2022. Efficient Partitioning Method for Optimizing the Compression on Array Data. J. Comput. Sci. Technol. 37, 5 (2022), 1049--1067. https://doi.org/10.1007/s11390-022--2371--7Google ScholarCross Ref
Donghe Kang, Ruochen Jiang, and Spyros Blanas. 2021. Jigsaw: A Data Storage and Query Processing Engine for Irregular Table Partitioning. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021. 898--911. https://doi.org/10.1145/3448016.3457547Google ScholarDigital Library
Per-Åke Larson, Eric N. Hanson, and Susan L. Price. 2012. Columnar Storage in SQL Server 2012. IEEE Data Eng. Bull. 35, 1 (2012), 15--20. http://sites.computer.org/debull/A12mar/apollo.pdfGoogle Scholar
Yinan Mei, Shaoxu Song, Yunsu Lee, Jungho Park, Soo-Hyung Kim, and Sungmin Yi. 2020. Representing Temporal Attributes for Schema Matching. In KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23--27, 2020. ACM, 709--719. https://doi.org/10.1145/3394486.3403115Google ScholarDigital Library
Jeffrey C. Mogul, Fred Douglis, Anja Feldmann, and Balachander Krishnamurthy. 1997. Potential Benefits of Delta Encoding and Data Compression for HTTP. In Proceedings of the ACM SIGCOMM 1997 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, September 14--18, 1997, Cannes, France. ACM, 181--194. https://doi.org/10.1145/263105.263162Google ScholarDigital Library
M. Tamer Özsu. 2016. A survey of RDF data management systems. Frontiers Comput. Sci. 10, 3 (2016), 418--432. https://doi.org/10.1007/s11704-016--5554-yGoogle ScholarCross Ref
Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin Teller, and Kaushik Veeraraghavan. 2015. Gorilla: A Fast, Scalable, In-Memory Time Series Database. Proc. VLDB Endow. 8, 12 (2015), 1816--1827. https://doi.org/10.14778/2824032.2824078Google ScholarDigital Library
Horst Samulowitz, Chandra Reddy, Ashish Sabharwal, and Meinolf Sellmann. 2013. Snappy: A Simple Algorithm Portfolio. In Theory and Applications of Satisfiability Testing - SAT 2013 - 16th International Conference, Helsinki, Finland, July 8--12, 2013. Proceedings (Lecture Notes in Computer Science, Vol. 7962), Matti Järvisalo and Allen Van Gelder (Eds.). Springer, 422--428. https://doi.org/10.1007/978--3--642--39071--5_33Google Scholar
Dilbag Singh, Yavuz Selim Taspinar, Ramazan Kursun, Ilkay Cinar, Murat Koklu, Ilker Ali Ozkan, and Heung-No Lee. 2022. Classification and Analysis of Pistachio Species with Pre-Trained Deep Learning Models. Electronics 11, 7 (2022), 981.Google ScholarCross Ref
Shaoxu Song, Yue Cao, and Jianmin Wang. 2016. Cleaning Timestamps with Temporal Constraints. Proc. VLDB Endow. 9, 10 (2016), 708--719. https://doi.org/10.14778/2977797.2977798Google ScholarDigital Library
Shaoxu Song and Lei Chen. 2010. Efficient set-correlation operator inside databases. In Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26--30, 2010. ACM, 139--148. https://doi.org/10.1145/1871437.1871459Google ScholarDigital Library
Chen Wang, Xiangdong Huang, Jialin Qiao, Tian Jiang, Lei Rui, Jinrui Zhang, Rong Kang, Julian Feinauer, Kevin Mcgrail, Peng Wang, Diaohan Luo, Jun Yuan, Jianmin Wang, and Jiaguang Sun. 2020. Apache IoTDB: Time-series database for Internet of Things. Proc. VLDB Endow. 13, 12 (2020), 2901--2904. https://doi.org/10.14778/3415478.3415504Google ScholarDigital Library
Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012. ACM, 85--96. https://doi.org/10.1145/2213836.2213847Google ScholarDigital Library
Zhiqi Wang, Jin Xue, and Zili Shao. 2021. Heracles: An Efficient Storage Model And Data Flushing For Performance Monitoring Timeseries. Proc. VLDB Endow. 14, 6 (2021), 1080--1092. https://doi.org/10.14778/3447689.3447710Google ScholarDigital Library
Jiaye Wu, Peng Wang, Ningting Pan, Chen Wang, Wei Wang, and Jianmin Wang. 2019. KV-Match: A Subsequence Matching Approach Supporting Normalization and Time Warping. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8--11, 2019. IEEE, 866--877. https://doi.org/10.1109/ICDE.2019.00082Google Scholar

Index Terms

Grouping Time Series for Efficient Columnar Storage
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Stream management

Recommendations

A flash-based decomposition storage model
DASFAA'12: Proceedings of the 17th international conference on Database Systems for Advanced Applications

The traditional HDD-based columnar storage is an important technology to improve the performance of query-intensive database. However, some features of HDD weaken the advantages of columnar storage. In this paper, we study the advantages of SSD over HDD ...
Read More
A Comparative Study of Row and Column Storage for Time Series Data
Spatial Data and Intelligence
Abstract
Over the past few decades, researchers have done massive research on data storage structures of relational databases. The existing research mainly focuses on the analysis and optimization of row and column storage structures in relational ...
Read More
Automatic Array Transformation to Columnar Storage at Run Time
MPLR '22: Proceedings of the 19th International Conference on Managed Programming Languages and Runtimes

Today’s huge memories make it possible to store and process large data structures in memory instead of in a database. Hence, accesses to this data should be optimized, which is normally relegated either to the runtimes and compilers or is left to the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the ACM on Management of Data Volume 1, Issue 1
PACMMOD
May 2023
2807 pages
EISSN:2836-6573
DOI:10.1145/3603164
Editor:
Divyakant Agrawal
UC Santa Barbara, United States
Issue’s Table of Contents
Copyright © 2023 Owner/Author
This work is licensed under a Creative Commons Attribution International 4.0 License.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 May 2023
Published in pacmmod Volume 1, Issue 1
Author Tags
columnar storage
grouping
time series
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 832
  Total Downloads
- Downloads (Last 12 months)832
- Downloads (Last 6 weeks)106
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Grouping Time Series for Efficient Columnar Storage

Proceedings of the ACM on Management of Data

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

A flash-based decomposition storage model

A Comparative Study of Row and Column Storage for Time Series Data

Automatic Array Transformation to Columnar Storage at Run Time

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Grouping Time Series for Efficient Columnar Storage

Proceedings of the ACM on Management of Data

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

A flash-based decomposition storage model

A Comparative Study of Row and Column Storage for Time Series Data

Automatic Array Transformation to Columnar Storage at Run Time

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media