skip to main content
research-article
Open Access

Grouping Time Series for Efficient Columnar Storage

Published:30 May 2023Publication History
Skip Abstract Section

Abstract

Columnar storage is now an industry standard design in most open-source or commercial time series database products, making them HTAP systems. The time column of a time series serves as the key for identifying the other value column, namely single-column storage scheme. When multiple time series share a similar set of timestamps, very likely in a module of multiple sensors, it is natural to group them together, i.e., one time column identifies multiple value columns in a single-group storage scheme. While multiple value columns sharing the same time column reduce the space cost of repeating timestamps, it may introduce extra space cost for recording null values. The reason is that time series may not be exactly aligned on each timestamp, owing to missing values, distinct data collection frequencies, unsynchronized clocks and so on. The columngroups storage scheme is thus to divide columns into multiple groups, within which the value columns share the same time column. Unfortunately, the problem of finding the optimal column groups for the minimum space cost is highly challenging, NP-hard according to our analysis. Thereby, we propose a heuristic algorithm for automatically grouping time series for efficient columnar storage. The column groups storage has been deployed in Apache IoTDB, an open-source time series database. The extensive performance analysis, over real-world data from our industrial partners, demonstrates that the proposed column groups achieve near optimal storage, more concise than the storage of single-column or single-group schemes. Interestingly, both the flushing and querying time costs of column groups are comparable to those of single-column or singlegroup, i.e., without incurring extra time cost.

Skip Supplemental Material Section

Supplemental Material

23sigmod-alignment.mp4

mp4

20.5 MB

References

  1. Apache HBase. https://hbase.apache.org/.Google ScholarGoogle Scholar
  2. Apache HBase Implementation. https://github.com/iotdbColumnGroup/HBase.Google ScholarGoogle Scholar
  3. Apache IoTDB. http://iotdb.apache.org.Google ScholarGoogle Scholar
  4. Appendix. https://iotdbcolumngroup.github.io/iotdbColumnGroup/appendix.pdf.Google ScholarGoogle Scholar
  5. Code and Data. https://github.com/iotdbColumnGroup/iotdbColumnGroup.Google ScholarGoogle Scholar
  6. InfluxDB. https://www.influxdata.com/.Google ScholarGoogle Scholar
  7. OpenTSDB. http://opentsdb.net/.Google ScholarGoogle Scholar
  8. Prometheus. https://prometheus.io.Google ScholarGoogle Scholar
  9. Source Code. https://github.com/apache/iotdb/tree/research/auto-aligned.Google ScholarGoogle Scholar
  10. TDengine. https://github.com/taosdata/TDengine/.Google ScholarGoogle Scholar
  11. TDengine Implementation. https://github.com/iotdbColumnGroup/TDengine.Google ScholarGoogle Scholar
  12. TimescaleDB. https://www.timescale.com.Google ScholarGoogle Scholar
  13. Zomato Dataset. https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants.Google ScholarGoogle Scholar
  14. Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. 2006. Efficient Exact Set-Similarity Joins. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006. ACM, 918--929. http://dl.acm.org/citation.cfm?id=1164206Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Esther M Arkin and Refael Hassin. 1998. On local search for weighted k-set packing. Mathematics of Operations Research 23, 3 (1998), 640--648.Google ScholarGoogle ScholarCross RefCross Ref
  16. Haoqiong Bian, Ying Yan, Wenbo Tao, Liang Jeff Chen, Yueguo Chen, Xiaoyong Du, and Thomas Moscibroda. 2017. Wide Table Layout Optimization based on Column Ordering and Duplication. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017. ACM, 299--314. https://doi.org/10.1145/3035918.3035930Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!). In 7th Symposium on Operating Systems Design and Implementation (OSDI '06), November 6--8, Seattle, WA, USA, Brian N. Bershad and Jeffrey C. Mogul (Eds.). USENIX Association, 205--218. http://www.usenix.org/events/osdi06/tech/chang.htmlGoogle ScholarGoogle Scholar
  18. Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A Primitive Operator for Similarity Joins in Data Cleaning. In Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3--8 April 2006, Atlanta, GA, USA. IEEE Computer Society, 5. https://doi.org/10.1109/ICDE.2006.9Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata. 2011. Column-Oriented Storage Techniques for MapReduce. Proc. VLDB Endow. 4, 7 (2011), 419--429. https://doi.org/10.14778/1988776.1988778Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Pranjal Gupta, Amine Mhedhbi, and Semih Salihoglu. 2021. Columnar Storage and List-based Processing for Graph Database Management Systems. Proc. VLDB Endow. 14, 11 (2021), 2491--2504. http://www.vldb.org/pvldb/vol14/p2491-gupta.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  21. Muon Ha and Yulia A. Shichkina. 2022. Translating a Distributed Relational Database to a Document Database. Data Sci. Eng. 7, 2 (2022), 136--155. https://doi.org/10.1007/s41019-022-00181--9Google ScholarGoogle Scholar
  22. Shuai Han, Mingxia Liu, and Jian-Zhong Li. 2022. Efficient Partitioning Method for Optimizing the Compression on Array Data. J. Comput. Sci. Technol. 37, 5 (2022), 1049--1067. https://doi.org/10.1007/s11390-022--2371--7Google ScholarGoogle ScholarCross RefCross Ref
  23. Donghe Kang, Ruochen Jiang, and Spyros Blanas. 2021. Jigsaw: A Data Storage and Query Processing Engine for Irregular Table Partitioning. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021. 898--911. https://doi.org/10.1145/3448016.3457547Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Per-Åke Larson, Eric N. Hanson, and Susan L. Price. 2012. Columnar Storage in SQL Server 2012. IEEE Data Eng. Bull. 35, 1 (2012), 15--20. http://sites.computer.org/debull/A12mar/apollo.pdfGoogle ScholarGoogle Scholar
  25. Yinan Mei, Shaoxu Song, Yunsu Lee, Jungho Park, Soo-Hyung Kim, and Sungmin Yi. 2020. Representing Temporal Attributes for Schema Matching. In KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23--27, 2020. ACM, 709--719. https://doi.org/10.1145/3394486.3403115Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jeffrey C. Mogul, Fred Douglis, Anja Feldmann, and Balachander Krishnamurthy. 1997. Potential Benefits of Delta Encoding and Data Compression for HTTP. In Proceedings of the ACM SIGCOMM 1997 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, September 14--18, 1997, Cannes, France. ACM, 181--194. https://doi.org/10.1145/263105.263162Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Tamer Özsu. 2016. A survey of RDF data management systems. Frontiers Comput. Sci. 10, 3 (2016), 418--432. https://doi.org/10.1007/s11704-016--5554-yGoogle ScholarGoogle ScholarCross RefCross Ref
  28. Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin Teller, and Kaushik Veeraraghavan. 2015. Gorilla: A Fast, Scalable, In-Memory Time Series Database. Proc. VLDB Endow. 8, 12 (2015), 1816--1827. https://doi.org/10.14778/2824032.2824078Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Horst Samulowitz, Chandra Reddy, Ashish Sabharwal, and Meinolf Sellmann. 2013. Snappy: A Simple Algorithm Portfolio. In Theory and Applications of Satisfiability Testing - SAT 2013 - 16th International Conference, Helsinki, Finland, July 8--12, 2013. Proceedings (Lecture Notes in Computer Science, Vol. 7962), Matti Järvisalo and Allen Van Gelder (Eds.). Springer, 422--428. https://doi.org/10.1007/978--3--642--39071--5_33Google ScholarGoogle Scholar
  30. Dilbag Singh, Yavuz Selim Taspinar, Ramazan Kursun, Ilkay Cinar, Murat Koklu, Ilker Ali Ozkan, and Heung-No Lee. 2022. Classification and Analysis of Pistachio Species with Pre-Trained Deep Learning Models. Electronics 11, 7 (2022), 981.Google ScholarGoogle ScholarCross RefCross Ref
  31. Shaoxu Song, Yue Cao, and Jianmin Wang. 2016. Cleaning Timestamps with Temporal Constraints. Proc. VLDB Endow. 9, 10 (2016), 708--719. https://doi.org/10.14778/2977797.2977798Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Shaoxu Song and Lei Chen. 2010. Efficient set-correlation operator inside databases. In Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26--30, 2010. ACM, 139--148. https://doi.org/10.1145/1871437.1871459Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Chen Wang, Xiangdong Huang, Jialin Qiao, Tian Jiang, Lei Rui, Jinrui Zhang, Rong Kang, Julian Feinauer, Kevin Mcgrail, Peng Wang, Diaohan Luo, Jun Yuan, Jianmin Wang, and Jiaguang Sun. 2020. Apache IoTDB: Time-series database for Internet of Things. Proc. VLDB Endow. 13, 12 (2020), 2901--2904. https://doi.org/10.14778/3415478.3415504Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012. ACM, 85--96. https://doi.org/10.1145/2213836.2213847Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Zhiqi Wang, Jin Xue, and Zili Shao. 2021. Heracles: An Efficient Storage Model And Data Flushing For Performance Monitoring Timeseries. Proc. VLDB Endow. 14, 6 (2021), 1080--1092. https://doi.org/10.14778/3447689.3447710Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jiaye Wu, Peng Wang, Ningting Pan, Chen Wang, Wei Wang, and Jianmin Wang. 2019. KV-Match: A Subsequence Matching Approach Supporting Normalization and Time Warping. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8--11, 2019. IEEE, 866--877. https://doi.org/10.1109/ICDE.2019.00082Google ScholarGoogle Scholar

Index Terms

  1. Grouping Time Series for Efficient Columnar Storage

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image Proceedings of the ACM on Management of Data
      Proceedings of the ACM on Management of Data  Volume 1, Issue 1
      PACMMOD
      May 2023
      2807 pages
      EISSN:2836-6573
      DOI:10.1145/3603164
      Issue’s Table of Contents

      Copyright © 2023 Owner/Author

      This work is licensed under a Creative Commons Attribution International 4.0 License.

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 30 May 2023
      Published in pacmmod Volume 1, Issue 1

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader