Abstract
Columnar storage is now an industry standard design in most open-source or commercial time series database products, making them HTAP systems. The time column of a time series serves as the key for identifying the other value column, namely single-column storage scheme. When multiple time series share a similar set of timestamps, very likely in a module of multiple sensors, it is natural to group them together, i.e., one time column identifies multiple value columns in a single-group storage scheme. While multiple value columns sharing the same time column reduce the space cost of repeating timestamps, it may introduce extra space cost for recording null values. The reason is that time series may not be exactly aligned on each timestamp, owing to missing values, distinct data collection frequencies, unsynchronized clocks and so on. The columngroups storage scheme is thus to divide columns into multiple groups, within which the value columns share the same time column. Unfortunately, the problem of finding the optimal column groups for the minimum space cost is highly challenging, NP-hard according to our analysis. Thereby, we propose a heuristic algorithm for automatically grouping time series for efficient columnar storage. The column groups storage has been deployed in Apache IoTDB, an open-source time series database. The extensive performance analysis, over real-world data from our industrial partners, demonstrates that the proposed column groups achieve near optimal storage, more concise than the storage of single-column or single-group schemes. Interestingly, both the flushing and querying time costs of column groups are comparable to those of single-column or singlegroup, i.e., without incurring extra time cost.
Supplemental Material
- Apache HBase. https://hbase.apache.org/.Google Scholar
- Apache HBase Implementation. https://github.com/iotdbColumnGroup/HBase.Google Scholar
- Apache IoTDB. http://iotdb.apache.org.Google Scholar
- Appendix. https://iotdbcolumngroup.github.io/iotdbColumnGroup/appendix.pdf.Google Scholar
- Code and Data. https://github.com/iotdbColumnGroup/iotdbColumnGroup.Google Scholar
- InfluxDB. https://www.influxdata.com/.Google Scholar
- OpenTSDB. http://opentsdb.net/.Google Scholar
- Prometheus. https://prometheus.io.Google Scholar
- Source Code. https://github.com/apache/iotdb/tree/research/auto-aligned.Google Scholar
- TDengine. https://github.com/taosdata/TDengine/.Google Scholar
- TDengine Implementation. https://github.com/iotdbColumnGroup/TDengine.Google Scholar
- TimescaleDB. https://www.timescale.com.Google Scholar
- Zomato Dataset. https://www.kaggle.com/himanshupoddar/zomato-bangalore-restaurants.Google Scholar
- Arvind Arasu, Venkatesh Ganti, and Raghav Kaushik. 2006. Efficient Exact Set-Similarity Joins. In Proceedings of the 32nd International Conference on Very Large Data Bases, Seoul, Korea, September 12--15, 2006. ACM, 918--929. http://dl.acm.org/citation.cfm?id=1164206Google ScholarDigital Library
- Esther M Arkin and Refael Hassin. 1998. On local search for weighted k-set packing. Mathematics of Operations Research 23, 3 (1998), 640--648.Google ScholarCross Ref
- Haoqiong Bian, Ying Yan, Wenbo Tao, Liang Jeff Chen, Yueguo Chen, Xiaoyong Du, and Thomas Moscibroda. 2017. Wide Table Layout Optimization based on Column Ordering and Duplication. In Proceedings of the 2017 ACM International Conference on Management of Data, SIGMOD Conference 2017, Chicago, IL, USA, May 14--19, 2017. ACM, 299--314. https://doi.org/10.1145/3035918.3035930Google ScholarDigital Library
- Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Michael Burrows, Tushar Chandra, Andrew Fikes, and Robert Gruber. 2006. Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!). In 7th Symposium on Operating Systems Design and Implementation (OSDI '06), November 6--8, Seattle, WA, USA, Brian N. Bershad and Jeffrey C. Mogul (Eds.). USENIX Association, 205--218. http://www.usenix.org/events/osdi06/tech/chang.htmlGoogle Scholar
- Surajit Chaudhuri, Venkatesh Ganti, and Raghav Kaushik. 2006. A Primitive Operator for Similarity Joins in Data Cleaning. In Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006, 3--8 April 2006, Atlanta, GA, USA. IEEE Computer Society, 5. https://doi.org/10.1109/ICDE.2006.9Google ScholarDigital Library
- Avrilia Floratou, Jignesh M. Patel, Eugene J. Shekita, and Sandeep Tata. 2011. Column-Oriented Storage Techniques for MapReduce. Proc. VLDB Endow. 4, 7 (2011), 419--429. https://doi.org/10.14778/1988776.1988778Google ScholarDigital Library
- Pranjal Gupta, Amine Mhedhbi, and Semih Salihoglu. 2021. Columnar Storage and List-based Processing for Graph Database Management Systems. Proc. VLDB Endow. 14, 11 (2021), 2491--2504. http://www.vldb.org/pvldb/vol14/p2491-gupta.pdfGoogle ScholarDigital Library
- Muon Ha and Yulia A. Shichkina. 2022. Translating a Distributed Relational Database to a Document Database. Data Sci. Eng. 7, 2 (2022), 136--155. https://doi.org/10.1007/s41019-022-00181--9Google Scholar
- Shuai Han, Mingxia Liu, and Jian-Zhong Li. 2022. Efficient Partitioning Method for Optimizing the Compression on Array Data. J. Comput. Sci. Technol. 37, 5 (2022), 1049--1067. https://doi.org/10.1007/s11390-022--2371--7Google ScholarCross Ref
- Donghe Kang, Ruochen Jiang, and Spyros Blanas. 2021. Jigsaw: A Data Storage and Query Processing Engine for Irregular Table Partitioning. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20--25, 2021. 898--911. https://doi.org/10.1145/3448016.3457547Google ScholarDigital Library
- Per-Åke Larson, Eric N. Hanson, and Susan L. Price. 2012. Columnar Storage in SQL Server 2012. IEEE Data Eng. Bull. 35, 1 (2012), 15--20. http://sites.computer.org/debull/A12mar/apollo.pdfGoogle Scholar
- Yinan Mei, Shaoxu Song, Yunsu Lee, Jungho Park, Soo-Hyung Kim, and Sungmin Yi. 2020. Representing Temporal Attributes for Schema Matching. In KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Virtual Event, CA, USA, August 23--27, 2020. ACM, 709--719. https://doi.org/10.1145/3394486.3403115Google ScholarDigital Library
- Jeffrey C. Mogul, Fred Douglis, Anja Feldmann, and Balachander Krishnamurthy. 1997. Potential Benefits of Delta Encoding and Data Compression for HTTP. In Proceedings of the ACM SIGCOMM 1997 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, September 14--18, 1997, Cannes, France. ACM, 181--194. https://doi.org/10.1145/263105.263162Google ScholarDigital Library
- M. Tamer Özsu. 2016. A survey of RDF data management systems. Frontiers Comput. Sci. 10, 3 (2016), 418--432. https://doi.org/10.1007/s11704-016--5554-yGoogle ScholarCross Ref
- Tuomas Pelkonen, Scott Franklin, Paul Cavallaro, Qi Huang, Justin Meza, Justin Teller, and Kaushik Veeraraghavan. 2015. Gorilla: A Fast, Scalable, In-Memory Time Series Database. Proc. VLDB Endow. 8, 12 (2015), 1816--1827. https://doi.org/10.14778/2824032.2824078Google ScholarDigital Library
- Horst Samulowitz, Chandra Reddy, Ashish Sabharwal, and Meinolf Sellmann. 2013. Snappy: A Simple Algorithm Portfolio. In Theory and Applications of Satisfiability Testing - SAT 2013 - 16th International Conference, Helsinki, Finland, July 8--12, 2013. Proceedings (Lecture Notes in Computer Science, Vol. 7962), Matti Järvisalo and Allen Van Gelder (Eds.). Springer, 422--428. https://doi.org/10.1007/978--3--642--39071--5_33Google Scholar
- Dilbag Singh, Yavuz Selim Taspinar, Ramazan Kursun, Ilkay Cinar, Murat Koklu, Ilker Ali Ozkan, and Heung-No Lee. 2022. Classification and Analysis of Pistachio Species with Pre-Trained Deep Learning Models. Electronics 11, 7 (2022), 981.Google ScholarCross Ref
- Shaoxu Song, Yue Cao, and Jianmin Wang. 2016. Cleaning Timestamps with Temporal Constraints. Proc. VLDB Endow. 9, 10 (2016), 708--719. https://doi.org/10.14778/2977797.2977798Google ScholarDigital Library
- Shaoxu Song and Lei Chen. 2010. Efficient set-correlation operator inside databases. In Proceedings of the 19th ACM Conference on Information and Knowledge Management, CIKM 2010, Toronto, Ontario, Canada, October 26--30, 2010. ACM, 139--148. https://doi.org/10.1145/1871437.1871459Google ScholarDigital Library
- Chen Wang, Xiangdong Huang, Jialin Qiao, Tian Jiang, Lei Rui, Jinrui Zhang, Rong Kang, Julian Feinauer, Kevin Mcgrail, Peng Wang, Diaohan Luo, Jun Yuan, Jianmin Wang, and Jiaguang Sun. 2020. Apache IoTDB: Time-series database for Internet of Things. Proc. VLDB Endow. 13, 12 (2020), 2901--2904. https://doi.org/10.14778/3415478.3415504Google ScholarDigital Library
- Jiannan Wang, Guoliang Li, and Jianhua Feng. 2012. Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20--24, 2012. ACM, 85--96. https://doi.org/10.1145/2213836.2213847Google ScholarDigital Library
- Zhiqi Wang, Jin Xue, and Zili Shao. 2021. Heracles: An Efficient Storage Model And Data Flushing For Performance Monitoring Timeseries. Proc. VLDB Endow. 14, 6 (2021), 1080--1092. https://doi.org/10.14778/3447689.3447710Google ScholarDigital Library
- Jiaye Wu, Peng Wang, Ningting Pan, Chen Wang, Wei Wang, and Jianmin Wang. 2019. KV-Match: A Subsequence Matching Approach Supporting Normalization and Time Warping. In 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, April 8--11, 2019. IEEE, 866--877. https://doi.org/10.1109/ICDE.2019.00082Google Scholar
Index Terms
- Grouping Time Series for Efficient Columnar Storage
Recommendations
A flash-based decomposition storage model
DASFAA'12: Proceedings of the 17th international conference on Database Systems for Advanced ApplicationsThe traditional HDD-based columnar storage is an important technology to improve the performance of query-intensive database. However, some features of HDD weaken the advantages of columnar storage. In this paper, we study the advantages of SSD over HDD ...
A Comparative Study of Row and Column Storage for Time Series Data
Spatial Data and IntelligenceAbstractOver the past few decades, researchers have done massive research on data storage structures of relational databases. The existing research mainly focuses on the analysis and optimization of row and column storage structures in relational ...
Automatic Array Transformation to Columnar Storage at Run Time
MPLR '22: Proceedings of the 19th International Conference on Managed Programming Languages and RuntimesToday’s huge memories make it possible to store and process large data structures in memory instead of in a database. Hence, accesses to this data should be optimized, which is normally relegated either to the runtimes and compilers or is left to the ...
Comments