Abstract
The SQL standard introduced MATCH_RECOGNIZE in 2016 for row pattern recognition. Since then, MATCH_RECOGNIZE has been supported by several leading relation systems, they implemented this function using Non-Deterministic Finite Automaton (NFA). While NFA is suitable for pattern recognition in streaming scenarios, the current uses of NFA by the relational systems for historical data analysis scenarios overlook important optimization opportunities. We propose a new approach to use Join to speed up row pattern recognition in historical analysis scenarios for relational systems. Implemented as a logical plan rewrite rule, the new approach first filters the input relation to MATCH_RECOGNIZE using Joins constructed based on a subset of symbols taken from the PATTERN expression, then run the NFA-based MATCH_RECOGNIZE on the filtered rows, reducing the net cost. The rule also includes a specialized cardinality model for the Joins and a cost model for the NFA-based MATCH_RECOGNIZE operator for choosing an appropriate symbol set. The rewrite rule is applicable when the query pattern's definition is self-contained and either the input table has no duplicates or there is a window condition. Applying the rewrite rule to a query benchmark with 1,800 queries spanning over 6 patterns and 3 pattern definitions, we observed median speedups of 5.4X on Trino (v373 with ORC files on Hive), 57.5X on SQL Server (2019) using column store and 41.6X on row store.
- 2022. Citus. https://github.com/citusdata/citus.Google Scholar
- 2022. Columnstore indexes: Overview. https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview.Google Scholar
- 2022. Index Accelerated Pattern Matching on Persistent Event Streams. https://github.com/sigmod2021-index-pattern/index-pattern.Google Scholar
- Christopher R. Aberger, Andrew Lamb, Susan Tu, Andres Nötzli, Kunle Olukotun, and Christopher Ré. 2017. EmptyHeaded: A Relational Engine for Graph Processing. ACM Trans. Database Syst. 42, 4 (2017), 20:1--20:44. Google ScholarDigital Library
- Jagrati Agrawal, Yanlei Diao, Daniel Gyllstrom, and Neil Immerman. 2008. Efficient pattern matching over event streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, Jason Tsong-Li Wang (Ed.). ACM, 147--160. Google ScholarDigital Library
- Apache Flink. 2022. Pattern Recognition. https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/sql/queries/match_recognize/.Google Scholar
- Brian Olsen. 2020. A gentle introduction to the Hive connector. https://trino.io/blog/2020/10/20/intro-to-hive-connector.html.Google Scholar
- Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, John C. Platt, James F. Terwilliger, and John Wernsing. 2014. Trill: A High-Performance Incremental Query Processor for Diverse Analytics. Proc. VLDB Endow. 8, 4 (2014), 401--412. Google ScholarDigital Library
- Badrish Chandramouli, Jonathan Goldstein, and David Maier. 2010. High-Performance Dynamic Pattern Matching over Disordered Streams. Proc. VLDB Endow. 3, 1 (2010), 220--231. Google ScholarDigital Library
- Chicago. 2022. Chicago Crimes. https://www.kaggle.com/chicago/chicago-crime.Google Scholar
- Alan J. Demers, Johannes Gehrke, Biswanath Panda, Mirek Riedewald, Varun Sharma, and Walker M. White. 2007. Cayuga: A General Purpose Event Monitoring System. In Third Biennial Conference on Innovative Data Systems Research, CIDR 2007, Asilomar, CA, USA, January 7-10, 2007, Online Proceedings. www.cidrdb.org, 412--422. http://cidrdb.org/cidr2007/papers/cidr07p47.pdfGoogle Scholar
- David J. DeWitt, Jeffrey F. Naughton, and Donovan A. Schneider. 1991. An Evaluation of Non-Equijoin Algorithms. In 17th International Conference on Very Large Data Bases, September 3-6, 1991, Barcelona, Catalonia, Spain, Proceedings, Guy M. Lohman, Amílcar Sernadas, and Rafael Camps (Eds.). Morgan Kaufmann, 443--452. http://www.vldb.org/conf/1991/P443.PDFGoogle Scholar
- Yanlei Diao, Neil Immerman, and Daniel Gyllstrom. 2007. Sase+: An agile language for kleene closure over event streams. UMass Technical Report (2007).Google Scholar
- Nihal Dindar, Peter M. Fischer, Merve Soner, and Nesime Tatbul. 2011. Efficiently correlating complex events over live and archived data streams. In Proceedings of the Fifth ACM International Conference on Distributed Event-Based Systems, DEBS 2011, New York, NY, USA, July 11-15, 2011, David M. Eyers, Opher Etzion, Avigdor Gal, Stanley B. Zdonik, and Paul Vincent (Eds.). ACM, 243--254. Google ScholarDigital Library
- Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek R. Narasayya, and Surajit Chaudhuri. 2019. Selectivity Estimation for Range Predicates using Lightweight Models. Proc. VLDB Endow. 12, 9 (2019), 1044--1057. Google ScholarDigital Library
- Erkang Zhu and Silu Huang and Surajit Chaudhuri. 2022. High-Performance Row Pattern Recognition Using Joins (Technical Report). https://www.microsoft.com/en-us/research/publication/high-performance-row-pattern-recognition-using-joins-technical-report/.Google Scholar
- Felipe Hoffa. 2021. Funnel analytics with SQL: MATCH_RECOGNIZE() on Snowflake. https://towardsdatascience.com/funnel-analytics-with-sql-match-recognize-on-snowflake-8bd576d9b7b1.Google Scholar
- Michael J. Freitag, Maximilian Bandle, Tobias Schmidt, Alfons Kemper, and Thomas Neumann. 2020. Adopting Worst-Case Optimal Joins in Relational Database Systems. Proc. VLDB Endow. 13, 11 (2020), 1891--1904. http://www.vldb.org/pvldb/vol13/p1891-freitag.pdfGoogle ScholarDigital Library
- Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database systems - the complete book (2. ed.). Pearson Education.Google Scholar
- Nikos Giatrakos, Elias Alevizos, Alexander Artikis, Antonios Deligiannakis, and Minos N. Garofalakis. 2020. Complex event recognition in the Big Data era: a survey. VLDB J. 29, 1 (2020), 313--352. Google ScholarDigital Library
- Goetz Graefe. 1993. Query Evaluation Techniques for Large Databases. ACM Comput. Surv. 25, 2 (1993), 73--170. Google ScholarDigital Library
- Goetz Graefe. 1995. The Cascades Framework for Query Optimization. IEEE Data Eng. Bull. 18, 3 (1995), 19--29. http://sites.computer.org/debull/95SEP-CD.pdfGoogle Scholar
- Goetz Graefe and William J. McKenna. 1993. The Volcano Optimizer Generator: Extensibility and Efficient Search. In Proceedings of the Ninth International Conference on Data Engineering, April 19-23, 1993, Vienna, Austria. IEEE Computer Society, 209--218. Google ScholarCross Ref
- Laura M. Haas, Johann Christoph Freytag, Guy M. Lohman, and Hamid Pirahesh. 1989. Extensible Query Processing in Starburst. In Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data, Portland, Oregon, USA, May 31 - June 2, 1989, James Clifford, Bruce G. Lindsay, and David Maier (Eds.). ACM Press, 377--388. Google ScholarDigital Library
- ISO/IEC JTC 1/SC 32 Data management and interchange. 2016. ISO/IEC TR 19075-5:2016 Information technology - Database languages - SQL Technical Reports - Part 5: Row Pattern Recognition in SQL. https://www.iso.org/standard/65143.html.Google Scholar
- Kasia Findeisen. 2021. Row pattern recognition with MATCH_RECOGNIZE. https://trino.io/blog/2021/05/19/row_pattern_matching.html.Google Scholar
- Keith Laker. 2017. MATCH_RECOGNIZE and predicates - everything you need to know. https://blogs.oracle.com/datawarehousing/post/match_recognize-and-predicates-everything-you-need-to-know.Google Scholar
- Mahmoud Abo Khamis, Hung Q. Ngo, Christopher Ré, and Atri Rudra. 2016. Joins via Geometric Resolutions: Worst Case and Beyond. ACM Trans. Database Syst. 41, 4 (2016), 22:1--22:45. Google ScholarDigital Library
- Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. 1977. Fast Pattern Matching in Strings. SIAM J. Comput. 6, 2 (1977), 323--350. Google ScholarDigital Library
- Ilya Kolchinsky and Assaf Schuster. 2018. Join Query Optimization Techniques for Complex Event Processing Applications. Proc. VLDB Endow. 11, 11 (2018), 1332--1345. Google ScholarDigital Library
- Ilya Kolchinsky and Assaf Schuster. 2019. Real-Time Multi-Pattern Detection over Event Streams. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 589--606. Google ScholarDigital Library
- Ilya Kolchinsky, Izchak Sharfman, and Assaf Schuster. 2015. Lazy evaluation methods for detecting complex events. In Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems, DEBS '15, Oslo, Norway, June 29 - July 3, 2015, Frank Eliassen and Roman Vitenberg (Eds.). ACM, 34--45. Google ScholarDigital Library
- Michael Körber, Nikolaus Glombiewski, and Bernhard Seeger. 2021. Index-Accelerated Pattern Matching in Event Stores. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 1023--1036. Google ScholarDigital Library
- Rundong Li, Wolfgang Gatterbauer, and Mirek Riedewald. 2020. Near-Optimal Distributed Band-Joins through Recursive Partitioning. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 2375--2390. Google ScholarDigital Library
- Mo Liu, Elke A. Rundensteiner, Daniel J. Dougherty, Chetan Gupta, Song Wang, Ismail Ari, and Abhay Mehta. 2011. High-performance nested CEP query processing over event streams. In Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany, Serge Abiteboul, Klemens Böhm, Christoph Koch, and Kian-Lee Tan (Eds.). IEEE Computer Society, 123--134. Google ScholarDigital Library
- Michael V. Mannino, Paicheng Chu, and Thomas Sager. 1988. Statistical Profile Estimation in Database Systems. ACM Comput. Surv. 20, 3 (1988), 191--221. Google ScholarDigital Library
- Marta Paes. 2019. MATCH_RECOGNIZE: where Flink SQL and Complex Event Processing meet. https://www.ververica.com/blog/match_recognize-where-flink-sql-and-complex-event-processing-meet.Google Scholar
- Yuan Mei and Samuel Madden. 2009. ZStream: a cost-based query processor for adaptively detecting composite events. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, Ugur Çetintemel, Stanley B. Zdonik, Donald Kossmann, and Nesime Tatbul (Eds.). ACM, 193--206. Google ScholarDigital Library
- Microsoft Azure. 2021. MATCH_RECOGNIZE (Stream Analytics). https://docs.microsoft.com/en-us/stream-analytics-query/match-recognize-stream-analytics.Google Scholar
- Hung Q. Ngo, Dung T. Nguyen, Christopher Ré, and Atri Rudra. 2014. Beyond worst-case analysis for joins with minesweeper. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'14, Snowbird, UT, USA, June 22-27, 2014, Richard Hull and Martin Grohe (Eds.). ACM, 234--245. Google ScholarDigital Library
- Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra. 2018. Worst-case Optimal Join Algorithms. J. ACM 65, 3 (2018), 16:1--16:40. Google ScholarDigital Library
- Oracle. 2022. Pattern Recognition With MATCH_RECOGNIZE. https://docs.oracle.com/en/middleware/fusion-middleware/osa/19.1/cqlreference/pattern-recognition-match_recognize.html.Google Scholar
- Raghu Ramakrishnan, Donko Donjerkovic, Arvind Ranganathan, Kevin S. Beyer, and Muralidhar Krishnaprasad. 1998. SRQL: Sorted Relational Query Language. In 10th International Conference on Scientific and Statistical Database Management, Proceedings, Capri, Italy, July 1-3, 1998, Maurizio Rafanelli and Matthias Jarke (Eds.). IEEE Computer Society, 84--95. Google ScholarCross Ref
- Medhabi Ray, Chuan Lei, and Elke A. Rundensteiner. 2016. Scalable Pattern Sharing on Event Streams. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 495--510. Google ScholarDigital Library
- Maximilian Reif and Thomas Neumann. 2022. A Scalable and Generic Approach to Range Joins. Proc. VLDB Endow. 15, 11 (2022), 3018--3030. https://www.vldb.org/pvldb/vol15/p3018-reif.pdfGoogle ScholarDigital Library
- Rodrigo Alves. 2019. Azure Stream Analytics now supports MATCH_RECOGNIZE. https://azure.microsoft.com/en-us/blog/azure-stream-analytics-now-supports-match-recognize/.Google Scholar
- Reza Sadri, Carlo Zaniolo, Amir M. Zarkesh, and Jafar Adibi. 2004. Expressing and optimizing sequence queries in database systems. ACM Trans. Database Syst. 29, 2 (2004), 282--318. Google ScholarDigital Library
- Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, and Thomas G. Price. 1979. Access Path Selection in a Relational Database Management System. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, Boston, Massachusetts, USA, May 30 - June 1, Philip A. Bernstein (Ed.). ACM, 23--34. Google ScholarDigital Library
- Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. 1994. Sequence Query Processing. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, Minnesota, USA, May 24-27, 1994, Richard T. Snodgrass and Marianne Winslett (Eds.). ACM Press, 430--441. Google ScholarDigital Library
- Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. 1995. SEQ: A Model for Sequence Databases. In Proceedings of the Eleventh International Conference on Data Engineering, March 6-10, 1995, Taipei, Taiwan, Philip S. Yu and Arbee L. P. Chen (Eds.). IEEE Computer Society, 232--239. Google ScholarCross Ref
- Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. 1996. The Design and Implementation of a Sequence Database System. In VLDB'96, Proceedings of 22th International Conference on Very Large Data Bases, September 3-6, 1996, Mumbai (Bombay), India, T. M. Vijayaraman, Alejandro P. Buchmann, C. Mohan, and Nandlal L. Sarda (Eds.). Morgan Kaufmann, 99--110. http://www.vldb.org/conf/1996/P099.PDFGoogle ScholarDigital Library
- Snowflake. 2021. Identifying Sequences of Rows That Match a Pattern. https://docs.snowflake.com/en/user-guide/match-recognize-introduction.html.Google Scholar
- Valery Soloviev. 1993. A Truncating Hash Algorithm for Processing Band-Join Queries. In Proceedings of the Ninth International Conference on Data Engineering, April 19-23, 1993, Vienna, Austria. IEEE Computer Society, 419--427. Google ScholarCross Ref
- Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth J. O'Neil, Patrick E. O'Neil, Alex Rasin, Nga Tran, and Stan Zdonik. 2019. C-store: a column-oriented DBMS. In Making Databases Work: the Pragmatic Wisdom of Michael Stonebraker, Michael L. Brodie (Ed.). ACM / Morgan & Claypool, 491--518. Google ScholarDigital Library
- Trino. 2022. MATCH_RECOGNIZE. https://trino.io/docs/current/sql/match-recognize.html.Google Scholar
- Todd L. Veldhuizen. 2014. Triejoin: A Simple, Worst-Case Optimal Join Algorithm. In Proc. 17th International Conference on Database Theory (ICDT), Athens, Greece, March 24-28, 2014, Nicole Schweikardt, Vassilis Christophides, and Vincent Leroy (Eds.). OpenProceedings.org, 96--106. Google ScholarCross Ref
- Eugene Wu, Yanlei Diao, and Shariq Rizvi. 2006. High-performance complex event processing over streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 27-29, 2006, Surajit Chaudhuri, Vagelis Hristidis, and Neoklis Polyzotis (Eds.). ACM, 407--418. Google ScholarDigital Library
Recommendations
Efficient Row Pattern Matching Using Pattern Hierarchies for Sequence OLAP
Big Data Analytics and Knowledge DiscoveryAbstractSequence OLAP is a variant of OLAP for sequence data analysis such as analysis of RFID log and person trip data. It extracts pattern occurrences of the given patterns (e.g., state transition pattern , movement pattern ) on sequence data ...
Optimization of joins using random record generation method
A2CWiC '10: Proceedings of the 1st Amrita ACM-W Celebration on Women in Computing in IndiaJoins are statements that retrieve data from more than one table. A Join is characterized by multiple tables in the FROM clause, and the relationship between the tables is defined through the existence of a Join condition in the WHERE clause. In the ...
Reducing outer joins
We present a method for transforming some outer joins to inner joins and describe a generalized semijoin reduction technique. The first part of the paper shows how to transform a given outer join query whose join graph is a tree to an equivalent inner ...
Comments