skip to main content
research-article

High-Performance Row Pattern Recognition Using Joins

Published:01 January 2023Publication History
Skip Abstract Section

Abstract

The SQL standard introduced MATCH_RECOGNIZE in 2016 for row pattern recognition. Since then, MATCH_RECOGNIZE has been supported by several leading relation systems, they implemented this function using Non-Deterministic Finite Automaton (NFA). While NFA is suitable for pattern recognition in streaming scenarios, the current uses of NFA by the relational systems for historical data analysis scenarios overlook important optimization opportunities. We propose a new approach to use Join to speed up row pattern recognition in historical analysis scenarios for relational systems. Implemented as a logical plan rewrite rule, the new approach first filters the input relation to MATCH_RECOGNIZE using Joins constructed based on a subset of symbols taken from the PATTERN expression, then run the NFA-based MATCH_RECOGNIZE on the filtered rows, reducing the net cost. The rule also includes a specialized cardinality model for the Joins and a cost model for the NFA-based MATCH_RECOGNIZE operator for choosing an appropriate symbol set. The rewrite rule is applicable when the query pattern's definition is self-contained and either the input table has no duplicates or there is a window condition. Applying the rewrite rule to a query benchmark with 1,800 queries spanning over 6 patterns and 3 pattern definitions, we observed median speedups of 5.4X on Trino (v373 with ORC files on Hive), 57.5X on SQL Server (2019) using column store and 41.6X on row store.

References

  1. 2022. Citus. https://github.com/citusdata/citus.Google ScholarGoogle Scholar
  2. 2022. Columnstore indexes: Overview. https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-overview.Google ScholarGoogle Scholar
  3. 2022. Index Accelerated Pattern Matching on Persistent Event Streams. https://github.com/sigmod2021-index-pattern/index-pattern.Google ScholarGoogle Scholar
  4. Christopher R. Aberger, Andrew Lamb, Susan Tu, Andres Nötzli, Kunle Olukotun, and Christopher Ré. 2017. EmptyHeaded: A Relational Engine for Graph Processing. ACM Trans. Database Syst. 42, 4 (2017), 20:1--20:44. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Jagrati Agrawal, Yanlei Diao, Daniel Gyllstrom, and Neil Immerman. 2008. Efficient pattern matching over event streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2008, Vancouver, BC, Canada, June 10-12, 2008, Jason Tsong-Li Wang (Ed.). ACM, 147--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Apache Flink. 2022. Pattern Recognition. https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/dev/table/sql/queries/match_recognize/.Google ScholarGoogle Scholar
  7. Brian Olsen. 2020. A gentle introduction to the Hive connector. https://trino.io/blog/2020/10/20/intro-to-hive-connector.html.Google ScholarGoogle Scholar
  8. Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, John C. Platt, James F. Terwilliger, and John Wernsing. 2014. Trill: A High-Performance Incremental Query Processor for Diverse Analytics. Proc. VLDB Endow. 8, 4 (2014), 401--412. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Badrish Chandramouli, Jonathan Goldstein, and David Maier. 2010. High-Performance Dynamic Pattern Matching over Disordered Streams. Proc. VLDB Endow. 3, 1 (2010), 220--231. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chicago. 2022. Chicago Crimes. https://www.kaggle.com/chicago/chicago-crime.Google ScholarGoogle Scholar
  11. Alan J. Demers, Johannes Gehrke, Biswanath Panda, Mirek Riedewald, Varun Sharma, and Walker M. White. 2007. Cayuga: A General Purpose Event Monitoring System. In Third Biennial Conference on Innovative Data Systems Research, CIDR 2007, Asilomar, CA, USA, January 7-10, 2007, Online Proceedings. www.cidrdb.org, 412--422. http://cidrdb.org/cidr2007/papers/cidr07p47.pdfGoogle ScholarGoogle Scholar
  12. David J. DeWitt, Jeffrey F. Naughton, and Donovan A. Schneider. 1991. An Evaluation of Non-Equijoin Algorithms. In 17th International Conference on Very Large Data Bases, September 3-6, 1991, Barcelona, Catalonia, Spain, Proceedings, Guy M. Lohman, Amílcar Sernadas, and Rafael Camps (Eds.). Morgan Kaufmann, 443--452. http://www.vldb.org/conf/1991/P443.PDFGoogle ScholarGoogle Scholar
  13. Yanlei Diao, Neil Immerman, and Daniel Gyllstrom. 2007. Sase+: An agile language for kleene closure over event streams. UMass Technical Report (2007).Google ScholarGoogle Scholar
  14. Nihal Dindar, Peter M. Fischer, Merve Soner, and Nesime Tatbul. 2011. Efficiently correlating complex events over live and archived data streams. In Proceedings of the Fifth ACM International Conference on Distributed Event-Based Systems, DEBS 2011, New York, NY, USA, July 11-15, 2011, David M. Eyers, Opher Etzion, Avigdor Gal, Stanley B. Zdonik, and Paul Vincent (Eds.). ACM, 243--254. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Anshuman Dutt, Chi Wang, Azade Nazi, Srikanth Kandula, Vivek R. Narasayya, and Surajit Chaudhuri. 2019. Selectivity Estimation for Range Predicates using Lightweight Models. Proc. VLDB Endow. 12, 9 (2019), 1044--1057. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Erkang Zhu and Silu Huang and Surajit Chaudhuri. 2022. High-Performance Row Pattern Recognition Using Joins (Technical Report). https://www.microsoft.com/en-us/research/publication/high-performance-row-pattern-recognition-using-joins-technical-report/.Google ScholarGoogle Scholar
  17. Felipe Hoffa. 2021. Funnel analytics with SQL: MATCH_RECOGNIZE() on Snowflake. https://towardsdatascience.com/funnel-analytics-with-sql-match-recognize-on-snowflake-8bd576d9b7b1.Google ScholarGoogle Scholar
  18. Michael J. Freitag, Maximilian Bandle, Tobias Schmidt, Alfons Kemper, and Thomas Neumann. 2020. Adopting Worst-Case Optimal Joins in Relational Database Systems. Proc. VLDB Endow. 13, 11 (2020), 1891--1904. http://www.vldb.org/pvldb/vol13/p1891-freitag.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hector Garcia-Molina, Jeffrey D. Ullman, and Jennifer Widom. 2009. Database systems - the complete book (2. ed.). Pearson Education.Google ScholarGoogle Scholar
  20. Nikos Giatrakos, Elias Alevizos, Alexander Artikis, Antonios Deligiannakis, and Minos N. Garofalakis. 2020. Complex event recognition in the Big Data era: a survey. VLDB J. 29, 1 (2020), 313--352. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Goetz Graefe. 1993. Query Evaluation Techniques for Large Databases. ACM Comput. Surv. 25, 2 (1993), 73--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Goetz Graefe. 1995. The Cascades Framework for Query Optimization. IEEE Data Eng. Bull. 18, 3 (1995), 19--29. http://sites.computer.org/debull/95SEP-CD.pdfGoogle ScholarGoogle Scholar
  23. Goetz Graefe and William J. McKenna. 1993. The Volcano Optimizer Generator: Extensibility and Efficient Search. In Proceedings of the Ninth International Conference on Data Engineering, April 19-23, 1993, Vienna, Austria. IEEE Computer Society, 209--218. Google ScholarGoogle ScholarCross RefCross Ref
  24. Laura M. Haas, Johann Christoph Freytag, Guy M. Lohman, and Hamid Pirahesh. 1989. Extensible Query Processing in Starburst. In Proceedings of the 1989 ACM SIGMOD International Conference on Management of Data, Portland, Oregon, USA, May 31 - June 2, 1989, James Clifford, Bruce G. Lindsay, and David Maier (Eds.). ACM Press, 377--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. ISO/IEC JTC 1/SC 32 Data management and interchange. 2016. ISO/IEC TR 19075-5:2016 Information technology - Database languages - SQL Technical Reports - Part 5: Row Pattern Recognition in SQL. https://www.iso.org/standard/65143.html.Google ScholarGoogle Scholar
  26. Kasia Findeisen. 2021. Row pattern recognition with MATCH_RECOGNIZE. https://trino.io/blog/2021/05/19/row_pattern_matching.html.Google ScholarGoogle Scholar
  27. Keith Laker. 2017. MATCH_RECOGNIZE and predicates - everything you need to know. https://blogs.oracle.com/datawarehousing/post/match_recognize-and-predicates-everything-you-need-to-know.Google ScholarGoogle Scholar
  28. Mahmoud Abo Khamis, Hung Q. Ngo, Christopher Ré, and Atri Rudra. 2016. Joins via Geometric Resolutions: Worst Case and Beyond. ACM Trans. Database Syst. 41, 4 (2016), 22:1--22:45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Donald E. Knuth, James H. Morris Jr., and Vaughan R. Pratt. 1977. Fast Pattern Matching in Strings. SIAM J. Comput. 6, 2 (1977), 323--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ilya Kolchinsky and Assaf Schuster. 2018. Join Query Optimization Techniques for Complex Event Processing Applications. Proc. VLDB Endow. 11, 11 (2018), 1332--1345. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Ilya Kolchinsky and Assaf Schuster. 2019. Real-Time Multi-Pattern Detection over Event Streams. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 589--606. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Ilya Kolchinsky, Izchak Sharfman, and Assaf Schuster. 2015. Lazy evaluation methods for detecting complex events. In Proceedings of the 9th ACM International Conference on Distributed Event-Based Systems, DEBS '15, Oslo, Norway, June 29 - July 3, 2015, Frank Eliassen and Roman Vitenberg (Eds.). ACM, 34--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Michael Körber, Nikolaus Glombiewski, and Bernhard Seeger. 2021. Index-Accelerated Pattern Matching in Event Stores. In SIGMOD '21: International Conference on Management of Data, Virtual Event, China, June 20-25, 2021, Guoliang Li, Zhanhuai Li, Stratos Idreos, and Divesh Srivastava (Eds.). ACM, 1023--1036. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Rundong Li, Wolfgang Gatterbauer, and Mirek Riedewald. 2020. Near-Optimal Distributed Band-Joins through Recursive Partitioning. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 2375--2390. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Mo Liu, Elke A. Rundensteiner, Daniel J. Dougherty, Chetan Gupta, Song Wang, Ismail Ari, and Abhay Mehta. 2011. High-performance nested CEP query processing over event streams. In Proceedings of the 27th International Conference on Data Engineering, ICDE 2011, April 11-16, 2011, Hannover, Germany, Serge Abiteboul, Klemens Böhm, Christoph Koch, and Kian-Lee Tan (Eds.). IEEE Computer Society, 123--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Michael V. Mannino, Paicheng Chu, and Thomas Sager. 1988. Statistical Profile Estimation in Database Systems. ACM Comput. Surv. 20, 3 (1988), 191--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Marta Paes. 2019. MATCH_RECOGNIZE: where Flink SQL and Complex Event Processing meet. https://www.ververica.com/blog/match_recognize-where-flink-sql-and-complex-event-processing-meet.Google ScholarGoogle Scholar
  38. Yuan Mei and Samuel Madden. 2009. ZStream: a cost-based query processor for adaptively detecting composite events. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2009, Providence, Rhode Island, USA, June 29 - July 2, 2009, Ugur Çetintemel, Stanley B. Zdonik, Donald Kossmann, and Nesime Tatbul (Eds.). ACM, 193--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Microsoft Azure. 2021. MATCH_RECOGNIZE (Stream Analytics). https://docs.microsoft.com/en-us/stream-analytics-query/match-recognize-stream-analytics.Google ScholarGoogle Scholar
  40. Hung Q. Ngo, Dung T. Nguyen, Christopher Ré, and Atri Rudra. 2014. Beyond worst-case analysis for joins with minesweeper. In Proceedings of the 33rd ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS'14, Snowbird, UT, USA, June 22-27, 2014, Richard Hull and Martin Grohe (Eds.). ACM, 234--245. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Hung Q. Ngo, Ely Porat, Christopher Ré, and Atri Rudra. 2018. Worst-case Optimal Join Algorithms. J. ACM 65, 3 (2018), 16:1--16:40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Oracle. 2022. Pattern Recognition With MATCH_RECOGNIZE. https://docs.oracle.com/en/middleware/fusion-middleware/osa/19.1/cqlreference/pattern-recognition-match_recognize.html.Google ScholarGoogle Scholar
  43. Raghu Ramakrishnan, Donko Donjerkovic, Arvind Ranganathan, Kevin S. Beyer, and Muralidhar Krishnaprasad. 1998. SRQL: Sorted Relational Query Language. In 10th International Conference on Scientific and Statistical Database Management, Proceedings, Capri, Italy, July 1-3, 1998, Maurizio Rafanelli and Matthias Jarke (Eds.). IEEE Computer Society, 84--95. Google ScholarGoogle ScholarCross RefCross Ref
  44. Medhabi Ray, Chuan Lei, and Elke A. Rundensteiner. 2016. Scalable Pattern Sharing on Event Streams. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016, San Francisco, CA, USA, June 26 - July 01, 2016, Fatma Özcan, Georgia Koutrika, and Sam Madden (Eds.). ACM, 495--510. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Maximilian Reif and Thomas Neumann. 2022. A Scalable and Generic Approach to Range Joins. Proc. VLDB Endow. 15, 11 (2022), 3018--3030. https://www.vldb.org/pvldb/vol15/p3018-reif.pdfGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  46. Rodrigo Alves. 2019. Azure Stream Analytics now supports MATCH_RECOGNIZE. https://azure.microsoft.com/en-us/blog/azure-stream-analytics-now-supports-match-recognize/.Google ScholarGoogle Scholar
  47. Reza Sadri, Carlo Zaniolo, Amir M. Zarkesh, and Jafar Adibi. 2004. Expressing and optimizing sequence queries in database systems. ACM Trans. Database Syst. 29, 2 (2004), 282--318. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Patricia G. Selinger, Morton M. Astrahan, Donald D. Chamberlin, Raymond A. Lorie, and Thomas G. Price. 1979. Access Path Selection in a Relational Database Management System. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data, Boston, Massachusetts, USA, May 30 - June 1, Philip A. Bernstein (Ed.). ACM, 23--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. 1994. Sequence Query Processing. In Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, Minnesota, USA, May 24-27, 1994, Richard T. Snodgrass and Marianne Winslett (Eds.). ACM Press, 430--441. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. 1995. SEQ: A Model for Sequence Databases. In Proceedings of the Eleventh International Conference on Data Engineering, March 6-10, 1995, Taipei, Taiwan, Philip S. Yu and Arbee L. P. Chen (Eds.). IEEE Computer Society, 232--239. Google ScholarGoogle ScholarCross RefCross Ref
  51. Praveen Seshadri, Miron Livny, and Raghu Ramakrishnan. 1996. The Design and Implementation of a Sequence Database System. In VLDB'96, Proceedings of 22th International Conference on Very Large Data Bases, September 3-6, 1996, Mumbai (Bombay), India, T. M. Vijayaraman, Alejandro P. Buchmann, C. Mohan, and Nandlal L. Sarda (Eds.). Morgan Kaufmann, 99--110. http://www.vldb.org/conf/1996/P099.PDFGoogle ScholarGoogle ScholarDigital LibraryDigital Library
  52. Snowflake. 2021. Identifying Sequences of Rows That Match a Pattern. https://docs.snowflake.com/en/user-guide/match-recognize-introduction.html.Google ScholarGoogle Scholar
  53. Valery Soloviev. 1993. A Truncating Hash Algorithm for Processing Band-Join Queries. In Proceedings of the Ninth International Conference on Data Engineering, April 19-23, 1993, Vienna, Austria. IEEE Computer Society, 419--427. Google ScholarGoogle ScholarCross RefCross Ref
  54. Mike Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth J. O'Neil, Patrick E. O'Neil, Alex Rasin, Nga Tran, and Stan Zdonik. 2019. C-store: a column-oriented DBMS. In Making Databases Work: the Pragmatic Wisdom of Michael Stonebraker, Michael L. Brodie (Ed.). ACM / Morgan & Claypool, 491--518. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Trino. 2022. MATCH_RECOGNIZE. https://trino.io/docs/current/sql/match-recognize.html.Google ScholarGoogle Scholar
  56. Todd L. Veldhuizen. 2014. Triejoin: A Simple, Worst-Case Optimal Join Algorithm. In Proc. 17th International Conference on Database Theory (ICDT), Athens, Greece, March 24-28, 2014, Nicole Schweikardt, Vassilis Christophides, and Vincent Leroy (Eds.). OpenProceedings.org, 96--106. Google ScholarGoogle ScholarCross RefCross Ref
  57. Eugene Wu, Yanlei Diao, and Shariq Rizvi. 2006. High-performance complex event processing over streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data, Chicago, Illinois, USA, June 27-29, 2006, Surajit Chaudhuri, Vagelis Hristidis, and Neoklis Polyzotis (Eds.). ACM, 407--418. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 16, Issue 5
    January 2023
    208 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 January 2023
    Published in pvldb Volume 16, Issue 5

    Check for updates

    Qualifiers

    • research-article
  • Article Metrics

    • Downloads (Last 12 months)75
    • Downloads (Last 6 weeks)7

    Other Metrics

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader