Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 4/2023

06.05.2023

ROhAN: Row-order agnostic null models for statistically-sound knowledge discovery

verfasst von: Maryam Abuissa, Alexander Lee, Matteo Riondato

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 4/2023

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We introduce a novel class of null models for the statistical validation of results obtained from binary transactional and sequence datasets. Our null models are Row-Order Agnostic (ROA), i.e., do not consider the order of rows in the observed dataset to be fixed, in stark contrast with previous null models, which are Row-Order Enforcing (ROE). We present ROhAN, an algorithmic framework for efficiently sampling datasets from ROA models according to user-specified distributions, which is a necessary step for the resampling-based statistical hypothesis tests employed to validate the results. ROhAN uses Metropolis-Hastings or rejection sampling to build on top of existing or future ROE sampling procedures. Our experimental evaluation shows that ROA models are very different from ROE ones, impacting the statistical validation, and that ROhAN is efficient, mixes fast, and scales well as the dataset grows.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Throughout this work, we use “significant” to mean “statistically significant”.
 
2
We drop “binary” and just use “transactional” in the rest of this work.
 
3
When considering the order of transactions as fixed, as ROE models do, there is a 1:1 correspondence between transactional datasets and binary matrices. The row sums correspond to the transaction lengths, and the column sum to the supports of single items.
 
4
Preserving properties exactly can partially be incorporated in these null models, but they usually make it impossible to derive a closed form for \(\pi\), with relevant computational consequences. The same is also true for many complex in-expectation constraints (Cimini et al. 2019).
 
5
We do not indicate this fact in the notation, to keep it light.
 
6
Gionis et al. (2007) focus on the case where \(\pi\) is the uniform distribution, but extending their discussion to a generic \(\pi\) is straightforward.
 
7
If not even earlier.
 
8
Some presentations of the algorithms mention a “transaction identifier” associated to each transaction, but this identifier is used only to uniquely label transactions, not for the purpose of ordering the rows, and it is in part a leftover of the idea that a transactional dataset is stored in a table in a relational database.
 
9
We assume \(\left( {\begin{array}{c}0\\ 0,\dotsc , 0\end{array}}\right) = 1\).
 
10
Other definitions of Q are possible. Deriving, for example, a tight lower bound \(b \le \min _{\mathcal {D}\in \mathcal {Z}_{\textrm{A}}} \textsf{c}(\mathcal {D})\) can be used to define \(Q \doteq {\left| {\mathcal {Z}_{\textrm{E}}}\right| }/{(\left| {\mathcal {Z}_{\textrm{A}}}\right| b)}\), which would lead to more samples being accepted. We leave this derivation to future work.
 
12
The other parameters of the generator were left to their default values.
 
Literatur
Zurück zum Zitat Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proc. 20th Int. Conf. Very Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, VLDB ’94, pp 487–499 Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proc. 20th Int. Conf. Very Large Data Bases. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, VLDB ’94, pp 487–499
Zurück zum Zitat Casella G, Robert CP, Wells MT (2004) Generalized accept-reject sampling schemes. In: A Festschrift for Herman Rubin, IMS Lecture Notes - Monograph Series, vol 45. IMS, p 342–347 Casella G, Robert CP, Wells MT (2004) Generalized accept-reject sampling schemes. In: A Festschrift for Herman Rubin, IMS Lecture Notes - Monograph Series, vol 45. IMS, p 342–347
Zurück zum Zitat Chen Y, Diaconis P, Holmes SP et al. (2005) Sequential monte carlo methods for statistical analysis of tables. J Am Stat Assoc 100(469):109–120MathSciNetCrossRefMATH Chen Y, Diaconis P, Holmes SP et al. (2005) Sequential monte carlo methods for statistical analysis of tables. J Am Stat Assoc 100(469):109–120MathSciNetCrossRefMATH
Zurück zum Zitat Cimini G, Squartini T, Saracco F et al. (2019) The statistical physics of real-world networks. Nature Rev Phys 1(1):58–71CrossRef Cimini G, Squartini T, Saracco F et al. (2019) The statistical physics of real-world networks. Nature Rev Phys 1(1):58–71CrossRef
Zurück zum Zitat Connor EF, Simberloff D (1979) The assembly of species communities: chance or competition? Ecology 60(6):1132–1140CrossRef Connor EF, Simberloff D (1979) The assembly of species communities: chance or competition? Ecology 60(6):1132–1140CrossRef
Zurück zum Zitat Dalleiger S, Vreeken J (2022) Discovering significant patterns under sequential false discovery control. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, KDD ’22 Dalleiger S, Vreeken J (2022) Discovering significant patterns under sequential false discovery control. In: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. ACM, KDD ’22
Zurück zum Zitat Fout AM (2022) New methods for fixed-margin binary matrix sampling, Fréchet covariance, and MANOVA tests for random objects in multiple metric spaces. PhD thesis, Colorado State University Fout AM (2022) New methods for fixed-margin binary matrix sampling, Fréchet covariance, and MANOVA tests for random objects in multiple metric spaces. PhD thesis, Colorado State University
Zurück zum Zitat Gionis A, Mannila H, Mielikäinen T et al. (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Dis from Data (TKDD) 1(3):14CrossRef Gionis A, Mannila H, Mielikäinen T et al. (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Dis from Data (TKDD) 1(3):14CrossRef
Zurück zum Zitat Gwadera R, Crestani F (2010) Ranking sequential patterns with respect to significance. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, pp 286–299 Gwadera R, Crestani F (2010) Ranking sequential patterns with respect to significance. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, pp 286–299
Zurück zum Zitat Hrovat G, Fister IJr, Yermak K, et al. (2015) Interestingness measure for mining sequential patterns in sports. Journal of Intelligent & Fuzzy Systems 29(5):1981–1994 Hrovat G, Fister IJr, Yermak K, et al. (2015) Interestingness measure for mining sequential patterns in sports. Journal of Intelligent & Fuzzy Systems 29(5):1981–1994
Zurück zum Zitat Jenkins S, Walzer-Goldfeld S, Riondato M (2022) SPEck: mining statistically-significant sequential patterns efficiently with exact sampling. Data Min Knowl Disc 36(4):1575–1599MathSciNetCrossRefMATH Jenkins S, Walzer-Goldfeld S, Riondato M (2022) SPEck: mining statistically-significant sequential patterns efficiently with exact sampling. Data Min Knowl Disc 36(4):1575–1599MathSciNetCrossRefMATH
Zurück zum Zitat Lehmann EL, Romano JP (2022) Testing Statistical Hypotheses, 4th edn. Springer, BerlinCrossRefMATH Lehmann EL, Romano JP (2022) Testing Statistical Hypotheses, 4th edn. Springer, BerlinCrossRefMATH
Zurück zum Zitat Low-Kam C, Raïssi C, Kaytoue M, et al. (2013) Mining statistically significant sequential patterns. In: 2013 IEEE 13th International Conference on Data Mining, IEEE, pp 488–497 Low-Kam C, Raïssi C, Kaytoue M, et al. (2013) Mining statistically significant sequential patterns. In: 2013 IEEE 13th International Conference on Data Mining, IEEE, pp 488–497
Zurück zum Zitat Méger N, Rigotti C, Pothier C (2015) Swap randomization of bases of sequences for mining satellite image times series. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 190–205 Méger N, Rigotti C, Pothier C (2015) Swap randomization of bases of sequences for mining satellite image times series. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, pp 190–205
Zurück zum Zitat Megiddo N, Srikant R (1998) Discovering predictive association rules. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, KDD ’98, pp 274–278 Megiddo N, Srikant R (1998) Discovering predictive association rules. In: Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, KDD ’98, pp 274–278
Zurück zum Zitat Mitzenmacher M, Upfal E (2005) Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press Mitzenmacher M, Upfal E (2005) Probability and Computing: Randomized Algorithms and Probabilistic Analysis. Cambridge University Press
Zurück zum Zitat Pei J, Han J, Mortazavi-Asl B et al. (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440CrossRef Pei J, Han J, Mortazavi-Asl B et al. (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440CrossRef
Zurück zum Zitat Pellegrina L, Riondato M, Vandin F (2019) Hypothesis testing and statistically-sound pattern mining. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, New York, NY, USA, KDD ’19, pp 3215–3216, https://doi.org/10.1145/3292500.3332286, Pellegrina L, Riondato M, Vandin F (2019) Hypothesis testing and statistically-sound pattern mining. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. ACM, New York, NY, USA, KDD ’19, pp 3215–3216, https://​doi.​org/​10.​1145/​3292500.​3332286,
Zurück zum Zitat Pinxteren S, Calders T (2021) Efficient permutation testing for significant sequential patterns. In: Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), SIAM, pp 19–27 Pinxteren S, Calders T (2021) Efficient permutation testing for significant sequential patterns. In: Proceedings of the 2021 SIAM International Conference on Data Mining (SDM), SIAM, pp 19–27
Zurück zum Zitat Preti G, De Francisci Morales G, Riondato M (2022) Alice and the caterpillar: A more descriptive null models for assessing data mining results. In: Proceedings of the 22nd IEEE International Conference on Data Mining, pp 418–427 Preti G, De Francisci Morales G, Riondato M (2022) Alice and the caterpillar: A more descriptive null models for assessing data mining results. In: Proceedings of the 22nd IEEE International Conference on Data Mining, pp 418–427
Zurück zum Zitat Stanley RP (2011) Enumerative Combinatorics, vol 1, 2nd edn. Cambridge University Press Stanley RP (2011) Enumerative Combinatorics, vol 1, 2nd edn. Cambridge University Press
Zurück zum Zitat Tonon A, Vandin F (2019) Permutation strategies for mining significant sequential patterns. In: 2019 IEEE International Conference on Data Mining (ICDM), IEEE, pp 1330–1335 Tonon A, Vandin F (2019) Permutation strategies for mining significant sequential patterns. In: 2019 IEEE International Conference on Data Mining (ICDM), IEEE, pp 1330–1335
Zurück zum Zitat Vreeken J, Tatti N (2014) Interesting patterns. In: Frequent pattern mining. Springer, p 105–134 Vreeken J, Tatti N (2014) Interesting patterns. In: Frequent pattern mining. Springer, p 105–134
Zurück zum Zitat Wang G (2020) A fast MCMC algorithm for the uniform sampling of binary matrices with fixed margins. Electron J Statistics 14(1):1690–1706MathSciNetCrossRefMATH Wang G (2020) A fast MCMC algorithm for the uniform sampling of binary matrices with fixed margins. Electron J Statistics 14(1):1690–1706MathSciNetCrossRefMATH
Zurück zum Zitat Westfall PH, Young SS (1993) Resampling-based multiple testing: Examples and methods for p-value adjustment. John Wiley & Sons Westfall PH, Young SS (1993) Resampling-based multiple testing: Examples and methods for p-value adjustment. John Wiley & Sons
Zurück zum Zitat Zimmermann A (2014) The data problem in data mining. SIGKDD Explor 16(2):38–45CrossRef Zimmermann A (2014) The data problem in data mining. SIGKDD Explor 16(2):38–45CrossRef
Metadaten
Titel
ROhAN: Row-order agnostic null models for statistically-sound knowledge discovery
verfasst von
Maryam Abuissa
Alexander Lee
Matteo Riondato
Publikationsdatum
06.05.2023
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 4/2023
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-023-00938-4

Weitere Artikel der Ausgabe 4/2023

Data Mining and Knowledge Discovery 4/2023 Zur Ausgabe

Premium Partner