Abstract
Computer science is essentially an applied or engineering science, creating tools. In Data Mining, those tools are supposed to help humans understand large amounts of data. In this position paper, I argue that for all the progress that has been made in Data Mining, in particular Pattern Mining, we are lacking insight into three key aspects: 1) How pattern mining algorithms perform quantitatively, 2) How to choose parameter settings, and 3) How to relate found patterns to the processes that generated the data. I illustrate the issue by surveying existing work in light of these concerns and pointing to the (relatively few) papers that have attempted to fill in the gaps. I argue further that progress regarding those questions is held back by a lack of data with varying, controlled properties, and that this lack is unlikely to be remedied by the ever increasing collection of real-life data. Instead, I am convinced that we will need to make a science of digital data generation, and use it to develop guidance to data practitioners.
- Corsika - an air shower simulation program, https://web.ikp.kit.edu/corsika/.Google Scholar
- I. Adá and M. R. Berthold. The new iris data: modular data generators. In B. Rao, B. Krishnapuram, A. Tomkins, and Q. Yang, editors, KDD, pages 413--422. ACM, 2010. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Databases, pages 487--499, Santiago de Chile, Chile, Sept. 1994. Morgan Kaufmann. Google ScholarDigital Library
- C. Anderson. The end of theory: The data deluge makes the scienti_c method obsolete. http://archive.wired.com/science/discoveries/ magazine/16-07/pb_theory. Accessed 08/21/2014.Google Scholar
- R. J. Bayardo Jr., B. Goethals, and M. J. Zaki, editors. FIMI '04, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, Brighton, UK, November 1, 2004, 2004.Google Scholar
- J. Besson, C. Rigotti, I. Mitasiunaite, and J.-F. Boulicaut. Parameter tuning for di_erential mining of string patterns. In ICDM Workshops, pages 77--86. IEEE Computer Society, 2008. Google ScholarDigital Library
- T. D. Bie. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min. Knowl. Discov., 23(3):407--446, 2011. Google ScholarDigital Library
- C. Blake and C. Merz. UCI repository of machine learning databases, 1998.Google Scholar
- M. Boley, T. Gärtner, and H. Grosskreutz. Formal concept sampling for counting and threshold-free local pattern mining. In SDM, pages 177--188. SIAM, 2010.Google Scholar
- M. Boley and H. Grosskreutz. A randomized approach for approximating the number of frequent sets. In ICDM, pages 43--52. IEEE Computer Society, 2008. Google ScholarDigital Library
- B. Bringmann and A. Zimmermann. Tree2 - Decision trees for tree structured data. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama, editors, 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 46--58. Springer, 2005.Google Scholar
- D. Chakrabarti and C. Faloutsos. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv., 38(1), 2006. Google ScholarDigital Library
- C. Cooper and M. Zito. Realistic synthetic data for testing association rule mining algorithms for market basket databases. In J. N. Kok, J. Koronacki, R. L. de M_antaras, S. Matwin, D. Mladenic, and A. Skowron, editors, PKDD, volume 4702 of Lecture Notes in Computer Science, pages 398--405. Springer, 2007. Google ScholarDigital Library
- F. Flouvat, F. D. Marchi, and J.-M. Petit. A new classification of datasets for frequent itemsets. J. Intell. Inf.Syst., 34(1):1--19, 2010. Google ScholarDigital Library
- A. Freno, M. Keller, and M. Tommasi. Fiedler random fields: A large-scale spectral approach to statistical network modeling. In P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, NIPS, pages 1871--1879, 2012.Google Scholar
- F. Geerts, B. Goethals, and J. V. den Bussche. Tight upper bounds on the number of candidate patterns. ACM Trans. Database Syst., 30(2):333--363, 2005. Google ScholarDigital Library
- S. Ghazizadeh and S. S. Chawathe. Seus: Structure extraction using summaries. In S. Lange, K. Satoh, and C. H. Smith, editors, Discovery Science, volume 2534 of Lecture Notes in Computer Science, pages 71--85. Springer, 2002. Google ScholarDigital Library
- A. Gionis, H.Mannila, T. Mielikäinen, and P. Tsaparas. Assessing data mining results via swap randomization. TKDD, 1(3), 2007. Google ScholarDigital Library
- B. Goethals and M. J. Zaki, editors. FIMI '03, Frequent Itemset Mining Implementations, Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations, 19 December 2003, Melbourne, Florida, USA, volume 90 of CEUR Workshop Proceedings. CEUR-WS.org, 2003.Google Scholar
- K. Gouda and M. J. Zaki. Genmax: An e_cient algorithm for mining maximal frequent itemsets. Data Min. Knowl. Discov., 11(3):223--242, 2005. Google ScholarDigital Library
- J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu. Freespan: frequent pattern-projected sequential pattern mining. In R. Ramakrishnan, S. J. Stolfo, R. J. Bayardo, and I. Parsa, editors, KDD, pages 355--359. ACM, 2000. Google ScholarDigital Library
- J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, SIGMOD Conference, pages 1--12. ACM, 2000. Google ScholarDigital Library
- J. Han, B. W. Wah, V. Raghavan, X. Wu, and R. Rastogi, editors. Fifth IEEE International Conference on Data Mining, Houston, Texas, USA, Nov. 2005. IEEE.Google Scholar
- J. Huan, W. Wang, and J. Prins. E_cient mining of frequent subgraphs in the presence of isomorphism. In ICDM, pages 549--552. IEEE Computer Society, 2003. Google ScholarDigital Library
- A. Inokuchi, T. Washio, and H. Motoda. An aprioribased algorithm for mining frequent substructures from graph data. In D. A. Zighed, H. J. Komorowski, and J. M. Zytkow, editors, PKDD, volume 1910 of Lecture Notes in Computer Science, pages 13--23. Springer, 2000. Google ScholarDigital Library
- A. Inokuchi, T. Washio, K. Nishimura, and H. Motoda. A fast algorithm for mining frequent connected subgraphs. Technical report, IBM Research, 2002.Google Scholar
- E. F. V. J. J. Down. A plant-wide industrial process control problem. Computers & Chemical Engineering, 17(3):245--255, March 1993.Google ScholarCross Ref
- U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In W. Wang, H. Kargupta, S. Ranka, P. S. Yu, and X. Wu, editors, ICDM, pages 229--238. IEEE Computer Society, 2009. Google ScholarDigital Library
- E. Keogh, Q. Zhu, B. Hu, Y. Hao, X. Xi, L. Wei, and C. A. Ratanamahatana. The UCR time series classification/clustering homepage, 2011.Google Scholar
- M. Kuramochi and G. Karypis. Frequent subgraph discovery. In N. Cercone, T. Y. Lin, and X. Wu, editors, ICDM, pages 313--320. IEEE Computer Society, 2001. Google ScholarDigital Library
- M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph*. Data Min. Knowl. Discov., 11(3):243--271, 2005. Google ScholarDigital Library
- S. Laxman, P. S. Sastry, and K. P. Unnikrishnan. Discovering frequent episodes and learning hidden markov models: A formal connection. IEEE Trans. Knowl. Data Eng., 17(11):1505--1517, 2005. Google ScholarDigital Library
- J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical properties of community structure in large social and information networks. In J. Huai, R. Chen, H.-W. Hon, Y. Liu, W.-Y. Ma, A. Tomkins, and X. Zhang, editors, WWW, pages 695--704. ACM, 2008. Google ScholarDigital Library
- L. Lhote, F. Rioult, and A. Soulet. Average number of frequent (closed) patterns in bernouilli and markovian databases. In Han et al. {23}, pages 713--716. Google ScholarDigital Library
- M. Mampaey and J. Vreeken. Summarizing categorical data by clustering attributes. Data Min. Knowl. Discov., 26(1):130--173, 2013. Google ScholarDigital Library
- M. Mampaey, J. Vreeken, and N. Tatti. Summarizing data succinctly with the most informative itemsets. TKDD, 6(4):16, 2012. Google ScholarDigital Library
- H. Mannila and H. Toivonen. Discovering frequent episodes in sequences. In Proceedings of the First Inter- national Conference on Knowledge Discovery and Data Mining (KDD'95), pages 210--215. AAAI Press, 1995.Google Scholar
- A. U. Matthijs van Leeuwen. Fast estimation of the pattern frequency spectrum.Google Scholar
- R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: simple building blocks of complex networks. Science, 298(5594):824--827, 2002.Google ScholarCross Ref
- S. Nijssen and J. Kok. Frequent subgraph miners: runtimes don't say everything. In T. Gärtner, G. Garriga, and T. Meinl, editors, Proceedings of the Workshop on Mining and Learning with Graphs,, pages 173--180, 2006.Google Scholar
- G. K. Orman, V. Labatut, and H. Cheri_. Qualitative comparison of community detection algorithms. In H. Cheri_, J. M. Zain, and E. El-Qawasmeh, editors, DICTAP (2), volume 167 of Communications in Computer and Information Science, pages 265--279. Springer, 2011.Google Scholar
- G. K. Orman, V. Labatut, and H. Cheri_. Towards realistic arti_cial benchmark for community detection algorithms evaluation. IJWBC, 9(3):349--370, 2013. Google ScholarDigital Library
- P. Palmerini, S. Orlando, and R. Perego. Statistical properties of transactional databases. In H. Haddad, A. Omicini, R. L. Wainwright, and L. M. Liebrock, editors, SAC, pages 515--519. ACM, 2004. Google ScholarDigital Library
- J. Pei, J. Han, and R. Mao. Closet: An e_cient algorithm for mining frequent closed itemsets. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 21--30, 2000.Google Scholar
- J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. Prefixspan: Mining sequential patterns by pre_x-projected growth. In D. Georgakopoulos and A. Buchmann, editors, ICDE, pages 215--224. IEEE Computer Society, 2001. Google ScholarDigital Library
- Y. Pei and O. Zaïane. A synthetic data generator for clustering and outlier analysis. Technical report, 2006.Google Scholar
- D. M. Pennock and Q. F. Stout. Exploiting a theory of phase transitions in three-satis_ability problems. In AAAI/IAAI, Vol. 1, pages 253--258, 1996. Google ScholarDigital Library
- B. A. Prakash, J. Vreeken, and C. Faloutsos. E_ciently spotting the starting points of an epidemic in a large graph. Knowl. Inf. Syst., 38(1):35--59, 2014.Google ScholarCross Ref
- G. Ramesh, W. Maniatty, and M. J. Zaki. Feasible itemset distributions in data mining: theory and application. In PODS, pages 284--295. ACM, 2003. Google ScholarDigital Library
- G. Ramesh, M. J. Zaki, and W. Maniatty. Distributionbased synthetic database generation techniques for itemset mining. In IDEAS, pages 307--316. IEEE Computer Society, 2005. Google ScholarDigital Library
- R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In P. M. G. Apers, M. Bouzeghoub, and G. Gardarin, editors, EDBT, volume 1057 of Lecture Notes in Computer Science, pages 3--17. Springer, 1996. Google ScholarDigital Library
- N. Tatti and J. Vreeken. Discovering descriptive tile trees - by mining optimal geometric subtiles. In P. A. Flach, T. D. Bie, and N. Cristianini, editors, Machine Learning and Knowledge Discovery in Databases - Eu- ropean Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part I, volume 7523 of Lecture Notes in Computer Science, pages 9--24. Springer, 2012.Google Scholar
- N. Tatti and J. Vreeken. The long and the short of it: summarising event sequences with serial episodes. In Q. Yang, D. Agarwal, and J. Pei, editors, The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '12, Beijing, China, August 12-16, 2012, pages 462--470. ACM, 2012. Google ScholarDigital Library
- J. Vanschoren, H. Blockeel, B. Pfahringer, and G. Holmes. Experiment databases - A new way to share, organize and learn from experiments. Machine Learning, 87(2):127--158, 2012. Google ScholarDigital Library
- J. Vreeken, M. van Leeuwen, and A. Siebes. Preserving privacy through data generation. In N. Ramakrishnan and O. Zaane, editors, ICDM, pages 685--690. IEEE Computer Society, 2007. Google ScholarDigital Library
- J. Vreeken, M. van Leeuwen, and A. Siebes. Krimp: mining itemsets that compress. Data Min. Knowl. Discov., 23(1):169--214, 2011. Google ScholarDigital Library
- G. I. Webb. Discovering signi_cant patterns. Machine Learning, 68(1):1--33, 2007. Google ScholarDigital Library
- G. I. Webb and J. Vreeken. E_cient discovery of the most interesting associations. Transactions on Knowl- edge Discovery from Data, 8(3):15--1, 2014. Google ScholarDigital Library
- M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen. A quantitative comparison of the subgraph miners mofa, gspan, _sm, and gaston. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama, editors, PKDD, pages 392--403. Springer, 2005. Google ScholarDigital Library
- X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In ICDM, pages 721--724. IEEE Computer Society, 2002. Google ScholarDigital Library
- M. J. Zaki. Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng., 12(3):372--390, 2000. Google ScholarDigital Library
- M. J. Zaki. Spade: An e_cient algorithm for mining frequent sequences. Machine Learning, 42(1/2):31--60, 2001. Google ScholarDigital Library
- M. J. Zaki. E_ciently mining frequent trees in a forest. In KDD, pages 71--80. ACM, 2002. Google ScholarDigital Library
- M. J. Zaki and C.-J. Hsiao. Charm: An e_cient algorithm for closed itemset mining. In R. L. Grossman, J. Han, V. Kumar, H. Mannila, and R. Motwani, editors, SDM. SIAM, 2002.Google Scholar
- Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association rule algorithms. In KDD, pages 401--406, 2001. Google ScholarDigital Library
- A. Zimmermann. Objectively evaluating condensed representations and interestingness measures for frequent itemset mining. Journal of Intelligent Information Systems, pages 1--19, 2013.Google Scholar
- A. Zimmermann. Understanding episode mining techniques: Benchmarking on diverse, realistic, arti_cial data. Intell. Data Anal., 18(5):761--791, 2014. Google ScholarDigital Library
- A. Zimmermann and B. Bringmann. Ctc - correlating tree patterns for classification. In Han et al. {23}, pages 833--836. Google ScholarDigital Library
Index Terms
- The Data Problem in Data Mining
Recommendations
Mining uncertain data
As an important data mining and knowledge discovery task, association rule mining searches for implicit, previously unknown, and potentially useful pieces of information—in the form of rules revealing associative relationships—that are embedded in the ...
Mining fuzzy specific rare itemsets for education data
Association rule mining is an important data analysis method for the discovery of associations within data. There have been many studies focused on finding fuzzy association rules from transaction databases. Unfortunately, in the real world, one may ...
Comments