skip to main content
column

The Data Problem in Data Mining

Published:21 May 2015Publication History
Skip Abstract Section

Abstract

Computer science is essentially an applied or engineering science, creating tools. In Data Mining, those tools are supposed to help humans understand large amounts of data. In this position paper, I argue that for all the progress that has been made in Data Mining, in particular Pattern Mining, we are lacking insight into three key aspects: 1) How pattern mining algorithms perform quantitatively, 2) How to choose parameter settings, and 3) How to relate found patterns to the processes that generated the data. I illustrate the issue by surveying existing work in light of these concerns and pointing to the (relatively few) papers that have attempted to fill in the gaps. I argue further that progress regarding those questions is held back by a lack of data with varying, controlled properties, and that this lack is unlikely to be remedied by the ever increasing collection of real-life data. Instead, I am convinced that we will need to make a science of digital data generation, and use it to develop guidance to data practitioners.

References

  1. Corsika - an air shower simulation program, https://web.ikp.kit.edu/corsika/.Google ScholarGoogle Scholar
  2. I. Adá and M. R. Berthold. The new iris data: modular data generators. In B. Rao, B. Krishnapuram, A. Tomkins, and Q. Yang, editors, KDD, pages 413--422. ACM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In Proceedings of the 20th International Conference on Very Large Databases, pages 487--499, Santiago de Chile, Chile, Sept. 1994. Morgan Kaufmann. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Anderson. The end of theory: The data deluge makes the scienti_c method obsolete. http://archive.wired.com/science/discoveries/ magazine/16-07/pb_theory. Accessed 08/21/2014.Google ScholarGoogle Scholar
  5. R. J. Bayardo Jr., B. Goethals, and M. J. Zaki, editors. FIMI '04, Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations, Brighton, UK, November 1, 2004, 2004.Google ScholarGoogle Scholar
  6. J. Besson, C. Rigotti, I. Mitasiunaite, and J.-F. Boulicaut. Parameter tuning for di_erential mining of string patterns. In ICDM Workshops, pages 77--86. IEEE Computer Society, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. D. Bie. Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min. Knowl. Discov., 23(3):407--446, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Blake and C. Merz. UCI repository of machine learning databases, 1998.Google ScholarGoogle Scholar
  9. M. Boley, T. Gärtner, and H. Grosskreutz. Formal concept sampling for counting and threshold-free local pattern mining. In SDM, pages 177--188. SIAM, 2010.Google ScholarGoogle Scholar
  10. M. Boley and H. Grosskreutz. A randomized approach for approximating the number of frequent sets. In ICDM, pages 43--52. IEEE Computer Society, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Bringmann and A. Zimmermann. Tree2 - Decision trees for tree structured data. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama, editors, 9th European Conference on Principles and Practice of Knowledge Discovery in Databases, pages 46--58. Springer, 2005.Google ScholarGoogle Scholar
  12. D. Chakrabarti and C. Faloutsos. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv., 38(1), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Cooper and M. Zito. Realistic synthetic data for testing association rule mining algorithms for market basket databases. In J. N. Kok, J. Koronacki, R. L. de M_antaras, S. Matwin, D. Mladenic, and A. Skowron, editors, PKDD, volume 4702 of Lecture Notes in Computer Science, pages 398--405. Springer, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. F. Flouvat, F. D. Marchi, and J.-M. Petit. A new classification of datasets for frequent itemsets. J. Intell. Inf.Syst., 34(1):1--19, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Freno, M. Keller, and M. Tommasi. Fiedler random fields: A large-scale spectral approach to statistical network modeling. In P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, NIPS, pages 1871--1879, 2012.Google ScholarGoogle Scholar
  16. F. Geerts, B. Goethals, and J. V. den Bussche. Tight upper bounds on the number of candidate patterns. ACM Trans. Database Syst., 30(2):333--363, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Ghazizadeh and S. S. Chawathe. Seus: Structure extraction using summaries. In S. Lange, K. Satoh, and C. H. Smith, editors, Discovery Science, volume 2534 of Lecture Notes in Computer Science, pages 71--85. Springer, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Gionis, H.Mannila, T. Mielikäinen, and P. Tsaparas. Assessing data mining results via swap randomization. TKDD, 1(3), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. B. Goethals and M. J. Zaki, editors. FIMI '03, Frequent Itemset Mining Implementations, Proceedings of the ICDM 2003 Workshop on Frequent Itemset Mining Implementations, 19 December 2003, Melbourne, Florida, USA, volume 90 of CEUR Workshop Proceedings. CEUR-WS.org, 2003.Google ScholarGoogle Scholar
  20. K. Gouda and M. J. Zaki. Genmax: An e_cient algorithm for mining maximal frequent itemsets. Data Min. Knowl. Discov., 11(3):223--242, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, and M. Hsu. Freespan: frequent pattern-projected sequential pattern mining. In R. Ramakrishnan, S. J. Stolfo, R. J. Bayardo, and I. Parsa, editors, KDD, pages 355--359. ACM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In W. Chen, J. F. Naughton, and P. A. Bernstein, editors, SIGMOD Conference, pages 1--12. ACM, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. Han, B. W. Wah, V. Raghavan, X. Wu, and R. Rastogi, editors. Fifth IEEE International Conference on Data Mining, Houston, Texas, USA, Nov. 2005. IEEE.Google ScholarGoogle Scholar
  24. J. Huan, W. Wang, and J. Prins. E_cient mining of frequent subgraphs in the presence of isomorphism. In ICDM, pages 549--552. IEEE Computer Society, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Inokuchi, T. Washio, and H. Motoda. An aprioribased algorithm for mining frequent substructures from graph data. In D. A. Zighed, H. J. Komorowski, and J. M. Zytkow, editors, PKDD, volume 1910 of Lecture Notes in Computer Science, pages 13--23. Springer, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. A. Inokuchi, T. Washio, K. Nishimura, and H. Motoda. A fast algorithm for mining frequent connected subgraphs. Technical report, IBM Research, 2002.Google ScholarGoogle Scholar
  27. E. F. V. J. J. Down. A plant-wide industrial process control problem. Computers & Chemical Engineering, 17(3):245--255, March 1993.Google ScholarGoogle ScholarCross RefCross Ref
  28. U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In W. Wang, H. Kargupta, S. Ranka, P. S. Yu, and X. Wu, editors, ICDM, pages 229--238. IEEE Computer Society, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. Keogh, Q. Zhu, B. Hu, Y. Hao, X. Xi, L. Wei, and C. A. Ratanamahatana. The UCR time series classification/clustering homepage, 2011.Google ScholarGoogle Scholar
  30. M. Kuramochi and G. Karypis. Frequent subgraph discovery. In N. Cercone, T. Y. Lin, and X. Wu, editors, ICDM, pages 313--320. IEEE Computer Society, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph*. Data Min. Knowl. Discov., 11(3):243--271, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. S. Laxman, P. S. Sastry, and K. P. Unnikrishnan. Discovering frequent episodes and learning hidden markov models: A formal connection. IEEE Trans. Knowl. Data Eng., 17(11):1505--1517, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney. Statistical properties of community structure in large social and information networks. In J. Huai, R. Chen, H.-W. Hon, Y. Liu, W.-Y. Ma, A. Tomkins, and X. Zhang, editors, WWW, pages 695--704. ACM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. L. Lhote, F. Rioult, and A. Soulet. Average number of frequent (closed) patterns in bernouilli and markovian databases. In Han et al. {23}, pages 713--716. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Mampaey and J. Vreeken. Summarizing categorical data by clustering attributes. Data Min. Knowl. Discov., 26(1):130--173, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Mampaey, J. Vreeken, and N. Tatti. Summarizing data succinctly with the most informative itemsets. TKDD, 6(4):16, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. H. Mannila and H. Toivonen. Discovering frequent episodes in sequences. In Proceedings of the First Inter- national Conference on Knowledge Discovery and Data Mining (KDD'95), pages 210--215. AAAI Press, 1995.Google ScholarGoogle Scholar
  38. A. U. Matthijs van Leeuwen. Fast estimation of the pattern frequency spectrum.Google ScholarGoogle Scholar
  39. R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii, and U. Alon. Network motifs: simple building blocks of complex networks. Science, 298(5594):824--827, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  40. S. Nijssen and J. Kok. Frequent subgraph miners: runtimes don't say everything. In T. Gärtner, G. Garriga, and T. Meinl, editors, Proceedings of the Workshop on Mining and Learning with Graphs,, pages 173--180, 2006.Google ScholarGoogle Scholar
  41. G. K. Orman, V. Labatut, and H. Cheri_. Qualitative comparison of community detection algorithms. In H. Cheri_, J. M. Zain, and E. El-Qawasmeh, editors, DICTAP (2), volume 167 of Communications in Computer and Information Science, pages 265--279. Springer, 2011.Google ScholarGoogle Scholar
  42. G. K. Orman, V. Labatut, and H. Cheri_. Towards realistic arti_cial benchmark for community detection algorithms evaluation. IJWBC, 9(3):349--370, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. P. Palmerini, S. Orlando, and R. Perego. Statistical properties of transactional databases. In H. Haddad, A. Omicini, R. L. Wainwright, and L. M. Liebrock, editors, SAC, pages 515--519. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. Pei, J. Han, and R. Mao. Closet: An e_cient algorithm for mining frequent closed itemsets. In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, pages 21--30, 2000.Google ScholarGoogle Scholar
  45. J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. Prefixspan: Mining sequential patterns by pre_x-projected growth. In D. Georgakopoulos and A. Buchmann, editors, ICDE, pages 215--224. IEEE Computer Society, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Y. Pei and O. Zaïane. A synthetic data generator for clustering and outlier analysis. Technical report, 2006.Google ScholarGoogle Scholar
  47. D. M. Pennock and Q. F. Stout. Exploiting a theory of phase transitions in three-satis_ability problems. In AAAI/IAAI, Vol. 1, pages 253--258, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. B. A. Prakash, J. Vreeken, and C. Faloutsos. E_ciently spotting the starting points of an epidemic in a large graph. Knowl. Inf. Syst., 38(1):35--59, 2014.Google ScholarGoogle ScholarCross RefCross Ref
  49. G. Ramesh, W. Maniatty, and M. J. Zaki. Feasible itemset distributions in data mining: theory and application. In PODS, pages 284--295. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. G. Ramesh, M. J. Zaki, and W. Maniatty. Distributionbased synthetic database generation techniques for itemset mining. In IDEAS, pages 307--316. IEEE Computer Society, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In P. M. G. Apers, M. Bouzeghoub, and G. Gardarin, editors, EDBT, volume 1057 of Lecture Notes in Computer Science, pages 3--17. Springer, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. N. Tatti and J. Vreeken. Discovering descriptive tile trees - by mining optimal geometric subtiles. In P. A. Flach, T. D. Bie, and N. Cristianini, editors, Machine Learning and Knowledge Discovery in Databases - Eu- ropean Conference, ECML PKDD 2012, Bristol, UK, September 24-28, 2012. Proceedings, Part I, volume 7523 of Lecture Notes in Computer Science, pages 9--24. Springer, 2012.Google ScholarGoogle Scholar
  53. N. Tatti and J. Vreeken. The long and the short of it: summarising event sequences with serial episodes. In Q. Yang, D. Agarwal, and J. Pei, editors, The 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '12, Beijing, China, August 12-16, 2012, pages 462--470. ACM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. J. Vanschoren, H. Blockeel, B. Pfahringer, and G. Holmes. Experiment databases - A new way to share, organize and learn from experiments. Machine Learning, 87(2):127--158, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. J. Vreeken, M. van Leeuwen, and A. Siebes. Preserving privacy through data generation. In N. Ramakrishnan and O. Zaane, editors, ICDM, pages 685--690. IEEE Computer Society, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. J. Vreeken, M. van Leeuwen, and A. Siebes. Krimp: mining itemsets that compress. Data Min. Knowl. Discov., 23(1):169--214, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. G. I. Webb. Discovering signi_cant patterns. Machine Learning, 68(1):1--33, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. G. I. Webb and J. Vreeken. E_cient discovery of the most interesting associations. Transactions on Knowl- edge Discovery from Data, 8(3):15--1, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. M. Wörlein, T. Meinl, I. Fischer, and M. Philippsen. A quantitative comparison of the subgraph miners mofa, gspan, _sm, and gaston. In A. Jorge, L. Torgo, P. Brazdil, R. Camacho, and J. Gama, editors, PKDD, pages 392--403. Springer, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In ICDM, pages 721--724. IEEE Computer Society, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. M. J. Zaki. Scalable algorithms for association mining. IEEE Trans. Knowl. Data Eng., 12(3):372--390, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. M. J. Zaki. Spade: An e_cient algorithm for mining frequent sequences. Machine Learning, 42(1/2):31--60, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. M. J. Zaki. E_ciently mining frequent trees in a forest. In KDD, pages 71--80. ACM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. M. J. Zaki and C.-J. Hsiao. Charm: An e_cient algorithm for closed itemset mining. In R. L. Grossman, J. Han, V. Kumar, H. Mannila, and R. Motwani, editors, SDM. SIAM, 2002.Google ScholarGoogle Scholar
  65. Z. Zheng, R. Kohavi, and L. Mason. Real world performance of association rule algorithms. In KDD, pages 401--406, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. A. Zimmermann. Objectively evaluating condensed representations and interestingness measures for frequent itemset mining. Journal of Intelligent Information Systems, pages 1--19, 2013.Google ScholarGoogle Scholar
  67. A. Zimmermann. Understanding episode mining techniques: Benchmarking on diverse, realistic, arti_cial data. Intell. Data Anal., 18(5):761--791, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. A. Zimmermann and B. Bringmann. Ctc - correlating tree patterns for classification. In Han et al. {23}, pages 833--836. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The Data Problem in Data Mining

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM SIGKDD Explorations Newsletter
            ACM SIGKDD Explorations Newsletter  Volume 16, Issue 2
            December 2014
            49 pages
            ISSN:1931-0145
            EISSN:1931-0153
            DOI:10.1145/2783702
            Issue’s Table of Contents

            Copyright © 2015 Author

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 21 May 2015

            Check for updates

            Qualifiers

            • column

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader