skip to main content
10.1145/1871437.1871494acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Accelerating probabilistic frequent itemset mining: a model-based approach

Authors Info & Claims
Published:26 October 2010Publication History

ABSTRACT

Data uncertainty is inherent in emerging applications such as location-based services, sensor monitoring systems, and data integration. To handle a large amount of imprecise information, uncertain databases have been recently developed. In this paper, we study how to efficiently discover frequent itemsets from large uncertain databases, interpreted under the Possible World Semantics. This is technically challenging, since an uncertain database induces an exponential number of possible worlds. To tackle this problem, we propose a novel method to capture the itemset mining process as a Poisson binomial distribution. This model-based approach extracts frequent itemsets with a high degree of accuracy, and supports large databases. We apply our techniques to improve the performance of the algorithms for: (1) finding itemsets whose frequentness probabilities are larger than some threshold; and (2) mining itemsets with the k highest frequentness probabilities. Our approaches support both tuple and attribute uncertainty models, which are commonly used to represent uncertain databases. Extensive evaluation on real and synthetic datasets shows that our methods are highly accurate. Moreover, they are orders of magnitudes faster than previous approaches.

References

  1. A. Deshpande et al. Model-driven data acquisition in sensor networks. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Aggarwal, Y. Li, J. Wang, and J. Wang. Frequent pattern mining with uncertain data. In KDD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. C. Aggarwal and P. Yu. A survey of uncertain data algorithms and applications. TKDE, 21(5), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Agrawal, T. Imieliński, and A. Swami. Mining association rules between sets of items in large databases. In SIGMOD, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. C. J. van Rijsbergen. Information Retrieval. Butterworth, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. L. Cam. An approximation theorem for the Poisson binomial distribution. In Pacific Journal of Mathematics, volume 10, 1960.Google ScholarGoogle Scholar
  7. H. Cheng, P. Yu, and J. Han. Approximate frequent itemset mining in the presence of random noise. SCKDDM, 2008.Google ScholarGoogle ScholarCross RefCross Ref
  8. R. Cheng, D. Kalashnikov, and S. Prabhakar. Evaluating probabilistic queries over imprecise data. In SIGMOD, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. C. K. Chui, B. Kao, and E. Hung. Mining frequent itemsets from uncertain data. In PAKDD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. G. Cormode and M. Garofalakis. Sketching probabilistic data streams. In SIGMOD, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Dalvi and D. Suciu. Efficient query evaluation on probabilistic databases. In VLDB, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Huang et al. MayBMS: A Probabilistic Database Management System. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Ren and S. Lee and X. Chen and B. Kao and R. Cheng and D. Cheung. Naive Bayes Classification of Uncertain Data. In ICDM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Jampani, L. Perez, M. Wu, F. Xu, C. Jermaine, and P. Haas. MCDB: A Monte Carlo Approach to Managing Uncertain Data. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. N. Khoussainova, M. Balazinska, and D. Suciu. Towards correcting input data errors probabilistically using integrity constraints. In MobiDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. H. Kriegel and M. Pfeifle. Density-based clustering of uncertain data. In KDD, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Kuok, A. Fu, and M. Wong. Mining fuzzy association rules in databases. SIGMOD Record, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Lu, Y. Ke, J. Cheng, and W. Ng. Mining vague association rules. In DASFAA, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Mutsuzaki et al. Trio-one: Layering uncertainty and lineage on a conventional dbms. In CIDR, 2007.Google ScholarGoogle Scholar
  21. M. Yiu et al. Efficient evaluation of probabilistic advanced spatial queries on existentially uncertain data. TKDE, 21(9), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Sistla et al. Querying the uncertain position of moving objects. In Temporal Databases: Research and Practice. Springer Verlag, 1998.Google ScholarGoogle Scholar
  23. C. Stein. Approximate Computation of Expectations. Institute of Mathematical Statistics Lecture Notes - Monograph Series, 7, 1986.Google ScholarGoogle Scholar
  24. L. Sun, R. Cheng, D. W. Cheung, and J. Cheng. Mining Uncertain Data with Probabilistic Guarantees. In SIGKDD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. T. Bernecker et al. Probabilistic frequent itemset mining in uncertain databases. In KDD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Jayram et al. Avatar information extraction system. IEEE Data Eng. Bulletin, 29(1), 2006.Google ScholarGoogle Scholar
  27. S. Tsang, B. Kao, K. Y. Yip, W. Ho, and S. Lee. Decision Trees for Uncertain Data. In ICDE, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Q. Zhang, F. Li, and K. Yi. Finding frequent items in probabilistic data. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Accelerating probabilistic frequent itemset mining: a model-based approach

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management
        October 2010
        2036 pages
        ISBN:9781450300995
        DOI:10.1145/1871437

        Copyright © 2010 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 26 October 2010

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader