Skip to main content
Top
Published in: International Journal of Machine Learning and Cybernetics 5/2016

01-10-2016 | Original Article

Speeding up maximal fully-correlated itemsets search in large databases

Authors: Lian Duan, W. Nick Street

Published in: International Journal of Machine Learning and Cybernetics | Issue 5/2016

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Finding the most interesting correlations among items is essential for problems in many commercial, medical, and scientific domains. Our previous work on the maximal fully-correlated itemset (MFCI) framework can rule out the itemsets with irrelevant items and its downward-closed property helps to achieve good computational performance. However, to calculate the desired MFCIs in large databases, there are still two computational issues. First, unlike finding maximal frequent itemsets which can start the pruning from 1-itemsets, finding MFCIs must start the pruning from 2-itemsets. When the number of items in a given dataset is large and the support of all the pairs cannot be loaded into the memory, the IO cost (\(O(n^2)\)) for calculating correlation of all the pairs can be very high. Second, users usually need to try different correlation thresholds for different desirable MFCIs. Therefore, the cost of processing the Apriori procedure each time for a different correlation threshold is also very high. Consequently, we proposed two techniques to solve these problems. First, we identify the correlation upper bound for any good correlation measure to avoid unnecessary IO query for the support of pairs, and make use of their common monotone property to prune many pairs even without computing their correlation upper bounds. In addition, we build an enumeration tree to save the fully-correlated value for all the MFCIs under a given initial correlation threshold. We can either efficiently retrieve the desired MFCIs for any given threshold above the initial threshold or incrementally grow the tree if the given threshold is below the initial threshold. Experimental results show that our algorithm can be an order of magnitude faster than the original MFCI algorithm.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Show more products
Literature
1.
go back to reference Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: SIGMOD ’93: Proceedings of the ACM SIGMOD international conference on management of data. ACM, New York, pp 207–216 Agrawal R, Imieliński T, Swami A (1993) Mining association rules between sets of items in large databases. In: SIGMOD ’93: Proceedings of the ACM SIGMOD international conference on management of data. ACM, New York, pp 207–216
2.
go back to reference Bayardo RJ, Jr. (1998) Efficiently mining long patterns from databases. In: SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD international conference on management of data. ACM, New York, pp 85–93 Bayardo RJ, Jr. (1998) Efficiently mining long patterns from databases. In: SIGMOD ’98: Proceedings of the 1998 ACM SIGMOD international conference on management of data. ACM, New York, pp 85–93
3.
go back to reference Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: Generalizing association rules to correlations. In: SIGMOD ’97: Proceedings ACM SIGMOD international conference on management of data. ACM, New York, pp 265–276 Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: Generalizing association rules to correlations. In: SIGMOD ’97: Proceedings ACM SIGMOD international conference on management of data. ACM, New York, pp 265–276
4.
go back to reference Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: SIGMOD ’97: Proceedings of the ACM SIGMOD international conference on management of data. ACM, New York, pp 255–264 Brin S, Motwani R, Ullman JD, Tsur S (1997) Dynamic itemset counting and implication rules for market basket data. In: SIGMOD ’97: Proceedings of the ACM SIGMOD international conference on management of data. ACM, New York, pp 255–264
5.
go back to reference Burdick D (2001) Mafia: A maximal frequent itemset algorithm for transactional databases. In: ICDE ’01: Proceedings of the 17th international conference on data engineering. IEEE Computer Society, Washington, DC, p 443 Burdick D (2001) Mafia: A maximal frequent itemset algorithm for transactional databases. In: ICDE ’01: Proceedings of the 17th international conference on data engineering. IEEE Computer Society, Washington, DC, p 443
6.
go back to reference Duan L, Khoshneshin M, Street W, Liu M (2013) Adverse drug effect detection. IEEE J Biomed Health Inf 17(2):305–311CrossRef Duan L, Khoshneshin M, Street W, Liu M (2013) Adverse drug effect detection. IEEE J Biomed Health Inf 17(2):305–311CrossRef
7.
go back to reference Duan L, Street WN (2009) Finding maximal fully-correlated itemsets in large databases. In: ICDM ’09: Proceedings of the 9th international conference on data mining. IEEE Computer Society, Miami, pp 770–775 Duan L, Street WN (2009) Finding maximal fully-correlated itemsets in large databases. In: ICDM ’09: Proceedings of the 9th international conference on data mining. IEEE Computer Society, Miami, pp 770–775
8.
go back to reference Duan L, Street WN, Liu Y (2013) Speeding up correlation search for binary data. Pattern Recognit Lett 34(13):1499–1507CrossRef Duan L, Street WN, Liu Y (2013) Speeding up correlation search for binary data. Pattern Recognit Lett 34(13):1499–1507CrossRef
9.
go back to reference Duan L, Street WN, Liu Y, Lu H (2014) Community detection in graphs through correlation. In: The 20th ACM SIGKDD conference on knowledge discovery and data mining (accepted) Duan L, Street WN, Liu Y, Lu H (2014) Community detection in graphs through correlation. In: The 20th ACM SIGKDD conference on knowledge discovery and data mining (accepted)
10.
go back to reference Duan L, Street WN, Liu Y, Xu S, Wu B (2014) Selecting the right correlation measure for binary data. ACM transactions on knowledge discovery from data (accepted) Duan L, Street WN, Liu Y, Xu S, Wu B (2014) Selecting the right correlation measure for binary data. ACM transactions on knowledge discovery from data (accepted)
11.
go back to reference Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74 Dunning T (1993) Accurate methods for the statistics of surprise and coincidence. Comput Linguist 19(1):61–74
12.
go back to reference Geng L, Hamilton HJ (2006) Interestingness measures for data mining: A survey. ACM Comput Surv 38(3):9CrossRef Geng L, Hamilton HJ (2006) Interestingness measures for data mining: A survey. ACM Comput Surv 38(3):9CrossRef
13.
go back to reference Gouda K, Zaki MJ (2005) Genmax: An efficient algorithm for mining maximal frequent itemsets. Data Min Knowl Discov 11(3):223–242MathSciNetCrossRef Gouda K, Zaki MJ (2005) Genmax: An efficient algorithm for mining maximal frequent itemsets. Data Min Knowl Discov 11(3):223–242MathSciNetCrossRef
14.
go back to reference Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD international conference on management of data. ACM, New York, pp 1–12 Han J, Pei J, Yin Y (2000) Mining frequent patterns without candidate generation. In: SIGMOD ’00: Proceedings of the 2000 ACM SIGMOD international conference on management of data. ACM, New York, pp 1–12
15.
go back to reference Jermaine C (2005) Finding the most interesting correlations in a database: How hard can it be? Inf Syst 30(1):21–46CrossRef Jermaine C (2005) Finding the most interesting correlations in a database: How hard can it be? Inf Syst 30(1):21–46CrossRef
16.
go back to reference Liu M, Hinz ERM, Matheny ME, Denny JC, Schildcrout JS, Miller RA, Xu H (2013) Comparative analysis of pharmacovigilance methods in the detection of adverse drug reactions using electronic medical records. J Am Med Inform Assoc 20(3):420–426CrossRef Liu M, Hinz ERM, Matheny ME, Denny JC, Schildcrout JS, Miller RA, Xu H (2013) Comparative analysis of pharmacovigilance methods in the detection of adverse drug reactions using electronic medical records. J Am Med Inform Assoc 20(3):420–426CrossRef
18.
go back to reference Omiecinski ER (2003) Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1):57–69MathSciNetCrossRef Omiecinski ER (2003) Alternative interest measures for mining associations in databases. IEEE Trans Knowl Data Eng 15(1):57–69MathSciNetCrossRef
19.
go back to reference Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. AAAI/MIT Press, Cambridge Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. AAAI/MIT Press, Cambridge
20.
go back to reference Tan P-N, Kumar V, Srivastava J (2004) Selecting the right objective measure for association analysis. Inf Syst 29(4):293–313CrossRef Tan P-N, Kumar V, Srivastava J (2004) Selecting the right objective measure for association analysis. Inf Syst 29(4):293–313CrossRef
21.
go back to reference Tew C, Giraud-Carrier C, Tanner K, Burton S (2014) Behavior-based clustering and analysis of interestingness measures for association rule mining. Data Min Knowl Discov 28(4):1004–1045 Tew C, Giraud-Carrier C, Tanner K, Burton S (2014) Behavior-based clustering and analysis of interestingness measures for association rule mining. Data Min Knowl Discov 28(4):1004–1045
Metadata
Title
Speeding up maximal fully-correlated itemsets search in large databases
Authors
Lian Duan
W. Nick Street
Publication date
01-10-2016
Publisher
Springer Berlin Heidelberg
Published in
International Journal of Machine Learning and Cybernetics / Issue 5/2016
Print ISSN: 1868-8071
Electronic ISSN: 1868-808X
DOI
https://doi.org/10.1007/s13042-014-0290-9

Other articles of this Issue 5/2016

International Journal of Machine Learning and Cybernetics 5/2016 Go to the issue