Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 4/2016

01.07.2016

Mining significant association rules from uncertain data

verfasst von: Anshu Zhang, Wenzhong Shi, Geoffrey I. Webb

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 4/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In association rule mining, the trade-off between avoiding harmful spurious rules and preserving authentic ones is an ever critical barrier to obtaining reliable and useful results. The statistically sound technique for evaluating statistical significance of association rules is superior in preventing spurious rules, yet can also cause severe loss of true rules in presence of data error. This study presents a new and improved method for statistical test on association rules with uncertain erroneous data. An original mathematical model was established to describe data error propagation through computational procedures of the statistical test. Based on the error model, a scheme combining analytic and simulative processes was designed to correct the statistical test for distortions caused by data error. Experiments on both synthetic and real-world data show that the method significantly recovers the loss in true rules (reduces type-2 error) due to data error occurring in original statistically sound method. Meanwhile, the new method maintains effective control over the familywise error rate, which is the distinctive advantage of the original statistically sound technique. Furthermore, the method is robust against inaccurate data error probability information and situations not fulfilling the commonly accepted assumption on independent error probabilities of different data items. The method is particularly effective for rules which were most practically meaningful yet sensitive to data error. The method proves promising in enhancing values of association rule mining results and helping users make correct decisions.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Aggarwal CC, Li Y, Wang J, Wang J (2009) Frequent pattern mining with uncertain data. In: Proceedings of 17th international conference on knowledge discovery and data mining (KDD 2009), pp 29–38 Aggarwal CC, Li Y, Wang J, Wang J (2009) Frequent pattern mining with uncertain data. In: Proceedings of 17th international conference on knowledge discovery and data mining (KDD 2009), pp 29–38
Zurück zum Zitat Agrawal R, Imielinski T, Swami A (1993) Mining associations between sets of items in massive databases. In: Proceedings of 1993 ACM-SIGMOD international conference on management of data, pp 207–216 Agrawal R, Imielinski T, Swami A (1993) Mining associations between sets of items in massive databases. In: Proceedings of 1993 ACM-SIGMOD international conference on management of data, pp 207–216
Zurück zum Zitat Bastide Y, Pasquier N, Taouil R, Stumme G, Lakhal L (2000) Mining minimal non-redundant association rules using frequent closed itemsets. In: Proceedings of first international conference on computational logic, pp 972–986 Bastide Y, Pasquier N, Taouil R, Stumme G, Lakhal L (2000) Mining minimal non-redundant association rules using frequent closed itemsets. In: Proceedings of first international conference on computational logic, pp 972–986
Zurück zum Zitat Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Disc 5(3):213–246CrossRefMATH Bay SD, Pazzani MJ (2001) Detecting group differences: mining contrast sets. Data Min Knowl Disc 5(3):213–246CrossRefMATH
Zurück zum Zitat Bayardo RJ Jr, Agrawal R, Gunopulos D (2000) Constraint-based rule mining in large, dense databases. Data Min Knowl Disc 4(2/3):217–240CrossRef Bayardo RJ Jr, Agrawal R, Gunopulos D (2000) Constraint-based rule mining in large, dense databases. Data Min Knowl Disc 4(2/3):217–240CrossRef
Zurück zum Zitat Ben-Israel A, Greville TNE (2003) Generalized inverses: theory and applications. Springer, New YorkMATH Ben-Israel A, Greville TNE (2003) Generalized inverses: theory and applications. Springer, New YorkMATH
Zurück zum Zitat Bishop G (2009) Assessing the likely quality of the statistical longitudinal census dataset. Research paper, Australian Bureau of Statistics Bishop G (2009) Assessing the likely quality of the statistical longitudinal census dataset. Research paper, Australian Bureau of Statistics
Zurück zum Zitat Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: SIGMOD 1997, proceedings ACM SIGMOD international conference on management of data, pp 265–276 Brin S, Motwani R, Silverstein C (1997) Beyond market baskets: generalizing association rules to correlations. In: SIGMOD 1997, proceedings ACM SIGMOD international conference on management of data, pp 265–276
Zurück zum Zitat Calders T, Garboni C, Goethals B (2010) Approximation of frequentness probability of itemsets in uncertain data. In: Proceedings of IEEE international conference on data mining (ICDM 2010), pp 749–754 Calders T, Garboni C, Goethals B (2010) Approximation of frequentness probability of itemsets in uncertain data. In: Proceedings of IEEE international conference on data mining (ICDM 2010), pp 749–754
Zurück zum Zitat Carvalho JV, Ruiz DD (2013) Discovering frequent itemsets on uncertain data: a systematic review. In: Proceedings of 9th international conference on machine learning and data mining, pp 390–404 Carvalho JV, Ruiz DD (2013) Discovering frequent itemsets on uncertain data: a systematic review. In: Proceedings of 9th international conference on machine learning and data mining, pp 390–404
Zurück zum Zitat Chui CK, Kao B (2008) A decremental approach for mining frequent itemsets from uncertain data. In: Proceedings of 12th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2008), pp 64–75 Chui CK, Kao B (2008) A decremental approach for mining frequent itemsets from uncertain data. In: Proceedings of 12th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2008), pp 64–75
Zurück zum Zitat Chui CK, Kao B, Hung E (2007) Mining frequent itemsets from uncertain data. In: Proceedings of 11th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2007), pp 47–58 Chui CK, Kao B, Hung E (2007) Mining frequent itemsets from uncertain data. In: Proceedings of 11th Pacific-Asia conference on knowledge discovery and data mining (PAKDD 2007), pp 47–58
Zurück zum Zitat Foody GM (2002) Status of land cover classification accuracy assessment. Remote Sens Environ 80:185–201CrossRef Foody GM (2002) Status of land cover classification accuracy assessment. Remote Sens Environ 80:185–201CrossRef
Zurück zum Zitat Gray B, Orlowska M (1998) CCAIIA: clustering categorical attributes into interesting association rules. In: Proceedings of 2nd Pacific-Asia conference on knowledge discovery and data mining (PAKDD’98), pp 132–143 Gray B, Orlowska M (1998) CCAIIA: clustering categorical attributes into interesting association rules. In: Proceedings of 2nd Pacific-Asia conference on knowledge discovery and data mining (PAKDD’98), pp 132–143
Zurück zum Zitat Hollister JW, Gonzalez ML, Paul JF, August PV, Copeland JL (2004) Assessing the accuracy of National Land Cover Dataset area estimates at multiple spatial extents. Photogramm Eng Remote Sensing 70:405–414CrossRef Hollister JW, Gonzalez ML, Paul JF, August PV, Copeland JL (2004) Assessing the accuracy of National Land Cover Dataset area estimates at multiple spatial extents. Photogramm Eng Remote Sensing 70:405–414CrossRef
Zurück zum Zitat International Business Machines (1996) IBM intelligent miner user’s guide, version 1, release 1 International Business Machines (1996) IBM intelligent miner user’s guide, version 1, release 1
Zurück zum Zitat Jones N, Lewis D (eds, with Aitken A, Hörngren J, Zilhão MJ) (2003) Handbook on improving quality by analysis of process variables. Final report, Eurostat Jones N, Lewis D (eds, with Aitken A, Hörngren J, Zilhão MJ) (2003) Handbook on improving quality by analysis of process variables. Final report, Eurostat
Zurück zum Zitat Mennis J, Liu JW (2005) Mining association rules in spatio-temporal data: an analysis of urban socioeconomic and land cover change. Trans GIS 9(1):5–17CrossRef Mennis J, Liu JW (2005) Mining association rules in spatio-temporal data: an analysis of urban socioeconomic and land cover change. Trans GIS 9(1):5–17CrossRef
Zurück zum Zitat McDonald JH (2014) Handbook of biological statistics, 3rd edn. Sparky House Publishing, Baltimore McDonald JH (2014) Handbook of biological statistics, 3rd edn. Sparky House Publishing, Baltimore
Zurück zum Zitat Shaffer JP (1995) Multiple hypothesis testing. Annu Rev Psychol 46:561–584CrossRef Shaffer JP (1995) Multiple hypothesis testing. Annu Rev Psychol 46:561–584CrossRef
Zurück zum Zitat Liu B, Hsu W, Ma Y (1999) Pruning and summarizing the discovered associations. In: Proceedings of 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’99), pp 125–134 Liu B, Hsu W, Ma Y (1999) Pruning and summarizing the discovered associations. In: Proceedings of 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD ’99), pp 125–134
Zurück zum Zitat Liu B, Hsu W, Ma Y (2001) Identifying non-actionable association rules. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01), pp 329–334 Liu B, Hsu W, Ma Y (2001) Identifying non-actionable association rules. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01), pp 329–334
Zurück zum Zitat Megiddo N, Srikant R (1998) Discovering predictive association rules. In: Proceedings of 4th international conference on knowledge discovery and data mining (KDD ’98), pp 27–78 Megiddo N, Srikant R (1998) Discovering predictive association rules. In: Proceedings of 4th international conference on knowledge discovery and data mining (KDD ’98), pp 27–78
Zurück zum Zitat Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. In: Piatetsky-Shapiro G, Frawley J (eds) Knowledge discovery in databases. AAAI/MIT Press, Menlo Park, pp 229–248 Piatetsky-Shapiro G (1991) Discovery, analysis, and presentation of strong rules. In: Piatetsky-Shapiro G, Frawley J (eds) Knowledge discovery in databases. AAAI/MIT Press, Menlo Park, pp 229–248
Zurück zum Zitat Rao CR, Mitra SK (1972) Generalized inverse of a matrix and its applications. In: Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 1: theory of statistics, pp 601–620 Rao CR, Mitra SK (1972) Generalized inverse of a matrix and its applications. In: Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 1: theory of statistics, pp 601–620
Zurück zum Zitat Smith JH, Stehman SV, Wickham JD, Yang L (2003) Effects of landscape characteristics on land-cover class accuracy. Remote Sens Environ 84:342–349CrossRef Smith JH, Stehman SV, Wickham JD, Yang L (2003) Effects of landscape characteristics on land-cover class accuracy. Remote Sens Environ 84:342–349CrossRef
Zurück zum Zitat Srikant R, Agrawal R (1995) Mining generalized association rules. In: Proceedings of 21st international conference on very large data bases, pp 407–419 Srikant R, Agrawal R (1995) Mining generalized association rules. In: Proceedings of 21st international conference on very large data bases, pp 407–419
Zurück zum Zitat Stehman SV, Wickham JD, Wade TG, Smith JH (2008) Designing a multi-objective, multi-support accuracy assessment of the 2001 National Land Cover Data (NLCD 2001) of the conterminous United States. Photogramm Eng Remote Sensing 74:1561–1571CrossRef Stehman SV, Wickham JD, Wade TG, Smith JH (2008) Designing a multi-objective, multi-support accuracy assessment of the 2001 National Land Cover Data (NLCD 2001) of the conterminous United States. Photogramm Eng Remote Sensing 74:1561–1571CrossRef
Zurück zum Zitat Sun L, Cheng R, Cheung DW, Cheng J (2010) Mining uncertain data with probabilistic guarantees. In: Proceedings of 17th international conference on knowledge discovery and data mining (KDD 2010), pp 273–282 Sun L, Cheng R, Cheung DW, Cheng J (2010) Mining uncertain data with probabilistic guarantees. In: Proceedings of 17th international conference on knowledge discovery and data mining (KDD 2010), pp 273–282
Zurück zum Zitat Ting KM (2011) Confusion matrix. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning, 1st edn. Springer, New York Ting KM (2011) Confusion matrix. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning, 1st edn. Springer, New York
Zurück zum Zitat Tong Y, Chen L, Ding B (2012) Discovering threshold-based frequent closed itemsets over probabilistic data. In: Proceedings of 28th international conference on data engineering, pp 270–281 Tong Y, Chen L, Ding B (2012) Discovering threshold-based frequent closed itemsets over probabilistic data. In: Proceedings of 28th international conference on data engineering, pp 270–281
Zurück zum Zitat Webb GI (2007) Discovering significant patterns. Mach Learn 68:1–33CrossRef Webb GI (2007) Discovering significant patterns. Mach Learn 68:1–33CrossRef
Zurück zum Zitat Yang L, Stehman SV, Smith JH, Wickham JD (2001) Thematic accuracy of MRLC land cover for eastern United States. Remote Sens Environ 76:418–422CrossRef Yang L, Stehman SV, Smith JH, Wickham JD (2001) Thematic accuracy of MRLC land cover for eastern United States. Remote Sens Environ 76:418–422CrossRef
Zurück zum Zitat Zaki MJ (2000) Generating non-redundant association rules. In: Proceedings of 6th ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2000), pp 34–43 Zaki MJ (2000) Generating non-redundant association rules. In: Proceedings of 6th ACM SIGKDD international conference on knowledge discovery and data mining (KDD-2000), pp 34–43
Zurück zum Zitat Zhang H, Padmanabhan B, Tuzhilin A (2004) On the discovery of significant statistical quantitative rules. In: Proceedings of 10th international conference on knowledge discovery and data mining (KDD 2004), pp 374–383 Zhang H, Padmanabhan B, Tuzhilin A (2004) On the discovery of significant statistical quantitative rules. In: Proceedings of 10th international conference on knowledge discovery and data mining (KDD 2004), pp 374–383
Zurück zum Zitat Zhu XQ, Wu XD (2006) Error awareness data mining. In: 2006 IEEE international conference on granular computing, pp 269–274 Zhu XQ, Wu XD (2006) Error awareness data mining. In: 2006 IEEE international conference on granular computing, pp 269–274
Zurück zum Zitat Zhu XQ, Wu XD, Yang Y (2004) Error detection and impact-sensitive instance ranking in noisy datasets. In: Proceedings of 19th national conference on artificial intelligence, pp 378–383 Zhu XQ, Wu XD, Yang Y (2004) Error detection and impact-sensitive instance ranking in noisy datasets. In: Proceedings of 19th national conference on artificial intelligence, pp 378–383
Metadaten
Titel
Mining significant association rules from uncertain data
verfasst von
Anshu Zhang
Wenzhong Shi
Geoffrey I. Webb
Publikationsdatum
01.07.2016
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 4/2016
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-015-0446-6

Weitere Artikel der Ausgabe 4/2016

Data Mining and Knowledge Discovery 4/2016 Zur Ausgabe

Premium Partner