Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 5-6/2014

01.09.2014

Unsupervised interaction-preserving discretization of multivariate data

verfasst von: Hoang-Vu Nguyen, Emmanuel Müller, Jilles Vreeken, Klemens Böhm

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 5-6/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Discretization is the transformation of continuous data into discrete bins. It is an important and general pre-processing technique, and a critical element of many data mining and data management tasks. The general goal is to obtain data that retains as much information in the continuous original as possible. In general, but in particular for exploratory tasks, a key open question is how to discretize multivariate data such that significant associations and patterns are preserved. That is exactly the problem we study in this paper. We propose IPD, an information-theoretic method for unsupervised discretization that focuses on preserving multivariate interactions. To this end, when discretizing a dimension, we consider the distribution of the data over all other dimensions. In particular, our method examines consecutive multivariate regions and combines them if (a) their multivariate data distributions are statistically similar, and (b) this merge reduces the MDL encoding cost. To assess the similarities, we propose \( ID \), a novel interaction distance that does not require assuming a distribution and permits computation in closed form. We give an efficient algorithm for finding the optimal bin merge, as well as a fast well-performing heuristic. Empirical evaluation through pattern-based compression, outlier mining, and classification shows that by preserving interactions we consistently outperform the state of the art in both quality and speed.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: SIGMOD Conference, p 37–46. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: SIGMOD Conference, p 37–46.
Zurück zum Zitat Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD Conference, p 94–105. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD Conference, p 94–105.
Zurück zum Zitat Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: CIKM, p 415–424. Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: CIKM, p 415–424.
Zurück zum Zitat Allen JF (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832–843CrossRefMATH Allen JF (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832–843CrossRefMATH
Zurück zum Zitat Aue A, Hörmann S, Horváth L, Reimherr M (2009) Break detection in the covariance structure of multivariate time series models. Ann Stat 37(6B):4046–4087CrossRefMATH Aue A, Hörmann S, Horváth L, Reimherr M (2009) Break detection in the covariance structure of multivariate time series models. Ann Stat 37(6B):4046–4087CrossRefMATH
Zurück zum Zitat Bay SD (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4):491–512CrossRefMATH Bay SD (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4):491–512CrossRefMATH
Zurück zum Zitat Bay SD, Pazzani MJ (1999) Detecting change in categorical data: Mining contrast sets. In: KDD, p 302–306. Bay SD, Pazzani MJ (1999) Detecting change in categorical data: Mining contrast sets. In: KDD, p 302–306.
Zurück zum Zitat Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80(391):580–598CrossRefMATHMathSciNet Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80(391):580–598CrossRefMATHMathSciNet
Zurück zum Zitat Breunig MM, Kriegel HP, Raymond T Ng JS (2000) LOF: identifying density-based local outliers. In: SIGMOD Conference, p 93–104. Breunig MM, Kriegel HP, Raymond T Ng JS (2000) LOF: identifying density-based local outliers. In: SIGMOD Conference, p 93–104.
Zurück zum Zitat Bu S, Lakshmanan LVS, Ng RT (2005) Mdl summarization with holes. In: VLDB, p 433–444. Bu S, Lakshmanan LVS, Ng RT (2005) Mdl summarization with holes. In: VLDB, p 433–444.
Zurück zum Zitat Cheng CH, Fu AWC, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: KDD, p 84–93. Cheng CH, Fu AWC, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: KDD, p 84–93.
Zurück zum Zitat Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New YorkMATH Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New YorkMATH
Zurück zum Zitat Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNet Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNet
Zurück zum Zitat Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI, p 1022–1029. Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI, p 1022–1029.
Zurück zum Zitat Ferrandiz S, Boullé M (2005) Multivariate discretization by recursive supervised bipartition of graph. In: MLDM, p 253–264. Ferrandiz S, Boullé M (2005) Multivariate discretization by recursive supervised bipartition of graph. In: MLDM, p 253–264.
Zurück zum Zitat Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Min Knowl Discov 19(2):210–226CrossRefMathSciNet Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Min Knowl Discov 19(2):210–226CrossRefMathSciNet
Zurück zum Zitat Grünwald PD (2007) The minimum description length principle. MIT Press, Cambridge Grünwald PD (2007) The minimum description length principle. MIT Press, Cambridge
Zurück zum Zitat Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2000) Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD Conference, p 463–474. Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2000) Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD Conference, p 463–474.
Zurück zum Zitat Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Disc 15:55–86CrossRefMathSciNet Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Disc 15:55–86CrossRefMathSciNet
Zurück zum Zitat Kang Y, Wang S, Liu X, Lai H, Wang H, Miao B (2006) An ICA-based multivariate discretization algorithm. In: KSEM, p 556–562. Kang Y, Wang S, Liu X, Lai H, Wang H, Miao B (2006) An ICA-based multivariate discretization algorithm. In: KSEM, p 556–562.
Zurück zum Zitat Kerber R (1992) ChiMerge: discretization of numeric attributes. In: AAAI, p 123–128. Kerber R (1992) ChiMerge: discretization of numeric attributes. In: AAAI, p 123–128.
Zurück zum Zitat Kontkanen P, Myllymäki P (2007) MDL histogram density estimation. J Mach Learn Res 2:219–226 Kontkanen P, Myllymäki P (2007) MDL histogram density estimation. J Mach Learn Res 2:219–226
Zurück zum Zitat Lakshmanan LVS, Ng RT, Wang CX, Zhou X, Johnson T (2002) The generalized MDL approach for summarization. In: VLDB, p 766–777. Lakshmanan LVS, Ng RT, Wang CX, Zhou X, Johnson T (2002) The generalized MDL approach for summarization. In: VLDB, p 766–777.
Zurück zum Zitat Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. In: ECML/PKDD (3), p 288–303. Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. In: ECML/PKDD (3), p 288–303.
Zurück zum Zitat Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM TKDD 6:1–44CrossRef Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM TKDD 6:1–44CrossRef
Zurück zum Zitat Mehta S, Parthasarathy S, Yang H (2005) Toward unsupervised correlation preserving discretization. IEEE Trans Knowl Data Eng 17(9):1174–1185CrossRef Mehta S, Parthasarathy S, Yang H (2005) Toward unsupervised correlation preserving discretization. IEEE Trans Knowl Data Eng 17(9):1174–1185CrossRef
Zurück zum Zitat Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: KDD, p 533–541. Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: KDD, p 533–541.
Zurück zum Zitat Müller E, Assent I, Krieger R, Günnemann S, Seidl T (2009) DensEst: density estimation for data mining in high dimensional spaces. In: SDM, p 173–184. Müller E, Assent I, Krieger R, Günnemann S, Seidl T (2009) DensEst: density estimation for data mining in high dimensional spaces. In: SDM, p 173–184.
Zurück zum Zitat Nguyen HV, Müller E, Vreeken J, Keller F, Böhm K (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: SDM, p 198–206. Nguyen HV, Müller E, Vreeken J, Keller F, Böhm K (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: SDM, p 198–206.
Zurück zum Zitat Peleg S, Werman M, Rom H (1989) A unified approach to the change of resolution: space and gray-level. IEEE Trans Pattern Anal Mach Intell 11(7):739–742CrossRef Peleg S, Werman M, Rom H (1989) A unified approach to the change of resolution: space and gray-level. IEEE Trans Pattern Anal Mach Intell 11(7):739–742CrossRef
Zurück zum Zitat Rao M, Seth S, Xu JW, Chen Y, Tagare H, Príncipe JC (2011) A test of independence based on a generalized correlation function. Signal Process 91(1):15–27CrossRefMATH Rao M, Seth S, Xu JW, Chen Y, Tagare H, Príncipe JC (2011) A test of independence based on a generalized correlation function. Signal Process 91(1):15–27CrossRefMATH
Zurück zum Zitat Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524CrossRef Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524CrossRef
Zurück zum Zitat Scargle JD, Norris JP, Jackson B, Chiang J (2013) Studies in astronomical time series analysis. vi. Bayesian block representations. Astrophys J 764(2) Scargle JD, Norris JP, Jackson B, Chiang J (2013) Studies in astronomical time series analysis. vi. Bayesian block representations. Astrophys J 764(2)
Zurück zum Zitat Seth S, Rao M, Park I, Príncipe JC (2011) A unified framework for quadratic measures of independence. IEEE Trans Signal Process 59(8):3624–3635CrossRefMathSciNet Seth S, Rao M, Park I, Príncipe JC (2011) A unified framework for quadratic measures of independence. IEEE Trans Signal Process 59(8):3624–3635CrossRefMathSciNet
Zurück zum Zitat Silverman BW (1986) Density estimation for statistics and data analysis. Chapman & Hall/CRC, LondonCrossRefMATH Silverman BW (1986) Density estimation for statistics and data analysis. Chapman & Hall/CRC, LondonCrossRefMATH
Zurück zum Zitat Tatti N, Vreeken J (2008) Finding good itemsets by packing data. In: ICDM, p 588–597. Tatti N, Vreeken J (2008) Finding good itemsets by packing data. In: ICDM, p 588–597.
Zurück zum Zitat Tzoumas K, Deshpande A, Jensen CS (2011) Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4(11):852–863 Tzoumas K, Deshpande A, Jensen CS (2011) Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4(11):852–863
Zurück zum Zitat Vereshchagin NK, Vitányi PMB (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Theory 50(12):3265–3290CrossRef Vereshchagin NK, Vitányi PMB (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Theory 50(12):3265–3290CrossRef
Zurück zum Zitat Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Disc 23(1):169–214CrossRefMATH Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Disc 23(1):169–214CrossRefMATH
Zurück zum Zitat Wagner A, Lützkendorf T, Voss K, Spars G, Maas A, Herkel S (2014) Performance analysis of commercial buildings: results and experiences from the german demonstration program ‘energy optimized building (EnOB)’. Energy Build 68:634–638CrossRef Wagner A, Lützkendorf T, Voss K, Spars G, Maas A, Herkel S (2014) Performance analysis of commercial buildings: results and experiences from the german demonstration program ‘energy optimized building (EnOB)’. Energy Build 68:634–638CrossRef
Zurück zum Zitat Yang X, Procopiuc CM, Srivastava D (2009) Summarizing relational databases. PVLDB 2(1):634–645 Yang X, Procopiuc CM, Srivastava D (2009) Summarizing relational databases. PVLDB 2(1):634–645
Zurück zum Zitat Yang X, Procopiuc CM, Srivastava D (2011) Summary graphs for relational database schemas. PVLDB 4(11):899–910 Yang X, Procopiuc CM, Srivastava D (2011) Summary graphs for relational database schemas. PVLDB 4(11):899–910
Metadaten
Titel
Unsupervised interaction-preserving discretization of multivariate data
verfasst von
Hoang-Vu Nguyen
Emmanuel Müller
Jilles Vreeken
Klemens Böhm
Publikationsdatum
01.09.2014
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 5-6/2014
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-014-0350-5

Weitere Artikel der Ausgabe 5-6/2014

Data Mining and Knowledge Discovery 5-6/2014 Zur Ausgabe