Skip to main content
Top
Published in: Data Mining and Knowledge Discovery 5-6/2014

01-09-2014

Unsupervised interaction-preserving discretization of multivariate data

Authors: Hoang-Vu Nguyen, Emmanuel Müller, Jilles Vreeken, Klemens Böhm

Published in: Data Mining and Knowledge Discovery | Issue 5-6/2014

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Discretization is the transformation of continuous data into discrete bins. It is an important and general pre-processing technique, and a critical element of many data mining and data management tasks. The general goal is to obtain data that retains as much information in the continuous original as possible. In general, but in particular for exploratory tasks, a key open question is how to discretize multivariate data such that significant associations and patterns are preserved. That is exactly the problem we study in this paper. We propose IPD, an information-theoretic method for unsupervised discretization that focuses on preserving multivariate interactions. To this end, when discretizing a dimension, we consider the distribution of the data over all other dimensions. In particular, our method examines consecutive multivariate regions and combines them if (a) their multivariate data distributions are statistically similar, and (b) this merge reduces the MDL encoding cost. To assess the similarities, we propose \( ID \), a novel interaction distance that does not require assuming a distribution and permits computation in closed form. We give an efficient algorithm for finding the optimal bin merge, as well as a fast well-performing heuristic. Empirical evaluation through pattern-based compression, outlier mining, and classification shows that by preserving interactions we consistently outperform the state of the art in both quality and speed.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
go back to reference Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: SIGMOD Conference, p 37–46. Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: SIGMOD Conference, p 37–46.
go back to reference Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD Conference, p 94–105. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD Conference, p 94–105.
go back to reference Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: CIKM, p 415–424. Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: CIKM, p 415–424.
go back to reference Allen JF (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832–843CrossRefMATH Allen JF (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832–843CrossRefMATH
go back to reference Aue A, Hörmann S, Horváth L, Reimherr M (2009) Break detection in the covariance structure of multivariate time series models. Ann Stat 37(6B):4046–4087CrossRefMATH Aue A, Hörmann S, Horváth L, Reimherr M (2009) Break detection in the covariance structure of multivariate time series models. Ann Stat 37(6B):4046–4087CrossRefMATH
go back to reference Bay SD, Pazzani MJ (1999) Detecting change in categorical data: Mining contrast sets. In: KDD, p 302–306. Bay SD, Pazzani MJ (1999) Detecting change in categorical data: Mining contrast sets. In: KDD, p 302–306.
go back to reference Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80(391):580–598CrossRefMATHMathSciNet Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80(391):580–598CrossRefMATHMathSciNet
go back to reference Breunig MM, Kriegel HP, Raymond T Ng JS (2000) LOF: identifying density-based local outliers. In: SIGMOD Conference, p 93–104. Breunig MM, Kriegel HP, Raymond T Ng JS (2000) LOF: identifying density-based local outliers. In: SIGMOD Conference, p 93–104.
go back to reference Bu S, Lakshmanan LVS, Ng RT (2005) Mdl summarization with holes. In: VLDB, p 433–444. Bu S, Lakshmanan LVS, Ng RT (2005) Mdl summarization with holes. In: VLDB, p 433–444.
go back to reference Cheng CH, Fu AWC, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: KDD, p 84–93. Cheng CH, Fu AWC, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: KDD, p 84–93.
go back to reference Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New YorkMATH Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New YorkMATH
go back to reference Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNet Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNet
go back to reference Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI, p 1022–1029. Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI, p 1022–1029.
go back to reference Ferrandiz S, Boullé M (2005) Multivariate discretization by recursive supervised bipartition of graph. In: MLDM, p 253–264. Ferrandiz S, Boullé M (2005) Multivariate discretization by recursive supervised bipartition of graph. In: MLDM, p 253–264.
go back to reference Grünwald PD (2007) The minimum description length principle. MIT Press, Cambridge Grünwald PD (2007) The minimum description length principle. MIT Press, Cambridge
go back to reference Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2000) Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD Conference, p 463–474. Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2000) Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD Conference, p 463–474.
go back to reference Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Disc 15:55–86CrossRefMathSciNet Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Disc 15:55–86CrossRefMathSciNet
go back to reference Kang Y, Wang S, Liu X, Lai H, Wang H, Miao B (2006) An ICA-based multivariate discretization algorithm. In: KSEM, p 556–562. Kang Y, Wang S, Liu X, Lai H, Wang H, Miao B (2006) An ICA-based multivariate discretization algorithm. In: KSEM, p 556–562.
go back to reference Kerber R (1992) ChiMerge: discretization of numeric attributes. In: AAAI, p 123–128. Kerber R (1992) ChiMerge: discretization of numeric attributes. In: AAAI, p 123–128.
go back to reference Kontkanen P, Myllymäki P (2007) MDL histogram density estimation. J Mach Learn Res 2:219–226 Kontkanen P, Myllymäki P (2007) MDL histogram density estimation. J Mach Learn Res 2:219–226
go back to reference Lakshmanan LVS, Ng RT, Wang CX, Zhou X, Johnson T (2002) The generalized MDL approach for summarization. In: VLDB, p 766–777. Lakshmanan LVS, Ng RT, Wang CX, Zhou X, Johnson T (2002) The generalized MDL approach for summarization. In: VLDB, p 766–777.
go back to reference Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. In: ECML/PKDD (3), p 288–303. Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. In: ECML/PKDD (3), p 288–303.
go back to reference Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM TKDD 6:1–44CrossRef Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM TKDD 6:1–44CrossRef
go back to reference Mehta S, Parthasarathy S, Yang H (2005) Toward unsupervised correlation preserving discretization. IEEE Trans Knowl Data Eng 17(9):1174–1185CrossRef Mehta S, Parthasarathy S, Yang H (2005) Toward unsupervised correlation preserving discretization. IEEE Trans Knowl Data Eng 17(9):1174–1185CrossRef
go back to reference Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: KDD, p 533–541. Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: KDD, p 533–541.
go back to reference Müller E, Assent I, Krieger R, Günnemann S, Seidl T (2009) DensEst: density estimation for data mining in high dimensional spaces. In: SDM, p 173–184. Müller E, Assent I, Krieger R, Günnemann S, Seidl T (2009) DensEst: density estimation for data mining in high dimensional spaces. In: SDM, p 173–184.
go back to reference Nguyen HV, Müller E, Vreeken J, Keller F, Böhm K (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: SDM, p 198–206. Nguyen HV, Müller E, Vreeken J, Keller F, Böhm K (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: SDM, p 198–206.
go back to reference Peleg S, Werman M, Rom H (1989) A unified approach to the change of resolution: space and gray-level. IEEE Trans Pattern Anal Mach Intell 11(7):739–742CrossRef Peleg S, Werman M, Rom H (1989) A unified approach to the change of resolution: space and gray-level. IEEE Trans Pattern Anal Mach Intell 11(7):739–742CrossRef
go back to reference Rao M, Seth S, Xu JW, Chen Y, Tagare H, Príncipe JC (2011) A test of independence based on a generalized correlation function. Signal Process 91(1):15–27CrossRefMATH Rao M, Seth S, Xu JW, Chen Y, Tagare H, Príncipe JC (2011) A test of independence based on a generalized correlation function. Signal Process 91(1):15–27CrossRefMATH
go back to reference Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524CrossRef Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524CrossRef
go back to reference Scargle JD, Norris JP, Jackson B, Chiang J (2013) Studies in astronomical time series analysis. vi. Bayesian block representations. Astrophys J 764(2) Scargle JD, Norris JP, Jackson B, Chiang J (2013) Studies in astronomical time series analysis. vi. Bayesian block representations. Astrophys J 764(2)
go back to reference Seth S, Rao M, Park I, Príncipe JC (2011) A unified framework for quadratic measures of independence. IEEE Trans Signal Process 59(8):3624–3635CrossRefMathSciNet Seth S, Rao M, Park I, Príncipe JC (2011) A unified framework for quadratic measures of independence. IEEE Trans Signal Process 59(8):3624–3635CrossRefMathSciNet
go back to reference Silverman BW (1986) Density estimation for statistics and data analysis. Chapman & Hall/CRC, LondonCrossRefMATH Silverman BW (1986) Density estimation for statistics and data analysis. Chapman & Hall/CRC, LondonCrossRefMATH
go back to reference Tatti N, Vreeken J (2008) Finding good itemsets by packing data. In: ICDM, p 588–597. Tatti N, Vreeken J (2008) Finding good itemsets by packing data. In: ICDM, p 588–597.
go back to reference Tzoumas K, Deshpande A, Jensen CS (2011) Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4(11):852–863 Tzoumas K, Deshpande A, Jensen CS (2011) Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4(11):852–863
go back to reference Vereshchagin NK, Vitányi PMB (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Theory 50(12):3265–3290CrossRef Vereshchagin NK, Vitányi PMB (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Theory 50(12):3265–3290CrossRef
go back to reference Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Disc 23(1):169–214CrossRefMATH Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Disc 23(1):169–214CrossRefMATH
go back to reference Wagner A, Lützkendorf T, Voss K, Spars G, Maas A, Herkel S (2014) Performance analysis of commercial buildings: results and experiences from the german demonstration program ‘energy optimized building (EnOB)’. Energy Build 68:634–638CrossRef Wagner A, Lützkendorf T, Voss K, Spars G, Maas A, Herkel S (2014) Performance analysis of commercial buildings: results and experiences from the german demonstration program ‘energy optimized building (EnOB)’. Energy Build 68:634–638CrossRef
go back to reference Yang X, Procopiuc CM, Srivastava D (2009) Summarizing relational databases. PVLDB 2(1):634–645 Yang X, Procopiuc CM, Srivastava D (2009) Summarizing relational databases. PVLDB 2(1):634–645
go back to reference Yang X, Procopiuc CM, Srivastava D (2011) Summary graphs for relational database schemas. PVLDB 4(11):899–910 Yang X, Procopiuc CM, Srivastava D (2011) Summary graphs for relational database schemas. PVLDB 4(11):899–910
Metadata
Title
Unsupervised interaction-preserving discretization of multivariate data
Authors
Hoang-Vu Nguyen
Emmanuel Müller
Jilles Vreeken
Klemens Böhm
Publication date
01-09-2014
Publisher
Springer US
Published in
Data Mining and Knowledge Discovery / Issue 5-6/2014
Print ISSN: 1384-5810
Electronic ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-014-0350-5

Other articles of this Issue 5-6/2014

Data Mining and Knowledge Discovery 5-6/2014 Go to the issue

Premium Partner