Top

Data Mining and Knowledge Discovery

Published in:

01-09-2014

Unsupervised interaction-preserving discretization of multivariate data

Authors: Hoang-Vu Nguyen, Emmanuel Müller, Jilles Vreeken, Klemens Böhm

Published in: Data Mining and Knowledge Discovery | Issue 5-6/2014

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

Discretization is the transformation of continuous data into discrete bins. It is an important and general pre-processing technique, and a critical element of many data mining and data management tasks. The general goal is to obtain data that retains as much information in the continuous original as possible. In general, but in particular for exploratory tasks, a key open question is how to discretize multivariate data such that significant associations and patterns are preserved. That is exactly the problem we study in this paper. We propose IPD, an information-theoretic method for unsupervised discretization that focuses on preserving multivariate interactions. To this end, when discretizing a dimension, we consider the distribution of the data over all other dimensions. In particular, our method examines consecutive multivariate regions and combines them if (a) their multivariate data distributions are statistically similar, and (b) this merge reduces the MDL encoding cost. To assess the similarities, we propose \( ID \), a novel interaction distance that does not require assuming a distribution and permits computation in closed form. We give an efficient algorithm for finding the optimal bin merge, as well as a fast well-performing heuristic. Empirical evaluation through pattern-based compression, outlier mining, and classification shows that by preserving interactions we consistently outperform the state of the art in both quality and speed.

previous article Self-organizing maps by difference of convex functions optimization

next article Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

http://www.ipd.kit.edu/~nguyenh/ipd/.

http://www.pamap.org/demo.html.

Aggarwal CC, Yu PS (2001) Outlier detection for high dimensional data. In: SIGMOD Conference, p 37–46.

Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: SIGMOD Conference, p 94–105.

Akoglu L, Tong H, Vreeken J, Faloutsos C (2012) Fast and reliable anomaly detection in categorical data. In: CIKM, p 415–424.

Allen JF (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832–843CrossRefMATH

Allen JF, Ferguson G (1994) Actions and events in interval temporal logic. J Log Comput 4(5):531–579CrossRefMATHMathSciNet

Aue A, Hörmann S, Horváth L, Reimherr M (2009) Break detection in the covariance structure of multivariate time series models. Ann Stat 37(6B):4046–4087CrossRefMATH

Bay SD (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4):491–512CrossRefMATH

Bay SD, Pazzani MJ (1999) Detecting change in categorical data: Mining contrast sets. In: KDD, p 302–306.

Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefMATH

Breiman L, Friedman JH (1985) Estimating optimal transformations for multiple regression and correlation. J Am Stat Assoc 80(391):580–598CrossRefMATHMathSciNet

Breunig MM, Kriegel HP, Raymond T Ng JS (2000) LOF: identifying density-based local outliers. In: SIGMOD Conference, p 93–104.

Bu S, Lakshmanan LVS, Ng RT (2005) Mdl summarization with holes. In: VLDB, p 433–444.

Cheng CH, Fu AWC, Zhang Y (1999) Entropy-based subspace clustering for mining numerical data. In: KDD, p 84–93.

Cover TM, Thomas JA (2006) Elements of information theory. Wiley, New YorkMATH

Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30MATHMathSciNet

Fayyad UM, Irani KB (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: IJCAI, p 1022–1029.

Ferrandiz S, Boullé M (2005) Multivariate discretization by recursive supervised bipartition of graph. In: MLDM, p 253–264.

Grosskreutz H, Rüping S (2009) On subgroup discovery in numerical domains. Data Min Knowl Discov 19(2):210–226CrossRefMathSciNet

Grünwald PD (2007) The minimum description length principle. MIT Press, Cambridge

Gunopulos D, Kollios G, Tsotras VJ, Domeniconi C (2000) Approximating multi-dimensional aggregate range queries over real attributes. In: SIGMOD Conference, p 463–474.

Han J, Cheng H, Xin D, Yan X (2007) Frequent pattern mining: current status and future directions. Data Min Knowl Disc 15:55–86CrossRefMathSciNet

Kang Y, Wang S, Liu X, Lai H, Wang H, Miao B (2006) An ICA-based multivariate discretization algorithm. In: KSEM, p 556–562.

Kerber R (1992) ChiMerge: discretization of numeric attributes. In: AAAI, p 123–128.

Kontkanen P, Myllymäki P (2007) MDL histogram density estimation. J Mach Learn Res 2:219–226

Lakshmanan LVS, Ng RT, Wang CX, Zhou X, Johnson T (2002) The generalized MDL approach for summarization. In: VLDB, p 766–777.

Lee J, Verleysen M (2007) Nonlinear dimensionality reduction. Springer, New YorkCrossRefMATH

Lemmerich F, Becker M, Puppe F (2013) Difference-based estimates for generalization-aware subgroup discovery. In: ECML/PKDD (3), p 288–303.

Liu R, Yang L (2008) Kernel estimation of multivariate cumulative distribution function. J Nonparametr Stat 20(8):661–677CrossRefMATHMathSciNet

Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM TKDD 6:1–44CrossRef

Mehta S, Parthasarathy S, Yang H (2005) Toward unsupervised correlation preserving discretization. IEEE Trans Knowl Data Eng 17(9):1174–1185CrossRef

Moise G, Sander J (2008) Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: KDD, p 533–541.

Müller E, Assent I, Krieger R, Günnemann S, Seidl T (2009) DensEst: density estimation for data mining in high dimensional spaces. In: SDM, p 173–184.

Nguyen HV, Müller E, Vreeken J, Keller F, Böhm K (2013) CMI: an information-theoretic contrast measure for enhancing subspace cluster and outlier detection. In: SDM, p 198–206.

Peleg S, Werman M, Rom H (1989) A unified approach to the change of resolution: space and gray-level. IEEE Trans Pattern Anal Mach Intell 11(7):739–742CrossRef

Philip Preuß HD Ruprecht Puchstein (2013) Detection of multiple structural breaks in multivariate time series. arXiv:1309.1309v1.

Rao M, Seth S, Xu JW, Chen Y, Tagare H, Príncipe JC (2011) A test of independence based on a generalized correlation function. Signal Process 91(1):15–27CrossRefMATH

Reshef DN, Reshef YA, Finucane HK, Grossman SR, McVean G, Turnbaugh PJ, Lander ES, Mitzenmacher M, Sabeti PC (2011) Detecting novel associations in large data sets. Science 334(6062):1518–1524CrossRef

Rissanen J (1978) Modeling by shortest data description. Automatica 14(1):465–471CrossRefMATH

Rissanen J (1983) Modeling by shortest data description. Ann Stat 11(2):416–431CrossRefMATHMathSciNet

Scargle JD, Norris JP, Jackson B, Chiang J (2013) Studies in astronomical time series analysis. vi. Bayesian block representations. Astrophys J 764(2)

Seth S, Rao M, Park I, Príncipe JC (2011) A unified framework for quadratic measures of independence. IEEE Trans Signal Process 59(8):3624–3635CrossRefMathSciNet

Silverman BW (1986) Density estimation for statistics and data analysis. Chapman & Hall/CRC, LondonCrossRefMATH

Tatti N, Vreeken J (2008) Finding good itemsets by packing data. In: ICDM, p 588–597.

Tzoumas K, Deshpande A, Jensen CS (2011) Lightweight graphical models for selectivity estimation without independence assumptions. PVLDB 4(11):852–863

Vereshchagin NK, Vitányi PMB (2004) Kolmogorov’s structure functions and model selection. IEEE Trans Inf Theory 50(12):3265–3290CrossRef

Vreeken J, van Leeuwen M, Siebes A (2011) Krimp: mining itemsets that compress. Data Min Knowl Disc 23(1):169–214CrossRefMATH

Wagner A, Lützkendorf T, Voss K, Spars G, Maas A, Herkel S (2014) Performance analysis of commercial buildings: results and experiences from the german demonstration program ‘energy optimized building (EnOB)’. Energy Build 68:634–638CrossRef

Yang X, Procopiuc CM, Srivastava D (2009) Summarizing relational databases. PVLDB 2(1):634–645

Yang X, Procopiuc CM, Srivastava D (2011) Summary graphs for relational database schemas. PVLDB 4(11):899–910

Title: Unsupervised interaction-preserving discretization of multivariate data
Authors: Hoang-Vu Nguyen
Emmanuel Müller
Jilles Vreeken
Klemens Böhm
Publication date: 01-09-2014
Publisher: Springer US
Published in: Data Mining and Knowledge Discovery / Issue 5-6/2014
Print ISSN: 1384-5810
Electronic ISSN: 1573-756X
DOI: https://doi.org/10.1007/s10618-014-0350-5

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 5-6/2014

Approximating the crowd

Guest editors’ introduction: special issue of the ECML/PKDD 2014 journal track

A peek into the black box: exploring classifiers by randomization

Overlapping community detection in labeled graphs

Invariant time-series factorization

Ontology of core data mining entities

Premium Partner