Skip to main content
Erschienen in: Knowledge and Information Systems 1/2018

19.01.2018 | Regular Paper

From sets of good redescriptions to good sets of redescriptions

verfasst von: Janis Kalofolias, Esther Galbrun, Pauli Miettinen

Erschienen in: Knowledge and Information Systems | Ausgabe 1/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Redescription mining aims at finding pairs of queries over data variables that describe roughly the same set of observations. These redescriptions can be used to obtain different views on the same set of entities. So far, redescription mining methods have aimed at listing all redescriptions supported by the data. Such an approach can result in many redundant redescriptions and hinder the user’s ability to understand the overall characteristics of the data. In this work, we present an approach to identify and remove the redundant redescriptions, that is, an approach to move from a set of good redescriptions to a good set of redescriptions. We measure the redundancy of a redescription using a framework inspired by the concept of subjective interestingness based on maximum entropy distributions as proposed by De Bie (Data Min Knowl Discov 23(3):407–446, 2011). Redescriptions, however, generate specific requirements on the framework, and our solution differs significantly from the existing ones. Notably, our approach can handle disjunctions and conjunctions in the queries, whereas the existing approaches are limited only to conjunctive queries. Our framework can also handle data with Boolean, nominal, or real-valued data, possibly containing missing values, making it applicable to a wide variety of data sets. Our experiments show that our framework can efficiently reduce the redundancy even on large data sets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
The present work is an extended version of our earlier conference publication [21].
 
2
The Dirac delta, which is the continuous equivalent of the Kronecker delta, is a generalised function that assumes an infinite mass when its argument is zero, in our case effectively ensuring that only the case of \({{\varvec{r}}}={{\varvec{r}}}_\rho \) is possible.
 
4
http://​siren.​gforge.​inria.​fr, accessed 13 December 2017.
 
Literatur
1.
Zurück zum Zitat Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec 27(2):94–105CrossRef Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec 27(2):94–105CrossRef
2.
Zurück zum Zitat Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of 20th international conference on very large data bases (VLDB’94), pp 487–499 Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of 20th international conference on very large data bases (VLDB’94), pp 487–499
3.
Zurück zum Zitat Barber D (2012) Bayesian reasoning and machine learning. Cambridge University Press, CambridgeMATH Barber D (2012) Bayesian reasoning and machine learning. Cambridge University Press, CambridgeMATH
4.
Zurück zum Zitat Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Linguist 22(1):39–71 Berger AL, Pietra VJD, Pietra SAD (1996) A maximum entropy approach to natural language processing. Comput Linguist 22(1):39–71
5.
Zurück zum Zitat Bickel S, Scheffer T (2004) Multi-view clustering. In: Proceedings of the 4th IEEE international conference on data mining (ICDM’04), pp 19–26 Bickel S, Scheffer T (2004) Multi-view clustering. In: Proceedings of the 4th IEEE international conference on data mining (ICDM’04), pp 19–26
6.
Zurück zum Zitat Burden RL, Faires JD (2011) Numerical analysis, 9th edn. Brooks/Cole, BostonMATH Burden RL, Faires JD (2011) Numerical analysis, 9th edn. Brooks/Cole, BostonMATH
7.
Zurück zum Zitat De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446MathSciNetCrossRefMATH De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446MathSciNetCrossRefMATH
9.
Zurück zum Zitat Galbrun E, Miettinen P (2012a) From black and white to full color: extending redescription mining outside the boolean world. Stat Anal Data Min 5(4):284–303MathSciNetCrossRef Galbrun E, Miettinen P (2012a) From black and white to full color: extending redescription mining outside the boolean world. Stat Anal Data Min 5(4):284–303MathSciNetCrossRef
10.
Zurück zum Zitat Galbrun E, Miettinen P (2012b) Siren: an interactive tool for mining and visualizing geospatial redescriptions. In: Proceedings of the 18th ACM SIGKDD International conference on knowledge discovery and data mining (KDD’12), pp 1544–1547 Galbrun E, Miettinen P (2012b) Siren: an interactive tool for mining and visualizing geospatial redescriptions. In: Proceedings of the 18th ACM SIGKDD International conference on knowledge discovery and data mining (KDD’12), pp 1544–1547
11.
Zurück zum Zitat Galbrun E, Miettinen P (2014) Interactive redescription mining. In: Proceedings of the 2016 international conference on management of data (SIGMOD’14), pp 1079–1082 Galbrun E, Miettinen P (2014) Interactive redescription mining. In: Proceedings of the 2016 international conference on management of data (SIGMOD’14), pp 1079–1082
12.
Zurück zum Zitat Galbrun E, Miettinen P (2018) Redescription mining. Springer, Cham Galbrun E, Miettinen P (2018) Redescription mining. Springer, Cham
13.
Zurück zum Zitat Gallo A, Miettinen P, Mannila H (2008) Finding subgroups having several descriptions: algorithms for redescription mining. In: Proceedings of the 8th SIAM international conference on data mining (SDM’08), pp 334–345 Gallo A, Miettinen P, Mannila H (2008) Finding subgroups having several descriptions: algorithms for redescription mining. In: Proceedings of the 8th SIAM international conference on data mining (SDM’08), pp 334–345
14.
15.
Zurück zum Zitat Grove AJ, Halpern JY, Koller D (1992) Random worlds and maximum entropy. In: Proceedings of the 7th annual IEEE symposium on logic in computer science (LICS’92), pp 22–33 Grove AJ, Halpern JY, Koller D (1992) Random worlds and maximum entropy. In: Proceedings of the 7th annual IEEE symposium on logic in computer science (LICS’92), pp 22–33
16.
Zurück zum Zitat Hijmans RJ, Cameron SE, Parra LJ, Jones PG, Jarvis A (2005) Very high resolution interpolated climate surfaces for global land areas. Int J Climatol 25:1965–1978CrossRef Hijmans RJ, Cameron SE, Parra LJ, Jones PG, Jarvis A (2005) Very high resolution interpolated climate surfaces for global land areas. Int J Climatol 25:1965–1978CrossRef
17.
Zurück zum Zitat Jaroszewicz S, Simovici DA (2002) Pruning redundant association rules using maximum entropy principle. In: Proceedings of the 6th Pacific–Asia conference on advances in knowledge discovery and data mining (PAKDD’02), pp 135–147 Jaroszewicz S, Simovici DA (2002) Pruning redundant association rules using maximum entropy principle. In: Proceedings of the 6th Pacific–Asia conference on advances in knowledge discovery and data mining (PAKDD’02), pp 135–147
18.
Zurück zum Zitat Jaynes E (1982) On the rationale of maximum-entropy methods. Proc IEEE 70(9):939–952CrossRef Jaynes E (1982) On the rationale of maximum-entropy methods. Proc IEEE 70(9):939–952CrossRef
19.
Zurück zum Zitat Jaynes ET (2003) Probability theory: the logic of science, vol 10. Cambridge University Press, Cambridge, p 33CrossRef Jaynes ET (2003) Probability theory: the logic of science, vol 10. Cambridge University Press, Cambridge, p 33CrossRef
20.
Zurück zum Zitat Jensen FV, Jensen F (1994) Optimal junction trees. In: Proceedings of the 10th annual conference on uncertainty in artificial intelligence (UAI’94), pp 360–366 Jensen FV, Jensen F (1994) Optimal junction trees. In: Proceedings of the 10th annual conference on uncertainty in artificial intelligence (UAI’94), pp 360–366
21.
Zurück zum Zitat Kalofolias J, Galbrun E, Miettinen P (2016) From sets of good redescriptions to good sets of redescriptions. In: Proceedings of the 16th IEEE international conference on data mining (ICDM’16), pp 211–220 Kalofolias J, Galbrun E, Miettinen P (2016) From sets of good redescriptions to good sets of redescriptions. In: Proceedings of the 16th IEEE international conference on data mining (ICDM’16), pp 211–220
22.
Zurück zum Zitat Kontonasios K-N, De Bie T (2012) Formalizing complex prior information to quantify subjective interestingness of frequent pattern sets. In: Proceedings of the 11th international symposium on advances in intelligent data analysis (IDA’12), pp 161–171 Kontonasios K-N, De Bie T (2012) Formalizing complex prior information to quantify subjective interestingness of frequent pattern sets. In: Proceedings of the 11th international symposium on advances in intelligent data analysis (IDA’12), pp 161–171
24.
Zurück zum Zitat Kontonasios K-N, Vreeken J, De Bie T (2011) Maximum entropy modelling for assessing results on real-valued data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM’1), pp 350–359 Kontonasios K-N, Vreeken J, De Bie T (2011) Maximum entropy modelling for assessing results on real-valued data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM’1), pp 350–359
25.
Zurück zum Zitat Kontonasios K-N, Vreeken J, De Bie T (2013) Maximum entropy models for iteratively identifying subjectively interesting structure in real-valued data. In: Proceedings of the 2013 European conference on machine learning and principles and practice of knowledge discovery in databases (ECML-PKDD’13), pp 256–271 Kontonasios K-N, Vreeken J, De Bie T (2013) Maximum entropy models for iteratively identifying subjectively interesting structure in real-valued data. In: Proceedings of the 2013 European conference on machine learning and principles and practice of knowledge discovery in databases (ECML-PKDD’13), pp 256–271
26.
Zurück zum Zitat Kröger P (2009) Subspace clustering techniques. In: Liu L, Özsu M T (eds) Encyclopedia of database systems. Springer, Berlin, pp 2873–2875 Kröger P (2009) Subspace clustering techniques. In: Liu L, Özsu M T (eds) Encyclopedia of database systems. Springer, Berlin, pp 2873–2875
27.
Zurück zum Zitat Mampaey M, Tatti N, Vreeken J (2011) Tell me what i need to know: succinctly summarizing data with itemsets. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’11), pp 573–581 Mampaey M, Tatti N, Vreeken J (2011) Tell me what i need to know: succinctly summarizing data with itemsets. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’11), pp 573–581
28.
Zurück zum Zitat Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM Trans Knowl Discov Data 6(4):16:1–16:42CrossRef Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM Trans Knowl Discov Data 6(4):16:1–16:42CrossRef
29.
Zurück zum Zitat Mannila H, Pavlov D, Smyth P (1999) Prediction with local patterns using cross-entropy. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’99), pp 357–361 Mannila H, Pavlov D, Smyth P (1999) Prediction with local patterns using cross-entropy. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’99), pp 357–361
30.
Zurück zum Zitat Mihelčić M, Šmuc T (2016) InterSet: interactive redescription set exploration. In: Proceedings of the 19th international conference on discovery science (DS’16), pp 35–50 Mihelčić M, Šmuc T (2016) InterSet: interactive redescription set exploration. In: Proceedings of the 19th international conference on discovery science (DS’16), pp 35–50
31.
Zurück zum Zitat Mitchell-Jones A J et al (1999) The atlas of European mammals. Academic Press, New York Mitchell-Jones A J et al (1999) The atlas of European mammals. Academic Press, New York
32.
Zurück zum Zitat Murdock GP (1967) Ethnographic atlas: a summary. Ethnology 6(2):109–236CrossRef Murdock GP (1967) Ethnographic atlas: a summary. Ethnology 6(2):109–236CrossRef
33.
Zurück zum Zitat Novak PK, Lavrač N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403MATH Novak PK, Lavrač N, Webb GI (2009) Supervised descriptive rule discovery: a unifying survey of contrast set, emerging pattern and subgroup mining. J Mach Learn Res 10:377–403MATH
34.
Zurück zum Zitat Parida L, Ramakrishnan N (2005) Redescription mining: structure theory and algorithms. In: Proceedings of the 20th national conference on artificial intelligence and the 7th innovative applications of artificial intelligence conference (AAAI’05), pp 837–844 Parida L, Ramakrishnan N (2005) Redescription mining: structure theory and algorithms. In: Proceedings of the 20th national conference on artificial intelligence and the 7th innovative applications of artificial intelligence conference (AAAI’05), pp 837–844
35.
Zurück zum Zitat Pavlov D, Mannila H, Smyth P (2003) Beyond independence: probabilistic models for query approximation on binary transaction data. IEEE Trans Knowl Data Eng 15(6):1409–1421CrossRef Pavlov D, Mannila H, Smyth P (2003) Beyond independence: probabilistic models for query approximation on binary transaction data. IEEE Trans Knowl Data Eng 15(6):1409–1421CrossRef
36.
Zurück zum Zitat Phillips SJ, Anderson RP, Schapire RE (2006) Maximum entropy modeling of species geographic distributions. Ecol Model 190(3):231–259CrossRef Phillips SJ, Anderson RP, Schapire RE (2006) Maximum entropy modeling of species geographic distributions. Ecol Model 190(3):231–259CrossRef
37.
Zurück zum Zitat Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm RF (2004) Turning CARTwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04), pp 266–275 Ramakrishnan N, Kumar D, Mishra B, Potts M, Helm RF (2004) Turning CARTwheels: an alternating algorithm for mining redescriptions. In: Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04), pp 266–275
38.
Zurück zum Zitat Rasch G (1960) Probabilistic models for some intelligence and achievement tests. Danish Institute for Educational Research, Copenhagen Rasch G (1960) Probabilistic models for some intelligence and achievement tests. Danish Institute for Educational Research, Copenhagen
40.
Zurück zum Zitat Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1):57–77CrossRef Tatti N (2008) Maximum entropy based significance of itemsets. Knowl Inf Syst 17(1):57–77CrossRef
41.
Zurück zum Zitat Tatti N, Vreeken J (2011) Comparing apples and oranges. In: Joint european conference on machine learning and knowledge discovery in databases, Springer, pp 398–413 Tatti N, Vreeken J (2011) Comparing apples and oranges. In: Joint european conference on machine learning and knowledge discovery in databases, Springer, pp 398–413
42.
Zurück zum Zitat van Leeuwen M, Galbrun E (2015) Association discovery in two-view data. IEEE Trans Knowl Data Eng 27(12):3190–3202CrossRef van Leeuwen M, Galbrun E (2015) Association discovery in two-view data. IEEE Trans Knowl Data Eng 27(12):3190–3202CrossRef
44.
Zurück zum Zitat Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06), pp 730–735 Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’06), pp 730–735
45.
Zurück zum Zitat Wu H, Vreeken J, Tatti N, Ramakrishnan N (2014) Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas. Data Min Knowl Discov 28(5–6):1398–1428MathSciNetCrossRef Wu H, Vreeken J, Tatti N, Ramakrishnan N (2014) Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas. Data Min Knowl Discov 28(5–6):1398–1428MathSciNetCrossRef
46.
Zurück zum Zitat Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’05), pp 364–373 Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’05), pp 364–373
47.
Zurück zum Zitat Zinchenko T, Galbrun E, Miettinen P (2015) Mining predictive redescriptions with trees. In: IEEE International conference on data mining workshops, pp 1672–1675 Zinchenko T, Galbrun E, Miettinen P (2015) Mining predictive redescriptions with trees. In: IEEE International conference on data mining workshops, pp 1672–1675
Metadaten
Titel
From sets of good redescriptions to good sets of redescriptions
verfasst von
Janis Kalofolias
Esther Galbrun
Pauli Miettinen
Publikationsdatum
19.01.2018
Verlag
Springer London
Erschienen in
Knowledge and Information Systems / Ausgabe 1/2018
Print ISSN: 0219-1377
Elektronische ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-017-1149-7

Weitere Artikel der Ausgabe 1/2018

Knowledge and Information Systems 1/2018 Zur Ausgabe

Premium Partner