Skip to main content
Erschienen in: Data Mining and Knowledge Discovery 5-6/2014

01.09.2014

Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas

verfasst von: Hao Wu, Jilles Vreeken, Nikolaj Tatti, Naren Ramakrishnan

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 5-6/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Many application domains such as intelligence analysis and cybersecurity require tools for the unsupervised identification of suspicious entities in multi-relational/network data. In particular, there is a need for automated semi-automated approaches to ‘uncover the plot’, i.e., to detect non-obvious coalitions of entities bridging many types of relations. We cast the problem of detecting such suspicious coalitions and their connections as one of mining surprisingly dense and well-connected chains of biclusters over multi-relational data. With this as our goal, we model data by the Maximum Entropy principle, such that in a statistically well-founded way we can gauge the surprisingness of a discovered bicluster chain with respect to what we already know. We design an algorithm for approximating the most informative multi-relational patterns, and provide strategies to incrementally organize discovered patterns into the background model. We illustrate how our method is adept at discovering the hidden plot in multiple synthetic and real-world intelligence analysis datasets. Our approach naturally generalizes traditional attribute-based maximum entropy models for single relations, and further supports iterative, human-in-the-loop, knowledge discovery.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Note that, to save computation, we do not update the MaxEnt model after adding each \(B^*\). However, in line with the local score, we know that adding a bicluster typically only changes the distribution locally, and as we never re-visit the same relation \(R\) in a single chain \(C\) the information by \(B_i\) is unlikely to influence much the informativeness of \(B_{i+1}\).
 
Literatur
Zurück zum Zitat Califano A, Stolovitzky G, Tu Y (2000) Analysis of gene expression microarrays for phenotype classification. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 75–85 Califano A, Stolovitzky G, Tu Y (2000) Analysis of gene expression microarrays for phenotype classification. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 75–85
Zurück zum Zitat Cerf L, Besson J, Robardet C, Boulicaut JF (2009) Closed patterns meet n-ary relations. ACM Trans Knowl Discov Data 3(1):3:1–3:36CrossRef Cerf L, Besson J, Robardet C, Boulicaut JF (2009) Closed patterns meet n-ary relations. ACM Trans Knowl Discov Data 3(1):3:1–3:36CrossRef
Zurück zum Zitat Cerf L, Besson J, Nguyen KNT, Boulicaut JF (2013) Closed and noise-tolerant patterns in n-ary relations. Data Min Knowl Discov 26(3):574–619CrossRefMATHMathSciNet Cerf L, Besson J, Nguyen KNT, Boulicaut JF (2013) Closed and noise-tolerant patterns in n-ary relations. Data Min Knowl Discov 26(3):574–619CrossRefMATHMathSciNet
Zurück zum Zitat Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology, AAAI Press, pp 93–103 Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology, AAAI Press, pp 93–103
Zurück zum Zitat Cover T, Thomas J (2006) Elements of information theory. Wiley, New YorkMATH Cover T, Thomas J (2006) Elements of information theory. Wiley, New YorkMATH
Zurück zum Zitat Davis WLI, Schwarz P, Terzi E (2009) Finding representative association rules from large rule collections. In: Proceedings of the 9th SIAM international conference on data mining (SDM). Sparks, NV, SIAM, pp 521–532 Davis WLI, Schwarz P, Terzi E (2009) Finding representative association rules from large rule collections. In: Proceedings of the 9th SIAM international conference on data mining (SDM). Sparks, NV, SIAM, pp 521–532
Zurück zum Zitat De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446CrossRefMATHMathSciNet De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446CrossRefMATHMathSciNet
Zurück zum Zitat Dehaspe L, Toironen H (2000) Discovery of relational association rules. In: Dĕzeroski S (ed) Relational data mining. Springer, New York Inc, pp 189–208 Dehaspe L, Toironen H (2000) Discovery of relational association rules. In: Dĕzeroski S (ed) Relational data mining. Springer, New York Inc, pp 189–208
Zurück zum Zitat Dzeroski S, Lavrac N (eds) (2001) Relational data mining. Springer, BerlinMATH Dzeroski S, Lavrac N (eds) (2001) Relational data mining. Springer, BerlinMATH
Zurück zum Zitat Geerts F, Goethals B, Mielikainen T (2004) Tiling databases. In: Proceedings of discovery science. Springer, Berlin, pp 278–289 Geerts F, Goethals B, Mielikainen T (2004) Tiling databases. In: Proceedings of discovery science. Springer, Berlin, pp 278–289
Zurück zum Zitat Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3):167–176CrossRef Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3):167–176CrossRef
Zurück zum Zitat Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Paris, France, pp 379–388 Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Paris, France, pp 379–388
Zurück zum Zitat Hossain M, Gresock J, Edmonds Y, Helm R, Potts M, Ramakrishnan N (2012a) Connecting the dots between PubMed abstracts. PLoS ONE 7(1) Hossain M, Gresock J, Edmonds Y, Helm R, Potts M, Ramakrishnan N (2012a) Connecting the dots between PubMed abstracts. PLoS ONE 7(1)
Zurück zum Zitat Hossain MS, Butler P, Boedihardjo AP, Ramakrishnan N (2012b) Storytelling in entity networks to support intelligence analysts. In: Proceedings of the 18th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Beijing, China, pp 1375–1383 Hossain MS, Butler P, Boedihardjo AP, Ramakrishnan N (2012b) Storytelling in entity networks to support intelligence analysts. In: Proceedings of the 18th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Beijing, China, pp 1375–1383
Zurück zum Zitat Hughes FJ (2005) Discovery, proof, choice: the art and science of the process of intelligence analysis, case study 6, “All Fall Down”, unpublished report Hughes FJ (2005) Discovery, proof, choice: the art and science of the process of intelligence analysis, case study 6, “All Fall Down”, unpublished report
Zurück zum Zitat Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev Ser II 106(4):620–630MATHMathSciNet Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev Ser II 106(4):620–630MATHMathSciNet
Zurück zum Zitat Jin Y, Murali TM, Ramakrishnan N (2008) Compositional mining of multirelational biological datasets. ACM Trans Knowl Discov Data 2(1):2:1–2:35CrossRef Jin Y, Murali TM, Ramakrishnan N (2008) Compositional mining of multirelational biological datasets. ACM Trans Knowl Discov Data 2(1):2:1–2:35CrossRef
Zurück zum Zitat Kiernan J, Terzi E (2008) Constructing comprehensive summaries of large event sequences. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (SIGKDD). Las Vegas, NV, pp 417–425 Kiernan J, Terzi E (2008) Constructing comprehensive summaries of large event sequences. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (SIGKDD). Las Vegas, NV, pp 417–425
Zurück zum Zitat Kontonasios KN, Vreeken J, De Bie T (2011) Maximum entropy modelling for assessing results on real-valued data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM). Vancouver, Canada, IEEE, pp 350–359 Kontonasios KN, Vreeken J, De Bie T (2011) Maximum entropy modelling for assessing results on real-valued data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM). Vancouver, Canada, IEEE, pp 350–359
Zurück zum Zitat Kontonasios KN, Vreeken J, De Bie T (2013) Maximum entropy models for iteratively identifying subjectively interesting structure in real-valued data. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD). Springer, Prague, Czech Republic, pp 256–271 Kontonasios KN, Vreeken J, De Bie T (2013) Maximum entropy models for iteratively identifying subjectively interesting structure in real-valued data. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD). Springer, Prague, Czech Republic, pp 256–271
Zurück zum Zitat Kumar D, Ramakrishnan N, Helm RF, Potts M (2006) Algorithms for storytelling. In: Proceedings of the 12th ACM international conference on knowledge discovery and data Mining (SIGKDD), Philadelphia, PA, pp 604–610 Kumar D, Ramakrishnan N, Helm RF, Potts M (2006) Algorithms for storytelling. In: Proceedings of the 12th ACM international conference on knowledge discovery and data Mining (SIGKDD), Philadelphia, PA, pp 604–610
Zurück zum Zitat Lavrac N, Flach P (2001) An extended transformation approach to inductive logic programming. ACM Trans Comput Logic 2(4):458–494CrossRef Lavrac N, Flach P (2001) An extended transformation approach to inductive logic programming. ACM Trans Comput Logic 2(4):458–494CrossRef
Zurück zum Zitat Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinf 1(1):24–45CrossRef Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinf 1(1):24–45CrossRef
Zurück zum Zitat Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM Trans Knowl Discov Data 6:1–44CrossRef Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM Trans Knowl Discov Data 6:1–44CrossRef
Zurück zum Zitat Ojala M, Garriga GC, Gionis A, Mannila H (2010) Evaluating query result significance in databases via randomizations. In: Proceedings of the 10th SIAM international conference on data mining (SDM). Columbus, OH, pp 906–917 Ojala M, Garriga GC, Gionis A, Mannila H (2010) Evaluating query result significance in databases via randomizations. In: Proceedings of the 10th SIAM international conference on data mining (SDM). Columbus, OH, pp 906–917
Zurück zum Zitat Rasch G (1960) Probabilistic models for some intelligence and attainnment tests. Danmarks paedagogiske Institut Rasch G (1960) Probabilistic models for some intelligence and attainnment tests. Danmarks paedagogiske Institut
Zurück zum Zitat Segal E, Taskar B, Gasch A, Friedman N, Koller D (2001) Rich probabilistic models for gene expression. Bioinformatics 17(suppl 1):S243–S252CrossRef Segal E, Taskar B, Gasch A, Friedman N, Koller D (2001) Rich probabilistic models for gene expression. Bioinformatics 17(suppl 1):S243–S252CrossRef
Zurück zum Zitat Shahaf D, Guestrin C (2010) Connecting the dots between news articles. In: Proceedings of the 16th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Washington, DC, pp 623–632 Shahaf D, Guestrin C (2010) Connecting the dots between news articles. In: Proceedings of the 16th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Washington, DC, pp 623–632
Zurück zum Zitat Shahaf D, Guestrin C (2012) Connecting two (or less) dots: discovering structure in news articles. ACM Trans Knowl Discov Data 5(4):24:1–24:31CrossRef Shahaf D, Guestrin C (2012) Connecting two (or less) dots: discovering structure in news articles. ACM Trans Knowl Discov Data 5(4):24:1–24:31CrossRef
Zurück zum Zitat Sheng Q, Moreau Y, De Moor B (2003) Biclustering microarray data by gibbs sampling. Bioinformatics 19(suppl 2):196–205CrossRef Sheng Q, Moreau Y, De Moor B (2003) Biclustering microarray data by gibbs sampling. Bioinformatics 19(suppl 2):196–205CrossRef
Zurück zum Zitat Spyropoulou E, De Bie T (2011) Interesting multi-relational patterns. Proceedings of the 11th IEEE international conference on data mining (ICDM). Vancouver, Canada, pp 675–684 Spyropoulou E, De Bie T (2011) Interesting multi-relational patterns. Proceedings of the 11th IEEE international conference on data mining (ICDM). Vancouver, Canada, pp 675–684
Zurück zum Zitat Spyropoulou E, De Bie T, Boley M (2013) Mining interesting patterns in multi-relational data with n-ary relationships. Discovery Science, vol 8140, Lecture Notes in Computer Science. Springer, Berlin, pp 217–232 Spyropoulou E, De Bie T, Boley M (2013) Mining interesting patterns in multi-relational data with n-ary relationships. Discovery Science, vol 8140, Lecture Notes in Computer Science. Springer, Berlin, pp 217–232
Zurück zum Zitat Spyropoulou E, De Bie T, Boley M (2014) Interesting pattern mining in multi-relational data. Data Min Knowl Discov 28(3):808–849CrossRefMathSciNet Spyropoulou E, De Bie T, Boley M (2014) Interesting pattern mining in multi-relational data. Data Min Knowl Discov 28(3):808–849CrossRefMathSciNet
Zurück zum Zitat Tatti N, Vreeken J (2012) Comparing apples and oranges - measuring differences between exploratory data mining results. Data Min Knowl Disc 25(2):173–207CrossRefMATHMathSciNet Tatti N, Vreeken J (2012) Comparing apples and oranges - measuring differences between exploratory data mining results. Data Min Knowl Disc 25(2):173–207CrossRefMATHMathSciNet
Zurück zum Zitat Tibshirani R, Hastie T, Eisen M, Ross D, Botstein D, Brown P (1999) Clustering methods for the analysis of dna microarray data. Stanford University, Tech. rep Tibshirani R, Hastie T, Eisen M, Ross D, Botstein D, Brown P (1999) Clustering methods for the analysis of dna microarray data. Stanford University, Tech. rep
Zurück zum Zitat Uno T, Kiyomi M, Arimura H (2005) Lcm ver.3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ACM, New York, NY, USA, OSDM ’05, pp 77–86 Uno T, Kiyomi M, Arimura H (2005) Lcm ver.3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ACM, New York, NY, USA, OSDM ’05, pp 77–86
Zurück zum Zitat Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (SIGKDD), Philadelphia, PA, pp 730–735 Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (SIGKDD), Philadelphia, PA, pp 730–735
Zurück zum Zitat Zaki M, Hsiao CJ (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17(4):462–478CrossRef Zaki M, Hsiao CJ (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17(4):462–478CrossRef
Zurück zum Zitat Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the 11th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Chicago, IL, pp 364–373 Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the 11th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Chicago, IL, pp 364–373
Metadaten
Titel
Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas
verfasst von
Hao Wu
Jilles Vreeken
Nikolaj Tatti
Naren Ramakrishnan
Publikationsdatum
01.09.2014
Verlag
Springer US
Erschienen in
Data Mining and Knowledge Discovery / Ausgabe 5-6/2014
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI
https://doi.org/10.1007/s10618-014-0370-1

Weitere Artikel der Ausgabe 5-6/2014

Data Mining and Knowledge Discovery 5-6/2014 Zur Ausgabe