nach oben

Data Mining and Knowledge Discovery

Erschienen in:

01.09.2014

Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas

verfasst von: Hao Wu, Jilles Vreeken, Nikolaj Tatti, Naren Ramakrishnan

Erschienen in: Data Mining and Knowledge Discovery | Ausgabe 5-6/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Many application domains such as intelligence analysis and cybersecurity require tools for the unsupervised identification of suspicious entities in multi-relational/network data. In particular, there is a need for automated semi-automated approaches to ‘uncover the plot’, i.e., to detect non-obvious coalitions of entities bridging many types of relations. We cast the problem of detecting such suspicious coalitions and their connections as one of mining surprisingly dense and well-connected chains of biclusters over multi-relational data. With this as our goal, we model data by the Maximum Entropy principle, such that in a statistically well-founded way we can gauge the surprisingness of a discovered bicluster chain with respect to what we already know. We design an algorithm for approximating the most informative multi-relational patterns, and provide strategies to incrementally organize discovered patterns into the background model. We illustrate how our method is adept at discovering the hidden plot in multiple synthetic and real-world intelligence analysis datasets. Our approach naturally generalizes traditional attribute-based maximum entropy models for single relations, and further supports iterative, human-in-the-loop, knowledge discovery.

Vorheriger Artikel Unsupervised interaction-preserving discretization of multivariate data

Nächster Artikel Discovering bands from graphs

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Note that, to save computation, we do not update the MaxEnt model after adding each \(B^*\). However, in line with the local score, we know that adding a bicluster typically only changes the distribution locally, and as we never re-visit the same relation \(R\) in a single chain \(C\) the information by \(B_i\) is unlikely to influence much the informativeness of \(B_{i+1}\).

http://dac.cs.vt.edu/projects.

http://www.alchemyapi.com/.

Califano A, Stolovitzky G, Tu Y (2000) Analysis of gene expression microarrays for phenotype classification. In: Proceedings of the 8th international conference on intelligent systems for molecular biology, pp 75–85

Cerf L, Besson J, Robardet C, Boulicaut JF (2009) Closed patterns meet n-ary relations. ACM Trans Knowl Discov Data 3(1):3:1–3:36CrossRef

Cerf L, Besson J, Nguyen KNT, Boulicaut JF (2013) Closed and noise-tolerant patterns in n-ary relations. Data Min Knowl Discov 26(3):574–619CrossRefMATHMathSciNet

Cheng Y, Church GM (2000) Biclustering of expression data. In: Proceedings of the eighth international conference on intelligent systems for molecular biology, AAAI Press, pp 93–103

Cover T, Thomas J (2006) Elements of information theory. Wiley, New YorkMATH

Csiszar I (1975) \(I\)-Divergence geometry of probability distributions and minimization problems. Ann Probab 3(1):146–158CrossRefMATHMathSciNet

Darroch JN, Ratcliff D (1972) Generalized iterative scaling for log-linear models. Ann Math Stat 43(5):1470–1480CrossRefMATHMathSciNet

Davis WLI, Schwarz P, Terzi E (2009) Finding representative association rules from large rule collections. In: Proceedings of the 9th SIAM international conference on data mining (SDM). Sparks, NV, SIAM, pp 521–532

De Bie T (2011) Maximum entropy models and subjective interestingness: an application to tiles in binary databases. Data Min Knowl Discov 23(3):407–446CrossRefMATHMathSciNet

Dehaspe L, Toironen H (2000) Discovery of relational association rules. In: Dĕzeroski S (ed) Relational data mining. Springer, New York Inc, pp 189–208

Dzeroski S, Lavrac N (eds) (2001) Relational data mining. Springer, BerlinMATH

Geerts F, Goethals B, Mielikainen T (2004) Tiling databases. In: Proceedings of discovery science. Springer, Berlin, pp 278–289

Gionis A, Mannila H, Mielikäinen T, Tsaparas P (2007) Assessing data mining results via swap randomization. ACM Trans Knowl Discov Data 1(3):167–176CrossRef

Hanhijärvi S, Ojala M, Vuokko N, Puolamäki K, Tatti N, Mannila H (2009) Tell me something I don’t know: randomization strategies for iterative data mining. In: Proceedings of the 15th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Paris, France, pp 379–388

Hossain M, Gresock J, Edmonds Y, Helm R, Potts M, Ramakrishnan N (2012a) Connecting the dots between PubMed abstracts. PLoS ONE 7(1)

Hossain MS, Butler P, Boedihardjo AP, Ramakrishnan N (2012b) Storytelling in entity networks to support intelligence analysts. In: Proceedings of the 18th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Beijing, China, pp 1375–1383

Hughes FJ (2005) Discovery, proof, choice: the art and science of the process of intelligence analysis, case study 6, “All Fall Down”, unpublished report

Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev Ser II 106(4):620–630MATHMathSciNet

Jin Y, Murali TM, Ramakrishnan N (2008) Compositional mining of multirelational biological datasets. ACM Trans Knowl Discov Data 2(1):2:1–2:35CrossRef

Kiernan J, Terzi E (2008) Constructing comprehensive summaries of large event sequences. In: Proceedings of the 14th ACM international conference on knowledge discovery and data mining (SIGKDD). Las Vegas, NV, pp 417–425

Kontonasios KN, Vreeken J, De Bie T (2011) Maximum entropy modelling for assessing results on real-valued data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM). Vancouver, Canada, IEEE, pp 350–359

Kontonasios KN, Vreeken J, De Bie T (2013) Maximum entropy models for iteratively identifying subjectively interesting structure in real-valued data. In: Proceedings of the European conference on machine learning and principles and practice of knowledge discovery in databases (ECML PKDD). Springer, Prague, Czech Republic, pp 256–271

Kumar D, Ramakrishnan N, Helm RF, Potts M (2006) Algorithms for storytelling. In: Proceedings of the 12th ACM international conference on knowledge discovery and data Mining (SIGKDD), Philadelphia, PA, pp 604–610

Lavrac N, Flach P (2001) An extended transformation approach to inductive logic programming. ACM Trans Comput Logic 2(4):458–494CrossRef

Madeira SC, Oliveira AL (2004) Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans Comput Biol Bioinf 1(1):24–45CrossRef

Mampaey M, Vreeken J, Tatti N (2012) Summarizing data succinctly with the most informative itemsets. ACM Trans Knowl Discov Data 6:1–44CrossRef

Ojala M, Garriga GC, Gionis A, Mannila H (2010) Evaluating query result significance in databases via randomizations. In: Proceedings of the 10th SIAM international conference on data mining (SDM). Columbus, OH, pp 906–917

Rasch G (1960) Probabilistic models for some intelligence and attainnment tests. Danmarks paedagogiske Institut

Rissanen J (1978) Modeling by shortest data description. Automatica 14(1):465–471CrossRefMATH

Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464CrossRefMATH

Segal E, Taskar B, Gasch A, Friedman N, Koller D (2001) Rich probabilistic models for gene expression. Bioinformatics 17(suppl 1):S243–S252CrossRef

Shahaf D, Guestrin C (2010) Connecting the dots between news articles. In: Proceedings of the 16th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Washington, DC, pp 623–632

Shahaf D, Guestrin C (2012) Connecting two (or less) dots: discovering structure in news articles. ACM Trans Knowl Discov Data 5(4):24:1–24:31CrossRef

Sheng Q, Moreau Y, De Moor B (2003) Biclustering microarray data by gibbs sampling. Bioinformatics 19(suppl 2):196–205CrossRef

Spyropoulou E, De Bie T (2011) Interesting multi-relational patterns. Proceedings of the 11th IEEE international conference on data mining (ICDM). Vancouver, Canada, pp 675–684

Spyropoulou E, De Bie T, Boley M (2013) Mining interesting patterns in multi-relational data with n-ary relationships. Discovery Science, vol 8140, Lecture Notes in Computer Science. Springer, Berlin, pp 217–232

Spyropoulou E, De Bie T, Boley M (2014) Interesting pattern mining in multi-relational data. Data Min Knowl Discov 28(3):808–849CrossRefMathSciNet

Tatti N (2006) Computational complexity of queries based on itemsets. Inf Process Lett 98(5):183–187. doi:10.1016/j.ipl.2006.02.003 CrossRefMATHMathSciNet

Tatti N, Vreeken J (2012) Comparing apples and oranges - measuring differences between exploratory data mining results. Data Min Knowl Disc 25(2):173–207CrossRefMATHMathSciNet

Tibshirani R, Hastie T, Eisen M, Ross D, Botstein D, Brown P (1999) Clustering methods for the analysis of dna microarray data. Stanford University, Tech. rep

Uno T, Kiyomi M, Arimura H (2005) Lcm ver.3: collaboration of array, bitmap and prefix tree for frequent itemset mining. In: Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ACM, New York, NY, USA, OSDM ’05, pp 77–86

Wang C, Parthasarathy S (2006) Summarizing itemset patterns using probabilistic models. In: Proceedings of the 12th ACM international conference on knowledge discovery and data mining (SIGKDD), Philadelphia, PA, pp 730–735

Zaki M, Hsiao CJ (2005) Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng 17(4):462–478CrossRef

Zaki MJ, Ramakrishnan N (2005) Reasoning about sets using redescription mining. In: Proceedings of the 11th ACM international conference on knowledge discovery and data mining (SIGKDD). ACM, Chicago, IL, pp 364–373

Titel: Uncovering the plot: detecting surprising coalitions of entities in multi-relational schemas
verfasst von: Hao Wu
Jilles Vreeken
Nikolaj Tatti
Naren Ramakrishnan
Publikationsdatum: 01.09.2014
Verlag: Springer US
Erschienen in: Data Mining and Knowledge Discovery / Ausgabe 5-6/2014
Print ISSN: 1384-5810
Elektronische ISSN: 1573-756X
DOI: https://doi.org/10.1007/s10618-014-0370-1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 5-6/2014

Detecting localized homogeneous anomalies over spatio-temporal data

Self-organizing maps by difference of convex functions optimization

Learning about meetings

Confidence bands for time series data

Classy: fast clustering streams of call-graphs

Invariant time-series factorization