Abstract
Many datasets of interest today are best described as a linked collection of interrelated objects. These may represent homogeneous networks, in which there is a single-object type and link type, or richer, heterogeneous networks, in which there may be multiple object and link types (and possibly other semantic information). Examples of homogeneous networks include single mode social networks, such as people connected by friendship links, or the WWW, a collection of linked web pages. Examples of heterogeneous networks include those in medical domains describing patients, diseases, treatments and contacts, or in bibliographic domains describing publications, authors, and venues. Link mining refers to data mining techniques that explicitly consider these links when building predictive or descriptive models of the linked data. Commonly addressed link mining tasks include object ranking, group detection, collective classification, link prediction and subgraph discovery. While network analysis has been studied in depth in particular areas such as social network analysis, hypertext mining, and web analysis, only recently has there been a cross-fertilization of ideas among these different communities. This is an exciting, rapidly expanding area. In this article, we review some of the common emerging themes.
- J. Adibi, H. Chalupsky, M. Grobelnik, N. Milic-Frayling, and D. Mladenic. KDD Workshop on Link Analysis and Group Detection. 2004.]]Google Scholar
- J. Adibi, H. Chalupsky, E. Melz, and A. Valente. The KOJAK group finder: Connecting the dots via integrated knowledge-based and statistical reasoning. In Innovative Applications of Artificial Intelligence Conference, 2004.]]Google ScholarCross Ref
- J. Adibi, M. Grobelnik, D. Mladenic, and P. Pantel. KDD Workshop on Link Discovery: Issues, Approaches and Applications. 2005.]]Google Scholar
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In International Conference on Very Large Data Bases, pages 487--499, Sept. 1994.]] Google ScholarDigital Library
- E. M. Airoldi and K. M. Carley. Sampling algorithms for pure network topologies. SIGKDD Explorations, 7(2), 2005.]] Google ScholarDigital Library
- R. Ananthakrishna, S. Chaudhuri, and V. Ganti. Eliminating fuzzy duplicates in data warehouses. In International Conference on Very Large Databases (VLDB), Hong Kong, China, 2002.]]Google ScholarCross Ref
- A. L. Barabási and R. Albert. Emergence of scaling in random networks. Science, 286:509--512, 1999.]]Google ScholarCross Ref
- K. Bharat and M. R. Henzinger. Improved algorithms for topic distillation in a hyperlinked environment. In ACM SIGIR International Conference on Research and Development in Information Retrieval, pages 104--111, 1998.]] Google ScholarDigital Library
- I. Bhattacharya and L. Getoor. Iterative record linkage for cleaning and integration. In SIGMOD 2004 Workshop on Research Issues on Data Mining and Knowledge Discovery, June 2004.]] Google ScholarDigital Library
- I. Bhattacharya and L. Getoor. Entity resolution in graphs. Technical Report 4758, Computer Science Department, University of Maryland, 2005.]]Google Scholar
- I. Bhattacharya and L. Getoor. A Latent dirichlet model for unsupervised entity resolution. In SIAM International Conference on Data Mining, 2006. To Appear.]]Google ScholarCross Ref
- P. Bonacich. Power and centrality: A family of measures. American Journal of Sociology, 92(5):1170--1182, 1987.]]Google ScholarCross Ref
- S. P. Borgatti and M. G. Everett. Models of core / periphery structures. Social Networks, 21:375--395, 1999.]]Google ScholarCross Ref
- P. J. Carrington, J. Scott, and S. Wasserman. Models and Methods in Social Network Analysis. Cambridge University Press, Cambridge, 2005.]]Google ScholarCross Ref
- D. Chakrabarti. Tools for Large Graph Mining. PhD thesis, School of Computer Science, Carnegie Mellon University, 2005.]] Google ScholarDigital Library
- S. Chakrabarti. Mining the Web. Morgan Kaufman, 2002.]]Google Scholar
- S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, P. Raghavan, and S. Rajagopalan. Automatic resource list compilation by analyzing hyperlink structure and associated text. In International World Wide Web Conference (WWW), 1998.]] Google ScholarDigital Library
- S. Chakrabarti, B. Dom, and P. Indyk. Enhanced hypertext categorization using hyperlinks. In SIGMOD International Conference on Management of Data, pages 307--318, 1998.]] Google ScholarDigital Library
- R. Chellappa and A. Jain. Markov random fields: theory and applications. Academic Press, Boston, 1993.]]Google Scholar
- D. Cohn and H. Chang. Learning to probabilistically identify authoritative documents. In International Conference on Machine Learning (ICML), pages 167--174. Morgan Kaufmann, San Francisco, CA, 2000.]] Google ScholarDigital Library
- D. Cohn and T. Hofmann. The missing link - a probabilistic model of document content and hypertext connectivity. In Neural Information Processing Systems 13, 2001.]]Google Scholar
- D. J. Cook and L. B. Holder. Substructure discovery using minimum description length and background knowledge. Journal of Artificial Intelligence Research, 1:231--255, 1994.]]Google ScholarDigital Library
- D. J. Cook and L. B. Holder. Graph-based data mining. IEEE Intelligent Systems, 15(2):32--41, 2000.]] Google ScholarDigital Library
- M. Craven, D. DiPasquo, D. Freitag, A. McCallum, T. Mitchell, K. Nigam, and S. Slattery. Learning to construct knowledge bases from the world wide web. Artificial Intelligence, 118(1-2):69--114, 2000.]] Google ScholarDigital Library
- A. Culotta and A. McCallum. Joint deduplication of multiple record types in relational data. In Fourteenth Conference on Information and Knowledge Management (CIKM), 2005.]] Google ScholarDigital Library
- G. V. Cybenko and J. Srivastava. SIAM Workshop on Link Analysis, Counterterrorism and Privacy. 2004.]]Google Scholar
- L. Dehaspe, H. Toivonen, and R. King. Finding frequent substructures in chemical compounds. In International Conference on Knowledge Discovery and Data Mining, pages 30--36, 1998.]]Google Scholar
- T. Dietterich, L. Getoor, and K. Murphy. ICML Workshop on Statistical Relational Learning and its Connections to Other Fields. 2004.]]Google Scholar
- C. Ding, X. He, P. Husbands, H. Zha, and H. D. Simon. PageRank, HITS and a unified framework for link analysis. In ACM SIGIR Conference on Research and Development in Information Retrieval, pages 353--354, 2002.]] Google ScholarDigital Library
- C. H. Q. Ding. Spectral clustering, 2004. http://crd.lbl.gov/cding/Spectral/.]]Google Scholar
- A. Doan, P. Domingos, and A. Y. Halevy. Learning to match the schemas of data sources: A multistrategy approach. Machine Learning, 50(3), 2003.]] Google ScholarDigital Library
- A. Doan, J. Madhavan, P. Domingos, and A. Halevy. Learning to map between ontologies on the semantic web. In International World Wide Web Conference, 2002.]] Google ScholarDigital Library
- P. Domingos and M. Richardson. Markov logic: A unifying framework for statistical relational learning. In ICML-2004 Workshop on Statistical Relational Learning and its Connections to Other Fields, pages 49--54, 2004.]]Google Scholar
- X. Dong, A. Halevy, and J. Madhavan. Reference reconciliation in complex information spaces. In ACM SIGMOD International Conference on Management of Data, pages 85--96, 2005.]] Google ScholarDigital Library
- S. Donoho, T. Dybala, M. Grobelnik, N. Milic-Frayling, and D. Mladenic. KDD Workshop on Link Analysis for Detecting Complex Behavior. 2003.]]Google Scholar
- S. Dzeroski and H. Blockeel. KDD Workshop on Multi-Relational Data Mining. 2004.]]Google Scholar
- S. Dzeroski and H. Blockeel. KDD Workshop on Multi-Relational Data Mining. 2005.]]Google Scholar
- S. Dzeroski and N. Lavrac, editors. Relational Data Mining. Kluwer, Berlin, 2001.]] Google ScholarDigital Library
- S. Dzeroski, L. D. Raedt, and S. Wrobel. KDD Workshop on Multi-Relational Data Mining. 2003.]]Google Scholar
- R. Feldman. Link analysis: Current state of the art, 2002.]]Google Scholar
- O. Frank and K. Nowicki. Exploratory statistical analysis of networks. Annals of Discrete Mathematics, 55:349--366, 1993.]]Google ScholarCross Ref
- T. Frantz and K. M. Carley. A formal characterization of cellular networks. Technical Report CMU-ISRI-05-109, Carnegie Mellon University, 2005.]]Google ScholarCross Ref
- L. Freeman. Centrality in social networks: Conceptual clarifications. Social Networks, 1:215--239, 1979.]]Google ScholarCross Ref
- T. Gärtner. Exponential and geometric kernels for graphs. In NIPS Workshop on Unreal Data: Principles of Modeling Nonvectorial Data, 2002.]]Google Scholar
- T. Gärtner. A survey of kernels for structured data. SIGKDD Explorations, 5(1):49--58, 2003.]] Google ScholarDigital Library
- L. Getoor. Link mining: a new data mining challenge. SIGKDD Explorations, 5(1):84--89, 2003.]] Google ScholarDigital Library
- L. Getoor. N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of link structure. Journal of Machine Learning Research, 3:679--707, 2003.]] Google ScholarDigital Library
- L. Getoor and D. Jensen. AAAI Workshop on Learning Statistical Models from Relational Data. AAAI Press, 2000.]]Google Scholar
- L. Getoor and D. Jensen. IJCAI Workshop on Learning Statistical Models from Relational Data. 2003.]]Google Scholar
- D. Gibson, J. Kleinberg, and P. Raghavan. Inferring web communities from link topology. In ACM Conference on Hypertext and Hypermedia, pages 225--234, 1998.]] Google ScholarDigital Library
- T. H. Haveliwala. Topic-sensitive PageRank. In International Conference on the World Wide Web (WWW), pages 517--526, 2002.]] Google ScholarDigital Library
- M. Huisman and T. A. B. Snijders. Statistical analysis of longitudinal network data with changing composition. Sociological Methods and Research, 32:253--287, 2003.]]Google ScholarCross Ref
- R. Hummel and S. Zucker. On the foundations of relaxation labeling processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 267--287, 1983.]]Google ScholarDigital Library
- A. Inokuchi, T. Washio, and H. Motoda. An Apriori-based algorithm for mining frequent substructures from graph data. In European Conference on Principles and Practice of Knowledge Discovery and Data Mining, pages 13--23, 2000.]] Google ScholarDigital Library
- G. Jeh and J. Widom. SimRank: A measure of structural-context similarity. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 538--543, 2002.]] Google ScholarDigital Library
- G. Jeh and J. Widom. Scaling personalized web search. In International Conference on the World Wide Web (WWW), pages 271--279, 2003.]] Google ScholarDigital Library
- D. Jensen. Statistical challenges to inductive inference in linked data. In Seventh International Workshop on Artificial Intelligence and Statistics, 1999.]]Google Scholar
- D. Jensen and H. Goldberg. AAAI Fall Symposium on AI and Link Analysis. AAAI Press, 1998.]]Google Scholar
- D. V. Kalashnikov, S. Mehrotra, and Z. Chen. Exploiting relationships for domain-independent data cleaning. In SIAM International Conference on Data Mining, April 21--23 2005.]]Google ScholarCross Ref
- H. Kashima and A. Inokuchi. Kernels for graph classification. In ICDM Workshop on Active Mining, 2002.]]Google Scholar
- C. Kemp, T. L. Griffiths, and J. B. Tenenbaum. Discovering latent classes in relational data. Technical Report AI Memo 2004-019, Massachusetts Institute of Technology, September 2004.]]Google Scholar
- N. Ketkar, L. Holder, and D. Cook. Comparison of graph-based and logic-based multi-relational data mining. SIGKDD Explorations, 7(2), December 2005.]] Google ScholarDigital Library
- R. D. King, S. H. Muggleton, A. Srinivasan, and M. J. E. Sternberg. Structure-activity relationships derived by machine learning: The use of atoms and their bond connectivities to predict mutagenicity by inductive logic programming. National Academy of Sciences, 93(1):438--442, January 1996.]]Google ScholarCross Ref
- J. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of the ACM, 46(5):604--632, 1999.]] Google ScholarDigital Library
- A. Knobbe and D. van der Wallen. ECML/PKDD Workshop on Multi-Relational Data Mining. 2001.]]Google Scholar
- J. N. Kok and T. Washio. ECML/PKDD Workshop on Mining Graphs, Trees and Sequences. 2004.]]Google Scholar
- J. Kubica, A. Moore, D. Cohn, and J. Schneider. eGraph: A fast graph-based method for link analysis and queries. In IJCAI 2003 Text-Mining and Link-Analysis Workshop, August 2003.]]Google Scholar
- J. Kubica, A. Moore, and J. Schneider. Tractable group detection on large link data sets. In The Third IEEE International Conference on Data Mining, pages 573--576, 2003.]] Google ScholarDigital Library
- J. Kubica, A. Moore, J. Schneider, and Y. Yang. Stochastic link and group detection. In Eighteenth National Conference on Artificial Intelligence, pages 798--804. American Association for Artificial Intelligence, 2002.]] Google ScholarDigital Library
- M. Kuramochi and G. Karypis. Frequent subgraph discovery. In IEEE International Conference on Data Mining, pages 313--320, 2001.]] Google ScholarDigital Library
- J. Lafferty, A. McCallum, and F. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proc. of ICML-01, 2001.]] Google ScholarDigital Library
- N. Lavrač and S. Džeroski. Inductive Logic Programming: Techniques and Applications. Ellis Horwood, 1994.]] Google ScholarDigital Library
- R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. Computer Networks, 33(1-6):387--401, 2000.]] Google ScholarDigital Library
- X. Li, P. Morie, and D. Roth. Semantic integration in text: From ambiguous names to identifiable entities. AI Magazine. Special Issue on Semantic Integration, 2005.]] Google ScholarDigital Library
- D. Liben-Nowell and J. Kleinberg. The link prediction problem for social networks. In International Conference on Information and Knowledge Management (CIKM), pages 556--559, 2003.]] Google ScholarDigital Library
- Q. Lu and L. Getoor. Link-based classification. In International Conference on Machine Learning, 2003.]]Google Scholar
- A. Madche and S. Staab. Ontology learning for the semantic web. IEEE Intelligent Systems, 16(2):72--79, March/April 2001.]] Google ScholarDigital Library
- T. Matsuda, T. Horiuchi, H. Motoda, and T. Washio. Extension of graph-based induction for general graph structured data. In PAKDD, pages 420--431, 2000.]] Google ScholarDigital Library
- S. Muggleton, editor. Inductive Logic Programming. Academic Press, 1992.]]Google Scholar
- J. Neville and D. Jensen. Iterative classification in relational data. In Proc. AAAI-2000 Workshop on Learning Statistical Models from Relational Data. AAAI Press, 2000.]]Google Scholar
- J. Neville and D. Jensen. Leveraging relational auto-correlation with latent group models. In IEEE International Conference on Data Mining (ICDM), 2005.]] Google ScholarDigital Library
- M. E. J. Newman. Detecting community structure in networks. European Physical Journal B, 38:321--330, 2004.]]Google ScholarCross Ref
- A. Y. Ng, A. X. Zheng, and M. I. Jordan. Link analysis, eigenvectors and stability. In International Joint Conference on Artificial Intelligence (IJCAI), pages 903--910, 2001.]]Google Scholar
- A. Y. Ng, A. X. Zheng, and M. I. Jordan. Stable algorithms for link analysis. In ACM SIGIR Conference on Research and Development in Information Retrieval, 2001.]] Google ScholarDigital Library
- S. Nijssen, T. Meinl, and G. Karypis. ECML/PKDD Workshop on Mining Graphs, Trees and Sequences. 2005.]]Google Scholar
- K. Nowicki and T. A. B. Snijders. Estimation and prediction for stochastic blockstructures. Journal of the American Statistical Association, 96(455):1077--1087, 2001.]]Google ScholarCross Ref
- H.-J. Oh, S. H. Myaeng, and M.-H. Lee. A practical hypertext catergorization method using links and incrementally available class information. In International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 264--271, 2000.]] Google ScholarDigital Library
- J. O'Madadhain, J. Hutchins, and P. Smyth. Prediction and ranking algorithms for even-based network data. SIGKDD Explorations, 7(2), December 2005.]] Google ScholarDigital Library
- J. O'Madadhain and P. Smyth. EventRank: A framework for ranking time-varying networks. In KDD Workshop on Link Discovery (LinkKDD): Issues, Approaches and Applications, 2005.]] Google ScholarDigital Library
- J. O'Madadhain, P. Smyth, and L. Adamic. Learning predictive models for link formation. Presented at the International Sunbelt Social Network Conference, February, 2005.]]Google Scholar
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical report, Stanford University, 1998.]]Google Scholar
- H. Pasula, B. Marthi, B. Milch, S. Russell, and I. Shpitser. Identity uncertainty and citation matching. In Advances in Neural Information Processing Systems 15. MIT Press, 2003.]]Google Scholar
- A. Popescul and L. H. Ungar. Statistical relational learning for link prediction. In IJCAI Workshop on Learning Statistical Models from Relational Data, 2003.]]Google ScholarDigital Library
- L. D. Raedt and T. Washio. ECML/PKDD Workshop on Mining Graphs, Trees and Sequences. 2003.]]Google Scholar
- D. Rafiei and A. O. Mendelzon. What is this page known for? Computing web page reputations. In International World Wide Web Conference (WWW), pages 823--835. North-Holland Publishing Co., 2000.]] Google ScholarDigital Library
- C. Ramakrishnan, W. Milnor, M. Perry, and A. Sheth. Discovering informative connection subgraphs in multi-relational graphs. SIGKDD Explorations, 7(2), December 2005.]] Google ScholarDigital Library
- M. Rattigan and D. Jensen. The case for anomalous link discovery. SIGKDD Explorations, 7(2), December 2005.]] Google ScholarDigital Library
- M. Richardson and P. Domingos. The Intelligent Surfer: Probabilistic Combination of Link and Content Information in PageRank. In Advances in Neural Information Processing Systems 14. MIT Press, 2002.]]Google Scholar
- A. Rosenfeld, R. Hummel, and S. Zucker. Scene labeling by relaxation operations. IEEE Transactions on Systems, Man and Cybernetics, 6(6), 1976.]]Google ScholarCross Ref
- T. Senator. Link mining applications: Progress and challenges. SIGKDD Explorations, 7(2), 2005.]] Google ScholarDigital Library
- L. Singh, L. Getoor, and L. Licamele. Pruning social networks using structural properties and descriptive attributes. In International Conference on Data Mining, 2005.]] Google ScholarDigital Library
- P. Singla and P. Domingos. Multi-relational record linkage. In KDD Workshop on Multi-Relational Data Mining, Seattle, WA, August 2004.]]Google Scholar
- D. Skillicorn and K. Carley. SIAM Workshop on Link Analysis, Counterterrorism and Security. 2005.]]Google Scholar
- A. Srivastava, D. Barbara, H. Kargupta, and V. Kumar. SIAM Workshop on Data Mining for Counterterrorism and Security. 2003.]]Google Scholar
- J. Sun, H. Qu, D. Chakrabarti, and C. Faloutsos. Relevance search and anomaly detection in bipartite graphs. SIGKDD Explorations, 7(2), December 2005.]] Google ScholarDigital Library
- L. Sweeney. Privacy-enhanced linking. SIGKDD Explorations, 7(2), 2005.]] Google ScholarDigital Library
- B. Taskar, P. Abbeel, and D. Koller. Discriminative probabilistic models for relational data. In Proc. of UAI, pages 485--492, Edmonton, Canada, 2002.]]Google Scholar
- B. Taskar, M.-F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data. In Neural Information Processing Systems Conference, Vancouver, Canada, December 2003.]]Google Scholar
- J. R. Tyler, D. M. Wilkinson, and B. A. Huberman. Email as Spectroscopy: Automated Discovery of Community Structure within Organizations. Kluwer, B. V., Deventer, The Netherlands, The Netherlands, 2003.]] Google ScholarDigital Library
- X. Wang, N. Mohanty, and A. McCallum. Group and topic discovery from relations and text. In KDD Workshop on Link Discovery, August 2005.]] Google ScholarDigital Library
- S. Wasserman. Conformity of two sociometric relations. Psychometrika, 52:3--18, 1987.]]Google ScholarCross Ref
- S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge, 1994.]]Google ScholarCross Ref
- D. J. Watts and S. H. Strogatz. Collective dynamics of "small-world" networks. Nature, 393:440--442, 1998.]]Google ScholarCross Ref
- S. White and P. Smyth. Algorithms for estimating relative importance in networks. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 266--275, 2003.]] Google ScholarDigital Library
- A. P. Wolfe and D. Jensen. Playing multiple roles: Discovering overlapping roles in social networks. In ICML-04 Workshop on Statistical Relational Learning and its Connections to Other Fields, 2004.]]Google Scholar
- X. Yan and J. Han. gSpan: Graph-based substructure pattern mining. In International Conference on Data Mining, 2002.]] Google ScholarDigital Library
- K. Yoshida, H. Motoda, and N. Indurkhya. Graph-based induction as a unified learning framework. Journal of Applied Intelligence, 4(3):297--316, July 1994.]]Google ScholarCross Ref
Recommendations
Authoritative sources in a hyperlinked environment
The network structure of a hyperlinked environment can be a rich source of information about the content of the environment, provided we have effective means for understanding it. We develop a set of algorithmic tools for extracting information from the ...
A Survey of Link Prediction in Complex Networks
Networks have become increasingly important to model complex systems composed of interacting elements. Network data mining has a large number of applications in many disciplines including protein-protein interaction networks, social networks, ...
node2vec: Scalable Feature Learning for Networks
KDD '16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data MiningPrediction tasks over nodes and edges in networks require careful effort in engineering features used by learning algorithms. Recent research in the broader field of representation learning has led to significant progress in automating prediction by ...
Comments