Skip to main content
Top
Published in: Knowledge and Information Systems 2/2017

30-08-2016 | Regular Paper

Domain-agnostic discovery of similarities and concepts at scale

Authors: Olof Görnerup, Daniel Gillblad, Theodore Vasiloudis

Published in: Knowledge and Information Systems | Issue 2/2017

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Appropriately defining and efficiently calculating similarities from large data sets are often essential in data mining, both for gaining understanding of data and generating processes and for building tractable representations. Given a set of objects and their correlations, we here rely on the premise that each object is characterized by its context, i.e., its correlations to the other objects. The similarity between two objects can then be expressed in terms of the similarity between their contexts. In this way, similarity pertains to the general notion that objects are similar if they are exchangeable in the data. We propose a scalable approach for calculating all relevant similarities among objects by relating them in a correlation graph that is transformed to a similarity graph. These graphs can express rich structural properties among objects. Specifically, we show that concepts—abstractions of objects—are constituted by groups of similar objects that can be discovered by clustering the objects in the similarity graph. These principles and methods are applicable in a wide range of fields and will be demonstrated here in three domains: computational linguistics, music, and molecular biology, where the numbers of objects and correlations range from small to very large.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Footnotes
1
That is, most objects are either completely unrelated or at most negligibly correlated. Two randomly selected persons in a large social network, for instance, most likely do not know each other.
 
Literature
2.
go back to reference Alexandrov A, Bergmann R, Ewen S et al (2014) The Stratosphere platform for big data analytics. VLDB J 23:163–181 Alexandrov A, Bergmann R, Ewen S et al (2014) The Stratosphere platform for big data analytics. VLDB J 23:163–181
3.
go back to reference Anisimova M, Kosiol C (2009) Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol 26(2):255–271CrossRef Anisimova M, Kosiol C (2009) Investigating protein-coding sequence evolution with probabilistic codon substitution models. Mol Biol Evol 26(2):255–271CrossRef
4.
go back to reference Bitton D, Boral H, DeWitt DJ et al (1983) Parallel algorithms for the execution of relational database operations. ACM Trans Database Syst 8(3):324–353CrossRef Bitton D, Boral H, DeWitt DJ et al (1983) Parallel algorithms for the execution of relational database operations. ACM Trans Database Syst 8(3):324–353CrossRef
5.
go back to reference Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: From form to meaning: processing texts automatically, Proceedings of the Biennial GSCL Conference, pp 31–40 Bouma G (2009) Normalized (pointwise) mutual information in collocation extraction. In: From form to meaning: processing texts automatically, Proceedings of the Biennial GSCL Conference, pp 31–40
6.
go back to reference Brown PF, deSouza PV, Mercer RL et al (1992) Class-based N-gram models of natural language. Comput Linguist 18(4):467–479 Brown PF, deSouza PV, Mercer RL et al (1992) Class-based N-gram models of natural language. Comput Linguist 18(4):467–479
7.
go back to reference Cancho RF, Solé RV (2001) The small world of human language. Proc R Soc Lond B Biol Sci 268(1482):2261–2265CrossRef Cancho RF, Solé RV (2001) The small world of human language. Proc R Soc Lond B Biol Sci 268(1482):2261–2265CrossRef
8.
go back to reference Celma Ò (2010) Music recommendation and discovery in the long tail. Springer, BerlinCrossRef Celma Ò (2010) Music recommendation and discovery in the long tail. Springer, BerlinCrossRef
9.
go back to reference Celma Ó, Cano P (2008) From hits to niches? Or how popular artists can bias music recommendation and discovery. In: Proceedings of the 2nd KDD workshop on large-scale recommender systems and the Netflix Prize Competition. ACM, p 5 Celma Ó, Cano P (2008) From hits to niches? Or how popular artists can bias music recommendation and discovery. In: Proceedings of the 2nd KDD workshop on large-scale recommender systems and the Netflix Prize Competition. ACM, p 5
10.
go back to reference Chandra AK, Merlin PM (1977) Optimal implementation of conjunctive queries in relational data bases. In: Proceedings of the ninth annual ACM symposium on theory of computing, STOC ’77. ACM, New York, NY, USA, pp 77–90 Chandra AK, Merlin PM (1977) Optimal implementation of conjunctive queries in relational data bases. In: Proceedings of the ninth annual ACM symposium on theory of computing, STOC ’77. ACM, New York, NY, USA, pp 77–90
11.
go back to reference Chelba C, Mikolov T, Schuster M et al (2013) One billion word benchmark for measuring progress in statistical language modeling. CoRR arXiv:1312.3005 Chelba C, Mikolov T, Schuster M et al (2013) One billion word benchmark for measuring progress in statistical language modeling. CoRR arXiv:​1312.​3005
12.
go back to reference Church KW, Hanks P (1990) Word association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29 Church KW, Hanks P (1990) Word association norms, mutual information, and lexicography. Comput Linguist 16(1):22–29
13.
go back to reference Dayhoff MO, Schwartz RM (1978) Chapter 22: A model of evolutionary change in proteins. In: Atlas of protein sequence and structure Dayhoff MO, Schwartz RM (1978) Chapter 22: A model of evolutionary change in proteins. In: Atlas of protein sequence and structure
14.
go back to reference Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302CrossRef Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302CrossRef
15.
go back to reference Finkelstein L, Gabrilovich E, Matias Y et al (2001) Placing search in context: the concept revisited. In: Proceedings of the 10th international conference on World Wide Web, WWW ’01. ACM, New York, NY, USA, pp 406–414 Finkelstein L, Gabrilovich E, Matias Y et al (2001) Placing search in context: the concept revisited. In: Proceedings of the 10th international conference on World Wide Web, WWW ’01. ACM, New York, NY, USA, pp 406–414
16.
go back to reference Firth JR (1957) A synopsis of linguistic theory 1930–55. In: Studies in linguistic analysis (special volume of the Philological Society), vol 1952–59. The Philological Society, pp 1–32 Firth JR (1957) A synopsis of linguistic theory 1930–55. In: Studies in linguistic analysis (special volume of the Philological Society), vol 1952–59. The Philological Society, pp 1–32
18.
go back to reference Görnerup O, Gillblad D, Vasiloudis T (2015) Knowing an object by the company it keeps: a domain-agnostic scheme for similarity discovery. In: IEEE international conference on data mining (ICDM 2015) Görnerup O, Gillblad D, Vasiloudis T (2015) Knowing an object by the company it keeps: a domain-agnostic scheme for similarity discovery. In: IEEE international conference on data mining (ICDM 2015)
19.
go back to reference Halawi G, Dror G, Gabrilovich E et al (2012) Large-scale learning of word relatedness with constraints. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 1406–1414 Halawi G, Dror G, Gabrilovich E et al (2012) Large-scale learning of word relatedness with constraints. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, New York, NY, USA, pp 1406–1414
20.
go back to reference Harispe S, Ranwez S, Janaqi S et al (2015) Semantic similarity from natural language and ontology analysis. Synth Lect Hum Lang Technol 8(1):1–254CrossRef Harispe S, Ranwez S, Janaqi S et al (2015) Semantic similarity from natural language and ontology analysis. Synth Lect Hum Lang Technol 8(1):1–254CrossRef
21.
22.
go back to reference Hill F, Reichart R, Korhonen A (2014) Simlex-999: evaluating semantic models with (genuine) similarity estimation. CoRR arXiv:1408.3456 Hill F, Reichart R, Korhonen A (2014) Simlex-999: evaluating semantic models with (genuine) similarity estimation. CoRR arXiv:​1408.​3456
23.
go back to reference Jaccard P (1912) The distribution of the flora in the alpine zone. New Phytol 11(2):37–50CrossRef Jaccard P (1912) The distribution of the flora in the alpine zone. New Phytol 11(2):37–50CrossRef
24.
go back to reference Jeh G, Widom J (2002) Simrank: a measure of structural-context similarity. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02. ACM, New York, NY, USA, pp 538–543 Jeh G, Widom J (2002) Simrank: a measure of structural-context similarity. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02. ACM, New York, NY, USA, pp 538–543
25.
go back to reference Jordan IK, Mariño Ramírez L, Wolf YI et al (2004) Conservation and coevolution in the scale-free human gene coexpression network. Mol Biol Evol 21(11):2058–2070CrossRef Jordan IK, Mariño Ramírez L, Wolf YI et al (2004) Conservation and coevolution in the scale-free human gene coexpression network. Mol Biol Evol 21(11):2058–2070CrossRef
26.
go back to reference Kessler M (1963) Bibliographic coupling between scientific papers. Am Doc 14:10–25CrossRef Kessler M (1963) Bibliographic coupling between scientific papers. Am Doc 14:10–25CrossRef
27.
go back to reference Koutris P, Suciu D (2011) Parallel evaluation of conjunctive queries. In: Proceedings of the thirteenth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’11. ACM, New York, NY, USA, pp 223–234 Koutris P, Suciu D (2011) Parallel evaluation of conjunctive queries. In: Proceedings of the thirteenth ACM SIGMOD-SIGACT-SIGART symposium on principles of database systems, PODS ’11. ACM, New York, NY, USA, pp 223–234
28.
go back to reference Larson R (1996) Bibliometrics of the World Wide Web: an exploratory analysis of the intellectual structure of cyberspace. Ann. Meeting of the American Soc. Info, Sci Larson R (1996) Bibliometrics of the World Wide Web: an exploratory analysis of the intellectual structure of cyberspace. Ann. Meeting of the American Soc. Info, Sci
29.
go back to reference Leicht EA, Holme P, Newman MEJ (2006) Vertex similarity in networks. Phys Rev E 73:026120CrossRef Leicht EA, Holme P, Newman MEJ (2006) Vertex similarity in networks. Phys Rev E 73:026120CrossRef
30.
go back to reference Lin Y, Michel J, Aiden EL et al (2012) Syntactic annotations for the google books ngram corpus. In: Proceedings of the ACL 2012 system demonstrations, ACL ’12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 169–174 Lin Y, Michel J, Aiden EL et al (2012) Syntactic annotations for the google books ngram corpus. In: Proceedings of the ACL 2012 system demonstrations, ACL ’12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 169–174
31.
go back to reference Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, Pickett JP, Hoiberg D, Clancy D, Norvig P, Orwant J, Pinker S, Nowak MA, Aiden EL (2011) Quantitative analysis of culture using millions of digitized books. Science 331(6014):176–182 Michel JB, Shen YK, Aiden AP, Veres A, Gray MK, Pickett JP, Hoiberg D, Clancy D, Norvig P, Orwant J, Pinker S, Nowak MA, Aiden EL (2011) Quantitative analysis of culture using millions of digitized books. Science 331(6014):176–182
32.
go back to reference Mihalcea R, Radev D (2011) Graph-based natural language processing and information retrieval. Cambridge University Press, CambridgeCrossRefMATH Mihalcea R, Radev D (2011) Graph-based natural language processing and information retrieval. Cambridge University Press, CambridgeCrossRefMATH
33.
go back to reference Mikolov T, Sutskever I, Chen K et al (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119 Mikolov T, Sutskever I, Chen K et al (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp 3111–3119
34.
go back to reference Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41CrossRef Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41CrossRef
35.
go back to reference Mislove A, Marcon M, Gummadi KP et al (2007) Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, IMC ’07. ACM, New York, NY, USA, pp 29–42 Mislove A, Marcon M, Gummadi KP et al (2007) Measurement and analysis of online social networks. In: Proceedings of the 7th ACM SIGCOMM conference on Internet measurement, IMC ’07. ACM, New York, NY, USA, pp 29–42
36.
go back to reference Nirenberg M, Leder P, Bernfield M et al (1965) RNA codewords and protein synthesis, VII. On the general nature of the RNA code. Proc Natl Acad Sci 53:1161–1168CrossRef Nirenberg M, Leder P, Bernfield M et al (1965) RNA codewords and protein synthesis, VII. On the general nature of the RNA code. Proc Natl Acad Sci 53:1161–1168CrossRef
37.
go back to reference Palla G, Derenyi I, Farkas I et al (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818CrossRef Palla G, Derenyi I, Farkas I et al (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435(7043):814–818CrossRef
38.
go back to reference Pecina P (2008) A machine learning approach to multiword expression extraction. In: Proceedings of the LREC 2008 workshop towards a shared task for multiword expressions. European Language Resources Association, pp 54–57 Pecina P (2008) A machine learning approach to multiword expression extraction. In: Proceedings of the LREC 2008 workshop towards a shared task for multiword expressions. European Language Resources Association, pp 54–57
39.
go back to reference Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, pp 1532–1543 Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, pp 1532–1543
40.
go back to reference Ravasz E, Somera AL, Mongru DA et al (2002) Hierarchical organization of modularity in metabolic networks. Science 297(5586):1551–1555CrossRef Ravasz E, Somera AL, Mongru DA et al (2002) Hierarchical organization of modularity in metabolic networks. Science 297(5586):1551–1555CrossRef
41.
go back to reference Sahlgren M (2006) The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Ph.D. thesis, Stockholm University Sahlgren M (2006) The Word-Space Model: using distributional analysis to represent syntagmatic and paradigmatic relations between words in high-dimensional vector spaces. Ph.D. thesis, Stockholm University
42.
go back to reference Schneider A, Cannarozzi G, Gonnet G (2005) Empirical codon substitution matrix. BMC Bioinform 6(134):1–7 Schneider A, Cannarozzi G, Gonnet G (2005) Empirical codon substitution matrix. BMC Bioinform 6(134):1–7
43.
go back to reference Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504CrossRef Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T (2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–2504CrossRef
44.
go back to reference Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inf Sci 24(4):265–269MathSciNetCrossRef Small H (1973) Co-citation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inf Sci 24(4):265–269MathSciNetCrossRef
45.
go back to reference Sørensen T (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biol Skr 5:1–34 Sørensen T (1948) A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Biol Skr 5:1–34
46.
go back to reference Steyvers M, Tenenbaum JB (2005) The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cogn Sci 29(1):41–78CrossRef Steyvers M, Tenenbaum JB (2005) The large-scale structure of semantic networks: statistical analyses and a model of semantic growth. Cogn Sci 29(1):41–78CrossRef
47.
go back to reference Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393(6684):409–10CrossRef Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Nature 393(6684):409–10CrossRef
48.
go back to reference Wong W, Liu W, Bennamoun M (2012) Ontology learning from text: a look back and into the future. ACM Comput Surv 44(4):20:1–20:36CrossRefMATH Wong W, Liu W, Bennamoun M (2012) Ontology learning from text: a look back and into the future. ACM Comput Surv 44(4):20:1–20:36CrossRefMATH
49.
go back to reference Wu TD, Brutlag DL (1996) Discovering empirically conserved amino acid substitution groups in databases of protein families. In: States DJ, Agarwal P, Gaasterland T, Hunter L, Smith R (eds) Proceedings of the fourth international conference on intelligent systems for molecular biology, St. Louis, MO, USA, June 12–15 1996. AAAI, pp 230–240 Wu TD, Brutlag DL (1996) Discovering empirically conserved amino acid substitution groups in databases of protein families. In: States DJ, Agarwal P, Gaasterland T, Hunter L, Smith R (eds) Proceedings of the fourth international conference on intelligent systems for molecular biology, St. Louis, MO, USA, June 12–15 1996. AAAI, pp 230–240
50.
go back to reference Xie J, Szymanski BK, Liu X (2011) SLPA: uncovering overlapping communities in social networks via a speaker–listener interaction dynamic process. In: ICDM 2011 Workshop on DMCCI Xie J, Szymanski BK, Liu X (2011) SLPA: uncovering overlapping communities in social networks via a speaker–listener interaction dynamic process. In: ICDM 2011 Workshop on DMCCI
51.
go back to reference Yih W, Qazvinian V (2012) Measuring word relatedness using heterogeneous vector space models. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL HLT ’12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 616–620 Yih W, Qazvinian V (2012) Measuring word relatedness using heterogeneous vector space models. In: Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL HLT ’12. Association for Computational Linguistics, Stroudsburg, PA, USA, pp 616–620
52.
go back to reference Yu W, Zhang W, Lin X et al (2012) A space and time efficient algorithm for simrank computation. World Wide Web 15(3):327–353CrossRef Yu W, Zhang W, Lin X et al (2012) A space and time efficient algorithm for simrank computation. World Wide Web 15(3):327–353CrossRef
53.
go back to reference Zaharia M, Chowdhury M, Das T et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as part of the 9th USENIX symposium on networked systems design and implementation (NSDI 12), San Jose, CA, pp 15–28 Zaharia M, Chowdhury M, Das T et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Presented as part of the 9th USENIX symposium on networked systems design and implementation (NSDI 12), San Jose, CA, pp 15–28
54.
go back to reference Zhang B, Horvath S (2005) A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 4, Article17 Zhang B, Horvath S (2005) A general framework for weighted gene co-expression network analysis. Stat Appl Genet Mol Biol 4, Article17
Metadata
Title
Domain-agnostic discovery of similarities and concepts at scale
Authors
Olof Görnerup
Daniel Gillblad
Theodore Vasiloudis
Publication date
30-08-2016
Publisher
Springer London
Published in
Knowledge and Information Systems / Issue 2/2017
Print ISSN: 0219-1377
Electronic ISSN: 0219-3116
DOI
https://doi.org/10.1007/s10115-016-0984-2

Other articles of this Issue 2/2017

Knowledge and Information Systems 2/2017 Go to the issue

Premium Partner