Skip to main content
Top
Published in: Discover Computing 5/2020

22-07-2020 | Mining Actionable Insights from Online User Generated Content

Robust keyword search in large attributed graphs

Authors: Spencer Bryson, Heidar Davoudi, Lukasz Golab, Mehdi Kargar, Yuliya Lytvyn, Piotr Mierzejewski, Jaroslaw Szlichta, Morteza Zihayat

Published in: Discover Computing | Issue 5/2020

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

There is a growing need to explore attributed graphs such as social networks, expert networks, and biological networks. A well-known mechanism for non-technical users to explore such graphs is keyword search, which receives a set of query keywords and returns a connected subgraph that contains the keywords. However, existing approaches, such as methods based on shortest paths between nodes containing the query keywords, may generate weakly-connected answers as they ignore the structure of the whole graph. To address this issue, we formulate and solve the robust keyword search problem for attributed graphs to find strongly-connected answers. We prove that the problem is NP-hard and we propose a solution based on a random walk with restart (RWR). To improve the efficiency and scalability of RWR, we use Monte Carlo approximation and we also propose a distributed version, which we implement in Apache Spark. Finally, we provide experimental evidence of the efficiency and effectiveness of our approach on real-world graphs.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
4
We call these nodes (that contain the query keywords) content nodes or keyword holders, in contrast to middle nodes that serve as connections between the content nodes.
 
5
We assume G is connected. If G is not connected, we use the strongly connected component, as in Lappas et al. (2009). Having a connected graph guarantees that the returned answers are always connected. Note that in our applications, returning disconnected subgraphs is meaningless.
 
6
Note that choosing any set of nodes that contain one of the input keywords in the first step guarantees the coverage of all keywords at the end. However, starting with the rarest keyword holder is a greedy strategy that improves performance significantly as it prunes the search space without significantly sacrificing presicion in practice.
 
7
RDD is a fundamental data structure in Spark. RDD provides an an immutable, partitioned collection of elements in which they can be operated on in parallel.
 
Literature
go back to reference Anagnostopoulos, A., Becchetti, L., Castillo, C., Gionis, A., & Leonardi, S. (2012). Online team formation in social networks. In WWW ’12, pp. 839–848. Anagnostopoulos, A., Becchetti, L., Castillo, C., Gionis, A., & Leonardi, S. (2012). Online team formation in social networks. In WWW ’12, pp. 839–848.
go back to reference Bahmani, B., Chowdhury, A., & Goel, A. (2010). Fast incremental and personalized pagerank. Proceedings of the VLDB Endowment, 4(3), 173–184.CrossRef Bahmani, B., Chowdhury, A., & Goel, A. (2010). Fast incremental and personalized pagerank. Proceedings of the VLDB Endowment, 4(3), 173–184.CrossRef
go back to reference Bai, L., Liang, J., Du, H., & Guo, Y. (2018). A novel community detection algorithm based on simplification of complex networks. Knowledge-Based Systems, 143, 58–64.CrossRef Bai, L., Liang, J., Du, H., & Guo, Y. (2018). A novel community detection algorithm based on simplification of complex networks. Knowledge-Based Systems, 143, 58–64.CrossRef
go back to reference Balmin, A., Hristidis, V., & Papakonstantinou, Y. (2004). Objectrank: Authority-based keyword search in databases. In VLDB, pp. 564–575. Balmin, A., Hristidis, V., & Papakonstantinou, Y. (2004). Objectrank: Authority-based keyword search in databases. In VLDB, pp. 564–575.
go back to reference Bhalotia, Gaurav, Hulgeri, Arvind, Nakhe, Arvind, Chakrabarti, Soumen, & Sudarshan, Shashank. (2002). Keyword searching and browsing in databases using banks. In Proceedings 18th international conference on data engineering, pp. 431–440. IEEE. Bhalotia, Gaurav, Hulgeri, Arvind, Nakhe, Arvind, Chakrabarti, Soumen, & Sudarshan, Shashank. (2002). Keyword searching and browsing in databases using banks. In Proceedings 18th international conference on data engineering, pp. 431–440. IEEE.
go back to reference Cavallari, S., Zheng, V., Cai, H., Chang, K., & Cambria, E. (2017). Learning community embedding with community detection and node embedding on graphs. In CIKM, pp. 337–386. Cavallari, S., Zheng, V., Cai, H., Chang, K., & Cambria, E. (2017). Learning community embedding with community detection and node embedding on graphs. In CIKM, pp. 337–386.
go back to reference Chakrabarti, S. (2007). Dynamic personalized pagerank in entity relation graphs. In WWW, pp. 571–580. Chakrabarti, S. (2007). Dynamic personalized pagerank in entity relation graphs. In WWW, pp. 571–580.
go back to reference Cui, W., Xiao, Y., Wang, H., & Wang, W. (2014). Local search of communities in large graphs. In SIGMOD, pp. 991–1002. Cui, W., Xiao, Y., Wang, H., & Wang, W. (2014). Local search of communities in large graphs. In SIGMOD, pp. 991–1002.
go back to reference Ding, B., Yu, J., Wang, S., Qin, L., Zhang, X., & Lin, X. (2007). Finding top-k min-cost connected trees in databases. In ICDE, pp. 836–845. Ding, B., Yu, J., Wang, S., Qin, L., Zhang, X., & Lin, X. (2007). Finding top-k min-cost connected trees in databases. In ICDE, pp. 836–845.
go back to reference Ding, B., Yu, J. X., Wang, S., Qin, L., Zhang, X., & Lin, X. (2007). Finding top-k min-cost connected trees in databases. In 2007 IEEE 23rd international conference on data engineering, pp. 836–845. IEEE. Ding, B., Yu, J. X., Wang, S., Qin, L., Zhang, X., & Lin, X. (2007). Finding top-k min-cost connected trees in databases. In 2007 IEEE 23rd international conference on data engineering, pp. 836–845. IEEE.
go back to reference Duret, L., & Mouchiroud, D. (1999). Expression pattern and surprisingly, gene length shape codon usage in caenorhabditis, drosophila, and arabidopsis. Proceedings of the National Academy of Sciences, 13(96), 4482–4487.CrossRef Duret, L., & Mouchiroud, D. (1999). Expression pattern and surprisingly, gene length shape codon usage in caenorhabditis, drosophila, and arabidopsis. Proceedings of the National Academy of Sciences, 13(96), 4482–4487.CrossRef
go back to reference Duret, L., & Mouchiroud, D. (1999). Expression pattern and, surprisingly, gene length shape codon usage in caenorhabditis, drosophila, and arabidopsis. In Proceedings of the national academy of sciences of the United States of America, pp. 4482–4487. Duret, L., & Mouchiroud, D. (1999). Expression pattern and, surprisingly, gene length shape codon usage in caenorhabditis, drosophila, and arabidopsis. In Proceedings of the national academy of sciences of the United States of America, pp. 4482–4487.
go back to reference Fang, Y., Cheng, R., Luo, S., & Hu, J. (2016). Effective community search for large attributed graphs. PVLDB, 9(12), 1233–1244. Fang, Y., Cheng, R., Luo, S., & Hu, J. (2016). Effective community search for large attributed graphs. PVLDB, 9(12), 1233–1244.
go back to reference Gonzalez, M., & Kann, M. (2012). Protein interactions and disease. PLoS Computational Biology, 8(12), 819–830.CrossRef Gonzalez, M., & Kann, M. (2012). Protein interactions and disease. PLoS Computational Biology, 8(12), 819–830.CrossRef
go back to reference Grover, A., & Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 855–864. ACM. Grover, A., & Leskovec, J. (2016). node2vec: Scalable feature learning for networks. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp. 855–864. ACM.
go back to reference Hajiabadi, M., Zare, H., & Bobarshad, H. (2017). IEDC: An integrated approach for overlapping and non-overlapping community detection. Knowledge-Based Systems, 123, 188–199.CrossRef Hajiabadi, M., Zare, H., & Bobarshad, H. (2017). IEDC: An integrated approach for overlapping and non-overlapping community detection. Knowledge-Based Systems, 123, 188–199.CrossRef
go back to reference Han, S., Zou, L., Yu, J.X., & Zhao, D. (2017). Keyword search on RDF graphs: A query graph assembly approach. In CIKM, pp. 227–236. Han, S., Zou, L., Yu, J.X., & Zhao, D. (2017). Keyword search on RDF graphs: A query graph assembly approach. In CIKM, pp. 227–236.
go back to reference He, H., Wang, H., Yang, J., & Yu, P. (2007). BLINKS: Ranked keyword searches on graphs. In SIGMOD, pp. 305–316. He, H., Wang, H., Yang, J., & Yu, P. (2007). BLINKS: Ranked keyword searches on graphs. In SIGMOD, pp. 305–316.
go back to reference Huang, X., & Lakshmanan, L. V. S. (2017). Attribute-driven community search. PVLDB, 10(9), 949–960. Huang, X., & Lakshmanan, L. V. S. (2017). Attribute-driven community search. PVLDB, 10(9), 949–960.
go back to reference Ingvarsson, P. (2007). Gene expression and protein length influence codon usage and rates of sequence evolution in populus tremula. Molecular Biology and Evolution, 24(3), 836–844.CrossRef Ingvarsson, P. (2007). Gene expression and protein length influence codon usage and rates of sequence evolution in populus tremula. Molecular Biology and Evolution, 24(3), 836–844.CrossRef
go back to reference Ingvarsson, Pär K. (2007). Gene expression and protein length influence codon usage and rates of sequence evolution in populus tremula. Molecular Biology and Evolution, 24(3), 836–844.CrossRef Ingvarsson, Pär  K. (2007). Gene expression and protein length influence codon usage and rates of sequence evolution in populus tremula. Molecular Biology and Evolution, 24(3), 836–844.CrossRef
go back to reference Jay, S. M. (2019). Protein silencing to stop a silent killer. Science Translational Medicine, 11(473), 529.CrossRef Jay, S. M. (2019). Protein silencing to stop a silent killer. Science Translational Medicine, 11(473), 529.CrossRef
go back to reference Jeong, Y., Pan, Y., Rathore, S., Kim, B., & Park, J. H. (2019). A parallel team formation approach using crowd intelligence from social network. Computers in Human Behavior, 101, 429–434.CrossRef Jeong, Y., Pan, Y., Rathore, S., Kim, B., & Park, J. H. (2019). A parallel team formation approach using crowd intelligence from social network. Computers in Human Behavior, 101, 429–434.CrossRef
go back to reference Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., & Karambelkar, H. (2005). Bidirectional expansion for keyword search on graph databases. In VLDB, pp. 505–516. Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., & Karambelkar, H. (2005). Bidirectional expansion for keyword search on graph databases. In VLDB, pp. 505–516.
go back to reference Kargar, M., & An, A. (2011). Keyword search in graphs: Finding r-cliques. PVLDB, 10(4), 681–692. Kargar, M., & An, A. (2011). Keyword search in graphs: Finding r-cliques. PVLDB, 10(4), 681–692.
go back to reference Kargar, M., An, A., & Yu, X. (2014). Efficient duplication free and minimal keyword search in graphs. TKDE, 26(7), 1657–1669. Kargar, M., An, A., & Yu, X. (2014). Efficient duplication free and minimal keyword search in graphs. TKDE, 26(7), 1657–1669.
go back to reference Kargar, M., Golab, L., & Szlichta, J. (2016). eGraphSearch: Effective keyword search in graphs. In CIKM, system demonstration, pp. 2461–2464. Kargar, M., Golab, L., & Szlichta, J. (2016). eGraphSearch: Effective keyword search in graphs. In CIKM, system demonstration, pp. 2461–2464.
go back to reference Kargar, M., Zihayat, M., & An, A. (2013). Finding affordable and collaborative teams from a network of experts. In SDM, pp. 587–595. Kargar, M., Zihayat, M., & An, A. (2013). Finding affordable and collaborative teams from a network of experts. In SDM, pp. 587–595.
go back to reference Kasneci, G., Ramanath, M., & Sozio, M. (2009). STAR: Steiner-tree approximation in relationship graphs. In ICDE, pp. 868–879. Kasneci, G., Ramanath, M., & Sozio, M. (2009). STAR: Steiner-tree approximation in relationship graphs. In ICDE, pp. 868–879.
go back to reference Lao, N., & Cohen, W. (2010). Fast query execution for retrieval models based on path-constrained random walks. In KDD, pp. 881–888. Lao, N., & Cohen, W. (2010). Fast query execution for retrieval models based on path-constrained random walks. In KDD, pp. 881–888.
go back to reference Lappas, T., Liu, L., & Terzi, E. (2009). Finding a team of experts in social networks. In KDD, pp. 467–476. Lappas, T., Liu, L., & Terzi, E. (2009). Finding a team of experts in social networks. In KDD, pp. 467–476.
go back to reference Le, W., Li, F., Kementsietsidis, A., & Duan, S. (2014). Scalable keyword search on large RDF data. TKDE, 26(11), 2774–2788. Le, W., Li, F., Kementsietsidis, A., & Duan, S. (2014). Scalable keyword search on large RDF data. TKDE, 26(11), 2774–2788.
go back to reference Leskovec, J., Rajaraman, A., & Ullman, J. (2014). Mining of massive datasets (2nd ed.). Cambridge: Cambridge University Press.CrossRef Leskovec, J., Rajaraman, A., & Ullman, J. (2014). Mining of massive datasets (2nd ed.). Cambridge: Cambridge University Press.CrossRef
go back to reference Li, G., Ooi, B., Feng, J., Wang, J., & Zhou, L. (2008). EASE: Efficient and adaptive keyword search on unstructured, semi-structured and structured data. In SIGMOD, pp. 903–904. Li, G., Ooi, B., Feng, J., Wang, J., & Zhou, L. (2008). EASE: Efficient and adaptive keyword search on unstructured, semi-structured and structured data. In SIGMOD, pp. 903–904.
go back to reference Li, R. H., Qin, L., Yu, J. X., & Mao, R. (2015). Influential community search in large networks. PVLDB, 8(5), 509–520. Li, R. H., Qin, L., Yu, J. X., & Mao, R. (2015). Influential community search in large networks. PVLDB, 8(5), 509–520.
go back to reference Majumder, A., Datta, S., & Naidu, K. (2012). Capacitated team formation problem on social networks. In KDD, pp. 1005–1013. Majumder, A., Datta, S., & Naidu, K. (2012). Capacitated team formation problem on social networks. In KDD, pp. 1005–1013.
go back to reference Markowetz, A., Ying, Y., & Papadias, D. (2007). Keyword search on relational data streams. In SIGMOD, pp. 605–616. Markowetz, A., Ying, Y., & Papadias, D. (2007). Keyword search on relational data streams. In SIGMOD, pp. 605–616.
go back to reference Pinero, J. (2017). DisGeNET: A comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Research, 45(D1), 833–839.CrossRef Pinero, J. (2017). DisGeNET: A comprehensive platform integrating information on human disease-associated genes and variants. Nucleic Acids Research, 45(D1), 833–839.CrossRef
go back to reference Qin, L., Yu, J., Chang, L., & Tao, Y. (2009). Querying communities in relational databases. In ICDE, pp. 724–735. Qin, L., Yu, J., Chang, L., & Tao, Y. (2009). Querying communities in relational databases. In ICDE, pp. 724–735.
go back to reference Sozio, M., & Gionis, A. (2010). The community-search problem and how to plan a successful cocktail party. In KDD, pp. 939–948. Sozio, M., & Gionis, A. (2010). The community-search problem and how to plan a successful cocktail party. In KDD, pp. 939–948.
go back to reference Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z.. (2008). Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 990–998. ACM. Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., & Su, Z.. (2008). Arnetminer: extraction and mining of academic social networks. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 990–998. ACM.
go back to reference Termehchy, A., & Winslett, M. (2011). Using structural information in XML keyword search effectively. ACM Transactions on Database Systems (TODS), 36(2), 1–49.CrossRef Termehchy, A., & Winslett, M. (2011). Using structural information in XML keyword search effectively. ACM Transactions on Database Systems (TODS), 36(2), 1–49.CrossRef
go back to reference Wang, H., & Aggarwal, C. C. (2010). A survey of algorithms for keyword search on graph data, (pp. 249–273). Springer US, Boston, MA. Wang, H., & Aggarwal, C. C. (2010). A survey of algorithms for keyword search on graph data, (pp. 249–273). Springer US, Boston, MA.
go back to reference Wang, W., He, Z., Shi, P., Wu, W., Jiang, Y., An, B., et al. (2019). Strategic social team crowdsourcing: Forming a team of truthful workers for crowdsourcing in social networks. IEEE Transactions on Mobile Computing, 18(6), 1419–1432.CrossRef Wang, W., He, Z., Shi, P., Wu, W., Jiang, Y., An, B., et al. (2019). Strategic social team crowdsourcing: Forming a team of truthful workers for crowdsourcing in social networks. IEEE Transactions on Mobile Computing, 18(6), 1419–1432.CrossRef
go back to reference Yang, Y., Agrawal, D., Jagadish, H. V., Tung, A. K. H., & Wu, S. (2019). An efficient parallel keyword search engine on knowledge graphs. In 2019 IEEE 35th international conference on data engineering (ICDE), pp. 338–349. Yang, Y., Agrawal, D., Jagadish, H. V., Tung, A. K. H., & Wu, S. (2019). An efficient parallel keyword search engine on knowledge graphs. In 2019 IEEE 35th international conference on data engineering (ICDE), pp. 338–349.
go back to reference Yin, X., Qu, C., Wang, Q., Wu, F., Liu, B., Chen, F., et al. (2019). Social connection aware team formation for participatory tasks. IEEE Access, 6, 20309–20319.CrossRef Yin, X., Qu, C., Wang, Q., Wu, F., Liu, B., Chen, F., et al. (2019). Social connection aware team formation for participatory tasks. IEEE Access, 6, 20309–20319.CrossRef
go back to reference Zhang, J., Yu, P. S., & Lv, Y. (2017). Enterprise employee training via project team formation. In WSDM, pp. 3–12. Zhang, J., Yu, P. S., & Lv, Y. (2017). Enterprise employee training via project team formation. In WSDM, pp. 3–12.
Metadata
Title
Robust keyword search in large attributed graphs
Authors
Spencer Bryson
Heidar Davoudi
Lukasz Golab
Mehdi Kargar
Yuliya Lytvyn
Piotr Mierzejewski
Jaroslaw Szlichta
Morteza Zihayat
Publication date
22-07-2020
Publisher
Springer Netherlands
Published in
Discover Computing / Issue 5/2020
Print ISSN: 2948-2984
Electronic ISSN: 2948-2992
DOI
https://doi.org/10.1007/s10791-020-09379-9

Premium Partner