Skip to main content

2017 | OriginalPaper | Buchkapitel

SetExpan: Corpus-Based Set Expansion via Context Feature Selection and Rank Ensemble

verfasst von : Jiaming Shen, Zeqiu Wu, Dongming Lei, Jingbo Shang, Xiang Ren, Jiawei Han

Erschienen in: Machine Learning and Knowledge Discovery in Databases

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Corpus-based set expansion (i.e., finding the “complete” set of entities belonging to the same semantic class, based on a given corpus and a tiny set of seeds) is a critical task in knowledge discovery. It may facilitate numerous downstream applications, such as information extraction, taxonomy induction, question answering, and web search.
To discover new entities in an expanded set, previous approaches either make one-time entity ranking based on distributional similarity, or resort to iterative pattern-based bootstrapping. The core challenge for these methods is how to deal with noisy context features derived from free-text corpora, which may lead to entity intrusion and semantic drifting. In this study, we propose a novel framework, SetExpan, which tackles this problem, with two techniques: (1) a context feature selection method that selects clean context features for calculating entity-entity distributional similarity, and (2) a ranking-based unsupervised ensemble method for expanding entity set based on denoised context features. Experiments on three datasets show that SetExpan is robust and outperforms previous state-of-the-art methods in terms of mean average precision.
Code related to this chapter is available at: https://​github.​com/​mickeystroller/​SetExpan
Data related to this chapter are available at: https://​goo.​gl/​1suS3Z

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
3
Results of SEISA on PubMed-CVD are omitted due to the scalability issue.
 
Literatur
1.
Zurück zum Zitat Balasubramanyan, R., Dalvi, B., Cohen, W.W.: From topic models to semi-supervised learning: biasing mixed-membership models to exploit topic-indicative features in entity clustering. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013. LNCS (LNAI), vol. 8189, pp. 628–642. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-40991-2_40 CrossRef Balasubramanyan, R., Dalvi, B., Cohen, W.W.: From topic models to semi-supervised learning: biasing mixed-membership models to exploit topic-indicative features in entity clustering. In: Blockeel, H., Kersting, K., Nijssen, S., Železný, F. (eds.) ECML PKDD 2013. LNCS (LNAI), vol. 8189, pp. 628–642. Springer, Heidelberg (2013). https://​doi.​org/​10.​1007/​978-3-642-40991-2_​40 CrossRef
2.
Zurück zum Zitat Chen, Z., Cafarella, M., Jagadish, H.: Long-tail vocabulary dictionary extraction from the web. In: WSDM, pp. 625–634. ACM (2016) Chen, Z., Cafarella, M., Jagadish, H.: Long-tail vocabulary dictionary extraction from the web. In: WSDM, pp. 625–634. ACM (2016)
3.
Zurück zum Zitat Chierichetti, F., Kumar, R., Pandey, S., Vassilvitskii, S.: Finding the Jaccard median. In: SODA (2010) Chierichetti, F., Kumar, R., Pandey, S., Vassilvitskii, S.: Finding the Jaccard median. In: SODA (2010)
4.
Zurück zum Zitat Curran, J.R., Murphy, T., Scholz, B.: Minimising semantic drift with mutual exclusion bootstrapping (2007) Curran, J.R., Murphy, T., Scholz, B.: Minimising semantic drift with mutual exclusion bootstrapping (2007)
5.
Zurück zum Zitat Etzioni, O., Cafarella, M.J., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005)CrossRef Etzioni, O., Cafarella, M.J., Downey, D., Popescu, A.-M., Shaked, T., Soderland, S., Weld, D.S., Yates, A.: Unsupervised named-entity extraction from the web: an experimental study. Artif. Intell. 165, 91–134 (2005)CrossRef
6.
Zurück zum Zitat Ghahramani, Z., Heller, K.A.: Bayesian sets. In: NIPS (2005) Ghahramani, Z., Heller, K.A.: Bayesian sets. In: NIPS (2005)
7.
Zurück zum Zitat Gupta, S., MacLean, D.L., Heer, J., Manning, C.D.: Research and applications: induced lexico-syntactic patterns improve information extraction from online medical forums. JAMIA 21, 902–909 (2014) Gupta, S., MacLean, D.L., Heer, J., Manning, C.D.: Research and applications: induced lexico-syntactic patterns improve information extraction from online medical forums. JAMIA 21, 902–909 (2014)
8.
Zurück zum Zitat Gupta, S., Manning, C.D.: Improved pattern learning for bootstrapped entity extraction. In: CoNLL, pp. 98–108 (2014) Gupta, S., Manning, C.D.: Improved pattern learning for bootstrapped entity extraction. In: CoNLL, pp. 98–108 (2014)
9.
Zurück zum Zitat Gupta, S., Manning, C.D.: Distributed representations of words to guide bootstrapped entity classifiers. In: HLT-NAACL (2015) Gupta, S., Manning, C.D.: Distributed representations of words to guide bootstrapped entity classifiers. In: HLT-NAACL (2015)
10.
Zurück zum Zitat He, Y., Xin, D.: SEISA: set expansion by iterative similarity aggregation. In: WWW (2011) He, Y., Xin, D.: SEISA: set expansion by iterative similarity aggregation. In: WWW (2011)
11.
Zurück zum Zitat Jindal, P., Roth, D.: Learning from negative examples in set-expansion. In: 2011 IEEE 11th International Conference on Data Mining (2011) Jindal, P., Roth, D.: Learning from negative examples in set-expansion. In: 2011 IEEE 11th International Conference on Data Mining (2011)
12.
Zurück zum Zitat Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: ACL/IJCNLP (2009) Lin, D., Wu, X.: Phrase clustering for discriminative learning. In: ACL/IJCNLP (2009)
13.
Zurück zum Zitat Ling, X., Weld, D.S.: Fine-grained entity recognition. In: AAAI (2012) Ling, X., Weld, D.S.: Fine-grained entity recognition. In: AAAI (2012)
14.
Zurück zum Zitat Liu, J., Shang, J., Wang, C., Ren, X., Han, J.: Mining quality phrases from massive text corpora. In: SIGMOD, pp. 1729–1744. ACM (2015) Liu, J., Shang, J., Wang, C., Ren, X., Han, J.: Mining quality phrases from massive text corpora. In: SIGMOD, pp. 1729–1744. ACM (2015)
15.
Zurück zum Zitat McIntosh, T., Curran, J.R.: Weighted mutual exclusion bootstrapping for domain independent lexicon and template acquisition (2008) McIntosh, T., Curran, J.R.: Weighted mutual exclusion bootstrapping for domain independent lexicon and template acquisition (2008)
16.
Zurück zum Zitat Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546 (2013) Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. CoRR, abs/1310.4546 (2013)
17.
Zurück zum Zitat Pantel, P., Crestan, E., Borkovsky, A., Popescu, A.-M., Vyas, V.: Web-scale distributional similarity and entity set expansion. In: EMNLP (2009) Pantel, P., Crestan, E., Borkovsky, A., Popescu, A.-M., Vyas, V.: Web-scale distributional similarity and entity set expansion. In: EMNLP (2009)
18.
Zurück zum Zitat Ren, X., El-Kishky, A., Wang, C., Tao, F., Voss, C.R., Han, J.: Clustype: effective entity recognition and typing by relation phrase-based clustering. In: WWW, pp. 995–1004. ACM (2015) Ren, X., El-Kishky, A., Wang, C., Tao, F., Voss, C.R., Han, J.: Clustype: effective entity recognition and typing by relation phrase-based clustering. In: WWW, pp. 995–1004. ACM (2015)
19.
Zurück zum Zitat Ren, X., Lv, Y., Wang, K., Han, J.: Comparative document analysis for large text corpora. CoRR, abs/1510.07197 (2017) Ren, X., Lv, Y., Wang, K., Han, J.: Comparative document analysis for large text corpora. CoRR, abs/1510.07197 (2017)
20.
Zurück zum Zitat Riloff, E.: Automatically generating extraction patterns from untagged text. In: AAAI/IAAI, vol. 2 (1996) Riloff, E.: Automatically generating extraction patterns from untagged text. In: AAAI/IAAI, vol. 2 (1996)
21.
Zurück zum Zitat Rong, X., Chen, Z., Mei, Q., Adar, E.: EgoSet: exploiting word ego-networks and user-generated ontology for multifaceted set expansion. In: WSDM, pp. 645–654. ACM (2016) Rong, X., Chen, Z., Mei, Q., Adar, E.: EgoSet: exploiting word ego-networks and user-generated ontology for multifaceted set expansion. In: WSDM, pp. 645–654. ACM (2016)
22.
Zurück zum Zitat Shi, B., Zhang, Z., Sun, L., Han, X.: A probabilistic co-bootstrapping method for entity set expansion. In: COLING (2014) Shi, B., Zhang, Z., Sun, L., Han, X.: A probabilistic co-bootstrapping method for entity set expansion. In: COLING (2014)
23.
Zurück zum Zitat Shi, S., Zhang, H., Yuan, X., Wen, J.-R.: Corpus-based semantic class mining: distributional vs. pattern-based approaches. In: COLING (2010) Shi, S., Zhang, H., Yuan, X., Wen, J.-R.: Corpus-based semantic class mining: distributional vs. pattern-based approaches. In: COLING (2010)
24.
Zurück zum Zitat Talukdar, P.P., Reisinger, J., Pasca, M., Ravichandran, D., Bhagat, R., Pereira, F.: Weakly-supervised acquisition of labeled class instances using graph random walks. In: EMNLP (2008) Talukdar, P.P., Reisinger, J., Pasca, M., Ravichandran, D., Bhagat, R., Pereira, F.: Weakly-supervised acquisition of labeled class instances using graph random walks. In: EMNLP (2008)
25.
Zurück zum Zitat Tang, J., Qu, M., Mei, Q.: PTE: predictive text embedding through large-scale heterogeneous text networks. In: KDD, pp. 1165–1174. ACM (2015) Tang, J., Qu, M., Mei, Q.: PTE: predictive text embedding through large-scale heterogeneous text networks. In: KDD, pp. 1165–1174. ACM (2015)
26.
Zurück zum Zitat Tong, S., Dean, J.: System and methods for automatically creating lists. US Patent 7,350,187 (2008) Tong, S., Dean, J.: System and methods for automatically creating lists. US Patent 7,350,187 (2008)
27.
Zurück zum Zitat Velardi, P., Faralli, S., Navigli, R.: Ontolearn reloaded: a graph-based algorithm for taxonomy induction. Comput. Linguist. 39(3), 665–707 (2013)CrossRef Velardi, P., Faralli, S., Navigli, R.: Ontolearn reloaded: a graph-based algorithm for taxonomy induction. Comput. Linguist. 39(3), 665–707 (2013)CrossRef
28.
Zurück zum Zitat Wang, C., Chakrabarti, K., He, Y., Ganjam, K., Chen, Z., Bernstein, P.A.: Concept expansion using web tables. In: WWW (2015) Wang, C., Chakrabarti, K., He, Y., Ganjam, K., Chen, Z., Bernstein, P.A.: Concept expansion using web tables. In: WWW (2015)
29.
Zurück zum Zitat Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the web. In: ICDM (2007) Wang, R.C., Cohen, W.W.: Language-independent set expansion of named entities using the web. In: ICDM (2007)
30.
Zurück zum Zitat Wang, Y.-Y., Hoffmann, R., Li, X., Szymanski, J.: Semi-supervised learning of semantic classes for query understanding: from the web and for the web. In: CIKM (2009) Wang, Y.-Y., Hoffmann, R., Li, X., Szymanski, J.: Semi-supervised learning of semantic classes for query understanding: from the web and for the web. In: CIKM (2009)
Metadaten
Titel
SetExpan: Corpus-Based Set Expansion via Context Feature Selection and Rank Ensemble
verfasst von
Jiaming Shen
Zeqiu Wu
Dongming Lei
Jingbo Shang
Xiang Ren
Jiawei Han
Copyright-Jahr
2017
DOI
https://doi.org/10.1007/978-3-319-71249-9_18