Skip to main content

2015 | OriginalPaper | Buchkapitel

Web as a Corpus: Going Beyond the n-gram

verfasst von : Preslav Nakov

Erschienen in: Information Retrieval

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The 60-year-old dream of computational linguistics is to make computers capable of communicating with humans in natural language. This has proven hard, and thus research has focused on sub-problems. Even so, the field was stuck with manual rules until the early 90s, when computers became powerful enough to enable the rise of statistical approaches. Eventually, this shifted the main research attention to machine learning from text corpora, thus triggering a revolution in the field.
Today, the Web is the biggest available corpus, providing access to quadrillions of words; and, in corpus-based natural language processing, size does matter. Unfortunately, while there has been substantial research on the Web as a corpus, it has typically been restricted to using page hit counts as an estimate for n-gram word frequencies; this has led some researchers to conclude that the Web should be only used as a baseline. We show that much better results are possible for structural ambiguity problems, when going beyond the n-gram.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
2
See (Nakov 2013) for an overview on the syntax and semantics of noun compounds. See also the Nakov & Hearst (2013)
 
3
This score worked best on training, when Keller and Lapata were doing model selection. On testing, \(\mathrm {Pr}\) (with the dependency model) worked better and achieved accuracy of 80.32 %, but this result was ignored, as \(\mathrm {Pr}\) did worse on training.
 
4
Zero counts sometimes happen for \(\#(w_1,w_3)\), but are rare for unigrams and bigrams on the Web, and there is no need for a more sophisticated smoothing.
 
5
For example, as used by Lauer to introduce a prior for left-right bracketing preference. The best Lauer model does not work with words directly, but uses a taxonomy and further needs a probabilistic interpretation, so that the hidden taxonomy variables can be summed out. Because of that summation, the term \(\mathrm {Pr}(w_2 \rightarrow w_3|w_3)\) does not cancel in his dependency model.
 
6
Features can also occur combined, e.g., brain’s stem-cells.
 
7
This appears as Surface features (sum) in Tables 1 and 2.
 
8
In addition to the articles (a, an, the), we also used quantifiers (e.g., some, every) and pronouns (e.g., this, his).
 
9
In our experiments, we used MSN Search (now Bing) statistics for the n-grams and the paraphrases (unless the pattern contained a “*”), and Google for the surface features. MSN always returned exact numbers, while Google and Yahoo rounded their page hits, which generally leads to lower accuracy (Yahoo was better than Google for these estimates).
 
12
In fact, the differences are negligible; their system achieved very similar result on the half split as well as on the whole set (personal communication).
 
13
Note however that here we experiment with 232 of the 430 examples.
 
14
When presented with a whole sentence, average humans score 93 %.
 
15
Ratnaparkhi (1998) noted that the test set contains errors, but did not correct them.
 
16
The configurations of the kind n \(h_1\) c \(h_2\) (e.g., company/n \(cars/h_1\) and/c \(trucks/h_2\)) can be handled in a similar way.
 
19
It can be extended to handle adjective-noun pairs as well, as demonstrated in Sect. 6.5 below.
 
20
The best type B system on SemEval achieved 76.3 % accuracy using the manually-annotated WordNet senses in context for each example, which constitutes an additional data source, as opposed to an additional resource. The systems that used WordNet as a resource only, i.e., ignoring the manually annotated senses, were classified as type A or C. (Girju et al. 2007).
 
Literatur
Zurück zum Zitat Rajeev, A., Boggess, L.: A simple but useful approach to conjunct identification. In: Proceedings of ACL, pp. 15–21 (1992) Rajeev, A., Boggess, L.: A simple but useful approach to conjunct identification. In: Proceedings of ACL, pp. 15–21 (1992)
Zurück zum Zitat Michele, B., Brill, E.: Scaling to very very large corpora for natural language disambiguation. In: Proceedings of ACL (2001) Michele, B., Brill, E.: Scaling to very very large corpora for natural language disambiguation. In: Proceedings of ACL (2001)
Zurück zum Zitat Bansal, M., Klein, D.: Web-scale features for full-scale parsing. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - vol.1, HLT 2011, pp. 693–702. PA, USA, Stroudsburg (2011) Bansal, M., Klein, D.: Web-scale features for full-scale parsing. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - vol.1, HLT 2011, pp. 693–702. PA, USA, Stroudsburg (2011)
Zurück zum Zitat Barker, K., Szpakowicz, S.: Semi-automatic recognition of noun modifier relationships. In: Proceedings of the 17th international conference on Computational linguistics, 96–102. Association for Computational Linguistics, Morristown, NJ, USA (1998) Barker, K., Szpakowicz, S.: Semi-automatic recognition of noun modifier relationships. In: Proceedings of the 17th international conference on Computational linguistics, 96–102. Association for Computational Linguistics, Morristown, NJ, USA (1998)
Zurück zum Zitat Bergsma, S., Goebel, R.: Using visual information to predict lexical preference. In: Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pp. 399–405. RANLP 2011 Organising Committee, Hissar, Bulgaria (2011) Bergsma, S., Goebel, R.: Using visual information to predict lexical preference. In: Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, pp. 399–405. RANLP 2011 Organising Committee, Hissar, Bulgaria (2011)
Zurück zum Zitat Pitler, E., Lin, D.: Creating robust supervised classifiers via web-scale n-gram data. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 865–874. Uppsala, Sweden (2010) Pitler, E., Lin, D.: Creating robust supervised classifiers via web-scale n-gram data. In: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 865–874. Uppsala, Sweden (2010)
Zurück zum Zitat Van Durme, B.: Learning bilingual lexicons using the visual similarity of labeled web images. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence -Volume Volume Three, IJCAI 2011, pp. 1764–1769. AAAI Press (2011) Van Durme, B.: Learning bilingual lexicons using the visual similarity of labeled web images. In: Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence -Volume Volume Three, IJCAI 2011, pp. 1764–1769. AAAI Press (2011)
Zurück zum Zitat Iris Wang, Q.: Learning noun phrase query segmentation. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 819–826 (2007) Iris Wang, Q.: Learning noun phrase query segmentation. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 819–826 (2007)
Zurück zum Zitat Brants, T., Popat, A.C., Peng, X., Och, F.J., Dean, J.: Large language models in machine translation. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858–867. Czech Republic, Prague (2007) Brants, T., Popat, A.C., Peng, X., Och, F.J., Dean, J.: Large language models in machine translation. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL), pp. 858–867. Czech Republic, Prague (2007)
Zurück zum Zitat Brill, E., Resnik, P.: A rule-based approach to prepositional phrase attachment disambiguation. In: Proceedings of COLING (1994) Brill, E., Resnik, P.: A rule-based approach to prepositional phrase attachment disambiguation. In: Proceedings of COLING (1994)
Zurück zum Zitat Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30, 107–117 (1998) Brin, S., Page, L.: The anatomy of a large-scale hypertextual web search engine. Comput. Netw. 30, 107–117 (1998)
Zurück zum Zitat Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32, 13–47 (2006)CrossRefMATH Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of lexical semantic relatedness. Comput. Linguist. 32, 13–47 (2006)CrossRefMATH
Zurück zum Zitat Butnariu, C., Kim, SN., Nakov, P., Séaghdha, D., Szpakowicz, S., Veale, T.: Noun compounds using paraphrasing verbs and prepositions. In: Proceedings of the 5th International Workshop on Semantic Evaluations (SemEval-2), Uppsala, Sweden, 11–16 July 2010, pp. 39–44 (2010) Butnariu, C., Kim, SN., Nakov, P., Séaghdha, D., Szpakowicz, S., Veale, T.: Noun compounds using paraphrasing verbs and prepositions. In: Proceedings of the 5th International Workshop on Semantic Evaluations (SemEval-2), Uppsala, Sweden, 11–16 July 2010, pp. 39–44 (2010)
Zurück zum Zitat Veale, T.: A concept-centered approach to noun-compound interpretation. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 81–88. Manchester, UK (2008) Veale, T.: A concept-centered approach to noun-compound interpretation. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 81–88. Manchester, UK (2008)
Zurück zum Zitat Cafarella, M., Banko, M., Etzioni, O.: Technical Report 02 April 2006, University of Washington, Department of Computer Science and Engineering (2006) Cafarella, M., Banko, M., Etzioni, O.: Technical Report 02 April 2006, University of Washington, Department of Computer Science and Engineering (2006)
Zurück zum Zitat Calvo, H., Gelbukh, A.: Improving prepositional phrase attachment disambiguation using the web as corpus. In: Sanfeliu, A., Ruiz-Shulcloper, J. (eds.) CIARP 2003. LNCS, vol. 2905, pp. 604–610. Springer, Heidelberg (2003) CrossRef Calvo, H., Gelbukh, A.: Improving prepositional phrase attachment disambiguation using the web as corpus. In: Sanfeliu, A., Ruiz-Shulcloper, J. (eds.) CIARP 2003. LNCS, vol. 2905, pp. 604–610. Springer, Heidelberg (2003) CrossRef
Zurück zum Zitat Cao, Y., Li, H.: Base noun phrase translation using web data and the EM algorithm. In: COLING, pp. 127–133 (2002) Cao, Y., Li, H.: Base noun phrase translation using web data and the EM algorithm. In: COLING, pp. 127–133 (2002)
Zurück zum Zitat Chantree, F., Kilgarriff, A., De Roeck, A., Willis, A.: Using a distributional thesaurus to resolve coordination ambiguities. In: Technical Report 2005/02. The Open University, UK (2005) Chantree, F., Kilgarriff, A., De Roeck, A., Willis, A.: Using a distributional thesaurus to resolve coordination ambiguities. In: Technical Report 2005/02. The Open University, UK (2005)
Zurück zum Zitat Chklovski, T., Pantel, P.: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 33–40 (2004) Chklovski, T., Pantel, P.: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 33–40 (2004)
Zurück zum Zitat Church, K., Patil, R.: Coping with syntactic ambiguity or how to put the block in the box on the table. Am. J. Comput. Linguist. 8, 139–149 (1982) Church, K., Patil, R.: Coping with syntactic ambiguity or how to put the block in the box on the table. Am. J. Comput. Linguist. 8, 139–149 (1982)
Zurück zum Zitat Collins, M., Brooks, J.: Prepositional phrase attachment through a backed-off model. In: Proceedings of EMNLP, pp. 27–38 (1995) Collins, M., Brooks, J.: Prepositional phrase attachment through a backed-off model. In: Proceedings of EMNLP, pp. 27–38 (1995)
Zurück zum Zitat Downing, P.: On the creation and use of english compound nouns. Language 53(4), 810–842 (1977)CrossRef Downing, P.: On the creation and use of english compound nouns. Language 53(4), 810–842 (1977)CrossRef
Zurück zum Zitat Dumais, S., Banko, M., Brill, E., Lin, J., Andrew Ng.: Web question answering: Is more always better?. In: Proceedings of SIGIR, pp. 291–298 (2002) Dumais, S., Banko, M., Brill, E., Lin, J., Andrew Ng.: Web question answering: Is more always better?. In: Proceedings of SIGIR, pp. 291–298 (2002)
Zurück zum Zitat Fleiss, J.L.: Statistical Methods for Rates and Proportions, 2nd edn. John Wiley & Sons Inc, New York (1981)MATH Fleiss, J.L.: Statistical Methods for Rates and Proportions, 2nd edn. John Wiley & Sons Inc, New York (1981)MATH
Zurück zum Zitat Girju, R., Moldovan, D., Tatu, M., Antohe, D.: On the semantics of noun compounds. Special Issue on Multiword Expressions 19(4), 479–496 (2005) Girju, R., Moldovan, D., Tatu, M., Antohe, D.: On the semantics of noun compounds. Special Issue on Multiword Expressions 19(4), 479–496 (2005)
Zurück zum Zitat Girju, R., Nakov, P., Nastase, Szpakowicz, S., Turney, P., Yuret. D.: Semeval-2007 task 04: classification of semantic relations between nominals. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pp. 13–18, Prague, Czech Republic (2007) Girju, R., Nakov, P., Nastase, Szpakowicz, S., Turney, P., Yuret. D.: Semeval-2007 task 04: classification of semantic relations between nominals. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pp. 13–18, Prague, Czech Republic (2007)
Zurück zum Zitat Nakov, P., Nastase, V., Szpakowicz, S., Turney, P., Yuret, D.: Language Resources and Evaluation 43, 105–121 (2009)CrossRef Nakov, P., Nastase, V., Szpakowicz, S., Turney, P., Yuret, D.: Language Resources and Evaluation 43, 105–121 (2009)CrossRef
Zurück zum Zitat Goldberg, M.: An unsupervised model for statistically determining coordinate phrase attachment. In: Proceedings of ACL, pp. 610–614 (1999) Goldberg, M.: An unsupervised model for statistically determining coordinate phrase attachment. In: Proceedings of ACL, pp. 610–614 (1999)
Zurück zum Zitat Grefenstette, G.: The world wide web as a resourcefor example-based machine translation tasks. In: Proceedings of the ASLIB Conference on Translating and the Computer (1998) Grefenstette, G.: The world wide web as a resourcefor example-based machine translation tasks. In: Proceedings of the ASLIB Conference on Translating and the Computer (1998)
Zurück zum Zitat Hendrickx, I., Kim, S.N., Kozareva, Z., Nakov, P., Séaghdha, D., Padó, S., Romano, M., Szpakowicz, S.: SemEval-2010 Task 8: Multi-way classification of semantic relations between pairs of nominals. In: Proceedings of the 5th International Workshop on Semantic Evaluations (SemEval-2), Uppsala, Sweden, 11– 16 July 2010, 33–38 (2010) Hendrickx, I., Kim, S.N., Kozareva, Z., Nakov, P., Séaghdha, D., Padó, S., Romano, M., Szpakowicz, S.: SemEval-2010 Task 8: Multi-way classification of semantic relations between pairs of nominals. In: Proceedings of the 5th International Workshop on Semantic Evaluations (SemEval-2), Uppsala, Sweden, 11– 16 July 2010, 33–38 (2010)
Zurück zum Zitat Weber, I.M.: Semantic Methods for Execution-level Business Process Modeling. LNBIP, vol. 40. Springer, Heidelberg (2009) Weber, I.M.: Semantic Methods for Execution-level Business Process Modeling. LNBIP, vol. 40. Springer, Heidelberg (2009)
Zurück zum Zitat Hindle, D., Rooth, M.: Structural ambiguity and lexical relations. Comput. Linguist. 19, 103–120 (1993) Hindle, D., Rooth, M.: Structural ambiguity and lexical relations. Comput. Linguist. 19, 103–120 (1993)
Zurück zum Zitat Szpektor, I., Tanev, H., Dagan, I., Coppola, B.: Scaling web-based acquisition of entailment relations. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 401–48 (2004) Szpektor, I., Tanev, H., Dagan, I., Coppola, B.: Scaling web-based acquisition of entailment relations. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 401–48 (2004)
Zurück zum Zitat Weber, I.M.: Evaluation. Semantic Methods for Execution-level Business Process Modeling. LNBIP, vol. 40, pp. 203–225. Springer, Heidelberg (2009)CrossRef Weber, I.M.: Evaluation. Semantic Methods for Execution-level Business Process Modeling. LNBIP, vol. 40, pp. 203–225. Springer, Heidelberg (2009)CrossRef
Zurück zum Zitat Keller, F., Lapata, M.: Using the Web to obtain frequencies for unseen bigrams. Comput. Linguist. 29, 459–484 (2003)CrossRef Keller, F., Lapata, M.: Using the Web to obtain frequencies for unseen bigrams. Comput. Linguist. 29, 459–484 (2003)CrossRef
Zurück zum Zitat Kilgariff, A., Grefenstette, G.: Introduction to the special issue on the web as corpus. Comput. Linguist. 29, 333–347 (2003)MathSciNetCrossRef Kilgariff, A., Grefenstette, G.: Introduction to the special issue on the web as corpus. Comput. Linguist. 29, 333–347 (2003)MathSciNetCrossRef
Zurück zum Zitat Kilgarriff, A.: Googleology is bad science. Comput. Linguist. 33, 147–151 (2007)CrossRef Kilgarriff, A.: Googleology is bad science. Comput. Linguist. 33, 147–151 (2007)CrossRef
Zurück zum Zitat Nam, K.S., Nakov, P.: Large-scale noun compound interpretation using bootstrapping and the web as a corpus. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 648–658. Edinburgh, Scotland, UK (2011) Nam, K.S., Nakov, P.: Large-scale noun compound interpretation using bootstrapping and the web as a corpus. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 648–658. Edinburgh, Scotland, UK (2011)
Zurück zum Zitat Kurohashi, S., Nagao, M.: Dynamic programming method for analyzing conjunctive structures in Japanese. In: Proceedings of COLING, vol. 1 (1992) Kurohashi, S., Nagao, M.: Dynamic programming method for analyzing conjunctive structures in Japanese. In: Proceedings of COLING, vol. 1 (1992)
Zurück zum Zitat Lapata, M., Keller, F.: The Web as a baseline: evaluating the performance of unsupervised Web-based models for a range of NLP tasks. In: Proceedings of HLT-NAACL, pp. 121–128, Boston (2004) Lapata, M., Keller, F.: The Web as a baseline: evaluating the performance of unsupervised Web-based models for a range of NLP tasks. In: Proceedings of HLT-NAACL, pp. 121–128, Boston (2004)
Zurück zum Zitat Keller, F.: Web-based models for natural language processing. ACM Trans. Speech Lang. Process. 2(1), 1–31 (2005) Keller, F.: Web-based models for natural language processing. ACM Trans. Speech Lang. Process. 2(1), 1–31 (2005)
Zurück zum Zitat Lauer, M.: Designing statistical language learners: experiments on noun compounds. Department of Computing Macquarie University NSW 2109 Australia dissertation (1995) Lauer, M.: Designing statistical language learners: experiments on noun compounds. Department of Computing Macquarie University NSW 2109 Australia dissertation (1995)
Zurück zum Zitat Levi, J.: The syntax and semantics of complex nominals. Academic Press, New York (1978) Levi, J.: The syntax and semantics of complex nominals. Academic Press, New York (1978)
Zurück zum Zitat Levy, O., Goldberg, Y.: Linguistic regularities in sparse and explicit word representations. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, 171–180 (2014) Levy, O., Goldberg, Y.: Linguistic regularities in sparse and explicit word representations. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, 171–180 (2014)
Zurück zum Zitat Lin, D.: An information-theoretic definition of similarity. In: ICML 1998: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann Publishers Inc San Francisco, CA, USA (1998) Lin, D.: An information-theoretic definition of similarity. In: ICML 1998: Proceedings of the Fifteenth International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann Publishers Inc San Francisco, CA, USA (1998)
Zurück zum Zitat Church, K., Ji, H., Sekine, S., Yarowsky, D., Bergsma, S., Patil, K., Pitler, E., Lathbury, R., Rao, V., Dalwani, K., Narsale, S.: New tools for web-scale n-grams. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010) Calzolari, N., (Conference Chair), Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M.,Tapias, D., Valletta, M.: European Language Resources Association (ELRA) (2010) Church, K., Ji, H., Sekine, S., Yarowsky, D., Bergsma, S., Patil, K., Pitler, E., Lathbury, R., Rao, V., Dalwani, K., Narsale, S.: New tools for web-scale n-grams. In: Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010) Calzolari, N., (Conference Chair), Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M.,Tapias, D., Valletta, M.: European Language Resources Association (ELRA) (2010)
Zurück zum Zitat Lin, Y., Michel, J.-B., Lieberman, E.A., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the google books ngram corpus. In: Proceedings of the ACL 2012 System Demonstrations, pp. 169–174. Jeju Island, Korea (2012) Lin, Y., Michel, J.-B., Lieberman, E.A., Orwant, J., Brockman, W., Petrov, S.: Syntactic annotations for the google books ngram corpus. In: Proceedings of the ACL 2012 System Demonstrations, pp. 169–174. Jeju Island, Korea (2012)
Zurück zum Zitat Marcus, M.: A Theory of Syntactic Recognition for Natural Language. MIT Press, Cambridge (1980)MATH Marcus, M.: A Theory of Syntactic Recognition for Natural Language. MIT Press, Cambridge (1980)MATH
Zurück zum Zitat Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of english: The PennTreebank. Comput. Linguist. 19, 313–330 (1994) Santorini, B., Marcinkiewicz, M.: Building a large annotated corpus of english: The PennTreebank. Comput. Linguist. 19, 313–330 (1994)
Zurück zum Zitat Mihalcea, R., Moldovan, D.: A method for word sense disambiguation of unrestricted text. In: ACL, pp. 152–158 (1999) Mihalcea, R., Moldovan, D.: A method for word sense disambiguation of unrestricted text. In: ACL, pp. 152–158 (1999)
Zurück zum Zitat Mikolov, Tomas, Yih, Wen-tau, Zweig, Geoffrey: Linguistic regularities in continuous space word representations.Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751. Atlanta, Georgia (2013) Mikolov, Tomas, Yih, Wen-tau, Zweig, Geoffrey: Linguistic regularities in continuous space word representations.Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751. Atlanta, Georgia (2013)
Zurück zum Zitat Modjeska, N., Markert, K. Nissim, M.: Using the web in machine learning for other-anaphora resolution. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, 176–183 ( 2003) Modjeska, N., Markert, K. Nissim, M.: Using the web in machine learning for other-anaphora resolution. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, 176–183 ( 2003)
Zurück zum Zitat Nakov, P.: Using the web as an implicit training set: Application to noun compound syntax and semantics. EECS Department, University of California, Berkeley, UCB/EECS-2007-173 dissertation (2007) Nakov, P.: Using the web as an implicit training set: Application to noun compound syntax and semantics. EECS Department, University of California, Berkeley, UCB/EECS-2007-173 dissertation (2007)
Zurück zum Zitat Improved statistical machine translation using monolingual paraphrases. In: Proceedings of the European Conference on Artificial Intelligence, ECAI 2008, pp. 338–342. Patras, Greece (2008a) Improved statistical machine translation using monolingual paraphrases. In: Proceedings of the European Conference on Artificial Intelligence, ECAI 2008, pp. 338–342. Patras, Greece (2008a)
Zurück zum Zitat Nakov, P.: Noun compound interpretation using paraphrasing verbs: feasibility study. In: Dochev, D., Pistore, M., Traverso, P. (eds.) AIMSA 2008. LNCS (LNAI), vol. 5253, pp. 103–117. Springer, Heidelberg (2008) CrossRef Nakov, P.: Noun compound interpretation using paraphrasing verbs: feasibility study. In: Dochev, D., Pistore, M., Traverso, P. (eds.) AIMSA 2008. LNCS (LNAI), vol. 5253, pp. 103–117. Springer, Heidelberg (2008) CrossRef
Zurück zum Zitat Paraphrasing verbs for noun compound interpretation. In: Proceedings of the LREC’08 Workshop: Towards a Shared Task for Multiword Expressions, MWE 2008, pp. 46–49. Marrakech, Morocco (2008c) Paraphrasing verbs for noun compound interpretation. In: Proceedings of the LREC’08 Workshop: Towards a Shared Task for Multiword Expressions, MWE 2008, pp. 46–49. Marrakech, Morocco (2008c)
Zurück zum Zitat On the interpretation of noun compounds: Syntax, semantics, and entailment. Natural Lang. Eng. vol. 19, pp. 291–330 (2013) On the interpretation of noun compounds: Syntax, semantics, and entailment. Natural Lang. Eng. vol. 19, pp. 291–330 (2013)
Zurück zum Zitat Hearst, M.: Search engine statistics beyond the n-gram: Application to noun compound bracketing. In: Proceedings of CoNLL-2005, Ninth Conference on Computational Natural Language Learning (2005a) Hearst, M.: Search engine statistics beyond the n-gram: Application to noun compound bracketing. In: Proceedings of CoNLL-2005, Ninth Conference on Computational Natural Language Learning (2005a)
Zurück zum Zitat Hearst, M.: A study of using search engine page hits as a proxy for n-gram frequencies. In: Proceedings of RANLP 2005, pp. 347–353. Borovets, Bulgaria (2005) Hearst, M.: A study of using search engine page hits as a proxy for n-gram frequencies. In: Proceedings of RANLP 2005, pp. 347–353. Borovets, Bulgaria (2005)
Zurück zum Zitat Hearst, M.: Using the web as an implicit training set: application to structural ambiguity resolution. In: HLT 2005: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 835–842. Association for Computational Linguistics, Morristown, NJ, USA (2005c) Hearst, M.: Using the web as an implicit training set: application to structural ambiguity resolution. In: HLT 2005: Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 835–842. Association for Computational Linguistics, Morristown, NJ, USA (2005c)
Zurück zum Zitat Hearst, M.: Solving relational similarity problems using the web as a corpus. In: Proceedings of the 46th Annual Meeting on Association for Computational Linguistics, ACL 2008, pp. 452–460. Columbus, OH (2008) Hearst, M.: Solving relational similarity problems using the web as a corpus. In: Proceedings of the 46th Annual Meeting on Association for Computational Linguistics, ACL 2008, pp. 452–460. Columbus, OH (2008)
Zurück zum Zitat Nakov, P., Hearst, M.: Using verbs to characterize noun-noun relations. In: Euzenat, J., Domingue, J. (eds.) AIMSA 2006. LNCS (LNAI), vol. 4183, pp. 233–244. Springer, Heidelberg (2006) CrossRef Nakov, P., Hearst, M.: Using verbs to characterize noun-noun relations. In: Euzenat, J., Domingue, J. (eds.) AIMSA 2006. LNCS (LNAI), vol. 4183, pp. 233–244. Springer, Heidelberg (2006) CrossRef
Zurück zum Zitat Kozareva, Z.: Combining relational and attributional similarity for semantic relation classification. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2011, pp. 323–330. Hissar, Bulgaria (2011) Kozareva, Z.: Combining relational and attributional similarity for semantic relation classification. In: Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2011, pp. 323–330. Hissar, Bulgaria (2011)
Zurück zum Zitat Schwartz, A., Wolf, B., Hearst, M.: Scaling up BioNLP: application of a text annotation architecture to noun compound bracketing. In: Proceedings of SIG BioLINK (2005a) Schwartz, A., Wolf, B., Hearst, M.: Scaling up BioNLP: application of a text annotation architecture to noun compound bracketing. In: Proceedings of SIG BioLINK (2005a)
Zurück zum Zitat Schwartz, A., Wolf, B., Hearst, M.: Proceedings of the ACL 2005 on interactive poster and demonstration sessions, pp. 65–68. Association for Computational Linguistics, Morristown, NJ, USA (2005b) Schwartz, A., Wolf, B., Hearst, M.: Proceedings of the ACL 2005 on interactive poster and demonstration sessions, pp. 65–68. Association for Computational Linguistics, Morristown, NJ, USA (2005b)
Zurück zum Zitat Nakov, P.I., Hearst, M.A.: Semantic interpretation of noun compounds using verbal and other paraphrases. ACM Trans. Speech Lang. Process. 10, 1–51 (2013)CrossRef Nakov, P.I., Hearst, M.A.: Semantic interpretation of noun compounds using verbal and other paraphrases. ACM Trans. Speech Lang. Process. 10, 1–51 (2013)CrossRef
Zurück zum Zitat Nastase, V., Nakov, P., Séaghdha, D.Ó., Szpakowicz, S.: Semantic Relations Between Nominals: Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, San Rafael (2013) CrossRef Nastase, V., Nakov, P., Séaghdha, D.Ó., Szpakowicz, S.: Semantic Relations Between Nominals: Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers, San Rafael (2013) CrossRef
Zurück zum Zitat Pantel, P., Lin, D.: An unsupervised approach to prepositional phrase attachment using contextually similar words. In: Proceedings of ACL (2000) Pantel, P., Lin, D.: An unsupervised approach to prepositional phrase attachment using contextually similar words. In: Proceedings of ACL (2000)
Zurück zum Zitat Porter, M.: An algorithm for suffix stripping. Program 14, 130–137 (1980)CrossRef Porter, M.: An algorithm for suffix stripping. Program 14, 130–137 (1980)CrossRef
Zurück zum Zitat Pustejovsky, J., Anick, P., Bergler, S.: Lexical semantic techniques for corpus analysis. Comput. Linguist. 19, 331–358 (1993) Pustejovsky, J., Anick, P., Bergler, S.: Lexical semantic techniques for corpus analysis. Comput. Linguist. 19, 331–358 (1993)
Zurück zum Zitat Ratnaparkhi, A.: Statistical models for unsupervised prepositional phrase attachment. In: Proceedings of COLING-ACL vol. 2, pp. 1079–1085 (1998) Ratnaparkhi, A.: Statistical models for unsupervised prepositional phrase attachment. In: Proceedings of COLING-ACL vol. 2, pp. 1079–1085 (1998)
Zurück zum Zitat Reynar, J., Roukos, S.: A maximum entropy model for prepositional phrase attachment. In: Proceedings of the ARPA Workshop on Human Language Technology, pp. 250–255 (1994) Reynar, J., Roukos, S.: A maximum entropy model for prepositional phrase attachment. In: Proceedings of the ARPA Workshop on Human Language Technology, pp. 250–255 (1994)
Zurück zum Zitat Resnik, P.: Selection and information: a class-based approach to lexical relationships. University of Pennsylvania, UMI Order No. GAX94-13894 dissertation (1993) Resnik, P.: Selection and information: a class-based approach to lexical relationships. University of Pennsylvania, UMI Order No. GAX94-13894 dissertation (1993)
Zurück zum Zitat Mining the web for bilingual text. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 527–534. Association for Computational Linguistics, Morristown, NJ, USA (1999a) Mining the web for bilingual text. In: Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics on Computational Linguistics, pp. 527–534. Association for Computational Linguistics, Morristown, NJ, USA (1999a)
Zurück zum Zitat Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. In: JAIR 11, pp. 95–130 (1999b) Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. In: JAIR 11, pp. 95–130 (1999b)
Zurück zum Zitat Rigau, G., Magnini, B., Agirre, E., Carroll, J.: Meaning: A roadmap to knowledge technologies. In: Proceedings of COLING Workshop on A Roadmap for Computational Linguistics (2002) Rigau, G., Magnini, B., Agirre, E., Carroll, J.: Meaning: A roadmap to knowledge technologies. In: Proceedings of COLING Workshop on A Roadmap for Computational Linguistics (2002)
Zurück zum Zitat Rus, V., Moldovan, D., Bolohan, O.: Bracketing compound nouns for logic form derivation. In: Haller, S.M., Simmons, G. (eds.) FLAIRS Conference, pp. 198–202. AAAI Press (2002) Rus, V., Moldovan, D., Bolohan, O.: Bracketing compound nouns for logic form derivation. In: Haller, S.M., Simmons, G. (eds.) FLAIRS Conference, pp. 198–202. AAAI Press (2002)
Zurück zum Zitat Santamaría, C., Gonzalo, J., Verdejo, F.: Automatic association of web directories with word senses. Comput. Linguist. 29, 485–502 (2003)CrossRef Santamaría, C., Gonzalo, J., Verdejo, F.: Automatic association of web directories with word senses. Comput. Linguist. 29, 485–502 (2003)CrossRef
Zurück zum Zitat Shinzato, K., Torisawa, K.: Acquiring hyponymy relations from web documents. In: Proceedings of HLT-NAACL, pp. 73–80 (2004) Shinzato, K., Torisawa, K.: Acquiring hyponymy relations from web documents. In: Proceedings of HLT-NAACL, pp. 73–80 (2004)
Zurück zum Zitat Soricut, R., Brill, E.: Automatic question answering: Beyond the factoid. In: Proceedings of HLT-NAACL, pp. 57–64 (2004) Soricut, R., Brill, E.: Automatic question answering: Beyond the factoid. In: Proceedings of HLT-NAACL, pp. 57–64 (2004)
Zurück zum Zitat Stetina, J., Makoto.: Corpus based PP attachment ambiguity resolution with a semantic dictionary. In: Proceedings of WVLC, pp. 66–80 (1997) Stetina, J., Makoto.: Corpus based PP attachment ambiguity resolution with a semantic dictionary. In: Proceedings of WVLC, pp. 66–80 (1997)
Zurück zum Zitat Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003, pp. 252–259 (2003) Toutanova, K., Klein, D., Manning, C., Singer, Y.: Feature-rich part-of-speech tagging with a cyclic dependency network. In: Proceedings of HLT-NAACL 2003, pp. 252–259 (2003)
Zurück zum Zitat Toutanova, K., Manning, C.D., Andrew Y.Ng.: Learning random walk models for inducing word dependency distributions. In: Proceedings of ICML (2004) Toutanova, K., Manning, C.D., Andrew Y.Ng.: Learning random walk models for inducing word dependency distributions. In: Proceedings of ICML (2004)
Zurück zum Zitat Turney, P., Littman, M.: Corpus-based learning of analogies and semantic relations. Mach. Learn. J. 60, 251–278 (2005)CrossRef Turney, P., Littman, M.: Corpus-based learning of analogies and semantic relations. Mach. Learn. J. 60, 251–278 (2005)CrossRef
Zurück zum Zitat Turney, P.D.: Similarity of semantic relations. Comput. Linguist. 32, 379–416 (2006)CrossRefMATH Turney, P.D.: Similarity of semantic relations. Comput. Linguist. 32, 379–416 (2006)CrossRefMATH
Zurück zum Zitat Volk, M.: Scaling up. using the www to resolve PP attachment ambiguities. In: Proceedings of Konvens-2000. Sprachkommunikation (2000) Volk, M.: Scaling up. using the www to resolve PP attachment ambiguities. In: Proceedings of Konvens-2000. Sprachkommunikation (2000)
Zurück zum Zitat Exploiting the WWW as a corpus to resolve PP attachment ambiguities. In: Proceedings of Corpus Linguistics (2001) Exploiting the WWW as a corpus to resolve PP attachment ambiguities. In: Proceedings of Corpus Linguistics (2001)
Zurück zum Zitat Wang, K., Thrasher, C., Paul Hsu, B.-J.: Web scale NLP: A case study on url word breaking. In: Proceedings of the 20th International Conference on World Wide Web, WWW 2011, pp. 357–366. ACM, New York, NY, USA (2011) Wang, K., Thrasher, C., Paul Hsu, B.-J.: Web scale NLP: A case study on url word breaking. In: Proceedings of the 20th International Conference on World Wide Web, WWW 2011, pp. 357–366. ACM, New York, NY, USA (2011)
Zurück zum Zitat Warren, B.: Semantic patterns of noun-noun compounds. In: Gothenburg Studies in English 41, Goteburg, Acta Universtatis Gothoburgensis (1978) Warren, B.: Semantic patterns of noun-noun compounds. In: Gothenburg Studies in English 41, Goteburg, Acta Universtatis Gothoburgensis (1978)
Zurück zum Zitat Way, A., Gough, N.: wEBMT: developing and validating an example-based machine translation system using the world wide web. Comput. Linguist. 29, 421–457 (2003)CrossRef Way, A., Gough, N.: wEBMT: developing and validating an example-based machine translation system using the world wide web. Comput. Linguist. 29, 421–457 (2003)CrossRef
Zurück zum Zitat Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Proceedings of ICML1997, pp. 412–420 (1997) Yang, Y., Pedersen, J.: A comparative study on feature selection in text categorization. In: Proceedings of ICML1997, pp. 412–420 (1997)
Zurück zum Zitat Zahariev, M.: School of Computing Science, Simon Fraser University, USA dissertation (2004) Zahariev, M.: School of Computing Science, Simon Fraser University, USA dissertation (2004)
Zurück zum Zitat Zhu, X., Rosenfeld, R.: Improving trigram language modeling with the world wide web. In: Proceedings of ICASSP I, pp. 533–536 (2001) Zhu, X., Rosenfeld, R.: Improving trigram language modeling with the world wide web. In: Proceedings of ICASSP I, pp. 533–536 (2001)
Metadaten
Titel
Web as a Corpus: Going Beyond the n-gram
verfasst von
Preslav Nakov
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-25485-2_5

Neuer Inhalt