nach oben

Erschienen in:

2018 | OriginalPaper | Buchkapitel

Extract Knowledge from Web Pages in a Specific Domain

verfasst von : Yihong Lu, Shuiyuan Yu, Minyong Shi, Chunfang Li

Erschienen in: Knowledge Science, Engineering and Management

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Most NLP tasks are based on large, well-organized corpus in general domain, while limited work has been done in specific domain due to the lack of qualified corpus and evaluation dataset. However domain-specific applications are widely needed nowadays. In this paper, we propose a fast and inexpensive, model-assisted method to train a high-quality distributional model from scattered, unconstructed web pages, which can capture knowledge from a specific domain. This approach does not require pre-organized corpus and much human help, and hence works on the specific domain which can’t afford the cost of artificially constructed corpus and complex training. We use Word2vec to assist in creating term set and evaluation dataset of embroidery domain. Next, we train a distributional model on filtered search results of term set, and conduct a task-specific tuning via two simple but practical evaluation metrics, word pairs similarity and in-domain terms’ coverage. Furthermore, our much-smaller models outperform the word embedding model trained on a large, general corpus in our task. In this work, we demonstrate the effectiveness of our method and hope it can serve as a reference for researchers who extract high-quality knowledge in specific domains.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel A Study on Performance Sensitivity to Data Sparsity for Automated Essay Scoring

Nächstes Kapitel TCMEF: A TCM Entity Filter Using Less Text

Altszyler, E., Ribeiro, S., Sigman, M., Slezak, D.F.: The interpretation of dream meaning: resolving ambiguity using latent semantic analysis in a small corpus of text. Conscious. Cogn. 56, 178–187 (2017). https://doi.org/10.1016/j.concog.2017.09.004CrossRef

Altszyler, E., Sigman, M., Slezak, D.F.: Comparative study of LSA vs Word2Vec embeddings in small corpora: a case study in dreams database. Science 8, 9

Altszyler, E., Sigman, M., Slezak, D.F.: Corpus specificity in LSA and Word2Vec: the role of out-of-domain documents. arXiv preprint arXiv:1712.10054 (2017)

Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the ACL, vol. 1: Long Papers, pp. 238–247 (2014)

Cardellino, C., Alonso i Alemany, L.: Disjoint semi-supervised Spanish verb sense disambiguation using word embeddings. In: XVIII Simposio Argentino de Inteligencia Artificial (ASAI)-JAIIO 46 (Córdoba, 2017) (2017)

Chiu, B., Crichton, G., Korhonen, A., Pyysalo, S.: How to train good word embeddings for biomedical NLP. In: Proceedings of the 15th Workshop on BioNLP. ACL (2016)

Diaz, F., Mitra, B., Craswell, N.: Query expansion with locally-trained word embeddings. In: Proceedings of the 54th Annual Meeting of the ACL, vol. 1: Long Papers. ACL (2016)

Dusserre, E., Padró, M.: Bigger does not mean better! we prefer specificity. In: IWCS 2017–12th International Conference on Computational Semantics–Short Papers (2017)

Finkelstein, L., et al.: Placing search in context: the concept revisited. ACM Trans. Inf. Syst. 20(1), 116–131 (2002). https://doi.org/10.1145/503104.503110CrossRef

10.

Hill, F., Reichart, R., Korhonen, A.: SimLex-999: evaluating semantic models with (genuine) similarity estimation. Comput. Linguist. 41(4), 665–695 (2015). https://doi.org/10.1162/coli_a_00237MathSciNetCrossRef

11.

Jin, P., Wu, Y.: SemEval-2012 task 4: evaluating Chinese word similarity. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics-Volume 1: Proceedings of the Main Conference and the Shared Task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, pp. 374–377. ACL (2012)

12.

Kusner, M., Sun, Y., Kolkin, N., Weinberger, K.: From word embeddings to document distances. In: International Conference on Machine Learning, pp. 957–966 (2015)

13.

Kutuzov, A., Kunilovskaya, M.: Size vs. structure in training corpora for word embedding models: araneum russicum maximum and russian national corpus. In: van der Aalst, W., et al. (eds.) AIST 2017. LNCS, vol. 10716, pp. 47–58. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-73013-4_5CrossRef

14.

Lai, S., Liu, K., He, S., Zhao, J.: How to generate a good word embedding? IEEE Intell. Syst. 1 (2017)

15.

Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. TACL 3, 211–225 (2015)

16.

Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 142–150. ACL (2011)

17.

Major, V., Surkis, A., Aphinyanaphongs, Y.: Utility of general and specific word embeddings for classifying translational stages of research. arXiv preprint arXiv:1705.06262 (2017)

18.

Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)

19.

Mikolov, T., Le, Q.V., Sutskever, I.: Exploiting similarities among languages for machine translation. arXiv preprint arXiv:1309.4168 (2013)

20.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)

21.

Pakhomov, S.V., Finley, G., McEwan, R., Wang, Y., Melton, G.B.: Corpus domain effects on distributional semantic modeling of medical terms. Bioinformatics 32, 3635–3644 (2016). https://doi.org/10.1093/bioinformatics/btw529CrossRef

22.

Qu, L., Ferraro, G., Zhou, L., Hou, W., Schneider, N., Baldwin, T.: Big data small data, in domain out-of domain, known word unknown word: the impact of word representations on sequence labelling tasks. In: Proceedings of the Nineteenth Conference on CoNLL. ACL (2015). https://doi.org/10.18653/v1/k15-1009

23.

Rekabsaz, N., Mitra, B., Lupu, M., Hanbury, A.: Toward incorporation of relevant documents in Word2Vec. arXiv preprint arXiv:1707.06598 (2017)

24.

Spousta, M.: Web as a corpus. In: Zbornik konference WDS, vol. 6, pp. 179–184 (2006)

25.

Sugathadasa, K., et al.: Synergistic union of Word2Vec and lexicon for domain specific semantic similarity. In: 2017 IEEE ICIIS. IEEE, December 2017. https://doi.org/10.1109/iciinfs.2017.8300343

26.

Tang, D., Wei, F., Yang, N., Zhou, M., Liu, T., Qin, B.: Learning sentiment-specific word embedding for twitter sentiment classification. In: Proceedings of the 52nd Annual Meeting of the ACL, vol. 1: Long Papers. ACL (2014). https://doi.org/10.3115/v1/p14-1146

27.

Muneeb, T.H., Sahu, S., Anand, A.: Evaluating distributed word representations for capturing semantics of biomedical concepts. In: Proceedings of BioNLP 2015. ACL (2015)

28.

Tixier, A.J.P., Vazirgiannis, M., Hallowell, M.R.: Word embeddings for the construction domain. arXiv preprint arXiv:1610.09333 (2016)

29.

Wang, Y., et al.: A comparison of word embeddings for the biomedical natural language processing. arXiv preprint arXiv:1802.00400 (2018)

Titel: Extract Knowledge from Web Pages in a Specific Domain
verfasst von: Yihong Lu
Shuiyuan Yu
Minyong Shi
Chunfang Li
Verlag: Springer International Publishing
Buch: Knowledge Science, Engineering and Management
Print ISBN: 978-3-319-99364-5

Electronic ISBN: 978-3-319-99365-2

Copyright-Jahr: 2018
DOI: https://doi.org/10.1007/978-3-319-99365-2_10

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner