Skip to main content
Erschienen in: World Wide Web 1/2020

02.12.2019

A crowd-efficient learning approach for NER based on online encyclopedia

verfasst von: Maolong Li, Zhixu Li, Qiang Yang, Zhigang Chen, Pengpeng Zhao, Lei Zhao

Erschienen in: World Wide Web | Ausgabe 1/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Named Entity Recognition (NER) is a core task of NLP. State-of-art supervised NER models rely heavily on a large amount of high-quality annotated data, which is quite expensive to obtain. Various existing ways have been proposed to reduce the heavy reliance on large training data, but only with limited effect. In this paper, we propose a crowd-efficient learning approach for supervised NER learning by making full use of the online encyclopedia pages. In our approach, we first define three criteria (representativeness, informativeness, diversity) to help select a much smaller set of samples for crowd labeling. We then propose a data augmentation method, which could generate a lot more training data with the help of the structured knowledge of online encyclopedia to greatly augment the training effect. After conducting model training on the augmented sample set, we re-select some new samples for crowd labeling for model refinement. We perform the training and selection procedure iteratively until the model could not be further improved or the performance of the model meets our requirement. Our empirical study conducted on several real data collections shows that our approach could reduce 50% manual annotations with almost the same NER performance as the fully trained model.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Bi, W, Wang, L, Kwok, JT, Tu, Z: Learning to predict from crowdsourced data. In: UAI, pp 82–91 (2014) Bi, W, Wang, L, Kwok, JT, Tu, Z: Learning to predict from crowdsourced data. In: UAI, pp 82–91 (2014)
2.
Zurück zum Zitat Collobert, R, Weston, J: A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine learning, pp 160–167. ACM (2008) Collobert, R, Weston, J: A unified architecture for natural language processing: Deep neural networks with multitask learning. In: Proceedings of the 25th International Conference on Machine learning, pp 160–167. ACM (2008)
3.
Zurück zum Zitat Collobert, R, Weston, J, Bottou, L, Karlen, M, Kavukcuoglu, K, Kuksa, P: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)MATH Collobert, R, Weston, J, Bottou, L, Karlen, M, Kavukcuoglu, K, Kuksa, P: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011)MATH
4.
Zurück zum Zitat Devlin, J, Chang, M.-W., Lee, K, Toutanova, K: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805 (2018) Devlin, J, Chang, M.-W., Lee, K, Toutanova, K: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv:1810.​04805 (2018)
5.
Zurück zum Zitat Dredze, M, Talukdar, PP, Crammer, K: Sequence learning from data with multiple labels. In: Workshop Co-Chairs, p 39 (2009) Dredze, M, Talukdar, PP, Crammer, K: Sequence learning from data with multiple labels. In: Workshop Co-Chairs, p 39 (2009)
6.
Zurück zum Zitat Dumitrache, A, Aroyo, L, Welty, C: Crowdsourcing ground truth for medical relation extraction. ACM Trans. Interact. Intell. Syst. (TiiS) 8(2), 12 (2018) Dumitrache, A, Aroyo, L, Welty, C: Crowdsourcing ground truth for medical relation extraction. ACM Trans. Interact. Intell. Syst. (TiiS) 8(2), 12 (2018)
7.
Zurück zum Zitat Felt, P., Black, K., Ringger, E., Seppi, K., Haertel, R.: Early gains matter: A case for preferring generative over discriminative crowdsourcing models. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 882–891 (2015) Felt, P., Black, K., Ringger, E., Seppi, K., Haertel, R.: Early gains matter: A case for preferring generative over discriminative crowdsourcing models. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp 882–891 (2015)
9.
Zurück zum Zitat Grishman, R, Sundheim, B: Message understanding conference-6: A brief history. In: COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, vol. 1 (1996) Grishman, R, Sundheim, B: Message understanding conference-6: A brief history. In: COLING 1996 Volume 1: The 16th International Conference on Computational Linguistics, vol. 1 (1996)
10.
Zurück zum Zitat Habibi, M, Weber, L, Neves, M, Wiegandt, DL, Leser, U: Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48 (2017)CrossRef Habibi, M, Weber, L, Neves, M, Wiegandt, DL, Leser, U: Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics 33(14), i37–i48 (2017)CrossRef
11.
Zurück zum Zitat Huang, Z, Xu, W, Yu, K: Bidirectional lstm-crf models for sequence tagging. arXiv:1508.01991 (2015) Huang, Z, Xu, W, Yu, K: Bidirectional lstm-crf models for sequence tagging. arXiv:1508.​01991 (2015)
12.
Zurück zum Zitat Huang, G, Liu, Z, Van Der Maaten, L, Weinberger, KQ: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4700–4708 (2017) Huang, G, Liu, Z, Van Der Maaten, L, Weinberger, KQ: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 4700–4708 (2017)
13.
Zurück zum Zitat Lafferty, J, McCallum, A, Pereira, FCN: Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001) Lafferty, J, McCallum, A, Pereira, FCN: Conditional random fields: Probabilistic models for segmenting and labeling sequence data (2001)
14.
Zurück zum Zitat Lample, G, Ballesteros, M, Subramanian, S, Kawakami, K, Dyer, C: Neural architectures for named entity recognition. arXiv:1603.01360 (2016) Lample, G, Ballesteros, M, Subramanian, S, Kawakami, K, Dyer, C: Neural architectures for named entity recognition. arXiv:1603.​01360 (2016)
15.
Zurück zum Zitat LeCun, Y, Bengio, Y, et al.: Convolutional networks for images, speech, and time series. Handbook Brain Theory Neural Netw. 3361(10), 1995 (1995) LeCun, Y, Bengio, Y, et al.: Convolutional networks for images, speech, and time series. Handbook Brain Theory Neural Netw. 3361(10), 1995 (1995)
16.
Zurück zum Zitat Levow, G.-A.: The third international chinese language processing bakeoff: Word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp 108–117 (2006) Levow, G.-A.: The third international chinese language processing bakeoff: Word segmentation and named entity recognition. In: Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing, pp 108–117 (2006)
17.
Zurück zum Zitat Li, Y, Bontcheva, K, Cunningham, H: Svm based learning system for information extraction. In: International Workshop on Deterministic and Statistical Methods in Machine Learning, pp 319–339. Springer (2004) Li, Y, Bontcheva, K, Cunningham, H: Svm based learning system for information extraction. In: International Workshop on Deterministic and Statistical Methods in Machine Learning, pp 319–339. Springer (2004)
18.
Zurück zum Zitat Li, S, Zhao, Z, Hu, R, Li, W, Liu, T, Du, X: Analogical reasoning on chinese morphological and semantic relations. arXiv:1805.06504(2018) Li, S, Zhao, Z, Hu, R, Li, W, Liu, T, Du, X: Analogical reasoning on chinese morphological and semantic relations. arXiv:1805.​06504(2018)
19.
Zurück zum Zitat Mou, L, Meng, Z, Yan, R, Li, G, Xu, Y, Zhang, L, Jin, Z: How transferable are neural networks in nlp applications? arXiv:1603.06111 (2016) Mou, L, Meng, Z, Yan, R, Li, G, Xu, Y, Zhang, L, Jin, Z: How transferable are neural networks in nlp applications? arXiv:1603.​06111 (2016)
20.
Zurück zum Zitat Nguyen, AT, Wallace, BC, Li, JJ, Nenkova, A, Lease, M: Aggregating and predicting sequence labels from crowd annotations. In: Proceedings of the conference. Association for Computational Linguistics. Meeting, vol. 2017, p 299. NIH Public Access (2017) Nguyen, AT, Wallace, BC, Li, JJ, Nenkova, A, Lease, M: Aggregating and predicting sequence labels from crowd annotations. In: Proceedings of the conference. Association for Computational Linguistics. Meeting, vol. 2017, p 299. NIH Public Access (2017)
21.
Zurück zum Zitat Ni, J, Florian, R: Improving multilingual named entity recognition with wikipedia entity type mapping. arXiv:1707.02459 (2017) Ni, J, Florian, R: Improving multilingual named entity recognition with wikipedia entity type mapping. arXiv:1707.​02459 (2017)
22.
Zurück zum Zitat Noraset, T, Bhagavatula, C, Downey, D: Websail wikifier at erd 2014. In: Proceedings of the First International Workshop on Entity Recognition & Disambiguation, pp 119–124. ACM (2014) Noraset, T, Bhagavatula, C, Downey, D: Websail wikifier at erd 2014. In: Proceedings of the First International Workshop on Entity Recognition & Disambiguation, pp 119–124. ACM (2014)
23.
Zurück zum Zitat Nothman, J, Ringland, N, Radford, W, Murphy, T, Curran, JR: Learning multilingual named entity recognition from wikipedia. Artif. Intell. 194, 151–175 (2013)MathSciNetCrossRef Nothman, J, Ringland, N, Radford, W, Murphy, T, Curran, JR: Learning multilingual named entity recognition from wikipedia. Artif. Intell. 194, 151–175 (2013)MathSciNetCrossRef
24.
Zurück zum Zitat Peters, ME, Ammar, W, Bhagavatula, C, Power, R: Semi-supervised sequence tagging with bidirectional language models. arXiv:1705.00108 (2017) Peters, ME, Ammar, W, Bhagavatula, C, Power, R: Semi-supervised sequence tagging with bidirectional language models. arXiv:1705.​00108 (2017)
25.
Zurück zum Zitat Richman, AE, Schone, P: Mining wiki resources for multilingual named entity recognition. In: Proceedings of ACL-08: HLT, pp 1–9 (2008) Richman, AE, Schone, P: Mining wiki resources for multilingual named entity recognition. In: Proceedings of ACL-08: HLT, pp 1–9 (2008)
26.
Zurück zum Zitat Rodrigues, F, Pereira, F, Ribeiro, B: Sequence labeling with multiple annotators. Mach. Learn. 95(2), 165–181 (2014)MathSciNetCrossRef Rodrigues, F, Pereira, F, Ribeiro, B: Sequence labeling with multiple annotators. Mach. Learn. 95(2), 165–181 (2014)MathSciNetCrossRef
27.
28.
Zurück zum Zitat Shen, Y, Yun, H, Lipton, ZC, Kronrod, Y, Anandkumar, A: Deep active learning for named entity recognition. arXiv:1707.05928 (2017) Shen, Y, Yun, H, Lipton, ZC, Kronrod, Y, Anandkumar, A: Deep active learning for named entity recognition. arXiv:1707.​05928 (2017)
29.
Zurück zum Zitat Snow, R, O’Connor, B, Jurafsky, D, Ng, AY: Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp 254–263. Association for Computational Linguistics (2008) Snow, R, O’Connor, B, Jurafsky, D, Ng, AY: Cheap and fast—but is it good?: evaluating non-expert annotations for natural language tasks. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp 254–263. Association for Computational Linguistics (2008)
30.
Zurück zum Zitat Sun, J: ’jieba’chinese word segmentation tool (2012) Sun, J: ’jieba’chinese word segmentation tool (2012)
31.
Zurück zum Zitat Tjong, EF, Sang, K, De Meulder, F: Introduction to the conll-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4, pp 142–147. Association for Computational Linguistics (2003) Tjong, EF, Sang, K, De Meulder, F: Introduction to the conll-2003 shared task: Language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, vol. 4, pp 142–147. Association for Computational Linguistics (2003)
32.
33.
Zurück zum Zitat Wang, WY, Yang, D: That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 2557–2563 (2015) Wang, WY, Yang, D: That’s so annoying!!!: A lexical and frame-semantic embedding based data augmentation approach to automatic categorization of annoying behaviors using# petpeeve tweets. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp 2557–2563 (2015)
34.
Zurück zum Zitat Weischedel, R, Pradhan, S, Ramshaw, L, Palmer, M, Xue, N, Marcus, M, Taylor, A, Greenberg, C, Hovy, E, Belvin, R, et al: Ontonotes release 4.0. LDC2011T03. Linguistic Data Consortium, Philadelphia (2011) Weischedel, R, Pradhan, S, Ramshaw, L, Palmer, M, Xue, N, Marcus, M, Taylor, A, Greenberg, C, Hovy, E, Belvin, R, et al: Ontonotes release 4.0. LDC2011T03. Linguistic Data Consortium, Philadelphia (2011)
35.
Zurück zum Zitat Wong, SC, Gatt, A, Stamatescu, V, McDonnell, MD: Understanding data augmentation for classification: when to warp? arXiv:1609.08764(2016) Wong, SC, Gatt, A, Stamatescu, V, McDonnell, MD: Understanding data augmentation for classification: when to warp? arXiv:1609.​08764(2016)
36.
Zurück zum Zitat Xu, Y, Jia, R, Mou, L, Li, G, Chen, Y, Lu, Y, Jin, Z: Improved relation classification by deep recurrent neural networks with data augmentation. arXiv:1601.03651 (2016) Xu, Y, Jia, R, Mou, L, Li, G, Chen, Y, Lu, Y, Jin, Z: Improved relation classification by deep recurrent neural networks with data augmentation. arXiv:1601.​03651 (2016)
37.
Zurück zum Zitat Yadav, V, Bethard, S: A survey on recent advances in named entity recognition from deep learning models. In: Proceedings of the 27th International Conference on Computational Linguistics, pp 2145–2158 (2018) Yadav, V, Bethard, S: A survey on recent advances in named entity recognition from deep learning models. In: Proceedings of the 27th International Conference on Computational Linguistics, pp 2145–2158 (2018)
38.
Zurück zum Zitat Yang, Y, Zhang, M, Chen, W, Zhang, W, Wang, H, Zhang, M: Adversarial learning for chinese ner from crowd annotations. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018) Yang, Y, Zhang, M, Chen, W, Zhang, W, Wang, H, Zhang, M: Adversarial learning for chinese ner from crowd annotations. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
39.
Zurück zum Zitat Zhou, GD, Su, J: Named entity recognition using an hmm-based chunk tagger. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp 473–480. Association for Computational Linguistics (2002) Zhou, GD, Su, J: Named entity recognition using an hmm-based chunk tagger. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp 473–480. Association for Computational Linguistics (2002)
Metadaten
Titel
A crowd-efficient learning approach for NER based on online encyclopedia
verfasst von
Maolong Li
Zhixu Li
Qiang Yang
Zhigang Chen
Pengpeng Zhao
Lei Zhao
Publikationsdatum
02.12.2019
Verlag
Springer US
Erschienen in
World Wide Web / Ausgabe 1/2020
Print ISSN: 1386-145X
Elektronische ISSN: 1573-1413
DOI
https://doi.org/10.1007/s11280-019-00736-3

Weitere Artikel der Ausgabe 1/2020

World Wide Web 1/2020 Zur Ausgabe

Premium Partner