Skip to main content
Erschienen in: International Journal on Digital Libraries 2-3/2018

17.05.2017

Section mixture models for scientific document summarization

verfasst von: John M. Conroy, Sashka T. Davis

Erschienen in: International Journal on Digital Libraries | Ausgabe 2-3/2018

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In this paper, we present a system for summarization of scientific and structured documents that has three components: section mixture models are used for estimation of the weights of terms; a hypothesis test to select a subset of these terms; and a sentence extractor based on techniques for combinatorial optimization. The section mixture models approach is an adaptation of a bigram mixture model based on the main sections of a scientific document and a collection of citing sentences (citances) from papers that reference the document. The model was adapted from earlier work done on Biomedical documents used in the summarization task of the 2014 Text Analysis Conference (TAC 2014). The mixture model trained on the Biomedical data was used also on the data for the Computational Linguistics scientific summarization task of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (CL-SciSumm 2016). This model gives rise to machine-generated summaries with ROUGE scores that are nearly as strong as those seen on the Biomedical data and was also the highest scoring submission to the task of generating a human summary. For sentence extraction, we use the OCCAMS algorithm (Davis et al., in: Vreeken, Ling, Zaki, Siebes, Yu, Goethals, Webb, Wu (eds) ICDM workshops, IEEE Computer Society, pp 454–463, 2012) which takes the sentences from the original document and the assignment of weights of the terms computed by the language models and outputs a set of minimally overlapping sentences whose combined term coverage is maximized. Finally, we explore the importance of an appropriate background model for the hypothesis test to select terms to achieve the best quality summaries.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 500–509. Association for Computational Linguistics, Portland, Oregon, USA (2011). http://www.aclweb.org/anthology/P11-1051 Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 500–509. Association for Computational Linguistics, Portland, Oregon, USA (2011). http://​www.​aclweb.​org/​anthology/​P11-1051
2.
Zurück zum Zitat Cabanac, G., Chandrasekaran, M.K., Frommholz, I., Jaidka, K., Kan, M.Y., Mayr, P., Wolfram, D.: Joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (birndl 2016). In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, JCDL ’16, pp. 299–300. ACM, New York, NY, USA (2016). doi:10.1145/2910896.2926734 Cabanac, G., Chandrasekaran, M.K., Frommholz, I., Jaidka, K., Kan, M.Y., Mayr, P., Wolfram, D.: Joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (birndl 2016). In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, JCDL ’16, pp. 299–300. ACM, New York, NY, USA (2016). doi:10.​1145/​2910896.​2926734
3.
Zurück zum Zitat Cohan, A., Goharian, N.: Scientific article summarization using citation-context and article’s discourse structure. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 390–400. Association for Computational Linguistics (2015). http://aclweb.org/anthology/D15-1045 Cohan, A., Goharian, N.: Scientific article summarization using citation-context and article’s discourse structure. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 390–400. Association for Computational Linguistics (2015). http://​aclweb.​org/​anthology/​D15-1045
4.
Zurück zum Zitat Cohan, A., Soldaini, L., Goharian, N.: Matching citation text and cited spans in biomedical literature: a search-oriented approach. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1042–1048. Association for Computational Linguistics (2015). http://www.aclweb.org/anthology/N15-1110 Cohan, A., Soldaini, L., Goharian, N.: Matching citation text and cited spans in biomedical literature: a search-oriented approach. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1042–1048. Association for Computational Linguistics (2015). http://​www.​aclweb.​org/​anthology/​N15-1110
5.
Zurück zum Zitat Conroy, J.M., Davis, S., Kubina, J., Liu, Y.K., O’Leary, D.P., Schlesinger, J.D.: Multilingual summarization: dimensionality reduction and a step towards optimal term coverage. In: ACL, MultiLing Workshop, pp. 454–463 (2013) Conroy, J.M., Davis, S., Kubina, J., Liu, Y.K., O’Leary, D.P., Schlesinger, J.D.: Multilingual summarization: dimensionality reduction and a step towards optimal term coverage. In: ACL, MultiLing Workshop, pp. 454–463 (2013)
6.
Zurück zum Zitat Conroy, J.M., Davis, S.T.: Vector space and language models for scientific document summarization. In: Proceedings of NAACL-HLT, pp. 186–191 (2015) Conroy, J.M., Davis, S.T.: Vector space and language models for scientific document summarization. In: Proceedings of NAACL-HLT, pp. 186–191 (2015)
7.
Zurück zum Zitat Davis, S.T., Conroy, J.M., Schlesinger, J.D.: OCCAMS—an optimal combinatorial covering algorithm for multi-document summarization. In: Vreeken, J., Ling, C., Zaki, M.J., Siebes, A., Yu, J.X., Goethals, B., Webb, G.I., Wu, X., (eds.) ICDM Workshops, pp. 454–463. IEEE Computer Society (2012) Davis, S.T., Conroy, J.M., Schlesinger, J.D.: OCCAMS—an optimal combinatorial covering algorithm for multi-document summarization. In: Vreeken, J., Ling, C., Zaki, M.J., Siebes, A., Yu, J.X., Goethals, B., Webb, G.I., Wu, X., (eds.) ICDM Workshops, pp. 454–463. IEEE Computer Society (2012)
9.
Zurück zum Zitat Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993) Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
10.
Zurück zum Zitat Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., Radev, D.: Blind men and elephants: What do citation summaries tell us about a research article? J. Am. Soc. Inf. Sci. Technol. 59(1), 51–62 (2008). doi:10.1002/asi.v59:1 CrossRef Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., Radev, D.: Blind men and elephants: What do citation summaries tell us about a research article? J. Am. Soc. Inf. Sci. Technol. 59(1), 51–62 (2008). doi:10.​1002/​asi.​v59:​1 CrossRef
12.
Zurück zum Zitat Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., Peng, H.: CIST system for CL-SciSumm 2016 shared task. In: Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL) co-located with the Joint Conference on Digital Libraries 2016 (JCDL 2016), Newark, NJ, USA, June 23, 2016., pp. 156–167 (2016). http://ceur-ws.org/Vol-1610/paper18.pdf Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., Peng, H.: CIST system for CL-SciSumm 2016 shared task. In: Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL) co-located with the Joint Conference on Digital Libraries 2016 (JCDL 2016), Newark, NJ, USA, June 23, 2016., pp. 156–167 (2016). http://​ceur-ws.​org/​Vol-1610/​paper18.​pdf
13.
Zurück zum Zitat Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: S.S. Marie-Francine Moens (ed.) Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004) Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: S.S. Marie-Francine Moens (ed.) Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004)
14.
Zurück zum Zitat Lin, C.Y., Cao, G., Gao, J., Nie, J.Y.: An information-theoretic approach to automatic evaluation of summaries. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06, pp. 463–470. Association for Computational Linguistics, Stroudsburg, PA, USA (2006). doi:10.3115/1220835.1220894 Lin, C.Y., Cao, G., Gao, J., Nie, J.Y.: An information-theoretic approach to automatic evaluation of summaries. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06, pp. 463–470. Association for Computational Linguistics, Stroudsburg, PA, USA (2006). doi:10.​3115/​1220835.​1220894
15.
Zurück zum Zitat Lin, C.Y., Hovy, E.: The automated acquisition of topic signatures for text summarization. In: Proceedings of the 18th Conference on Computational Linguistics, pp. 495–501. Association for Computational Linguistics, Morristown, NJ, USA (2000) Lin, C.Y., Hovy, E.: The automated acquisition of topic signatures for text summarization. In: Proceedings of the 18th Conference on Computational Linguistics, pp. 495–501. Association for Computational Linguistics, Morristown, NJ, USA (2000)
16.
Zurück zum Zitat McDonald, R.: A study of global inference algorithms in multi-document summarization. In: Proceedings of ECIR, pp. 557–564 (2007) McDonald, R.: A study of global inference algorithms in multi-document summarization. In: Proceedings of ECIR, pp. 557–564 (2007)
17.
Zurück zum Zitat Nakov, P.I., Schwartz, A.S., Hearst, M.A.: Citances: citation sentences for semantic analysis of bioscience text. In: In Proceedings of the SIGIR’04 Workshop on Search and Discovery in Bioinformatics (2004) Nakov, P.I., Schwartz, A.S., Hearst, M.A.: Citances: citation sentences for semantic analysis of bioscience text. In: In Proceedings of the SIGIR’04 Workshop on Search and Discovery in Bioinformatics (2004)
19.
Zurück zum Zitat Nishikawa, H., Hasegawa, T., Matsuo, Y., Kikui, G.: Opinion summarization with integer linear programming formulation for sentence extraction and ordering. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 910–918. Association for Computational Linguistics (2010) Nishikawa, H., Hasegawa, T., Matsuo, Y., Kikui, G.: Opinion summarization with integer linear programming formulation for sentence extraction and ordering. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 910–918. Association for Computational Linguistics (2010)
20.
Zurück zum Zitat Parveen, D., Mesgar, M., Strube, M.: Generating coherent summaries of scientific articles using coherence patterns. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 772–783. Association for Computational Linguistics (2016). https://aclweb.org/anthology/D16-1074 Parveen, D., Mesgar, M., Strube, M.: Generating coherent summaries of scientific articles using coherence patterns. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 772–783. Association for Computational Linguistics (2016). https://​aclweb.​org/​anthology/​D16-1074
21.
Zurück zum Zitat Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation summary networks. In: Proceedings of the 22Nd International Conference on Computational Linguistics - Volume 1, COLING ’08, pp. 689–696. Association for Computational Linguistics, Stroudsburg, PA, USA (2008). http://dl.acm.org/citation.cfm?id=1599081.1599168 Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation summary networks. In: Proceedings of the 22Nd International Conference on Computational Linguistics - Volume 1, COLING ’08, pp. 689–696. Association for Computational Linguistics, Stroudsburg, PA, USA (2008). http://​dl.​acm.​org/​citation.​cfm?​id=​1599081.​1599168
22.
Zurück zum Zitat Rankel, P., Conroy, J., Slud, E., O’Leary, D.: Ranking human and machine summarization systems. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 467–473. Association for Computational Linguistics, Edinburgh, Scotland, UK. (2011). http://www.aclweb.org/anthology/D11-1043 Rankel, P., Conroy, J., Slud, E., O’Leary, D.: Ranking human and machine summarization systems. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 467–473. Association for Computational Linguistics, Edinburgh, Scotland, UK. (2011). http://​www.​aclweb.​org/​anthology/​D11-1043
24.
Zurück zum Zitat Teufel, S., Moens, M.: Summarizing scientific articles—experiments with relevance and rhetorical status. Comput. Linguist. 28, 2002 (2002)CrossRef Teufel, S., Moens, M.: Summarizing scientific articles—experiments with relevance and rhetorical status. Comput. Linguist. 28, 2002 (2002)CrossRef
25.
Zurück zum Zitat Yates, F.: Contingency tables involving small numbers and the \(\chi ^2\) test. Supplement. J. R. Stat. Soc. 1, 217–235 (1934)MATH Yates, F.: Contingency tables involving small numbers and the \(\chi ^2\) test. Supplement. J. R. Stat. Soc. 1, 217–235 (1934)MATH
Metadaten
Titel
Section mixture models for scientific document summarization
verfasst von
John M. Conroy
Sashka T. Davis
Publikationsdatum
17.05.2017
Verlag
Springer Berlin Heidelberg
Erschienen in
International Journal on Digital Libraries / Ausgabe 2-3/2018
Print ISSN: 1432-5012
Elektronische ISSN: 1432-1300
DOI
https://doi.org/10.1007/s00799-017-0218-6

Weitere Artikel der Ausgabe 2-3/2018

International Journal on Digital Libraries 2-3/2018 Zur Ausgabe