Skip to main content
Top
Published in: International Journal on Digital Libraries 2-3/2018

17-05-2017

Section mixture models for scientific document summarization

Authors: John M. Conroy, Sashka T. Davis

Published in: International Journal on Digital Libraries | Issue 2-3/2018

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In this paper, we present a system for summarization of scientific and structured documents that has three components: section mixture models are used for estimation of the weights of terms; a hypothesis test to select a subset of these terms; and a sentence extractor based on techniques for combinatorial optimization. The section mixture models approach is an adaptation of a bigram mixture model based on the main sections of a scientific document and a collection of citing sentences (citances) from papers that reference the document. The model was adapted from earlier work done on Biomedical documents used in the summarization task of the 2014 Text Analysis Conference (TAC 2014). The mixture model trained on the Biomedical data was used also on the data for the Computational Linguistics scientific summarization task of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (CL-SciSumm 2016). This model gives rise to machine-generated summaries with ROUGE scores that are nearly as strong as those seen on the Biomedical data and was also the highest scoring submission to the task of generating a human summary. For sentence extraction, we use the OCCAMS algorithm (Davis et al., in: Vreeken, Ling, Zaki, Siebes, Yu, Goethals, Webb, Wu (eds) ICDM workshops, IEEE Computer Society, pp 454–463, 2012) which takes the sentences from the original document and the assignment of weights of the terms computed by the language models and outputs a set of minimally overlapping sentences whose combined term coverage is maximized. Finally, we explore the importance of an appropriate background model for the hypothesis test to select terms to achieve the best quality summaries.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 500–509. Association for Computational Linguistics, Portland, Oregon, USA (2011). http://www.aclweb.org/anthology/P11-1051 Abu-Jbara, A., Radev, D.: Coherent citation-based summarization of scientific papers. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 500–509. Association for Computational Linguistics, Portland, Oregon, USA (2011). http://​www.​aclweb.​org/​anthology/​P11-1051
2.
go back to reference Cabanac, G., Chandrasekaran, M.K., Frommholz, I., Jaidka, K., Kan, M.Y., Mayr, P., Wolfram, D.: Joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (birndl 2016). In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, JCDL ’16, pp. 299–300. ACM, New York, NY, USA (2016). doi:10.1145/2910896.2926734 Cabanac, G., Chandrasekaran, M.K., Frommholz, I., Jaidka, K., Kan, M.Y., Mayr, P., Wolfram, D.: Joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries (birndl 2016). In: Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries, JCDL ’16, pp. 299–300. ACM, New York, NY, USA (2016). doi:10.​1145/​2910896.​2926734
3.
go back to reference Cohan, A., Goharian, N.: Scientific article summarization using citation-context and article’s discourse structure. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 390–400. Association for Computational Linguistics (2015). http://aclweb.org/anthology/D15-1045 Cohan, A., Goharian, N.: Scientific article summarization using citation-context and article’s discourse structure. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 390–400. Association for Computational Linguistics (2015). http://​aclweb.​org/​anthology/​D15-1045
4.
go back to reference Cohan, A., Soldaini, L., Goharian, N.: Matching citation text and cited spans in biomedical literature: a search-oriented approach. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1042–1048. Association for Computational Linguistics (2015). http://www.aclweb.org/anthology/N15-1110 Cohan, A., Soldaini, L., Goharian, N.: Matching citation text and cited spans in biomedical literature: a search-oriented approach. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1042–1048. Association for Computational Linguistics (2015). http://​www.​aclweb.​org/​anthology/​N15-1110
5.
go back to reference Conroy, J.M., Davis, S., Kubina, J., Liu, Y.K., O’Leary, D.P., Schlesinger, J.D.: Multilingual summarization: dimensionality reduction and a step towards optimal term coverage. In: ACL, MultiLing Workshop, pp. 454–463 (2013) Conroy, J.M., Davis, S., Kubina, J., Liu, Y.K., O’Leary, D.P., Schlesinger, J.D.: Multilingual summarization: dimensionality reduction and a step towards optimal term coverage. In: ACL, MultiLing Workshop, pp. 454–463 (2013)
6.
go back to reference Conroy, J.M., Davis, S.T.: Vector space and language models for scientific document summarization. In: Proceedings of NAACL-HLT, pp. 186–191 (2015) Conroy, J.M., Davis, S.T.: Vector space and language models for scientific document summarization. In: Proceedings of NAACL-HLT, pp. 186–191 (2015)
7.
go back to reference Davis, S.T., Conroy, J.M., Schlesinger, J.D.: OCCAMS—an optimal combinatorial covering algorithm for multi-document summarization. In: Vreeken, J., Ling, C., Zaki, M.J., Siebes, A., Yu, J.X., Goethals, B., Webb, G.I., Wu, X., (eds.) ICDM Workshops, pp. 454–463. IEEE Computer Society (2012) Davis, S.T., Conroy, J.M., Schlesinger, J.D.: OCCAMS—an optimal combinatorial covering algorithm for multi-document summarization. In: Vreeken, J., Ling, C., Zaki, M.J., Siebes, A., Yu, J.X., Goethals, B., Webb, G.I., Wu, X., (eds.) ICDM Workshops, pp. 454–463. IEEE Computer Society (2012)
9.
go back to reference Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993) Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
10.
go back to reference Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., Radev, D.: Blind men and elephants: What do citation summaries tell us about a research article? J. Am. Soc. Inf. Sci. Technol. 59(1), 51–62 (2008). doi:10.1002/asi.v59:1 CrossRef Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D., Radev, D.: Blind men and elephants: What do citation summaries tell us about a research article? J. Am. Soc. Inf. Sci. Technol. 59(1), 51–62 (2008). doi:10.​1002/​asi.​v59:​1 CrossRef
12.
go back to reference Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., Peng, H.: CIST system for CL-SciSumm 2016 shared task. In: Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL) co-located with the Joint Conference on Digital Libraries 2016 (JCDL 2016), Newark, NJ, USA, June 23, 2016., pp. 156–167 (2016). http://ceur-ws.org/Vol-1610/paper18.pdf Li, L., Mao, L., Zhang, Y., Chi, J., Huang, T., Cong, X., Peng, H.: CIST system for CL-SciSumm 2016 shared task. In: Proceedings of the Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL) co-located with the Joint Conference on Digital Libraries 2016 (JCDL 2016), Newark, NJ, USA, June 23, 2016., pp. 156–167 (2016). http://​ceur-ws.​org/​Vol-1610/​paper18.​pdf
13.
go back to reference Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: S.S. Marie-Francine Moens (ed.) Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004) Lin, C.Y.: Rouge: a package for automatic evaluation of summaries. In: S.S. Marie-Francine Moens (ed.) Text Summarization Branches Out: Proceedings of the ACL-04 Workshop, pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (2004)
14.
go back to reference Lin, C.Y., Cao, G., Gao, J., Nie, J.Y.: An information-theoretic approach to automatic evaluation of summaries. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06, pp. 463–470. Association for Computational Linguistics, Stroudsburg, PA, USA (2006). doi:10.3115/1220835.1220894 Lin, C.Y., Cao, G., Gao, J., Nie, J.Y.: An information-theoretic approach to automatic evaluation of summaries. In: Proceedings of the Main Conference on Human Language Technology Conference of the North American Chapter of the Association of Computational Linguistics, HLT-NAACL ’06, pp. 463–470. Association for Computational Linguistics, Stroudsburg, PA, USA (2006). doi:10.​3115/​1220835.​1220894
15.
go back to reference Lin, C.Y., Hovy, E.: The automated acquisition of topic signatures for text summarization. In: Proceedings of the 18th Conference on Computational Linguistics, pp. 495–501. Association for Computational Linguistics, Morristown, NJ, USA (2000) Lin, C.Y., Hovy, E.: The automated acquisition of topic signatures for text summarization. In: Proceedings of the 18th Conference on Computational Linguistics, pp. 495–501. Association for Computational Linguistics, Morristown, NJ, USA (2000)
16.
go back to reference McDonald, R.: A study of global inference algorithms in multi-document summarization. In: Proceedings of ECIR, pp. 557–564 (2007) McDonald, R.: A study of global inference algorithms in multi-document summarization. In: Proceedings of ECIR, pp. 557–564 (2007)
17.
go back to reference Nakov, P.I., Schwartz, A.S., Hearst, M.A.: Citances: citation sentences for semantic analysis of bioscience text. In: In Proceedings of the SIGIR’04 Workshop on Search and Discovery in Bioinformatics (2004) Nakov, P.I., Schwartz, A.S., Hearst, M.A.: Citances: citation sentences for semantic analysis of bioscience text. In: In Proceedings of the SIGIR’04 Workshop on Search and Discovery in Bioinformatics (2004)
19.
go back to reference Nishikawa, H., Hasegawa, T., Matsuo, Y., Kikui, G.: Opinion summarization with integer linear programming formulation for sentence extraction and ordering. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 910–918. Association for Computational Linguistics (2010) Nishikawa, H., Hasegawa, T., Matsuo, Y., Kikui, G.: Opinion summarization with integer linear programming formulation for sentence extraction and ordering. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 910–918. Association for Computational Linguistics (2010)
20.
go back to reference Parveen, D., Mesgar, M., Strube, M.: Generating coherent summaries of scientific articles using coherence patterns. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 772–783. Association for Computational Linguistics (2016). https://aclweb.org/anthology/D16-1074 Parveen, D., Mesgar, M., Strube, M.: Generating coherent summaries of scientific articles using coherence patterns. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 772–783. Association for Computational Linguistics (2016). https://​aclweb.​org/​anthology/​D16-1074
21.
go back to reference Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation summary networks. In: Proceedings of the 22Nd International Conference on Computational Linguistics - Volume 1, COLING ’08, pp. 689–696. Association for Computational Linguistics, Stroudsburg, PA, USA (2008). http://dl.acm.org/citation.cfm?id=1599081.1599168 Qazvinian, V., Radev, D.R.: Scientific paper summarization using citation summary networks. In: Proceedings of the 22Nd International Conference on Computational Linguistics - Volume 1, COLING ’08, pp. 689–696. Association for Computational Linguistics, Stroudsburg, PA, USA (2008). http://​dl.​acm.​org/​citation.​cfm?​id=​1599081.​1599168
22.
go back to reference Rankel, P., Conroy, J., Slud, E., O’Leary, D.: Ranking human and machine summarization systems. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 467–473. Association for Computational Linguistics, Edinburgh, Scotland, UK. (2011). http://www.aclweb.org/anthology/D11-1043 Rankel, P., Conroy, J., Slud, E., O’Leary, D.: Ranking human and machine summarization systems. In: Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp. 467–473. Association for Computational Linguistics, Edinburgh, Scotland, UK. (2011). http://​www.​aclweb.​org/​anthology/​D11-1043
24.
go back to reference Teufel, S., Moens, M.: Summarizing scientific articles—experiments with relevance and rhetorical status. Comput. Linguist. 28, 2002 (2002)CrossRef Teufel, S., Moens, M.: Summarizing scientific articles—experiments with relevance and rhetorical status. Comput. Linguist. 28, 2002 (2002)CrossRef
25.
go back to reference Yates, F.: Contingency tables involving small numbers and the \(\chi ^2\) test. Supplement. J. R. Stat. Soc. 1, 217–235 (1934)MATH Yates, F.: Contingency tables involving small numbers and the \(\chi ^2\) test. Supplement. J. R. Stat. Soc. 1, 217–235 (1934)MATH
Metadata
Title
Section mixture models for scientific document summarization
Authors
John M. Conroy
Sashka T. Davis
Publication date
17-05-2017
Publisher
Springer Berlin Heidelberg
Published in
International Journal on Digital Libraries / Issue 2-3/2018
Print ISSN: 1432-5012
Electronic ISSN: 1432-1300
DOI
https://doi.org/10.1007/s00799-017-0218-6

Other articles of this Issue 2-3/2018

International Journal on Digital Libraries 2-3/2018 Go to the issue

Premium Partner