Functional structure identification of scientific documents in computer science

Lu, Wei; Huang, Yong; Bu, Yi; Cheng, Qikai

doi:10.1007/s11192-018-2640-y

Functional structure identification of scientific documents in computer science

Published: 02 February 2018

Volume 115, pages 463–486, (2018)
Cite this article

Scientometrics Aims and scope Submit manuscript

Wei Lu¹,
Yong Huang^1,2,
Yi Bu² &
…
Qikai Cheng ORCID: orcid.org/0000-0003-3904-8901¹

1124 Accesses
15 Citations
Explore all metrics

Abstract

The increasing number of open-access full-text scientific documents promotes the transformation from metadata- to content-based studies, which is more detailed and semantic. Along with the benefits of ample data, the confused internal structure introduces great difficulties to data organization and analysis. Each unit in scientific documents has its own function in expressing authors’ research ideas, such as introducing motivations, describing methods, stating related work, and drawing conclusions; these could be used to identify functional structure of scientific documents. This paper firstly proposes a clustering method to generate domain-specific structures based on high-frequency section headers in scientific documents of a domain. To automatically identify the structure of scientific documents, we categorize scientific documents into three types: (1) strong-structure documents; (2) weak-structure documents; and (3) no-structure documents. We further divide the identification into three levels—section header-based identification, section content-based identification, and paragraph-based identification—corresponding to the three types of documents. Our experiments on documents in the field of computer science show that: (1) section header-based identification is the most direct and simplest method, but its accuracy is limited by unknown words in section headers; (2) section content-based identification is more stable and obtains good performance; and (3) paragraph-based identification is promising in identifying functions of no-structure documents. Additionally, we apply our methods to two tasks: academic search and keyword extraction. Both tasks demonstrate the effectiveness of functional structure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Keyword-based Research Field Discovery with External Knowledge Aware Hierarchical Co-clustering

Unsupervised document structure analysis of digital scientific articles

Article 08 June 2014

Generic features selection for structure classification of diverse styled scholarly articles

Article 16 July 2023

Notes

References

Beliga, S., Meštrović, A., & Martinčić-Ipšić, S. (2016). Selectivity-based keyword extraction method. International Journal on Semantic Web and Information Systems, 12(3), 1–26.
Article Google Scholar
Bu, Y., Liu, T., & Huang, W.-B. (2016). MACA: A modified author co-citation analysis method combined with general metadata of citations. Scientometrics, 108(1), 143–166.
Article Google Scholar
Chang, C.-C., & Lin, C.-J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2(3), 27.
Article Google Scholar
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.
MATH Google Scholar
Councill, I. G., Giles, C. L., & Kan, M.-Y. (2008). ParsCit: An open-source CRF reference string parsing package. In Proceedings of the international conference on language resources and evaluation (pp. 661–667). May 28–30, 2008, Marrakech, Morocco.
Day, R. A. (1989). The origins of the scientific paper: The IMRaD format. Journal of the American Medical Writers Association, 4(2), 16–18.
Google Scholar
De Sordi, J. O., de Paulo, W. L., Meireles, M. A., de Azevedo, M. C., & Pinochet, L. H. C. (2017). Proposal of indicators for the structural analysis of scientific articles. Journal of Informetrics, 11(2), 483–497.
Article Google Scholar
Ding, Y., Chowdhury, G. G., & Foo, S. (2001). Bibliometric cartography of information retrieval research by using co-word analysis. Information Processing and Management, 37(6), 817–842.
Article MATH Google Scholar
Ding, Y., Liu, X., Guo, C., & Cronin, B. (2013). The distribution of references across texts: Some implications for citation analysis. Journal of Informetrics, 7(3), 583–592.
Article Google Scholar
Fader, A., Soderland, S., & Etzioni, O. (2011). Identifying relations for open information extraction. In Proceedings of the conference on empirical methods in natural language processing (pp. 1535–1545). July 27–29, 2011, Edinburgh, UK.
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., & Lin, C.-J. (2008). LIBLINEAR: A library for large linear classification. Journal of Machine Learning Research, 9(8), 1871–1874.
MATH Google Scholar
Frank, E., Paynter, G. W., Witten, I. H., Gutwin, C., & Nevill-Manning, C. G. (1999). Domain-specific keyphrase extraction. In 16th International joint conference on artificial intelligence (IJCAI 99) (Vol. 2, pp. 668–673). San Francisco, CA: Morgan Kaufmann Publishers Inc.
He, Q., Pei, J., Kifer, D., Mitra, P., & Giles, L. (2010). Context-aware citation recommendation. In Proceedings of the 19th international conference on World Wide Web (pp. 421–430). ACM.
Hu, Z., Chen, C., & Liu, Z. (2013). Where are citations located in the body of scientific articles? A study of the distributions of citation locations. Journal of Informetrics, 7(4), 887–896.
Article Google Scholar
Jeong, Y.-K., Song, M., & Ding, Y. (2014). Content-based author co-citation analysis. Journal of Informetrics, 8(1), 197–211.
Article Google Scholar
Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. In European conference on Machine Learning: ECML (pp. 137–142). Berlin: Springer.
Kim, J., Xue, X., & Croft, W. B. (2009). A probabilistic retrieval model for semistructured data. In European conference on information retrieval (pp. 228–239). Springer. Retrieved from http://link.springer.com/chapter/10.1007/978-3-642-00958-7_22.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the eighteenth international conference on machine learning (Vol. 1, pp. 282–289). Retrieved from http://www.jmlr.org/papers/volume15/doppa14a/source/biblio.bib.old.
Leydesdorff, L. (2001). The challenge of scientometrics: The development, measurement, and self-organization of scientific communications. Universal-Publishers. Retrieved from https://books.google.com/books?hl=zh-CN&lr=&id=H7J6Q-1Q5GcC&oi=fnd&pg=PA1&dq=The+challenge+of+scientometrics:+The+development,+measurement,+and+self-organization+of+scientific+communication&ots=0QLb4jF3lH&sig=KJRGq6S2F7lwm9xgRYzcJUUoE58.
Li, L.-J., & Ge, G.-C. (2009). Genre analysis: Structural and linguistic evolution of the English-medium medical research article (1985–2004). English for Specific Purposes, 28(2), 93–104.
Article Google Scholar
Luong, M.-T., Nguyen, T. D., & Kan, M.-Y. (2012). Logical structure recovery in scholarly articles with rich document features. Multimedia Storage and Retrieval Innovations for Digital Library Systems, 270.
Martın, P. M. (2003). A genre analysis of English and Spanish research paper abstracts in experimental social sciences. English for Specific Purposes, 22(1), 25–43.
Article Google Scholar
McCain, K. W. (1991). Mapping economics through the journal literature: An experiment in journal cocitation analysis. Journal of the American Society for Information Science, 42(4), 290.
Article Google Scholar
Mihalcea, R., & Tarau, P. (2004). TextRank: Bringing order into texts. Association for Computational Linguistics.
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv:1301.3781. Retrieved from http://arxiv.org/abs/1301.3781.
Nguyen, T. D., & Kan, M.-Y. (2007). Keyphrase extraction in scientific publications. In International conference on Asian digital libraries (pp. 317–326). Springer. Retrieved from http://link.springer.com/10.1007%2F978-3-540-77094-7_41.
Nwogu, K. N. (1997). The medical research paper: Structure and functions. English for Specific Purposes, 16(2), 119–138.
Article Google Scholar
Ponte, J. M., & Croft, W. B. (1998). A language modeling approach to information retrieval. In Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval (pp. 275–281). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=291008.
Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14(3), 130–137.
Article Google Scholar
Sollaci, L. B., & Pereira, M. G. (2004). The introduction, methods, results, and discussion (IMRAD) structure: A 50-year survey. Journal of the Medical Library Association, 92(3), 364–371.
Google Scholar
Sugimoto, C. R., Work, S., Larivière, V., & Haustein, S. (2017). Scholarly use of social media and altmetrics: A review of the literature. Journal of the Association for Information Science and Technology, 68(9), 2037–2062.
Article Google Scholar
Sutton, C., McCallum, A., et al. (2012). An introduction to conditional random fields. Foundations and Trends in Machine Learning, 4(4), 267–373.
Article MATH Google Scholar
Swales, J. (1990). Genre analysis: English in academic and research settings. Cambridge University Press. Retrieved from https://books.google.com/books?hl=zh-CN&lr=&id=shX_EV1r3-0C&oi=fnd&pg=PR7&dq=Genre+analysis:+English+in+academic+and+research+setting.&ots=8FW0t-irxf&sig=U_dDsXBwVdpB1VIQMAx6UZZDX8U.
Turney, P. D. (2000). Learning algorithms for key phrase extraction. Information Retrieval, 2(4), 303–336.
Article Google Scholar
Wang, X., Cheng, Q., & Lu, W. (2014). Analyzing evolution of research topics with NEViewer: A new method based on dynamic co-word networks. Scientometrics, 101(2), 1253–1271.
Article Google Scholar
White, H. D., & Griffith, B. C. (1981). Author cocitation: A literature measure of intellectual structure. Journal of the Association for Information Science and Technology, 32(3), 163–171.
Google Scholar
Witten, I. H., Paynter, G. W., Frank, E., Gutwin, C., & Nevill-Manning, C. G. (1999). KEA: Practical automatic keyphrase extraction. In Proceedings of the fourth ACM conference on digital libraries (pp. 254–255). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=313437.
Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Icml (Vol. 97, pp. 412–420). Retrieved from http://www.surdeanu.info/mihai/teaching/ista555-spring15/readings/yang97comparative.pdf.
Zhai, C., & Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval (pp. 334–342). ACM. Retrieved from http://dl.acm.org/citation.cfm?id=384019.
Zhang, L. (2012). Grasping the structure of journal articles: Utilizing the functions of information units. Journal of the American Society for Information Science and Technology, 63(3), 469–480.
Article Google Scholar
Zhao, D. (2006). Towards all-author co-citation analysis. Information Processing and Management, 42, 1578–1591.
Article Google Scholar

Download references

Acknowledgements

This study was supported by the Natural Science Funding in China (No. 71473183). The authors would like to thank Ying Ding and the anonymous reviewer for their insightful suggestions.

Author information

Authors and Affiliations

Information Retrieval and Knowledge Mining Laboratory, School of Information Management, Wuhan University, Wuhan, Hubei, China
Wei Lu, Yong Huang & Qikai Cheng
School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN, USA
Yong Huang & Yi Bu

Authors

Wei Lu
View author publications
You can also search for this author in PubMed Google Scholar
Yong Huang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Bu
View author publications
You can also search for this author in PubMed Google Scholar
Qikai Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qikai Cheng.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lu, W., Huang, Y., Bu, Y. et al. Functional structure identification of scientific documents in computer science. Scientometrics 115, 463–486 (2018). https://doi.org/10.1007/s11192-018-2640-y

Download citation

Received: 14 September 2017
Published: 02 February 2018
Issue Date: April 2018
DOI: https://doi.org/10.1007/s11192-018-2640-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Functional structure identification of scientific documents in computer science

Abstract

Access this article

Similar content being viewed by others

Keyword-based Research Field Discovery with External Knowledge Aware Hierarchical Co-clustering

Unsupervised document structure analysis of digital scientific articles

Generic features selection for structure classification of diverse styled scholarly articles

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Functional structure identification of scientific documents in computer science

Abstract

Access this article

Similar content being viewed by others

Keyword-based Research Field Discovery with External Knowledge Aware Hierarchical Co-clustering

Unsupervised document structure analysis of digital scientific articles

Generic features selection for structure classification of diverse styled scholarly articles

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation