skip to main content
research-article

Summarizing figures, tables, and algorithms in scientific publications to augment search results

Published:06 March 2012Publication History
Skip Abstract Section

Abstract

Increasingly, special-purpose search engines are being built to enable the retrieval of document-elements like tables, figures, and algorithms [Bhatia et al. 2010; Liu et al. 2007; Hearst et al. 2007]. These search engines present a thumbnail view of document-elements, some document metadata such as the title of the papers and their authors, and the caption of the document-element. While some authors in some disciplines write carefully tailored captions, generally, the author of a document assumes that the caption will be read in the context of the text in the document. When the caption is presented out of context as in a document-element-search-engine result, it may not contain enough information to help the end-user understand what the content of the document-element is. Consequently, end-users examining document-element search results would want a short “synopsis” of this information presented along with the document-element. Having access to the synopsis allows the end-user to quickly understand the content of the document-element without having to download and read the entire document as examining the synopsis takes a shorter time than finding information about a document element by downloading, opening and reading the file. Furthermore, it may allow the end-user to examine more results than they would otherwise. In this paper, we present the first set of methods to extract this useful information (synopsis) related to document-elements automatically. We use Naïve Bayes and support vector machine classifiers to identify relevant sentences from the document text based on the similarity and the proximity of the sentences with the caption and the sentences in the document text that refer to the document-element. We compare the two classification methods and study the effects of different features used. We also investigate the problem of choosing the optimum synopsis-size that strikes a balance between the information content and the size of the generated synopses. A user study is also performed to measure how the synopses generated by our proposed method compare with other state-of-the-art approaches.

References

  1. Bhatia, S., Lahiri, S., and Mitra, P. 2009. Generating synopses for document-element search. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM, New York, NY, 2003--2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Bhatia, S., Mitra, P., and Giles, C. L. 2010. Finding algorithms in scientific articles. In Proceedings of the International World Wide Web Conference. 1061--1062. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Bishop, C. M. 2006. Pattern Recognition and Machine Learning. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Carbonell, J. and Goldstein, J. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 335--336. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Chang, C.-C. and Lin, C.-J. 2001. LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Techn. 2, 27. (Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.) Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Chang, C.-C., Lin, C.-J., and Hsu, C.-W. 2009. A practical guide to support vector classification. http://www.csie.ntu.edu.tw/~cjlin/talks/freiburg.pdf.Google ScholarGoogle Scholar
  7. Chen, Y., Wang, G., and Dong, S. 2003. Learning with progressive transductive support vector machine. Patt. Recog. Lett. 24, 12, 1845--1855. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Cho, S., Koudas, N., and Srivastava, D. 2006. Meta-data indexing for XPath location steps. In Proceedings of the SIGMOD Conference, S. Chaudhuri, V. Hristidis, and N. Polyzotis, Eds., ACM, 455--466. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Corio, M. and Lapalme, G. 1999. Generation of texts for information graphics. In Proceedings of the 7th European Workshop on Natural Language Generation (EWNLG'99). 49--58.Google ScholarGoogle Scholar
  10. Demner-Fushman, D., Antani, S., Simpson, M., and Thoma, G. 2009. Annotation and retrieval of clinically relevant images. Int. J. Med. Inf. 78, 12, e59--e67.Google ScholarGoogle ScholarCross RefCross Ref
  11. Elzer, S., Carberry, R., Chester, D., Demir, S., Green, N., Zukerman, I., and Trnka, K. 2005. Exploring and exploiting the limited utility of captions in recognizing intention in information graphics. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL). 223--230. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Futrelle, R. P. 1999. Summarization of diagrams in documents. In Advances in Automated Text Summarization. 403--421.Google ScholarGoogle Scholar
  13. Futrelle, R. P. 2004. Handling figures in document summarization. In Text Summarization Branches Out. In Proceedings of the Workshop at the Annual Meeting of the Association for Computational Linguistics. 61--65.Google ScholarGoogle Scholar
  14. Goldstein, J., Kantrowitz, M., Mittal, V., and Carbonell, J. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval., 121--128. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Goldstein, J., Mittal, V., Carbonell, J., and Kantrowitz, M. 2000. Multi-document summarization by sentence extraction. In Proceedings of the Workshop on Automatic summarization. Association for Computational Linguistics, 40--48. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Guglielmo, E. J. and Rowe, N. C. 1996. Natural-language retrieval of images based on descriptive captions. ACM Trans. Inf. Syst. 14, 237--267. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Hadjieleftheriou, M., Kollios, G., Bakalov, P., and Tsotras, V. J. 2005. Complex spatiotemporal pattern queries. In Proceedings of the International Conference on Very Large Databases. K. Böhm, C. S. Jensen, L. M. Haas, M. L. Kersten, P.-A. Larson, and B. C. Ooi, Eds., ACM, 877--888. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Hearst, M. A., Divoli, A., Guturu, H., Ksikes, A., Nakov, P., Wooldridge, M. A., and Ye, J. 2007. Biotext search engine: Beyond abstract search. Bioinformatics 23, 16, 2196--2197. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Huang, W., Tan, C. L., and Leow, W. K. 2005. Associating text and graphics for scientific chart understanding. In Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR'05). IEEE Computer Society, Washington, DC, 580--584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Kataria, S., Brouwer, W., Mitra, P., and Giles, E. L. 2008. Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents. In Proceedings of the National Conference on Artificial Intelligence. 1169--1174. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Ko, Y. and Seo, J. 2008. An effective sentence-extraction technique using contextual information and statistical approaches for text summarization. Pattern Recogn. Lett. 29, 9, 1366--1371. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Kupiec, J., Pedersen, J., and Chen, F. 1995. A trainable document summarizer. In Proceedings of the 18th Annual International Conference on Research and Development in Information Retrieval (SIGIR'95). ACM Press, New York, NY, 68--73. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Liu, Y., Bai, K., Mitra, P., and Giles, C. L. 2007. Tableseer: automatic table metadata extraction and searching in digital libraries. In Proceedings of the Joint Conference on Digital Libraries. ACM, 91--100. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Luhn, H. P. 1958. Automatic creation of literature abstracts. IBM J. Res. Dev. 2, 159--165. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Mani, I. and Maybury, M. T., Eds. 1999. Advances in Automatic Text Summarization. MIT Press, Cambridge, MA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Manning, C. D., Raghavan, P., and Schutze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Metzler, D. and Kanungo, T. 2008. Machine learned sentence selection strategies for querybiased summarization. In Proceedings of the SIGIR Learning to Rank Workshop.Google ScholarGoogle Scholar
  28. Osuna, E. E., Freund, R., and Girosi, F. 1997. Training support vector machines: An application to face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Passonneau, R., Kukich, K., Robin, J., Hatzivassiloglou, V., Lefkowitz, L., and Jing, H.1996. Generating summaries of work flow diagrams. In Proceedings of the International Conference on Natural Language Processing and Industrial Applications (NLPIA'96). 204--210.Google ScholarGoogle Scholar
  30. Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarGoogle ScholarCross RefCross Ref
  31. Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. 1995. Okapi at trec-3. http://www.compapp.dcu.ie/~gjones/Teaching/CA437/city.pdf.Google ScholarGoogle Scholar
  32. Sandusky, R. and Tenopir, C. 2008. Finding and using journal-article components: Impacts of disaggregation on teaching and research practice. J. Amer. Soc. Inf. Sci. Techn. 59, 6, 970--982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Teufel, S. and Moens, M. 1997. Sentence extraction as a classification task. In Proceedings of the Workshop on Intelligent and Scalable Text Summarization.Google ScholarGoogle Scholar
  34. Tombros, A. and Sanderson, M. 1998. Advantages of query biased summaries in information retrieval. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. White, R., Jose, J. M., and Ruthven, I. 2003. A task-oriented study on the influencing effects of query-biased summarisation in web searching. Inf. Process. Manage 39, 5, 707--733. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Wu, T.-F., Lin, C.-J., and Weng, R. C. 2003. Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Resear. 5, 975--1005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Summarizing figures, tables, and algorithms in scientific publications to augment search results

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Information Systems
            ACM Transactions on Information Systems  Volume 30, Issue 1
            February 2012
            193 pages
            ISSN:1046-8188
            EISSN:1558-2868
            DOI:10.1145/2094072
            Issue’s Table of Contents

            Copyright © 2012 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 6 March 2012
            • Accepted: 1 October 2011
            • Revised: 1 July 2011
            • Received: 1 December 2010
            Published in tois Volume 30, Issue 1

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader