research-article

Summarizing figures, tables, and algorithms in scientific publications to augment search results

Authors:
Sumit Bhatia

Pennsylvania State University, PA

Pennsylvania State University, PA
View Profile

,
Prasenjit Mitra

Pennsylvania State University, PA

Pennsylvania State University, PA
View Profile

Authors Info & Claims

ACM Transactions on Information Systems Volume 30 Issue 1Article No.: 3pp 1–24https://doi.org/10.1145/2094072.2094075

Published:06 March 2012Publication History

ACM Transactions on Information Systems

Abstract

Increasingly, special-purpose search engines are being built to enable the retrieval of document-elements like tables, figures, and algorithms [Bhatia et al. 2010; Liu et al. 2007; Hearst et al. 2007]. These search engines present a thumbnail view of document-elements, some document metadata such as the title of the papers and their authors, and the caption of the document-element. While some authors in some disciplines write carefully tailored captions, generally, the author of a document assumes that the caption will be read in the context of the text in the document. When the caption is presented out of context as in a document-element-search-engine result, it may not contain enough information to help the end-user understand what the content of the document-element is. Consequently, end-users examining document-element search results would want a short “synopsis” of this information presented along with the document-element. Having access to the synopsis allows the end-user to quickly understand the content of the document-element without having to download and read the entire document as examining the synopsis takes a shorter time than finding information about a document element by downloading, opening and reading the file. Furthermore, it may allow the end-user to examine more results than they would otherwise. In this paper, we present the first set of methods to extract this useful information (synopsis) related to document-elements automatically. We use Naïve Bayes and support vector machine classifiers to identify relevant sentences from the document text based on the similarity and the proximity of the sentences with the caption and the sentences in the document text that refer to the document-element. We compare the two classification methods and study the effects of different features used. We also investigate the problem of choosing the optimum synopsis-size that strikes a balance between the information content and the size of the generated synopses. A user study is also performed to measure how the synopses generated by our proposed method compare with other state-of-the-art approaches.

References

Bhatia, S., Lahiri, S., and Mitra, P. 2009. Generating synopses for document-element search. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM, New York, NY, 2003--2006. Google ScholarDigital Library
Bhatia, S., Mitra, P., and Giles, C. L. 2010. Finding algorithms in scientific articles. In Proceedings of the International World Wide Web Conference. 1061--1062. Google ScholarDigital Library
Bishop, C. M. 2006. Pattern Recognition and Machine Learning. Springer. Google ScholarDigital Library
Carbonell, J. and Goldstein, J. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 335--336. Google ScholarDigital Library
Chang, C.-C. and Lin, C.-J. 2001. LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Techn. 2, 27. (Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.) Google ScholarDigital Library
Chang, C.-C., Lin, C.-J., and Hsu, C.-W. 2009. A practical guide to support vector classification. http://www.csie.ntu.edu.tw/~cjlin/talks/freiburg.pdf.Google Scholar
Chen, Y., Wang, G., and Dong, S. 2003. Learning with progressive transductive support vector machine. Patt. Recog. Lett. 24, 12, 1845--1855. Google ScholarDigital Library
Cho, S., Koudas, N., and Srivastava, D. 2006. Meta-data indexing for XPath location steps. In Proceedings of the SIGMOD Conference, S. Chaudhuri, V. Hristidis, and N. Polyzotis, Eds., ACM, 455--466. Google ScholarDigital Library
Corio, M. and Lapalme, G. 1999. Generation of texts for information graphics. In Proceedings of the 7th European Workshop on Natural Language Generation (EWNLG'99). 49--58.Google Scholar
Demner-Fushman, D., Antani, S., Simpson, M., and Thoma, G. 2009. Annotation and retrieval of clinically relevant images. Int. J. Med. Inf. 78, 12, e59--e67.Google ScholarCross Ref
Elzer, S., Carberry, R., Chester, D., Demir, S., Green, N., Zukerman, I., and Trnka, K. 2005. Exploring and exploiting the limited utility of captions in recognizing intention in information graphics. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL). 223--230. Google ScholarDigital Library
Futrelle, R. P. 1999. Summarization of diagrams in documents. In Advances in Automated Text Summarization. 403--421.Google Scholar
Futrelle, R. P. 2004. Handling figures in document summarization. In Text Summarization Branches Out. In Proceedings of the Workshop at the Annual Meeting of the Association for Computational Linguistics. 61--65.Google Scholar
Goldstein, J., Kantrowitz, M., Mittal, V., and Carbonell, J. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval., 121--128. Google ScholarDigital Library
Goldstein, J., Mittal, V., Carbonell, J., and Kantrowitz, M. 2000. Multi-document summarization by sentence extraction. In Proceedings of the Workshop on Automatic summarization. Association for Computational Linguistics, 40--48. Google ScholarDigital Library
Guglielmo, E. J. and Rowe, N. C. 1996. Natural-language retrieval of images based on descriptive captions. ACM Trans. Inf. Syst. 14, 237--267. Google ScholarDigital Library
Hadjieleftheriou, M., Kollios, G., Bakalov, P., and Tsotras, V. J. 2005. Complex spatiotemporal pattern queries. In Proceedings of the International Conference on Very Large Databases. K. Böhm, C. S. Jensen, L. M. Haas, M. L. Kersten, P.-A. Larson, and B. C. Ooi, Eds., ACM, 877--888. Google ScholarDigital Library
Hearst, M. A., Divoli, A., Guturu, H., Ksikes, A., Nakov, P., Wooldridge, M. A., and Ye, J. 2007. Biotext search engine: Beyond abstract search. Bioinformatics 23, 16, 2196--2197. Google ScholarDigital Library
Huang, W., Tan, C. L., and Leow, W. K. 2005. Associating text and graphics for scientific chart understanding. In Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR'05). IEEE Computer Society, Washington, DC, 580--584. Google ScholarDigital Library
Kataria, S., Brouwer, W., Mitra, P., and Giles, E. L. 2008. Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents. In Proceedings of the National Conference on Artificial Intelligence. 1169--1174. Google ScholarDigital Library
Ko, Y. and Seo, J. 2008. An effective sentence-extraction technique using contextual information and statistical approaches for text summarization. Pattern Recogn. Lett. 29, 9, 1366--1371. Google ScholarDigital Library
Kupiec, J., Pedersen, J., and Chen, F. 1995. A trainable document summarizer. In Proceedings of the 18th Annual International Conference on Research and Development in Information Retrieval (SIGIR'95). ACM Press, New York, NY, 68--73. Google ScholarDigital Library
Liu, Y., Bai, K., Mitra, P., and Giles, C. L. 2007. Tableseer: automatic table metadata extraction and searching in digital libraries. In Proceedings of the Joint Conference on Digital Libraries. ACM, 91--100. Google ScholarDigital Library
Luhn, H. P. 1958. Automatic creation of literature abstracts. IBM J. Res. Dev. 2, 159--165. Google ScholarDigital Library
Mani, I. and Maybury, M. T., Eds. 1999. Advances in Automatic Text Summarization. MIT Press, Cambridge, MA. Google ScholarDigital Library
Manning, C. D., Raghavan, P., and Schutze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
Metzler, D. and Kanungo, T. 2008. Machine learned sentence selection strategies for querybiased summarization. In Proceedings of the SIGIR Learning to Rank Workshop.Google Scholar
Osuna, E. E., Freund, R., and Girosi, F. 1997. Training support vector machines: An application to face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Google ScholarDigital Library
Passonneau, R., Kukich, K., Robin, J., Hatzivassiloglou, V., Lefkowitz, L., and Jing, H.1996. Generating summaries of work flow diagrams. In Proceedings of the International Conference on Natural Language Processing and Industrial Applications (NLPIA'96). 204--210.Google Scholar
Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarCross Ref
Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. 1995. Okapi at trec-3. http://www.compapp.dcu.ie/~gjones/Teaching/CA437/city.pdf.Google Scholar
Sandusky, R. and Tenopir, C. 2008. Finding and using journal-article components: Impacts of disaggregation on teaching and research practice. J. Amer. Soc. Inf. Sci. Techn. 59, 6, 970--982. Google ScholarDigital Library
Teufel, S. and Moens, M. 1997. Sentence extraction as a classification task. In Proceedings of the Workshop on Intelligent and Scalable Text Summarization.Google Scholar
Tombros, A. and Sanderson, M. 1998. Advantages of query biased summaries in information retrieval. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2--10. Google ScholarDigital Library
White, R., Jose, J. M., and Ruthven, I. 2003. A task-oriented study on the influencing effects of query-biased summarisation in web searching. Inf. Process. Manage 39, 5, 707--733. Google ScholarDigital Library
Wu, T.-F., Lin, C.-J., and Weng, R. C. 2003. Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Resear. 5, 975--1005. Google ScholarDigital Library

Index Terms

Summarizing figures, tables, and algorithms in scientific publications to augment search results

Recommendations

Generating synopses for document-element search
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Scientists often search for document-elements like tables, figures, or algorithm pseudo-codes. Domain scientists and researchers report important data, results and algorithms using these document-elements; readers want to compare the reported results ...
Read More
Finding Answers in Web Search
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval

There are many informational queries that could be answered with a text passage, thereby not requiring the searcher to access the full web document. When building manual annotations of answer passages for TREC queries, Keikha et al. [6] confirmed that ...
Read More
Summarizing highly structured documents for effective search interaction
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

As highly structured documents with rich metadata (such as products, movies, etc.) become increasingly prevalent, searching those documents has become an important IR problem. Unfortunately existing work on document summarization, especially in the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Information Systems Volume 30, Issue 1
February 2012
193 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2094072
Issue’s Table of Contents

Copyright © 2012 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 6 March 2012
- Accepted: 1 October 2011
- Revised: 1 July 2011
- Received: 1 December 2010
Published in tois Volume 30, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Classification
document-element
summarization
synopses
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 40
  Total Citations
  View Citations
- 636
  Total Downloads
- Downloads (Last 12 months)19
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Summarizing figures, tables, and algorithms in scientific publications to augment search results

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Generating synopses for document-element search

Finding Answers in Web Search

Summarizing highly structured documents for effective search interaction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Summarizing figures, tables, and algorithms in scientific publications to augment search results

ACM Transactions on Information Systems

Abstract

References

Cited By

Index Terms

Recommendations

Generating synopses for document-element search

Finding Answers in Web Search

Summarizing highly structured documents for effective search interaction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media