Abstract
Increasingly, special-purpose search engines are being built to enable the retrieval of document-elements like tables, figures, and algorithms [Bhatia et al. 2010; Liu et al. 2007; Hearst et al. 2007]. These search engines present a thumbnail view of document-elements, some document metadata such as the title of the papers and their authors, and the caption of the document-element. While some authors in some disciplines write carefully tailored captions, generally, the author of a document assumes that the caption will be read in the context of the text in the document. When the caption is presented out of context as in a document-element-search-engine result, it may not contain enough information to help the end-user understand what the content of the document-element is. Consequently, end-users examining document-element search results would want a short “synopsis” of this information presented along with the document-element. Having access to the synopsis allows the end-user to quickly understand the content of the document-element without having to download and read the entire document as examining the synopsis takes a shorter time than finding information about a document element by downloading, opening and reading the file. Furthermore, it may allow the end-user to examine more results than they would otherwise. In this paper, we present the first set of methods to extract this useful information (synopsis) related to document-elements automatically. We use Naïve Bayes and support vector machine classifiers to identify relevant sentences from the document text based on the similarity and the proximity of the sentences with the caption and the sentences in the document text that refer to the document-element. We compare the two classification methods and study the effects of different features used. We also investigate the problem of choosing the optimum synopsis-size that strikes a balance between the information content and the size of the generated synopses. A user study is also performed to measure how the synopses generated by our proposed method compare with other state-of-the-art approaches.
- Bhatia, S., Lahiri, S., and Mitra, P. 2009. Generating synopses for document-element search. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM'09). ACM, New York, NY, 2003--2006. Google ScholarDigital Library
- Bhatia, S., Mitra, P., and Giles, C. L. 2010. Finding algorithms in scientific articles. In Proceedings of the International World Wide Web Conference. 1061--1062. Google ScholarDigital Library
- Bishop, C. M. 2006. Pattern Recognition and Machine Learning. Springer. Google ScholarDigital Library
- Carbonell, J. and Goldstein, J. 1998. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 335--336. Google ScholarDigital Library
- Chang, C.-C. and Lin, C.-J. 2001. LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Techn. 2, 27. (Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm.) Google ScholarDigital Library
- Chang, C.-C., Lin, C.-J., and Hsu, C.-W. 2009. A practical guide to support vector classification. http://www.csie.ntu.edu.tw/~cjlin/talks/freiburg.pdf.Google Scholar
- Chen, Y., Wang, G., and Dong, S. 2003. Learning with progressive transductive support vector machine. Patt. Recog. Lett. 24, 12, 1845--1855. Google ScholarDigital Library
- Cho, S., Koudas, N., and Srivastava, D. 2006. Meta-data indexing for XPath location steps. In Proceedings of the SIGMOD Conference, S. Chaudhuri, V. Hristidis, and N. Polyzotis, Eds., ACM, 455--466. Google ScholarDigital Library
- Corio, M. and Lapalme, G. 1999. Generation of texts for information graphics. In Proceedings of the 7th European Workshop on Natural Language Generation (EWNLG'99). 49--58.Google Scholar
- Demner-Fushman, D., Antani, S., Simpson, M., and Thoma, G. 2009. Annotation and retrieval of clinically relevant images. Int. J. Med. Inf. 78, 12, e59--e67.Google ScholarCross Ref
- Elzer, S., Carberry, R., Chester, D., Demir, S., Green, N., Zukerman, I., and Trnka, K. 2005. Exploring and exploiting the limited utility of captions in recognizing intention in information graphics. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL). 223--230. Google ScholarDigital Library
- Futrelle, R. P. 1999. Summarization of diagrams in documents. In Advances in Automated Text Summarization. 403--421.Google Scholar
- Futrelle, R. P. 2004. Handling figures in document summarization. In Text Summarization Branches Out. In Proceedings of the Workshop at the Annual Meeting of the Association for Computational Linguistics. 61--65.Google Scholar
- Goldstein, J., Kantrowitz, M., Mittal, V., and Carbonell, J. 1999. Summarizing text documents: Sentence selection and evaluation metrics. In Research and Development in Information Retrieval., 121--128. Google ScholarDigital Library
- Goldstein, J., Mittal, V., Carbonell, J., and Kantrowitz, M. 2000. Multi-document summarization by sentence extraction. In Proceedings of the Workshop on Automatic summarization. Association for Computational Linguistics, 40--48. Google ScholarDigital Library
- Guglielmo, E. J. and Rowe, N. C. 1996. Natural-language retrieval of images based on descriptive captions. ACM Trans. Inf. Syst. 14, 237--267. Google ScholarDigital Library
- Hadjieleftheriou, M., Kollios, G., Bakalov, P., and Tsotras, V. J. 2005. Complex spatiotemporal pattern queries. In Proceedings of the International Conference on Very Large Databases. K. Böhm, C. S. Jensen, L. M. Haas, M. L. Kersten, P.-A. Larson, and B. C. Ooi, Eds., ACM, 877--888. Google ScholarDigital Library
- Hearst, M. A., Divoli, A., Guturu, H., Ksikes, A., Nakov, P., Wooldridge, M. A., and Ye, J. 2007. Biotext search engine: Beyond abstract search. Bioinformatics 23, 16, 2196--2197. Google ScholarDigital Library
- Huang, W., Tan, C. L., and Leow, W. K. 2005. Associating text and graphics for scientific chart understanding. In Proceedings of the 8th International Conference on Document Analysis and Recognition (ICDAR'05). IEEE Computer Society, Washington, DC, 580--584. Google ScholarDigital Library
- Kataria, S., Brouwer, W., Mitra, P., and Giles, E. L. 2008. Automatic extraction of data points and text blocks from 2-dimensional plots in digital documents. In Proceedings of the National Conference on Artificial Intelligence. 1169--1174. Google ScholarDigital Library
- Ko, Y. and Seo, J. 2008. An effective sentence-extraction technique using contextual information and statistical approaches for text summarization. Pattern Recogn. Lett. 29, 9, 1366--1371. Google ScholarDigital Library
- Kupiec, J., Pedersen, J., and Chen, F. 1995. A trainable document summarizer. In Proceedings of the 18th Annual International Conference on Research and Development in Information Retrieval (SIGIR'95). ACM Press, New York, NY, 68--73. Google ScholarDigital Library
- Liu, Y., Bai, K., Mitra, P., and Giles, C. L. 2007. Tableseer: automatic table metadata extraction and searching in digital libraries. In Proceedings of the Joint Conference on Digital Libraries. ACM, 91--100. Google ScholarDigital Library
- Luhn, H. P. 1958. Automatic creation of literature abstracts. IBM J. Res. Dev. 2, 159--165. Google ScholarDigital Library
- Mani, I. and Maybury, M. T., Eds. 1999. Advances in Automatic Text Summarization. MIT Press, Cambridge, MA. Google ScholarDigital Library
- Manning, C. D., Raghavan, P., and Schutze, H. 2008. Introduction to Information Retrieval. Cambridge University Press. Google ScholarDigital Library
- Metzler, D. and Kanungo, T. 2008. Machine learned sentence selection strategies for querybiased summarization. In Proceedings of the SIGIR Learning to Rank Workshop.Google Scholar
- Osuna, E. E., Freund, R., and Girosi, F. 1997. Training support vector machines: An application to face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Google ScholarDigital Library
- Passonneau, R., Kukich, K., Robin, J., Hatzivassiloglou, V., Lefkowitz, L., and Jing, H.1996. Generating summaries of work flow diagrams. In Proceedings of the International Conference on Natural Language Processing and Industrial Applications (NLPIA'96). 204--210.Google Scholar
- Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.Google ScholarCross Ref
- Robertson, S. E., Walker, S., Jones, S., Hancock-Beaulieu, M. M., and Gatford, M. 1995. Okapi at trec-3. http://www.compapp.dcu.ie/~gjones/Teaching/CA437/city.pdf.Google Scholar
- Sandusky, R. and Tenopir, C. 2008. Finding and using journal-article components: Impacts of disaggregation on teaching and research practice. J. Amer. Soc. Inf. Sci. Techn. 59, 6, 970--982. Google ScholarDigital Library
- Teufel, S. and Moens, M. 1997. Sentence extraction as a classification task. In Proceedings of the Workshop on Intelligent and Scalable Text Summarization.Google Scholar
- Tombros, A. and Sanderson, M. 1998. Advantages of query biased summaries in information retrieval. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2--10. Google ScholarDigital Library
- White, R., Jose, J. M., and Ruthven, I. 2003. A task-oriented study on the influencing effects of query-biased summarisation in web searching. Inf. Process. Manage 39, 5, 707--733. Google ScholarDigital Library
- Wu, T.-F., Lin, C.-J., and Weng, R. C. 2003. Probability estimates for multi-class classification by pairwise coupling. J. Mach. Learn. Resear. 5, 975--1005. Google ScholarDigital Library
Index Terms
- Summarizing figures, tables, and algorithms in scientific publications to augment search results
Recommendations
Generating synopses for document-element search
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge managementScientists often search for document-elements like tables, figures, or algorithm pseudo-codes. Domain scientists and researchers report important data, results and algorithms using these document-elements; readers want to compare the reported results ...
Finding Answers in Web Search
SIGIR '15: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information RetrievalThere are many informational queries that could be answered with a text passage, thereby not requiring the searcher to access the full web document. When building manual annotations of answer passages for TREC queries, Keikha et al. [6] confirmed that ...
Summarizing highly structured documents for effective search interaction
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalAs highly structured documents with rich metadata (such as products, movies, etc.) become increasingly prevalent, searching those documents has become an important IR problem. Unfortunately existing work on document summarization, especially in the ...
Comments