skip to main content
10.1145/1363686.1363793acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

A wrapper generation system for PDF documents

Published:16 March 2008Publication History

ABSTRACT

The widespread use of the PDF format for exchanging print-oriented documents raises new challenges in the research field of information extraction. In this paper we present a novel wrapper generation system for extracting information from PDF documents. Objects in a PDF document are accessible by their position, thus we exploit spatial constraints for driving the extraction of relevant information according to a set of group type definitions. Moreover, using fuzzy logic based conditions enables effectively handling uncertainty on the comprehension of the layout structure of PDF documents. The experimental results shown in the paper state a good accuracy of our PDF wrapping system.

References

  1. Adobe Systems Incorporated. PDF Reference, 5th edition: Adobe Portable Document Format version 1.6, 2004.Google ScholarGoogle Scholar
  2. N. Ashish and C. A. Knoblock. Wrapper Generation for Semistructured Internet Sources. ACM SIGMOD Record, 26(4):8--15, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baumgartner, S. Flesca, and G. Gottlob. Visual Web Information Extraction with Lixto. In Proc. VLDB '01 Conf., pages 119--128, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Bruggemann-Klein and D. Wood. One-Unambiguous Regular Languages. Information and Computation, 142(2):182--206, 1998 Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large Web sites. In Proc. VLDB '01 Conf., pages 109--118, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. S. Flesca, S. Garruzzo, E. Masciari, and A. Tagarelli. Wrapping PDF Documents Exploiting Uncertain Knowledge. In Proc. CAiSE '06 Conf., pages 175--189, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Freitag. Machine Learning for Information Extraction in Informal Domains. Machine Learning, 39(2--3):233--272, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Goyal and M. Egenhofer. Similarity of Cardinal Directions. In Proc. of Symposium on Advances in Spatial and Temporal Databases, pages 36--58, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira. A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record, 31(2):84--93, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. I. Muslea, S. Minton, and C. Knoblock. Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems, 4(1/2):93--114, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Skiadopoulos and M. Koubarakis. Composing cardinal direction relations. Artificial Intelligence, 152(2):143--171, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Soderland. Learning Information Extraction Rules for Semistructured and Free Text. Machine Learning, 34(1--3):233--272, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Zadeh. Fuzzy Sets. Information and Control, 8:338--353, 1965.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. A wrapper generation system for PDF documents

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  SAC '08: Proceedings of the 2008 ACM symposium on Applied computing
                  March 2008
                  2586 pages
                  ISBN:9781595937537
                  DOI:10.1145/1363686

                  Copyright © 2008 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 16 March 2008

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • research-article

                  Acceptance Rates

                  Overall Acceptance Rate1,650of6,669submissions,25%
                • Article Metrics

                  • Downloads (Last 12 months)1
                  • Downloads (Last 6 weeks)0

                  Other Metrics

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader