ABSTRACT
The widespread use of the PDF format for exchanging print-oriented documents raises new challenges in the research field of information extraction. In this paper we present a novel wrapper generation system for extracting information from PDF documents. Objects in a PDF document are accessible by their position, thus we exploit spatial constraints for driving the extraction of relevant information according to a set of group type definitions. Moreover, using fuzzy logic based conditions enables effectively handling uncertainty on the comprehension of the layout structure of PDF documents. The experimental results shown in the paper state a good accuracy of our PDF wrapping system.
- Adobe Systems Incorporated. PDF Reference, 5th edition: Adobe Portable Document Format version 1.6, 2004.Google Scholar
- N. Ashish and C. A. Knoblock. Wrapper Generation for Semistructured Internet Sources. ACM SIGMOD Record, 26(4):8--15, 1997. Google ScholarDigital Library
- R. Baumgartner, S. Flesca, and G. Gottlob. Visual Web Information Extraction with Lixto. In Proc. VLDB '01 Conf., pages 119--128, 2001. Google ScholarDigital Library
- A. Bruggemann-Klein and D. Wood. One-Unambiguous Regular Languages. Information and Computation, 142(2):182--206, 1998 Google ScholarDigital Library
- V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large Web sites. In Proc. VLDB '01 Conf., pages 109--118, 2001. Google ScholarDigital Library
- S. Flesca, S. Garruzzo, E. Masciari, and A. Tagarelli. Wrapping PDF Documents Exploiting Uncertain Knowledge. In Proc. CAiSE '06 Conf., pages 175--189, 2006. Google ScholarDigital Library
- D. Freitag. Machine Learning for Information Extraction in Informal Domains. Machine Learning, 39(2--3):233--272, 2000. Google ScholarDigital Library
- R. Goyal and M. Egenhofer. Similarity of Cardinal Directions. In Proc. of Symposium on Advances in Spatial and Temporal Databases, pages 36--58, 2001. Google ScholarDigital Library
- A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira. A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record, 31(2):84--93, 2002. Google ScholarDigital Library
- I. Muslea, S. Minton, and C. Knoblock. Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems, 4(1/2):93--114, 2001. Google ScholarDigital Library
- S. Skiadopoulos and M. Koubarakis. Composing cardinal direction relations. Artificial Intelligence, 152(2):143--171, 2004. Google ScholarDigital Library
- S. Soderland. Learning Information Extraction Rules for Semistructured and Free Text. Machine Learning, 34(1--3):233--272, 1999. Google ScholarDigital Library
- L. Zadeh. Fuzzy Sets. Information and Control, 8:338--353, 1965.Google ScholarCross Ref
Index Terms
- A wrapper generation system for PDF documents
Recommendations
Making accessible PDF documents
DocEng '11: Proceedings of the 11th ACM symposium on Document engineeringAccessibility features in the Adobe Portable Document Format (PDF) help facilitate access to electronic information for people with disabilities. This workshop explores how to create accessible PDF documents, from within Adobe Acrobat and other ...
A Fuzzy Logic Approach to Wrapping PDF Documents
The PDF format represents the de facto standard for print-oriented documents. In this paper, we address the problem of wrapping PDF documents, which raises new challenges in several contexts of text data management. Our proposal is based on a novel ...
Wrapping PDF documents exploiting uncertain knowledge
CAiSE'06: Proceedings of the 18th international conference on Advanced Information Systems EngineeringThe PDF format represents the de facto standard for print-oriented documents. In this paper we address the problem of wrapping PDF documents, which raises new challenges in the information extraction field. The proposal is based on a novel bottom-up ...
Comments