research-article

A wrapper generation system for PDF documents

Authors:
Bettina Fazzinga

DEIS-UNICAL, Rende, Italy

DEIS-UNICAL, Rende, Italy
View Profile

,
Sergio Flesca

DEIS-UNICAL, Rende, Italy

DEIS-UNICAL, Rende, Italy
View Profile

,
Andrea Tagarelli

DEIS-UNICAL, Rende, Italy

DEIS-UNICAL, Rende, Italy
View Profile

,
Salvatore Garruzzo

DIMET-UNIRC, Reggio Calabria, Italy

DIMET-UNIRC, Reggio Calabria, Italy
View Profile

,
Elio Masciari

ICAR-CNR, Rende, Italy

ICAR-CNR, Rende, Italy
View Profile

SAC '08: Proceedings of the 2008 ACM symposium on Applied computingMarch 2008Pages 442–446https://doi.org/10.1145/1363686.1363793

Published:16 March 2008Publication History

SAC '08: Proceedings of the 2008 ACM symposium on Applied computing

Pages 442–446

ABSTRACT

The widespread use of the PDF format for exchanging print-oriented documents raises new challenges in the research field of information extraction. In this paper we present a novel wrapper generation system for extracting information from PDF documents. Objects in a PDF document are accessible by their position, thus we exploit spatial constraints for driving the extraction of relevant information according to a set of group type definitions. Moreover, using fuzzy logic based conditions enables effectively handling uncertainty on the comprehension of the layout structure of PDF documents. The experimental results shown in the paper state a good accuracy of our PDF wrapping system.

References

Adobe Systems Incorporated. PDF Reference, 5th edition: Adobe Portable Document Format version 1.6, 2004.Google Scholar
N. Ashish and C. A. Knoblock. Wrapper Generation for Semistructured Internet Sources. ACM SIGMOD Record, 26(4):8--15, 1997. Google ScholarDigital Library
R. Baumgartner, S. Flesca, and G. Gottlob. Visual Web Information Extraction with Lixto. In Proc. VLDB '01 Conf., pages 119--128, 2001. Google ScholarDigital Library
A. Bruggemann-Klein and D. Wood. One-Unambiguous Regular Languages. Information and Computation, 142(2):182--206, 1998 Google ScholarDigital Library
V. Crescenzi, G. Mecca, and P. Merialdo. RoadRunner: Towards automatic data extraction from large Web sites. In Proc. VLDB '01 Conf., pages 109--118, 2001. Google ScholarDigital Library
S. Flesca, S. Garruzzo, E. Masciari, and A. Tagarelli. Wrapping PDF Documents Exploiting Uncertain Knowledge. In Proc. CAiSE '06 Conf., pages 175--189, 2006. Google ScholarDigital Library
D. Freitag. Machine Learning for Information Extraction in Informal Domains. Machine Learning, 39(2--3):233--272, 2000. Google ScholarDigital Library
R. Goyal and M. Egenhofer. Similarity of Cardinal Directions. In Proc. of Symposium on Advances in Spatial and Temporal Databases, pages 36--58, 2001. Google ScholarDigital Library
A. Laender, B. Ribeiro-Neto, A. da Silva, and J. Teixeira. A Brief Survey of Web Data Extraction Tools. ACM SIGMOD Record, 31(2):84--93, 2002. Google ScholarDigital Library
I. Muslea, S. Minton, and C. Knoblock. Hierarchical Wrapper Induction for Semistructured Information Sources. Autonomous Agents and Multi-Agent Systems, 4(1/2):93--114, 2001. Google ScholarDigital Library
S. Skiadopoulos and M. Koubarakis. Composing cardinal direction relations. Artificial Intelligence, 152(2):143--171, 2004. Google ScholarDigital Library
S. Soderland. Learning Information Extraction Rules for Semistructured and Free Text. Machine Learning, 34(1--3):233--272, 1999. Google ScholarDigital Library
L. Zadeh. Fuzzy Sets. Information and Control, 8:338--353, 1965.Google ScholarCross Ref

Index Terms

A wrapper generation system for PDF documents

Recommendations

Making accessible PDF documents
DocEng '11: Proceedings of the 11th ACM symposium on Document engineering

Accessibility features in the Adobe Portable Document Format (PDF) help facilitate access to electronic information for people with disabilities. This workshop explores how to create accessible PDF documents, from within Adobe Acrobat and other ...
Read More
A Fuzzy Logic Approach to Wrapping PDF Documents

The PDF format represents the de facto standard for print-oriented documents. In this paper, we address the problem of wrapping PDF documents, which raises new challenges in several contexts of text data management. Our proposal is based on a novel ...
Read More
Wrapping PDF documents exploiting uncertain knowledge
CAiSE'06: Proceedings of the 18th international conference on Advanced Information Systems Engineering

The PDF format represents the de facto standard for print-oriented documents. In this paper we address the problem of wrapping PDF documents, which raises new challenges in the information extraction field. The proposal is based on a novel bottom-up ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SAC '08: Proceedings of the 2008 ACM symposium on Applied computing
March 2008
2586 pages
ISBN:9781595937537
DOI:10.1145/1363686
Conference Chairs:
Roger L. Wainwright
University of Tulsa
,
Hisham M. Haddad
Kennesaw State University
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 March 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,650of6,669submissions,25%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 212
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A wrapper generation system for PDF documents

SAC '08: Proceedings of the 2008 ACM symposium on Applied computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Making accessible PDF documents

A Fuzzy Logic Approach to Wrapping PDF Documents

Wrapping PDF documents exploiting uncertain knowledge

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A wrapper generation system for PDF documents

SAC '08: Proceedings of the 2008 ACM symposium on Applied computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Making accessible PDF documents

A Fuzzy Logic Approach to Wrapping PDF Documents

Wrapping PDF documents exploiting uncertain knowledge

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media