Skip to main content

2014 | OriginalPaper | Buchkapitel

23. Analysis of Documents Born Digital

verfasst von : Jianying Hu, Ying Liu

Erschienen in: Handbook of Document Image Processing and Recognition

Verlag: Springer London

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

While traditional document analysis has focused on printed media, an increasingly large portion of the documents today are generated in digital form from the start. Such “documents born digital” range from plain text documents such as emails to more sophisticated forms such as PDF documents and Web documents. On the one hand, the existence of the digital encoding of documents eliminates the need for scanning, image processing, and character recognition in most situations (a notable exception being the prevalent use of text embedded in images for Web documents, as elaborated upon in section “Analysis of Text in Web Images”). On the other hand, many higher-level processing tasks remain due to the fact that the design purpose of almost existing digital document encoding systems (i.e., HTML, PDF) is for display or printing for human consumption, not for machine-level information exchange and extraction. As such, significant amount of processing is still required for automatic information extraction, indexing, and content repurposing from such documents, and many challenges exist in this process. This chapter describes in detail the key technologies for processing documents born digital, with a focus on PDF and Web document processing.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Adelberg B (1998) NoDoSE – a tool for semi-automatically extracting structured and Semi-structured data from text documents. In: ACM SIGMOD international conference on management of data (SIGMOD’98), Seattle, pp 283–294 Adelberg B (1998) NoDoSE – a tool for semi-automatically extracting structured and Semi-structured data from text documents. In: ACM SIGMOD international conference on management of data (SIGMOD’98), Seattle, pp 283–294
2.
Zurück zum Zitat Ailon N, Charikar M, Newman A (2005) Aggregating inconsistent information: ranking and clustering. In: 37th STOC, Baltimore, pp 684–693 Ailon N, Charikar M, Newman A (2005) Aggregating inconsistent information: ranking and clustering. In: 37th STOC, Baltimore, pp 684–693
3.
Zurück zum Zitat Anjewierden A (2001) AIDAS: incremental logical structure discovery in PDF documents. In: 6th international conference on document analysis and recognition (ICDAR), Seattle, Sept 2001, pp 374–378 Anjewierden A (2001) AIDAS: incremental logical structure discovery in PDF documents. In: 6th international conference on document analysis and recognition (ICDAR), Seattle, Sept 2001, pp 374–378
4.
Zurück zum Zitat Antonacopoulos A, Hu J (ed) (2004) Web document analysis: challenges and opportunities. World Scientific, Singapore Antonacopoulos A, Hu J (ed) (2004) Web document analysis: challenges and opportunities. World Scientific, Singapore
5.
Zurück zum Zitat Cai D, Yu S, Wen J-R, Ma W-Y (2003) Extracting content structure for web pages based on visual representation. In 5th Asia Pacific Web Conference, pp 406–415CrossRef Cai D, Yu S, Wen J-R, Ma W-Y (2003) Extracting content structure for web pages based on visual representation. In 5th Asia Pacific Web Conference, pp 406–415CrossRef
6.
Zurück zum Zitat Califf ME, Mooney RJ (1999) Relational learning of pattern-match rules for information extraction. In: Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference, AAAI’99/IAAI’99, Orlando. Menlo Park, pp 6–11 Califf ME, Mooney RJ (1999) Relational learning of pattern-match rules for information extraction. In: Proceedings of the sixteenth national conference on artificial intelligence and the eleventh innovative applications of artificial intelligence conference, AAAI’99/IAAI’99, Orlando. Menlo Park, pp 6–11
7.
Zurück zum Zitat Chakrabarti D, Kumar R, Punera K (2008) A graph-theoretic approach to webpage segmentation. In: WWW 2008, Beijing, pp 377–386 Chakrabarti D, Kumar R, Punera K (2008) A graph-theoretic approach to webpage segmentation. In: WWW 2008, Beijing, pp 377–386
8.
Zurück zum Zitat Chao H, Fan J (2004) Layout and content extraction for pdf documents. In: Marinai S, Dengel A (eds) Document analysis systems VI. Lecture notes in computer science, vol 3163. Springer, New York/Berlin, pp 13–224CrossRef Chao H, Fan J (2004) Layout and content extraction for pdf documents. In: Marinai S, Dengel A (eds) Document analysis systems VI. Lecture notes in computer science, vol 3163. Springer, New York/Berlin, pp 13–224CrossRef
9.
Zurück zum Zitat Chen JS, Tseng DC (1996) Overlapped-character separation and construction for table-form documents. In: IEEE international conference on image processing (ICIP), Lausanne, pp 233–236 Chen JS, Tseng DC (1996) Overlapped-character separation and construction for table-form documents. In: IEEE international conference on image processing (ICIP), Lausanne, pp 233–236
10.
Zurück zum Zitat Chen Y, Xie X, Ma W-Y, Zhang H-J (2005) Adapting web pages for small-screen devices. Internet Computing, 9(1):50–56CrossRef Chen Y, Xie X, Ma W-Y, Zhang H-J (2005) Adapting web pages for small-screen devices. Internet Computing, 9(1):50–56CrossRef
11.
Zurück zum Zitat Cohn AG (1997) Qualitative spatial representation and reasoning techniques, vol 1303. Springer, Berlin, pp 1–30 Cohn AG (1997) Qualitative spatial representation and reasoning techniques, vol 1303. Springer, Berlin, pp 1–30
12.
Zurück zum Zitat Duygulu P, Atalay V (2002) A hierarchical representation of form documents for identification and retrieval. IJDAR 5(1):17–27MATHCrossRef Duygulu P, Atalay V (2002) A hierarchical representation of form documents for identification and retrieval. IJDAR 5(1):17–27MATHCrossRef
13.
Zurück zum Zitat Embley DW, Campbell DM, Jiang YS, Liddle SW, Lonsdale DW, Ng Y-K, Smith RD (1999) Conceptual-model-based data extraction from multiple-record web pages. Data Knowl Eng 31(3):227–251MATHCrossRef Embley DW, Campbell DM, Jiang YS, Liddle SW, Lonsdale DW, Ng Y-K, Smith RD (1999) Conceptual-model-based data extraction from multiple-record web pages. Data Knowl Eng 31(3):227–251MATHCrossRef
14.
Zurück zum Zitat Finn A, Kushmerick N, Smyth B (2001) Fact or fiction: content classification for digital libraries. In: Joint DELOS-NSF workshop on personalisation and recommender systems in digital libraries, Dublin, p 1 Finn A, Kushmerick N, Smyth B (2001) Fact or fiction: content classification for digital libraries. In: Joint DELOS-NSF workshop on personalisation and recommender systems in digital libraries, Dublin, p 1
15.
Zurück zum Zitat Futrelle RP, Shao M, Cieslik C, Grimes AE (2003) Extraction, layout analysis and classification of diagrams in PDF documents. In: International conference on document analysis and recognition (ICDAR) 2003, proceedings, Edinburgh, vol 2, p 1007 Futrelle RP, Shao M, Cieslik C, Grimes AE (2003) Extraction, layout analysis and classification of diagrams in PDF documents. In: International conference on document analysis and recognition (ICDAR) 2003, proceedings, Edinburgh, vol 2, p 1007
16.
Zurück zum Zitat Gatterbauer W, Bohunsky P (2006) Table extraction using spatial reasoning on the CSS2 visual box model. In: Proceedings of the 21st national conference on artificial intelligence (AAAI), Boston, vol 2, pp 1313–1318 Gatterbauer W, Bohunsky P (2006) Table extraction using spatial reasoning on the CSS2 visual box model. In: Proceedings of the 21st national conference on artificial intelligence (AAAI), Boston, vol 2, pp 1313–1318
17.
Zurück zum Zitat Gupta N, Hilal S Dr (2011) A heuristic approach for web content extraction. Int J Comput Appl 15(5):20–24CrossRef Gupta N, Hilal S Dr (2011) A heuristic approach for web content extraction. Int J Comput Appl 15(5):20–24CrossRef
18.
Zurück zum Zitat Hadjar K, Rigamonti M, Lalanne D, Ingold R (2004) Xed: a new tool for extracting hidden structures from electronic documents. In: Document image analysis for libraries, Palo Alto, pp 212–224 Hadjar K, Rigamonti M, Lalanne D, Ingold R (2004) Xed: a new tool for extracting hidden structures from electronic documents. In: Document image analysis for libraries, Palo Alto, pp 212–224
19.
Zurück zum Zitat Hassan T (2009) Object-level document analysis of PDF files. In: Proceedings of the 9th ACM symposium on document engineering (DocEng’09), Munich. ACM, New York, pp 47–55 Hassan T (2009) Object-level document analysis of PDF files. In: Proceedings of the 9th ACM symposium on document engineering (DocEng’09), Munich. ACM, New York, pp 47–55
20.
Zurück zum Zitat Hassan T (2009) User-guided wrapping of PDF documents using graph matching techniques. In: International conference on document analysis and recognition – ICDAR, Barcelona, pp 631–635 Hassan T (2009) User-guided wrapping of PDF documents using graph matching techniques. In: International conference on document analysis and recognition – ICDAR, Barcelona, pp 631–635
21.
Zurück zum Zitat Hurst M (2001) Layout and language: challenges for table understanding on the web. In: Proceedings of the 1st international workshop on web document analysis, Seattle Hurst M (2001) Layout and language: challenges for table understanding on the web. In: Proceedings of the 1st international workshop on web document analysis, Seattle
22.
Zurück zum Zitat Jain AK, Yu B (1998) Automatic text location in images and video frames. Pattern Recognit 31(12):2055–2076CrossRef Jain AK, Yu B (1998) Automatic text location in images and video frames. Pattern Recognit 31(12):2055–2076CrossRef
23.
Zurück zum Zitat Karatzas D (2002) Text segmentation in web images using colour perception and topological features. PhD Thesis, University of Liverpool Karatzas D (2002) Text segmentation in web images using colour perception and topological features. PhD Thesis, University of Liverpool
24.
Zurück zum Zitat Karatzas D, Anotnacopoulos A (2007) Colour text segmentation in web images based on human perception. Image Vis Comput 25(5):564–577CrossRef Karatzas D, Anotnacopoulos A (2007) Colour text segmentation in web images based on human perception. Image Vis Comput 25(5):564–577CrossRef
25.
Zurück zum Zitat Kong J, Zhang K, Zeng X (2006) Spatial graph grammars for graphical user interfaces. CHI 13:268–307 Kong J, Zhang K, Zeng X (2006) Spatial graph grammars for graphical user interfaces. CHI 13:268–307
26.
Zurück zum Zitat Krupl B, Herzog M, Gatterbauer W (2005) Using visual cues for extraction of tabular data from arbitrary HTML documents. In: Proceedings of the 14th international conference on world wide web (WWW), Chiba Krupl B, Herzog M, Gatterbauer W (2005) Using visual cues for extraction of tabular data from arbitrary HTML documents. In: Proceedings of the 14th international conference on world wide web (WWW), Chiba
27.
Zurück zum Zitat Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell Spec Issue Intell Internet Syst 118(1–2):15–68MathSciNetMATH Kushmerick N (2000) Wrapper induction: efficiency and expressiveness. Artif Intell Spec Issue Intell Internet Syst 118(1–2):15–68MathSciNetMATH
28.
Zurück zum Zitat Laender AHF, Ribeiro-Neto BA, da Silva AS, Teixeira JS (2002) A brief survey of web data extraction tools. ACM SIGMOD Rec Homepage Arch 31(2):84–93CrossRef Laender AHF, Ribeiro-Neto BA, da Silva AS, Teixeira JS (2002) A brief survey of web data extraction tools. ACM SIGMOD Rec Homepage Arch 31(2):84–93CrossRef
29.
Zurück zum Zitat Laender AHF, Ribeiro-Neto B, da Silva AS (2002) DEByE – date extraction by example. Data Knowl Eng 40(2):121–154MATHCrossRef Laender AHF, Ribeiro-Neto B, da Silva AS (2002) DEByE – date extraction by example. Data Knowl Eng 40(2):121–154MATHCrossRef
30.
Zurück zum Zitat Lien Y-LL (1989) Apparatus and method for vectorization of incoming scanned image data. United States Patent US4,817,187, assigned to GTX Corporation, Phoenix, Arizona, 28 Mar 1989 Lien Y-LL (1989) Apparatus and method for vectorization of incoming scanned image data. United States Patent US4,817,187, assigned to GTX Corporation, Phoenix, Arizona, 28 Mar 1989
31.
Zurück zum Zitat Liu Y, Bai K, Mitra P, Lee Giles C (2007) TableSeer: automatic table metadata extraction and searching in digital libraries. In: ACM/IEEE joint conference on digital libraries, Vancouver, pp 91–100 Liu Y, Bai K, Mitra P, Lee Giles C (2007) TableSeer: automatic table metadata extraction and searching in digital libraries. In: ACM/IEEE joint conference on digital libraries, Vancouver, pp 91–100
32.
Zurück zum Zitat Lopresti D, Zhou J (2000) Locating and recognizing text in WWW images. Inf Retr 2(2/3):177–206CrossRef Lopresti D, Zhou J (2000) Locating and recognizing text in WWW images. Inf Retr 2(2/3):177–206CrossRef
33.
Zurück zum Zitat Lovegrove W, Brailsford D (1995) Document analysis of PDF files: methods, results and implications. Electron Publ Orig Dissem Des 8(3):207–220 Lovegrove W, Brailsford D (1995) Document analysis of PDF files: methods, results and implications. Electron Publ Orig Dissem Des 8(3):207–220
34.
Zurück zum Zitat Luo P, Fan J, Liu S, Lin F, Xiong Y, Liu J (2009) Web article extraction for web printing: a DOM+visual based approach. In: Proceedings of the DocEng, Munich. ACM, pp 66–69 Luo P, Fan J, Liu S, Lin F, Xiong Y, Liu J (2009) Web article extraction for web printing: a DOM+visual based approach. In: Proceedings of the DocEng, Munich. ACM, pp 66–69
35.
Zurück zum Zitat Marinai S (2009) Metadata extraction from PDF papers for digital library ingest. In: Proceedings of the 10th international conference on document analysis and recognition (ICDAR), Barcelona, pp 251–255 Marinai S (2009) Metadata extraction from PDF papers for digital library ingest. In: Proceedings of the 10th international conference on document analysis and recognition (ICDAR), Barcelona, pp 251–255
36.
Zurück zum Zitat McKeown KR, Barzilay R, Evans D, Hatzivassiloglou V, Kan MY, Schiffman B, Teufel S (2001) Columbia multi-document summarization: approach and evaluation. In: Document understanding conference, New Orleans McKeown KR, Barzilay R, Evans D, Hatzivassiloglou V, Kan MY, Schiffman B, Teufel S (2001) Columbia multi-document summarization: approach and evaluation. In: Document understanding conference, New Orleans
37.
Zurück zum Zitat Okun O, Doermann D, Pietikainen M (1999) Page segmentation and zone classification: the state of the art. Technical report: LAMP-TR-036/CAR-TR-927/CS-TR-4079, University of Maryland, College Park, Nov 1999 Okun O, Doermann D, Pietikainen M (1999) Page segmentation and zone classification: the state of the art. Technical report: LAMP-TR-036/CAR-TR-927/CS-TR-4079, University of Maryland, College Park, Nov 1999
38.
Zurück zum Zitat Oro E, Ruffolo M (2009) PDF-TREX: an approach for recognizing and extracting tables from PDF documents. In: ICDAR’09 proceedings of the 2009 10th international conference on document analysis and recognition, Barcelona, pp 906–910 Oro E, Ruffolo M (2009) PDF-TREX: an approach for recognizing and extracting tables from PDF documents. In: ICDAR’09 proceedings of the 2009 10th international conference on document analysis and recognition, Barcelona, pp 906–910
39.
Zurück zum Zitat Petrie H, Harrison C, Dev S (2005) Describing images on the web: a survey of current practice and prospects for the future. In: Proceedings of human computer interaction international (HCII), Las Vegas, July 2005 Petrie H, Harrison C, Dev S (2005) Describing images on the web: a survey of current practice and prospects for the future. In: Proceedings of human computer interaction international (HCII), Las Vegas, July 2005
40.
Zurück zum Zitat Smith PN, Brailsford DF (1995) Towards structured, block-based PDF. Electron Publ Orig Dissem Des 8(2–3):153–165 Smith PN, Brailsford DF (1995) Towards structured, block-based PDF. Electron Publ Orig Dissem Des 8(2–3):153–165
41.
Zurück zum Zitat Soderland S, Cardie C, Mooney R (1999) Learning information extraction rules for semi-structured and free text. Mach Learn Spec Issue Nat Lang Learn 34(1–3):233–272MATH Soderland S, Cardie C, Mooney R (1999) Learning information extraction rules for semi-structured and free text. Mach Learn Spec Issue Nat Lang Learn 34(1–3):233–272MATH
42.
Zurück zum Zitat Wang Y, Hu J (2002) Detecting tables in HTML documents. In: Fifth IAPR international workshop on document analysis systems, Princeton, Aug 2002. Lecture notes in computer science, vol 2423, pp 249–260 Wang Y, Hu J (2002) Detecting tables in HTML documents. In: Fifth IAPR international workshop on document analysis systems, Princeton, Aug 2002. Lecture notes in computer science, vol 2423, pp 249–260
43.
Zurück zum Zitat Wang Y, Phillips IT, Haralick RM (2000) Statistical-based approach to word segmentation, In: 15th international conference on pattern recognition, ICPR2000, vol 4. Barcelona, Spain, pp 555–558 Wang Y, Phillips IT, Haralick RM (2000) Statistical-based approach to word segmentation, In: 15th international conference on pattern recognition, ICPR2000, vol 4. Barcelona, Spain, pp 555–558
44.
Zurück zum Zitat Wasserman HC, Yukawa K, Sy BK, Kwok K-L, Phillips IT (2002) A theoretical foundation and a method for document table structure extraction and decomposition. In: Lopresti DP, Hu J, Kashi R (eds) Document analysis systems. Lecture notes in computer science, vol 2423. Springer, Berlin/New York, pp 29–294 Wasserman HC, Yukawa K, Sy BK, Kwok K-L, Phillips IT (2002) A theoretical foundation and a method for document table structure extraction and decomposition. In: Lopresti DP, Hu J, Kashi R (eds) Document analysis systems. Lecture notes in computer science, vol 2423. Springer, Berlin/New York, pp 29–294
45.
Zurück zum Zitat Wyszecki G, Stiles W (1982) Color science: concepts and methods, quantitative data and formulae, 2nd edn. Wiley, New York Wyszecki G, Stiles W (1982) Color science: concepts and methods, quantitative data and formulae, 2nd edn. Wiley, New York
46.
Zurück zum Zitat Yildiz B, Kaiser K, Miksch S (2005) pdf2table: a method to extract table information from PDF files. In: Proceedings of the 2nd Indian international conference on artificial intelligence (IICAI05), Pune, pp 1773–1785 Yildiz B, Kaiser K, Miksch S (2005) pdf2table: a method to extract table information from PDF files. In: Proceedings of the 2nd Indian international conference on artificial intelligence (IICAI05), Pune, pp 1773–1785
47.
Zurück zum Zitat Zanibbi R, Blostein D, Cordy JR (2004) A survey of table recognition: models, observations, transformations, and inferences. Int J Doc Anal Recognit 7(1):1–16CrossRef Zanibbi R, Blostein D, Cordy JR (2004) A survey of table recognition: models, observations, transformations, and inferences. Int J Doc Anal Recognit 7(1):1–16CrossRef
48.
Zurück zum Zitat Zhu J, Nie Z, Wen J-R, Zhang B, Ma W-Y (2005) 2D conditional random fields for web information extraction. In: Proceedings of the ICML’05, Bonn. ACM, pp 1044–1051 Zhu J, Nie Z, Wen J-R, Zhang B, Ma W-Y (2005) 2D conditional random fields for web information extraction. In: Proceedings of the ICML’05, Bonn. ACM, pp 1044–1051
Zurück zum Zitat Web Document Analytics, Apostolos Antonacopoulos and Jianying Hu (Editors), World Scientific, 2004. Web Document Analytics, Apostolos Antonacopoulos and Jianying Hu (Editors), World Scientific, 2004.
Zurück zum Zitat PDF Explained, John Whitington, O’Reilly Media, 2011. PDF Explained, John Whitington, O’Reilly Media, 2011.
Zurück zum Zitat Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Bing Liu, Springer, 2011. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Bing Liu, Springer, 2011.
Metadaten
Titel
Analysis of Documents Born Digital
verfasst von
Jianying Hu
Ying Liu
Copyright-Jahr
2014
Verlag
Springer London
DOI
https://doi.org/10.1007/978-0-85729-859-1_26

Premium Partner