nach oben

International Journal on Digital Libraries

Erschienen in:

01.08.2014

Unsupervised document structure analysis of digital scientific articles

verfasst von: Stefan Klampfl, Michael Granitzer, Kris Jack, Roman Kern

Erschienen in: International Journal on Digital Libraries | Ausgabe 3-4/2014

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Text mining and information retrieval in large collections of scientific literature require automated processing systems that analyse the documents’ content. However, the layout of scientific articles is highly varying across publishers, and common digital document formats are optimised for presentation, but lack structural information. To overcome these challenges, we have developed a processing pipeline that analyses the structure a PDF document using a number of unsupervised machine learning techniques and heuristics. Apart from the meta-data extraction, which we reused from previous work, our system uses only information available from the current document and does not require any pre-trained model. First, contiguous text blocks are extracted from the raw character stream. Next, we determine geometrical relations between these blocks, which, together with geometrical and font information, are then used categorize the blocks into different classes. Based on this resulting logical structure we finally extract the body text and the table of contents of a scientific article. We separately evaluate the individual stages of our pipeline on a number of different datasets and compare it with other document structure analysis approaches. We show that it outperforms a state-of-the-art system in terms of the quality of the extracted body text and table of contents. Our unsupervised approach could provide a basis for advanced digital library scenarios that involve diverse and dynamic corpora.

Vorheriger Artikel Introduction to the focused issue on the 17th International Conference on Theory and Practice of Digital Libraries (TPDL 2013)

Nächster Artikel Who and what links to the Internet Archive

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

http://www.mendeley.com

http://www.citeulike.org

http://citeseerx.ist.psu.edu

http://knowminer.at:8080/code-demo/index.html

https://www.knowminer.at/svn/opensource/projects/code/trunk

http://pdfbox.apache.org/

http://itextpdf.com/

http://opensource.intarsys.de/home/en/index.php?n=OpenSource.JPod

http://poppler.freedesktop.org

Consider a page with four text blocks arranged in two columns (two blocks in each column) and in the middle of the page there is another block spanning both columns. Then the top right block is before the middle block in the reading order, the middle block before the bottom left block, but the bottom left block before the top right block.

http://www.ncbi.nlm.nih.gov/pubmed/

http://wing.comp.nus.edu.sg/parsCit/

http://poppler.freedesktop.org/

https://github.com/timtadh/zhang-shasha

Aiello, M., Monz, C., Todoran, L., Worring, M.: Document understanding for a broad class of documents. Int. J. Doc. Anal. Recogn. 5(1), 1–16 (2002). doi:10.1007/s10032-002-0080-x CrossRefMATH

Beel, J., Langer, S., Genzmehr, M., Müller, C.: Docear’s PDF inspector: title extraction from PDF files. In: Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL 2013) (2013)

Constantin, A., Pettifer, S., Voronkov, A.: PDFX: fully-automated PDF-to-XML conversion of scientific literature. In: Proceedings of the 13th ACM Symposium on Document, Engineering (2013)

Councill, I.G., Giles, C.L., Kan, M.y.: ParsCit: An Open-Source CRF Reference String Parsing Package. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of LREC, vol. 2008, pp. 661–667. Citeseer, European Language Resources Association (ELRA) (2008). doi:10.1.1.150.6790

Dejean, H., Meunier, J.L.: A system for converting PDF documents into structured XML format. In: Document Analysis Systems VII, pp. 129–140 (2006)

Doucet, A., Kazai, G., Colutto, S., Mühlberger, G.: Overview of the ICDAR 2013 competition on book structure extraction. In: Proceedings of the Twelfth International Conference on Document Analysis and Recognition (ICDAR’2013), p. 6. Washington DC, USA (2013)

Esposito, F., Ferilli, S., Basile, T.M.A.: Machine learning for digital document processing: from layout analysis to metadata extraction. World Wide Web Internet Web Inform. Syst. 138(2008), 1–35 (2008). doi:10.1007/978-3-540-76280-5_5

Ferilli, S., Basile, T., Mauro, N.D.: Markov logic networks for document layout correction. In: Modern Approaches in, Applied Intelligence, pp. 275–284 (2011)

Gao, L., Tang, Z., Lin, X., Liu, Y., Qiu, R., Wang, Y.: Structure extraction from PDF-based book documents. In: Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries, pp. 11–20 (2011)

10.

Gorman, L.O., Definitions, A.: The document spectrum for page layout analysis. IEEE Trans. Pattern Anal. Mach. Intell. 15(11), 1162–1173 (1993)CrossRef

11.

Granitzer, M., Hristakeva, M., Knight, R., Jack, K.: A comparison of metadata extraction techniques for crowdsourced bibliographic metadata management. In: Proceedings of the 27th Symposium On Applied Computing, p. to appear. ACM, New York (2012)

12.

Granitzer, M., Hristakeva, M., Knight, R., Jack, K., Kern, R.: A comparison of layout based bibliographic metadata extraction techniques. In: WIMS12—International Conference on Web Intelligence, Mining and Semantics, pp. 19:1–19:8. ACM, New York (2012)

13.

Kern, R., Jack, K., Hristakeva, M., Granitzer, M.: TeamBeam—meta-data extraction from scientific literature. In: 1st International Workshop on Mining Scientific Publications (2012)

14.

Kern, R., Klampfl, S.: Extraction of references using layout and formatting information from scientific articles. D-Lib Magazine 19(9/10) (2013). doi:10.1045/september2013-kern

15.

Klink, S., Dengel, A., Kieninger, T.: Document structure analysis based on layout and textual features. In: Proceedings of International Workshop on Document Analysis Systems (2000)

16.

Lin, X.: Header and footer extraction by page-association. Proc. SPIE 5010, 164–171 (2002). doi:10.1117/12.472833 CrossRef

17.

Liu, Y., Bai, K., Mitra, P., Giles, C.L.: Improving the table boundary detection in PDFs by fixing the sequence error of the sparse lines. In: 2009 10th International Conference on Document Analysis and Recognition, pp. 1006–1010 (2009). doi:10.1109/ICDAR.2009.138

18.

Liu, Y., Mitra, P., Giles, C.L.: A fast preprocessing method for table boundary detection: narrowing down the sparse lines using solely coordinate information. In: 2008 The Eighth IAPR International Workshop on Document Analysis Systems, pp. 431–438. IEEE (2008). doi:10.1109/DAS.2008.77

19.

Liu, Y., Mitra, P., Giles, C.L.: Identifying table boundaries in digital documents via sparse line detection. In: Proceeding of the 17th ACM conference on Information and knowledge mining CIKM 08, pp. 1311–1320. ACM Press (2008). doi:10.1145/1458082.1458255

20.

Luong, M.T., Nguyen, T.D., Kan, M.Y.: Logical structure recovery in scholarly articles with rich document features. Int. J. Digital Libr. Syst. 1(4), 1–23 (2011). doi:10.4018/jdls.2010100101 CrossRef

21.

Malerba, D., Ceci, M., Berardi, M.: Machine learning for reading order detection in document image understanding. In: Machine Learning in Document Analysis, pp. 45–69 (2008)

22.

Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. Proc. SPIE 5010(1), 197–207 (2003). doi:10.1117/12.476326 CrossRef

23.

Meunier, J.L.: Optimized XY-cut for determining a page reading order. In: Eighth International Conference on Document Analysis and Recognition ICDAR05 1, pp. 347–351 (2005). doi:10.1109/ICDAR.2005.182

24.

Nagy, G., Seth, S., Viswanathan, M.: A prototype document image analysis system for technical journals. Computer 25(7), 10–22 (1992). doi:10.1109/2.144436 CrossRef

25.

Peng, F., McCallum, A.: Accurate information extraction from research papers using conditional random fields. In: HLTNAACL04, vol. 2004, pp. 329–336 (2004). doi: 10.1.1.10.5644

26.

Ramakrishnan, C., Patnia, A., Hovy, E., Burns, G.A.: Layout-aware text extraction from full-text PDF of scientific articles. Source Code Biol Med 7(1), 7 (2012). doi:10.1186/1751-0473-7-7 CrossRef

27.

Summers, K.: Automatic discovery of logical document structure. Ph.D. thesis (1998)

28.

Tkaczyk, D., Bolikowski, L., Czeczko, A., Rusek, K.: A modular metadata extraction system for born-digital articles. In: 2012 10th IAPR International Workshop on Document Analysis Systems, pp. 11–16 (2012). doi:10.1109/DAS.2012.4

29.

Tkaczyk, D., Czeczko, A., Rusek, K.: GROTOAP: ground truth for open access publications. In: Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 381–382 (2012)

30.

Zanibbi, R., Blostein, D., Cordy, J.R.: A survey of table recognition. Doc. Anal. Recogn. 7(1), 1–16 (2004). doi:10.1007/s10032-004-0120-9

31.

Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989). doi:10.1137/0218082 CrossRefMATHMathSciNet

Titel: Unsupervised document structure analysis of digital scientific articles
verfasst von: Stefan Klampfl
Michael Granitzer
Kris Jack
Roman Kern
Publikationsdatum: 01.08.2014
Verlag: Springer Berlin Heidelberg
Erschienen in: International Journal on Digital Libraries / Ausgabe 3-4/2014
Print ISSN: 1432-5012
Elektronische ISSN: 1432-1300
DOI: https://doi.org/10.1007/s00799-014-0115-1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 3-4/2014

Sustainability of digital libraries: a conceptual model and a research framework

Who and what links to the Internet Archive

Word occurrence based extraction of work contributors from statements of responsibility

Metadata management, interoperability and Linked Data publishing support for Natural History Museums

A system for high quality crowdsourced indigenous language transcription

Introduction to the focused issue on the 17th International Conference on Theory and Practice of Digital Libraries (TPDL 2013)