Skip to main content
Erschienen in:
Buchtitelbild

Open Access 2022 | OriginalPaper | Buchkapitel

Conversion of Multi-lingual STEM Documents in E-Born PDF into Various Accessible E-Formats

verfasst von : Masakazu Suzuki, Katsuhito Yamaguchi

Erschienen in: Computers Helping People with Special Needs

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
download
DOWNLOAD
print
DRUCKEN
insite
SUCHEN
loading …

Abstract

A new method of mathematical OCR to improve remarkably recognition accuracy for e-born PDF by making use of SVG information generated from the PDF is shown. Even if a local language is used to represent texts in the PDF, without a special OCR engine for that language, it can be converted into various accessible e-formats. The software GUI is improved so that end users can customize it easily for their language. Its French and Vietnamese versions are actually released by using this new feature. Some evaluations done in Vietnam are also reported.

1 Introduction

One of the most serious problems in digitized STEM (science, technology, engineering and mathematics) contents, which are provided in PDF in most cases, is their poor accessibility. Print-disabled people usually use OCR (optical character recognition) software to read those PDF. However, the ordinary software cannot recognize technical parts such as mathematical formulas properly, and it is hard for them to read such PDF. To solve this problem, we have been working on the development of OCR software for STEM contents, “InftyReader” [1].
It is said that 90% of visually disabled people live in developing countries. Recently, accessible e-books have been becoming gradually available even in those countries [2]. However, for the present, ordinary conversion tools usually cannot treat STEM contents in their local language, and it still requires a lot of manual works for them to convert inaccessible STEM contents in their own language into an accessible form.
There are essentially two different types in PDF. The first one, “e-born PDF” is PDF produced from a digital file such as a document in Microsoft Word, https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-08648-9_2/518378_1_En_2_IEq1_HTML.gif
Latex
, Adobe InDesign, etc. We refer to the other type as “image PDF,” which is usually made by scanning or copying.
We have released the English and the Japanese versions of InftyReader for almost twenty years. Through user supports for the software, we have realized that in recent years, most of the (individual) end users use InftyReader to read STEM contents in e-born PDF, and the importance of e-born PDF accessibility is definitely increasing.
From the viewpoint of computerized processing to convert e-born PDF into accessible form, its most significant advantage is that the accurate information on each character/symbol such as the character code, the font name, the coordinate on a page is embedded in it. In ICCHP2016, we reported the new method of mathematical OCR to improve recognition accuracy for e-born PDF, by combining analysis technologies in our mathematical OCR with character/symbol information extracted from the PDF by a PDF parser [3]. At that time, the ambiguity of the character coordinate obtained by the PDF parser was our main problem, which is solved in this paper by making use of a new method developed by Fujiyoshi [4].
If a character set for a local language is included in Unicode, character information in e-born PDF is usually represented in Unicode. As is well known, to recognize a local language correctly, image-based OCR software does need a special OCR engine well-customized for that language. However, our new method for e-born PDF no longer uses image-based OCR in the recognition of local-language texts. Thus, without a special OCR engine, it is possible to develop a system to convert e-born STEM contents in other local languages into accessible format. Here, we show our recent work in this course of action. We have improved InftyReader so that it can treat e-born PDF in any Unicode-based language. Those e-born STEM contents can be converted automatically into various accessible e-formats: https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-08648-9_2/518378_1_En_2_IEq2_HTML.gif
Latex
, human-readable (HR) https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-08648-9_2/518378_1_En_2_IEq3_HTML.gif
Tex
, Microsoft Word, xHTML with MathML, accessible EPUB3, Multimedia Daisy [5], ChattyBook (audio-embedded HTML5 with JAVA script) [6]. We also develop a scheme to allow end users to customize the software GUI (graphic user interface) so that it is represented in their local language.

2 Background

As was mentioned, we reported our first approach to improve recognition quality for e-born PDF in ICCHP2016 [3]. Sorge, et al. have also studied a method to recognize e-born PDF for STEM by making use of embedded character information and image-based OCR [7, 8]. However, as far as a mathematical part is concerned, a font rect-area (rectangular area) extracted from e-born PDF by a PDF parser often differs significantly from the graphical rect-area of the original character image. Consequently, in the previous works, it was impossible to realize mathematical recognition based only on character information extracted from PDF. To solve this problem, we as well as Sorge et al. estimated the correct graphical rect-area of characters/symbols in a mathematical part, by combining the extracted data with image-based OCR.
In the DEIMS2021 conference, Fujiyoshi, et al. reported a completely new approach to extract character information from PDF [4]. They developed an application named “PDFContentExtracter” that makes the vector information of drawing each character/symbol in scalable vector graphics (SVG) by trapping a function for printing PDF. This application allows us to get a correct graphical rect-area even in a mathematical part. Actually, by making use of his application, we have improved InftyReader so that its structure analysis of mathematical formulas is less dependent on the image-based OCR result of characters/symbols.

3 Method

Our new recognition method for e-born PDF is carried out in the following workflow.
(1)
 Converting e-born PDF into SVG with PDFContentExtracter. PDFContentExtracter, which is a utility using a PDF parser named “PDFBOX” (developed in JAVA), converts e-born PDF into SVG. The SVG consists of three types of elements: characters, images and path elements; where the path is a set of point sequences to draw lines, arcs and Bezier curves with color information to fill up the inside of closed path elements. Concerning a character element, not only its vector image (the path command in SVG) but also its font name and character code are output.
 
(2)
 Analyzing the SVG. To make the SVG include only texts in black for recognition, three tasks: removing background images, extracting image areas and changing all character colors to black are performed. In addition, while large mathematical symbols such as fraction bars, big parentheses, radical signs, integral signs, etc. are often represented as graphics, they should be detected and changed to mathematical symbols with character information.
 
(3)
 Checking the character code of each mathematical symbol to treat user-defined character fonts. Although most of mathematical symbols can be represented in Unicode, user-defined character fonts are often used in STEM contents. The user-character-code area of Unicode is usually assigned to them, but sometimes, ASCII character codes are irregularly assigned.
To avoid the misrecognition caused by this type of special user-defined fonts, InftyReader checks the character codes of each font used in a target PDF by making use of image-based OCR. To make this process efficient, the check is done for each font of ASCII code or code in the user-character-code area of the Unicode table when they appear firstly in a document; the result is used to construct a character-code map special for the fonts used in the target PDF.
 
(4)
 Judging a font category. In STEM, various font styles are used to represent different notions: math Italic, Roman, script, Fraktur, bold, Blackboard bold, calligraphic, etc. Here, we call them font categories. Even if the category is different, in Unicode representation, a same ASCII code is assigned to all those different-category fonts. Certainly, their font names are different from each other. However, it is difficult to get the category just from the font names since there is no standard table of the font names which is commonly used all over the world.
In InftyReader, at first, some typical code characters are chosen, which should have clearly different shapes among the different categories. For instance, to distinguish two categories: Italic and calligraphic, D, E, F etc. can be chosen while C and S should be excluded. The shapes of selected code characters in Italic and calligraphic are clearly different from each other while the excluded characters such as C and S have similar shapes (see Fig. 1). Using image-based OCR, InftyReader recognizes the code characters selected for the distinction and judges the category of each font name. All categories appearing in a document are judged in this manner.
 
(5)
 Structure analysis of mathematical expressions. The method to analyze mathematical structures is essentially the same as our conventional one [9]. However, there is an important advantage in e-born PDF case since no misrecognition occurs in distinguishing between upper- and lowercase letters for the characters that have similar shapes such as “C” and “c” or “S” and “s,” etc. (Usually, the character size of “c” and “s” (lowercase) on a baseline is almost same as the character “C” and “S” (uppercase) at a subscript position, and sometimes, that gives rise to the misrecognition.) As the result, the accuracy of the mathematical structure analysis is improved in comparison with image-based OCR.
 
(6)
 Segmentation of text and math elements. In a language using Roman/Latin characters such as European languages, Vietnamese, Malayan, Indonesian, etc., some literal words are often misrecognized as a math object or vice versa. Function names such as \(\log \) or other technical elements should be distinguished correctly from short-length words appearing in that local text. Certainly, an element having a math structure is automatically judged as a math part; however, even for other elements (a character sequence) having no math structure, this judgement also plays an important role in the use of recognition results; for instance, aloud reading, Braille translation, etc.
This classification is a strongly language-dependent job. For this purpose, we do need an appropriate dictionary for each local language. For the new version, we collected STEM contents in WIKI to construct the dictionary for several local languages. Although the size of the collected source texts is not so large, the obtained short-word dictionaries seem to work well surprisingly, and in the latest version, we actually incorporate such dictionaries for 20 languages: English, Czech, Danish, Dutch, Finish, French, German, Hungarian, Italian, Indonesian, Malay, Norwegian, Polish, Portuguese, Russian, Slovak, Spanish, Swedish, Turkish and Vietnamese.
 
After completing these steps, the recognition result can be converted into various accessible e-formats: https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-08648-9_2/518378_1_En_2_IEq5_HTML.gif
Latex
, human-readable (HR) https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-08648-9_2/518378_1_En_2_IEq6_HTML.gif
Tex
, Microsoft Word, xHTML with MathML, accessible EPUB3, Multimedia Daisy, ChattyBook.

4 Multilingual Support in GUI

In our accessible STEM-document editor, “ChattyInfty3,” we have introduced a localization scheme to allow end users to customize easily its graphical user interface (GUI) and aloud reading of mathematical formulas so that they are represented in a local language [10]. This time, we have also developed a localization scheme in the InftyReader GUI.
We prepare a definition table for GUI as an xml file separate from its main program. All menu items, button names, etc. are loaded from this definition table, in which any Unicode characters can be used to represent those GUI items. By simply placing a local-language version of the definition table in a specified folder, end users become able to select that language to represent the InftyReader GUI. Thus, they can translate the GUI into their local language without modifying the main program. Actually, French and Vietnamese teams have prepared the definition table for their language, and for the present, English, French, Japanese and Vietnamese can be chosen as the GUI language. Here, we show the GUI in English and Vietnamese (see Fig. 2). As will be discussed later, the Vietnamese GUI has been actually used in workshops for sighted Vietnamese instructors and coordinators.

5 Evaluation

As the first test, using the new mathematical-OCR method, we converted an e-born-PDF math drill book in Vietnamese into some accessible formats such as https://static-content.springer.com/image/chp%3A10.1007%2F978-3-031-08648-9_2/518378_1_En_2_IEq7_HTML.gif
Latex
, xHTML with MathML, multimedia DAISY. A blind end user in Hanoi evaluated them to check their accessibility. She said, the new mathematical-OCR method worked well. In particular, the multimedia DAISY version looked best since she could read properly all the contents including mathematical expressions. Concerning the other versions, she could not read some technical parts smoothly with a popular screen reader: “JAWS” [11]. As we foresaw, she could not read most of technical parts in the original PDF itself with JAWS, either.
Next, we conducted a small online workshop for evaluation. A retired math teacher, two young working adults and four students participated in it. All of them are blind Vietnamese, and three of them are JAWS users. The others use “NVDA” that is another popular screen reader [12]. They tried to access the math drill book in various formats. In Vietnam, majority of the blind do not have a DAISY player, and we asked them to evaluate mainly the accessibility of ChattyInfty3 contents. We could confirm that it was accessible enough in Vietnamese while the PDF and the Microsoft-Word versions were not with JAWS or NVDA.
We also conducted another series of online workshops for sighted instructors and coordinators in Vietnam. In Vietnam, the government provides blind students with the Braille version of textbooks, but the other types of accessible textbooks are not officially released for the present. They are planning to release other types such as multimedia DAISY, and we have given lectures to about twenty participants on a method to convert printed STEM books into various accessible format by making use of InftyReader. Its new GUI in Vietnamese looked very helpful for them to use the software smoothly.

6 Conclusion and Future Works

Here, we discuss our new mathematical OCR method implemented recently in InftyReader, which improves remarkably recognition accuracy for e-born PDF STEM contents represented in Latin characters. Even if a Unicode-based local language is used in its text part, without a special OCR engine for that language, InftyReader can convert the PDF into various accessible e-formats. In addition, the software GUI is also improved so that end users can customize it easily for their local language; its French and Vietnamese versions are actually released. In the next step, we intend to work on the following tasks.
Concerning other Asian languages represented in non-Latin characters, we have already incorporated three languages: Japanese, Korean and Thai in InftyReader. However, there still remain a lot of other such languages in Asia, which have a large speaker population - for instance, Chinese, Hindi, Arabic, etc. As for Chinese, a current problem for us is that the PDF parser, PDFBOX, does not support sufficiently the fonts used frequently in Chinese at this moment. On the other hand, in Hindi and Arabic, we need further investigation about the character and word representation system in those languages. It should be our important future task to realize proper recognition for them in InftyReader.
Another remaining important task in InftyReader is how to analyze complicated layout correctly. In school textbooks, recently, page layout becomes increasingly complicated as publishing technologies develop. For the present, we have to do many manual jobs to treat them properly. Further development of machine-learning technology to replace these jobs with automatic processing is strongly desired.
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://​creativecommons.​org/​licenses/​by/​4.​0/​), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Literatur
4.
Zurück zum Zitat Nakamura, S., Kohase, K., Fujiyoshi, A.: Extracting precise coordinate information of components from E-Born PDF Files. In: Proceedings of the 4th International Workshop on “Digitization and E-Inclusion in Mathematics and Science 2021" (DEIMS2021), Nihon University, online, pp. 15–18 (2021) Nakamura, S., Kohase, K., Fujiyoshi, A.: Extracting precise coordinate information of components from E-Born PDF Files. In: Proceedings of the 4th International Workshop on “Digitization and E-Inclusion in Mathematics and Science 2021" (DEIMS2021), Nihon University, online, pp. 15–18 (2021)
7.
Zurück zum Zitat Baker, J., Sexton, A., Sorge, V.: Extracting precise data from PDF documents for mathematical formula recognition. In: Proceedings of the 8th IAPR International Workshop on Document Analysis Systems (2008) Baker, J., Sexton, A., Sorge, V.: Extracting precise data from PDF documents for mathematical formula recognition. In: Proceedings of the 8th IAPR International Workshop on Document Analysis Systems (2008)
9.
Zurück zum Zitat Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY - an integrated OCR system for mathematical documents -. In: Proceedings of the ACA Symposium on Document Engineering 2003, Grenoble, pp. 95–104 (2003) Suzuki, M., Tamari, F., Fukuda, R., Uchida, S., Kanahori, T.: INFTY - an integrated OCR system for mathematical documents -. In: Proceedings of the ACA Symposium on Document Engineering 2003, Grenoble, pp. 95–104 (2003)
Metadaten
Titel
Conversion of Multi-lingual STEM Documents in E-Born PDF into Various Accessible E-Formats
verfasst von
Masakazu Suzuki
Katsuhito Yamaguchi
Copyright-Jahr
2022
DOI
https://doi.org/10.1007/978-3-031-08648-9_2

Neuer Inhalt