Segmentation of text and math elements. In a language using Roman/Latin characters such as European languages, Vietnamese, Malayan, Indonesian, etc., some literal words are often misrecognized as a math object or vice versa. Function names such as \(\log \) or other technical elements should be distinguished correctly from short-length words appearing in that local text. Certainly, an element having a math structure is automatically judged as a math part; however, even for other elements (a character sequence) having no math structure, this judgement also plays an important role in the use of recognition results; for instance, aloud reading, Braille translation, etc.
This classification is a strongly language-dependent job. For this purpose, we do need an appropriate dictionary for each local language. For the new version, we collected STEM contents in WIKI to construct the dictionary for several local languages. Although the size of the collected source texts is not so large, the obtained short-word dictionaries seem to work well surprisingly, and in the latest version, we actually incorporate such dictionaries for 20 languages: English, Czech, Danish, Dutch, Finish, French, German, Hungarian, Italian, Indonesian, Malay, Norwegian, Polish, Portuguese, Russian, Slovak, Spanish, Swedish, Turkish and Vietnamese.