Skip to main content
Top

2015 | OriginalPaper | Chapter

Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

We present an improved implementation of the Annotated suffix tree method for text analysis (abbreviated as the AST-method). Annotated suffix trees are an extension of the original suffix tree data structure, with nodes labeled by occurrence frequencies for corresponding substrings in the input text collection. They have a range of interesting applications in text analysis, such as language-independent computation of a matching score for a keyphrase against some text collection. In our enhanced implementation, new algorithms and data structures (suffix arrays used instead of the traditional but heavyweight suffix trees) have enabled us to derive an implementation superior to the previous ones in terms of both memory consumption (10 times less memory) and runtime. We describe an open-source statistical text analysis software package, called “EAST”, which implements this enhanced annotated suffix tree method. Besides, the EAST package includes an adaptation of a distributional synonym extraction algorithm that supports the Russian language and allows us to achieve better results in keyphrase matching.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2, 53–86 (2004)MathSciNetCrossRefMATH Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2, 53–86 (2004)MathSciNetCrossRefMATH
2.
go back to reference Barsky, M., Stege, U., Thomo, A.: A survey of practical algorithms for suffix tree construction in external memory. Softw. Pract. Experience 40(11), 965–988 (2010)CrossRef Barsky, M., Stege, U., Thomo, A.: A survey of practical algorithms for suffix tree construction in external memory. Softw. Pract. Experience 40(11), 965–988 (2010)CrossRef
3.
go back to reference Dubov, M., Chernyak, E.: Annotated suffix trees: implementation details. Transactions of Scientific Conference on Analysis of Images, Social Networks and Texts (AIST), pp. 49–57. Springer, Switzerland (2013) Dubov, M., Chernyak, E.: Annotated suffix trees: implementation details. Transactions of Scientific Conference on Analysis of Images, Social Networks and Texts (AIST), pp. 49–57. Springer, Switzerland (2013)
4.
go back to reference Dubov, M., Mirkin, B., Shal, A.: Automatic russian text processing system. Open Systems DBMS 22(10), 15–17 (2014) Dubov, M., Mirkin, B., Shal, A.: Automatic russian text processing system. Open Systems DBMS 22(10), 15–17 (2014)
5.
go back to reference Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)CrossRefMATH Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)CrossRefMATH
6.
go back to reference Kinen, J, Sanders, P.: Simple Linear Work Suffix Array Construction. Automata, Languages and Programming. Lecture Notes in Computer Science, pp. 943–2719 (2003) Kinen, J, Sanders, P.: Simple Linear Work Suffix Array Construction. Automata, Languages and Programming. Lecture Notes in Computer Science, pp. 943–2719 (2003)
7.
go back to reference Kasai, T., Lee, G.H., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001) CrossRef Kasai, T., Lee, G.H., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001) CrossRef
8.
go back to reference Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, pp. 768–774 (1998) Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, pp. 768–774 (1998)
10.
go back to reference Mirkin, B., Chernyak, E., Chugunova, O.: Method of annotated suffix tree for scoring the extent of presence of a string in text. Bus. Inf. 3(21), 31–41 (2012) Mirkin, B., Chernyak, E., Chugunova, O.: Method of annotated suffix tree for scoring the extent of presence of a string in text. Bus. Inf. 3(21), 31–41 (2012)
11.
go back to reference Pampapathi, R.: Annotated suffix trees for text modelling and classification. Doctoral dissertation, Birkbeck College, University of London, Retrieved from CiteSeerX (2008) Pampapathi, R.: Annotated suffix trees for text modelling and classification. Doctoral dissertation, Birkbeck College, University of London, Retrieved from CiteSeerX (2008)
12.
go back to reference Pampapathi, R., Mirkin, B., Levene, M.: A suffix tree approach to anti-spam email filtering. Mach. Learn. 65(1), 309–338 (2006)CrossRef Pampapathi, R., Mirkin, B., Levene, M.: A suffix tree approach to anti-spam email filtering. Mach. Learn. 65(1), 309–338 (2006)CrossRef
13.
go back to reference Perkins, J.: Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, Birmingham (2010) Perkins, J.: Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, Birmingham (2010)
16.
go back to reference Wang, T.: Extracting Synonyms from Dictionary Definitions. Retrieved from Focus on Research, Master dissertation, University of Toronto (2009) Wang, T.: Extracting Synonyms from Dictionary Definitions. Retrieved from Focus on Research, Master dissertation, University of Toronto (2009)
Metadata
Title
Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation
Author
Mikhail Dubov
Copyright Year
2015
DOI
https://doi.org/10.1007/978-3-319-26123-2_30

Premium Partner