Skip to main content

2015 | OriginalPaper | Buchkapitel

Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We present an improved implementation of the Annotated suffix tree method for text analysis (abbreviated as the AST-method). Annotated suffix trees are an extension of the original suffix tree data structure, with nodes labeled by occurrence frequencies for corresponding substrings in the input text collection. They have a range of interesting applications in text analysis, such as language-independent computation of a matching score for a keyphrase against some text collection. In our enhanced implementation, new algorithms and data structures (suffix arrays used instead of the traditional but heavyweight suffix trees) have enabled us to derive an implementation superior to the previous ones in terms of both memory consumption (10 times less memory) and runtime. We describe an open-source statistical text analysis software package, called “EAST”, which implements this enhanced annotated suffix tree method. Besides, the EAST package includes an adaptation of a distributional synonym extraction algorithm that supports the Russian language and allows us to achieve better results in keyphrase matching.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2, 53–86 (2004)MathSciNetCrossRefMATH Abouelhoda, M.I., Kurtz, S., Ohlebusch, E.: Replacing suffix trees with enhanced suffix arrays. J. Discrete Algorithms 2, 53–86 (2004)MathSciNetCrossRefMATH
2.
Zurück zum Zitat Barsky, M., Stege, U., Thomo, A.: A survey of practical algorithms for suffix tree construction in external memory. Softw. Pract. Experience 40(11), 965–988 (2010)CrossRef Barsky, M., Stege, U., Thomo, A.: A survey of practical algorithms for suffix tree construction in external memory. Softw. Pract. Experience 40(11), 965–988 (2010)CrossRef
3.
Zurück zum Zitat Dubov, M., Chernyak, E.: Annotated suffix trees: implementation details. Transactions of Scientific Conference on Analysis of Images, Social Networks and Texts (AIST), pp. 49–57. Springer, Switzerland (2013) Dubov, M., Chernyak, E.: Annotated suffix trees: implementation details. Transactions of Scientific Conference on Analysis of Images, Social Networks and Texts (AIST), pp. 49–57. Springer, Switzerland (2013)
4.
Zurück zum Zitat Dubov, M., Mirkin, B., Shal, A.: Automatic russian text processing system. Open Systems DBMS 22(10), 15–17 (2014) Dubov, M., Mirkin, B., Shal, A.: Automatic russian text processing system. Open Systems DBMS 22(10), 15–17 (2014)
5.
Zurück zum Zitat Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)CrossRefMATH Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)CrossRefMATH
6.
Zurück zum Zitat Kinen, J, Sanders, P.: Simple Linear Work Suffix Array Construction. Automata, Languages and Programming. Lecture Notes in Computer Science, pp. 943–2719 (2003) Kinen, J, Sanders, P.: Simple Linear Work Suffix Array Construction. Automata, Languages and Programming. Lecture Notes in Computer Science, pp. 943–2719 (2003)
7.
Zurück zum Zitat Kasai, T., Lee, G.H., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001) CrossRef Kasai, T., Lee, G.H., Arimura, H., Arikawa, S., Park, K.: Linear-time longest-common-prefix computation in suffix arrays and its applications. In: Amir, A., Landau, G.M. (eds.) CPM 2001. LNCS, vol. 2089, pp. 181–192. Springer, Heidelberg (2001) CrossRef
8.
Zurück zum Zitat Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, pp. 768–774 (1998) Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, pp. 768–774 (1998)
10.
Zurück zum Zitat Mirkin, B., Chernyak, E., Chugunova, O.: Method of annotated suffix tree for scoring the extent of presence of a string in text. Bus. Inf. 3(21), 31–41 (2012) Mirkin, B., Chernyak, E., Chugunova, O.: Method of annotated suffix tree for scoring the extent of presence of a string in text. Bus. Inf. 3(21), 31–41 (2012)
11.
Zurück zum Zitat Pampapathi, R.: Annotated suffix trees for text modelling and classification. Doctoral dissertation, Birkbeck College, University of London, Retrieved from CiteSeerX (2008) Pampapathi, R.: Annotated suffix trees for text modelling and classification. Doctoral dissertation, Birkbeck College, University of London, Retrieved from CiteSeerX (2008)
12.
Zurück zum Zitat Pampapathi, R., Mirkin, B., Levene, M.: A suffix tree approach to anti-spam email filtering. Mach. Learn. 65(1), 309–338 (2006)CrossRef Pampapathi, R., Mirkin, B., Levene, M.: A suffix tree approach to anti-spam email filtering. Mach. Learn. 65(1), 309–338 (2006)CrossRef
13.
Zurück zum Zitat Perkins, J.: Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, Birmingham (2010) Perkins, J.: Python Text Processing with NLTK 2.0 Cookbook. Packt Publishing, Birmingham (2010)
14.
16.
Zurück zum Zitat Wang, T.: Extracting Synonyms from Dictionary Definitions. Retrieved from Focus on Research, Master dissertation, University of Toronto (2009) Wang, T.: Extracting Synonyms from Dictionary Definitions. Retrieved from Focus on Research, Master dissertation, University of Toronto (2009)
Metadaten
Titel
Text Analysis with Enhanced Annotated Suffix Trees: Algorithms and Implementation
verfasst von
Mikhail Dubov
Copyright-Jahr
2015
DOI
https://doi.org/10.1007/978-3-319-26123-2_30

Premium Partner