Skip to main content
Erschienen in: Pattern Analysis and Applications 1/2018

01.06.2017 | Short paper

A text representation model using Sequential Pattern-Growth method

verfasst von: Suraya Alias, Siti Khaotijah Mohammad, Gan Keng Hoon, Tan Tien Ping

Erschienen in: Pattern Analysis and Applications | Ausgabe 1/2018

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Text representation is an essential task in transforming the input from text into features that can be later used for further Text Mining and Information Retrieval tasks. The commonly used text representation model is Bags-of-Words (BOW) and the N-gram model. Nevertheless, some known issues of these models, which are inaccurate semantic representation of text and high dimensionality of word size combination, should be investigated. A pattern-based model named Frequent Adjacent Sequential Pattern (FASP) is introduced to represent the text using a set of sequence adjacent words that are frequently used across the document collection. The purpose of this study is to discover the similarity of textual pattern between documents that can be later converted to a set of rules to describe the main news event. The FASP is based on the Pattern-Growth’s divide-and-conquer strategy where the main difference between FASP and the prior technique is in the Pattern Generation phase. This approach is tested against the BOW and N-gram text representation model using Malay and English language news dataset with different term weightings in the Vector Space Model (VSM). The findings demonstrate that the FASP model has a promising performance in finding similarities between documents with the average vector size reduction of 34% against the BOW and 77% against the N-gram model using the Malay dataset. Results using the English dataset is also consistent, indicating that the FASP approach is also language independent.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Baharudin B, Lee LH, Khan K (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1(1):4–20 Baharudin B, Lee LH, Khan K (2010) A review of machine learning algorithms for text-documents classification. J Adv Inf Technol 1(1):4–20
2.
Zurück zum Zitat Zhang W, Yoshida T, Tang X (2011) A comparative study of TF* IDF, LSI and multi-words for text classification. Expert Syst Appl 38(3):2758–2765CrossRef Zhang W, Yoshida T, Tang X (2011) A comparative study of TF* IDF, LSI and multi-words for text classification. Expert Syst Appl 38(3):2758–2765CrossRef
3.
Zurück zum Zitat Lewis DD (1992) Text representation for intelligent text retrieval: a classification-oriented view. Text-based intelligent systems: current research and practice in information extraction and retrieval. Lawrence Erlbaum, Hillsdale Lewis DD (1992) Text representation for intelligent text retrieval: a classification-oriented view. Text-based intelligent systems: current research and practice in information extraction and retrieval. Lawrence Erlbaum, Hillsdale
4.
Zurück zum Zitat Salton G, Wong A, Yang C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620CrossRefMATH Salton G, Wong A, Yang C-S (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620CrossRefMATH
5.
Zurück zum Zitat Le QV, Mikolov T (2014) Distributed representations of sentences and documents. J Mach Learn Res 32 Le QV, Mikolov T (2014) Distributed representations of sentences and documents. J Mach Learn Res 32
6.
Zurück zum Zitat Kalogeratos A, Likas A (2012) Text document clustering using global term context vectors. Knowl Inf Syst 31(3):455–474CrossRef Kalogeratos A, Likas A (2012) Text document clustering using global term context vectors. Knowl Inf Syst 31(3):455–474CrossRef
7.
Zurück zum Zitat Guthrie D, Allison B, Liu W, Guthrie L, Wilks Y (2006) A closer look at skip-gram modelling. In: Proceedings of the 5th international Conference on language resources and evaluation (LREC-2006), pp 1–4 Guthrie D, Allison B, Liu W, Guthrie L, Wilks Y (2006) A closer look at skip-gram modelling. In: Proceedings of the 5th international Conference on language resources and evaluation (LREC-2006), pp 1–4
8.
Zurück zum Zitat Sidorov G, Velasquez F, Stamatatos E, Gelbukh A, Chanona-Hernández L (2014) Syntactic n-grams as machine learning features for natural language processing. Expert Syst Appl 41(3):853–860CrossRef Sidorov G, Velasquez F, Stamatatos E, Gelbukh A, Chanona-Hernández L (2014) Syntactic n-grams as machine learning features for natural language processing. Expert Syst Appl 41(3):853–860CrossRef
9.
Zurück zum Zitat Tan C-M, Wang Y-F, Lee C-D (2002) The use of bigrams to enhance text categorization. Inf Process Manag 38(4):529–546CrossRefMATH Tan C-M, Wang Y-F, Lee C-D (2002) The use of bigrams to enhance text categorization. Inf Process Manag 38(4):529–546CrossRefMATH
10.
Zurück zum Zitat Hernández-Reyes E, García-Hernández RA, Carrasco-Ochoa JA, Martínez-Trinidad JF (2006) Document Clustering Based on Maximal Frequent Sequences. In: Salakoski T, Ginter F, Pyysalo S, Pahikkala T(eds) Advances in Natural Language Processing. Lecture Notes in Computer Science, vol 4139. Springer, Berlin, Heidelberg, pp 257–267. Hernández-Reyes E, García-Hernández RA, Carrasco-Ochoa JA, Martínez-Trinidad JF (2006) Document Clustering Based on Maximal Frequent Sequences. In: Salakoski T, Ginter F, Pyysalo S, Pahikkala T(eds) Advances in Natural Language Processing. Lecture Notes in Computer Science, vol 4139. Springer, Berlin, Heidelberg, pp 257–267.
11.
Zurück zum Zitat Kim HD, Park DH, Lu Y, Zhai C (2012) Enriching text representation with frequent pattern mining for probabilistic topic modeling. Proc Am Soc Inf Sci Technol 49(1):1–10. doi:10.1002/meet.14504901209 Kim HD, Park DH, Lu Y, Zhai C (2012) Enriching text representation with frequent pattern mining for probabilistic topic modeling. Proc Am Soc Inf Sci Technol 49(1):1–10. doi:10.​1002/​meet.​14504901209
13.
Zurück zum Zitat Chim H, Deng X (2008) Efficient phrase-based document similarity for clustering. IEEE Trans Knowl Data Eng 20(9):1217–1229CrossRef Chim H, Deng X (2008) Efficient phrase-based document similarity for clustering. IEEE Trans Knowl Data Eng 20(9):1217–1229CrossRef
14.
Zurück zum Zitat Li Y, Chung SM, Holt JD (2008) Text document clustering based on frequent word meaning sequences. Data Knowl Eng 64(1):381–404CrossRef Li Y, Chung SM, Holt JD (2008) Text document clustering based on frequent word meaning sequences. Data Knowl Eng 64(1):381–404CrossRef
15.
Zurück zum Zitat Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval, 1992, ACM, pp 37–50 Lewis DD (1992) An evaluation of phrasal and clustered representations on a text categorization task. In: Proceedings of the 15th annual international ACM SIGIR conference on research and development in information retrieval, 1992, ACM, pp 37–50
16.
Zurück zum Zitat Fürnkranz J (1998) A study using n-gram features for text categorization. Austrian Res Inst Artif Intell 3(1998):1–10 Fürnkranz J (1998) A study using n-gram features for text categorization. Austrian Res Inst Artif Intell 3(1998):1–10
17.
Zurück zum Zitat Gupta M, Han J (2011) Applications of pattern discovery using sequential data mining. In: Kumar P, Krishna PR, Raju SB (eds) Pattern discovery using sequence data mining: applications and studies. IGI Global, Hershey, pp 1–23 Gupta M, Han J (2011) Applications of pattern discovery using sequential data mining. In: Kumar P, Krishna PR, Raju SB (eds) Pattern discovery using sequence data mining: applications and studies. IGI Global, Hershey, pp 1–23
18.
Zurück zum Zitat Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M-C (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440CrossRef Pei J, Han J, Mortazavi-Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M-C (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng 16(11):1424–1440CrossRef
19.
Zurück zum Zitat Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2–3):259–284CrossRef Landauer TK, Foltz PW, Laham D (1998) An introduction to latent semantic analysis. Discourse Process 25(2–3):259–284CrossRef
20.
21.
Zurück zum Zitat Steinberger J, Ježek K (2009) Text summarization: an old challenge and new approaches. In: Abraham A, Hassanien A-E, de Leon F, de Carvalho A, Snášel V (eds) Foundations of computational intelligence, vol 206. Springer, Berlin, pp 127–149. doi:10.1007/978-3-642-01091-0_6 Steinberger J, Ježek K (2009) Text summarization: an old challenge and new approaches. In: Abraham A, Hassanien A-E, de Leon F, de Carvalho A, Snášel V (eds) Foundations of computational intelligence, vol 206. Springer, Berlin, pp 127–149. doi:10.​1007/​978-3-642-01091-0_​6
22.
Zurück zum Zitat Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New Orleans, pp 19–25. doi:10.1145/383952.383955 Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New Orleans, pp 19–25. doi:10.​1145/​383952.​383955
23.
Zurück zum Zitat Wallach HM (2006) Topic modeling: beyond Bag-of-words. In: Proceedings of the 23rd international conference on machine learning, New York, ICML ‘06. ACM, pp 977–984. doi:10.1145/1143844.1143967 Wallach HM (2006) Topic modeling: beyond Bag-of-words. In: Proceedings of the 23rd international conference on machine learning, New York, ICML ‘06. ACM, pp 977–984. doi:10.​1145/​1143844.​1143967
24.
Zurück zum Zitat Lent B, Agrawal R, Srikant R (1997) Discovering trends in text databases. In: Proceedings of the 3rd international conference on knowledge discovery and data mining (KDD’97), CA, pp 227–230 Lent B, Agrawal R, Srikant R (1997) Discovering trends in text databases. In: Proceedings of the 3rd international conference on knowledge discovery and data mining (KDD’97), CA, pp 227–230
25.
Zurück zum Zitat Baralis E, Cagliero L, Fiori A, Jabeen S (2011) PatTexSum: a pattern-based text summarizer. In: Proceedings of the workshop on mining complex patterns, pp 14–14 Baralis E, Cagliero L, Fiori A, Jabeen S (2011) PatTexSum: a pattern-based text summarizer. In: Proceedings of the workshop on mining complex patterns, pp 14–14
26.
Zurück zum Zitat García-Hernández RA, Ledeneva Y (2009) Word sequence models for single text summarization. 2009 Second international conferences on advances in computer–human interactions: pp 44–48. doi:10.1109/ACHI.2009.58 García-Hernández RA, Ledeneva Y (2009) Word sequence models for single text summarization. 2009 Second international conferences on advances in computer–human interactions: pp 44–48. doi:10.​1109/​ACHI.​2009.​58
27.
Zurück zum Zitat Ahonen-Myka H (1999) Finding all maximal frequent sequences in text. In: Proceedings of the ICML99 workshop on machine learning in text data analysis. Citeseer, pp 11–17 Ahonen-Myka H (1999) Finding all maximal frequent sequences in text. In: Proceedings of the ICML99 workshop on machine learning in text data analysis. Citeseer, pp 11–17
28.
Zurück zum Zitat Ahonen-Myka H (2002) Discovery of frequent word sequences in text. In: Proceedings of the ESF exploratory workshop on pattern detection and discovery {LNCS} 24 (Teollisuuskatu 23): pp 180–189 Ahonen-Myka H (2002) Discovery of frequent word sequences in text. In: Proceedings of the ESF exploratory workshop on pattern detection and discovery {LNCS} 24 (Teollisuuskatu 23): pp 180–189
29.
Zurück zum Zitat Agrawal R, Srikant R (1995) Mining sequential patterns. In: 11th international conference on data engineering (ICDE’95), Taipei Agrawal R, Srikant R (1995) Mining sequential patterns. In: 11th international conference on data engineering (ICDE’95), Taipei
33.
Zurück zum Zitat Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the fifth international conference on extending database technology, Avignon Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the fifth international conference on extending database technology, Avignon
34.
Zurück zum Zitat Zaki MJ (2001) SPADE: an efficient algorithm for mining frequent sequences. Mach Learn J 42(1):31–60CrossRefMATH Zaki MJ (2001) SPADE: an efficient algorithm for mining frequent sequences. Mach Learn J 42(1):31–60CrossRefMATH
35.
Zurück zum Zitat Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu M-C (2000) FreeSpan: frequent pattern-projected Sequential Pattern Mining. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 355–359 Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu M-C (2000) FreeSpan: frequent pattern-projected Sequential Pattern Mining. In: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, pp 355–359
36.
Zurück zum Zitat Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Disc 8(1):53–87MathSciNetCrossRef Han J, Pei J, Yin Y, Mao R (2004) Mining frequent patterns without candidate generation: a frequent-pattern tree approach. Data Min Knowl Disc 8(1):53–87MathSciNetCrossRef
37.
Zurück zum Zitat Song F, Liu S, Yang J (2005) A comparative study on text representation schemes in text categorization. Pattern Anal Appl 8(1–2):199–209MathSciNetCrossRef Song F, Liu S, Yang J (2005) A comparative study on text representation schemes in text categorization. Pattern Anal Appl 8(1–2):199–209MathSciNetCrossRef
38.
Zurück zum Zitat Nenkova A, McKeownK (2012) A survey of text summarization techniques. In Aggarwal CC, Zhai C (eds) Mining text data. Springer, pp 43–76. Nenkova A, McKeownK (2012) A survey of text summarization techniques. In Aggarwal CC, Zhai C (eds) Mining text data. Springer, pp 43–76.
Metadaten
Titel
A text representation model using Sequential Pattern-Growth method
verfasst von
Suraya Alias
Siti Khaotijah Mohammad
Gan Keng Hoon
Tan Tien Ping
Publikationsdatum
01.06.2017
Verlag
Springer London
Erschienen in
Pattern Analysis and Applications / Ausgabe 1/2018
Print ISSN: 1433-7541
Elektronische ISSN: 1433-755X
DOI
https://doi.org/10.1007/s10044-017-0624-9

Weitere Artikel der Ausgabe 1/2018

Pattern Analysis and Applications 1/2018 Zur Ausgabe

Industrial and Commercial Application

Generalized Gabor filters for palmprint recognition