Skip to main content
Erschienen in: The Journal of Supercomputing 8/2016

01.08.2016

Extracting significant pattern histories from timestamped texts using MapReduce

verfasst von: Jing-Doo Wang

Erschienen in: The Journal of Supercomputing | Ausgabe 8/2016

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

This paper provides valuable clues for trend analysis in text mining that one can have texts attached with timestamps as tags and then observe the frequency distribution of the patterns over equally spaced time intervals to predict the trend. Observing frequency distributions (histories) of significant patterns plays an important role for trend analysts. To have the computation of extracting these frequency distributions from a huge amount of texts with timestamps over long time periods scalable, this paper proposes a novel approach based on Hadoop MapReduce programming model that improves our previous work based on external memory approach to reduce the computation time from several days to several hours. The history of a significant pattern is the frequency distribution of that pattern over equally spaced time intervals; a significant pattern is one maximal repeat of consecutive words within texts. Note that the length of one significant pattern can be as long as that of one sentence if that sentence appears twice. To solidify the contribution of this study, the experimental resources included the titles and abstracts (total 12 GB) of 14,473,242 articles from 1990 to 2014 (25 years) downloaded from PubMed, a well-known web site for biomedical literature. Experimental results show that the scale of computation time can be reduced from days to hours employing six computing nodes within one personal computer cluster. Notably, these pattern histories, over two decades in length, not only provide clues that can be analyzed for trend variations within these articles, but also have the potential to reveal revolutions in article writing that might be valuable to the linguist who engages in corpus analysis in the future.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Fußnoten
3
(PubMedPubDate PubStatus = “pubmed”).
 
6
Based on this paper, the author had applied for an USA patent provisional application (US 62/301,681) entitled “METHOD FOR EXTRACTING MAXIMAL REPEAT PATTERNS AND COMPUTING FREQUENCY DISTRIBUTION TABLES” at 2016/3/1.
 
Literatur
1.
Zurück zum Zitat Gusfield D (1997) Algorithms on strings, trees, and sequences : computer science and computational biology. Cambridge University Press, CambridgeCrossRefMATH Gusfield D (1997) Algorithms on strings, trees, and sequences : computer science and computational biology. Cambridge University Press, CambridgeCrossRefMATH
2.
Zurück zum Zitat Wang J-D (2006) External memory approach to compute the maximal repeats across classes from dna sequences. Asian J Health Inf Sci 1(2):276–295 Wang J-D (2006) External memory approach to compute the maximal repeats across classes from dna sequences. Asian J Health Inf Sci 1(2):276–295
3.
Zurück zum Zitat Wang J-D (2011) A novel approach to compute pattern history for trend analysis. In: The 8th international conference on fuzzy systems and knowledge discovery, pp 1796–1800 Wang J-D (2011) A novel approach to compute pattern history for trend analysis. In: The 8th international conference on fuzzy systems and knowledge discovery, pp 1796–1800
4.
Zurück zum Zitat Lin J, Dyer C (2010) Data-intensive text processing with MapReduce Lin J, Dyer C (2010) Data-intensive text processing with MapReduce
6.
Zurück zum Zitat Witten IH, Frank E (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Elsevier, AmsterdamMATH Witten IH, Frank E (2011) Data mining: practical machine learning tools and techniques, 3rd edn. Elsevier, AmsterdamMATH
7.
Zurück zum Zitat Zhang Z, Zhang R (2008) Multimedia data mining: a systematic introduction to concepts and theory, 1st edn. Chapman & Hall/CRC, LondonCrossRefMATH Zhang Z, Zhang R (2008) Multimedia data mining: a systematic introduction to concepts and theory, 1st edn. Chapman & Hall/CRC, LondonCrossRefMATH
8.
Zurück zum Zitat Berry MW, Kogan J (2010) Text mining: applications and theory. Wiley, New YorkCrossRef Berry MW, Kogan J (2010) Text mining: applications and theory. Wiley, New YorkCrossRef
9.
Zurück zum Zitat Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications. Chapman & Hall/CRC, LondonCrossRefMATH Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications. Chapman & Hall/CRC, LondonCrossRefMATH
10.
Zurück zum Zitat Kao A, Poteet SR (2006) Natural language processing and text mining. Springer, BerlinMATH Kao A, Poteet SR (2006) Natural language processing and text mining. Springer, BerlinMATH
11.
Zurück zum Zitat Feldman R (2006) Text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, New York, NYCrossRef Feldman R (2006) Text mining handbook: advanced approaches in analyzing unstructured data. Cambridge University Press, New York, NYCrossRef
12.
Zurück zum Zitat Manu K (2006) Text mining application programming, CHARLES RIVER MEDLA Manu K (2006) Text mining application programming, CHARLES RIVER MEDLA
14.
Zurück zum Zitat Mei Q, Zhai C (2005) Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining, KDD ’05, ACM, New York, NY, USA, pp 198–207. doi:10.1145/1081870.1081895 Mei Q, Zhai C (2005) Discovering evolutionary theme patterns from text: an exploration of temporal text mining. In: Proceedings of the 11th ACM SIGKDD international conference on knowledge discovery in data mining, KDD ’05, ACM, New York, NY, USA, pp 198–207. doi:10.​1145/​1081870.​1081895
15.
Zurück zum Zitat Shaik Z, Garla S, Chakraborty G (2012) SAS since 1976: an application of text mining to reveal trends. In: SAS Global Forum 2012: data mining and text analytics, pp 1–10 Shaik Z, Garla S, Chakraborty G (2012) SAS since 1976: an application of text mining to reveal trends. In: SAS Global Forum 2012: data mining and text analytics, pp 1–10
17.
Zurück zum Zitat Luo D, Yang J, Krstajic M, Ribarsky W, Keim D (2012) Eventriver: visually exploring text collections with temporal references. Visual Comput Graph IEEE Trans 18(1):93–105. doi:10.1109/TVCG.2010.225 CrossRef Luo D, Yang J, Krstajic M, Ribarsky W, Keim D (2012) Eventriver: visually exploring text collections with temporal references. Visual Comput Graph IEEE Trans 18(1):93–105. doi:10.​1109/​TVCG.​2010.​225 CrossRef
18.
Zurück zum Zitat Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In: Proceedings of the 4th ACM international conference on web search and data mining, WSDM ’11, ACM, New York, NY, USA, pp 177–186. doi:10.1145/1935826.1935863 Yang J, Leskovec J (2011) Patterns of temporal variation in online media. In: Proceedings of the 4th ACM international conference on web search and data mining, WSDM ’11, ACM, New York, NY, USA, pp 177–186. doi:10.​1145/​1935826.​1935863
21.
Zurück zum Zitat Shrestha AMS, Frith MC, Horton P (2014) A bioinformatician’s guide to the forefront of suffix array construction algorithms. Brief Bioinform 15(2):138–154. doi:10.1093/bib/bbt081 CrossRef Shrestha AMS, Frith MC, Horton P (2014) A bioinformatician’s guide to the forefront of suffix array construction algorithms. Brief Bioinform 15(2):138–154. doi:10.​1093/​bib/​bbt081 CrossRef
23.
Zurück zum Zitat Ferragina P, Grossi R (1999) The string B-tree: a new data structure for string search in external memory and its application. J ACM 46(2):236–280MathSciNetCrossRefMATH Ferragina P, Grossi R (1999) The string B-tree: a new data structure for string search in external memory and its application. J ACM 46(2):236–280MathSciNetCrossRefMATH
24.
Zurück zum Zitat Kulekci MO, Vitter JS, Xu B (2012) Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree. IEEE/ACM Trans Comput Biol Bioinform 9(2):421–429. doi:10.1109/TCBB.2011.127 CrossRef Kulekci MO, Vitter JS, Xu B (2012) Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree. IEEE/ACM Trans Comput Biol Bioinform 9(2):421–429. doi:10.​1109/​TCBB.​2011.​127 CrossRef
25.
Zurück zum Zitat Lam C (2010) Hadoop in action, 1st edn. Manning Publications Co., Greenwich, CT Lam C (2010) Hadoop in action, 1st edn. Manning Publications Co., Greenwich, CT
26.
Zurück zum Zitat Li F, Ooi BC, Özsu MT, Wu S (2014) Distributed data management using mapreduce. ACM Comput Surv 46(3):31:1–31:42. doi:10.1145/2503009 Li F, Ooi BC, Özsu MT, Wu S (2014) Distributed data management using mapreduce. ACM Comput Surv 46(3):31:1–31:42. doi:10.​1145/​2503009
28.
Zurück zum Zitat Qin L, Yu JX, Chang L, Cheng H, Zhang C, Lin X (2014) Scalable big graph processing in mapreduce. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, SIGMOD ’14, ACM, New York, NY, USA, pp 827–838. doi:10.1145/2588555.2593661 Qin L, Yu JX, Chang L, Cheng H, Zhang C, Lin X (2014) Scalable big graph processing in mapreduce. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data, SIGMOD ’14, ACM, New York, NY, USA, pp 827–838. doi:10.​1145/​2588555.​2593661
29.
Zurück zum Zitat Zhang X, Yang L, Liu C, Chen J (2014) A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. Parallel Distrib Syst IEEE Trans 25(2):363–373. doi:10.1109/TPDS.2013.48 CrossRef Zhang X, Yang L, Liu C, Chen J (2014) A scalable two-phase top-down specialization approach for data anonymization using mapreduce on cloud. Parallel Distrib Syst IEEE Trans 25(2):363–373. doi:10.​1109/​TPDS.​2013.​48 CrossRef
31.
Zurück zum Zitat Hsu C-H, Slagter KD, Chung Y-C (2015) Locality and loading aware virtual machine mapping techniques for optimizing communications in mapreduce applications. Fut Gener Comput Syst 53:43–54CrossRef Hsu C-H, Slagter KD, Chung Y-C (2015) Locality and loading aware virtual machine mapping techniques for optimizing communications in mapreduce applications. Fut Gener Comput Syst 53:43–54CrossRef
32.
Zurück zum Zitat Slagter K, Hsu C-H, Chung Y-C, Zhang D (2013) An improved partitioning mechanism for optimizing massive data analysis using mapreduce. J Supercomput 66(1):539–555CrossRef Slagter K, Hsu C-H, Chung Y-C, Zhang D (2013) An improved partitioning mechanism for optimizing massive data analysis using mapreduce. J Supercomput 66(1):539–555CrossRef
33.
Zurück zum Zitat Slagter KD, Hsu C-H, Chung Y-C (2015) An adaptive and memory efficient sampling mechanism for partitioning in mapreduce. Int J Parallel Prog 43(3):489–507CrossRef Slagter KD, Hsu C-H, Chung Y-C (2015) An adaptive and memory efficient sampling mechanism for partitioning in mapreduce. Int J Parallel Prog 43(3):489–507CrossRef
34.
Zurück zum Zitat Wang J-D, Tsay J-J (2002) Mining periodic events from retrospective Chinese news. Int J Comput Process Orient Lang Special Issue “Web WAP Orient Lang Multimed Comput” 15(4):361–377 Wang J-D, Tsay J-J (2002) Mining periodic events from retrospective Chinese news. Int J Comput Process Orient Lang Special Issue “Web WAP Orient Lang Multimed Comput” 15(4):361–377
35.
Zurück zum Zitat Mount DW (2004) Bioinformatics: sequence and genome analysis, 2nd edn. Cold Spring Harbor Laboratory Press, New York Mount DW (2004) Bioinformatics: sequence and genome analysis, 2nd edn. Cold Spring Harbor Laboratory Press, New York
37.
Zurück zum Zitat Tan YS, Tan J, Chng ES, Lee B-S, Li J, Date S, Chak HP, Xiao X, Narishige A (2013) Hadoop framework: impact of data organization on performance. Softw: Pract Exp 43(11):1241–1260. doi:10.1002/spe.1082 Tan YS, Tan J, Chng ES, Lee B-S, Li J, Date S, Chak HP, Xiao X, Narishige A (2013) Hadoop framework: impact of data organization on performance. Softw: Pract Exp 43(11):1241–1260. doi:10.​1002/​spe.​1082
Metadaten
Titel
Extracting significant pattern histories from timestamped texts using MapReduce
verfasst von
Jing-Doo Wang
Publikationsdatum
01.08.2016
Verlag
Springer US
Erschienen in
The Journal of Supercomputing / Ausgabe 8/2016
Print ISSN: 0920-8542
Elektronische ISSN: 1573-0484
DOI
https://doi.org/10.1007/s11227-016-1713-z

Weitere Artikel der Ausgabe 8/2016

The Journal of Supercomputing 8/2016 Zur Ausgabe