Skip to main content

2018 | OriginalPaper | Buchkapitel

16. Q3-D3-LSA: D3.js and Generalized Vector Space Models for Statistical Computing

verfasst von : Lukas Borke, Wolfgang K. Härdle

Erschienen in: Handbook of Big Data Analytics

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

QuantNet is an integrated web-based environment consisting of different types of statistics-related documents and program codes. Its goal is creating reproducibility and offering a platform for sharing validated knowledge native to the social web. To increase the information retrieval (IR) efficiency there is a need for incorporating semantic information. Three text mining models will be examined: vector space model (VSM), generalized VSM (GVSM), and latent semantic analysis (LSA). The LSA has been successfully used for IR purposes as a technique for capturing semantic relations between terms and inserting them into the similarity measure between documents. Our results show that different model configurations allow adapted similarity-based document clustering and knowledge discovery. In particular, different LSA configurations together with hierarchical clustering reveal good results under M 3 evaluation. QuantNet and the corresponding Data-Driven Documents (D3) based visualization can be found and applied under http://​quantlet.​de. The driving technology behind it is Q3-D3-LSA, which is the combination of “GitHub API based QuantNet Mining infrastructure in R”, LSA and D3 implementation.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Berry M (2003) Survey of text mining: clustering, classification, and retrieval, 1st edn. Springer, New York Berry M (2003) Survey of text mining: clustering, classification, and retrieval, 1st edn. Springer, New York
Zurück zum Zitat Borak S, Härdle W, López-Cabrera B (2013) Statistics of financial markets: exercises and solutions, 2nd edn. Springer, BerlinCrossRef Borak S, Härdle W, López-Cabrera B (2013) Statistics of financial markets: exercises and solutions, 2nd edn. Springer, BerlinCrossRef
Zurück zum Zitat Borke L (2017a) TManalyzerQ: provides IR tools in 3 text mining models: BVSM, GVSM(TT) and LSA - QuantNet edition. R package version 0.5.0 Borke L (2017a) TManalyzerQ: provides IR tools in 3 text mining models: BVSM, GVSM(TT) and LSA - QuantNet edition. R package version 0.5.0
Zurück zum Zitat Borke L (2017b) yamldebugger: YAML parser debugger according to the QuantNet style guide. R package version 1.0 Borke L (2017b) yamldebugger: YAML parser debugger according to the QuantNet style guide. R package version 1.0
Zurück zum Zitat Borke L, Bykovskaya S (2017c) mdGeneratorQ: GitHub Markdown generator according to the QuantNet style guide. R package version 0.4.0 Borke L, Bykovskaya S (2017c) mdGeneratorQ: GitHub Markdown generator according to the QuantNet style guide. R package version 0.4.0
Zurück zum Zitat Borke L, Härdle WK (2017) GitHub API based QuantNet mining infrastructure in R. SFB 649 discussion paper. Humboldt Universität zu Berlin Borke L, Härdle WK (2017) GitHub API based QuantNet mining infrastructure in R. SFB 649 discussion paper. Humboldt Universität zu Berlin
Zurück zum Zitat Bostock M, Ogievetsky V, Heer J (2011) D3 data-driven documents. IEEE Trans Vis Comput Graph 17(12):2301–2309CrossRef Bostock M, Ogievetsky V, Heer J (2011) D3 data-driven documents. IEEE Trans Vis Comput Graph 17(12):2301–2309CrossRef
Zurück zum Zitat Bradford RB (2009) Comparability of LSI and human judgment in text analysis tasks. In: Proceedings of the 11th WSEAS international conference on mathematical methods and computational techniques in electrical engineering, MMACTEE’09. World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, pp 359–366 Bradford RB (2009) Comparability of LSI and human judgment in text analysis tasks. In: Proceedings of the 11th WSEAS international conference on mathematical methods and computational techniques in electrical engineering, MMACTEE’09. World Scientific and Engineering Academy and Society (WSEAS), Stevens Point, pp 359–366
Zurück zum Zitat Brock G, Pihur V, Datta S, Datta S (2008) clValid: an R package for cluster validation. J Stat Softw 25(1):1–22 Brock G, Pihur V, Datta S, Datta S (2008) clValid: an R package for cluster validation. J Stat Softw 25(1):1–22
Zurück zum Zitat Cosentino V, Luis J, Cabot J (2016) Findings from GitHub: methods, datasets and limitations. In: Proceedings of the 13th international conference on mining software repositories, MSR ’16. ACM, New York, pp 137–141 Cosentino V, Luis J, Cabot J (2016) Findings from GitHub: methods, datasets and limitations. In: Proceedings of the 13th international conference on mining software repositories, MSR ’16. ACM, New York, pp 137–141
Zurück zum Zitat Cristianini N, Shawe-Taylor J, Lodhi H (2002) Latent semantic kernels. J Intell Inf Syst 18(2):127–152CrossRef Cristianini N, Shawe-Taylor J, Lodhi H (2002) Latent semantic kernels. J Intell Inf Syst 18(2):127–152CrossRef
Zurück zum Zitat Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407CrossRef Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6):391–407CrossRef
Zurück zum Zitat Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, HobokenCrossRef Everitt BS, Landau S, Leese M, Stahl D (2011) Cluster analysis, 5th edn. Wiley, HobokenCrossRef
Zurück zum Zitat Feinerer I, Hornik K (2015) tm: text mining package. R package version 0.6-2 Feinerer I, Hornik K (2015) tm: text mining package. R package version 0.6-2
Zurück zum Zitat Feinerer I, Wild F (2007) Automated coding of qualitative interviews with latent semantic analysis. In: Mayr HC, Karagiannis D (eds) Information systems technology and its applications, 6th international conference ISTA. Gesellschaft für Informatik, Bonn, pp 66–77 Feinerer I, Wild F (2007) Automated coding of qualitative interviews with latent semantic analysis. In: Mayr HC, Karagiannis D (eds) Information systems technology and its applications, 6th international conference ISTA. Gesellschaft für Informatik, Bonn, pp 66–77
Zurück zum Zitat Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in R. J Stat Softw 25(5):1–54CrossRef Feinerer I, Hornik K, Meyer D (2008) Text mining infrastructure in R. J Stat Softw 25(5):1–54CrossRef
Zurück zum Zitat Fernández-Luna JM, Huete JF, Rodríguez-Cano JC (2011) User intent transition for explicit collaborative search through groups recommendation. In: Proceedings of the 3rd international workshop on collaborative information retrieval, CIR ’11. ACM, New York, pp 23–28CrossRef Fernández-Luna JM, Huete JF, Rodríguez-Cano JC (2011) User intent transition for explicit collaborative search through groups recommendation. In: Proceedings of the 3rd international workshop on collaborative information retrieval, CIR ’11. ACM, New York, pp 23–28CrossRef
Zurück zum Zitat Franke J, Härdle W, Hafner C (2015) Statistics of financial markets: an introduction, 4th edn. Springer, BerlinMATH Franke J, Härdle W, Hafner C (2015) Statistics of financial markets: an introduction, 4th edn. Springer, BerlinMATH
Zurück zum Zitat Golyandina N, Korobeynikov A (2014) Basic singular spectrum analysis and forecasting with R. Comput Stat Data Anal 71:934–954. R package version 0.14MathSciNetCrossRef Golyandina N, Korobeynikov A (2014) Basic singular spectrum analysis and forecasting with R. Comput Stat Data Anal 71:934–954. R package version 0.14MathSciNetCrossRef
Zurück zum Zitat Handl J, Knowles J, Kell DB (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201–3212CrossRef Handl J, Knowles J, Kell DB (2005) Computational cluster validation in post-genomic data analysis. Bioinformatics 21(15):3201–3212CrossRef
Zurück zum Zitat Härdle W, Simar L (2015) Applied multivariate statistical analysis, 4th edn. Springer, BerlinMATH Härdle W, Simar L (2015) Applied multivariate statistical analysis, 4th edn. Springer, BerlinMATH
Zurück zum Zitat Härdle W, Hautsch N, Overbeck L (2008) Applied quantitative finance, 2nd edn. Springer, BerlinMATH Härdle W, Hautsch N, Overbeck L (2008) Applied quantitative finance, 2nd edn. Springer, BerlinMATH
Zurück zum Zitat Haslwanter T (2016) An introduction to statistics with Python: with applications in the life sciences, 1st edn. Springer International Publishing, BerlinCrossRef Haslwanter T (2016) An introduction to statistics with Python: with applications in the life sciences, 1st edn. Springer International Publishing, BerlinCrossRef
Zurück zum Zitat Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21CrossRef Jones KS (1972) A statistical interpretation of term specificity and its application in retrieval. J Doc 28(1):11–21CrossRef
Zurück zum Zitat Kaufman L, Rousseeuw PJ (2008) Partitioning around medoids (program PAM). In: Finding groups in data. Wiley, Hoboken, pp 68–125 Kaufman L, Rousseeuw PJ (2008) Partitioning around medoids (program PAM). In: Finding groups in data. Wiley, Hoboken, pp 68–125
Zurück zum Zitat Korobeynikov A (2010) Computation- and space-efficient implementation of SSA. Stat Interface 3(3):357–368. R package version 0.14.MathSciNetCrossRef Korobeynikov A (2010) Computation- and space-efficient implementation of SSA. Stat Interface 3(3):357–368. R package version 0.14.MathSciNetCrossRef
Zurück zum Zitat Korobeynikov A, Larsen RM, Laboratory LBN (2016) svd: interfaces to various state-of-art SVD and eigensolvers. R package version 0.4 Korobeynikov A, Larsen RM, Laboratory LBN (2016) svd: interfaces to various state-of-art SVD and eigensolvers. R package version 0.4
Zurück zum Zitat Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi PF (2008) Mining internet-scale software repositories. In: Platt J, Koller D, Singer Y, Roweis S (eds) Advances in neural information processing systems 20. Curran Associates, Red Hook, pp 929–936 Linstead E, Rigor P, Bajracharya S, Lopes C, Baldi PF (2008) Mining internet-scale software repositories. In: Platt J, Koller D, Singer Y, Roweis S (eds) Advances in neural information processing systems 20. Curran Associates, Red Hook, pp 929–936
Zurück zum Zitat Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik, K (2016) cluster: cluster analysis basics and extensions. R package version 2.0.5 Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik, K (2016) cluster: cluster analysis basics and extensions. R package version 2.0.5
Zurück zum Zitat Michailidis G (2008) Data visualization through their graph representations. In: Handbook of data visualization. Springer handbooks of computational statistics. Springer, Berlin, pp 103–120CrossRef Michailidis G (2008) Data visualization through their graph representations. In: Handbook of data visualization. Springer handbooks of computational statistics. Springer, Berlin, pp 103–120CrossRef
Zurück zum Zitat Miller T, Klein B, Wolf E (2009) Exploiting latent semantic relations in highly linked hypertext for information retrieval in wikis. In: Proceedings of the international conference RANLP-2009. Association for Computational Linguistics, Borovets, pp 241–245 Miller T, Klein B, Wolf E (2009) Exploiting latent semantic relations in highly linked hypertext for information retrieval in wikis. In: Proceedings of the international conference RANLP-2009. Association for Computational Linguistics, Borovets, pp 241–245
Zurück zum Zitat Mohamed M, Oussalah M (2014) A comparative study of conversion aided methods for WordNet sentence textual similarity. In: Proceedings of the first AHA!-workshop on information discovery in text. Association for Computational Linguistics, Borovets and Dublin City University, Dublin, pp 37–42CrossRef Mohamed M, Oussalah M (2014) A comparative study of conversion aided methods for WordNet sentence textual similarity. In: Proceedings of the first AHA!-workshop on information discovery in text. Association for Computational Linguistics, Borovets and Dublin City University, Dublin, pp 37–42CrossRef
Zurück zum Zitat Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for IDF. J Doc 60(5):503–520CrossRef Robertson S (2004) Understanding inverse document frequency: on theoretical arguments for IDF. J Doc 60(5):503–520CrossRef
Zurück zum Zitat Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRef Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRef
Zurück zum Zitat Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620CrossRef Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620CrossRef
Zurück zum Zitat Scheidegger C (2016) github: github API. R package version 0.9.8 Scheidegger C (2016) github: github API. R package version 0.9.8
Zurück zum Zitat Scheidegger C, Borke L (2017) rgithubQ: GitHub API bindings for R - QuantNet edition. R package version 0.5.0 Scheidegger C, Borke L (2017) rgithubQ: GitHub API bindings for R - QuantNet edition. R package version 0.5.0
Zurück zum Zitat Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications, 1st edn. Chapman & Hall/CRC, Boca Raton Srivastava A, Sahami M (2009) Text mining: classification, clustering, and applications, 1st edn. Chapman & Hall/CRC, Boca Raton
Zurück zum Zitat Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining
Zurück zum Zitat Theußl S, Feinerer I, Hornik K (2012) A tm plug-in for distributed text mining in R. J Stat Softw 51(5):1–31CrossRef Theußl S, Feinerer I, Hornik K (2012) A tm plug-in for distributed text mining in R. J Stat Softw 51(5):1–31CrossRef
Zurück zum Zitat Warnes GR, Bolker B, Bonebakker L, Gentleman R, Liaw WHA, Lumley T, Maechler M, Magnusson A, Moeller S, Schwartz M, Venables B (2016) gplots: various R programming tools for plotting data. R package version 3.0.1 Warnes GR, Bolker B, Bonebakker L, Gentleman R, Liaw WHA, Lumley T, Maechler M, Magnusson A, Moeller S, Schwartz M, Venables B (2016) gplots: various R programming tools for plotting data. R package version 3.0.1
Zurück zum Zitat Weiss SM, Indurkhya N, Zhang T (2010) Fundamentals of predictive text mining. Springer, LondonCrossRef Weiss SM, Indurkhya N, Zhang T (2010) Fundamentals of predictive text mining. Springer, LondonCrossRef
Zurück zum Zitat Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer, New YorkCrossRef Wickham H (2009) ggplot2: elegant graphics for data analysis. Springer, New YorkCrossRef
Zurück zum Zitat Wild F (2015) lsa: latent semantic analysis. R package version 0.73.1 Wild F (2015) lsa: latent semantic analysis. R package version 0.73.1
Zurück zum Zitat Wild F, Stahl C (2007) Investigating unstructured texts with latent semantic analysis. In: Decker R, Lenz HJ (eds) Advances in data analysis. Proceedings of the 30th annual conference of the Gesellschaft für Klassifikation e.V., Freie Universität Berlin, 8–10 March 2006. Springer, Berlin, pp 383–390CrossRef Wild F, Stahl C (2007) Investigating unstructured texts with latent semantic analysis. In: Decker R, Lenz HJ (eds) Advances in data analysis. Proceedings of the 30th annual conference of the Gesellschaft für Klassifikation e.V., Freie Universität Berlin, 8–10 March 2006. Springer, Berlin, pp 383–390CrossRef
Zurück zum Zitat Wong SKM, Ziarko W, Wong PCN (1985) Generalized vector spaces model in information retrieval. In: Proceedings of the 8th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’85. ACM, New York, pp 18–25 Wong SKM, Ziarko W, Wong PCN (1985) Generalized vector spaces model in information retrieval. In: Proceedings of the 8th annual international ACM SIGIR conference on research and development in information retrieval, SIGIR ’85. ACM, New York, pp 18–25
Metadaten
Titel
Q3-D3-LSA: D3.js and Generalized Vector Space Models for Statistical Computing
verfasst von
Lukas Borke
Wolfgang K. Härdle
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-18284-1_16

Premium Partner