Skip to main content
Top
Published in: Journal of Intelligent Information Systems 3/2014

01-06-2014

TS-stream: clustering time series on data streams

Published in: Journal of Intelligent Information Systems | Issue 3/2014

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

The current ability to produce massive amounts of data and the impossibility in storing it motivated the development of data stream mining strategies. Despite the proposal of many techniques, this research area still lacks in approaches to mine data streams composed of multiple time series, which has applications in finance, medicine and science. Most of the current techniques for clustering streaming time series have a serious limitation in their similarity measure, which are based on the Pearson correlation. In this paper, we show the Pearson correlation is not capable of detecting similarities even for classic time series models, such as those by Box and Jenkins. This limitation motivated our proposal to cluster streaming time series based on their generating functions, which is achieved by considering features obtained using descriptive measures, such as Auto Mutual Information, the Hurst Exponent and several others. We present a new tree-based clustering algorithm, entitled TS-Stream, which uses the extracted features to produce partitions in better accordance to the time series generating functions. Experiments with synthetic data sets confirm TS-Stream outperforms ODAC, currently the most popular technique, in terms of clustering quality. Using real financial time series from the NYSE and NASDAQ, we conducted stock trading simulations employing TS-Stream to support the creation of diversified investment portfolios. Results confirmed TS-Stream increased the monetary returns in several orders of magnitude when compared to trading strategies simply based on the Moving Average Convergence Divergence financial indicator.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Footnotes
1
Determinism is a measure based on Recurrence Quantification Analysis (Marwan et al. 2007).
 
2
Defined as \(\sigma ^{2}(X)~=~\sum _{i=1}^{n} p_{i}~\cdot ~(x_{i} - \mu )^{2}\) for a series X = (x 1, x 2, … , x n ), in which p i is the probability of observation x i and μ is the mean of the sequence.
 
3
A deterministic, chaotic, nonlinear model.
 
4
In this case outliers are simply exceptional gains, but when plotted hinder a better visualization of the distribution.
 
Literature
go back to reference Aggarwal, C.C., Han, J., Wang, J., Yu, P.S. (2003). A framework for clustering evolving data streams. In: VLDB ’2003: Proceedings of the 29th international conference on very large data bases (pp. 81–92). VLDB Endowment. Aggarwal, C.C., Han, J., Wang, J., Yu, P.S. (2003). A framework for clustering evolving data streams. In: VLDB ’2003: Proceedings of the 29th international conference on very large data bases (pp. 81–92). VLDB Endowment.
go back to reference Aggarwal, C.C., Han, J.,Wang, J., Yu, P.S. (2004). A framework for projected clustering of high dimensional data streams. In VLDB ’04: Proceedings of the 30th international conference on very large data bases (pp. 852–863). VLDB Endowment. Aggarwal, C.C., Han, J.,Wang, J., Yu, P.S. (2004). A framework for projected clustering of high dimensional data streams. In VLDB ’04: Proceedings of the 30th international conference on very large data bases (pp. 852–863). VLDB Endowment.
go back to reference Appel, G. (2005). Technical analysis: power tools for active investors, 1st edn, FT Press. Appel, G. (2005). Technical analysis: power tools for active investors, 1st edn, FT Press.
go back to reference Ardia, D., Boudt, K., Carl, P., Mullen, K.M., Peterson, B.G. (2011). Differential Evolution with DEoptim: An application to non-convex portfolio optimization. The Royal Journal, 3(1), 27–34. Ardia, D., Boudt, K., Carl, P., Mullen, K.M., Peterson, B.G. (2011). Differential Evolution with DEoptim: An application to non-convex portfolio optimization. The Royal Journal, 3(1), 27–34.
go back to reference Athanassioum, P. (2012). Research handbook on hedge funds, private equity and alternative investments, Edward Elgar Pub. Athanassioum, P. (2012). Research handbook on hedge funds, private equity and alternative investments, Edward Elgar Pub.
go back to reference Bélisle, C. (1992). Convergence theorems for a class of simulated annealing algorithms on rd. Journal of Applied Probability, 885–895. Bélisle, C. (1992). Convergence theorems for a class of simulated annealing algorithms on rd. Journal of Applied Probability, 885–895.
go back to reference Bifet, A., & Kirby, R. (2009). Data stream mining: A practical approach. Technical report, The University of Waikato. Bifet, A., & Kirby, R. (2009). Data stream mining: A practical approach. Technical report, The University of Waikato.
go back to reference Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. The journal of political economy, 637–654. Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. The journal of political economy, 637–654.
go back to reference Box, G., & Jenkins, G. (1994). Time series analysis: forecasting and control, Prentice Hall PTR. Box, G., & Jenkins, G. (1994). Time series analysis: forecasting and control, Prentice Hall PTR.
go back to reference Cao, F. (2006). Density-based clustering over an evolving data stream with noise. In Proceedings of the 6th SIAM international conference data mining. Cao, F. (2006). Density-based clustering over an evolving data stream with noise. In Proceedings of the 6th SIAM international conference data mining.
go back to reference Chaovalit, P. (2009). Clustering transient data streams by example and by variable. PhD thesis, University of Maryland. Chaovalit, P. (2009). Clustering transient data streams by example and by variable. PhD thesis, University of Maryland.
go back to reference Chatfield, C. (2003). The analysis of time series: an introduction (Vol. 59). CRC press. Chatfield, C. (2003). The analysis of time series: an introduction (Vol. 59). CRC press.
go back to reference Daubechies, I. (1992). Ten lectures on wavelets. Society for industrial and applied mathematics, Philadelphia. Daubechies, I. (1992). Ten lectures on wavelets. Society for industrial and applied mathematics, Philadelphia.
go back to reference D’haeseleer, P., et al. (2005). How does gene expression clustering work? Nature biotechnology, 23(12), 1499–1502.CrossRef D’haeseleer, P., et al. (2005). How does gene expression clustering work? Nature biotechnology, 23(12), 1499–1502.CrossRef
go back to reference Fourier, J. (1888). Théorie analytique de la chaleur (Vol. 1). Gauthier-Villars et fils. Fourier, J. (1888). Théorie analytique de la chaleur (Vol. 1). Gauthier-Villars et fils.
go back to reference Fujita, A., Sato, J., Demasi, M., Sogayar, M., Ferreira, C., Miyano, S. (2009). Comparing pearson, spearman and hoeffding’s d measure for gene expression association analysis. Journal of Bioinformatics and Computational Biology, 7(4), 663–84.CrossRef Fujita, A., Sato, J., Demasi, M., Sogayar, M., Ferreira, C., Miyano, S. (2009). Comparing pearson, spearman and hoeffding’s d measure for gene expression association analysis. Journal of Bioinformatics and Computational Biology, 7(4), 663–84.CrossRef
go back to reference Gama, J. (2010). Knowledge discovery from data streams, 1st edn. Chapman & Hall/CRC. Gama, J. (2010). Knowledge discovery from data streams, 1st edn. Chapman & Hall/CRC.
go back to reference Hilbert, D. (1912). Grundzüge einer allgemeinen Theorie der linearen Integralgleichungen. BG Teubner. Hilbert, D. (1912). Grundzüge einer allgemeinen Theorie der linearen Integralgleichungen. BG Teubner.
go back to reference Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 13–30. Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 13–30.
go back to reference Huang, N., Shen, Z., Long, S., Wu, M., Shih, H., Zheng, Q., Yen, N., Tung, C., Liu, H. (1998). The empirical mode decomposition and the hilbert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society of London Series A: Mathematical, Physical and Engineering Sciences, 454(1971), 903.CrossRefMATHMathSciNet Huang, N., Shen, Z., Long, S., Wu, M., Shih, H., Zheng, Q., Yen, N., Tung, C., Liu, H. (1998). The empirical mode decomposition and the hilbert spectrum for nonlinear and non-stationary time series analysis. Proceedings of the Royal Society of London Series A: Mathematical, Physical and Engineering Sciences, 454(1971), 903.CrossRefMATHMathSciNet
go back to reference Ishii, R.P., Rios, R.A., de Mello, R.F. (2011). Classification of time series generation processes using experimental tools: a survey and proposal of an automatic and systematic approach. International Journal of Computational Science and Engineering, 1, 1–21. Ishii, R.P., Rios, R.A., de Mello, R.F. (2011). Classification of time series generation processes using experimental tools: a survey and proposal of an automatic and systematic approach. International Journal of Computational Science and Engineering, 1, 1–21.
go back to reference Kantz, H., & Schreiber, T. (1997). Nonlinear time series analysis. New York: Cambridge University Press. Kantz, H., & Schreiber, T. (1997). Nonlinear time series analysis. New York: Cambridge University Press.
go back to reference Keogh, E., Lin, J., Truppel, W. (2003). Clustering of time series subsequences is meaningless: implications for previous and future research. In ICDM ’03: Proceedings of the 3rd IEEE international conference on data mining, IEEE computer society (pp. 115–). Washington, DC. http://dl.acm.org/citation.cfm?id=951949.952156. Keogh, E., Lin, J., Truppel, W. (2003). Clustering of time series subsequences is meaningless: implications for previous and future research. In ICDM ’03: Proceedings of the 3rd IEEE international conference on data mining, IEEE computer society (pp. 115–). Washington, DC. http://​dl.​acm.​org/​citation.​cfm?​id=​951949.​952156.
go back to reference Kontaki, M., Papadopoulos, A.N., Manolopoulos, Y. (2008). Continuous subspace clustering in streaming time series. Information Systems, 33(2), 240–260.CrossRef Kontaki, M., Papadopoulos, A.N., Manolopoulos, Y. (2008). Continuous subspace clustering in streaming time series. Information Systems, 33(2), 240–260.CrossRef
go back to reference Lorenz, E.N. (1963). Deterministic nonperiodic flow. Journal of the Atmospheric Sciences, 20(2), 130–141.CrossRef Lorenz, E.N. (1963). Deterministic nonperiodic flow. Journal of the Atmospheric Sciences, 20(2), 130–141.CrossRef
go back to reference MacKay, D. (2003). Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press.MATH MacKay, D. (2003). Information theory, inference, and learning algorithms. Cambridge: Cambridge University Press.MATH
go back to reference Peng, C., Buldyrev, S., Havlin, S., Simons, M., Stanley, H., Goldberger, A. (1994). On the mosaic organization of DNA sequences. Physical Review E, 49, 1685–1689.CrossRef Peng, C., Buldyrev, S., Havlin, S., Simons, M., Stanley, H., Goldberger, A. (1994). On the mosaic organization of DNA sequences. Physical Review E, 49, 1685–1689.CrossRef
go back to reference Quinlan, J. (1986). Induction of decision trees. Machine learning, 1(1), 81–106. Quinlan, J. (1986). Induction of decision trees. Machine learning, 1(1), 81–106.
go back to reference Quinlan, J. (1993). C4. 5: programs for machine learning, Morgan Kaufmann. Quinlan, J. (1993). C4. 5: programs for machine learning, Morgan Kaufmann.
go back to reference R Development Core Team (2011). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. http://www.R-project.org, ISBN 3-900051-07-0. R Development Core Team (2011). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna. http://​www.​R-project.​org, ISBN 3-900051-07-0.
go back to reference Ren, J., Cai, B., Hu, C. (2011). Clustering over data streams based on grid density and index tree. Journal of Convergence Information Technology, 6(1). Ren, J., Cai, B., Hu, C. (2011). Clustering over data streams based on grid density and index tree. Journal of Convergence Information Technology, 6(1).
go back to reference Sandri, M. (1996). Numerical calculation of lyapunov exponents. The Mathematical Journal, 6(3), 78–84. Sandri, M. (1996). Numerical calculation of lyapunov exponents. The Mathematical Journal, 6(3), 78–84.
go back to reference Skiena, S.S. (1998). The algorithm design manual. New York: Springer. Skiena, S.S. (1998). The algorithm design manual. New York: Springer.
go back to reference Takens, F. (1981). Detecting strange attractors in turbulence. Dynamical systems and turbulence, (pp. 366–381). Warwick 1980. Takens, F. (1981). Detecting strange attractors in turbulence. Dynamical systems and turbulence, (pp. 366–381). Warwick 1980.
go back to reference Tan, P.N., Steinbach, M., Kumar, V. (2005). Introduction to Data Mining. Boston: Addison-Wesley Longman. Tan, P.N., Steinbach, M., Kumar, V. (2005). Introduction to Data Mining. Boston: Addison-Wesley Longman.
go back to reference Tang, L.A., Zheng, Y., Yuan, J., Han, J., Leung, A., Hung, C.C., Peng, W.C. (2012). On discovery of traveling companions from streaming trajectories. In 2012 IEEE 28th International Conference on Data Engineering (ICDE), (pp. 186-197). IEEE. Tang, L.A., Zheng, Y., Yuan, J., Han, J., Leung, A., Hung, C.C., Peng, W.C. (2012). On discovery of traveling companions from streaming trajectories. In 2012 IEEE 28th International Conference on Data Engineering (ICDE), (pp. 186-197). IEEE.
go back to reference Vinh, N.X., Epps, J., Bailey, J. (2009). Information theoretic measures for clusterings comparison: is a correction for chance necessary? In ICML ’09 (pp 1073–1080). New York: ACM. doi:10.1145/1553374.1553511. Vinh, N.X., Epps, J., Bailey, J. (2009). Information theoretic measures for clusterings comparison: is a correction for chance necessary? In ICML ’09 (pp 1073–1080). New York: ACM. doi:10.​1145/​1553374.​1553511.
go back to reference Walpole, R., Myers, R., Myers, S., Ye, K. (1998). Probability and statistics for engineers and scientists. Upper Saddle River: Prentice Hall. Walpole, R., Myers, R., Myers, S., Ye, K. (1998). Probability and statistics for engineers and scientists. Upper Saddle River: Prentice Hall.
go back to reference Wan, L., Ng, W.K., Dang, X.H., Yu, P.S., Zhang, K. (2009). Density-based clustering of data streams at multiple resolutions. ACM Transactions on Knowledge and Discovery Data, 3(3), 1–28. doi:10.1145/1552303.1552307.CrossRef Wan, L., Ng, W.K., Dang, X.H., Yu, P.S., Zhang, K. (2009). Density-based clustering of data streams at multiple resolutions. ACM Transactions on Knowledge and Discovery Data, 3(3), 1–28. doi:10.​1145/​1552303.​1552307.CrossRef
go back to reference Widiputra, H., Pears, R., Kasabov, N. (2011). Multiple time-series prediction through multiple time-series relationships profiling and clustered recurring trends. In Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining (pp. 161–172). Berlin, Heidelberg: Springer-Verlag.CrossRef Widiputra, H., Pears, R., Kasabov, N. (2011). Multiple time-series prediction through multiple time-series relationships profiling and clustered recurring trends. In Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining (pp. 161–172). Berlin, Heidelberg: Springer-Verlag.CrossRef
go back to reference Yang, Y., & Chen, K. (2011). Temporal data clustering via weighted clustering ensemble with different representations. IEEE Transactions on Knowledge and Data Engineering, 23, 307–320.CrossRef Yang, Y., & Chen, K. (2011). Temporal data clustering via weighted clustering ensemble with different representations. IEEE Transactions on Knowledge and Data Engineering, 23, 307–320.CrossRef
go back to reference Zheng, K., Zheng, Y., Yuan, N.J., Shang, S. (2013). On discovery of gathering patterns from trajectories. In IEEE international conference on data engineering, ICDE. Zheng, K., Zheng, Y., Yuan, N.J., Shang, S. (2013). On discovery of gathering patterns from trajectories. In IEEE international conference on data engineering, ICDE.
Metadata
Title
TS-stream: clustering time series on data streams
Publication date
01-06-2014
Published in
Journal of Intelligent Information Systems / Issue 3/2014
Print ISSN: 0925-9902
Electronic ISSN: 1573-7675
DOI
https://doi.org/10.1007/s10844-013-0290-3

Other articles of this Issue 3/2014

Journal of Intelligent Information Systems 3/2014 Go to the issue

Premium Partner