Abstract
We present Cypress, a novel framework to archive and query massive time series streams such as those generated by sensor networks, data centers, and scientific computing. Cypress applies multi-scale analysis to decompose time series and to obtain sparse representations in various domains (e.g. frequency domain and time domain). Relying on the sparsity, the time series streams can be archived with reduced storage space. We then show that many statistical queries such as trend, histogram and correlations can be answered directly from compressed data rather than from reconstructed raw data. Our evaluation with server utilization data collected from real data centers shows significant benefit of our framework.
- D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The Design of the Borealis Stream Processing Engine. In Second Biennial Conference on Innovative Data Systems Research (CIDR 2005), 2005.Google Scholar
- D. Achlioptas. Database-friendly random projections. In ACM PODS, 2001. Google ScholarDigital Library
- M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In ACM SOSP, 2003. Google ScholarDigital Library
- E. Bingham and H. Mannila. Random projection in dimensionality reduction: applications to image and text data. In in Knowledge Discovery and Data Mining, pages 245--250. ACM Press, 2001. Google ScholarDigital Library
- G. Chen, W. He, J. Liu, S. Nath, L. Rigas, L. Xiao, and F. Zhao. Energy-aware server provisioning and load dispatching for connection-intensive internet services. In USENIX NSDI, 2008. Google ScholarDigital Library
- R. Cole, D. Shasha, and X. Zhao. Fast window correlations over uncooperative time series. In KDD, 2005. Google ScholarDigital Library
- G. Cormode, P. Indyk, N. Koudas, and S. Muthukrishan. Fast mining of massive tabular data via approximate distance computations. In ICDE, 2002.Google ScholarCross Ref
- D. Donoho. Compressed sensing. IEEE Trans. on Information Theory, 52(4):1289--1306, 2006. Google ScholarDigital Library
- X. Fan, W. dietrich Weber, and L. A. Barroso. Power provisioning for a warehouse-sized computer. In Intl. Symp. on Computer Architecture, pages 13--23, 2007. Google ScholarDigital Library
- X. Z. Fern and C. E. Brodley. Random projection for high dimensional data clustering: A cluster ensemble approach. In Intl. Conf. on Machine Learning, 2003.Google Scholar
- S. Gandhi, S. Nath, S. Suri, and J. Liu. GAMPS: Compressing Multi Sensor Data by Grouping and Amplitude Scaling. In ACM SIGMOD, 2009. Google ScholarDigital Library
- M. Garofalakis and P. B. Gibbons. Wavelet Synopses with Error Guarantees. In ACM SIGMOD, 2002. Google ScholarDigital Library
- A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. One-pass wavelet decompositions of data streams. IEEE Transactions on Knowledge and Data Engineering, 15(3):541--554, 2003. Google ScholarDigital Library
- J. Haupt and R. Nowak. Signal reconstruction from noisy random projections. IEEE Trans. on Information Theory, 52(9):4036--4048, 2006. Google ScholarDigital Library
- P. Indyk, N. Koudas, and S. Muthukrishan. Identifying representative trends in massive time series data sets using sketches. In VLDB, 2000. Google ScholarDigital Library
- P. Indyk and R. Motwani. Approximate nearest neighbots: Towards removing the curse of dimensionality. In ACM STOC, 1998. Google ScholarDigital Library
- W. Johnson and J. Lindenstrauss. Extensions of lipschitz mapping into hilber space. Contemporary Mathematics, 26:189--206, 1984.Google ScholarCross Ref
- Y. Lu, A. Montanari, B. Prabhakar, S. Dharmapurikar, and A. Kabbani. Counter braids: a novel counter architecture for per-flow measurement. In ACM SIGMETRICS, 2008. Google ScholarDigital Library
- A. K. Menon, G. V. A. Pham, S. Chawla, and A. Viglas. An incremental data-stream sketch using sparse random projections. In SIAM International Conference on Data Mining, 2007.Google ScholarCross Ref
- S. Papdimitriou, J. Sun, and C. Faloutsos. Streaming pattern discovery in multiple time-series. In VLDB, 2005. Google ScholarDigital Library
- Y. Sakurai, S. Papdimitriou, and C. Faloutsos. Braid: Stream mining through group lag correslations. In SIGMOD, June 2005. Google ScholarDigital Library
- Y. Z. D. Shasha. StatStream: Statistical monitoring of thousands of data streams in real time. In In VLDB, 2002. Google ScholarDigital Library
- J. Shieh and E. Keogh. iSAX: indexing and mining terabyte sized time series. In ACM SIGKDD, 2008. Google ScholarDigital Library
- N. Thaper, S. Guha, P. Indyk, and N. Koudas. Dynamic multidimensional histograms. In ACM SIGMOD, 2002. Google ScholarDigital Library
- A. Thiagarajan and S. Madden. Querying continuous functions in a database system. In ACM SIGMOD, 2008. Google ScholarDigital Library
- M. Vlachos, M. Hadjieleftheriou, D. Gunopulos, and E. Keogh. Indexing multidimensional time-series. The VLDB Journal, 15(1):1--20, 2006. Google ScholarDigital Library
Index Terms
- Managing massive time series streams with multi-scale compressed trickles
Recommendations
Analyzing massive data streams: past, present, and future
DMKD '03: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discoveryContinuous data streams arise naturally, for example, in the installations of large telecom and Internet service providers where detailed usage information (Call-Detail-Records, SNMP-/RMON packet-flow data, etc.) from different parts of the underlying ...
A Framework for Clustering Massive-Domain Data Streams
ICDE '09: Proceedings of the 2009 IEEE International Conference on Data EngineeringIn this paper, we will examine the problem of clustering massive domain data streams. Massive-domain data streams are those in which the number of possible domain values for each attribute are very large and cannot be easily tracked for clustering ...
Query-aware partitioning for monitoring massive network data streams
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataData Stream Management Systems (DSMS) are gaining acceptance for applications that need to process very large volumes of data in real time. The load generated by such applications frequently exceeds by far the computation capabilities of a single ...
Comments