research-article

Effective temporal dependence discovery in time series data

Authors:
Qingchao Cai

National University of Singapore

National University of Singapore
View Profile

,
Zhongle Xie

National University of Singapore

National University of Singapore
View Profile

,
Meihui Zhang

Beijing Institute of Technology

Beijing Institute of Technology
View Profile

,
Gang Chen

Zhejiang University

Zhejiang University
View Profile

,
H. V. Jagadish

University of Michigan

University of Michigan
View Profile

,
Beng Chin Ooi

National University of Singapore

National University of Singapore
View Profile

Proceedings of the VLDB Endowment Volume 11 Issue 8pp 893–905https://doi.org/10.14778/3204028.3204033

Published:01 April 2018Publication History

Proceedings of the VLDB Endowment

Abstract

To analyze user behavior over time, it is useful to group users into cohorts, giving rise to cohort analysis. We identify several crucial limitations of current cohort analysis, motivated by the unmet need for temporal dependence discovery. To address these limitations, we propose a generalization that we call recurrent cohort analysis. We introduce a set of operators for recurrent cohort analysis and design access methods specific to these operators in both single-node and distributed environments. Through extensive experiments, we show that recurrent cohort analysis when implemented using the proposed access methods is up to six orders faster than one implemented as a layer on top of a database in a single-node setting, and two orders faster than one implemented using Spark SQL in a distributed setting.

References

Amplitude. https://amplitude.com.Google Scholar
Apache zookeeper. https://zookeeper.apache.org/.Google Scholar
Retention. https://mixpanel.com/retention/.Google Scholar
Rjmetrics. https://rjmetrics.com/.Google Scholar
Top 10 best stock market analysis software review 2018. https://www.liberatedstocktrader.com/top-10-best-stock-market-analysis-software-review/.Google Scholar
Use the cohort analysis report. https://support.google.com/analytics/answer/6074676?hl=en.Google Scholar
D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD, pages 671--682, 2006. Google ScholarDigital Library
K. F. Adams, G. C. Fonarow, C. L. Emerman, T. H. LeJemtel, M. R. Costanzo, W. T. Abraham, R. L. Berkowitz, M. Galvao, and D. P. Horton. Characteristics and outcomes of patients hospitalized for heart failure in the united states: rationale, design, and preliminary observations from the first 100,000 cases in the acute decompensated heart failure national registry (adhere). American heart journal, 149(2):209--216, 2005.Google Scholar
S. Amer-Yahia and T. Johnson. Optimizing queries on compressed bitmaps. In VLDB, pages 329--338, 2000. Google ScholarDigital Library
M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1383--1394. ACM, 2015. Google ScholarDigital Library
P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.Google Scholar
N. E. Breslow, J. Lubin, P. Marek, and B. Langholz. Multiplicative models and cohort analysis. Journal of the American Statistical Association, 78(381):1--12, 1983.Google ScholarCross Ref
Q. Cai, W. Guo, H. Zhang, D. Agrawal, G. Chen, B. C. Ooi, K.-L. Tan, Y. M. Teo, and S. Wang. Efficient distributed memory management with rdma and caching. Technical report, National University of Singapore, Department of Computer Science, 2018.Google Scholar
Q. Cai, H. Zhang, W. Guo, G. Chen, B. C. Ooi, K. L. Tan, and W. F. Wong. Memepic: Towards a unified in-memory big data management system. IEEE Transactions on Big Data, 2018.Google ScholarCross Ref
J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
D. Donnell et al. Heterosexual hiv-1 transmission after initiation of antiretroviral therapy: a prospective cohort analysis. The Lancet, 375(9731):2092--2098, 2010.Google ScholarCross Ref
N. D. Glenn. Cohort Analysis. Sage Publications, Inc., London, 2005.Google Scholar
E. A. Hoste, G. Clermont, A. Kersten, R. Venkataraman, D. C. Angus, D. De Bacquer, and J. A. Kellum. Rifle criteria for acute kidney injury are associated with hospital mortality in critically ill patients: a cohort analysis. Critical care, 10(3):1, 2006.Google ScholarCross Ref
D. Jiang, Q. Cai, G. Chen, H. V. Jagadish, B. C. Ooi, K.-L. Tan, and A. K. H. Tung. Cohort query processing. PVLDB, 10(1):1--12, 2016. Google ScholarDigital Library
Y. Koren. Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '09, pages 447--456, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
L. L. Kupper, J. M. Janis, A. Karmous, and B. G. Greenberg. Statistical age-period-cohort analysis: a review and critique. Journal of chronic diseases, 38(10):811--830, 1985.Google Scholar
Y. Li and J. M. Patel. Bitweaving: Fast scans for main memory data processing. In SIGMOD, pages 289--300, 2013. Google ScholarDigital Library
S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing database architecture for the new bottleneck: Memory access. The VLDB Journal, 9(3):231--246, 2000. Google ScholarDigital Library
R. M. Martin, P. N. Biswas, S. N. Freemantle, G. L. Pearce, and R. D. Mann. Age and sex distribution of suspected adverse drug reactions to newly marketed drugs in general practice in england: analysis of 48 cohort studies. British journal of clinical pharmacology, 46(5):505--511, 1998.Google Scholar
W. M. Mason and S. Fienberg. Cohort analysis in social research: Beyond the identification problem. Springer Science & Business Media, 2012.Google Scholar
J. G. Pope. An investigation of the accuracy of virtual population analysis using cohort analysis. ICNAF Research Bulletin, 9(10):65--74, 1972.Google Scholar
M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-store: A column-oriented dbms. In VLDB, pages 553--564, 2005. Google ScholarDigital Library
T. Westmann, D. Kossmann, S. Helmer, and G. Moerkotte. The implementation and performance of compressed databases. SIGMOD Record, 29(3):55--67, 2000. Google ScholarDigital Library
Z. Xie, Q. Cai, F. He, G. Y. Ooi, W. Huang, and B. C. Ooi. Cohort analysis with ease. In ACM SIGMOD Demo, 2018.Google ScholarDigital Library
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. pages 2--2, 2012. Google ScholarDigital Library
H. Zhang, G. Chen, B. C. Ooi, K. L. Tan, and M. Zhang. In-memory big data management and processing: A survey. IEEE Transactions on Knowledge and Data Engineering, 27(7):1920--1948, July 2015.Google ScholarDigital Library
M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In ICDE, pages 59--70, 2006. Google ScholarDigital Library

Recommendations

Summarizing neonatal time series data
EACL '03: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2

We describe our investigations in generating textual summaries of physiological time series data to aid medical personnel in monitoring babies in neonatal intensive care units. Our studies suggest that summarization is a communicative task that requires ...
Read More
Prediction for Disease Risk and Medical Cost using Time Series Healthcare Data
BIOSTEC 2016: Proceedings of the International Joint Conference on Biomedical Engineering Systems and Technologies

Foreseeing the medical expenditure is beneficial for both insurance companies and individuals. In this paper we propose a new methodology to predict disease risk and medical cost. Based on sequential latent dirichlet allocation (SeqLDA), which ...
Read More
Temporal Data Mining of HIV Registries: Results from a 25 Years Follow-Up
AIME '09: Proceedings of the 12th Conference on Artificial Intelligence in Medicine: Artificial Intelligence in Medicine

The Human Immunodeficiency Virus (HIV) causes a pandemic infection in humans, with millions of people infected every year. Although the Highly Active Antiretroviral Therapy reduced the number of AIDS cases since 1996 by significantly increasing the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 11, Issue 8
April 2018
94 pages
ISSN:2150-8097
Editors:
Jian Pei
Simon Fraser University
,
Sihem Amer-Yahia
University of Grenoble Alpes, CNRS
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 April 2018
Published in pvldb Volume 11, Issue 8
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 135
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Effective temporal dependence discovery in time series data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Summarizing neonatal time series data

Prediction for Disease Risk and Medical Cost using Time Series Healthcare Data

Temporal Data Mining of HIV Registries: Results from a 25 Years Follow-Up

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Effective temporal dependence discovery in time series data

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Summarizing neonatal time series data

Prediction for Disease Risk and Medical Cost using Time Series Healthcare Data

Temporal Data Mining of HIV Registries: Results from a 25 Years Follow-Up

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media