Abstract
To analyze user behavior over time, it is useful to group users into cohorts, giving rise to cohort analysis. We identify several crucial limitations of current cohort analysis, motivated by the unmet need for temporal dependence discovery. To address these limitations, we propose a generalization that we call recurrent cohort analysis. We introduce a set of operators for recurrent cohort analysis and design access methods specific to these operators in both single-node and distributed environments. Through extensive experiments, we show that recurrent cohort analysis when implemented using the proposed access methods is up to six orders faster than one implemented as a layer on top of a database in a single-node setting, and two orders faster than one implemented using Spark SQL in a distributed setting.
- Amplitude. https://amplitude.com.Google Scholar
- Apache zookeeper. https://zookeeper.apache.org/.Google Scholar
- Retention. https://mixpanel.com/retention/.Google Scholar
- Rjmetrics. https://rjmetrics.com/.Google Scholar
- Top 10 best stock market analysis software review 2018. https://www.liberatedstocktrader.com/top-10-best-stock-market-analysis-software-review/.Google Scholar
- Use the cohort analysis report. https://support.google.com/analytics/answer/6074676?hl=en.Google Scholar
- D. Abadi, S. Madden, and M. Ferreira. Integrating compression and execution in column-oriented database systems. In SIGMOD, pages 671--682, 2006. Google ScholarDigital Library
- K. F. Adams, G. C. Fonarow, C. L. Emerman, T. H. LeJemtel, M. R. Costanzo, W. T. Abraham, R. L. Berkowitz, M. Galvao, and D. P. Horton. Characteristics and outcomes of patients hospitalized for heart failure in the united states: rationale, design, and preliminary observations from the first 100,000 cases in the acute decompensated heart failure national registry (adhere). American heart journal, 149(2):209--216, 2005.Google Scholar
- S. Amer-Yahia and T. Johnson. Optimizing queries on compressed bitmaps. In VLDB, pages 329--338, 2000. Google ScholarDigital Library
- M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A. Ghodsi, et al. Spark sql: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, pages 1383--1394. ACM, 2015. Google ScholarDigital Library
- P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.Google Scholar
- N. E. Breslow, J. Lubin, P. Marek, and B. Langholz. Multiplicative models and cohort analysis. Journal of the American Statistical Association, 78(381):1--12, 1983.Google ScholarCross Ref
- Q. Cai, W. Guo, H. Zhang, D. Agrawal, G. Chen, B. C. Ooi, K.-L. Tan, Y. M. Teo, and S. Wang. Efficient distributed memory management with rdma and caching. Technical report, National University of Singapore, Department of Computer Science, 2018.Google Scholar
- Q. Cai, H. Zhang, W. Guo, G. Chen, B. C. Ooi, K. L. Tan, and W. F. Wong. Memepic: Towards a unified in-memory big data management system. IEEE Transactions on Big Data, 2018.Google ScholarCross Ref
- J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- D. Donnell et al. Heterosexual hiv-1 transmission after initiation of antiretroviral therapy: a prospective cohort analysis. The Lancet, 375(9731):2092--2098, 2010.Google ScholarCross Ref
- N. D. Glenn. Cohort Analysis. Sage Publications, Inc., London, 2005.Google Scholar
- E. A. Hoste, G. Clermont, A. Kersten, R. Venkataraman, D. C. Angus, D. De Bacquer, and J. A. Kellum. Rifle criteria for acute kidney injury are associated with hospital mortality in critically ill patients: a cohort analysis. Critical care, 10(3):1, 2006.Google ScholarCross Ref
- D. Jiang, Q. Cai, G. Chen, H. V. Jagadish, B. C. Ooi, K.-L. Tan, and A. K. H. Tung. Cohort query processing. PVLDB, 10(1):1--12, 2016. Google ScholarDigital Library
- Y. Koren. Collaborative filtering with temporal dynamics. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD '09, pages 447--456, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- L. L. Kupper, J. M. Janis, A. Karmous, and B. G. Greenberg. Statistical age-period-cohort analysis: a review and critique. Journal of chronic diseases, 38(10):811--830, 1985.Google Scholar
- Y. Li and J. M. Patel. Bitweaving: Fast scans for main memory data processing. In SIGMOD, pages 289--300, 2013. Google ScholarDigital Library
- S. Manegold, P. A. Boncz, and M. L. Kersten. Optimizing database architecture for the new bottleneck: Memory access. The VLDB Journal, 9(3):231--246, 2000. Google ScholarDigital Library
- R. M. Martin, P. N. Biswas, S. N. Freemantle, G. L. Pearce, and R. D. Mann. Age and sex distribution of suspected adverse drug reactions to newly marketed drugs in general practice in england: analysis of 48 cohort studies. British journal of clinical pharmacology, 46(5):505--511, 1998.Google Scholar
- W. M. Mason and S. Fienberg. Cohort analysis in social research: Beyond the identification problem. Springer Science & Business Media, 2012.Google Scholar
- J. G. Pope. An investigation of the accuracy of virtual population analysis using cohort analysis. ICNAF Research Bulletin, 9(10):65--74, 1972.Google Scholar
- M. Stonebraker, D. J. Abadi, A. Batkin, X. Chen, M. Cherniack, M. Ferreira, E. Lau, A. Lin, S. Madden, E. O'Neil, P. O'Neil, A. Rasin, N. Tran, and S. Zdonik. C-store: A column-oriented dbms. In VLDB, pages 553--564, 2005. Google ScholarDigital Library
- T. Westmann, D. Kossmann, S. Helmer, and G. Moerkotte. The implementation and performance of compressed databases. SIGMOD Record, 29(3):55--67, 2000. Google ScholarDigital Library
- Z. Xie, Q. Cai, F. He, G. Y. Ooi, W. Huang, and B. C. Ooi. Cohort analysis with ease. In ACM SIGMOD Demo, 2018.Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. pages 2--2, 2012. Google ScholarDigital Library
- H. Zhang, G. Chen, B. C. Ooi, K. L. Tan, and M. Zhang. In-memory big data management and processing: A survey. IEEE Transactions on Knowledge and Data Engineering, 27(7):1920--1948, July 2015.Google ScholarDigital Library
- M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In ICDE, pages 59--70, 2006. Google ScholarDigital Library
Recommendations
Summarizing neonatal time series data
EACL '03: Proceedings of the tenth conference on European chapter of the Association for Computational Linguistics - Volume 2We describe our investigations in generating textual summaries of physiological time series data to aid medical personnel in monitoring babies in neonatal intensive care units. Our studies suggest that summarization is a communicative task that requires ...
Prediction for Disease Risk and Medical Cost using Time Series Healthcare Data
BIOSTEC 2016: Proceedings of the International Joint Conference on Biomedical Engineering Systems and TechnologiesForeseeing the medical expenditure is beneficial for both insurance companies and individuals. In this paper we propose a new methodology to predict disease risk and medical cost. Based on sequential latent dirichlet allocation (SeqLDA), which ...
Temporal Data Mining of HIV Registries: Results from a 25 Years Follow-Up
AIME '09: Proceedings of the 12th Conference on Artificial Intelligence in Medicine: Artificial Intelligence in MedicineThe Human Immunodeficiency Virus (HIV) causes a pandemic infection in humans, with millions of people infected every year. Although the Highly Active Antiretroviral Therapy reduced the number of AIDS cases since 1996 by significantly increasing the ...
Comments