Skip to main content
Erschienen in: The VLDB Journal 6/2020

27.08.2020 | Regular Paper

Cohort analytics: efficiency and applicability

verfasst von: Behrooz Omidvar-Tehrani, Sihem Amer-Yahia, Laks V. S. Lakshmanan

Erschienen in: The VLDB Journal | Ausgabe 6/2020

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

The abundant availability of health-care data calls for effective analysis methods to help medical experts gain a better understanding of their patients and their health. The focus of existing work has been largely on prediction. In this paper, we introduce Core, a framework for cohort “representation” and “exploration.” Our contributions are twofold: First, we formalize cohort representation as the problem of aggregating the trajectories of its patients. This problem is challenging because cohorts often consist of hundreds of patients who underwent medical actions of various types at different points in time. We prove that producing a representative cohort trajectory is NP-complete with a reduction in the multiple sequence alignment problem. We propose a heuristic that extends the Needleman–Wunsch algorithm for sequence matching to handle temporal sequences. To further improve cohort representation efficiency, we introduce “trajectory families” and “stratified sampling.” Our second contribution is formalizing the problem of cohort exploration as finding a set of cohorts that are similar to a cohort of interest and that maximize entropy. This problem is challenging because the potential number of similar cohorts is huge. We prove NP-completeness with a reduction in the maximum edge subgraph problem. To address complexity, we develop a multi-staged approach based on limiting the search space to “contrast cohorts.” To speed up the computation of cohort similarity, we use “event sets” that are inspired from the double dictionary encoding proposed for keyword search. Moreover, we explore the usefulness and efficiency of Core using an extensive set of qualitative and quantitative experiments on two real health-care datasets. In a user study with medical experts, we show that Core reduces time-to-insight from hours to seconds and helps them find better insights than baseline approaches. Moreover, we show that the obtained cohort representations offer the right trade-off between quality and performance. We study the benefits of trajectory families and stratified sampling for cohort representation and show their applicability on large and heterogeneous cohorts. We also show the benefit of event sets for cohort exploration in providing interactive performance.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Continuous Positive Airway Pressure.
 
2
Throughout the paper, we use the shorter term “age” for “age category,” and “life” for “life status.”
 
3
Throughout the paper, the dot notation represents the invocation of a function on its right-hand side for the object (e.g., a patient) on its left-hand side.
 
Literatur
1.
Zurück zum Zitat Munshi, A., Sharma, V., Sharma, S.: Lessons learned from cohort studies, and hospital-based studies and their implications in precision medicine. In: Progress and Challenges in Precision Medicine. Elsevier (2017) Munshi, A., Sharma, V., Sharma, S.: Lessons learned from cohort studies, and hospital-based studies and their implications in precision medicine. In: Progress and Challenges in Precision Medicine. Elsevier (2017)
2.
Zurück zum Zitat Welch, S.R., Huff, S.M.: Cohort amplification: an associative classification framework for identification of disease cohorts in the electronic health record. In: Annual Symposium Proceedings. American Medical Informatics Association (2010) Welch, S.R., Huff, S.M.: Cohort amplification: an associative classification framework for identification of disease cohorts in the electronic health record. In: Annual Symposium Proceedings. American Medical Informatics Association (2010)
3.
Zurück zum Zitat Maggi, F.M., Di Francescomarino, C., Dumas, M., Ghidini, C.: Predictive monitoring of business processes. In: International Conference on Advanced Information Systems Engineering. Springer, pp. 457–472 (2014) Maggi, F.M., Di Francescomarino, C., Dumas, M., Ghidini, C.: Predictive monitoring of business processes. In: International Conference on Advanced Information Systems Engineering. Springer, pp. 457–472 (2014)
4.
Zurück zum Zitat Pham, T., Tran, T., Phung, D., Venkatesh, S.: Predicting healthcare trajectories from medical records: a deep learning approach. J. Biomed. Inform. 69, 218–229 (2017)CrossRef Pham, T., Tran, T., Phung, D., Venkatesh, S.: Predicting healthcare trajectories from medical records: a deep learning approach. J. Biomed. Inform. 69, 218–229 (2017)CrossRef
5.
Zurück zum Zitat Fejza, A.., Genevès, P., Layaïda, N., Bosson, J.-L.: Scalable and interpretable predictive models for electronic health records. In DSAA, IEEE (2018) Fejza, A.., Genevès, P., Layaïda, N., Bosson, J.-L.: Scalable and interpretable predictive models for electronic health records. In DSAA, IEEE (2018)
6.
Zurück zum Zitat Heuser, A., Huynh, M., Chang, J.C.: Empirical process-based large sample properties of the area bounded by cohort-weighted Kaplan Meier curves. arXiv preprint arXiv:1701.02424 (2017) Heuser, A., Huynh, M., Chang, J.C.: Empirical process-based large sample properties of the area bounded by cohort-weighted Kaplan Meier curves. arXiv preprint arXiv:​1701.​02424 (2017)
7.
Zurück zum Zitat Liu, Y., Safavi, T., Dighe, A., Danai, K.: Graph summarization methods and applications: a survey. ACM Comput. Surv. 51, 1–34 (2018)CrossRef Liu, Y., Safavi, T., Dighe, A., Danai, K.: Graph summarization methods and applications: a survey. ACM Comput. Surv. 51, 1–34 (2018)CrossRef
8.
Zurück zum Zitat Senderovich, A., Weidlich, M., Gal, A.: Temporal network representation of event logs for improved performance modelling in business processes. In: BPM (2017) Senderovich, A., Weidlich, M., Gal, A.: Temporal network representation of event logs for improved performance modelling in business processes. In: BPM (2017)
9.
Zurück zum Zitat Monroe, M., Lan, R., Lee, H., Plaisant, C., Shneiderman, B.: Temporal event sequence simplification. TVCG (2013) Monroe, M., Lan, R., Lee, H., Plaisant, C., Shneiderman, B.: Temporal event sequence simplification. TVCG (2013)
10.
Zurück zum Zitat Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRef Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48(3), 443–453 (1970)CrossRef
11.
Zurück zum Zitat Pahins, C.A.L., Omidvar-Tehrani, B., Amer-Yahia, S., Siroux, V., Pépin, J.L., Borel, J.-C., Comba, J.: COVIZ: a system for visual formation and exploration of patient cohorts. PVLDB 12(12), 1822–1825 (2019) Pahins, C.A.L., Omidvar-Tehrani, B., Amer-Yahia, S., Siroux, V., Pépin, J.L., Borel, J.-C., Comba, J.: COVIZ: a system for visual formation and exploration of patient cohorts. PVLDB 12(12), 1822–1825 (2019)
12.
Zurück zum Zitat Von Elm, E., Altman, D.G., Egger, M., et al.: The strengthening the reporting of observational studies in epidemiology (strobe) statement: guidelines for reporting observational studies. PLoS Med. 147, 573–577 (2007) Von Elm, E., Altman, D.G., Egger, M., et al.: The strengthening the reporting of observational studies in epidemiology (strobe) statement: guidelines for reporting observational studies. PLoS Med. 147, 573–577 (2007)
13.
Zurück zum Zitat Hall, A., Bachmann, O., Büssow, R., Gănceanu, S., Nunkesser, M.: Processing a trillion cells per mouse click. Proc. VLDB Endow. 5(11), 1436–1446 (2012)CrossRef Hall, A., Bachmann, O., Büssow, R., Gănceanu, S., Nunkesser, M.: Processing a trillion cells per mouse click. Proc. VLDB Endow. 5(11), 1436–1446 (2012)CrossRef
14.
Zurück zum Zitat Omidvar-Tehrani, B., Amer-Yahia, S., Lakshmanan, L.V.S.: Cohort representation and exploration. In: DSAA. IEEE (2018) Omidvar-Tehrani, B., Amer-Yahia, S., Lakshmanan, L.V.S.: Cohort representation and exploration. In: DSAA. IEEE (2018)
15.
Zurück zum Zitat Armony, M., Israelit, S., Mandelbaum, A., Marmor, Y.N., Tseytlin, Y., Yom-Tov, G.B.: On patient flow in hospitals: a data-based queueing-science perspective. Stoch. Syst. 5(1), 146–194 (2015)MathSciNetCrossRef Armony, M., Israelit, S., Mandelbaum, A., Marmor, Y.N., Tseytlin, Y., Yom-Tov, G.B.: On patient flow in hospitals: a data-based queueing-science perspective. Stoch. Syst. 5(1), 146–194 (2015)MathSciNetCrossRef
16.
Zurück zum Zitat Jenkins, K.: Comorbidity patterns with female incontinence distinguish subtypes. MedPage Today J. (2018) Jenkins, K.: Comorbidity patterns with female incontinence distinguish subtypes. MedPage Today J. (2018)
17.
Zurück zum Zitat Woodfield, J.: Gestational diabetes associated with early signs of kidney damage. The Global Diabetes Community (2018) Woodfield, J.: Gestational diabetes associated with early signs of kidney damage. The Global Diabetes Community (2018)
18.
Zurück zum Zitat Collins, T.: For your patients-REM sleep behavior disorder: REM disorder is highly predictive of neurodegenerative disease, study shows. Neurol. Today 18, 1–22 (2018) Collins, T.: For your patients-REM sleep behavior disorder: REM disorder is highly predictive of neurodegenerative disease, study shows. Neurol. Today 18, 1–22 (2018)
19.
Zurück zum Zitat Wang, L., Jiang, T.: On the complexity of multiple sequence alignment. J. Comput. Biol. 1(4), 337–348 (1994)CrossRef Wang, L., Jiang, T.: On the complexity of multiple sequence alignment. J. Comput. Biol. 1(4), 337–348 (1994)CrossRef
20.
21.
22.
Zurück zum Zitat Kaplan, E.L., Meier, P.: Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc. 53(282), 457–481 (1958)MathSciNetCrossRef Kaplan, E.L., Meier, P.: Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc. 53(282), 457–481 (1958)MathSciNetCrossRef
23.
Zurück zum Zitat Gollery, M.: Bioinformatics: sequence and genome analysis. Clin. Chem. 51, 2219 (2005)CrossRef Gollery, M.: Bioinformatics: sequence and genome analysis. Clin. Chem. 51, 2219 (2005)CrossRef
24.
Zurück zum Zitat Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 11, 473–483 (2010)CrossRef Li, H., Homer, N.: A survey of sequence alignment algorithms for next-generation sequencing. Brief. Bioinform. 11, 473–483 (2010)CrossRef
25.
Zurück zum Zitat Smith, T., Waterman, M.: Identification of common molecular subsequences. Mol. Biol. 147, 195–197 (1981)CrossRef Smith, T., Waterman, M.: Identification of common molecular subsequences. Mol. Biol. 147, 195–197 (1981)CrossRef
26.
Zurück zum Zitat Polyanovsky, V.O., Roytberg, M.A., Tumanyan, V.G.: Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences. Algorithms Mol. Biol. 6, 25 (2011)CrossRef Polyanovsky, V.O., Roytberg, M.A., Tumanyan, V.G.: Comparative analysis of the quality of a global algorithm and a local algorithm for alignment of two sequences. Algorithms Mol. Biol. 6, 25 (2011)CrossRef
27.
Zurück zum Zitat Goonesekere, N.C.W., Lee, B.: Context-specific amino acid substitution matrices and their use in the detection of protein homologs. Proteins Struct. Funct. Bioinf. 71(2), 910–919 (2008)CrossRef Goonesekere, N.C.W., Lee, B.: Context-specific amino acid substitution matrices and their use in the detection of protein homologs. Proteins Struct. Funct. Bioinf. 71(2), 910–919 (2008)CrossRef
28.
Zurück zum Zitat Altschul, S.F.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219, 555–565 (1991)CrossRef Altschul, S.F.: Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 219, 555–565 (1991)CrossRef
29.
Zurück zum Zitat Omidvar-Tehrani, B.: Augmented therapy with online support groups. In: VLDB Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH). Springer (2018) Omidvar-Tehrani, B.: Augmented therapy with online support groups. In: VLDB Workshop on Data Management and Analytics for Medicine and Healthcare (DMAH). Springer (2018)
30.
Zurück zum Zitat Notredame, C., Higgins, D.G., Heringa, J.: T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1), 205–217 (2000)CrossRef Notredame, C., Higgins, D.G., Heringa, J.: T-coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol. 302(1), 205–217 (2000)CrossRef
31.
Zurück zum Zitat Chatain, T., Carmona, J., Van Dongen, B.: Alignment-based trace clustering. In: International Conference on Conceptual Modeling. Springer, pp. 295–308 (2017) Chatain, T., Carmona, J., Van Dongen, B.: Alignment-based trace clustering. In: International Conference on Conceptual Modeling. Springer, pp. 295–308 (2017)
32.
Zurück zum Zitat Enright, A.J., Van Dongen, S., Ouzounis, C.A.: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30(7), 1575–1584 (2002)CrossRef Enright, A.J., Van Dongen, S., Ouzounis, C.A.: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res. 30(7), 1575–1584 (2002)CrossRef
33.
Zurück zum Zitat Bhuiyan, M., Mukhopadhyay, S., Al Hasan, M.: Interactive pattern mining on hidden data: a sampling-based solution. In: CIKM. ACM (2012) Bhuiyan, M., Mukhopadhyay, S., Al Hasan, M.: Interactive pattern mining on hidden data: a sampling-based solution. In: CIKM. ACM (2012)
34.
Zurück zum Zitat Amer-Yahia, S., Kleisarchaki, S., Kolloju, N.K., Lakshmanan, L.V.S., Zamar, R.H.: Exploring rated datasets with rating maps. In: WWW (2017) Amer-Yahia, S., Kleisarchaki, S., Kolloju, N.K., Lakshmanan, L.V.S., Zamar, R.H.: Exploring rated datasets with rating maps. In: WWW (2017)
35.
Zurück zum Zitat Omidvar-Tehrani, B., Amer-Yahia, S., Termier, A.: Interactive user group analysis. In: CIKM (2015) Omidvar-Tehrani, B., Amer-Yahia, S., Termier, A.: Interactive user group analysis. In: CIKM (2015)
36.
Zurück zum Zitat Jiang, D., Cai, Q., Chen, G., Jagadish, H.V., Ooi, B.C., Tan, K.-L., Tung, A.K.H.: Cohort query processing. Proc. VLDB Endow. 10((1), 1–12 (2016)CrossRef Jiang, D., Cai, Q., Chen, G., Jagadish, H.V., Ooi, B.C., Tan, K.-L., Tung, A.K.H.: Cohort query processing. Proc. VLDB Endow. 10((1), 1–12 (2016)CrossRef
37.
Zurück zum Zitat Ge, C., He, X., Ilyas, I.F., Machanavajjhala, A.: Accuracy-aware differentially private data exploration. In: SIGMOD, Apex (2019) Ge, C., He, X., Ilyas, I.F., Machanavajjhala, A.: Accuracy-aware differentially private data exploration. In: SIGMOD, Apex (2019)
38.
Zurück zum Zitat Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions—i. Math. Program. 14(1), 265–294 (1978)MathSciNetCrossRef Nemhauser, G.L., Wolsey, L.A., Fisher, M.L.: An analysis of approximations for maximizing submodular set functions—i. Math. Program. 14(1), 265–294 (1978)MathSciNetCrossRef
40.
Zurück zum Zitat Opsahl, T., Agneessens, F., Skvoretz, J.: Node centrality in weighted networks: generalizing degree and shortest paths. Soc. Netw. 32(3), 245–251 (2010)CrossRef Opsahl, T., Agneessens, F., Skvoretz, J.: Node centrality in weighted networks: generalizing degree and shortest paths. Soc. Netw. 32(3), 245–251 (2010)CrossRef
41.
Zurück zum Zitat Sharma, D., Kapoor, A., Deshpande, A.: On greedy maximization of entropy. In: International Conference on Machine Learning, pp. 1330–1338 (2015) Sharma, D., Kapoor, A., Deshpande, A.: On greedy maximization of entropy. In: International Conference on Machine Learning, pp. 1330–1338 (2015)
42.
Zurück zum Zitat Korn, G.A., Korn, T.M.: Mathematical Handbook for Scientists and Engineers: Definitions, Theorems, and Formulas for Reference and Review. Courier Corporation, North Chelmsford (2000)MATH Korn, G.A., Korn, T.M.: Mathematical Handbook for Scientists and Engineers: Definitions, Theorems, and Formulas for Reference and Review. Courier Corporation, North Chelmsford (2000)MATH
43.
Zurück zum Zitat Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (2012)MATH Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, Hoboken (2012)MATH
44.
Zurück zum Zitat Fekete, J.-D., Primet, R.: Progressive analytics: a computation paradigm for exploratory data analysis. arXiv preprint arXiv:1607.05162 (2016) Fekete, J.-D., Primet, R.: Progressive analytics: a computation paradigm for exploratory data analysis. arXiv preprint arXiv:​1607.​05162 (2016)
45.
Zurück zum Zitat Miller, G.: Human memory and the storage of information. IRE Trans. Inf. Theory 2(3), 129–137 (1956)CrossRef Miller, G.: Human memory and the storage of information. IRE Trans. Inf. Theory 2(3), 129–137 (1956)CrossRef
46.
Zurück zum Zitat Rozinat, A., de Medeiros, A.K.A., Günther, C.W., et al.: The need for a process mining evaluation framework in research and practice. In: BPM. Springer, pp. 84–89 (2007) Rozinat, A., de Medeiros, A.K.A., Günther, C.W., et al.: The need for a process mining evaluation framework in research and practice. In: BPM. Springer, pp. 84–89 (2007)
47.
Zurück zum Zitat Sharma, G., Goodwin, J.: Effect of aging on respiratory system physiology and immunology. Clin. Interv. Aging 1(3), 253 (2006)CrossRef Sharma, G., Goodwin, J.: Effect of aging on respiratory system physiology and immunology. Clin. Interv. Aging 1(3), 253 (2006)CrossRef
48.
Zurück zum Zitat Shanks, D.: Solved and Unsolved Problems in Number Theory, vol. 297. AMS, Providence (2001)MATH Shanks, D.: Solved and Unsolved Problems in Number Theory, vol. 297. AMS, Providence (2001)MATH
49.
Zurück zum Zitat Bonchi, F., Giannotti, F., Lucchese, C., Orlando, S., Perego, R., Trasarti, R.: Conquest: a constraint-based querying system for exploratory pattern discovery. In: ICDE (2006) Bonchi, F., Giannotti, F., Lucchese, C., Orlando, S., Perego, R., Trasarti, R.: Conquest: a constraint-based querying system for exploratory pattern discovery. In: ICDE (2006)
50.
Zurück zum Zitat Yan, N., Li, C., Roy, S.B., Ramegowda, R., Das, G.: Facetedpedia: enabling query-dependent faceted search for Wikipedia. In: CIKM. ACM (2010) Yan, N., Li, C., Roy, S.B., Ramegowda, R., Das, G.: Facetedpedia: enabling query-dependent faceted search for Wikipedia. In: CIKM. ACM (2010)
51.
Zurück zum Zitat Mottin, D., Lissandrini, M., Velegrakis, Y., Palpanas, T.: New trends on exploratory methods for data analytics. Proc. VLDB Endow. 10(12), 1977–1980 (2017)CrossRef Mottin, D., Lissandrini, M., Velegrakis, Y., Palpanas, T.: New trends on exploratory methods for data analytics. Proc. VLDB Endow. 10(12), 1977–1980 (2017)CrossRef
Metadaten
Titel
Cohort analytics: efficiency and applicability
verfasst von
Behrooz Omidvar-Tehrani
Sihem Amer-Yahia
Laks V. S. Lakshmanan
Publikationsdatum
27.08.2020
Verlag
Springer Berlin Heidelberg
Erschienen in
The VLDB Journal / Ausgabe 6/2020
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-020-00625-6

Weitere Artikel der Ausgabe 6/2020

The VLDB Journal 6/2020 Zur Ausgabe