Skip to main content
Top
Published in: Annals of Data Science 3/2021

06-06-2020

A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates

Authors: Chunlei Tang, Joseph Michael Plasek, Yun Xiong, Zhikun Zhang, David Westfall Bates, Li Zhou

Published in: Annals of Data Science | Issue 3/2021

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

This paper proposes a novel unsupervised document embedding based clustering algorithm to generate clinical note templates. We adapted Charikar’s SimHash to embed each clinical document into a vector representation. We modified the traditional K-means algorithm to merge any two clusters with centroids when they are very close. Under the K-means paradigm, our algorithm designates the cluster representative corresponding to the document vector closest to the centroid as the prototype template. On a corpus of clinical notes, we evaluated the feasibility of utilizing our algorithm at the individual author level. The corpus contains 1,063,893 clinical notes corresponding to 19,146 unique providers between January 2011 and July 2016. Our algorithm achieved more than 80% precision and runs in O(n) time complexity. We further validated our algorithm using human annotators who reported it is able to efficiently detect a real clinical document that can represent the other documents in the same cluster at both the department level and the individual clinician level.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Appendix
Available only for authorised users
Literature
1.
2.
go back to reference Tan P, Steinbach M, Karpatne A, Kumar V (2019) Introduction to data mining, 2nd edn. Pearson Education India, London Tan P, Steinbach M, Karpatne A, Kumar V (2019) Introduction to data mining, 2nd edn. Pearson Education India, London
4.
go back to reference Doing-Harris K, Patterson O, Igo S, Hurdle J (2013) Document sublanguage clustering to detect medical specialty in cross-institutional clinical texts. In: Proceedings of the 7th international workshop on data and text mining in biomedical informatics. ACM, pp 9–12 Doing-Harris K, Patterson O, Igo S, Hurdle J (2013) Document sublanguage clustering to detect medical specialty in cross-institutional clinical texts. In: Proceedings of the 7th international workshop on data and text mining in biomedical informatics. ACM, pp 9–12
5.
go back to reference Patterson O, Hurdle JF (2011) Document clustering of clinical narratives: a systematic study of clinical sublanguages. In: AMIA annual symposium proceedings. American Medical Informatics Association, p 1099 Patterson O, Hurdle JF (2011) Document clustering of clinical narratives: a systematic study of clinical sublanguages. In: AMIA annual symposium proceedings. American Medical Informatics Association, p 1099
6.
go back to reference Zhang R, Pakhomov S, Melton GB (2014) Longitudinal analysis of new information types in clinical notes. In: AMIA joint summits on translational science proceedings. American Medical Informatics Association, pp 232–237 Zhang R, Pakhomov S, Melton GB (2014) Longitudinal analysis of new information types in clinical notes. In: AMIA joint summits on translational science proceedings. American Medical Informatics Association, pp 232–237
7.
go back to reference Cohen R, Elhadad M, Elhadad N (2013) Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies. BMC Bioinform 14:10CrossRef Cohen R, Elhadad M, Elhadad N (2013) Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies. BMC Bioinform 14:10CrossRef
8.
go back to reference Downey D, Etzioni O, Soderland S (2010) Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artif Intell 174(11):726CrossRef Downey D, Etzioni O, Soderland S (2010) Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artif Intell 174(11):726CrossRef
9.
go back to reference Zhang R, Pakhomov S, McInnes BT, Melton GB (2011) Evaluating measures of redundancy in clinical texts. In: AMIA annual symposium proceedings, pp 1612–1620 Zhang R, Pakhomov S, McInnes BT, Melton GB (2011) Evaluating measures of redundancy in clinical texts. In: AMIA annual symposium proceedings, pp 1612–1620
10.
go back to reference Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, vol 400, no 1, pp 525–526 Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, vol 400, no 1, pp 525–526
12.
go back to reference Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing. ACM, pp 380–388 Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing. ACM, pp 380–388
13.
go back to reference Sadowski C, Levin G (2007) Simhash: hash-based similarity detection. Technical report, Google Sadowski C, Levin G (2007) Simhash: hash-based similarity detection. Technical report, Google
14.
go back to reference Boley D, Gini M, Gross R, Han E, Hastings K, Karpis G, Kumar V, Mobasher B, Moore J (1999) Partitioning-based clustering for web document categorization. Decis Support Syst 27(3):329–341CrossRef Boley D, Gini M, Gross R, Han E, Hastings K, Karpis G, Kumar V, Mobasher B, Moore J (1999) Partitioning-based clustering for web document categorization. Decis Support Syst 27(3):329–341CrossRef
15.
go back to reference Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168CrossRef Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168CrossRef
16.
go back to reference Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining. AAAI, pp 226–231 Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining. AAAI, pp 226–231
17.
go back to reference Liao W, Liu Y, Choudhary A (2004) A grid-based clustering algorithm using adaptive mesh refinement. In: 7th workshop on mining scientific and engineering datasets of SIAM international conference on data mining, vol 22. SIAM, pp 61–69 Liao W, Liu Y, Choudhary A (2004) A grid-based clustering algorithm using adaptive mesh refinement. In: 7th workshop on mining scientific and engineering datasets of SIAM international conference on data mining, vol 22. SIAM, pp 61–69
18.
go back to reference Zhong S, Ghosh J (2003) A unified framework for model-based clustering. J Mach Learn Res 4:1001–1037 Zhong S, Ghosh J (2003) A unified framework for model-based clustering. J Mach Learn Res 4:1001–1037
19.
go back to reference Wu H, Luk R, Wong K, Kwok K (2008) Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst 26(3):13CrossRef Wu H, Luk R, Wong K, Kwok K (2008) Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst 26(3):13CrossRef
20.
go back to reference Hui S, Dechao Z (2016) A weighted topical document embedding based clustering method for news text. In: 2016 IEEE information technology, networking, electronic and automation control conference. IEEE, pp 1060–1065 Hui S, Dechao Z (2016) A weighted topical document embedding based clustering method for news text. In: 2016 IEEE information technology, networking, electronic and automation control conference. IEEE, pp 1060–1065
21.
go back to reference Sood S (2011) Probabilistic simhash matching. Doctoral dissertation. Texas A&M University Sood S (2011) Probabilistic simhash matching. Doctoral dissertation. Texas A&M University
22.
go back to reference Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of ACM STOC, pp 604–613 Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of ACM STOC, pp 604–613
23.
go back to reference Svenstrup DT, Hansen J, Winther O (2017) Hash embeddings for efficient word representations. In: Advances in neural information processing systems, pp 4928–4936 Svenstrup DT, Hansen J, Winther O (2017) Hash embeddings for efficient word representations. In: Advances in neural information processing systems, pp 4928–4936
24.
go back to reference Sim J, Wright CC (2005) The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther 85(3):257–268CrossRef Sim J, Wright CC (2005) The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther 85(3):257–268CrossRef
25.
go back to reference McHugh ML (2012) Interrater reliability: the kappa statistic. Biochem Med 22(3):276–282CrossRef McHugh ML (2012) Interrater reliability: the kappa statistic. Biochem Med 22(3):276–282CrossRef
26.
go back to reference Institute of Medicine Committee on Quality of Health Care in America (2001) Crossing the quality chasm: a new health system for the 21st century. National Academies Press, Washington Institute of Medicine Committee on Quality of Health Care in America (2001) Crossing the quality chasm: a new health system for the 21st century. National Academies Press, Washington
27.
go back to reference Clark A (1998) Being there: putting brain, body, and world together again. MIT Press, Cambridge Clark A (1998) Being there: putting brain, body, and world together again. MIT Press, Cambridge
28.
go back to reference Kashyap V, Turchin A, Morin L, Chang F, Li Q, Hongsermeier T (2006) Creation of structured documentation templates using natural language processing techniques. In: AMIA annual symposium proceedings. American Medical Informatics Association, p 977 Kashyap V, Turchin A, Morin L, Chang F, Li Q, Hongsermeier T (2006) Creation of structured documentation templates using natural language processing techniques. In: AMIA annual symposium proceedings. American Medical Informatics Association, p 977
29.
go back to reference Ferrández O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM (2013) BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J Am Med Inform Assoc 20(1):77–83CrossRef Ferrández O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM (2013) BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J Am Med Inform Assoc 20(1):77–83CrossRef
Metadata
Title
A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates
Authors
Chunlei Tang
Joseph Michael Plasek
Yun Xiong
Zhikun Zhang
David Westfall Bates
Li Zhou
Publication date
06-06-2020
Publisher
Springer Berlin Heidelberg
Published in
Annals of Data Science / Issue 3/2021
Print ISSN: 2198-5804
Electronic ISSN: 2198-5812
DOI
https://doi.org/10.1007/s40745-020-00296-8

Other articles of this Issue 3/2021

Annals of Data Science 3/2021 Go to the issue