Top

Annals of Data Science

Published in:

06-06-2020

A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates

Authors: Chunlei Tang, Joseph Michael Plasek, Yun Xiong, Zhikun Zhang, David Westfall Bates, Li Zhou

Published in: Annals of Data Science | Issue 3/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

This paper proposes a novel unsupervised document embedding based clustering algorithm to generate clinical note templates. We adapted Charikar’s SimHash to embed each clinical document into a vector representation. We modified the traditional K-means algorithm to merge any two clusters with centroids when they are very close. Under the K-means paradigm, our algorithm designates the cluster representative corresponding to the document vector closest to the centroid as the prototype template. On a corpus of clinical notes, we evaluated the feasibility of utilizing our algorithm at the individual author level. The corpus contains 1,063,893 clinical notes corresponding to 19,146 unique providers between January 2011 and July 2016. Our algorithm achieved more than 80% precision and runs in O(n) time complexity. We further validated our algorithm using human annotators who reported it is able to efficiently detect a real clinical document that can represent the other documents in the same cluster at both the department level and the individual clinician level.

previous article Modeling Determinants of Low Birth Weight for Under Five-Children in Ethiopia

next article Predicting Indian Stock Market Using the Psycho-Linguistic Features of Financial News

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Available only for authorised users

Dubois S, Romano N, Kale DC, Shah N, Jung K (2017) Effective representations of clinical notes. arXiv preprint arXiv:1705.07025

Tan P, Steinbach M, Karpatne A, Kumar V (2019) Introduction to data mining, 2nd edn. Pearson Education India, London

Naming clusters (2017) Dataiku.com. https://academy.dataiku.com/cluster-models/513439. Accessed 5 June 2020

Doing-Harris K, Patterson O, Igo S, Hurdle J (2013) Document sublanguage clustering to detect medical specialty in cross-institutional clinical texts. In: Proceedings of the 7th international workshop on data and text mining in biomedical informatics. ACM, pp 9–12

Patterson O, Hurdle JF (2011) Document clustering of clinical narratives: a systematic study of clinical sublanguages. In: AMIA annual symposium proceedings. American Medical Informatics Association, p 1099

Zhang R, Pakhomov S, Melton GB (2014) Longitudinal analysis of new information types in clinical notes. In: AMIA joint summits on translational science proceedings. American Medical Informatics Association, pp 232–237

Cohen R, Elhadad M, Elhadad N (2013) Redundancy in electronic health record corpora: analysis, impact on text mining performance and mitigation strategies. BMC Bioinform 14:10CrossRef

Downey D, Etzioni O, Soderland S (2010) Analysis of a probabilistic model of redundancy in unsupervised information extraction. Artif Intell 174(11):726CrossRef

Zhang R, Pakhomov S, McInnes BT, Melton GB (2011) Evaluating measures of redundancy in clinical texts. In: AMIA annual symposium proceedings, pp 1612–1620

10.

Steinbach M, Karypis G, Kumar V (2000) A comparison of document clustering techniques. In: KDD workshop on text mining, vol 400, no 1, pp 525–526

11.

Keogh E, Mueen A (2017) Curse of dimensionality. In: Sammut C, Webb GI (eds) Encyclopedia of machine learning and data mining. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7687-1CrossRef

12.

Charikar MS (2002) Similarity estimation techniques from rounding algorithms. In: Proceedings of the thirty-fourth annual ACM symposium on theory of computing. ACM, pp 380–388

13.

Sadowski C, Levin G (2007) Simhash: hash-based similarity detection. Technical report, Google

14.

Boley D, Gini M, Gross R, Han E, Hastings K, Karpis G, Kumar V, Mobasher B, Moore J (1999) Partitioning-based clustering for web document categorization. Decis Support Syst 27(3):329–341CrossRef

15.

Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Discov 10(2):141–168CrossRef

16.

Ester M, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining. AAAI, pp 226–231

17.

Liao W, Liu Y, Choudhary A (2004) A grid-based clustering algorithm using adaptive mesh refinement. In: 7th workshop on mining scientific and engineering datasets of SIAM international conference on data mining, vol 22. SIAM, pp 61–69

18.

Zhong S, Ghosh J (2003) A unified framework for model-based clustering. J Mach Learn Res 4:1001–1037

19.

Wu H, Luk R, Wong K, Kwok K (2008) Interpreting TF-IDF term weights as making relevance decisions. ACM Trans Inf Syst 26(3):13CrossRef

20.

Hui S, Dechao Z (2016) A weighted topical document embedding based clustering method for news text. In: 2016 IEEE information technology, networking, electronic and automation control conference. IEEE, pp 1060–1065

21.

Sood S (2011) Probabilistic simhash matching. Doctoral dissertation. Texas A&M University

22.

Indyk P, Motwani R (1998) Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of ACM STOC, pp 604–613

23.

Svenstrup DT, Hansen J, Winther O (2017) Hash embeddings for efficient word representations. In: Advances in neural information processing systems, pp 4928–4936

24.

Sim J, Wright CC (2005) The kappa statistic in reliability studies: use, interpretation, and sample size requirements. Phys Ther 85(3):257–268CrossRef

25.

McHugh ML (2012) Interrater reliability: the kappa statistic. Biochem Med 22(3):276–282CrossRef

26.

Institute of Medicine Committee on Quality of Health Care in America (2001) Crossing the quality chasm: a new health system for the 21st century. National Academies Press, Washington

27.

Clark A (1998) Being there: putting brain, body, and world together again. MIT Press, Cambridge

28.

Kashyap V, Turchin A, Morin L, Chang F, Li Q, Hongsermeier T (2006) Creation of structured documentation templates using natural language processing techniques. In: AMIA annual symposium proceedings. American Medical Informatics Association, p 977

29.

Ferrández O, South BR, Shen S, Friedlin FJ, Samore MH, Meystre SM (2013) BoB, a best-of-breed automated text de-identification system for VHA clinical documents. J Am Med Inform Assoc 20(1):77–83CrossRef

Title: A Clustering Algorithm Based on Document Embedding to Identify Clinical Note Templates
Authors: Chunlei Tang
Joseph Michael Plasek
Yun Xiong
Zhikun Zhang
David Westfall Bates
Li Zhou
Publication date: 06-06-2020
Publisher: Springer Berlin Heidelberg
Published in: Annals of Data Science / Issue 3/2021
Print ISSN: 2198-5804
Electronic ISSN: 2198-5812
DOI: https://doi.org/10.1007/s40745-020-00296-8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Technik"

Springer Professional "Wirtschaft+Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 3/2021

A Two-Phase Multi-criteria Fuzzy Group Decision Making Approach for Supplier Evaluation and Order Allocation Considering Multi-objective, Multi-product and Multi-period

On Discrimination Between the Lindley and xgamma Distributions

Predicting Indian Stock Market Using the Psycho-Linguistic Features of Financial News

A Generalization of the Quantile-Based Flattened Logistic Distribution

Analytical Split Value Calculation for Numerical Attributes in Hoeffding Trees with Misclassification-Based Impurity

A Comprehensive Review on Performance Prediction of Solar Air Heaters Using Artificial Neural Network