nach oben

Erschienen in:

2021 | OriginalPaper | Buchkapitel

Staging Cancer Through Text Mining of Pathology Records

verfasst von : Pietro Belloni, Giovanna Boccuzzo, Stefano Guzzinati, Irene Italiano, Carlo R. Rossi, Massimo Rugge, Manuel Zorzi

Erschienen in: Data Science and Social Research II

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Valuable information is stored in a healthcare record system and over 40% of it is estimated to be unstructured in the form of free clinical text. A collection of pathology records is provided by the Veneto Cancer Registry: these medical records refer to cases of melanoma and contain free text, in particular, the diagnosis. The aim of this research is to extract from the free text the size of the primary tumour, the involvement of lymph nodes, the presence of metastasis, and the cancer stage of the tumour. This goal is achieved with text mining techniques based on a supervised statistical approach. Since the procedure of information extraction from a free text can be traced back to a statistical classification problem, we apply several machine learning models in order to extract the variables mentioned above from the text. A gold standard for these variables is available: the clinical records have already been assessed case-by-case by an expert. The most efficient of the estimated models is the gradient boosting. Despite the good performance of gradient boosting, the classification error is not low enough to allow this kind of text mining procedures to be used in a Cancer Registry as it is proposed.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Determining the Importance of Hotel Services by Using Transitivity Thresholds

Nächstes Kapitel Predicting the Risk of Gambling Activities in Adolescence: A Case Study

Data analysis is based on anonymized data that have been analysed at the Cancer Registry of the Veneto Health Care System (Azienda Zero) after a formal agreement with the University of Padua.

Aalabdulsalam, A. K., et al. (2018). Automated extraction and classification of cancer stage mentions from unstructured text fields in a central cancer registry. In AMIA Summits on Translational Science Proceedings (pp. 16–25).

Aggarwal, C. C. & Zhai, C. (Eds.). (2012). Mining text data. Springer Science and Business Media.

Alicante, A., Corazza, A., Isgrò, F., & Silvestri, S. (2016). Unsupervised entity and relation extraction from clinical records in Italian. Computers in Biology and Medicine, 72, 263–275.CrossRef

Allvin, H., et al. (2011). Characteristics of Finnish and Swedish intensive care nursing narratives: a comparative analysis to support the development of clinical language technologies. Journal of Biomedical Semantics, 2, 1–11.CrossRef

Angelova, G., Boytcheva, S., & Nikolova, I. (2017). Mining association rules from clinical narratives. In Proceedings of Recent Advances in Natural Language Processing (pp. 130–138).

Balch, C. M., et al. (2001). Final version of the American Joint Committee on Cancer staging system for cutaneous melanoma. Journal of Clinical Oncology, 19, 3635–3648.CrossRef

Breiman, L. (1997). Arcing the edge. Technical Report. Statistics Department, University of California.

Ceron, A., Curini, L. & Iacus, S. M. (2014). Social Media e Sentiment Analysis: L’evoluzione dei fenomeni sociali attraverso la Rete. Springer Science & Business Media.

Chaovalit, P. & Zhou, L. (2005). Movie review mining: A comparison between supervised and unsupervised classification approaches. In Proceedings of the 38th Annual Hawaii International Conference on System Sciences (Vol. 112).

Chawla, N. V. (2003). C4. 5 and imbalanced data sets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In Proceedings of the ICML 3 (Vol. 66).

Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.CrossRef

Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6, 1–6.CrossRef

Chen, T. & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining (pp. 785–794). ACM.

Chiaramello, E., Paglialonga, A., Pinciroli, F., & Tognola, G. (2016). Attempting to use metamap in clinical practice: A feasibility study on the identification of medical concepts from Italian clinical notes. Studies in Health Technology and Informatics, 228, 28–32.

Cieslak, D. A. & Chawla, N. V. (2008). Learning decision trees for unbalanced data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 241–256). Springer.

Dalianis, H. (2018). Clinical text mining: Secondary use of electronic patient records (Vol. 192). Springer.

Ehrentraut, C., Dalianis, H., Tanushi, H., & Tiedemann, J. (2012). Detection of Hospital Acquired Infections in sparse and noisy Swedish patient records. In Sixth Workshop on Analytics for Noisy Unstructured Text Data (pp. 1–8).

Feinerer, I. (2018). Introduction to the tm Package 2018.

Feinerer, I., Hornik, K., & Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25.

Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. ICML, 96, 148–156.

Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 1189–1232.

Guzzinati, S. et al. (2018). High resolution registry of melanoma and care pathways monitoring in the Veneto Region, Italy in ENCR scientific meeting.

Hanauer, D. A., Miela, G., Chinnaiyan, A. M., Chang, A. E., & Blayney, D. W. (2007). The registry case finding engine: An automated tool to identify cancer cases from unstructured, free-text pathology reports and clinical notes. Journal of the American College of Surgeons, 205, 690–697.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2013). The elements of statistical learning. New York: Springer.MATH

Jivani, A. G. (2011). A comparative study of stemming algorithms. International Journal of Computer Technology and Applications, 2, 1930–1938.

Jurafsky, D. & Martin, J. H. (2008). Speech and language processing. Pearson London.

Kovacevic, A., Dehghan, A., Filannino, M., Keane, J. A., & Nenadic, G. (2013). Combining rules and machine learning for extraction of temporal expressions and events from clinical narratives. Journal of the American Medical Informatics Association, 20, 859–866.CrossRef

Kwartler, T. (2017). Text mining in practice with R. Wiley.

Liu, H., et al. (2012). Clinical decision support with automated text processing for cervical cancer screening. Journal of the American Medical Informatics Association, 19, 833–839.CrossRef

Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11, 22–31.

Martinez, D., Cavedon, L. & Pitson, G. (2013). Stability of text mining techniques for identifying cancer staging in Louhi. In The 4th International Workshop on Health Document Text Mining and Information Analysis.

McCowan, I., Moore, D., & Fry, M.-J. (2006). Classification of cancer stage from free-text histology reports. Engineering in Medicine and Biology Society, 5153–5156.

McCowan, I., et al. (2007). Collection of cancer stage data by classifying freetext medical reports. Journal of the American Medical Informatics Association, 14, 736–745.CrossRef

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space (pp. 1–12).

Miner, G., Elder, J., & Hill, T. (2012). Practical text mining and statistical analysis for non-structured text data applications. Academic Press.

Napolitano, G., Fox, C., Middleton, R., & Connolly, D. (2010). Pattern-based information extraction from pathology reports for cancer registration. Cancer Causes & Control, 21, 1887–1894.CrossRef

Nassif, H., et al. (2009). Information extraction for clinical data mining: A mammography case study. In International Conference on Data Mining (pp. 370-42).

Nguyen, A. N., Moore, D. C., McCowan, I. & Courage, M. (2007). Multi-class classification of cancer stages from free-text histology reports using support vector machines. In 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 5140–5143).

Nielsen, D. (2016). Tree BoostingWith XGBoost-why does XGBoostWin “Every” machine learning competition?

Pakhomov, S., Pedersen, T., & Chute, C. G. (2005). Abbreviation and acronym disambiguation in clinical discourse eng. AMIA Annual Symposium Proceedings, 2005, 589–593.

Patrick, J. & Nguyen, D. (2011). Automated proof reading of clinical notes. In Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation.

Pennington, J., Socher, R. & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543).

Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130–137.CrossRef

Pratt, A. W. & Pacak, M. G. (1969). Automated processing of medical English. In Proceedings of the 1969 Conference on Computational Linguistics (Association for Computational Linguistics) (pp. 1–23).

Ramos, J. (2003). Using tf-idf to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning (Vol. 242, pp. 133–142).

Spasic, I., Livsey, J., Keane, J. A., & Nenadic, G. (2014). Text mining of cancerrelated information: Review of current status and future directions. International Journal of Medical Informatics, 83, 605–623.

Stehman, S. V. (1997). Selecting and interpreting measures of thematic classification accuracy. Remote Sensing of Environment, 62, 77–89.CrossRef

Wu, Y., Xu, J., Jiang, M., Zhang, Y. & Xu, H. (2015). A study of neural word embeddings for named entity recognition in clinical text. In AMIA Symposium 2015 (pp. 1326–1333). American Medical Informatics Association.

Zhang, Y., Jin, R., & Zhou, Z.-H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1, 43–52.CrossRef

Zhou, X., Han, H., Chankai, I., Prestrud, A. & Brooks, A. (2006). Approaches to text mining for clinical medical records. In Proceedings of the 2006 ACM symposium on Applied computing (Vol. 235).

Titel: Staging Cancer Through Text Mining of Pathology Records
verfasst von: Pietro Belloni
Giovanna Boccuzzo
Stefano Guzzinati
Irene Italiano
Carlo R. Rossi
Massimo Rugge
Manuel Zorzi
Verlag: Springer International Publishing
Buch: Data Science and Social Research II
Print ISBN: 978-3-030-51221-7

Electronic ISBN: 978-3-030-51222-4

Copyright-Jahr: 2021
DOI: https://doi.org/10.1007/978-3-030-51222-4_4

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner