Skip to main content

2021 | OriginalPaper | Buchkapitel

Staging Cancer Through Text Mining of Pathology Records

verfasst von : Pietro Belloni, Giovanna Boccuzzo, Stefano Guzzinati, Irene Italiano, Carlo R. Rossi, Massimo Rugge, Manuel Zorzi

Erschienen in: Data Science and Social Research II

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Valuable information is stored in a healthcare record system and over 40% of it is estimated to be unstructured in the form of free clinical text. A collection of pathology records is provided by the Veneto Cancer Registry: these medical records refer to cases of melanoma and contain free text, in particular, the diagnosis. The aim of this research is to extract from the free text the size of the primary tumour, the involvement of lymph nodes, the presence of metastasis, and the cancer stage of the tumour. This goal is achieved with text mining techniques based on a supervised statistical approach. Since the procedure of information extraction from a free text can be traced back to a statistical classification problem, we apply several machine learning models in order to extract the variables mentioned above from the text. A gold standard for these variables is available: the clinical records have already been assessed case-by-case by an expert. The most efficient of the estimated models is the gradient boosting. Despite the good performance of gradient boosting, the classification error is not low enough to allow this kind of text mining procedures to be used in a Cancer Registry as it is proposed.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
1
Data analysis is based on anonymized data that have been analysed at the Cancer Registry of the Veneto Health Care System (Azienda Zero) after a formal agreement with the University of Padua.
 
Literatur
Zurück zum Zitat Aalabdulsalam, A. K., et al. (2018). Automated extraction and classification of cancer stage mentions from unstructured text fields in a central cancer registry. In AMIA Summits on Translational Science Proceedings (pp. 16–25). Aalabdulsalam, A. K., et al. (2018). Automated extraction and classification of cancer stage mentions from unstructured text fields in a central cancer registry. In AMIA Summits on Translational Science Proceedings (pp. 16–25).
Zurück zum Zitat Aggarwal, C. C. & Zhai, C. (Eds.). (2012). Mining text data. Springer Science and Business Media. Aggarwal, C. C. & Zhai, C. (Eds.). (2012). Mining text data. Springer Science and Business Media.
Zurück zum Zitat Alicante, A., Corazza, A., Isgrò, F., & Silvestri, S. (2016). Unsupervised entity and relation extraction from clinical records in Italian. Computers in Biology and Medicine, 72, 263–275.CrossRef Alicante, A., Corazza, A., Isgrò, F., & Silvestri, S. (2016). Unsupervised entity and relation extraction from clinical records in Italian. Computers in Biology and Medicine, 72, 263–275.CrossRef
Zurück zum Zitat Allvin, H., et al. (2011). Characteristics of Finnish and Swedish intensive care nursing narratives: a comparative analysis to support the development of clinical language technologies. Journal of Biomedical Semantics, 2, 1–11.CrossRef Allvin, H., et al. (2011). Characteristics of Finnish and Swedish intensive care nursing narratives: a comparative analysis to support the development of clinical language technologies. Journal of Biomedical Semantics, 2, 1–11.CrossRef
Zurück zum Zitat Angelova, G., Boytcheva, S., & Nikolova, I. (2017). Mining association rules from clinical narratives. In Proceedings of Recent Advances in Natural Language Processing (pp. 130–138). Angelova, G., Boytcheva, S., & Nikolova, I. (2017). Mining association rules from clinical narratives. In Proceedings of Recent Advances in Natural Language Processing (pp. 130–138).
Zurück zum Zitat Balch, C. M., et al. (2001). Final version of the American Joint Committee on Cancer staging system for cutaneous melanoma. Journal of Clinical Oncology, 19, 3635–3648.CrossRef Balch, C. M., et al. (2001). Final version of the American Joint Committee on Cancer staging system for cutaneous melanoma. Journal of Clinical Oncology, 19, 3635–3648.CrossRef
Zurück zum Zitat Breiman, L. (1997). Arcing the edge. Technical Report. Statistics Department, University of California. Breiman, L. (1997). Arcing the edge. Technical Report. Statistics Department, University of California.
Zurück zum Zitat Ceron, A., Curini, L. & Iacus, S. M. (2014). Social Media e Sentiment Analysis: L’evoluzione dei fenomeni sociali attraverso la Rete. Springer Science & Business Media. Ceron, A., Curini, L. & Iacus, S. M. (2014). Social Media e Sentiment Analysis: L’evoluzione dei fenomeni sociali attraverso la Rete. Springer Science & Business Media.
Zurück zum Zitat Chaovalit, P. & Zhou, L. (2005). Movie review mining: A comparison between supervised and unsupervised classification approaches. In Proceedings of the 38th Annual Hawaii International Conference on System Sciences (Vol. 112). Chaovalit, P. & Zhou, L. (2005). Movie review mining: A comparison between supervised and unsupervised classification approaches. In Proceedings of the 38th Annual Hawaii International Conference on System Sciences (Vol. 112).
Zurück zum Zitat Chawla, N. V. (2003). C4. 5 and imbalanced data sets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In Proceedings of the ICML 3 (Vol. 66). Chawla, N. V. (2003). C4. 5 and imbalanced data sets: Investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In Proceedings of the ICML 3 (Vol. 66).
Zurück zum Zitat Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.CrossRef Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357.CrossRef
Zurück zum Zitat Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6, 1–6.CrossRef Chawla, N. V., Japkowicz, N., & Kotcz, A. (2004). Special issue on learning from imbalanced data sets. ACM Sigkdd Explorations Newsletter, 6, 1–6.CrossRef
Zurück zum Zitat Chen, T. & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining (pp. 785–794). ACM. Chen, T. & Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM Sigkdd International Conference on Knowledge Discovery and Data Mining (pp. 785–794). ACM.
Zurück zum Zitat Chiaramello, E., Paglialonga, A., Pinciroli, F., & Tognola, G. (2016). Attempting to use metamap in clinical practice: A feasibility study on the identification of medical concepts from Italian clinical notes. Studies in Health Technology and Informatics, 228, 28–32. Chiaramello, E., Paglialonga, A., Pinciroli, F., & Tognola, G. (2016). Attempting to use metamap in clinical practice: A feasibility study on the identification of medical concepts from Italian clinical notes. Studies in Health Technology and Informatics, 228, 28–32.
Zurück zum Zitat Cieslak, D. A. & Chawla, N. V. (2008). Learning decision trees for unbalanced data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 241–256). Springer. Cieslak, D. A. & Chawla, N. V. (2008). Learning decision trees for unbalanced data. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases (pp. 241–256). Springer.
Zurück zum Zitat Dalianis, H. (2018). Clinical text mining: Secondary use of electronic patient records (Vol. 192). Springer. Dalianis, H. (2018). Clinical text mining: Secondary use of electronic patient records (Vol. 192). Springer.
Zurück zum Zitat Ehrentraut, C., Dalianis, H., Tanushi, H., & Tiedemann, J. (2012). Detection of Hospital Acquired Infections in sparse and noisy Swedish patient records. In Sixth Workshop on Analytics for Noisy Unstructured Text Data (pp. 1–8). Ehrentraut, C., Dalianis, H., Tanushi, H., & Tiedemann, J. (2012). Detection of Hospital Acquired Infections in sparse and noisy Swedish patient records. In Sixth Workshop on Analytics for Noisy Unstructured Text Data (pp. 1–8).
Zurück zum Zitat Feinerer, I. (2018). Introduction to the tm Package 2018. Feinerer, I. (2018). Introduction to the tm Package 2018.
Zurück zum Zitat Feinerer, I., Hornik, K., & Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25. Feinerer, I., Hornik, K., & Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25.
Zurück zum Zitat Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. ICML, 96, 148–156. Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. ICML, 96, 148–156.
Zurück zum Zitat Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 1189–1232. Friedman, J. H. (2001). Greedy function approximation: A gradient boosting machine. Annals of Statistics, 1189–1232.
Zurück zum Zitat Guzzinati, S. et al. (2018). High resolution registry of melanoma and care pathways monitoring in the Veneto Region, Italy in ENCR scientific meeting. Guzzinati, S. et al. (2018). High resolution registry of melanoma and care pathways monitoring in the Veneto Region, Italy in ENCR scientific meeting.
Zurück zum Zitat Hanauer, D. A., Miela, G., Chinnaiyan, A. M., Chang, A. E., & Blayney, D. W. (2007). The registry case finding engine: An automated tool to identify cancer cases from unstructured, free-text pathology reports and clinical notes. Journal of the American College of Surgeons, 205, 690–697. Hanauer, D. A., Miela, G., Chinnaiyan, A. M., Chang, A. E., & Blayney, D. W. (2007). The registry case finding engine: An automated tool to identify cancer cases from unstructured, free-text pathology reports and clinical notes. Journal of the American College of Surgeons, 205, 690–697.
Zurück zum Zitat Hastie, T., Tibshirani, R., & Friedman, J. H. (2013). The elements of statistical learning. New York: Springer.MATH Hastie, T., Tibshirani, R., & Friedman, J. H. (2013). The elements of statistical learning. New York: Springer.MATH
Zurück zum Zitat Jivani, A. G. (2011). A comparative study of stemming algorithms. International Journal of Computer Technology and Applications, 2, 1930–1938. Jivani, A. G. (2011). A comparative study of stemming algorithms. International Journal of Computer Technology and Applications, 2, 1930–1938.
Zurück zum Zitat Jurafsky, D. & Martin, J. H. (2008). Speech and language processing. Pearson London. Jurafsky, D. & Martin, J. H. (2008). Speech and language processing. Pearson London.
Zurück zum Zitat Kovacevic, A., Dehghan, A., Filannino, M., Keane, J. A., & Nenadic, G. (2013). Combining rules and machine learning for extraction of temporal expressions and events from clinical narratives. Journal of the American Medical Informatics Association, 20, 859–866.CrossRef Kovacevic, A., Dehghan, A., Filannino, M., Keane, J. A., & Nenadic, G. (2013). Combining rules and machine learning for extraction of temporal expressions and events from clinical narratives. Journal of the American Medical Informatics Association, 20, 859–866.CrossRef
Zurück zum Zitat Kwartler, T. (2017). Text mining in practice with R. Wiley. Kwartler, T. (2017). Text mining in practice with R. Wiley.
Zurück zum Zitat Liu, H., et al. (2012). Clinical decision support with automated text processing for cervical cancer screening. Journal of the American Medical Informatics Association, 19, 833–839.CrossRef Liu, H., et al. (2012). Clinical decision support with automated text processing for cervical cancer screening. Journal of the American Medical Informatics Association, 19, 833–839.CrossRef
Zurück zum Zitat Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11, 22–31. Lovins, J. B. (1968). Development of a stemming algorithm. Mechanical Translation and Computational Linguistics, 11, 22–31.
Zurück zum Zitat Martinez, D., Cavedon, L. & Pitson, G. (2013). Stability of text mining techniques for identifying cancer staging in Louhi. In The 4th International Workshop on Health Document Text Mining and Information Analysis. Martinez, D., Cavedon, L. & Pitson, G. (2013). Stability of text mining techniques for identifying cancer staging in Louhi. In The 4th International Workshop on Health Document Text Mining and Information Analysis.
Zurück zum Zitat McCowan, I., Moore, D., & Fry, M.-J. (2006). Classification of cancer stage from free-text histology reports. Engineering in Medicine and Biology Society, 5153–5156. McCowan, I., Moore, D., & Fry, M.-J. (2006). Classification of cancer stage from free-text histology reports. Engineering in Medicine and Biology Society, 5153–5156.
Zurück zum Zitat McCowan, I., et al. (2007). Collection of cancer stage data by classifying freetext medical reports. Journal of the American Medical Informatics Association, 14, 736–745.CrossRef McCowan, I., et al. (2007). Collection of cancer stage data by classifying freetext medical reports. Journal of the American Medical Informatics Association, 14, 736–745.CrossRef
Zurück zum Zitat Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space (pp. 1–12). Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space (pp. 1–12).
Zurück zum Zitat Miner, G., Elder, J., & Hill, T. (2012). Practical text mining and statistical analysis for non-structured text data applications. Academic Press. Miner, G., Elder, J., & Hill, T. (2012). Practical text mining and statistical analysis for non-structured text data applications. Academic Press.
Zurück zum Zitat Napolitano, G., Fox, C., Middleton, R., & Connolly, D. (2010). Pattern-based information extraction from pathology reports for cancer registration. Cancer Causes & Control, 21, 1887–1894.CrossRef Napolitano, G., Fox, C., Middleton, R., & Connolly, D. (2010). Pattern-based information extraction from pathology reports for cancer registration. Cancer Causes & Control, 21, 1887–1894.CrossRef
Zurück zum Zitat Nassif, H., et al. (2009). Information extraction for clinical data mining: A mammography case study. In International Conference on Data Mining (pp. 370-42). Nassif, H., et al. (2009). Information extraction for clinical data mining: A mammography case study. In International Conference on Data Mining (pp. 370-42).
Zurück zum Zitat Nguyen, A. N., Moore, D. C., McCowan, I. & Courage, M. (2007). Multi-class classification of cancer stages from free-text histology reports using support vector machines. In 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 5140–5143). Nguyen, A. N., Moore, D. C., McCowan, I. & Courage, M. (2007). Multi-class classification of cancer stages from free-text histology reports using support vector machines. In 29th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (pp. 5140–5143).
Zurück zum Zitat Nielsen, D. (2016). Tree BoostingWith XGBoost-why does XGBoostWin “Every” machine learning competition? Nielsen, D. (2016). Tree BoostingWith XGBoost-why does XGBoostWin “Every” machine learning competition?
Zurück zum Zitat Pakhomov, S., Pedersen, T., & Chute, C. G. (2005). Abbreviation and acronym disambiguation in clinical discourse eng. AMIA Annual Symposium Proceedings, 2005, 589–593. Pakhomov, S., Pedersen, T., & Chute, C. G. (2005). Abbreviation and acronym disambiguation in clinical discourse eng. AMIA Annual Symposium Proceedings, 2005, 589–593.
Zurück zum Zitat Patrick, J. & Nguyen, D. (2011). Automated proof reading of clinical notes. In Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation. Patrick, J. & Nguyen, D. (2011). Automated proof reading of clinical notes. In Proceedings of the 25th Pacific Asia Conference on Language, Information and Computation.
Zurück zum Zitat Pennington, J., Socher, R. & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543). Pennington, J., Socher, R. & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543).
Zurück zum Zitat Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130–137.CrossRef Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14, 130–137.CrossRef
Zurück zum Zitat Pratt, A. W. & Pacak, M. G. (1969). Automated processing of medical English. In Proceedings of the 1969 Conference on Computational Linguistics (Association for Computational Linguistics) (pp. 1–23). Pratt, A. W. & Pacak, M. G. (1969). Automated processing of medical English. In Proceedings of the 1969 Conference on Computational Linguistics (Association for Computational Linguistics) (pp. 1–23).
Zurück zum Zitat Ramos, J. (2003). Using tf-idf to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning (Vol. 242, pp. 133–142). Ramos, J. (2003). Using tf-idf to determine word relevance in document queries. In Proceedings of the First Instructional Conference on Machine Learning (Vol. 242, pp. 133–142).
Zurück zum Zitat Spasic, I., Livsey, J., Keane, J. A., & Nenadic, G. (2014). Text mining of cancerrelated information: Review of current status and future directions. International Journal of Medical Informatics, 83, 605–623. Spasic, I., Livsey, J., Keane, J. A., & Nenadic, G. (2014). Text mining of cancerrelated information: Review of current status and future directions. International Journal of Medical Informatics, 83, 605–623.
Zurück zum Zitat Stehman, S. V. (1997). Selecting and interpreting measures of thematic classification accuracy. Remote Sensing of Environment, 62, 77–89.CrossRef Stehman, S. V. (1997). Selecting and interpreting measures of thematic classification accuracy. Remote Sensing of Environment, 62, 77–89.CrossRef
Zurück zum Zitat Wu, Y., Xu, J., Jiang, M., Zhang, Y. & Xu, H. (2015). A study of neural word embeddings for named entity recognition in clinical text. In AMIA Symposium 2015 (pp. 1326–1333). American Medical Informatics Association. Wu, Y., Xu, J., Jiang, M., Zhang, Y. & Xu, H. (2015). A study of neural word embeddings for named entity recognition in clinical text. In AMIA Symposium 2015 (pp. 1326–1333). American Medical Informatics Association.
Zurück zum Zitat Zhang, Y., Jin, R., & Zhou, Z.-H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1, 43–52.CrossRef Zhang, Y., Jin, R., & Zhou, Z.-H. (2010). Understanding bag-of-words model: A statistical framework. International Journal of Machine Learning and Cybernetics, 1, 43–52.CrossRef
Zurück zum Zitat Zhou, X., Han, H., Chankai, I., Prestrud, A. & Brooks, A. (2006). Approaches to text mining for clinical medical records. In Proceedings of the 2006 ACM symposium on Applied computing (Vol. 235). Zhou, X., Han, H., Chankai, I., Prestrud, A. & Brooks, A. (2006). Approaches to text mining for clinical medical records. In Proceedings of the 2006 ACM symposium on Applied computing (Vol. 235).
Metadaten
Titel
Staging Cancer Through Text Mining of Pathology Records
verfasst von
Pietro Belloni
Giovanna Boccuzzo
Stefano Guzzinati
Irene Italiano
Carlo R. Rossi
Massimo Rugge
Manuel Zorzi
Copyright-Jahr
2021
DOI
https://doi.org/10.1007/978-3-030-51222-4_4

Premium Partner