Methods Inf Med 2013; 52(05): 382-394
DOI: 10.3414/ME12-01-0092
Original Articles
Schattauer GmbH

Feasibility of Feature-based Indexing, Clustering, and Search of Clinical Trials[*]

A Case Study of Breast Cancer Trials from ClinicalTrials.gov
M. R. Boland
1   Department of Biomedical Informatics, Columbia University, New York, NY, USA
,
R. Miotto
1   Department of Biomedical Informatics, Columbia University, New York, NY, USA
,
J. Gao
1   Department of Biomedical Informatics, Columbia University, New York, NY, USA
,
C. Weng
1   Department of Biomedical Informatics, Columbia University, New York, NY, USA
2   The Irving Institute for Clinical and Translational Research, Columbia University, New York, NY, USA
› Author Affiliations
Further Information

Publication History

received: 03 October 2012

accepted: 21 February 2013

Publication Date:
20 January 2018 (online)

Summary

Background: When standard therapies fail, clinical trials provide experimental treatment opportunities for patients with drug-resistant illnesses or terminal diseases. Clinical Trials can also provide free treatment and education for individuals who otherwise may not have access to such care. To find relevant clinical trials, patients often search online; however, they often encounter a significant barrier due to the large number of trials and in-effective indexing methods for reducing the trial search space.

Objectives: This study explores the feasibility of feature-based indexing, clustering, and search of clinical trials and informs designs to automate these processes.

Methods: We decomposed 80 randomly selected stage III breast cancer clinical trials into a vector of eligibility features, which were organized into a hierarchy. We clustered trials based on their eligibility feature similarities. In a simulated search process, manually selected features were used to generate specific eligibility questions to filter trials iteratively.

Results: We extracted 1,437 distinct eligi -bility features and achieved an inter-rater agreement of 0.73 for feature extraction for 37 frequent features occurring in more than 20 trials. Using all the 1,437 features we stratified the 80 trials into six clusters containing trials recruiting similar patients by patient-characteristic features, five clusters by disease-characteristic features, and two clusters by mixed features. Most of the features were mapped to one or more Unified Medical Language System (UMLS) concepts, demonstrating the utility of named entity recognition prior to mapping with the UMLS for automatic feature extraction.

Conclusions: It is feasible to develop feature-based indexing and clustering methods for clinical trials to identify trials with similar target populations and to improve trial search efficiency.

* Supplementary material published on our website www.methods-online.com


 
  • References

  • 1 Weng C, Embi P. Informatics Approaches to Participant Recruitment. In: Richesson R, Andrews J. editors Clinical Research Informatics.; Springer: 2012: 428
  • 2 Thadani SR, Weng C, Bigger JT, Ennever JF, Wajngurt D. Electronic Screening Improves Efficiency in Clinical Trial Recruitment. Journal of the American Medical Informatics Association 2009; 16 (06) 869-873.
  • 3 Herasevich V, Pieper MS, Pulido J, Gajic O. Enrollment into a time sensitive clinical study in the critical care setting: results from computerized septic shock sniffer implementation. Journal of the American Medical Informatics Association 2011; 18 (05) 639-644.
  • 4 Yamamoto K, Sumi E, Yamazaki T, Asai K, Yamori M, Teramukai S. et al. A pragmatic method for electronic medical record-based observational studies: developing an electronic medical records retrieval system for clinical research. BMJ Open 2012. 2 (6)
  • 5 Niland J. Integration of Clinical Research and EHR: Eligibility Coding Standards: ASPIRE (Agreement on Standardized Protocol Inclusion Requirements for Eligibility). http:// crisummit2010.amia.org/files/symposium2008/ S14_Niland.pdf.
  • 6 Patel C, Khan S, Gomadam K, Trial X. Using Semantic Technologies to Match Patients to Relevant Clinical Trials Based on Their Personal Health Records. In: Proceedings of the 8th International Semantic Web Conference. 2009: 1-7.
  • 7 Weng C, Tu SW, Sim I, Richesson R. Formal representation of eligibility criteria: a literature review. J Biomed Inform 2010; 43 (03) 451-467. PubMed PMID: 20034594. Pubmed Central PMCID: 2878905. Epub 2009/12/26. Eng
  • 8 Penberthy L, Brown R, Puma F, Dahman B. Automated matching software for clinical trials eligibility: Measuring efficiency and flexibility. Contemporary Clinical Trials 2010; 31 (03) 207-217.
  • 9 Heiney SP, Adams SA, Drake BF, Bryant LH, Bridges L, Hebert JR. Successful subject recruitment for a prostate cancer behavioral intervention trial. Clinical Trials 2010; 7 (04) 411-417.
  • 10 Patel C, Gomadam K, Khan S, Garg V. TrialX: Using semantic technologies to match patients to relevant clinical trials based on their Personal Health Records. J Web Sem 2010; 8 (04) 342-347.
  • 11 Kernan W, Viscoli C, Brass L, Amatangelo M, Birch A, Clark W. et al. Boosting enrolment in clinical trials: validation of a regional network model. Clinical Trials 2011; 8 (05) 645-653.
  • 12 Weng C, Wu X, Luo Z, Boland MR, Theodoratos D, Johnson SB. EliXR: an approach to eligibility criteria extraction and representation. Journal of the American Medical Informatics Association 2011; 18 (Suppl. 01) i116-i124.
  • 13 Heinemann S, Thuring S, Wedeken S, Schafer T, Scheidt-Nave C, Ketterer M. et al. A clinical trial alert tool to recruit large patient samples and assess selection bias in general practice research. BMC Med Res Methodol 2011; 11 (16) 1-10. PubMed PMID: 21320358. Pubmed Central PMCID: 3047292. Epub 2011/02/16. eng
  • 14 Beauharnais CC, Larkin ME, Zai AH, Boykin EC, Luttrell J, Wexler DJ. Efficacy and cost-effectiveness of an automated screening algorithm in an inpatient clinical trial. Clinical Trials 2012; 9 (02) 198-203.
  • 15 Korkontzelos I, Mu T, Ananiadou S. ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials. BMC Medical Informatics and Decision Making 2012; 12 (Suppl. 01) S3
  • 16 Harris PA, Scott KW, Lebo L, Hassan N, Lightner C, Pulley J. ResearchMatch: a national registry to recruit volunteers for clinical research. Academic medicine. Journal of the Association of American Medical Colleges 2012; 87 (01) 66-73. PubMed PMID: 22104055. Epub 2011/11/23. eng
  • 17 ResearchMatch www.researchmatch.org. Accessed on August 9 2012
  • 18 caMATCH https://cabigncinihgov/community/tools/caMATCH. Accessed on January 7 2013
  • 19 Corengi https://wwwcorengicom/. Accessed on January 7 2013
  • 20 University of Florida Research Affairs Clinical Trials http://wwwhscjufledu/research/ SearchClinicalTrialsaspx. Accessed on January 7 2013
  • 21 McCray A. Better access to information about clinical trials. Annals of Internal Medicine 2000; 133 (08) 609-614.
  • 22 NIH www.clinicaltrials.gov. Accessed on February 10, 2012 and October 2 2012
  • 23 Muller H, Hanbury A, Al Shorbaji N. Health information search to deal with the exploding amount of health information produced. Methods Inf Med 2012; 51 (06) 516-518. PubMed PMID: 23212781. Epub 2012/12/06. eng
  • 24 Tan P-N. Steinbach M, Kumar V. Introduction to Data Mining. Addison-Wesley 2005.
  • 25 Tata S, Patel JM. Estimating the selectivity of tf-idf based cosine similarity predicates. SIGMOD Rec 2007; 36 (02) 7-12.
  • 26 Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. New York: Cambridge University Press; 2008: 482
  • 27 Durao F, Dolog P, Leginus M, Lage R. SimSpectrum: A Similarity Based Spectral Clustering Approach to Generate a Tag Cloud. In: Harth A, Koch N. editors. Current Trends in Web Engineering. Lecture Notes in Computer Science 7059 Berlin Heidelberg: Springer; 2012: 145-154.
  • 28 Korkontzelos I, Mu T, Ananiadou S. ASCOT: a text mining-based web-service for efficient search and assisted creation of clinical trials. BMC Medical Informatics and Decision Making 2012; 12 (Suppl. 01) S3 PubMed PMID doi: 10.1186/1472–6947–12-S1-S3
  • 29 Salton G, Fox EA, Wu H. Extended Boolean information retrieval. Commun ACM 1983; 26 (11) 1022-1036.
  • 30 Salton G. Developments in Automatic Text Retrieval. Science 1991; 253 5023 974-980.
  • 31 Denecke K. An Architecture for Diversity-aware Search for Medical Web Content. Methods Inf Med 2012; 51 (06) 549-556. PubMed PMID: 23080127. Epub 2012/10/20. eng
  • 32 Turney P. Mining the web for synonyms: PMI-IR versus LSA on TOEFL. In: Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001). 2001: 1-12.
  • 33 Aula A. Query formulation in web information search. In: Proceedings of IADIS international conference WWW/Internet. 2003. Lisboa (IADIS Press) 403-410.
  • 34 Steinbrook R. Searching for the Right Search - Reaching the Medical Literature. New England Journal of Medicine 2006; 354 (01) 4-7.
  • 35 Rogers FB. Medical subject headings. Bulletin of the Medical Library Association 1963; 51: 114-116. PubMed PMID: 13982385. Pubmed Central PMCID: 197951. Epub 1963/01/01. eng
  • 36 Bakken S, Currie LM, Lee N-J. Roberts WD, Collins SA, Cimino JJ. Integrating evidence into clinical information systems for nursing decision support. International Journal of Medical Informatics 2008; 77 (06) 413-420.
  • 37 Burstein J, Kukich K, Wolff S, Lu C, Chodorow M, Braden-Harder L. et al. Automated scoring using a hybrid feature identification technique. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics - Volume 1; Montreal, Quebec, Canada. 980879: Association for Computational Linguistics 1998: 206-210.
  • 38 Forman G, Kirshenbaum E. Extremely fast text feature extraction for classification and indexing. In: Proceedings of the 17th ACM conference on Information and knowledge management; Napa Valley, California, USA. 1458243: ACM 2008: 1221-1230.
  • 39 Lowe D, Webb AR. Optimized Feature Extraction and the Bayes Decision in Feed-Forward Classifier Networks. IEEE Transations on Pattern Analysis and Machine Intelligence 1991; 13 (04) 355-364.
  • 40 Clausen M, Korner H, Kurth F. An Efficient Indexing and Search Technique for Multimedia Databases. SIGIR Multimedia Information Retrieval Workshop 2003: 1-12.
  • 41 Lewis DD. Feature selection and feature extraction for text categorization. In: Proceedings of the workshop on Speech and Natural Language; Harriman, New York. 1075574: Association for Computational Linguistics 1992: 212-217.
  • 42 Similarity trials Nat Biotech. 2011; 29 (01) 1
  • 43 Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Research 2004; 32 (Suppl. 01) D267-D270.
  • 44 Ross J, Tu S, Carini S, Sim I. Analysis of eligibility criteria complexity in clinical trials. AMIA Summits Transl Sci Proc 2010. (March 1) 46-50.
  • 45 George S. Reducing patient eligibility criteria in cancer clinical trials. J Clin Oncol 1996; 14 (04) 1364-1370.
  • 46 Sim I, Olasov B, Carini S. An ontology of randomized controlled trials for evidence-based practice: content specification and evaluation using the competency decomposition method. Journal of Biomedical Informatics 2004; 37: 108-119.
  • 47 Sarkar IN. A vector space model approach to identify genetically related diseases. Journal of the American Medical Informatics Association 2012; 19 (02) 249-254.
  • 48 Geertzen J. Cohen’s Kappa for more than two annotators with multiple classes http://cosmion.net/jeroen/software/kappao/. Accessed on August 15 2012
  • 49 Fleiss JL. Measuring nominal scale agreement among many raters. Psychological Bulletin 1971; 76 (05) 378-382.
  • 50 Wishart D. 256. Note: An Algorithm for Hierarchical Classifications. Biometrics 1969; 25 (01) 165-170.
  • 51 Ward Jr JH. Hierarchical Grouping to Optimize an Objective Function. Journal of the American Statistical Association 1963; 58 (301) 236-244.
  • 52 Suzuki R, Shimodaira H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics 2006; 22 (12) 1540-1542.
  • 53 Luo Z, Duffy R, Johnson SB, Weng C. Corpus-based Approach to Creating a Semantic Lexicon for Clinical Research Eligibility Criteria from UMLS. AMIA Summits Transl Sci Proc 2010. (March 1) 26-30.
  • 54 Luo Z, Johnson SB, Weng C. Semi-Automatically Inducing Semantic Classes of Clinical Research Eligibility Criteria Using UMLS and Hierarchical Clustering. AMIA Annu Symp Proc 2010. (Nov 13) 487-491.
  • 55 Horridge M. OWLViz - A visualisation plugin for the Protege OWL Plugin. http://www.co-ode.orgldownloads/owlvizl. Accessed on September 24 2012
  • 56 Landis JR, Koch GG. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977; 33 (01) 159-174.
  • 57 Hripcsak G, Rothschild AS. Agreement, the F-Measure, and Reliability in Information Retrieval. Journal of the American Medical Informatics Association 2005; 12 (03) 296-298.
  • 58 Hamers L, Hemeryck Y, Herweyers G, Janssen M, Keters H, Rousseau R. et al. Similarity measures in scientometric research: The Jaccard index versus Salton’s cosine formula. Information Processing & Management 1989; 25 (03) 315-318.
  • 59 Krieger AM, Green PE. A Generalized Rand-Index Method for Consensus Clustering of Separate Partitions of the Same Data Base. Journal of Classification 1999; 16 (01) 63 PubMed PMID: 4676459
  • 60 Meriggi F, Abeni C, Di Biasi B, Zaniboni A. The use of bevacizumab and trastuzumab beyond tumor progression: a new avenue in cancer treatment?. Rev Recent Clin Trials 2009; 4 (03) 163-167.
  • 61 Martín M, Makhson A, Gligorov J, Lichinitser M, Lluch A, Semiglazov V. et al. Phase II Study of Bevacizumab in Combination with Trastuzumab and Capecitabine as First-Line Treatment for HER-2-positive Locally Recurrent or Metastatic Breast Cancer. The Oncologist 2012; 17 (04) 469-475.
  • 62 Evans DA, Zhai C. Noun-phrase analysis in unrestricted text for information retrieval. In: Proceedings of the 34th annual meeting on Association for Computational Linguistics; Santa Cruz, California. 981866: Association for Computational Linguistics 1996: 17-24.
  • 63 Novichkova S, Egorov S, Daraselia N. MedScan, a natural language processing engine for MEDLINE abstracts. Bioinformatics 2003; 19 (13) 1699-1706.
  • 64 Molina A, Pla F. Clause detection using HMM. In: Proceedings of the 2001 workshop on Computational Natural Language Learning - Volume 7; Toulouse, France. 1455688: Association for Computational Linguistics 2001: 1
  • 65 Pakhomov S, Buntrock J, Duffy P. High throughput modularized NLP system for clinical text. In: Proceedings of the ACL 2005 on Interactive poster and demonstration sessions; Ann Arbor, Michigan. 1225760: Association for Computational Linguistics 2005: 25-28.
  • 66 Restificar A, Ananiadou S. Inferring appropriate eligibility criteria in clinical trial protocols without labeled data. Proceedings of the ACM sixth international workshop on Data and text mining in biomedical informatics; Maui, Hawaii, USA. 2390074: ACM 2012: 21-28.
  • 67 Patel C, Cimino J, Dolby J, Fokoue A, Kalyanpur A, Kershenbaum A. et al. Matching Patient Records to Clinical Trials Using Ontologies. In: Aberer K, Choi K-S, Noy N, Allemang D, Lee K-I, Nixon L. et al. editors. The Semantic Web. Lecture Notes in Computer Science. 4825. Berlin Heidelberg: Springer; 2007: 816-829.