Introduction
-
To the best of our knowledge, this survey is the first to provide a bird’s eye view of healthcare KG construction.
-
A new representative taxonomy is outlined to facilitate easier KG construction in the healthcare domain.
-
An in-depth analysis of state-of-the-art KG construction methodologies is provided, and their main strengths and weaknesses are discussed.
-
A summary of the research findings and remaining issues is presented, paving the way for future research.
Survey methodology
Groundworks
An overview of KG
Generic and domain specific KG
A taxonomy of healthcare KG construction
Levels of knowledge extraction
Entity-level
Relation-level
Types of knowledge base
Types of knowledge resources
KG evaluation metrics
State of the art review
Drug discovery, repurposing and adverse reaction
Ref. | KG Specific Functionality | Knowledge Extraction Techniques | Type of KB | KG Resource(s) | KG Stats | Evaluation Measure(s) | Shortcoming(s) | |
---|---|---|---|---|---|---|---|---|
Entity-level | Relation-Level | |||||||
[46] | Drug discovery | Manual and fuzzy matching | Schema-based | Wikidata, DrugBank13, WedMD, and GoodRx | N/A | R, P | • Lack of statistics on the resultant KG. • Limited discussion on the Ontology design • The evaluation of the proposed model emphasized on KG embedding rather than the resultant integrated KG. | |
[47] | Drug discovery for COVID-19 | Manual construction based on six KGs obtained from the literature | Schema-based | Literature on COVID-19 | #n: 100,00 #e: 670,000 | AUC, and AUPRC | • Insufficient discussion on the mechanism followed to integrate the incorporated KGs, • The evaluation of Att-GCN-DDI is limited and not detailed. | |
[48] | Drug discovery | Manual extraction based on Bio2RDF KG | Hybrid | Bio2RDF14 | #n: 2,947,140 #e: 10,131,654 | AUC, AUPR, F1 | • Inadequate discussion on the construction of drug KG. | |
[55] | Drug repurposing | Algorithms developed at BenevolentAI15 and part of their IP | Hybrid | Structured and unstructured resourced including Literature on COVID-19 | #n: millions #e: hundreds of millions | Case study | • There is no detailed discussion on the mechanism followed to construct BenevolentAI graph. • The evaluation was merely measured by case study. | |
[56] | Drug repurposing | Coarse- and fine-grained entity extraction | Manually based on CTD and MeSH | Schema-based | Multimodal scientific literature (CTD16) | #n: 67,217 #e: 77,844,574 | Case study on Drug Repurposing Report Generation | • Although the proposed framework demonstrated success in tackling the quantity issue of relevant KG resources, the quality issue was not properly evaluated to demonstrate its effectiveness. • Observed bias in training and development data, source, and test queries. |
[57] | Drug repurposing | Manually encoded in Biological Expression Language | Schema-free | PubMed, LitCovid17, EuropePMC, etc. | #n: 4,016 #e: 10,232 | Case study (Gene Expression Analysis) | • The mechanism followed to construct the KG (manual-based) is poor in terms of scalability. | |
[58] | Drug repurposing | Cross-referencing | Schema-based | PharmGKB, TTD, KEGG DRUG, DrugBank, SIDER18, and DID | N/A | Case study (Finding drug–disease pairs) | • The proposed data model that was used for data integration can be improved by using formal domain ontology toward better conceptualizing the domain. | |
[67] | Prediction of adverse drug reactions | Direct construction from structural databases | Schema-free | DrugBank database and SIDER database | #n: 12,473 #e:154,239 | P, R, F1, AUC, and a case study on Drug-induced liver injury | • The KG skips information of drugs and protein target, • The scope of information perceived by entities can be enlarged by using longer path in the KG as the input of Word2Vec model. | |
[66] | Prediction of adverse drug reactions | Direct construction from structural databases | Schema-free | DrugBank, SIDER | #n: 5,828 #e: 70,382 | AUC and case study(Validation in EHRs and Eudravigilance) | • No clear discussion on KG construction approach, • Insufficient discussion on the methodology followed in the ML benchmark comparison. | |
[68] | Discovery of adverse drug reactions | cTAKES19 | naive Bayesian model | Schema-based | MEDLINE | #n: 9,699 #e: 139,254 | co-occurrence analysis and Case study (Osimertinib) | • The computed drug-biomarker groupings cannot differentiate between a drug-treatment relationship, • The study lacks the attention to drug-drug interaction, • lack of rationale on using the entity extraction method |
[75] | Drug action | Automatically using rule-based approach | Schema-free | Medical papers | #n: 40,963 #e: 57,865 | R, and accuracy | • Lack of verification to the textual prio KG construction. • Limited comparison with currently exiting similar KGs. |
Diseases and disorders
Ref. | KG Specific Functionality | Knowledge Extraction Techniques | Type of KB | KG Resource(s) | KG Stats | Evaluation Measure(s) | Shortcoming(s) | |
---|---|---|---|---|---|---|---|---|
Entity-level | Relation-Level | |||||||
[15] | Cardiovascular domain | LSTM-CR | pattern-based and supervised learning methods | Hybrid | UMLS, EMRs, medical standards, and expert knowledge. | #n: 8,293,284 #e: 32,256,360 | The evaluation is conducted in the embedded modules | • The overall framework requires a detailed case study to evaluate the effectiveness of integrating the proposed modules. |
[45] | Subarachnoid hemorrhage | Semantic analysis (Ontologies: LBO, IAO, etc.,) | Automatic (Rule-based) | Shema-based | clinical notes and brain angiograms | N/A | P, R, F1, and AC | • Limited discussion on the KG statistics • The overall framework requires a detailed case study to evaluate the effectiveness of integrating the proposed modules. |
[76] | Hepatocellular carcinoma | SemRep20, rule-based method,and BioIE(with Att-BiLSTM-CRF) | Schema-based | PubMed, SemMedDB, UpToDate, and Clinical Trials21 | #n: 5,028 #e: 13,296 | Accuracy | • The KG was not properly evaluated on real-life case study that addresses hepatocellular carcinoma. • There has been no detailed discussion on the mechanism followed to address the presented disagreements. | |
[80] | Stroke | NLTK, PKDE4J, and Bio-BERT | Shema-free | #n: 46 k #e: 157 k | P, R, F1 | • The constructed KG is limited to Chinese context and hard to replicate and build a more comprehensive map of medical knowledge. | ||
[77] | Diagnosis and treatment of viral hepatitis B | N/A | N/A | Schema-based | EMR (8544 patients in China) | #n: 8,563 #e: 96,896 | N/A | • No proper evaluation was conducted. • No discussion on mechanism followed to construct the KG |
[83] | Coronavirus pneumonia-related diseases, | CRF | Bio-BERT | Shema-free | COVID-19 scientific literatures | #n: 10,993 #e: 1,204,234 | Specificity, P, R, F1, and AC | • The entity and relation extraction datasets are provided with lack of discussion on the mechanism followed to conduct the experiments on these datasets. |
[78] | Identifying disease-gene associations | N/A | N/A | Shema-free | #n: 103,625 #e: 3,273,215 | N/A | • No discussion on the mechanism followed to extract entities and relationships. • The construction of the KG itself is not evaluated | |
[79] | Myopia Prevention | Automatic using python script | Schema-based | Baidu Encyclopedia, Chinese Wikipedia, and professional websites | #n: N/A #e: N/A | NA | • KG is not described in terms of mechanisms used to extract entities and relationships. • No proper evaluation is undertaken. | |
[93] | Depression disorder | XMedlan, Semantic Queries with regular expressions, | Hybrid | PubMed, Clinical Trials5 DrugBank32, DrugBook, Wikipedia, SIDER33, and UMLS | #e: 8,892,722 | Use cases | • Lack of proper evaluation, • insufficient use of other important medical repositories, • lack of discussion on both the methodology used for knowledge integration and KG statistics. | |
[91] | Autism spectrum disorder | MinHash lookup/UMLS | Skip-gram and kmeans++ | Schema-free | PubMed34 (autism spectrum disorder-related article abstracts) | #n: 6827 #e: 16,192 | Hit@k | • Extracted relations are coarse-grained. • Difficult to distinguish semantically related relations, • Insufficient overall evaluation to the model |
[94] | Depression | SemRep35,OpenIE and rule-based method | Schema-based | SemMedDB, PubMed | #n: 3,055 #e: 30 | Jaccard | • Poor data quality • The utility of KG was not well-proven | |
[95] | Metabolism-depression associations | Manual curation and extraction by domain expert (traditional logical rules) | Schema-based | KEGG and scientific literature | #n: 3,724,526 #e: 5,725,821 | Case study | • Ineffective inferences due to the incorporated traditional logical rules. • Automatic extraction methods are required to enrich the functional diversity of the depression KG. |
Biomedicine
Ref. | KG Specific Functionality | Knowledge Extraction Techniques | Type of KB | KG Resource(s) | KG Stats | Evaluation Measure(s) | Shortcoming(s) | |
---|---|---|---|---|---|---|---|---|
Entity-level | Relation-Level | |||||||
[100] | Generic biomedicine | Manual integration and mapping of entities and relationships | Schema- base | OMIM, DrugBank, PharmGKB, Therapeutic Target Database], SIDER, and HumanNet | #n: 7,603 #e: 500,958 | Hits@N and Downstream tasks | • The quality and integrity of the metadata cannot be fully assured. • The final version of the constructed graph does not have large-scale of entities compared with state-of-the-art KGs. • No discussion is provided on the adopted ontology. | |
[101] | Generic biomedicine | PubTator36 and manual annotation (EBC) | Stanford Dependency Parser37 | Schema-free | Biomedical literature (Medline abstracts38) | #n: N/A #e: 2,236,307 | Benchmark comparison | • Heavily dependent on the co-occurrence of paths to map scarcer paths to themes, • Lack of handling complex relations • There is a potential of a parser error, |
[102] | Translational biomedicine | Manually and automatically using Snakemake39 | Schema- base | 70 knowledge sources including SemMedDB, ChEMBL, etc. | #n: 6.4 m #e: 39.3 m | Benchmark comparison | • The automation process to construct the KG was not detailed. • The comparison with other KGs is not well discussed nor formulated. | |
[103] | Biomedical Causal Discovery | Manual and rule-based approach | Schema-free | PubMed | #n: N/A #e: N/A | Accuracy | The paper failed to extract implicit causality, The process to identify concepts and relationships between concepts is not detailed. | |
[82] | Marine Chinese medicine | Manual mapping between the ontology and the KG | Schema- base | Medical literature | #n: N/A #e: N/A | NA | • The paper inadequately described the construction and evaluation of the proposed KG. | |
[104] | Generic biomedicine | BioDBLinker | Automatic mapping | Schema- free | #n: N/A #e: N/A | Benchmark comparison | • Suffers from sparsity of data, • Train-test data leakage in case used without careful review | |
[105] | Intestinal cells | Manually based on the conceptual model | Schema- base | PubMed | #n: 2443 #e: 160,253 | Case study | • Poor entity and relation extraction approaches. • Data source is static and limited to medical literature, yet medical facts of intestinal cells can be obtained from future experiments. | |
[112] | Microbiology | NER and NLP techniques | Schema- base | KG Hub – COVID1944 | #n: 266,000 #e: 432,000 | N/A | • Poor discussion on mechanisms followed to construct and validate the KG | |
[113] | Gut microbiota | Manual annotation and mapping | Schema- base | Google Scholar and PubMed, UMLS, MeSH, SNOMED CT, and KEGG | #f: 31,268,998 | Case studies | • Poor extraction of entities and relations. • The correctness and completeness of extracted relations limit the semantic search’s precision and reliability. | |
[114] | Microbe-Disease Associations | Kindred entity and relation classifier45 | Schema- free | Wikidata, UMLS, NCBI | #n: 9,832 #e: 21,905 | Hits@N | • KG can be expanded by means of a bacterial attribute mining tool, • Lacks a discussion on interactions between bacteria and antibiotics or viruses. | |
[115] | Coronavirus | Manual extraction and mapping | Schema- free | Analytical Graph (AG) and CORD-1946 | #n: 588,820 #e: N/A | Case study | • Limited data sources, • Static KG | |
[116] | Coronavirus | BioBERT | Schema- free | PubMed and CORD-19 | #n: N/A #e: N/A | P, R, and F1-score | • KG can be expanded to other bio-medical datasets. • Further biomedical NLP models for NER, e.g., blueBERT can be attempted to verify the validy of the extracted knowledge. |
Miscellaneous healthcare
Ref. | KG Specific Functionality | Knowledge Extraction Techniques | Type of KB | KG Resource(s) | KG Stats | Evaluation Measure(s) | Shortcoming(s) | |
---|---|---|---|---|---|---|---|---|
Entity-level | Relation-Level | |||||||
[120] | A generic medical KG of patient visits. | BMM, BiLSTM-CRF and pattern recognizer | Nine predefined relations | Schema-free | Southwest Hospital in China: 16,217,270 de-identified visits of 3,767,198 patients | #n: 22,508 #e: 579,094 | R, P, F1, and NDCG | • KG embedding was designed and limited to Bi-LTSM without considering other state-of-the-art techniques. • The evaluation was mainly conducted on the embedded components. • Besides the preliminary discussion on the applications, there is a lack of an overall evaluation of the KG. |
[121] | KG of online EMR and emergency department | N/A | N/A | Schema-free | BIDMC dataset and EMRs from an emergency department | #n: N/A #e: N/A | F1 and the area under the precision-recall curve | • The provided statistics are on the sources of the KG; the stats on the KG in terms of entities and edges are missing. • There is no discussion on the mechanism followed to construct the KG in terms of entities and relations. |
[133] | Smart Healthcare Management | CRF | Manual and classification-based algorithms | Schema-based | #n: 1,169 #e: 9,707 | R, P, and F1 | • The resultant KG can be consolidated with information about disease and drugs and link them with symptom entities. | |
[128] | Q&A | BILSTM-CRF | Manually | Schema-free | EMRs from a hospital in Shanghai | #n: 44,111 #e: 203,308 | R, F1 and Accuracy | • Lack of comparative study of the model. • Limited practicability of the system • Limited size and pretreatment of the corpus |
[129] | Q&A | BiLSTM + CRF | Schema-free | National Service Platform for Famous Old Chinese Medicine Experience50 | #n: N/A #e: N/A | Case study and Hitration | • Poor KG with a minimal number of entities and relationships, | |
[42] | Q&A | Plausible reasoning | Schema-free | BioASQ, DrugBank, Disease Ontology, and SemMedDB | #n: N/A #e: N/A | Domain expert’s verification | • Insufficient evaluation, • evaluating the performance of query rewriting algorithm does not exist | |
[130] | Q&A | Automatic mapping | Schema-free | Chinese medical websites | #n: 18,687 #e: 88,858 | Case study | • Poor discussion on extraction of entities and relationships. • The QA system does not exhibit utility due to inapplicable results. | |
[131] | Q&A | Jieba51 | Automatic mapping | Schema-free | A medical company (YiFeng Pharmacy52) | #n: 34,788 #e: 601,475 | Training and decision accuracy, cost, and time | • The construction of KG is not validated. • The system can answer one intention per question and cannot thus answer questions with multi-intensions. |
[146] | COVID-19 Clinical Research | Stanza’s NER53 | Stanza’s Bi-LSTM | Schema-free | Artificial Intelligence in Medicine | #n: N/A #e: N/A | Baseline comparison | • Lack of statistics on entities and relationships, • Poor KG validation method |
Summary
Findings, open issues, and opportunities
-
KG data sources: various previous studies have concentrated on knowledge curation and facts captured from a limited number of data sources. For example, certain KGs were constructed using only biomedical scientific publications (e.g. PubMed and SemMedDB) [94, 103, 105]. The extracted knowledge using such data sources lacks completeness, leading to poor descriptiveness of the entities and potentially flawed relationships within a particular healthcare domain. This also limits the capacity of the graph to deliver useful facts or rules to power data-driven methods that can be used for making healthcare decisions [45]. To consolidate a healthcare KG and establish a cohesive viewpoint of the domain, alternative sources need to be incorporated and integrated including EMRs, PMRs, clinical trials, patient records, epidemiological surveillance, sensor data, disease registries, wearable devices, health workforce data, census data, implanted equipment, pill cameras, and all other relevant sources. However, full integration of such heterogeneous data sources can be a complicated and time-consuming task, especially when working with large-scale datasets where traditional data assimilation and aggregation techniques are not applicable. Therefore, there is still room for research to address the big data problem in healthcare KGs by developing advanced and sophisticated data collection and aggregation techniques.
-
Healthcare knowledge interoperability: Linked Open Data (LOD) and Semantic Web technologies have made it possible to improve a variety of domain-specific applications [14, 151‐153]. KGs represent an expansion of these efforts and are frequently connected with LOD initiatives because they improve data semantics by enhancing the conceptual representations of entities [154]. As a result, appropriate interlinking of entities gathered from different data sources facilitates information interoperability, resulting in multimodal KGs. However, some of the methodologies investigated in this study revealed difficulties in attaining the appropriate level of knowledge expandability and interoperability. In particular, semantic expansion strategies were underutilised, and their ability to take advantage of freely accessible vocabulary and semantic resources is mostly ignored. The expansion of healthcare knowledge with health records collected from different channels, such as hospital admissions, family physician visits, prescription drugs, pharmacy requests, laboratory blood analyses, and death certificates establishes a comprehensive individual health (or disease) profile [155]. This holistic view carries enormous implications for several research areas, such as epidemiology and precision medicine. Basic structure of KGs facilitates better data integration, unification, and information sharing. Semantic expansion adds context to the collected facts in the KGs and enhances the quality of the aggregated knowledge, eliminates redundant records, and detects missing entities. Based on success of existing healthcare semantic expansion initiatives such as the Centre for Health Record Linkage (CHeReL) in Australia [156] and Rochester Epidemiology Project in USA [157], more research in this direction should be conducted.
-
KG construction mechanisms: The construction of the KG comprises several activities which might vary depending on the type of knowledge base (schema-based, schema-free, or hybrid), knowledge resources and their data types (structured or unstructured), knowledge extraction techniques (entity-level and relation-level), etc. Several of the examined studies failed to adequately disclose the internal mechanisms they used to build and implement the KGs. A shortcoming that was commonly observed was poor and/or limited discussion to explain either the overall construction methodology [48, 55, 66] or the essential construction tasks such as the ontology design [46, 100], entity and/or relation extraction [78, 83], and knowledge integration [47, 93]. Furthermore, many of the KGs described in those papers are not publicly available for inspection. These drawbacks detract from knowledge sharing, translation, and reusing, and make the replication of the proposed approaches difficult. This is particularly problematic in the healthcare domain where knowledge replicability can assist in consolidating the facts about certain scientific tests and medical experiments [158]. Therefore, future studies must ensure that all steps of KG construction are well-explained, and the resultant KG must be publically shared with the community to reinforce FAIR principles (Findable, Accessible, Interoperable, Reusable)54.
-
KG evaluation: Despite the continuous propagation of KGs for the healthcare domain and its sub-domains, this survey reports evident problems with KG evaluation and/or case study implementation. Numerous KGs were constructed with no proper concern for evaluation of their quality [77‐79, 82]. Additionally, there is only a limited utility in applying the constructed KGs to real-life applications. Instead of practical applications, the proposed KGs mainly attempted to provide an underlying conceptual structure of the domain utilising domain-specific entities, concepts, relationships, and events. For example, the authors of [76] attempted to build a KG for hepatocellular carcinoma with no verified utility in addressing the designated disease. Designing and implementing actionable healthcare analytics must be the essence of the KG construction philosophy, where relevant facts are obtained with the objective to conceptualise the correct context and address a domain problem, thereby achieving the hoped-for value. Future works must ensure that KGs are assessed using one or more appropriate evaluation and refinement methodologies such as (i) silver and gold standards [159]; (ii) theoretically proven computational measures such as precision and recall; and (iii) domain experts. In addition, the constructed KG must prove its utility and verify its applicability in real-life scenarios and for the execution of downstream tasks.
-
Data Quality and PrivacyApplying healthcare KGs to downstream tasks such as drug discovery, clinical decision support, and medical treatment relies profoundly on the high quality of the embedded facts. Although some of the examined works constructed their KGs using structural, verified and curated data sources [42, 94, 104], other KGs imported data from unstructured sources (such as scientific medical literature or social media), with little regard for applying data quality measures before incorporating the extracted information [56, 105]. Freely available texts such as scientific medical literature commonly comprise ambiguous data, abbreviations, and noisy data that includes words and phrases irrelevant to the designated context. EMRs also comprise a vital source of embedded clinical data that can be either mistakenly neglected or hard to collect due to confidentiality constraints. These challenges raise concerns about the quality and reliability of KGs generated from such data sources. Therefore, high-quality healthcare KGs should be constructed by selecting high-quality data sources and developing quality measurement techniques. Also, advanced NLP and deep learning algorithms that can efficiently and automatically identify high-quality entities and relations should be implemented wherever possible. Those tools should be used to improve data privacy, integrity, and security, preventing malicious activities that attempt to abuse patients’ sensitive medical information.
-
Recentness: Most of the examined studies did not consider the temporal factor; their KGs are static in nature and often neglect the validity period of incorporated triples. A healthcare KG built based on just one snapshot of the knowledge landscape might not be a sustainable depiction of the designated domain, particularly with the emergence of wearable medical devices, sensors, health monitoring systems, and mobile applications [160] which make the construction of dynamic and frequently updated KGs a necessity. Ignoring the dynamic nature of healthcare knowledge degrades the quality and accuracy of facts embedded in KGs, consequently leading to poor data analytics and decision making.
-
Healthcare KG reasoning: Reasoning of the KG aims to infer new facts and make new conclusions based on the existing data. KG reasoning allows for deriving new insights and enriches KGs with new relations. Several techniques have been proposed in the literature for KG reasoning, including ontology reasoning, logic rules, and random walk algorithm [161]. Recently, KG embedding approaches attracted a lot of attention in the research community due to their capacity to provide generalizations and infer new facts. KG embedding techniques aim to transform the KG into semantically-continuous low-dimensional space. The embedded KG can be then used for several downstream tasks including link prediction, knowledge discovery, etc [162]. This study reveals a relative lack of successful KG embedding strategies in the investigated papers.