Skip to main content

2023 | Buch

Health Information Processing. Evaluation Track Papers

8th China Conference, CHIP 2022, Hangzhou, China, October 21–23, 2022, Revised Selected Papers

herausgegeben von: Buzhou Tang, Qingcai Chen, Hongfei Lin, Fei Wu, Lei Liu, Tianyong Hao, Yanshan Wang, Haitian Wang, Jianbo Lei, Zuofeng Li, Hui Zong

Verlag: Springer Nature Singapore

Buchreihe : Communications in Computer and Information Science

insite
SUCHEN

Über dieses Buch

This book constitutes the papers presented at the Evaluation Track of the 8th China Conference on Health Information Processing, CHIP 2022, held in Hangzhou, China during October 21–23, 2022.
The 20 full papers included in this book were carefully reviewed and selected from 20 submissions. They were organized in topical sections as follows: text mining for gene-disease association semantic; medical causal entity and relation extraction; medical decision tree extraction from unstructured text; OCR of electronic medical document; clinical diagnostic coding.

Inhaltsverzeichnis

Frontmatter

Text Mining for Gene-Disease Association Semantic

Frontmatter
Text Mining Task for “Gene-Disease” Association Semantics in CHIP 2022
Abstract
Gene-disease association plays a crucial role in healthcare knowledge discovery, and a large amount of valuable information is hidden in the literature. To alleviate this problem, we designed and organized the Gene-Disease Association Semantics (GDAS) track in CHIP2022, which aims to automatically extract the semantic association between the gene and disease from the literature. The GDAS track includes three progressive subtasks, gene-disease concept recognization, semantic role labeling, and gene-regulation-disease triplet extraction. Six teams participated in the track and submitted valid results, three of which showed promising performance in the GDAS track, we briefly present and summarize their methods. Finally, we discuss the potential value of the GDAS track to the healthcare and BioNLP communities, and explore the feasibility of further methods to facilitate the GDAS track.
Sizhuo Ouyang, Xinzhi Yao, Yuxing Wang, Qianqian Peng, Zhihan He, Jingbo Xia
Hierarchical Global Pointer Network: An Implicit Relation Inference Method for Gene-Disease Knowledge Discovery
Abstract
The investigation of new associations between genes and diseases is a crucial knowledge extraction task for drug discovery. In this paper, we present a novel approach to gene-disease knowledge discovery in the 8th China Health Information Processing Conference (CHIP 2022) [31] Open Shared Task AGAC Track (http://​www.​cips-chip.​org.​cn/​2022/​eval1). Selective annotation and latent topic annotation are two major challenges in this task. To address these challenges, we propose a novel Hierarchical Global Pointer Network with propagation module to extract implicit relations that are inferred based on explicit relations. We are also the first one to design a unified end-to-end model which can achieve three AGAC tasks simultaneously and alleviate the problem of selective annotation. The experiment results show the method we proposed can achieve F1-scores of 52%, 31% and 30% for three tasks respectively.
Yiwen Jiang, Wentao Xie
A Knowledge-Based Data Augmentation Framework for Few-Shot Biomedical Information Extraction
Abstract
There are a lot of biomedical knowledge hidden in the massive scientific clinical literature. These knowledge exist in an unstructured form and is difficult to extract automatically. Natural language processing makes it possible to mine these knowledge automatically. At present, most information extraction models need enough data to achieve good performance. Due to the scarcity of high-quality biomedical labeled data, it is still difficult to extract biomedical literature accurately in the case of few samples. This paper describes our participation in the task 1 of the “China Health Information Processing Conference” (CHIP 2022). We proposes a knowledge-based data augmentation framework to achieve data expansion to overcome the scarcity of training data. The experimental results show that after data augmentation, the \(F1\ score\) of named entity recognition using BioBERT-BiLSTM-CRF reaches 0.58 and the \(F1\ score\) of relation extraction using TDEER reaches 0.6. Finally, we win the second place, which validates the performance of our approach.
Xin Su, Chuang Cheng, Kuo Yang, Xuezhong Zhou
Biomedical Named Entity Recognition Under Low-Resource Situation
Abstract
Biomedical named entity recognition is Key technology for automatic processing of health medical information. However, it is labor-expensive for get enough labeled data and always facing low-resource situation. We propose a method for biomedical named entity recognition that under low-resource situation. Our work is based on the 8th China Health Information Processing Conference task-1 and ranked third among all the teams. In the final results of the test set, the precision is \(38.07\%\), the recall is \(51.04\%\) and the F1-score is 43.61.
Jianfei Zhao, Xiangyu Ren, Shuo Zhao, Jinyi Li

Medical Causal Entity and Relation Extraction

Frontmatter
CHIP2022 Shared Task Overview: Medical Causal Entity Relationship Extraction
Abstract
Modern medicine emphasizes interpretability and requires doctors to give reasonable, well-founded and convincing diagnostic results when diagnosing patients. Therefore, there are a large number of causal correlations in medical concepts such as symptoms, diagnosis and treatment in the text of the results of the inquiry. Explanation of relationships, and mining these relationships from text is of great help in improving the accuracy and interpretability of medical search results. Based on this, this paper constructs a new medical causality extraction dataset CMedCausal (Chinese Medical Causal dataset) and it is used in the CHIP2022 shared task, which defines three key types of medical causal relationships: causal relationship, conditional relationship, and hypothetical relationship. It consists of 9,153 medical texts with a total of 79,244 entity relationships annotated. Participants need to correctly label these correct reasoning relationships and corresponding subject-object entities. A total of 49 teams submitted results for the preliminary round with the highest Macro-F1 value of 0.4510. A total of 25 teams submitted results for final round with the highest Macro-F1 value of 0.4416.
Zihao Li, Mosha Chen, Kangping Yin, Yixuan Tong, Chuanqi Tan, Zhenzhen Lang, Buzhou Tang
Domain Robust Pipeline for Medical Causal Entity and Relation Extraction Task
Abstract
Medical entity and relation extraction is an essential task for medical knowledge graph, which can provide explanatory answers for medical search engine. Recently, PL-Marker, a deep learning based pipeline, has been proposed, which follows a similar NER &ER paradigm. In this method, medical entities are first identified by a NER model, and then they are combined by pairs to feed into a ER model to learn the causal relation among the medical entities. In this way, the pipeline cannot handle the complex entity relationships contained by CMedCausal due to its own defects, such as exposure bias and lack of relevance between entities and relationships. In this paper, we propose a novel pipeline: Domain Robust Pipeline (DRP) which tackles these challenges by introducing noisy entities to solve the exposure bias, adding KL loss to learn from samples with noisy labels, applying multitask learning to escape semantic traps and re-targeting the relationships to increase the robustness of the pipeline.
Tao Liang, Shengjun Yuan, Pengfei Zhou, Hangcong Fu, Huizhe Wu
A Multi-span-Based Conditional Information Extraction Model
Abstract
Conditional information extraction plays an important role in medical information extraction applications, such as medical information retrieval, medical knowledge graph construction, intelligent diagnosis and medical question-answering. Based on the evaluation task of China Conference on Health Information Processing 2022 (CHIP 2022), we propose a Multi-span-based Conditional Information Extraction model (MSCIE), which can well solve the conditional information extraction by extracting multiple span and the relations between each span. Moreover, the model provide a solution to conditional information extraction in complex scenes such as discontinuous entities, entity overlap, and entity nesting. Finally, our model, with the fusion of two pretrained models, has obtained the performance of the 1st in list A and the 2nd in list B, which also proves the effectiveness of the model.
Jiajia Jiang, Xiaowei Mao, Ping Huang, Mingxing Huang, Xiaobo Zhou, Yao Hu, Peng Shen
Medical Causality Extraction: A Two-Stage Based Nested Relation Extraction Model
Abstract
The extraction of medical causality contributes to constructing medical causal knowledge graphs, and enhancing the interpretability of modern medical consultation process. In this paper, we present our approach to medical causal entity and relation extraction in the 8th China Health Information Processing Conference (CHIP 2022) Open Shared Task. Nested relations and overlapping relations with shared entities are two major challenges in this task. We propose a two-stage model to achieve nested relation extraction. In the first stage, we extract traditional non-nested relations and explore how to utilize causal relational signals in entity recognition module to alleviate the problem of overlapping relations. In the second stage, we identify entities in nested relations through the method of machine reading comprehension and design a span-based contrastive learning method (SpanCL) with under-sampling strategy to determine whether causality is nested. The experiment results show that the method we proposed can achieve 43.23% in terms of macro-averaged F1-score.
Yiwen Jiang, Jingyi Zhao

Medical Decision Tree Extraction from Unstructured Text

Frontmatter
Extracting Decision Trees from Medical Texts: An Overview of the Text2DT Track in CHIP2022
Abstract
This paper presents an overview of the Text2DT shared task\(^{1}\) held in the CHIP-2022 shared tasks. The shared task addresses the challenging topic of automatically extracting the medical decision trees from the un-structured medical texts such as medical guidelines and textbooks. Many teams from both industry and academia participated in the shared tasks, and the top teams achieved amazing test results. This paper describes the tasks, the datasets, evaluation metrics, and the top systems for both tasks. Finally, the paper summarizes the techniques and results of the evaluation of the various approaches explored by the participating teams.\(^{1}\)(http://​cips-chip.​org.​cn/​2022/​eval3)
Wei Zhu, Wenfeng Li, Xiaoling Wang, Wendi Ji, Yuanbin Wu, Jin Chen, Liang Chen, Buzhou Tang
Medical Decision Tree Extraction: A Prompt Based Dual Contrastive Learning Method
Abstract
The extraction of decision-making knowledge in the form of decision trees from unstructured textual knowledge sources is a novel research area within the field of information extraction. In this paper, we present an approach to extract medical decision trees from medical texts (aka. Text2DT) in the 8th China Health Information Processing Conference (CHIP 2022) Open Shared Task\(^{1}\). Text2DT task involves the construction of tree nodes using relation triples, which extends upon the foundation of the named entity recognition and relation extraction tasks. Compared to the fixed event schema typically defined in the event extraction task, the tree structure allows a more flexible and variable approach to representing information. To achieve this novel task, we propose a prompt based dual contrastive learning method. The experimental results demonstrate that the decision tree constructed by our model can achieve an accuracy of 55% (65% using the relaxed metric). \(^{1}\)(http://www.cips-chip.org.cn/2022/eval3)
Yiwen Jiang, Hao Yu, Xingyue Fu
An Automatic Construction Method of Diagnosis and Treatment Decision Tree Based on UIE and Logical Rules
Abstract
In traditional information extraction, entity relation triples are mainly extracted from text, and there is no further logical relationship between triples. Therefore, an information extraction model based on triple extraction and decision tree generation is proposed. The model first extracts the triples in the medical text through the UIE method, and then forms the triples into a binary tree according to the condition node and the decision node. The condition node represents the condition judgment that needs to be made, and the decision node represents the diagnosis and treatment decision that needs to be made. This decision can not only mine the core entities and relationships in the text, but also realize the connection of entity relationship information to form a complete decision process. The correct rate of decision tree construction has achieved good results, which proves that the model can effectively generate decision trees.
Shiqi Zhen, Ruijie Wang
Research on Decision Tree Method of Medical Text Based on Information Extraction
Abstract
Extracting diagnosis and treatment decision trees from medical texts is a very meaningful thing. Recently, research in this area has just started. The general direction is to use pipeline extraction methods, which can be divided into two steps: triplet extraction and decision tree generation. However, in the previous research method, there are some problems in triplet extraction and decision tree generation, which lead to poor effect of the whole decision extraction. This paper improves in the following three directions: (1) adopts the pre-training method on the medical data set; (2) uses named entity recognition and biaffine to judge the relationship between entities in terms of triplet extraction; (3) adopts the pattern method to make the triples generate a decision tree. Through the above three improvements, it has achieved excellent performance on the 2022CHIP evaluation three data sets (Text2MDT) The medical pre-training model allows the model to have a deeper under-standing of medical vocabulary and the dependencies between vocabulary; The triplet ex-traction method uses biaffine to judge that the entity relationship is suitable for the triplet extraction of the evaluation data set; The method using the pattern triplet is more expressive.
Zihong Wu

OCR of Electronic Medical Document

Frontmatter
Information Extraction of Medical Materials: An Overview of the Track of Medical Materials MedOCR
Abstract
In the medical and insurance industry, electronic medical record materials contain a lot of information, which can be extracted and applied to various businesses through artificial intelligence technology, which will greatly reduce labor costs and improve efficiency. However, it is difficult to extract. At present, most of them rely on manual input. Using Optical Character Recognition (OCR) and Natural Language Processing (NLP) technology to electronize and structure the information on these paper materials has gradually become a hot spot in the current industry. Based on this, we constructed a medical material information extraction data set Medical OCR dataset (MedOCR) [1], and we also held the “Medical inventory invoice OCR element extraction Task” evaluation competition based on the eighth China Health Information processing Conference (CHIP2022), in order to promote the development of medical material information extraction technology. A total of 18 teams participated in the competition, most of which used an OCR-based extraction system. For the evaluation index Acc, the best performing teams reached 0.9330 and 0.9076. The task of the competition focuses on information extraction technology, and MedOCR will be open for researchers to carry out related technical research for a long time.
Lifeng Liu, Dejie Chang, Xiaolong Zhao, Longjie Guo, Mosha Chen, Buzhou Tang
TripleMIE: Multi-modal and Multi Architecture Information Extraction
Abstract
The continuous development of deep learning technology makes it widely used in various fields. In the medical scene, electronic voucher recognition is a very challenging task. Compared with traditional manual entry, the application of OCR and NLP technology can effectively improve work efficiency and reduce the training cost of business personnel. Using OCR and NLP technology to digitize and structure the information on these paper materials has gradually become a hot spot in the current industry.
Evaluation task 4 (OCR identification of electronic medical paper documents (ePaper)) of CHIP2022 [15, 16, 25] requires extracte 87 fields from the four types of medical voucher materials, including discharge summary, outpatient invoice, drug purchase invoice, and inpatient invoice. This task is very challenging because of the various types of materials, noise-contained data, and many categories of target fields.
To achieve the above goals, we propose a knowledge-based multi-modal and multi-architecture medical voucher information extraction method, namely TripleMIE, which includes I2SM: Image to sequence model, L-SPN: Large scale PLM-based span prediction net, MMIE: multi-modal information extraction model, etc. At the same time, a knowledge-based model integration module named KME is proposed to effectively integrate prior knowledge such as competition rules and material types with the model results. With the help of the above modules, we have achieved excellent results on the online official test data, which verifies the performance of the proposed method.(https://tianchi.aliyun.com/dataset/131815#4)
Boqian Xia, Shihan Ma, Yadong Li, Wenkang Huang, Qiuhui Shi, Zuming Huang, Lele Xie, Hongbin Wang
Multimodal End-to-End Visual Document Parsing
Abstract
The record meterials used in some industries including medical service and insurance, whose information has high commercial and scientific research value, are still mainly paper-based. Recent progress in deep learning makes it easier to parse visually-rich document. Compared with traditional manual input, the application of this technology contributes to improvement of work efficiency and reduction of the training cost of business personnel. In previous work, the task of visual document parsing was divided into two stages which are composed of Optical Character Recognition (OCR) and Natural language understanding (NLU). In order to solve a series of problems in OCR, such as high computational costs, multi-language inflexibility and backward propagation of OCR errors, OCR-free multimodal visual document understanding method based on deep learning has been proposed recently. Through fine-tuning the pre-trained model, it can perform well in many downstream tasks. However, such approach is still limited in the specific context by 1) the language and context in which the encoder and decoder are pre-trained; 2) the image input size of the pre-trained encoder. In view of the above two problems, in this paper, we put forward the corresponding solutions, as a result, our proposed scheme won the second place in the “Identification of Electronic Medical Paper Documents (ePaper)” (IEMPD) task in the Eighth China Health Information Processing Conference (CHIP 2022) in an end-to-end way.
Yujiang Lu, Weifeng Qiu, Yinghua Hong, Jiayi Wang
Improving Medical OCR Information Extraction with Integrated Bert and LayoutXLM Models
Abstract
Currently, medical records in most hospitals are paper-based and rely on manual input, but with the advancements in OCR and NLP technologies, it is now possible to convert such records into electronic and structured formats. In this paper, we explore the CHIP2022 evaluation task 4 and compare the performance of two pre-training models: Bert without additional coordinate information and LayoutXLM with additional coordinate information. We apply a selection and regularization process to refine the results and evaluate our framework’s accuracy through a list in Ali Cloud Tianchi. Our results demonstrate that our framework achieved good performance.
Lianchi Zheng, Xiaoming Liu, Zhihui Sun, Yuxiang He

Clinical Diagnostic Coding

Frontmatter
Overview of CHIP 2022 Shared Task 5: Clinical Diagnostic Coding
Abstract
The 8th China conference on Health Information Processing (CHIP2022) released 5 shared tasks related to Chinese medical information processing. Among them, the fifth task is about clinical diagnosis coding, which demands assigning the standard medical diagnostic words to the possible medical concepts in the visiting information. A total of 10 teams participated in the task and finally submitted 19 sets of results. This task takes the average F1 score as the final evaluation index, and the highest F1 score among all submission reaches 0.6908.
Gengxin Luo, Bo Kang, Hao Peng, Ying Xiong, Zhenli Lin, Buzhou Tang
Clinical Coding Based on Knowledge Enhanced Language Model and Attention Pooling
Abstract
Clinical coding is obtaining a standard ICD code based on the patient’s electronic medical record (EMR) information, including diagnosis, procedure, drug list, order, etc. It is essential in the medical record information management of the hospital. However, the medical record data have problems such as low quality, insufficient data, full of non-standardized jargon, irrelevant order information, and unbalanced data distribution, which result in poor performance of clinical coding. This task is a multi-label classification problem. Based on the medical pre-trained language model and our medical knowledge engineering (MetaMed KE), we proposed a Value-Level Attention Pooling (VLAP) to build a clinical diagnostic coding framework for Chinese electronic medical records. The framework includes three components: preprocessing module, the model, and the postprocessing module. Compared to existing algorithms, our framework dramatically improves the generalization ability and accuracy in the case of insufficient data and class unbalance. Thus, our method provides a reliable automatic solution for clinical coding in hospital medical record information management.
Yong He, Weiqing Li, Shun Zhang, Zhaorong Li, Zixiao Ding, Zhenyu Zeng
Rule-Enhanced Disease Coding Method Based on Roberta
Abstract
Disease coding is crucial in medical informatics, but its complexity and sheer volume of disease codes make traditional methods ineffective. To address this challenge, this paper formulates disease coding as a classification problem, leveraging pre-trained language models and integrating domain knowledge to propose a rule-enhanced clinical coding approach. Experimental results demonstrate the effectiveness of this method, which achieved the second-best performance in the China Health Information Processing Conference (CHIP) 2022 competition.
Bo An, Xiaodan Lv
Diagnosis Coding Rule-Matching Based on Characteristic Words and Dictionaries
Abstract
With the continuous development of medical informatization and the wide use of electronic medical records, realizing effective informatization standard management is essential for the storage and use of medical information. The 8th China Health Information Processing Conference (CHIP2022) shared task 5 proposed diagnostic coding for Chinese electronic medical records. According to the relevant diagnosis and other seven attributes of the medical information, match its corresponding standard codings from National Clinical Version 2.0. In this evaluation task, we propose a method of matching diagnostic codings based on medical feature words and feature drug dictionaries using medical text feature words and the connection between therapeutic drugs and therapeutic means. The experimental results on the test dataset showed that the F1 value of our proposed method was 65.81%, which obtained third place in this task.
Shuangcan Xue, Jintao Tang, Shasha Li, Ting Wang
Backmatter
Metadaten
Titel
Health Information Processing. Evaluation Track Papers
herausgegeben von
Buzhou Tang
Qingcai Chen
Hongfei Lin
Fei Wu
Lei Liu
Tianyong Hao
Yanshan Wang
Haitian Wang
Jianbo Lei
Zuofeng Li
Hui Zong
Copyright-Jahr
2023
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9948-26-0
Print ISBN
978-981-9948-25-3
DOI
https://doi.org/10.1007/978-981-99-4826-0

Premium Partner