Learning from imbalanced data in surveillance of nosocomial infection
Introduction
Surveillance is the cornerstone activity of infection control, whether nosocomial1 or otherwise. It provides data to assess the magnitude of the problem, detect outbreaks, identify risk factors for infection, target control measures on high-risk patients or wards, or evaluate prevention programs. Ultimately, the goal of surveillance is to decrease infection risk and consequently improve patients’ safety.
There are several ways to perform surveillance, each with its advantages and drawbacks. The gold standard is hospital-wide prospective surveillance, which consists in reviewing on a daily basis all available information on all hospitalized patients in order to detect all nosocomial infections (NIs). This method is labor-intensive, infeasible at a hospital level, and currently recommended only for high-risk, i.e., critically ill patients. As an alternative and more realistic approach, prevalence surveys are being recognized as a valid surveillance strategy and are becoming increasingly performed. Their major limitations are their retrospective nature, the dependency on readily available data, a prevalence bias, the inability to detect outbreak (depending on survey frequency), and the limited capacity to identify risk factors. However, they provide sufficiently good data to measure the magnitude of the problem, evaluate a prevention program, and help allocate resources. They give a snapshot of clinically active NIs during a given index day and provide information about the frequency and characteristics of these infections. The efficacy of infection control policies can be easily measured by repeated prevalence surveys [1]. However, whatever the strategy used, surveillance of NI is resource and labor-consuming, as it requires to assemble a wide range of data gathered from multiple sources. This calls for the development of alternative methods that would ultimately allow to constantly monitor infection risk across the hospital, and at a lower cost.
Section snippets
Background and motivation
The actual detection of NIs largely rely on manual methods. Infection control practitioners report infection rates using standard method (i.e. guidelines) elaborated by the Centers for Disease Control (CDC) [2]. Several teams have developed tools to assist physicians in detecting NIs, using computerized approaches. These tools typically work by searching clinical databases of microbiology and other data and producing a report that infection control physicians can then use to assess whether or
Data collection and preparation
The University Hospital of Geneva (HUG) has been performing yearly prevalence studies since 1994 [10]. These surveys are undertaken every year at the same period and last approximately 3 weeks. All patients hospitalized at time of the survey for more than 48 h are assessed for the presence of an active nosocomial infection. Data are extracted from medical records, kardex, X-ray and microbiology reports, and interviews with nurses and physicians in charge of the patient, if necessary. All
The imbalanced data problem
The major difficulty inherent in the data (as in many medical diagnostic applications) is the highly skewed class distribution. Out of 683 patients, only 75 (11% of the total) were infected and 608 were not. The problem of imbalanced datasets is particularly crucial in applications where the goal is to maximize recognition of the minority class.2 The issue of class imbalance has been actively
Strategies for handling imbalanced data
In this section we present two distinct approaches to the imbalanced data problem. The first is aimed at eliminating or at least attenuating class imbalance before the learning process whereas the second adjusts the learning algorithm’s bias to allow it to learn despite the handicap of imbalanced data. In the first approach, we decompose each class into fine-grained clusters and generate artificial cases in the form of cluster prototypes; these synthetic cases are used to drive the preliminary
Learning algorithms
For the preprocessing strategy we compared alternative solutions to the class imbalance problem using five learning algorithms with clearly distinct inductive biases. Decision trees such as those built by C4.5 are models in which each node is a test on an individual variable and a path from the root to a leaf is a conjunction of conditions required for a given classification [25]. Naive Bayes computes the posterior probability of each class given a new case, then assigns the case to the most
Results
Table 1 summarizes performance results on the original skewed class distribution and illustrates clearly the inadequacy of the accuracy criterion for this task. For instance, AdaBoost exhibits the highest accuracy of 90% but actually performs more poorly than Nave Bayes in detecting positive cases of nosocomial infections. In fact, Nave Bayes ranks last in terms of accuracy rate due to its poor performance on the majority class (specificity of 0.88, lower than all the others) but attains the
Conclusion
We analyzed the results of a prevalence study of nosocomial infections in order to predict infection risk on the basis of patient records. The major hurdle, typical in medical diagnosis, is the problem of rare positives. We addressed this problem via two different approaches. The first is based on the generation of synthetic instances for both oversampling and undersampling. Generation of artificial cases must however meet a hard constraint: the synthetic cases generated must remain within the
Acknowledgement
The authors are grateful for the dataset provided by the infection control team at the University of Geneva Hospitals.
References (31)
- et al.
CDC definitions for nosocomial infections
Am J Infect Control
(1988) - et al.
Improving support vector machine classifiers by modifying kernel functions
Neural Networks
(1999) - et al.
Repeated prevalence surveys for monitoring effectiveness of hospital infection control
Lancet
(1983) - et al.
The second national prevalence survey of infection in hospitals: methodology
J Hosp Infect
(1995) - et al.
Computer algorithms to detect bloodstream infections
Emerg Infect Dis
(2004) - et al.
Association rules and data mining in hospital infection control and public health surveillance
J Am Med Inform Assoc
(1998) - et al.
Application of data mining to intensive care unit microbiologic data
Emerg Infect Dis
(1999) - et al.
A data mining system for infection control surveillance
Meth Inf Med
(2000) - et al.
A framework for infection control surveillance using association rules
- et al.
A system for monitoring nosocomial infections
Nosocomial infections in Swiss university hospitals: a multi-centre survey and review of the published experience
Schweiz Med Wochenschr
Data mining on multimedia data
The class imbalance problem: a systematic study
Intell Data Anal J
SMOTE: Synthetic Minority Over-sampling TEchnique
J Artif Intell Res (JAIR)
Addressing the curse of imbalanced data sets: one-sided sampling
Cited by (193)
Relabeling & raking algorithm for imbalanced classification[Formula presented]
2024, Expert Systems with ApplicationsA hybrid multi-criteria meta-learner based classifier for imbalanced data
2024, Knowledge-Based SystemsA broad review on class imbalance learning techniques
2023, Applied Soft ComputingImproving imbalanced classification using near-miss instances
2022, Expert Systems with ApplicationsLearning class-imbalanced data with region-impurity synthetic minority oversampling technique
2022, Information SciencesMulti-granularity relabeled under-sampling algorithm for imbalanced data
2022, Applied Soft Computing