Learning from imbalanced data in surveillance of nosocomial infection

https://doi.org/10.1016/j.artmed.2005.03.002Get rights and content

Summary

Objective

An important problem that arises in hospitals is the monitoring and detection of nosocomial or hospital acquired infections (NIs). This paper describes a retrospective analysis of a prevalence survey of NIs done in the Geneva University Hospital. Our goal is to identify patients with one or more NIs on the basis of clinical and other data collected during the survey.

Methods and material

Standard surveillance strategies are time-consuming and cannot be applied hospital-wide; alternative methods are required. In NI detection viewed as a classification task, the main difficulty resides in the significant imbalance between positive or infected (11%) and negative (89%) cases. To remedy class imbalance, we explore two distinct avenues: (1) a new resampling approach in which both oversampling of rare positives and undersampling of the noninfected majority rely on synthetic cases (prototypes) generated via class-specific subclustering, and (2) a support vector algorithm in which asymmetrical margins are tuned to improve recognition of rare positive cases.

Results and conclusion

Experiments have shown both approaches to be effective for the NI detection problem. Our novel resampling strategies perform remarkably better than classical random resampling. However, they are outperformed by asymmetrical soft margin support vector machines which attained a sensitivity rate of 92%, significantly better than the highest sensitivity (87%) obtained via prototype-based resampling.

Introduction

Surveillance is the cornerstone activity of infection control, whether nosocomial1 or otherwise. It provides data to assess the magnitude of the problem, detect outbreaks, identify risk factors for infection, target control measures on high-risk patients or wards, or evaluate prevention programs. Ultimately, the goal of surveillance is to decrease infection risk and consequently improve patients’ safety.

There are several ways to perform surveillance, each with its advantages and drawbacks. The gold standard is hospital-wide prospective surveillance, which consists in reviewing on a daily basis all available information on all hospitalized patients in order to detect all nosocomial infections (NIs). This method is labor-intensive, infeasible at a hospital level, and currently recommended only for high-risk, i.e., critically ill patients. As an alternative and more realistic approach, prevalence surveys are being recognized as a valid surveillance strategy and are becoming increasingly performed. Their major limitations are their retrospective nature, the dependency on readily available data, a prevalence bias, the inability to detect outbreak (depending on survey frequency), and the limited capacity to identify risk factors. However, they provide sufficiently good data to measure the magnitude of the problem, evaluate a prevention program, and help allocate resources. They give a snapshot of clinically active NIs during a given index day and provide information about the frequency and characteristics of these infections. The efficacy of infection control policies can be easily measured by repeated prevalence surveys [1]. However, whatever the strategy used, surveillance of NI is resource and labor-consuming, as it requires to assemble a wide range of data gathered from multiple sources. This calls for the development of alternative methods that would ultimately allow to constantly monitor infection risk across the hospital, and at a lower cost.

Section snippets

Background and motivation

The actual detection of NIs largely rely on manual methods. Infection control practitioners report infection rates using standard method (i.e. guidelines) elaborated by the Centers for Disease Control (CDC) [2]. Several teams have developed tools to assist physicians in detecting NIs, using computerized approaches. These tools typically work by searching clinical databases of microbiology and other data and producing a report that infection control physicians can then use to assess whether or

Data collection and preparation

The University Hospital of Geneva (HUG) has been performing yearly prevalence studies since 1994 [10]. These surveys are undertaken every year at the same period and last approximately 3 weeks. All patients hospitalized at time of the survey for more than 48 h are assessed for the presence of an active nosocomial infection. Data are extracted from medical records, kardex, X-ray and microbiology reports, and interviews with nurses and physicians in charge of the patient, if necessary. All

The imbalanced data problem

The major difficulty inherent in the data (as in many medical diagnostic applications) is the highly skewed class distribution. Out of 683 patients, only 75 (11% of the total) were infected and 608 were not. The problem of imbalanced datasets is particularly crucial in applications where the goal is to maximize recognition of the minority class.2 The issue of class imbalance has been actively

Strategies for handling imbalanced data

In this section we present two distinct approaches to the imbalanced data problem. The first is aimed at eliminating or at least attenuating class imbalance before the learning process whereas the second adjusts the learning algorithm’s bias to allow it to learn despite the handicap of imbalanced data. In the first approach, we decompose each class into fine-grained clusters and generate artificial cases in the form of cluster prototypes; these synthetic cases are used to drive the preliminary

Learning algorithms

For the preprocessing strategy we compared alternative solutions to the class imbalance problem using five learning algorithms with clearly distinct inductive biases. Decision trees such as those built by C4.5 are models in which each node is a test on an individual variable and a path from the root to a leaf is a conjunction of conditions required for a given classification [25]. Naive Bayes computes the posterior probability of each class given a new case, then assigns the case to the most

Results

Table 1 summarizes performance results on the original skewed class distribution and illustrates clearly the inadequacy of the accuracy criterion for this task. For instance, AdaBoost exhibits the highest accuracy of 90% but actually performs more poorly than Nave Bayes in detecting positive cases of nosocomial infections. In fact, Nave Bayes ranks last in terms of accuracy rate due to its poor performance on the majority class (specificity of 0.88, lower than all the others) but attains the

Conclusion

We analyzed the results of a prevalence study of nosocomial infections in order to predict infection risk on the basis of patient records. The major hurdle, typical in medical diagnosis, is the problem of rare positives. We addressed this problem via two different approaches. The first is based on the generation of synthetic instances for both oversampling and undersampling. Generation of artificial cases must however meet a hard constraint: the synthetic cases generated must remain within the

Acknowledgement

The authors are grateful for the dataset provided by the infection control team at the University of Geneva Hospitals.

References (31)

  • J.S. Garner et al.

    CDC definitions for nosocomial infections

    Am J Infect Control

    (1988)
  • S. Amari et al.

    Improving support vector machine classifiers by modifying kernel functions

    Neural Networks

    (1999)
  • G.G. French et al.

    Repeated prevalence surveys for monitoring effectiveness of hospital infection control

    Lancet

    (1983)
  • M.C. Kelsey et al.

    The second national prevalence survey of infection in hospitals: methodology

    J Hosp Infect

    (1995)
  • W.E. Trick et al.

    Computer algorithms to detect bloodstream infections

    Emerg Infect Dis

    (2004)
  • S.E. Brossette et al.

    Association rules and data mining in hospital infection control and public health surveillance

    J Am Med Inform Assoc

    (1998)
  • S.A. Moser et al.

    Application of data mining to intensive care unit microbiologic data

    Emerg Infect Dis

    (1999)
  • S.E. Brossette et al.

    A data mining system for infection control surveillance

    Meth Inf Med

    (2000)
  • L. Ma et al.

    A framework for infection control surveillance using association rules

  • E. Lamma et al.

    A system for monitoring nosocomial infections

  • S. Harbarth et al.

    Nosocomial infections in Swiss university hospitals: a multi-centre survey and review of the published experience

    Schweiz Med Wochenschr

    (1999)
  • P. Perner

    Data mining on multimedia data

    (2002)
  • N. Japkowicz

    The class imbalance problem: a systematic study

    Intell Data Anal J

    (2002)
  • N. Chawla et al.

    SMOTE: Synthetic Minority Over-sampling TEchnique

    J Artif Intell Res (JAIR)

    (2002)
  • M. Kubat et al.

    Addressing the curse of imbalanced data sets: one-sided sampling

  • Cited by (193)

    View all citing articles on Scopus
    View full text