Healthcare related information has seen a tremendous growth in recent times [
18]. The large amount of data available is being used in high performance computing architectures in various applications such as extracting high-cost patients, analyzing and predicting hospital readmissions [
19‐
21], triage where the risk of complications is estimated [
22], predicting decompensating risk of a patient, predicting adverse events well before they occur [
23], and predicting diseases affecting multiple organ systems [
24]. Healthcare data fulfills all 4 definitions of Big Data: volume, variety, veracity, and velocity [
25]. The volume of healthcare data has continued to grow in recent times and every day more and more patients, healthcare institutions, and health insurance companies are adopting electronic operations and produce data in variety of areas from gene data to discharge summaries. In order to utilize the data efficiently, the input and output to the systems should be fast and less time consuming. Big Data applications include medical R&D efficiency [
26], Medicare fraud detection [
27], reducing hospital readmission [
28], and wellness predictive modeling [
29]. This paper studies readmission risk prediction for COPD patients from a Big Data perspective.
After the HRRP was established, the readmission problem has been an important focus of the healthcare industry. It has affected all healthcare agencies including CMS in achieving a better overall healthcare economy, hospitals through a reduction of the inpatient prospective payment system (IPPS), and most importantly for patients and family members as this reduces the chances of getting readmitted thereby improving outcomes. Due to these impacts, significant research has been conducted in the area of reducing readmission from various aspects such as demographics and disease types. This section discusses some of the noteworthy studies conducted in the area of reducing readmissions. This study did not limit its focus to a certain type of disease, thereby not limiting the potential advantages and drawbacks which can be further understood in the context of other disease. The studies in this literature review were analyzed and compared based on the data type, data size, disease conditions, algorithms and other features which could be valuable for addressing the readmission problem.
Mehdi et al. [
1] conducted a study on all-cause risk of 30-day readmission on 323,813 inpatient stays and used Neural Network model with 1667 features in various feature categories including encounter reason and hospital problems which yielded a precision or positive predictive value (PPV) of 0.24 which was 20% higher than LACE (length of stay, acuity of admission, comorbidities and emergency department visits) which is industry standard. The study also performed a basic cost analysis showing savings as a function of intervention success rate. The study performed by Mohsen et al. [
15] on readmission problem for CHF (congestive heart failure) showed a reduction of 18.2% in re-hospitalizations with a cost savings of 3.8%. The study was performed using Logistic Regression with 3888 binary variables extracted from the patient visit data of 1172 hospital visits.
The study performed by Futoma et al. [
13] was based on the dataset gathered from the New Zealand Ministry of Health with 3.3 million hospital readmissions between 2006 and 2012. The study showed that the analysis can be used for US healthcare data as well. This study was performed using logistic regression, random forests and support vector machine for COPD, CHF, Pneumonia, THA/TKA and AMI. The data size for COPD was 31,457 which showed an AUC of 0.711. The study performed by Issac et al. [
11] was applicable to HF, Acute Myocardial Infarction, Pneumonia or COPD. The data of 7200 records was gathered from administrative records of Veteran Health Administration which correspond to 2985 distinct adult patients. The study shows that PHSF (Phase-Type Random Forests) works better than Random Forests, SVM, Logistic Regression or Neural Network.
Danning et al. [
12] use claim based data to predict 3-day readmission using standardized billing code for Chronic Pancreatitis. The study utilized data of 26,091 admissions from John Hopkins Hospital and 16,194 admissions from Bayview Medical Center showing AUC of more than 0.65. In an another study by Amarasingham et al. [
14], the use of data directly from multi-condition EHR system across 7 Hospital systems was performed for a patient record set of 39,604. The model from the study was compared with acute decompensated heart failure registry (ADHERE) model and CMS models, and was shown to perform better. The study also derived that claims based models are not efficient, because as claims data are gathered at a very late stage and the data might not be as useful by that time.
The cost sensitive study performed by Christopher et al. [
30] consisted of dataset of 1248 patient discharge summaries and a total of 5429 features were extracted based using bag-of-words. The dataset was somewhat imbalanced with class distribution of 14.32% as positive class (readmission) and 85.68% as negative class (non-readmission). The classification algorithms chosen for this study were Nave Bayes (NB), Random Forest (RF), Support Vector Machines (SVM), k-Nearest Neighbors (kNN), C4.5, Bagging with REPTree, and Boosting with Decision Stump. The study shows that by including cost factor in classification, the CMS penalties can be reduced.
In an another study performed by Christopher et al. [
31], hospital readmission dataset is shown to be imbalanced in nature, and using that data to create models does not provide efficient solutions. So they proposed a method which uses an ensemble of topic learners to leverage data from multiple hospitals and sources. This study was performed on a dataset of 62,714 instances from 16 hospitals with a total of 7112 extracted features in the corpus.
The study used Nave Bayes (NB), k-nearest neighbors (kNN), linear regression (LR), and support vector machine (SVM) classification algorithms. The results showed that hospitals that implement latent topic ensemble learners using Nave Bayes reduce readmissions and CMS penalties when compared those using other known methods.
Although some of the works reviewed in this section are better at predicting readmission than others, overall they lack two very important aspect of data analysis in real world setting: frameworks suitable for increasing amounts of data, and the to handle new data that is being extracted on a daily basis. Big Data will play a very important role if the hospital readmission prediction system is to be used in a real world setting where all current patients are being marked for readmission probability as they are treated and the feature sets are also being updated as new data is extracted. This study is more focused towards the aspect of implementation using Big Data and using models in such a context.
The Big Data platform selected for this study is the high performance computing cluster (HPCC) systems. It is also known as data analytics supercomputer (DAS). The HPCC systems is an open-source Big Data software architecture developed by LexisNexis Risk Solutions. It provides the architecture which is implemented on commodity computing clusters to deliver high speed output using Big Data [
32,
33]. HPCC systems use commodity hardware as processing clusters using high-speed network which ensure that the real time data analytics of readmissions in healthcare is cost-effective. The HPCC system architecture provides high redundancy and availability as the systems store file part replicas on several nodes which makes sure that, in the event of a failure, the data can be provided with no issues.
HPCC systems architecture provides some pre-built tools to create and manage a Big Data platform with ease and efficiency [
34]. The tools include administrative tools which allows easy configuration is a cluster environment and job monitoring to keep track of all job units being processed. It also provides some extension modules for natural language parsing, machine learning, and data encryption which can be easily used in the healthcare domain for predicting readmissions using patient discharge summaries [
35]. HPCC systems also provides an easy to use Big Data architecture driven declarative language known as enterprise control language (ECL). The ECL compiler is cluster-aware which automatically optimizes the code for parallel processing.
HPCC systems provide many advantages when compared with its alternative Hadoop which is based on Googles Map Reduce paradigm [
36]. HPCC systems uses three types of parallelism: data parallelism, pipeline parallelism and system parallelism whereas Hadoop only uses one type of parallelism [
36,
37]. According to a study by Seref et al. [
25], for the same 400-node system hardware configuration, HPCC took 6 min and 27 s whereas Hadoop took 25 min and 28 s which shows that HPCC systems is designed very efficiently and provides optimum performance for the same hardware [
25]. HPCC systems use ROXIE which was built on architecture of random access, low latency and high concurrency which provides real time query output, but Hadoop does not provide real time processing. One of the distinguishing features of HPCC is its suite monitoring services and tools to ensure high availability. This suggests that HPCC systems can enable use of Big Data in healthcare more effectively and efficiently.
HPCC systems is being used in a wide range of applications including parameter estimation for improving machine learning models [
38] and cyber security analytics [
39‐
41]. The healthcare applications utilizing HPCC platforms show great potential of HPCC in this domain as well, as it covers a wide range of applications detecting organized crime in healthcare using social network analytics [
42].