Introduction
Background
Big data challenges in healthcare
Related work
Hadoop MapReduce | Apache Spark | |
---|---|---|
Definition | Open source big data framework wich deals with structured and unstructured data that are stored in HDFS, Hadoop MapReduce is designed in a way to process a large amount of data on a cluster | Open source big data framework, it’s a flexible in-memory framework that allows it to handle batch and real-time analytic and data processing workloads. Spark is basically designed for fast computation |
Speed | Reading and writing from/to the file system and disk slows down the processing speed | 100 times faster in memory and 10 times faster even when runing on disk than hadoop MapReduce. Because of run computation in memory |
Easy of use | In Hadoop MapReduce, developers need to code each operation and require abstractions, so it is difficult to easily program each problem | Spark is easier to use than Hadoop, because it has whole of high-level operators with RDDs |
Real-time analysis | No | Yes |
Execution model | Batch | Batch, streaming |
In-memory | No | Yes |
Methods
Proposed architecture for real-time health status prediction and analytics system
Data sources
Kafka real-time data collection
Spark streaming data processing
- Transformations: that create a new RDD from the existing RDDs by applying processes such as mapping, filtering and more.
- Actions: compute a result based on RDD, and either returned or saved to an external storage system.
Spark architecture
Use case datasets
No | Attributes | Description |
---|---|---|
1 | Age | Age in years |
2 | Sex | Sex (1= male, 0= female) |
3 | Cp | Chest pain type |
4 | Restbpss | Resting blood pressure |
5 | Chol | Serum Cholesterol |
6 | Fbs | Fasting blood sugar |
7 | Restecg | Resting electrocardiographic results |
8 | Thalach | Maximum heart rate |
9 | Oldpeak | ST depression induced by exercise relative to rest |
10 | Exang | Exercise induced angina |
11 | Slope | Slope of peak exercise ST segment |
12 | Ca | Number of major vessels colored with fluoroscopy |
13 | Thal | 3 (normal), 6 (fixed defect), 7 (reversible defect) |
14 | Num | Class (1 = presence of heart disease, 0 = absence of heart disease) |
Spark implementation of parallel decision tree
Data storage and visualization
- The reasonable cost and ease of implementation.
- Partitioning and replication of data across multiple machines.
- Scalability by adding columns, allowing more data to be processed quickly, especially larger data.
- The speed of data transfer compared to conventional databases.
- Scalability by adding additional nodes to the cluster without the need to create a distribution. Here after data processing with Spark, the result data is stored in a table with a primary key through Cassandra. Data stored in database will be queried later for historical data analysis, visualizing, reporting and real-time monitoring.
Experiments
Experiment setup
Parameter | Master | Worker |
---|---|---|
Processor | Core i7 | Core i3 |
Cores | 4 | 4 |
Memory | 8 GB | 4 GB |
Operating system | Ubuntu 16.04 | Ubuntu 16.04 |
No of nodes | Events/s | Kafka | Topic partitions | Spark workers | Cassandra |
---|---|---|---|---|---|
1 | \(\sim\) 550,000 | 1 | 1 | 1 | 1 |
2 | \(\sim\) 1,300,000 | 2 | 2 | 2 | 2 |
Results and discussion
Performnace evaluation of machine learning model
- Receiver operating characteristic (ROC) curve: is one of the most important and effective metrics of evaluating the quality or performance of diagnostic tests. The ROC curve is supported by MLlib.
- Classification accuracy: This is defined as the ratio of all correct predictions made to overall prediction data. In this work, classification accuracy for the datasets are measured using the equation:$$\begin{aligned} \mathrm{Accuracy}=\frac{TP+TN}{TP+TN+FP+FN} \end{aligned}$$(6)
Dataset | Diabetes | Heart disease |
---|---|---|
maxBins | 250 | 100 |
maxDepth | 8 | 6 |
Sensitivity (%) | 85.97 | 80.00 |
Specificity (%) | 94.38 | 85.36 |
ROC curve (%) | 90.03 | 82.3 |
Accuracy (%) | 91.57 | 82.40 |
Apache Spark vs Weka performance
Spark based DT scalability
Throughput
- the larger the throughput, the more cost of processing time.
- the more nodes we use, the less of processing time.
- if enough nodes are leveraged, even the size of throughput is big, the performance can be near to the optimal one. For example, given a throughput of 2.5 million records, if we use 2 nodes, the execution time is near to 2 s. While the execution time is near to 3 s when we use a single node. The execution time of the same throughput decrease by adding other nodes.
- Spark streaming with distributed machine learning is a good choice to deal with real-time problems. Namely, by leveraging more nodes, we can resolve big data problems especially those related to real-time prediction in healthcare field.