Introduction
Motivation
Contributions
-
We propose RSP-Explore, a new method for big data exploration and cleaning on computing clusters using the RSP approach.
-
We address error detection as an estimation problem by estimating the proportions of inconsistent values in quantitative data;
-
We propose an algorithm to estimate the statistical properties of the entire unknown clean data set by cleaning only a few RSP blocks;
-
We introduce a theoretical analysis on using RSP blocks for statistical estimation;
-
We empirically demonstrate the performance of RSP-Explore on three real data sets.
Related work
Sampling-based approximate big data analysis
Big data exploration and profiling
Scaling statistical methods to big data
Background
RSP distributed data model
Big data computing with RSP blocks
Methods
RSP-Explore
Statistics estimator
Error detector
-
Error values: values that don’t belong to the data type of X (e.g., a negative value in a power consumption feature);
-
Outliers values: any value that is not error and located more than a specific threshold away from the center of the data. Since the mean and standard deviation are sensitive to outliers, we use the median as a robust metric of location instead of the mean, and the Median Absolute Deviation (MAD) as a robust metric of dispersion instead of the standard deviation [16, 53]. The MAD measures the median distance of all the values from the median. The outliers threshold is then defined as \(a \times MAD\) away from the median where a is often set to 2, 2.5, or 3.
-
Missing values are often represented using a special string, e.g., NA;
-
Valid values don’t belong to any of the previous categories.
Data cleaner
Theoretical analysis
Results
Experiment environment and settings
Name |
N
|
M
|
K
|
n
|
---|---|---|---|---|
HIGGS | 11,000,000 | 28 | 100 | 110,000 |
Power | 46,669,266 | 99 | 6667 | 7000 |
AirOnTime87to12 | 148,619,655 | 47 | 298 | 498,724 |
Exploring HIGGS data
Summary statistics
Stat | True value | Summary of RSP-based estimates | |||
---|---|---|---|---|---|
Mean ± StdDev | 5th percentile | 50th percentile | 95th percentile | ||
Mean | 0.9914658 | 0.9916447 ±
0.001691498 | 0.9892427 | 0.9921890 | 0.9941837 |
StdDev | 0.5653777 | 0.5657651 ±
0.002567618 | 0.5625647 | 0.5652471 | 0.5696894 |
Median | 0.8533714 | 0.8531884 ± 0.001508743 | 0.8513674 | 0.8530054 | 0.8555858 |
MAD | 0.4485073 | 0.4483536 ±
0.001575363 | 0.4464044 | 0.4482361 | 0.4509222 |
Skewness | 1.758388 | 1.758561 ± 0.03039423 | 1.719768 | 1.753431 | 1.810040 |
Kurtosis | 5.57178304 | 5.551798 ± 0.4035557 | 5.048026 | 5.532661 | 6.335199 |
Correlation
Feature | True value | Summary of RSP-based estimates | |||
---|---|---|---|---|---|
Mean ± StdDev | 5th percentile | 50th percentile | 95th percentile | ||
V3 | − 0.000153 | − 0.000151 ± 0.003051 | − 0.003929 | − 0.000420 | 0.002098 |
V10 | − 0.006265 | − 0.006065 ± 0.003414 | − 0.011673 | − 0.006966 | − 0.002738 |
V15 | − 0.011190 | − 0.011556 ± 0.003186 | − 0.015670 | − 0.011981 | − 0.006128 |
V20 | 0.000090 | 0.000023 ± 0.003806 | − 0.003273 | 0.000171 | 0.007839 |
V25 | 0.272327 | 0.274107 ± 0.003175 | 0.266316 | 0.270652 | 0.277722 |
V29 | 0.141168 | 0.141790 ± 0.003782 | 0.135513 | 0.141922 | 0.146332 |
Error detection
Category | True value | Summary of RSP-based proportions | |||
---|---|---|---|---|---|
Mean ± StdDev | 5th percentile | 50th percentile | 95th percentile | ||
Errors | 0 | 0 ± 0 | 0 | 0 | 0 |
Outliers | 0.04084664 | 0.04082 ± 0.0006663723 | 0.03977546 | 0.04074545 | 0.04160728 |
Missing values | 0 | 0 ± 0 | 0 | 0 | 0 |
Valid values | 0.95915336 | 0.95918 ± 0.0006663723 | 0.9583927 | 0.9592546 | 0.9602245 |
Exploring Power data
Summary statistics
Stat | True value | Summary of RSP-based estimates | |||
---|---|---|---|---|---|
Mean ± StdDev | 5th percentile | 50th percentile | 95th percentile | ||
Mean | 5314.49 | 5256.008 ± 215.3196 | 4942.404 | 5254.394 | 5602.095 |
StdDev | 15127.22 | 10937.52 ± 4318.59 | 5990.154 | 10331.031 | 19210.659 |
Median | 3100 | 3103.9 ± 88.62217 | 2982.475 | 3099.750 | 3243.750 |
MAD | 3783.595 | 3776.738 ± 106.9769 | 3616.877 | 3783.966 | 3954.020 |
Skewness | − 108.5203 | 20.84063 ± 12.67476 | 2.474514 | 21.064943 | 39.870585 |
Kurtosis | 71059.13 | 802.1309 ± 660.3176 | 11.966831 | 692.604567 | 1977.642239 |
Correlation
Feature | True value | Summary of RSP-based estimates | |||
---|---|---|---|---|---|
Mean ± StdDev | 5th percentile | 50th percentile | 95th percentile | ||
V3 | 0.8512235 | 0.7536487 ± 0.1524914 | 0.5233034 | 0.7376381 | 0.9631805 |
V15 | 0.8720032 | 0.8009430 ± 0.1265738 | 0.6350162 | 0.7963755 | 0.9701548 |
V37 | 0.96786065 | 0.97107535 ± 0.02932448 | 0.93525808 | 0.97572809 | 0.99619922 |
V39 | 0.95651256 | 0.96303372 ± 0.02644475 | 0.92738897 | 0.96685111 | 0.99514627 |
V65 | 0.8905704 | 0.8496362 ± 0.0984346 | 0.6913239 | 0.8548659 | 0.9741665 |
V98 | 0.8527415 | 0.7653636 ± 0.1444491 | 0.5650703 | 0.7657670 | 0.9650463 |
Error detection
Category | True value | Summary of RSP-based proportions | |||
---|---|---|---|---|---|
Mean ± StdDev | 5th percentile | 50th percentile | 95th percentile | ||
Errors | 0.00000154 | 0.00001408 ± 0.00001408 | 0 | 0 | 0 |
Outliers | 0.03630323 | 0.03663272 ± 0.002324408 | 0.03328592 | 0.03656265 | 0.04053885 |
Missing values | 0.53230704 | 0.5320224 ± 0.005352080 | 0.52315101 | 0.53212264 | 0.54086298 |
Valid values | 0.43138819 | 0.4313435 ± 0.005839722 | 0.42173133 | 0.43105753 | 0.44055828 |
Data cleaning
Stat | True value | Summary of RSP-based estimates | |||
---|---|---|---|---|---|
Mean ± StdDev | 5th percentile | 50th percentile | 95th percentile | ||
Mean | 3404.014 | 3396.572 ± 101.0598 | 3241.725 | 3403.829 | 3559.236 |
StdDev | 2388.191 | 2382.172 ± 69.43742 | 2271.915 | 2378.807 | 2486.027 |
Median | 3100 | 3103.9 ± 88.62217 | 2982.475 | 3099.750 | 3243.750 |
MAD | 0 | 0 ± 0 | 0 | 0 | 0 |
Skewness | 1.996612 | 1.996031 ± 0.03404484 | 1.947026 | 1.997957 | 2.053139 |
Kurtosis | 5.185274 | 5.187026 ± 0.2038164 | 4.842079 | 5.144150 | 5.537230 |
Feature | True value | Summary of RSP-based estimates | |||
---|---|---|---|---|---|
Mean ± StdDev | 5th percentile | 50th percentile | 95th percentile | ||
V3 | 0.3126873 | 0.3132926 ± 0.01267681 | 0.2956115 | 0.3131225 | 0.3356508 |
V15 | 0.3268368 | 0.3276999 ± 0.01238780 | 0.3059839 | 0.3286929 | 0.3459587 |
V37 | 0.8598520 | 0.8601381 ± 0.01011005 | 0.8427635 | 0.8606978 | 0.8760539 |
V39 | 0.5330211 | 0.5335060 ± 0.01609916 | 0.5082045 | 0.5346581 | 0.5578231 |
V65 | 0.63534 | 0.6340810 ± 0.01368082 | 0.6108725 | 0.6361881 | 0.6562849 |
V98 | 0.4790145 | 0.4784269 ± 0.01759861 | 0.4507184 | 0.4788393 | 0.5072602 |
Exploring airlines data
Stat | True value | Summary of RSP-based estimates | |||
---|---|---|---|---|---|
Mean ± StdDev | 5th percentile | 50th percentile | 95th percentile | ||
Mean | 6.5666 | 6.471984 ± 0.1872946 | 6.229485 | 6.518680 | 6.728294 |
StdDev | 31.55641 | 31.58147 ± 0.5532607 | 30.73029 | 31.59074 | 32.25136 |
Median | 0 | 0 ± 0 | 0 | 0 | 0 |
MAD | 13.3434 | 13.3434 ± 0 | 13.3434 | 13.3434 | 13.3434 |
Skewness | 5.763419 | 5.685329 ± 0.3536372 | 5.189962 | 5.765824 | 6.116367 |
Kurtosis | 90.08612 | 88.75798 ± 17.53257 | 68.61477 | 87.41795 | 115.98568 |
Feature | True value | Summary of RSP-based estimates | |||
---|---|---|---|---|---|
Mean ± StdDev | 5th percentile | 50th percentile | 95th percentile | ||
DepDelay | 0.9257360 | 0.9266382 ± 0.0098338 | 0.906661909 | 0.929922510 | 0.936090184 |
WeatherDelay | 0.2616046 | 0.2592562 ± 0.0233771 | 0.228037031 | 0.258204030 | 0.297533928 |
Distance | − 0.0134785 | − 0.0151175 ± 0.0123154 | − 0.034275887 | − 0.016134390 | 0.006646666 |
AirTime | − 0.0146634 | − 0.0170937 ± 0.0124837 | − 0.036912348 | − 0.019335688 | 0.004608020 |
CarrierDelay | 0.5357752 | 0.5348132 ± 0.0297866 | 0.489324517 | 0.534856336 | 0.586338471 |
NASDelay | 0.3217719 | 0.3181037 ± 0.0243393 | 0.284709646 | 0.320763065 | 0.352731115 |
Category | True value | Summary of RSP-based proportions | |||
---|---|---|---|---|---|
Mean ± StdDev | 5th percentile | 50th percentile | 95th percentile | ||
Errors | 0 | 0 ± 0 | 0 | 0 | 0 |
Outliers | 0.07903821 | 0.07931381 ± 0.00185624 | 0.07680381 | 0.07920513 | 0.08183604 |
Missing values | 0.02047453 | 0.02039184 ± 0.00111821 | 0.01903729 | 0.02016446 | 0.02201047 |
Valid values | 0.90048725 | 0.9002944 ± 0.00227722 | 0.8977497 | 0.9003487 | 0.9037174 |
Discussion
-
Currently, RSP-Explore can’t be used to get an approximate histogram. While it is possible to get histograms of individual RSP blocks, building an approximate histogram requires criteria for combining local histograms and quantifying the uncertainty of the approximated global histogram. We are currently working on extending RSP-Explore to build an approximate equi-width histogram that can be used to quickly understand the probability distribution of the entire data;
-
RSP-Explore can’t be used directly to detect and repair duplication errors. It needs an additional step to check duplications across RSP blocks. We are currently experimenting this idea. Furthermore, empirical and theoretical evidences are necessary to study the affect of de-duplication on the probability distribution in RSP blocks and the similarity between these blocks and the entire unknown clean data.In fact, big data cleaning burden would dramatically be alleviated if repairing duplicates in only a small block-level sample was enough to get samples of the entire unknown clean data.
-
RSP-Explore is not designed for streaming data. As we mentioned before, we target at offline workloads where data scientists explore big data sets on computing clusters with a variety of techniques. For steaming data, a different strategy is required to get synopses of the data such as sketching [59].
-
If the target is to find statistics or proportions in a certain subspace in the data, we may need alternative data partitioning algorithms to create RSP blocks with specific characteristics (e.g., each block is a random sample of the observations about customers in a certain city or branch). This issue still needs more investigation and experiments.