Introduction
Background and related work
Stream computing
Dimension | Batch processing | Streaming processing |
---|---|---|
Input | Data chunks | Stream of new data or updates |
Data size | Known and finite | Infinite or unknown in advance |
Hardware | Multiple CPUs | Typical single limited amount of memory |
Storage | Store | Not store or store non-trivial portion in memory |
Processing | Processed in multiple rounds | A single or few passes over data |
Time | Much longer | A few seconds or even milliseconds |
Applications | Widely adopted in almost every domain | Web mining, traffic monitoring, sensor networks |
Big data stream analysis
Key issues in big data stream analysis
Scalability
Integration
Fault-tolerance
Timeliness
Consistency
Heterogeneity and incompleteness
Load balancing
High throughput
Privacy
Accuracy
Related work
Research method
Research question
-
Research Question 1: What are the tools and technologies employed for big data stream analysis?
-
Research Question 2: What methods and techniques are used in analysing big data streams?
-
Research Question 3: What do these tools and technologies have in common and their differences in terms of concept, purpose and capabilities?
-
Research Question 4: What are the limitations and strengths of these tools and technologies?
-
Research Question 5: What are the evaluation techniques or benchmarks used for evaluating big data streaming tools and technology?
Search string
Data sources
Data retrieval
Scopus | ScienceDirect | EBSCOhost | Total | |
---|---|---|---|---|
Number of papers | 2097 | 65 | 133 | 2295 |
Scopus | ScienceDirect | EBSCOhost | Total | |
---|---|---|---|---|
Number of papers | 196 | 27 | 92 | 315 |
Scopus | ScienceDirect | EBSCOhost | Total | |
---|---|---|---|---|
Number of papers | 64 | 23 | 24 | 111 |
Scopus | ScienceDirect | EBSCOhost | Total | |
---|---|---|---|---|
Number of papers | 25 | 10 | 12 | 47 |
Inclusion criteria
Exclusion criteria
Result
Research Question 1: What are the tools and technologies employed for big data stream analysis?
Shape of the data
Data access
Availability and consistency requirement
Workload profile required
Latency requirement
Tools and technology | Article |
---|---|
BlockMon | [83] |
NoSQL | |
Spark streaming | |
Apache storm | |
Kafka | |
Yahoo! S4 | |
Apache Samza | |
Photon | |
Apache Aurora | |
MavEStream | [103] |
EsperTech | |
Redis | [106] |
C-SPARQL | |
SAMOA | |
CQELS | |
ETALIS | [112] |
XSEQ | [73] |
Apache Kylin | [113] |
Splunk stream | [114] |
Tools and technology | Article |
---|---|
CodeBlue | [115] |
Anodot | [116] |
Cloudet | [117] |
Sentiment brand monitoring | [118] |
Numenta | [119] |
Elastic streaming processing engine | [120] |
Microsoft azure stream analytics | [121] |
IBM InfoSphere streams | |
Google MillWheel | [123] |
Artemis | [124] |
WSO2 analytics | [125] |
Microsoft StreamInsight | [126] |
TIBCO StreamBase | [127] |
Striim | [128] |
Kyvos insights | [129] |
AtScale | |
Lambda architecture | [57] |
Research Question 2: What methods and techniques are used in analysing big data streams?
Methods and techniques | Article |
---|---|
SPADE | [132] |
Locally supervised metric learning (LSML) | [133] |
KTS | [106] |
Multinomial latent dirichlet allocation | [106] |
Voltage clustering algorithm | [106] |
Locality sensitive hashing (LSH) | [134] |
User profile vector update algorithm | [134] |
Tag assignment stream clustering (TASC) | [134] |
StreamMap | [117] |
Density cognition | [117] |
QRS detection algorithm | [87] |
Forward chaining rule | [110] |
Stream | [135] |
CluStream | |
HPClustering | [138] |
DenStream | [139] |
D-Stream | [140] |
ACluStream | [141] |
DCStream | [142] |
P-Stream | [143] |
ADStream | [144] |
Continuous query processing (CQR) | [145] |
FPSPAN-growth | [146] |
Outlier method for cloud computing algorithm (OMCA) | [147] |
Multi-query optimization strategy (MQOS) | [148] |
Parallel K-means clustering | [72] |
Visibly push down automata (VPA) | [73] |
Incremental MI outlier detection algorithm (Inc I-MLOF) | [149] |
Adaptive windowing based online ensemble (AWOE) | [74] |
Dynamic prime-number based security verification | [84] |
K-anonymity, I-diversity, t-closeness | [90] |
Singular spectrum matrix completion (SS-MC) | [76] |
Temporal fuzzy concept analysis | [96] |
ECM-sketch | [77] |
Nearest neighbour | [91] |
Markov chains | [91] |
Block-QuickSort-AdjacentJobMatch | [86] |
Block-QuickSort-OverlapReplicate | [86] |
Fuzzy-CSar-AFP | [150] |
Weighted online sequential extreme learning machine with kernels (WOS-ELMK) | [22] |
Concept-adapting very fast decision tree (CVFDT) | [151] |
Research Question 3: What do big data streaming tools and technologies have in common and their differences in terms of concept, purpose, and capabilities?
Tools and technology | Database support | Execution model | Workload | Fault tolerance | Latency | Throughput | Reliability | Operating system | Implementation/supported languages | Application |
---|---|---|---|---|---|---|---|---|---|---|
BlockMon | Cassandra, MongoDB, XML | Streaming | Multi-slice memory allocation and batch allocations | Checkpoint, rollback | Very low | High | At least once | Linux | C ++11, Python | Anomaly detection, network optimization, multimedia content delivery, financial market analysis, web analytics |
Spark Streaming | Kafka, HBase, Hive Flume, HDF/S3, Kinesis, TCP sockets, Twitter, SQL | Batch, Iterative, Streaming | CPU/memory intensive | RDD based Check-pointing, parallel recovery, replication | Low | High | Exactly once | Windows, macOS, Linux | Scala, Python, Java, R | Event detection, streaming machine learning, fog computing, interactive analysis, multimedia analysis, cluster analysis, filtering, re-processing, cache invalidation |
Apache Storm | Spout, HBase, Hive, SQL, Cassandra, Memcached | Streaming | CPU/memory intensive | Replication, checkpoint, data recovery, Upstream backup, record-level acknowledgement, stateless management | Very low | Low | At least once | Windows, macOS, Linux | Clojure, Java, Scala, Clojure, non-JVM languages | Internet of things, streaming machine learning, multimedia analysis |
Yahoo! S4 | MySQL, NoSQL, Rich Data Format | Streaming | CPU/memory intensive | Replication, checkpoint, data recovery | Low | Low | Exactly once | Linux | Java, Python, C++, Perl | Online analytics, monitoring, fraud detection, financial data processing, web personalization and session modelling |
Apache Samza | Kafka, HDFS, Kinesis, Stream consumer, Key-value stores | Streaming, batch processing | Memory intensive | Checkpoint | Very low | High | At least once | Linux, Windows | Java, Scala, JVM languages | Filtering, re-processing, cache invalidation |
Apache Flink | Kafka, Flume, HDF/S3, Kinesis, TCP sockets, Twitter, Cassandra, Redis, MongoDB, HBase, SQL | Streaming, batch, iterative, interactive | Memory intensive | Stream replay and marker-checkpoint | Very low | High | Exactly once | Linux, MacOS, Windows | Java, Scala, Python | Optimization of e-commerce search result, network/sensor monitoring and error detection, ETL for business intelligence infrastructure, machine learning |
Apache Aurora | H2, Java maps, MyBatis, MySQL, PostgreSQL | Streaming | Memory and disk space | Periodic recovery checkpoint and rollback | Low | High | At least once | Linux | Python | Monitoring applications such as financial analysis and military applications |
Redis | Key-value stores, rabitmq, MongoDB | Streaming | In-memory but persistent on-disk database | Replica migration, Sentinel | Low | High | At least once | Ubuntu, Linux, OSX | C, C#, Java, PHP, Python | Web analysis, cache, message queues |
C-SPARQL | RDF, SQLJ, NoSQL, HDF | Batch, streaming | Low memory usage | Adaptation | Very low | High | Cumulative | Windows, Linux, MacOS, Android | Java, Apache Jena libraries | Real-time reasoning over sensor data, social semantic data, urban computing |
SAMOA | HBase, Hive, Cassandra | Streaming | Low memory usage | Upstream backup | Low | High | Exactly once | Linux | Java | Classification, clustering, spam detection, regression, frequent pattern mining |
CQELS | RDF, SQLJ, NoSQL, HDF | Batch, streaming | In-memory | Adaptation | Low | High | Cumulative | Windows, Linux, MacOS, Android | Java | Real-time reasoning over sensor data, social semantic data, urban computing |
ETALIS | RDF | Streaming | Binarization | Adaptation | Low | Low | Cumulative | Windows, Linux, MacOS, Android | Prolog, Java, C, SPARQL, C#, ETALIS Language for Events (ELE) | Event detection, reasoning over streaming events |
XSEQ | XML | Batch, streaming | In-memory with buffering | checkpoint | Low | High | At least once | Windows, Linux | Java, Apache Xerces | Biological data, social networks, user behaviour, financial data analysis, filtering |
IBM InfoSphere streams | Pig, Hive, Jaql, HBase Flume, Lucene, Avro, ZooKeeper, Oozie, Oracle Database, DB2, Netezza, MySQL, Aster, Informix. | Streaming | Capture database workloads and replay them in a test database environment | Automatic recovery | Low | High | Exactly once, At least once, At most once | Linux, CentOS | C ++ Java SPL | Space weather prediction, physiological data streams analysis, traffic management, real-time predictions, event detection, visualisation |
Google MillWheel | BigTable, Spanner | Streaming | In-memory and bloom filtering | Uncoordinated periodic, checkpoint, upstream backup | Low | High | Exactly once | Linux | Virtually any programming language | Anomaly detection, health monitoring, image processing, network switch management |
Infochimps cloud | SQL, NoSQL, Hive, Pig Wukong, Hadoop, RDBMS, Virtually any data format | Batch, streaming | In-memory | Upstream backup | Low | High | Exactly once | Linux | Java | Disaster discovery, text analysis, complex event processing, visualisation |
Microsoft StreamInsight | SQL Server | Streaming | In-memory | Replication, checkpoint, data recovery | Very low | High | Exactly once | Windows | .NET, C#, LINQ, Rx | Manufacturing process monitoring and control, financial data analysis, operation analytics, web analytics, event pattern detection |
TIBCO StreamBase | Oracle database, SQL Server, Impala | Batch, Streaming | In-memory | Synchronization, replication, rollback | Very low | High | At least once/at most once/exactly once | Windows, MacOS, Linux | R, Java | Mission critical analysis, IoT analysis, click-stream analysis, predictive analytics, workflow optimization, risk avoidance |
Lambda Architecture | RDBMS, Cassandra, Kafka, Data Warehouses, Kinesis Data Stream, HDFS, HBase | Batch, Streaming | In-memory/disk database | Replication, checkpoint | Low | Low | Exactly once | Ubuntu, Windows, Linux | Java, C#, Python, Pig Latin | IoT analysis, tracking real-time updates, financial risk management, click-stream analysis |
Research Question 4: What are the limitations and strengths of big data streaming tools and technologies?
Research Question 5: What are the evaluation techniques or benchmarks that are used for evaluating big data streaming tools and technologies?
Discussion
Year | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011 | 2012 | 2013 | 2014 | 2015 | 2016 | 2017 | 2018 | Total |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Paper | 2 | 1 | 2 | 3 | 5 | 2 | 5 | 4 | 5 | 10 | 22 | 28 | 38 | 98 | 156 | 381 |