Skip to main content

2020 | Book

Big Data Analytics and Knowledge Discovery

22nd International Conference, DaWaK 2020, Bratislava, Slovakia, September 14–17, 2020, Proceedings

Editors: Prof. Min Song, Prof. Il-Yeol Song, Gabriele Kotsis, Prof. Dr. A Min Tjoa, Ismail Khalil

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science


About this book

The volume LNCS 12393 constitutes the papers of the 22nd International Conference Big Data Analytics and Knowledge Discovery which will be held online in September 2020.

The 15 full papers presented together with 14 short papers plus 1 position paper in this volume were carefully reviewed and selected from a total of 77 submissions.

This volume offers a wide range to following subjects on theoretical and practical aspects of big data analytics and knowledge discovery as a new generation of big data repository, data pre-processing, data mining, text mining, sequences, graph mining, and parallel processing.

Table of Contents


Position Paper

Analyzing the Research Landscape of DaWaK Papers from 1999 to 2019
The International Conference on Big Data Analytics and Knowledge Discovery (DaWaK) has become a key conduit to exchange experience and knowledge among researchers and practitioners in the field of data warehousing and knowledge discovery. This study has quantitatively analyzed the 775 papers published in DaWaK from 1999 to 2019. This study presents the knowledge structure of the DaWaK papers and identifies the evolution of research topics in this discipline. Several text mining techniques were applied to analyze the contents of the research fields and to structure the knowledge presented at DaWaK. Dirichlet Multinomial Regression (DMR) is used to examine the trend of the research topics. Research metrics were used to identify conference and paper performance in terms of citation counts, readers, and the number of downloads. The study shows that DaWaK research outcomes have been receiving consistent attention from the scholarly community in the past 21 years. The 775 papers were cited by 4,339 times, marking the average number of citations of each proceeding as 207 times, and the average number of citations per published paper as six times.
Tatsawan Timakum, Soobin Lee, Il-Yeol Song, Min Song


DHE: Distributed Hybrid Evolution Engine for Performance Optimizations of Computationally Intensive Applications
A large number of real-world optimization and search problems are too computationally intensive to be solved due to their large state space. Therefore, a mechanism for generating approximate solutions must be adopted. Genetic Algorithms, a subclass of Evolutionary Algorithms, represent one of the widely used methods of finding and approximating useful solutions to hard problems. Due to their population-based logic and iterative behaviour, Evolutionary Algorithms are very well suited for parallelization and distribution. Several distributed models have been proposed to meet the challenges of implementing parallel Evolutionary Algorithms. Among them, the MapReduce paradigm proved to be a proper abstraction of mapping the evolutionary process. In this paper, we propose a generic framework, i.e., DHE\(^{2}\) (Distributed Hybrid Evolution Engine), that implements distributed Evolutionary Algorithms on top of the MapReduce open-source implementation in Apache Hadoop. Within DHE\(^{2}\), we propose and implement two distributed hybrid evolution models, i.e., the MasterSlaveIslands and MicroMacroIslands models, alongside a real-world application that avoids the local optimum for clustering in an efficient and performant way. The experiments for the proposed application are used to demonstrate DHE\(^{2}\) increased performance.
Oana Stroie, Elena-Simona Apostol, Ciprian-Octavian Truică
Grand Reports: A Tool for Generalizing Association Rule Mining to Numeric Target Values
Since its introduction in the 1990s, association rule mining(ARM) has been proven as one of the essential concepts in data mining; both in practice as well as in research. Discretization is the only means to deal with numeric target column in today’s association rule mining tools. However, domain experts and decision-makers are used to argue in terms of mean values when it comes to numeric target values. In this paper, we provide a tool that reports mean values of a chosen numeric target column concerning all possible combinations of influencing factors – so-called grand reports. We give an in-depth explanation of the functionalities of the proposed tool. Furthermore, we compare the capabilities of the tool with one of the leading association rule mining tools, i.e., RapidMiner. Moreover, the study delves into the motivation of grand reports and offers some useful insight into their theoretical foundation.
Sijo Arakkal Peious, Rahul Sharma, Minakshi Kaushik, Syed Attique Shah, Sadok Ben Yahia
Expected vs. Unexpected: Selecting Right Measures of Interestingness
Measuring interestingness in between data items is one of the key steps in association rule mining. To assess interestingness, after the introduction of the classical measures (support, confidence and lift), over 40 different measures have been published in the literature. Out of the large variety of proposed measures, it is very difficult to select the appropriate measures in a concrete decision support scenario. In this paper, based on the diversity of measures proposed to date, we conduct a preliminary study to identify the most typical and useful roles of the measures of interestingness. The research on selecting useful measures of interestingness according to their roles will not only help to decide on optimal measures of interestingness, but can also be a key factor in proposing new measures of interestingness in association rule mining.
Rahul Sharma, Minakshi Kaushik, Sijo Arakkal Peious, Sadok Ben Yahia, Dirk Draheim
SONDER: A Data-Driven Methodology for Designing Net-Zero Energy Public Buildings
The reduction of carbon emissions into the atmosphere has become an urgent health issue. The energy in buildings and their construction represents more than 1/3 of final global energy consumption and contributes to nearly 1/4 of greenhouse gas emissions worldwide. Heating, Ventilation, and Air-Conditioning (HVAC) systems are major energy consumers and responsible for about 18% of all building energy use. To reduce this huge amount of energy, the Net-Zero Energy Building (nZEB) concept has been imposed by energy authorities. They recommend a massive use of renewable energy technology. With the popularization of Smart Grid, Internet of Things devices, and the Machine Learning (ML), a couple of data-driven approaches emerged to reach this crucial objective. By analysing these approaches, we figure out that they lack a comprehensive methodology with a well-identified life cycle that favours collaboration between nZEB actors. In this paper, we share our vision for developing Energy Management Systems for nZEB as part of IMPROVEMENT EU Interreg Sudoe programm. First, we propose a comprehensive methodology (SONDER), associated with a well-identified life cycle for developing data-driven solutions. Secondly, an instantiation of this methodology is given by considering a case study for predicting the energy consumption of the domestic hot water system in the Regional Hospital of La Axarquia, Spain that includes gas and electricity sections. This prediction is conducted using four ML techniques: multivariate regression, XGBoost, Random Forest and ANN. Our obtained results show the effectiveness of SONDER by offering a fluid collaboration among project actors and the prediction efficiency of ANN.
Ladjel Bellatreche, Felix Garcia, Don Nguyen Pham, Pedro Quintero Jiménez
Reverse Engineering Approach for NoSQL Databases
In recent years, the need to use NoSQL systems to store and exploit big data has been steadily increasing. Most of these systems are characterized by the property “schema less” which means absence of the data model when creating a database. This property offers an undeniable flexibility allowing the user to add new data without making any changes on the data model. However, the lack of an explicit data model makes it difficult to express queries on the database. Therefore, users (developers and decision-makers) still need the database data model to know how data are stored and related, and then to write their queries. In previous works, we have proposed a process to extract the physical model of a document-oriented NoSQL database. In this paper, we aim to extend this work to achieve a reverse engineering of NoSQL databases in order to provide an element of semantic knowledge close to human understanding. The reverse engineering process is ensured by a set of transformation algorithms. We provide experiments of our approach using a case study taken from the medical field.
Fatma Abdelhedi, Amal Ait Brahim, Rabah Tighilt Ferhat, Gilles Zurfluh

Big Data/Data Lake

HANDLE - A Generic Metadata Model for Data Lakes
The substantial increase in generated data induced the development of new concepts such as the data lake. A data lake is a large storage repository designed to enable flexible extraction of the data’s value. A key aspect of exploiting data value in data lakes is the collection and management of metadata. To store and handle the metadata, a generic metadata model is required that can reflect metadata of any potential metadata management use case, e.g., data versioning or data lineage. However, an evaluation of existent metadata models yields that none so far are sufficiently generic. In this work, we present HANDLE, a generic metadata model for data lakes, which supports the flexible integration of metadata, data lake zones, metadata on various granular levels, and any metadata categorization. With these capabilities HANDLE enables comprehensive metadata management in data lakes. We show HANDLE’s feasibility through the application to an exemplary access-use-case and a prototypical implementation. A comparison with existent models yields that HANDLE can reflect the same information and provides additional capabilities needed for metadata management in data lakes.
Rebecca Eichler, Corinna Giebler, Christoph Gröger, Holger Schwarz, Bernhard Mitschang

Data Mining

A SAT-Based Approach for Mining High Utility Itemsets from Transaction Databases
Mining high utility itemsets is a keystone in several data analysis tasks. High Utility Itemset Mining generalizes the frequent itemset mining problem by considering item quantities and weights. A high utility itemset is a set of items that appears in the transadatabase and having a high importance to the user, measured by a utility function. The utility of a pattern can be quantified in terms of various objective criteria, e.g., profit, frequency, and weight. Constraint Programming (CP) and Propositional Satisfiability (SAT) based frameworks for modeling and solving pattern mining tasks have gained a considerable attention in recent few years. This paper introduces the first declarative framework for mining high utility itemsets from transaction databases. First, we model the problem of mining high utility itemsets from transaction databases as a propositional satifiability problem. Moreover, to facilitate the mining task, we add an additional constraint to the efficiency of our method by using weighted clique cover problem. Then, we exploit the efficient SAT solving techniques to output all the high utility itemsets in the data that satisfy a user-specified minimum support and minimum utility values. Experimental evaluations on real and synthetic datasets show that the performance of our proposed approach is close to that of the optimal case of state-of-the-art HUIM algorithms.
Amel Hidouri, Said Jabbour, Badran Raddaoui, Boutheina Ben Yaghlane
High-Utility Interval-Based Sequences
Sequential pattern mining is an interesting research area with broad range of applications. Most prior research on sequential pattern mining has considered point-based data where events occur instantaneously. However, in many application domains, events persist over intervals of time of varying lengths. Furthermore, traditional frameworks for sequential pattern mining assume all events have the same weight or utility. This simplifying assumption neglects the opportunity to find informative patterns in terms of utilities, such as cost. To address these issues, we incorporate the concept of utility into interval-based sequences and define a framework to mine high utility patterns in interval-based sequences i.e., patterns whose utility meets or exceeds a minimum threshold. In the proposed framework, the utility of events is considered while assuming multiple events can occur coincidentally and persist over varying periods of time. An algorithm named High Utility Interval-based Pattern Miner (HUIPMiner) is proposed and applied to real datasets. To achieve an efficient solution, HUIPMiner is augmented with a pruning strategy. Experimental results show that HUIPMiner is an effective solution to the problem of mining high utility interval-based sequences.
S. Mohammad Mirbagheri, Howard J. Hamilton
Extreme-SAX: Extreme Points Based Symbolic Representation for Time Series Classification
Time series classification is an important problem in data mining with several applications in different domains. Because time series data are usually high dimensional, dimensionality reduction techniques have been proposed as an efficient approach to lower their dimensionality. One of the most popular dimensionality reduction techniques of time series data is the Symbolic Aggregate Approximation (SAX), which is inspired by algorithms from text mining and bioinformatics. SAX is simple and efficient because it uses precomputed distances. The disadvantage of SAX is its inability to accurately represent important points in the time series. In this paper we present Extreme-SAX (E-SAX), which uses only the extreme points of each segment to represent the time series. E-SAX has exactly the same simplicity and efficiency of the original SAX, yet it gives better results in time series classification than the original SAX, as we show in extensive experiments on a variety of time series datasets.
Muhammad Marwan Muhammad Fuad
Framework to Optimize Data Processing Pipelines Using Performance Metrics
Optimizing Data Processing Pipelines (DPPs) is challenging in the context of both, data warehouse architectures and data science architectures. Few approaches to this problem have been proposed so far. The most challenging issue is to build a cost model of the whole DPP, especially if user defined functions (UDFs) are used. In this paper we addressed the problem of the optimization of UDFs in data-intensive workflows and presented our approach to construct a cost model to determine the degree of parallelism for parallelizable UDFs .
Syed Muhammad Fawad Ali, Robert Wrembel
A Scalable Randomized Algorithm for Triangle Enumeration on Graphs Based on SQL Queries
Triangle enumeration is a fundamental problem in large-scale graph analysis. For instance, triangles are used to solve practical problems like community detection and spam filtering. On the other hand, there is a large amount of data stored on database management systems (DBMSs), which can be modeled and analyzed as graphs. Alternatively, graph data can be quickly loaded into a DBMS. Our paper shows how to adapt and optimize a randomized distributed triangle enumeration algorithm with SQL queries, which is a significantly different approach from programming graph algorithms in traditional languages such as Python or C++. We choose a parallel columnar DBMS given its fast query processing, but our solution should work for a row DBMS as well. Our randomized solution provides a balanced workload for parallel query processing, being robust to the existence of skewed degree vertices. We experimentally prove our solution ensures a balanced data distribution, and hence workload, among machines. The key idea behind the algorithm is to evenly partition all possible triplets of vertices among machines, sending edges that may form a triangle to a proxy machine; this edge redistribution eliminates shuffling edges during join computation and therefore triangle enumeration becomes local and fully parallel. In summary, our algorithm exhibits linear speedup with large graphs, including graphs that have high skewness in vertex degree distributions.
Abir Farouzi, Ladjel Bellatreche, Carlos Ordonez, Gopal Pandurangan, Mimoun Malki
Data Engineering for Data Science: Two Sides of the Same Coin
A de facto technological standard of data science is based on notebooks (e.g., Jupyter), which provide an integrated environment to execute data workflows in different languages. However, from a data engineering point of view, this approach is typically inefficient and unsafe, as most of the data science languages process data locally, i.e., in workstations with limited memory, and store data in files. Thus, this approach neglects the benefits brought by over 40 years of R&D in the area of data engineering, i.e., advanced database technologies and data management techniques. In this paper, we advocate for a standardized data engineering approach for data science and we present a layered architecture for a data processing pipeline (DPP). This architecture provides a comprehensive conceptual view of DPPs, which next enables the semi-automation of the logical and physical designs of such DPPs.
Oscar Romero, Robert Wrembel
Mining Attribute Evolution Rules in Dynamic Attributed Graphs
A dynamic attributed graph is a graph that changes over time and where each vertex is described using multiple continuous attributes. Such graphs are found in numerous domains, e.g., social network analysis. Several studies have been done on discovering patterns in dynamic attributed graphs to reveal how attribute(s) change over time. However, many algorithms restrict all attribute values in a pattern to follow the same trend (e.g. increase) and the set of vertices in a pattern to be fixed, while others consider that a single vertex may influence its neighbors. As a result, these algorithms are unable to find complex patterns that show the influence of multiple vertices on many other vertices in terms of several attributes and different trends. This paper addresses this issue by proposing to discover a novel type of patterns called attribute evolution rules (AER). These rules indicate how changes of attribute values of multiple vertices may influence those of others with a high confidence. An efficient algorithm named AER-Miner is proposed to find these rules. Experiments on real data show AER-Miner is efficient and that AERs can provide interesting insights about dynamic attributed graphs.
Philippe Fournier-Viger, Ganghuan He, Jerry Chun-Wei Lin, Heitor Murilo Gomes
Sustainable Development Goal Relational Modelling: Introducing the SDG-CAP Methodology
A mechanism for predicting whether individual regions will meet there UN Sustainability for Development Goals (SDGs) is presented which takes into consideration the potential relationships between time series associated with individual SDGs, unlike previous work where an independence assumption was made. The challenge is in identifying the existence of relationships and then using these relationships to make SDG attainment predictions. To this end the SDG Correlation/Causal Attainment Prediction (SDG-CAP) methodology is presented. Five alternative mechanisms for determining time series relationships are considered together with three prediction mechanisms. The results demonstrate that by considering the relationships between time series, by combining a number of popular causal and correlation identification mechanisms, more accurate SDG forecast predictions can be made.
Yassir Alharbi, Frans Coenen, Daniel Arribas-Bel
Mining Frequent Seasonal Gradual Patterns
Gradual patterns that capture co-variation of complex attributes in the form “when X increases/decreases, Y increases/decreases” play an important role in many real world applications where huge volumes of complex numerical data must be handled. More recently, they have received attention from the data mining community for exploring temporal data and methods have been defined to automatically extract gradual patterns from temporal data. However, to the best of our knowledge, no method has been proposed to extract gradual patterns that always appear at the identical time intervals in the sequences of temporal data, despite the knowledge that such patterns may bring for certain applications such as e-commerce. This paper proposes to extract co-variations of periodically repeating attributes from the sequences of temporal data that we call seasonal gradual patterns. We discuss the specific features of these patterns and propose an approach for their extraction by exploiting a motif mining algorithm in a sequence, and justify its applicability to the gradual case. Illustrative results obtained from a real world data set are described and show the interest for such patterns.
Jerry Lonlac, Arnaud Doniec, Marin Lujak, Stephane Lecoeuche
Derivative, Regression and Time Series Analysis in SARS-CoV-2
The Covid-19 pandemic and the need of confinement have had a very significant impact on people’s lives all over the world. Everybody is looking at curves and how those curves evolve to try to understand how their country has been and will be affected in the near future. Thanks to open data and data science tools, we can analyze the evolution of key factors. Derivatives, polynomial regression and time series analysis can be used to capture trends. In this paper we explore and evaluate the use of such techniques, concluding regarding their merits and limitations for the Covid-19 data series. We conclude that polynomial regression on derivative totals, with degree 2 or 3 achieved the lowest average errors (median 5.5 to 6%) over 20 countries, while PROPHET and ARIMA may excel in larger series.
Pedro Furtado

Machine Learning and Deep Learning

Building a Competitive Associative Classifier
With the huge success of deep learning, other machine learning paradigms have had to take back seat. Yet other models, particularly rule-based, are more readable and explainable and can even be competitive when labelled data is not abundant. However, most of the existing rule-based classifiers suffer from the production of a large number of classification rules, affecting the model readability. This hampers the classification accuracy as noisy rules might not add any useful information for classification and also lead to longer classification time. In this study, we propose SigD2 which uses a novel, two-stage pruning strategy which prunes most of the noisy, redundant and uninteresting rules and makes the classification model more accurate and readable. To make SigDirect more competitive with the most prevalent but uninterpretable machine learning-based classifiers like neural networks and support vector machines, we propose bagging and boosting on the ensemble of the SigDirect classifier. The results of the proposed algorithms are quite promising and we are able to obtain a minimal set of statistically significant rules for classification without jeopardizing the classification accuracy. We use 15 UCI datasets and compare our approach with eight existing systems. The SigD2 and boosted SigDirect (ACboost) ensemble model outperform various state-of-the-art classifiers not only in terms of classification accuracy but also in terms of the number of rules.
Nitakshi Sood, Osmar Zaiane
Contrastive Explanations for a Deep Learning Model on Time-Series Data
In the last decade, with the irruption of Deep Learning (DL), artificial intelligence has risen a step concerning previous years. Although Deep Learning models have gained strength in many fields like image classification, speech recognition, time-series anomaly detection, etc. these models are often difficult to understand because of their lack of interpretability. In recent years an effort has been made to understand DL models, creating a new research area called Explainable Artificial Intelligence (XAI). Most of the research in XAI has been done for image data, and little research has been done in the time-series data field. In this paper, a model-agnostic method called Contrastive Explanation Method (CEM) is used for interpreting a DL model for time-series classification. Even though CEM has been validated in tabular data and image data, the obtained experimental results show that CEM is also suitable for interpreting deep learning models that work with time-series data.
Jokin Labaien, Ekhi Zugasti, Xabier De Carlos
Cyberbullying Detection in Social Networks Using Deep Learning Based Models
Cyberbullying is a disturbing online misbehaviour with troubling consequences. It appears in different forms, and in most of the social networks, it is in textual format. Automatic detection of such incidents requires intelligent systems. Most of the existing studies have approached this problem with conventional machine learning models and the majority of the developed models in these studies are adaptable to a single social network at a time. Recently deep learning based models have been used for similar objectives, claiming that they can overcome the limitations of the conventional models, and improve the detection performance. In this paper, we investigated the findings of a recent literature in this regard and validated their findings using the same datasets as they did. We further expanded the work by applying the developed methods on a new dataset. We aimed to further investigate the performance of the models in new social media platforms. Our findings show that the deep learning based models outperform the machine learning models previously applied to the same dataset. We believe that the deep learning based models can also benefit from integrating other sources of information and looking into the impact of profile information of the users in social networks.
Maral Dadvar, Kai Eckert
Predicting Customer Churn for Insurance Data
Most organisations employ customer relationship management systems to provide a strategic advantage over their competitors. One aspect of this is applying a customer lifetime value to each client which effectively forms a fine-grained ranking of every customer in their database. This is used to focus marketing and sales budgets and, in turn, generate a more optimised and targeted spend. The problem is that it requires a full customer history for every client and this rarely exists. In effect, there is a large gap between the available information in application databases and the types of datasets required to calculate customer lifetime values. This gap prevents any meaningful calculation of customer lifetime values. In this research, we present an approach to generating some of the missing parameters for CLV calculations. This requires a specialised form of data warehouse architecture and a flexible prediction and validation methodology for imputing missing data.
Michael Scriney, Dongyun Nie, Mark Roantree

Supervised Learning

Scalable Machine Learning on Popular Analytic Languages with Parallel Data Summarization
Machine learning requires scalable processing. An important acceleration mechanism is data summarization, which is accurate for many models and whose summary requires a small amount of RAM. In this paper, we generalize a data summarization matrix to produce one or multiple summaries, which benefits a broader class of models, compared to previous work. Our solution works well in popular languages, like R and Python, on a shared-nothing architecture, the standard in big data analytics. We introduce an algorithm which computes machine learning models in three phases: Phase 0 pre-processes and transfers the data set to the parallel processing nodes; Phase 1 computes one or multiple data summaries in parallel and Phase 2 computes a model in one machine based on such data set summaries. A key innovation is evaluating a demanding vector-vector outer product in C++ code, in a simple function call from a high-level programming language. We show Phase 1 is fully parallel, requiring a simple barrier synchronization at the end. Phase 2 is a sequential bottleneck, but contributes very little to overall time. We present an experimental evaluation with a prototype in the R language, with our summarization algorithm programmed in C++. We first show R is faster and simpler than competing big data analytic systems computing the same models, including Spark (using MLlib, calling Scala functions) and a parallel DBMS (computing data summaries with SQL queries calling UDFs). We then show our parallel solution becomes better than single-node processing as data set size grows.
Sikder Tahsin Al-Amin, Carlos Ordonez
Which Bills Are Lobbied? Predicting and Interpreting Lobbying Activity in the US
Using lobbying data from, we offer several experiments applying machine learning techniques to predict if a piece of legislation (US bill) has been subjected to lobbying activities or not. We also investigate the influence of the intensity of the lobbying activity on how discernible a lobbied bill is from one that was not subject to lobbying. We compare the performance of a number of different models (logistic regression, random forest, CNN and LSTM) and text embedding representations (BOW, TF-IDF, GloVe, Law2Vec). We report results of above 0.85% ROC AUC scores, and 78% accuracy. Model performance significantly improves (95% ROC AUC, and 88% accuracy) when bills with higher lobbying intensity are looked at. We also propose a method that could be used for unlabelled data. Through this we show that there is a considerably large number of previously unlabelled US bills where our predictions suggest that some lobbying activity took place. We believe our method could potentially contribute to the enforcement of the US Lobbying Disclosure Act (LDA) by indicating the bills that were likely to have been affected by lobbying but were not filed as such.
Ivan Slobozhan, Peter Ormosi, Rajesh Sharma
FIBS: A Generic Framework for Classifying Interval-Based Temporal Sequences
We study the problem of classifying interval-based temporal sequences (IBTSs). Since common classification algorithms cannot be directly applied to IBTSs, the main challenge is to define a set of features that effectively represents the data such that classifiers can be applied. Most prior work utilizes frequent pattern mining to define a feature set based on discovered patterns. However, frequent pattern mining is computationally expensive and often discovers many irrelevant patterns. To address this shortcoming, we propose the FIBS framework for classifying IBTSs. FIBS extracts features relevant to classification from IBTSs based on relative frequency and temporal relations. To avoid selecting irrelevant features, a filter-based selection strategy is incorporated into FIBS. Our empirical evaluation on eight real-world datasets demonstrates the effectiveness of our methods in practice. The results provide evidence that FIBS effectively represents IBTSs for classification algorithms, which contributes to similar or significantly better accuracy compared to state-of-the-art competitors. It also suggests that the feature selection strategy is beneficial to FIBS’s performance.
S. Mohammad Mirbagheri, Howard J. Hamilton
Multivariate Time Series Classification: A Relational Way
Multivariate Time Series Classification (MTSC) has attracted increasing research attention in the past years due to the wide range applications in e.g., action/activity recognition, EEG/ECG classification, etc. In this paper, we open a novel path to tackle with MTSC: a relational way. The multiple dimensions of MTS are represented in a relational data scheme, then a propositionalisation technique (based on classical aggregation/selection functions from the relational data field) is applied to build interpretable features from secondary tables to “flatten” the data. Finally, the MTS flattened data are classified using a selective Naïve Bayes classifier. Experimental validation on various benchmark data sets show the relevance of the suggested approach.
Dominique Gay, Alexis Bondu, Vincent Lemaire, Marc Boullé, Fabrice Clérot

Unsupervised Learning

Behave or Be Detected! Identifying Outlier Sequences by Their Group Cohesion
Since the amount of sequentially recorded data is constantly increasing, the analysis of time series (TS), and especially the identification of anomalous points and subsequences, is nowadays an important field of research. Many approaches consider only a single TS, but in some cases multiple sequences need to be investigated. In 2019 we presented a new method to detect behavior-based outliers in TS which analyses relations of sequences to their peers. Therefore we clustered data points of TS per timestamp and calculated distances between the resulting clusters of different points in time. We realized this by evaluating the number of peers a TS is moving with. We defined a stability measure for time series and subsequences, which is used to detect the outliers. Originally we considered cluster splits but did not take merges into account. In this work we present two major modifications to our previous work, namely the introduction of the jaccard index as a distance measure for clusters and a weighting function, which enables behavior-based outlier detection in larger TS. We evaluate our modifications separately and in conjunction on two real and one artificial data set. The adjustments lead to well reasoned and sound results, which are robust regarding larger TS.
Martha Tatusch, Gerhard Klassen, Stefan Conrad
Detecting Anomalies in Production Quality Data Using a Method Based on the Chi-Square Test Statistic
This paper describes the capability of the Chi-Square test statistic at detecting outliers in production-quality data. The goal is automated detection and evaluation of statistical anomalies for a large number of time series in the production-quality context. The investigated time series are the temporal course of sensor failure rates in relation to particular aspects (e.g. type of failure, information about products, the production process, or measuring sites). By means of an industrial use case, we show why in this setting our chosen approach is superior to standard methods for statistical outlier detection.
Michael Mayr, Johannes Himmelbauer
Learning from Past Observations: Meta-Learning for Efficient Clustering Analyses
Many clustering algorithms require the number of clusters as input parameter prior to execution. Since the “best” number of clusters is most often unknown in advance, analysts typically execute clustering algorithms multiple times with varying parameters and subsequently choose the most promising result. Several methods for an automated estimation of suitable parameters have been proposed. Similar to the procedure of an analyst, these estimation methods draw on repetitive executions of a clustering algorithm with varying parameters. However, when working with voluminous datasets, each single execution tends to be very time-consuming. Especially in today’s Big Data era, such a repetitive execution of a clustering algorithm is not feasible for an efficient exploration. We propose a novel and efficient approach to accelerate estimations for the number of clusters in datasets. Our approach relies on the idea of meta-learning and terminates each execution of the clustering algorithm as soon as an expected qualitative demand is met. We show that this new approach is generally applicable, i.e., it can be used with existing estimation methods. Our comprehensive evaluation reveals that our approach is able to speed up the estimation of the number of clusters by an order of magnitude, while still achieving accurate estimates.
Manuel Fritz, Dennis Tschechlov, Holger Schwarz
Parallel K-Prototypes Clustering with High Efficiency and Accuracy
Big data is often characterized by a huge volume and mixed types of data including numeric and categorical. The k-prototypes is one of the best-known clustering methods for mixed data. Despite this, it is not suitable to deal with huge volume of data. Several methods have attempted to solve the efficiency problem of the k-prototypes using parallel frameworks. However, none of the existing clustering methods for mixed data, satisfy both accuracy and efficiency. To deal with this issue, we propose a novel parallel k-prototypes clustering method that improves both efficiency and accuracy. The proposed method is based on integrating a parallel approach through Spark framework and implementing a new centers initialization strategy using sampling. Experiments were performed on simulated and real datasets show that the proposed method is scalable and improves both the efficiency and accuracy of the existing k-prototypes methods.
Hiba Jridi, Mohamed Aymen Ben HajKacem, Nadia Essoussi
Self-Organizing Map for Multi-view Text Clustering
Text document clustering represents a key task in machine learning, which partitions a specific documents’ collection into clusters of related documents. To this end, a pre-processing step is carried to represent text in a structured form. However, text depicts several aspects, which a single representation cannot capture. Therefore, multi-view clustering present an efficient solution to exploit and integrate the information captured from different representations or views. However, the existing methods are limited to represent views using terms frequencies based representations which lead to losing valuable information and fails to capture the semantic aspect of text. To deal with these issues, we propose a new method for multi-view text clustering that exploits different representations of text. The proposed method explores the use of Self-Organizing Map to the problem of unsupervised clustering of texts by taking into account simultaneously several views, that are obtained from textual data. Experiments are performed to demonstrate the improvement of clustering results compared to the existing methods.
Maha Fraj, Mohamed Aymen Ben Hajkacem, Nadia Essoussi
Big Data Analytics and Knowledge Discovery
Prof. Min Song
Prof. Il-Yeol Song
Gabriele Kotsis
Prof. Dr. A Min Tjoa
Ismail Khalil
Copyright Year
Electronic ISBN
Print ISBN

Premium Partner