Skip to main content

2019 | Buch

Intelligent Data Engineering and Automated Learning – IDEAL 2019

20th International Conference, Manchester, UK, November 14–16, 2019, Proceedings, Part II

herausgegeben von: Dr. Hujun Yin, David Camacho, Peter Tino, Dr. Antonio J. Tallón-Ballesteros, Ronaldo Menezes, Richard Allmendinger

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This two-volume set of LNCS 11871 and 11872 constitutes the thoroughly refereed conference proceedings of the 20th International Conference on Intelligent Data Engineering and Automated Learning, IDEAL 2019, held in Manchester, UK, in November 2019.

The 94 full papers presented were carefully reviewed and selected from 149 submissions. These papers provided a timely sample of the latest advances in data engineering and machine learning, from methodologies, frameworks, and algorithms to applications. The core themes of IDEAL 2019 include big data challenges, machine learning, data mining, information retrieval and management, bio-/neuro-informatics, bio-inspired models (including neural networks, evolutionary computation and swarm intelligence), agents and hybrid intelligent systems, real-world applications of intelligent techniques and AI.

Inhaltsverzeichnis

Frontmatter

Special Session on Fuzzy Systems and Intelligent Data Analysis

Frontmatter
Computational Generalization in Taxonomies Applied to: (1) Analyze Tendencies of Research and (2) Extend User Audiences

We define a most specific generalization of a fuzzy set of topics assigned to leaves of the rooted tree of a domain taxonomy. This generalization lifts the set to its “head subject” node in the higher ranks of the taxonomy tree. The head subject is supposed to “tightly” cover the query set, possibly bringing in some errors referred to as “gaps” and “offshoots”. Our method, ParGenFS, globally minimizes a penalty function combining the numbers of head subjects and gaps and offshoots, differently weighted. Two applications are considered: (1) analysis of tendencies of research in Data Science; (2) audience extending for programmatic targeted advertising online. The former involves a taxonomy of Data Science derived from the celebrated ACM Computing Classification System 2012. Based on a collection of research papers published by Springer 1998–2017, and applying in-house methods for text analysis and fuzzy clustering, we derive fuzzy clusters of leaf topics in learning, retrieval and clustering. The head subjects of these clusters inform us of some general tendencies of the research. The latter involves publicly available IAB Tech Lab Content Taxonomy. Each of about 25 mln users is assigned with a fuzzy profile within this taxonomy, which is generalized offline using ParGenFS. Our experiments show that these head subjects effectively extend the size of targeted audiences at least twice without loosing quality.

Dmitry Frolov, Susana Nascimento, Trevor Fenner, Zina Taran, Boris Mirkin
Unsupervised Initialization of Archetypal Analysis and Proportional Membership Fuzzy Clustering

This paper further investigates and compares a method for fuzzy clustering which retrieves pure individual types from data, known as the fuzzy clustering with proportional membership (FCPM), with the FurthestSum Archetypal Analysis algorithm (FS-AA). The Anomalous Pattern (AP) initialization algorithm, an algorithm that sequentially extracts clusters one by one in a manner similar to principal component analysis, is shown to outperform the FurthestSum not only by improving the convergence of FCPM and AA algorithms but also to be able to model the number of clusters to extract from data.A study comparing nine information-theoretic validity indices and the soft ARI has shown that the soft Normalized Mutual Information max ( $$NMI_{sM}$$ ) and the Adjusted Mutual Information (AMI) indices are more adequate to access the quality of FCPM and AA partitions than soft internal validity indices. The experimental study was conducted exploring a collection of 99 synthetic data sets generated from a proper data generator, the FCPM-DG, covering various dimensionalities as well as 18 benchmark data sets from machine learning.

Susana Nascimento, Nuno Madaleno

Special Session on Machine Learning Towards Smarter Multimodal Systems

Frontmatter
Multimodal Web Based Video Annotator with Real-Time Human Pose Estimation

This paper presents a multi-platform Web-based video annotator to support multimodal annotation that can be applied to several working areas, such as dance rehearsals, among others. The CultureMoves’ “Motion-Notes” Annotator was designed to assist the creative and exploratory processes of both professional and amateur users, working with a digital device for personal annotations. This prototype is being developed for any device capable of running in a modern Web browser. It is a real-time multimodal video annotator based on keyboard, touch and voice inputs. Five different ways of adding annotations have been already implemented: voice, draw, text, web URL, and mark annotations. Pose estimation functionality uses machine learning techniques to identify a person skeleton in the video frames, which gives the user another resource to identify possible annotations.

Rui Rodrigues, Rui Neves Madeira, Nuno Correia, Carla Fernandes, Sara Ribeiro
New Interfaces for Classifying Performance Gestures in Music

Interactive machine learning (ML) allows a music performer to digitally represent musical actions (via gestural interfaces) and affect their musical output in real-time. Processing musical actions (termed performance gestures) with ML is useful because it predicts and maps often-complex biometric data. ML models can therefore be used to create novel interactions with musical systems, game-engines, and networked analogue devices. Wekinator is a free open-source software for ML (based on the Waikato Environment for Knowledge Analysis – WEKA - framework) which has been widely used, since 2009, to build supervised predictive models when developing real-time interactive systems. This is because it is accessible in its format (i.e. a graphical user interface – GUI) and simplified approach to ML. Significantly, it allows model training via gestural interfaces through demonstration. However, Wekinator offers the user several models to build predictive systems with. This paper explores which ML models (in Wekinator) are the most useful for predicting an output in the context of interactive music composition. We use two performance gestures for piano, with opposing datasets, to train available ML models, investigate compositional outcomes and frame the investigation. Our results show ML model choice is important for mapping performance gestures because of disparate mapping accuracies and behaviours found between all Wekinator ML models.

Chris Rhodes, Richard Allmendinger, Ricardo Climent

Special Session on Data Selection in Machine Learning

Frontmatter
Classifying Ransomware Using Machine Learning Algorithms

Ransomware is a continuing threat and has resulted in the battle between the development and detection of new techniques. Detection and mitigation systems have been developed and are in wide-scale use; however, their reactive nature has resulted in a continuing evolution and updating process. This is largely because detection mechanisms can often be circumvented by introducing changes in the malicious code and its behaviour. In this paper, we demonstrate a classification technique of integrating both static and dynamic features to increase the accuracy of detection and classification of ransomware. We train supervised machine learning algorithms using a test set and use a confusion matrix to observe accuracy, enabling a systematic comparison of each algorithm. In this work, supervised algorithms such as the Naïve Bayes algorithm resulted in an accuracy of 96% with the test set result, SVM 99.5%, random forest 99.5%, and 96%. We also use Youden’s index to determine sensitivity and specificity.

Samuel Egunjobi, Simon Parkinson, Andrew Crampton
Artificial Neural Networks in Mathematical Mini-Games for Automatic Students’ Learning Styles Identification: A First Approach

The lack of customized education results in low performance in different subjects as mathematics. Recognizing and knowing student learning styles will enable educators to create an appropriate learning environment. Questionnaires are traditional methods to identify the learning styles of the students. Nevertheless, they exhibit several limitations such as misunderstanding of the questions and boredom in children. Thus, this work proposes a first automatic approach to detect the learning styles (Activist, Reflector, Theorist, Pragmatist) based on Honey and Mumford theory through the use of Artificial Neural Networks in mathematical Mini-Games. Metrics from the mathematical Mini-Games as score and time were used as input data to then train the Artificial Neural Networks to predict the percentages of learning styles. The data gathered in this work was from a pilot study of Ecuadorian students with ages between 9 and 10 years old. The preliminary results show that the average overall difference between the two techniques (Artificial Neural Networks and CHAEA-Junior) is 4.13%. Finally, we conclude that video games can be fun and suitable tools for an accurate prediction of learning styles.

Richard Torres-Molina, Jorge Banda-Almeida, Lorena Guachi-Guachi
The Use of Unified Activity Records to Predict Requests Made by Applications for External Services

Many modern applications use services and data made available by provisioning platforms of third parties. The question arises if the use of individual services and data resources such as open data by novel applications can be predicted. In particular, whether initial software development efforts such as application development during hackathons can be monitored to provide data for the models predicting requests submitted to open data platforms and possibly other platforms is not clear.In this work, we propose an iterative method of transforming request streams into activity records. By activity records, vectors containing aggregated representation of the requests for external services made by individual applications over growing periods of software development are meant. The approach we propose extends previous works on the development of network flows aggregating network traffic and makes it possible to predict future requests made to web services with high accuracy.

Maciej Grzenda, Robert Kunicki, Jaroslaw Legierski
Fuzzy Clustering Approach to Data Selection for Computer Usage in Headache Disorders

This paper is focused on a new approach based on fuzzy clustering system for diagnosing headache disorders. The proposed fuzzy clustering system is based on two steps Gustafson-Kessel clustering system. Experimental data set consist of the frequency of the computer use and habits while using the computer and assessment of the adverse health effects due to computer use. The attribute selection for major features is done based on the experimental data set. The proposed fuzzy clustering system is tested on data set collected from patients in Clinical Centre of Vojvodina, in Novi Sad, Serbia.

Svetlana Simić, Ljiljana Radmilo, Dragan Simić, Svetislav D. Simić, Antonio J. Tallón-Ballesteros
Multitemporal Aerial Image Registration Using Semantic Features

A semantic feature extraction method for multitemporal high resolution aerial image registration is proposed in this paper. These features encode properties or information about temporally invariant objects such as roads and help deal with issues such as changing foliage in image registration, which classical handcrafted features are unable to address. These features are extracted from a semantic segmentation network and have shown good robustness and accuracy in registering aerial images across years and seasons in the experiments.

Ananya Gupta, Yao Peng, Simon Watson, Hujun Yin

Special Session on Machine Learning in Healthcare

Frontmatter
Brain Tumor Classification Using Principal Component Analysis and Kernel Support Vector Machine

Early diagnosis improves cancer outcomes by giving care at the most initial possible stage and is, therefore, an important health strategy in all settings. Gliomas, meningiomas, and pituitary tumors are among the most common brain tumors in adults. This paper classifies these three types of brain tumors from patients; using a Kernel Support Vector Machine (KSVM) classifier. The images are pre-processed, and its dimensionality is reduced before entering the classifier, and the difference in accuracy produced by using or not pre-processing techniques is compared, as well as, the use of three different kernels, namely linear, quadratic, and Gaussian Radial Basis (GRB) for the classifier. The experimental results showed that the proposed approach with pre-processed MRI images by using GRB kernel achieves better performance than quadratic and linear kernels in terms of accuracy, precision, and specificity.

Richard Torres-Molina, Carlos Bustamante-Orellana, Andrés Riofrío-Valdivieso, Francisco Quinga-Socasi, Robinson Guachi, Lorena Guachi-Guachi
Modelling Survival by Machine Learning Methods in Liver Transplantation: Application to the UNOS Dataset

The aim of this study is to develop and validate a machine learning (ML) model for predicting survival after liver transplantation based on pre-transplant donor and recipient characteristics. For this purpose, we consider a database from the United Network for Organ Sharing (UNOS), containing 29 variables and 39,095 donor-recipient pairs, describing liver transplantations performed in the United States of America from November 2004 until June 2015. The dataset contains more than a $$74\%$$ of censoring, being a challenging and difficult problem. Several methods including proportional-hazards regression models and ML methods such as Gradient Boosting were applied, using 10 donor characteristics, 15 recipient characteristics and 4 shared variables associated with the donor-recipient pair. In order to measure the performance of the seven state-of-the-art methodologies, three different evaluation metrics are used, being the concordance index (ipcw) the most suitable for this problem. The results achieved show that, for each measure, a different technique obtains the highest value, performing almost the same, but, if we focus on ipcw, Gradient Boosting outperforms the rest of the methods.

David Guijo-Rubio, Pedro J. Villalón-Vaquero, Pedro A. Gutiérrez, Maria Dolores Ayllón, Javier Briceño, César Hervás-Martínez
Design and Development of an Automatic Blood Detection System for Capsule Endoscopy Images

Wireless Capsule Endoscopy is a technique that allows for observation of the entire gastrointestinal tract in an easy and non-invasive way. However, its greatest limitation lies in the time required to analyze the large number of images generated in each examination for diagnosis, which is about 2 h. This causes not only a high cost, but also a high probability of a wrong diagnosis due to the physician’s fatigue, while the variable appearance of abnormalities requires continuous concentration. In this work, we designed and developed a system capable of automatically detecting blood based on classification of extracted regions, following two different classification approaches. The first method consisted in extraction of hand-crafted features that were used to train machine learning algorithms, specifically Support Vector Machines and Random Forest, to create models for classifying images as healthy tissue or blood. The second method consisted in applying deep learning techniques, concretely convolutional neural networks, capable of extracting the relevant features of the image by themselves. The best results (95.7% sensitivity and 92.3% specificity) were obtained for a Random Forest model trained with features extracted from the histograms of the three HSV color space channels. For both methods we extracted square patches of several sizes using a sliding window, while for the first approach we also implemented the waterpixels technique in order to improve the classification results.

Pedro Pons, Reinier Noorda, Andrea Nevárez, Adrián Colomer, Vicente Pons Beltrán, Valery Naranjo
Comparative Analysis for Computer-Based Decision Support: Case Study of Knee Osteoarthritis

This case study benchmarks a range of statistical and machine learning methods relevant to computer-based decision support in clinical medicine, focusing on the diagnosis of knee osteoarthritis at first presentation. The methods, comprising logistic regression, Multilayer Perceptron (MLP), Chi-square Automatic Interaction Detector (CHAID) and Classification and Regression Trees (CART), are applied to a public domain database, the Osteoarthritis Initiative (OAI), a 10 year longitudinal study starting in 2002 (n = 4,796). In this real-world application, it is shown that logistic regression is comparable with the neural networks and decision trees for discrimination of positive diagnosis on this data set. This is likely because of weak non-linearities among high levels of noise. After comparing the explanations provided by the different methods, it is concluded that the interpretability of the risk score index provided by logistic regression is expressed in a form that most naturally integrates with clinical reasoning. The reason for this is that it gives a statistical assessment of the weight of evidence for making the diagnosis, so providing a direction for future research to improve explanation of generic non-linear models.

Philippa Grace McCabe, Ivan Olier, Sandra Ortega-Martorell, Ian Jarman, Vasilios Baltzopoulos, Paulo Lisboa
A Clustering-Based Patient Grouper for Burn Care

Patient casemix is a system of defining groups of patients. For reimbursement purposes, these groups should be clinically meaningful and share similar resource usage during their hospital stay. In the UK National Health Service (NHS) these groups are known as health resource groups (HRGs), and are predominantly derived based on expert advice and checked for homogeneity afterwards, typically using length of stay (LOS) to assess similarity in resource consumption. LOS does not fully capture the actual resource usage of patients, and assurances on the accuracy of HRG as a basis of payment rate derivation are therefore difficult to give. Also, with complex patient groups such as those encountered in burn care, expert advice will often reflect average patients only, therefore not capturing the complexity and severity of many patients’ injury profile. The data-driven development of a grouper may support the identification of features and segments that more accurately account for patient complexity and resource use. In this paper, we describe the development of such a grouper using established techniques for dimensionality reduction and cluster analysis. We argue that a data-driven approach minimises bias in feature selection. Using a registry of patients from 23 burn services in England and Wales, we demonstrate a reduction of within cluster cost-variation in the identified groups, when compared to the original casemix.

Chimdimma Noelyn Onah, Richard Allmendinger, Julia Handl, Paraskevas Yiapanis, Ken W. Dunn
A Comparative Assessment of Feed-Forward and Convolutional Neural Networks for the Classification of Prostate Lesions

Prostate cancer is the most common cancer in men in the UK. An accurate diagnosis at the earliest stage possible is critical in its treatment. Multi-parametric Magnetic Resonance Imaging is gaining popularity in prostate cancer diagnosis, it can be used to actively monitor low-risk patients, and it is convenient due to its non-invasive nature. However, it requires specialist knowledge to review the abundance of available data, which has motivated the use of machine learning techniques to speed up the analysis of these many and complex images. This paper focuses on assessing the capabilities of two neural network approaches to accurately discriminate between three tissue types: significant prostate cancer lesions, non-significant lesions, and healthy tissue. For this, we used data from a previous SPIE ProstateX challenge that included significant and non-significant lesions, and we extended the dataset to include healthy prostate tissue due to clinical interest. Feed-Forward and Convolutional Neural Networks have been used, and their performances were evaluated using 80/20 training/test splits. Several combinations of the data were tested under different conditions and summarised results are presented. Using all available imaging data, a Convolutional Neural Network three-class classifier comparing prostate lesions and healthy tissue attains an Area Under the Curve of 0.892.

Sabrina Marnell, Patrick Riley, Ivan Olier, Marc Rea, Sandra Ortega-Martorell

Special Session on Machine Learning in Automatic Control

Frontmatter
A Method Based on Filter Bank Common Spatial Pattern for Multiclass Motor Imagery BCI

The Common Spatial Pattern (CSP) algorithm is capable of solving the binary classification problem for the motor image task brain-computer interface (BCI). This paper proposes a novel method based on the Filter Bank Common Spatial Pattern (FBCSP) termed the Multiscale and Overlapping FBCSP (MO-FBCSP). We extend the CSP algorithm for multiclass by using the one-versus-one (OvO) strategy. Multiple periods are selected and combined with the overlapping spectrum of the filter bank which contains useful information. This method is evaluated on the benchmark BCI Competition IV dataset 2a with 9 subjects. An average accuracy of 80% was achieved with the random forest (RF) classifier, and the corresponding kappa value was 0.734. Quantitative results have shown that the proposed scheme outperforms the classical FBCSP algorithm by over 12%.

Ziqing Xia, Likun Xia, Ming Ma
Safe Deep Neural Network-Driven Autonomous Vehicles Using Software Safety Cages

Deep learning is a promising class of techniques for controlling an autonomous vehicle. However, functional safety validation is seen as a critical issue for these systems due to the lack of transparency in deep neural networks and the safety-critical nature of autonomous vehicles. The black box nature of deep neural networks limits the effectiveness of traditional verification and validation methods. In this paper, we propose two software safety cages, which aim to limit the control action of the neural network to a safe operational envelope. The safety cages impose limits on the control action during critical scenarios, which if breached, change the control action to a more conservative value. This has the benefit that the behaviour of the safety cages is interpretable, and therefore traditional functional safety validation techniques can be applied. The work here presents a deep neural network trained for longitudinal vehicle control, with safety cages designed to prevent forward collisions. Simulated testing in critical scenarios shows the effectiveness of the safety cages in preventing forward collisions whilst under normal highway driving unnecessary interruptions are eliminated, and the deep learning control policy is able to perform unhindered. Interventions by the safety cages are also used to re-train the network, resulting in a more robust control policy.

Sampo Kuutti, Richard Bowden, Harita Joshi, Robert de Temple, Saber Fallah
Wave and Viscous Resistance Estimation by NN

Ship resistance estimation is one of the most important problems to be solved by naval architects at the early stages of the ship design project. This paper presents a comparison of methods that are used to estimate the resistance value of a vessel, studying the two terms that are the most relevant, the viscous resistance (depending on the form factor) and the resistance to waves, that appears in any floating device. This work focuses on the estimation of the form factor since it is a parameter difficult to estimate in the design early phases, and it is not always available in the measurements provided by real experiments with ship prototypes in towing tanks. Different estimation methods are applied and they are compared with the direct estimation and with the prediction obtained with a feedforward neural network. The results support the suitability of the neural networks to identify these vessel shape and wave related variables.

D. Marón, M. Santos
Neural Controller of UAVs with Inertia Variations

Floating offshore wind turbines (FOWT) are exposed to hard environmental conditions which could impose expensive maintenance operations. These costs could be alleviated by monitoring these floating devices using UAVs. Given the FOWT location, UAVs are currently the only way to do this health monitoring. But this means that UAV should be well equipped and must be accurately controlled. Rotational inertia variation is a common disturbance that affect the aerial vehicles during these inspection tasks. To address this issue, in this work we propose a new neural controller based on adaptive neuro estimators. The approach is based on the hybridization of feedback linearization, PIDs and artificial neural networks. Online learning is used to help the network to improve the estimations while the system is working. The proposal is tested by simulation with several complex trajectories when the rotational inertia is multiplied by 10. Results show the proposed UAV neural controller gets a good tracking and the neuro estimators tackle the effect of the variations of the rotational inertia.

J. Enrique Sierra-Garcia, Matilde Santos, Juan G. Victores

Special Session on Finance and Data Mining

Frontmatter
A Metric Framework for Quantifying Data Concentration

Poor performance of artificial neural nets when applied to credit-related classification problems is investigated and contrasted with logistic regression classification. We propose that artificial neural nets are less successful because of the inherent structure of credit data rather than any particular aspect of the neural net structure. Three metrics are developed to rationalise the result with such data. The metrics exploit the distributional properties of the data to rationalise neural net results. They are used in conjunction with a variant of an established concentration measure that differentiates between class characteristics. The results are contrasted with those obtained using random data, and are compared with results obtained using logistic regression. We find, in general agreement with previous studies, that logistic regressions out-perform neural nets in the majority of cases. An approximate decision criterion is developed in order to explain adverse results.

Peter Mitic
Adaptive Machine Learning-Based Stock Prediction Using Financial Time Series Technical Indicators

Stock market prediction is a hard task even with the help of advanced machine learning algorithms and computational power. Although much research has been conducted in the field, the results often are not reproducible. That is the reason why the proposed workflow is publicly available on GitHub [1] as a continuous effort to help improve the research in the field. This study explores in detail the importance of financial time series technical indicators. Exploring new approaches and technical indicators, targets, feature selection techniques, and machine learning algorithms. Using data from multiple assets and periods, the proposed model adapts to market patterns to predict the future and using multiple supervised learning algorithms to ensure the adoption of different markets. The lack of research focusing on feature importance and the premise that technical indicators can improve prediction accuracy directed this research. The proposed approach highest accuracy reaches 75% with an area under the curve (AUC) of 0.82, using historical data up to 2019 to ensure the applicability for today’s market, with more than a hundred experiments on a diverse set of assets publicly available.

Ahmed K. Taha, Mohamed H. Kholief, Walid AbdelMoez

Special Session on Knowledge Discovery from Data

Frontmatter
Exploiting Online Newspaper Articles Metadata for Profiling City Areas

News websites are among the most popular sources from which internet users read news articles. Such articles are often freely available and updated very frequently. Apart from the description of the specific news, these articles often contain metadata that can be automatically extracted and analyzed using data mining and machine learning techniques. In this work, we discuss how online news articles can be integrated as a further source of information in a framework for profiling city areas. We present some preliminary results considering online news articles related to the city of Rome. We characterize the different areas of Rome in terms of criminality, events, services, urban problems, decay and accidents. Profiles are identified using the k-means clustering algorithm. In order to offer better services to citizens and visitors, the profiles of the city areas may be a useful support for the decision making process of local administrations.

Livio Cascone, Pietro Ducange, Francesco Marcelloni
Modelling the Social Interactions in Ant Colony Optimization

Ant Colony Optimization (ACO) is a swarm-based algorithm inspired by the foraging behavior of ants. Despite its success, the efficiency of ACO has depended on the appropriate choice of parameters, requiring deep knowledge of the algorithm. A true understanding of ACO is linked to the (social) interactions between the agents given that it is through the interactions that the ants are able to explore-exploit the search space. We propose to study the social interactions that take place as artificial agents explore the search space and communicate using stigmergy. We argue that this study bring insights to the way ACO works. The interaction network that we model out of the social interactions reveals nuances of the algorithm that are otherwise hard to notice. Examples include the ability to see whether certain agents are more influential than others, the structure of communication, to name a few. We argue that our interaction-network approach may lead to a unified way of seeing swarm systems and in the case of ACO, remove part of the reliance on experts for parameter choice.

Nishant Gurrapadi, Lydia Taw, Mariana Macedo, Marcos Oliveira, Diego Pinheiro, Carmelo Bastos-Filho, Ronaldo Menezes
An Innovative Deep-Learning Algorithm for Supporting the Approximate Classification of Workloads in Big Data Environments

In this paper, we describe AppxDL, an algorithm for approximate classification of workloads of running processes in big data environments via deep learning (deep neural networks). The Deep Neural Network is trained with some workloads which belong to known categories (e.g., compiler, file compressor, etc...). Its purpose is to extract the type of workload from the executions of reference programs, so that a Neural Model of the workloads can be learned. When the learning phase is completed, the Deep Neural Network is available as Neural Model of the known workloads. We describe the AppxDL algorithm and we report and discuss some significant results we have achieved with it.

Alfredo Cuzzocrea, Enzo Mumolo, Carson K. Leung, Giorgio Mario Grasso
Control-Flow Business Process Summarization via Activity Contraction

Organizations collect and store considerable amounts of process data in event logs that are subsequently mined to obtain process models. When the business process involves hundreds of activities, executed according to complex execution patterns, the process model can become too large and complex to identify relevant information by manual and visual inspection only. Summarization techniques can help, by providing concise and meaningful representations of the underling process. This paper describes a business process summarization algorithm based on the hierarchical grouping of activities. In the proposed approach, activity grouping is guided by the existence of some relations, between pairs of activities, mined from the associated event log.

Valeria Fionda, Gianluigi Greco
Classifying Flies Based on Reconstructed Audio Signals

Advancements in sensor technology and processing power have made it possible to create recording equipment that can reconstruct the audio signal of insects passing through a directed infrared beam. The widespread deployment of such devices would allow for a range of applications previously not practical. A sensor net of detectors could be used to help model population dynamics, assess the efficiency of interventions and serve as an early warning system. At the core of any such system is a classification problem: given a segment of audio collected as something passes through a sensor, can we classify it? We examine the case of detecting the presence of fly species, with a particular focus on mosquitoes. This gives rise to a range of problems such as: can we discriminate between species of fly? Can we detect different species of mosquito? Can we detect the sex of the insect? Automated classification would significantly improve the effectiveness and efficiency of vector monitoring using these sensor nets. We assess a range of time series classification (TSC) algorithms on data from two projects working in this area. We assess our prior belief that spectral features are most effective, and we remark on all approaches with respect to whether they can be considered “real-time”.

Michael Flynn, Anthony Bagnall
Studying the Evolution of the ‘Circular Economy’ Concept Using Topic Modelling

Circular Economy has gained immense popularity for its perceived capacity to operationalise sustainable development. However, a comprehensive long-term understanding of the concept, characterising its evolution in academic literature, has not yet been provided. As a first step, we apply unsupervised topic models on academic articles to identify patterns in concept evolution. We generate topics using LDA, and investigate topic prevalence over time. We determine the optimal number of topics for the model (k) through coherence scorings and evaluate the topic model results by expert judgement. Specifying k as 20, we find topics in the literature focussing on resources, business models, process modelling, conceptual research and policies. We identify a shift in the research focus of contemporary literature, moving away from the Chinese pre-dominance to a European perspective, along with a shift towards micro level interventions, e.g., circular design, business models, around 2014–2015.

Sampriti Mahanty, Frank Boons, Julia Handl, Riza Batista-Navarro
Mining Frequent Distributions in Time Series

Time series data is composed of observations of one or more variables along a time period. By analyzing the variability of the variables we can reveal patterns that repeat or that are correlated, which helps to understand the behaviour of the variables over time. Our method finds frequent distributions of a target variable in time series data and discovers relationships between frequent distributions in consecutive time intervals. The frequent distributions are found using a new method, and relationships between them are found using association rules mining.

José Carlos Coutinho, João Mendes Moreira, Cláudio Rebelo de Sá
Time Series Display for Knowledge Discovery on Selective Laser Melting Machines

This paper presents a method for displaying industrial time series. It aims to support data and process engineers on the data analytics tasks, specially in the area of Industry 4.0 where data and process joins. The method is entitled SCG, from Splitting, Clustering and Graph making which are its main pillars. It brings two innovations: Samples making and Visualizations. The first one is in charge of build well-suited samples fostered to reach the data exploring objectives, whereas the second one is in charge showing a graph-based view and a time-based view. The final objective of this method is the detection of stable working states on a working machine, which is key for process understanding, while at the same time it enlightens on knowledge discovery and monitoring. The use case in which this work is grounded is the Selective Laser Melting (SLM) industrial process, though the introduced SCG procedure could be applied to any time series collection.

Ramón Moreno, Juan Carlos Pereira, Alex López, Asif Mohammed, Prasha Pahlevannejad

Special Session on Machine Learning Algorithms for Hard Problems

Frontmatter
Using Prior Knowledge to Facilitate Computational Reading of Arabic Calligraphy

Arabic calligraphy (AC) is central to Arabic cultural heritage and has been used since its introduction, with the first writing of the Holy Quran, up until the present. It is famous for the artistic and complicated ways that letters and words interweave and intertwine to express textual statements – usually quotations from the Quran. These specifications make it probably the hardest of all human writing systems to read. Here, we introduce the challenge of reading Arabic calligraphy using artificial intelligence (AI), a challenge that combines image processing and understanding of texts. We have collected a corpus of 1000 AC images along with annotated quotations from the Quran, pre-processing the images and identifying individual letters using detection methods based on maximally stable extremal regions (MSERs) and sliding windows (SWs). We then collect the identified letters to form bags of extracted letters (BOLs). These BOLs are then used to search for possible quotation from the corpus. Our results show that MSERs outperforms SWs in letter detection. Furthermore, BOL-matching is better than word generation in predicting the correct quotation, with the correct answer found in the list of 10 topmost matches for more than 74% of the 388 test examples.

Seetah ALSalamah, Riza Batista-Navarro, Ross D. King
SMOTE Algorithm Variations in Balancing Data Streams

From one year to another, more and more vast amounts of data is being created in different fields of application. Great deal of those sources require real-time processing and analyzing, which leads to increased interest in streaming data classification field of machine learning. It is not rare, that many of those applications deal with somehow skewed or imbalanced data. In this paper, we analyze usage of smote oversampling algorithm variations in learning patterns from imbalanced data streams using different incremental learning ensemble algorithms.

Bogdan Gulowaty, Paweł Ksieniewicz
Multi-class Text Complexity Evaluation via Deep Neural Networks

Automatic Text Complexity Evaluation (ATE) is a natural language processing task which aims to assess texts difficulty taking into account many facets related to complexity. A large number of papers tackle the problem of ATE by means of machine learning algorithms in order to classify texts into complex or simple classes. In this paper, we try to go beyond the methodologies presented so far by introducing a preliminary system based on a deep neural network model whose objective is to classify sentences into more of two classes. Experiments have been carried out on a manually annotated corpus which has been preprocessed in order to make it suitable for the scope of the paper. The results show that a higher detail level of the classification makes the ATE problem much harder to resolve, showing the weaknesses of the model to accomplish the task correctly.

Alfredo Cuzzocrea, Giosué Lo Bosco, Giovanni Pilato, Daniele Schicchi
Imbalance Reduction Techniques Applied to ECG Classification Problem

In this work we explored capabilities of improving deep learning models performance by reducing the dataset imbalance. For our experiments a highly imbalanced ECG dataset MIT-BIH was used. Multiple approaches were considered. First we introduced mutliclass UMCE, the ensemble designed to deal with imbalanced datasets. Secondly, we studied the impact of applying oversampling techniques to a training set. smote without prior majority class undersampling was used as one of the methods. Another method we used was smote with noise introduced to synthetic learning examples. The baseline for our study was a single ResNet network with undersampling of the training set. Mutliclass UMCE proved to be superior compared to the baseline model, but failed to beat the results obtained by a single model with smote applied to training set. Introducing perturbations to signals generated by smote did not bring significant improvement. Future work may consider combining multiclass UMCE with smote.

Jȩdrzej Kozal, Paweł Ksieniewicz
Machine Learning Methods for Fake News Classification

The problem of the fake news publication is not new and it already has been reported in ancient ages, but it has started having a huge impact especially on social media users. Such false information should be detected as soon as possible to avoid its negative influence on the readers and in some cases on their decisions, e.g., during the election. Therefore, the methods which can effectively detect fake news are the focus of intense research. This work focuses on fake news detection in articles published online and on the basis of extensive research we confirmed that chosen machine learning algorithms can distinguish them from reliable information.

Paweł Ksieniewicz, Michał Choraś, Rafał Kozik, Michał Woźniak
A Genetic-Based Ensemble Learning Applied to Imbalanced Data Classification

Imbalanced data classification is still a focus of intense research, due to its ever-growing presence in the real-life decision tasks. In this article, we focus on a classifier ensemble for imbalanced data classification. The ensemble is formed on the basis of the individual classifiers trained on supervise-selected feature subsets. There are several methods employing this concept to ensure a high diverse ensemble, nevertheless most of them, as Random Subspace or Random Forest, select attributes for a particular classifier randomly. The main drawback of mentioned methods is not giving the ability to supervise and control this task. In following work, we apply a genetic algorithm to the considered problem. Proposition formulates an original learning criterion, taking into consideration not only the overall classification performance but also ensures that trained ensemble is characterised by high diversity. The experimental study confirmed the high efficiency of the proposed algorithm and its superiority to other ensemble forming method based on random feature selection.

Jakub Klikowski, Paweł Ksieniewicz, Michał Woźniak
The Feasibility of Deep Learning Use for Adversarial Model Extraction in the Cybersecurity Domain

Machine learning algorithms found their way into a surprisingly wide range of applications, providing utility and allowing for insights gathered from data in a way never before possible. Those tools, however, have not been developed with security in mind. A deployed algorithm can meet a multitude of risks in the real world. This work explores one of those risks - the feasibility of an exploratory attack geared towards stealing an algorithm used in the cybersecurity domain. The process we have used is thoroughly explained and the results are promising.

Michał Choraś, Marek Pawlicki, Rafał Kozik
Backmatter
Metadaten
Titel
Intelligent Data Engineering and Automated Learning – IDEAL 2019
herausgegeben von
Dr. Hujun Yin
David Camacho
Peter Tino
Dr. Antonio J. Tallón-Ballesteros
Ronaldo Menezes
Richard Allmendinger
Copyright-Jahr
2019
Electronic ISBN
978-3-030-33617-2
Print ISBN
978-3-030-33616-5
DOI
https://doi.org/10.1007/978-3-030-33617-2