Skip to main content
main-content
Top

About this book

This volume constitutes the refereed proceedings of the 14th International Conference on Hybrid Artificial Intelligent Systems, HAIS 2019, held in León, Spain, in September 2019.
The 64 full papers published in this volume were carefully reviewed and selected from 134 submissions. They are organized in the following topical sections: data mining, knowledge discovery and big data; bio-inspired models and evolutionary computation; learning algorithms; visual analysis and advanced data processing techniques; data mining applications; and hybrid intelligent applications.

Table of Contents

Frontmatter

Data Mining, Knowledge Discovery and Big Data

Frontmatter

Testing Modified Confusion Entropy as Split Criterion for Decision Trees

Confusion Entropy (CEN) has been proposed as a performance measure for classification showing a better discrimination against other metrics. Many works use CEN for other purposes. Recently, an improvement in the definition of CEN has been proposed, a modified CEN (MCEN). The aim of this work is to review a previous work based on a classification tree that uses CEN as a pruning criterion, replacing this criterion with the newly defined MCEN metric.

J. David Nuñez-Gonzalez, Alexander Gonzalo de Sá, Manuel Graña

Generating a Question Answering System from Text Causal Relations

The aim of this paper is to present a methodology for creating expert systems by processing texts in order to respond to the queries of a question answering system. In previous work, we have shown several algorithms that were able to extract causal information from text documents and to summarize it. These approaches extracted knowledge from unstructured information, but the performed representation could not be processed automatically to infer new knowledge. Generated summaries only present the information in natural language, and hence cannot be processed in order to generate complex implications. In this paper, we introduce a procedure capable of using this knowledge in order to infer new causal relations between concepts automatically by creating expert systems from the processed texts. These expert systems will contain the causal relations presented in the processed texts. In this representation, by using logic programming, we can infer new concepts that are implied by causal relations. We describe the methodology, technical details of the implementation of our question answering system and a full example where its usefulness is described.

E. C. Garrido Merchán, C. Puente, J. A. Olivas

Creation of a Distributed NoSQL Database with Distributed Hash Tables

Databases are an essential tool in the real world. Traditionally, the relational model and centralized architectures have been used mostly. However, with the growth of the Internet in recent decades, both in the number of users and in the amount of information, the use of decentralized architectures and alternative database models to the relational model has been extended, which receive the name of NoSQL (Not only Structured Query Language) databases. With the present end of degree work, the development of a distributed NoSQL database is proposed, which will try to achieve high availability and high scalability through a decentralized architecture based on DHT (Distributed Hash Tables).

Agustín San Román Guzmán, Diego Valdeolmillos, Alberto Rivas, Angélica González Arrieta, Pablo Chamoso

Study of Data Pre-processing for Short-Term Prediction of Heat Exchanger Behaviour Using Time Series

Geothermal exchangers are among the most interesting solutions to equip a modern house with a renewable energy heating installation. The present study shows the computational modelling of an instance of an installation of such type, aiming to predict the behaviour of the system in the short term, basing on registered data in previous time instants. A correct prediction could potentially be of interesting use in the design of smart power grids. In this study, several models and configurations have been compared to determine the best and most economical setup needed for registering data of the prediction. The study includes comparisons of several ways of arranging the temporal data and pre-processing it with unsupervised techniques and several regression models. The novel approach has been tested empirically with a real dataset of measurements registered along a complete year; obtaining good results in all the operating condition ranges.

Bruno Baruque, Esteban Jove, Santiago Porras, José Luis Calvo-Rolle

Can Automated Smoothing Significantly Improve Benchmark Time Series Classification Algorithms?

tl;dr: no, it cannot, at least not on average on the standard archive problems. We assess whether using six smoothing algorithms (moving average, exponential smoothing, Gaussian filter, Savitzky-Golay filter, Fourier approximation and a recursive median sieve) could be automatically applied to time series classification problems as a preprocessing step to improve the performance of three benchmark classifiers (1-Nearest Neighbour with Euclidean and Dynamic Time Warping distances, and Rotation Forest). We found no significant improvement over unsmoothed data even when we set the smoothing parameter through cross validation. We are not claiming smoothing has no worth. It has an important role in exploratory analysis and helps with specific classification problems where domain knowledge can be exploited. What we observe is that the automatic application does not help to improve classification performance and that we cannot explain the improvement of other time series classification algorithms over the baseline classifiers simply as a function of the absence of smoothing.

James Large, Paul Southam, Anthony Bagnall

Dataset Weighting via Intrinsic Data Characteristics for Pairwise Statistical Comparisons in Classification

In supervised learning, some data characteristics (e.g. presence of errors, overlapping degree, etc.) may negatively influence classifier performance. Many methods are designed to overcome the undesirable effects of the aforementioned issues. When comparing one of those techniques with existing ones, a proper selection of datasets must be made, based on how well each dataset reflects the characteristic being specifically addressed by the proposed algorithm. In this setting, statistical tests are necessary to check the significance of the differences found in the comparison of different methods. Wilcoxon’s signed-ranks test is one of the most well-known statistical tests for pairwise comparisons between classifiers. However, it gives the same importance to every dataset, disregarding how representative each of them is in relation to the concrete issue addressed by the methods compared. This research proposes a hybrid approach which combines techniques of measurement for data characterization with statistical tests for decision making in data mining. Thus, each dataset is weighted according to its representativeness of the property of interest before using Wilcoxon’s test. Our proposal has been successfully compared with the standard Wilcoxon’s test in two scenarios related to the noisy data problem. As a result, this approach stands out properties of the algorithms easier, which may otherwise remain hidden if data characteristics are not considered in the comparison.

José A. Sáez, Pablo Villacorta, Emilio Corchado

Mining Network Motif Discovery by Learning Techniques

Properties of complex networks represent a powerful set of tools that can be used to study the complex behaviour of these systems of interconnections. They can vary from properties represented as simplistic metrics (number of edges and nodes) to properties that reflect complex information of the connection between entities part of the network (assortativity degree, density or clustering coefficient). Such a topological property that has valuable implications on the study of the networks dynamics are network motifs - patterns of interconnections found in real-world networks. Knowing that one of the biggest issue with network motifs discovery is its algorithmic NP-complete nature, this paper intends to present a method to detect if a network is prone or not to generate motifs by making use of its topological properties while training various classification models. This approach wants to serve as a time saving pre-processing step for the state-of-the-art solutions used to detect motifs in Complex networks.

Bogdan-Eduard-Mădălin Mursa, Anca Andreica, Laura Dioşan

Deep Structured Semantic Model for Recommendations in E-commerce

This paper presents an approach for building a recommender system that makes use of heterogeneous side information based on a modification of Deep Structured Semantic Model (DSSM). The core idea is to unite all side-information features into two subnetworks of user-related and item-related features and learn the similarity between their latent representations using neural matrix factorization. We tested the proposed model in the task of products recommendation on the dataset provided by a Russian online-marketing company. To deal with the sparsity of the data, we suggest recommending categories of items first and then use any other algorithm to rank items inside categories. We compared the performance of the proposed model to several traditional methods and demonstrated that DSSM with heterogeneous input significantly increases overall recommendation quality which makes it suitable for recommendations with rich side information about items and users.

Anna Larionova, Polina Kazakova, Nikita Nikitinsky

Bio-inspired Models and Evolutionary Computation

Frontmatter

Improving Comparative Radiography by Multi-resolution 3D-2D Evolutionary Image Registration

Comparative radiography has a crucial role in the forensic identification endeavor. A proposal to automate the comparison of ante-mortem and post-mortem radiographs has been recently proposed based on an evolutionary image registration method. It considers the use of differential evolution to estimate the parameters of a 3D-2D registration transformation that automatically superimposes a bone surface model over a radiograph of the same bone. The main drawback of this proposal is the high computational cost. This contribution tackled this high computational cost by incorporating multi-resolution and multi-start strategies into its optimization process. We have studied the accuracy, robustness and computation time of the different configurations of the proposed method with synthetic images of patellae, clavicles and frontal sinuses. A significant improvement has been obtained in comparison to the state-of-the-art method in term of the robustness of the optimization method and computational cost with a drop in accuracy smaller than the 0.5% of the pixels of the silhouette of the bone or cavity.

Oscar Gómez, Oscar Ibáñez, Andrea Valsecchi, Oscar Cordón

Evolutionary Algorithm for Pathways Detection in GWAS Studies

In genetics, a genome-wide association study (GWAs) involves an analysis of the single-nucleotide polymorphisms (SNPs) that constitute the genome. This analysis is performed on a large set of individuals usually classified as cases and controls. The study of differences in the SNP chains of both groups is known as pathway analysis. The analysis alluded to allows the researcher to go beyond univariate results like those offered by the p-value analysis and its representation by Manhattan plots. Pathway analysis makes it possible to detect weaker single-variant signals and is also helpful in order to understand molecular mechanisms linked to certain diseases and phenotypes. The present research proposes a new algorithm based on evolutionary computation, capable of finding significant pathways in GWA studies. Its performance has been tested with the help of synthetic data sets created with an ad hoc developed genomic data simulator.

Fidel Díez Díaz, Fernando Sánchez Lasheras, Francisco Javier de Cos Juez, Vicente Martín Sánchez

Particle Swarm Optimization-Based CNN-LSTM Networks for Anomalous Query Access Control in RBAC-Administered Model

As most organizations and companies depend on the database to process confidential information, database security has received considerable attention in recent years. In the database security category, access control is the selective restriction of access to the system or information only by the authorized user. However, access control is difficult to prevent information leakage by structured query language (SQL) statements created by internal attackers. In this paper, we propose a hybrid anomalous query access control system to extract the features of the access behavior by parsing the query log with the assumption that the DBA has role-based access control (RBAC) and to detect the database access anomalies in the features using the particle swarm optimization (PSO)-based CNN-LSTM network. The CNN hierarchy can extract important features for role classification in the vector of elements that have converted the SQL queries, and the LSTM model is suitable for representing the sequential relationship of SQL query statements. The PSO automatically finds the optimal CNN-LSTM hyperparameters for access control. Our CNN-LSTM method achieves nearly perfect access control performance for very similar roles that were previously difficult to classify and explains important variables that influence the role classification. Finally, the PSO-based CNN-LSTM networks outperform other state-of-the-art machine learning techniques in the TPC-E scenario-based virtual query dataset.

Tae-Young Kim, Sung-Bae Cho

d(Tree)-by-dx: Automatic and Exact Differentiation of Genetic Programming Trees

Genetic programming (GP) has developed to the point where it is a credible candidate for the ‘black box’ modeling of real systems. Wider application, however, could greatly benefit from its seamless embedding in conventional optimization schemes, which are most efficiently carried out using gradient-based methods. This paper describes the development of a method to automatically differentiate GP trees using a series of tree transformation rules; the resulting method can be applied an unlimited number of times to obtain higher derivatives of the function approximated by the original, trained GP tree. We demonstrate the utility of our method using a number of illustrative gradient-based optimizations that embed GP models.

Peter Rockett, Yuri Kaszubowski Lopes, Tiantian Dou, Elizabeth A. Hathway

Genetic Algorithm-Based Deep Learning Ensemble for Detecting Database Intrusion via Insider Attack

A database Intrusion Detection System (IDS) based on Role-based Access Control (RBAC) mechanism that has capability of learning and adaptation learns SQL transaction patterns represented by roles to detect insider attacks. In this paper, we parameterize the rules for partitioning the entire query set into multiple areas with simple chromosomes and propose an ensemble of multiple deep learning models that can effectively model the tree structural characteristics of SQL transactions. Experimental results on a large synthetic query dataset verify that it quantitatively outperforms other ensemble methods and machine learning methods including deep learning models, in terms of 10-fold cross validation and chi-square validation.

Seok-Jun Bu, Sung-Bae Cho

An Efficient Hybrid Genetic Algorithm for Solving a Particular Two-Stage Fixed-Charge Transportation Problem

In this paper we address a particular capacitated two-stage fixed-charge transportation problem using an efficient hybrid genetic algorithm. The proposed approach is designed to fit the challenges of the investigated optimization problem and is obtained by incorporating an linear programming (LP) optimization problem within the framework of a genetic algorithm. We evaluated our proposed solution approach on two sets of instances often used in the literature. The experimental results that we achieved show the efficiency of our hybrid algorithm in yielding high-quality solutions within reasonable running-times, besides the superiority of our approach against other existing competing methods.

Ovidiu Cosma, Petrica C. Pop, Cosmin Sabo

Analysis of MOEA/D Approaches for Inferring Ancestral Relationships

Throughout the years, decomposition approaches have been gaining major research attraction as a promising way to solve complex multiobjective optimization problems. This work investigates the application of decomposition-based optimization techniques to address a challenging problem from the bioinformatics domain: the reconstruction of ancestral relationships from protein data. A comparative analysis of different design alternatives for the Multiobjective Evolutionary Algorithm based on Decomposition (MOEA/D) is undertaken. Particularly, MOEA/D variants integrating genetic operators (MOEA/D-GA) and differential evolution (MOEA/D-DE) are studied. Hybrid search mechanisms are included to improve the accuracy of these methods, combining evolutionary strategies with problem-specific heuristics. Experimental results on four real-world problem instances give account of the significance of these techniques, especially when differential evolution approaches are used to conduct the search. As a result, significant multiobjective performance and biological solution quality are accomplished when compared with other methods from the literature.

Sergio Santander-Jiménez, Miguel A. Vega-Rodríguez, Leonel Sousa

Parsimonious Modeling for Estimating Hospital Cooling Demand to Reduce Maintenance Costs and Power Consumption

Hospitals are massive consumers of energy, and their cooling systems for HVAC and sanitary uses are particularly energy-intensive. Forecasting the thermal cooling demand of a hospital facility is a remarkable method for its potential to improve the energy efficiency of these buildings. A predictive model can help forecast the activity of water-cooled generators and improve the overall efficiency of the whole system. Therefore, power generation can be adapted to the real demand expected and adjusted accordingly. In addition, the maintenance costs related to power-generator breakdowns or ineffective starts and stops can be reduced. This article details the steps taken to develop an optimal and efficient model based on a genetic methodology that searches for low-complexity models through feature selection, parameter tuning and parsimonious model selection. The methodology, called GAparsimony, has been tested with neural networks, support vector machines and gradient boosting techniques. This new operational method employed herein can be replicated in similar buildings with comparable water-cooled generators, regardless of whether the buildings are new or existing structures.

Eduardo Dulce, Francisco Javier Martinez-de-Pison

Haploid Versus Diploid Genetic Algorithms. A Comparative Study

Genetic algorithms (GAs) are powerful tools for solving complex optimization problems, usually using a haploid representation. In the past decades, there has been a growing interest concerning the diploid genetic algorithms. Even though this area seems to be attractive, it lacks wider coverage and research in the Evolutionary Computation community. The scope of this paper is to provide some reasons why this situation happens and in order to fulfill this aim, we present experimental results using a conventional haploid GA and a developed diploid GA tested on some major benchmark functions used for performance evaluation of genetic algorithms. The obtained results show the superiority of the diploid GA over the conventional haploid GA in the case of the considered benchmark functions.

Adrian Petrovan, Petrica Pop-Sitar, Oliviu Matei

Entropy and Organizational Performance

The main purpose of this article is to analyze the impact of the workers’ behavior in terms of their emotions and feelings in system’s performance, i.e., one is looking at issues concerned with Organizational Sustainability. Indeed, one’s aim is to define a process that motivates and inspires managers and personnel to act upon the limit, i.e., to achieve the organizational goals through an effective and efficient implementation of operational and behavioral strategies. The focus will be on the importance of specific psychosocial variables that may affect collective pro-organizational attitudes. Data that is increasing exponentially, and somehow being out of control, i.e., the question is to know the correct value of the information that may be behind these numbers.

José Neves, Nuno Maia, Goreti Marreiros, Mariana Neves, Ana Fernandes, Jorge Ribeiro, Isabel Araújo, Nuno Araújo, Liliana Ávidos, Filipa Ferraz, António Capita, Nicolás Lori, Victor Alves, Henrique Vicente

A Semi-supervised Method to Classify Educational Videos

Currently, topic modelling has regained interest in the world of e-learning, where it is necessary to search through an extensive database of online learning objects, mainly in the form of educational videos. The main problem is the retrieval of those learning objects that are best suited to students’ keyword searches. Today this problem is still an open topic. According to this, this paper aims to provide a more sophisticated method to improve the search of educational videos and thus show those that best fit the learning objectives of students. To do this, a semi-supervised method to cluster and classify a large data-set of educational videos from the Universitat Politècnica de València has been developed. The proposed method employs open content resources from Wikipedia as labelled data to train the model.

Alexandru Stefan Stoica, Stella Heras, Javier Palanca, Vicente Julian, Marian Cristian Mihaescu

New Approach for the Aesthetic Improvement of Images Through the Combination of Convolutional Neural Networks and Evolutionary Algorithms

Programs for aesthetic improvements of the images have been one of the applications more widely in the last years so much from the commercial point of view like the private one. The improvement of images has been made through the application of different filters that transform the original image into another whose aesthetics have been improved. In this work a new approach for the automatic improvement of the aesthetics of images is presented. This approach uses a Convolutional Neural Network (CNN) network trained with the AVA photography data set, which contains around 255,000 images that are valued by amateur photographers. Once trained, we will have the ability to assess an image in terms of its aesthetic characteristics. Through an evolutionary differential algorithm, an optimization process will be carried out in order to find the parameters of a set of filters that improve the aesthetics of the original image. As a fitness function the trained CNN will be used. At the end of the experimentation, the viability of this methodology is presented, analyzing the convergence capacity and some visual results.

Juan Abascal, Miguel A. Patricio, Antonio Berlanga, José M. Molina

Learning Algorithms

Frontmatter

Evaluating Strategies for Selecting Test Datasets in Recommender Systems

Recommender systems based on collaborative filtering are widely used to predict users’ behaviour in large databases, where users rate items. The prediction model is built from a training dataset according to matrix factorization method and validated using a test dataset in order to measure the prediction error. Random selection is the most simple and instinctive way to build test datasets. Nevertheless, we could think about other deterministic methods to select test ratings uniformly along the database, in order to obtain a balanced contribution from all the users and items. In this paper, we perform several experiments of validating recommender systems using random and deterministic strategies to select test datasets. We considered a zigzag deterministic strategy that selects ratings uniformly across the rows and columns of the ratings matrix, following a diagonal path. After analysing the statistical results, we conclude that there are no particular advantages in considering the deterministic strategy.

Francisco Pajuelo-Holguera, Juan A. Gómez-Pulido, Fernando Ortega

Use of Natural Language Processing to Identify Inappropriate Content in Text

The quick development of communication through new technology media such as social networks and mobile phones has improved our lives. However, this also produces collateral problems such as the presence of insults and abusive comments. In this work, we address the problem of detecting violent content on text documents using Natural Language Processing techniques. Following an approach based on Machine Learning techniques, we have trained six models resulting from the combinations of two text encoders, Term Frequency-Inverse Document Frequency and Bag of Words, together with three classifiers: Logistic Regression, Support Vector Machines and Naïve Bayes. We have also assessed StarSpace, a Deep Learning approach proposed by Facebook and configured to use a Hit@1 accuracy. We evaluated these seven alternatives in two publicly available datasets from the Wikipedia Detox Project: Attack and Aggression. StarSpace achieved an accuracy of 0.938 and 0.937 in these datasets, respectively, being the algorithm recommended to detect violent content on text documents among the alternatives evaluated.

Sergio Merayo-Alba, Eduardo Fidalgo, Víctor González-Castro, Rocío Alaiz-Rodríguez, Javier Velasco-Mata

Classification of Human Driving Behaviour Images Using Convolutional Neural Network Architecture

Traffic safety is a problem that concerns the worldwide. Many traffic accidents occur. There are many situations that cause these accidents. However, when we look at the relevant statistics, it is seen that the traffic accident is caused by the behavior of the driver. Drivers who exhibit careless behavior, cause an accident. Preliminary detection of such actions may prevent the accident. In this study, it is possible to recognize the behavior of the state farm distracted driver detection data, which includes nine situations and one normal state image, which may cause an accident. The images are preprocessed with the LOG (Laplasian of Gaussian) filter. The feature extraction process is carried out with googlenet, which is the convolutional neural network architecture. As a result, the classification process resulted in 97.7% accuracy.

Emine Cengil, Ahmet Cinar

Building a Classification Model Using Affinity Propagation

Regular classification of data includes a training set and test set. For example for Naïve Bayes, Artificial Neural Networks, and Support Vector Machines, each classifier employs the whole training set to train itself. This study will explore the possibility of using a condensed form of the training set in order to get a comparable classification accuracy. The technique we explored in this study will use a clustering algorithm to explore how the data can be compressed. For example, is it possible to represent 50 records as a single record? Can this single record train a classifier as similarly to using all 50 records? This thesis aims to explore the idea of how we can achieve data compression through clustering, what are the concepts that extract the qualities of a compressed dataset, and how to check the information gain to ensure the integrity and quality of the compression algorithm. This study will explore compression through Affinity Propagation using categorical data, exploring entropy within cluster sets to calculate integrity and quality, and testing the compressed dataset with a classifier using Cosine Similarity against the uncompressed dataset.

Christopher Klecker, Ashraf Saad

Clustering-Based Ensemble Pruning and Multistage Organization Using Diversity

The purpose of ensemble pruning is to reduce the number of predictive models in order to improve efficiency and predictive performance of the ensemble. In clustering-based approach, we are looking for groups of similar models, and then we prune each of them separately in order to increase overall diversity of the ensemble. In this paper we propose two methods for this purpose using classifier clustering on the basis of a criterion based on diversity measure. In the first method we select from each cluster the model with the best predictive performance to form the final ensemble, while the second one employs the multistage organization, where instead of removing the classifiers from the ensemble each classifier group makes the decision independently. The final answer of the proposed framework is the result of the majority voting of the decisions returned by each group. Experimentation results validated through statistical tests confirmed the usefulness of the proposed approaches.

Paweł Zyblewski, Michał Woźniak

Towards a Custom Designed Mechanism for Indexing and Retrieving Video Transcripts

Finding appropriate e-Learning resources within a repository of videos represents a critical aspect for students. Given that transcripts are available for the entire set of videos, the problem reduces to obtaining a ranked list of video transcripts for a particular query. The paper presents a custom approach for searching the 16.012 available video transcripts from https://media.upv.es/ at Universitat Politècnica de València. An inherent difficulty of the problem comes from the fact that transcripts are in the Spanish language. The proposed solution embeds all the transcripts using feed-forward Neural-Net Language Models, clusters the embedded transcripts and builds a Latent Dirichlet Allocation (LDA) model for each cluster. We can then process a new query and find the transcripts that have the LDA results closest to the LDA results for our query.

Gabriel Turcu, Stella Heras, Javier Palanca, Vicente Julian, Marian Cristian Mihaescu

Active Image Data Augmentation

Deep neural networks models have achieved state-of-the-art results in a great number of different tasks in different domains (e.g., natural language processing and computer vision). However, the notions of robustness, causality, and fairness are not measured in traditional evaluated settings. In this work, we proposed an active data augmentation method to improve the model robustness to new data. We use the Vanilla Backpropagation to visualize what the trained model consider important in the input information. Based on that information, we augment the training dataset with new data to refine the model training. The objective is to make the model robust and effective for important input information. We evaluated our approach in a Spinal Cord Gray Matter Segmentation task and verified improvement in robustness while keeping the model competitive in the traditional metrics. Besides, we achieve the state-of-the-art results on that task using a U-Net based model.

Flávio Arthur Oliveira Santos, Cleber Zanchettin, Leonardo Nogueira Matos, Paulo Novais

A Novel Density-Based Clustering Approach for Outlier Detection in High-Dimensional Data

Outlier detection is a primary aspect in data-mining and machine learning applications, also known as outlier mining. The importance of outlier detection in medical data came from the fact that outliers may carry some precious information however outlier detection can show very bad performance in the presence of high dimensional data. In this paper, a new outlier detection technique is proposed based on a feature selection strategy to avoid the curse of dimensionality, named Infinite Feature Selection DBSCAN. The main purpose of our proposed method is to reduce the dimensions of a high dimensional data set in order to efficiently identify outliers using clustering techniques. Simulations on real databases proved the effectiveness of our method taking into account the accuracy, the error-rate, F-score and the retrieval time of the algorithm.

Thouraya Aouled Messaoud, Abir Smiti, Aymen Louati

Visual Analysis and Advanced Data Processing Techniques

Frontmatter

Convolutional CARMEN: Tomographic Reconstruction for Night Observation

To remove the distortion that the atmosphere causes in the observations performed with extremely large telescopes, correction techniques are required. To tackle this problem, adaptive optics systems uses wave front sensors obtain measures of the atmospheric turbulence and hence, estimate a reconstruction of the atmosphere when this calculation is applied in deformable mirrors, which compensates the aberrated wave front. In Multi Object Adaptive Optics (MOAO), several Shack-Hartmann wave front sensors along with reference guide stars are used to characterize the aberration produced by the atmosphere. Typically, this is a two-step process, where a centroiding algorithm is applied to the image provided by the sensor and the centroids from different Shack-Hartmanns wave front sensors are combined by using a Least Squares algorithm or an Artificial Neural Network, such as the Multi-Layer Perceptron. In this article a new solution based on Convolutional Neural Networks is proposed, which allows to integrate both the centroiding and the tomographic reconstruction in the same algorithm, getting a substantial improvement over the traditional Least Squares algorithm and a similar performance than the Multi-Layer Perceptron, but without the need of previously computing the centroiding algorithm.

Francisco García Riesgo, Sergio Luis Suárez Gómez, Fernando Sánchez Lasheras, Carlos González Gutiérrez, Carmen Peñalver San Cristóbal, Francisco Javier de Cos Juez

A Proof of Concept in Multivariate Time Series Clustering Using Recurrent Neural Networks and SP-Lines

Big Data and the IoT explosion has made clustering multivariate Time Series (TS) one of the most effervescent research fields. From Bio-informatics to Business and Management, multivariate TS are becoming more and more interesting as they allow to match events the co-occur in time but that is hardly noticeable. This study represents a step forward in our research. We firstly made use of Recurrent Neural Networks and transfer learning to analyze each example, measuring similarities between variables. All the results are finally aggregated to create an adjacency matrix that allows extracting the groups. In this second approach, splines are introduced to smooth the TS before modeling; also, this step avoid to learn from data with high variation or with noise. In the experiments, the two solutions are compared suing the same proof-of-concept experimentation.

Iago Vázquez, José R. Villar, Javier Sedano, Svetlana Simić, Enrique de la Cal

On the Influence of Admissible Orders in IVOVO

It is known that when dealing with interval-valued data, there exist problems associated with the non-existence of a total order. In this work we investigate a reformulation of an interval-valued decomposition strategy for multi-class problems called IVOVO, and we analyze the effectiveness of considering different admissible orders in the aggregation phase of IVOVO. We demonstrate that the choice of an appropriate admissible order allows the method to obtain significant differences in terms of accuracy.

Mikel Uriz, Daniel Paternain, Humberto Bustince, Mikel Galar

Does the Order of Attributes Play an Important Role in Classification?

This paper proposes a methodology to feature sorting in the context of supervised machine learning algorithms. Feature sorting is defined as a procedure to order the initial arrangement of the attributes according to any sorting algorithm to assign an ordinal number to every feature, depending on its importance; later the initial features are sorted following the ordinal numbers from the first to the last, which are provided by the sorting method. Feature ranking has been chosen as the representative technique to fulfill the sorting purpose inside the feature selection area. This contribution aims at introducing a new methodology where all attributes are included in the data mining task, following different sortings by means of different feature ranking methods. The approach has been assessed in ten binary and multiple class problems with a number of features lower than 37 and a number of instances below than 106 up to 28056; the test-bed includes one challenging data set with 21 labels and 23 attributes where previous works were not able to achieve an accuracy of at least a fifty percent. ReliefF is a strong candidate to be applied in order to re-sort the initial characteristic space and C4.5 algorithm achieved a promising global performance; additionally, PART -a rule-based classifier- and Support Vector Machines obtained acceptable results.

Antonio J. Tallón-Ballesteros, Simon Fong, Rocío Leal-Díaz

The Contract Random Interval Spectral Ensemble (c-RISE): The Effect of Contracting a Classifier on Accuracy

The Random Interval Spectral Ensemble (RISE) is a recently introduced tree based time series classification algorithm, in which each tree is built on a distinct set of Fourier, autocorrelation and partial autocorrelation features. It is a component in the meta ensemble HIVE-COTE [9]. RISE has run time complexity of $$O(nm^2)$$ O ( n m 2 ) , where m is the series length and n the number of train cases. This is prohibitively slow when considering long series, which are common in problems such as audio classification, where spectral approaches are likely to perform better than classifiers built in the time domain. We propose an enhancement of RISE that allows the user to specify how long the algorithm can have to run. The contract RISE (c-RISE) allows for check-pointing and adaptively estimates the time taken to build each tree in the ensemble through learning the constant terms in the run time complexity function. We show how the dynamic approach to contracting is more effective than the static approach of estimating the complexity before executing, and investigate the effect of contracting on accuracy for a range of large problems.

Michael Flynn, James Large, Tony Bagnall

Graph-Based Knowledge Inference for Style-Directed Architectural Design

This paper deals with style-oriented approach to computer aided architectural design. The generated 3D-models of architectural forms are composed of perceptual primitives determined on the basis of Biederman’s geons theory. The knowledge about the designed models is represented by the labelled graphs. In the provided CAD environment the designer selects and marks the model parts characterizing the considered style. Then the system infers subgraphs corresponding to these parts from graph representations of the models and encodes them into graph grammar rules. The additional graph grammar rules are constructed on the basis of creative design actions taken by the designer during the design process. The system supports the designer in his creative process by automatically generating graphs corresponding to new architectural forms in the desired style. The approach is illustrated by examples of designing objects in the Neoclassical style.

Agnieszka Mars, Ewa Grabska, Grażyna Ślusarczyk, Barbara Strug

A Convex Formulation of SVM-Based Multi-task Learning

Multi-task learning (MTL) is a powerful framework that allows to take advantage of the similarities between several machine learning tasks to improve on their solution by independent task specific models. Support Vector Machines (SVMs) are well suited for this and Cai et al. have proposed additive MTL SVMs, where the final model corresponds to the sum of a common one shared between all tasks, and each task specific model. In this work we will propose a different formulation of this additive approach, in which the final model is a convex combination of common and task specific ones. The convex mixing hyper-parameter $$\lambda $$ λ takes values between 0 and 1, where a value of 1 is mathematically equivalent to a common model for all the tasks, whereas a value of 0 corresponds to independent task-specific models. We will show that for $$\lambda $$ λ values between 0 and 1, this convex approach is equivalent to the additive one of Cai et al. when the other SVM parameters are properly selected. On the other hand, the predictions of the proposed convex model are also convex combinations of the common and specific predictions, making this formulation easier to interpret. Finally, this convex formulation is easier to hyper-parametrize since the hyper-parameter $$\lambda $$ λ is constrained to the [0, 1] region, in contrast with the unbounded range in the additive MTL SVMs.

Carlos Ruiz, Carlos M. Alaíz, José R. Dorronsoro

Influence Maximization and Extremal Optimization

Influence Extremal Optimization (InfEO) is an algorithm based on Extremal Optimization, adapted for the influence maximization problem for the independent cascade model. InfEO maximizes the marginal contribution of a node to the influence set of the model. Numerical experiments are used to compare InfEO with other influence maximization methods, indicating the potential of this approach. Practical results are discussed on a network constructed from publication data in the field of computer science.

Tamás Képes, Noémi Gaskó, Rodica Ioana Lung, Mihai-Alexandru Suciu

Data Mining Applications

Frontmatter

Forecast Daily Air-Pollution Time Series with Deep Learning

Air-quality in urban areas is one of the most critical concern for governs. Wide spectrum measures are implemented in relation to this issue, from laws and promotion of renewal of heating and transport systems, to stablish monitoring and prediction systems. When air-pollutant levels excess from healthy thresholds, traffic limitations are activated with non-negligible nuisances, and social and economic impacts. For this reason, high-pollution episodes must be appropriately anticipated. In this work, deep learning-based implementations are evaluated for forecasting daily values of three pollutants: CO, $$NO_2$$ , and $$O_3$$ , at three types of monitoring station from the air-quality time series provided by Madrid City Council. In this analysis, the influence of working-non-working days and the use of multivariant input, composed of multiple-pollutants time series, is also evaluated. As a consequence, a rank of the most suitable algorithms for forecasting air-quality time series is stated.

Miguel Cárdenas-Montes

Botnet Detection on TCP Traffic Using Supervised Machine Learning

The increase of botnet presence on the Internet has made it necessary to detect their activity in order to prevent them to attack and spread over the Internet. The main methods to detect botnets are traffic classifiers and sinkhole servers, which are special servers designed as a trap for botnets. However, sinkholes also receive non-malicious automatic online traffic and therefore they also need to use traffic classifiers. For these reasons, we have created two new datasets to evaluate classifiers: the TCP-Int dataset, built from publicly available TCP Internet traces of normal traffic and of three botnets, Kelihos, Miuref and Sality; and the TCP-Sink dataset based on traffic from a private sinkhole server with traces of the Conficker botnet and of automatic normal traffic. We used the two datasets to test four well-known Machine Learning classifiers: Decision Tree, k-Nearest Neighbours, Support Vector Machine and Naïve Bayes. On the TCP-Int dataset, we used the F1 score to measure the capability to identify the type of traffic, i.e., if the trace is normal or from one of the three considered botnets, while on the TCP-Sink we used ROC curves and the corresponding AUC score since it only presents two classes: non-malicious or botnet traffic. The best performance was achieved by Decision Tree, with a 0.99 F1 score and a 0.99 AUC score on the TCP-Int and the TCP-Sink datasets respectively.

Javier Velasco-Mata, Eduardo Fidalgo, Víctor González-Castro, Enrique Alegre, Pablo Blanco-Medina

Classifying Pastebin Content Through the Generation of PasteCC Labeled Dataset

Online notepad services allow users to upload and share free text anonymously. Reviewing Pastebin, one of the most popular online notepad services websites, it is possible to find textual content that could be related to illegal activities, such as leaks of personal information or hyperlinks to multimedia files containing child sexual abuse images or videos. An automatic approach to monitor and to detect these activities in such an active and a dynamic environment could be useful for Law Enforcement Agencies to fight against cybercrime. In this work, we present Pastes Content Classification 17K (PasteCC_17K), a dataset of 17640 textual samples crawled from Pastebin, which are classified in 15 categories, being 6 of them suspicious to be related to illegal ones. We used PasteCC_17K to evaluated two well-known text representation techniques, ensembled with three different supervised approaches to classify the pastes of the Pastebin website. We found that the best performance is achieved ensembling TF-IDF encoding with Logistic Regression obtaining an accuracy of $$98.63\%$$ 98.63 % . The proposed model could assist the authorities in the detection of suspicious content shared in Pastebin.

Adrián Riesco, Eduardo Fidalgo, Mhd Wesam Al-Nabki, Francisco Jáñez-Martino, Enrique Alegre

Network Traffic Analysis for Android Malware Detection

The possibilities offered by the management of huge quantities of equipment and/or networks is attracting a growing number of developers of malware. In this paper, we propose a working methodology for the detection of malicious traffic, based on the analysis of the flow of packets circulating on the network. This objective is achieved through the parameterization of the characteristics of these packages to be analyzed later with supervised learning techniques focused on traffic labeling, so as to enable a proactive response to the large volume of information handled by current filters.

José Gaviria de la Puerta, Iker Pastor-López, Borja Sanz, Pablo G. Bringas

Inferring Knowledge from Clinical Data for Anesthesia Automation

The use of Hybrid Artificial Intelligent techniques in medicine has increased in recent years. Specifically, one of the main challenges in anesthesia is achieving new controllers capable of automating the drug titration during surgeries. This work deals with the development of a Takagi-Sugeno fuzzy controller to automate the drug infusion for the control of hypnosis in patients undergoing anesthesia. To do that, a combination of Neural Networks and optimization techniques were applied to tune the internal parameters of the fuzzy controller. For the training process, data from 20 patients undergoing surgery were used. Finally, the controller proposed was tested over 16 virtual surgeries. It was concluded that the fuzzy controller was able to meet both clinical and control objectives.

Jose M. Gonzalez-Cava, Iván Castilla-Rodríguez, José Antonio Reboso, Ana León, María Martín, Esteban Jove-Pérez, José Luis Calvo-Rolle, Juan Albino Méndez-Pérez

Anomaly Detection Over an Ultrasonic Sensor in an Industrial Plant

The significant industrial developments in terms of digitalization and optimization, have focused the attention on anomaly detection techniques. This work presents a detailed study about the performance of different one-class intelligent techniques, used for detecting anomalies in the performance of an ultrasonic sensor. The initial dataset is obtained from a control level plant, and different percentage variations in the sensor measurements are generated. For each variation, the performance of three one-class classifiers are assessed, obtaining very good results.

Esteban Jove, José-Luis Casteleiro-Roca, Jose Manuel González-Cava, Héctor Quintián, Héctor Alaiz-Moretón, Bruno Baruque, Juan Albino Méndez-Pérez, José Luis Calvo-Rolle

A Machine Learning Approach to Determine Abundance of Inclusions in Stainless Steel

Steel-making process is a complex procedure involving the presence of exogenous materials which could potentially lead to non-metallic inclusions. Determining the abundance of inclusions in the earliest stage possible may help to reduce costs and avoid further post-processing manufacturing steps to alleviate undesired effects. This paper presents a data analysis and machine learning approach to analyze data related to austenitic stainless steel (Type 304L) in order to develop a decision-support tool helping to minimize the inclusion content present in the final product. Several machine learning models (generalized linear models with regularization, random forest, artificial neural networks and support vector machines) were tested in this analysis. Moreover, two different outcomes were analyzed (average and maximum abundance of inclusions per steel cast) and two different settings were considered within the analysis based on the input features used to train the models (full set of features and more relevant ones). The results showed that the average abundance of inclusions can be predicted more accurately than the maximum abundance of inclusions using linear models and the reduced set of features. A list of the more relevant features linked to the abundance of inclusions based on the data and models used in this study is additionally provided.

Héctor Mesa, Daniel Urda, Juan J. Ruiz-Aguilar, José A. Moscoso-López, Juan Almagro, Patricia Acosta, Ignacio J. Turias

Measuring Lower Limb Alignment and Joint Orientation Using Deep Learning Based Segmentation of Bones

Deformities of the lower limbs are a common clinical problem encountered in orthopedic practices. Several methods have been proposed for measuring lower limb alignment and joint orientation clinically or using computer-assisted methods. In this work we introduce a new approach for measuring lower limb alignment and joint orientation on the basis of bones segmented by deep neural networks. The bones are segmented on X-ray images using an U-net convolutional neural network. It has been trained on forty manually segmented images. Afterwards, the segmented bones are post-processed using fully connected CRFs. Finally, lines are fitted to pruned skeletons representing the bones. We discuss algorithms for measuring lower limb alignment and joint orientation. We present both qualitative and quantitative segmentation results on ten test images. We compare the results that were obtained manually using a computer-assisted program and by the proposed algorithm.

Kamil Kwolek, Adrian Brychcy, Bogdan Kwolek, Wojciech Marczyński

Constraint Programming Based Algorithm for Solving Large-Scale Vehicle Routing Problems

Smart cities management has become currently an interesting topic where recent decision aid making algorithms are essential to solve and optimize their related problems. A popular transportation optimization problem is the Vehicle Routing Problem (VRP) which is high complicated in such a way that it is categorized as a NP-hard problem. VRPs are famous and appear as influential problems that are widely present in many real-world industrial applications. They have become an elemental part of economy, the enhancement of which arises in a significant reduction in costs.The basic version of VRPs, the Capacitated VRP (CVRP) occupies a central position for historical and practical considerations since there are important real-world systems can be satisfactorily modeled as a CVRP. A Constraint Programming (CP) paradigm is used to model and solve the CVRP by applying interval and sequence variables in addition to the use of a transition distance matrix to attain the objective. An empirical study over 52 CVRP classical instances, with a number of nodes that varies from 16 to 200, and 20 CVRP large-scale instances, with a number of nodes that varies from 106 to 459, shows the relative merits of our proposed approach. It shows also that the CP paradigm tackles successfully large-scale problems with a percentage deviation varying from 2% to 10% where several exact and heuristic algorithms fail to tackle them and only a few meta-heuristics can probably solve instances with a such big number of customers.

Bochra Rabbouch, Foued Saâdaoui, Rafaa Mraihi

Application of Extractive Text Summarization Algorithms to Speech-to-Text Media

This paper presents how speech-to-text summarization can be performed using extractive text summarization algorithms. Our objective is to make a recommendation about which of the six text summary algorithms evaluated in the study is the most suitable for the task of audio summarization. First, we have selected six text summarization algorithms: Luhn, TextRank, LexRank, LSA, SumBasic, and KLSum. Then, we have evaluated them on two datasets, DUC2001 and OWIDSum, with six ROUGE metrics. After that, we have selected five speech documents from ISCI Corpus dataset, and we have transcribed using the Automatic Speech Recognition (ASR) from Google Cloud Speech API. Finally, we applied the studied extractive summarization algorithms to these five text samples to obtain a text summary from the original audio file. Experimental results showed that Luhn and TextRank obtained the best performance for the task of extractive speech-to-text summarization on the samples evaluated.

Domínguez M. Victor, Fidalgo F. Eduardo, Rubel Biswas, Enrique Alegre, Laura Fernández-Robles

User Profiles Matching for Different Social Networks Based on Faces Identification

It is common practice nowadays to use multiple social networks for different social roles. Although this, these networks assume differences in content type, communications and style of speech. If we intend to understand human behaviour as a key-feature for recommender systems, banking risk assessments or sociological researches, this is better to achieve using a combination of the data from different social media. In this paper, we propose a new approach for user profiles matching across social media based on publicly available users’ face photos and conduct an experimental study of its efficiency. Our approach is stable to changes in content and style for certain social media.

Timur Sokhin, Nikolay Butakov, Denis Nasonov

Ro-Ro Freight Prediction Using a Hybrid Approach Based on Empirical Mode Decomposition, Permutation Entropy and Artificial Neural Networks

This study attempts to create an optimal forecasting model of daily Ro-Ro freight traffic at ports by using Empirical Mode Decomposition (EMD) and Permutation Entropy (PE) together with an Artificial Neural Networks (ANNs) as a learner method.EMD method decomposes the time series into several simpler subseries easier to predict. However, the number of subseries may be high. Thus, the PE method allows identifying the complexity degree of the decomposed components in order to aggregate the least complex, significantly reducing the computational cost. Finally, an ANNs model is applied to forecast the resulting subseries and then an ensemble of the predicted results provides the final prediction.The proposed hybrid EMD-PE-ANN method is more robust than the individual ANN model and can generate a high-accuracy prediction. This methodology may be useful as an input of a Decision Support System (DSS) at ports as well it provides relevant information to plan in advance in the port community.

Jose Antonio Moscoso-Lopez, Juan Jesus Ruiz-Aguilar, Javier Gonzalez-Enrique, Daniel Urda, Hector Mesa, Ignacio J. Turias

Hybrid Intelligent Applications

Frontmatter

Modeling a Mobile Group Recommender System for Tourism with Intelligent Agents and Gamification

To provide recommendations to groups of people is a complex task, especially due to the group’s heterogeneity and conflicting preferences and personalities. This heterogeneity is even deeper in occasional groups formed for predefined tour packages in tourism. Group Recommender Systems (GRS) are being designed for helping in situations like those. However, many limitations can still be found, either on their time-consuming configurations and excessive intrusiveness to build the tourists’ profile, or in their lack of concern for the tourists’ interests during the planning and tours, like feeling a greater liberty, diminish the sense of fear/being lost, increase their sense of companionship, and promote the social interaction among them without losing a personalized experience. In this paper, we propose a conceptual model that intends to enhance GRS for tourism by using gamification techniques, intelligent agents modeled with the tourists’ context and profile, such as psychological and socio-cultural aspects, and dialogue games between the agents for the post-recommendation process. Some important aspects of a GRS for tourism are also discussed, opening the way for the proposed conceptual model, which we believe will help to solve the identified limitations.

Patrícia Alves, João Carneiro, Goreti Marreiros, Paulo Novais

Orthogonal Properties of Asymmetric Neural Networks with Gabor Filters

Neural networks researches are developed for the recent machine learnings. To improve the performance of the neural networks, the biological inspired neural networks are often studied. Models for motion processing in the biological systems have been used, which consist of the symmetric networks with quadrature functions of Gabor filters. This paper proposes a model of the bio-inspired asymmetric neural networks, which shows excellent ability of the movement detection. The prominent features are the nonlinear characteristics as the squaring and rectification functions, which are observed in the retinal and visual cortex networks. In this paper, the proposed asymmetric network with Gabor filters and the conventional energy model are analyzed from the orthogonality characteristics. It is shown that the biological asymmetric network is effective for generating the orthogonality function using the network correlation computations. Further, the asymmetric networks with nonlinear characteristics are able to generate independent subspaces, which will be useful for the creation of features spaces and efficient computations in the learning.

Naohiro Ishii, Toshinori Deguchi, Masashi Kawaguchi, Hiroshi Sasaki, Tokuro Matsuo

Deep CNN-Based Recognition of JSL Finger Spelling

In this paper, we present a framework for recognition of static finger spelling in Japanese Sign Language on RGB images. The finger spelled signs were recognized by an ensemble consisting of a ResNet-based convolutional neural network and two ResNet quaternion convolutional neural networks. A 3D articulated hand model has been used to generate synthetic finger spellings and to extend a dataset consisting of real hand gestures. Twelve different gesture realizations were prepared for each of 41 signs. Ten images have been rendered for each realization through interpolations between the starting and end poses. Experimental results demonstrate that owing to sufficient amount of training data a high recognition rate can be attained on images from a single RGB camera. Results achieved by the ResNet quaternion convolutional neural network are better than results obtained by the ResNet CNN. The best recognition results were achieved by the ensemble. The JSL-rend dataset is available for download.

Nam Tu Nguen, Shinji Sako, Bogdan Kwolek

Algorithm for Constructing a Classifier Team Using a Modified PCA (Principal Component Analysis) in the Task of Diagnosis of Acute Lymphocytic Leukaemia Type B-CLL

Systems of data recognition and data classification are getting more and more developed. There appear newer algorithms that solve more difficult and complex decision problems. Very good results are obtained using sets of classifiers. The authors in their research focused on certain data characteristics. The characteristics concerns recognition of classes of objects whose features can be grouped. Clusters created in this manner can contribute to better recognition of certain decision classes. One such example is a diagnosis of forecast in the case of acute lymphocytic chronic leukaemia B-CLL type. In this document, the authors present a modified selection method of features of the PCA object. The modification concerns the rotation of objects in relation to decision classes. In addition to grouping similar features using Varimax rotation, a procedure for grouping patients in these PCA groups was developed. Within each PCA, two classifiers - strong and weak ones were built. In the research part, the developed method was compared to the one-stage recognition algorithms known from the literature. The obtained results have a significant contribution to medical diagnostics. They allow to develop a procedure for treatment of B-CLL lymphocytic leukaemia. Making an appropriate diagnosis allows to increase a patient’s survival chance by implementing appropriate treatment.

Mariusz Topolski, Katarzyna Topolska

Road Lane Landmark Extraction: A State-of-the-art Review

In this paper we present a state-of-the-art review about road lane landmark extraction. Automatic lane landmark extraction has been studied during the last decade for different practical applications. The purpose of this paper is to gather and discuss methodologies of road lane landmark extraction based on signals from different sensors in order to automate the extraction of horizontal road surface lane signs and get an accurate map of road lane landmarks. Specific algorithms for each kind of sensors are analyzed, describing their basic ideas, and discussing their pros and cons.

Asier Izquierdo, Jose Manuel Lopez-Guede, Manuel Graña

CAPAS: A Context-Aware System Architecture for Physical Activities Monitoring

Attribute grammars are widely used by compiler-generators since it allows complete specifications of static semantics. They can also be applied to other fields of research, for instance, to human activities recognition. This paper aims to present CAPAS, a Context-aware system Architecture to monitor Physical ActivitieS. One of the components that is present in the architecture is the attribute grammar which is filled after the prediction is made according to the data gathered from the user through the sensors. According to some predefined rules, the physical activity is validated after an analysis on the attribute grammar, if it meets those requirements. Besides that it proposes an attribute grammar itself which should be able to be incorporated in a system in order to validate the performed physical activity.

Paulo Ferreira, Leandro O. Freitas, Pedro Rangel Henriques, Paulo Novais, Juan Pavón

Anomaly Detection Using Gaussian Mixture Probability Model to Implement Intrusion Detection System

Network intrusion detection systems (NIDS) detect attacks or anomalous network traffic patterns in order to avoid cybersecurity issues. Anomaly detection algorithms are used to identify unusual behavior or outliers in the network traffic in order to generate alarms. Traditionally, Gaussian Mixture Models (GMMs) have been used for probabilistic-based anomaly detection NIDS. We propose to use multiple simple GMMs to model each individual feature, and an asymmetric voting scheme that aggregates the individual anomaly detectors to provide. We test our approach using the NSL dataset. We construct the normal behavior models using only the samples labelled as normal in this dataset and evaluate our proposal using the official NSL testing set. As a result, we obtain a F1-score over 0.9, outperforming other supervised and unsupervised proposals.

Roberto Blanco, Pedro Malagón, Samira Briongos, José M. Moya

Combining Random Subspace Approach with smote Oversampling for Imbalanced Data Classification

Following work tries to utilize a hybrid approach of combining Random Subspace method and smote oversampling to solve a problem of imbalanced data classification. Paper contains a proposition of the ensemble diversified using Random Subspace approach, trained with a set oversampled in the context of each reduced subset of features. Algorithm was evaluated on the basis of the computer experiments carried out on the benchmark datasets and three different base classifiers.

Pawel Ksieniewicz

Optimization of the Master Production Scheduling in a Textile Industry Using Genetic Algorithm

In a competitive environment, an industry’s success is directly related to the level of optimization of its processes, how production is planned and developed. In this area, the master production scheduling (MPS) is the key action for success. The object of study arises from the need to optimize the medium-term production planning system in a textile company, through genetic algorithms. This research begins with the analysis of the constraints, mainly determined by the installed capacity and the number of workers. The aggregate production planning is carried out for the T-shirts families. Due to such complexity, the application of bioinspired optimization techniques demonstrates their best performance, before industries that normally employ exact and simple methods that provide an empirical MPS but can compromise efficiency and costs. The products are then disaggregated for each of the items in which the MPS is determined, based on the analysis of the demand forecast, and the orders made by customers. From this, with the use of genetic algorithms, the MPS is optimized to carry out production planning, with an improvement of up to 96% of the level of service provided.

Leandro L. Lorente-Leyva, Jefferson R. Murillo-Valle, Yakcleem Montero-Santos, Israel D. Herrera-Granda, Erick P. Herrera-Granda, Paul D. Rosero-Montalvo, Diego H. Peluffo-Ordóñez, Xiomara P. Blanco-Valencia

Urban Pollution Environmental Monitoring System Using IoT Devices and Data Visualization: A Case Study

This work presents a new approach to the Internet of Things (IoT) between sensor nodes and data analysis with visualization platform with the purpose to acquire urban pollution data. The main objective is to determine the degree of contamination in Ibarra city in real time. To do this, for one hand, thirteen IoT devices have been implemented. For another hand, a Prototype Selection and Data Balance algorithms comparison in relation to the classifier k-Nearest Neighbourhood is made. With this, the system has an adequate training set to achieve the highest classification performance. As a final result, the system presents a visualization platform that estimates the pollution condition with more than 90% accuracy.

Paul D. Rosero-Montalvo, Vivian F. López-Batista, Diego H. Peluffo-Ordóñez, Leandro L. Lorente-Leyva, X. P. Blanco-Valencia

Crowd-Powered Systems to Diminish the Effects of Semantic Drift

Internet and social Web made possible the acquisition of information to feed a growing number of Machine Learning (ML) applications and, in addition, brought light to the use of crowdsourcing approaches, commonly applied to problems that are easy for humans but difficult for computers to solve, building the crowd-powered systems. In this work, we consider the issue of semantic drift in a bootstrap learning algorithm and propose the novel idea of a crowd-powered approach to diminish the effects of such issue. To put this idea to test we built a hybrid version of the Coupled Pattern Learner (CPL), a bootstrap learning algorithm that extract contextual patterns from an unstructured text, and SSCrowd, a component that allows conversation between learning systems and Web users, in an attempt to actively and autonomously look for human supervision by asking people to take part into the knowledge acquisition process, thus using the intelligence of the crowd to improve the learning capabilities of CPL. We take advantage of the ease that humans have to understand language in unstructured text, and we show the results of using a hybrid crowd-powered approach to diminish the effects of semantic drift.

Saulo D. S. Pedro, Estevam R. Hruschka

Prediction of Student Performance Through an Intelligent Hybrid Model

The present work addresses the problem of low academic performance in engineering degree students. Models capable of predicting academic performance are generated through the application of several intelligent regression techniques to a dataset containing the official academic records of students of the engineering degree in the University of A Coruña. The global model, specifically the hybrid model based on K-means clustering, can predict the grade subject based on previous courses. In addition, an LDA (Linear Discriminant Analysis) has been implemented in order to identify the important features and visualize the classification clearly. Thus, the developed model makes it possible to estimate the academic performance of each student as well as the most important variables associated with it.

Héctor Alaiz-Moretón, José Antonio López Vázquez, Héctor Quintián, José-Luis Casteleiro-Roca, Esteban Jove, José Luis Calvo-Rolle

A Hybrid Automatic Classification Model for Skin Tumour Images

In medical practice early accurate detection of all types of skin tumours is essential to guide appropriate management and improve patients’ survival. The most important is to differentiate between malignant skin tumours and benign lesions. The aim of this research is classification of skin tumours by analyzing medical skin tumour dermoscopy images. This paper is focused on a new strategy based on hybrid model which combines mathematics and artificial techniques to define strategy to automatic classification for skin tumour images. The proposed hybrid system is tested on well-known HAM10000 data set, and experimental results are compared with similar researches.

Svetlana Simić, Svetislav D. Simić, Zorana Banković, Milana Ivkov-Simić, José R. Villar, Dragan Simić

Texture Descriptors for Automatic Estimation of Workpiece Quality in Milling

Milling workpiece present a regular pattern when they are correctly machined. However, if some problems occur, the pattern is not so homogeneous and, consequently, its quality is reduced. This paper proposes a method based on the use of texture descriptors in order to detect workpiece wear in milling automatically. Images are captured by using a boroscope connected to a camera and the whole inner surface of the workpiece is analysed. Then texture features are computed from the coocurrence for each image. Next, feature vectors are classified by 4 different approaches, Decision Trees, K Neighbors, Naïve Bayes and a Multilayer Perceptron. Linear discriminant analysis reduces the number of features from 6 to 2 without loosing accuracy. A hit rate of 91.8% is achieved with Decision Trees what fulfils the industrial requirements.

Manuel Castejón-Limas, Lidia Sánchez-González, Javier Díez-González, Laura Fernández-Robles, Virginia Riego, Hilde Pérez

Surface Defect Modelling Using Co-occurrence Matrix and Fast Fourier Transformation

There are several industries that supplies key elements to other industries where they are critical. Hence, foundry castings are subject to very strict safety controls to assure the quality of the manufactured castings. In the last years, the use of computer vision technologies to control the surface quality. In particular, we have focused our work on inclusions, cold laps and misruns. We propose a new methodology that detects and categorises imperfections on the surface. To this end, we compared several features extracted from the images to highlight the regions of the casting that may be affected and, then, we applied several machine-learning techniques to classify the regions. Despite Deep Learning techniques have a very good performance in this problems, they need a huge dataset to get this results. In this case, due to the size of the dataset (which is a real problem in a real environment), we have use traditional machine learning techniques. Our experiments shows that this method obtains high precision rates, in general, and our best results are a 96,64% of accuracy and 0.9763 of area under ROC curve.

Iker Pastor-López, Borja Sanz, José Gaviria de la Puerta, Pablo G. Bringas

Reinforcement Learning Experiments Running Efficiently over Widly Heterogeneous Computer Farms

Researchers working with Reinforcement Learning typically face issues that severely hinder the efficiency of their research workflow. These issues include high computational requirements, numerous hyper-parameters that must be set manually, and the high probability of failing a lot of times before success. In this paper, we present some of the challenges our research has faced and the way we have tackled successfully them in an innovative software platform. We provide some benchmarking results that show the improvements introduced by the new platform.

Borja Fernandez-Gauna, Xabier Larrucea, Manuel Graña

Backmatter

Additional information

Premium Partner

    Image Credits