Bioinformatics and Health Informatics

Frontmatter

Fully Automatic Classification of Flow Cytometry Data

Flow cytometry is a powerful analytical method, allowing to measure several properties individually for even hundreds of thousands of particles contained in some sample. Their joint distribution is a highly informative descriptor, yet directly unusable for standard machine learning methods.Hence, such data is traditionally pre-processed into numerical features, which is often a manual or semi-automatic process. This paper introduces flowForest, an ensemble classifier capable of directly processing flow cytomtery data, modelled after the popular Random Forest method. We demonstrate that it can achieve high classification performance in a fully automatic way.

Bartosz Paweł Piotrowski, Miron Bartosz Kursa

Positive Unlabeled Link Prediction via Transfer Learning for Gene Network Reconstruction

Transfer learning can be employed to leverage knowledge from a source domain in order to better solve tasks in a target domain, where the available data is exiguous. While most of the previous papers work in the supervised setting, we study the more challenging case of positive-unlabeled transfer learning, where few positive labeled instances are available for both the source and the target domains. Specifically, we focus on the link prediction task on network data, where we consider known existing links as positive labeled data and all the possible remaining links as unlabeled data. In many real applications (e.g., in bioinformatics), this usually leads to few positive labeled data and a huge amount of unlabeled data. The transfer learning method proposed in this paper exploits the unlabeled data and the knowledge of a source network in order to improve the reconstruction of a target network. Experiments, conducted in the biological field, showed the effectiveness of the proposed approach with respect to the considered baselines, when exploiting the Mus Musculus gene network (source) to improve the reconstruction of the Homo Sapiens Sapiens gene network (target).

Paolo Mignone, Gianvito Pio

Early Detection of Heart Symptoms with Convolutional Neural Network and Scattering Wavelet Transformation

The paper utilizes Convolutional Neural Network (CNN) for preliminary screening of cardiac pathologies by classifying the signal of heartbeat, recorded by digital stethoscope and mobile devices. The Scattering Wavelet Transformation (SWT) was used for the heartbeat representation. The experiments revealed the optimum concatenation size of SWT windows to obtain the state-of-the-art in the majority of metrics, coming from the PASCAL Classifying Heart Sounds Challenge.

Mariusz Kleć

Rough Sets: Visually Discerning Neurological Functionality During Thought Processes

The central aim of this paper is to test and illustrate the viability of utilizing Rough Set Theory to visualize neurological events that occur when a human is thinking very intensely to solve a problem or, conversely, solving a trivial problem with little to no effort. Since humans solve complex problems by leveraging synapses from a distributed neural network in the frontal and parietal lobe, which is a difficult portion of the brain to research, it has been a challenge for the neuroscience community to functionally measure how intensely a subject is thinking while trying to solve a problem. Herein, we present our research of optimizing machine intelligence to visually illustrate when members of our cohort experienced misunderstandings and challenges during periods where they read and comprehended short code snippets. This research is a continuation of the authors’ research efforts to use Rough Sets and artificial intelligence to deliver a system that will eventually visually illustrate deception.

Rory Lewis, Chad A. Mello, Yanyan Zhuang, Martin K.-C. Yeh, Yu Yan, Dan Gopstein

Graph Mining

Frontmatter

Solving the Maximal Clique Problem on Compressed Graphs

The Maximal Clique Enumeration problem (MCE) is a graph problem encountered in many applications such as social network analysis and computational biology. However, this problem is difficult and requires exponential time. Consequently, appropriate solutions must be proposed in the case of massive graph databases. In this paper, we investigate and evaluate an approach that deals with this problem on a compressed version of the graphs. This approach is interesting because compression is a staple of massive data processing. We mainly show, through extensive experimentations, that besides reducing the size of the graphs, this approach enhances the efficiency of existing algorithms.

Jocelyn Bernard, Hamida Seba

Clones in Graphs

Finding structural similarities in graph data, like social networks, is a far-ranging task in data mining and knowledge discovery. A (conceptually) simple reduction would be to compute the automorphism group of a graph. However, this approach is ineffective in data mining since real world data does not exhibit enough structural regularity. Here we step in with a novel approach based on mappings that preserve the maximal cliques. For this we exploit the well known correspondence between bipartite graphs and the data structure formal context (G, M, I) from Formal Concept Analysis. From there we utilize the notion of clone items. The investigation of these is still an open problem to which we add new insights with this work. Furthermore, we produce a substantial experimental investigation of real world data. We conclude with demonstrating the generalization of clone items to permutations.

Stephan Doerfel, Tom Hanika, Gerd Stumme

Knowledge-Based Mining of Exceptional Patterns in Logistics Data: Approaches and Experiences in an Industry 4.0 Context

In the context of Industry 4.0 and smart production, industrial large-scale enterprise data is applied for enabling data-driven analysis and modeling methods. However, the majority of the currently applied approaches consider the data in isolated fashion such that data from different sources, e.g., from large data warehouses are only considered independently. Furthermore, connections and relations between those data, i.e., relating to semantic dependencies are typically not considered, while these would open up integrated semantic approaches for effective data mining methods. This paper tackles these issues and demonstrates approaches and experiences in the context of a real-world case study in the industrial logistics domain: We propose knowledge-based data analysis applying subgroup discovery for identifying exceptional patterns in a semantic approach using appropriately constructed knowledge graphs.

Eric Sternberg, Martin Atzmueller

An Intra-algorithm Comparison Study of Complete Search FSM Implementations in Centralized Graph Transaction Databases

Frequent subgraph mining (FSM) algorithms are widely used in various areas of data analysis. Several experimental studies about FSM algorithms were reported in literature; however, these experiments lack some clarifications about the most efficient implementation of a specific algorithm for a context of use (e.g., medium size datasets). In this paper, we present an experimental study with available implementations of two well known complete search FSM algorithms namely gSpan and Gaston. Our main purpose of this experimental study is to find a suitable Frequent Subgraph Mining implementation for indexing centralized graphs databases for aggregated search(CAIR home page: www.irit.fr/CAIR ). In this paper, we provide details of the experimental results according to the input variation cases. We propose (for end users) a summary, about the most efficient FSM implementations for each algorithm (i.e., gSpan and Gaston), based on real datasets from the literature.

Rihab Ayed, Mohand-Saïd Hacid, Rafiqul Haque, Abderrazek Jemai

Critical Link Identification Based on Bridge Detection for Network with Uncertain Connectivity

Efficiently identifying critical links that substantially degrade network performance if they fail to function is challenging for a large complex network. In this paper, we tackle this problem under a more realistic situation where each link is probabilistically disconnected as if a road is blocked in a natural disaster than assuming that any road is never blocked in a disaster. To solve this problem, we utilize the bridge detection technique in graph theory and efficiently identify critical links in case the node reachability is taken as the performance measure, which corresponds to the number of people who can reach at least one evacuation facility in a disaster. Using two real-world road networks, we empirically show that the proposed method is much more efficient than the other methods that are based on traditional centrality measures and the links our method detected are substantially more critical than those by the others.

Kazumi Saito, Kouzou Ohara, Masahiro Kimura, Hiroshi Motoda

Image Analysis

Frontmatter

Deep Neural Networks for Face Recognition: Pairwise Optimisation

Such factors as lighting conditions, head rotations and view angles affect the reliability of face recognition and make the recognition task difficult. Recognition of multiple subjects requires to learn class boundaries whose complexities quickly grow with the number of subjects. Artificial Neural Networks (ANNs) have provided efficient solutions, although their performances need to be improved. Multiclass and convolutional ANNs require massive computations and finding ad-hoc parameters in order to maximise the performance. Pairwise ANN structure has outperformed the multiclass ANNs on some face recognition tasks. We propose the pairwise optimisation for ANN, which requires a significantly smaller number of ad-hoc parameters and substantially fewer computations than the multiclass and convolutional networks.

Elitsa Popova, Athanasios Athanasopoulos, Efraim Ie, Nikolaos Christou, Ndifreke Nyah

Mobile Application with Image Recognition for Persons with Aphasia

A person with aphasia, caused by a damage to a brain, loses (partially or completely) the ability to use speech or writing. Rehabilitation is based on mental or motor exercises, stimulating the brain areas responsible for communication. The aim of our work was to implement an application (app) for smartphones that could be used for this rehabilitation. We used Google’s Cloud Vision for photo analysis. The initial prototype of the app was modified according to comments from a specialist. The usability tests with the target group of users prove the effectiveness of the app, and suggest the directions of further development.

Jan Gonera, Krzysztof Szklanny, Marcin Wichrowski, Alicja Wieczorkowska

A Comparative Study on Soft Biometric Approaches to Be Used in Retail Stores

Soft biometric analysis aims at recognizing personal traits that provide some information about the individual. In this paper, we implemented and compared several approaches for soft biometric analysis in order to analyze humans soft biometric traits: age, gender, presence of eyeglasses and beard. Convolutional Neural Netoworks can be successfully used to understand soft biometric traits of passers-by looking at public displays and at shop windows.

Berardina De Carolis, Nicola Macchiarulo, Giuseppe Palestra

Low Cost Intelligent System for the 2D Biomechanical Analysis of Road Cyclists

This paper introduces an intelligent system focused on the biomechanical analysis of road bicycle cyclists. This type of analysis is carried out in specialized medical centers that operate using costly resources and are employed, mainly, for studies on athletes of high performance. The proposed system contrasts with these centers in that it provides the rookie cyclist with an accessible and affordable biomechanical analysis, although not as accurate. The architecture of the system rests in the advances in motion capture and augmented reality libraries. In the paper are discussed the motivations of the research, the internal design of the proposed system and the differences with various systems.

Camilo Salguero, Sandra P. Mosquera, Andrés F. Barco, Élise Vareilles

Intelligent Systems

Frontmatter

An Approach for the Police Districting Problem Using Artificial Intelligence

Police patrols are usually assigned to a restricted zone where they have to serve and protect the law. This feature not only results in routine tasks, such imposing traffic tickets, but also there are other important tasks, like assisting in accidents or riot control, that need to be covered.An efficient traffic Police patrol location and a schedule assignment across the streets of a city or in a road network ensure that the traffic Police comply with their functions.How to distribute these patrols in the city is a complicated task that needs experience and a deep analysis of traffic and Police data. In this work, we present a method that uses artificial intelligence to analyse these data and propose how to distribute the Police patrols reacting to events that are monitored in real-time for a better service to the citizens.

José Manuel Rodríguez-Jiménez

Unsupervised Vehicle Recognition Using Incremental Reseeding of Acoustic Signatures

Vehicle recognition and classification have broad applications, ranging from traffic flow management to military target identification. We demonstrate an unsupervised method for automated identification of moving vehicles from roadside audio sensors. Using a short-time Fourier transform to decompose audio signals, we treat the frequency signature in each time window as an individual data point. We then use a spectral embedding for dimensionality reduction. Based on the leading eigenvectors, we relate the performance of an incremental reseeding algorithm to that of spectral clustering. We find that incremental reseeding accurately identifies individual vehicles using their acoustic signatures.

Justin Sunu, Allon G. Percus, Blake Hunter

A Big Data Framework for Analysis of Traffic Data in Italian Highways

The analysis of traffic data can provide decision-makers with invaluable information. Despite the availability of methodologies specifically oriented to processing this kind of data and extract knowledge from them, few tools provide a rich set of functionalities tailored to traffic analysis in large-scale, stream-like contexts. In this paper we aim to fill this gap, by introducing an exploratory framework supporting the analysis of massive stream traffic data by either OLAP-like exploration or by resorting to advanced data mining techniques.

Claudia Diamantini, Domenico Potena, Emanuele Storti

Traffic Data Classification for Police Activity

Traffic data, automatically collected en masse every day, can be mined to discover information or patterns to support police investigations. Leveraging on domain expertise, in this paper we show how unsupervised clustering techniques can be used to infer trending behaviors for road-users and thus classify both routes and vehicles. We describe a tool devised and implemented upon openly-available scientific libraries and we present a new set of experiments involving three years worth data. Our classification results show robustness to noise and have high potential for detecting anomalies possibly connected to criminal activity.

Stefano Guarino, Fabio Leuzzi, Flavio Lombardi, Enrico Mastrostefano

Multipurpose Web-Platform for Labeling Audio Segments Efficiently and Effectively

One of the principal reasons for the success of machine learning discoveries can be attributed to the utilization of large sums of labeled datasets used to train various learning models. The availabilities of annotated data depend, to a large extent, on the nature of the domain, and how easy it is to obtain labeled data-points. One of the areas that we believe still lacks substantial labeled data is audio. This is not surprising, since labeling audio segments can be rather tedious and time-consuming, mainly due to the temporal nature of it. In this paper, we present a free and open-source web-based platform that we developed, which allows individuals and research teams to crowdsource large sums of labeled audio segments efficiently and effectively. Once an individual or a team signs up to use the platform as researchers, they will be granted administrative access that will enable them to upload their own audio files, and customize the labeling and data collection process according to their study needs. Examples of customizing the study include listing the different labels of interest, specifying the duration of audio segments and how they should be extracted from the audio file(s), and dictating how labelers should be prompted with the audio segments based on a set of pre-determined user-defined rules. Our system will automatically handle generating the audio segments from the audio files, presenting labelers with an intuitive interface using the rules specified by the study administrators, and finally recording the labelers’ responses and providing them to the administrators of the study in a readable and easy-to-access format.

Ayman Hajja, Griffin P. Hiers, Pierre Arbajian, Zbigniew W. Raś, Alicja A. Wieczorkowska

A Description Logic of Typicality for Conceptual Combination

We propose a nonmonotonic Description Logic of typicality able to account for the phenomenon of combining prototypical concepts, an open problem in the fields of AI and cognitive modelling. Our logic extends the logic of typicality $$\mathcal {ALC}+\mathbf{T}_\mathbf{R}$$ , based on the notion of rational closure, by inclusions $$p \ {::} \ \mathbf{T}(C) \sqsubseteq D$$ (“we have probability p that typical Cs are Ds”), coming from the distributed semantics of probabilistic Description Logics. Additionally, it embeds a set of cognitive heuristics for concept combination. We show that the complexity of reasoning in our logic is ExpTime-complete as in $$\mathcal {ALC}$$ .

Antonio Lieto, Gian Luca Pozzato

Mining Complex Patterns

Frontmatter

Sparse Multi-label Bilinear Embedding on Stiefel Manifolds

Dimensionality reduction plays an important role in various machine learning tasks. In this paper, we propose a novel method dubbed Sparse Multi-label bILinear Embedding (SMILE) on Stiefel manifolds for supervised dimensionality reduction on multi-label data. Unlike the traditional multi-label dimensionality reduction algorithms that work on the vectorized data, the proposed SMILE directly takes the second-order tensor data as the input, and thus characterizes the spatial structure of the tensor data in an efficient way. Differentiating from the existing tensor-based dimensionality reduction methods that perform the eigen-decomposition in each iteration, SMILE utilizes a gradient ascent strategy to optimize the objective function in each iteration, and thus is more efficient. Moreover, we introduce column-orthonormal constraints to transformation matrices to eliminate the redundancy between the projection directions of the learned subspace and add an $$L_1$$ -norm regularization term to the objective function to enhance the interpretability of the learned subspace. Experiments on a standard image dataset validate the effectiveness of the proposed method.

Yang Liu, Guohua Dong, Zhonglei Gu

Learning Latent Factors in Linked Multi-modality Data

Many real-world data can be represented as networks in which the vertices and edges represent data entities and the interrelationship between them, respectively. The discovery of network clusters, which are typical latent structures, is one of the most significant tasks of network analytics. Currently, there are no effective approaches that are able to deal with linked data with features from multimodality. To address it, we propose an effective model for learning latent factors in linked multimodality data, named as LFLMD. Given the link structure and multimodality features associated with vertices, LFLMD formulates a constrained optimization problem to learn corresponding latent spaces representing the strength that each vertex belongs to the latent components w.r.t. link structure and multimodality features. Besides, LFLMD further adopts an effective method to model the affinity between pairwise vertices so that the cluster membership for each vertex can be revealed by grouping vertices sharing more similar latent structures. For model inference, a series of iterative algorithms for updating the variables in the latent spaces are derived. LFLMD has been tested on several sets of networked data with different modalities of features and it is found LFLMD is very effective.

Tiantian He, Keith C. C. Chan

Researcher Name Disambiguation: Feature Learning and Affinity Propagation Clustering

Name ambiguity has been considered as a challenging task in the field of information retrieval. When we want to query all the papers of a researcher in the current literature integration system, we will find that many irrelevant papers written by the same researcher name appear in the retrieval results, which seriously affect the quality of retrieval. To tackle this problem, name disambiguation task was proposed to correctly distinguish the papers, thus making papers contained in each part belongs to a unique researcher. Certain information sources can help disambiguate researchers, e.g., CoResearcher, affiliation, homepages and paper titles. However, such information sources may be costly to obtain or unavailable. Therefore, it is necessary to solve name disambiguation task under the condition of insufficient information sources. Another challenge is how to accomplish the task without knowing the number of distinct researchers. In this paper, we sufficiently use the relational network between papers. Our proposed method learns the feature representations of papers and then uses affinity propagation clustering to solve name disambiguation task. The experimental results show that our proposed method can obtain better accuracy at solving name disambiguation task comparing to existing methods.

Zhizhi Yu, Bo Yang

Hierarchical Clustering of High-Dimensional Data Without Global Dimensionality Reduction

Very few clustering algorithms can cope with a high number of dimensions. Many problems arise when dimensions grow to the order of hundreds. Dimensionality reduction and feature selection are simple remedies to these problems. In addition to being somewhat intellectually disappointing approaches, both also lead to loss of information.Furthermore, many elaborated clustering algorithms become unintuitive for the user because she is required to set the values of several hyperparameters without clear understanding of their meaning and effect.We develop PCA-based hierarchical clustering algorithms that are particularly geared for high-dimensional data. Technically the novelty is to describe data vectors iteratively by the their angles with eigenvectors in the orthogonal basis of the input space. The new algorithms avoid the major curse of dimensionality problems that affect cluster analysis.We aim at expressive algorithms that are easily applicable. This entails that the user only needs to set few intuitive hyperparameters. Moreover, exploring the effects of tuning parameters is simple since they are directly (or inversely) proportional to the clustering resolution. Also, the clustering hierarchy has a comprehensible interpretation and, therefore, moving between nodes in the hierarchy tree has an intuitive meaning.

Ilari Kampman, Tapio Elomaa

Exploiting Order Information Embedded in Ordered Categories for Ordinal Data Clustering

As a major type of categorical data, ordinal data are those with the attributes whose possible values (also called categories interchangeably) are naturally ordered. As far as we know, all the existing distance metrics proposed for categorical data do not take the underlying order information into account during the distance measurement. This will make the produced distance incorrect and will further influence the results of ordinal data clustering. We therefore propose a specially designed distance metric, which can exploit the order information embedded in the ordered categories for distance measurement. It quantifies the distance between two ordinal categories by accumulating the sub-entropies of all the categories ordered between them. Since the proposed distance metric takes the order information into account, distance produced by it will be more reasonable than the other metrics proposed for categorical data. Moreover, it is parameter-free and can be easily applied to different ordinal data clustering tasks. Experimental results show the promising advantages of the proposed distance metric.

Yiqun Zhang, Yiu-ming Cheung

User-Emotion Detection Through Sentence-Based Classification Using Deep Learning: A Case-Study with Microblogs in Albanian

Human emotion analysis has always stimulated studies in different disciplines, such as Cognitive Sciences, Psychology, and thanks to the diffusion of the social media, it is attracting the interests of computer scientists too. Particularly, the growing popularity of Microblogging platforms, has generated large amounts of information which in turn represent an attractive source of data to be further subjected to opinion mining and sentiment analysis. In our research, we leverage the analysis performed on micro-blogging texts and postings in Albanian language, which enables the use of technologies to monitor and follow the feelings and perception of the people with respect to products, issues, events, etc. Our approach to emotion analysis tackles the problem of classifying a text fragment into a set of pre-defined emotion categories and therefore aims at detecting the emotional state of the writer conveyed through the text. In order to achieve this goal, we perform a comparative analysis between different classifiers, using deep learning and other classical machine learning classification algorithms. We also adopt a domestic stemming tool for Albanian language in order to preprocess the datasets used in a second round of experiments. Experimental evaluation shows that deep learning produces overall better results compared with the other methods in terms of classification accuracy. We present also other findings related to the length of the texts being processed and the impact on the classifiers’ accuracy.

Marjana Prifti Skenduli, Marenglen Biba, Corrado Loglisci, Michelangelo Ceci, Donato Malerba

A Novel Personalized Citation Recommendation Approach Based on GAN

With the explosive growth of scientific publications, researchers find it hard to search appropriate research papers. Citation recommendation can overcome this obstacle. In this paper, we propose a novel approach for citation recommendation by applying the generative adversarial networks. The generative adversarial model plays an adversarial game with two linked models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability which a sample came from the training data rather than G. The model first encodes the graph structure and the content information to obtain the content-based graph representation. Then, we encode the network structure and co-authorship to gain author-based graph representation. Finally, the concatenation of the two representations will be acted as the node feature vector, which is a more accurate network representation that integrates the author and content information. Based on the obtained node vectors, we propose a novel personalized citation recommendation approach called CGAN and its variation VCGAN. When evaluated on AAN dataset, we found that our proposed approaches outperform existing state-of-the-art approaches.

Ye Zhang, Libin Yang, Xiaoyan Cai, Hang Dai

Novelty Detection and Class Imbalance

Frontmatter

Unsupervised LSTMs-based Learning for Anomaly Detection in Highway Traffic Data

Since road traffic is nowadays predominant, improving its safety, security and comfortability may have a significant positive impact on people’s lives. This objective requires suitable studies of traffic behavior, to help stakeholders in obtaining non-trivial information, understanding the traffic models and plan suitable actions. While, on one hand, the pervasiveness of georeferencing and mobile technologies allows us to know the position of relevant objects and track their routes, on the other hand the huge amounts of data to be handled, and the intrinsic complexity of road traffic, make this study quite difficult. Deep Neural Networks (NNs) are powerful models that have achieved excellent performance on many tasks. In this paper we propose a sequence-to-sequence (Seq2Seq) autoencoder able to detect anomalous routes and consisting of an encoder Long Short Term Memory (LSTM) mapping the input route to a vector of a fixed length representation, and then a decoder LSTM to decode back the input route. It was applied to the TRAP2017 dataset freely available from the Italian National Police.

Nicola Di Mauro, Stefano Ferilli

SCUT-DS: Learning from Multi-class Imbalanced Canadian Weather Data

Learning from multi-class imbalanced data streams with multiple minority classes, and varying degrees of skewed distributions, is an important problem in many real-world applications. However, to date, this aspect has received limited attention in the research community. Rather, the focus is on binary class problems or, alternatively, multi-class scenarios are decomposed into multiple binary sub-problems that are handled separately. Furthermore, the evolving nature of data streams make the task of correctly predicting minority instances challenging. In this paper, we introduce the SCUT-DS approach that combines multi-class synthetic oversampling and cluster-based under-sampling. SCUT-DS is a window-based method that balances the number of incoming instances of all classes directly, as the stream evolves. We present our experimental evaluation against a stream of Canadian weather data, with varying degree of skewed distributions and multiple classes. We demonstrate that our SCUT-DS algorithms consistently improve the recognition rates of the minority instances in this multi-class imbalanced setting. Our results are especially promising for difficult-to-learn minority classes, notably for predicting ice storms and glaze events.

Olubukola M. Olaitan, Herna L. Viktor

An Efficient Algorithm for Network Vulnerability Analysis Under Malicious Attacks

Given a communication network, we address the problem of computing a lower bound to the transmission rate between two network nodes notwithstanding the presence of an intelligent malicious attacker with limited destructive power.Formally, we are given a link capacitated network N with source node s and destination node t and a budget B for the attacker.We want to compute the Guaranteed Maximum Flow from s to t when an attacker can remove at most B edges. This problem is known to be NP-hard for general networks.For Internet-like networks we present an efficient ILP-based algorithm coupled with instance transformation techniques that allow us to solve the above problem for networks with more than 200 000 nodes and edges within a few minutes. To the best of our knowledge this is the first time that instances of this size for the above problem have been solved for Internet-like networks.

Toni Mancini, Federico Mari, Igor Melatti, Ivano Salvo, Enrico Tronci

Social Data Analysis

Frontmatter

An Instrumented Methodology to Analyze and Categorize Information Flows on Twitter Using NLP and Deep Learning: A Use Case on Air Quality

This article focuses on the development of an instrumented methodology for modeling and analyzing the circulation message flows concerning air quality on the social network Twitter. This methodology aims at describing and representing, on the one hand, the modes of circulation and distribution of message flows on this social media and, on the other hand, the content exchanged between stakeholders. To achieve this, we developed Natural Language Processing (NLP) tools and a classifier based on Deep Learning approaches in order to categorize messages from scratch. The conceptual and instrumented methodology presented is part of a broader interdisciplinary methodology, based on quantitative and qualitative methods, for the study of communication in environmental health. A use case of air quality is presented.

B. Juanals, J. L. Minel

Market-Aware Proactive Skill Posting

Referral networks consist of a network of experts, human or automated agent, with differential expertise across topics and can redirect tasks to appropriate colleagues based on their topic-conditioned skills. Proactive skill posting is a setting in referral networks, where agents are allowed a one-time local-network-advertisement of a subset of their skills. Heretofore, while advertising expertise, experts only considered their own skills and reported their strongest skills. However, in practice, tasks can have varying difficulty levels and reporting skills that are uncommon or rare may give an expert relative advantage over others, and the network as a whole better ability to solve problems. This work introduces market-aware proactive skill posting where experts report a subset of their skills that give them competitive advantages over their peers. Our proposed algorithm in this new setting, proactive-DIEL $$_{\varDelta }$$ , outperforms the previous state-of-the-art, proactive-DIEL $$_t$$ during the early learning phase, while retaining important properties such as tolerance to noisy self-skill estimates, and robustness to evolving networks and strategic lying.

Ashiqur R. KhudaBukhsh, Jong Woo Hong, Jaime G. Carbonell

Evidential Multi-relational Link Prediction Based on Social Content

A novel framework to address the link prediction problem in multiplex social networks is introduced. In this framework, uncertainty found in social data due to noise, missing information and observation errors is handled by the belief function theory. Despite the numerous published studies on link prediction, few research are concerned with social data imperfections issues which cause distortions in social networks structures and probably inaccurate results. In addition, most works focus on similarity scores based on network topology whereas social networks include rich content which may add semantic to the analysis and enhance results. To this end, we develop a link prediction method that combine network topology and social content to predict new links existence along with their types in multiplex social networks. Structural and social neighbors information are gathered and pooled using belief function theory combination rules. It is subsequently revised according to global information about the multiplex. Experiments performed on real world social data show that our approach works well and enhances the prediction accuracy.

Sabrine Mallek, Imen Boukhris, Zied Elouedi, Eric Lefevre

Spatio-temporal Analysis

Frontmatter

Predicting Temporal Activation Patterns via Recurrent Neural Networks

We tackle the problem of predict whether a target user (or group of users) will be active within an event stream before a time horizon. Our solution, called PATH, leverages recurrent neural networks to learn an embedding of the past events. The embedding allows to capture influence and susceptibility between users and places closer (the representation of) users that frequently get active in different event streams within a small time interval. We conduct an experimental evaluation on real world data and compare our approach with related work.

Giuseppe Manco, Giuseppe Pirrò, Ettore Ritacco

Handling Multi-scale Data via Multi-target Learning for Wind Speed Forecasting

Wind speed forecasting is particularly important for wind farms due to cost-related issues, dispatch planning, and energy markets operations. This paper presents a multi-target learning method, in order to model historical wind speed data and yield accurate forecasts of the wind speed on the day-ahead (24 h) horizon. The proposed method is based on the analysis of historical data, which are represented at multiple scales in both space and time. Handling multi-scale data allows us to leverage the knowledge hidden in both the spatial and temporal variability of the shared information, in order to identify spatio-temporal aided patterns that contribute to yield accurate wind speed forecasts. The viability of the presented method is evaluated by considering benchmark data. Specifically, the empirical study shows that learning multi-scale historical data allows us to determine accurate wind speed forecasts.

Annalisa Appice, Antonietta Lanza, Donato Malerba

Temporal Reasoning with Layered Preferences

Temporal representation and temporal reasoning is a central in Artificial Intelligence. The literature is moving to the treatment of “non-crisp” temporal constraints, in which also preferences or probabilities are considered. However, most approaches only support numeric preferences, while, in many domain applications, users naturally operate on “layered” scales of values (e.g., Low, Medium, High), which are domain- and task-dependent. For many tasks, including decision support, the evaluation of the minimal network of the constraints (i.e., the tightest constraints) is of primary importance. We propose the first approach in the literature coping with layered preferences on quantitative temporal constraints. We extend the widely used simple temporal problem (STP) framework to consider layered user-defined preferences, proposing (i) a formal representation of quantitative constraints with layered preferences, and (ii) a temporal reasoning algorithm, based on the general algorithm Compute-Summaries, for the propagation of such temporal constraints. We also prove that our temporal reasoning algorithm evaluates the minimal network.

Luca Anselma, Alessandro Mazzei, Luca Piovesan, Paolo Terenziani

Granular and Soft Clustering

Frontmatter

An Adaptive Three-Way Clustering Algorithm for Mixed-Type Data

The three-way clustering is different from the traditional two-way clustering. Instead of using two regions to represent a cluster by a single set, a cluster is represented by a pair of sets, and there are three regions such as the core region, fringe region and trivial region. The three-way representation intuitively shows that which objects are fringe to the cluster and it is proposed for dealing with uncertain clustering. However, the three-way clustering algorithm usually needs an appropriate evaluation function and corresponding thresholds. It is not scientific and efficient method for setting the thresholds in advance. Meanwhile, there is a large amount of mixed-type data in real life. Therefore, this paper proposes an adaptive three-way clustering algorithm for mixed-type data, which adjusts the three-way thresholds during the clustering process based on the idea of universal gravitation by excavating more detailed ascription relation between objects and clusters. The experimental results show that the proposed algorithm has good performance in indices such as the accuracy, F-measure, RI and NMI.

Jing Xiong, Hong Yu

Three-Way Spectral Clustering

In recent years, three-way clustering has shown promising performance in many different fields. In this paper, we present a new three-way spectral clustering by combining three-way decision and spectral clustering. In the proposed algorithm, we revise the process of spectral clustering and obtain an upper bound of each cluster. Perturbation analysis is applied to separate the core region from upper bound and the differences between upper bound and core region are regarded as the fringe region of specific cluster. The results on UCI data sets show that such strategy is effective in reducing the value of DBI and increasing the values of ACC and AS.

Hong Shi, Qiang Liu, Pingxin Wang

The Granular Structures in Formal Concept Analysis

Granular analysis in formal concept analysis is a newly proposed interesting topic. In this paper, we study the granular structures in formal concept analysis. According to the principle of multiview, we define the granules from viewpoints of objects and attributes, respectively. On the basis of the semantic meaning and the function in lattice construction, ten different kinds of granules are given. These ten kinds of granules construct the multilevel granular structures. Then, we define the similarity between each pair of granules in the same level. On the basis of the similarity measurement, we show how to transform granules from one level to another. Finally, an example is presented to illustrate the results we obtain.

Ruisi Ren, Ling Wei

Fuzzy RST and RST Rules Can Predict Effects of Different Therapies in Parkinson’s Disease Patients

Neurodegenerative disorders (ND) such as Parkinson’s disease (PD) are increasing in frequency with ageing, but we still do not have cure for ND. In the present study, we have analyzed results of: neurological, psychological and eye movement (saccadic) tests in order to discover patterns (KDD) and to predict disease progression with fuzzy rough set (FRST) and rough set (RST) theories. It is a longitudinal study in which we have repeated our measurements every six months and estimated disease progression in three different groups of patients: BMT-group: medication only; DBS-group medication and deep brain stimulation (DBS); and POP–group same as DBS but with several years longer period of DBS. With help of above KDD methods, we have predicted UPDRS (Unified Parkinson’s Disease Rating Scale) values in the following two visits on the basis of the first visit with the accuracy of 0.7 for both BMT visits; 0.56 for DBS, and 0.7-0.8 for POP visits. We could also predict UPDRS of DBS patients by rules obtained from BMT-group with accuracy of 0.6, 0.8, and 0.7 for three following DBS visits. Using FRTS we have predicted UPDRS of DBSW3 from DBSW2 with accuracy of 0.5. We could not predict by RST disease progression of POP patients from other groups but with FRST we could predict POPW1 on the basis of DBSW1 results (with accuracy of 0.33). In summary: long-term DBS (POP-group) in contrast to other-groups has changed brain mechanisms and only FRST found similarities between POP and other-groups in disease progressions.

Andrzej W. Przybyszewski

From Knowledge Discovery to Customer Attrition

This article presents a novel approach to handle customer attrition problem with knowledge discovery methods. The data mining is performed on the customer feedback data, which was labelled by means of temporal transactional invoice data in terms of customer activity. The problem was raised within industry-academia collaboration project at University of North Carolina at Charlotte by one of the companies from the heavy equipment repair industry. They expressed interest in gaining better insight into this problem, already having their own active CRM program implemented. The goal and motivation within this topic is to determine whether there are markers in the sales trends that might suggest a customer is getting ready to defect. Observing the behavior of customers who left a company, one might be able to identify customers who may leave as well.

Katarzyna Tarnowska, Zbigniew Ras

Initial Analysis of Multivariate Factors for Prediction of Shark Presence and Attacks on the Coast of North Carolina

Classification, association rules and clustering are used in the study to improve understanding of the presence of sharks in near shore waters during tourist seasons in middle Atlantic coastal waters, specifically North Carolina. The Global Shark Attack File combined with data on environmental, biotic and meteorological factors is prepared for analysis using the CRISP-DM process. In future work, combined inputs including a standardized hashtag for twitter mining, real time weather and water information, and data on crab and turtle presence will provide real-time input to an app or a dashboard providing early warning of shark presence.

Sonal Kaulkar, Lavanya Vinodh, Pamela Thompson

Topic Modelling and Opinion Mining

Frontmatter

An Experimental Evaluation of Algorithms for Opinion Mining in Multi-domain Corpus in Albanian

Opinion mining is an important tool to find out what others think about something. Most of methods used for opinion mining are based on machine learning. In this paper we present an experimental evaluation of machine learning algorithms used for opinion mining in a multi-domain corpus in Albanian language. We have created 11 multi-domains corpuses combining the opinions from 5 different topics. The opinions are classified as positive or negative. All the corpuses are used to train and test for opinion mining the performance of 50 classification algorithms. Out of these, there are seven best performing algorithms out of which three are based on Naïve Bayes.

Nelda Kote, Marenglen Biba, Evis Trandafili

Predicting Author’s Native Language Using Abstracts of Scholarly Papers

Predicting author’s attributes is useful for understanding implicit meanings of documents. The target problem of this paper is predicting author’s native language for each document. The authors of this paper used surface-level features of documents for the problem and tried to clarify the practical tendencies of the writing style as word occurrences. They conducted a classification of the abstracts written in English of approximately 85,000 scholarly papers written in English or in Japanese. As a result of the experiment, the accuracy of the binary classification was 0.97, and they found that a number of distinctive phrases used in the classification were related to typical writing styles of Japanese.

Takahiro Baba, Kensuke Baba, Daisuke Ikeda

Identifying Exceptional Descriptions of People Using Topic Modeling and Subgroup Discovery

Descriptions of images form the backbone for many intelligent systems, assuming descriptions that randomly vary in construction and content, but where description content is homogeneous. This assumption becomes problematic being extended to descriptions of images of people [14], where people are known to show systematic biases in how they process others [19]. Therefore, this paper presents a novel approach for discovering exceptional subgroups of descriptions in which the content of those descriptions reliably differs from the general set of descriptions. We develop a novel interestingness measure for subgroup discovery appropriate for probability distributions across semantic representations. The proposed method is applied to a web-based experiment in which 500 raters describe images of 200 people. Our analysis identifies multiple exceptional subgroups and the attributes of the respective raters and images. We further discuss implications for intelligent systems.

Andrew T. Hendrickson, Jason Wang, Martin Atzmueller

Springer Professional

About this book

Table of Contents

Frontmatter

Bioinformatics and Health Informatics

Frontmatter

Fully Automatic Classification of Flow Cytometry Data

Positive Unlabeled Link Prediction via Transfer Learning for Gene Network Reconstruction

Early Detection of Heart Symptoms with Convolutional Neural Network and Scattering Wavelet Transformation

Rough Sets: Visually Discerning Neurological Functionality During Thought Processes

Graph Mining

Frontmatter

Solving the Maximal Clique Problem on Compressed Graphs

Clones in Graphs

Knowledge-Based Mining of Exceptional Patterns in Logistics Data: Approaches and Experiences in an Industry 4.0 Context

An Intra-algorithm Comparison Study of Complete Search FSM Implementations in Centralized Graph Transaction Databases

Critical Link Identification Based on Bridge Detection for Network with Uncertain Connectivity

Image Analysis

Frontmatter

Deep Neural Networks for Face Recognition: Pairwise Optimisation

Mobile Application with Image Recognition for Persons with Aphasia

A Comparative Study on Soft Biometric Approaches to Be Used in Retail Stores

Low Cost Intelligent System for the 2D Biomechanical Analysis of Road Cyclists

Intelligent Systems

Frontmatter

An Approach for the Police Districting Problem Using Artificial Intelligence

Unsupervised Vehicle Recognition Using Incremental Reseeding of Acoustic Signatures

A Big Data Framework for Analysis of Traffic Data in Italian Highways

Traffic Data Classification for Police Activity

Multipurpose Web-Platform for Labeling Audio Segments Efficiently and Effectively

A Description Logic of Typicality for Conceptual Combination

Mining Complex Patterns

Frontmatter

Sparse Multi-label Bilinear Embedding on Stiefel Manifolds

Learning Latent Factors in Linked Multi-modality Data

Researcher Name Disambiguation: Feature Learning and Affinity Propagation Clustering

Hierarchical Clustering of High-Dimensional Data Without Global Dimensionality Reduction

Exploiting Order Information Embedded in Ordered Categories for Ordinal Data Clustering

User-Emotion Detection Through Sentence-Based Classification Using Deep Learning: A Case-Study with Microblogs in Albanian

A Novel Personalized Citation Recommendation Approach Based on GAN

Novelty Detection and Class Imbalance

Frontmatter

Unsupervised LSTMs-based Learning for Anomaly Detection in Highway Traffic Data

SCUT-DS: Learning from Multi-class Imbalanced Canadian Weather Data

An Efficient Algorithm for Network Vulnerability Analysis Under Malicious Attacks

Social Data Analysis

Frontmatter

An Instrumented Methodology to Analyze and Categorize Information Flows on Twitter Using NLP and Deep Learning: A Use Case on Air Quality

Market-Aware Proactive Skill Posting

Evidential Multi-relational Link Prediction Based on Social Content

Spatio-temporal Analysis

Frontmatter

Predicting Temporal Activation Patterns via Recurrent Neural Networks

Handling Multi-scale Data via Multi-target Learning for Wind Speed Forecasting

Temporal Reasoning with Layered Preferences

Granular and Soft Clustering

Frontmatter

An Adaptive Three-Way Clustering Algorithm for Mixed-Type Data

Three-Way Spectral Clustering

The Granular Structures in Formal Concept Analysis

Fuzzy RST and RST Rules Can Predict Effects of Different Therapies in Parkinson’s Disease Patients

From Knowledge Discovery to Customer Attrition

Initial Analysis of Multivariate Factors for Prediction of Shark Presence and Attacks on the Coast of North Carolina

Topic Modelling and Opinion Mining

Frontmatter

An Experimental Evaluation of Algorithms for Opinion Mining in Multi-domain Corpus in Albanian

Predicting Author’s Native Language Using Abstracts of Scholarly Papers

Identifying Exceptional Descriptions of People Using Topic Modeling and Subgroup Discovery

Backmatter

Premium Partner