Skip to main content
Top

2017 | Book

Discovery Science

20th International Conference, DS 2017, Kyoto, Japan, October 15–17, 2017, Proceedings

Editors: Prof. Akihiro Yamamoto, Takuya Kida, Dr. Takeaki Uno, Tetsuji Kuboyama

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This book constitutes the proceedings of the 20th International Conference on Discovery Science, DS 2017, held in Kyoto, Japan, in October 2017, co-located with the International Conference on Algorithmic Learning Theory, ALT 2017.

The 18 revised full papers presented together with 6 short papers and 2 invited talks in this volume were carefully reviewed and selected from 42 submissions. The scope of the conference includes the development and analysis of methods for discovering scientific knowledge, coming from machine learning, data mining, intelligent data analysis, big data analysis as well as their application in various scientific domains. The papers are organized in topical sections on machine learning: online learning, regression, label classification, deep learning, feature selection, recommendation system; and knowledge discovery: recommendation system, community detection, pattern mining, misc.

Table of Contents

Frontmatter

Online Learning

Frontmatter
Context-Based Abrupt Change Detection and Adaptation for Categorical Data Streams
Abstract
The identification of changes in data distributions associated with data streams is critical in understanding the mechanics of data generating processes and ensuring that data models remain representative through time. To this end, concept drift detection methods often utilize statistical techniques that take numerical data as input. However, many applications produce data streams containing categorical attributes, where numerical statistical methods are not applicable. In this setting, common solutions use error monitoring, assuming that fluctuations in the error measures of a learning system correspond to concept drift. Context-based concept drift detection techniques for categorical streams, which observe changes in the actual data distribution, have received limited attention. Such context-based change detection is arguably more informative as it is data-driven and directly applicable in an unsupervised setting. This paper introduces a novel context-based algorithm for categorical data, namely FG-CDCStream. In this unsupervised method, multiple drift detection tracks are maintained and their votes are combined in order to determine whether a real change has occurred. In this way, change detections are rapid and accurate, while the number of false alarms remains low. Our experimental evaluation against synthetic data streams shows that FG-CDCStream outperforms the state-of-the art. Our analysis further indicates that FG-CDCStream produces highly accurate and representative post-change models.
Sarah D’Ettorre, Herna L. Viktor, Eric Paquet
A New Adaptive Learning Algorithm and Its Application to Online Malware Detection
Abstract
Nowadays, the number of new malware samples discovered every day is in millions, which undermines the effectiveness of the traditional signature-based approach towards malware detection. To address this problem, machine learning methods have become an attractive and almost imperative solution. In most of the previous work, the application of machine learning to this problem is batch learning. Due to its fixed setting during the learning phase, batch learning often results in low detection accuracy when encountered zero-day samples with obfuscated appearance or unseen behavior. Therefore, in this paper, we propose the FTRL-DP online algorithm to address the problem of malware detection under concept drift when the behavior of malware changes over time. The experimental results show that online learning outperforms batch learning in all settings, either with or without retrainings.
Ngoc Anh Huynh, Wee Keong Ng, Kanishka Ariyapala
Real-Time Validation of Retail Gasoline Prices
Abstract
We provide a method of validating current gasoline (petrol) prices for a crowd-sourced app that primarily provides current gasoline prices. To validate prices reported by users of the app, we propose an approach that validates each price report in real time as it is entered by a consumer by comparing it to the current prediction of what the price is expected to be at the specified store at the present time. To do so, a forecast model is used to predict, with high accuracy, a price range for each store in real-time so that when a price is entered by a consumer it can be flagged if it falls outside the predicted range. We present the first experimental results concerning predicting the current price in real time at all stores in a city. The results indicate that there is a significant improvement in reducing the forecast error when using our Price Change Rules model over the modified Most Common Action model.
Mondelle Simeon, Howard J. Hamilton

Regression

Frontmatter
General Meta-Model Framework for Surrogate-Based Numerical Optimization
Abstract
We present a novel, general framework for surrogate-based numerical optimization. We introduce the concept of a modular meta model that can be easily coupled with any optimization method. It incorporates a dynamically constructed surrogate that efficiently approximates the objective function. We consider two surrogate management strategies for deciding when to evaluate the surrogate and when to evaluate the true objective. We address the task of estimating parameters of non-linear models of dynamical biological systems from observations. We show that the meta model significantly improves the efficiency of optimization, achieving up to 50% reduction of the time needed for optimization and substituting up to 63% of the total number of evaluations of the objective function. The improvement is a result of the use of an adaptive management strategy learned from the history of objective evaluations.
Žiga Lukšič, Jovan Tanevski, Sašo Džeroski, Ljupčo Todorovski
Evaluation of Different Heuristics for Accommodating Asymmetric Loss Functions in Regression
Abstract
Most machine learning methods used for regression explicitly or implicitly assume a symmetric loss function. However, recently an increasing number of problem domains require loss functions that are asymmetric in the sense that the costs for over- or under-predicting the target value may differ. This paper discusses theoretical foundations of handling asymmetric loss functions, and describes and evaluates simple methods which might be used to offset the effects of asymmetric losses. While these methods are applicable to any problem where an asymmetric loss is used, our work derives its motivation from the area of predictive maintenance, which is often characterized by a small number of training samples (in case of failure prediction) or monetary cost-based, mostly non-convex, loss functions.
Andrei Tolstikov, Frederik Janssen, Johannes Fürnkranz
Differentially Private Empirical Risk Minimization with Input Perturbation
Abstract
We propose a novel framework for the differentially private ERM, input perturbation. Existing differentially private ERM implicitly assumed that the data contributors submit their private data to a database expecting that the database invokes a differentially private mechanism for publication of the learned model. In input perturbation, each data contributor independently randomizes her/his data by itself and submits the perturbed data to the database. We show that the input perturbation framework theoretically guarantees that the model learned with the randomized data eventually satisfies differential privacy with the prescribed privacy parameters. At the same time, input perturbation guarantees that local differential privacy is guaranteed to the server. We also show that the excess risk bound of the model learned with input perturbation is O(1 / n) under a certain condition, where n is the sample size. This is the same as the excess risk bound of the state-of-the-art.
Kazuto Fukuchi, Quang Khai Tran, Jun Sakuma

Label Classification

Frontmatter
On a New Competence Measure Applied to the Dynamic Selection of Classifiers Ensemble
Abstract
In this paper a new method for calculating the classifier competence in the dynamic mode is developed. In the method, first decision profile of the classified object is calculated using K nearest objects from the validation set. Next, the decision profile is compared with the support vector produced by the classifier. The competence measure reflects the outcome of this comparison and rates the classifier with respect to the similarity of its support vector and decision profile of the test object in a continuous manner. Three different procedures for calculating decision profile and three different measures for comparing decision profile and support vector are proposed, which leads to nine methods of competence calculation. Two multiclassifier systems (MC) with homogeneous and heterogeneous pool of base classifiers and with dynamic ensemble selection scheme (DES) were constructed using the methods developed. The performance of constructed MC systems was compared against seven state-of-the-art MC systems using 15 benchmark data sets taken from the UCI Machine Learning Repository. The experimental investigations clearly show the effectiveness of the combined multiclassifier system in dynamic fashion with the use of the proposed measures of competence regardless of the ensemble type used.
Marek Kurzynski, Pawel Trajdos
Multi-label Classification Using Random Label Subset Selections
Abstract
In this work, we address the task of multi-label classification (MLC). There are two main groups of methods addressing the task of MLC: problem transformation and algorithm adaptation. Methods from the former group transform the dataset to simpler local problems and then use off-the-shelf methods to solve them. Methods from the latter group change and adapt existing methods to directly address this task and provide a global solution. There is no consensus on when to apply a given method (local or global) to a given dataset. In this work, we design a method that builds on the strengths of both groups of methods. We propose an ensemble method that constructs global predictive models on randomly selected subsets of labels. More specifically, we extend the random forests of predictive clustering trees (PCTs) to consider random output subspaces. We evaluate the proposed ensemble extension on 13 benchmark datasets. The results give parameter recommendations for the proposed method and show that the method yields models with competitive performance as compared to three competing methods.
Martin Breskvar, Dragi Kocev, Sašo Džeroski
Option Predictive Clustering Trees for Hierarchical Multi-label Classification
Abstract
In this work, we address the task of hierarchical multi-label classification (HMLC). HMLC is a variant of classification, where a single example may belong to multiple classes at the same time and the classes are organized in the form of a hierarchy. Many practically relevant problems can be presented as a HMLC task, such as predicting gene function, habitat modelling, annotation of images and videos, etc. We propose to extend the predictive clustering trees for HMLC – a generalization of decision trees for HMLC – toward learning option predictive clustering trees (OPCTs) for HMLC. OPCTs address the myopia of the standard tree induction by considering alternative splits in the internal nodes of the tree. An option tree can also be regarded as a condensed representation of an ensemble. We evaluate OPCTs on 12 benchmark HMLC datasets from various domains. With the least restrictive parameter values, OPCTs are comparable to the state-of-the-art ensemble methods of bagging and random forest of PCTs. Moreover, OPCTs statistically significantly outperform PCTs.
Tomaž Stepišnik Perdih, Aljaž Osojnik, Sašo Džeroski, Dragi Kocev

Deep Learning

Frontmatter
Re-training Deep Neural Networks to Facilitate Boolean Concept Extraction
Abstract
Deep neural networks are accurate predictors, but their decisions are difficult to interpret, which limits their applicability in various fields. Symbolic representations in the form of rule sets are one way to illustrate their behavior as a whole, as well as the hidden concepts they model in the intermediate layers. The main contribution of the paper is to demonstrate how to facilitate rule extraction from a deep neural network by retraining it in order to encourage sparseness in the weight matrices and make the hidden units be either maximally or minimally active. Instead of using datasets which combine the attributes in an unclear manner, we show the effectiveness of the methods on the task of reconstructing predefined Boolean concepts so it can later be assessed to what degree the patterns were captured in the rule sets. The evaluation shows that reducing the connectivity of the network in such a way significantly assists later rule extraction, and that when the neurons are either minimally or maximally active it suffices to consider one threshold per hidden unit.
Camila González, Eneldo Loza Mencía, Johannes Fürnkranz
An In-Depth Experimental Comparison of RNTNs and CNNs for Sentence Modeling
Abstract
The goal of modeling sentences is to accurately represent their meaning for different tasks. A variety of deep learning architectures have been proposed to model sentences, however, little is known about their comparative performance on a common ground, across a variety of datasets, and on the same level of optimization. In this paper, we provide such a novel comparison for two popular architectures, Recursive Neural Tensor Networks (RNTNs) and Convolutional Neural Networks (CNNs). Although RNTNs have been shown to work well in many cases, they require intensive manual labeling due to the vanishing gradient problem. To enable an extensive comparison of the two architectures, this paper employs two methods to automatically label the internal nodes: a rule-based method and (this time as part of the RNTN method) a convolutional neural network. This enables us to compare these RNTN models to a relatively simple CNN architecture. Experiments conducted on a set of benchmark datasets demonstrate that the CNN outperforms the RNTNs based on automatic phrase labeling, whereas the RNTN based on manual labeling outperforms the CNN. The results corroborate that CNNs already offer good predictive performance and, at the same time, more research on RNTNs is needed to further exploit sentence structure.
Zahra Ahmadi, Marcin Skowron, Aleksandrs Stier, Stefan Kramer

Feature Selection

Frontmatter
Improving Classification Accuracy by Means of the Sliding Window Method in Consistency-Based Feature Selection
Abstract
In the digital era, collecting relevant information of a technological process has become increasingly cheaper and easier. However, due to the huge available amount of data, supervised classification is one of the most challenging tasks within the artificial intelligence field. Feature selection solves this problem by removing irrelevant and redundant features from data. In this paper we propose a new feature selection algorithm called Swcfs, which works well in high-dimensional and noisy data. Swcfs can detect noisy features by leveraging the sliding window method over the set of consecutive features ranked according to their non-linear correlation with the class feature. The metric Swcfs uses to evaluate sets of features, with respect to their relevance to the class label, is the bayesian risk, which represents the theoretical upper error bound of deterministic classification. Experiments reveal Swcfs is more accurate than most of the state-of-the-art feature selection algorithms.
Adrian Pino Angulo, Kilho Shin
Feature Ranking for Multi-target Regression with Tree Ensemble Methods
Abstract
In this work, we address the task of feature ranking for multi-target regression (MTR). The task of MTR concerns problems where there are multiple continuous dependent variables and the goal is to learn a model for predicting all of the targets simultaneously. This task is receiving an increasing attention from the research community. However, performing feature ranking in the context of MTR has not been studied. Here, we propose three feature ranking methods for MTR: Symbolic, Genie3 and Random Forest. These methods are then coupled with three types of ensemble methods: Bagging, Random Forest, and Extremely Randomized Trees. All of the ensemble methods use predictive clustering trees (PCTs) as base predictive models. PCTs are a generalization of decision trees capable of MTR. In total, we consider eight different ensemble-ranking pairs. We extensively evaluate these pairs on 26 benchmark MTR datasets. The results reveal that all of the methods produce relevant feature rankings and that the best performing method is Genie3 ranking used with Random Forests of PCTs.
Matej Petković, Sašo Džeroski, Dragi Kocev

Recommendation System

Frontmatter
Recommending Collaborative Filtering Algorithms Using Subsampling Landmarkers
Abstract
Recommender Systems have become increasingly popular, propelling the emergence of several algorithms. As the number of algorithms grows, the selection of the most suitable algorithm for a new task becomes more complex. The development of new Recommender Systems would benefit from tools to support the selection of the most suitable algorithm. Metalearning has been used for similar purposes in other tasks, such as classification and regression. It learns predictive models to map characteristics of a dataset with the predictive performance obtained by a set of algorithms. For such, different types of characteristics have been proposed: statistical and/or information-theoretical, model-based and landmarkers. Recent studies argue that landmarkers are successful in selecting algorithms for different tasks. We propose a set of landmarkers for a Metalearning approach to the selection of Collaborative Filtering algorithms. The performance is compared with a state of the art systematic metafeatures approach using statistical and/or information-theoretical metafeatures. The results show that the metalevel accuracy performance using landmarkers is not statistically significantly better than the metafeatures obtained with a more traditional approach. Furthermore, the baselevel results obtained with the algorithms recommended using landmarkers are worse than the ones obtained with the other metafeatures. In summary, our results show that, contrary to the results obtained in other tasks, these landmarkers are not necessarily the best metafeatures for algorithm selection in Collaborative Filtering.
Tiago Cunha, Carlos Soares, André C. P. L. F. de Carvalho

Community Detection

Frontmatter
Recursive Extraction of Modular Structure from Layered Neural Networks Using Variational Bayes Method
Abstract
Deep neural networks have made a substantial contribution to the recognition and prediction of complex data in various fields, such as image processing, speech recognition and bioinformatics. However, it is very difficult to discover knowledge from the inference provided by a neural network, since its internal representation consists of many nonlinear and hierarchical parameters. To solve this problem, an approach has been proposed that extracts a global and simplified structure for a neural network. Although it can successfully detect such a hidden modular structure, its convergence is not sufficiently stable and is vulnerable to the initial parameters. In this paper, we propose a new deep learning algorithm that consists of recursive back propagation, community detection using a variational Bayes, and pruning unnecessary connections. We show that the proposed method can appropriately detect a hidden inference structure and compress a neural network without increasing the generalization error.
Chihiro Watanabe, Kaoru Hiramatsu, Kunio Kashino
Discovering Hidden Knowledge in Carbon Emissions Data: A Multilayer Network Approach
Abstract
In this paper, we construct the first human carbon emissions network which connects more than a thousand geographical locations based on their daily carbon emissions. We use this network to enable a data-driven analysis for a myriad of scientific knowledge discovery tasks. Specifically, we demonstrate that our carbon emissions network is strongly correlated with oil prices and socio-economic events like regional wars and financial crises. Further, we propose the first multilayer network approach that couples carbon emissions with climate (temperature) anomalies and identifies climate anomaly outlier locations across 60 years of documented carbon emissions data; these outlier locations, despite having different emission trends, experience similar temperature anomalies. Overall, we demonstrate how using network science as a key data analysis technique can reveal a treasure trove of knowledge hidden beneath the carbon emissions data.
Kartikeya Bhardwaj, HingOn Miu, Radu Marculescu
Topic Extraction on Twitter Considering Author’s Role Based on Bipartite Networks
Abstract
This paper proposes a quality topic extraction on Twitter based on author’s role on bipartite networks. We suppose that author’s role which means who were in what group, affects the quality of extracted topics. Our proposed method expresses relations between authors and words as bipartite networks, explores author’s role by forming clusters using our original community detection technique, and finds quality topics considering the semantic accuracy of words and author’s role.
Takako Hashimoto, Tetsuji Kuboyama, Hiroshi Okamoto, Kilho Shin

Pattern Mining

Frontmatter
Mining Strongly Closed Itemsets from Data Streams
Abstract
We consider the problem of mining strongly closed itemsets from transactional data streams. Compactness and stability against changes in the input are two characteristic features of this kind of itemsets that make them appealing for different applications. Utilizing their algebraic and algorithmic properties, we propose an algorithm based on reservoir sampling for approximating this type of itemsets in the landmark streaming setting, prove its correctness, and show empirically that it yields a considerable speed-up over a straightforward naive algorithm without any significant loss in precision and recall. As a motivating application, we experimentally demonstrate the suitability of strongly closed itemsets to concept drift detection in transactional data streams.
Daniel Trabold, Tamás Horváth
Extracting Mutually Dependent Multisets
Abstract
In this paper, we extend mutually dependent patterns as itemsets introduced by Ma and Hellerstein (2001) to mutually dependent multisets allowing two or more occurrences of the same items. Then, by improving the algorithm to extract all of the mutually dependent patterns based on Apriori with maintaining itemsets and their supports, we design the algorithm to extract all of the mutually dependent multisets based on AprioriTid with traversing a database just once and maintaining both multisets and their tail occurrences but without computing overall multiplicity of items in multisets. Finally, we give experimental results to apply the algorithm to both real data as antibiograms consisting of a date, a patient id, a detected bacterium, and so on and artificial data obtained by repeating items in transaction data.
Natsuki Kiyota, Sho Shimamura, Kouichi Hirata

Bioinformatics

Frontmatter
LOCANDA: Exploiting Causality in the Reconstruction of Gene Regulatory Networks
Abstract
The reconstruction of gene regulatory networks via link prediction methods is receiving increasing attention due to the large availability of data, mainly produced by high throughput technologies. However, the reconstructed networks often suffer from a high amount of false positive links, which are actually the result of indirect regulation activities. Such false links are mainly due to the presence of common cause and common effect phenomena, which are typically present in gene regulatory networks. Existing methods for the identification of a transitive reduction of a network or for the removal of (possibly) redundant links suffer from limitations about the structure of the network or the nature/length of the indirect regulation, and often require additional pre-processing steps to handle specific peculiarities of the networks at hand (e.g., cycles).
In this paper, we propose the method LOCANDA, which overcomes these limitations and is able to identify and exploit indirect relationships of arbitrary length to remove links considered as false positives. This is performed by identifying indirect paths in the network and by comparing their reliability with that of direct links. Experiments performed on networks of two organisms (E. coli and S. cerevisiae) show a higher accuracy in the reconstruction with respect to the considered competitors, as well as a higher robustness to the presence of noise in the data.
Gianvito Pio, Michelangelo Ceci, Francesca Prisciandaro, Donato Malerba
Discovery of Salivary Gland Tumors’ Biomarkers via Co-Regularized Sparse-Group Lasso
Abstract
In this study, we discovered a panel of discriminative microRNAs in salivary gland tumors by application of statistical machine learning methods. We modelled multi-component interactions of salivary microRNAs to detect group-based associations among the features, enabling the distinction of malignant from benign tumors with a high predictive performance utilizing only seven microRNAs. Several of the identified microRNAs are separately known to be involved in cell cycle regulation. Integrated biological interpretation of identified microRNAs can provide potential new insights into the biology of salivary gland tumors and supports the development of non-invasive diagnostic tests to discriminate salivary gland tumor subtypes.
Sultan Imangaliyev, Johannes H. Matse, Jan G. M. Bolscher, Ruud H. Brakenhoff, David T. W. Wong, Elisabeth Bloemena, Enno C. I. Veerman, Evgeni Levin

Knowledge Discovery

Frontmatter
Measuring the Inspiration Rate of Topics in Bibliographic Networks
Abstract
Information diffusion is a widely-studied topic thanks to its applications to social media/network analysis, viral marketing campaigns, influence maximization and prediction. In bibliographic networks, for instance, an information diffusion process takes place when some authors, that publish papers in a given topic, influence some of their neighbors (coauthors, citing authors, collaborators) to publish papers in the same topic, and the latter influence their neighbors in their turn. This well-accepted definition, however, does not consider that influence in bibliographic networks is a complex phenomenon involving several scientific and cultural aspects. In fact, in scientific citation networks, influential topics are usually considered those ones that spread most rapidly in the network. Although this is generally a fact, this semantics does not consider that topics in bibliographic networks evolve continuously. In fact, knowledge, information and ideas are dynamic entities that acquire different meanings when passing from one person to another. Thus, in this paper, we propose a new definition of influence that captures the diffusion of inspiration within the network. We propose a measure of the inspiration rate called inspiration rank. Finally, we show the effectiveness of our measure in detecting the most inspiring topics in a citation network built upon a large bibliographic dataset.
Livio Bioglio, Valentina Rho, Ruggero G. Pensa
Discovering Minority Sub-clusters and Local Difficulty Factors from Imbalanced Data
Abstract
Learning classifiers from imbalanced data is particularly challenging when class imbalance is accompanied by local data difficulty factors, such as outliers, rare cases, class overlapping, or minority class decomposition. Although these issues have been highlighted in previous research, there have been no proposals of algorithms that simultaneously detect all the aforementioned difficulties in a dataset. In this paper, we put forward two extensions to popular clustering algorithms, ImKmeans and ImScan, and one novel algorithm, ImGrid, that attempt to detect minority sub-clusters, outliers, rare cases, and class overlapping. Experiments with artificial datasets show that ImGrid, which uses a Bayesian test to join similar neighboring regions, is able to re-discover simulated clusters and types of minority examples on par with competing methods, while being the least sensitive to parameter tuning.
Mateusz Lango, Dariusz Brzezinski, Sebastian Firlik, Jerzy Stefanowski
Fusion Techniques for Named Entity Recognition and Word Sense Induction and Disambiguation
Abstract
In this paper we explore the use of well-known multimodal fusion techniques to solve two prominent Natural Language Processing tasks. Specifically, we focus on solving Named Entity Recognition and Word Sense Induction and Disambiguation by applying feature-combination methods that have already shown their efficiency in the multimedia analysis domain. We present a series of experiments employing fusion techniques in order to combine textual linguistic features. Our intuition is that by combining different types of features we may find semantic relatedness among words at different levels and thus, the combination (and recombination) of these levels may yield gains in terms of metrics’ performance. To our knowledge, employing these techniques has not been studied for the tasks we address in this paper. We test the proposed fusion techniques on three datasets for named entity recognition and one for word sense disambiguation and induction. Our results show that the combination of textual features indeed improves the performance compared to single feature representation and the trivial feature concatenation.
Edmundo-Pavel Soriano-Morales, Julien Ah-Pine, Sabine Loudcher
Backmatter
Metadata
Title
Discovery Science
Editors
Prof. Akihiro Yamamoto
Takuya Kida
Dr. Takeaki Uno
Tetsuji Kuboyama
Copyright Year
2017
Electronic ISBN
978-3-319-67786-6
Print ISBN
978-3-319-67785-9
DOI
https://doi.org/10.1007/978-3-319-67786-6

Premium Partner