Skip to main content

2015 | Buch

Discovery Science

18th International Conference, DS 2015, Banff, AB, Canada, October 4-6, 2015. Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 17th International Conference on Discovery Science, DS 2015, held in banff, AB, Canada in October 2015. The 16 long and 12 short papers presendted together with 4 invited talks in this volume were carefully reviewed and selected from 44 submissions. The combination of recent advances in the development and analysis of methods for discovering scienti c knowledge, coming from machine learning, data mining, and intelligent data analysis, as well as their application in various scienti c domains, on the one hand, with the algorithmic advances in machine learning theory, on the other hand, makes every instance of this joint event unique and attractive.

Inhaltsverzeichnis

Frontmatter
Resolution Transfer in Cancer Classification Based on Amplification Patterns
Abstract
In the current scientific age, the measurement technology has considerably improved and diversified producing data in different representations. Traditional machine learning and data mining algorithms can handle data only in a single representation in their standard form. In this contribution, we address an important challenge encountered in data analysis: what to do when the data to be analyzed are represented differently with regards to the resolution? Specifically, in classification, how to train a classifier when class labels are available only in one resolution and missing in the other resolutions? The proposed methodology learns a classifier in one data resolution and transfers it to learn the class labels in a different resolution. Furthermore, the methodology intuitively works as a dimensionality reduction method. The methodology is evaluated on a simulated dataset and finally used to classify cancers in a real–world multiresolution chromosomal aberration dataset producing plausible results.
Prem Raj Adhikari, Jaakko Hollmén
Very Short-Term Wind Speed Forecasting Using Spatio-Temporal Lazy Learning
Abstract
A wind speed forecast corresponds to an estimate of the upcoming production of a wind farm. The paper illustrates a variant of the Nearest Neighbor algorithm that yields wind speed forecasts, with a fast time resolution, for a (very) short time horizon. The proposed algorithm allows us to monitor a grid of wind farms, which collaborate by sharing information (i.e. wind speed measurements). It accounts for both spatial and temporal correlation of shared information. Experiments show that the presented algorithm is able to determine more accurate forecasts than a state-of-art statistical algorithm, namely auto. ARIMA.
Annalisa Appice, Sonja Pravilovic, Antonietta Lanza, Donato Malerba
Discovery of Parameters for Animation of Midge Swarms
Abstract
We describe a method of discovering suitable parameters for simulating and animating the swarming behaviour of non-biting midges (Diptera: Chironomidae). A characteristic of animal aggregations that can be emulated by software is the emergence of complex behaviours from simple rules. Here the well-characterized swarming behaviour of non-biting midges is used to create a rule-based behaviour model for them. To test the effectiveness of this model in creating the emergent qualities of real swarms, success criteria are derived from quantitative swarm data. We propose using a genetic algorithm to automate the identification of parameter settings that optimize the effectiveness of the model.
Judith Bjorndahl, Ashley Herman, Richard Hamilton, Howard J. Hamilton, Mark Brigham
No Sentiment is an Island
Sentiment Classification on Medical Forums
Abstract
In this study we propose a new method to classify sentiments in messages posted on online forums. Traditionally, sentiment classification relies on analysis of emotionally-charged words and discourse units found in the classified text. In coherent online discussions, however, messages’ non-lexical meta-information can be sufficient to achieve reliable classification results. Our empirical evidence is obtained through multi-class classification of messages posted on a medical forum.
Victoria Bobicev, Marina Sokolova
Active Learning for Classifying Template Matches in Historical Maps
Abstract
Historical maps are important sources of information for scholars of various disciplines. Many libraries are digitising their map collections as bitmap images, but for these collections to be most useful, there is a need for searchable metadata. Due to the heterogeneity of the images, metadata are mostly extracted by hand—if at all: many collections are so large that anything more than the most rudimentary metadata would require an infeasible amount of manual effort. We propose an active-learning approach to one of the practical problems in automatic metadata extraction from historical maps: locating occurrences of image elements such as text or place markers. For that, we combine template matching (to locate possible occurrences) with active learning (to efficiently determine a classification). Using this approach, we design a human computer interaction in which large numbers of elements on a map can be located reliably using little user effort. We experimentally demonstrate the effectiveness of this approach on real-world data.
Benedikt Budig, Thomas C. van Dijk
An Evaluation of Score Descriptors Combined with Non-linear Models of Expressive Dynamics in Music
Abstract
Expressive interpretation forms an important but complex aspect of music, in particular in certain forms of classical music. Modeling the relation between musical expression and structural aspects of the score being performed, is an ongoing line of research. Prior work has shown that some simple numerical descriptors of the score (capturing dynamics annotations and pitch) are effective for predicting expressive dynamics in classical piano performances. Nevertheless, the features have only been tested in a very simple linear regression model. In this work, we explore the potential of a non-linear model for predicting expressive dynamics. Using a set of descriptors that capture different types of structure in the musical score, we compare the predictive accuracies of linear and non-linear models. We show that, in addition to being (slightly) more accurate, non-linear models can better describe certain interactions between numerical descriptors than linear models.
Carlos Eduardo Cancino Chacón, Maarten Grachten
Geo-Coordinated Parallel Coordinates (GCPC): A Case Study of Environmental Data Analysis
Abstract
Knowledge discovery in scientific and research datasets is an extremely challenging problem due to the high dimensionality, heterogeneity, and complex relationships within the data. When these datasets also includes temporal and geospatial components, the challenges in analyzing the data become even more difficult. A number of visualization approaches have been developed and studied to support the exploration and analysis among such datasets, including parallel coordinate plots, dimensional subsetting, geovisualization, and multiple coordinated views. In this research, we combine and enhance these approaches in a system called Geo-Coordinated Parallel Coordinates (GCPC), with the goal of supporting interactive exploration, analytical reasoning, and knowledge discovery.
Maha El Meseery, Orland Hoeber
Generalized Shortest Path Kernel on Graphs
Abstract
We consider the problem of classifying graphs using graph kernels. We define a new graph kernel, called the generalized shortest path kernel, based on the number and length of shortest paths between nodes. For our example classification problem, we consider the task of classifying random graphs from two well-known families, by the number of clusters they contain. We verify empirically that the generalized shortest path kernel outperforms the original shortest path kernel on a number of datasets. We give a theoretical analysis for explaining our experimental results. In particular, we estimate distributions of the expected feature vectors for the shortest path kernel and the generalized shortest path kernel, and we show some evidence explaining why our graph kernel outperforms the shortest path kernel for our graph classification problem.
Linus Hermansson, Fredrik D. Johansson, Osamu Watanabe
Ensembles of Extremely Randomized Trees for Multi-target Regression
Abstract
In this work, we address the task of learning ensembles of predictive models for predicting multiple continuous variables, i.e., multi-target regression (MTR). In contrast to standard regression, where the output is a single scalar value, in MTR the output is a data structure – a tuple/vector of continuous variables. The task of MTR is recently gaining increasing interest by the research community due to its applicability in a practically relevant domains. More specifically, we consider the Extra-Tree ensembles – the overall top performer in the DREAM4 and DREAM5 challenges for gene network reconstruction. We extend this method for the task of multi-target regression and call the extension Extra-PCTs ensembles. As base predictive models, we propose to use predictive clustering trees (PCTs) – a generalization of decision trees for predicting structured outputs, including multiple continuous variables. We consider both global and local prediction of the multiple variables, the former based on a single model that predicts all of the target variables simultaneously and the latter based on a collection of models, each predicting a single target variable. We conduct an experimental evaluation of the proposed method on a collection of 10 benchmark datasets for with multiple continuous targets and compare its performance to random forests of PCTs. The results reveal that a multi-target Extra-PCTs ensemble performs statistically significantly better than a single multi-target or single-target PCT. Next, the performance among the different ensemble learning methods is not statistically significantly different, while multi-target Extra-PCTs ensembles are the best performing method. Finally, in terms of efficiency (running times and model complexity), both multi-target variants of the ensemble methods are more efficient and produce smaller models as compared to the single-target ensembles.
Dragi Kocev, Michelangelo Ceci
Clustering-Based Optimised Probabilistic Active Learning (COPAL)
Abstract
Facing ever increasing volumes of data but limited human annotation capacities, active learning approaches that allocate these capacities to the labelling of the most valuable instances gain in importance. A particular challenge is the active learning of arbitrary, user-specified adaptive classifiers in evolving datastreams.We address this challenge by proposing a novel clustering-based optimised probabilistic active learning (COPAL) approach for evolving datastreams. It combines established clustering techniques, inspired by semi-supervised learning, which are used to capture the structure of the unlabelled data, with the recently introduced probabilistic active learning approach, which is used for the selection among clusters. The labels actively selected by COPAL are then available for training an arbitrary adaptive stream classifier. The performance of our algorithm is evaluated on several synthetic and real-world datasets. The results show that it achieves a better accuracy for the same budget than other recently proposed active learning approaches for such evolving datastreams.
Georg Krempl, Tuan Cuong Ha, Myra Spiliopoulou
Predictive Analysis on Tracking Emails for Targeted Marketing
Abstract
In this work, we present our experiences using a learning model on predicting the “opens” and “unopens” of targeted marketing emails. The model is based on the features extracted from the emails and email recipients profiles. To achieve this, we have employed and evaluated two different classifiers and two different data sets using different feature sets. Our results demonstrate that it is possible to predict the rate for a targeted marketing email to be opened or not with approximately 78 % F1-measure.
Xiao Luo, Revanth Nadanasabapathy, A. Nur Zincir-Heywood, Keith Gallant, Janith Peduruge
Semi-supervised Learning for Stream Recommender Systems
Abstract
Recommender systems suffer from an extreme data sparsity that results from a large number of items and only a limited capability of users to perceive them. Only a small fraction of items can be rated by a single user. Consequently, there is plenty of unlabelled information that can be leveraged by semi-supervised methods. We propose the first semi-supervised framework for stream recommender systems that can leverage this information incrementally on a stream of ratings. We design several novel components, such as a sensitivity-based reliability measure, and extend a state-of-the-art matrix factorization algorithm by the capability to extend the dimensions of a matrix incrementally as new users and items occur in a stream. We show that our framework improves the quality of recommendations at nearly all time points in a stream.
Pawel Matuszyk, Myra Spiliopoulou
Detecting Transmembrane Proteins Using Decision Trees
Abstract
Transmembrane (TM) proteins are proteins that span a cell membrane; their segments crossing the membrane are called TM domains. TM domain and TM protein detection are important problems in computational biology, but typical machine learning approaches yield classifiers that are difficult to interpret and hence yield no biological insight. We study both TM domain and TM protein detection with easy to interpret decision trees. For TM domain detection, the use of decision trees is already reported in the literature, but we provide a critical study of the existing approach, resulting in improved feature sets as well as observations on how to avoid biased training and test sets. In particular, we discover a motif known to be common to TM domains that was not discovered in previous research using machine learning. For TM protein detection, we propose a 2-layer learning method. This method can be generalized to deal with a large class of string classification problems. The method achieves sensitivity and specificity values of up to 92 % on the settings we experimented with, while providing intuitive classifiers that are easy to interpret for the domain expert.
Mohammad Hossein Nikravan, Ashwani Kumar, Sandra Zilles
Change Point Detection for Information Diffusion Tree
Abstract
We propose a method of detecting the points at which the speed of information diffusion changed from an observed diffusion sequence data over a social network, explicitly taking the network structure into account. Thus, change in diffusion is both spatial and temporal. This is different from most of the existing change detection approaches in which all the diffusion information is projected on a single time line and the search is made in this time axis. We formulate this as a search problem of change points and their respective change rates under the framework of maximum log-likelihood embedded in MDL. Time complexity of the search is almost proportional to the number of observed data points and the method is very efficient. We tested this using both a real Twitter date (ground truth not known) and the synthetic data (ground truth known), and demonstrated that the proposed method can detect the change points efficiently and the results are very different from the existing sequence-based (time axis) approach (Kleinberg’s method).
Kouzou Ohara, Kazumi Saito, Masahiro Kimura, Hiroshi Motoda
Multi-label Classification via Multi-target Regression on Data Streams
Abstract
Multi-label classification is becoming more and more critical in data mining applications. Many efficient methods exist in the classical batch setting, however, in the streaming setting, comparatively few methods exist. In this paper, we propose a new methodology for multi-label classification via multi-target regression in a streaming setting and develop a streaming multi-target regressor iSOUP-Tree, which uses this approach. We experimentally evaluated two variants of the iSOUP-Tree algorithm, and determined that the use of regression trees is advisable over the use model trees. Furthermore, we compared our results to the state-of-the-art and found that the iSOUP-Tree method is comparable to the other streaming multi-label learners. This is a motivation for the potential use of iSOUP-Tree in an ensemble setting as a base learner.
Aljaž Osojnik, Panče Panov, Sašo Džeroski
Periodical Skeletonization for Partially Periodic Pattern Mining
Abstract
Finding periodical regularities in sequential databases is an important topic in Knowledge Discovery. In pattern mining such regularity is modeled as partially periodic patterns, where typical periods (e.g., daily or weekly) can be considered. Although efficient algorithms have been studied, applying them to real databases is still challenging because they are noisy and most transactions are not extremely frequent in practice. They cause a combinatorial explosion of patterns and the difficulty of tuning a threshold parameter. To overcome these issues we investigate a pre-processing method called skeletonization, which was recently introduced for finding sequential patterns. It tries to find clusters of symbols in patterns, aiming at shrinking the space of all possible patterns in order to avoid the combinatorial explosion and to provide comprehensive patterns. The key idea is to compute similarities within symbols in patterns from a given database based on the definition of patterns we would like to mine, and to use clustering methods based on the similarities computed. Although the original method cannot allow for periods, we generalize it by using the periodicity. We give experimental results using both synthetic and real datasets, and compare results of mining with and without the skeletonization, to see that our method helps us to obtain comprehensive partially periodic patterns.
Keisuke Otaki, Akihiro Yamamoto
Predicting Drugs Adverse Side-Effects Using a Recommender-System
Abstract
Adverse Drug Events (ADEs) are a major health problem, and developing accurate prediction methods may have a significant impact in public health. Ideally, we would like to have predictive methods, that could pinpoint possible ADRs during the drug development process. Unfortunately, most relevant information on possible ADRs is only available after the drug is commercially available. As a first step, we propose using prior information on existing interactions through recommendation systems algorithms. We have evaluated our proposal using data from the ADReCS database with promising results.
Diogo Pinto, Pedro Costa, Rui Camacho, Vítor Santos Costa
Dr. Inventor Framework: Extracting Structured Information from Scientific Publications
Abstract
Even if research communities and publishing houses are putting increasing efforts in delivering scientific articles as structured texts, nowadays a considerable part of on-line scientific literature is still available in layout-oriented data formats, like PDF, lacking any explicit structural or semantic information. As a consequence the bootstrap of textual analysis of scientific papers is often a time-consuming activity. We present the first version of the Dr. Inventor Framework, a publicly available collection of scientific text mining components useful to prevent or at least mitigate this problem. Thanks to the integration and the customization of several text mining tools and on-line services, the Dr. Inventor Framework is able to analyze scientific publications both in plain text and PDF format, making explicit and easily accessible core aspects of their structure and semantics. The facilities implemented by the Framework include the extraction of structured textual contents, the discursive characterization of sentences, the identifications of the structural elements of both papers header and bibliographic entries and the generation of graph based representations of text excerpts. The Framework is distributed as a Java library. We describe in detail the scientific mining facilities included in the Framework and present two use cases where the Framework is respectively exploited to boost scientific creativity and to generate RDF graphs from scientific publications.
Francesco Ronzano, Horacio Saggion
Predicting Protein Function and Protein-Ligand Interaction with the 3D Neighborhood Kernel
Abstract
Kernels for structured data have gained a lot of attention in a world with an ever increasing amount of complex data, generated from domains such as biology, chemistry, or engineering. However, while many applications involve spatial aspects, up to now only few kernel methods have been designed to take 3D information into account. We introduce a novel kernel called the 3D Neighborhood Kernel. As a first step, we focus on 3D structures of proteins and ligands, in which the atoms are represented as points in 3D space. By comparing the Euclidean distances between selected sets of atoms, the kernel can select spatial features that are important for determining functions of proteins or interactions with other molecules. We evaluate the kernel on a number of benchmark datasets and show that it obtains a competitive performance w.r.t. the state-of-the-art methods. While we apply this kernel to proteins and ligands, it is applicable to any kind of 3D data where objects follow a common schema, such as RNA, cars, or standardized equipment.
Leander Schietgat, Thomas Fannes, Jan Ramon
Hierarchical Multidimensional Classification of Web Documents with MultiWebClass
Abstract
Most of works on text categorization have focused on classifying documents into a set of categories with no relationships among them (flat classification). However, due to the intrinsic structure that can be found in many domains, recent works are focusing on more complex tasks, such as multi-label classification, hierarchical classification and multidimensional classification. In this paper, we propose the hierarchical multidimensional classification task, where documents can be classified according to different dimensions/viewpoints (e.g., topic, geographic area, time period, etc.), where in each dimension categories can be organized hierarchically. In particular, we propose the system MultiWebClass, a multidimensional variant of the system WebClassIII, which discovers correlations among categories belonging to different dimensions and exploits them, according to two different strategies, to refine the set of features used during the learning process. Experimental evaluation performed on both synthetic and real datasets confirms that the exploitation of correlations among categories can lead to better results in terms of classification accuracy, possibly reducing specialization error or generalization error, depending on the strategy adopted for the refinement of the feature sets.
Francesco Serafino, Gianvito Pio, Michelangelo Ceci, Donato Malerba
Evaluating the Effectiveness of Hashtags as Predictors of the Sentiment of Tweets
Abstract
Recently, there has been growing research interest in the sentiment analysis of tweets. However, there is still a need to examine the contribution of Twitter-specific features to this task. One such feature is hashtags, which are user-defined topics. In our study, we compare the performance of sentiment and non-sentiment hashtags in classifying tweets as positive or negative. By combining subjective words from different lexical resources, we achieve accuracy scores of 83.58 % and 83.83 % in identifying sentiment hashtags and non-sentiment hashtags, respectively. Furthermore, our accuracy scores surpass those scores obtained using models that apply a single lexical resource. We apply derived properties of sentiment and non-sentiment hashtags, including their sentiment polarity to classify tweets. Our best classification models achieve accuracy scores of 81.14 % and 86.07 % using sentiment hashtags and non-sentiment hashtags, respectively. Additionally, our models perform comparably to supervised machine learning algorithms, and outperform a scoring algorithm developed in a previous study.
Credell Simeon, Robert Hilderman
On the Feasibility of Discovering Meta-Patterns from a Data Ensemble
Abstract
We introduce meta-pattern discovery from a data ensemble, a new paradigm of pattern discovery which goes beyond the KDD process model. A data ensemble, which represents a set of data sets, seems to be more natural as a model of the big data (We focus on the volume and velocity aspects of the big data.). We propose two kinds of meta-patterns, each of which specifies patterns such as clusters for a set of data sets, for an unsupervised setting and a supervised one. Our solutions for these settings were shown to be feasible with one synthetic and two real data ensembles by experiments.
Einoshin Suzuki
An Algorithm for Influence Maximization in a Two-Terminal Series Parallel Graph and its Application to a Real Network
Abstract
We developed an algorithm to exactly solve an influence maximization problem (MaxInf) for a two-terminal series parallel graph (TTSPG) in the independent cascade model. The class of TTSPGs can be considered as a class wider than that of trees, only for which an efficient exact solver of this problem has been developed so far. Our algorithm calculates candidate node sets in the divide-and-conquer manner keeping the number of them as small as possible by efficiently eliminating unnecessary ones in merge of subproblems’ solutions. Furthermore, we propose a way of converting an arbitrary network to a TTSPG with edges important for propagation to apply our method to real networks. According to our empirical results, our method is significantly faster than the greedy approximation algorithm for MaxInf of a TTSPG. We also demonstrate improvement of solutions by converting to TTSPGs instead of trees using real networks made from DBLP datasets.
Koji Tabata, Atsuyoshi Nakamura, Mineichi Kudo
Benchmarking Stream Clustering for Churn Detection in Dynamic Networks
Abstract
Retaining users and customers is one of the most important challenges for the service industry from mobile communications to online gaming. As the users of these services form dynamic networks that grow in size, predicting ‘churners’ becomes harder and harder. In this work, we explore the use of anomaly detection for churn prediction. To this end, we evaluate bio-inspired and deterministic online clustering algorithms on both cell phone and online gaming data sets. We discuss the results of each technique from the perspective of: feature identification, sensitivity analysis of the parameters as well as their capacity to detect churn.
Serdar Baran Tatar, Andrew McIntyre, Nur Zincir-Heywood, Malcolm Heywood
Canonical Correlation Methods for Exploring Microbe-Environment Interactions in Deep Subsurface
Abstract
In this study, we apply non-linear kernelized canonical correlation analysis (KCCA) as well as primal-dual sparse canonical correlation analysis (SCCA) to the discovery of correlations between sulphate reducing bacterial taxa and their geochemical environment in the deep biosphere. For visualization of canonical patterns, we demonstrate the applicability of the correlation plot technique on kernelized data. Finally, we provide an extension to the visual analysis by clustergrams. The presented framework and visualization tools enabled extraction of latent canonical correlation patterns between the salinity of the groundwater and the bacterial taxonomic orders Desulfobacterales, Desulfovibrionales and Clostridiales.
Viivi Uurtio, Malin Bomberg, Kristian Nybo, Merja Itävaara, Juho Rousu
KeCo: Kernel-Based Online Co-agreement Algorithm
Abstract
We propose a kernel-based online semi-supervised algorithm that is applicable for large scale learning tasks. In particular, we use a multi-view learning framework and a co-agreement strategy to take into account unlabelled data and to improve classification performance of the algorithm. Unlike the standard online methods our algorithm is naturally applicable to many real-world situations where data is available in multiple representations. In addition our online algorithm allows learning non-linear relations in the data via kernel functions, that are efficiently embedded into the formulation of the algorithm. We test performance of the algorithm on several large-scale LIBSVM and UCI benchmark datasets and demonstrate improved performance in comparison to standard online learning methods. Last but not least, we make a Python implementation of our algorithm available for download (Available at https://​github.​com/​laurensvdwiel/​KeCo).
Laurens Wiel, Tom Heskes, Evgeni Levin
Tree PCA for Extracting Dominant Substructures from Labeled Rooted Trees
Abstract
We propose novel principal component analysis (PCA) for rooted labeled trees to discover dominant substructures from a collection of trees. The principal components of trees are defined in analogy to the ordinal principal component analysis on numerical vectors. Our methods substantially extend earlier work, in which the input data are restricted to binary trees or rooted unlabeled trees with unique vertex indexing, and the principal components are also restricted to the form of paths. In contrast, our extension allows the input data to accept general rooted labeled trees, and the principal components to have more expressive forms of subtrees instead of paths. For this extension, we can employ the technique of flexible tree matching; various mappings used in tree edit distance algorithms. We design an efficient algorithm using top-down mappings based on our framework, and show the applicability of our algorithm by applying it to extract dominant patterns from a set of glycan structures.
Tomoya Yamazaki, Akihiro Yamamoto, Tetsuji Kuboyama
Enumerating Maximal Clique Sets with Pseudo-Clique Constraint
Abstract
It is an important task in Data Mining and Social Network Analysis to detect dense subgraphs, namely pseudo-cliques in networks. Given a positive integer k designating an upper bound of the number of disconnections, some algorithms to enumerate k-plexes as pseudo-cliques have been proposed based on the anti-monotonicity property similar to the case of cliques. Those algorithms are however effective only for small k, since every vertex set with its size less than \(k+1\) is trivially a k-plex. Moreover, there still exist non-dense k-plexes with their sizes exceeding k. For these reasons, it has been a hard task to design an efficient k-plex enumerator for non-small k. This paper aims at developing a fast enumerator for finding densely connected k-plexes for non-small k, avoiding both of the small k-plexes and non-dense medium k plexes. For this purpose, we construct a clique-graph from the original input graph and consider meta-cliques of overlapping cliques satisfying several constraints about k-plexness and overlappingness using bond measure for set-theoretic correlation. We also show its usefulness by exhaustive experiments about the number of solution k-plexes, computational costs and even the quality of output k-plexes.
Hongjie Zhai, Makoto Haraguchi, Yoshiaki Okubo, Etsuji Tomita
Backmatter
Metadaten
Titel
Discovery Science
herausgegeben von
Nathalie Japkowicz
Stan Matwin
Copyright-Jahr
2015
Electronic ISBN
978-3-319-24282-8
Print ISBN
978-3-319-24281-1
DOI
https://doi.org/10.1007/978-3-319-24282-8

Premium Partner