Skip to main content
main-content

Über dieses Buch

This book constitutes the proceedings of the 17th International Conference on Discovery Science, DS 2014, held in Bled, Slovenia, in October 2014. The 30 full papers included in this volume were carefully reviewed and selected from 62 submissions. The papers cover topics such as: computational scientific discovery; data mining and knowledge discovery; machine learning and statistical methods; computational creativity; mining scientific data; data and knowledge visualization; knowledge discovery from scientific literature; mining text, unstructured and multimedia data; mining structured and relational data; mining temporal and spatial data; mining data streams; network analysis; discovery informatics; discovery and experimental workflows; knowledge capture and scientific ontologies; data and knowledge integration; logic and philosophy of scientific discovery; and applications of computational methods in various scientific domains.

Inhaltsverzeichnis

Frontmatter

Explaining Mixture Models through Semantic Pattern Mining and Banded Matrix Visualization

Semi-automated data analysis is possible for the end user if data analysis processes are supported by easily accessible tools and methodologies for pattern/model construction, explanation, and exploration. The proposed three–part methodology for multiresolution 0–1 data analysis consists of data clustering with mixture models, extraction of rules from clusters, as well as data, cluster, and rule visualization using banded matrices. The results of the three-part process—clusters, rules from clusters, and banded structure of the data matrix—are finally merged in a unified visual banded matrix display. The incorporation of multiresolution data is enabled by the supporting ontology, describing the relationships between the different resolutions, which is used as background knowledge in the semantic pattern mining process of descriptive rule induction. The presented experimental use case highlights the usefulness of the proposed methodology for analyzing complex DNA copy number amplification data, studied in previous research, for which we provide new insights in terms of induced semantic patterns and cluster/pattern visualization.

Prem Raj Adhikari, Anže Vavpetič, Jan Kralj, Nada Lavrač, Jaakko Hollmén

Big Data Analysis of StockTwits to Predict Sentiments in the Stock Market

Online stock forums have become a vital investing platform for publishing relevant and valuable user-generated content (UGC) data, such as investment recommendations that allow investors to view the opinions of a large number of users, and the sharing and exchanging of trading ideas. This paper combines text-mining, feature selection and Bayesian Networks to analyze and extract sentiments from stock-related micro-blogging messages called “StockTwits”. Here, we investigate whether the power of the collective sentiments of StockTwits might be predicted and how these predicted sentiments might help investors and their peers to make profitable investment decisions in the stock market. Specifically, we build Bayesian Networks from terms identified in the tweets that are selected using wrapper feature selection. We then used textual visualization to provide a better understanding of the predicted relationships among sentiments and their related features.

Alya Al Nasseri, Allan Tucker, Sergio de Cesare

Synthetic Sequence Generator for Recommender Systems – Memory Biased Random Walk on a Sequence Multilayer Network

Personalized recommender systems rely on each user’s personal usage data in the system, in order to assist in decision making. However, privacy policies protecting users’ rights prevent these highly personal data from being publicly available to a wider researcher audience. In this work, we propose a memory biased random walk model on a multilayer sequence network, as a generator of synthetic sequential data for recommender systems. We demonstrate the applicability of the generated synthetic data in training recommender system models in cases when privacy policies restrict clickstream publishing.

Nino Antulov-Fantulin, Matko Bošnjak, Vinko Zlatić, Miha Grčar, Tomislav Šmuc

Predicting Sepsis Severity from Limited Temporal Observations

Sepsis, an acute systemic inflammatory response syndrome caused by severe infection, is one of the leading causes of in-hospital mortality. Our recent work provides evidence that mortality rate in sepsis patients can be significantly reduced by Hemoadsorption (HA) therapy with duration determined by a data-driven approach. The therapy optimization process requires predicting high-mobility group protein B-1 concentration 24 hours in the future. However, measuring sepsis biomarkers is very costly, and also blood volume is limited such that the number of available temporal observations for training a regression model is small. The challenge addressed in this study is how to balance the trade-off of prediction accuracy versus the limited number of temporal observations by selecting a sampling protocol (biomarker selection and frequency of measurements) appropriately for the prediction model and measurement noise level. In particular, to predict HMGB1 concentration 24 hours ahead when limiting the number of blood drawings before therapy to three, we found that the accuracy of observing HMGB1 and three other cytokines (Lsel, TNF-alpha, and IL10) was comparable to observing eight cytokines that are commonly used sepsis biomarkers. We found that blood drawings 1-hour apart are preferred when measurements are noise free, but in presence of noise, blood drawings 3 hours apart are preferred. Comparing to the data-driven approaches, the sampling protocol obtained by using domain knowledge has a similar accuracy with the same cost, but half of the number of blood drawings.

Xi Hang Cao, Ivan Stojkovic, Zoran Obradovic

Completion Time and Next Activity Prediction of Processes Using Sequential Pattern Mining

Process mining is a research discipline that aims to discover, monitor and improve real processing using event logs. In this paper we describe a novel approach that (i) identifies partial process models by exploiting sequential pattern mining and (ii) uses the additional information about the activities matching a partial process model to train nested prediction models from event logs. Models can be used to predict the next activity and completion time of a new (running) process instance. We compare our approach with a model based on Transition Systems implemented in the ProM5 Suite and show that the attributes in the event log can improve the accuracy of the model without decreasing performances. The experimental results show how our algorithm improves of a large margin ProM5 in predicting the completion time of a process, while it presents competitive results for next activity prediction.

Michelangelo Ceci, Pasqua Fabiana Lanotte, Fabio Fumarola, Dario Pietro Cavallo, Donato Malerba

Antipattern Discovery in Ethiopian Bagana Songs

This paper develops and applies sequential pattern mining to a corpus of songs for the bagana, a large lyre played in Ethiopia. An important aspect of this repertoire is the unique availability of rare motifs that have been used by a master bagana teacher in Ethiopia. The method is applied to find antipatterns: patterns that are surprisingly rare in a corpus of bagana songs. In contrast to previous work, this is performed without an explicit set of background pieces. The results of this study show that data mining methods can reveal with high significance these antipatterns of interest based on the computational analysis of a small corpus of bagana songs.

Darrell Conklin, Stéphanie Weisser

Categorize, Cluster, and Classify: A 3-C Strategy for Scientific Discovery in the Medical Informatics Platform of the Human Brain Project

One of the goals of the European Flagship Human Brain Project is to create a platform that will enable scientists to search for new biologically and clinically meaningful discoveries by making use of a large database of neurological data enlisted from many hospitals. While the patients whose data will be available have been diagnosed, there is a widespread concern that their diagnosis, which relies on current medical classification, may be too wide and ambiguous and thus hides important scientific information.

We therefore offer a strategy for a search, which combines supervised and unsupervised learning in three steps: Categorization, Clustering and Classification. This 3-C strategy runs as follows: using external medical knowledge, we categories the available set of features into three types: the patients’ assigned disease diagnosis, clinical measurements and potential biological markers, where the latter may include genomic and brain imaging information. In order to reduce the number of clinical measurements a supervised learning algorithm (Random Forest) is applied and only the best predicting features are kept. We then use unsupervised learning in order to create new clinical manifestation classes that are based on clustering the selected clinical measurement. Profiles of these clusters of clinical manifestation classes are visually described using profile plots and analytically described using decision trees in order to facilitate their clinical interpretation. Finally, we classify the new clinical manifestation classes by relying on the potential biological markers. Our strategy strives to connect between potential biomarkers, and classes of clinical and functional manifestation, both expressed by meaningful features. We demonstrate this strategy using data from the Alzheimer’s Disease Neuroimaging Initiative cohort (ADNI).

Tal Galili, Alexis Mitelpunkt, Netta Shachar, Mira Marcus-Kalish, Yoav Benjamini

Multilayer Clustering: A Discovery Experiment on Country Level Trading Data

The topic of this work is the presentation of a novel clustering methodology based on instance similarity in two or more attribute layers. The work is motivated by multi-view clustering and redescription mining algorithms. In our approach we do not construct descriptions of subsets of instances and we do not use conditional independence assumption of different views. We do bottom up merging of clusters only if it enables reduction of an example variability score for

all

layers. The score is defined as a two component sum of squared deviates of example similarity values. For a given set of instances, the similarity values are computed by execution of an artificially constructed supervised classification problem. As a final result we identify a small but coherent clusters. The methodology is illustrated on a real life discovery task aimed at identification of relevant subgroups of countries with similar trading characteristics in respect of the type of commodities they export.

Dragan Gamberger, Matej Mihelčić, Nada Lavrač

Medical Document Mining Combining Image Exploration and Text Characterization

With an ever growing number of published scientific studies, there is a need for automated search methods, able to collect and extract as much information as possible from those articles. We propose a framework for the extraction and characterization of brain activity areas published in neuroscientific reports, as well as a suitable clustering strategy of said areas. We further show that it is possible to obtain three-dimensional summarizing brain maps, accounting for a particular topic within those studies. After, using the text information from the articles, we characterize such maps. As an illustrative experiment, we demonstrate the proposed mining approach in fMRI reports of default mode networks. The proposed method hints at the possibility of searching for both visual and textual keywords in neuro atlases.

Nicolau Gonçalves, Erkki Oja, Ricardo Vigário

Mining Cohesive Itemsets in Graphs

Discovering patterns in graphs is a well-studied field of data mining. While a lot of work has already gone into finding structural patterns in graph datasets, we focus on relaxing the structural requirements in order to find items that often occur near each other in the input graph. By doing this, we significantly reduce the search space and simplify the output. We look for itemsets that are both frequent and cohesive, which enables us to use the anti-monotonicity property of the frequency measure to speed up our algorithm. We experimentally demonstrate that our method can handle larger and more complex datasets than the existing methods that either run out of memory or take too long.

Tayena Hendrickx, Boris Cule, Bart Goethals

Mining Rank Data

This paper addresses the problem of mining rank data, that is, data in the form of rankings (total orders) of an underlying set of items. More specifically, two types of patterns are considered, namely frequent subrankings and dependencies between such rankings in the form of association rules. Algorithms for mining patterns of this kind are proposed and illustrated on three case studies.

Sascha Henzgen, Eyke Hüllermeier

Link Prediction on the Semantic MEDLINE Network

An Approach to Literature-Based Discovery

Retrieving and linking different segments of scientific information into understandable and interpretable knowledge is a challenging task. Literature-based discovery (LBD) is a methodology for automatically generating hypotheses for scientific research by uncovering hidden, previously unknown relationships from existing knowledge (published literature). Semantic MEDLINE is a database consisting of semantic predications extracted from MEDLINE citations. The predications provide a normalized form of the meaning of the text. The associations between the concepts in these predications can be described in terms of a network, consisting of nodes and directed arcs, where the nodes represent biomedical concepts and the arcs represent their semantic relationships. In this paper we propose and evaluate a methodology for link prediction of implicit relationships in the Semantic MEDLINE network. Link prediction was performed using different similarity measures including common neighbors, Jaccard index, and preferential attachment. The proposed approach is complementary to, and may augment, existing LBD approaches. The analyzed network consisted of 231,589 nodes and 10,061,747 directed arcs. The results showed high prediction performance, with the common neighbors method providing the best area under the ROC curve of 0.96.

Andrej Kastrin, Thomas C. Rindflesch, Dimitar Hristovski

Medical Image Retrieval Using Multimodal Data

In this paper we propose a system for medical image retrieval using multimodal data. The system can be separated in an off-line and on-line phase. The off-line phase deals with modality classification of the images by their visual content. For this part we use state-of-the-art opponentSIFT visual features to describe the image content, as for the classification we use SVMs. The modality classification labels all images in the database with their corresponding modality. The off-line phase, also, implements the text-based retrieval structure of the system. In this part we index the text associated with the images using the open-source search engine Terrier IR. In the on-line phase the retrieval is performed. In this phase the system receives a text query. The system processes the query and performs the text-based retrieval with Terrier IR and the initial results are generated. Afterwards, the images in the initial results are re-ranked based on their modality and the final results are provided. Our system was evaluated against the standardized ImageCLEF 2013 medical dataset. Our system reported results with a mean average precision of 0.32, which is state-of-the-art performance on the dataset.

Ivan Kitanovski, Ivica Dimitrovski, Gjorgji Madjarov, Suzana Loskovska

Fast Computation of the Tree Edit Distance between Unordered Trees Using IP Solvers

We propose a new method for computing the tree edit distance between two unordered trees by problem encoding. Our method transforms an instance of the computation into an instance of some IP problems and solves it by an efficient IP solver. The tree edit distance is defined as the minimum cost of a sequence of edit operations (either substitution, deletion, or insertion) to transform a tree into another one. Although its time complexity is NP-hard, some encoding techniques have been proposed for computational efficiency. An example is an encoding method using the clique problem. As a new encoding method, we propose to use IP solvers and provide new IP formulations representing the problem of finding the minimum cost mapping between two unordered trees, where the minimum cost exactly coincides with the tree edit distance. There are IP solvers other than that for the clique problem and our method can efficiently compute ariations of the tree edit distance by adding additional constraints. Our experimental results with Glycan datasets and the Web log datasets CSLOGS show that our method is much faster than an existing method if input trees have a large degree. We also show that two variations of the tree edit distance could be computed efficiently by IP solvers.

Seiichi Kondo, Keisuke Otaki, Madori Ikeda, Akihiro Yamamoto

Probabilistic Active Learning: Towards Combining Versatility, Optimality and Efficiency

Mining data with minimal annotation costs requires efficient active approaches, that ideally select the optimal candidate for labelling under a user-specified classification performance measure. Common generic approaches, that are usable with any classifier and any performance measure, are either slow like error reduction, or heuristics like uncertainty sampling. In contrast, our Probabilistic Active Learning (PAL) approach offers versatility, direct optimisation of a performance measure and computational efficiency. Given a labelling candidate from a pool, PAL models both the candidate’s label and the true posterior in its neighbourhood as random variables. By computing the expectation of the gain in classification performance over both random variables, PAL then selects the candidate that in expectation will improve the classification performance the most. Extending our recent poster, we discuss the properties of PAL and perform a thorough experimental evaluation on several synthetic and real-world data sets of different sizes. Results show comparable or better classification performance than error reduction and uncertainty sampling, yet PAL has the same asymptotic time complexity as uncertainty sampling and is faster than error reduction.

Georg Krempl, Daniel Kottke, Myra Spiliopoulou

Incremental Learning with Social Media Data to Predict Near Real-Time Events

In this paper, we focus on the problem of predicting some particular user activities in social media. Our challenge is to consider real events such as message posting to friends or forwarding received ones, connecting to new friends, and provide near real-time prediction of new events. Our approach is based on

latent factor models

which can exploit simultaneously the timestamped interaction information among users and their posted content information. We propose a simple strategy to learn incrementally the latent factors at each time step. Our method takes only recent data to update latent factor models and thus can reduce computational cost. Experiments on a real dataset collected from Twitter show that our method can achieve performances that are comparable with other state-of-the-art non-incremental techniques.

Duc Kinh Le Tran, Cécile Bothorel, Pascal Cheung Mon Chan, Yvon Kermarrec

Stacking Label Features for Learning Multilabel Rules

Dependencies between the labels is commonly regarded as the crucial issue in multilabel classification. Rules provide a natural way for symbolically describing such relationships, for instance, rules with label tests in the body allow for representing directed dependencies like implications, subsumptions, or exclusions. Moreover, rules naturally allow to jointly capture both local and global label dependencies.

We present a bootstrapped stacking approach which uses a common rule learner in order to induce label-dependent rules. For this, we learn for each label a separate ruleset, but we include the remaining labels as additional attributes in the training instances. Proceeding this way, label dependencies can be made explicit in the rules. Our experiments show competitive results in terms of the standard multilabel evaluation measures. But more importantly, using these additional attributes is shown to allow to discover and consider label relations as well as to better comprehend the available multilabel datasets.

However, this approach is only a first step towards integrating the multilabel rule learning directly in the rule induction process, e.g., in typical separate-and-conquer rule learners. We present future perspectives, advantages, and arising issues in this regard.

Eneldo Loza Mencía, Frederik Janssen

Selective Forgetting for Incremental Matrix Factorization in Recommender Systems

Recommender Systems are used to build models of users’ preferences. Those models should reflect current state of the preferences at any timepoint. The preferences, however, are not static. They are subject to concept drift or even shift, as it is known from e.g. stream mining. They undergo permanent changes as the taste of users and perception of items change over time. Therefore, it is crucial to select the actual data for training models and to forget the outdated ones.

The problem of selective forgetting in recommender systems has not been addressed so far. Therefore, we propose two forgetting techniques for incremental matrix factorization and incorporate them into a stream recommender. We use a stream-based algorithm that adapts continuously to changes, so that forgetting techniques have an immediate effect on recommendations. We introduce a new evaluation protocol for recommender systems in a streaming environment and show that forgetting of outdated data increases the quality of recommendations substantially.

Pawel Matuszyk, Myra Spiliopoulou

Providing Concise Database Covers Instantly by Recursive Tile Sampling

Known pattern discovery algorithms for finding tilings (covers of 0/1-databases consisting of 1-rectangles) cannot be integrated in instant and interactive KD tools, because they do not satisfy at least one of two key requirements: a) to provide results within a short response time of only a few seconds and b) to return a concise set of patterns with only a few elements that nevertheless covers a large fraction of the input database. In this paper we present a novel randomized algorithm that works well under these requirements. It is based on the recursive application of a simple tile sample procedure that can be implemented efficiently using rejection sampling. While, as we analyse, the theoretical solution distribution can be weak in the worst case, the approach performs very well in practice and outperforms previous sampling as well as deterministic algorithms.

Sandy Moens, Mario Boley, Bart Goethals

Resampling-Based Framework for Estimating Node Centrality of Large Social Network

We address a problem of efficiently estimating value of a centrality measure for a node in a large social network only using a partial network generated by sampling nodes from the entire network. To this end, we propose a resampling-based framework to estimate the approximation error defined as the difference between the true and the estimated values of the centrality. We experimentally evaluate the fundamental performance of the proposed framework using the closeness and betweenness centralities on three real world networks, and show that it allows us to estimate the approximation error more tightly and more precisely with the confidence level of 95% even for a small partial network compared with the standard error traditionally used, and that we could potentially identify top nodes and possibly rank them in a given centrality measure with high confidence level only from a small partial network.

Kouzou Ohara, Kazumi Saito, Masahiro Kimura, Hiroshi Motoda

Detecting Maximum k-Plex with Iterative Proper ℓ-Plex Search

In this paper, we are concerned with the notion of

k

-plex, a relaxation model of clique, where degree of relaxation is controlled by the parameter

k

. Particularly, we present an efficient algorithm for detecting a

maximum k-plex

in a given simple undirected graph. Existing algorithms for extracting a maximum

k

-plex do not work well for larger

k

-values because the number of

k

-plexes exponentially grows as

k

becomes larger. In order to design an efficient algorithm for the problem, we introduce a notion of

properness

of

k

-plex. Our algorithm tries to

iteratively

find a maximum proper ℓ-plex, decreasing the value of ℓ from

k

to 1. At each iteration stage, the maximum size of proper ℓ-plex found so far can work as an effective lower bound which makes our branch-and-bound pruning more powerful. Our experimental results for several benchmark graphs show that our algorithm can detect maximum

k

-plexes much faster than

SPLEX

, the existing most efficient algorithm.

Yoshiaki Okubo, Masanobu Matsudaira, Makoto Haraguchi

Exploiting Bhattacharyya Similarity Measure to Diminish User Cold-Start Problem in Sparse Data

Collaborative Filtering (CF) is one of the most successful approaches for personalized product recommendations. Neighborhood based collaborative filtering is an important class of CF, which is simple and efficient product recommender system widely used in commercial domain. However, neighborhood based CF suffers from

user cold-start

problem. This problem becomes severe when neighborhood based CF is used in sparse rating data. In this paper, we propose an effective approach for similarity measure to address user cold-start problem in sparse rating dataset. Our proposed approach can find neighbors in the absence of co-rated items unlike existing measures. To show the effectiveness of this measure under cold-start scenario, we experimented with real rating datasets. Experimental results show that our approach based CF outperforms state-of-the art measures based CFs for cold-start problem.

Bidyut Kr. Patra, Raimo Launonen, Ville Ollikainen, Sukumar Nandi

Failure Prediction – An Application in the Railway Industry

Machine or system failures have high impact both at technical and economic levels. Most modern equipment has logging systems that allow us to collect a diversity of data regarding their operation and health. Using data mining models for novelty detection enables us to explore those datasets, building classification systems that can detect and issue an alert when a failure starts evolving, avoiding the unknown development up to breakdown. In the present case we use a failure detection system to predict train doors breakdowns before they happen using data from their logging system. We study three methods for failure detection: outlier detection, novelty detection and a supervised SVM. Given the problem’s features, namely the possibility of a passenger interrupting the movement of a door, the three predictors are prone to false alarms. The main contribution of this work is the use of a low-pass filter to process the output of the predictors leading to a strong reduction in the false alarm rate.

Pedro Pereira, Rita P. Ribeiro, João Gama

Wind Power Forecasting Using Time Series Cluster Analysis

The growing integration of wind turbines into the power grid can only be balanced with precise forecasts of upcoming energy productions. This information plays as basis for operation and management strategies for a reliable and economical integration into the power grid. A precise forecast needs to overcome problems of variable energy production caused by fluctuating weather conditions. In this paper, we define a data mining approach, in order to process a past set of the wind power measurements of a wind turbine and extract a robust prediction model. We resort to a time series clustering algorithm, in order to extract a compact, informative representation of the time series of wind power measurements in the past set. We use cluster prototypes for predicting upcoming wind powers of the turbine. We illustrate a case study with real data collected from a wind turbine installed in the Apulia region.

Sonja Pravilovic, Annalisa Appice, Antonietta Lanza, Donato Malerba

Feature Selection in Hierarchical Feature Spaces

Feature selection is an important preprocessing step in data mining, which has an impact on both the runtime and the result quality of the subsequent processing steps. While there are many cases where hierarchic relations between features exist, most existing feature selection approaches are not capable of exploiting those relations. In this paper, we introduce a method for feature selection in hierarchical feature spaces. The method first eliminates redundant features along paths in the hierarchy, and further prunes the resulting feature set based on the features’ relevance. We show that our method yields a good trade-off between feature space compression and classification accuracy, and outperforms both standard approaches as well as other approaches which also exploit hierarchies.

Petar Ristoski, Heiko Paulheim

Incorporating Regime Metrics into Latent Variable Dynamic Models to Detect Early-Warning Signals of Functional Changes in Fisheries Ecology

In this study, dynamic Bayesian networks have been applied to predict future biomass of geographically different but functionally equivalent fish species. A latent variable is incorporated to model functional collapse, where the underlying food web structure dramatically changes irrevocably (known as a

regime shift

). We examined if the use of a hidden variable can reflect changes in the trophic dynamics of the system and also whether the inclusion of recognised statistical metrics would improve predictive accuracy of the dynamic models. The hidden variable appears to reflect some of the metrics’ characteristics in terms of identifying regime shifts that are known to have occurred. It also appears to capture changes in the variance of different species biomass. Including metrics in the models had an impact on predictive accuracy but only in some cases. Finally, we explore whether exploiting expert knowledge in the form of diet matrices based upon stomach surveys is a better approach to learning model structure than using biomass data alone when predicting food web dynamics. A non-parametric bootstrap in combination with a greedy search algorithm was applied to estimate the confidence of features of networks learned from the data, allowing us to identify pairwise relations of high confidence between species.

Neda Trifonova, Daniel Duplisea, Andrew Kenny, David Maxwell, Allan Tucker

An Efficient Algorithm for Enumerating Chordless Cycles and Chordless Paths

A chordless cycle (induced cycle)

C

of a graph is a cycle without any chord, meaning that there is no edge outside the cycle connecting two vertices of the cycle. A chordless path is defined similarly. In this paper, we consider the problems of enumerating chordless cycles/paths of a given graph

G

 = (

V

,

E

), and propose algorithms taking

O

(|

E

|) time for each chordless cycle/path. In the existing studies, the problems had not been deeply studied in the theoretical computer science area, and no output polynomial time algorithm has been proposed. Our experiments showed that the computation time of our algorithms is constant per chordless cycle/path for non-dense random graphs and real-world graphs. They also show that the number of chordless cycles is much smaller than the number of cycles. We applied the algorithm to prediction of NMR (Nuclear Magnetic Resonance) spectra, and increased the accuracy of the prediction.

Takeaki Uno, Hiroko Satoh

Algorithm Selection on Data Streams

We explore the possibilities of meta-learning on data streams, in particular algorithm selection. In a first experiment we calculate the characteristics of a small sample of a data stream, and try to predict which classifier performs best on the entire stream. This yields promising results and interesting patterns. In a second experiment, we build a meta-classifier that predicts, based on measurable data characteristics in a window of the data stream, the best classifier for the next window. The results show that this meta-algorithm is very competitive with state of the art ensembles, such as OzaBag, OzaBoost and Leveraged Bagging. The results of all experiments are made publicly available in an online experiment database, for the purpose of verifiability, reproducibility and generalizability.

Jan N. van Rijn, Geoffrey Holmes, Bernhard Pfahringer, Joaquin Vanschoren

Sparse Coding for Key Node Selection over Networks

The size of networks now needed to model real world phenomena poses significant computational challenges.

Key node selection in networks

, (KNSIN) presented in this paper, selects a representative set of nodes that preserves the sketch of original nodes in the network and thus, serves as a useful solution to this challenge. KNSIN is accomplished via a sparse coding algorithm that efficiently learns a basis set over the feature space defined by the nodes. By executing a stop criterion, KNSIN automatically learns the dimensionality of the node space and guarantees that the learned basis accurately preserves the sketch of the original node space. In experiments, we use two large scale network datasets to evaluate the proposed KNSIN framework. Our results on the two datasets demonstrate the effectiveness of the KNSIN algorithm.

Ye Xu, Dan Rockmore

Variational Dependent Multi-output Gaussian Process Dynamical Systems

This paper presents a dependent multi-output Gaussian process (GP) for modeling complex dynamical systems. The outputs are dependent in this model, which is largely different from previous GP dynamical systems. We adopt convolved multi-output GPs to model the outputs, which are provided with a flexible multi-output covariance function. We adapt the variational inference method with inducing points for approximate posterior inference of latent variables. Conjugate gradient based optimization is used to solve parameters involved. Besides the temporal dependency, the proposed model also captures the dependency among outputs in complex dynamical systems. We evaluate the model on both synthetic and real-world data, and encouraging results are observed.

Jing Zhao, Shiliang Sun

Erratum: Categorize, Cluster, and Classify: A 3-C Strategy for Scientific Discovery in the Medical Informatics Platform of the Human Brain Project

By mistake the following acknowledgement text was missing in the originally published paper:

Acknowledgement.

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 604102 (Human Brain Project).

Tal Galili, Alexis Mitelpunkt, Netta Shachar, Mira Marcus-Kalish, Yoav Benjamini

Backmatter

Weitere Informationen

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

Whitepaper

- ANZEIGE -

INDUSTRIE 4.0

Der Hype um Industrie 4.0 hat sich gelegt – nun geht es an die Umsetzung. Das Whitepaper von Protolabs zeigt Unternehmen und Führungskräften, wie sie die 4. Industrielle Revolution erfolgreich meistern. Es liegt an den Herstellern, die besten Möglichkeiten und effizientesten Prozesse bereitzustellen, die Unternehmen für die Herstellung von Produkten nutzen können. Lesen Sie mehr zu: Verbesserten Strukturen von Herstellern und Fabriken | Konvergenz zwischen Soft- und Hardwareautomatisierung | Auswirkungen auf die Neuaufstellung von Unternehmen | verkürzten Produkteinführungszeiten
Jetzt gratis downloaden!

Bildnachweise