Mixture Models from Multiresolution 0-1 Data

Abstract

Multiresolution data has received considerable research interest due to the practical usefulness in combining datasets in different resolutions into a single analysis. Most models and methods can only model a single data resolution, that is, vectors of the same dimensionality, at a time. This is also true for mixture models, the model of interest. In this paper, we propose a multiresolution mixture model capable of modeling data in multiple resolutions. Firstly, we define the multiresolution component distributions of mixture models from the domain ontology. We then learn the parameters of the component distributions in the Bayesian network framework. Secondly, we map the multiresolution data in a Bayesian network setting to a vector representation to learn the mixture coefficients and the parameters of the component distributions. We investigate our proposed algorithms on two data sets. A simulated data allows us to have full data observations in all resolutions. However, this is unrealistic in all practical applications. The second data consists of DNA aberrations data in two resolutions. The results with multiresolution models show improvement in modeling performance with regards to the likelihood over single resolution mixture models.

Prem Raj Adhikari, Jaakko Hollmén

Model Tree Ensembles for Modeling Dynamic Systems

Abstract

We address the task of discrete-time modeling of nonlinear dynamic systems using measured data. In the area of control engineering, this task is typically converted into a classical regression problem, which can then be solved with any nonlinear regression approach. As tree ensembles are a very successful predictive modelling approach, we investigate the use of tree ensembles for regression for this task.

While ensembles of regression trees have been extensively used and different variants thereof explored (such as bagging and random forests), ensembles of model trees have received much less attention, being limited mostly to bagging of model trees. We introduce a novel model tree ensemble approach to regression, namely, bagging and random forests of fuzzified model trees. The main advantage of the new approach is that it produces a model with no discontinuities with a satisfactory extrapolation behavior, needed for modeling dynamic systems.

We evaluate existing tree ensemble approaches to regression and the approach we propose on two synthetic and one real task of modeling nonlinear dynamic systems coming from the area of control engineering. The results show that our proposed model tree ensembles outperform ensembles of regression trees and have comparable performance to state-of-the-art methods for system identification typically used in control engineering. The computing time of our approach is comparable to that of the state-of-the-art methods on the small problems considered, with the potential to scale much better to large modeling problems.

Darko Aleksovski, Juš Kocijan, Sašo Džeroski

Fast and Scalable Image Retrieval Using Predictive Clustering Trees

Abstract

The recent overwhelming increase in the amount of available visual information, especially digital images,has brought up a pressing need to develop efficient and accurate systems for image retrieval. State-of-the-art systems for image retrieval use the bag-of-visual-words representation of the images. However, the computational bottleneck in all such systems is the construction of the visual vocabulary (i.e., how to obtain the visual words). This is typically performed by clustering hundreds of thousands or millions of local descriptors, where the resulting clusters correspond to visual words. Each image is then represented by a histogram of the distribution of its local descriptors throughout the vocabulary. The major issue in the retrieval systems is that by increasing the sizes of the image databases, the number of local descriptors to be clustered increases rapidly: Thus, using conventional clustering techniques is infeasible. Considering this, we propose to construct the visual codebook by using predictive clustering trees, which are very efficient and have good performance. Moreover, to increase the stability of the model, we propose to use random forests of predictive clustering trees. We evaluate the proposed method on a benchmark database of a million images and compare it to other state-of-the-art methods. The results reveal that the proposed method produces a visual vocabulary with superior discriminative power and thus better retrieval performance.

Ivica Dimitrovski, Dragi Kocev, Suzana Loskovska, Sašo Džeroski

Avoiding Anomalies in Data Stream Learning

Abstract

The presence of anomalies in data compromises data quality and can reduce the effectiveness of learning algorithms. Standard data mining methodologies refer to data cleaning as a pre-processing before the learning task. The problem of data cleaning is exacerbated when learning in the computational model of data streams. In this paper we present a streaming algorithm for learning classification rules able to detect contextual anomalies in the data. Contextual anomalies are surprising attribute values in the context defined by the conditional part of the rule. For each example we compute the degree of anomaliness based on the probability of the attribute-values given the conditional part of the rule covering the example. The examples with high degree of anomaliness are signaled to the user and not used to train the classifier. The experimental evaluation in real-world data sets shows the ability to discover anomalous examples in the data. The main advantage of the proposed method is the ability to inform the context and explain why the anomaly occurs.

João Gama, Petr Kosina, Ezilda Almeida

Generalizing from Example Clusters

Abstract

We consider the following problem: Given a set of data and one or more examples of clusters, find a clustering of the whole data set that is consistent with the given clusters. This is essentially a semi-supervised clustering problem, but it differs from previously studied semi-supervised clustering settings in significant ways. Earlier work has shown that none of the existing methods for semi-supervised clustering handle this problem well. We identify two reasons for this, which are related to the default metric learning methods not working well in this situation, and to overfitting behavior. We investigate the latter in more detail and propose a new method that explicitly guards against overfitting. Experimental results confirm that the new method generalizes much better. Several other problems identified here remain open.

Pan Hu, Celine Vens, Bart Verstrynge, Hendrik Blockeel

Clustering Based Active Learning for Evolving Data Streams

Abstract

Data labeling is an expensive and time-consuming task. Choosing which labels to use is increasingly becoming important. In the active learning setting, a classifier is trained by asking for labels for only a small fraction of all instances. While many works exist that deal with this issue in non-streaming scenarios, few works exist in the data stream setting. In this paper we propose a new active learning approach for evolving data streams based on a pre-clustering step, for selecting the most informative instances for labeling. We consider a batch incremental setting: when a new batch arrives, first we cluster the examples, and then, we select the best instances to train the learner. The clustering approach allows to cover the whole data space avoiding to oversample examples from only few areas. We compare our method w.r.t. state of the art active learning strategies over real datasets. The results highlight the improvement in performance of our proposal. Experiments on parameter sensitivity are also reported.

Dino Ienco, Albert Bifet, Indrė Žliobaitė, Bernhard Pfahringer

Robust Crowd Labeling Using Little Expertise

Abstract

Crowd-labeling emerged from the need to label large-scale and complex data, a tedious, expensive, and time-consuming task. But the problem of obtaining good quality labels from a crowd and their integration is still unresolved. To address this challenge, we propose a new framework that automatically combines and boosts bulk crowd labels supported by limited number of “ground truth” labels from experts. The ground truth labels help to estimate the individual expertise of crowd labelers and difficulty of each instance, both of which are used to aggregate the labels. We show through extensive experiments that unlike other state-of-the-art approaches, our method is robust even in the presence of a large proportion of bad labelers in the crowd. We derive a lower bound on the number of expert labels needed to judge crowd and dataset as well as to get better quality labels.

Faiza Khan Khattak, Ansaf Salleb-Aouissi

A New Approach to String Pattern Mining with Approximate Match

Abstract

Frequent string mining is the problem of finding frequently appearing strings in a given string database or large text. It has many applications to string data analysis on strings such as texts, word sequences, and genome sequences. The problem becomes difficult if the string patterns are allowed to match approximately, i.e., a fixed number of errors leads to an explosion in the number of small solutions, and a fixed ratio of errors violates the monotonicity that disables hill climbing algorithms, and thus makes searching difficult. There would be also a difficulty on the modeling of the problem; simple mathematical definitions would result explosion of the solutions. To solve this difficulty, we go back to the motivations to find frequent strings, and propose a new method for generating string patterns that appear in the input string many times. In the method, we first compute the similarity between the strings in the database, and enumerate clusters generated by similarity. We then compute representative strings for each cluster, and the representatives are our frequent strings. Further, by taking majority votes, we extend the obtained representatives to obtain long frequent strings. The computational experiments we performed show the efficiency of both our model and algorithm; we were able to find many string patterns appearing many times in the data, and that were long but not particularly numerous. The computation time of our method is practically short, such as 20 minutes even for a genomic sequence of 100 millions of letters.

Tetsushi Matsui, Takeaki Uno, Juzoh Umemori, Tsuyoshi Koide

OntoDM-KDD: Ontology for Representing the Knowledge Discovery Process

Abstract

In this article, we present an ontology for representing the knowledge discovery (KD) process based on the CRISP-DM process model (OntoDM-KDD). OntoDM-KDD defines the most essential entities for describing data mining investigations in the context of KD in a two-layered ontological structure. The ontology is aligned and reuses state-of-the-art resources for representing scientific investigations, such as Information Artifact Ontology (IAO) and Ontology of Biomedical Investigations (OBI). It provides a taxonomy of KD specific actions, processes and specifications of inputs and outputs. OntoDM-KDD supports the annotation of DM investigations in application domains. The ontology has been thoroughly assessed following the best practices in ontology engineering, is fully interoperable with many domain resources and easily extensible. OntoDM-KDD is available at http://www.ontodm.com .

Panče Panov, Larisa Soldatova, Sašo Džeroski

A Wordification Approach to Relational Data Mining

Abstract

This paper describes a propositionalization technique called wordification. Wordification is inspired by text mining and can be seen as a transformation of a relational database into a corpus of documents. Wordification aims at producing simple, easy to understand features, acting as words in the transformed Bag-Of-Words representation. As in other propositionalization methods, after the wordification step any propositional data mining algorithm can be applied. The most notable advantage of the presented technique is greater scalability: the propositionalization step is done in time linear to the number of attributes times the number of examples. The paper presents the wordification methodology, implemented in a cloud-based web data mining platform Clowd-Flows, and describes the experiments in two real-life datasets together with a critical comparison to the RSD propositionalization approach.

Matic Perovšek, Anže Vavpetič, Bojan Cestnik, Nada Lavrač

Multi-interval Discretization of Continuous Attributes for Label Ranking

Abstract

Label Ranking (LR) problems, such as predicting rankings of financial analysts, are becoming increasingly important in data mining. While there has been a significant amount of work on the development of learning algorithms for LR in recent years, pre-processing methods for LR are still very scarce. However, some methods, like Naive Bayes for LR and APRIORI-LR, cannot deal with real-valued data directly. As a make-shift solution, one could consider conventional discretization methods used in classification, by simply treating each unique ranking as a separate class. In this paper, we show that such an approach has several disadvantages. As an alternative, we propose an adaptation of an existing method, MDLP, specifically for LR problems. We illustrate the advantages of the new method using synthetic data. Additionally, we present results obtained on several benchmark datasets. The results clearly indicate that the discretization is performing as expected and in some cases improves the results of the learning algorithms.

Cláudio Rebelo de Sá, Carlos Soares, Arno Knobbe, Paulo Azevedo, Alípio Mário Jorge

Identifying Super-Mediators of Information Diffusion in Social Networks

Abstract

We propose a method to discover a different kind of influential nodes in a social network, which we call “super-mediators”, i.e., those nodes which play an important role in receiving the information and passing it to other nodes. We mathematically formulate this as a difference maximization problem in the average influence degree with respect to a node removal, i.e., a node that contributes to making the difference large is influential. We further characterize the property of these super-mediators as having both large influence degree, i.e., capable of widely spreading information to other recipient nodes, and large reverse- influence degree, i.e., capable of widely receiving information from other information source nodes. We conducted extensive experiments using three real world social networks and confirmed that this property holds. We further investigated how well the conventional centrality measures capture super-mediators. In short the in-degree centrality is a good measure when the diffusion probability is small and the betweenness centrality is a good measure when the diffusion probability is large, but the super-mediators do depend on the value of the diffusion probability and no single centrality measure works equally well for a wide range of the diffusion probability.

Kazumi Saito, Masahiro Kimura, Kouzou Ohara, Hiroshi Motoda

SM2D: A Modular Knowledge Discovery Approach Applied to Hydrological Forecasting

Abstract

In this paper, we address the problem of flood prediction in complex situations. We present an original solution in order to achieve the main goals of accuracy, flexibility and readability. We propose the SM2D modular data driven approach that provides predictive models for each sub-process of a global hydrological process. We show that this solution improves the predictive accuracy regarding a global approach. The originality of our proposition is threefold: (1) the predictive model is defined as a set of aggregate variables that act as classifiers, (2) an evolutionary technique is implemented to find best juries of such classifiers and (3) the flood process complexity problem is addressed by searching for sub-models on sub-processes identified partly by spatial criteria. The solution has proved to perform well on flash flood phenomena of tropical areas known to be hardly predictable. It was indeed successfully applied on a real caribbean river dataset after both preprocessing and preliminary analysis steps presented in the paper.

Wilfried Segretier, Martine Collard

A Dynamic Programming Algorithm for Learning Chain Event Graphs

Abstract

Chain event graphs are a model family particularly suited for asymmetric causal discrete domains. This paper describes a dynamic programming algorithm for exact learning of chain event graphs from multivariate data. While the exact algorithm is slow, it allows reasonably fast approximations and provides clues for implementing more scalable heuristic algorithms.

Tomi Silander, Tze-Yun Leong

Mining Interesting Patterns in Multi-relational Data with N-ary Relationships

Abstract

We present a novel method for mining local patterns from multi-relational data in which relationships can be of any arity. More specifically, we define a new pattern syntax for such data, develop an efficient algorithm for mining it, and define a suitable interestingness measure that is able to take into account prior information of the data miner. Our approach is a strict generalisation of prior work on multi-relational data in which relationships were restricted to be binary, as well as of prior work on local pattern mining from a single n-ary relationship. Remarkably, despite being more general our algorithm is comparably fast or faster than the state-of-the-art in these less general problem settings.

Eirini Spyropoulou, Tijl De Bie, Mario Boley

Learning Hierarchical Multi-label Classification Trees from Network Data

Abstract

We present an algorithm for hierarchical multi-label classification (HMC) in a network context. It is able to classify instances that may belong to multiple classes at the same time and consider the hierarchical organization of the classes. It assumes that the instances are placed in a network and uses information on the network connections during the learning of the predictive model. Many real world prediction problems have classes that are organized hierarchically and instances that can have pairwise connections. One example is web document classification, where topics (classes) are typically organized into a hierarchy and documents are connected by hyperlinks. Another example, which is considered in this paper, is gene/protein function prediction, where genes/proteins are connected and form protein-to-protein interaction (PPI) networks. Network datasets are characterized by a form of autocorrelation, where the value of a variable at a given node depends on the values of variables at the nodes it is connected with. Combining the hierarchical multi-label classification task with network prediction is thus not trivial and requires the introduction of the new concept of network autocorrelation for HMC. The proposed algorithm is able to profitably exploit network autocorrelation when learning a tree-based prediction model for HMC. The learned model is in the form of a Predictive Clustering Tree (PCT) and predicts multiple (hierarchically organized) labels at the leaves. Experiments show the effectiveness of the proposed approach for different problems of gene function prediction, considering different PPI networks. The results show that different networks introduce different benefits in different problems of gene function prediction.

Daniela Stojanova, Michelangelo Ceci, Donato Malerba, Sašo Džeroski

A Density-Based Backward Approach to Isolate Rare Events in Large-Scale Applications

Abstract

While significant work in data mining has been dedicated to the detection of single outliers in the data, less research has approached the problem of isolating a group of outliers, i.e. rare events representing micro-clusters of less – or significantly less – than 1% of the whole dataset. This research issue is critical for example in medical applications. The problem is difficult to handle as it lies at the frontier between outlier detection and clustering and distinguishes by a clear challenge to avoid missing true positives. We address this challenge and propose a novel two-stage framework, based on a backward approach, to isolate abnormal groups of events in large datasets. The key of our backward approach is to first identify the core of the dense regions and then gradually augments them based on a density-driven condition. The framework outputs a small subset of the dataset containing both rare events and outliers. We tested our framework on a biomedical application to find micro-clusters of pathological cells. The comparison against two common clustering (DBSCAN) and outlier detection (LOF) algorithms show that our approach is a very efficient alternative to the detection of rare events – generally a recall of 100% and a higher precision, positively correlated wih the size of the rare event – while also providing a \(\mathcal{O}(N)\) solution to the existing algorithms dominated by a \(\mathcal{O}(N^2)\) complexity.

Enikö Székely, Pascal Poncelet, Florent Masseglia, Maguelonne Teisseire, Renaud Cezar

Inductive Process Modeling of Rab5-Rab7 Conversion in Endocytosis

Abstract

Inductive process modeling (IPM) is an approach to equation discovery that can be used to induce comprehensible models of dynamical systems from observed data and domain knowledge. We apply IPM to the task of modeling the conversion of Rab5 domain proteins to Rab7 domain proteins, a key process in endocytosis. Endocytosis, and in particular its specific form phagocytosis, is a major mechanism of the immune system, used to remove pathogens. We first introduce a formal representation of the domain knowledge for modeling this process. We then present the design of the IPM experiments using the domain knowledge and measured data and the results obtained from these experiments. We finally compare our results with results already published in the literature.

Jovan Tanevski, Ljupčo Todorovski, Yannis Kalaidzidis, Sašo Džeroski

Fast Compression of Large-Scale Hypergraphs for Solving Combinatorial Problems

Abstract

We present a fast algorithm to compress hypergraphs into the data structure ZDDs. We furthermore analyze the computational complexity. Our algorithm uses multikey Quicksort given by Bentley and Sedgewick. By conducting experiments with various datasets, we show that our algorithm is significantly faster and requires much smaller memory than an existing method.

Takahisa Toda

Semantic Data Mining of Financial News Articles

Abstract

Subgroup discovery aims at constructing symbolic rules that describe statistically interesting subsets of instances with a chosen property of interest. Semantic subgroup discovery extends standard subgroup discovery approaches by exploiting ontological concepts in rule construction. Compared to previously developed semantic data mining systems SDM-SEGS and SDM-Aleph, this paper presents a general purpose semantic subgroup discovery system Hedwig that takes as input the training examples encoded in RDF, and constructs relational rules by effective top-down search of ontologies, also encoded as RDF triples. The effectiveness of the system is demonstrated through an application in a financial domain with the goal to analyze financial news in search for interesting vocabulary patterns that reflect credit default swap (CDS) trend reversal for financially troubled countries. The approach is showcased by analyzing over 8 million news articles collected in the period of eighteen months. The paper exemplifies the results by showing rules reflecting interesting news topics characterizing Portugal CDS trend reversal in terms of conjunctions of terms describing concepts at different levels of the concept hierarchy.

Anže Vavpetič, Petra Kralj Novak, Miha Grčar, Igor Mozetič, Nada Lavrač

Polynomial Delay and Space Discovery of Connected and Acyclic Sub-hypergraphs in a Hypergraph

Abstract

In this paper, we study the problem of finding all tree-like substructure contained in a hypergraph, with potential applications to substructure mining from relational data. We employ the class of connected and Berge acyclic sub-hypergraphs as definition of tree-like substructures, which is the most restricted notion of acyclicities for hypergraphs. Then, we present an efficient depth-first algorithm that finds all connected and Berge acyclic sub-hypergraphs S in a hypergraph \(\mathcal H\) with m hyperedges and n vertices in O(nm ²) time per solution (delay) using O(N) space, where \(N = ||\mathcal H||\) is the total input size. To achieve efficient enumeration, we use the notion of the maximum border set. This result gives the first polynomial delay and time algorithm for enumeration of connected and Berge-acyclic sub-hypergraphs. We also present an incremental enumeration algorithm that finds all solutions S in \(O(\varDelta MB(S)\tau(m)) = O(rd\cdot\tau(m))\) delay using O(N) space and preprocessing, whose delay depends only on the difference of solutions, where S is the enumerated sub-hypergraph, \(\varDelta MB(S)\) is the number of newly added hyperedges to the maximum border of S, r and d are the rank and degree of \(\mathcal H\), respectively, and τ(m) = ((log log m)²/log log log m).

Kunihiro Wasa, Takeaki Uno, Kouichi Hirata, Hiroki Arimura

Hyperlink Prediction in Hypernetworks Using Latent Social Features

Abstract

Predicting the existence of links between pairwise objects in networks is a key problem in the study of social networks. However, relationships among objects are often more complex than simple pairwise relations. By restricting attention to dyads, it is possible that information valuable for many learning tasks can be lost. The hypernetwork relaxes the assumption that only two nodes can participate in a link, permitting instead an arbitrary number of nodes to participate in so-called hyperlinks or hyperedges, which is a more natural representation for complex, multi-party relations. However, the hyperlink prediction problem has yet to be studied. In this paper, we propose HPLSF (Hyperlink Prediction using Latent Social Features), a hyperlink prediction algorithm for hypernetworks. By exploiting the homophily property of social networks, HPLSF explores social features for hyperlink prediction. To handle the problem that social features are not always observable, a latent social feature learning scheme is developed. To cope with the arbitrary cardinality hyperlink issue in hypernetworks, we design a feature-embedding scheme to map the a priori arbitrarily-sized feature set associated with each hyperlink into a uniformly-sized auxiliary space. To address the fact that observed features and latent features may be not independent, we generalize a structural SVM to learn using both observed features and latent features. In experiments, we evaluate the proposed HPLSF framework on three large-scale hypernetwork datasets. Our results on the three diverse datasets demonstrate the effectiveness of the HPLSF algorithm. Although developed in the context of social networks, HPLSF is a general methodology and applies to arbitrary hypernetworks.

Ye Xu, Dan Rockmore, Adam M. Kleinbaum

Extracting Opinionated (Sub)Features from a Stream of Product Reviews

Abstract

We propose a stream mining method that learns opinionated product features from a stream of reviews. Monitoring the attitude of customers towards products is a field of much interest, but the products themselves may come in and out of the market. We rather investigate which (implicit) features of the products are important for the customers, and monitor how customer attitude towards such features evolves. To this purpose, we use a two-level stream clustering algorithm that extracts features and subfeatures from an opinionated stream, and couple it with dedicated feature-specific classifiers that assess the polarity of each extracted (sub)feature. We evaluate our method on a stream of reviews and we elaborate on how changes in the arrival rate of features (drift) affects algorithm performance.

Max Zimmermann, Eirini Ntoutsi, Myra Spiliopoulou

Springer Professional

Discovery Science

16th International Conference, DS 2013, Singapore, October 6-9, 2013. Proceedings

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Mixture Models from Multiresolution 0-1 Data

Model Tree Ensembles for Modeling Dynamic Systems

Fast and Scalable Image Retrieval Using Predictive Clustering Trees

Avoiding Anomalies in Data Stream Learning

Generalizing from Example Clusters

Clustering Based Active Learning for Evolving Data Streams

Robust Crowd Labeling Using Little Expertise

A New Approach to String Pattern Mining with Approximate Match

OntoDM-KDD: Ontology for Representing the Knowledge Discovery Process

A Wordification Approach to Relational Data Mining

Multi-interval Discretization of Continuous Attributes for Label Ranking

Identifying Super-Mediators of Information Diffusion in Social Networks

SM2D: A Modular Knowledge Discovery Approach Applied to Hydrological Forecasting

A Dynamic Programming Algorithm for Learning Chain Event Graphs

Mining Interesting Patterns in Multi-relational Data with N-ary Relationships

Learning Hierarchical Multi-label Classification Trees from Network Data

A Density-Based Backward Approach to Isolate Rare Events in Large-Scale Applications

Inductive Process Modeling of Rab5-Rab7 Conversion in Endocytosis

Fast Compression of Large-Scale Hypergraphs for Solving Combinatorial Problems

Semantic Data Mining of Financial News Articles

Polynomial Delay and Space Discovery of Connected and Acyclic Sub-hypergraphs in a Hypergraph

Hyperlink Prediction in Hypernetworks Using Latent Social Features

Extracting Opinionated (Sub)Features from a Stream of Product Reviews

Backmatter

Premium Partner