Skip to main content

Über dieses Buch

This book constitutes the thoroughly refereed conference proceedings of the First International Workshop on New Frontiers in Mining Complex Patterns, NFMCP 2012, held in conjunction with ECML/PKDD 2012, in Bristol, UK, in September 2012.

The 15 revised full papers were carefully reviewed and selected from numerous submissions. The papers are organized in topical sections on mining rich (relational) datasets, mining complex patterns from miscellaneous data, mining complex patterns from trajectory and sequence data, and mining complex patterns from graphs and networks.



Mining Rich (Relational) Datasets

Learning with Configurable Operators and RL-Based Heuristics

In this paper, we push forward the idea of machine learning systems for which the operators can be modified and finetuned for each problem. This allows us to propose a learning paradigm where users can write (or adapt) their operators, according to the problem, data representation and the way the information should be navigated. To achieve this goal, data instances, background knowledge, rules, programs and operators are all written in the same functional language, Erlang. Since changing operators affect how the search space needs to be explored, heuristics are learnt as a result of a decision process based on reinforcement learning where each action is defined as a choice of operator and rule. As a result, the architecture can be seen as a ‘system for writing machine learning systems’ or to explore new operators.
Fernando Martínez-Plumed, Cèsar Ferri, José Hernández-Orallo, María José Ramírez-Quintana

Reducing Examples in Relational Learning with Bounded-Treewidth Hypotheses

Feature selection methods often improve the performance of attribute-value learning. We explore whether also in relational learning, examples in the form of clauses can be reduced in size to speed up learning without affecting the learned hypothesis. To this end, we introduce the notion of safe reduction: a safely reduced example cannot be distinguished from the original example under the given hypothesis language bias. Next, we consider the particular, rather permissive bias of bounded treewidth clauses. We show that under this hypothesis bias, examples of arbitrary treewidth can be reduced efficiently. The bounded treewidth bias can be replaced by other assumptions such as acyclicity with similar benefits. We evaluate our approach on four data sets with the popular system Aleph and the state-of-the-art relational learner nFOIL. On all four data sets we make learning faster for nFOIL, achieving an order-of-magnitude speed up on one of the data sets, and more accurate for Aleph.
Ondřej Kuželka, Andrea Szabóová, Filip Železný

Mining Complex Patterns from Miscellaneous Data

Mining Complex Event Patterns in Computer Networks

More and more ubiquitous and mobile computer networks are becoming available, which leads to a massive growth in the amount of traffic and according log messages. Therefore, sophisticated approaches for network management and analysis are necessary for handling and managing networks efficiently.
In this paper, we show how to use temporal data mining in a declarative framework for analysing log files for computer networks. From a sequence of network management protocol messages, we derive temporal association rules, which state frequent dependencies between the occuring events. We also present methods for extendable and modular parsing of text messages and their analysis in log files based on Xml.
Dietmar Seipel, Philipp Neubeck, Stefan Köhler, Martin Atzmueller

Learning in the Presence of Large Fluctuations: A Study of Aggregation and Correlation

Consider a scenario where one aims to learn models from data being characterized by very large fluctuations that are neither attributable to noise nor outliers. This may be the case, for instance, when predicting the potential future damages of earthquakes or oil spills, or when conducting financial data analysis. If follows that, in such a situation, the standard central limit theorem does not apply, since the associated Gaussian distribution exponentially suppresses large fluctuations. In this paper, we present an analysis of data aggregation and correlation in such scenarios. To this end, we introduce the Lévy, or stable, distribution which is a generalization of the Gaussian distribution. Our theoretical conclusions are illustrated with various simulations, as well as against a benchmarking financial database. We show which specific strategies should be adopted for aggregation, depending on the stability exponent of the Lévy distribution. Our results indicate that the correlation in between two attributes may be underestimated if a Gaussian distribution is erroneously assumed. Secondly, we show that, in the scenario where we aim to learn a set of rules to estimate the level of stability of a stock market, the Lévy distribution produces superior results. Thirdly, we illustrate that, in a multi-relational database mining setting, aggregation using average values may be highly unsuitable.
Eric Paquet, Herna Lydia Viktor, Hongyu Guo

Retracted: Machine Learning as an Objective Approach to Understanding Music

Traditional research into the arts has generally been based around the subjective judgment of human critics. We propose an alternative approach based on the use of objective machine learning programs. To illustrate this methodology we investigated the distribution of music from around the world: geographical ethnomusicology. To ensure that the knowledge obtained about geographical ethnomusicology is objective and operational we cast the problem as a machine learning one: predicting the geographical origin of pieces of music. We collected 1,142 pieces of music from 73 countries, and described them using 2 sets of standard audio descriptors using MARSYAS. To predict the location of origin of the music we developed a method designed to deal with the spherical surface topology based upon a modified k-nearest-neighbour. We also investigated the utility of a priori geographical knowledge in the predictions: a land and sea mask, and a population distribution overlay. The best-performing prediction method achieved a median land distance error of 1,506km, with comparable random trials having mean of medians 3,190km - this is significant at P < 0.001.
Claire Q, Ross D. King

Pair-Based Object-Driven Action Rules

Action rules, as proposed by Raś and Wieczorkowska in [11], can be defined as actionable tasks that describe possible transitions of objects from one state to another with respect to a distinguished attribute. Recently, a new specialized case of action rules, namely object-driven action rules, has been introduced by Ayman et al. in [4]. Object-driven action rules are action rules that are extracted from information systems with temporal and object-based nature. By object-based nature, we refer to systems that contain multiple observations for each object. A typical example of an object-based system would be a system of patients recording multiple visits; each patient is considered a distinct object. In this paper, we will further investigate the concept of object-driven action rules by proposing a new pair-based way of examining object-driven systems, which we believe is more intuitive for temporal and object-driven systems. The focus of this paper will be on our proposed pair-based approach, along with the modifications required to extract action rules and calculate their properties.
Ayman Hajja, Alicja A. Wieczorkowska, Zbigniew W. Ras, Ryszard Gubrynowicz

Mining Complex Patterns from Trajectory and Sequence Data

Effectively Grouping Trajectory Streams

Trajectory data streams are huge amounts of data pertaining to time and position of moving objects. They are continuously generated by different sources exploiting a wide variety of technologies (e.g., RFID tags, GPS, GSM networks). Mining such amount of data is a challenging problem, since the possibility to extract useful information from this peculiar kind of data is crucial in many application scenarios such as vehicle traffic management, hand-off in cellular networks, supply chain management. Moreover, spatial data streams pose interesting challenges for their proper representation, thus making the mining process harder than for classical point data. In this paper, we address the problem of trajectory data streams clustering, that revealed really intriguing as we deal with a kind of data (trajectories) for which the order of elements is relevant. We propose a complete framework starting from data preparation task that allows us to make the mining step quite effective. Since the validation of data mining approaches has to be experimental we performed several tests on real world datasets that confirmed the efficiency and effectiveness of the proposed technique.
Gianni Costa, Giuseppe Manco, Elio Masciari

Healthcare Trajectory Mining by Combining Multidimensional Component and Itemsets

Sequential pattern mining is aimed at extracting correlations among temporal data. Many different methods were proposed to either enumerate sequences of set valued data (i.e., itemsets) or sequences containing multidimensional items. However, in real-world scenarios, data sequences are described as events of both multidimensional items and set valued information. These rich heterogeneous descriptions cannot be exploited by traditional approaches. For example, in healthcare domain, hospitalizations are defined as sequences of multi-dimensional attributes (e.g. Hospital or Diagnosis) associated with two sets, set of medical procedures (e.g. \( \lbrace \) Radiography, Appendectomy \(\rbrace\)) and set of medical drugs (e.g. \(\lbrace \) Aspirin, Paracetamol \(\rbrace\)) . In this paper we propose a new approach called MMISP (Mining Multidimensional Itemset Sequential Patterns) to extract patterns from a complex sequences including both dimensional items and itemsets. The novelties of the proposal lies in: (i) the way in which the data can be efficiently compressed; (ii) the ability to reuse and adopt sequential pattern mining algorithms and (iii) the extraction of new kind of patterns. We introduce as a case-study, experimented on real data aggregated from a regional healthcare system and we point out the usefulness of the extracted patterns. Additional experiments on synthetic data highlights the efficiency and scalability of our approach.
Elias Egho, Chedy Raïssi, Dino Ienco, Nicolas Jay, Amedeo Napoli, Pascal Poncelet, Catherine Quantin, Maguelonne Teisseire

Graph-Based Approaches to Clustering Network-Constrained Trajectory Data

Clustering trajectory data attracted considerable attention in the last few years. Most of prior work assumed that moving objects can move freely in an euclidean space and did not consider the eventual presence of an underlying road network and its influence on evaluating the similarity between trajectories. In this paper, we present an approach to clustering such network-constrained trajectory data. More precisely we aim at discovering groups of road segments that are often travelled by the same trajectories. To achieve this end, we model the interactions between segments w.r.t. their similarity as a weighted graph to which we apply a community detection algorithm to discover meaningful clusters. We showcase our proposition through experimental results obtained on synthetic datasets.
Mohamed Khalil El Mahrsi, Fabrice Rossi

Mining Complex Patterns from Graphs and Networks

Finding the Most Descriptive Substructures in Graphs with Discrete and Numeric Labels

Many graph datasets are labelled with discrete and numeric attributes. Frequent substructure discovery algorithms usually ignore numeric attributes; in this paper we show that they can be used to improve discrimination and search performance. Our thesis is that the most descriptive substructures are those which are normative both in terms of their structure and in terms of their numeric values. We explore the relationship between graph structure and the distribution of attribute values and propose an outlier-detection step, which is used as a constraint during substructure discovery. By pruning anomalous vertices and edges, more weight is given to the most descriptive substructures. Our experiments on a real-world access control database returns similar substructures to unconstrained search with 30% fewer graph isomorphism tests.
Michael Davis, Weiru Liu, Paul Miller

Learning in Probabilistic Graphs Exploiting Language-Constrained Patterns

The probabilistic graphs framework models the uncertainty inherent in real-world domains by means of probabilistic edges whose value quantifies the likelihood of the edge existence or the strength of the link it represents. The goal of this paper is to provide a learning method to compute the most likely relationship between two nodes in a framework based on probabilistic graphs. In particular, given a probabilistic graph we adopted the language-constrained reachability method to compute the probability of possible interconnections that may exists between two nodes. Each of these connections may be viewed as feature, or factor, between the two nodes and the corresponding probability as its weight. Each observed link is considered as a positive instance for its corresponding link label. Given the training set of observed links a L2-regularized Logistic Regression has been adopted to learn a model able to predict unobserved link labels.
Claudio Taranto, Nicola Di Mauro, Floriana Esposito

Improving Robustness and Flexibility of Concept Taxonomy Learning from Text

The spread and abundance of electronic documents requires automatic techniques for extracting useful information from the text they contain. The availability of conceptual taxonomies can be of great help, but manually building them is a complex and costly task. Building on previous work, we propose a technique to automatically extract conceptual graphs from text and reason with them. Since automated learning of taxonomies needs to be robust with respect to missing or partial knowledge and flexible with respect to noise, this work proposes a way to deal with these problems. The case of poor data/sparse concepts is tackled by finding generalizations among disjoint pieces of knowledge. Noise is handled by introducing soft relationships among concepts rather than hard ones, and applying a probabilistic inferential setting. In particular, we propose to reason on the extracted graph using different kinds of relationships among concepts, where each arc/relationship is associated to a weight that represents its likelihood among all possible worlds, and to face the problem of sparse knowledge by using generalizations among distant concepts as bridges between disjoint portions of knowledge.
Fabio Leuzzi, Stefano Ferilli, Fulvio Rotella

Discovering Evolution Chains in Dynamic Networks

Most of the works on learning from networked data assume that the network is static. In this paper we consider a different scenario, where the network is dynamic, i.e. nodes/relationships can be added or removed and relationships can change in their type over time. We assume that the “core” of the network is more stable than the “marginal” part of the network, nevertheless it can change with time. These changes are of interest for this work, since they reflect a crucial step in the network evolution. Indeed, we tackle the problem of discovering evolution chains, which express the temporal evolution of the “core” of the network. To describe the “core” of the network, we follow a frequent pattern-mining approach, with the critical difference that the frequency of a pattern is computed along a time-period and not on a static dataset. The proposed method proceeds in two steps: 1) identification of changes through the discovery of emerging patterns; 2) composition of evolution chains by joining emerging patterns. We test the effectiveness of the method on both real and synthetic data.
Corrado Loglisci, Michelangelo Ceci, Donato Malerba

Supporting Information Spread in a Social Internetworking Scenario

The problem of quickly, capillary and effectively spreading information over social networks has become extremely important in many areas of our society. This problem has been widely studied in the recent literature and is still open, but it becomes even more challenging, due to the new issues to deal with, in a multi-social-network context, where the possibility that information can cross different social networks has a fundamental role. As a matter of fact, this is the scenario towards which social networks are evolving with a rapid increase of the mutual interaction among them. In this new scenario, called Social Internetworking Scenario (SIS, for short), we propose an approach devoted to favor information spreading, by identifying two stereotypes, specific for SISs, which are expected to be good spreaders: the starter and the bridge.
Francesco Buccafurri, Gianluca Lax, Antonino Nocera, Domenico Ursino

Context-Aware Predictions on Business Processes: An Ensemble-Based Solution

The discovery of predictive models for process performances is an emerging topic, which poses a series of difficulties when considering complex and flexible processes, whose behaviour tend to change over time depending on context factors. We try to face such a situation by proposing a predictive-clustering approach, where different context-related execution scenarios are equipped with separate prediction models. Recent methods for the discovery of both Predictive Clustering Trees and state-aware process performance predictors can be reused in the approach, provided that the input log is preliminary converted into a suitable propositional form, based on the identification of an optimal subset of features for log traces. In order to make the approach more robust and parameter free, we also introduce an ensemble-based clustering method, where multiple PCTs are learnt (using different, randomly selected, subsets of features), and integrated into an overall model. Several tests on real-life logs confirmed the validity of the approach.
Francesco Folino, Massimo Guarascio, Luigi Pontieri

Erratum: Machine Learning as an Objective Approach to Understanding Music

The paper starting on page 64 of this publication has been withdrawn because Figure 1 is incorrect and it is unclear if the paper’s results can be reproduced.
Claire Q, Ross D. King


Weitere Informationen

Premium Partner