Skip to main content
Top

2009 | Book

Discovery Science

12th International Conference, DS 2009, Porto, Portugal, October 3-5, 2009

Editors: João Gama, Vítor Santos Costa, Alípio Mário Jorge, Pavel B. Brazdil

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

insite
SEARCH

Table of Contents

Frontmatter
Inference and Learning in Planning (Extended Abstract)

Planning is concerned with the development of solvers for a wide range of models where actions must be selected for achieving goals. In these models, actions may be deterministic or not, and full or partial sensing may be available. In the last few years, significant progress has been made, resulting in algorithms that can produce plans effectively in a variety of settings. These developments have to do with the formulation and use of general inference techniques and transformations. In this invited talk, I’ll review the inference techniques used for solving individual planning instances from scratch, and discuss the use of learning methods and transformations for obtaining more general solutions.

Hector Geffner
Mining Heterogeneous Information Networks by Exploring the Power of Links

Knowledge is power but for interrelated data, knowledge is often hidden in massive links in heterogeneous information networks. We explore the power of links at mining heterogeneous information networks with several interesting tasks, including link-based object distinction, veracity analysis, multidimensional online analytical processing of heterogeneous information networks, and rank-based clustering. Some recent results of our research that explore the crucial information hidden in links will be introduced, including (1)

Distinct

for object distinction analysis, (2)

TruthFinder

for veracity analysis, (3)

Infonet-OLAP

for online analytical processing of information networks, and (4)

RankClus

for integrated ranking-based clustering. We also discuss some of our on-going studies in this direction.

Jiawei Han
Learning on the Web

It is commonplace to say that theWeb has changed everything.Machine learning researchers often say that their projects and results respond to that change with better methods for finding and organizing Web information. However, not much of the theory, or even the current practice, of machine learning take the Web seriously. We continue to devote much effort to refining supervised learning, but the Web reality is that labeled data is hard to obtain, while unlabeled data is inexhaustible. We cling to the iid assumption, while all the Web data generation processes drift rapidly and involve many hidden correlations. Many of our theory and algorithms assume data representations of fixed dimension, while in fact the dimensionality of data, for example the number of distinct words in text, grows with data size. While there has been much work recently on learning with sparse representations, the actual patterns of sparsity on the Web are not paid much attention. Those patterns might be very relevant to the communication costs of distributed learning algorithms, which are necessary at Web scale, but little work has been done on this.

Fernando C. N. Pereira
Learning and Domain Adaptation

Domain adaptation is a fundamental learning problem where one wishes to use labeled data from one or several source domains to learn a hypothesis performing well on a different, yet related, domain for which no labeled data is available. This generalization across domains is a very significant challenge for many machine learning applications and arises in a variety of natural settings, including NLP tasks (document classification, sentiment analysis, etc.), speech recognition (speakers and noise or environment adaptation) and face recognition (different lighting conditions, different population composition).

The learning theory community has only recently started to analyze domain adaptation problems. In the talk, I will overview some recent theoretical models and results regarding domain adaptation.

This talk is based on joint works with Mehryar Mohri and Afshin Rostamizadeh.

Yishay Mansour
The Two Faces of Active Learning

The active learning model is motivated by scenarios in which it is easy to amass vast quantities of unlabeled data (images and videos off the web, speech signals from microphone recordings, and so on) but costly to obtain their labels. Like supervised learning, the goal is ultimately to learn a classifier. But like unsupervised learning, the data come unlabeled. More precisely, the labels are hidden, and each of them can be revealed only at a cost. The idea is to query the labels of just a few points that are especially informative about the decision boundary, and thereby to obtain an accurate classifier at significantly lower cost than regular supervised learning.

Sanjoy Dasgupta
An Iterative Learning Algorithm for Within-Network Regression in the Transductive Setting

Within-network regression addresses the task of regression in partially labeled networked data where labels are sparse and continuous. Data for inference consist of entities associated with nodes for which labels are known and interlinked with nodes for which labels must be estimated. The premise of this work is that many networked datasets are characterized by a form of autocorrelation where values of the response variable in a node depend on values of the predictor variables of interlinked nodes. This autocorrelation is a violation of the independence assumption of observation. To overcome to this problem, the lagged predictor variables are added to the regression model. We investigate a computational solution for this problem in the transductive setting, which asks for predicting the response values only for unlabeled nodes of the network. The neighborhood relation is computed on the basis of the node links. We propose a regression inference procedure that is based on a co-training approach according to separate model trees are learned from both attribute values of labeled nodes and attribute values aggregated in the neighborhood of labeled nodes, respectively. Each model tree is used to label the unlabeled nodes for the other during an iterative learning process. The set of labeled data is changed by including labels which are estimated as confident. The confidence estimate is based on the influence of the predicted labels on known labels of interlinked nodes. Experiments with sparsely labeled networked data show that the proposed method improves traditional model tree induction.

Annalisa Appice, Michelangelo Ceci, Donato Malerba
Detecting New Kinds of Patient Safety Incidents

We present a novel approach to discovering small groups of anomalously similar pieces of free text.

The UK’s National Reporting and Learning System (NRLS) contains free text and categorical variables describing several million patient safety incidents that have occurred in the National Health Service. The groups of interest represent previously unknown incident types. The task is particularly challenging because the free text descriptions are of random lengths, from very short to quite extensive, and include arbitrary abbreviations and misspellings, as well as technical medical terms. Incidents of the same type may also be described in various different ways.

The aim of the analysis is to produce a global, numerical model of the text, such that the relative positions of the incidents in the model space reflect their meanings. A high dimensional vector space model of the text passages is produced; TF-IDF term weighting is applied, reflecting the differing importance of particular words to a description’s meaning. The dimensionality of the model space is reduced, using principal component and linear discriminant analysis. The supervised analysis uses categorical variables from the NRLS, and allows incidents of similar meaning to be positioned close to one another in the model space. Anomaly detection tools are then used to find small groups of descriptions that are more similar than one would expect. The results are evaluated by having the groups assessed qualitatively by domain experts to see whether they are of substantive interest.

James Bentham, David J. Hand
Using Data Mining for Wine Quality Assessment

Certification and quality assessment are crucial issues within the wine industry. Currently, wine quality is mostly assessed by physicochemical (e.g alcohol levels) and sensory (e.g. human expert evaluation) tests. In this paper, we propose a data mining approach to predict wine preferences that is based on easily available analytical tests at the certification step. A large dataset is considered with white

vinho verde

samples from the Minho region of Portugal. Wine quality is modeled under a regression approach, which preserves the order of the grades. Explanatory knowledge is given in terms of a sensitivity analysis, which measures the response changes when a given input variable is varied through its domain. Three regression techniques were applied, under a computationally efficient procedure that performs simultaneous variable and model selection and that is guided by the sensitivity analysis. The support vector machine achieved promising results, outperforming the multiple regression and neural network methods. Such model is useful for understanding how physicochemical tests affect the sensory preferences. Moreover, it can support the wine expert evaluations and ultimately improve the production.

Paulo Cortez, Juliana Teixeira, António Cerdeira, Fernando Almeida, Telmo Matos, José Reis
MICCLLR: Multiple-Instance Learning Using Class Conditional Log Likelihood Ratio

Multiple-instance learning (MIL) is a generalization of the supervised learning problem where each training observation is a labeled bag of unlabeled instances. Several supervised learning algorithms have been successfully adapted for the multiple-instance learning settings. We explore the adaptation of the Naive Bayes (NB) classifier and the utilization of its sufficient statistics for developing novel multiple-instance learning methods. Specifically, we introduce MICCLLR (multiple-instance class conditional log likelihood ratio), a method for mapping each bag of instances as a single meta-instance using class conditional log likelihood ratio statistics such that any supervised base classifier can be applied to the meta-data. The results of our experiments with MICCLLR using different base classifiers suggest that no single base classifier consistently outperforms other base classifiers on all data sets. We show that a substantial improvement in performance is obtained using an ensemble of MICCLLR classifiers trained using different base learners. We also show that an extra gain in classification accuracy is obtained by applying AdaBoost.M1 to weak MICCLLR classifiers. Overall, our results suggest that the predictive performance of the three proposed variants of MICCLLR are competitive to some of the state-of-the-art MIL methods.

Yasser EL-Manzalawy, Vasant Honavar
On the Complexity of Constraint-Based Theory Extraction

In this paper we rule out output polynomial listing algorithms for the general problem of discovering theories for a conjunction of monotone and anti-monotone constraints as well as for the particular subproblem in which all constraints are frequency-based. For the general problem we prove a concrete exponential lower time bound that holds for any correct algorithm and even in cases in which the size of the theory as well as the only previous bound are constant. For the case of frequency-based constraints our result holds unless

P = NP

. These findings motivate further research to identify tractable subproblems and justify approaches with exponential worst case complexity.

Mario Boley, Thomas Gärtner
Algorithm and Feature Selection for VegOut: A Vegetation Condition Prediction Tool

The goal of the VegOut tool is to provide accurate early warning drought prediction. VegOut integrates climate, oceanic, and satellite-based vegetation indicators to identify historical patterns between drought and vegetation conditions indices and predict future vegetation conditions based on these patterns at multiple time steps (2, 4 and 6-week outlooks). This paper evaluates different sets of data mining techniques and various climatic indices for providing the improved prediction accuracy to the VegOut tool.

Sherri Harms, Tsegaye Tadesse, Brian Wardlow
Regression Trees from Data Streams with Drift Detection

The problem of extracting meaningful patterns from time changing data streams is of increasing importance for the machine learning and data mining communities. We present an algorithm which is able to learn regression trees from fast and unbounded data streams in the presence of concept drifts. To our best knowledge there is no other algorithm for incremental learning regression trees equipped with change detection abilities. The FIRT-DD algorithm has mechanisms for drift detection and model adaptation, which enable to maintain accurate and updated regression models at any time. The drift detection mechanism is based on sequential statistical tests that track the evolution of the local error, at each node of the tree, and inform the learning process for the detected changes. As a response to a local drift, the algorithm is able to adapt the model only locally, avoiding the necessity of a global model adaptation. The adaptation strategy consists of building a new tree whenever a change is suspected in the region and replacing the old ones when the new trees become more accurate. This enables smooth and granular adaptation of the global model. The results from the empirical evaluation performed over several different types of drift show that the algorithm has good capability of consistent detection and proper adaptation to concept drifts.

Elena Ikonomovska, João Gama, Raquel Sebastião, Dejan Gjorgjevik
Mining Frequent Bipartite Episode from Event Sequences

In this paper, first we introduce a

bipartite episode

of the form

A

B

for two sets

A

and

B

of events, which means that every event of

A

is followed by every event of

B

. Then, we present an algorithm that finds all frequent bipartite episodes from an input sequence without duplication in

O

(|Σ| ·

N

) time per an episode and in

O

(|Σ|

2

n

) space, where Σ is an alphabet,

N

is total input size of

$\mathcal S$

, and

n

is the length of

S

. Finally, we give experimental results on artificial and real sequences to evaluate the efficiency of the algorithm.

Takashi Katoh, Hiroki Arimura, Kouichi Hirata
CHRONICLE: A Two-Stage Density-Based Clustering Algorithm for Dynamic Networks

Information networks, such as social networks and that extracted from bibliographic data, are changing dynamically over time. It is crucial to discover time-evolving communities in dynamic networks. In this paper, we study the problem of finding time-evolving communities such that each community freely forms, evolves, and dissolves for any time period. Although the previous

t

-partite graph based methods are quite effective for discovering such communities from large-scale dynamic networks, they have some weak points such as finding only stable clusters of

single path

type and not being scalable

w.r.t.

the time period. We propose

CHRONICLE

, an efficient clustering algorithm that discovers not only clusters of single path type but also clusters of

path group

type. In order to find clusters of both types and also control the dynamicity of clusters, CHRONICLE performs the

two-stage

density-based clustering, which performs the 2nd-stage density-based clustering for the

t

-partite graph constructed from the 1st-stage density-based clustering result for each timestamp network. For a given data set, CHRONICLE finds all clusters in a fixed time by using a fixed amount of memory, regardless of the number of clusters and the length of clusters. Experimental results using real data sets show that CHRONICLE finds a wider range of clusters in a shorter time with a much smaller amount of memory than the previous method.

Min-Soo Kim, Jiawei Han
Learning Large Margin First Order Decision Lists for Multi-Class Classification

Inductive Logic Programming (ILP) systems have been successfully applied to solve binary classification problems. It remains an open question how an accurate solution to a multi-class problem can be obtained by using a logic based learning method. In this paper we present a novel logic based approach to solve challenging multi-class classification problems. Our technique is based on the use of large margin methods in conjunction with the kernels constructed from first order rules induced by an ILP system. The proposed approach learns a multi-class classifier by using a divide and conquer reduction strategy that splits multi-classes into binary groups and solves each individual problem recursively hence generating an underlying decision list structure. We also study the well known one-vs-all scheme in conjunction with logic-based kernel learning. In order to construct a highly informative logical and relational space we introduce a low dimensional embedding method. The technique is amenable to skewed/non-skewed class distribution where multi-class problems such as protein fold recognition are generally characterized by highly uneven class distribution. We performed a series of experiments to evaluate the proposed rule selection and multi-class schemes. The methods were applied to solve challenging problems in computation biology and bioinformatics, namely multi-class protein fold recognition and mutagenicity detection. Experimental comparisons of the performance of large margin first order decision list based multi-class scheme with the standard multi-class ILP algorithm and multi-class Support Vector Machine yielded statistically significant results. The results also demonstrated a favorable comparison between the performances of decision list based scheme and one-vs-all strategy.

Huma Lodhi, Stephen Muggleton, Mike J. E. Sternberg
Centrality Measures from Complex Networks in Active Learning

In this paper, we present some preliminary results indicating that Complex Network properties may be useful to improve performance of Active Learning algorithms. In fact, centrality measures derived from networks generated from the data allow ranking the instances to find out the best ones to be presented to a human expert for manual classification. We discuss how to rank the instances based on the network vertex properties of closeness and betweenness. Such measures, used in isolation or combined, enable identifying regions in the data space that characterize prototypical or critical examples in terms of the classification task. Results obtained on different data sets indicate that, as compared to random selection of training instances, the approach reduces error rate and variance, as well as the number of instances required to reach representatives of all classes.

Robson Motta, Alneu de Andrade Lopes, Maria Cristina F. de Oliveira
Player Modeling for Intelligent Difficulty Adjustment

In this paper we aim at automatically adjusting the difficulty of computer games by clustering players into different types and supervised prediction of the type from short traces of gameplay. An important ingredient of video games is to challenge players by providing them with tasks of appropriate and increasing difficulty. How this difficulty should be chosen and increase over time strongly depends on the ability, experience, perception and learning curve of each individual player. It is a subjective parameter that is very difficult to set. Wrong choices can easily lead to players stopping to play the game as they get bored (if underburdened) or frustrated (if overburdened). An ideal game should be able to adjust its difficulty dynamically governed by the player’s performance. Modern video games utilise a game-testing process to investigate among other factors the perceived difficulty for a multitude of players. In this paper, we investigate how machine learning techniques can be used for automatic difficulty adjustment. Our experiments confirm the potential of machine learning in this application.

Olana Missura, Thomas Gärtner
Unsupervised Fuzzy Clustering for the Segmentation and Annotation of Upwelling Regions in Sea Surface Temperature Images

The Anomalous Pattern algorithm is explored as an initialization strategy to the Fuzzy

K

-Means (FCM), with the sequential extraction of clusters, that simultaneously allows the determination of the number of clusters. The composed algorithm, Anomalous Pattern Fuzzy Clustering (AP-FCM), is applied in the segmentation of Sea Surface Temperature (SST) images for the identification of Coastal Upwelling.

A set of features are constructed from the AP-FCM clustering segmentation taking into account domain knowledge and a threshold procedure is defined in order to identify the transition cluster whose frontline is automatically annotated on SST images to separate the upwelling regions from the background.

Two independent data samples in a total of 61 SST images covering large diversity of upwelling situations are analysed. Results show that by tuning the AP-FCM stop conditions it fits a good number of clusters providing an effective segmentation of the SST images whose spatial visualization of fuzzy membership closely reproduces the original images. Comparing the AP-FCM with the FCM using several validation indices shows the advantage of the AP-FCM avoiding under or over-segmented images. Quantitative assessment of the segmentations is accomplished through ROC analysis. Compared to FCM, the number of iterations of the AP-FCM is significantly decreased.

The automatic annotation of upwelling frontlines from the AP-FCM segmentation overcomes the subjective visual inspection made by the Oceanographers.

Susana Nascimento, Pedro Franco
Discovering the Structures of Open Source Programs from Their Developer Mailing Lists

This paper presents a method which discovers the structure of given open source programs from their developer mailing lists. Our goal is to help successive developers understand the structures and the components of open source programs even if documents about them are not provided sufficiently. Our method consists of two phases: (1) producing a mapping between the source files and the emails, and (2) constructing a lattice from the produced mapping and then reducing it with a novel algorithm, called PRUNIA (

PRUN

ing Algorithm Based on

I

ntroduced

A

ttributes), in order to obtain a more compact structure. We performed experiments with some open source projects which are originally from or popular in Japan such as Namazu and Ruby. The experimental results reveal that the extracted structures reflect very well important parts of the hidden structures of the programs.

Dinh Anh Nguyen, Koichiro Doi, Akihiro Yamamoto
A Comparison of Community Detection Algorithms on Artificial Networks

Community detection has become a very important part in complex networks analysis. Authors traditionally test their algorithms on a few real or artificial networks. Testing on real networks is necessary, but also limited: the considered real networks are usually small, the actual underlying communities are generally not defined objectively, and it is not possible to control their properties. Generating artificial networks makes it possible to overcome these limitations. Until recently though, most works used variations of the classic Erdős-Rényi random model and consequently suffered from the same flaws, generating networks not realistic enough. In this work, we use Lancichinetti et al. model, which is able to generate networks with controlled power-law degree and community distributions, to test some community detection algorithms. We analyze the properties of the generated networks and use the normalized mutual information measure to assess the quality of the results and compare the considered algorithms.

Günce Keziban Orman, Vincent Labatut
Towards an Ontology of Data Mining Investigations

Motivated by the need for unification of the domain of data mining and the demand for formalized representation of outcomes of data mining investigations, we address the task of constructing an ontology of data mining. In this paper we present an updated version of the OntoDM ontology, that is based on a recent proposal of a general framework for data mining and it is aligned with the ontology of biomedical investigations (OBI) . The ontology aims at describing and formalizing entities from the domain of data mining and knowledge discovery. It includes definitions of basic data mining entities (e.g., datatype, dataset, data mining task, data mining algorithm etc.) and allows extensions with more complex data mining entities (e.g. constraints, data mining scenarios and data mining experiments). Unlike most existing approaches to constructing ontologies of data mining, OntoDM is compliant to best practices in engineering ontologies that describe scientific investigations (e.g., OBI ) and is a step towards an ontology of data mining investigations. OntoDM is available at:

http://kt.ijs.si/panovp/OntoDM/

.

Panče Panov, Larisa N. Soldatova, Sašo Džeroski
OMFP: An Approach for Online Mass Flow Prediction in CFB Boilers

Fuel feeding and inhomogeneity of fuel typically cause process fluctuations in the circulating fluidized bed (CFB) boilers. If control systems fail to compensate the fluctuations, the whole plant will suffer from fluctuations that are reinforced by the closed-loop controls. Accurate estimates of fuel consumption among other factors are needed for control systems operation. In this paper we address a problem of online mass flow prediction. Particularly, we consider the problems of (1) constructing

the ground truth

, (2) handling noise and abrupt concept drift, and (3) learning an accurate predictor. Last but not least we emphasize the importance of having the domain knowledge concerning the considered case. We demonstrate the performance of OMPF using real data sets collected from the experimental CFB boiler.

Indrė Žliobaitė, Jorn Bakker, Mykola Pechenizkiy
C-DenStream: Using Domain Knowledge on a Data Stream

Stream clustering algorithms are traditionally designed to process streams efficiently and to adapt to the evolution of the underlying population. This is done without assuming any prior knowledge about the data. However, in many cases, a certain amount of domain or background knowledge is available, and instead of simply using it for the external validation of the clustering results, this knowledge can be used to guide the clustering process. In non-stream data, domain knowledge is exploited in the context of

semi-supervised clustering

.

In this paper, we extend the static semi-supervised learning paradigm for streams. We present C-DenStream, a density-based clustering algorithm for data streams that includes domain information in the form of constraints. We also propose a novel method for the use of background knowledge in data streams. The performance study over a number of real and synthetic data sets demonstrates the effectiveness and efficiency of our method. To our knowledge, this is the first approach to include domain knowledge in clustering for data streams.

Carlos Ruiz, Ernestina Menasalvas, Myra Spiliopoulou
Discovering Influential Nodes for SIS Models in Social Networks

We address the problem of efficiently discovering the influential nodes in a social network under the

susceptible/infected/susceptible (SIS) model

, a diffusion model where nodes are allowed to be activated multiple times. The computational complexity drastically increases because of this multiple activation property. We solve this problem by constructing a layered graph from the original social network with each layer added on top as the time proceeds, and applying the bond percolation with pruning and burnout strategies. We experimentally demonstrate that the proposed method gives much better solutions than the conventional methods that are solely based on the notion of centrality for social network analysis using two large-scale real-world networks (a blog network and a wikipedia network). We further show that the computational complexity of the proposed method is much smaller than the conventional naive probabilistic simulation method by a theoretical analysis and confirm this by experimentation. The properties of the influential nodes discovered are substantially different from those identified by the centrality-based heuristic methods.

Kazumi Saito, Masahiro Kimura, Hiroshi Motoda
An Empirical Comparison of Probability Estimation Techniques for Probabilistic Rules

Rule learning is known for its descriptive and therefore comprehensible classification models which also yield good class predictions. However, in some application areas, we also need good class probability estimates. For different classification models, such as decision trees, a variety of techniques for obtaining good probability estimates have been proposed and evaluated. However, so far, there has been no systematic empirical study of how these techniques can be adapted to probabilistic rules and how these methods affect the probability-based rankings. In this paper we apply several basic methods for the estimation of class membership probabilities to classification rules. We also study the effect of a shrinkage technique for merging the probability estimates of rules with those of their generalizations.

Jan-Nikolas Sulzmann, Johannes Fürnkranz
Precision and Recall for Regression

Cost sensitive prediction is a key task in many real world applications. Most existing research in this area deals with classification problems. This paper addresses a related regression problem: the prediction of rare extreme values of a continuous variable. These values are often regarded as outliers and removed from posterior analysis. However, for many applications (e.g. in finance, meteorology, biology, etc.) these are the key values that we want to accurately predict. Any learning method obtains models by optimizing some preference criteria. In this paper we propose new evaluation criteria that are more adequate for these applications. We describe a generalization for regression of the concepts of precision and recall often used in classification. Using these new evaluation metrics we are able to focus the evaluation of predictive models on the cases that really matter for these applications. Our experiments indicate the advantages of the use of these new measures when comparing predictive models in the context of our target applications.

Luis Torgo, Rita Ribeiro
Mining Local Correlation Patterns in Sets of Sequences

Given a set of (possibly infinite) sequences, we consider the problem of detecting events where a subset of the sequences is correlated for a short period. In other words, we want to find cases where a number of the sequences output exactly the same substring at the same time. Such substrings, together with the sequences in which they are contained, form a

local correlation pattern

. In practice we only want to find patterns that are longer than

γ

and appear in at least

σ

sequences.

Our main contribution is an algorithm for mining such patterns in an online case, where the sequences are read in parallel one symbol at a time (no random access) and the patterns must be reported as soon as they occur.

We conduct experiments on both artificial and real data. The results show that the proposed algorithm scales well as the number of sequences increases. We also conduct a case study using a public EEG dataset. We show that the local correlation patterns capture essential features that can be used to automatically distinguish subjects diagnosed with a genetic predisposition to alcoholism from a control group.

Antti Ukkonen
Subspace Discovery for Promotion: A Cell Clustering Approach

The promotion analysis problem has been proposed in , where ranking-based promotion query processing techniques are studied to effectively and efficiently promote a given object, such as a product, by exploring ranked answers. To be more specific, in a multidimensional data set, our goal is to discover interesting subspaces in which the object is ranked high. In this paper, we extend the previously proposed promotion cube techniques and develop a cell clustering approach that is able to further achieve better tradeoff between offline materialization and online query processing. We formally formulate our problem and present a solution to it. Our empirical evaluation on both synthetic and real data sets show that the proposed technique can greatly speedup query processing with respect to baseline implementations.

Tianyi Wu, Jiawei Han
Contrasting Sequence Groups by Emerging Sequences

Group comparison per se is a fundamental task in many scientific endeavours but is also the basis of any classifier. Contrast sets and emerging patterns contrast between groups of categorical data. Comparing groups of sequence data is a relevant task in many applications. We define Emerging Sequences (ESs) as subsequences that are frequent in sequences of one group and less frequent in the sequences of another, and thus distinguishing or contrasting sequences of different classes. There are two challenges to distinguish sequence classes: the extraction of ESs is not trivially efficient and only exact matches of sequences are considered. In our work we address those problems by a suffix tree-based framework and a similar matching mechanism. We propose a classifier based on Emerging Sequences. Evaluating against two learning algorithms based on frequent subsequences and exact matching subsequences, the experiments on two datasets show that our model outperforms the baseline approaches by up to 20% in prediction accuracy.

Kang Deng, Osmar R. Zaïane
A Sliding Window Algorithm for Relational Frequent Patterns Mining from Data Streams

Some challenges in frequent pattern mining from data streams are the drift of data distribution and the computational efficiency. In this work an additional challenge is considered: data streams describe complex objects modeled by multiple database relations. A multi-relational data mining algorithm is proposed to efficiently discover approximate relational frequent patterns over a sliding time window of a complex data stream. The effectiveness of the method is proved on application to the Internet packet stream.

Fabio Fumarola, Anna Ciampi, Annalisa Appice, Donato Malerba
A Hybrid Collaborative Filtering System for Contextual Recommendations in Social Networks

Recommender systems are based mainly on collaborative filtering algorithms, which only use the ratings given by the users to the products. When context is taken into account, there might be difficulties when it comes to making recommendations to users who are placed in a context other than the usual one, since their preferences will not correlate with the preferences of those in the new context. In this paper, a hybrid collaborative filtering model is proposed, which provides recommendations based on the context of the travelling users. A combination of a user-based collaborative filtering method and a semantic-based one has been used. Contextual recommendation may be applied in multiple social networks that are spreading world-wide. The resulting system has been tested over 11870.com, a good example of a social network where context is a primary concern.

Jorge Gonzalo-Alonso, Paloma de Juan, Elena Garcí-a-Hortelano, Carlos Á. Iglesias
Linear Programming Boosting by Column and Row Generation

We propose a new boosting algorithm based on a linear programming formulation. Our algorithm can take advantage of the sparsity of the solution of the underlying optimization problem. In preliminary experiments, our algorithm outperforms a state-of-the-art LP solver and LPBoost especially when the solution is given by a small set of relevant hypotheses and support vectors.

Kohei Hatano, Eiji Takimoto
Discovering Abstract Concepts to Aid Cross-Map Transfer for a Learning Agent

The capacity to apply knowledge in a context different than the one in which it was learned has become crucial within the area of autonomous agents. This paper specifically addresses the issue of transfer of knowledge acquired through online learning in partially observable environments. We investigate the discovery of relevant abstract concepts which help the transfer of knowledge in the context of an environment characterized by its 2D geographical configuration. The architecture proposed is tested in a simple grid-world environment where two agents duel each other. Results show that an agent’s performances are improved through learning, including when it is tested on a map it has not yet seen.

Cédric Herpson, Vincent Corruble
A Dialectic Approach to Problem-Solving

We analyze the dynamics of problem-solving in a framework which captures two key features of that activity. The first feature is that problem-solving is a social game where a number of problem-solvers interact, rely on other agents to tackle parts of a problem, and regularly communicate the outcomes of their investigations. The second feature is that problem-solving requires a careful control over the set of hypotheses that might be needed at various stages of the investigation for the problem to be solved; more particularly, that any incorrect hypothesis be eventually refuted in the face of some evidence: all agents can expect such evidence to be brought to their knowledge whenever it holds. Our presentation uses a very general form of logic programs, viewed as sets of rules that can be activated and fire, depending on what a problem-solver is willing to explore, what a problem-solver is willing to hypothesize, and what a problem-solver knows about the problem to be solved in the form of data or background knowledge.

Our framework supports two fundamental aspects of problem-solving. The first aspect is that no matter how the work is being distributed amongst agents, exactly the same knowledge is guaranteed to be discovered eventually. The second aspect is that any group of agents (with at one end, one agent being in charge of all rules and at another end, one agent being in charge of one and only one rule) might need to sometimes put forward some hypotheses to allow for the discovery of a particular piece of knowledge in finite time.

Eric Martin, Jean Sallantin
Gene Functional Annotation with Dynamic Hierarchical Classification Guided by Orthologs

This paper proposes an approach to automating Gene Ontology (GO) annotation in the framework of hierarchical classification that uses known, already annotated functions of the orthologs of a given gene. The proposed approach exploits such known functions as constraints and dynamically builds classifiers based on the training data available under the constraints. In addition, two unsupervised approaches are applied to complement the classification framework. The validity and effectiveness of the proposed approach are empirically demonstrated.

Kazuhiro Seki, Yoshihiro Kino, Kuniaki Uehara
Stream Clustering of Growing Objects

We study incremental clustering of objects that grow and accumulate over time. The objects come from a multi-table stream e.g. streams of

Customer

and

Transaction

. As the Transactions stream accumulates, the Customers’ profiles

grow

. First, we use an incremental propositionalisation to convert the multi-table stream into a single-table stream upon which we apply clustering. For this purpose, we develop an online version of K-Means algorithm that can handle these swelling objects and any new objects that arrive. The algorithm also

monitors

the quality of the model and performs re-clustering when it deteriorates. We evaluate our method on the PKDD Challenge 1999 dataset.

Zaigham Faraz Siddiqui, Myra Spiliopoulou
Finding the k-Most Abnormal Subgraphs from a Single Graph

In this paper, we propose a discord discovery method which finds the

k

-most dissimilar subgraphs of size

n

among the subgraphs of the same size of an input graph, where the values of

k

and

n

are given by the user. Our algorithm SD3 (Subgraph Discord Detector based on Dissimilarity) exploits a dynamic index structure and its effectiveness is demonstrated through experiments using graph data in chemical-informatics and bioinformatics.

JianBin Wang, Bin-Hui Chou, Einoshin Suzuki
Latent Topic Extraction from Relational Table for Record Matching

We propose a latent feature extraction method for record linkage. We first introduce a probabilistic model that generates records with their latent topics. The proposed generative model is designed to utilize the co-occurrence among the attributes of the record. Then, we derive a topic estimation algorithm using the Gibbs sampling technique. The estimated topics are used to identify records. The proposed algorithm works in an unsupervised way; i.e., we do not need to prepare labor-intensive training data. We evaluated the proposed model using bibliographic records and proved that the proposed method tended to perform better for records with more attributes by utilizing their co-occurrence.

Atsuhiro Takasu, Daiji Fukagawa, Tatsuya Akutsu
Computing a Comprehensible Model for Spam Filtering

In this paper, we describe the application of the Desicion Tree Boosting (DTB) learning model to spam email filtering.This classification task implies the learning in a high dimensional feature space. So, it is an example of how the DTB algorithm performs in such feature space problems. In [1], it has been shown that hypotheses computed by the DTB model are more comprehensible that the ones computed by another ensemble methods. Hence, this paper tries to show that the DTB algorithm maintains the same comprehensibility of hypothesis in high dimensional feature space problems while achieving the performance of other ensemble methods. Four traditional evaluation measures (precision, recall, F1 and accuracy) have been considered for performance comparison between DTB and others models usually applied to spam email filtering. The size of the hypothesis computed by a DTB is smaller and more comprehensible than the hypothesis computed by Adaboost and Naïve Bayes.

Amparo Ruiz-Sepúlveda, José L. Triviño-Rodriguez, Rafael Morales-Bueno
Better Decomposition Heuristics for the Maximum-Weight Connected Graph Problem Using Betweenness Centrality

We present new decomposition heuristics for finding the optimal solution for the maximum-weight connected graph problem, which is known to be NP-hard. Previous optimal algorithms for solving the problem decompose the input graph into subgraphs using heuristics based on node degree. We propose new heuristics based on betweenness centrality measures, and show through computational experiments that our new heuristics tend to reduce the number of subgraphs in the decomposition, and therefore could lead to the reduction in computational time for finding the optimal solution. The method is further applied to analysis of biological pathway data.

Takanori Yamamoto, Hideo Bannai, Masao Nagasaki, Satoru Miyano
Backmatter
Metadata
Title
Discovery Science
Editors
João Gama
Vítor Santos Costa
Alípio Mário Jorge
Pavel B. Brazdil
Copyright Year
2009
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-04747-3
Print ISBN
978-3-642-04746-6
DOI
https://doi.org/10.1007/978-3-642-04747-3

Premium Partner