nach oben

2010 | Buch

Kapitel lesen Erstes Kapitel lesen

Advances in Knowledge Discovery and Management

herausgegeben von: Fabrice Guillet, Gilbert Ritschard, Djamel Abdelkader Zighed, Henri Briand

Verlag: Springer Berlin Heidelberg

Buchreihe : Studies in Computational Intelligence

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

During the last decade, the French-speaking scientific community developed a very strong research activity in the field of Knowledge Discovery and Management (KDM or EGC for “Extraction et Gestion des Connaissances” in French), which is concerned with, among others, Data Mining, Knowledge Discovery, Business Intelligence, Knowledge Engineering and SemanticWeb. The recent and novel research contributions collected in this book are extended and reworked versions of a selection of the best papers that were originally presented in French at the EGC 2009 Conference held in Strasbourg, France on January 2009. The volume is organized in four parts. Part I includes five papers concerned by various aspects of supervised learning or information retrieval. Part II presents five papers concerned with unsupervised learning issues. Part III includes two papers on data streaming and two on security while in Part IV the last four papers are concerned with ontologies and semantic.

Inhaltsverzeichnis

Frontmatter

Supervised Learning and Information Retrieval

Frontmatter

Discrepancy Analysis of Complex Objects Using Dissimilarities

Abstract

In this article we consider objects for which we have a matrix of dissimilarities and we are interested in their links with covariates. We focus on state sequences for which pairwise dissimilarities are given for instance by edit distances. The methods discussed apply however to any kind of objects and measures of dissimilarities. We start with a generalization of the analysis of variance (ANOVA) to assess the link of complex objects (e.g. sequences) with a given categorical variable. The trick is to show that discrepancy among objects can be derived from the sole pairwise dissimilarities, which permits then to identify factors that most reduce this discrepancy.We present a general statistical test and introduce an original way of rendering the results for state sequences. We then generalize the method to the case with more than one factor and discuss its advantages and limitations especially regarding interpretation. Finally, we introduce a new tree method for analyzing discrepancy of complex objects that exploits the former test as splitting criterion. We demonstrate the scope of the methods presented through a study of the factors that most discriminate Swiss occupational trajectories. All methods presented are freely accessible in our TraMineR package for the R statistical environment.

Matthias Studer, Gilbert Ritschard, Alexis Gabadinho, Nicolas S. Müller

A Bayes Evaluation Criterion for Decision Trees

Abstract

We present a new evaluation criterion for the induction of decision trees. We exploit a parameter-free Bayesian approach and propose an analytic formula for the evaluation of the posterior probability of a decision tree given the data. We thus transform the training problem into an optimization problem in the space of decision tree models, and search for the best tree, which is the maximum a posteriori (MAP) one. The optimization is performed using top-down heuristics with pre-pruning and post-pruning processes. Extensive experiments on 30 UCI datasets and on the 5 WCCI 2006 performance prediction challenge datasets show that our method obtains predictive performance similar to that of alternative state-of-the-art methods, with far simpler trees.

Nicolas Voisine, Marc Boullé, Carine Hue

Classifying Very-High-Dimensional Data with Random Forests of Oblique Decision Trees

Abstract

The random forests method is one of the most successful ensemble methods. However, random forests do not have high performance when dealing with very-high-dimensional data in presence of dependencies. In this case one can expect that there exist many combinations between the variables and unfortunately the usual random forests method does not effectively exploit this situation. We here investigate a new approach for supervised classification with a huge number of numerical attributes. We propose a random oblique decision trees method. It consists of randomly choosing a subset of predictive attributes and it uses SVM as a split function of these attributes.We compare, on 25 datasets, the effectiveness with classical measures (e.g. precision, recall, F1-measure and accuracy) of random forests of random oblique decision trees with SVMs and random forests of C4.5. Our proposal has significant better performance on very-high-dimensional datasets with slightly better results on lower dimensional datasets.

Thanh-Nghi Do, Philippe Lenca, Stéphane Lallich, Nguyen-Khang Pham

Intensive Use of Correspondence Analysis for Large Scale Content-Based Image Retrieval

Abstract

In this paper, we investigate the intensive use of Correspondence Analysis (CA) for large scale content-based image retrieval. Correspondence Analysis is a useful method for analyzing textual data and we adapt it to images using the SIFT local descriptors. CA is used to reduce dimensions and to limit the number of images to be considered during the search step. An incremental algorithm for CA is proposed to deal with large databases giving exactly the same result as the standard algorithm. We also integrate the Contextual Dissimilarity Measure in our search scheme in order to improve response time and accuracy. We explore this integration in two ways: (i) off-line (the structure of image neighborhoods is corrected off-line) and (ii) on-the-fly (the structure of image neighborhoods is adapted during the search). The evaluation tests have been performed on a large image database (up to 1 million images).

Nguyen-Khang Pham, Annie Morin, Patrick Gros, Quyet-Thang Le

Toward a Better Integration of Spatial Relations in Learning with Graphical Models

Abstract

This paper deals with structural representations of images for machine learning and image categorization. The representation consists of a graph where vertices represent image regions and edges spatial relations between them. Both vertices and edges are attributed. The method is based on graph kernels, in order to derive a metrics for comparing images. We show in particular the importance of edge information (i.e. spatial relations) in the specific context of the influence of the satisfaction or non-satisfaction of a relation between two regions. The main contribution of the paper is situated in highlighting the challenges that follow in terms of image representation, if fuzzy models are considered for estimating relation satisfiability.

Emanuel Aldea, Isabelle Bloch

Unsupervised Learning

Frontmatter

Multigranular Manipulations for OLAP Querying

Abstract

Decisional systems are based on multidimensional databases improving OLAP analyses. This chapter describes a new OLAP operator named “BLEND” that performs multigranular analyses. This operation transforms multidimensional structures when querying in order to analyze measures according to several granularity levels like one parameter. We study valid uses of this operation in the context of strict hierarchies. Experiments within a R-OLAP implementation show the light cost of the operator.

Gilles Hubert, Olivier Teste

A New Approach for Unsupervised Classification in Image Segmentation

Abstract

Image segmentation is a fundamental problem in image analysis and understanding. Among the existing approaches proposed to solve this problem, unsupervised classification or clustering is often involved in an early step to partition the space of pixel intensities (i.e. either grey levels, colours or spectral signatures). Since it completely ignores pixel neighbourhoods, a second step is then necessary to ensure spatial analysis (e.g. with a connected component labeling) in order to identify the regions built from the segmentation process. The lack of spatial information is a major drawback of the classification-based segmentation approaches, thus many solutions (where classification is used together with other techniques) have been proposed in the literature. In this paper, we propose a new formulation of the unsupervised classification which is able to perform image segmentation without requiring the need for some additional techniques. More precisely, we introduce a kmeans-like method where data to be clustered are pixels themselves (and not anymore their intensities or colours) and where distances between points and class centres are not anymore Euclidean but topographical. Segmentation is then an iterative process, where at each iteration resulting classes can be seen as influence zones in the context of mathematical morphology. This comparison provides some efficient algorithms proposed in this field (such as hierarchical queue-based solutions), while adding the iterative property of unsupervised classification methods considered here. Finally, we illustrate the potential of our approach by some segmentation results obtained on artificial and natural images.

Sébastien Lefèvre

Cluster-Dependent Feature Selection through a Weighted Learning Paradigm

Abstract

This paper addresses the problem of selecting a subset of the most relevant features from a dataset through a weighted learning paradigm.We propose two automated feature selection algorithms for unlabeled data. In contrast to supervised learning, the problem of automated feature selection and feature weighting in the context of unsupervised learning is challenging, because label information is not available or not used to guide the feature selection. These algorithms involve both the introduction of unsupervised local feature weights, identifying certain relevant features of the data, and the suppression of the irrelevant features using unsupervised selection. The algorithms described in this paper provide topographic clustering, each cluster being associated to a prototype and a weight vector, reflecting the relevance of the feature. The proposed methods require simple computational techniques and are based on the self-organizing map (SOM) model. Empirical results based on both synthetic and real datasets from the UCI repository, are given and discussed.

Nistor Grozavu, Younès Bennani, Mustapha Lebbah

Two Variants of the OKM for Overlapping Clustering

Abstract

This paper deals with overlapping clustering and presents two extensions of the approach OKM denoted as OKMED andWOKM. OKMED generalizes the well known k-medoid method to overlapping clustering and help in organizing data with any proximity matrix as input. WOKM (Weighted-OKM) proposes a model with local weighting of the clusters; this variant is suitable for overlapping clustering since a single data can matches with multiple classes according to different features. On text clustering, we show that OKMED has a behavior similar to OKM but offers to use metrics other than euclidean distance. Then we observe significant improvement using the weighted extension of OKM.

Guillaume Cleuziou

A Stable Decomposition Algorithm for Dynamic Social Network Analysis

Abstract

Dynamic networks raise new challenges for knowledge discovery. To efficiently handle this kind of data, analysis methods have to decompose the network, modelled by a graph, into similar sets of nodes. In this article, we present a graph decomposition algorithm that generates overlapping clusters. The complexity of this algorithm is \(O(|E| \cdot deg^2_{max} + |V| \cdot log(|V|))\). This algorithm is particularly efficient because it can detect major changes in the data as it evolves over time.

Romain Bourqui, Paolo Simonetto, Fabien Jourdan

Security and Data Streaming

Frontmatter

An Hybrid Data Stream Summarizing Approach by Sampling and Clustering

Abstract

Computer systems generate a large amount of data that, in terms of space and time, is very expensive - even impossible - to store. Besides this, many applications need to keep an historical view of such data in order to provide historical aggregated information, perform data mining tasks or detect anomalous behavior in computer systems. One solution is to treat the data as streams being processed on the fly in order to build historical summaries. Many data summarizing techniques have already been developed such as sampling, clustering, histograms, etc. Some of them have been extended to be applied directly to data streams. This chapter presents a new approach to build such historical summaries of data streams. It is based on a combination of two existing algorithms: StreamSamp and CluStream. The combination takes advantages of the benefits of each algorithm and avoids their drawbacks. Some experiments are presented both on real and synthetic data. These experiments show that the new approach gives better results than using any one of the two mentioned algorithms.

Nesrine Gabsi, Fabrice Clérot, Georges Hébrail

SPAMS: A Novel Incremental Approach for Sequential Pattern Mining in Data Streams

Abstract

Mining sequential patterns in data streams is a new challenging problem for the datamining community since data arrives sequentially in the form of continuous rapid and infinite streams. In this paper, we propose a new on-line algorithm, SPAMS, to deal with the sequential patterns mining problem in data streams. This algorithm uses an automaton-based structure to maintain the set of frequent sequential patterns, i.e. SPA (Sequential Pattern Automaton). The sequential pattern automaton can be smaller than the set of frequent sequential patterns by two or more orders of magnitude, which allows us to overcome the problem of combinatorial explosion of sequential patterns. Current results can be output constantly on any user’s specified thresholds. In addition, taking into account the characteristics of data streams, we propose a well-suited method said to be approximate since we can provide near optimal results with a high probability. Experimental studies show the relevance of the SPA data structure and the efficiency of the SPAMS algorithm on various datasets. Our contribution opens a promising gateway, by using an automaton as a data structure for mining frequent sequential patterns in data streams.

Lionel Vinceslas, Jean-Emile Symphor, Alban Mancheron, Pascal Poncelet

Mining Common Outliers for Intrusion Detection

Abstract

Data mining for intrusion detection can be divided into several sub-topics, among which unsupervised clustering (which has controversial properties). Unsupervised clustering for intrusion detection aims to i) group behaviours together depending on their similarity and ii) detect groups containing only one (or very few) behaviour(s). Such isolated behaviours seem to deviate from the model of normality; therefore, they are considered as malicious. Obviously, not all atypical behaviours are attacks or intrusion attempts. This represents one drawback of intrusion detection methods based on clustering.We take into account the addition of a new feature to isolated behaviours before they are considered malicious. This feature is based on the possible repeated occurrences of the bahaviour on many information systems. Based on this feature, we propose a new outlier mining method which we validate through a set of experiments.

Goverdhan Singh, Florent Masseglia, Céline Fiot, Alice Marascu, Pascal Poncelet

Intrusion Detections in Collaborative Organizations by Preserving Privacy

Abstract

To overcome the problem of attacks on networks, new Intrusion Detection System (IDS) approaches have been proposed in recent years. They consist in identifying signatures of known attacks to compare them to each request and determine whether it is an attack or not. However, these methods are set to default when the attack is unknown from the database of signatures. Usually this problem is solved by calling human expertise to update the database of signatures. However, it is frequent that an attack has already been detected by another organization and it would be useful to be able to benefit from this knowledge to enrich the database of signatures. Unfortunately this information is not so easy to obtain. In fact organizations do not necessarily want to spread the information that they have already faced this type of attack. In this paper we propose a new approach to intrusion detection in a collaborative environment but by preserving the privacy of the collaborative organizations. Our approach works for any signature that may be written as a regular expression insuring that no information is disclosed on the content of the sites.

Nischal Verma, François Trousset, Pascal Poncelet, Florent Masseglia

Ontologies and Semantic

Frontmatter

Alignment-Based Partitioning of Large-Scale Ontologies

Abstract

Ontology alignment is an important task for information integration systems that can make different resources, described by various and heterogeneous ontologies, interoperate. However very large ontologies have been built in some domains such as medicine or agronomy and the challenge now lays in scaling up alignment techniques that often perform complex tasks. In this paper, we propose two partitioning methods which have been designed to take the alignment objective into account in the partitioning process as soon as possible. These methods transform the two ontologies to be aligned into two sets of blocks of a limited size. Furthermore, the elements of the two ontologies that might be aligned are grouped in a minimal set of blocks and the comparison is then enacted upon these blocks. Results of experiments performed by the two methods on various pairs of ontologies are promising.

Fayçal Hamdi, Brigitte Safar, Chantal Reynaud, Haïfa Zargayouna

Learning Ontologies with Deep Class Hierarchies by Mining the Content of Relational Databases

Abstract

Relational databases are valuable sources for ontology learning. Previous work showed how precise ontologies can be learned and be fruitfully exploited to solve practical problems, such as ensuring integration and interoperation of heterogeneous databases. However, a major persisting limitation of the existing approaches is the derivation of ontologies with flat structure that simply mirror the schema of the source databases. In this paper, we present the RTAXON learning method that shows how the content of the databases can be exploited to identify categorization patterns from which class hierarchies can be generated. This fully formalized method combines a classical database schema analysis with hierarchy mining in the stored data. RTAXON is one of the methods implemented in RDBToOnto, a comprehensive tool that support the transitioning process from access to the data to generation of populated ontologies.

Farid Cerbah

Semantic Analysis for the Geospatial Semantic Web

Abstract

Semantic analysis is a new search paradigm for the Semantic Web, which aims the automatic extraction of semantic associations existing between individuals described in RDF(S) graphs. In order to infer additional semantic associations and to increase the accuracy of the analysis, we propose here, to adapt semantic analysis for OWL-DL ontologies.We also show that by taking into account spatio-temporal information which is usually attached to resources, new and possibly interesting semantic associations can be discovered. Moreover, we propose to handle spatial and temporal contexts in order to limit the scope of the analysis to a given region of space and a given period of time, considered interesting from the user’s point of view. For reasoning with spatial and temporal information and relations we use ONTOAST, a spatio-temporal representation and querying system, which is compatible with OWL-DL.

Alina Dia Miron, Jérôme Gensel, Marlène Villanova-Oliver

Statistically Valid Links and Anti-links BetweenWords and Between Documents: Applying TourneBool Randomization Test to a Reuters Collection

Abstract

Neighborhood is a central concept in data mining, and a bunch of definitions have been implemented, mainly rooted in geometrical or topological considerations. We propose here a statistical definition of neighborhood: our TourneBool randomization test processes an objects × attributes binary table in order to establish which inter-attribute relations are fortuitous, and which ones are meaningful, without requiring any pre-defined statistical model, while taking into account the empirical distributions. It ensues a robust and statistically validated graph. We present a full-scale experiment on one of the public access Reuters test corpus. We characterize the resulting word graph by a series of indicators, such as clustering coefficients, degree distribution and correlation, cluster modularity and size distribution. Another graph structure stems from this process: the one conveying the negative “counter-relations” between words, i.e. words which “steer clear” one from another. We characterize in the same way the counter-relation graph. At last we generate the couple of valid document graphs (i.e. links and anti-links) and evaluate them by taking into account the Reuters document categories.

Alain Lelu, Martine Cadot

Backmatter

Titel: Advances in Knowledge Discovery and Management
herausgegeben von: Fabrice Guillet
Gilbert Ritschard
Djamel Abdelkader Zighed
Henri Briand
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-642-00580-0
Print ISBN: 978-3-642-00579-4
DOI: https://doi.org/10.1007/978-3-642-00580-0

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Supervised Learning and Information Retrieval

Frontmatter

Discrepancy Analysis of Complex Objects Using Dissimilarities

A Bayes Evaluation Criterion for Decision Trees

Classifying Very-High-Dimensional Data with Random Forests of Oblique Decision Trees

Intensive Use of Correspondence Analysis for Large Scale Content-Based Image Retrieval

Toward a Better Integration of Spatial Relations in Learning with Graphical Models

Unsupervised Learning

Frontmatter

Multigranular Manipulations for OLAP Querying

A New Approach for Unsupervised Classification in Image Segmentation

Cluster-Dependent Feature Selection through a Weighted Learning Paradigm

Two Variants of the OKM for Overlapping Clustering

A Stable Decomposition Algorithm for Dynamic Social Network Analysis

Security and Data Streaming

Frontmatter

An Hybrid Data Stream Summarizing Approach by Sampling and Clustering

SPAMS: A Novel Incremental Approach for Sequential Pattern Mining in Data Streams

Mining Common Outliers for Intrusion Detection

Intrusion Detections in Collaborative Organizations by Preserving Privacy

Ontologies and Semantic

Frontmatter

Alignment-Based Partitioning of Large-Scale Ontologies

Learning Ontologies with Deep Class Hierarchies by Mining the Content of Relational Databases

Semantic Analysis for the Geospatial Semantic Web

Statistically Valid Links and Anti-links BetweenWords and Between Documents: Applying TourneBool Randomization Test to a Reuters Collection

Backmatter

Premium Partner