Skip to main content

2010 | Buch

Advances in Data Mining. Applications and Theoretical Aspects

10th Industrial Conference, ICDM 2010, Berlin, Germany, July 12-14, 2010. Proceedings

insite
SUCHEN

Über dieses Buch

These are the proceedings of the tenth event of the Industrial Conference on Data Mining ICDM held in Berlin (www.data-mining-forum.de). For this edition the Program Committee received 175 submissions. After the pe- review process, we accepted 49 high-quality papers for oral presentation that are included in this book. The topics range from theoretical aspects of data mining to app- cations of data mining such as on multimedia data, in marketing, finance and telec- munication, in medicine and agriculture, and in process control, industry and society. Extended versions of selected papers will appear in the international journal Trans- tions on Machine Learning and Data Mining (www.ibai-publishing.org/journal/mldm). Ten papers were selected for poster presentations and are published in the ICDM Poster Proceeding Volume by ibai-publishing (www.ibai-publishing.org). In conjunction with ICDM four workshops were held on special hot applicati- oriented topics in data mining: Data Mining in Marketing DMM, Data Mining in LifeScience DMLS, the Workshop on Case-Based Reasoning for Multimedia Data CBR-MD, and the Workshop on Data Mining in Agriculture DMA. The Workshop on Data Mining in Agriculture ran for the first time this year. All workshop papers will be published in the workshop proceedings by ibai-publishing (www.ibai-publishing.org). Selected papers of CBR-MD will be published in a special issue of the international journal Transactions on Case-Based Reasoning (www.ibai-publishing.org/journal/cbr).

Inhaltsverzeichnis

Frontmatter

Invited Talk

Moving Targets
When Data Classes Depend on Subjective Judgement, or They Are Crafted by an Adversary to Mislead Pattern Analysis Algorithms - The Cases of Content Based Image Retrieval and Adversarial Classification

The vast majority of pattern recognition applications assume that data can be subdivided into a number of data classes on the basis of the values of a set of suitable features. Supervised techniques assume the data classes are given in advance, and the goal is to find the most suitable set of feature and classification algorithm that allows the effective partition of the data. On the other hand, unsupervised techniques allow discovering the “natural” data classes in which data can be partitioned, for a given set of features.These approaches are showing their limitation to handle the challenges issued by applications where, for each instance of the problem, patterns can be assigned to different data classes, and the definition itself of data classes is not uniquely fixed. As a consequence, the set of features providing for an effective discrimination of patterns, and the related discrimination rule, should be set for each instance of the classification problem. Two applications from different domains share similar characteristics: Content-Based Multimedia Retrieval and Adversarial Classification. The retrieval of multimedia data by content is biased by the high subjectivity of the concept of similarity. On the other hand, in an adversarial environment, the adversary carefully craft new patterns so that they are assigned to the incorrect data class. In this paper, the issues of the two application scenarios will be discussed, and some effective solutions and future reearch directions will be outlined.

Giorgio Giacinto
Bioinformatics Contributions to Data Mining

The field of bioinformatics shows a tremendous growth at the crossroads of biology, medicine, information science, and computer science. Figures clearly demonstrate that today bioinformatics research is as productive as data mining research as a whole. However most bioinformatics research deals with tasks of prediction, classification, and tree or network induction from data. Bioinformatics tasks consist mainly in similarity-based sequence search, microarray data analysis, 2D or 3D macromolecule shape prediction, and phylogenetic classification. It is therefore interesting to consider how the methods of bioinformatics can be pertinent advances in data mining and to highlight some examples of how these bioinformatics algorithms can potentially be applied to domains outside biology.

Isabelle Bichindaritz

Theoretical Aspects of Data Mining

Bootstrap Feature Selection for Ensemble Classifiers

Small number of samples with high dimensional feature space leads to degradation of classifier performance for machine learning, statistics and data mining systems. This paper presents a bootstrap feature selection for ensemble classifiers to deal with this problem and compares with traditional feature selection for ensemble (select optimal features from whole dataset before bootstrap selected data). Four base classifiers: Multilayer Perceptron, Support Vector Machines, Naive Bayes and Decision Tree are used to evaluate the performance of UCI machine learning repository and causal discovery datasets. Bootstrap feature selection algorithm provides slightly better accuracy than traditional feature selection for ensemble classifiers.

Rakkrit Duangsoithong, Terry Windeatt
Evaluating the Quality of Clustering Algorithms Using Cluster Path Lengths

Many real world systems can be modeled as networks or graphs. Clustering algorithms that help us to organize and understand these networks are usually referred to as, graph based clustering algorithms. Many algorithms exist in the literature for clustering network data. Evaluating the quality of these clustering algorithms is an important task addressed by different researchers. An important ingredient of evaluating these clustering techniques is the node-edge density of a cluster. In this paper, we argue that evaluation methods based on density are heavily biased to networks having dense components, such as social networks, but are not well suited for data sets with other network topologies where the nodes are not densely connected. Example of such data sets are the transportation and Internet networks. We justify our hypothesis by presenting examples from real world data sets.

We present a new metric to evaluate the quality of a clustering algorithm to overcome the limitations of existing cluster evaluation techniques. This new metric is based on the path length of the elements of a cluster and avoids judging the quality based on cluster density. We show the effectiveness of the proposed metric by comparing its results with other existing evaluation methods on artificially generated and real world data sets.

Faraz Zaidi, Daniel Archambault, Guy Melançon
Finding Irregularly Shaped Clusters Based on Entropy

In data clustering the more traditional algorithms are based on similarity criteria which depend on a metric distance. This fact imposes important constraints on the shape of the clusters found. These shapes generally are hyperspherical in the metric’s space due to the fact that each element in a cluster lies within a radial distance relative to a given center. In this paper we propose a clustering algorithm that does not depend on simple distance metrics and, therefore, allows us to find clusters with arbitrary shapes in n-dimensional space. Our proposal is based on some concepts stemming from Shannon’s information theory and evolutionary computation. Here each cluster consists of a subset of the data where entropy is minimized. This is a highly non-linear and usually non-convex optimization problem which disallows the use of traditional optimization techniques. To solve it we apply a rugged genetic algorithm (the so-called Vasconcelos’ GA). In order to test the efficiency of our proposal we artificially created several sets of data with known properties in a tridimensional space. The result of applying our algorithm has shown that it is able to find highly irregular clusters that traditional algorithms cannot. Some previous work is based on algorithms relying on similar approaches (such as ENCLUS’ and CLIQUE’s). The differences between such approaches and ours are also discussed.

Angel Kuri-Morales, Edwin Aldana-Bobadilla
Fuzzy Conceptual Clustering

Grouping unknown data into groups of similar data is a necessary first step for classification, indexing of data bases, and prediction. Most of today’s applications, such as news classification, blog indexing, image classification, and medical diagnosis, obtain their data in temporal sequence or on-line. The necessity for data exploration requires a graphical method that allows the expert in the field to study the determined groups of data. Therefore, incremental hierarchical clustering methods that can create explicit cluster descriptions are convenient. The noisy and uncertain nature of the data makes it necessary to develop fuzzy clustering methods. We propose a novel fuzzy conceptual clustering algorithm. We describe the fuzzy objective function for incremental building of the clusters and the relation among the clusters in a hierarchy. The operations that can incrementally re-optimize the fuzzy-based hierarchy based on the newly arrived data are explained. Finally, we evaluate our method and present results.

Petra Perner, Anja Attig
Mining Concept Similarities for Heterogeneous Ontologies

We consider the problem of discovering pairs of similar concepts, which are part of two given source ontologies, in which each concept node is mapped to a set of instances. The similarity measures we propose are based on learning a classifier for each concept that allows to discriminate the respective concept from the remaining concepts in the same ontology. We present two new measures that are compared experimentally: (1) one based on comparing the sets of support vectors from the learned SVMs and (2) one which considers the list of discriminating variables for each concept. These lists are determined using a novel variable selection approach for the SVM. We compare the performance of the two suggested techniques with two standard approaches (Jaccard similarity and class-means distance). We also present a novel recursive matching algorithm based on concept similarities.

Konstantin Todorov, Peter Geibel, Kai-Uwe Kühnberger
Re-mining Positive and Negative Association Mining Results

Positive and negative association mining are well-known and extensively studied data mining techniques to analyze market basket data. Efficient algorithms exist to find both types of association, separately or simultaneously. Association mining is performed by operating on the transaction data. Despite being an integral part of the transaction data, the pricing and time information has not been incorporated into market basket analysis so far, and additional attributes have been handled using quantitative association mining. In this paper, a new approach is proposed to incorporate price, time and domain related attributes into data mining by re-mining the association mining results. The underlying factors behind positive and negative relationships, as indicated by the association rules, are characterized and described through the second data mining stage

re-mining

. The applicability of the methodology is demonstrated by analyzing data coming from apparel retailing industry, where price markdown is an essential tool for promoting sales and generating increased revenue.

Ayhan Demiriz, Gurdal Ertek, Tankut Atan, Ufuk Kula
Multi-Agent Based Clustering: Towards Generic Multi-Agent Data Mining

A framework for Multi Agent Data Mining (MADM) is described. The framework comprises a collection of agents cooperating to address given data mining tasks. The fundamental concept underpinning the framework is that it should support generic data mining. The vision is that of a system that grows in an organic manner. The central issue to facilitating this growth is the communication medium required to support agent interaction. This issue is partly addressed by the nature of the proposed architecture and partly through an extendable ontology; both are described. The advantages offered by the framework are illustrated in this paper by considering a clustering application. The motivation for the latter is that no “best” clustering algorithm has been identified, and consequently an agent-based approach can be adopted to identify “best” clusters. The application serves to demonstrates the full potential of MADM.

Santhana Chaimontree, Katie Atkinson, Frans Coenen
Describing Data with the Support Vector Shell in Distributed Environments

Distributed data streams mining is increasingly demanded in most extensive application domains, like web traffic analysis and financial transactions. In distributed environments, it is impractical to transmit all data to one node for global model. It is reasonable to extract the essential parts of local models of subsidiary nodes, thereby integrating into the global model. In this paper we proposed an approach SVDDS to do this model integration in distributed environments. It is based on SVM theory, and trades off between the risk of the global model and the total transmission load. Our analysis and experiments show that SVDDS obviously lowers the total transmission load while the global accuracy drops comparatively little.

Peng Wang, Guojun Mao
Robust Clustering Using Discriminant Analysis

Cluster ensemble technique has attracted serious attention in the area of unsupervised learning. It aims at improving robustness and quality of clustering scheme, particularly in scenarios where either randomization or sampling is the part of the clustering algorithm.

In this paper, we address the problem of instability and non robustness in K-means clusterings. These problems arise naturally because of random seed selection by the algorithm, order sensitivity of the algorithm and presence of noise and outliers in data. We propose a cluster ensemble method based on Discriminant Analysis to obtain robust clustering using K-means clusterer. The proposed algorithm operates in three phases. The first phase is preparatory in which multiple clustering schemes generated and the cluster correspondence is obtained. The second phase uses discriminant analysis and constructs a label matrix. In the final stage, consensus partition is generated and noise, if any, is segregated. Experimental analysis using standard public data sets provides strong empirical evidence of the high quality of resultant clustering scheme.

Vasudha Bhatnagar, Sangeeta Ahuja
New Approach in Data Stream Association Rule Mining Based on Graph Structure

Discovery of useful information and valuable knowledge from transactions has attracted many researchers due to increasing use of very large databases and data warehouses. Furthermore most of proposed methods are designed to work on traditional databases in which re-scanning the transactions is allowed. These methods are not useful for mining in data streams (DS) because it is not possible to re-scan the transactions duo to huge and continues data in DS. In this paper, we proposed an effective approach to mining frequent itemsets used for association rule mining in DS named GRM. Unlike other semi-graph methods, our method is based on graph structure and has the ability to maintain and update the graph in one pass of transactions. In this method data storing is optimized by memory usage criteria and mining the rules is done in a linear processing time.

Efficiency of our implemented method is compared with other proposed method and the result is presented.

Samad Gahderi Mojaveri, Esmaeil Mirzaeian, Zarrintaj Bornaee, Saeed Ayat

Multimedia Data Mining

Fast Training of Neural Networks for Image Compression

The paper considers the problem of image compression by using artificial neural networks (ANN). The main concept of this approach is the reduction of the original feature spaces, what allows us to eliminate the image redundancy and accordingly leads to their compression. Two variants of the neural networks: two layers ANN with the self-learning algorithm based on the weighted informational criterion and auto-associative four-layers feedforward network have been proposed and analyzed.

Yevgeniy Bodyanskiy, Paul Grimm, Sergey Mashtalir, Vladimir Vinarski
Processing Handwritten Words by Intelligent Use of OCR Results

About 3.5 million dried plants on paper sheets are deposited in the Botanical Museum Berlin in Germany. Frequently they have handwritten annotations (see figure 1). So a procedure had to be developed in order to process the handwriting on the sheet. In the present work an approach tries to identify the writer by handwritten words and to read handwritten keywords. Therefore the word is cut out and transformed into a 6-dimensional time series and compared e.g. by means of DTW-method. A recognition rate of 98.6% is achieved with 12 different words (1200 samples). All herbar documents contain several printed tokens which indicate more information about the plant. With the token it is possible to get information who has found this plant, where this plant was found (country and sometimes the town), what kind of plant it is and so on. By using the local connections of the text it is possible to get more information from the herbar document, e.g. to find and recognize handwritten text in a defined area.

Benjamin Mund, Karl-Heinz Steinke
Saliency-Based Candidate Inspection Region Extraction in Tape Automated Bonding

Electronic circuits are composed of components connected by traces which conduct the current. While the interconnections between the components can be created by assembling individual pieces of wire, it is nowadays common to use printed circuit boards. Tape automated bonding (TAB) is a technique to assemble chips and printed circuit boards. Because TAB become smaller, their inspection methods are required to adapt to the decreasing size of the electric circuits’ pattern. An image of a TAB is taken during the manufacturing process and analysed using image processing algorithms to inspect it for flaws in its pattern. This paper proposes an algorithm to find candidate inspection regions in a TAB pattern based on visual saliency. Orientation information contained in the image is processed to detect probable error regions and exclude correct regions from further inspection. The algorithm finds all the flaws in an image and in the case of regular patterns, marks only 5% of the image pixels as belonging to a candidate inspection region. The results show that a saliency-based approach is applicable on the task of finding flaws in the pattern of an electric circuit.

Martina Dümcke, Hiroki Takahashi
Image Classification Using Histograms and Time Series Analysis: A Study of Age-Related Macular Degeneration Screening in Retinal Image Data

An approach to image mining is described that combines a histogram based representation with a time series analysis technique. More specifically a Dynamic Time Warping (DTW) approach is applied to histogram represented image sets that have been enhanced using CLAHE and noise removal. The focus of the work is the screening (classification) of retinal image sets to identify age-related macular degeneration (AMD). Results are reported from experiments conducted to compare different image enhancement techniques, combination of two different histograms for image classification, and different histogram based approaches. The experiments demonstrated that: the image enhancement techniques produce improved results, the usage of two histograms improved the classifier performance, and that the proposed DTW procedure out-performs other histogram based techniques in terms of classification accuracy.

Mohd Hanafi Ahmad Hijazi, Frans Coenen, Yalin Zheng
Entropic Quadtrees and Mining Mars Craters

This paper introduces entropic quadtrees, which are structures derived from quadtrees by allowing nodes to split only when nodes point to sufficiently diverse sets of objects. Diversity is evaluated using entropy attached to the histograms of the values of features for sets designated by the nodes.

As an application, we used entropic quadtrees to locate craters on the surface of Mars, represented by circles in digital images.

Rosanne Vetro, Dan A. Simovici
Hybrid DIAAF/RS: Statistical Textual Feature Selection for Language-Independent Text Classification

Textual Feature Selection (TFS) is an important phase in the process of text classification. It aims to identify the most significant textual features (i.e. key words and/or phrases), in a textual dataset, that serve to distinguish between text categories. In TFS, basic techniques can be divided into two groups: linguistic vs. statistical. For the purpose of building a language-independent text classifier, the study reported here is concerned with statistical TFS only. In this paper, we propose a novel statistical TFS approach that hybridizes the ideas of two existing techniques, DIAAF (Darmstadt Indexing Approach Association Factor) and RS (Relevancy Score). With respect to associative (text) classification, the experimental results demonstrate that the proposed approach can produce greater classification accuracy than other alternative approaches.

Yanbo J. Wang, Fan Li, Frans Coenen, Robert Sanderson, Qin Xin
Multimedia Summarization in Law Courts: A Clustering-Based Environment for Browsing and Consulting Judicial Folders

Digital videos represent a fundamental informative source of those events that occur during a penal proceedings, which thanks to the technologies available nowadays, can be stored, organized and retrieved in short time and with low cost. However, considering the dimension that a video source can assume during a trial recording, several requirements have been pointed out by judicial actors: fast navigation of the stream, efficient access to data inside and effective representation of relevant contents. One of the possible solutions to these requirements is represented by multimedia summarization aimed at deriving a synthetic representation of audio/video contents, characterized by a limited loss of meaningful information. In this paper a multimedia summarization environment is proposed for defining a storyboard for proceedings celebrated into courtrooms.

E. Fersini, E. Messina, F. Archetti
Comparison of Redundancy and Relevance Measures for Feature Selection in Tissue Classification of CT Images

In this paper we report on a study on feature selection within the minimum–redundancy maximum–relevance framework. Features are ranked by their correlations to the target vector. These relevance scores are then integrated with correlations between features in order to obtain a set of relevant and least–redundant features. Applied measures of correlation or distributional similarity for redunancy and relevance include Kolmogorov–Smirnov (KS) test, Spearman correlations, Jensen–Shannon divergence, and the sign–test. We introduce a metric called “value difference metric“ (VDM) and present a simple measure, which we call “fit criterion“ (FC). We draw conclusions about the usefulness of different measures. While KS–test and sign–test provided useful information, Spearman correlations are not fit for comparison of data of different measurement intervals. VDM was very good in our experiments as both redundancy and relevance measure. Jensen–Shannon and the sign–test are good redundancy measure alternatives and FC is a good relevance measure alternative.

Benjamin Auffarth, Maite López, Jesús Cerquides

Data Mining in Marketing

Quantile Regression Model for Impact Toughness Estimation

The purpose of this study was to develop a product design model for estimating the impact toughness of low-alloy steel plates. The rejection probability in a Charpy-V test (CVT) is predicted with process variables and chemical composition. The proposed method is suitable for the whole production line of a steel plate mill, including all grades of steel in production. The quantile regression model was compared to the joint model of mean and dispersion and the constant variance model. The quantile regression model proved out to be the most effective method for modelling a highly complicated property at this extent.

Next, the developed model will be implemented into a graphical simulation tool that is in daily use in the product planning department and already contains some other mechanical property models. The model will guide designers in predicting the related risk of rejection and in producing desired properties in the product at lower cost.

Satu Tamminen, Ilmari Juutilainen, Juha Röning
Mining for Paths in Flow Graphs

This paper presents FlowGSP, a data-mining algorithm that discovers frequent sequences of attributes in subpaths of a flow graph. FlowGSP was evaluated using flow graphs derived from the execution of transactions in the IBM

®

WebSphere

®

Application Server, a large real-world enterprise application server. The vertices of this flow graph may represent single instructions, bytecodes, basic blocks, regions, or entire methods. These vertices are annotated with attributes that correspond to run-time characteristics of the execution of the program. FlowGSP successfully identified a number of existing characteristics of the WebSphere Application Server which had previously been discovered only through extensive manual examination. In addition, a multi-threaded implementation of FlowGSP demonstrates the algorithm’s suitability for exploiting the resources of modern multi-core computers.

Adam Jocksch, José Nelson Amaral, Marcel Mitran
Combining Unsupervised and Supervised Data Mining Techniques for Conducting Customer Portfolio Analysis

Leveraging the power of increasing amounts of data to analyze customer base for attracting and retaining the most valuable customers is a major problem facing companies in this information age. Data mining technologies extract hidden information and knowledge from large data stored in databases or data warehouses, thereby supporting the corporate decision making process. In this study, we apply a two-level approach that combines SOM-Ward clustering and decision trees to conduct customer portfolio analysis for a case company. The created two-level model was then used to identify potential high-value customers from the customer base. It was found that this hybrid approach could provide more detailed and accurate information about the customer base for tailoring actionable marketing strategies.

Zhiyuan Yao, Annika H. Holmbom, Tomas Eklund, Barbro Back
Managing Product Life Cycle with MultiAgent Data Mining System

Production planning is the main aspect for a manufacturer affecting an income of a company. Correct production planning policy, chosen for the right product at the right time, lessens production, storing and other related costs. The task of choosing a production policy in most cases is solved by an expert group, what not an every company can support. Thus a topic of having an intelligent system for supporting production management process becomes actual. The main tasks such system should be able to solve are defining the present Product Life Cycle (PLC) phase of a product as also determining a transition point - a moment of time (period), when the PLC phase is changed; as the results obtained will affect the decision of what production planning policy should be used.

The paper presents the MultiAgent Data Mining system, meant for supporting a production manager in his/her production planning decisions. The developed system is based on the analysis of historical demand for products and on the information about transitions between phases in life cycles of those products. The architecture of the developed system is presented as also an analysis of testing on the real-world data results is given.

Serge Parshutin
Modeling Pricing Strategies Using Game Theory and Support Vector Machines

Data Mining is a widely used discipline with methods that are heavily supported by statistical theory. Game theory, instead, develops models with solid economical foundations but with low applicability in companies so far. This work attempts to unify both approaches, presenting a model of price competition in the credit industry. Based on game theory and sustained by the robustness of Support Vector Machines to structurally estimate the model, it takes advantage from each approach to provide strong results and useful information. The model consists of a market-level game that determines the marginal cost, demand, and efficiency of the competitors. Demand is estimated using Support Vector Machines, allowing the inclusion of multiple variables and empowering standard economical estimation through the aggregation of client-level models. The model is being applied by one competitor, which created new business opportunities, such as the strategic chance to aggressively cut prices given the acquired market knowledge.

Cristián Bravo, Nicolás Figueroa, Richard Weber

Data Mining in Industrial Processes

Determination of the Fault Quality Variables of a Multivariate Process Using Independent Component Analysis and Support Vector Machine

The multivariate statistical process control (MSPC) chart plays an important role in monitoring a multivariate process. Once a process disturbance has occurred, the MSPC out-of-control signal would be triggered. The process personnel then begin to search for the root causes of a disturbance in order to take remedial action to compensate for the effects of the disturbance. However, the use of MSPC chart encounters a difficulty in practice. This difficult issue involves which quality variable or which set of the quality variables is responsible for the generation of the out-of-control signal. This determination is not straightforward, and it usually confused the process personnel. This study proposes a hybrid approach which is composed of independent component analysis (ICA) and support vector machine (SVM) to determine the fault quality variables when a step-change disturbance existed in a process. The well-known Hotelling T

2

control chart is employed to monitor the multivariate process. The proposed hybrid ICA-SVM scheme first uses ICA to the Hotelling T

2

statistics generating independent components (ICs). The hidden useful information of the fault quality variables could be discovered in these ICs. The ICs are then used as the input variables of the SVM for building the classification model. The performance of various process designs is investigated and compared with the typical classification method.

Yuehjen E. Shao, Chi-Jie Lu, Yu-Chiun Wang
Dynamic Pattern Extraction of Parameters in Laser Welding Process

Tuning parameters is essential for the results of the welding process. In order to optimize the tuning process of welding parameters, we propose a system based on historical data of laser welding machines. On a given combination of materials, the system extracts patterns dynamically and classifies new cases with a relative accuracy, which depends on the selected data set. The analysis of the generated patterns helps decision makers to visualize important features in large databases and therefore, achieve optimal results.

Gissel Velarde, Christian Binroth
Trajectory Clustering for Vibration Detection in Aircraft Engines

The automatic detection of the vibration signature of rotating parts of an aircraft engine is considered. This paper introduces an algorithm that takes into account the variation over time of the level of detection of orders, i.e. vibrations ate multiples of the rotating speed. The detection level over time at a specific order are gathered in a so-called trajectory. It is shown that clustering the trajectories to classify them into detected and non-detected orders improves the robustness to noise and other external conditions, compared to a traditional statistical signal detection by an hypothesis test. The algorithms are illustrated in real aircraft engine data.

Aurélien Hazan, Michel Verleysen, Marie Cottrell, Jérôme Lacaille
Episode Rule-Based Prognosis Applied to Complex Vacuum Pumping Systems Using Vibratory Data

This paper presents a local pattern-based method that addresses system prognosis. It also details a successful application to complex vacuum pumping systems. More precisely, using historical vibratory data, we first model the behavior of systems by extracting a given type of episode rules, namely First Local Maximum episode rules (FLM-rules). A subset of the extracted FLM-rules is then selected in order to further predict pumping system failures in a vibratory datastream context. The results that we got for production data are very encouraging as we predict failures with a good time scale precision. We are now deploying our solution for a customer of the semi-conductor market.

Florent Martin, Nicolas Méger, Sylvie Galichet, Nicolas Becourt
Predicting Disk Failures with HMM- and HSMM-Based Approaches

Understanding and predicting disk failures are essential for both disk vendors and users to manufacture more reliable disk drives and build more reliable storage systems, in order to avoid service downtime and possible data loss. Predicting disk failure from observable disk attributes, such as those provided by the Self-Monitoring and Reporting Technology (SMART) system, has been shown to be effective. In the paper, we treat SMART data as time series, and explore the prediction power by using HMM- and HSMM-based approaches. Our experimental results show that our prediction models outperform other models that do not capture the temporal relationship among attribute values over time. Using the best single attribute, our approach can achieve a detection rate of 46% at 0% false alarm. Combining the two best attributes, our approach can achieve a detection rate of 52% at 0% false alarm.

Ying Zhao, Xiang Liu, Siqing Gan, Weimin Zheng
Aircraft Engine Health Monitoring Using Self-Organizing Maps

Aircraft engines are designed to be used during several tens of years. Ensuring a proper operation of engines over their lifetime is therefore an important and difficult task. The maintenance can be improved if efficients procedures for the understanding of data flows produced by sensors for monitoring purposes are implemented. This paper details such a procedure aiming at visualizing in a meaningful way successive data measured on aircraft engines. The core of the procedure is based on Self-Organizing Maps (SOM) which are used to visualize the evolution of the data measured on the engines. Rough measurements can not be directly used as inputs, because they are influenced by external conditions. A preprocessing procedure is set up to extract meaningful information and remove uninteresting variations due to change of environmental conditions. The proposed procedure contains three main modules to tackle these difficulties: environmental conditions normalization (ECN), change detection and adaptive signal modeling (CD) and finally visualization with Self-Organizing Maps (SOM). The architecture of the procedure and of modules are described in details in this paper and results on real data are also supplied.

Etienne Côme, Marie Cottrell, Michel Verleysen, Jérôme Lacaille

Data Mining in Medicine

Finding Temporal Patterns in Noisy Longitudinal Data: A Study in Diabetic Retinopathy

This paper describes an approach to temporal pattern mining using the concept of user defined temporal prototypes to define the nature of the trends of interests. The temporal patterns are defined in terms of sequences of support values associated with identified frequent patterns. The prototypes are defined mathematically so that they can be mapped onto the temporal patterns. The focus for the advocated temporal pattern mining process is a large longitudinal patient database collected as part of a diabetic retinopathy screening programme, The data set is, in itself, also of interest as it is very noisy (in common with other similar medical datasets) and does not feature a clear association between specific time stamps and subsets of the data. The diabetic retinopathy application, the data warehousing and cleaning process, and the frequent pattern mining procedure (together with the application of the prototype concept) are all described in the paper. An evaluation of the frequent pattern mining process is also presented.

Vassiliki Somaraki, Deborah Broadbent, Frans Coenen, Simon Harding
Selection of High Risk Patients with Ranked Models Based on the CPL Criterion Functions

Important practical problems in computer support medical diagnosis are related to screening procedures. Identification of high risk patients can serve as an example of such a problem. The identification results should allow to select a patient in an objective manner for additional therapeutic treatment. The designing of the screening tools can be based on the minimisation of the convex and piecewise linear (CPL) criterion functions. Particularly ranked models can be designed in this manner for the purposes of screening procedures.

Leon Bobrowski
Medical Datasets Analysis: A Constructive Induction Approach

The main goal of our research was to compile new methodology for building simplified learning models in a form of decision rule set. Every investigated source informational dataset was extended by application of constructive induction method to get a new, additional, descriptive attribute, and then sets of decision rules were developed for source and for extended database, respectively. In the last step, obtained set of rules were optimized and compared to earlier set of rules.

Wiesław Paja, Mariusz Wrzesień

Data Mining in Agriculture

Regression Models for Spatial Data: An Example from Precision Agriculture

The term

precision agriculture

refers to the application of state-of-the-art GPS technology in connection with small-scale, sensor-based treatment of the crop. This data-driven approach to agriculture poses a number of data mining problems. One of those is also an obviously important task in agriculture: yield prediction. Given a precise, geographically annotated data set for a certain field, can a season’s yield be predicted?

Numerous approaches have been proposed to solving this problem. In the past, classical regression models for non-spatial data have been used, like regression trees, neural networks and support vector machines. However, in a cross-validation learning approach, issues with the assumption of statistical independence of the data records appear. Therefore, the geographical location of data records should clearly be considered while employing a regression model. This paper gives a short overview about the available data, points out the issues with the classical learning approaches and presents a novel spatial cross-validation technique to overcome the problems and solve the aforementioned yield prediction task.

Georg Ruß, Rudolf Kruse
Trend Mining in Social Networks: A Study Using a Large Cattle Movement Database

This paper reports on a mechanism to identify temporal spatial trends in social networks. The trends of interest are defined in terms of the occurrence frequency of time stamped patterns across social network data. The paper proposes a technique for identifying such trends founded on the Frequent Pattern Mining paradigm. The challenge of this technique is that, given appropriate conditions, many trends may be produced; and consequently the analysis of the end result is inhibited. To assist in the analysis, a Self Organising Map (SOM) based approach, to visualizing the outcomes, is proposed. The focus for the work is the social network represented by the UK’s cattle movement data base. However, the proposed solution is equally applicable to other large social networks.

Puteri N. E. Nohuddin, Rob Christley, Frans Coenen, Christian Setzkorn

WebMining

Spam Email Filtering Using Network-Level Properties

Spam is serious problem that affects email users (e.g. phishing attacks, viruses and time spent reading unwanted messages). We propose a novel spam email filtering approach based on network-level attributes (e.g. the IP sender geographic coordinates) that are more persistent in time when compared to message content. This approach was tested using two classifiers, Naive Bayes (NB) and Support Vector Machines (SVM), and compared against bag-of-words models and eight blacklists. Several experiments were held with recent collected legitimate (ham) and non legitimate (spam) messages, in order to simulate distinct user profiles from two countries (USA and Portugal). Overall, the network-level based SVM model achieved the best discriminatory performance. Moreover, preliminary results suggests that such method is more robust to phishing attacks.

Paulo Cortez, André Correia, Pedro Sousa, Miguel Rocha, Miguel Rio
Domain-Specific Identification of Topics and Trends in the Blogosphere

Staying tuned to the trends and opinions in a certain domain is an important task in many areas. E. g., market researchers want to know about the acceptance of products. Traditionally this is done by screening broadcast media, but in recent years social media like the blogosphere have gained more and more importance. As manual screening of the blogosphere is a tedious task, automated knowledge discovery techniques for trend analysis and topic detection are needed.

Our system “Social Media Miner” supports professionals in these tasks. The system aggregates relevant blog articles in a specified domain from blog search services, analyzes their link structure and their importance, provides an overview of the most active topics and identifies general trends in the area. For every topic it gives the analyst access to the most relevant articles. Experiments show that our system achieves a high degree of sound automated processing.

Rafael Schirru, Darko Obradović, Stephan Baumann, Peter Wortmann
Combining Business Process and Data Discovery Techniques for Analyzing and Improving Integrated Care Pathways

Hospitals increasingly use process models for structuring their care processes. Activities performed to patients are logged to a database but these data are rarely used for managing and improving the efficiency of care processes and quality of care. In this paper, we propose a synergy of process mining with data discovery techniques. In particular, we analyze a dataset consisting of the activities performed to 148 patients during hospitalization for breast cancer treatment in a hospital in Belgium. We expose multiple quality of care issues that will be resolved in the near future, discover process variations and best practices and we discover issues with the data registration system. For example, 25 % of patients receiving breast-conserving therapy did not receive the key intervention "revalidation”. We found this was caused by lowering the length of stay in the hospital over the years without modifying the care process. Whereas the process representations offered by Hidden Markov Models are easier to use than those offered by Formal Concept Analysis, this data discovery technique has proven to be very useful for analyzing process anomalies and exceptions in detail.

Jonas Poelmans, Guido Dedene, Gerda Verheyden, Herman Van der Mussele, Stijn Viaene, Edward Peters
Interest-Determining Web Browser

This paper investigates the application of data-mining techniques on a user’s browsing history for the purpose of determining the user’s interests. More specifically, a system is outlined that attempts to determine certain keywords that a user may or may not be interested in. This is done by first applying a term-frequency/inverse-document frequency filter to extract keywords from webpages in the user’s history, after which a Self-Organizing Map (SOM) neural network is utilized to determine if these keywords are of interest to the user. Such a system could enable web-browsers to highlight areas of web pages that may be of higher interest to the user. It is found that while the system is indeed successful in identifying many keywords of user-interest, it also mis-classifies many uninteresting words boasting only a 62% accuracy rate.

Khaled Bashir Shaban, Joannes Chan, Raymond Szeto
Web-Site Boundary Detection

Defining the boundaries of a web-site, for (say) archiving or information retrieval purposes, is an important but complicated task. In this paper a web-page clustering approach to boundary detection is suggested. The principal issue is feature selection, hampered by the observation that there is no clear understanding of what a web-site is. This paper proposes a definition of a web-site, founded on the principle of

user intention

, directed at the boundary detection problem; and then reports on a sequence of experiments, using a number of clustering techniques, and a wide range of features and combinations of features to identify web-site boundaries. The preliminary results reported seem to indicate that, in general, a combination of features produces the most appropriate result.

Ayesh Alshukri, Frans Coenen, Michele Zito

Data Mining in Finance

An Application of Element Oriented Analysis Based Credit Scoring

In this paper, we present an application of an Element Oriented Analysis (EOA) credit scoring model used as a classifier for assessing the bad risk records. The model building methodology we used is the Element Oriented Analysis. The objectives in this study are: 1) to develop a stratified model based on EOA to classify the risk for the Brazilian credit card data; 2) to investigate if this model is a satisfactory classifier for this application; 3) to compare the characteristics of our model to the conventional credit scoring models in this specific domain. Classifier performance is measured using the Area under Receiver Operating Characteristic curve (AUC) and overall error rate in out-of-sample tests.

Yihao Zhang, Mehmet A. Orgun, Rohan Baxter, Weiqiang Lin
A Semi-supervised Approach for Reject Inference in Credit Scoring Using SVMs

This paper presents a novel semi-supervised approach that determines a linear predictor using Support Vector Machines (SVMs) and incorporates information on rejected loans, assuming that the labeled data (accepted applicants) and unlabeled data (rejected applicants) are not drawn from the same distribution. We use a self-training algorithm in order to predict how likely a rejected applicant would have repaid had the applicant received credit. A modification to the self-training algorithm based on Platt’s probabilistic output for SVMs is introduced. Experiments with two toy data sets; one well-known benchmark Credit Scoring data set, and one project performed for a Chilean financial institution demonstrate that our approach accomplishes the best classification performance compared to well-known reject inference alternatives and another state-of-the-art semi-supervised method for SVMs (Transductive SVM).

Sebastián Maldonado, Gonzalo Paredes

Aspects of Data Mining

Data Mining with Neural Networks and Support Vector Machines Using the R/rminer Tool

We present rminer, our open source library for the R tool that facilitates the use of data mining (DM) algorithms, such as neural Networks (NNs) and support vector machines (SVMs), in classification and regression tasks. Tutorial examples with real-world problems (i.e. satellite image analysis and prediction of car prices) were used to demonstrate the rminer capabilities and NN/SVM advantages. Additional experiments were also held to test the rminer predictive capabilities, revealing competitive performances.

Paulo Cortez
The Orange Customer Analysis Platform

In itself, the continuous exponential increase of the data-warehouses size does not necessarily lead to a richer and finer-grained information since the processing capabilities do not increase at the same rate. Current state-of-the-art technologies require the user to strike a delicate balance between the processing cost and the information quality. We describe an industrial approach which leverages recent advances in treatment automatization and relevant data/instance selection and indexing so as to dramatically improve our capability to turn huge volumes of raw data into useful information.

Raphaël Féraud, Marc Boullé, Fabrice Clérot, Françoise Fessant, Vincent Lemaire
Semi-supervised Learning for False Alarm Reduction

Intrusion Detection Systems (IDSs) which have been deployed in computer networks to detect a wide variety of attacks are suffering how to manage of a large number of triggered alerts. Thus, reducing false alarms efficiently has become the most important issue in IDS. In this paper, we introduce the semi-supervised learning mechanism to build an alert filter, which will reduce up to 85% false alarms and still keep a high detection rate. In our semi-supervised learning approach, we only need a very small amount of label information. This will save a huge security officer’s effort and make the alert filter be more practical for the real systems. Numerical comparison with conventional supervised learning approach with the same small portion labeled data, our method has significantly superior detection rate as well as in the false alarm reduction rate.

Chien-Yi Chiu, Yuh-Jye Lee, Chien-Chung Chang, Wen-Yang Luo, Hsiu-Chuan Huang
Learning from Humanoid Cartoon Designs

Character design is a key ingredient to the success of any comic-book, graphic novel, or animated feature. Artists typically use shape, size and proportion as the first design layer to express role, physicality and personality traits. In this paper, we propose a knowledge mining framework that extracts primitive shape features from finished art, and trains models with labeled metadata attributes. The applications are in shape-based query of character databases as well as label-based generation of basic shape scaffolds, providing an informed starting point for sketching new characters. It paves the way for more intelligent shape indexing of arbitrary well-structured objects in image libraries. Furthermore, it provides an excellent tool for novices and junior artists to learn from the experts. We first describe a novel primitive based shape signature for annotating character body-parts. We then use support vector machine to classify these characters using their body part’s shape signature as features. The proposed data transformation is computationally light and yields compact storage. We compare the learning performance of our shape representation with a low-level point feature representation, with substantial improvement.

Md. Tanvirul Islam, Kaiser Md. Nahiduzzaman, Why Yong Peng, Golam Ashraf
Mining Relationship Associations from Knowledge about Failures Using Ontology and Inference

Mining general knowledge about relationships between concepts described in the analyses of failure cases could help people to avoid repeating previous failures. Furthermore, by representing knowledge using ontologies that support inference, we can identify relationships between concepts more effectively than text-mining techniques. A relationship association is a form of knowledge generalization that is based on binary relationships between entities in semantic graphs. Specifically, relationship associations involve two binary relationships that share a connecting entity and that co-occur frequently in a set of semantic graphs. Such connected relationships can be considered as generalized knowledge mined from a set of knowledge resources, such as failure case descriptions, that are formally represented by the semantic graphs. This paper presents the application of a technique to mine relationship associations from formalized semantic descriptions of failure cases. Results of mining relationship associations in a knowledge base containing 291 semantic graphs representing failure cases are presented.

Weisen Guo, Steven B. Kraines

Data Mining for Network Performance Monitoring

Event Prediction in Network Monitoring Systems: Performing Sequential Pattern Mining in Osmius Monitoring Tool

Event prediction is one of the most challenging problems in network monitoring systems. This type of inductive knowledge provides monitoring systems with valuable real time predictive capabilities. By obtaining this knowledge, system and network administrators can anticipate and prevent failures.

In this paper we present a prediction module for the monitoring software Osmius (

www.osmius.net

). Osmius has been developed by Peopleware (

peopleware.es

) under GPL licence. We have extended the Osmius database to store the knowledge we obtain from the algorithms in a highly parametrized way. Thus system administrators can apply the most appropriate settings for each system.

Results are presented in terms of positive predictive values and false discovery rates over a huge event database. They confirm that these pattern mining processes will provide network monitoring systems with accurate real time predictive capabilities.

Rafael García, Luis Llana, Constantino Malagón, Jesús Pancorbo
Selection of Effective Network Parameters in Attacks for Intrusion Detection

Current Intrusion Detection Systems (IDS) examine a large number of data features to detect intrusion or misuse patterns. Some of the features may be redundant or with a little contribution to the detection process. The purpose of this study is to identify important input features in building an IDS that are computationally efficient and effective. This paper proposes and investigates a selection of effective network parameters for detecting network intrusions that are extracted from Tcpdump DARPA1998 dataset. Here PCA method is used to determine an optimal feature set. An appropriate feature set helps to build efficient decision model as well as to reduce the population of the feature set. Feature reduction will speed up the training and the testing process for the attack identification system considerably. Tcpdump of DARPA1998 intrusion dataset was used in the experiments as the test data. Experimental results indicate a reduction in training and testing time while maintaining the detection accuracy within tolerable range.

Gholam Reza Zargar, Peyman Kabiri
Backmatter
Metadaten
Titel
Advances in Data Mining. Applications and Theoretical Aspects
herausgegeben von
Petra Perner
Copyright-Jahr
2010
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-14400-4
Print ISBN
978-3-642-14399-1
DOI
https://doi.org/10.1007/978-3-642-14400-4