Skip to main content

2006 | Buch

Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining

6th Industrial Conference on Data Mining, ICDM 2006, Leipzig, Germany, July 14-15, 2006. Proceedings

insite
SUCHEN

Über dieses Buch

The Industrial Conference on Data Mining ICDM-Leipzig was the sixth event in a series of annual events which started in 2000. We are pleased to note that the topic data mining with special emphasis on real-world applications has been adopted by so many researchers all over the world into their research work. We received 156 papers from 19 different countries. The main topics are data mining in medicine and marketing, web mining, mining of images and signals, theoretical aspects of data mining, and aspects of data mining that bundle a series of different data mining applications such as intrusion detection, knowledge management, manufacturing process control, time-series mining and criminal investigations. The Program Committee worked hard in order to select the best papers. The acceptance rate was 30%. All these selected papers are published in this proceedings volume as long papers up to 15 pages. Moreover we installed a forum where work in progress was presented. These papers are collected in a special poster proceedings volume and show once more the potentials and interesting developments of data mining for different applications. Three new workshops have been established in connection with ICDM: (1) Mass Data Analysis on Images and Signals, MDA 2006; (2) Data Mining for Life Sciences, DMLS 2006; and (3) Data Mining in Marketing, DMM 2006. These workshops are developing new topics for data mining under the aspect of the special application. We are pleased to see how many interesting developments are going on in these fields.

Inhaltsverzeichnis

Frontmatter

Data Mining in Medicine

Using Prototypes and Adaptation Rules for Diagnosis of Dysmorphic Syndromes

Since diagnosis of dysmorphic syndromes is a domain with incomplete knowledge and where even experts have seen only few syndromes themselves during their lifetime, documentation of cases and the use of case-oriented techniques are popular. In dysmorphic systems, diagnosis usually is performed as a classification task, where a prototypicality measure is applied to determine the most probable syndrome. These measures differ from the usual Case-Based Reasoning similarity measures, because here cases and syndromes are not represented as attribute value pairs but as long lists of symptoms, and because query cases are not compared with cases but with prototypes. In contrast to these dysmorphic systems our approach additionally applies adaptation rules. These rules do not only consider single symptoms but combinations of them, which indicate high or low probabilities of specific syndromes.

Rainer Schmidt, Tina Waligora
OVA Scheme vs. Single Machine Approach in Feature Selection for Microarray Datasets

The large number of genes in microarray data makes feature selection techniques more crucial than ever. From rank-based filter techniques to classifier-based wrapper techniques, many studies have devised their own feature selection techniques for microarray datasets. By combining the OVA (one-vs.-all) approach and differential prioritization in our feature selection technique, we ensure that class-specific relevant features are selected while guarding against redundancy in predictor set at the same time. In this paper we present the OVA version of our differential prioritization-based feature selection technique and demonstrate how it works better than the original SMA (single machine approach) version.

Chia Huey Ooi, Madhu Chetty, Shyh Wei Teng
Similarity Searching in DNA Sequences by Spectral Distortion Measures

Searching for similarity among biological sequences is an important research area of bioinformatics because it can provide insight into the evolutionary and genetic relationships between species that open doors to new scientific discoveries such as drug design and treament. In this paper, we introduce a novel measure of similarity between two biological sequences without the need of alignment. The method is based on the concept of spectral distortion measures developed for signal processing. The proposed method was tested using a set of six DNA sequences taken from

Escherichia coli

K-12 and

Shigella flexneri

, and one random sequence. It was further tested with a complex dataset of 40 DNA sequences taken from the GenBank sequence database. The results obtained from the proposed method are found superior to some existing methods for similarity measure of DNA sequences.

Tuan D. Pham
Multispecies Gene Entropy Estimation, a Data Mining Approach

This paper presents a data mining approach to estimate multispecies gene entropy by using a self-organizing map (SOM) to mine a homologous gene set. The gene distribution function for each gene in the feature space is approximated by its probability distribution in the feature space. The phylogenetic applications of the multispecies gene entropy are investigated in an example of inferring the species phylogeny of eight yeast species. It is found that genes with the nearest K-L distances to the minimum entropy gene are more likely to be phylogenetically informative. The K-L distances of genes are strongly correlated with the spectral radiuses of their identity percentage matrices. The images of identity percentage matrices of the genes with small K-L distances to the minimum entropy gene are more similar to the image of the minimum entropy gene in their frequency domains after fast Fourier transforms (FFT) than the images of those genes with large K-L distances to the minimum entropy gene. Finally, a K-L distance based gene concatenation approach under gene clustering is proposed to infer species phylogenies robustly and systematically.

Xiaoxu Han
A Unified Approach for Discovery of Interesting Association Rules in Medical Databases

Association rule discovery is an important technique for mining knowledge from large databases. Data mining researchers have studied subjective measures of interestingness to reduce the volume of discovered rules and to improve the overall efficiency of the knowledge discovery in databases process (KDD). The objective of this paper is to provide a framework that uses subjective measures of interestingness to discover interesting patterns from association rules algorithms. The framework works in an environment where the medical databases are evolving with time. In this paper we consider a unified approach to quantify interestingness of association rules. We believe that the expert mining can provide a basis for determining user threshold which will ultimately help us in finding interesting rules. The framework is tested on public datasets in medical domain and results are promising.

Harleen Kaur, Siri Krishan Wasan, Ahmed Sultan Al-Hegami, Vasudha Bhatnagar
Named Relationship Mining from Medical Literature

This article addresses the task of mining named relationships between concepts from biomedical literature for indexing purposes or for scientific discovery from medical literature. This research builds on previous work on concept mining from medical literature for indexing purposes and proposes to learn semantic relationships names between concepts learnt. Previous ConceptMiner system did learn pairs of concepts, expressing a relationship between two concepts, but did not learn relationships semantic names. Building on ConceptMiner, RelationshipMiner is interested in learning as well the relationships with their name identified from the Unified Medical Language System (UMLS) knowledge-base as a basis for creating higher-level knowledge structures, such as rules, cases, and models, in future work. Current system is focused on learning semantically typed relationships as predefined in the UMLS, for which a dictionary of synonyms and variations has been created. An evaluation is presented showing that actually this relationship mining task improves the concept mining task results by enabling a better screening of the relationships between concepts for relevant ones.

Isabelle Bichindaritz
Experimental Study of Evolutionary Based Method of Rule Extraction from Neural Networks in Medical Data

In the paper the method of rule extraction from neural networks based on evolutionary approach, called GEX, is presented. Its details are described but the main stress is focussed on the experimental studies, the aim of which was to examine its usefulness in knowledge discovery and rule extraction for classification task of medical data. The tests were made using the well-known benchmark data sets from UCI, as well as two other data sets collected by Lower Silesian Oncology Center.

Urszula Markowska-Kaczmar, Rafal Matkowski

Web Mining and Logfile Analysis

httpHunting: An IBR Approach to Filtering Dangerous HTTP Traffic

Recently, there has been significant interest in applying artificial intelligence techniques to intrusion detection problem. To find the solution to the difficulties in acquiring and representing existing knowledge in almost systems, we proposed a novel instance-based intrusion detection system called

http

Hunting. It will provide a framework to intrusion detection problem, incorporating several artificial intelligence techniques that help to overcome some of those limitations.

http

Hunting is able to classify in real time, traffic data arriving at the network interface of the host that is protecting, detecting anomalous traffic patterns. From our initial experiments, we can conclude that there are important key benefits of such an approach to network traffic-filtering domain.

F. Fdez-Riverola, L. Borrajo, R. Laza, F. J. Rodríguez, D. Martínez
A Comparative Performance Study of Feature Selection Methods for the Anti-spam Filtering Domain

In this paper we analyse the strengths and weaknesses of the mainly used feature selection methods in text categorization when they are applied to the spam problem domain. Several experiments with different feature selection methods and content-based filtering techniques are carried out and discussed. Information Gain,

χ

2

-text, Mutual Information and Document Frequency feature selection methods have been analysed in conjunction with Naïve Bayes, boosting trees, Support Vector Machines and ECUE models in different scenarios. From the experiments carried out the underlying ideas behind feature selection methods are identified and applied for improving the feature selection process of SpamHunting, a novel anti-spam filtering software able to accurate classify suspicious e-mails.

J. R. Méndez, F. Fdez-Riverola, F. Díaz, E. L. Iglesias, J. M. Corchado
Evaluation of Web Robot Discovery Techniques: A Benchmarking Study

This paper describes part of a web usage mining study executed on log files obtained from a Belgian e-commerce company. From these log files, it can be observed that numerous web robots are active on the site. Most of these robots show a crawling behavior that is radically different from the browsing behavior of human visitors. Because the owners of the e-shop desire information about the paths that human visitors follow through the site, it is of crucial importance to remove these robotic visits from the log files.

Several existing methods for web robot discovery are evaluated and compared, none of them leading to satisfying results. Therefore, a new technique is developed that results in a successful and reliable identification of web robots.

Nick Geens, Johan Huysmans, Jan Vanthienen
Data Preparation of Web Log Files for Marketing Aspects Analyses

This article deals with several aspects of a marketing-oriented analysis of web log files. It discusses their preprocessing and possible ways to enrich the raw data that can be gained from a web log file in order to facilitate a later use in different analyses. Further, we look at the question which requirements a good web log analysis software needs to meet and offer an overview over current and future analysis practices including their advantages and disadvantages.

Meike Reichle, Petra Perner, Klaus-Dieter Althoff
UP-DRES User Profiling for a Dynamic REcommendation System

The WWW is actually the most dynamic and attractive information exchange place. Finding useful information is hard due to huge data amount, varied topics and unstructured contents. In this paper we present a web browsing support system that proposes personalized contents. It is integrated in the content management system and it runs on the server hosting the site. It processes periodically site contents, extracting vectors of the most significant words. A topology tree is defined applying hierarchical clustering. During online browsing, viewed contents are processed and mapped in the vector space previously defined. The centroid of these vectors is compared with the topology tree nodes’ centroids to find the most similar; its contents are presented to the user as link suggestions or dynamically created pages. Personal profile is saved after every session and included in the analysis during same user’s subsequent visits, avoiding the cold start problem.

Enza Messina, Daniele Toscani, Francesco Archetti
Improving Effectiveness on Clickstream Data Mining

Developing and applying data mining processes are often very complex tasks to users without deep knowledge in this domain, particularly when such tasks involve

clickstream

data processing. One important and known challenge arises in the selection of mining methods to apply on a specific data analysis problem, trying to get better and useful results for a particular goal. Our approach to address this challenge relies on the reuse of the acquired experience from similar problems, which had provided successful mining processes in the past. In order to accomplish such goal, we implemented a prototype mining plans selection system, based on the Case-Based Reasoning paradigm. In this paper we explain how this paradigm and the implemented system may be explored to assist decisions on the data mining or Web usage mining specific scope. Additionally, we also identify the underlying issues and the approaches that were followed.

Cristina Wanzeller, Orlando Belo
Conceptual Knowledge Retrieval with FooCA: Improving Web Search Engine Results with Contexts and Concept Hierarchies

This paper presents a new approach to accessing information on the Web.

FooCA

, an application in the field of Conceptual Knowledge Processing, is introduced to support a holistic representation of today’s standard sequential Web search engine retrieval results.

FooCA

uses the itemset consisting of the title, a short description, and the URL to build a context and the appropriate concept hierarchy. In order to generate a nicely arranged concept hierarchy using line diagrams to retrieve and analyze the data, the prior context can be iteratively explored and enhanced. The combination of Web Mining techniques and Formal Concept Analysis (FCA) with contextual attribute elicitation gives the user more insight and more options than a traditional search engine interface. Besides serving as a tool for holistic data exploration,

FooCA

also enables the regular user to learn step by step how to run new, optimized search queries for his personal information need on the Web.

Bjoern Koester

Theoretical Aspects of Data Mining

A Pruning Based Incremental Construction Algorithm of Concept Lattice

The concept lattice has played an important role in knowledge discovery. However due to inevitable occurrence of redundant information in the construction process of concept lattice, the low construction efficiency has been a main concern in the literature. In this work, an improved incremental construction algorithm of concept lattice over the traditional Godin algorithm, called the pruning based incremental algorithm is proposed, which uses a pruning process to detect and eliminate possible redundant information during the construction. Our pruning based construction algorithm is in nature superior to the Godin algorithm. It can achieve the same structure with the Godin algorithm but with less computational complexity. In addition, our pruning based algorithm is also experimentally validated by taking the star spectra from the LAMOST project as the formal context.

Zhang Ji-Fu, Hu Li-Hua, Zhang Su-Lan
Association Rule Mining with Chi-Squared Test Using Alternate Genetic Network Programming

A method of association rule mining using Alternate Genetic Network Programming (aGNP) is proposed. GNP is one of the evolutionary optimization techniques, which uses directed graph structures as genes. aGNP is an extended GNP in terms of including two kinds of sets of node functions. The proposed system can extract important association rules whose antecedent and consequent are composed of the attributes of each family defined by users. The method measures the significance of association via chi-squared test using GNP’s features. Rule extraction is done without identifying frequent itemsets used in Apriori-like methods. Therefore, the method can be applied to rule extraction from dense database, and can extract dependent pairs of the sets of attributes in the database. Extracted rules are stored in a pool all together through generations and reflected in genetic operators as acquired information. In this paper, we describe the algorithm capable of finding the important association rules and present some experimental results.

Kaoru Shimada, Kotaro Hirasawa, Jinglu Hu
Ordinal Classification with Monotonicity Constraints

Classification methods commonly assume unordered class values. In many practical applications – for example grading – there is a natural ordering between class values. Furthermore, some attribute values of classified objects can be ordered, too. The standard approach in this case is to convert the ordered values into a numeric quantity and apply a regression learner to the transformed data. This approach can be used just in case of linear ordering. The proposed method for such a classification lies on the boundary between ordinal classification trees, classification trees with monotonicity constraints and multi-relational classification trees. The advantage of the proposed method is that it is able to handle non-linear ordering on the class and attribute values. For the better understanding, we use a toy example from the semantic web environment – prediction of rules for the user’s evaluation of hotels.

Tomáš Horváth, Peter Vojtáš
Local Modelling in Classification on Different Feature Subspaces

Sometimes one may be confronted with classification problems where classes are constituted of several subclasses that possess different distributions and therefore destroy accurate models of the entire classes as one similar group. An issue is modelling via local models of several subclasses.

In this paper, a method is presented of how to handle such classification problems where the subclasses are furthermore characterized by different subsets of the variables. Situations are outlined and tested where such local models in different variable subspaces dramatically improve the classification error.

Gero Szepannek, Claus Weihs
Supervised Selection of Dynamic Features, with an Application to Telecommunication Data Preparation

In the field of data mining, data preparation has more and more in common with a bottleneck. Indeed, collecting and storing data becomes cheaper while modelling costs remain unchanged. As a result, feature selection is now usually performed. In the data preparation step, selection often relies on feature ranking. In the supervised classification context, ranking is based on the information that the explanatory feature brings on the target categorical attribute.

With the increasing presence in the database of feature measured over time,

i.e.

dynamic features, new supervised ranking methods have to be designed. In this paper, we propose a new method to evaluate dynamic features, which is derived from a probabilistic criterion. The criterion is non-parametric and handles automatically the problem of overfitting the data. The resulting evaluation produces reliable results. Furthermore, the design of the criterion relies on an understandable and simple approach. This allows to provide meaningful visualization of the evaluation, in addition to the computed score. The advantages of the new method are illustrated on a telecommunication dataset.

Sylvain Ferrandiz, Marc Boullé
Using Multi-SOMs and Multi-Neural-Gas as Neural Classifiers

Within this paper we present the extension of two neural network paradigms for clustering tasks. The Self Organizing feature Maps (SOM) are extended to the Multi SOM approach, and the Neural Gas is extended to a Multi Neural Gas. Some common cluster analysis coefficients (Silhouette Coefficient, Gap Statistics, Calinski-Harabasz Coefficient)have been adapted for the new paradigms. Both new neural clustering methods are described and evaluated briefly using exemplary data sets.

Nils Goerke, Alexandra Scherbart
Derivative Free Stochastic Discrete Gradient Method with Adaptive Mutation

In data mining we come across many problems such as function optimization problem or parameter estimation problem for classifiers for which a good learning algorithm for searching is very much necessary. In this paper we propose a stochastic based derivative free algorithm for unconstrained optimization problem. Many derivative-based local search methods exist which usually stuck into local solution for non-convex optimization problems. On the other hand global search methods are very time consuming and works for only limited number of variables. In this paper we investigate a derivative free multi search gradient based method which overcomes the problems of local minima and produces global solution in less time. We have tested the proposed method on many benchmark dataset in literature and compared the results with other existing algorithms. The results are very promising.

Ranadhir Ghosh, Moumita Ghosh, Adil Bagirov

Data Mining in Marketing

Association Analysis of Customer Services from the Enterprise Customer Management System

The communications market has seen rising competition among businesses. While securing new customers is still important, it is more crucial to maintain and manage existing customers by providing optimized service and efficient marketing strategies for each customer in order to preserve existing customers from business rivals and ultimately maximize corporate sales. This thesis investigates how to obtain useful methodologies for customer management by applying the technological concepts of data-mining and association analysis to KT’s secure customer data.

Sung-Ju Kim, Dong-Sik Yun, Byung-Soo Chang
Feature Selection in an Electric Billing Database Considering Attribute Inter-dependencies

With the increasing size of databases, feature selection has become a relevant and challenging problem for the area of knowledge discovery in databases. An effective feature selection strategy can significantly reduce the data mining processing time, improve the predicted accuracy, and help to understand the induced models, as they tend to be smaller and make more sense to the user. Many feature selection algorithms assumed that the attributes are independent between each other given the class, which can produce models with redundant attributes and/or exclude sets of attributes that are relevant when considered together. In this paper, an effective best first search algorithm, called buBF, for feature selection is described. buBF uses a novel heuristic function based on

n-way

entropy to capture inter-dependencies among variables. It is shown that buBF produces more accurate models than other state-of-the-art feature selection algorithms when compared on several real and synthetic datasets. Specifically we apply buBF to a Mexican Electric Billing database and obtain satisfactory results.

Manuel Mejía-Lavalle, Eduardo F. Morales
Learning the Reasons Why Groups of Consumers Prefer Some Food Products

In this paper we propose a method for learning the reasons why groups of consumers prefer some food products instead of others of the same type. We emphasize the role of groups given that, from a practical point of view, they may represent market segments that demand different products. Our method starts representing in a metric space people preferences; there we are able to define similarity functions that allow a clustering algorithm to discover significant groups of consumers with homogeneous tastes. Finally in each cluster, we learn, with a SVM, a function that explains the tastes of the consumers grouped in the cluster. Additionally, a feature selection process highlights the essential properties of food products that have a major influence on their acceptability. To illustrate our method, a real case of consumers of lamb meat was studied. The panel was formed by 773 people of 216 families from 6 European countries. Different tastes between Northern and Southern families were enhanced.

Juan José del Coz, Jorge Díez, Antonio Bahamonde, Carlos Sañudo, Matilde Alfonso, Philippe Berge, Eric Dransfield, Costas Stamataris, Demetrios Zygoyiannis, Tyri Valdimarsdottir, Edi Piasentier, Geoffrey Nute, Alan Fisher
Exploiting Randomness for Feature Selection in Multinomial Logit: A CRM Cross-Sell Application

Data mining applications addressing classification problems must master two key tasks: feature selection and model selection. This paper proposes a random feature selection procedure integrated within the multinomial logit (MNL) classifier to perform both tasks simultaneously. We assess the potential of the random feature selection procedure (exploiting randomness) as compared to an expert feature selection method (exploiting domain-knowledge) on a CRM cross-sell application. The results show great promise as the predictive accuracy of the integrated random feature selection in the MNL algorithm is substantially higher than that of the expert feature selection method.

Anita Prinzie, Dirk Van den Poel
Data Mining Analysis on Italian Family Preferences and Expenditures

Italian expenditures are a complex system. Every year the Italian National Bureau of Statistics (

ISTAT

) carries out a survey on the expenditure behavior of Italian families. The survey regards household expenditures on durable and daily goods and on various services. Our goal is here twofold: firstly we describe the most important characteristics of family behavior with respect to expenditures on goods and usage of different services; secondly possible relationships among these behaviors are highlighted and explained by social-demographical features of families. Different data mining techniques are jointly used to these aims so as to identify different capabilities of selected methods within these kinds of issues. In order to properly focalize on service usage, further investigation will be needed about the nature of investigated services (private or public) and, most of all, about their supply and effectiveness along the national territory.

Paola Annoni, Pier Alda Ferrari, Silvia Salini
Multiobjective Evolutionary Induction of Subgroup Discovery Fuzzy Rules: A Case Study in Marketing

This paper presents a multiobjective genetic algorithm which obtains fuzzy rules for subgroup discovery in disjunctive normal form. This kind of fuzzy rules lets us represent knowledge about patterns of interest in an explanatory and understandable form which can be used by the expert. The evolutionary algorithm follows a multiobjective approach in order to optimize in a suitable way the different quality measures used in this kind of problems. Experimental evaluation of the algorithm, applying it to a market problem studied in the University of Mondragón (Spain), shows the validity of the proposal. The application of the proposal to this problem allows us to obtain novel and valuable knowledge for the experts.

Francisco Berlanga, María José del Jesus, Pedro González, Francisco Herrera, Mikel Mesonero
A Scatter Search Algorithm for the Automatic Clustering Problem

We present a new hybrid algorithm for data clustering. This new proposal uses one of the well known evolutionary algorithms called Scatter Search. Scatter Search operates on a small set of solutions and makes only a limited use of randomization for diversification when searching for globally optimal solutions. The proposed method discovers automatically cluster number and cluster centres without prior knowledge of a possible number of class, and without any initial partition. We have applied this algorithm on standard and real world databases and we have obtained good results compared to the K-means algorithm and an artificial ant based algorithm, the Antclass algorithm.

Rasha S. Abdule-Wahab, Nicolas Monmarché, Mohamed Slimane, Moaid A. Fahdil, Hilal H. Saleh
Multi-objective Parameters Selection for SVM Classification Using NSGA-II

Selecting proper parameters is an important issue to extend the classification ability of Support Vector Machine (SVM), which makes SVM practically useful. Genetic Algorithm (GA) has been widely applied to solve the problem of parameters selection for SVM classification due to its ability to discover good solutions quickly for complex searching and optimization problems. However, traditional GA in this field relys on single generalization error bound as fitness function to select parameters. Since there have several generalization error bounds been developed, picking and using single criterion as fitness function seems intractable and insufficient. Motivated by the multi-objective optimization problems, this paper introduces an efficient method of parameters selection for SVM classification based on multi-objective evolutionary algorithm NSGA-II. We also introduce an adaptive mutation rate for NSGA-II. Experiment results show that our method is better than single-objective approaches, especially in the case of tiny training sets with large testing sets.

Li Xu, Chunping Li
Effectiveness Evaluation of Data Mining Based IDS

Data mining has been widely applied to the problem of Intrusion Detection in computer networks. However, the misconception of the underlying problem has led to out of context results. This paper shows that factors such as the probability of intrusion and the costs of responding to detected intrusions must be taken into account in order to compare the effectiveness of machine learning algorithms over the intrusion detection domain. Furthermore, we show the advantages of combining different detection techniques. Results regarding the well known 1999 KDD dataset are shown.

Agustín Orfila, Javier Carbó, Arturo Ribagorda

Mining Signals and Images

Spectral Discrimination of Southern Victorian Salt Tolerant Vegetation

The use of remotely sensed data to map aspects of the landscape is both efficient and cost effective. In geographically large and sparsely populated countries such as Australia these approaches are attracting interest as an aid in the identification of areas affected by environmental problems such as dryland salinity. This paper investigates the feasibility of using visible and near infra-red spectra to distinguish between salt tolerant and salt sensitive vegetation species in order to identify saline areas in Southern Victoria, Australia. A series of classification models were built using a variety of data mining techniques and these together with a discriminant analysis suggested that excellent generalisation results could be achieved on a laboratory collected spectra data base. The results form a basis for continuing work on the development of methods to distinguish between vegetation species based on remotely sensed rather than laboratory based measurements.

Chris Matthews, Rob Clark, Leigh Callinan
A Generative Graphical Model for Collaborative Filtering of Visual Content

In this paper, we propose a novel generative graphical model for collaborative filtering of visual content. The preferences of the ”like-minded” users are modelled in order to predict the relevance of visual documents represented by their visual features. We formulate the problem using a probabilistic latent variable model where user’s preferences and items’ classes are combined into a unified framework in order to provide an accurate and a generative model that overcomes the new item problem, generally encountered in traditional collaborative filtering systems.

Sabri Boutemedjet, Djemel Ziou
A Variable Initialization Approach to the EM Algorithm for Better Estimation of the Parameters of Hidden Markov Model Based Acoustic Modeling of Speech Signals

The traditional method for estimation of the parameters of Hidden Markov Model (HMM) based acoustic modeling of speech uses the Expectation-Maximization (EM) algorithm. The EM algorithm is sensitive to initial values of HMM parameters and is likely to terminate at a local maximum of likelihood function resulting in non-optimized estimation for HMM and lower recognition accuracy. In this paper, to obtain better estimation for HMM and higher recognition accuracy, several candidate HMMs are created by applying EM on multiple initial models. The best HMM is chosen from the candidate HMMs which has highest value for likelihood function. Initial models are created by varying maximum frame number in the segmentation step of HMM initialization process. A binary search is applied while creating the initial models. The proposed method has been tested on TIMIT database. Experimental results show that our approach obtains improved values for likelihood function and improved recognition accuracy.

Md. Shamsul Huda, Ranadhir Ghosh, John Yearwood
Mining Dichromatic Colours from Video

It is commonly accepted that the most powerful approaches for increasing the efficiency of visual content delivery are personalisation and adaptation of visual content according to user’s preferences and his/her individual characteristics. In this work, we present results of a comparative study of colour contrast and characteristics of colour change between successive video frames for normal vision and two most common types of colour blindness: the protanopia and deuteranopia. The results were obtained by colour mining from three videos of different kind including their original and simulated colour blind versions. Detailed data regarding the reduction of colour contrast, decreasing of the number of distinguishable colours, and reduction of inter-frame colour change rate in dichromats are provided.

Vassili A. Kovalev
Feature Analysis and Classification of Classical Musical Instruments: An Empirical Study

We present an empirical study on classical music instrument classification. A methodology with feature extraction and evaluation is proposed and assessed with a number of experiments, whose final stage is to detect instruments in solo passages. In feature selection it is found that similar but different rankings for individual tone classification and solo passage instrument recognition are reported. Based on the feature selection results, excerpts from concerto and sonata files are processed, so as to detect and distinguish four major instruments in solo passages: trumpet, flute, violin, and piano. Nineteen features selected from the Mel-frequency cepstral coefficients (MFCC) and the MPEG-7 audio descriptors achieve a recognition rate of around 94% by the best classifier assessed by cross validation.

Christian Simmermacher, Da Deng, Stephen Cranefield
Automated Classification of Images from Crystallisation Experiments

Protein crystallography can often provide the three-dimensional structures of macro-molecules necessary for functional studies and drug design. However, identifying the conditions that will provide diffraction quality crystals often requires numerous experiments. The use of robots has led to a dramatic increase in the number of crystallisation experiments performed in most laboratories and, in structural genomics centres, tens of thousands of experiments can be produced daily. The results of these experiments must be assessed repeatedly over time and inspection of the results by eye is becoming increasingly impractical. A number of systems are now available for automated imaging of crystallisation experiments and the primary aim of this research is the development of software to automate image analysis.

Julie Wilson

Aspects of Data Mining

An Efficient Algorithm for Frequent Itemset Mining on Data Streams

In order to mining frequent itemsets on data stream efficiently, a new approach was proposed in this paper. The memory efficient and accurate one-pass algorithm divides all the frequent itemsets into frequent equivalence classes and prune all the redundant itemsets except for those represent the GLB(Greatest Lower Bound) and LUB(Least Upper Bound) of the frequent equivalence class and the number of GLB and LUB is much less than that of frequent itemsets. In order to maintain these equivalence classes, A compact data structure, the frequent itemset enumerate tree (FIET) was proposed in the paper. The detailed experimental evaluation on synthetic and real datasets shows that the algorithm is very accurate in practice and requires significantly lower memory than Jin and Agrawal’s one pass algorithm.

Xie Zhi-jun, Chen Hong, Cuiping Li
Discovering Key Sequences in Time Series Data for Pattern Classification

This paper addresses the issue of discovering key sequences from time series data for pattern classification. The aim is to find from a symbolic database all sequences that are both indicative and non-redundant. A sequence as such is called a key sequence in the paper. In order to solve this problem we first we establish criteria to evaluate sequences in terms of the measures of evaluation base and discriminating power. The main idea is to accept those sequences appearing frequently and possessing high co-occurrences with consequents as indicative ones. Then a sequence search algorithm is proposed to locate indicative sequences in the search space. Nodes encountered during the search procedure are handled appropriately to enable completeness of the search results while removing redundancy. We also show that the key sequences identified can later be utilized as strong evidences in probabilistic reasoning to determine to which class a new time series most probably belongs.

Peter Funk, Ning Xiong
Data Alignment Via Dynamic Time Warping as a Prerequisite for Batch-End Quality Prediction

In this work, a 4-phase

dynamic time warping

is implemented to align measurement profiles from an existing chemical batch reactor process, making all batch measurement profiles equal in length, while also matching the major events occurring during each batch run.

This data alignment is the first step towards constructing an inferential batch-end quality sensor, capable of predicting 3 quality variables before batch run completion using a multivariate statistical

partial least squares

model. This inferential sensor provides on-line quality predictions, allowing corrective actions to be performed when the quality of the polymerization product does not meet the specifications, saving valuable production time and reducing operation cost.

Geert Gins, Jairo Espinosa, Ilse Y. Smets, Wim Van Brempt, Jan F. M. Van Impe
A Distance Measure for Determining Similarity Between Criminal Investigations

The information explosion has led to problems and possibilities in many areas of society, including that of law enforcement. In comparing individual criminal investigations on similarity, we seize one of the opportunities of the information surplus to determine what crimes may or may not have been committed by the same group of individuals.

For this purpose we introduce a new distance measure that is specifically suited to the comparison between investigations that differ largely in terms of available intelligence. It employs an adaptation of the probability density function of the normal distribution to constitute this distance between all possible couples of investigations.

We embed this distance measure in a four-step paradigm that extracts entities from a collection of documents and use it to transform a high dimensional vector table into input for a police operable tool. The eventual report is a two-dimensional representation of the distances between the various investigations and will assist the police force on the job to get a clearer picture of the current situation.

Tim K. Cocx, Walter A. Kosters
Establishing Fraud Detection Patterns Based on Signatures

All over the world we have been assisting to a significant increase of the telecommunication systems usage. People are faced day after day with strong marketing campaigns seeking their attention to new telecommunication products and services. Telecommunication companies struggle in a high competitive business arena. It seems that their efforts were well done, because customers are strongly adopting the new trends and use (and abuse) systematically communication services in their quotidian. Although fraud situations are rare, they are increasing and they correspond to a large amount of money that telecommunication companies lose every year. In this work, we studied the problem of fraud detection in telecommunication systems, especially the cases of superimposed fraud, providing an anomaly detection technique, supported by a signature schema. Our main goal is to detect deviate behaviors in useful time, giving better basis to fraud analysts to be more accurate in their decisions in the establishment of potential fraud situations.

Pedro Ferreira, Ronnie Alves, Orlando Belo, Luís Cortesão
Intelligent Information Systems for Knowledge Work(ers)

Our society needs and expects more high-value services. Such “knowledge-intensive” services can only be delivered if the necessary organizational and technical requirements are fulfilled. In addition, the cost-benefit analysis from the service provider point of view needs to be positive. Continuous improvement and goal-directed (partial) automation of such services is therefore of crucial importance. As a contribution to this we describe our current research vision for (partially) automated support of knowledge work(ers) based on intelligent information systems focusing on the use of experience. For the implementation of such a vision we base on the integration of approaches from artificial intelligence and software engineering. A “deep” integration of case-based reasoning and experience factory is a first successful step in this direction [33, 28]. We envision the further integration of software product-lines and multi-agent systems as the next one.

Klaus-Dieter Althoff, Björn Decker, Alexandre Hanft, Jens Mänz, Régis Newo, Markus Nick, Jörg Rech, Martin Schaaf
Nonparametric Approaches for e-Learning Data

In the paper we propose nonparametric approaches for e-learning data. In particular we want to supply a measure of the relative exercises importance, to estimate the acquired Knowledge for each student and finally to personalize the e-learning platform. The methodology employed is based on a comparison between nonparametric statistics for kernel density classification and parametric models such as generalized linear models and generalized additive models.

Paolo Baldini, Silvia Figini, Paolo Giudici
An Intelligent Manufacturing Process Diagnosis System Using Hybrid Data Mining

The high cost of maintaining a complex manufacturing process necessitates the enhancement of an efficient maintenance system. For the efficient maintenance of manufacturing process, precise diagnosis of the manufacturing process should be performed and the appropriate maintenance action should be executed when the current condition of the manufacturing system is diagnosed as being in abnormal condition. This paper suggests an intelligent manufacturing process diagnosis system using hybrid data mining. In this system, the cause-and-effect rules for the manufacturing process condition are inferred by hybrid decision tree/evolution strategies learning and the most effective maintenance action is recommended by a decision network and AHP (analytical hierarchy process). To verify the hybrid learning proposed in this paper, we compared the accuracy of the hybrid learning with that of the general decision tree learning algorithm (C4.5) and hybrid decision tree/genetic algorithm learning by using datasets from the well-known dataset repository at UCI (University of California at Irvine).

Joon Hur, Hongchul Lee, Jun-Geol Baek
Computer Network Monitoring and Abnormal Event Detection Using Graph Matching and Multidimensional Scaling

Computer network monitoring and abnormal event detection have become important areas of research. In previous work, it has been proposed to represent a computer network as a time series of graphs and to compute the difference, or distance, of consecutive graphs in such a time series. Whenever the distance of two graphs exceeds a given threshold, an abnormal event is reported. In the present paper we go one step further and compute graph distances between all pairs of graphs in a time series. Given these distances, a multidimensional scaling procedure is applied that maps each graph onto a point in the two-dimensional real plane, such that the distances between the graphs are reflected, as closely as possible, in the distances between the points in the two-dimensional plane. In this way the behaviour of a network can be visualised and abnormal events as well as states or clusters of states of the network can be graphically represented. We demonstrate the feasibility of the proposed method by means of synthetically generated graph sequences and data from real computer networks.

H. Bunke, P. Dickinson, A. Humm, Ch. Irniger, M. Kraetzl
Backmatter
Metadaten
Titel
Advances in Data Mining. Applications in Medicine, Web Mining, Marketing, Image and Signal Mining
herausgegeben von
Petra Perner
Copyright-Jahr
2006
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-36037-7
Print ISBN
978-3-540-36036-0
DOI
https://doi.org/10.1007/11790853

Premium Partner