nach oben

2007 | Buch

Kapitel lesen Erstes Kapitel lesen

Intelligent Data Engineering and Automated Learning - IDEAL 2007

8th International Conference, Birmingham, UK, December 16-19, 2007. Proceedings

herausgegeben von: Hujun Yin, Peter Tino, Emilio Corchado, Will Byrne, Xin Yao

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Inhaltsverzeichnis

Frontmatter

Learning and Information Processing

Support Function Machines

This paper proposes a novel model of support function machine (SFM) for time series predictions. Two machine learning models, namely, support vector machines (SVM) and procedural neural networks (PNN) are compared in solving time series and they inspire the creation of SFM. SFM aims to extend the support vectors to spatiotemporal domain, in which each component of vectors is a function with respect to time. In the view of the function, SFM transfers a vector function of time to a static vector. Similar to the SVM training procedure, the corresponding learning algorithm for SFM is presented, which is equivalent to solving a quadratic programming. Moreover, two practical examples are investigated and the experimental results illustrate the feasibility of SFM in modeling time series predictions.

Jiuzhen Liang

Different Bayesian Network Models in the Classification of Remote Sensing Images

In this paper we study the application of Bayesian network models to classify multispectral and hyperspectral remote sensing images. Different models of Bayesian networks as: Naive Bayes (NB), Tree Augmented Naive Bayes (TAN) and General Bayesian Networks (GBN), are applied to the classification of hyperspectral data. In addition, several Bayesian multi-net models: TAN multi-net, GBN multi-net and the model developed by Gurwicz and Lerner, TAN-Based Bayesian Class-Matched multi-net (tBCM

) (see [1]) are applied to the classification of multispectral data. A comparison of the results obtained with the different classifiers is done.

Cristina Solares, Ana Maria Sanz

Group Decision Making with Triangular Fuzzy Linguistic Variables

In group decision making with linguistic information, the decision makers (DMs) usually provide their assessment information by means of linguistic variables. In some situations, however, the DMs may provide fuzzy linguistic information because of time pressure, lack of knowledge, and their limited attention and information processing capabilities. In this paper, we introduce the concepts of triangular fuzzy linguistic variable and its member function, and introduce some operational laws of triangular fuzzy linguistic variables. We propose a formula for comparing triangular fuzzy linguistic variables, and develop some operators for aggregating triangular fuzzy linguistic variables, such as the fuzzy linguistic averaging (FLA) operator, fuzzy linguistic weighted averaging (FLWA) operator, fuzzy linguistic ordered weighted averaging (FLOWA) operator, and induced FLOWA (IFLOWA) operator, etc. Based on the FLWA and IFLOWA operators, we develop a practical method for group decision making with triangular fuzzy linguistic variables, and finally, an illustrative example is given to verify the feasibility and effectiveness of the developed method.

Zeshui Xu

Sparse Kernel Modelling: A Unified Approach

A unified approach is proposed for sparse kernel data modelling that includes regression and classification as well as probability density function estimation. The orthogonal-least-squares forward selection method based on the leave-one-out test criteria is presented within this unified data-modelling framework to construct sparse kernel models that generalise well. Examples from regression, classification and density estimation applications are used to illustrate the effectiveness of this generic sparse kernel data modelling approach.

S. Chen, X. Hong, C. J. Harris

Advanced Forecasting and Classification Technique for Condition Monitoring of Rotating Machinery

Prediction and classification of particular faults in rotating machinery, based on a given set of measurements, could significantly reduce the overall costs of maintenance and repair. Usually the vibration signal is sampled with a very high frequency due to its nature, thus it is quite difficult to do considerably long forecasting based on the methods, which are suitable for e.g. financial time series (where the sampling frequency is smaller). In this paper new forecasting and classification technique for particular vibration signal characteristics is proposed. Suggested approach allows creating a part of control system responsible for early fault detection, which could be used for preventive maintenance of industrial equipment. Presented approach can be extended to high frequency financial data for the prediction of “faults” on the market.

Ilya Mokhov, Alexey Minin

Out of Bootstrap Estimation of Generalization Error Curves in Bagging Ensembles

The dependence of the classification error on the size of a bagging ensemble can be modeled within the framework of Monte Carlo theory for ensemble learning. These error curves are parametrized in terms of the probability that a given instance is misclassified by one of the predictors in the ensemble. Out of bootstrap estimates of these probabilities can be used to model generalization error curves using only information from the training data. Since these estimates are obtained using a finite number of hypotheses, they exhibit fluctuations. This implies that the modeled curves are biased and tend to overestimate the true generalization error. This bias becomes negligible as the number of hypotheses used in the estimator becomes sufficiently large. Experiments are carried out to analyze the consistency of the proposed estimator.

Daniel Hernández-Lobato, Gonzalo Martínez-Muñoz, Alberto Suárez

An Edit Distance Approach to Shallow Semantic Labeling

This paper proposes a model of semantic labeling based on the edit distance. The dynamic programming approach stresses on a non-exact string matching technique that takes full advantage of the underlying grammatical structure of 65,000 parse trees in a Treebank. Both part-of-speech and lexical similarity serve to identify the possible semantic labels, without miring into a pure linguistic analysis. The model described has been implemented. We also analyze the tradeoffs between the part-of-speech and lexical similarity in the semantic labeling. Experimental results for recognizing various labels in 10,000 sentences are used to justify its significances.

Samuel W. K. Chan

A Comparison of One-Class Classifiers for Novelty Detection in Forensic Case Data

This paper investigates the application of novelty detection techniques to the problem of drug profiling in forensic science. Numerous one-class classifiers are tried out, from the simple k-means to the more elaborate Support Vector Data Description algorithm. The target application is the classification of illicit drugs samples as part of an existing trafficking network or as a new cluster. A unique chemical database of heroin and cocaine seizures is available and allows assessing the methods. Evaluation is done using the area under the ROC curve of the classifiers. Gaussian mixture models and the SVDD method are trained both with and without outlier examples, and it is found that providing outliers during training improves in some cases the classification performance. Finally, combination schemes of classifiers are also tried out. Results highlight methods that may guide the profiling methodology used in forensic analysis.

Frédéric Ratle, Mikhail Kanevski, Anne-Laure Terrettaz-Zufferey, Pierre Esseiva, Olivier Ribaux

Variational GTM

Generative Topographic Mapping (GTM) is a non-linear latent variable model that provides simultaneous visualization and clustering of high-dimensional data. It was originally formulated as a constrained mixture of distributions, for which the adaptive parameters were determined by Maximum Likelihood (ML), using the Expectation-Maximization (EM) algorithm. In this paper, we define an alternative variational formulation of GTM that provides a full Bayesian treatment to a Gaussian Process (GP)-based variation of GTM. The performance of the proposed Variational GTM is assessed in several experiments with artificial datasets. These experiments highlight the capability of Variational GTM to avoid data overfitting through active regularization.

Iván Olier, Alfredo Vellido

Skill Combination for Reinforcement Learning

Recently researchers have introduced methods to develop reusable knowledge in reinforcement learning (RL). In this paper, we define simple principles to combine skills in reinforcement learning. We present a skill combination method that uses trained skills to solve different tasks in a RL domain. Through this combination method, composite skills can be used to express tasks at a high level and they can also be re-used with different tasks in the context of the same problem domains. The method generates an abstract task representation based upon normal reinforcement learning which decreases the information coupling of states thus improving an agent’s learning. The experimental results demonstrate that the skills combination method can effectively reduce the learning space, and so accelerate the learning speed of the RL agent. We also show in the examples that different tasks can be solved by combining simple reusable skills.

Zhihui Luo, David Bell, Barry McCollum

A New Recurring Multistage Evolutionary Algorithm for Solving Problems Efficiently

This paper introduces a new approach, called recurring multistage evolutionary algorithm (RMEA), to balance the explorative and exploitative features of the conventional evolutionary algorithm. Unlike most previous work, the basis of RMEA is repeated and alternated executions of two different stages i.e. exploration and exploitation during evolution. RMEA uses dissimilar information across the population and similar information within population neighbourhood in mutation operation for achieving global exploration and local exploitation, respectively. It is applied on two unimodal, two multimodal, one rotated multimodal and one composition functions. The experimental results indicated the effectiveness of using different object-oriented stages and their repeated alternation during evolution. The comparison of RMEA with other algorithms showed its superiority on complex problems.

Md. Monirul Islam, Mohammad Shafiul Alam, Kazuyuki Murase

The Effect of Missing Wind Speed Data on Wind Power Estimation

In this paper, the effect of possible missing data on wind power estimation is examined. One−month wind speed data obtained from wind and solar observation station which is constructed at Iki Eylul Campus of Anadolu University is used. A closed correlation is found between consecutive wind speed data that are collected for a period of 15 second. A very short time wind speed forecasting model is built by using two−input and one−output Adaptive Neuro Fuzzy Inference System (ANFIS). First, some randomly selected data from whole data are discarded. Second, 10%, 20% and 30% of all data which are randomly selected from a predefined interval (3−6 m/sec) are discarded and discarded data are forecasted. Finally, the data are fitted to Weibull distribution, Weibull distribution parameters are obtained and wind powers are estimated for all cases. The results show that the missing data has a significant effect on wind power estimation and must be taken into account in wind studies. Furthermore, it is concluded that ANFIS is a convenient tool for this kind of prediction.

Fatih Onur Hocaog̃lu, Mehmet Kurban

Exploration of a Text Collection and Identification of Topics by Clustering

An application of cluster analysis to identify topics in a collection of posters abstracts from the Society for Neuroscience (SfN) Annual Meeting in 2006 is presented. The topics were identified by selecting from the abstracts belonging to each cluster the terms with the highest scores using different ranking schemes. The ranking scheme based on log-entropy showed better performance in this task than other more classical TFIDF schemes. An evaluation of the extracted topics was performed by comparison with previously defined thematic categories for which titles are available, and after assigning each cluster to one dominant category. The results show that repeated bisecting k-means performs better than standard k-means.

Antoine Naud, Shiro Usui

Asynchronous BCI Control of a Robot Simulator with Supervised Online Training

Due to the non-stationarity of EEG signals, online training and adaptation is essential to EEG based brain-computer interface (BCI) systems. Asynchronous BCI offers more natural human-machine interaction, but it is a great challenge to train and adapt an asynchronous BCI online because the user’s control intention and timing are usually unknown. This paper proposes a novel motor imagery based asynchronous BCI for controlling a simulated robot in a specifically designed environment which is able to provide user’s control intention and timing during online experiments, so that online training and adaptation of motor imagery based asynchronous BCI can be effectively investigated. This paper also proposes an online training method, attempting to automate the process of finding the optimal parameter values of the BCI system to deal with non-stationary EEG signals

Experimental results have shown that the proposed methodfor online training of asynchronous BCI significantly improves the performance.

Chun Sing Louis Tsui, John Q. Gan

Fuzzy Ridge Regression with Non Symmetric Membership Functions and Quadratic Models

Fuzzy regression models has been traditionally considered as a problem of linear programming. The use of quadratic programming allows to overcome the limitations of linear programming as well as to obtain highly adaptable regression approaches. However, we verify the existence of multicollinearity in fuzzy regression and we propose a model based on Ridge regression in order to address this problem.

S. Donoso, N. Marín, M. A. Vila

A Subjective and Objective Integrated Method for MAGDM Problems with Multiple Types of Exact Preference Formats

Group decision making with preference information on alternatives has become a very active research field over the last decade. Especially, the investigation on the group decision making problems based on different preference formats has attracted great interests from researchers recently and some approaches have been developed for dealing with these problems.However, the existing approaches can only be suitable for handling the subjective preference information. In this paper, we investigate the multiple attribute group decision making (MAGDM) problems, in which the attribute values (objective information) are given as non-negative real numbers, the information about attribute weights is to be determined, and the decision makers have their subjective preferences on alternatives. The provided subjective preference information can be represented in three well-known exact preference formats: 1) utility values; 2) fuzzy preference relations; and 3) multiplicative preference relations. We first set up three constrained optimization models integrating the given objective information and each of three preference formats respectively, and then based on these three models, we establish an integrated constrained optimization model to derive the attribute weights. The obtained attribute weights contain both the subjective preference information given by all the decision makers and the objective information. Thus, a straightforward and practical method is provided for MAGDM with multiple types of exact preference formats.

Zeshui Xu, Jian Chen

Energy Saving by Means of Fuzzy Systems

It is well known that global sustainability must begin with human actions. A reduction of the consumed energy in the heating systems is one of such possible actions. The higher the society prosperity the higher the required houses comfort, and the higher amount of energy. In Spain it is especially important as the construction rate is almost the half of that in Europe. To save energy is urgent, which means that the energy losses must be reduced.

In this paper, a multi agent system solution for the reduction of the energy consumption in heating systems of houses is presented. A control central unit (CCU) responsible of minimising the energy consumption interacts with the heaters. The CCU includes a Fuzzy Model (FM) and a Fuzzy controller (FC) and makes use of the concept of energy balance to distribute the energy between the heaters.

Results show the proposed system as a very promising solution for energy saving and comfort tracking in houses. This solution is the preliminary study to be included in a heating system product of a local company.

José R. Villar, Enrique de la Cal, Javier Sedano

A Comparative Study of Local Classifiers Based on Clustering Techniques and One-Layer Neural Networks

In this article different approximations of a local classifier algorithm are described and compared. The classification algorithm is composed by two different steps. The first one consists on the clustering of the input data by means of three different techniques, specifically a k-means algorithm, a Growing Neural Gas (GNG) and a Self-Organizing Map (SOM). The groups of data obtained are the input to the second step of the classifier, that is composed of a set of one-layer neural networks which aim is to fit a local model for each cluster. The three different approaches used in the first step are compared regarding several parameters such as its dependence on the initial state, the number of nodes employed and its performance. In order to carry out the comparative study, two artificial and three real benchmark data sets were employed.

Yuridia Gago-Pallares, Oscar Fontenla-Romero, Amparo Alonso-Betanzos

Filter Methods for Feature Selection – A Comparative Study

Adequate selection of features may improve accuracy and efficiency of classifier methods. There are two main approaches for feature selection: wrapper methods, in which the features are selected using the classifier, and filter methods, in which the selection of features is independent of the classifier used. Although the wrapper approach may obtain better performances, it requires greater computational resources. For this reason, lately a new paradigm, hybrid approach, that combines both filter and wrapper methods has emerged. One of its problems is to select the filter method that gives the best relevance index for each case, and this is not an easy to solve question. Different approaches to relevance evaluation lead to a large number of indices for ranking and selection. In this paper, several filter methods are applied over artificial data sets with different number of relevant features, level of noise in the output, interaction between features and increasing number of samples. The results obtained for the four filters studied (ReliefF, Correlation-based Feature Selection, Fast Correlated Based Filter and INTERACT) are compared and discussed. The final aim of this study is to select a filter to construct a hybrid method for feature selection.

Noelia Sánchez-Maroño, Amparo Alonso-Betanzos, María Tombilla-Sanromán

FPGA-Based Architecture for Computing Testors

Irreducible testors (also named typical testors) are a useful tool for feature selection in supervised classification problems with mixed incomplete data. However, the complexity of computing all irreducible testors of a training matrix has an exponential growth with respect to the number of columns in the matrix. For this reason different approaches like heuristic algorithms, parallel and distributed processing, have been developed. In this paper, we present the design and implementation of a custom architecture for BT algorithm, which allows computing testors from a given input matrix. The architectural design is based on a parallel approach that is suitable for high populated input matrixes. The architecture has been designed to deal with parallel processing of all matrix rows, automatic candidate generation, and can be configured for any size of matrix. The architecture is able to evaluate whether a feature subset is a testor of the matrix and to calculate the next candidate to be evaluated, in a single clock cycle. The architecture has been implemented on a Field Programmable Gate Array (FPGA) device. Results show that it provides significant performance improvements over a previously reported hardware implementation. Implementation results are presented and discussed.

Alejandro Rojas, René Cumplido, J. Ariel Carrasco-Ochoa, Claudia Feregrino, J. Francisco Martínez-Trinidad

Minimal BSDT Abstract Selectional Machines and Their Selectional and Computational Performance

Turing machine (TM) theory constitutes the theoretical basis for contemporary digital (von Neumann) computers. But it is problematic whether it could be an adequate theory of brain functions (computations) because, as it is widely accepted, the brain is a selectional device with blurred bounds between the areas responsible for data processing, control, and behavior. In this paper, by analogy with TMs, the optimal decoding algorithm of recent binary signal detection theory (BSDT) is presented in the form of a minimal one-dimensional abstract selectional machine (ASM). The ASM’s hypercomplexity is explicitly hypothesized, its optimal selectional and super-Turing computational performance is discussed. BSDT ASMs can contribute to a mathematically strict and biologically plausible theory of functional properties of the brain, mind/brain relations and super-Turing machines mimicking partially some cognitive abilities in animals and humans.

Petro Gopych

Active Learning for Regression Based on Query by Committee

We investigate a committee-based approach for active learning of real-valued functions. This is a variance-only strategy for selection of informative training data. As such it is shown to suffer when the model class is misspecified since the learner’s bias is high. Conversely, the strategy outperforms passive selection when the model class is very expressive since active minimization of the variance avoids overfitting.

Robert Burbidge, Jem J. Rowland, Ross D. King

Influence of Wavelet Frequency and Orientation in an SVM-Based Parallel Gabor PCA Face Verification System

We present a face verification system using Parallel Gabor Principal Component Analysis (PGPCA) and fusion of Support Vector Machines (SVM) scores. The algorithm has been tested on two databases: XM2VTS (frontal images with frontal or lateral illumination) and FRAV2D (frontal images with diffuse or zenithal illumination, varying poses and occlusions). Our method outperforms others when fewer PCA coefficients are kept. It also has the lowest equal error rate (EER) in experiments using frontal images with occlusions. We have also studied the influence of wavelet frequency and orientation on the EER in a one-Gabor PCA. The high frequency wavelets are able to extract more discriminant information compared to the low frequency wavelets. Moreover, as a general rule, oblique wavelets produce a lower EER compared to horizontal or vertical wavelets. Results also suggest that the optimal wavelet orientation coincides with the illumination gradient.

Ángel Serrano, Isaac Martín de Diego, Cristina Conde, Enrique Cabello, Linlin Shen, Li Bai

Wrapping the Naive Bayes Classifier to Relax the Effect of Dependences

The Naive Bayes Classifier is based on the (unrealistic) assumption of independence among the values of the attributes given the class value. Consequently, its effectiveness may decrease in the presence of interdependent attributes. In spite of this, in recent years, Naive Bayes classifier is worked for a privilege position due to several reasons [1]. We present DGW (Dependency Guided Wrapper), a wrapper that uses information about dependences to transform the data representation to improve the Naive Bayes classification. This paper presents experiments comparing the performance and execution time of 12 DGW variations against 12 previous approaches, as constructive induction of cartesian product attributes, and wrappers that perform a search for optimal subsets of attributes.

Experimental results show that DGW generates a new data representation that allows the Naive Bayes to obtain better accuracy more times than any other wrapper tested. DGW variations also obtain the best possible accuracy more often than the state of the art wrappers while often spending less time in the attribute subset search process.

Jose Carlos Cortizo, Ignacio Giraldez, Mari Cruz Gaya

Preference Learning from Interval Pairwise Data. A Distance-Based Approach

Preference learning has recently received a lot of attention within the machine learning field, concretely learning by pairwise comparisons is a well-established technique in this field. We focus on the problem of learning the overall preference weights of a set of alternatives from the (possibly conflicting) uncertain and imprecise information given by a group of experts into the form of interval pairwise comparison matrices. Because of the complexity of real world problems, incomplete information or knowledge and different patterns of the experts, interval data provide a flexible framework to account uncertainty and imprecision. In this context, we propose a two-stage method in a distance-based framework, where the impact of the data certainty degree is captured. First, it is obtained the group preference matrix that best reflects imprecise information given by the experts. Then, the crisp preference weights and the associated ranking of the alternatives are derived from the obtained group matrix. The proposed methodology is made operational by using an Interval Goal Programming formulation.

Esther Dopazo, Mauricio Ruiz-Tagle, Juan Robles

Psychometric Functions Within the Framework of Binary Signal Detection Theory: Coding the Face Identity

One of standard methods in vision research is measuring the psychometric functions (PFs) that are further analyzed implying the validity of traditional signal detection theory (SDT). This research paradigm contains essential inherent contradiction: in contrast to most empirical PFs the ones predicted by the SDT do not satisfy the Neyman-Pearson objective. The problem may successfully be overcome within the framework of recent binary signal detection theory (BSDT) providing PFs for which the objective required is always achieved. Here, the original BSDT theory for vision is for the first time applied to quantitative description of specific empirical PFs measured in experiments where the coding of facial identity has been studied. By fitting the data, some parameters of BSDT face recognition algorithm were extracted and it was demonstrated that the BSDT supports popular prototype face identification model. Results can be used for developing new high-performance computational methods for face recognition.

Petro Gopych, Anna Kolot

Load Forecasting with Support Vector Machines and Semi-parametric Method

A new approach to short-term electrical load forecasting is investigated in this paper. As electrical load data are highly non-linear in nature, in the proposed approach, we first separate out the linear and the non-linear parts, and then forecast using the non-linear part only. Semi-parametric spectral estimation method is used to decompose a load data signal into a harmonic linear signal model and a non-linear trend. A support vector machine is then used to predict the non-linear trend. The final predicted signal is then found by adding the support vector machine predicted trend and the linear signal part. The performance of the proposed method seems to be more robust than using only the raw load data. This is due to the fact that the proposed method is intended to be more focused on the non-linear part rather than a diluted mixture of the linear and the non-linear parts as done usually.

J. A. Jordaan, A. Ukil

Reproducing Kernel Hilbert Space Methods to Reduce Pulse Compression Sidelobes

Since the development of pulse compression in the mid-1950’s the concept has become an indispensable feature of modern radar systems. A matched filter is used on reception to maximize the signal to noise ratio of the received signal. The actual waveforms that are transmitted are chosen to have an autocorrelation function with a narrow peak at zero time shift and the other values, referred to as sidelobes, as low as possible at all other times. A new approach to radar pulse compression is introduced, namely the Reproducing Kernel Hilbert Space (RKHS) method. This method reduces sidelobe levels significantly. The paper compares a second degree polynomial kernel RKHS method to a least squares and

-norm mismatched filter, and concludes with a presentation of the representative testing results.

J. A. Jordaan, M. A. van Wyk, B. J. van Wyk

Support Kernel Machine-Based Active Learning to Find Labels and a Proper Kernel Simultaneously

SVM-based active learning has been successfully applied when a large number of unlabeled samples are available but getting their labels is costly. However, the kernel used in SVM should be fixed properly before the active learning process. If the pre-selected kernel is inadequate for the target data, the learned SVM has poor performance. So, new active learning methods are required which effectively find an adequate kernel for the target data as well as the labels of unknown samples.

In this paper, we propose a two-phased SKM-based active learning method and a sampling strategy for the purpose. By experiments, we show that the proposed SKM-based active learning method has quick response suited to interaction with human experts and can find an appropriate kernel among linear combinations of given multiple kernels. We also show that with the proposed sampling strategy, it converges earlier to the proper combination of kernels than with the popular sampling strategy MARGIN.

Yasusi Sinohara, Atsuhiro Takasu

Making Class Bias Useful: A Strategy of Learning from Imbalanced Data

The performance of many learning methods are usually influenced by the class imbalance problem, where the training data is dominated by the instances belonging to one class. In this paper, we propose a novel method which combines random forest based techniques and sampling methods for effectively learning from imbalanced data. Our method is mainly composed of two phases: data cleaning and classification based on random forest. Firstly, the training data is cleaned through the elimination of dangerous negative instances. The data cleaning process is supervised by a negative biased random forest, where the negative instances have a major proportion of the training data in each of the tree in the forest. Secondly, we develop a variant of random forest in which each tree is biased towards the positive class to classify the data set, where a major vote is provided for prediction. In the experimental test, we compared our method with other existing methods on the real data sets, and the results demonstrate the significative performance improvement of our method in terms of the area under the ROC curve(AUC).

Jie Gu, Yuanbing Zhou, Xianqiang Zuo

Detecting Phishing E-mails by Heterogeneous Classification

This paper presents a system for classifying e-mails into two categories, legitimate and fraudulent. This classifier system is based on the serial application of three filters: a Bayesian filter that classifies the textual content of e-mails, a rule- based filter that classifies the non grammatical content of e-mails and, finally, a filter based on an emulator of fictitious accesses which classifies the responses from websites referenced by links contained in e-mails. This system is based on an approach that is hybrid, because it uses different classification methods, and also integrated, because it takes into account all kind of data and information contained in e-mails. This approach aims to provide an effective and efficient classification. The system first applies fast and reliable classification methods, and only when the resulting classification decision is imprecise does the system apply more complex analysis and classification methods.

M. Dolores del Castillo, Angel Iglesias, J. Ignacio Serrano

Load Balancing in Fault Tolerant Video Server

The Video-on-Demand (VoD) application is popular as the videos are delivered to the users at anytime and anywhere. The system load is increased due to simultaneous access of VoD system by many users. The proposed architecture discusses load balancing mechanism within a video server and to provide service to the users with small start-up delay. The Video storage is based on the probability of the users requesting for the video. Videos with higher request probability are stored and replicated to ensure guaranteed retrieval. Parity generation scheme is employed to provide reliability to non-popular videos. The system is also capable of handling disk failures transparently and thereby providing a reliable service to the user.

D. N. Sujatha, K. Girish, B. Rashmi, K. R. Venugopal, L. M. Patnaik

Position-Aware String Kernels with Weighted Shifts and a General Framework to Apply String Kernels to Other Structured Data

In combination with efficient kernel-base learning machines such as Support Vector Machine (SVM), string kernels have proven to be significantly effective in a wide range of research areas (

e.g.

bioinformatics, text analysis, voice analysis). Many of the string kernels proposed so far take advantage of simpler kernels such as trivial comparison of characters and/or substrings, and are classified into two classes: the

position-aware

string kernel which takes advantage of positional information of characters/substrings in their parent strings, and the

position-unaware

string kernel which does not. Although the positive semidefiniteness of kernels is a critical prerequisite for learning machines to work properly, a little has been known about the positive semidefiniteness of the position-aware string kernel. The present paper is the first paper that presents easily checkable sufficient conditions for the positive semidefiniteness of a certain useful subclass of the position-aware string kernel: the similarity/matching of pairs of characters/substrings is evaluated with weights determined according to

shifts

(the differences in the positions of characters/substrings). Such string kernels have been studied in the literature but insufficiently. In addition, by presenting a general framework for converting positive semidefinite string kernels into those for richer data structures such as trees and graphs, we generalize our results.

Kilho Shin

A New Regression Based Software Cost Estimation Model Using Power Values

The paper aims to provide for the improvement of software estimation research through a new regression model. The study design of the paper is organized as follows. Evaluation of estimation methods based on historical data sets requires that these data sets be representative for current or future projects. For that reason the data set for software cost estimation model the International Software Benchmarking Standards Group (ISBSG) data set Release 9 is used. The data set records true project values in the real world, and can be used to extract information to predict new projects cost in terms of effort. As estimation method regression models are used. The main contribution of this study is the new cost production function that is used to obtain software cost estimation. The new proposed cost estimation function performance is compared with related work in the literature. In the study same calibration on the production function is made in order to obtain maximum performance. There is some important discussion on how the results can be improved and how they can be applied to other estimation models and datasets.

Oktay Adalier, Aybars Uğur, Serdar Korukoğlu, Kadir Ertaş

Visualising and Clustering Video Data

We review a new form of self-organizing map which is based on a nonlinear projection of latent points into data space, identical to that performed in the Generative Topographic Mapping (GTM) [1]. But whereas the GTM is an extension of a mixture of experts, this model is an extension of a product of experts [6]. We show visualisation and clustering results on a data set composed of video data of lips uttering 5 Korean vowels and show that the new mapping achieves better results than the standard Self-Organizing Map.

Colin Fyfe, Wei Chuang Ooi, Hanseok Ko

Neural Network-Based Receiver for Uplink Multiuser Code Division Multiple Access Communication System

In this paper, the uplink multiuser code division multiple access (CDMA) communication system model is described in the form of space-time domain through antenna array and multipath fading expression. Novel suitable neural network technique is proposed as an effective signal processing method for the receiver of such an uplink multiuser CDMA system. By the appropriate choice of the channel state information for the neural network parameters, the neural network can collectively resolve the effects of both the inter-symbol interference due to the multipath fading channel and the multiple access interference in the receiver of the uplink multiuser CDMA communication system. The dynamics of the proposed neural network receiver for the uplink multiuser CDMA communication system is also studied.

Zi-Wei Zheng

Evolving Tree Algorithm Modifications

There are many variants of the original self-organizing neural map algorithm proposed by Kohonen. One of the most recent is the Evolving Tree, a tree-shaped self-organizing network which has many interesting characteristics. This network builds a tree structure splitting the input dataset during learning. This paper presents a speed-up modification of the original training algorithm useful when the Evolving Tree network is used with complex data as images or video. After a measurement of the effectiveness an application of the modified algorithm in image segmentation is presented.

Vincenzo Cannella, Riccardo Rizzo, Roberto Pirrone

A Linear Learning Method for Multilayer Perceptrons Using Least-Squares

Training multilayer neural networks is typically carried out using gradient descent techniques. Ever since the brilliant

backpropagation

(BP), the first gradient-based algorithm proposed by Rumelhart et al., novel training algorithms have appeared to become better several facets of the learning process for feed-forward neural networks.

Learning speed

is one of these. In this paper, a learning algorithm that applies linear-least-squares is presented. We offer the theoretical basis for the method and its performance is illustrated by its application to several examples in which it is compared with other learning algorithms and well known data sets. Results show that the new algorithm upgrades the learning speed of several backpropagation algorithms, while preserving good optimization accuracy. Due to its performance and low computational cost it is an interesting alternative, even for second order methods, particularly when dealing large networks and training sets.

Bertha Guijarro-Berdiñas, Oscar Fontenla-Romero, Beatriz Pérez-Sánchez, Paula Fraguela

A Discriminative Model Corresponding to Hierarchical HMMs

Hidden Markov Models (HMMs) are very popular generative models for sequence data. Recent work has, however, shown that on many tasks, Conditional Random Fields (CRFs), a type of discriminative model, perform better than HMMs. We propose Hierarchical Hidden Conditional Random Fields (HHCRFs), a discriminative model corresponding to hierarchical HMMs (HHMMs). HHCRFs model the conditional probability of the states at the upper levels given observations. The states at the lower levels are hidden and marginalized in the model definition. We have developed two algorithms for the model: a parameter learning algorithm that needs only the states at the upper levels in the training data and the marginalized Viterbi algorithm, which computes the most likely state sequences at the upper levels by marginalizing the states at the lower levels. In an experiment that involves segmenting electroencephalographic (EEG) data for a Brain-Computer Interface, HHCRFs outperform HHMMs.

Takaaki Sugiura, Naoto Goto, Akira Hayashi

Finding Unsatisfiable Subformulas with Stochastic Method

Explaining the causes of infeasibility of Boolean formulas has many practical applications in various fields. A small unsatisfiable subformula provides a succinct explanation of infeasibility and is valuable for applications. In recent years the problem of finding unsatisfiable subformulas has been addressed frequently by research works, which are mostly based on the SAT solvers with DPLL backtrack-search algorithm. However little attention has been concentrated on extraction of unsatisfiable subformulas using stochastic methods. In this paper, we propose a resolution-based stochastic local search algorithm to derive unsatisfiable subformulas. This approach directly constructs the resolution sequences for proving unsatisfiability with a local search procedure, and then extracts small unsatisfiable subformulas from the refutation traces. We report and analyze the experimental results on benchmarks.

Jianmin Zhang, Shengyu Shen, Sikun Li

A New Efficient Approach in Clustering Ensembles

Previous clustering ensemble algorithms usually use a consensus function to obtain a final partition from the outputs of the initial clustering. In this paper, we propose a new clustering ensemble method, which generates a new feature space from initial clustering outputs. Multiple runs of an initial clustering algorithm like k-means generate a new feature space, which is significantly better than pure or normalized feature space. Therefore, running a simple clustering algorithm on generated feature space can obtain the final partition significantly better than pure data. In this method, we use a modification of k-means for initial clustering runs named as “Intelligent k-means”, which is especially defined for clustering ensembles. The results of the proposed method are presented using both simple k-means and intelligent k-means. Fast convergence and appropriate behavior are the most interesting points of the proposed method. Experimental results on real data sets show effectiveness of the proposed method.

Javad Azimi, Monireh Abdoos, Morteza Analoui

An Evolutionary Hyperheuristic to Solve Strip-Packing Problems

In this paper we introduce an evolutionary hyperheuristic approach to solve difficult strip packing problems. We have designed a genetic based hyperheuristic using the most recently proposed low-level heuristics in the literature. Two versions for tuning parameters have also been evaluated. The results obtained are very encouraging showing that our approach outperforms the single heuristics and others well-known techniques.

Pablo Garrido, María-Cristina Riff

Statistical Analysis of Sample-Size Effects in ICA

Independent component analysis (ICA) solves the blind source separation problem by evaluating higher-order statistics, e.g. by estimating fourth-order moments. While estimation errors of the kurtosis can be shown to asymptotically decay with sample size according to a square-root law, they are subject to two further effects for finite samples. Firstly, errors in the estimation of kurtosis increase with the deviation from Gaussianity. Secondly, errors in kurtosis-based ICA algorithms increase when approaching the Gaussian case. These considerations allow us to derive a strict lower bound for the sample size to achieve a given separation quality, which we study analytically for a specific family of distributions and a particular algorithm (fastICA). We further provide results from simulations that support the relevance of the analytical results.

J. Michael Herrmann, Fabian J. Theis

HPGP: An Abstraction-Based Framework for Decision-Theoretic Planning

This paper is a report on research towards the development of an abstraction-based framework for decision-theoretic planning. We make use of two planning approaches in the context of probabilistic planning: planning by abstraction and planning graphs. To create abstraction hierarchies our planner uses an adapted version of a hierarchical planner under uncertainty, and to search for plans, we propose a probabilistic planning algorithm based on Pgraphplan. The article outlines the main framework characteristics, and presents results on some problems found in the literature. Our preliminary results suggest that our planner can reduce the size of the search space, when compared with Pgraphplan, hierarchical planning under uncertainty and top-down dynamic programming.

Letícia Friske, Carlos Henrique Costa Ribeiro

Correction of Medical Handwriting OCR Based on Semantic Similarity

In the paper a method of the correction of handwriting Optical Character Recognition (OCR) based on the semantic similarity is presented. Different versions of the extraction of semantic similarity measures from a corpus are analysed, with the best results achieved for the combination of the text window context and Rank Weight Function. An algorithm of the word sequence selection with the high internal similarity is proposed. The method was trained and applied to a corpus of real medical documents written in Polish.

Bartosz Broda, Maciej Piasecki

Multiple Classifier Fusion Using k-Nearest Localized Templates

This paper presents a method for combining classifiers that uses

-nearest localized templates. The localized templates are estimated from a training set using

-means clustering algorithm, and matched to the decision profile of a new incoming sample by a similarity measure. The sample is assigned to the class which is most frequently represented among the

most similar templates. The appropriate value of

is determined according to the characteristics of the given data set. Experimental results on real and artificial data sets show that the proposed method performs better than the conventional fusion methods.

Jun-Ki Min, Sung-Bae Cho

Data Mining and Information Management

Color Image Segmentation Applied to Medical Domain

The article presents two practical ways of using the automated color image segmentation in the medical field: for content-based region query and for tracking the time evolution of the disease in patients following a certain treatment. A known technique was used for automated color medical image segmentation – the color set back-projection algorithm. Our previous work in extraction of color regions from a database of nature images using the same algorithm showed promising results. The images are transformed from RGB to HSV color space, quantized at 166 colors and processed by the color set back-projection algorithm that allows the color region detection. The algorithm is studied from two points of view: complexity and the retrieval quality. The experiments that were made on a database with color endoscopy images from digestive tract have shown satisfying results for both applications that are important in practical medical use and medical teaching.

Liana Stanescu, Dan Dumitru Burdescu, Cosmin Stoica

Hierarchical Program Representation for Program Element Matching

Many intermediate program representations are used by compilers and other software development tools. In this paper, we propose a novel representation technique that, unlike those commonly used by compilers, has been explicitly designed for facilitating program element matching, a task at the heart of many software mining problems.

Fernando Berzal, Juan-Carlos Cubero, Aída Jiménez

A Combination-of-Tools Method for Learning Interpretable Fuzzy Rule-Based Classifiers from Support Vector Machines

A new approach is proposed for the data-based identification of transparent fuzzy rule-based classifiers. It is observed that fuzzy rule-based classifiers work in a similar manner as kernel function-based support vector machines (SVMs) since both model the input space by nonlinearly maps into a feature space where the decision can be easily made. Accordingly, trained SVM can be used for the construction of fuzzy rule-based classifiers. However, the transformed SVM does not automatically result in an interpretable fuzzy model because the SVM results in a complex rule-base, where the number of rules is approximately 40-60% of the number of the training data. Hence, reduction of the SVM-initialized classifier is an essential task. For this purpose, a three-step reduction algorithm is developed based on the combination of previously published model reduction techniques. In the first step, the identification of the SVM is followed by the application of the Reduced Set method to decrease the number of kernel functions. The reduced SVM is then transformed into a fuzzy rule-based classifier. The interpretability of a fuzzy model highly depends on the distribution of the membership functions. Hence, the second reduction step is achieved by merging similar fuzzy sets based on a similarity measure. Finally, in the third step, an orthogonal least-squares method is used to reduce the number of rules and re-estimate the consequent parameters of the fuzzy rule-based classifier. The proposed approach is applied for the Wisconsin Breast Cancer, Iris and Wine classification problems to compare its performance to other methods.

Tamas Kenesei, Johannes A. Roubos, Janos Abonyi

An Effective Content-Based Image Retrieval System by Hierachical Segmentation

As the inaccuracy of current image segmentation methods, it’s unavoidable for the objects with discrepant components to be segmented into different regions. As a result, good image retrieval performance could not be achieved by those region-based image retrieval approaches. Furthermore, the complexity of image segmentations is also an unmanageable issue in the scenarios with complex backgrounds. Aspired by the wavelet multi-resolution analysis, the objects with different scales, orientations, and locations, can be retrieved by their invariant features and hierarchical multi-resolution segmentations. For simplification, the hierarchical segmentation is conducted to segment one image into equal block with different shifts and sizes in one hierarchical way, and those blocks can form a complete pitch to the image at different hierarchical levels with different shifts and sizes. The smaller the sizes of blocks are, the higher the hierarchical levels. Then, the similar metrics of these sub-blocks to query image, are evaluated to retrieve those sub-blocks with contents in query images. Meanwhile, the location and scale information about query objects can also be returned in retrieved images. With geometric invariants, normalized histograms and their combinations as invariant features, the hierarchical segmentation-based image retrieval scheme are tested by experiments via one image database with 500 images. The retrieval accuracy with geometric invariants as invariant features can achieve 78% for the optimal similar metric threshold, only inferior to that of region-based image retrieval schemes, whose retrieval accuracy in our experiments is 80% with expectation maximized segmentations.

Mingxin Zhang, Zhaogan Lu, Junyi Shen

Knowledge Extraction from Unstructured Surface Meshes

We propose methods that allow the investigation of local modifications of aerodynamic design data represented by discrete unstructured surface meshes. A displacement measure is suggested to evaluate local differences between the shapes. The displacement measure provides information on the amount and direction of surface modifications. Using the displacement measure in conjunction with statistical methods or data mining techniques provides meaningfull knowledge from the data set for guiding further shape optimization processes.

Lars Graening, Markus Olhofer, Bernhard Sendhoff

Clustering with Reinforcement Learning

We show how a previously derived method of using reinforcement learning for supervised clustering of a data set can lead to a sub-optimal solution if the cluster prototypes are initialised to poor positions. We then develop three novel reward functions which show great promise in overcoming poor initialization. We illustrate the results on several data sets. We then use the clustering methods with an underlying latent space which enables us to create topology preserving mappings. We illustrate this method on both real and artificial data sets.

Wesam Barbakh, Colin Fyfe

Mining Frequent Itemsets in Large Data Warehouses: A Novel Approach Proposed for Sparse Data Sets

Proposing efficient techniques for discovery of useful information and valuable knowledge from very large databases and data warehouses has attracted the attention of many researchers in the field of data mining. The well-known Association Rule Mining (ARM) algorithm, Apriori, searches for frequent itemsets (i.e., set of items with an acceptable support) by scanning the whole database repeatedly to count the frequency of each candidate itemset. Most of the methods proposed to improve the efficiency of the Apriori algorithm attempt to count the frequency of each itemset without re-scanning the database. However, these methods rarely propose any solution to reduce the complexity of the inevitable enumerations that are inherited within the problem. In this paper, we propose a new algorithm for mining frequent itemsets and also association rules. The algorithm computes the frequency of itemsets in an efficient manner. Only a single scan of the database is required in this algorithm. The data is encoded into a compressed form and stored in main memory within a suitable data structure. The proposed algorithm works in an iterative manner, and in each iteration, the time required to measure the frequency of an itemset is reduced further (i.e., checking the frequency of n-dimensional candidate itemsets is much faster than those of n-1 dimensions). The efficiency of our algorithm is evaluated using artificial and real-life datasets. Experimental results indicate that our algorithm is more efficient than existing algorithms.

S. M. Fakhrahmad, M. Zolghadri Jahromi, M. H. Sadreddini

A Sparse Bayesian Position Weighted Bio-Kernel Network

The Bio-Basis Function Neural Network (BBFNN) is a successful neural network architecture for peptide classification. However, the selection of a subset of peptides for a parsimonious network structure is always a difficult process. We present a Sparse Bayesian Bio-Kernel Network in which a minimal set of representative peptides can be selected automatically. We also introduce per-residue weighting to the Bio-Kernel to improve accuracy and identify patterns for biological activity. The new network is shown to outperform the original BBFNN on various datasets, covering different biological activities such as as enzymatic and post-translational-modification, and generates simple, interpretable models.

David C. Trudgian, Zheng Rong Yang

Square Penalty Support Vector Regression

Support Vector Regression (SVR) is usually pursued using the

–insensitive loss function while, alternatively, the initial regression problem can be reduced to a properly defined classification one. In either case, slack variables have to be introduced in practical interesting problems, the usual choice being the consideration of linear penalties for them. In this work we shall discuss the solution of an SVR problem recasting it first as a classification problem and working with square penalties. Besides a general theoretical discussion, we shall also derive some consequences for regression problems of the coefficient structure of the resulting SVMs and illustrate the procedure on some standard problems widely used as benchmarks and also over a wind energy forecasting problem.

Álvaro Barbero, Jorge López, José R. Dorronsoro

Constructing Accurate Fuzzy Rule-Based Classification Systems Using Apriori Principles and Rule-Weighting

A fuzzy rule-based classification system (FRBCS) is one of the most popular approaches used in pattern classification problems. One advantage of a fuzzy rule-based system is its interpretability. However, we’re faced with some challenges when generating the rule-base. In high dimensional problems, we can not generate every possible rule with respect to all antecedent combinations. In this paper, by making the use of some data mining concepts, we propose a method for rule generation, which can result in a rule-base containing rules of different lengths. As the next phase, we use rule-weight as a simple mechanism to tune the classifier and propose a new method of rule-weight specification for this purpose. Through computer simulations on some data sets from UCI repository, we show that the proposed scheme achieves better prediction accuracy compared with other fuzzy and non-fuzzy rule-based classification systems proposed in the past.

S. M. Fakhrahmad, A. Zare, M. Zolghadri Jahromi

Visualization of Topology Representing Networks

As data analysis tasks often have to face the analysis of huge and complex data sets there is a need for new algorithms that combine vector quantization and mapping methods to visualize the hidden data structure in a low-dimensional vector space. In this paper a new class of algorithms is defined. Topology representing networks are applied to quantify and disclose the data structure and different nonlinear mapping algorithms for the low-dimensional visualization are applied for the mapping of the quantized data. To evaluate the main properties of the resulted topology representing network based mapping methods a detailed analysis based on the wine benchmark example is given.

Agnes Vathy-Fogarassy, Agnes Werner-Stark, Balazs Gal, Janos Abonyi

The Outer Impartation Information Content of Rules and Rule Sets

The appraisement of rules and rule sets is very important in data mining. The information content of rules is discussed in this paper and is categorized into inner mutual information and outer impartation information. We put forward the viewpoint that the outer impartation information content of rules and rule sets can be represented by relations from input universe to output universe. Then, the interaction of rules in a rule set can be represented by the union and intersection of binary relations expediently. Based on the entropy of relations, the outer impartation information content of rules and rule sets are well measured. Compared with the methods which appraise rule sets by their overall performance (accuracy, error rate) on the given test data sets, the outer impartation information content of rule sets is more objective and convenient because of the absence of test data sets.

Dan Hu, Yuanfu Feng

An Engineering Approach to Data Mining Projects

Both the number and complexity of Data Mining projects has increased in late years. Unfortunately, nowadays there isn’t a formal process model for this kind of projects, or existing approaches are not right or complete enough. In some sense, present situation is comparable to that in software that led to ’software crisis’ in latest 60’s. Software Engineering matured based on process models and methodologies. Data Mining’s evolution is being parallel to that in Software Engineering. The research work described in this paper proposes a Process Model for Data Mining Projects based on the study of current Software Engineering Process Models (IEEE Std 1074 and ISO 12207) and the most used Data Mining Methodology CRISP-DM (considered as a “facto” standard) as basic references.

Óscar Marbán, Gonzalo Mariscal, Ernestina Menasalvas, Javier Segovia

Classifying Polyphonic Melodies by Chord Estimation Based on Hidden Markov Model

In this investigation we propose a novel approach for classifying polyphonic melodies. Our main idea comes from for automatic classification of polyphonic melodies by

Hidden Markov model

where the states correspond to well-tempered chords over the music and the observation sequences to some feature values called

pitch spectrum

. The similarity among harmonies can be considered by means of the features and well-tempered chords. We show the effectiveness and the usefulness of the approach by some experimental results.

Yukiteru Yoshihara, Takao Miura, Isamu Shioya

Elastic Non-contiguous Sequence Pattern Detection for Data Stream Monitoring

In recent years, there has been an increasing interest in the detection of non-contiguous sequence patterns in data streams. Existing works define a fixed temporal constraint between every pair of adjacent elements of the sequence. While this method is simple and intuitive, it suffers from the following shortcomings: 1)It is difficult for the users who are not domain experts to specify such complex temporal constraints properly; 2)The fixed temporal constraint is not flexible to capture interested patterns hidden in long sequences. In this paper, we introduce a novel type of non-contiguous sequence pattern, named

Elastic Temporal Constrained Non-contiguous Sequence Pattern(ETC-NSP)

. Such a pattern defines an elastic temporal constraint on the sequence, thus is more flexible and effective as opposed to the fixed temporal constraints. Detection of ETC-NSP in data streams is a non-trivial task since a brute force approach is exponential in time. Our method exploits an similarity measurement called

Minimal Variance Matching

as the basic matching mechanism. To further speed up the monitoring process, we develop pruning strategies which make it practical to use ETC-NSP in streaming environment. Experimental studies show that the proposed method is efficient and effective in detecting non-contiguous sequence patterns from data streams.

Xinqiang Zuo, Yuanbing Zhou, Chunhui Zhao

Joint Cutoff Probabilistic Estimation Using Simulation: A Mailing Campaign Application

Frequently, organisations have to face complex situations where decision making is difficult. In these scenarios, several related decisions must be made at a time, which are also bounded by constraints (e.g. inventory/stock limitations, costs, limited resources, time schedules, etc). In this paper, we present a new method to make a good global decision when we have such a complex environment with several local interwoven data mining models. In these situations, the best local cutoff for each model is not usually the best cutoff in global terms. We use simulation with Petri nets to obtain better cutoffs for the data mining models. We apply our approach to a frequent problem in customer relationship management (CRM), more specifically, a direct-marketing campaign design where several alternative products have to be offered to the same house list of customers and with usual inventory limitations. We experimentally compare two different methods to obtain the cutoff for the models (one based on merging the prospective customer lists and using the local cutoffs, and the other based on simulation), illustrating that methods which use simulation to adjust model cutoff obtain better results than a more classical analytical method.

Antonio Bella, Cèsar Ferri, José Hernández-Orallo, María José Ramírez-Quintana

Segmentation and Annotation of Audiovisual Recordings Based on Automated Speech Recognition

Searching multimedia data in particular audiovisual data is still a challenging task to fulfill. The number of digital video recordings has increased dramatically as recording technology has become more affordable and network infrastructure has become easy enough to provide download and streaming solutions. But, the accessibility and traceability of its content for further use is still rather limited. In our paper we are describing and evaluating a new approach to synchronizing auxiliary text-based material as, e. g. presentation slides with lecture video recordings. Our goal is to show that the tentative transliteration is sufficient for synchronization. Different approaches to synchronize textual material with deficient transliterations of lecture recordings are discussed and evaluated in this paper. Our evaluation data-set is based on different languages and various speakers’ recordings.

Stephan Repp, Jörg Waitelonis, Harald Sack, Christoph Meinel

Mining Disjunctive Sequential Patterns from News Stream

Frequent disjunctive pattern is known to be a sophisticated method of text mining in a single document that satisfies

anti-monotonicity

, by which we can discuss efficient algorithm based on APRIORI. In this work, we propose a new

online and single-pass algorithm

by which we can extract current frequent

disjunctive patterns

by a weighting method for past events from a

news stream

. And we discuss some experimental results.

Kazuhiro Shimizu, Isamu Shioya, Takao Miura

A New Dissimilarity Measure Between Trees by Decomposition of Unit-Cost Edit Distance

Tree edit distance is a conventional dissimilarity measure between labeled trees. However, tree edit distance including unit-cost edit distance contains the similarity of label and that of tree structure simultaneously. Therefore, even if the label similarity between two trees that share many nodes with the same label is high, the high label similarity is hard to be recognized from their tree edit distance when their tree sizes or shapes are quite different. To overcome this flaw, we propose a novel method that obtains a label dissimilarity measure and a structural dissimilarity measure separately by decomposing unit-cost edit distance.

Hisashi Koga, Hiroaki Saito, Toshinori Watanabe, Takanori Yokoyama

Optimizing Web Structures Using Web Mining Techniques

With vibrant and rapidly growing web, website complexity is constantly increasing, making it more difficult for users to quickly locate the information they are looking for. This, on the other hand, becomes more and more important due to the widespread reliance on the many services available on the Internet nowadays. Web mining techniques have been successfully used for quite some time, for example in search engines like Google, to facilitate retrieval of relevant information. This paper takes a different approach, as we believe that not only search engines can facilitate the task of finding the information one is looking for, but also an optimization of a website’s internal structure, which is based on previously recorded user behavior. In this paper, we will present a novel approach to identifying problematic structures in websites. This method compares user behavior, derived via web log mining techniques, to an analysis of the website’s link structure obtained by applying the Weighted PageRank algorithm (see [19]). We will then show how to use these intermediate results in order to point out problematic website structures to the website owner.

Jonathan Jeffrey, Peter Karski, Björn Lohrmann, Keivan Kianmehr, Reda Alhajj

A Collaborative Recommender System Based on Asymmetric User Similarity

Recommender systems could be seen as an application of a data mining process in which data collection, pre-processing, building user profiles and evaluation phases are performed in order to deliver personalised recommendations. Collaborative filtering systems rely on user-to-user similarities using standard similarity measures. The symmetry of most standard similarity measures makes it difficult to differentiate users’ patterns based on their historical behaviour. That means, they are not able to distinguish between two users when one user’ behaviour is quite similar to the other but not

vice versa

. We have found that the

-nearest neighbour algorithm may generate groups which are not necessarily homogenous. In this paper, we use an asymmetric similarity measure in order to distinguish users’ patterns. Recommendations are delivered based on the users’ historical behaviour closest to a target user. Preliminary experimental results have shown that the similarity measure used is a powerful tool for differentiating users’ patterns.

Marta Millan, Maria Trujillo, Edward Ortiz

Stop Wasting Time: On Predicting the Success or Failure of Learning for Industrial Applications

The successful application of machine learning techniques to industrial problems places various demands on the collaborators. The system designers must possess appropriate analytical skills and technical expertise, and the management of the industrial or commercial partner must be sufficiently convinced of the potential benefits that they are prepared to invest in money and equipment. Vitally, the collaboration also requires a significant investment in time from the end-users in order to provide training data from which the system can (hopefully) learn. This poses a problem if the developed Machine Learning system is not sufficiently accurate, as the users and management may view their input as wasted effort, and lose faith with the process. In this paper we investigate techniques for making early predictions of the error rate achievable after further interactions. In particular we show how decomposing the error in different components can lead to useful predictors of achievable accuracy, but that this is dependent on the choice of an appropriate sampling methodology.

J. E. Smith, M. A. Tahir

Parallel Wavelet Transform for Spatio-temporal Outlier Detection in Large Meteorological Data

This paper describes a state-of-the-art parallel data mining solution that employs wavelet analysis for scalable outlier detection in large complex spatio-temporal data. The algorithm has been implemented on multiprocessor architecture and evaluated on real-world meteorological data. Our solution on high-performance architecture can process massive and complex spatial data at reasonable time and yields improved prediction.

Sajib Barua, Reda Alhajj

A Tool for Web Usage Mining

This paper presents a tool for web usage mining. The aim is centered on providing a tool that facilitates the mining process rather than implement elaborated algorithms and techniques. The tool covers different phases of the CRISP-DM methodology as data preparation, data selection, modeling and evaluation. The algorithms used in the modeling phase are those implemented in the Weka project. The tool has been tested in a web site to find access and navigation patterns.

Jose M. Domenech, Javier Lorenzo

An Algorithm to Mine General Association Rules from Tabular Data

Mining association rules is a major technique within data mining and has many applications. Most methods for mining association rules from tabular data mine simple rules which only represent equality in their items. Limiting the operator only to “=” results in many interesting frequent patterns that may exist not being identified. It is obvious that where there is an order between objects, greater than or less than a value is as important as equality. This motivates extension, from simple equality, to a more general set of operators. We address the problem of mining general association rules in tabular data where rules can have all operators { ≤ , ≥ , ≠ , = } in their antecedent part. The proposed algorithm, Mining General Rules (MGR), is applicable to datasets with discrete-ordered attributes and on quantitative discretized attributes. The proposed algorithm stores candidate general itemsets in a tree structure in such a way that supports of complex itemsets can be recursively computed from supports of simpler itemsets. The algorithm is shown to have benefits in terms of time complexity, memory management and has great potential for parallelization.

Siyamand Ayubi, Maybin Muyeba, John Keane

Intrusion Detection at Packet Level by Unsupervised Architectures

Intrusion Detection Systems (IDS’s) monitor the traffic in computer networks for detecting suspect activities. Connectionist techniques can support the development of IDS’s by modeling ‘normal’ traffic. This paper presents the application of some unsupervised neural methods to a packet dataset for the first time. This work considers three unsupervised neural methods, namely, Vector Quantization (VQ), Self-Organizing Maps (SOM) and Auto-Associative Back-Propagation (AABP) networks. The former paradigm proves quite powerful in supporting the basic space-spanning mechanism to sift normal traffic from anomalous traffic. The SOM attains quite acceptable results in dealing with some anomalies while it fails in dealing with some others. The AABP model effectively drives a nonlinear compression paradigm and eventually yields a compact visualization of the network traffic progression.

Álvaro Herrero, Emilio Corchado, Paolo Gastaldo, Davide Leoncini, Francesco Picasso, Rodolfo Zunino

Quality of Adaptation of Fusion ViSOM

This work presents a research on the performance capabilities of an extension of the ViSOM (Visualization Induced SOM) algorithm by the use of the ensemble meta-algorithm and a later fusion process. This main fusion process has two different variants, considering two different criteria for the similarity of nodes. These criteria are Euclidean distance and similarity on Voronoi polygons. The capabilities, strengths and weakness of the different variants of the model are discussed and compared more deeply in the present work. The details of several experiments performed over different datasets applying the variants of the fusion to the ViSOM algorithm along with same variants of fusion with the SOM are included for this purpose.

Bruno Baruque, Emilio Corchado, Hujun Yin

Classification Based on the Trace of Variables over Time

To be successful with certain classification problems or knowledge discovery tasks it is not sufficient to look at the available variables at a single point in time, but their development has to be traced over a period of time. It is shown that patterns and sequences of labeled intervals represent a particularly well suited data format for this purpose. An extension of existing classifiers is proposed that enables them to handle this kind of sequential data. Compared to earlier approaches the expressiveness of the pattern language (using Allen et al.’s interval relationships) is increased, which allows the discovery of many temporal patterns common to real-world applications.

Frank Höppner, Alexander Topp

Extracting Meaningful Contexts from Mobile Life Log

Life logs include people’s experiences collected from various sources. It is used to support user’s memory. There are many studies that collect and store life log for personal memory. In this paper, we collect log data from smart phone, derive contexts from the log, and then identify which is meaningful context by using a method based on KeyGraph. To evaluate the proposed method, we show an example of the meaningful places by using contexts and GPS logs collected from two users.

Youngseol Lee, Sung-Bae Cho

Topological Tree Clustering of Social Network Search Results

In the information age, online collaboration and social networks are of increasing importance and quickly becoming an integral part of our lifestyle. In business, social networking can be a powerful tool to expand a customer network to which a company can sell products and services, or find new partners / employees in a more trustworthy and targeted manner. Identifying new friends or partners, on social networking websites, is usually done via a keyword search, browsing a directory of topics (e.g. interests, geography, or employer) or a chain of social ties (e.g. links to other friends on a user’s profile). However there are limitations to these three approaches. Keyword search typically produces a list of ranked results, where traversing pages of ranked results can be tedious and time consuming to explore. A directory of groups / networks is generally created manually, requires significant ongoing maintenance and cannot keep up with rapid changes. Social chains require the initial users to specify metadata in their profile settings and again may no be up to date. In this paper we propose to use the topological tree method to dynamically identify similar groups based on metadata and content. The topological tree method is used to automatically organise social networking groups. The retrieved results, organised using an online version of the topological tree method, are discussed against to the returned results of a social network search. A discussion is made on the criterions of representing social relationships, and the advantages of presenting underlying topics and providing a clear view of the connections between topics. The topological tree has been found to be a superior representation and well suited for organising social networking content.

Richard T. Freeman

Bioinformatics and Neuroinformatics

A Framework to Analyze Biclustering Results on Microarray Experiments

Microarray technology produces large amounts of information to be manipulated by analysis methods, such as biclustering algorithms, to extract new knowledge. All-purpose multivariate data visualization tools are usually not enough for studying microarray experiments. Additionally, clustering tools do not provide means of simultaneous visualization of all the biclusters obtained.

We present an interactive tool that integrates traditional visualization techniques with others related to bioinformatics, such as transcription regulatory networks and microarray heatmaps, to provide enhanced understanding of the biclustering results. Our aim is to gain insight about the structure of biological data and the behavior of different biclustering algorithms.

Rodrigo Santamaría, Roberto Therón, Luis Quintales

Methods to Bicluster Validation and Comparison in Microarray Data

There are lots of validation indexes and techniques to study clustering results. Biclustering algorithms have been applied in Systems Biology, principally in DNA Microarray analysis, for the last years, with great success. Nowadays, there is a big set of biclustering algorithms each one based in different concepts, but there are few intercomparisons that measure their performance. We review and present here some numerical measures, new and evolved from traditional clustering validation techniques, to allow comparisons and validation of biclustering algorithms.

Rodrigo Santamaría, Luis Quintales, Roberto Therón

Capturing Heuristics and Intelligent Methods for Improving Micro-array Data Classification

Classification of micro-array data has been studied extensively but only a small amount of research work has been done on classification of micro-array data involving more than two classes. This paper proposes a learning strategy that deals with building a multi-target classifier and takes advantage from well known data mining techniques. To address the intrinsic difficulty of selecting features in order to promote the classification accuracy, the paper considers the use of a set of binary classifiers each of ones is devoted to predict a single class of the multi-classification problem. These classifiers are similar to local experts whose knowledge (about the features that are most correlated to each class value) is taken into account by the learning strategy for selecting an optimal set of features. Results of the experiments performed on a publicly available dataset demonstrate the feasibility of the proposed approach.

Andrea Bosin, Nicoletta Dessì, Barbara Pes

Classification of Microarrays with kNN: Comparison of Dimensionality Reduction Methods

Dimensionality reduction can often improve the performance of the k-nearest neighbor classifier (kNN) for high-dimensional data sets, such as microarrays. The effect of the choice of dimensionality reduction method on the predictive performance of kNN for classifying microarray data is an open issue, and four common dimensionality reduction methods, Principal Component Analysis (PCA), Random Projection (RP), Partial Least Squares (PLS) and Information Gain(IG), are compared on eight microarray data sets. It is observed that all dimensionality reduction methods result in more accurate classifiers than what is obtained from using the raw attributes. Furthermore, it is observed that both PCA and PLS reach their best accuracies with fewer components than the other two methods, and that RP needs far more components than the others to outperform kNN on the non-reduced dataset. None of the dimensionality reduction methods can be concluded to generally outperform the others, although PLS is shown to be superior on all four binary classification tasks, but the main conclusion from the study is that the choice of dimensionality reduction method can be of major importance when classifying microarrays using kNN.

Sampath Deegalla, Henrik Boström

Protein Data Condensation for Effective Quaternary Structure Classification

Many proteins are composed of two or more subunits, each associated with different polypeptide chains. The number and the arrangement of subunits forming a protein are referred to as

quaternary structure

. The quaternary structure of a protein is important, since it characterizes the biological function of the protein when it is involved in specific biological processes. Unfortunately, quaternary structures are not trivially deducible from protein amino acid sequences. In this work, we propose a protein quaternary structure classification method exploiting the functional domain composition of proteins. It is based on a nearest neighbor condensation technique in order to reduce both the portion of dataset to be stored and the number of comparisons to carry out. Our approach seems to be promising, in that it guarantees an high classification accuracy, even though it does not require the entire dataset to be analyzed. Indeed, experimental evaluations show that the method here proposed selects a small dataset portion for the classification (of the order of the 6.43%) and that it is very accurate (97.74%).

Fabrizio Angiulli, Valeria Fionda, Simona E. Rombo

PINCoC: A Co-clustering Based Approach to Analyze Protein-Protein Interaction Networks

A novel technique to search for functional modules in a protein-protein interaction network is presented. The network is represented by the adjacency matrix associated with the undirected graph modelling it. The algorithm introduces the concept of

quality

of a sub-matrix of the adjacency matrix, and applies a greedy search technique for finding local optimal solutions made of dense sub-matrices containing the maximum number of ones. An initial random solution, constituted by a single protein, is evolved to search for a locally optimal solution by adding/removing connected proteins that best contribute to improve the

quality

function. Experimental evaluations carried out on

Saccaromyces Cerevisiae

proteins show that the algorithm is able to efficiently isolate groups of biologically meaningful proteins corresponding to the most compact sets of interactions.

Clara Pizzuti, Simona E. Rombo

Discovering α–Patterns from Gene Expression Data

The biclustering techniques have the purpose of finding subsets of genes that show similar activity patterns under a subset of conditions. In this paper we characterize a specific type of pattern, that we have called

–pattern, and present an approach that consists in a new biclustering algorithm specifically designed to find

–patterns, in which the gene expression values evolve across the experimental conditions showing a similar behavior inside a band that ranges from 0 up to a pre–defined threshold called

. The

value guarantees the co–expression among genes. We have tested our method on the

Yeast

dataset and compared the results to the biclustering algorithms of Cheng & Church (2000) and Aguilar & Divina (2005). Results show that the algorithm finds interesting biclusters, grouping genes with similar behaviors and maintaining a very low mean squared residue.

Domingo S. Rodríguez-Baena, Norberto Diaz-Diaz, Jesús S. Aguilar-Ruiz, Isabel Nepomuceno-Chamorro

Biclusters Evaluation Based on Shifting and Scaling Patterns

Microarray techniques have motivated the develop of different methods to extract useful information from a biological point of view. Biclustering algorithms obtain a set of genes with the same behaviour over a group of experimental conditions from gene expression data. In order to evaluate the quality of a bicluster, it is useful to identify specific tendencies represented by patterns on data. These patterns describe the behaviour of a bicluster obtained previously by an adequate biclustering technique from gene expression data. In this paper a new measure for evaluating biclusters is proposed. This measure captures a special kind of patterns with scaling trends which represents quality patterns. They are not contemplated with the previous evaluating measure accepted in the literature. This work is a first step to investigate methods that search biclusters based on the concept of shift and scale invariance. Experimental results based on the yeast cell cycle and the human B-cell lymphoma datasets are reported. Finally, the performance of the proposed technique is compared with an optimization method based on the Nelder-Mead Simplex search algorithm.

Juan A. Nepomuceno, Alicia Troncoso Lora, Jesús S. Aguilar–Ruiz, Jorge García–Gutiérrez

A Deterministic Model to Infer Gene Networks from Microarray Data

Microarray experiments help researches to construct the structure of gene regulatory networks, i.e., networks representing relationships among different genes. Filter and knowledge extraction processes are necessary in order to handle the huge amount of data produced by microarray technologies. We propose regression trees techniques as a method to identify gene networks. Regression trees are a very useful technique to estimate the numerical values for the target outputs. They are very often more precise than linear regression models because they can adjust different linear regressions to separate areas of the search space. In our approach, we generate a single regression tree for each genes from a set of genes, taking as input the remaining genes, to finally build a graph from all the relationships among output and input genes. In this paper, we will simplify the approach by setting an only seed, the gene ARN1, and building the graph around it. The final model might gives some clues to understand the dynamics, the regulation or the topology of the gene network from one (or several) seeds, since it gathers relevant genes with accurate connections. The performance of our approach is experimentally tested on the yeast Saccharomyces cerevisiae dataset (Rosetta compendium).

Isabel Nepomuceno-Chamorro, Jesús S. Aguilar–Ruiz, Norberto Díaz–Díaz, Domingo S. Rodríguez–Baena, Jorge García

Profiling of High-Throughput Mass Spectrometry Data for Ovarian Cancer Detection

Mass Spectrometry (MS) has been applied to the early detection of ovarian cancer. To date, most of the studies concentrated on the so-called whole-spectrum approach, which treats each point in the spectrum as a separate test, due to its better accuracy than the profiling approach. However, the whole-spectrum approach does not guarantee biologically meaningful results and is difficult for biological interpretation and clinical application. Therefore, to develop an accurate profiling technique for early detection of ovarian cancer is required. This paper proposes a novel profiling method for high-resolution ovarian cancer MS data by integrating the Smoothed Nonlinear Energy Operator (SNEO), correlation-based peak selection and Random Forest classifier. In order to evaluate the performance of this novel method without bias, we employed randomization techniques by dividing the data set into testing set and training set to test the whole procedure for many times over. Test results show that the method can find a parsimonious set of biologically meaningful biomarkers with better accuracy than other methods.

Shan He, Xiaoli Li

Adapting Machine Learning Technique for Periodicity Detection in Nucleosomal Locations in Sequences

DNA sequence is an important determinant of the positioning, stability, and activity of nucleosome, yet the molecular basis of these remains elusive. Positioned nucleosomes are believed to play an important role in transcriptional regulation and for the organization of chromatin in cell nuclei. After completing the genome project of many organisms, sequence mining received considerable and increasing attention. Many works devoted a lot of effort to detect the periodicity in DNA sequences, namely, the DNA segments that wrap the Histone protein. In this paper, we describe and apply a dynamic periodicity detection algorithm to discover periodicity in DNA sequences. Our algorithm is based on suffix tree as the underlying data structure. The proposed approach considers the periodicity of alternative substrings, in addition to considering dynamic window to detect the periodicity of certain instances of substrings. We demonstrate the applicability and effectiveness of the proposed approach by reporting test results on three data sets.

Faraz Rasheed, Mohammed Alshalalfa, Reda Alhajj

Analysis of Tiling Microarray Data by Learning Vector Quantization and Relevance Learning

We apply learning vector quantization to the analysis of tiling microarray data. As an example we consider the classification of

C. elegans

genomic probes as intronic or exonic. Training is based on the current annotation of the genome. Relevance learning techniques are used to weight and select features according to their importance for the classification. Among other findings, the analysis suggests that correlations between the perfect match intensity of a particular probe and its neighbors are highly relevant for successful exon identification.

Michael Biehl, Rainer Breitling, Yang Li

Discriminating Microbial Species Using Protein Sequence Properties and Machine Learning

Much work has been done to identify species-specific proteins in sequenced genomes and hence to determine their function. We assumed that such proteins have specific physico-chemical properties that will discriminate them from proteins in other species. In this paper, we examine the validity of this assumption by comparing proteins and their properties from different bacterial species using Support Vector Machines (SVM). We show that by training on selected protein sequence properties, SVMs can successfully discriminate between proteins of different species. This finding takes us a step closer to inferring the functional characteristics of these proteins.

Ali Al-Shahib, David Gilbert, Rainer Breitling

Automatic Prognostic Determination and Evolution of Cognitive Decline Using Artificial Neural Networks

This work tries to go a step further in the development of methods based on automatic learning techniques to parse and interpret data relating to cognitive decline (CD). There have been studied the neuropsychological tests of 267 consultations made over 30 patients by the Alzheimer’s Patient Association of Gran Canaria in 2005. The Sanger neural network adaptation for missing values treatment has allowed making a Principal Components Analysis (PCA) on the successfully obtained data. The results show that the first three obtained principal components are able to extract information relating to functional, cognitive and instrumental sintomatology, respectively, from the test. By means of these techniques, it is possible to develop tools that allow physicians to quantify, view and make a better pursuit of the sintomatology associated to the cognitive decline processes, contributing to a better knowledge of these ones.

Patricio García Báez, Carmen Paz Suárez Araujo, Carlos Fernández Viadero, José Regidor García

Agents and Distributed Systems

SCSTallocator: Sized and Call-Site Tracing-Based Shared Memory Allocator for False Sharing Reduction in Page-Based DSM Systems

False sharing is a result of co-location of unrelated data in the same unit of memory coherency, and is one source of unnecessary overhead being of no help to keep the memory coherency in multiprocessor systems. Moreover, the damage caused by false sharing becomes large in proportion to the granularity of memory coherency. To reduce false sharing in page-based DSM systems, it is necessary to allocate unrelated data objects that have different access patterns into the separate shared pages. In this paper we propose

sized and call-site tracing-based shared memory allocator

, shortly

SCSTallocator

. SCSTallocator expects that the data objects requested from the different call-sites may have different access patterns in the future. So SCSTallocator places each data object requested from the different call-sites into the separate shared pages, and consequently data objects that have the same call-site are likely to get together into the same shared pages. At the same time SCSTallocator places each data object that has different size into different shared pages to prohibit the different-sized objects from being allocated to the same shared page. We use execution-driven simulation of real parallel applications to evaluate the effectiveness of our SCSTallocator. Our observations show that our SCSTallocator outperforms the existing dynamic shared memory allocators. By combining the two existing allocation technique, we can reduce a considerable amount of false sharing misses.

Jongwoo Lee, Youngho Park, Yongik Yoon

A Hybrid Social Model for Simulating the Effects of Policies on Residential Power Consumption

In this paper, a hybrid social model of econometric model and social influence model is proposed to settle the problem in power resources management. And, a hybrid society simulation platform based on the proposed model, termed Residential Electric Power Consumption Multi-Agent Systems (RECMAS), is designed to simulate residential power consumption by multi-agent. RECMAS is composed of consumer agent, power supplier agent, and policy maker agent. It provides the policy makers with an additional tool to evaluate power price policies and public education campaigns in different scenarios. Through an influenced diffusion mechanism, RECMAS can simulate the factors affecting power consumption, and the ones associated with public education campaigns. The application of the method for simulating residential power consumption in China is presented.

Minjie Xu, Zhaoguang Hu, Xiaoyou Jiao, Junyong Wu

On Intelligent Interface Agents for Human Based Computation

In this paper a new type of interface agent will be presented. This agent is oriented to model systems for human based computation. This kind of computation, that we consider a logical extension of intelligent agent paradigm, emerges as valid approach for the resolution of complex problems.

Firstly an study of the state of the art of interface agents will be review. Next, human based computation will be defined and we will see how is necessary to extend the current typology of interface agents to model this new kind of computation. In addition, a new type of interface agent, oriented to model this type of computational system, will be presented. Finally, two of the most representative applications of human based computation will be specified using this new typology.

F. Aznar, M. Sempere, M. Pujol, R. Rizo

Reverse Engineering an Agent-Based Hidden Markov Model for Complex Social Systems

The power of social values that helps to shape or formulate our behavior patterns is not only inevitable, but also how we have surreptitiously responded to the hidden curriculum that derives from such social values in our decision making can be just as significant. Through a machine learning approach, we are able to discover the agent dynamics that drives the evolution of the social groups in a community. By doing so, we set up the problem by introducing an agent-based hidden Markov model, in which the acts of an agent are determined by

micro-laws

with unknown parameters. To solve the problem, we develop a multistage learning process for determining the

micro-laws

of a community based on observed set of communications between actors without the semantic contents. We present the results of extensive experiments on synthetic data as well as some results on real communities,

e.g.

, Enron email and movie newsgroups.

Hung-Ching Chen, Mark Goldberg, Malik Magdon-Ismail, William A. Wallace

Effects of Neighbourhood Structure on Evolution of Cooperation in N-Player Iterated Prisoner’s Dilemma

In multi-agent systems, complex and dynamic interactions often emerge among individual agents. The ability of each agent to learn adaptively is therefore important for them to survive in such changing environment. In this paper, we consider the effects of neighbourhood structure on the evolution of cooperative behaviour in the N-Player Iterated Prisoner’s Dilemma (NIPD). We simulate the NIPD as a bidding game on a two dimensional grid-world, where each agent has to bid against its neighbours based on a chosen game strategy. We conduct experiments with three different types of neighbourhood structures, namely the triangular neighbourhood structure, the rectangular neighbourhood structure and the random pairing structure. Our results show that cooperation does emerge under the triangular neighbourhood structure, but defection prevails under the rectangular neighbourhood structure as well as the random pairing structure.

Raymond Chiong, Sandeep Dhakal, Lubo Jankovic

Interface Agents’ Design for a DRT Transportation System Using PASSI

The present work continues a longer research in the field of flexible transportation services and the design of an agent system devoted to the planning, scheduling and control of trips under such a domain. In particular, this paper focuses in the design and development of the interface agents present in the system by following an agent development methodology named PASSI. The interface agent devoted to interaction with the customers is explained in detail and its prototype is shown.

Claudio Cubillos, Sandra Gaete

A Multi-agent System Approach to Power System Topology Verification

The paper deals with power system topology verification, being an important problem in real-time power system modeling. In the paper, the approach with use of multi-agent system is proposed. At the beginning, benefits from utilization of the agent technology are presented. Then, a theoretical background for the power system topology verification is described. Next, multi-agent system for topology verification and functions of particular agents are characterized. At the end, features of the presented approach to power system topology verification are summed up.

Kazimierz Wilkosz

Financial Engineering and Modelling

A System for Efficient Portfolio Management

In this work we perform an automatic data survey to draw up an optimum portfolio, and to automate the one year forecast of a portfolio’s payoff and risk, showing the advantages of using formally grounded models in portfolio management and adopting a strategy that ensures, a high rate of return at a minimum risk. The use of neural networks provides an interesting alternative to the statistical classifier. We can take a decision on the purchase or sale of a given asset, using a neural network to classify the process into three decisions: buy, sell or do nothing.

Vivian F. López, Luis Alonso, María N. Moreno, Saddys Segrera, Alfredo Belloso

Partitioning-Clustering Techniques Applied to the Electricity Price Time Series

Clustering is used to generate groupings of data from a large dataset, with the intention of representing the behavior of a system as accurately as possible. In this sense, clustering is applied in this work to extract useful information from the electricity price time series. To be precise, two clustering techniques, K-means and Expectation Maximization, have been utilized for the analysis of the prices curve, demonstrating that the application of these techniques is effective so to split the whole year into different groups of days, according to their prices conduct. Later, this information will be used to predict the price in the short time period. The prices exhibited a remarkable resemblance among days embedded in a same season and can be split into two major kind of clusters: working days and festivities.

F. Martínez-Álvarez, A. Troncoso, J. C. Riquelme, J. M. Riquelme

Time-Series Prediction Using Self-Organising Mixture Autoregressive Network

In the past few years, various variants of the self-organising map (SOM) have been proposed to extend its ability for modelling time-series or temporal sequence. Most of them, however, have little connection to, or are over-simplified, autoregressive (AR) models. In this paper, a new extension termed, self-organising mixture autoregressive (SOMAR) network is proposed to topologically cluster time-series segments into underlying generating AR models. It uses autocorrelation values as the similarity measure between the model and the time-series segments. Such networks can be used for modelling nonstationary time-series. Experiments on predicting artificial time-series (Mackey-Glass) and real-world data (foreign exchange rates) are presented and results show that the proposed SOMAR network is a viable and superior to other SOM-based approaches.

He Ni, Hujun Yin

Adjusting the Generalized Pareto Distribution with Evolution Strategies – An application to a Spanish Motor Liability Insurance Database

Management of extreme events is required of a special consideration, as well as a sufficiently wide time horizon for solvency evaluation. Whereas their classical adjustment is usually carried out with Extreme Value Theory (EVT)-based distributions (namely, the Generalized Pareto Distribution), Evolutionary Techniques have been tried herein to fit the GPD parameters as an optimisation problem. The comparison between classical and evolutionary techniques highlights the accuracy of the evolutionary process. Data adjusted in this paper come from a Spanish motor liability insurance portfolio.

María J. Pérez-Fructuoso, Almudena García, Antonio Berlanga, José M. Molina

Independent Factor Reinforcement Learning for Portfolio Management

In this paper we propose to do portfolio management using reinforcement learning (RL) and independent factor model. Factors in independent factor model are mutually independent and exhibit better predictability. RL is applied to each factor to capture temporal dependence and provide investment suggestion on factor. Optimal weights on factors are found by portfolio optimization method subject to the investment suggestions and general portfolio constraints. Experimental results and analysis are given to show that the proposed method has better performance when compare to two alternative portfolio management systems.

Jian Li, Kun Zhang, Laiwan Chan

Discrete Time Portfolio Selection with Lévy Processes

This paper analyzes discrete time portfolio selection models with Lévy processes. We first implement portfolio models under the hypotheses the vector of log-returns follow or a multivariate Variance Gamma model or a Multivariate Normal Inverse Gaussian model or a Brownian Motion. In particular, we propose an ex-ante and an ex-post empirical comparisons by the point of view of different investors. Thus, we compare portfolio strategies considering different term structure scenarios and different distributional assumptions when unlimited short sales are allowed.

Cesarino Bertini, Sergio Ortobelli Lozza, Alessandro Staino

Agent-Based Approach to Service Sciences

Analyzing the Influence of Overconfident Investors on Financial Markets Through Agent-Based Model

In this research, we employ Agent-Based Model to analyze how asset prices are affected by investors’ Behavior. This analysis places focus on the influence of overconfident investors on financial market. As a result of intensive analysis, we find that overconfident investors are generated in a bottom-up fashion in the market. Furthermore, it has also been found that overconfident investors have the ability to contribute to market efficiency.

Hiroshi Takahashi, Takao Terano

Modularity, Product Innovation, and Consumer Satisfaction: An Agent-Based Approach

The importance of modularity in product innovation is analyzed in this paper. Through simulations with an agent-based modular economic model, we examine the significance of the use of a modular structure in new product designs in terms of its impacts upon customer satisfaction and firms’ competitiveness. To achieve the above purpose, the automatically defined terminal is proposed and is used to modify the simple genetic programming.

Shu-Heng Chen, Bin-Tzong Chie

An Agent-Based Model of Interactions in the Payment Card Market

We develop an agent-based model of the competition between payment cards by focusing on the interactions between consumers and merchants determining the subscription and usage of cards. We find that after a short period of time the market will be dominated by a small number of cards, even though there do not exist significant differences between cards and the market is fully competitive. In contrast to the existing literature we focus on the dynamics of market shares and emergence of multi-homing rather than equilibrium outcomes.

Biliana Alexandrova-Kabadjova, Andreas Krause, Edward Tsang

The Possibility of an Epidemic Meme Analogy for Web Community Population Analysis

The aim of this paper is to discuss the possibility of understanding human social interaction in web communities by analogy with a disease propagation model from epidemiology. When an article is submitted by an individual to a social web service, it is potentially influenced by other participants. The submission sometimes starts a long and argumentative chain of articles, but often does not. This complex behavior makes management of server resources difficult and a more theoretical methodology is required. This paper tries to express these complex human dynamics by analogy with infection by a virus. In this first report, by fitting an epidemiological model to Bulletin Board System (BBS) logs in terms of a numerical triple, we show that the analogy is reasonable and beneficial because the analogy can estimate the community size despite the submitter’s information alone being observable.

Masao Kubo, Keitaro Naruse, Hiroshi Sato, Takashi Matubara

The Econometric Analysis of Agent-Based Models in Finance: An Application

This paper illustrates how to compare different agent-based models and how to compare an agent-based model with real data. As examples we investigate ARFIMA models, the probability density function, and the spectral density function. We illustrate the methodology in an analysis of the agent-based model developed by Levy, Levy, Solomon (2000), and confront it with the S&P 500 for a comparison with real life data.

Youwei Li, Bas Donkers, Bertrand Melenberg

Short Run Dynamics in an Artificial Futures Market with Human Subjects

This paper presents the computational results obtained in the strategy experiments in an artificial futures market with human subjects. Participants submit their own strategy files and they receive the performances of all the market participants in order to improve for the next round. After two-round experiments, simulations with only machine agents are run. We find that the time series data support so-called stylized facts in some regards and that experiments of human subjects seem to make the prices be closer to a theoretical value.

Takashi Yamada, Yusuke Koyama, Takao Terano

Video-Based Conjoint Analysis and Agent Based Simulation for Estimating Customer’s Behavior

Conjoint analysis is a statistical technique to reveal customers’ invisible preference using series of questions regarding tradeoffs in products. In this paper, we propose a new variant of this technique that uses products layout and customers’ actions in a store instead of conjoint cards and answers. We demonstrate the effectiveness of this method by making agent-based in-store simulator that can reproduce the congestion in a store. The parameters of the agents in the simulator were determined by our technique – video-based conjoint analysis.

Hiroshi Sato, Masao Kubo, Akira Namatame

Effect of the Number of Users and Bias of Users’ Preference on Recommender Systems

Recommender System provides certain products adapted to a target user, from a large number of products. One of the most successful recommendation algorithms is Collaborative Filtering, and it is used in many websites. However, the recommendation result is influenced by community characteristics such as the number of users and bias of users’ preference, because the system uses ratings of products by the users at the recommendation.

In this paper, we evaluate an effect of community characteristics on recommender system, using multi-agent based simulation. The results show that a certain number of ratings are necessary to effective recommendation based on collaborative filtering. Moreover, the results also indicate that the number of necessary ratings for recommendation depends on the number of users and bias of the users’ preference.

Akihiro Yamashita, Hidenori Kawamura, Hiroyuki Iizuka, Azuma Ohuchi

Exploring Quantitative Evaluation Criteria for Service and Potentials of New Service in Transportation: Analyzing Transport Networks of Railway, Subway, and Waterbus

This paper explores

quantitative evaluation criteria

for service and

potentials

of new service from the transportation viewpoint. For this purpose, we analyze transport networks of railway, subway, and waterbus, and have revealed the following implications: (1)

efficiency

criterion proposed by Latora [7,8] and

centrality

criterion in the complex network literature can be applied as quantitative evaluation criteria for service in a transportation domain; and (2) new services are highly embedded among networks,

i.e.

, the analyses of the combined networks have the great potential for finding new services that cannot be found by analyzing a single network.

Keiki Takadama, Takahiro Majima, Daisuke Watanabe, Mitsujiro Katsuhara

Neural-evolutionary Fusion Algorithms and Their Applications

Saw-Tooth Algorithm Guided by the Variance of Best Individual Distributions for Designing Evolutionary Neural Networks

This paper proposes a diversity generating mechanism for an evolutionary algorithm that determines the basic structure of Multilayer Perceptron (MLP) classifiers and simultaneously estimates the coefficients of the models. We apply a modified version of a recently proposed diversity enhancement mechanism [1], that uses a variable population size and periodic partial reinitializations of the population in the form of a saw-tooth function. Our improvement on this standard scheme consists of guiding saw-tooth reinitializations by considering the variance of the best individuals in the population, performing the population restart when the difference of variance between two consecutive generations is lower than a percentage of the previous variance. The empirical results over six benchmark datasets show that the proposed mechanism outperforms the standard saw-tooth algorithm. Moreover, results are very promising in terms of classification accuracy, yielding a state-of-the-art performance.

Pedro Antonio Gutiérrez, César Hervás, Manuel Lozano

Using a Genetic Algorithm for Editing k-Nearest Neighbor Classifiers

The edited

-nearest neighbor consists of the application of the

-nearest neighbor classifier with an edited training set, in order to reduce the classification error rate. This edited training set is a subset of the complete training set in which some of the training patterns are excluded. In recent works, genetic algorithms have been successfully applied to generate edited sets. In this paper we propose three improvements of the edited

-nearest neighbor design using genetic algorithms: the use of a mean square error based objective function, the implementation of a clustered crossover, and a fast smart mutation scheme. Results achieved using the breast cancer database and the diabetes database from the UCI machine learning benchmark repository demonstrate the improvement achieved by the joint use of these three proposals.

R. Gil-Pita, X. Yao

An Evolution of Geometric Structures Algorithm for the Automatic Classification of HRR Radar Targets

This paper presents a novel approach to solve multiclass classification problems using pure evolutionary techniques. The proposed approach is called Evolution of Geometric Structures algorithm, and consists in the evolution of several geometric structures such as hypercubes, hyperspheres, hyperoctahedrons, etc. to obtain a first division of the samples space, which will be re-evolved in a second step in order to solve samples belonging to two or more structures. We have applied the EGS algorithm to a well known multiclass classification problem, where our approach will be compared with several existing classification algorithms.

Leopoldo Carro-Calvo, Sancho Salcedo-Sanz, Roberto Gil-Pita, Antonio Portilla-Figueras, Manuel Rosa-Zurera

Hybrid Cross-Entropy Method/Hopfield Neural Network for Combinatorial Optimization Problems

This paper presents a novel hybrid algorithm for combinatorial optimization problems based on mixing the cross-entropy (CE) method and a Hopfield neural network. The algorithm uses the CE method as a global search procedure, whereas the Hopfield network is used to solve the constraints associated to the problems. We have shown the validity of our approach in several instance of the generalized frequency assignment problem.

Emilio G. Ortiz-García, Ángel M. Pérez-Bellido

Backmatter

Titel: Intelligent Data Engineering and Automated Learning - IDEAL 2007
herausgegeben von: Hujun Yin
Peter Tino
Emilio Corchado
Will Byrne
Xin Yao
Verlag: Springer Berlin Heidelberg
Electronic ISBN: 978-3-540-77226-2
Print ISBN: 978-3-540-77225-5
DOI: https://doi.org/10.1007/978-3-540-77226-2