Skip to main content
Top

2004 | Book

Intelligent Data Engineering and Automated Learning – IDEAL 2004

5th International Conference, Exeter, UK. August 25-27, 2004. Proceedings

Editors: Zheng Rong Yang, Hujun Yin, Richard M. Everson

Publisher: Springer Berlin Heidelberg

Book Series : Lecture Notes in Computer Science

insite
SEARCH

Table of Contents

Frontmatter

Bioinformatics

Modelling and Clustering of Gene Expressions Using RBFs and a Shape Similarity Metric

This paper introduces a novel approach for gene expression time-series modelling and clustering using neural networks and a shape similarity metric. The modelling of gene expressions by the Radial Basis Function (RBF) neural networks is proposed to produce a more general and smooth characterisation of the series. Furthermore, we identified that the use of the correlation coefficient of the derivative of the modelled profiles allows the comparison of profiles based on their shapes and the distributions of time points. The series are grouped into similarly shaped profiles using a correlation based fuzzy clustering algorithm. A well known dataset is used to demonstrate the proposed approach and a set of known genes are used as a benchmark to evaluate its performance. The results show the biological relevance and indicate that the proposed method is a useful technique for gene expression time-series analysis.

Carla S. Möller-Levet, Hujun Yin
A Novel Hybrid GA/SVM System for Protein Sequences Classification

A novel hybrid genetic algorithm(GA)/Support Vector Machine (SVM) system, which selects features from the protein sequences and trains the SVM classifier simultaneously using a multi-objective genetic algorithm, is proposed in this paper. The system is then applied to classify protein sequences obtained from the Protein Information Resource (PIR) protein database. Finally, experimental results over six protein superfamilies are reported, where it is shown that the proposed hybrid GA/SVM system outperforms BLAST and HMMer.

Xing-Ming Zhao, De-Shuang Huang, Yiu-ming Cheung, Hong-qiang Wang, Xin Huang
Building Genetic Networks for Gene Expression Patterns

Building genetic regulatory networks from time series data of gene expression patterns is an important topic in bioinformatics. Probabilistic Boolean networks (PBNs) have been developed as a model of gene regulatory networks. PBNs are able to cope with uncertainty, corporate rule-based dependencies between genes and uncover the relative sensitivity of genes in their interactions with other genes. However, PBNs are unlikely used in practice because of huge number of possible predictors and their computed probabilities. In this paper, we propose a multivariate Markov chain model to govern the dynamics of a genetic network for gene expression patterns. The model preserves the strength of PBNs and reduce the complexity of the networks. Parameters of the model are quadratic with respect to the number of genes. We also develop an efficient estimation method for the model parameters. Simulation results on yeast data are given to illustrate the effectiveness of the model.

Wai-Ki Ching, Eric S. Fung, Michael K. Ng
SVM-Based Classification of Distant Proteins Using Hierarchical Motifs

This article presents a discriminative approach to the protein classification in the particular case of remote homology. The protein family is modelled by a set M of motifs related to the physicochemical properties of the residues. We propose an algorithm for discovering motifs based on the ascending hierarchical classification paradigm. The set M defines a feature space of the sequences: each sequence is transformed into a vector that indicates the possible presence of the motifs belonging to M. We then use the SVM learning method to discriminate the target family. Our hierarchical motif set specifically modelises interleukins among all the structural families of the SCOP database. Our method yields a significantly better remote protein classification compared to spectrum kernel techniques.

Jérôme Mikolajczack, Gérard Ramstein, Yannick Jacques
Knowledge Discovery in Lymphoma Cancer from Gene–Expression

A comprehensive study of the database used in Alizadeh et al. [7], about the identification of lymphoma cancer subtypes within Diffuse Large B–Cell Lymphoma (DLBCL), is presented in this paper, focused on both the feature selection and classification tasks. Firstly, we tackle with the identification of relevant genes in the prediction of lymphoma cancer types, and lately the discovering of most relevant genes in the Activated B–Like Lymphoma and Germinal Centre B–Like Lymphoma subtypes within DLBCL. Afterwards, decision trees provide knowledge models to predict both types of lymphoma and subtypes within DLBCL. The main conclusion of our work is that the data may be insufficient to exactly predict lymphoma or even extract functionally relevant genes.

Jesús S. Aguilar-Ruiz, Francisco Azuaje
A Method of Filtering Protein Surface Motifs Based on Similarity Among Local Surfaces

We have developed a system for extracting surface motifs from protein molecular surface database called SUrface MOtif mining MOdule (SUMOMO). However, SUMOMO tends to extract a largeamount of surface motifs making it difficult to distinguish whether they are true active sites. Since active sites, from proteins having a particular function, have similar shape and physical properties, proteins can be classified based on similarity among local surfaces. Thus, motifs extracted from proteins from the same group can be considered significant, and rest can be filtered out. The proposed method is applied to 3,183 surface motifs extracted from 15 proteins belonging to each of four function groups. As a result, the number of motifs is reduced to 14.1% without elimination of important motifs that correspond to the four functional sites.

Nripendra Lal Shrestha, Youhei Kawaguchi, Tadasuke Nakagawa, Takenao Ohkawa
Qualified Predictions for Proteomics Pattern Diagnostics with Confidence Machines

In this paper, we focus on the problem of prediction with confidence and describe the recently developed transductive confidence machines (TCM). TCM allows us to make predictions within predefined confidence levels, thus providing a controlled and calibrated classification environment. We apply the TCM to the problem of proteomics pattern diagnostics. We demonstrate that the TCM performs well, yielding accurate, well-calibrated and informative predictions in both online and offline learning settings.

Zhiyuan Luo, Tony Bellotti, Alex Gammerman
An Assessment of Feature Relevance in Predicting Protein Function from Sequence

Improving the performance of protein function prediction is the ultimate goal for a bioinformatician working in functional genomics. The classical prediction approach is to employ pairwise sequence alignments. However this method often faces difficulties when no statistically significant homologous sequences are identified. An alternative way is to predict protein function from sequence-derived features using machine learning. In this case the choice of possible features which can be derived from the sequence is of vital importance to ensure adequate discrimination to predict function. In this paper we have shown that carefully assessing the discriminative value of derived features by performing feature selection improves the performance of the prediction classifiers by eliminating irrelevant and redundant features. The subset selected from available features has also shown to be biologically meaningful as they correspond to features that have commonly been employed to assess biological function.

Ali Al-Shahib, Chao He, Aik Choon Tan, Mark Girolami, David Gilbert
A New Artificial Immune System Algorithm for Clustering

This paper describes a new artificial immune system algorithm for data clustering. The proposed algorithm resembles the CLONALG, widely used AIS algorithm but much simpler as it uses one shot learning and omits cloning. The algorithm is tested using four simulated and two benchmark data sets for data clustering. Experimental results indicate it produced the correct clusters for the data sets.

Reda Younsi, Wenjia Wang
The Categorisation of Similar Non-rigid Biological Objects by Clustering Local Appearance Patches

A novel approach is presented to the categorisation of non-rigid biological objects from unsegmented scenes in an unsupervised manner. The biological objects investigated are five phytoplankton species from the coastal waters of the European Union. The high morphological variability within each species and the high similarity between species make the categorisation task a challenge for both marine ecologists and machine vision systems. The framework developed takes a local appearance approach to learn the object model, which is done using a novel-clustering algorithm with minimal supervised information. Test objects are classified based on matches with local patches of high occurrence. Experiments show that the method achieves good results, given the difficulty of the task.

Hongbin Wang, Phil F. Culverhouse
Unsupervised Dense Regions Discovery in DNA Microarray Data

In this paper, we introduce the notion of dense regions in DNA microarray data and present algorithms for discovering them. We demonstrate that dense regions are of statistical and biological significance through experiments. A dataset containing gene expression levels of 23 primate brain samples is employed to test our algorithms. Subsets of potential genes distinguishing between species and a subset of samples with potential abnormalities are identified.

Andy M. Yip, Edmond H. Wu, Michael K. Ng, Tony F. Chan
Visualisation of Distributions and Clusters Using ViSOMs on Gene Expression Data

Microarray datasets are often too large to visualise due to the high dimensionality. The self-organising map has been found useful to analyse massive complex datasets. It can be used for clustering, visualisation, and dimensionality reduction. However for visualisation purposes the SOM uses colouring schemes as a means of marking cluster boundaries on the map. The distribution of the data and the cluster structures are not faithfully portrayed. In this paper we applied the recently proposed visualisation induced Self-Organising Map (ViSOM), which directly preserves the inter-point distances of the input data on the map as well as the topology. The ViSOM algorithm regularizes the neurons so that the distances between them are proportional in both the data space and the map space. The results are similar to the Sammon mappings but with improved details on gene distributions and the flexibility to nonlinearity. The method is more suitable for larger datasets.

Swapna Sarvesvaran, Hujun Yin
Prediction of Implicit Protein-Protein Interaction by Optimal Associative Feature Mining

Proteins are known to perform a biological function by interacting with other proteins or compounds. Since protein-protein interaction is intrinsic to most cellular processes, protein interaction prediction is an important issue in post-genomic biology where abundant interaction data has been produced by many research groups. In this paper, we present an associative feature mining method to predict implicit protein-protein interactions of S.cerevisiae from public protein-protein interaction data. To overcome the dimensionality problem of conventional data mining approach, we employ feature dimension reduction filter (FDRF) method based on the information theory to select optimal informative features and to speed up the overall mining procedure. As a mining method to predict interaction, we use association rule discovery algorithm for associative feature and rule mining. Using the discovered associative feature we predict implicit protein interactions which have not been observed in training data. According to the experimental results, the proposed method accomplishes about 94.8% prediction accuracy with reduced computation time which is 32.5% faster than conventional method that has no feature filter.

Jae-Hong Eom, Jeong-Ho Chang, Byoung-Tak Zhang
Exploring Dependencies Between Yeast Stress Genes and Their Regulators

An environmental stress response gene should, by definition, have common properties in its behavior across different stress treatments. We search for such common properties by models that maximize common variation, and explore potential regulators of the stress response by further maximizing mutual information with transcription factor binding data. A computationally tractable combination of generalized canonical correlations and clustering that searches for dependencies is proposed and shown to find promising sets of genes and their potential regulators.

Janne Nikkilä, Christophe Roos, Samuel Kaski
Poly-transformation

Poly-transformation is the extension of the idea of ensemble learning to the transformation step of Knowledge Discovery in Databases (KDD). In poly-transformation multiple transformations of the data are made before learning (data mining) is applied. The theoretical basis for poly-transformation is the same as that for other combining methods – using different predictors to remove uncorrelated errors. It is not possible to demonstrate the utility of poly-transformation using standard datasets, because no pre-transformed data exists for such datasets. We therefore demonstrate its utility by applying it to a single well-known hard problem for which we have expertise – the problem of predicting protein secondary structure from primary structure. We applied four different transformations of the data, each of which was justifiable by biological background knowledge. We then applied four different learning methods (linear discrimination, back-propagation, C5.0, and learning vector quantization) both to the four transformations, and to combining predictions from the different transformations to form the poly-transformation predictions. Each of the learning methods produced significantly higher accuracy with poly-transformation than with only a single transformation. Poly-transformation is the basis of the secondary structure prediction method Prof, which is one of the most accurate existing methods for this problem.

Ross D. King, Mohammed Ouali
Prediction of Natively Disordered Regions in Proteins Using a Bio-basis Function Neural Network

Recent studies have found that many proteins contain regions that do not form well defined three-dimensional structures in their native states. The study and detection of such disordered regions is very important both for facilitating structural analysis and to aid understanding of protein function. A newly developed pattern recognition algorithm termed a “Bio-basis Function Neural Network” has been applied to the detection of disordered regions in proteins. Different models were trained studying the effect of changing the size of the window used for residue classification. Ten-fold cross validation showed that the estimated prediction accuracy was 95.2% for a window size of 21 residues and an overlap threshold of 30%. Blind tests using the trained models on a data set unrelated to the training set gave a regional prediction accuracy of 81.4% (± 0.9%).

Rebecca Thomson, Robert Esnouf
The Effect of Image Compression on Classification and Storage Requirements in a High-Throughput Crystallization System

High-throughput crystallization and imaging facilities can require a huge amount of disk space to keep images on-line. Although compressed images can look very similar to the human eye, the effect on the performance of crystal detection software needs to be analysed. This paper tests the use of common lossy and lossless compression algorithms on image file size and on the performance of the York University image analysis software by comparison of compressed Oxford images with their native, uncompressed bitmap images. This study shows that significant (approximately 4-fold) space savings can be gained with only a moderate effect on classification capability.

Ian Berry, Julie Wilson, Chris Mayo, Jon Diprose, Robert Esnouf
PromSearch: A Hybrid Approach to Human Core-Promoter Prediction

This paper presents an effective core-promoter prediction system on human DNA sequence. The system, named PromSearch, employs a hybrid approach which combines search-by-content method and search-by-signal method. Global statistics of promoter-specific contents are included to represent new significant information underlying the proximal and downstream region around transcription start site (TSS) of DNA sequence. Local signal features such as TATA box and CAAT box are encoded by the position weight matrix (PWM) method. In the experiment for the sequence set from the review by J.W.Fickett, PromSeach shows 47% positive predictive value which surpasses most of previously systems. On large genomic sequences, it shows reduced false positive rate while preserving true positive rate.

Byoung-Hee Kim, Seong-Bae Park, Byoung-Tak Zhang

Data Mining and Knowledge Engineering

Synergy of Logistic Regression and Support Vector Machine in Multiple-Class Classification

In this paper, we focus on multiple-class classification problems. By using polytomous logistic regression and support vector machine together, we come out a hybrid multi-class classifier with very promising results in terms of classification accuracy. Usually, the multiple-class classifier can be built by using many binary classifiers as its construction bases. Those binary classifiers might be trained by either one-versus-one or one-versus-others manners, and the final classifier is constructed by some kinds of “leveraging” methods; such as majority vote, weighted vote, regression, etc. Here, we propose a new way for constructing binary classifiers, which might take the relationship of classes into consideration. For example, the level of severity of a disease in medial diagnostic. Depending on the methods used for constructing binary classifiers, the final classifier will be constructed/assembled by nominal, ordinal or even more sophisticated polytomous logistic regression techniques. This hybrid method has been apply to many real world bench mark data sets and the results shows that this new hybrid method is very promising and out-performs the classifiers using the technique of the support vector machine alone.

Yuan-chin Ivar Chang, Sung-Chiang Lin
Deterministic Propagation of Blood Pressure Waveform from Human Wrists to Fingertips

Feeling the pulse on the wrist is one of the most important diagnostic methods in traditional Chinese medicine (TCM). In this paper we test whether there is any difference between feeling the pulse on the wrist or at any other part of the body, such as at the fingertips? To do this we employ the optimal neural networks estimated by description length to model blood pressure propagation from the wrist to the fingertip, and then apply the method of surrogate data to the residuals of this model. Our result indicates that for healthy subjects measuring pulse waveform at the fingertip is equivalent to feeling pulse on the lateral artery (wrist).

Yi Zhao, Michael Small
Pre-pruning Decision Trees by Local Association Rules

This paper proposes a pre-pruning method called KLC4 for decision trees, and our method, based on KL divergence, drops candidate attributes irrelevant to classification. We compare our technique to conventional ones, and show usefulness of our technique by experiments.

Tomoya Takamitsu, Takao Miura, Isamu Shioya
A New Approach for Selecting Attributes Based on Rough Set Theory

Decision trees are widely used in data mining and machine learning for classification. In the process of constructing a tree, the criteria of selecting partitional attributes will influence the classification accuracy of the tree. In this paper, we present a new concept, weighted mean roughness, which is based on rough set theory, for choosing attributes. The experimental result shows that compared with the entropy-based approach, our approach is a better way to select nodes for constructing decision trees.

Jiang Yun, Li Zhanhuai, Zhang Yang, Zhang Qiang
A Framework for Mining Association Rules in Data Warehouses

The effort of data mining, especially in relation to association rules in real world business applications, is significantly important. Recently, association rules algorithms have been developed to cope with multidimensional data. In this paper we are concerned with mining association rules in data warehouses by focusing on its measurement of summarized data. We propose two algorithms: HAvg and VAvg, to provide the initialization data for mining association rules in data warehouses by concentrating on the measurement of aggregate data. These algorithms are capable of providing efficient initialized data extraction from data warehouses and are used for mining association rules in data warehouses.

Haorianto Cokrowijoyo Tjioe, David Taniar
Intelligent Web Service Discovery in Large Distributed System

Web services are the new paradigm for distributed computing. Traditional centralized indexing scheme can’t scale well with a large distributed system for a scalable, flexible and robust discovery mechanism. In this paper, we use an ontology-based approach to capture real world knowledge for the finest granularity annotation of Web services. This is the core for intelligent discovery. We use a distributed hash table (DHT) based catalog service in P2P (peer to peer) system to index the ontology information and store the index at peers. We have discussed the DHT based service discovery model and discovery procedure. DHT supports only exact match lookups. We have made improvement to the matching algorithm for intelligent services discovery. The experiments show that the discovery model has good scalability and the semantic annotation can notably improve discovery exactness. The improved algorithms can discover the most potential service against request.

Shoujian Yu, Jianwei Liu, Jiajin Le
The Application of K-Medoids and PAM to the Clustering of Rules

Earlier research has resulted in the production of an ‘all-rules’ algorithm for data-mining that produces all conjunctive rules of above given confidence and coverage thresholds. While this is a useful tool, it may produce a large number of rules. This paper describes the application of two clustering algorithms to these rules, in order to identify sets of similar rules and to better understand the data.

Alan P. Reynolds, Graeme Richards, Vic J. Rayward-Smith
A Comparison of Texture Teatures for the Classification of Rock Images

Texture analysis plays a vital role in the area of image understanding research. One of the key areas of research is to compare how well these algorithms rate in differentiating between different textures. Traditionally, texture algorithms have been applied mostly on benchmark data and some studies have found certain algorithms are better suited for differentiating between certain types of textures. In this paper we compare 7 well-established image texture analysis algorithms on the task of classifying rocks.

Maneesha Singh, Akbar Javadi, Sameer Singh
A Mixture of Experts Image Enhancement Scheme for CCTV Images

The main aim of this paper is to present a mixture of experts framework for the selection of an optimal image enhancement. This scheme selects the best image enhancement algorithm from a bank of algorithms on a per image basis. The results show that this scheme considerably improves the quality of test images collected from CCTV.

Maneesha Singh, Sameer Singh, Matthew Porter
Integration of Projected Clusters and Principal Axis Trees for High-Dimensional Data Indexing and Query

High-dimensional data indexing and query is a challenging problem due to the inherent sparsity of the data. Fast algorithms are in an urgent need in this field. In this paper, an automatic subspace dimension selection (ASDS) based clustering algorithm is derived from the well-known projection-based clustering algorithm, ORCLUS, and a two-level architecture for high-dimensional data indexing and query is also proposed, which integrates projected clusters and principal axis trees (PAT) to generate efficient high-dimensional data indexes. The query performances of similarity search by ASDS+PAT, ORCLUS+PAT, PAT alone, and Clindex are compared on two high-dimensional data sets. The results show that the integration of ASDS and PAT is an efficient indexing architecture and considerably reduces the query time.

Ben Wang, John Q. Gan
Unsupervised Segmentation on Image with JSEG Using Soft Class Map

Soft class map is presented for JSEG. The definitions of J values etc. in JSEG are adjusted correspondingly. The method of constructing soft class map is provided. JSEG with soft class map is a more robust method in unsupervised image segmentation compared with the original JSEG method. Our method can segment correctly image in which there exists color smooth transition in underlying object region.

Yuanjie Zheng, Jie Yang, Yue Zhou
DESCRY: A Density Based Clustering Algorithm for Very Large Data Sets

A novel algorithm, named DESCRY, for clustering very large multidimensional data sets with numerical attributes is presented. DESCRY discovers clusters having different shape, size, and density and when data contains noise by first finding and clustering a small set of points, called meta-points, that well depict the shape of clusters present in the data set. Final clusters are obtained by assigning each point to one of the partial clusters. The computational complexity of DESCRY is linear both in the data set size and in the data set dimensionality. Experiments show the very good qualitative results obtained comparable with those obtained by state of the art clustering algorithms.

Fabrizio Angiulli, Clara Pizzuti, Massimo Ruffolo
A Fuzzy Set Based Trust and Reputation Model in P2P Networks

Trust plays an important role in making collaborative decisions, so the trust problem in P2P networks has become a focus of research. In this paper, a fuzz set based model is proposed for building trust and reputation in P2P collaborations based on observations and recommendations, and using fuzzy set a peer can present differentiated trust and combine different aspects of trust in collaborations. Depending on fuzzy similarity measure, a good recommenders set can be maintained. The evaluation of the model using simulation experiments shows its effectiveness.

Zhang Shuqin, Lu Dongxin, Yang Yongtian
Video Based Human Behavior Identification Using Frequency Domain Analysis

The identification of human activity in video, for example whether a person is walking, clapping, waving, etc. is extremely important for video interpretation. Since different people would perform the same action across different number of frames, matching training and test actions is not a trivial task. In this paper we discuss a new technique for video shot matching where the shots matched are of different sizes. The proposed technique is based on frequency domain analysis of feature data and it is shown to achieve very high recognition accuracy on a number of different human actions with synthetic data and real life data.

Jessica JunLin Wang, Sameer Singh
Mobile Data Mining by Location Dependencies

Mobile mining is about finding useful knowledge from the raw data produced by mobile users. The mobile environment consists of a set of static device and mobile device. Previous works in mobile data mining include finding frequency pattern and group pattern. Location dependency was not part of consideration in previous work but it would be meaningful. The proposed method builds a user profile based on past mobile visiting data, filters and to mine association rules. The more frequent the user profiles are updated, the more accurate the rules are. Our performance evaluation shows that as the number of characteristics increases, the number of rules will increase dramatically and therefore, a careful choosing of only the relevant characteristics to ensure acceptable amount of rules.

Jen Ye Goh, David Taniar
An Algorithm for Artificial Intelligence-Based Model Adaptation to Dynamic Data Distribution

Changes in data distribution for in-sample training and out-sample validation can be unavoidable due to presence of random dynamic noises created by external uncontrollable environmental factors. To compensate for the variation in data distribution, one approach is to recursively use immediate past prediction error to augment the current data. This paper proposes a simple algorithm that ensures the parameter settings in an ANFIS model are adaptive to its unique data distribution. Such an ’open ended’ strategy allows the ANFIS to be more accurate in predicting chaotic time series problems. An application of the proposed a procedure to predict Dow Jones Industrial Average index has yielded better prediction accuracy than using the conventional prediction model.

Vincent C. S. Lee, Alex T. H. Sim
On a Detection of Korean Prosody Phrases Boundaries

This paper describes an automatic detection technique of Korean accentual phrase boundaries by using one-stage DP, and the normalized pitch pattern. For making the normalized pitch pattern, we propose a method of modified normalization for Korean spoken language. The results shows that 76.4% of the accentual phrase boundaries are correctly detected while 14.7% are the false detection rate. Also we can know that accentual phrase detection method by pattern matching shows the more superior detection rate than detection method by LH tone from this result.

Jong Kuk Kim, Ki Young Lee, Myung Jin Bae
A Collision Avoidance System for Autonomous Ship Using Fuzzy Relational Products and COLREGs

This paper presents a collision avoidance system for autonomous ship. Unlike collision avoidance system of other unmanned vehicles, the collision avoidance system for autonomous ship aims at not only deriving a reasonable and safe path to the goal but also keeping COLREGs (International Regulations for Preventing Collisions at Sea). The heuristic search based on the fuzzy relational products is adopted to achieve the general purpose of collision avoidance system; deriving a reasonable and safe path. The rule of “action to avoid collision” is adopted for the other necessary and sufficient condition; keeping the COLREGs.

Young-il Lee, Yong-Gi Kim
Engineering Knowledge Discovery in Network Intrusion Detection

The use of data mining techniques for intrusion detection (ID) is one of the ongoing issues in the field of computer security, but little attention has been placed in engineering ID activities. This paper presents a framework that models the ID process as a set of cooperative tasks each supporting a specialized activity. Specifically, the framework organises raw audit data into a set of relational tables and applies data mining algorithms to generate intrusion detection models. Specialized components of a commercial DBMS have been used to validate the proposed approach. Results show that the framework works well in capturing patterns of intrusion while the availability of an integrated software environment allows a high level of modularity in performing each task.

Andrea Bosin, Nicoletta Dessì, Barbara Pes
False Alarm Classification Model for Network-Based Intrusion Detection System

Network-based IDS(Intrusion Detection System) gathers network packet data and analyzes them into attack or normal. But they often output a large amount of low-level or incomplete alert information. Such alerts can be unmanageable and also be mixed with false alerts. In this paper we proposed a false alarm classification model to reduce the false alarm rate using classification analysis of data mining techniques. The model was implemented based on associative classification in the domain of DDOS attack. We evaluated the false alarm classifier deployed in front of Snort with Darpa 1998 dataset and verified the reduction of false alarm rate. Our approach is useful to reduce false alerts and to improve the detection rate of network-based intrusion detection systems.

Moon Sun Shin, Eun Hee Kim, Keun Ho Ryu
Exploiting Safety Constraints in Fuzzy Self-organising Maps for Safety Critical Applications

This paper defines a constrained Artificial Neural Network (ANN) that can be employed for highly-dependable roles in safety critical applications. The derived model is based upon the Fuzzy Self-Organising Map (FSOM) and enables behaviour to be described qualitatively and quantitatively. By harnessing these desirable features, behaviour is bounded through incorporation of safety constraints – derived from safety requirements and hazard analysis. The constrained FSOM has been termed a ’Safety Critical Artificial Neural Network’ (SCANN) and preserves valuable performance characteristics for non-linear function approximation problems. The SCANN enables construction of compelling (product-based) safety arguments for mitigation and control of identified failure modes. Illustrations of potential benefits for real-world applications are also presented.

Zeshan Kurd, Tim P. Kelly, Jim Austin
Surface Spatial Index Structure of High-Dimensional Space

This paper proposes a spatial index structure based on a new space-partitioning method. Previous research proposed various high dimensional index structures. However, when dimensionality becomes high, the effectiveness of the spatial index structure disappears. This problem is called the “curse of dimensionality”. This paper focuses on the fact that the volume of high dimensional space is mostly occupied by its surface and then proposes a new surface index structure. The utility of this new surface spatial index structure is illustrated through a series of experiments.

Jiyuan An, Yi-Ping Phoebe Chen, Qinying Xu
Generating and Applying Rules for Interval Valued Fuzzy Observations

One of the objectives of intelligent data engineering and automated learning is to develop algorithms that learn the environment, generate rules, and take possible courses of actions. In this paper, we report our work on how to generate and apply such rules with a rule matrix model. Since the environments can be interval valued and rules often fuzzy, we further study how to obtain and apply rules for interval valued fuzzy observations.

Andre de Korvin, Chenyi Hu, Ping Chen
Automatic Video Shot Boundary Detection Using Machine Learning

In this paper we present a machine learning system that can accurately predict the transitions between frames in a video sequence. We propose a set of novel features and describe how to use dominant features based on a coarse-to-fine strategy to accurately predict video transitions.

Wei Ren, Sameer Singh
On Building XML Data Warehouses

Developing a data warehouse for XML documents involves two major processes: one of creating it, by processing XML raw documents into a specified data warehouse repository; and the other of querying it, by applying techniques to better answer user’s queries. The proposed methodology in our paper on building XML data warehouses covers processes such as data cleaning and integration, summarization, intermediate XML documents, and updating/linking existing documents and creating fact tables.

Laura Irina Rusu, Wenny Rahayu, David Taniar
A Novel Method for Mining Frequent Subtrees from XML Data

In this paper, we focus on the problem of finding frequent subtrees in a large collection of XML data, where both of the patterns and the data are modeled by labeled ordered trees. We present an efficient algorithm RSTMiner that computes all rooted subtrees appearing in a collection of XML data trees with frequent above a user-specified threshold using a special structure Me-tree. In this algorithm, Me-tree is used as a merging tree to supply scheme information for efficient pruning and mining frequent sub-trees. The keys of the algorithm are efficient pruning candidates with Me-Tree structure and incrementally enumerating all rooted sub-trees in canonical form based on a extended right most expansion technique. Experiment results show that RSTMiner algorithm is efficient and scalable.

Wan-Song Zhang, Da-Xin Liu, Jian-Pei Zhang
Mining Association Rules Using Relative Confidence

Mining for association rules is one of the fundamental tasks of data mining. Association rule mining searches for interesting relationships amongst items for a given dataset based mainly on the support and confidence measures. Support is used for filtering out infrequent rules, while confidence measures the implication relationships from a set of items to one another. However, one of the main drawbacks of the confidence measure is that it presents the absolute value of implication that does not reflect truthfully the relationships amongst items. For example, if two items have a very high frequency, then they will probably form a rule with a high confidence even if there is no relationship between them at all. In this paper, we propose a new measure known as relative confidence for mining association rules, which is able to reflect truthfully the relationships of items. The effectiveness of the relative confidence measure is evaluated in comparison with the confidence measure in mining interesting relationships between terms from textual documents and in associative classification.

Tien Dung Do, Siu Cheung Hui, Alvis C. M. Fong
Multiple Classifiers Fusion System Based on the Radial Basis Probabilistic Neural Networks

The fusion system designing of multiple classifiers, which is based on the radial basis probabilistic neural network (RBPNN), is discussed in this paper. By means of the proposed design method, the complex structure optimization can be effectively avoided in the designing procedure of the RBPNN. In addition, D-S fusion algorithm adopted in the system greatly improves the classification performance for the complexity problem of the real-world. The simulation results demonstrate that the designing case of the fusion system based on the RBPNNs is feasible and effective.

Wen-Bo Zhao, Ming-Yi Zhang, Li-Ming Wang, Ji-Yan Du, De-Shuang Huang
An Effective Distributed Privacy-Preserving Data Mining Algorithm

Data mining is useful means for discovering valuable patterns, associations, trends, and dependencies in data. Data mining is often required to be performed among a group of sites, where the precondition is that no privacy of any site should be leaked out to other sites. In this paper a distributed privacy-preserving data mining algorithm is proposed. The proposed algorithm is characterized with (1) its ability to preserve the privacy without any coordinator site, and specially its ability to resist the collusion; and (2) its lightweight since only the random number is used for preserving the privacy. Performance analysis and experimental results are provided for demonstrating the effectiveness of the proposed algorithm.

Takuya Fukasawa, Jiahong Wang, Toyoo Takata, Masatoshi Miyazaki
Dimensionality Reduction with Image Data

A common objective in image analysis is dimensionality reduction. The most often used data-exploratory technique with this objective is principal component analysis. We propose a new method based on the projection of the images as matrices after a Procrustes rotation and show that it leads to a better reconstruction of images.

Mónica Benito, Daniel Peña
Implicit Fitness Sharing Speciation and Emergent Diversity in Tree Classifier Ensembles

Implicit fitness sharing is an approach to the stimulation of speciation in evolutionary computation for problems where the fitness of an individual is determined as its success rate over a number trials against a collection of succeed/fail tests. By fixing the reward available for each test, individuals succeeding in a particular test are caused to depress the size of one another’s fitness gain and hence implicitly co-operate with those succeeding in other tests. An important class of problems of this form is that of attribute-value learning of classifiers. Here, it is recognised that the combination of diverse classifiers has the potential to enhance performance in comparison with the use of the best obtainable individual classifiers. However, proposed prescriptive measures of the diversity required have inherent limitations from which we would expect the diversity emergent from the self-organisation of speciating evolutionary simulation to be free. The approach was tested on a number of the popularly used real-world data sets and produced encouraging results in terms of accuracy and stability.

Karl J. Brazier, Graeme Richards, Wenjia Wang
Improving Decision Tree Performance Through Induction- and Cluster-Based Stratified Sampling

It is generally recognised that recursive partitioning, as used in the construction of classification trees, is inherently unstable, particularly for small data sets. Classification accuracy and, by implication, tree structure, are sensitive to changes in the training data. Successful approaches to counteract this effect include multiple classifiers, e.g. boosting, bagging or windowing. The downside of these multiple classification models, however, is the plethora of trees that result, often making it difficult to extract the classifier in a meaningful manner. We show that, by using some very weak knowledge in the sampling stage, when the data set is partitioned into the training and test sets, a more consistent and improved performance is achieved by a single decision tree classifier.

Abdul A. Gill, George D. Smith, Anthony J. Bagnall
Learning to Classify Biomedical Terms Through Literature Mining and Genetic Algorithms

We present an approach to classification of biomedical terms based on the information acquired automatically from the corpus of relevant literature. The learning phase consists of two stages: acquisition of terminologically relevant contextual patterns (CPs) and selection of classes that apply to terms used with these patterns. CPs represent a generalisation of similar term contexts in the form of regular expressions containing lexical, syntactic and terminological information. The most probable classes for the training terms co-occurring with the statistically relevant CP are learned by a genetic algorithm. Term classification is based on the learnt results. First, each term is associated with the most frequently co-occurring CP. Classes attached to such CP are initially suggested as the term’s potential classes. Then, the term is finally mapped to the most similar suggested class.

Irena Spasić, Goran Nenadić, Sophia Ananiadou
PRICES: An Efficient Algorithm for Mining Association Rules

In this paper, we present PRICES, an efficient algorithm for mining association rules, which first identifies all large itemsets and then generates association rules. Our approach reduces large itemset generation time, known to be the most time-consuming step, by scanning the database only once and using logical operations in the process. Experimental results and comparisons with the state of the art algorithm Apriori shows that PRICES very efficient and in some cases up to ten times as fast as Apriori.

Chuan Wang, Christos Tjortjis
Combination of SVM Knowledge for Microcalcification Detection in Digital Mammograms

In this paper, we propose a novel combinational SVM algorithm via a set of decision rules to achieve better performances in microcalcification detection inside digital mammograms towards computer aided breast cancer diagnosis. Based on the discovery that the polynomial SVM is sensitive to MC (microcalcification) pixels and the linear SVM is sensitive to non-MC pixels, we designed an adaptive threshold mechanism via establishment of their correspondences to exploit the complementary nature between the polynomial SVM and the linear SVM. Experiments show that the proposed algorithm successfully reduced false positive detection rate while keeping the true positive detection rate competitive.

Ying Li, Jianmin Jiang
Char: An Automatic Way to Describe Characteristics of Data

As e-business software prevails worldwide, large amount of data are accumulated automatically in databases of most sizable companies. Managers in organizations now face the problems of making sense out of the data. In this paper, an algorithm to automatically produce characteristic rules to describe the major characteristics of data in a table is proposed. In contrast to traditional Attribute Oriented Induction methods, the algorithm, named as Char Algorithm, does not need a concept tree and only requires setting a desired coverage threshold to generate a minimal set of characteristic rules to describe the given dataset. Our simulation results show that the characteristic rules found by Char are fairly consistent even when the number of records and attributes increase.

Yu-Chin Liu, Ping-Yu Hsu
Two Methods for Automatic 3D Reconstruction from Long Un-calibrated Sequences

This paper presents two methods for automatic 3D reconstruction: the one is a quantitative measure for frame grouping over long un-calibrated sequences, and the other is 3D reconstruction algorithm based on projective invariance. The first method evaluates the duration of corresponding points over sequence, the homography error, and the distribution of correspondences in the image. By making efficient bundles, we can overcome the limitation of the factorization, which is the assumption that all correspondences must remain in all views. In addition, we use collinearity among invariant properties in projective space to refine the projective matrix. That means any points located on the 2D imaged line must lie on the reconstructed projective line. Therefore, we regard the points unsatisfying collinearity as outliers, which are caused by a false feature tracking. After fitting a new 3D line from projective points, we iteratively obtain more precise projective matrix by using the points that are the orthogonal projection of outliers onto the line. Experimental results showed our methods can recover efficiently 3D structure from un-calibrated sequences.

Yoon-Yong Jeong, Bo-Ra Seok, Yong-Ho Hwang, Hyun-Ki Hong
Wrapper for Ranking Feature Selection

We propose a new feature selection criterion not based on calculated measures between attributes, or complex and costly distance calculations. Applying a wrapper to the output of a new attribute ranking method, we obtain a minimum subset with the same error rate as the original data. The experiments were compared to two other algorithms with the same results, but with a very short computation time.

Roberto Ruiz, Jesús S. Aguilar-Ruiz, José C. Riquelme
Simultaneous Feature Selection and Weighting for Nearest Neighbor Using Tabu Search

Both feature selection and feature weighting techniques are useful for improving the classification accuracy of K-nearest-neighbor (KNN) rule. The term feature selection refers to algorithms that select the best subset of the input feature set. In feature weighting, each feature is multiplied by a weight value proportional to the ability of the feature to distinguish among pattern classes. In this paper, a tabu search based heuristic is proposed for simultaneous feature selection and feature weighting of KNN rule. The proposed heuristic in combination with KNN classifier is compared with several classifiers on various available datasets. Results have indicated a significant improvement of the performance in terms of maximizing classification accuracy.

Muhammad Atif Tahir, Ahmed Bouridane, Fatih Kurugollu
Fast Filtering of Structural Similarity Search Using Discovery of Topological Patterns

Similarity search for protein 3D structure databases is much more complex and computationally expensive. It is essential to improve performance on the existing comparison systems such as DALI and VAST. In our approach, the structural similarity search composes of a filter step to generate small candidate set and a refinement step to compute structural alignment. This paper describes fast filtering of similarity search using discovery of topological patterns of secondary structure elements based on spatial relations. Our system is fully implemented by using Oracle 8i spatial. Experimental results show that our method is approximately three times faster than DaliLite.

Sung-Hee Park, Keun Ho Ryu
Detecting Worm Propagation Using Traffic Concentration Analysis and Inductive Learning

As a vast number of services have been flooding into the Internet, it is more likely for the Internet resources to be exposed to various hacking activities such as Code Red and SQL Slammer worm. Since various worms quickly spread over the Internet using self-propagation mechanism, it is crucial to detect worm propagation and protect them for secure network infrastructure. In this paper, we propose a mechanism to detect worm propagation using the computation of entropy of network traffic and the compilation of network traffic. In experiments, we tested our framework in simulated network settings and could successfully detect worm propagation.

Sanguk Noh, Cheolho Lee, Keywon Ryu, Kyunghee Choi, Gihyun Jung
Comparing Study for Detecting Microcalcifications in Digital Mammogram Using Wavelets

A comparing study for detection microcalcifications in digital mammogram using wavelets is proposed. Microcalcifications are early sign of breast cancer appeared as isolated bright spots in mammograms, however, they are difficult to detect due to their small size (0.05 to 1 mm of diameter). From a signal processing point of view, microcalcifications are high frequency components in mammograms. To enhance the detection performance of the microcalcifications in the mammograms we use the wavelet transform. Due to the multi-resolution decomposition capacity of the wavelet transform, we can decompose the image into different resolution levels which are sensitive to different frequency bands. By choosing an appropriate wavelet with a right resolution level, we can effectively detect the microcalcifications in digital mammogram. In this paper, several normal wavelet family functions are studied comparably, and for each wavelet function, different resolution levels are explored for detecting the microcalcifications. Experimental results show that the Daubechies wavelet with 4th level decomposition achieves the best detecting result of 95% TP rate with FP rate of 0.3 clusters per image.

Ju Cheng Yang, Jin Wook Shin, Dong Sun Park
A Hybrid Multi-layered Speaker Independent Arabic Phoneme Identification System

A phoneme identification system for Arabic language has been developed. It is based on a hybrid approach that incorporates two levels of phoneme identification. In the first layer power spectral information, efficiently condensed through the use of singular value decomposition, is utilized to train separate self-organizing maps for identifying each Arabic phoneme. This is followed by a second layer of identification, based on similarity metric, that compares the standard pitch contours of phonemes with the pitch contours of the input sound. The second layer performs the identification in case the first layer generates multiple classifications of the same input sound. The system has been developed using utterances of twenty-eight Arabic phonemes from over a hundred speakers. The identification accuracy based on the first layer alone was recorded at 71%, which increased to 91% with the addition of the second identification layer. The introduction of singular values for training instead of power spectral densities directly has resulted in reduction of training and recognition times for self-organizing maps by 80% and 89% respectively. The research concludes that power spectral densities along with the pitch information result in an acceptable and robust identification system for the Arabic phonemes.

Mian M. Awais, Shahid Masud, Shafay Shamail, J. Akhtar
Feature Selection for Natural Disaster Texts Classification Using Testors

In this paper, the feature selection for classification of natural disaster texts through testors, is presented. Testors are features subsets such that no class confusion is introduced. Typical testors are irreducible testors. Then they can be used in order to select which words are relevant to separate the classes, and so, be useful to get better classification rates. Some experiments were done with KNN and Naive Bayes Classifiers, results were compared against frequency threshold and information gain methods.

Jesús A. Carrasco-Ochoa, José Fco. Martínez-Trinidad
Mining Large Engineering Data Sets on the Grid Using AURA

AURA (Advanced Uncertain Reasoning Architecture) is a parallel pattern matching technology intended for high-speed approximate search and match operations on large unstructured datasets. This paper represents how the AURA technology is extended and used to search the engine data within a major UK eScience Grid project (DAME) for maintenance of Rolls-Royce aero-engines and how it may be applied in other areas. Examples of its use will be presented.

Bojian Liang, Jim Austin
Self-tuning Based Fuzzy PID Controllers: Application to Control of Nonlinear HVAC Systems

Heating, Ventilating and Air Conditioning (HVAC) plant is a multivariable, nonlinear and non minimum phase system, which its control is very difficult. For this reason, in the design of HVAC controller the idea of self tuning controllers is being used. In this paper a robust and adaptive self tuning based fuzzy PID controller for control of nonlinear HVAC systems is presented. To illustrate the effectiveness of the proposed method some simulations are given. Obtained results show that the proposed method not only is robust, but also it gives good dynamic response compared with traditional controllers. Also the response time is also very fast despite the fact that the control strategy is based on bounded rationality. To evaluate the usefulness of the proposed method, we compare the response of this method with PID controller. The obtained results show that our method has the better control performance than PID controller.

Behzad Moshiri, Farzan Rashidi
Ontology-Based Web Navigation Assistant

This paper proposes a navigation assistant that provides more personalized Web navigation by exploiting domain-specific ontologies. In general, an ontology is regarded as the specification of conceptualization that enables formal definitions about things and states by using terms and relationships between them. In our approach, Web pages are converted into concepts by referring to domain-specific ontologies which employ a hierarchical concept structure. This concept mapping makes it easy to handle Web pages and also provides higher-level classification information. The proposed navigation assistant eventually recommends the Web documents that are intimately associated with the concept nodes in the upper-levels of the hierarchy by analyzing the current Web page and its outwardly-linked pages.

Hyunsub Jung, Jaeyoung Yang, Joongmin Choi
A Hybrid Fuzzy-Neuro Model for Preference-Based Decision Analysis

A hybrid fuzzy-neuro model that combines frame-based fuzzy logic system and neural network learning paradigm, hereinafter called FRN, is proposed to support innovative decision analysis (selection and assessment) process. The FRN model exploits the merits of reasoning from a frame-based fuzzy expert system in combination with preference-based learning derived from a supervised neural network. The FRN has proven to be useful and practical in filtering out all possible decision analyses. A salient feature of FRN is its ability to adapt to user’s sudden change of preference in the midst of model implementation. A case study on the decision analysis (assessment and selection) of preference-based product is included to illustrate the implementation of FRN model.

Vincent C. S. Lee, Alex T. H. Sim
Combining Rules for Text Categorization Using Dempster’s Rule of Combination

In this paper, we present an investigation into the combination of rules for text categorization using Dempster’s rule of combination. We first propose a boosting-like technique for generating multiple sets of rules based on rough set theory, and then describe how to use Dempster’s rule of combination to combine the classification decisions produced by multiple sets of rules. We apply these methods to 10 out of the 20-newsgroups – a benchmark data collection, individually and in combination. Our experimental results show that the performance of the best combination of the multiple sets of rules on the 10 groups of the benchmark data can achieve 80.47% classification accuracy, which is 3.24% better than that of the best single set of rules.

Yaxin Bi, Terry Anderson, Sally McClean
Genetic Program Based Data Mining for Fuzzy Decision Trees

A data mining procedure for automatic determination of fuzzy decision tree structure using a genetic program is discussed. A genetic program is an algorithm that evolves other algorithms or mathematical expressions. Methods of accelerating convergence of the data mining procedure including a new innovation based on computer algebra are examined. Experimental results related to using computer algebra are given. A comparison between a tree obtained using a genetic program and one constructed solely by interviewing experts is made. A genetic program evolved tree is shown to be superior to one created by hand using expertise alone. Finally, additional methods that have been used to validate the data mining algorithm are discussed.

James F. Smith III
Automating Co-evolutionary Data Mining

An approach is being explored that involves embedding a fuzzy logic based resource manager in an electronic game environment. Game agents can function under their own autonomous logic or human control. This approach automates the data mining problem. The game automatically creates a cleansed database reflecting the domain expert’s knowledge, it calls a data mining function, a genetic algorithm, for data mining of the data base as required and allows easy evaluation of the information extracted. Co-evolutionary fitness functions are discussed. The strategy tree concept and its relationship to co-evolutionary data mining are examined as well as the associated phase space representation of fuzzy concepts. Co-evolutionary data mining alters the geometric properties of the overlap region known as the combined admissible region of phase space significantly enhancing the performance of the resource manager. Significant experimental results are provided.

James F. Smith III
Topological Tree for Web Organisation, Discovery and Exploration

In this paper we focus on the organisation of web contents, which allows efficient browsing, searching and discovery. We propose a method that dynamically creates such a structure called Topological Tree. The tree is generated using an algorithm called Automated Topological Tree Organiser, which uses a set of hierarchically organised self-organising growing chains. Each chain fully adapts to a specific topic, where its number of subtopics is determined using entropy-based validation and cluster tendency schemes. The Topological Tree adapts to the natural underlying structure at each level in the hierarchy. The topology in the chains also relates close topics together, thus can be exploited to reduce the time needed for search and navigation. This method can be used to generate a web portal or directory where browsing and user comprehension are improved.

Richard T. Freeman, Hujun Yin
New Medical Diagnostic Method for Oriental Medicine Using BYY Harmony Learning

In help of BYY harmony learning for binary independent factor analysis, an automatic oriental medical diagnostic approach is proposed. A preliminary experiment has shown a promising result on the diagnostic problem of ’Flaming-up liver fire’ with medical data from oriental medical diagnosis.

JeongYon Shim
An Intelligent Topic-Specific Crawler Using Degree of Relevance

It is indispensable that the users surfing on the Internet could have web pages classified into a given topic as correct as possible. Toward this ends, this paper presents a topic-specific crawler computing the degree of relevance and refining the preliminary set of related web pages using term frequency/ document frequency, entropy, and compiled rules. In the experiments, we test our topic-specific crawler in terms of the accuracy of its classification, the crawling efficiency, and the crawling consistency. In case of using 51 representative terms, it turned out that the resulting accuracy of the classification was 97.8%.

Sanguk Noh, Youngsoo Choi, Haesung Seo, Kyunghee Choi, Gihyun Jung
Development of a Global and Integral Model of Business Management Using an Unsupervised Model

In this paper, we use a recent artificial neural architecture called Cooperative Maximum Likelihood Hebbian Learning (CMLHL) in order to categorize the necessities for the Acquisition, Transfer and Updating of Knowledge of the different departments of a firm. We apply Maximum Likelihood Hebbian learning to an extension of a negative feedback network characterised by the use of lateral connections on the output layer. These lateral connections have been derived from the Rectified Gaussian distribution. This technique is used as a tool to develop a part of a Global and Integral Model of business Management, which brings about a global improvement in the firm, adding value, flexibility and competitiveness. From this perspective, the model tries to generalise the hypothesis of organizational survival and competitiveness, so that the organisation that is able to identify, strengthen, and use key knowledge will reach a pole position.

Emilio Corchado, Colin Fyfe, Lourdes Sáiz, Ana Lara
Spam Mail Detection Using Artificial Neural Network and Bayesian Filter

We propose dynamic anti-spam filtering methods for agglutinative languages in general and for Turkish in particular, based on Artificial Neural Networks (ANN) and Bayesian filters. The algorithms are adaptive and have two components. The first one deals with the morphology and the second one classifies the e-mails by using the roots. Two ANN structures, single layer perceptron and multi layer perceptron, are considered and the inputs to the networks are determined using binary and probabilistic models. For Bayesian classification, three approaches are employed: binary, probabilistic, and advance probabilistic models. In the experiments, a total of 750 e-mails (410 spam and 340 normal) were used and a success rate of about 90% was achieved.

Levent Özgür, Tunga Güngör, Fikret Gürgen
An Integrated Approach to Automatic Indoor Outdoor Scene Classification in Digital Images

This paper describes a method for automatic indoor/outdoor scene classification in digital images. Digital images of the inside of buildings will contain a higher proportion of objects with sharp edges. We used an edge detection algorithm on these images and used a method to collect and measure the straightness of the lines in the image. This paper highlights a novel integrated method of measuring these straight lines, and training a neural network to detect the difference between a set of indoor and outdoor images.

Matthew Traherne, Sameer Singh
Using Fuzzy Sets in Contextual Word Similarity

We propose a novel algorithm for computing asymmetric word similarity (AWS) using mass assignment based on fuzzy sets of words. Words in documents are considered similar if they appear in similar contexts. However, these similar words do not have to be synonyms, or belong to the same lexical category. We apply AWS in measuring document similarity. We evaluate the effectiveness of our method against a typical symmetric similarity measure, TF.IDF. The system has been evaluated on real world documents, and the results show that this method performs well.

Masrah Azmi-Murad, Trevor P. Martin
Summarizing Time Series: Learning Patterns in ‘Volatile’ Series

Most financial time series processes are nonstationary and their frequency characteristics are time-dependant. In this paper we present a time series summarization and prediction framework to analyse nonstationary, volatile and high-frequency time series data. Multiscale wavelet analysis is used to separate out the trend, cyclical fluctuations and autocorrelational effects. The framework can generate verbal signals to describe each effect. The summary output is used to reason about the future behaviour of the time series and to give a prediction. Experiments on the intra-day European currency spot exchange rates are described. The results are compared with a neural network prediction framework.

Saif Ahmad, Tugba Taskaya-Temizel, Khurshid Ahmad
Cosine Transform Priors for Enhanced Decoding of Compressed Images

Image compression methods such as JPEG use quantisation of discrete cosine transform (DCT) coefficients of image blocks to produce lossy compression. During decoding, an inverse DCT of the quantised values is used to obtain the lossy image. These methods suffer from blocky effects from the region boundaries, and can produce poor representations of regions containing sharp edges. Such problems can be obvious artefacts in compressed images but also cause significant problems for many super-resolution algorithms. Prior information about the DCT coefficients of an image and the continuity between image blocks can be used to improve the decoding using the same compressed image information. This paper analyses empirical priors for DCT coefficients, and shows how they can be combined with block edge contiguity information to produce decoding methods which reduce the blockiness of images. We show that the use of DCT priors is generic can be useful in many other circumstances.

Amos Storkey, Michael Allan
Partial Discharge Classification Through Wavelet Packets of Their Modulated Ultrasonic Emission

Locating and classifying partial discharge due to sharp-edges, polluted insulators and loose-contacts in power systems significantly reduce the outage time, impending failure, equipment damage and supply interruption. In this paper, based on wavelet packets features of their modulated ultrasound emissions, an efficient novel scheme for neural network recognition of partial discharges is proposed. The employed preprocessing, wavelet features and near-optimally sized network led to successful classification up to 100%, particularly when longer duration signals are processed.

Mazen Abdel-Salam, Yassin M. Y. Hasan, Mohammed Sayed, Salah Abdel-Sattar
A Hybrid Optimization Method of Multi-objective Genetic Algorithm (MOGA) and K-Nearest Neighbor (KNN) Classifier for Hydrological Model Calibration

The MOGA is used as automatic calibration method for a wide range of water and environmental simulation models.The task of estimating the entire Pareto set requires a large number of fitness evaluations in a standard MOGA optimization process. However, it’s very time consuming to obtain a value of objective functions in many real engineering problems. We propose a unique hybrid method of MOGA and KNN classifier to reduce the number of actual fitness evaluations. The test results for multi-objective calibration show that the proposed method only requires about 30% of actual fitness evaluations of the MOGA.

Yang Liu, Soon-Thiam Khu, Dragon Savic
Cluster-Based Visualisation of Marketing Data

Marketing data analysis typically aims to gain insights for targeted promotions or, increasingly, to implement collaborative filtering. Ideally, data would be visualised directly. There is a scarcity of methods to visualise the position of individual data points in clusters, mainly because dimensionality reduction is necessary for analysis of high-dimensional data and projective methods tend to merge clusters together. This paper proposes a cluster-based projective method to represent cluster membership, which shows good cluster separation and retains linear relationships in the data. This method is practical for the analysis of large, high-dimensional, databases, with generic applicability beyond marketing studies. Theoretical properties of this non-orthogonal projection are derived and its practical value is demonstrated on real-world data from a web-based retailer, benchmarking with the visualisation of clusters using Sammon and Kohonen maps.

Paulo J. G. Lisboa, Shail Patel
Dynamic Symbolization of Streaming Time Series

Symbolization of time series is an important preprocessing subroutine for many data mining tasks. However, it is usually difficult, if not impossible, to apply the traditional static symbolization approach on streaming time series, because of either the low efficiency of re-computing the typical sub-series, or the low capability of representing the up-to-date series characters. This paper presents a novel symbolization method, in which the typical sub-series are dynamically adjusted to fit the up-to-date characters of streaming time series. It works in an incremental form without scanning the whole date set. Experiments on data set from stock market justify the superiority of the proposed method over the traditional ones.

Xiaoming Jin, Jianmin Wang, Jiaguang Sun
A Clustering Model for Mining Evolving Web User Patterns in Data Stream Environment

With the fast growing of the Internet and its Web users all over the world, how to manage and discover useful patterns from tremendous and evolving Web information sources become new challenges to our data engineering researchers. Also, there is a great demand on designing scalable and flexible data mining algorithms for various time-critical and data-intensive Web applications. In this paper, we purpose a new clustering model for generating and maintaining clusters efficiently which represent the changing Web user patterns in Websites. With effective pruning process, the clusters can be fast discovered and updated to reflect the current or changing user patterns to Website administrators. This model can also be employed in different Web applications such as personalization and recommendation systems.

Edmond H. Wu, Michael K. Ng, Andy M. Yip, Tony F. Chan
An Improved Constructive Neural Network Ensemble Approach to Medical Diagnoses

Neural networks have played an important role in intelligent medical diagnoses. This paper presents an Improved Constructive Neural Network Ensemble (ICNNE) approach to three medical diagnosis problems. New initial structure of the ensemble, new freezing criterion, and a different error function are presented. Experiment results show that our ICNNE approach performed better for most problems.

Zhenyu Wang, Xin Yao, Yong Xu
Spam Classification Using Nearest Neighbour Techniques

Spam mail classification and filtering is a commonly investigated problem, yet there has been little research into the application of nearest neighbour classifiers in this field. This paper examines the possibility of using a nearest neighbour algorithm for simple, word based spam mail classification. This approach is compared to a neural network, and decision-tree along with results published in another conference paper on the subject.

Dave C. Trudgian

Learning Algorithms and Systems

Kernel Density Construction Using Orthogonal Forward Regression

An automatic algorithm is derived for constructing kernel density estimates based on a regression approach that directly optimizes generalization capability. Computational efficiency of the density construction is ensured using an orthogonal forward regression, and the algorithm incrementally minimizes the leave-one-out test score. Local regularization is incorporated into the density construction process to further enforce sparsity. Examples are included to demonstrate the ability of the proposed algorithm to effectively construct a very sparse kernel density estimate with comparable accuracy to that of the full sample Parzen window density estimate.

Sheng Chen, Xia Hong, Chris J. Harris
Orthogonal Least Square with Boosting for Regression

A novel technique is presented to construct sparse regression models based on the orthogonal least square method with boosting. This technique tunes the mean vector and diagonal covariance matrix of individual regressor by incrementally minimizing the training mean square error. A weighted optimization method is developed based on boosting to append regressors one by one in an orthogonal forward selection procedure. Experimental results obtained using this technique demonstrate that it offers a viable alternative to the existing state-of-art kernel modeling methods for constructing parsimonious regression models.

Sheng Chen, Xunxian Wang, David J. Brown
New Applications for Object Recognition and Affine Motion Estimation by Independent Component Analysis

This paper proposes a new scheme based on independent component analysis(ICA) for object recognition with affine transformation and for affine motion estimation between video frames. For different skewed shapes of recognized object, an invariant descriptor can be extracted by ICA, and it can solve some skewed object recognition problems. This method also can be used to estimate the affine motion between two frames, which is important in high compression rate coding such as MPEG4 or MPEG7 standard. Simulation results show that the proposed method has a better performance than other traditional methods in pattern recognition and affine motion estimation.

Liming Zhang, Xuming Huang
Personalized News Reading via Hybrid Learning

In this paper, we present a personalized news reading prototype where latest news articles published by various on-line news providers are automatically collected, categorized and ranked in light of a user’s habits or interests. Moreover, our system can adapt itself towards a better performance. In order to develop such an adaptive system, we proposed a hybrid learning strategy; supervised learning is used to create an initial system configuration based on user’s feedbacks during registration, while an unsupervised learning scheme gradually updates the configuration by tracing the user’s behaviors as the system is being used. Simulation results demonstrate satisfactory performance.

Ke Chen, Sunny Yeung
Mercer Kernel, Fuzzy C-Means Algorithm, and Prototypes of Clusters

In this paper, an unsupervised Mercer kernel based fuzzy c-means (MKFCM) clustering algorithm is proposed, in which the implicit assumptions about the shapes of clusters in the FCM algorithm is removed so that the new algorithm possesses strong adaptability to cluster structures within data samples. A new method for calculating the prototypes of clusters in input space is also proposed, which is essential for data clustering applications. Experimental results have demonstrated the promising performance of the MKFCM algorithm in different scenarios.

Shangming Zhou, John Q. Gan
DIVACE: Diverse and Accurate Ensemble Learning Algorithm

In order for a neural network ensemble to generalise properly, two factors are considered vital. One is the diversity and the other is the accuracy of the networks that comprise the ensemble. There exists a tradeoff as to what should be the optimal measures of diversity and accuracy. The aim of this paper is to address this issue. We propose the DIVACE algorithm which tries to produce an ensemble as it searches for the optimum point on the diversity-accuracy curve. The DIVACE algorithm formulates the ensemble learning problem as a multi-objective problem explicitly.

Arjun Chandra, Xin Yao
Parallel Processing for Movement Detection in Neural Networks with Nonlinear Functions

In the neural networks, one of the prominent features, is parallel processing for the spatial information. It is not discussed theoretically to clarify the key features for the parallel processing in the neural network. In this paper, it is shown that asymmetric nonlinear functions, play an crucial role in the network parallel processing for the movement detection. The visual information is inputted first to the retinal neural networks, then is transmitted on the way and finally is processed in the visual network of the cortex and middle temporal area of the brain. In these networks, it is reported that some nonlinear functions will process the visual information effectively. We make clear that the parallel processing with the even and odd nonlinear functions, is effective in the movement detection.

Naohiro Ishii, Toshinori Deguchi, Hiroshi Sasaki
Combining Multiple k-Nearest Neighbor Classifiers Using Different Distance Functions

The k-nearest neighbor (KNN) classification is a simple and effective classification approach. However, improving performance of the classifier is still attractive. Combining multiple classifiers is an effective technique for improving accuracy. There are many general combining algorithms, such as Bagging, Boosting, or Error Correcting Output Coding that significantly improve the classifier such as decision trees, rule learners, or neural networks. Unfortunately, these combining methods do not improve the nearest neighbor classifiers. In this paper we present a new approach to combine multiple KNN classifiers based on different distance funtions, in which we apply multiple distance functions to improve the performance of the k-nearest neighbor classifier. The proposed algorithm seeks to increase generalization accuracy when compared to the basic k-nearest neighbor algorithm. Experiments have been conducted on some benchmark datasets from the UCI Machine Learning Repository. The results show that the proposed algorithm improves the performance of the k-nearest neighbor classification.

Yongguang Bao, Naohiro Ishii, Xiaoyong Du
Finding Minimal Addition Chains Using Ant Colony

Modular exponentiation is one of the most important operations in the almost all nowadays cryptosystems. It is performed using a series of modular multiplications. The latter operation is time consuming for large operands, which always the case in cryptography. Hence Accelerating public-key cryptography software or hardware needs either optimising the time consumed by a single modular multiplication and/or reducing the total number of modular multiplication required. This paper introduces a novel idea based on the principles of ant colony for finding a minimal addition chain that allows us to reduce the number of modular multiplication so that modular exponentiation can be implemented very efficently.

Nadia Nedjah, Luiza de Macedo Mourelle
Combining Local and Global Models to Capture Fast and Slow Dynamics in Time Series Data

Many time series exhibit dynamics over vastly different time scales. The standard way to capture this behavior is to assume that the slow dynamics are a“trend”, to de-trend the data, and then to model the fast dynamics. However, for nonlinear dynamical systems this is generally insufficient. In this paper we describe a new method, utilizing two distinct nonlinear modeling architectures to capture both fast and slow dynamics. Slow dynamics are modeled with the method of analogues, and fast dynamics with a deterministic radial basis function network. When combined the resulting model out-performs either individual system.

Michael Small
A Variable Metric Probabilistic k-Nearest-Neighbours Classifier

k-nearest neighbour (k-nn) model is a simple, popular classifier. Probabilistic k-nn is a more powerful variant in which the model is cast in a Bayesian framework using (reversible jump) Markov chain Monte Carlo methods to average out the uncertainy over the model parameters.The k-nn classifier depends crucially on the metric used to determine distances between data points. However, scalings between features, and indeed whether some subset of features is redundant, are seldom known a priori. Here we introduce a variable metric extension to the probabilistic k-nn classifier, which permits averaging over all rotations and scalings of the data. In addition, the method permits automatic rejection of irrelevant features. Examples are provided on synthetic data, illustrating how the method can deform feature space and select salient features, and also on real-world data.

Richard M. Everson, Jonathan E. Fieldsend
Feature Word Tracking in Time Series Documents

Data mining from time series documents is a new challenge in text mining and, for this purpose, time dependent feature extraction is an important problem. This paper proposes a method to track feature terms in time series documents. When analyzing and mining time series data, the key is to handle time information. The proposed method applies non-linear principal component analysis to document vectors that consist of term frequencies and time information. This paper reports preliminary experimental results in which the proposed method is applied to a corpus of topic detection and tracking, and we show that the proposed method is effective in extracting time dependent terms.

Atsuhiro Takasu, Katsuaki Tanaka
Combining Gaussian Mixture Models

A Gaussian mixture model (GMM) estimates a probability density function using the expectation-maximization algorithm. However, it may lead to a poor performance or inconsistency. This paper analytically shows that performance of a GMM can be improved in terms of Kullback-Leibler divergence with a committee of GMMs with different initial parameters. Simulations on synthetic datasets demonstrate that a committee of as few as 10 models outperforms a single model.

Hyoung-joo Lee, Sungzoon Cho
Global Convergence of Steepest Descent for Quadratic Functions

This paper analyzes the effect of momentum on steepest descent training for quadratic performance functions. Some global convergence conditions of the steepest descent algorithm are obtained by directly analyzing the exact momentum equations for quadratic cost functions. Those conditions can be directly derived from the parameters (different from eigenvalues that are used in the existed ones.) of the Hessian matrix. The results presented in this paper are new.

Zhigang Zeng, De-Shuang Huang, Zengfu Wang
Boosting Orthogonal Least Squares Regression

A comparison between the support vector machine regression (SVR) and the orthogonal least square (OLS) forward selection regression is given by an example. The disadvantage of SVR is shown and analyzed. A new algorithm by using OLS method to select regressors (support vectors) and boosting method to train the regressors’ weight is proposed. This algorithm can give a small regression error when a very sparse system model is required. When a detailed model is required, the resulted train set error model and the test set error model may look very similar.

Xunxian Wang, David J. Brown
Local Separation Property of the Two-Source ICA Problem with the One-Bit-Matching Condition

The one-bit-matching conjecture for independent component analysis (ICA) is basically stated as “all the sources can be separated as long as there is one-to-one same-sign-correspondence between the kurtosis signs of all source probability density functions (pdf’s) and the kurtosis signs of all model pdf’s”, which has been widely believed in the ICA community, but not proved completely. Recently, it has been proved that under the assumption of zero skewness for the model pdf’s, the global maximum of a cost function on the ICA problem with the one-bit-matching condition corresponds to a feasible solution of the ICA problem. In this paper, we further study the one-bit-matching conjecture along this direction and prove that all the possible local maximums of this cost function correspond to the feasible solutions of the ICA problem in the case of two sources under the same assumption. That is, the one-bit-matching condition is sufficient for solving the two-source ICA problem via any local ascent algorithm of the cost function.

Jinwen Ma, Zhiyong Liu, Lei Xu
Two Further Gradient BYY Learning Rules for Gaussian Mixture with Automated Model Selection

Under the Bayesian Ying-Yang (BYY) harmony learning theory, a harmony function has been developed for Gaussian mixture model with an important feature that, via its maximization through a gradient learning rule, model selection can be made automatically during parameter learning on a set of sample data from a Gaussian mixture. This paper proposes two further gradient learning rules, called conjugate and natural gradient learning rules, respectively, to efficiently implement the maximization of the harmony function on Gaussian mixture. It is demonstrated by simulation experiments that these two new gradient learning rules not only work well, but also converge more quickly than the general gradient ones.

Jinwen Ma, Bin Gao, Yang Wang, Qiansheng Cheng
Improving Support Vector Solutions by Selecting a Sequence of Training Subsets

In this paper we demonstrate that it is possible to gradually improve the performance of support vector machine (SVM) classifiers by using a genetic algorithm to select a sequence of training subsets from the available data. Performance improvement is possible because the SVM solution generally lies some distance away from the Bayes optimal in the space of learning parameters. We illustrate performance improvements on a number of benchmark data sets.

Tom Downs, Jianxiong Wang
Machine Learning for Matching Astronomy Catalogues

An emerging issue in the field of astronomy is the integration, management and utilization of databases from around the world to facilitate scientific discovery. In this paper, we investigate application of the machine learning techniques of support vector machines and neural networks to the problem of amalgamating catalogues of galaxies as objects from two disparate data sources: radio and optical. Formulating this as a classification problem presents several challenges, including dealing with a highly unbalanced data set. Unlike the conventional approach to the problem (which is based on a likelihood ratio) machine learning does not require density estimation and is shown here to provide a significant improvement in performance. We also report some experiments that explore the importance of the radio and optical data features for the matching problem.

David Rohde, Michael Drinkwater, Marcus Gallagher, Tom Downs, Marianne Doyle
Boosting the Tree Augmented Naïve Bayes Classifier

The Tree Augmented Naïve Bayes (TAN) classifier relaxes the sweeping independence assumptions of the Naïve Bayes approach by taking account of conditional probabilities. It does this in a limited sense, by incorporating the conditional probability of each attribute given the class and (at most) one other attribute. The method of boosting has previously proven very effective in improving the performance of Naïve Bayes classifiers and in this paper, we investigate its effectiveness on application to the TAN classifier.

Tom Downs, Adelina Tang
Clustering Model Selection for Reduced Support Vector Machines

The reduced support vector machine was proposed for the practical objective that overcomes the computational difficulties as well as reduces the model complexity by generating a nonlinear separating surface for a massive dataset. It has been successfully applied to other kernel-based learning algorithms. Also, there are experimental studies on RSVM that showed the efficiency of RSVM. In this paper we propose a robust method to build the model of RSVM via RBF (Gaussian) kernel construction. Applying clustering algorithm to each class, we can generate cluster centroids of each class and use them to form the reduced set which is used in RSVM. We also estimate the approximate density for each cluster to get the parameter used in Gaussian kernel. Under the compatible classification performance on the test set, our method selects a smaller reduced set than the one via random selection scheme. Moreover, it determines the kernel parameter automatically and individually for each point in the reduced set while the RSVM used a common kernel parameter which is determined by a tuning procedure.

Lih-Ren Jen, Yuh-Jye Lee
Generating the Reduced Set by Systematic Sampling

The computational difficulties occurred when we use a conventional support vector machine with nonlinear kernels to deal with massive datasets. The reduced support vector machine (RSVM) replaces the fully dense square kernel matrix with a small rectangular kernel matrix which is used in the nonlinear SVM formulation to avoid the computational difficulties. In this paper, we propose a new algorithm, Systematic Sampling RSVM (SSRSVM) that selects the informative data points to form the reduced set while the RSVM used random selection scheme. This algorithm is inspired by the key idea of SVM, the SVM classifier can be represented by support vectors and the misclassified points are a part of support vectors. SSRSVM starts with an extremely small initial reduced set and adds a portion of misclassified points into the reduced set iteratively based on the current classifier until the validation set correctness is large enough. In our experiments, we tested SSRSVM on six public available datasets. It turns out that SSRSVM might automatically generate a smaller size of reduced set than the one by random sampling. Moreover, SSRSVM is faster than RSVM and much faster than conventional SVM under the same level of the test set correctness.

Chien-Chung Chang, Yuh-Jye Lee
Experimental Comparison of Classification Uncertainty for Randomised and Bayesian Decision Tree Ensembles

In this paper we experimentally compare the classification uncertainty of the randomised Decision Tree (DT) ensemble technique and the Bayesian DT technique with a restarting strategy on a synthetic dataset as well as on some datasets commonly used in the machine learning community. For quantitative evaluation of classification uncertainty, we use an Uncertainty Envelope dealing with the class posterior distribution and a given confidence probability. Counting the classifier outcomes, this technique produces feasible evaluations of the classification uncertainty. Using this technique in our experiments, we found that the Bayesian DT technique is superior to the randomised DT ensemble technique.

Vitaly Schetinin, Derek Partridge, Wojtek J. Krzanowski, Richard M. Everson, Jonathan E. Fieldsend, Trevor C. Bailey, Adolfo Hernandez
Policy Gradient Method for Team Markov Games

The main aim of this paper is to extend the single-agent policy gradient method for multiagent domains where all agents share the same utility function. We formulate these team problems as Markov games endowed with the asymmetric equilibrium concept and based on this formulation, we provide a direct policy gradient learning method. In addition, we test the proposed method with a small example problem.

Ville Könönen
An Information Theoretic Optimal Classifier for Semi-supervised Learning

Model uncertainty refers to the risk associated with basing prediction on only one model. In semi-supervised learning, this uncertainty is greater than in supervised learning (for the same total number of instances) given that many data points are unlabelled. An optimal Bayes classifier (OBC) reduces model uncertainty by averaging predictions across the entire model space weighted by the models’ posterior probabilities. For a given model space and prior distribution OBC produces the lowest risk. We propose an information theoretic method to construct an OBC for probabilistic semi-supervised learning using Markov chain Monte Carlo sampling. This contrasts with typical semi-supervised learning that attempts to find the single most probable model using EM. Empirical results verify that OBC yields more accurate predictions than the best single model.

Ke Yin, Ian Davidson
Improving Evolutionary Algorithms by a New Smoothing Technique

In this paper, a novel smoothing technique, which can be integrated into different optimization methods to improve their performance, is presented. At first, a new smoothing technique using a properly truncated Fourier series as the smoothing function is proposed. This smoothing function can eliminate many local minima and preserve the global minima. Thus it make the search of optimal solution more easier and faster. At second, this technique is integrated into a simple genetic algorithm to improve and demonstrate the efficiency of this technique. The simulation results also indicate the new smoothing technique can improve the simple genetic algorithm greatly.

Yuping Wang
In-Situ Learning in Multi-net Systems

Multiple classifier systems based on neural networks can give improved generalisation performance as compared with single classifier systems. We examine collaboration in multi-net systems through in-situ learning, exploring how generalisation can be improved through the simultaneous learning in networks and their combination. We present two in-situ trained systems; first, one based upon the simple ensemble, combining supervised networks in parallel, and second, a combination of unsupervised and supervised networks in sequence. Results for these are compared with existing approaches, demonstrating that in-situ trained systems perform better than similar pre-trained systems.

Matthew Casey, Khurshid Ahmad
Multi-objective Genetic Algorithm Based Method for Mining Optimized Fuzzy Association Rules

This paper introduces optimized fuzzy association rules mining. We propose a multi-objective Genetic Algorithm (GA) based approach for mining fuzzy association rules containing instantiated and uninstantiated attributes. According to our method, fuzzy association rules can contain an arbitrary number of uninstantiated attributes. The method uses three bjectives for the rule mining process: support, confidence and number of fuzzy sets. Experimental results conducted on a real data set demonstrate the effectiveness and applicability of the proposed approach.

Mehmet Kaya, Reda Alhajj
Co-training from an Incremental EM Perspective

We study classification when the majority of data is unlabeled, and only a small fraction is labeled: the so-called semi-supervised learning situation. Blum and Mitchell’s co-training is a popular semi-supervised algorithm [1] to use when we have multiple independent views of the entities to classify. An example of a multi-view situation is classifying web pages: one view may describe the pages by the words that occur on them, another view describes the pages by the words in the hyperlinks that point to them. In co-training two learners each form a model from the labeled data and then incrementally label small subsets of the unlabeled data for each other. The learners then re-estimate their model from the labeled data and the psuedo-labels provided by the learners. Though some analysis of the algorithm’s performance exists [1] the computation performed is still not well understood. We propose that each view in co-training is effectively performing incremental EM as postulated by Neal and Hinton [3], combined with a Bayesian classifier. This analysis suggests improvements over the core co-training algorithm. We introduce variations, which result in faster convergence to the maximum possible accuracy of classification than the core co-training algorithm, and therefore increase the learning efficiency. We empirically verify our claim for a number of data sets in the context of belief network learning.

Minoo Aminian
Locally Tuned General Regression for Learning Mixture Models Using Small Incomplete Data Sets with Outliers and Overlapping Classes

Finite mixture models are commonly used in pattern recognition. Parameters of these models are usually estimated via the Expectation Maximization algorithm. This algorithm is modified earlier to handle incomplete data. However, the modified algorithm is sensitive to the occurrence of outliers in the data and to the overlap among data classes in the data space. Meanwhile, it requires the number of missing values to be small in order to produce good estimations of the model parameters. Therefore, a new algorithm is proposed in this paper to overcome these problems. A comparison study shows the preference of the proposed algorithm to other algorithms commonly used in the literature including the modified Expectation Maximization algorithm.

Ahmed Rafat

Financial Engineering

Credit Risks of Interest Rate Swaps: A Comparative Study of CIR and Monte Carlo Simulation Approach

This paper compares the credit risk profile for two types of model, the Monte Carlo model used in the existing literature, and the Cox, Ingersoll and Ross (CIR) model. Each of the profiles has a concave or hump-backed shape, reflecting the amortisation and diffusion effects. However, the CIR model generates significantly different results. In addition, we consider the sensitivity of these models of credit risk to initial interest rates, volatility, maturity, kappa and delta. The results show that the sensitivities vary across the models, and we explore the meaning of that variation.

Victor Fang, Vincent C. S. Lee
Cardinality Constrained Portfolio Optimisation

The traditional quadratic programming approach to portfolio optimisation is difficult to implement when there are cardinality constraints. Recent approaches to resolving this have used heuristic algorithms to search for points on the cardinality constrained frontier. However, these can be computationally expensive when the practitioner does not know a priori exactly how many assets they may desire in a portfolio, or what level of return/risk they wish to be exposed to without recourse to analysing the actual trade-off frontier.This study introduces a parallel solution to this problem. By extending techniques developed in the multi-objective evolutionary optimisation domain, a set of portfolios representing estimates of all possible cardinality constrained frontiers can be found in a single search process, for a range of portfolio sizes and constraints. Empirical results are provided on emerging markets and US asset data, and compared to unconstrained frontiers found by quadratic programming.

Jonathan E. Fieldsend, John Matatko, Ming Peng
Stock Trading by Modelling Price Trend with Dynamic Bayesian Networks

We study a stock trading method based on dynamic bayesian networks to model the dynamics of the trend of stock prices. We design a three level hierarchical hidden Markov model (HHMM). There are five states describing the trend in first level. Second and third levels are abstract and concrete hidden Markov models to produce the observed patterns. To train the HHMM, we adapt a semi-supervised learning so that the trend states of first layer is manually labelled. The inferred probability distribution of first level are used as an indicator for the trading signal, which is more natural and reasonable than technical indicators. Experimental results on representative 20 companies of Korean stock market show that the proposed HHMM outperforms a technical indicator in trading performances.

O Jangmin, Jae Won Lee, Sung-Bae Park, Byoung-Tak Zhang
Detecting Credit Card Fraud by Using Questionnaire-Responded Transaction Model Based on Support Vector Machines

This work proposes a new method to solve the credit card fraud problem. Traditionally, systems based on previous transaction data were set up to predict a new transaction. This approach provides a good solution in some situations. However, there are still many problems waiting to be solved, such as skewed data distribution, too many overlapped data, fickle-minded consumer behavior, and so on. To improve the above problems, we propose to develop a personalized system, which can prevent fraud from the initial use of credit cards. First, the questionnaire-responded transaction (QRT) data of users are collected by using an online questionnaire based on consumer behavior surveys. The data are then trained by using the support vector machines (SVMs) whereby the QRT models are developed. The QRT models are used to predict a new transaction. Results from this study show that the proposed method can effectively detect the credit card fraud.

Rong-Chang Chen, Ming-Li Chiu, Ya-Li Huang, Lin-Ti Chen
Volatility Forecasts in Financial Time Series with HMM-GARCH Models

Nowadays many researchers use GARCH models to generate volatility forecasts. However, it is well known that volatility persistence, as indicated by the sum of the two parameters G1 and A1[1], in GARCH models is usually too high. Since volatility forecasts in GARCH models are based on these two parameters, this may lead to poor volatility forecasts. It has long been argued that this high persistence is due to the structure changes(e.g. shift of volatility levels) in the volatility processes, which GARCH models cannot capture. To solve this problem, we introduce our GARCH model based on Hidden Markov Models(HMMs), called HMM-GARCH model. By using the concept of hidden states, HMMs allow for periods with different volatility levels characterized by the hidden states. Within each state, local GARCH models can be applied to model conditional volatility. Empirical analysis demonstrates that our model takes care of the structure changes and hence yields better volatility forecasts.

Xiong-Fei Zhuang, Lai-Wan Chan

Agent Technologies

User Adaptive Answers Generation for Conversational Agent Using Genetic Programming

Recently, it seems to be interested in the conversational agent as an effective and familiar information provider. Most of conversational agents reply to user’s queries based on static answers constructed in advance. Therefore, it cannot respond with flexible answers adjusted to the user, and the stiffness shrinks the usability of conversational agents. In this paper, we propose a method using genetic programming to generate answers adaptive to users. In order to construct answers, Korean grammar structures are defined by BNF (Backus Naur Form), and it generates various grammar structures utilizing genetic programming (GP). We have applied the proposed method to the agent introducing a fashion web site, and certified that it responds more flexibly to user’s queries.

Kyoung-Min Kim, Sung-Soo Lim, Sung-Bae Cho
Comparing Measures of Agreement for Agent Attitudes

A model for the interaction of three agents is presented in which each agent has three personality parameters; tolerance, volatility and stubbornness. A pair of agents interact and evolve their attitudes to each other as a function of how well they agree in their attitude towards the third agent in the group. The effects of using two different measures of agreement are compared and contrasted and it is found that although the measures used have quite different motivations and formulations, there are striking similarities between the overall results which they produce.

Mark McCartney, David H. Glass
Hierarchical Agglomerative Clustering for Agent-Based Dynamic Collaborative Filtering

Collaborative Filtering systems suggest items to a user because it is highly rated by some other user with similar tastes. Although these systems are achieving great success on web based applications, the tremendous growth in the number of people using these applications require performing many recommendations per second for millions of users. Technologies are needed that can rapidly produce high quality recommendations for large community of users.In this paper we present an agent based approach to collaborative filtering where agents work on behalf of their users to form shared “interest groups”, which is a process of pre-clustering users based on their interest profiles. These groups are dynamically updated to reflect the user’s evolving interests over time.

Gulden Uchyigit, Keith Clark
Learning Users’ Interests in a Market-Based Recommender System

Recommender systems are widely used to cope with the problem of information overload and, consequently, many recommendation methods have been developed. However, no one technique is best for all users in all situations. To combat this, we have previously developed a market-based recommender system that allows multiple agents (each representing a different recommendation method or system) to compete with one another to present their best recommendations to the user. Our marketplace thus coordinates multiple recommender agents and ensures only the best recommendations are presented. To do this effectively, however, each agent needs to learn the users’ interests and adapt its recommending behaviour accordingly. To this end, in this paper, we develop a reinforcement learning and Boltzmann exploration strategy that the recommender agents can use for these tasks. We then demonstrate that this strategy helps the agents to effectively obtain information about the users’ interests which, in turn, speeds up the market convergence and enables the system to rapidly highlight the best recommendations.

Yan Zheng Wei, Luc Moreau, Nicholas R. Jennings
Visualisation of Multi-agent System Organisations Using a Self-organising Map of Pareto Solutions

The structure and performance of organisations – natural or man-made – are intricately linked, and these multifaceted interactions are increasingly being investigated using Multi Agent System concepts. This paper shows how a selection of generic structural metrics for organisations can be explored using a combination of Pareto Frontier exemplars; extensive simulations of simple goal-orientated Multi Agent Systems, and exposé of organisational types through Self-Organising Map clusters can provide insights into desirable structures for such objectives as robustness and efficiency.

Johnathan M. E. Gabbai, W. Andy Wright, Nigel M. Allinson
Backmatter
Metadata
Title
Intelligent Data Engineering and Automated Learning – IDEAL 2004
Editors
Zheng Rong Yang
Hujun Yin
Richard M. Everson
Copyright Year
2004
Publisher
Springer Berlin Heidelberg
Electronic ISBN
978-3-540-28651-6
Print ISBN
978-3-540-22881-3
DOI
https://doi.org/10.1007/b99975