Skip to main content
main-content

Über dieses Buch

This book constitutes the proceedings of the 8th International Conference on Modeling Decisions for Artificial Intelligence, MDAI 2011, held in Changsha, China, in July 2011. The 25 papers presented were carefully reviewed and selected from 43 submissions. The volume also contains extended abstracts of the three invited papers. The topics covered are aggregation operators and decision making; clustering and similarity; computational intelligence; and data privacy.

Inhaltsverzeichnis

Frontmatter

Invited Papers

Online Social Honeynets: Trapping Web Crawlers in OSN

Abstract
Web crawlers are complex applications that explore the Web with different purposes. Web crawlers can be configured to crawl online social networks (OSN) to obtain relevant data about its global structure. Before a web crawler can be launched to explore the web, a large amount of settings have to be configured. This settings define the behavior of the crawler and have a big impact on the collected data. The amount of collected data and the quality of the information that it contains are affected by the crawler settings and, therefore, by properly configuring this web crawler settings we can target specific goals to achieve with our crawl. In this paper, we analyze how different scheduler algorithms affect to the collected data in terms of users’ privacy. Furthermore, we introduce the concept of online social honeynet (OShN) to protect OSN from web crawlers and we provide an OShN proof-of-concept that achieve good results for protecting OSN from a specific web crawler.
Jordi Herrera-Joancomartí, Cristina Pérez-Solà

Cost-Sensitive Learning

Abstract
In conventional classification settings, the classifiers generally try to maximize the accuracy or minimize the error rate, both are equivalent to minimizing the number of mistakes in classifying new instances. Such a setting is valid when the costs of different types of mistakes are equal. In real-world applications, however, the costs of different types of mistakes are often unequal. For example, in intrusion detection, the cost of mistakenly classifying an intrusion as a normal access is usually far larger than that of mistakenly classifying a normal access as an intrusion, because the former type of mistakes will result in much more serious losses.
Zhi-Hua Zhou

Evolving Graph Structures for Drug Discovery

Abstract
Computer-Aided Drug Discovery (CADD) is concerned with the use of computational techniques to determine drug structures with certain desirable properties. Evolutionary algorithms (EAs) have been proposed to evolve drug molecules by mimicking chemical reactions that cause the exchange of chemical bonds and components between molecules. For these EAs to perform their tasks, known molecular components, which can serve as building blocks for the drugs to be designed, and known chemical rules, which govern chemical combination between different components, have to be introduced before an evolutionary process can take place. To automate drug molecular design without such prior knowledge and constraints, we need a special EA that can evolve molecular graphs with minimal background knowledge. In this talk, we present one such EA that can evolves graph structures used to represent drug molecules. We show how molecular fingerprints can be used to evaluate the “fitness” of an evolved drug structure obtained at each generation during the evolutionary process. We also show how the discovering of privileged structures in many drug molecules and the use of ligand docking and binding affinity can be used as alternatives for fitness evaluating in an EA for drug design. We show how the results obtained using the proposed EA may lead to a promising approach for CADD.
Keith C. C. Chan

Fuzzy Measures and Comonotonicity on Multisets

Abstract
Fuzzy measures on multisets are studied. We show that a class of multisets can be represented as a subset of positive integers. Comonotonicity for multisets are defined. We show that a fuzzy measure on multisets with some comonotonicity condition can be represented by generalized fuzzy integral.
Yasuo Narukawa, Klara Stokes, Vicenç Torra

Regular Papers

Aggregation Operators and Decision Making

A Parallel Fusion Method for Heterogeneous Multi-sensor Transportation Data

Abstract
Information fusion technology has been introduced for data analysis in intelligent transportation systems (ITS) in order to generate a more accurate evaluation of the traffic state. The data collected from multiple heterogeneous traffic sensors are converted into common traffic state features, such as mean speed and volume. Afterwards, we design a hierarchical evidential fusion model (HEFM) based on D-S Evidence Theory to implement the feature-level fusion. When the data quantity reaches a large amount, HEFM can be parallelized in data-centric mode, which mainly consists of region-based data decomposition by quadtree and fusion task scheduling. The experiments are conducted to testify the scalability of this parallel fusion model on accuracy and efficiency as the numbers of decomposed sub-regions and cyberinfrastructure computing nodes increase. The results show that significant speedups can be achieved without loss in accuracy.
Yingjie Xia, Chengkun Wu, Qingjie Kong, Zhenyu Shan, Li Kuang

A Dynamic Value-at-Risk Portfolio Model

Abstract
A mathematical dynamic portfolio allocation model with uncertainty is discussed. Introducing a value-at-risk under a condition, this paper formulates value-at-risks in a dynamic stochastic environment. By dynamic programming approach, an optimality condition of the optimal portfolio for dynamic value-at-risks is derived. It is shown that the optimal time-average value-at-risk is a solution of the optimality equation under a reasonable assumption, and an optimal trading strategy is obtained from the equation. A numerical example is given to illustrate our idea.
Yuji Yoshida

Modelling Heterogeneity among Experts in Multi-criteria Group Decision Making Problems

Abstract
Heterogeneity in group decision making problems has been recently studied in the literature. Some instances of these studies include the use of heterogeneous preference representation structures, heterogeneous preference representation domains and heterogeneous importance degrees. On this last heterogeneity level, the importance degrees are associated to the experts regardless of what is being assessed by them, and these degrees are fixed through the problem. However, there are some situations in which the experts’ importance degrees do not depend only on the expert. Sometimes we can find sets of heterogeneously specialized experts, that is, experts whose knowledge level is higher on some alternatives and criteria than it is on any others. Consequently, their importance degree should be established in accordance with what is being assessed. Thus, there is still a gap on heterogeneous group decision making frameworks to be studied. We propose a new fuzzy linguistic multi-criteria group decision making model which considers different importance degrees for each expert depending not only on the alternatives but also on the criterion which is taken into account to evaluate them.
Ignacio J. Pérez, Sergio Alonso, Francisco J. Cabrerizo, Jie Lu, Enrique Herrera-Viedma

Fast Mining of Non-derivable Episode Rules in Complex Sequences

Abstract
Researchers have been endeavoring to discover concise sets of episode rules instead of complete sets in sequences. Existing approaches, however, are not able to process complex sequences and can not guarantee the accuracy of resulting sets due to the violation of anti-monotonicity of the frequency metric. In some real applications, episode rules need to be extracted from complex sequences in which multiple items may appear in a time slot. This paper investigates the discovery of concise episode rules in complex sequences. We define a concise representation called non-derivable episode rules and formularize the mining problem. Adopting a novel anti-monotonic frequency metric, we then develop a fast approach to discover non-derivable episode rules in complex sequences. Experimental results demonstrate that the utility of the proposed approach substantially reduces the number of rules and achieves fast processing.
Min Gan, Honghua Dai

Hybridizing Data Stream Mining and Technical Indicators in Automated Trading Systems

Abstract
Automated trading systems for financial markets can use data mining techniques for future price movement prediction. However, classifier accuracy is only one important component in such a system: the other is a decision procedure utilizing the prediction in order to be long, short or out of the market. In this paper, we investigate the use of technical indicators as a means of deciding when to trade in the direction of a classifier’s prediction. We compare this “hybrid” technical/data stream mining-based system with a naive system that always trades in the direction of predicted price movement. We are able to show via evaluations across five financial market datasets that our novel hybrid technique frequently outperforms the naive system. To strengthen our conclusions, we also include in our evaluation several “simple” trading strategies without any data mining component that provide a much stronger baseline for comparison than traditional buy-and-hold or sell-and-hold strategies.
Michael Mayo

Semi-supervised Dimensionality Reduction via Harmonic Functions

Abstract
Traditional unsupervised dimensionality reduction techniques are widely used in many learning tasks, such as text classification and face recognition. However, in many applications, a few labeled examples are readily available. Thus, semi-supervised dimensionality reduction(SSDR), which could incorporate the label information, has aroused considerable research interests. In this paper, a novel SSDR approach, which employs the harmonic function in a gaussian random field to compute the states of all points, is proposed. It constructs a complete weighted graph, whose edge weights are assigned by the computed states. The linear projection matrix is then derived to maximize the separation of points in different classes. For illustration, we provide some deep theoretical analyses and promising classification results on different kinds of data sets. Compared with other dimensionality reduction approaches, it is more beneficial for classification. Comparing with the transductive harmonic function method, it is inductive and able to deal with new coming data directly.
Chenping Hou, Feiping Nie, Yi Wu

Clustering

Semi-supervised Agglomerative Hierarchical Clustering with Ward Method Using Clusterwise Tolerance

Abstract
This paper presents a new semi-supervised agglomerative hierarchical clustering algorithm with ward method using clusterwise tolerance. Recently, semi-supervised clustering has been remarked and studied in many research fields. In semi-supervised clustering, must-link and cannot-link called pairwise constraints are frequently used in order to improve clustering properties. First, a clusterwise tolerance based pairwise constraints is introduced in order to handle must-link and cannot-link constraints. Next, a new semi-supervised agglomerative hierarchical clustering algorithm with ward method is constructed based on above discussions. Moreover, the effectiveness of proposed algorithms is verified through numerical examples.
Yukihiro Hamasuna, Yasunori Endo, Sadaaki Miyamoto

Agglomerative Clustering Using Asymmetric Similarities

Abstract
Algorithms of agglomerative hierarchical clustering using asymmetric similarity measures are studied. Two different measures between two clusters are proposed, one of which generalizes the average linkage for symmetric similarity measures. Asymmetric dendrogram representation is considered after foregoing studies. It is proved that the proposed linkage methods for asymmetric measures have no reversals in the dendrograms. Examples based on real data show how the methods work.
Satoshi Takumi, Sadaaki Miyamoto

On Hard c-Means Using Quadratic Penalty-Vector Regularization for Uncertain Data

Abstract
Clustering is one of the unsupervised classification techniques of the data analysis. Data are transformed from a real space into a pattern space to apply clustering methods. However, the data cannot be often represented by a point because of uncertainty of the data, e.g., measurement error margin and missing values in data. In this paper, we introduce quadratic penalty-vector regularization to handle such uncertain data into hard c-means (HCM) which is one of the most typical clustering algorithms. First, we propose a new clustering algorithm called hard c-means using quadratic penalty-vector regularization for uncertain data (HCMP). Second, we propose sequential extraction hard c-means using quadratic penalty-vector regularization (SHCMP) to handle datasets whose cluster number is unknown. Moreover, we verify the effectiveness of our propose algorithms through some numerical examples.
Yasunori Endo, Arisa Taniguchi, Aoi Takahashi, Yukihiro Hamasuna

Grey Synthetic Clustering Method for DoS Attack Effectiveness Evaluation

Abstract
Effectiveness evaluation of DoS attack is a complex problem, in which the information is incomplete and vague. The grey theory, which deals with the "less data uncertainty" matter, is a powerful tool to solve the problem. We propose a grey synthetic clustering method for DoS attack effectiveness evaluation in this paper. Firstly, we calculate grey clustering coefficient with general grey clustering method. Secondly, if there is no significant difference about grey clustering coefficient, we calculate the synthetic clustering coefficient. Finally, clustering objects can be clustered accurately with the synthetic clustering coefficient. The experimental results show that the approach is feasible and correct.
Zimei Peng, Wentao Zhao, Jun Long

Fuzzy-Possibilistic Product Partition: A Novel Robust Approach to c-Means Clustering

Abstract
One of the main challenges in the field of c-means clustering models is creating an algorithm that is both accurate and robust. In the absence of outlier data, the conventional probabilistic fuzzy c-means (FCM) algorithm, or the latest possibilistic-fuzzy mixture model (PFCM), provide highly accurate partitions. However, during the 30-year history of FCM, the researcher community of the field failed to produce an algorithm that is accurate and insensitive to outliers at the same time. This paper introduces a novel mixture clustering model built upon probabilistic and possibilistic fuzzy partitions, where the two components are connected to each other in a qualitatively different way than they were in earlier mixtures. The fuzzy-possibilistic product partition c-means (FP3CM) clustering algorithm seems to fulfil the initial requirements, namely it successfully suppresses the effect of outliers situated at any finite distance and provides partitions of high quality.
László Szilágyi

Computational Intelligence and Data Mining

A Novel and Effective Approach to Shape Analysis: Nonparametric Representation, De-noising and Change-Point Detection, Based on Singular-Spectrum Analysis

Abstract
This paper proposes new very effective methods for building nonparametric, multi-resolution models of 2D closed contours, based on Singular Spectrum Analysis (SSA). Representation, de-noising and change-point detection to automate the landmark selection are simultaneously addressed in three different settings. The basic one is to apply SSA to a shape signature encoded by sampling a real-valued time series from a radius-vector contour function. However, this is only suited for star-shaped contours. A second setting is to generalize SSA so as to apply to a complex-valued trajectory matrix in order to directly represent the contour as a time series path in the complex plan, along with detecting change-points in a complex-valued time series. A third setting is to consider the pairs (x, y) of coordinates as a co-movement of two real-valued time series and to apply SSA to a trajectory matrix defined in such a way to span both of them.
Vasile Georgescu

A SSA-Based New Framework Allowing for Smoothing and Automatic Change-Points Detection in the Fuzzy Closed Contours of 2D Fuzzy Objects

Abstract
The aim of this paper is to propose a new framework, based on Singular-Spectrum Analysis, allowing for smoothing and automatic change-point detection in the fuzzy closed contours of 2D fuzzy objects. The representation of fuzzy objects is first addressed, by distinguishing between fuzzy regions and fuzzy closed curves. Fuzzy shape signatures are derived in special cases, from which fuzzy time series can be subsequently sampled. Geodesic and Euclidean fuzzy paths and distances between two points in a fuzzy region are next contrasted. Finally, a novel approach to decomposing and reconstructing a fuzzy shape and to automatic change-point detection is proposed, based on a generalization of Singular-Spectrum Analysis so as to deal with complex-valued trajectory matrices. The coordinates themselves, represented as complex numbers are used as a shape signature. This approach is suitable for non-convex and non-star-shaped fuzzy contours.
Vasile Georgescu

Possibilistic Linear Programming Using General Necessity Measures Preserves the Linearity

Abstract
In this paper, a robust optimization approach to possibilistic linear programming problems is studied. After necessity measures and generation processes of logical connectives are reviewed, the necessity fractile optimization model of possibilistic linear programming problem is introduced as a robust optimization model. This problem is reduced to a linear semi-infinite programming problem. Assuming the convexity of the right parts of membership functions of fuzzy coefficients and the concavity of membership functions of fuzzy constraints, we investigate conditions on logical connectives for the problems to be reduced to linear programming problems. Several examples are given to demonstrate that necessity fractile optimization models are often reduced to linear programming problems.
Masahiro Inuiguchi

An Efficient Hybrid Approach to Correcting Errors in Short Reads

Abstract
High-throughput sequencing technologies produce a large number of short reads that may contain errors. These sequencing errors constitute one of the major problems in analyzing such data. Many algorithms and software tools have been proposed to correct errors in short reads. However, the computational complexity limits their performance. In this paper, we propose a novel and efficient hybrid approach which is based on an alignment-free method combined with multiple alignments. We construct suffix arrays on all short reads to search the correct overlapping regions. For each correct overlapping region, we form multiple alignments for the substrings following the correct overlapping region to identify and correct the erroneous bases. Our approach can correct all types of errors in short reads produced by different sequencing platforms. Experiments show that our approach provides significantly higher accuracy and is comparable or even faster than previous approaches.
Zhiheng Zhao, Jianping Yin, Yong Li, Wei Xiong, Yubin Zhan

Data Privacy

Rule Protection for Indirect Discrimination Prevention in Data Mining

Abstract
Services in the information society allow automatically and routinely collecting large amounts of data. Those data are often used to train classification rules in view of making automated decisions, like loan granting/denial, insurance premium computation, etc. If the training datasets are biased in what regards sensitive attributes like gender, race, religion, etc., discriminatory decisions may ensue. Direct discrimination occurs when decisions are made based on biased sensitive attributes. Indirect discrimination occurs when decisions are made based on non-sensitive attributes which are strongly correlated with biased sensitive attributes. This paper discusses how to clean training datasets and outsourced datasets in such a way that legitimate classification rules can still be extracted but indirectly discriminating rules cannot.
Sara Hajian, Josep Domingo-Ferrer, Antoni Martínez-Ballesté

A Comparison of Two Different Types of Online Social Network from a Data Privacy Perspective

Abstract
We consider two distinct types of online social network, the first made up of a log of writes to wall by users in Facebook, and the second consisting of a corpus of emails sent and received in a corporate environment (Enron). We calculate the statistics which describe the topologies of each network represented as a graph. Then we calculate the information loss and risk of disclosure for different percentages of perturbation for each dataset, where perturbation is achieved by randomly adding links to the nodes. We find that the general tendency of information loss is similar, although Facebook is affected to a greater extent. For risk of disclosure, both datasets also follow a similar trend, except for the average path length statistic. We find that the differences are due to the different distributions of the derived factors, and also the type of perturbation used and its parameterization. These results can be useful for choosing and tuning anonymization methods for different graph datasets.
David F. Nettleton, Diego Sáez-Trumper, Vicenç Torra

On the Declassification of Confidential Documents

Abstract
We introduce the anonymization of unstructured documents to settle the base of automatic declassification of confidential documents. Departing from known ideas and methods of data privacy, we introduce the main issues of unstructured document anonymization and propose the use of named entity recognition techniques from natural language processing and information extraction to identify the entities of the document that need to be protected.
Daniel Abril, Guillermo Navarro-Arribas, Vicenç Torra

Uncovering Community Structure in Social Networks by Clique Correlation

Abstract
Community is tightly-connected group of agents in social networks and the discovery of such subgraphs has aroused considerable research interest in the past few years. Typically, a quantity function called modularity is used to guide the division of the network. By representing the network as a bipartite graph between its vertices and cliques, we show that community structure can be uncovered by the correlation coefficients derived from the bipartite graph through a suitable optimization procedure. We also show that the modularity can be seen as a special case of the quantity function built from the covariance of the vertices. Due the the heteroscedaticity, the modularity suffers a resolution limit problem. And the quantity function based on correlation proposed here exhibits higher resolution power. Experiments show that the proposed method can achieve promising results on synthesized and real world networks. It outperforms several state-of-the-art algorithms.
Xu Liu, Chenping Hou, Qiang Luo, Dongyun Yi

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise