Skip to main content

2010 | Buch

Advanced Data Mining and Applications

6th International Conference, ADMA 2010, Chongqing, China, November 19-21, 2010, Proceedings, Part I

herausgegeben von: Longbing Cao, Yong Feng, Jiang Zhong

Verlag: Springer Berlin Heidelberg

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

With the ever-growing power of generating, transmitting, and collecting huge amounts of data, information overloadis nowan imminent problemto mankind. The overwhelming demand for information processing is not just about a better understanding of data, but also a better usage of data in a timely fashion. Data mining, or knowledge discovery from databases, is proposed to gain insight into aspects ofdata and to help peoplemakeinformed,sensible,and better decisions. At present, growing attention has been paid to the study, development, and application of data mining. As a result there is an urgent need for sophisticated techniques and toolsthat can handle new ?elds of data mining, e. g. , spatialdata mining, biomedical data mining, and mining on high-speed and time-variant data streams. The knowledge of data mining should also be expanded to new applications. The 6th International Conference on Advanced Data Mining and Appli- tions(ADMA2010)aimedtobringtogethertheexpertsondataminingthrou- out the world. It provided a leading international forum for the dissemination of original research results in advanced data mining techniques, applications, al- rithms, software and systems, and di?erent applied disciplines. The conference attracted 361 online submissions from 34 di?erent countries and areas. All full papers were peer reviewed by at least three members of the Program Comm- tee composed of international experts in data mining ?elds. A total number of 118 papers were accepted for the conference. Amongst them, 63 papers were selected as regular papers and 55 papers were selected as short papers.

Inhaltsverzeichnis

Frontmatter

Data Mining Foundations

Cost Sensitive Classification in Data Mining

Cost-sensitive classification is one of mainstream research topics in data mining and machine learning that induces models from data with unbalance class distributions and impacts by quantifying and tackling the unbalance. Rooted in diagnosis data analysis applications, there are great many techniques developed for cost-sensitive learning. They are mainly focused on minimizing the total cost of misclassification costs, test costs, or other types of cost, or a combination among these costs. This paper introduces the up-to-date prevailing cost-sensitive learning methods and presents some research topics by outlining our two new results: lazy-learning and semi-learning strategies for cost-sensitive classifiers.

Zhenxing Qin, Chengqi Zhang, Tao Wang, Shichao Zhang
Web Users Access Paths Clustering Based on Possibilistic and Fuzzy Sets Theory

Web users access paths clustering is important to conduct Web page prediction. In this paper, a novel Web users access paths clustering method is proposed based on possibilistic and fuzzy sets theory. Firstly, a similarity measure method of access paths is proposed based on differences between paths’ factors, such as the length of time spent on visiting a page, the frequency of a page accessed and the order of pages accessed. Furthermore, considering that clusters tend to have vague or imprecise boundaries in the path clustering, a novel uncertain clustering method is proposed based on combining advantages of fuzzy clustering and possibility clustering. A

λ

_cut set is defined here to process the overlapping clusters adaptively. The comparison of experimental results shows that our proposed method is valid and efficient.

Hong Yu, Hu Luo, Shuangshuang Chu
Discriminative Markov Logic Network Structure Learning Based on Propositionalization and χ 2-Test

In this paper we present a bottom-up discriminative algorithm to learn automatically Markov Logic Network structures. Our approach relies on a new propositionalization method that transforms a learning dataset into an approximative representation in the form of boolean tables, from which to construct a set of candidate clauses according to a

χ

2

-test. To compute and choose clauses, we successively use two different optimization criteria, namely pseudo-log-likelihood (PLL) and conditional log-likelihood (CLL), in order to combine the efficiency of PLL optimization algorithms together with the accuracy of CLL ones. First experiments show that our approach outperforms the existing discriminative MLN structure learning algorithms.

Quang-Thang Dinh, Matthieu Exbrayat, Christel Vrain
EWGen: Automatic Generation of Item Weights for Weighted Association Rule Mining

Association Rule Mining is an important data mining technique that has been widely used as an automatic rule generation method. While having outstanding success in many different application domains, it also has the potential to generate a vast number of rules, many of which are of little interest to the user. Weighted Association Rule Mining (WARM) overcomes this problem by assigning weights to items thus enabling interesting rules to be ranked ahead of less interesting ones and making it easier for the user to determine which rules are the most useful. Past research on WARM assumes that users have the necessary knowledge to supply item weights. In this research we relax this assumption by deriving item weights based on interactions between items. Our experimentation shows that the rule bases produced by our scheme produces more compact rule bases with a higher information content than standard rule generation methods.

Russel Pears, Yun Sing Koh, Gillian Dobbie
Best Clustering Configuration Metrics: Towards Multiagent Based Clustering

Multi-Agent Clustering (MAC) requires a mechanism for identifying the most appropriate cluster configuration. This paper reports on experiments conducted with respect to a number of validation metrics to identify the most effective metric with respect to this context. This paper also describes a process whereby such metrics can be used to determine the optimum parameters typically required by clustering algorithms, and a process for incorporating this into a MAC framework to generate best cluster configurations with minimum input from end users.

Santhana Chaimontree, Katie Atkinson, Frans Coenen
On Probabilistic Models for Uncertain Sequential Pattern Mining

We study uncertainty models in

sequential pattern mining

. We consider situations where there is uncertainty either about a

source

or an

event

. We show that both these types of uncertainties could be modelled using probabilistic databases, and give possible-worlds semantics for both. We then describe ”interestingness” criteria based on two notions of frequentness (previously studied for frequent itemset mining) namely

expected support

[C. Aggarwal et al. KDD’09;Chui et al., PAKDD’07,’08] and

probabilistic frequentness

[Bernecker et al., KDD’09]. We study the interestingness criteria from a complexity-theoretic perspective, and show that in case of source-level uncertainty, evaluating probabilistic frequentness is #P-complete, and thus no polynomial time algorithms are likely to exist, but evaluate the interestingness predicate in polynomial time in the remaining cases.

Muhammad Muzammal, Rajeev Raman
Cube Based Summaries of Large Association Rule Sets

A major problem when dealing with association rules post-processing is the huge amount of extracted rules. Several approaches have been implemented to summarize them. However, the obtained summaries are generally difficult to analyse because they suffer from the lack of navigational tools. In this paper, we propose a novel method for summarizing large sets of association rules. Our approach enables to obtain from a rule set, several summaries called Cube Based Summaries (CBSs). We show that the CBSs can be represented as cubes and we give an overview of OLAP navigational operations that can be used to explore them. Moreover, we define a new quality measure called homogeneity, to evaluate the interestingness of CBSs. Finally, we propose an algorithm that generates a relevant CBS w.r.t. a quality measure, to initialize the exploration. The evaluation of our algorithm on benchmarks proves the effectiveness of our approach.

Marie Ndiaye, Cheikh T. Diop, Arnaud Giacometti, Patrick Marcel, Arnaud Soulet
A Perceptron-Like Linear Supervised Algorithm for Text Classification

A fast and accurate linear supervised algorithm is presented which compares favorably to other state of the art algorithms over several real data collections on the problem of text categorization. Although it has been already presented in [6], no proof of its convergence is given. From the geometric intuition of the algorithm it is evident that it is not a Perceptron or a gradient descent algorithm thus an algebraic proof of its convergence is provided in the case of linearly separable classes. Additionally we present experimental results on many standard text classification datasets and artificially generated linearly separable datasets. The proposed algorithm is very simple to use and easy to implement and it can be used in any domain without any modification on the data or parameter estimation.

Anestis Gkanogiannis, Theodore Kalamboukis
Research on Time Series Forecasting Model Based on Moore Automata

For time series data mining (TSDM), the problem of time series forecasting has attracted wide attention as solving it actually paves a way to extrapolate past behavior into the future. Researchers have long been interested in modeling the problem by linear regression, neural network, chaos, support vector machines, etc. In this paper, we explore the use of Moore automata for time series forecast modeling and demonstrate how the Moore automata can be converted to solve the problem with regression methods. The effectiveness of the proposed approach has been verified by experiments.

Yixiong Chen, Zhongfu Wu, Zhiguo Li, Yixing Zhang
A Clustering Algorithm FCM-ACO for Supplier Base Management

Supplier selection is one of the critical components of supply chain management and has played a very important role to improve the competitiveness of the entire supply chain. Successful supplier selection may have significant business implications, while how to determine the suitable suppliers is a complex decision making problem which includes both qualitative and quantitative factors. Therefore, this paper proposes a novel approach that combines the fuzzy c-means (FCM) algorithm with ant colony optimization (ACO) algorithm to cluster suppliers into manageable smaller groups with similar characteristics. The simulation results show that the proposed method improves the performance of FCM algorithm for it is less sensitive to local extreme.

Weining Liu, Lei Jiang
Nearest Neighbour Distance Matrix Classification

A distance based classification is one of the popular methods for classifying instances using a point-to-point distance based on the nearest neighbour or

k

-NEAREST NEIGHBOUR (

k-

NN). The representation of distance measure can be one of the various measures available (e.g. Euclidean distance, Manhattan distance, Mahalanobis distance or other specific distance measures). In this paper, we propose a modified nearest neighbour method called Nearest Neighbour Distance Matrix (NNDM) for classification based on unsupervised and supervised distance matrix. In the proposed NNDM method, an Euclidean distance method coupled with a distance loss function is used to create a distance matrix. In our approach, distances of each instance to the rest of the training instances data will be used to create the training distance matrix (TADM). Then, the TADM will be used to classify a new instance. In supervised NNDM, two instances that belong to different classes will be pushed apart from each other. This is to ensure that the instances that are located next to each other belong to the same class. Based on the experimental results, we found that the trained distance matrix yields reasonable performance in classification.

Mohd Shamrie Sainin, Rayner Alfred
Classification Inductive Rule Learning with Negated Features

This paper reports on an investigation to compare a number of strategies to include negated features within the process of Inductive Rule Learning (IRL). The emphasis is on generating the negation of features while rules are being “learnt”; rather than including (or deriving) the negation of all features as part of the input. Eight different strategies are considered based on the manipulation of three feature sub-spaces. Comparisons are also made with Associative Rule Learning (ARL) in the context of multi-class text classification. The results indicate that the option to include negated features within the IRL process produces more effective classifiers.

Stephanie Chua, Frans Coenen, Grant Malcolm
Fast Retrieval of Time Series Using a Multi-resolution Filter with Multiple Reduced Spaces

Fast retrieval of time series that are similar to a given pattern in large databases is a problem which has received a lot of attention in the last decade. The high dimensionality and large size of time series databases make sequential scanning inefficient to handle the similarity search problem. Several dimensionality reduction techniques have been proposed to reduce the complexity of the similarity search. Multi-resolution techniques are methods that speed-up the similarity search problem by economizing distance computations. In this paper we revisit two of previously proposed methods and present an improved algorithm that combine the advantages of these two methods. We conduct extensive experiments that show the show the superior performance of the improved algorithm over the previously proposed techniques.

Muhammad Marwan Muhammad Fuad, Pierre-François Marteau
DHPTID-HYBRID Algorithm: A Hybrid Algorithm for Association Rule Mining

Direct Hashing and Pruning algorithm of ARM performs well at initial passes by smaller candidate 2-itemset generation and turns out to be very powerful to facilitate initial itemset generation. Efficient pruning technique of AprioriTid algorithm is highly effective for frequent itemset generation in the later passes. Theoretical and practical studies reveals the underneath facts for the scope of improvement for both the algorithms. This paper describes a hybrid approach for association rule mining, based on coupling of best part of two well known ARM algorithms, AprioriTid and DHP with a binary approach which significantly improves the performance.

Shilpa Sonawani, Amrita Mishra
An Improved Rough Clustering Using Discernibility Based Initial Seed Computation

In this paper, we present the discernibility approach for an initial seed computation of Rough K-Means (RKM). We propose the use of the discernibility initial seed computation (ISC) for RKM. Our proposed algorithm aims to improve the performance and to avoid the problem of an empty cluster which affects the numerical stability since there are data constellations where |

Ck

| = 0 in RKM algorithm. For verification, our proposed algorithm was tested using 8 UCI datasets and validated using the David Bouldin Index. The experimental results showed that the proposed algorithm of the discernibility initial seed computation of RKM was appropriate to avoid the empty cluster and capable of improving the performance of RKM.

Djoko Budiyanto Setyohadi, Azuraliza Abu Bakar, Zulaiha Ali Othman
Fixing the Threshold for Effective Detection of Near Duplicate Web Documents in Web Crawling

The drastic development of the WWW in recent times has made the concept of Web Crawling receive remarkable significance. The voluminous amounts of web documents swarming the web have posed huge challenges to web search engines making their results less relevant to the users. The presence of duplicate and near duplicate web documents in abundance has created additional overheads for the search engines critically affecting their performance and quality which have to be removed to provide users with the relevant results for their queries. In this paper, we have presented a novel and efficient approach for the detection of near duplicate web pages in web crawling where the keywords are extracted from the crawled pages and the similarity score between two pages is calculated. The documents having similarity score greater than a threshold value are considered as near duplicates. In this paper we have fixed the threshold value.

V. A. Narayana, P. Premchand, A. Govardhan
Topic-Constrained Hierarchical Clustering for Document Datasets

In this paper, we propose the

topic-constrained hierarchical clustering

, which organizes document datasets into hierarchical trees consistant with a given set of topics. The proposed algorithm is based on a constrained agglomerative clustering framework and a semi-supervised criterion function that emphasizes the relationship between documents and topics and the relationship among documents themselves simultaneously. The experimental evaluation show that our algorithm outperformed the traditional agglomerative algorithm by 7.8% to 11.4%.

Ying Zhao
Discretization of Time Series Dataset Using Relative Frequency and K-Nearest Neighbor Approach

In this work, we propose an improved approach of time series data discretization using the Relative Frequency and K- nearest Neighbor functions called the RFknn method. The main idea of the method is to improve the process of determining the sufficient number of intervals for discretization of time series data. The proposed approach improved the time series data representation by integrating it with the Piecewise Aggregate Approximation (PAA) and the Symbolic Aggregate Approximation (SAX) representation. The intervals are represented as a symbol and can ensure efficient mining process where better knowledge model can be obtained without major loss of knowledge. The basic idea is not to minimize or maximize the number of intervals of the temporal patterns over their class labels. The performance of RFknn is evaluated using 22 temporal datasets and compared to the original time series discretization SAX method with similar representation. We show that RFknn can improve representation preciseness without losing symbolic nature of the original SAX representation. The experimental results showed that RFknn gives better term of representation with lower and comparable error rates.

Azuraliza Abu Bakar, Almahdi Mohammed Ahmed, Abdul Razak Hamdan
MSDBSCAN: Multi-density Scale-Independent Clustering Algorithm Based on DBSCAN

A good approach in data mining is density based clustering in which the clusters are constructed based on the density of shape regions. The prominent algorithm proposed in density based clustering family is DBSCAN [1] that uses two global density parameters, namely minimum number of points for a dense region and epsilon indicating the neighborhood distance. Among others, one of the weaknesses of this algorithm is its un-suitability for multi-density data sets where different regions have various densities so the same epsilon does not work. In this paper, a new density based clustering algorithm, MSDBSCAN, is proposed. MSDBSCAN uses a new definition for core point and dense region. The MSDBSCAN can find clusters in multi-variant density data sets. Also this algorithm benefits scale independency. The results obtained on data sets show that the MSDBSCAN is very effective in multi-variant environment.

Gholamreza Esfandani, Hassan Abolhassani
An Efficient Algorithm for Mining Erasable Itemsets

Mining erasable itemsets first introduced in 2009 is one of new emerging data mining tasks. In this paper, we present a new data representation called PID_list, which keeps track of the id_nums (identification number) of products that include an itemset. Based on PID_list, we propose a new algorithm called VME for mining erasable itemsets efficiently. The main advantage of VME algorithm is that the gain of an itemset can be computed efficiently via union operations on product id_nums. In addition, VME algorithm can also automatically prune irrelevant data. For evaluating VME algorithm, we have conducted experiments on six synthetic product databases. Our performance study shows that the VME algorithm is efficient and is on average over two orders of magnitude faster than the META algorithm, which is the first algorithm for dealing with the problem of erasable itemsets mining.

Zhihong Deng, Xiaoran Xu
Discord Region Based Analysis to Improve Data Utility of Privately Published Time Series

Privacy preserving data publishing is one of the most important issues of privacy preserving data mining, but the problem of privately publishing time series data has not received enough attention. Random perturbation is an efficient method of privately publishing data. Random noise addition introduces uncertainty into published data, increasing the difficult of conjecturing the original values. The existing Gaussian white noise addition distributes the same amount of noise to every single attribute of each series, incurring the great decrease of data utility for classification purpose. Through analyzing the different impact of local regions on overall classification pattern, we formally define the concept of discord region which strongly influences the classification performance. We perturb original series differentially according to their position, whether in a discord region, to improve classification utility of published data. The experimental results on real and synthetic data verify the effectiveness of our proposed methods.

Shuai Jin, Yubao Liu, Zhijie Li
Deep Web Sources Classifier Based on DSOM-EACO Clustering Model

There are many deep web sources providing the services, but we may not be aware of their existence, and not know which sources can satisfy our demands. So that there is a great significant to build a system to integrate the myriad deep web sources in the Internet, and the classification of deep web sources is very important in the integration. In this paper, a clustering model based on dynamic self-organizing maps (DSOM) and enhanced ant colony optimization (EACO) is systematically proposed for deep web sources classification. The basic idea of the model is to produce the cluster by DSOM and EACO. With the classified data instances, the classifier can be established. And then the classifier can be used in real deep web sources classification, and it is observed that the proposed approach gives better performance over some traditional approaches for deep web sources classification problems.

Yong Feng, Xianyong Chen, Zhen Chen
Kernel Based K-Medoids for Clustering Data with Uncertainty

Uncertain data is ubiquitous in real-world applications due to various causes. In recent years, clustering uncertain data has been paid more attention by the research community, and the classical clustering algorithms based on partition, density and hierarchy have been extended to handle the uncertain data. However, these extended algorithms usually work in the input space. In this paper, to well explore the inherent data pattern in the high dimensional feature space, we propose a kernel based K-medoids algorithm for clustering uncertain data. Extensive experiments performed on synthetic and several real datasets demonstrate that our kernel based method has higher clustering accuracy than the state-of-the-art UK-medoids algorithm. Also, it signifies that the uncertain data pattern in the new feature space could be well presented when the kernel function and the K-medoids algorithm are effectively incorporated.

Baoguo Yang, Yang Zhang
Frequent Pattern Mining Using Modified CP-Tree for Knowledge Discovery

Mining frequent pattern from databases is useful for knowledge discovery. In this paper, we propose modified CP-Tree, which scans entire transactions only once and constructs the tree by inserting the transactions one by one. The constructed tree consists of an item list along with its occurrence. In addition, a sorted order of items with its frequency of occurrence is maintained and based on the sorted value, the tree is dynamically rearranged. In rearranging phase, the nodes are rearranged in each branch based on sorted order of items. Each path of the branch is removed from the tree, sorted based on sorted order of items and inserted back as a branch into the tree. We have evaluated the performance of the proposed modified tree on benchmark databases such as CHESS, MUSHROOM and T10I4D100K. It is observed that the time taken for extracting frequent item from the tree is encouraging compared to conventional CP-Tree.

R . Vishnu Priya, A. Vadivel, R. S. Thakur
Spatial Neighborhood Clustering Based on Data Field

Based on the theory of data field, each sample point in the spatial database radiates its data energy from the sample space to the mother space. This paper studies the use of the data field as a basis for clustering. We put forward a novel method for clustering, which is a kind of natural clustering method called spatial neighborhood clustering. In the data field, the potential center is identical to the cluster center. The key step of the cluster algorithm is to find the potential centers in the grid units of data field. The spatial neighborhood cluster method makes use of the distribution property of the potential value point as the potential center in the data field to discriminate the maximum potential value in a certain neighborhood window. Then the cluster centers can be acquired corresponding to the maximum potential values and the number of cluster centers is automatically amount to that of potential centers.

Meng Fang, Shuliang Wang, Hong Jin
Surrounding Influenced K-Nearest Neighbors: A New Distance Based Classifier

The nearest neighbor classification method assigns to an unclassified point the class of the nearest of a set of previously classified points. An extension to this approach is the

K

-NN method, in which the classification is made taking into account the

K

nearest points and classifying the unclassified point by a voting criteria from this k points. We present a new method that extends the

K

-NN limits, taking into account, for each neighbor, its

I

nearest neighbors. Experimental results are promising, obtaining better results for two class problems than the original

K

-NN.

I. Mendialdua, B. Sierra, E. Lazkano, I. Irigoien, E. Jauregi
A Centroid k-Nearest Neighbor Method

k

-nearest neighbor method (

k

NN) is a very useful and easy-implementing method for real applications. The query point is estimated by its

k

nearest neighbors. However, this kind of prediction simply uses the label information of its neighbors without considering their space distributions. This paper proposes a novel

k

NN method in which the centroids instead of the neighbors themselves are employed. The centroids can reflect not only the label information but also the distribution information of its neighbors. In order to evaluate the proposed method, Euclidean distance and Mahalanobis distance is used in our experiments. Moreover, traditional

k

NN is also implemented to provide a comparison with the proposed method. The empirical results suggest that the propose method is more robust and effective.

Qingjiu Zhang, Shiliang Sun
Mining Spatial Association Rules with Multi-relational Approach

Working on spatial data models for geographic phenomena have always been viewed from a spatial context and emphasizing spatial change. But the problem is that spatial relationships are embedded in space, unknown a priori. To achieve such issue, spatial association rules mining techniques are needed. In this paper, we propose a multi-relational mining method to deal with it. We use a non-parametric way by using Vironoi-diagram based neighborhood, classification method is implemented to pre-process the rules condition, association rules are pre-defined, and a close Apriori-base algorithm is proposed to cope with it. Then the framework is evaluated by the real-world dataset, and some thoughtful association rules are given.

Min Qian, Li-Jie Pu, Rong Fu, Ming Zhu
An Unsupervised Classification Method of Remote Sensing Images Based on Ant Colony Optimization Algorithm

Remote sensing images classification method can be divided into supervised classification and unsupervised classification according to whether there is prior knowledge. Supervised classification is a machine learning procedure for deducing a function from training data; unsupervised classification is a kind of classification which no training sample is available and subdivision of the feature space is achieved by identifying natural groupings present in the images values. As a branch of swarm intelligence, ant colony optimization algorithm has self-organization, adaptation, positive feedback and other intelligent advantages. In this paper, ant colony optimization algorithm is tentatively introduced into unsupervised classification of remote sensing images. A series of experiments are performed with remote sensing data. Comparing with the K-mean and the ISODATA clustering algorithm, the experiment result proves that artificial ant colony optimization algorithm provides a more effective approach to remote sensing images classification.

Duo Wang, Bo Cheng
A Novel Clustering Algorithm Based on Gravity and Cluster Merging

Fuzzy C-means (FCM) clustering algorithm is commonly used in data mining tasks. It has the advantage of producing good modeling results in many cases. However, it is sensitive to outliers and the initial cluster centers. In addition, it could not get the accurate cluster number during the algorithm. To overcome the above problems, a novel FCM algorithm based on gravity and cluster merging was presented in this paper. By using gravity in this algorithm, the influence of outliers was minimized and the initial cluster centers were selected. And by using cluster merging, an appropriate number of clustering could be specified. The experimental evaluation shows that the modified method can effectively improve the clustering performance.

Jiang Zhong, Longhai Liu, Zhiguo Li

Data Mining in Specific Areas

Evolution Analysis of a Mobile Social Network

Smart phones and ubiquitous wireless connections are helping people to build and maintain mobile social relationships. We present an in-depth and complete evolution analysis on the user activities and social graph of a mobile social network using data obtained from Nokia Friend View. Our results show that (1) user activities in Friend View are highly correlated and the power law fitted exponents for user activities distribution are slowly becoming larger over time, which appears to be contrary to the famous “rich get richer” assertion in the preferential attachment model because users in Friend View regard the reciprocity as important during the interaction and (2) both undirected friend network and directed comment network in Friend View are small-world and scale-free networks over time with slowly decreasing clustering coefficient. However, compared to online social networks where users have a large number of friends but loose weakly-tied subgroups, users in Friend View tend to have close strongly-tied cohesive subgroups. The results can help us understand users’ social activities and interactions over time in mobile social networks.

Hao Wang, Alvin Chin
Distance Distribution and Average Shortest Path Length Estimation in Real-World Networks

The average shortest path length is one of the most important and frequent-invoked characteristics of real-world complex networks. However, the high time complexity of the algorithms prevents us to apply them to calculate the average shortest path lengths in real-world massive networks. In this paper, we present an empirical study of the vertex-vertex distance distributions in more than 30 artificial and real-world networks. To best of our knowledge, we are the first to find out the vertex-vertex distance distributions in these networks conform well to the normal distributions with different means and variations. We also investigate how to use the sampling algorithms to estimate the average shortest path lengths in these networks. Comparing our average shortest path estimating algorithm with other three different sampling strategies, the results show that we can estimate the average shortest path length quickly by just sampling a small number of vertices in both of real-world and artificial networks.

Qi Ye, Bin Wu, Bai Wang
Self-adaptive Change Detection in Streaming Data with Non-stationary Distribution

Non-stationary distribution, in which the data distribution evolves over time, is a common issue in many application fields, e.g., intrusion detection and grid computing. Detecting the changes in massive streaming data with a non-stationary distribution helps to alarm the anomalies, to clean the noises, and to report the new patterns. In this paper, we employ a novel approach for detecting changes in streaming data with the purpose of improving the quality of modeling the data streams. Through observing the outliers, this approach of change detection uses a weighted standard deviation to monitor the evolution of the distribution of data streams. A cumulative statistical test, Page-Hinkley, is employed to collect the evidence of changes in distribution. The parameter used for reporting the changes is self-adaptively adjusted according to the distribution of data streams, rather than set by a fixed empirical value. The self-adaptability of the novel approach enhances the effectiveness of modeling data streams by timely catching the changes of distributions. We validated the approach on an online clustering framework with a benchmark KDDcup 1999 intrusion detection data set as well as with a real-world grid data set. The validation results demonstrate its better performance on achieving higher accuracy and lower percentage of outliers comparing to the other change detection approaches.

Xiangliang Zhang, Wei Wang
Anchor Points Seeking of Large Urban Crowd Based on the Mobile Billing Data

In everyday life, people spend most of their time in some routine places such as the living places(origin) and working places(destination). We define these locations as anchor points. The anchor point information is important to the city planning, transportation management and optimization. Traditional methods of anchor points seeking mainly based on the data obtained from the sample survey or link volumes. The defects of these methods such as low sample rate and high cost make it difficult for us to study on the large crowd in the city.In recent years, with the rapid development of wireless communication, mobile phones have becoming more and more popular. In this paper, we proposed a novel approach to obtain the anchor points of the large urban crowd based on the mobile billing data. In addition, we took advantage of the spatial and temporal patterns of people’s behavior in the anchor points to improve the simple algorithm.

Wenhao Huang, Zhengbin Dong, Nan Zhao, Hao Tian, Guojie Song, Guanhua Chen, Yun Jiang, Kunqing Xie
Frequent Pattern Trend Analysis in Social Networks

This paper describes an approach to identifying and comparing

frequent pattern trends

in social networks. A frequent pattern trend is defined as a sequence of time-stamped occurrence (

support

) values for specific frequent patterns that exist in the data. The trends are generated according to

epochs

. Therefore, trend changes across a sequence epochs can be identified. In many cases, a great many trends are identified and difficult to interpret the result. With a combination of constraints, placed on the frequent patterns, and clustering and cluster analysis techniques, it is argued that analysis of the result is enhanced. Clustering technique uses a

Self Organising Map

approach to produce a sequence of

maps

, one per epoch. These maps can then be compared and the movement of trends identified. This

Frequent Pattern Trend Mining

framework has been evaluated using two non-standard types of social networks, the cattle movement network and the insurance quote network.

Puteri N. E. Nohuddin, Rob Christley, Frans Coenen, Yogesh Patel, Christian Setzkorn, Shane Williams
Efficient Privacy-Preserving Data Mining in Malicious Model

In many distributed data mining settings, disclosure of the original data sets is not acceptable due to privacy concerns. To address such concerns, privacy-preserving data mining has been an active research area in recent years. While confidentiality is a key issue, scalability is also an important aspect to assess the performance of a privacy-preserving data mining algorithms for practical applications. With this in mind, Kantarcioglu et al. proposed secure dot product and secure set-intersection protocols for privacy-preserving data mining in malicious adversarial model using zero knowledge proofs, since the assumption of semi-honest adversary is unrealistic in some settings. Both the computation and communication complexities are linear with the number of data items in the protocols proposed by Kantarcioglu et al. In this paper, we build efficient and secure dot product and set-intersection protocols in malicious model. In our work, the complexity of computation and communication for proof of knowledge is always constant (independent of the number of data items), while the complexity of computation and communication for the encrypted messages remains the same as in Kantarcioglu et al.’s work (linear with the number of data items). Furthermore, we provide the security model in Universal Composability framework.

Keita Emura, Atsuko Miyaji, Mohammad Shahriar Rahman
Analyze the Wild Birds’ Migration Tracks by MPI-Based Parallel Clustering Algorithm

Aiming at the avian influenza outbreak in Qinghai Lake area, the satellite tracking of migratory birds in Qinghai Lake is studied to analyze the relationship between bird migration, virus spread and ecological environment. These biological problems have been converted into computational studies in previous studies in which spatial clustering is the key factor. A bird migration data analysis system based on DBSCAN algorithm was designed in previous work, by which data can be systematically analyzed, and knowledge patterns are subsequently available for deep biological studies. As the GPS (Global Positioning System) raw data grows rapidly which is large scale with high complexity, DBSCAN takes long time (several minutes) to get the result. In this paper, parallel STING (statistical information grid) algorithm is designed and implemented based on MPI (message passing interface) for spatial clustering. By using parallel STING algorithm, it only takes several seconds to get the result.

HaiMing Zhang, YuanChun Zhou, JianHui Li, XueZhi Wang, BaoPing Yan
Formal Concept Analysis Based Clustering for Blog Network Visualization

Blog network is growing explosively recently. As a result, the network becomes huge and dynamic. People have an urge to have effective ways to explore and retrieve related information. Blog analysis has been investigated for several years. There are still some improvement space exist. In this paper, we provide a formal concept analysis based clustering visualization to help people find information easily. Especially it is easy for them to find hot topics and their related information. Our approach has several steps such as extracting keywords from individual blog entries, formal concept analysis (FCA) based clustering and user interactions. Compare with other applications, the main difference is using FCA to analysis the content of individual entries so that group similar entries into one community. Experiments results are provided to show the advantages of our approach.

Jing Gao, Wei Lai
Finding Frequent Subgraphs in Longitudinal Social Network Data Using a Weighted Graph Mining Approach

The mining of social networks entails a high degree of computational complexity. This complexity is exacerbate when considering longitudinal social network data. To address this complexity issue three weighting schemes are proposed in this paper. The fundamental idea is to reduce the complexity by considering only the most significant nodes and links. The proposed weighting schemes have been incorporated into the weighted variations and extensions of the well established gSpan frequent subgraph mining algorithm. The focus of the work is the cattle movement network found in Great Britain. A complete evaluation of the proposed approaches is presented using this network. In addition, the utility of the discovered patterns is illustrated by constructing a sequential data set to which a sequential mining algorithm can be applied to capturing the changes in “behavior” represented by a network.

Chuntao Jiang, Frans Coenen, Michele Zito
Weigted-FP-Tree Based XML Query Pattern Mining

According as XML data have been prevailing in many areas such as internet and public documentation, we need to research data mining algorithm to XML data. And many kinds of techniques have been researched to speed up the query performance about XML data. In this paper, therefore, as the method for speeding up the query performance we analyze the XML query pattern and propose Weighted- FP-growth algorithm extracting the similar XML query pattern fast. The proposed method is applied to XML query subtrees. And we experimented our method compared with the existing algorithm. And we showed the proposed method outperform the other methods and give the fast query result to the repeatedly occurring queries.

Mi Sug Gu, Jeong Hee Hwang, Keun Ho Ryu
Privacy-Preserving Data Mining in Presence of Covert Adversaries

Disclosure of the original data sets is not acceptable due to privacy concerns in many distributed data mining settings. To address such concerns, privacy-preserving data mining has been an active research area in recent years. All the recent works on privacy-preserving data mining have considered either semi-honest or malicious adversarial models, whereby an adversary is assumed to follow or arbitrarily deviate from the protocol, respectively. While semi-honest model provides weak security requiring small amount of computation and malicious model provides strong security requiring expensive computations like Non-Interactive Zero Knowledge proofs, we envisage the need for ‘covert’ adversarial model that performs in between the semi-honest and malicious models, both in terms of security guarantee and computational cost. In this paper, for the first time in data-mining area, we build efficient and secure dot product and set-intersection protocols in covert adversarial model. We use homomorphic property of Paillier encryption scheme and two-party computation of Aumann et al. to construct our protocols. Furthermore, our protocols are secure in Universal Composability framework.

Atsuko Miyaji, Mohammad Shahriar Rahman
Multiple Level Views on the Adherent Cohesive Subgraphs in Massive Temporal Call Graphs

In this paper, we present a multi-level empirical study of locality structures of several temporal call graphs containing both mobile and fixed-line call graphs emphasizing on comparing the patterns of these two types of call graphs. We investigate the topological patterns of the cohesive subgraphs in a mesoscopic scale, and we find several novel patterns in these call graphs. We study the correlation between the link weights and their localities. To our surprise, we can find there is nearly no correlation between link weights and their edge betweenness values. We also find that in the network of communities, communities’ sizes and betweenness centralities are highly positively correlated, which indicates large communities tend to be in the center of the call networks. Our analysis also suggests that ‘small-world’ phenomenon still exists in the community-based networks. We believe that our analysis results will help Telecom operators have better understanding of their customers.

Qi Ye, Bin Wu, Bai Wang
Combating Link Spam by Noisy Link Analysis

Link Spam has indentified as one of the major obstacles for link-based ranking algorithms of modern search engine since it intently constructs hyperlink structure to help some poor-content pages obtaining undeserved high rank. This problem is even worse with the advent of wikis, blogs and forum that are rich in links. Existing works on link spam are mainly focused on link spam detection by extracting some special link structures (e.g. clique, tight bipartite etc.). However, link spam structures could have many variations and easily make the existing detection methods ineffective. In this paper, we tackle the problem of link spam from a more fundamental viewpoint—“noisy link” analysis. First of all, how “non-voting” hyperlinks affect the quality of ranking is investigated, and then based on this investigation, an approach to detect and process “noisy link” both effectively and automatically is proposed. We also compare our work with two other related works (TrustRank and Site-level Noise removal) on two real web datasets. The experimental results demonstrate that the proposed “noisy link” analysis is very effective on both spam page filtering and final ranking improvement.

Yitong Wang, Xiaofei Chen, Xiaojun Feng
High Dimensional Image Categorization

We are interested in varying the vocabulary size in the image categorization task with a bag-of-visual-words to investigate its influence on the classification accuracy in two cases: in the first one, both the test-set and the training set contains the same objects (with only different view points in the test-set) and the second one where objects in the test-set do not appear at all in the training set (only other objects from the same category appear). In order to perform these tasks, we need to scale-up the algorithms used to deal with millions data points in hundred of thousand dimensions. We present k-means (used in the quantization step) and SVM (used in the classification step) algorithms extended to deal with very large datasets. These new incremental and parallel algorithms can be used on various distributed architectures, like multi-thread computer, cluster or GPU (graphics processing units). The efficiency of the approach is shown with the categorization of the 3D-Dataset from Savarese and Fei-Fei containing about 6700 images of 3D objects from 10 different classes. The obtained incremental and parallel SVM algorithm is several orders of magnitude faster than usual ones (like lib-SVM, SVM-perf or CB-SVM) and the incremental and parallel k-means is at least one order of magnitude faster than usual implementations.

François Poulet, Nguyen-Khang Pham
Efficiently Mining Co-Location Rules on Interval Data

Spatial co-location rules represent subsets of spatial features whose instances are frequently located together. This paper studies co-location rule mining on interval data and achieves the following goals: 1) defining the semantic proximity between instances, getting fuzzy equivalent classes of instances and grouping instances in a fuzzy equivalent class into a semantic proximity neighborhood, so that the proximity neighborhood on interval data can be rapidly computed and adjusted; 2) defining new related concepts with co-location rules based on the semantic proximity neighborhood; 3) designing an algorithm to mine the above co-location rules efficiently; 4) verifying the efficiency of the method by experiments on synthetic datasets and the plant dataset of “Three Parallel Rivers of Yunnan Protected Areas”.

Lizhen Wang, Hongmei Chen, Lihong Zhao, Lihua Zhou
Multiple Attribute Frequent Mining-Based for Dengue Outbreak

Dengue fever (DF) and dengue hemorrhagic fever (DHF) are vector borne disease which is notifiable diseases in Malaysia since 1974. Early notification is essential for control measures as delayed notification will lead to further occurrences of outbreak cases. In this study we identify the number of attributes to be used in determining outbreaks rather than using only case counts. The experiment is conducted using multiple attribute value based on Apriori concept. The outcomes are promising when we can identify more than one attributes showing similar graph in vector-borne diseases outbreaks. Our methods also outperform in term of detection rate, false positive rate and overall performance. We prove through our experiment that more than one attributes can be used to better detect outbreaks.

Zalizah Awang Long, Azuraliza Abu Bakar, Abdul Razak Hamdan, Mazrura Sahani
A Top-Down Approach for Hierarchical Cluster Exploration by Visualization

With the much increased capability of data collection and storage in the past decade, data miners have to deal with much larger datasets in knowledge discovery tasks. Very large observations may cause traditional clustering methods to break down and not be able to cope with such large volumes of data. To enable data miners effectively detect the hierarchical cluster structure of a very large dataset, we introduce a visualization technique HOV

3

to plot the dataset into clear and meaningful subsets by using its statistical summaries. Therefore, data miners can focus on investigating a relatively smaller-sized subset and its nested clusters. In such a way, data miners can explore clusters of any subset and its offspring subsets in a top-down fashion. As a consequence, HOV

3

provides data miners an effective method on the exploration of clusters in a hierarchy by visualization.

Ke-Bing Zhang, Mehmet A. Orgun, Peter A. Busch, Abhaya C. Nayak
Distributed Frequent Items Detection on Uncertain Data

Frequent items detection is one of the valuable techniques in many applications, such as network monitor, network intrusion detection, worm virus detection, and so on. This technique has been well studied on deterministic databases. However, it is a new task on emerging uncertain database, especially in distributed environment. In this paper, a new definition of frequent items on uncertain data is defined. Based on the definition, a polynomial algorithm is proposed, which can efficiently answer the queries in central environment. Furthermore, this work designs the communication-efficient algorithms for retrieving the top-k items with the largest probability from distributed sites. The algorithms compute the upper bound of each round of the transmission, and filter the data as much as possible, which have no chance to influence the query result. Extensive experiments show that the algorithms can process the queries correctly and reduce communication cost efficiently with various data set.

Shuang Wang, Guoren Wang, Jitong Chen
Mining Uncertain Sentences with Multiple Instance Learning

Distinguishing uncertain information from factual ones in online texts is of essential importance in information extraction, because uncertain information would mislead systems to find useless even fault information. In this paper, we propose a method for detecting uncertain sentences with multiple instance learning (MIL). Based on the basic assumption, we derive two new constraints for estimating the weight vector by defining a probability margin, which is used in an online learning algorithm known as Passive-Aggressive algorithm. To demonstrate the effectiveness of our method, we experiment on the biomedical corpus. Compared with an intuitive method with conventional single instance learning (SIL), our method provide higher performance by raising the performance from 79.07% up to 82.55%, over 3% improvement.

Feng Ji, Xipeng Qiu, Xuanjing Huang
WeightLOFCC: A Heuristic Weight-Setting Strategy of LOF Applied to Outlier Detection in Time Series Data

Finding outliers is more interesting than finding common patterns in many KDD applications. The local outlier factor method (LOF) is a popular approach to detect outliers, in which a degree of being an outlier will be assigned to each object. In this paper, we present a modification method called WeightLOFCC to better handle outliers in time series data. Differing from the traditional LOF algorithm, the proposed WeightLOFCC method utilizes the idea of semi-supervised learning and weight factor to model data, and makes use of the cross correlation to measure the similarity. We evaluated the proposed algorithm on a large variety of data sets, and the experiment results show that for most of the data sets, our solution for outlier detection can achieve the best performance compared with other classical techniques.

Hongrui Xie, Yujiu Yang, Wenhuang Liu
TGP: Mining Top-K Frequent Closed Graph Pattern without Minimum Support

In this paper, we propose a new mining task: mining top-k frequent closed graph patterns without minimum support. Most previous frequent graph pattern mining works require the specification of a minimum support threshold. However it is difficult for users to set a suitable value sometimes. We develop an efficient algorithm, called TGP, to mine patterns without minimum support. A new structure called Lexicographic Pattern Net is designed to store graph patterns, which makes the closed pattern verification more efficient and speeds up raising support threshold dynamically. In addition, Lexicographic Pattern Net can be stored in the file through serialization, so it doesn’t need generate candidate patterns again in the next mining. It is found in the preliminary experiments that TGP can find top-k frequent closed graph patterns completely and accurately. Furthermore, TGP can be extended to mine other kinds of graphs or dynamic graph streams easily.

Yuhua Li, Quan Lin, Ruixuan Li, Dongsheng Duan
Research on Similarity Matching for Multiple Granularities Time-Series Data

Because of the appropriate algorithm of measuring multiple granularities time-series is few, this article advanced a similarity matching algorithm for multiple granularities time-series, which based on the ideal of time calibrator and hypothesis test. It firstly expounded the definition of multiple granularities time-series, and proposed a sample of distance; secondly, it put forward the similarity matching algorithm of multiple granularities time-series; finally, the experimental result proved that the algorithm can effectively reflect the time-series of multiple granularities.

Wenning Hao, Enlai Zhao, Hongjun Zhang, Gang Chen, Dawei Jin
A Novel Algorithm for Hierarchical Community Structure Detection in Complex Networks

Networks have in recent years emerged as an invaluable tool for describing and quantifying complex systems in many branches of science. Recent studies suggest that network often exhibit hierarchical organization, where vertices divide into groups that further subdivided into groups of groups, and so forth over multiple scales. In this paper, we introduce a novel algorithm that searches for the hierarchical structure. The method iteratively combines the similar communities with the elaborate design of community similarity and combination threshold. The experiments on artificial and real networks show that the method is able to obtain reasonable hierarchical structure solutions.

Chuan Shi, Jian Zhang, Liangliang Shi, Yanan Cai, Bin Wu
Investigating Sequential Patterns of DNS Usage and Its Applications

DNS is an important infrastructure of the Internet whose smooth operation and performance are vital to both the Internet users and applications. In this paper, we investigate the DNS usage patterns by applying the mixture of Markov chains (

MMC

) to DNS usage data. Results on cluster analysis of real DNS query data demonstrate that the method is capable of revealing insightful patterns of sequential activities of DNS users. Furthermore, typical DNS user groups are investigated. Finally, two particular application scenarios of user clustering using this method are proposed, which might help improve the DNS performance by adding plug-ins into the DNS server software.

Jun Wu, Xin Wang, Xiaodong Lee, Baoping Yan
Key Issues and Theoretical Framework on Moving Objects Data Mining

Considering technical difficulties and bottlenecks in moving objects data mining, such as massive movement data, high dimensional data, topological complexity, and knowledge semantic representation etc., this paper focuses on the study of theory and methods of moving objects data mining. First, it presents two key scientific issues of the research topic, i.e. integration and modeling of heterogeneous data, and information aggregation and interpretation. Second, a theoretical framework of moving object data mining is proposed based on different perspectives of "

space-time data(space-time concept(space-time pattern

". Two aspects of the framework are then discussed in details, including (1) moving objects data modeling and semantic expression; (2) mining methods and algorithms of association rules based on concept lattice. Finally, future works are discussed.

Rong Xie, Xin Luo
An Improved KNN Based Outlier Detection Algorithm for Large Datasets

Outlier detection is becoming a hot issue in the field of data mining since outliers often contain useful information. In this paper, we propose an improved KNN based outlier detection algorithm which is fulfilled through two stage clustering. Clustering one is to partition the dataset into several clusters and then calculate the Kth nearest neighbor in each cluster which can effectively avoid passing the entire dataset for each calculation. Clustering two is to partition the clusters obtained by clustering one and then prune the partitions as soon as it is determined that it cannot contain outliers which results in substantial savings in computation. Experimental results on both synthetic and real life datasets demonstrate that our algorithm is efficient in large datasets.

Qian Wang, Min Zheng
Some Developments of Determinacy Analysis

This paper deals with development of Determincy Analysis (DA), a method for data mining. There are two approaches to DA that give a different result consisting of non-overlapping (exclusive each other) rules. The first method finds exactly one system consisting of rules that have equal length. The second, step by step approach enables to find very many different rule systems where the rules have different length. In the first case the rules contain a lot of redundant attributes, in the second case there are too many different (formally complete) systems of rules what makes the selection hard. A better result can be obtained by finding overlapping rules. This paper presents DA approach and technique that enables to find overlapping rules with different length and algorithm realizing it. Such approach for DA has not been created earlier.

Rein Kuusik, Grete Lind
A New Computational Framework for Gene Expression Clustering

Clustering of gene expression is a useful exploratory technique for gene expression dataset as it groups similar objects together and identify potentially meaningful relationships between the objects. However, there are several issues arise for instance data intensive and redundancy in the cluster. Therefore, the new computational framework is needed in order to handle these issues. The results showed that the proposed computational framework achieved better results compared with other methods.

Shahreen Kasim, Safaai Deris, Razib M. Othman
Forecasting Short-Term Trends of Stock Markets Based on Fuzzy Frequent Pattern Tree

Stock forecasting is a non-linear financial time series forecasting problem. Stock index contains tremendous noise and is affected by numerous factors. Fuzzy time series takes advantage of such problems. In this paper, a novel model based on the fuzzy frequent pattern tree (FFPT) is proposed to forecast short-term trends of stock markets. Fuzzy frequent pattern tree is a combination of fuzzy set theory and frequent pattern tree. Frequent pattern tree is a highly compressed data structure store the information of association rules to be mined. In this paper, an FFPT is built using fuzzy stock time series. Then we forecast short-term trends by a new method called FFPT-Search. And stock data from several famous stock markets is picked up to evaluate the effectiveness of our model. Computational results indicate it works well.

Defu Zhang, Bo Wu, Xian Hua, Yangbin Yang
Inspired Rule-Based User Identification

Web log mining is an important branch of Web usage mining. It is a kind of specific application of data mining. It can find Web user mode of behavior through processing server-side log files, and further improve the structure of the website or provide users with personalized service. Data preprocessing is an important step of Web log mining. In general, data preprocessing is divided into four steps: data cleaning, user identification, session identification, path supplement, transaction identification. This paper proposes a user identification method which is based on inspired rules. Experiment results prove the effectiveness of this method.

Peng Yang, Yan Zheng
Backmatter
Metadaten
Titel
Advanced Data Mining and Applications
herausgegeben von
Longbing Cao
Yong Feng
Jiang Zhong
Copyright-Jahr
2010
Verlag
Springer Berlin Heidelberg
Electronic ISBN
978-3-642-17316-5
Print ISBN
978-3-642-17315-8
DOI
https://doi.org/10.1007/978-3-642-17316-5