Discovering Local Subgroups, with an Application to Fraud Detection

In Subgroup Discovery, one is interested in finding subgroups that behave differently from the ‘average’ behavior of the entire population. In many cases, such an approach works well because the general population is rather homogeneous, and the subgroup encompasses clear outliers. In more complex situations however, the investigated population is a mixture of various subpopulations, and reporting all of these as interesting subgroups is undesirable, as the variation in behavior is explainable. In these situations, one would be interested in finding subgroups that are unusual with respect to their neighborhood. In this paper, we present a novel method for discovering such

local subgroups

. Our work is motivated by an application in health care fraud detection. In this domain, one is dealing with various types of medical practitioners, who sometimes specialize in specific patient groups (elderly, disabled, etc.), such that unusual claiming behavior in itself is not cause for suspicion. However, unusual claims with respect to a reference group of similar patients do warrant further investigation into the suspect associated medical practitioner. We demonstrate experimentally how local subgroups can be used to capture interesting fraud patterns.

Rob M. Konijn, Wouter Duivesteijn, Wojtek Kowalczyk, Arno Knobbe

PUF-Tree: A Compact Tree Structure for Frequent Pattern Mining of Uncertain Data

Many existing algorithms mine frequent patterns from traditional databases of

precise

data. However, there are situations in which data are

uncertain

. In recent years, researchers have paid attention to frequent pattern mining from uncertain data. When handling uncertain data, UF-growth and UFP-growth are examples of well-known mining algorithms, which use the UF-tree and the UFP-tree respectively. However, these trees can be large, and thus degrade the mining performance. In this paper, we propose (i) a more compact tree structure to capture uncertain data and (ii) an algorithm for mining all frequent patterns from the tree. Experimental results show that (i) our tree is usually more compact than the UF-tree or UFP-tree, (ii) our tree can be as compact as the FP-tree, and (iii) our mining algorithm finds frequent patterns efficiently.

Carson Kai-Sang Leung, Syed Khairuzzaman Tanbeer

Frequent Pattern Mining in Attributed Trees

Frequent pattern mining is an important data mining task with a broad range of applications. Initially focused on the discovery of frequent itemsets, studies were extended to mine structural forms like sequences, trees or graphs. In this paper, we introduce a new data mining method that consists in mining new kind of patterns in a collection of attributed trees (atrees). Attributed trees are trees in which vertices are associated with itemsets. Mining this type of patterns (called asubtrees), which combines tree mining and itemset mining, requires the exploration of a huge search space. We present several new algorithms for attributed trees mining and show that their implementations can efficiently list frequent patterns in a database of several thousand of attributed trees.

Claude Pasquier, Jérémy Sanhes, Frédéric Flouvat, Nazha Selmaoui-Folcher

Mining Frequent Patterns from Human Interactions in Meetings Using Directed Acyclic Graphs

In modern life, interactions between human beings frequently occur in meetings, where topics are discussed. Semantic knowledge of meetings can be revealed by discovering interaction patterns from these meetings. An existing method mines interaction patterns from meetings using tree structures. However, such a tree-based method may not capture all kinds of triggering relations between interactions, and it may not distinguish a participant of a certain rank from another participant of a different rank in a meeting. Hence, the tree-based method may not be able to find all interaction patterns such as those about correlated interaction. In this paper, we propose to mine interaction patterns from meetings using an alternative data structure—namely, a

directed acyclic graph

(

DAG

). Specifically, a DAG captures both temporal and triggering relations between interactions in meetings. Moreover, to distinguish one participant of a certain rank from another, we assign weights to nodes in the DAG. As such, a meeting can be modeled as a weighted DAG, from which weighted frequent interaction patterns can be discovered. Experimental results showed the effectiveness of our proposed DAG-based method for mining interaction patterns from meetings.

Anna Fariha, Chowdhury Farhan Ahmed, Carson Kai-Sang Leung, S. M. Abdullah, Longbing Cao

ClaSP: An Efficient Algorithm for Mining Frequent Closed Sequences

In this paper, we propose a new algorithm, called ClaSP for mining frequent closed sequential patterns in temporal transaction data. Our algorithm uses several efficient search space pruning methods together with a vertical database layout. Experiments on both synthetic and real datasets show that ClaSP outperforms currently well known state of the art methods, such as CloSpan.

Antonio Gomariz, Manuel Campos, Roque Marin, Bart Goethals

Efficient Mining of Contrast Patterns on Large Scale Imbalanced Real-Life Data

Contrast pattern mining has been studied intensively for its strong discriminative capability. However, the state-of-the-art methods rarely consider the class imbalanced problem, which has been proved to be a big challenge in mining large scale data. This paper introduces a novel pattern, i.e. converging pattern, which refers to the itemsets whose supports contrast sharply from the minority class to the majority one. A novel algorithm, ConvergMiner, which adopts T*-tree and branch bound pruning strategies to mine converging patterns efficiently, is proposed. Substantial experiments in online banking fraud detection show that the ConvergMiner greatly outperforms the existing cost-sensitive classification methods in terms of predicative accuracy. In particular, the efficiency improves with the increase of data imbalance.

Jinjiu Li, Can Wang, Wei Wei, Mu Li, Chunming Liu

Online Cross-Lingual PLSI for Evolutionary Theme Patterns Analysis

In this paper, we focus on the problem of evolutionary theme patterns (ETP) analysis in cross-lingual scenarios. Previously, cross-lingual topic models in batch mode have been explored. By directly applying such techniques in ETP analysis, however, two limitations would arise. (1) It is time-consuming to re-train all the latent themes for each time interval in the time sequence. (2) The latent themes between two adjacent time intervals might lose continuity. This motivates us to utilize online algorithms to solve these limitations. The research of online topic models is not novel, but previous work cannot be directly employed, because they mainly target at monolingual texts. Consequently, we propose an online cross-lingual topic model. By experimental verification in a real world dataset, we demonstrate that our algorithm performs well in the ETP analysis task. It can efficiently reduce the updating time complexity; and it is effective in solving the continuity limitation.

Xin Xin, Kun Zhuang, Ying Fang, Heyan Huang

F-Trail: Finding Patterns in Taxi Trajectories

Given a large number of taxi trajectories, we would like to find interesting and unexpected patterns from the data. How can we summarize the major trends, and how can we spot anomalies? The analysis of trajectories has been an issue of considerable interest with many applications such as tracking trails of migrating animals and predicting the path of hurricanes. Several recent works propose methods on clustering and indexing trajectories data. However, these approaches are not especially well suited to pattern discovery with respect to the dynamics of social and economic behavior. To further analyze a huge collection of taxi trajectories, we develop a novel method, called

F-Trail

, which allows us to find meaningful patterns and anomalies. Our approach has the following advantages: (a) it is fast, and scales linearly on the input size, (b) it is effective, leading to novel discoveries, and surprising outliers. We demonstrate the effectiveness of our approach, by performing experiments on real taxi trajectories. In fact,

F-Trail

does produce concise, informative and interesting patterns.

Yasuko Matsubara, Lei Li, Evangelos Papalexakis, David Lo, Yasushi Sakurai, Christos Faloutsos

Mining Appliance Usage Patterns in Smart Home Environment

Nowadays, due to the great advent of sensor technology, the data of all appliances in a house can be collected easily. However, with a huge amount of appliance usage log data, it is not an easy task for residents to visualize how the appliances are used. Mining algorithms is necessary to discover appliance usage patterns that capture representative usage behavior of appliances. If some of our representative patterns of appliance electricity usages are available, we may be able to adapt our usage behaviors to conserve the energy easily. In this paper, we introduce (i) two types of usage patterns which capture the representative usage behaviors of appliances in a smart home environment and (ii) the corresponding algorithms for discovering usage patterns efficiently. Finally, we apply our algorithms on a real-world dataset to show the practicability of usage pattern mining.

Yi-Cheng Chen, Yu-Lun Ko, Wen-Chih Peng, Wang-Chien Lee

Computational Models of Stress in Reading Using Physiological and Physical Sensor Data

Stress is a major problem facing our world today and it is important to develop an objective understanding of how average individuals respond to stress in a typicalactivity like reading. The aim for this paper is to determine whether stress patterns can be recognized using individual-independent computational models from sensor based stress response signals induced by reading text with stressful content. The response signals were obtained by sensors that sourced various physiological and physical signals, from which hundreds of features were derived. The paper proposes feature selection methods to deal with redundant and irrelevant features and improve the performance of classifications obtained from models based on artificial neural networks (ANNs) and support vector machines (SVMs). A genetic algorithm (GA) and a novel method based on

pseudo-independence

of features are proposed as feature selection methods for the classifiers. Classification performances for the proposed classifiers are compared. The performance of the individual-independent classifiers improved when the feature selection methods were used. The GA-SVM hybrid produced the best results with a stress recognition rate of 98%.

Nandita Sharma, Tom Gedeon

Latent Patient Profile Modelling and Applications with Mixed-Variate Restricted Boltzmann Machine

Efficient management of chronic diseases is critical in modern health care. We consider

diabetes mellitus

, and our ongoing goal is to examine how machine learning can deliver information for clinical efficiency. The challenge is to aggregate highly heterogeneous sources including demographics, diagnoses, pathologies and treatments, and extract similar groups so that care plans can be designed. To this end, we extend our recent model, the mixed-variate restricted Boltzmann machine (MV.RBM), as it seamlessly integrates multiple data types for each patient aggregated over time and outputs a homogeneous representation called “latent profile” that can be used for patient clustering, visualisation, disease correlation analysis and prediction. We demonstrate that the method outperforms all baselines on these tasks - the primary characteristics of patients in the same groups are able to be identified and the good result can be achieved for the diagnosis codes prediction.

Tu Dinh Nguyen, Truyen Tran, Dinh Phung, Svetha Venkatesh

MassBayes: A New Generative Classifier with Multi-dimensional Likelihood Estimation

Existing generative classifiers (e.g., BayesNet and A

n

DE) make independence assumptions and estimate one-dimensional likelihood. This paper presents a new generative classifier called

MassBayes

that estimates multi-dimensional likelihood without making any explicit assumptions. It aggregates the multi-dimensional likelihoods estimated from random subsets of the training data using varying size random feature subsets. Our empirical evaluations show that MassBayes yields better classification accuracy than the existing generative classifiers in large data sets. As it works with fixed-size subsets of training data, it has constant training time complexity and constant space complexity, and it can easily scale up to very large data sets.

Sunil Aryal, Kai Ming Ting

Fast and Effective Single Pass Bayesian Learning

The rapid growth in data makes ever more urgent the quest for highly scalable learning algorithms that can maximize the benefit that can be derived from the information implicit in big data. Where data are too big to reside in core, efficient learning requires minimal data access. Single pass learning accesses each data point once only, providing the most efficient data access possible without resorting to sampling. The AnDE family of classifiers are effective single pass learners. We investigate two extensions to A2DE, subsumption resolution and MI-weighting. Neither of these techniques require additional data access. Both reduce A2DE’s learning bias, improving its effectiveness for big data. Furthermore, we demonstrate that the techniques are complementary. The resulting combined technique delivers computationally efficient low-bias learning well suited to learning from big data.

Nayyar A. Zaidi, Geoffrey I. Webb

Sparse Reductions for Fixed-Size Least Squares Support Vector Machines on Large Scale Data

Fixed-Size Least Squares Support Vector Machines (FS-LSSVM) is a powerful tool for solving large scale classification and regression problems. FS-LSSVM solves an over-determined system of

M

linear equations by using Nyström approximations on a set of prototype vectors (PVs) in the primal. This introduces sparsity in the model along with ability to scale for large datasets. But there exists no formal method for selection of the right value of

M

. In this paper, we investigate the sparsity-error trade-off by introducing a second level of sparsity after performing one iteration of FS-LSSVM. This helps to overcome the problem of selecting a right number of initial PVs as the final model is highly sparse and dependent on only a few appropriately selected prototype vectors (SV) is a subset of the PVs. The first proposed method performs an iterative approximation of

L

0

-norm which acts as a regularizer. The second method belongs to the category of threshold methods, where we set a window and select the SV set from correctly classified PVs closer and farther from the decision boundaries in the case of classification. For regression, we obtain the SV set by selecting the PVs with least minimum squared error (

mse

). Experiments on real world datasets from the UCI repository illustrate that highly sparse models are obtained without significant trade-off in error estimations scalable to large scale datasets.

Raghvendra Mall, Johan A. K. Suykens

Discovery of Regional Co-location Patterns with k-Nearest Neighbor Graph

The spatial co-location pattern mining discovers the subsets of features of which the events are frequently located together in a geographic space. The current research on this topic adopts a distance threshold that has limitations in spatial data sets with various magnitudes of neighborhood distances, especially for mining of regional co-location patterns. In this paper, we propose a hierarchical co-location mining framework by considering both varieties of neighborhood distances and spatial heterogeneity. By adopting

k

-nearest neighbor graph (

k

NNG) instead of distance threshold, we propose “distance variation coefficient” as a new measure to drive the mining process and determine an individual neighborhood relationship graph for each region. The experimental results on a real world data set verify the effectiveness of our framework.

Feng Qian, Kevin Chiew, Qinming He, Hao Huang, Lianhang Ma

Spectral Decomposition for Optimal Graph Index Prediction

There is an ample body of recent research on indexing for structural graph queries. However, as verified by our experiments with a large number of random and scale-free graphs, there may be a great variation in the performances of indexes of graph queries. Unfortunately, the structures of graph indexes are often complex and

ad-hoc

, so deriving an accurate performance model is a daunting task. As a result, database practitioners may encounter difficulties in choosing the optimal index for their data graphs. In this paper, we address this problem by proposing a spectral decomposition method for predicting the relative performances of graph indexes. Specifically, given a graph, we compute its spectrum. We then propose a similarity function to compare the spectrums of graphs. We adopt a classification algorithm to build a model and a voting algorithm for predicting the optimal index. Our empirical studies on a large number of random and scale-free graphs, using four structurally distinguishable indexes, demonstrate that our spectral decomposition method is robust and almost always exhibits an accuracy of 70% or above.

Liyan Song, Yun Peng, Byron Choi, Jianliang Xu, Bingsheng He

Patterns amongst Competing Task Frequencies: Super-Linearities, and the Almond-DG Model

If Alice has double the friends of Bob, will she also have double the phone-calls (or wall-postings, or tweets)? Our first contribution is the discovery that the relative frequencies obey a power-law (sub-linear, or super-linear), for a wide variety of diverse settings: tasks in a phone-call network, like count of friends, count of phone-calls, total count of minutes; tasks in a twitter-like network, like count of tweets, count of followees etc. The second contribution is that we further provide a full, digitized 2-d distribution, which we call the

Almond-DG

model, thanks to the shape of its iso-surfaces. The

Almond-DG

model matches all our empirical observations: super-linear relationships among variables, and (provably) log-logistic marginals. We illustrate our observations on two large, real network datasets, spanning ~2.2

M

and ~3.1

M

individuals with 5 features each. We show how to use our observations to spot clusters and outliers, like, e.g., telemarketers in our phone-call network.

Danai Koutra, Vasileios Koutras, B. Aditya Prakash, Christos Faloutsos

Node Classification in Social Network via a Factor Graph Model

This paper attempts to addresses the task of node classification in social networks. As we know, node classification in social networks is an important challenge for understanding the underlying graph with the linkage structure and node features. Compared with the traditional classification problem, we should not only use the node features, but also consider about the relationship between nodes. Besides, it is difficult to cost much time and energy to label every node in the large social networks. In this work, we use a factor graph model with partially-labeled data to solve these problems. Our experiments on two data sets (DBLP co-author network, Weibo) with multiple small tasks have demonstrated that our model works much better than the traditional classification algorithms.

Huan Xu, Yujiu Yang, Liangwei Wang, Wenhuang Liu

Fast Graph Stream Classification Using Discriminative Clique Hashing

As many data mining applications involve networked data with dynamically increasing volumes, graph stream classification has recently extracted significant research interest. The aim of graph stream classification is to learn a discriminative model from a stream of graphs represented by sets of edges on a complex network. In this paper, we propose a fast graph stream classification method using DIscriminative Clique Hashing (DICH). The main idea is to employ a fast algorithm to decompose a compressed graph into a number of cliques to sequentially extract clique-patterns over the graph stream as features. Two random hashing schemes are employed to compress the original edge set of the graph stream and map the unlimitedly increasing clique-patterns onto a fixed-size feature space, respectively. The hashed cliques are used to update an “in-memory” fixed-size pattern-class table, which will be finally used to construct a rule-based classifier. DICH essentially speeds up the discriminative clique-pattern mining process and solves the unlimited clique-pattern expanding problem in graph stream mining. Experimental results on two real-world graph stream data sets demonstrate that DICH can clearly outperform the compared state-of-the-art method in both classification accuracy and training efficiency.

Lianhua Chi, Bin Li, Xingquan Zhu

Mining Interesting Itemsets in Graph Datasets

Traditionally, pattern discovery in graphs has been mostly limited to searching for frequent subgraphs, reoccurring patterns within which nodes with certain labels are frequently interconnected in exactly the same way. We relax this requirement by claiming that a set of labels is interesting if they often occur in each other’s vicinity, but not necessarily always interconnected by exactly the same structures. Searching for such itemsets can be done both in datasets consisting of one large graph, and in datasets consisting of many graphs. We present novel methods dealing with both possible settings. Our algorithms benefit from avoiding computationally costly isomorphism checks typical for subgraph mining, as well as from a greatly reduced search space, as we consider only itemsets, and not all possible edges and paths that can connect them.

Boris Cule, Bart Goethals, Tayena Hendrickx

Robust Synchronization-Based Graph Clustering

Complex graph data now arises in various fields like social networks, protein-protein interaction networks, ecosystems, etc. To reveal the underlying patterns in graphs, an important task is to partition them into several meaningful clusters. The question is: how can we find the natural partitions of a complex graph which truly reflect the intrinsic patterns? In this paper, we propose

RSGC

, a novel approach to graph clustering. The key philosophy of

RSGC

is to consider graph clustering as a dynamic process towards synchronization. For each vertex, it is viewed as an oscillator and interacts with other vertices according to the graph connection information. During the process towards synchronization, vertices with similar connectivity patterns tend to naturally synchronize together to form a cluster. Inherited from the powerful concept of synchronization,

RSGC

shows several desirable properties: (a) it provides a novel perspective for graph clustering based on proposed interaction model; (b)

RSGC

allows discovering natural clusters in graph without any data distribution assumption; (c) RSGC is also robust against noise vertices. We systematically evaluate

RSGC

algorithm on synthetic and real data to demonstrate its superiority.

Junming Shao, Xiao He, Qinli Yang, Claudia Plant, Christian Böhm

Efficient Mining of Combined Subspace and Subgraph Clusters in Graphs with Feature Vectors

Large graphs are ubiquitous in today’s applications. Besides the mere graph structure, data sources usually provide information about single objects by feature vectors. To realize the full potential for knowledge extraction, recent approaches consider both information types simultaneously. Thus, for the task of clustering, combined clustering models determine object groups within one network that are densely connected and show similar characteristics. However, due to the inherent complexity of such a combination, the existing methods are not efficiently executable and are hardly applicable to large graphs.

In this work, we develop a method for an efficient clustering of combined data sources, while at the same time finding high-quality results. We prove the complexity of our model and identify the critical parts inhibiting an efficient execution. Based on this analysis, we develop the algorithm EDCAR that approximates the optimal clustering solution using the established GRASP (Greedy Randomized Adaptive Search) principle. In thorough experiments we show that EDCAR outperforms all competing approaches in terms of runtime and simultaneously achieves high clustering qualities. For repeatability and further research we publish all datasets, executables and parameter settings on our website.

Stephan Günnemann, Brigitte Boden, Ines Färber, Thomas Seidl

Exploiting Temporal Information in a Two-Stage Classification Framework for Content-Based Depression Detection

Depression has become a critical illness in human society as many people suffer from the condition without being aware of it. The goal of this paper is to design a system to identify potential depression candidates based on their write-ups. To solve this problem, we propose a two-stage supervised learning framework. The first stage determines whether the user possesses apparent negative emotion. Then the positive cases are passed to the second stage to further evaluate whether the condition is clinical depression or just ordinary sadness. Our training data are generated automatically from Bulletin Board Systems. The content and temporal features are designed to improve the classification accuracy. Finally we develop an online demo system that takes a piece of written text as input, and outputs the likelihood of the author currently suffering depression. We conduct cross-validation and human study to evaluate the effectiveness of this system.

Yu-Chun Shen, Tsung-Ting Kuo, I-Ning Yeh, Tzu-Ting Chen, Shou-De Lin

EEG-Based Person Verification Using Multi-Sphere SVDD and UBM

The use of brain-wave patterns extracted from electroencephalography (EEG) brain signals for person verification has been investigated recently. The challenge is that the EEG signals are noisy due to low conductivity of the human skull and the EEG data have unknown distribution. We propose a multi-sphere support vector data description (MSSVDD) method to reduce noise and to provide a mixture of hyperspheres that can describe the EEG data distribution. We also propose a MSSVDD universal background model (UBM) to model impostors in person verification. Experimental results show that our proposed methods achieved lower verification error rates than other verification methods.

Phuoc Nguyen, Dat Tran, Trung Le, Xu Huang, Wanli Ma

Measuring Reproducibility of High-Throughput Deep-Sequencing Experiments Based on Self-adaptive Mixture Copula

Measurement of the statistical reproducibility between biological experiment replicates is vital first step of the entire series of bioinformatics analysis for mining meaningful biological discovery from mega-data. To distinguish the real biological relevant signals from artificial signals, irreproducible discovery rate (IDR) employing Copula, which can separate dependence structure and marginal distribution from data, has been put forth. However, IDR employed a Gaussian Copula which may cause underestimation of risk and limit the robustness of the method. To address the issue, we propose a Self-adaptive Mixture Copula (SaMiC) to measure the reproducibility of experiment replicates from high-throughput deep-sequencing data. Simple and easy to implement, the proposed SaMiC method can self-adaptively tune its coefficients so that the measurement of reproducibility is more effective for general distributions. Experiments in simulated and real data indicate that compared with IDR, the SaMiC method can better estimate reproducibility between replicate samples.

Qian Zhang, Junping Zhang, Chenghai Xue

Mining Representative Movement Patterns through Compression

Mining trajectories (or moving object patterns) from spatio-temporal data is an active research field. Most of the researches are devoted to extract trajectories that differ in their structure and characteristic in order to capture different object behaviors. The first issue is constituted from the fact that all these methods extract thousand of patterns resulting in a huge amount of redundant knowledge that poses limit in their usefulness. The second issue is supplied from the nature of spatio-temporal database from which different types of patterns could be extracted. This means that using only a single type of patterns is not sufficient to supply an insightful picture of the whole database.

Motivating by these issues, we develop a Minimum Description Length (MDL)-based approach that is able to compress spatio-temporal data combining different kinds of moving object patterns. The proposed method results in a rank of the patterns involved in the summarization of the dataset. In order to validate the quality of our approach, we conduct an empirical study on real data to compare the proposed algorithms in terms of effectiveness, running time and compressibility.

Phan Nhat Hai, Dino Ienco, Pascal Poncelet, Maguelonne Teisseire

NARGES: Prediction Model for Informed Routing in a Communications Network

There is a dependency between packet-loss and the delay and jitter time-series derived from a telecommunication link. Multimedia applications such as Voice over IP (VoIP) are sensitive to loss and packet recovery is not a merely efficient solution with the increasing number of Internet users. Predicting packet-loss from network dynamics of past transmissions is crucial to inform the next generation of routers in making smart decisions. This paper proposes a hybrid data mining model for routing management in a communications network, called NARGES. The proposed model is designed and implemented for predicting packet-loss based on the forecasted delays and jitters. The model consists of two parts: a historical symbolic time-series approximation module, called HDAX, and a Multilayer Perceptron (MLP). It is validated with heterogeneous quality of service (QoS) datasets, namely delay, jitter and packet-loss time-series. The results show improved precision and quality of prediction compared to autoregressive moving average, ARMA.

Hooman Homayounfard, Paul J. Kennedy, Robin Braun

Mining Usage Traces of Mobile Apps for Dynamic Preference Prediction

Due to a huge amount of mobile applications (abbreviated as Apps), for Apps providers, the usage preferences of Apps are important in recommending Apps, downloading Apps and promoting Apps. We predict and quantize users’ dynamic preferences by exploring their usage traces of Apps. To address the dynamic preference prediction problem, we propose Mode-based Prediction (abbreviated as MBP) and Reference-based Prediction (abbreviated as RBP) algorithms. Both MBP and RBP consist of two phases: the trend detection phase and the change estimation phase. In the trend detection phase, both algorithms determine whether the preference of an App is increasing or decreasing. Then, in the change estimation phase, the amount of preference change is calculated. In particular, MBP adopts users’ current usage mode (active or inactive), and then estimates the amount of change via our proposed utility model. On the other hand, RBP calculates an expected number of usage as a reference, and then builds a probabilistic model to estimate the change of preference by comparing the real usage and the reference. We conduct comprehensive experiments using two App usage traces and one music listening log, the Last.fm dataset, to validate our proposed algorithms. The experimental results show that both MBP and RBP outperform the usage-based method that is based solely on the number of usages.

Zhung-Xun Liao, Wen-Chih Peng, Philip S. Yu

Leveraging Hybrid Citation Context for Impact Summarization

Impact summarization aims to highlight the influential aspects of a cited paper by selecting a few representative citation sentences into the summary. Most existing work considers only the citation sentence information while the hybrid citation context associated with each citation sentence has been ignored. This paper proposes a context-aware approach. In the approach, different kinds of relationships among papers and authors are leveraged to jointly infer the impact of hybrid citation context, which is further integrated in a sentence language smoothing model to measure citation sentence relationships more effectively. The experimental results show that the proposed approach can achieve significantly better results than several baselines.

Po Hu, Yujing Guo, Donghong Ji, Jiacong He

Optimal Allocation of High Dimensional Assets through Canonical Vines

The widely used mean-variance criteria is actually not the optimal solution for asset allocation as the joint distribution of asset returns are distributed in asymmetric ways rather than in the assumed normal distribution. It is a computationally challenging task to model the asymmetries and skewness of joint distributions of returns in high dimensional space due to their own complicated structural complexities. This paper proposes to use a new form of canonical vine to produce the complex joint distribution of asset returns. Then, we use the utility function of Constant Relative Risk Aversion to determine the optimal allocation of the assets. The importance of using the asymmetries information is assessed by comparing the performance of a portfolio based on the mean-variance criteria and that of a portfolio based on the new canonical vine. The results show that the investors using the forecasts of these asymmetries can make better portfolio decisions than those who ignore the asymmetries information.

Wei Wei, Jinyan Li, Longbing Cao, Jingguang Sun, Chunming Liu, Mu Li

Inducing Context Gazetteers from Encyclopedic Databases for Named Entity Recognition

Named entity recognition (NER) is a fundamental task for mining valuable information from unstructured and semi-structured texts. State-of-the-art NER models mostly employ a supervised machine learning approach that heavily depends on local contexts. However, results of recent research have demonstrated that non-local contexts at the sentence or document level can help advance the improvement of recognition performance. As described in this paper, we propose the use of a context gazetteer, the list of contexts with which entity names can co-occur, as new non-local context information. We build a context gazetteer from an encyclopedic database because manually annotated data are often too few to extract rich and sophisticated context patterns. In addition, dependency path is used as sentence level non-local context to capture more syntactically related contexts to entity mentions than linear context in traditional NER. In the discussion of experimentation used for this study, we build a context gazetteer of gene names and apply it for a biomedical NER task. High confidence context patterns appear in various forms. Some are similar to a predicate–argument structure whereas some are in unexpected forms. The experiment results show that the proposed model using both entity and context gazetteers improves both precision and recall over a strong baseline model, and therefore the usefulness of the context gazetteer.

Han-Cheol Cho, Naoaki Okazaki, Kentaro Inui

An Optimization Method for Proportionally Diversifying Search Results

The problem of diversifying search results has attracted much attention, since diverse results can provide non-redundant information and cover multiple query-related topics. However, existing approaches typically assign equal importance to each topic. In this paper, we propose a novel method for diversification: proportionally diversifying search results. Specifically, we study the problem of returning a top-

k

ranked list where the number of candidates in each topic is proportional to the popularity degree of that topic with respect to the query. We obtain such a top-

k

proportionally diverse list by maximizing our proposed objective function and we prove that this is an NP-hard problem. We further propose a greedy heuristic to efficiently obtain a good approximate solution. To evaluate the effectiveness of our model, we also propose a novel metric based on the concept of proportionality. Extensive experimental evaluations over our proposed metric as well as standard measures demonstrate the effectiveness and efficiency of our method.

Lin Wu, Yang Wang, John Shepherd, Xiang Zhao

Joint Naïve Bayes and LDA for Unsupervised Sentiment Analysis

In this paper we proposed a hierarchical generative model based on Naïve Bayes and LDA for unsupervised sentiment analysis at sentence level and document level of granularity simultaneously. In particular, our model called NB-LDA assumes that each sentence instead of word has a latent sentiment label, and then the sentiment label generates a series of features for the sentence independently in the Naïve Bayes manner. The idea of NB assumption at sentence level makes it possible that we can use advanced NLP technologies such as dependency parsing to improve the performance for unsupervised sentiment analysis. Experiment results show that the proposed NB-LDA can obtain significantly improved results for sentiment analysis comparing to other approaches.

Yong Zhang, Dong-Hong Ji, Ying Su, Hongmiao Wu

An Unsupervised Learning Model to Perform Side Channel Attack

This paper proposes a novel unsupervised learning approach for Power Analysis – a form of side channel attack in Cryptanalysis. Different from existing works that exploit supervised learning framework to solve this problem, our method does not require any labeled pairs, which contains information of the form {X,Y}={key, power-trace}, but is still capable of deciphering the secret key accurately. Besides proposing a regression-based, unsupervised approach for this purpose, we further propose an enhanced model through exploiting the dependency of key bits between different sub-processes during the encryption process to obtain accurate results in a more efficient way. Our experiment shows that the proposed method outperforms the state-of-the-art non-learning based decipherment methods significantly.

Jung-Wei Chou, Min-Huang Chu, Yi-Lin Tsai, Yun Jin, Chen-Mou Cheng, Shou-De Lin

Decisive Supervised Learning

Traditional active learning selects the most informative (e.g., the most uncertain) example and queries an oracle for the label. However, as more examples are learned in the process, even the most uncertain examples can become certain. In this case, would it be better to make predictions directly and take the consequence if the prediction is wrong, rather than asking the oracle for labels? In this paper, we propose a new learning paradigm. In contrast to the traditional active learning, the learner can obtain true labels not only by querying oracles but also by making predictions and taking the consequence. Under this paradigm, we further propose a novel algorithm named

Decisive Learner

which always chooses the most decisive action (either querying oracles or making predictions) in the learning process. Compared to other typical learners (indecisive learners, traditional active learners, conservative learners), we show empirically that our decisive learner makes fewer mistakes and incurs the smallest total costs in the learning process.

Eileen A. Ni, Da Kuang, Charles X. Ling

Learning Overlap Optimization for Domain Decomposition Methods

The finite element method is a numerical simulation technique for solving partial differential equations. Domain decomposition provides a means for parallelizing the expensive simulation with modern computing architecture. Choosing the sub-domains for domain decomposition is a non-trivial task, and in this paper we show how this can be addressed with machine learning. Our method starts with a baseline decomposition, from which we learn tailored sub-domain overlaps from localized neighborhoods. An evaluation of 527 partial differential equations shows that our learned solutions improve the baseline decomposition with high consistency and by a statistically significant margin.

Steven Burrows, Jörg Frochte, Michael Völske, Ana Belén Martínez Torres, Benno Stein

CLUEKR : CLUstering Based Efficient kNN Regression

K

-Nearest Neighbor based regression algorithm assigns a value to the query instance based on the values of its neighborhood instances. Although

k

NN has proved to be a ubiquitous classification/regre-ssion tool with good scalability but it suffers from some drawbacks. One of its biggest drawback is that, it is a lazy learner i.e. it uses all the training data at runtime. In this paper, we propose a novel, efficient and accurate, clustering based

k

NN regression algorithm CLUEKR having the advantage of low computational complexity. Instead of searching for nearest neighbors directly in the entire dataset, we first hierarchically cluster the data and then find the cluster in which the query point should lie. Our empirical experiments with several real world datasets show that our algorithm reduces the search space for

k

NN significantly and is yet accurate.

Harshit Dubey, Vikram Pudi

AREM: A Novel Associative Regression Model Based on EM Algorithm

In recent years, there have been increasing efforts in applying association rule mining to build Associative Classification (AC) models. However, the similar area that applies association rule mining to build Associative Regression (AR) models has not been well explored. In this work, we fill this gap by presenting a novel regression model based on association rules called AREM. AREM starts with finding a set of regression rules by applying the instance based pruning strategy, in which the best rules for each instance are discovered and combined. Then a probabilistic model is trained by applying the EM algorithm, in which the right hand side of the rules and their importance weights are updated. The extensive experimental evaluation shows that our model can perform better than both the previously proposed AR model and some of the state of the art regression models, including Boosted Regression Trees, SVR, CART and Cubist, with the Mean Squared Error (MSE) being used as the performance metric.

Zhonghua Jiang, George Karypis

One-Class Transfer Learning with Uncertain Data

One-class learning aims at constructing a distinctive classifier based on the labeled one class data. However, it is a challenge for the existing one-class learning methods to transfer knowledge from a source task to a target task for uncertain data. To address this challenge, this paper proposes a novel approach, called uncertain one-class transfer learning with SVM (UOCT-SVM), which first formulates the uncertain data and transfer learning into one-class SVM as an optimization problem and then proposes an iterative framework to build an accurate classifier for the target task. Our proposed method explicitly addresses the problem of one-class transfer learning with uncertain data. Extensive experiments has found our proposed method can mitigate the effect of uncertain data on the decision boundary and transfer knowledge to help build an accurate classifier for the target task, compared with state-of-the-art one-class learning methods.

Bo Liu, Philip S. Yu, Yanshan Xiao, Zhifeng Hao

Time Series Forecasting Using Distribution Enhanced Linear Regression

Amongst the wealth of available machine learning algorithms for forecasting time series, linear regression has remained one of the most important and widely used methods, due to its simplicity and interpretability. A disadvantage, however, is that a linear regression model may often have higher error than models that are produced by more sophisticated techniques. In this paper, we investigate the use of a grouping based quadratic mean loss function for improving the performance of linear regression. In particular, we propose segmenting the input time series into groups and simultaneously optimizing both the average loss of each group and the variance of the loss between groups, over the entire series. This aims to produce a linear model that has low overall error, is less sensitive to distribution changes in the time series and is more robust to outliers. We experimentally investigate the performance of our method and find that it can build models which are different from those produced by standard linear regression, whilst achieving significant reductions in prediction errors.

Goce Ristanoski, Wei Liu, James Bailey

Twin Bridge Transfer Learning for Sparse Collaborative Filtering

Collaborative filtering (CF) is widely applied in recommender systems. However, the

sparsity

issue is still a crucial bottleneck for most existing CF methods. Although target data are extremely sparse for a newly-built CF system, some dense auxiliary data may already exist in other matured related domains. In this paper, we propose a novel approach,

Twin Bridge Transfer Learning

(TBT), to address the sparse collaborative filtering problem. TBT reduces the sparsity in target data by transferring knowledge from dense auxiliary data through two paths: 1) the

latent factors

of users and items learned from two dense auxiliary domains, and 2) the

similarity graphs

of users and items constructed from the learned latent factors. These two paths act as a

twin bridge

to allow more knowledge transferred across domains to reduce the sparsity of target data. Experiments on two benchmark datasets demonstrate that our TBT approach significantly outperforms state-of-the-art CF methods.

Jiangfeng Shi, Mingsheng Long, Qiang Liu, Guiguang Ding, Jianmin Wang

Dimensionality Reduction with Dimension Selection

We propose a novel method called sparse dimensionality reduction (SDR) in this paper. It performs dimension selection while reducing data dimensionality. Different from traditional dimensionality reduction methods, this method does not require dimensionality estimation. The number of final dimensions is the outcome of the sparse component of this method. In a nutshell, the idea is to transform input data to a suitable space where redundant dimensions are compressible. The structure of this method is very flexible which accommodates a series of variants along this line. In this paper, the data transformation is carried out by Laplacian eigenmaps and the dimension selection is fulfilled by

l

2/

l

1 norm. A Nesterov algorithm is proposed to solve the approximated SDR objective function. Experiments have been conducted on images from video sequences and protein structure data. It is evident that the SDR algorithm has subspace learning capability and may be applied to computer vision applications potentially.

Yi Guo, Junbin Gao, Feng Li

Multi-View Visual Classification via a Mixed-Norm Regularizer

In data mining and machine learning, we often represent instances by multiple views for better descriptions and effective learning. However, such comprehensive representations can introduce redundancy and noise. Learning with these multi-view data without any preprocessing may affect the effectiveness of visual classification. In this paper, we propose a novel mixed-norm joint sparse learning model to effectively eliminate the negative effect of redundant views and noisy attributes (or dimensions) for multi-view multi-label (MVML) classification. In particular, a mixed-norm regularizer, integrating a Frobenius norm and an ℓ

2,1

-norm, is embedded into the framework of joint sparse learning to achieve the design goals, which include selecting significant views, preserving the intrinsic view structure and removing noisy attributes from the selected views. Moreover, we devise an iterative algorithm to solve the derived objective function of the proposed mixed-norm joint sparse learning model. We theoretically prove that the objective function converges to its global optimum via the algorithm. Experimental results on challenging real-life datasets show the superiority of the proposed learning model over state-of-the-art methods.

Xiaofeng Zhu, Zi Huang, Xindong Wu

Mining Specific Features for Acquiring User Information Needs

Term-based approaches can extract many features in text documents, but most include noise. Many popular text-mining strategies have been adapted to reduce noisy information from extracted features; however, text-mining techniques suffer from low frequency. The key issue is how to discover relevance features in text documents to fulfil user information needs. To address this issue, we propose a new method to extract specific features from user relevance feedback. The proposed approach includes two stages. The first stage extracts topics (or patterns) from text documents to focus on interesting topics. In the second stage, topics are deployed to lower level terms to address the low-frequency problem and find specific terms. The specific terms are determined based on their appearances in relevance feedback and their distribution in topics or high-level patterns. We test our proposed method with extensive experiments in the Reuters Corpus Volume 1 dataset and TREC topics. Results show that our proposed approach significantly outperforms the state-of-the-art models.

Abdulmohsen Algarni, Yuefeng Li

Ensemble-Based Wrapper Methods for Feature Selection and Class Imbalance Learning

The wrapper feature selection approach is useful in identifying informative feature subsets from high-dimensional datasets. Typically, an inductive algorithm “wrapped” in a search algorithm is used to evaluate the merit of the selected features. However, significant bias may be introduced when dealing with highly imbalanced dataset. That is, the selected features may favour one class while being less useful to the adverse class. In this paper, we propose an ensemble-based wrapper approach for feature selection from data with highly imbalanced class distribution. The key idea is to create multiple balanced datasets from the original imbalanced dataset via sampling, and subsequently evaluate feature subsets using an ensemble of base classifiers each trained on a balanced dataset. The proposed approach provides a unified framework that incorporates ensemble feature selection and multiple sampling in a mutually beneficial way. The experimental results indicate that, overall, features selected by the ensemble-based wrapper are significantly better than those selected by wrappers with a single inductive algorithm in imbalanced data classification.

Pengyi Yang, Wei Liu, Bing B. Zhou, Sanjay Chawla, Albert Y. Zomaya

Exploring Groups from Heterogeneous Data via Sparse Learning

Complexity networks, such as social networks, biological networks and co-citation networks, are ubiquitous in reality. Identifying groups from data is critical for network analysis, for it can offer deep insights in understanding the structural properties and functions of complex networks. Over the past decades, many endeavors from interdisciplinary fields have been attempted to identify groups from data. However, little attention has been paid on exploring groups and their relationships from different views. In this work, we address this issue by using canonical correlation analysis (CCA) to analyze groups and their interplays in the networks. To further improve the interpretability of results, we solve the optimization problem with sparse learning, and then propose a generalized framework of group discovery from heterogeneous data. This framework enables us to find groups and explicitly model their relationships from diverse views simultaneously. Extensive experimental studies conducted on both synthetic and DBLP datasets demonstrate the effectiveness of the proposed method.

Huawen Liu, Jiuyong Li, Lin Liu, Jixue Liu, Ivan Lee, Jianmin Zhao

Multiplex Topic Models

Multiplex document networks have multiple types of links such as citation and coauthor links between scientific papers. Inferring thematic topics from multiplex document networks requires quantifying and balancing the influence from different types of links. It is therefore a problem of considerable interest and represents significant challenges. To address this problem, we propose a novel multiplex topic model (MTM) that represents the topic influence from different types of links using a factor graph. To estimate parameters in MTM, we also develop an approximate inference algorithm, multiplex belief propagation (MBP), which can estimate the influence weights of multiple links automatically at each learning iteration. Experimental results confirm the superiority of MTM in two applications, document clustering and link prediction, when compared with several state-of-the-art link-based topic models.

Juan Yang, Jia Zeng, William K. Cheung

Integrating Clustering and Ranking on Hybrid Heterogeneous Information Network

Recently, ranking-based clustering on heterogeneous information network has emerged, which shows its advantages on the mutual promotion of clustering and ranking. However, these algorithms are restricted to information network only containing heterogeneous relations. In many applications, networked data are more complex and they can be represented as a hybrid network which simultaneously includes heterogeneous and homogeneous relations. It is more promising to promote clustering and ranking performance by combining the heterogeneous and homogeneous relations. This paper studied the ranking-based clustering on this kind of hybrid network and proposed the ComClus algorithm. ComClus applies star schema with self loop to organize the hybrid network and uses a probability model to represent the generative probability of objects. Experiments show that ComClus can achieve more accurate clustering results and do more reasonable ranking with quick and steady convergence.

Ran Wang, Chuan Shi, Philip S. Yu, Bin Wu

Learning from Multiple Observers with Unknown Expertise

Internet has emerged as a powerful technology for collecting labeled data from a large number of users around the world at very low cost. Consequently, each instance is often associated with a handful of labels, precluding any assessment of an individual user’s quality. We present a probabilistic model for regression when there are multiple yet some unreliable observers providing continuous responses. Our approach simultaneously learns the regression function and the expertise of each observer that allow us to predict the ground truth and observers’ responses on the new data. Experimental results on both synthetic and real-world data sets indicate that the proposed method has clear advantages over “taking the average” baseline and some state-of-art models.

Han Xiao, Huang Xiao, Claudia Eckert

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Discovering Local Subgroups, with an Application to Fraud Detection

PUF-Tree: A Compact Tree Structure for Frequent Pattern Mining of Uncertain Data

Frequent Pattern Mining in Attributed Trees

Mining Frequent Patterns from Human Interactions in Meetings Using Directed Acyclic Graphs

ClaSP: An Efficient Algorithm for Mining Frequent Closed Sequences

Efficient Mining of Contrast Patterns on Large Scale Imbalanced Real-Life Data

Online Cross-Lingual PLSI for Evolutionary Theme Patterns Analysis

F-Trail: Finding Patterns in Taxi Trajectories

Mining Appliance Usage Patterns in Smart Home Environment

Computational Models of Stress in Reading Using Physiological and Physical Sensor Data

Latent Patient Profile Modelling and Applications with Mixed-Variate Restricted Boltzmann Machine

MassBayes: A New Generative Classifier with Multi-dimensional Likelihood Estimation

Fast and Effective Single Pass Bayesian Learning

Sparse Reductions for Fixed-Size Least Squares Support Vector Machines on Large Scale Data

Discovery of Regional Co-location Patterns with k-Nearest Neighbor Graph

Spectral Decomposition for Optimal Graph Index Prediction

Patterns amongst Competing Task Frequencies: Super-Linearities, and the Almond-DG Model

Node Classification in Social Network via a Factor Graph Model

Fast Graph Stream Classification Using Discriminative Clique Hashing

Mining Interesting Itemsets in Graph Datasets

Robust Synchronization-Based Graph Clustering

Efficient Mining of Combined Subspace and Subgraph Clusters in Graphs with Feature Vectors

Exploiting Temporal Information in a Two-Stage Classification Framework for Content-Based Depression Detection

EEG-Based Person Verification Using Multi-Sphere SVDD and UBM

Measuring Reproducibility of High-Throughput Deep-Sequencing Experiments Based on Self-adaptive Mixture Copula

Mining Representative Movement Patterns through Compression

NARGES: Prediction Model for Informed Routing in a Communications Network

Mining Usage Traces of Mobile Apps for Dynamic Preference Prediction

Leveraging Hybrid Citation Context for Impact Summarization

Optimal Allocation of High Dimensional Assets through Canonical Vines

Inducing Context Gazetteers from Encyclopedic Databases for Named Entity Recognition

An Optimization Method for Proportionally Diversifying Search Results

Joint Naïve Bayes and LDA for Unsupervised Sentiment Analysis

An Unsupervised Learning Model to Perform Side Channel Attack

Decisive Supervised Learning

Learning Overlap Optimization for Domain Decomposition Methods

CLUEKR : CLUstering Based Efficient kNN Regression

AREM: A Novel Associative Regression Model Based on EM Algorithm

One-Class Transfer Learning with Uncertain Data

Time Series Forecasting Using Distribution Enhanced Linear Regression

Twin Bridge Transfer Learning for Sparse Collaborative Filtering

Dimensionality Reduction with Dimension Selection

Multi-View Visual Classification via a Mixed-Norm Regularizer

Mining Specific Features for Acquiring User Information Needs

Ensemble-Based Wrapper Methods for Feature Selection and Class Imbalance Learning

Exploring Groups from Heterogeneous Data via Sparse Learning

Multiplex Topic Models

Integrating Clustering and Ranking on Hybrid Heterogeneous Information Network

Learning from Multiple Observers with Unknown Expertise

Backmatter

Premium Partner