Top

2018 | Book

Read chapter Read first chapter

Advances in Knowledge Discovery and Data Mining

22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III

Editors: Dinh Phung, Vincent S. Tseng, Prof. Geoffrey I. Webb, Bao Ho, Mohadeseh Ganji, Lida Rashidi

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

Part of: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

About this book

This three-volume set, LNAI 10937, 10938, and 10939, constitutes the thoroughly refereed proceedings of the 22nd Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2018, held in Melbourne, VIC, Australia, in June 2018.

The 164 full papers were carefully reviewed and selected from 592 submissions. The volumes present papers focusing on new ideas, original research results and practical development experiences from all KDD related areas, including data mining, data warehousing, machine learning, artificial intelligence, databases, statistics, knowledge engineering, visualization, decision-making systems and the emerging applications.

Frontmatter

Feature Learning and Data Mining Process

Frontmatter

Discovering High Utility Itemsets Based on the Artificial Bee Colony Algorithm

Mining high utility itemsets (HUI) is an interesting research problem in data mining. Recently, evolutionary computation has attracted researchers’ attention, and based on the genetic algorithm and particle swarm optimization, new algorithms for mining HUIs have been proposed. In this paper, we propose a new algorithm called HUI mining based on the artificial bee colony algorithm (HUIM-ABC). In HUIM-ABC, a bitmap is used to transform the original database that represents a nectar source and three types of bee. In addition to an efficient bitwise operation and direct utility computation, a bitmap can also be used for search space pruning. Furthermore, the size of discovered itemsets is used to generate new nectar sources, which has a higher chance of producing HUIs than generating new nectar sources at random. Extensive tests show that the proposed algorithm outperforms existing state-of-the-art algorithms.

Wei Song, Chaomin Huang

A Scalable and Efficient Subgroup Blocking Scheme for Multidatabase Record Linkage

Record linkage is a commonly used task in data integration to facilitate the identification of matching records that refer to the same entity from different databases. The scalability of multidatabase record linkage (MDRL) is significantly challenged with the increase of both the sizes and the number of databases that are to be linked. Identifying matching records across subgroups of databases is an important aspect in MDRL that has not been addressed so far. We propose a scalable subgroup blocking approach for MDRL that uses an efficient search over a graph structure to identify similar blocks of records that need to be compared across subgroups of multiple databases. We provide an analysis of our technique in terms of complexity and blocking quality. We conduct an empirical study on large real-world datasets that shows our approach is scalable with the size of subgroups and the number of databases, and outperforms an existing state-of-the-art blocking technique for MDRL.

Thilina Ranbaduge, Dinusha Vatsalan, Peter Christen

Efficient Feature Selection Framework for Digital Marketing Applications

Digital marketing strategies can help businesses achieve better Return on Investment (ROI). Big data and predictive modelling are key to identifying these specific customers. Yet the very rich and mostly irrelevant attributes(features) will adversely affect the predictive modelling performance, both computationally and qualitatively. So selecting relevant features is a crucial task for marketing applications. The feature selection process is very time consuming due to the large amount of data and high dimensionality of features. In this paper, we propose to reduce the computation time through regularizing the feature search process using expert knowledge. We also combine the regularized search with a generative filtering step, so we can address potential problems with the regularized search and further speed up the process. In addition, a progressive sampling and coarse to fine selection framework is built to further lower the space and time requirements.

Wei Zhang, Shiladitya Bose, Said Kobeissi, Scott Tomko, Chris Challis

Dynamic Feature Selection Algorithm Based on Minimum Vertex Cover of Hypergraph

Feature selection is an important pre-processing step in many fields, such as data mining, machine learning and pattern recognition. This paper focuses on dynamically updating a subset of features with new samples arriving and provides a hypergraph model to deal with dynamic feature selection problem. Firstly, we discuss the relationship between feature selection of information system and minimum vertex cover of hypergraph, and feature selection is converted to a minimum vertex cover problem based on this relationship. Then, an algorithm for generating induced hypergraph from information system is presented, the induced hypergraph can be divided into two part: the original induced hypergraph and the added hypergraph with new samples arriving. Finally, a novel dynamic feature selection algorithm based on minimum vertex cover of hypergraph is proposed, and this algorithm only needs a small amount of computation. Experiments show that the proposed method is feasible and highly effective.

Xiaojun Xie, Xiaolin Qin

Feature Selection for Multiclass Binary Data

Feature selection in binary datasets is an important task in many real world machine learning applications such as document classification, genomic data analysis, and image recognition. Despite many algorithms available, selecting features that distinguish all classes from one another in a multiclass binary dataset remains a challenge. Furthermore, many existing feature selection methods incur unnecessary computation costs for binary data, as they are not specifically designed for binary data. We show that exploiting the symmetry and feature value imbalance of binary datasets, more efficient feature selection measures that can better distinguish the classes in multiclass binary datasets can be developed. Using these measures, we propose a greedy feature selection algorithm, CovSkew, for multiclass binary data. We show that CovSkew achieves high accuracy gain over baseline methods, upto $$\sim $$∼40%, especially when the selected feature subset is small. We also show that CovSkew has low computational costs compared with most of the baselines.

Kushani Perera, Jeffrey Chan, Shanika Karunasekera

Scalable Model-Based Cascaded Imputation of Missing Data

Missing data is a common trait of real-world data that can negatively impact interpretability. In this paper, we present Cascade Imputation (CIM), an effective and scalable technique for automatic imputation of missing data. CIM is not restrictive on the characteristics of the data set, providing support for: Missing At Random and Missing Completely At Random data, numerical and nominal attributes, and large data sets including highly dimensional data sets. We compare CIM against well-established imputation techniques over a variety of data sets under multiple test configurations to measure the impact of imputation on the classification problem. Test results show that CIM outperforms other imputation methods over multiple test conditions. Additionally, we identify optimal performance and failure conditions for popular imputation techniques.

Jacob Montiel, Jesse Read, Albert Bifet, Talel Abdessalem

On Reducing Dimensionality of Labeled Data Efficiently

We address the problem of reducing dimensionality for labeled data. Our objective is to achieve better class separation in latent space. Existing nonlinear algorithms rely on pairwise distances between data samples, which are generally infeasible to compute or store in the large data limit. In this paper, we propose a parametric nonlinear algorithm that employs a spherical mixture model in the latent space. The proposed algorithm attains grand efficiency in reducing data dimensionality, because it only requires distances between data points and cluster centers. In our experiments, the proposed algorithm achieves up to 44 times better efficiency while maintaining similar efficacy. In practice, it can be used to speedup k-NN classification or visualize data points with their class structure.

Guoxi Zhang, Tomoharu Iwata, Hisashi Kashima

Using Metric Space Indexing for Complete and Efficient Record Linkage

Record linkage is the process of identifying records that refer to the same real-world entities in situations where entity identifiers are unavailable. Records are linked on the basis of similarity between common attributes, with every pair being classified as a link or non-link depending on their similarity. Linkage is usually performed in a three-step process: first, groups of similar candidate records are identified using indexing, then pairs within the same group are compared in more detail, and finally classified. Even state-of-the-art indexing techniques, such as locality sensitive hashing, have potential drawbacks. They may fail to group together some true matching records with high similarity, or they may group records with low similarity, leading to high computational overhead. We propose using metric space indexing (MSI) to perform complete linkage, resulting in a parameter-free process combining indexing, comparison and classification into a single step delivering complete and efficient record linkage. An evaluation on real-world data from several domains shows that linkage using MSI can yield better quality than current indexing techniques, with similar execution cost, without the need for domain knowledge or trial and error to configure the process.

Özgür Akgün, Alan Dearle, Graham Kirby, Peter Christen

Dimensionality Reduction via Community Detection in Small Sample Datasets

Real world networks constructed from raw data are often characterized by complex community structures. Existing dimensionality reduction techniques, however, do not take such characteristics into account. This is especially important for problems with low number of samples where the curse of dimensionality is particularly significant. Therefore, in this paper, we propose FeatureNet, a novel community-based dimensionality reduction framework targeting small sample problems. To this end, we propose a new method to directly construct a network from high-dimensional raw data while explicitly revealing its hidden community structure; these communities are then used to learn low-dimensional features using a representation learning framework. We show the effectiveness of our approach on eight datasets covering application areas as diverse as handwritten digits, biology, physical sciences, NLP, and computational sustainability. Extensive experiments on the above datasets (with sizes mostly between 100 and 1500 samples) demonstrate that FeatureNet significantly outperforms (i.e., up to 40% improvement in classification accuracy) ten well-known dimensionality reduction methods like PCA, Kernel PCA, Isomap, SNE, t-SNE, etc.

Kartikeya Bhardwaj, Radu Marculescu

An Interaction-Enhanced Feature Selection Algorithm

Feature selection is a crucial pre-processing step in machine learning and data mining. A popular approach is based on information theoretic measures. Most of the existing methods used low-dimensional mutual information terms that are ineffective in detecting high-order feature interactions. To fill this gap, we employ higher-order interactions for feature selection. We first relax the assumptions of MI-based methods to allow for higher-order interactions. A direct calculation of the interaction terms is computationally expensive. We use four-dimensional joint mutual information, a computationally efficient measure, to estimate the interaction terms. We also use the ‘maximum of the minimum’ nonlinear approach to avoid the overestimation of feature significance. Finally, we arrive at an effective feature selection method that makes use of higher-order interactions. To evaluate the performance of the proposed method, we compare it with seven representative feature selection methods, including RelaxMRMR, JMIM, IWFS, CIFE, MIFS, MIM, and reliefF. Experimental results on eighteen benchmark data sets demonstrate that higher-order interactions are effective in improving MI-based feature selection.

Xiaochuan Tang, Yuanshun Dai, Yanping Xiang, Liang Luo

An Extended Random-Sets Model for Fusion-Based Text Feature Selection

Selecting features that represent a specific corpus is important for the success of many machine learning and text mining applications. In information retrieval (IR), fusion-based techniques have shown remarkable performance compared to traditional models. However, in text feature selection (FS), popular models do not consider the fusion of the taxonomic features of the corpus. This research proposed an innovative and effective extended random-sets model for fusion-based FS. The model fused scores of different hierarchal features to accurately weight the representative words based on their appearance across the documents in the corpus and in several latent topics. The model was evaluated for information filtering (IF) using TREC topics and the standard RCV1 dataset. The results showed that the proposed model significantly outperformed eleven state-of-the-art baseline models in six evaluation metrics.

Abdullah Semran Alharbi, Yuefeng Li, Yue Xu

Attribute Reduction Algorithm Based on Improved Information Gain Rate and Ant Colony Optimization

Solving minimal attribute reduction (MAR) in rough set theory is a NP-hard and nonlinear constrained combinatorial optimization problem. Ant colony optimization (ACO), a new intelligent computing method, takes strategies of heuristic search, which is characterized by a distributed and positive feedback and it has the advantage of excellent global optimization ability for handling combinatorial optimization problems. Having considered that the existing information entropy and information gain methods fail to help to select the optimal minimal attribute every time, this paper proposed a novel attribute reduction algorithm based on ACO. Firstly, the algorithm adopts an improved information gain rate as heuristic information. Secondly, each ant solves a problem of minimum attributes reduction and then conduct redundancy test to each selected attribute. What’s more, redundant detection of all non-core attributes in the optimal solution will be perfomed in each generation. The result of the experiment on several datasets from UCI show that the proposed algorithms are more capable of finding the minimum attribute reduction and can faster converge and at the same time they can almost keep the classification accuracy, compared with the traditional attribute reduction based on ACO algorithm.

Jipeng Wei, Qianjin Wei, Yimin Wen

Efficient Approximate Algorithms for the Closest Pair Problem in High Dimensional Spaces

The Closest Pair Problem (CPP) is one of the fundamental problems that has a wide range of applications in data mining, such as unsupervised data clustering, user pattern similarity search, etc. A number of exact and approximate algorithms have been proposed to solve it in the low dimensional space. In this paper, we address the problem when the metric space is of a high dimension. For example, the drug-target or movie-user interaction data could contain as many as hundreds of features. To solve this problem under the $$\ell _2$$ℓ2 norm, we present two novel approximate algorithms. Our algorithms are based on the novel idea of projecting the points into the real line. We prove high probability bounds on the run time and accuracy for both of the proposed algorithms. Both algorithms are evaluated via comprehensive experiments and compared with existing best-known approaches. The experiments reveal that our proposed approaches outperform the existing methods.

Xingyu Cai, Sanguthevar Rajasekaran, Fan Zhang

Efficient Compression Technique for Sparse Sets

Recent growth in internet has generated large amount of data over web. Representations of most of such data are high-dimensional and sparse. Many fundamental subroutines of various data analytics tasks such as clustering, ranking, nearest neighbour scales poorly with the data dimension. In spite of significant growth in the computational power performing such computations on high dimensional data sets are infeasible, and at times impossible. Thus, it is desirable to investigate on compression algorithms that can significantly reduce dimension while preserving similarity between data objects. In this work, we consider the data points as sets, and use Jaccard similarity as the similarity measure. Pratap and Kulkarni [10] suggested a compression technique for high dimensional, sparse, binary data for preserving the Inner product and Hamming distance. In this work, we show that their algorithm also works well for Jaccard similarity. We present a theoretical analysis of compression bound and complement it with rigorous experimentation on synthetic and real-world datasets. We also compare our results with the state-of-the-art “min-wise independent permutation [6]”, and show that our compression algorithm achieves almost equal accuracy while significantly reducing the compression time and the randomness.

Rameshwar Pratap, Ishan Sohony, Raghav Kulkarni

It Pays to Be Certain: Unsupervised Record Linkage via Ambiguity Minimization

Record linkage (RL) is a process of identifying records that refer to the same real-world entity. Many existing approaches to RL apply supervised machine learning (ML) techniques to generate a classification model that classifies a pair of records as either linked or non-linked. In such techniques, the labeled data helps guide the choice and relative importance to similarity measures to be employed in RL. Unsupervised RL is therefore a more challenging problem since the quality of similarity measures needs to be estimated in the absence of linkage labels. In this paper we propose a novel optimization approach to unsupervised RL. We define a scoring technique which aggregates similarities between two records along all attributes and all available similarity measures using a weighted sum formulation. The core idea behind our method is embodied in an objective function representing the overall ambiguity of the scoring across a dataset. Our goal is to iteratively optimize the objective function to progressively refine estimates of the scoring weights in the direction of lesser overall ambiguity. We have evaluated our approach on multiple real world datasets which are commonly used in the RL community. Our experimental results show that our proposed approach outperforms state-of-the-art techniques, while being orders of magnitude faster.

Anna Jurek, Deepak P.

Community Detection and Network Science

Frontmatter

Consensus Community Detection in Multilayer Networks Using Parameter-Free Graph Pruning

The clustering ensemble paradigm has emerged as an effective tool for community detection in multilayer networks, which allows for producing consensus solutions that are designed to be more robust to the algorithmic selection and configuration bias. However, one limitation is related to the dependency on a co-association threshold that controls the degree of consensus in the community structure solution. The goal of this work is to overcome this limitation with a new framework of ensemble-based multilayer community detection, which features parameter-free identification of consensus communities based on generative models of graph pruning that are able to filter out noisy co-associations. We also present an enhanced version of the modularity-driven ensemble-based multilayer community detection method, in which community memberships of nodes are reconsidered to optimize the multilayer modularity of the consensus solution. Experimental evidence on real-world networks confirms the beneficial effect of using model-based filtering methods and also shows the superiority of the proposed method on state-of-the-art multilayer community detection.

Domenico Mandaglio, Alessia Amelio, Andrea Tagarelli

Community Discovery Based on Social Relations and Temporal-Spatial Topics in LBSNs

Community discovery is a comprehensive problem associating with sociology and computer science. The recent surge of Location-Based Social Networks (LBSNs) brings new challenges to this problem as there is no definite community structure in LBSNs. This paper tackles the multidimensional community discovery in LBSNs based on user check-in characteristics. Communities discovered in this paper satisfy two requirements: frequent user interaction and consistent temporal-spatial pattern. Firstly, based on a new definition of dynamic user interaction, two types of check-ins in LBSNs are distinguished. Secondly, a novel community discovery model called SRTST is conceived to describe the generative process of different types of check-ins. Thirdly, the Gibbs Sampling algorithm is derived for the model parameter estimation. In the end, empirical experiments on real-world LBSN datasets are designed to validate the performance of the proposed model. Experimental results show that SRTST model can discover multidimensional communities and it outperforms the state-of-the-art methods on various evaluation metrics.

Shuai Xu, Jiuxin Cao, Xuelin Zhu, Yi Dong, Bo Liu

A Unified Weakly Supervised Framework for Community Detection and Semantic Matching

Due to the sparsity of network, some community detection methods only based on topology often lead to relatively low accuracy. Although some methods have been proposed to improve the detection accuracy by using few known semi-supervised information or node content, the research of community detection not only pursues the enhancement of community accuracy, but also pays more attention to the semantic description for communities. In this paper, we proposed a unified nonnegative matrix factorization framework simultaneously for community detection and semantic matching by integrating both semi-supervised information and node content. The framework reveals two-fold community structures as well as their coupling relationship matrix, which helps to identify accurate community structure and at the same time assign specific semantic information to each community. Experiments on some real networks show that the framework is efficient to match each community with specific semantic information, and the performance are superior over the compared methods.

Wenjun Wang, Xiao Liu, Pengfei Jiao, Xue Chen, Di Jin

Tapping Community Memberships and Devising a Novel Homophily Modeling Approach for Trust Prediction

Prediction of trust relations between users of social networks is critical for finding credible information. Inferring trust is challenging, since user-specified trust relations are highly sparse and power-law distributed. In this paper, we explore utilizing community memberships for trust prediction in a principled manner. We also propose a novel method to model homophily that complements existing work. To the best of our knowledge, this is the first work that mathematically formulates an insight based on community memberships for unsupervised trust prediction. We propose and model the hypothesis that a user is more likely to develop a trust relation within the user’s community than outside it. Unlike existing work, our approach for encoding homophily directly links user-user similarities with the pair-wise trust model. We derive mathematical factors that model our hypothesis relating community memberships to trust relations and the homophily effect. Along with low-rank matrix factorization, they are combined into chTrust, the proposed multi-faceted optimization framework. Our experiments on the standard Ciao and Epinions datasets show that the proposed framework outperforms multiple unsupervised trust prediction baselines for all test user pairs as well as the low-degree segment, across evaluation settings.

Pulkit Parikh, Manish Gupta, Vasudeva Varma

Deep Learning Theory and Applications in KDD

Frontmatter

Text-Visualizing Neural Network Model: Understanding Online Financial Textual Data

This study aims to visualize financial documents to swiftly obtain market sentiment information from these documents and determine the reason for which sentiment decisions are made. This type of visualization is considered helpful for nonexperts to easily understand technical documents such as financial reports. To achieve this, we propose a novel interpretable neural network (NN) architecture called gradient interpretable NN (GINN). GINN can visualize both the market sentiment score from a whole financial document and the sentiment gradient scores in concept units. We experimentally demonstrate the validity of text visualization produced by GINN using a real textual dataset.

Tomoki Ito, Hiroki Sakaji, Kota Tsubouchi, Kiyoshi Izumi, Tatsuo Yamashita

MIDA: Multiple Imputation Using Denoising Autoencoders

Missing data is a significant problem impacting all domains. State-of-the-art framework for minimizing missing data bias is multiple imputation, for which the choice of an imputation model remains nontrivial. We propose a multiple imputation model based on overcomplete deep denoising autoencoders. Our proposed model is capable of handling different data types, missingness patterns, missingness proportions and distributions. Evaluation on several real life datasets show our proposed model significantly outperforms current state-of-the-art methods under varying conditions while simultaneously improving end of the line analytics.

Lovedeep Gondara, Ke Wang

Dual Control Memory Augmented Neural Networks for Treatment Recommendations

We formulate the task of treatment recommendation as a sequence-to-sequence prediction model that takes the time–ordered medical history as input, and predicts a sequence of future clinical procedures and medications. It is built on the premise that an effective treatment plan may have long–term dependencies from previous medical history. We approach the problem by using a memory–augmented neural network, in particular, by leveraging the recent differentiable neural computer that consists of a neural controller and an external memory module. Differing from the original model, we use dual controllers, one for encoding the history followed by another for decoding the treatment sequences. In the encoding phase, the memory is updated as new input is read; at the end of this phase, the memory holds not only the medical history but also the information about the current illness. During the decoding phase, the memory is write–protected. The decoding controller generates a treatment sequence, one treatment option at a time. The resulting dual controller write–protected memory–augmented neural network is demonstrated on the MIMIC-III dataset on two tasks: procedure prediction and medication prescription. The results show improved performance over both traditional bag-of-words and sequence-to-sequence methods.

Hung Le, Truyen Tran, Svetha Venkatesh

Denoising Time Series Data Using Asymmetric Generative Adversarial Networks

Denoising data is a preprocessing step for several time series mining algorithms. This step is especially important if the noise in data originates from diverse sources. Consequently, it is commonly used in biomedical applications that use Electroencephalography (EEG) data. In EEG data noise can occur due to ocular, muscular and cardiac activities. In this paper, we explicitly learn to remove noise from time series data without assuming a prior distribution of noise. We propose an online, fully automated, end-to-end system for denoising time series data. Our model for denoising time series is trained using unpaired training corpora and does not need information about the source of the noise or how it is manifested in the time series. We propose a new architecture called AsymmetricGAN that uses a generative adversarial network for denoising time series data. To analyze our approach, we create a synthetic dataset that is easy to visualize and interpret. We also evaluate and show the effectiveness of our approach on an existing EEG dataset.

Sunil Gandhi, Tim Oates, Tinoosh Mohsenin, David Hairston

Shared Deep Kernel Learning for Dimensionality Reduction

Deep Kernel Learning (DKL) has been proven to be an effective method to learn complex feature representation by combining the structural properties of deep learning with the nonparametric flexibility of kernel methods, which can be naturally used for supervised dimensionality reduction. However, if limited training data are available its performance could be compromised because parameters of the deep structure embedded into the model are large and difficult to be efficiently optimized. In order to address this issue, we propose the Shared Deep Kernel Learning model by combining DKL with shared Gaussian Process Latent Variable Model. The novel method could not only bring the improved performance without increasing model complexity but also learn the hierarchical features by sharing the deep kernel. The comparison with some supervised dimensionality reduction methods and deep learning approach verify the advantages of the proposed model.

Xinwei Jiang, Junbin Gao, Xiaobo Liu, Zhihua Cai, Dongmei Zhang, Yuanxing Liu

CDSSD: Refreshing Single Shot Object Detection Using a Conv-Deconv Network

Single shot multi-box object detectors [13] have been recently shown to achieve state-of-the-art performance on object detection tasks. We extend the single shot detection (SSD) framework in [13] and propose a generic architecture using a deep convolution-deconvolution network. Our architecture does not rely on any pretrained network, and can be pretrained in an unsupervised manner for a given image dataset. Furthermore, we propose a novel approach to combine feature maps from both convolution and deconvolution layers to predict bounding boxes and labels with improved accuracy. Our framework, Conv-Deconv SSD (CDSSD), with its two key contributions – unsupervised pretraining and multi-layer confluence of convolution-deconvolution feature maps – results in state-of-the-art performance while utilizing significantly less number of bounding boxes and improved identification of small objects. On $$300 \times 300$$300×300 image inputs, we achieve 80.7% mAP on VOC07 and 78.1% mAP on VOC07+12 (1.7% to 2.8% improvement over StairNet [21], DSSD [5], SSD [13]). CDSSD achieves 30.2% mAP on COCO performing at-par with R-FCN [3] and faster-R-FCN [18], while working on smaller size input images. Furthermore, CDSSD matches SSD performance while utilizing 82% of data, and reduces the prediction time per image by 10%.

Vijay Gabale, Uma Sawant

Binary Classification of Sequences Possessing Unilateral Common Factor with AMS and APR

Most real-world sequence data for binary classification tasks appear to possess unilateral common factor. That is, samples from one of the classes occur because of common underlying causes while those from the other class may not. We are interested in resolving these tasks using convolutional neural networks (CNN). However, due to both the technical specification and the nature of the data, learning a classifier is generally associated with two problems: (1) defining a segmentation window size to sub-sequence for sufficient data augmentation and avoiding serious multiple-instance learning issue is non-trivial; (2) samples from one of the classes have common underlying causes and thus present similar features, while those from the other class can have various latent characteristics which can distract CNN in the learning process. We mitigate the first problem by introducing a random variable on sample scaling parameters, whose distribution’s parameters are jointly learnt with CNN and leads to what we call adaptive multi-scale sampling (AMS). To address the second problem, we propose activation pattern regularization (APR) on only samples with the common causes such that CNN focuses on learning representations pertaining to the common factor. We demonstrate the effectiveness of both proposals in extensive experiments on real-world datasets.

Yujin Tang, Kei Yonekawa, Mori Kurokawa, Shinya Wada, Kiyohito Yoshihara

Automating Reading Comprehension by Generating Question and Answer Pairs

Neural network-based methods represent the state-of-the-art in question generation from text. Existing work focuses on generating only questions from text without concerning itself with answer generation. Moreover, our analysis shows that handling rare words and generating the most appropriate question given a candidate answer are still challenges facing existing approaches. We present a novel two-stage process to generate question-answer pairs from the text. For the first stage, we present alternatives for encoding the span of the pivotal answer in the sentence using Pointer Networks. In our second stage, we employ sequence to sequence models for question generation, enhanced with rich linguistic features. Finally, global attention and answer encoding are used for generating the question most relevant to the answer. We motivate and linguistically analyze the role of each component in our framework and consider compositions of these. This analysis is supported by extensive experimental evaluations. Using standard evaluation metrics as well as human evaluations, our experimental results validate the significant improvement in the quality of questions generated by our framework over the state-of-the-art. The technique presented here represents another step towards more automated reading comprehension assessment. We also present a live system (Demo of the system is available at https://www.cse.iitb.ac.in/~vishwajeet/autoqg.html.) to demonstrate the effectiveness of our approach.

Vishwajeet Kumar, Kireeti Boorla, Yogesh Meena, Ganesh Ramakrishnan, Yuan-Fang Li

Emotion Classification with Data Augmentation Using Generative Adversarial Networks

It is a difficult task to classify images with multiple class labels using only a small number of labeled examples, especially when the label (class) distribution is imbalanced. Emotion classification is such an example of imbalanced label distribution, because some classes of emotions like disgusted are relatively rare comparing to other labels like happy or sad. In this paper, we propose a data augmentation method using generative adversarial networks (GAN). It can complement and complete the data manifold and find better margins between neighboring classes. Specifically, we design a framework using a CNN model as the classifier and a cycle-consistent adversarial networks (CycleGAN) as the generator. In order to avoid gradient vanishing problem, we employ the least-squared loss as adversarial loss. We also propose several evaluation methods on three benchmark datasets to validate GAN’s performance. Empirical results show that we can obtain 5%–10% increase in the classification accuracy after employing the GAN-based data augmentation techniques.

Xinyue Zhu, Yifan Liu, Jiahong Li, Tao Wan, Zengchang Qin

Trans2Vec: Learning Transaction Embedding via Items and Frequent Itemsets

Learning meaningful and effective representations for transaction data is a crucial prerequisite for transaction classification and clustering tasks. Traditional methods which use frequent itemsets (FIs) as features often suffer from the data sparsity and high-dimensionality problems. Several supervised methods based on discriminative FIs have been proposed to address these disadvantages, but they require transaction labels, thus rendering them inapplicable to real-world applications where labels are not given. In this paper, we propose an unsupervised method which learns low-dimensional continuous vectors for transactions based on information of both singleton items and FIs. We demonstrate the superior performance of our proposed method in classifying transactions on four datasets compared with several state-of-the-art baselines.

Dang Nguyen, Tu Dinh Nguyen, Wei Luo, Svetha Venkatesh

Detecting Complex Sensitive Information via Phrase Structure in Recursive Neural Networks

State-of-the-art sensitive information detection in unstructured data relies on the frequency of co-occurrence of keywords with sensitive seed words. In practice, however, this may fail to detect more complex patterns of sensitive information. In this work, we propose learning phrase structures that separate sensitive from non-sensitive documents in recursive neural networks. Our evaluation on real data with human labeled sensitive content shows that our new approach outperforms existing keyword based strategies.

Jan Neerbek, Ira Assent, Peter Dolog

Clustering and Unsupervised Learning

Frontmatter

A Distance Scaling Method to Improve Density-Based Clustering

Density-based clustering is able to find clusters of arbitrary sizes and shapes while effectively separating noise. Despite its advantage over other types of clustering, it is well-known that most density-based algorithms face the same challenge of finding clusters with varied densities. Recently, ReScale, a principled density-ratio preprocessing technique, enables a density-based clustering algorithm to identify clusters with varied densities. However, because the technique is based on one-dimensional scaling, it does not do well in datasets which require multi-dimensional scaling. In this paper, we propose a multi-dimensional scaling method, named DScale, which rescales based on the computed distance. It overcomes the key weakness of ReScale and requires one less parameter while maintaining the simplicity of the implementation. Our empirical evaluation shows that DScale has better clustering performance than ReScale for three existing density-based algorithms, i.e., DBSCAN, OPTICS and DP, on synthetic and real-world datasets.

Ye Zhu, Kai Ming Ting, Maia Angelova

Neighbourhood Contrast: A Better Means to Detect Clusters Than Density

Most density-based clustering algorithms suffer from large density variations among clusters. This paper proposes a new measure called Neighbourhood Contrast (NC) as a better alternative to density in detecting clusters. The proposed NC admits all local density maxima, regardless of their densities, to have similar NC values. Due to this unique property, NC is a better means to detect clusters in a dataset with large density variations among clusters. We provide two applications of NC. First, replacing density with NC in the current state-of-the-art clustering procedure DP leads to significantly improved clustering performance. Second, we devise a new clustering algorithm called Neighbourhood Contrast Clustering (NCC) which does not require density or distance calculations, and therefore has a linear time complexity in terms of dataset size. Our empirical evaluation shows that both NC-based methods outperform density-based methods including the current state-of-the-art.

Bo Chen, Kai Ming Ting

Clustering of Multiple Density Peaks

Density-based clustering, such as Density Peak Clustering (DPC) and DBSCAN, can find clusters with arbitrary shapes and have wide applications such as image processing, spatial data mining and text mining. In DBSCAN, a core point has density greater than a threshold, and can spread its cluster ID to its neighbours. However, the core points selected by one cut/threshold are too coarse to segment fine clusters that are sensitive to densities. DPC resolves this problem by finding a data point with the peak density as centre to develop a fine cluster. Unfortunately, a DPC cluster that comprises only one centre may be too fine to form a natural cluster. In this paper, we provide a novel clustering of multiple density peaks (MDPC) to find clusters with arbitrary number of regional centres with local peak densities through extending DPC. In MDPC, we generate fine seed clusters containing single density peaks, and form clusters with multiple density peaks by merging those clusters that are close to each other and have similar density distributions. Comprehensive experiments have been conducted on both synthetic and real-world datasets to demonstrate the accuracy and effectiveness of MDPC compared with DPC, DBSCAN and other base-line clustering algorithms.

Borui Cai, Guangyan Huang, Yong Xiang, Jing He, Guang-Li Huang, Ke Deng, Xiangmin Zhou

A New Local Density for Density Peak Clustering

Density peak clustering is able to recognize clusters of arbitrary shapes, so it has attracted attention in academic community. However, existing density peak clustering algorithms prefer to select cluster centers from dense regions and thus easily ignore clusters from sparse regions. To solve this problem, we redefine the local density of a point as the number of points whose neighbors contain this point. This idea is based on our following finding: whether in dense clusters or in sparse clusters, a cluster center would have a relatively high local density calculated by our new measure. Even in a sparse region, there may be some points with high local densities in our definition, thus one of these points can be selected to be the center of this region in subsequent steps and this region is then detected as a cluster. We apply our new definition to both density peak clustering and the combination of density peak clustering with agglomerative clustering. Experiments on benchmark datasets show the effectiveness of our methods.

Zhishuai Guo, Tianyi Huang, Zhiling Cai, William Zhu

An Efficient Ranking-Centered Density-Based Document Clustering Method

Document clustering is a popular method for discovering useful information from text data. This paper proposes an innovative hybrid document clustering method based on the novel concepts of ranking, density and shared neighborhood. We utilize ranked documents generated from a search engine to effectively build a graph of shared relevant documents. The high density regions in the graph are processed to form initial clusters. The clustering decisions are further refined using the shared neighborhood information. Empirical analysis shows that the proposed method is able to produce accurate and efficient solution as compared to relevant benchmarking methods.

Wathsala Anupama Mohotti, Richi Nayak

Fast Manifold Landmarking Using Locality-Sensitive Hashing

Manifold landmarks can approximately represent the low-dimensional nonlinear manifold structure embedded in high-dimensional ambient feature space. Due to the quadratic complexity of many learning algorithms in the number of training samples, selecting a sample subset as manifold landmarks has become an important issue for scalable learning. Unfortunately, state-of-the-art Gaussian process methods for selecting manifold landmarks themselves are not scalable to large datasets. In an attempt to speed up learning manifold landmarks, uniformly selected minibatch stochastic gradient descent is used by the state-of-the-art approach. Unfortunately, this approach only goes part way to making manifold learning tractable. We propose two adaptive sample selection approaches for gradient-descent optimization, which can lead to better performance in accuracy and computational time. Our methods exploit the compatibility of locality-sensitive hashing (via LSH and DBH) and the manifold assumption, thereby limiting expensive optimization to relevant regions of the data. Landmarks selected by our methods achieve superior accuracy than training the state-of-the-art learner with randomly selected minibatch. We also demonstrate that our methods can be used to find manifold landmarks without learning Gaussian processes at all, which leads to orders-of-magnitude speed up with only minimal decrease in accuracy.

Zay Maung Maung Aye, Benjamin I. P. Rubinstein, Kotagiri Ramamohanarao

Equitable Conceptual Clustering Using OWA Operator

We propose an equitable conceptual clustering approach based on multi-agent optimization, where each cluster is represented by an agent having its own satisfaction. The problem consists in finding the best cumulative satisfaction while emphasizing a fair compromise between all individual agents. The fairness goal is achieved using an equitable formulation of the Ordered Weighted Averages (OWA) operator. Experiments performed on UCI and ERP datasets show that our approach efficiently finds clusterings of consistently high quality.

Noureddine Aribi, Abdelkader Ouali, Yahia Lebbah, Samir Loudni

Unsupervised Extremely Randomized Trees

In this paper we present a method to compute dissimilarities on unlabeled data, based on extremely randomized trees. This method, Unsupervised Extremely Randomized Trees, is used jointly with a novel randomized labeling scheme we describe here, and that we call AddCl3. Unlike existing methods such as AddCl1 and AddCl2, no synthetic instances are generated, thus avoiding an increase in the size of the dataset. The empirical study of this method shows that Unsupervised Extremely Randomized Trees with AddCl3 provides competitive results regarding the quality of resulting clusterings, while clearly outperforming previous similar methods in terms of running time.

Kevin Dalleau, Miguel Couceiro, Malika Smail-Tabbone

Local Graph Clustering by Multi-network Random Walk with Restart

Searching local graph clusters is an important problem in big network analysis. Given a query node in a graph, local clustering aims at finding a subgraph around the query node, which consists of nodes highly relevant to the query node. Existing local clustering methods are based on single networks that contain limited information. In contrast, the real data are always comprehensive and can be represented better by multiple connected networks (multi-network). To take the advantage of heterogeneity of multi-network and improve the clustering accuracy, we advance a strategy for local graph clustering based on Multi-network Random Walk with Restart (MRWR), which discovers local clusters on a target network in association with additional networks. For the proposed local clustering method, we develop a localized approximate algorithm (AMRWR) on solid theoretical basis to speed up the searching process. To the best of our knowledge, this is the first elaboration of local clustering on a target network by integrating multiple networks. Empirical evaluations show that the proposed method improves clustering accuracy by more than 10% on average with competently short running time, compared with the alternative state-of-the-art graph clustering approaches.

Yaowei Yan, Dongsheng Luo, Jingchao Ni, Hongliang Fei, Wei Fan, Xiong Yu, John Yen, Xiang Zhang

Scalable Approximation Algorithm for Graph Summarization

Massive sizes of real-world graphs, such as social networks and web graph, impose serious challenges to process and perform analytics on them. These issues can be resolved by working on a small summary of the graph instead. A summary is a compressed version of the graph that removes several details, yet preserves it’s essential structure. Generally, some predefined quality measure of the summary is optimized to bound the approximation error incurred by working on the summary instead of the whole graph. All known summarization algorithms are computationally prohibitive and do not scale to large graphs. In this paper we present an efficient randomized algorithm to compute graph summaries with the goal to minimize reconstruction error. We propose a novel weighted sampling scheme to sample vertices for merging that will result in the least reconstruction error. We provide analytical bounds on the running time of the algorithm and prove approximation guarantee for our score computation. Efficiency of our algorithm makes it scalable to very large graphs on which known algorithms cannot be applied. We test our algorithm on several real world graphs to empirically demonstrate the quality of summaries produced and compare to state of the art algorithms. We use the summaries to answer several structural queries about original graph and report their accuracies.

Maham Anwar Beg, Muhammad Ahmad, Arif Zaman, Imdadullah Khan

Privacy-Preserving and Security

Frontmatter

RIPEx: Extracting Malicious IP Addresses from Security Forums Using Cross-Forum Learning

Is it possible to extract malicious IP addresses reported in security forums in an automatic way? This is the question at the heart of our work. We focus on security forums, where security professionals and hackers share knowledge and information, and often report misbehaving IP addresses. So far, there have only been a few efforts to extract information from such security forums. We propose RIPEx, a systematic approach to identify and label IP addresses in security forums by utilizing a cross-forum learning method. In more detail, the challenge is twofold: (a) identifying IP addresses from other numerical entities, such as software version numbers, and (b) classifying the IP address as benign or malicious. We propose an integrated solution that tackles both these problems. A novelty of our approach is that it does not require training data for each new forum. Our approach does knowledge transfer across forums: we use a classifier from our source forums to identify seed information for training a classifier on the target forum. We evaluate our method using data collected from five security forums with a total of 31 K users and 542 K posts. First, RIPEx can distinguish IP address from other numeric expressions with 95% precision and above 93% recall on average. Second, RIPEx identifies malicious IP addresses with an average precision of 88% and over 78% recall, using our cross-forum learning. Our work is a first step towards harnessing the wealth of useful information that can be found in security forums.

Joobin Gharibshah, Evangelos E. Papalexakis, Michalis Faloutsos

Pattern-Mining Based Cryptanalysis of Bloom Filters for Privacy-Preserving Record Linkage

Data mining projects increasingly require records about individuals to be linked across databases to facilitate advanced analytics. The process of linking records without revealing any sensitive or confidential information about the entities represented by these records is known as privacy-preserving record linkage (PPRL). Bloom filters are a popular PPRL technique to encode sensitive information while still enabling approximate linking of records. However, Bloom filter encoding can be vulnerable to attacks that can re-identify some encoded values from sets of Bloom filters. Existing attacks exploit that certain Bloom filters can occur frequently in an encoded database, and thus likely correspond to frequent plain-text values such as common names. We present a novel attack method based on a maximal frequent itemset mining technique which identifies frequently co-occurring bit positions in a set of Bloom filters. Our attack can re-identify encoded sensitive values even when all Bloom filters in an encoded database are unique. As our experiments on a real-world data set show, our attack can successfully re-identify values from encoded Bloom filters even in scenarios where previous attacks fail.

Peter Christen, Anushka Vidanage, Thilina Ranbaduge, Rainer Schnell

A Privacy Preserving Bayesian Optimization with High Efficiency

Bayesian optimization is a powerful machine learning technique for solving experimental design problems. With its use in industrial design optimization, time and cost of industrial processes can be reduced significantly. However, often the experimenters in industries may not have the expertise of optimization techniques and may require help from third-party optimization services. This can cause privacy concerns as the optimized design of an industrial process typically needs to be kept secret to retain its competitive advantages. To this end, we propose a novel Bayesian optimization algorithm that can allow the experimenters from an industry to utilize the expertise of a third-party optimization service in privacy preserving manner. Privacy of our proposed algorithm is guaranteed under a modern privacy preserving framework called Error Preserving Privacy, especially designed to maintain high utility even under the privacy restrictions. Using several benchmark optimization problems as well as optimization problems from real-world industrial processes, we demonstrate that the optimization efficiency of our algorithm is comparable to the non-private Bayesian optimization algorithm and significantly better than its differential privacy counterpart.

Thanh Dai Nguyen, Sunil Gupta, Santu Rana, Svetha Venkatesh

Randomizing SVM Against Adversarial Attacks Under Uncertainty

Robust machine learning algorithms have been widely studied in adversarial environments where the adversary maliciously manipulates data samples to evade security systems. In this paper, we propose randomized SVMs against generalized adversarial attacks under uncertainty, through learning a classifier distribution rather than a single classifier in traditional robust SVMs. The randomized SVMs have advantages on better resistance against attacks while preserving high accuracy of classification, especially for non-separable cases. The experimental results demonstrate the effectiveness of our proposed models on defending against various attacks, including aggressive attacks with uncertainty.

Yan Chen, Wei Wang, Xiangliang Zhang

Recommendation and Data Factorization

Frontmatter

One for the Road: Recommending Male Street Attire

Growth of male fashion industry and escalating popularity of affordable street fashion wear has created a demand for the intervention of effective data analytics and recommender systems for male street wear. This motivated us to undertake extensive image collection of male subjects in casual wear and pose; assiduously annotate and carefully select discriminating features. We build up a classifier which predicts accurately the attractive quotient of an outfit. Further, we build a recommendation system - MalOutRec - which provides pointed recommendation of changing a part of the outfit in case the outfit looks unattractive (e.g. change the existing pair of trousers with a recommended one). We employ an innovative methodology that uses personalized pagerank in designing MalOutRec - experimental results show that it handsomely beats the metapath based baseline algorithm.

Debopriyo Banerjee, Niloy Ganguly, Shamik Sural, Krothapalli Sreenivasa Rao

Context-Aware Location Annotation on Mobility Records Through User Grouping

Due to the increasing popularity of location-based services, a massive volume of human mobility records have been generated. At the same time, the growing spatial context data provides us rich semantic information. Associating the mobility records with relevant surrounding contexts, known as the location annotation, enables us to understand the semantics of the mobility records and helps further tasks like advertising. However, the location annotation problem is challenging due to the ambiguity of contexts and the sparsity of personal data. To solve this problem, we propose a Context-Aware location annotation method through User Grouping (CAUG) to annotate locations with venues. This method leverages user grouping and venue categories to alleviate the data sparsity issue and annotates locations according to multi-view information (spatial, temporal and contextual) of multiple granularities. Through extensive experiments on a real-world dataset, we demonstrate that our method significantly outperforms other baseline methods.

Yong Zhang, Hua Wei, Xuelian Lin, Fei Wu, Zhenhui Li, Kaiheng Chen, Yuandong Wang, Jie Xu

A Joint Optimization Approach for Personalized Recommendation Diversification

In recommendation systems, items of interest are often classified into categories such as genres of movies. Existing research has shown that diversified recommendations can improve real user experience. However, most existing methods do not consider the fact that users’ levels of interest (i.e., user preferences) in different categories usually vary, and such user preferences are not reflected in the diversified recommendations. We propose an algorithm that considers user preferences for different categories when recommending diversified results, and refer to this problem as personalized recommendation diversification. In the proposed algorithm, a model that captures user preferences for different categories is optimized jointly toward both relevance and diversity. To provide the proposed algorithm with informative training labels and effectively evaluate recommendation diversity, we also propose a new personalized diversity measure. The proposed measure overcomes limitations of existing measures in evaluating recommendation diversity: existing measures either cannot effectively handle user preferences for different categories, or cannot evaluate both relevance and diversity at the same time. Experiments using two real-world datasets confirm the superiority of the proposed algorithm, and show the effectiveness of the proposed measure in capturing user preferences.

Xiaojie Wang, Jianzhong Qi, Kotagiri Ramamohanarao, Yu Sun, Bo Li, Rui Zhang

Personalized Item-of-Interest Recommendation on Storage Constrained Smartphone Based on Word Embedding Quantization

In recent years, word embedding models receive tremendous research attentions due to their capability of capturing textual semantics. This study investigates the issue of employing word embedding models into resource-limited smartphones for personalized item recommendation. The challenge lies in that the existing embedding models are often too large to fit into a resource-limited smartphones. One naive idea is to incorporate a secondary storage by residing the model in the secondary storage and processing recommendation with the secondary storage. However, this idea suffers from the burden of additional traffics. To this end, we propose a framework called Word Embedding Quantization (WEQ) that constructs an index upon a given word embedding model and stores the index on the primary storage to enable the use of the word embedding model on smartphones. One challenge for using the index is that the exact user profile is no longer ensured. However, we find that there are opportunities for computing the correct recommendation results by knowing only inexact user profile. In this paper, we propose a series of techniques that leverage the opportunities for computing candidates with the goal of minimizing the accessing cost to a secondary storage. Experiments are made to verify the efficiency of the proposed techniques, which demonstrates the feasibility of the proposed framework.

Si-Ying Huang, Yung-Yu Chen, Hung-Yuan Chen, Lun-Chi Chen, Yao-Chung Fan

Social Network, Ubiquitous Data and Graph Mining

Frontmatter

Topic-Specific Retweet Count Ranking for Weibo

In this paper, we study topic-specific retweet count ranking problem in Weibo. Two challenges make this task nontrivial. Firstly, traditional methods cannot derive effective feature for tweets, because in topic-specific setting, tweets usually have too many shared contents to distinguish them. We propose a LSTM-embedded autoencoder to generate tweet features with the insight that any different prefixes of tweet text is a possible distinctive feature. Secondly, it is critical to fully catch the meaning of topic in topic-specific setting, but Weibo can provide little information about topic. We leverage real-time news information from Toutiao to enrich the meaning of topic, as more than 85% topics are headline news. We evaluate the proposed components based on ablation methods, and compare the overall solution with a recently-proposed tensor factorization model. Extensive experiments on real Weibo data show the effectiveness and flexibility of our methods.

Hangyu Mao, Yang Xiao, Yuan Wang, Jiakang Wang, Zhen Xiao

Motif-Aware Diffusion Network Inference

Characterizing and understanding information diffusion over social networks play an important role in various real-world applications. In many scenarios, however, only the states of nodes can be observed while the underlying diffusion networks are unknown. Many methods have therefore been proposed to infer the underlying networks based on node observations. To enhance the inference performance, structural priors of the networks, such as sparsity, scale-free, and community structures, are often incorporated into the learning procedure. As the building blocks of networks, network motifs occur frequently in many social networks, and play an essential role in describing the network structures and functionalities. However, to the best of our knowledge, no existing work exploits this kind of structural primitives in diffusion network inference. In order to address this unexplored yet important issue, in this paper, we propose a novel framework called Motif-Aware Diffusion Network Inference (MADNI), which aims to mine the motif profile from the node observations and infer the underlying network based on the mined motif profile. The mined motif profile and the inferred network are alternately refined until the learning procedure converges. Extensive experiments on both synthetic and real-world datasets validate the effectiveness of the proposed framework.

Qi Tan, Yang Liu, Jiming Liu

Tri-Fly: Distributed Estimation of Global and Local Triangle Counts in Graph Streams

Given a graph stream, how can we estimate the number of triangles in it using multiple machines with limited storage?Counting triangles (i.e., cycles of length three) is a classical graph problem whose importance has been recognized in diverse fields, including data mining, social network analysis, and databases. Recently, for triangle counting in massive graphs, two approaches have been intensively studied. One approach is streaming algorithms, which estimate the count of triangles incrementally in time-evolving graphs or in large graphs only part of which can be stored. The other approach is distributed algorithms for utilizing computational power and storage of multiple machines.Can we have the best of both worlds? We propose Tri-Fly, the first distributed streaming algorithm for approximate triangle counting. Making one pass over a graph stream, Tri-Fly rapidly and accurately estimates the counts of global triangles and local triangles incident to each node. Compared to state-of-the-art single-machine streaming algorithms, Tri-Fly is (a) Accurate: yields up to 4.5$$\times $$×smaller estimation error, (b) Fast: runs up to 8.8$$\times $$×faster with linear scalability, and (c) Theoretically sound: gives unbiased estimates with smaller variances.

Kijung Shin, Mohammad Hammoud, Euiwoong Lee, Jinoh Oh, Christos Faloutsos

WFSM-MaxPWS: An Efficient Approach for Mining Weighted Frequent Subgraphs from Edge-Weighted Graph Databases

Weighted frequent subgraph mining comes with an inherent challenge—namely, weighted support does not support the downward closure property, which is often used in mining algorithms for reducing the search space. Although this challenge attracted attention from several researchers, most existing works in this field use either affinity based pruning or alternative anti-monotonic weighting technique for subgraphs other than average edge-weight. In this paper, we propose an efficient weighted frequent subgraph mining algorithm called WFSM-MaxPWS. Our algorithm uses the MaxPWS pruning technique, which significantly reduces search space without changing subgraph weighting scheme while ensuring completeness. Our evaluation results on three different graph datasets with two different weight distributions (normal and negative exponential) showed that our WFSM-MaxPWS algorithm led to significant runtime improvement over the existing MaxW pruning technique (which is a concept for weighted pattern mining in computing subgraph weight by taking average of edge weights).

Md. Ashraful Islam, Chowdhury Farhan Ahmed, Carson K. Leung, Calvin S. H. Hoi

A Game-Theoretic Adversarial Approach to Dynamic Network Prediction

Predicting the evolution of a dynamic network—the addition of new edges and the removal of existing edges—is challenging. In part, this is because: (1) networks are often noisy; (2) various performance measures emphasize different aspects of prediction; and (3) it is not clear which network features are useful for prediction. To address these challenges, we develop a novel framework for robust dynamic network prediction using an adversarial formulation that leverages both edge-based and global network features to make predictions. We conduct experiments on five distinct dynamic network datasets to show the superiority of our approach compared to state-of-the-art methods.

Jia Li, Brian Ziebart, Tanya Berger-Wolf

Targeted Influence Minimization in Social Networks

An online social network can be used for the diffusion of malicious information like derogatory rumors, disinformation, hate speech, revenge pornography, etc. This motivates the study of influence minimization that aim to prevent the spread of malicious information. Unlike previous influence minimization work, this study considers the influence minimization in relation to a particular group of social network users, called targeted influence minimization. Thus, the objective is to protect a set of users, called target nodes, from malicious information originating from another set of users, called active nodes. This study also addresses two fundamental, but largely ignored, issues in different influence minimization problems: (i) the impact of a budget on the solution; (ii) robust sampling. To this end, two scenarios are investigated, namely unconstrained and constrained budget. Given an unconstrained budget, we provide an optimal solution; Given a constrained budget, we show the problem is NP-hard and develop a greedy algorithm with an $$(1-1/e)$$(1-1/e)-approximation. More importantly, in order to solve the influence minimization problem in large, real-world social networks, we propose a robust sampling-based solution with a desirable theoretic bound. Extensive experiments using real social network datasets offer insight into the effectiveness and efficiency of the proposed solutions.

Xinjue Wang, Ke Deng, Jianxin Li, Jeffery Xu Yu, Christian S. Jensen, Xiaochun Yang

Maximizing Social Influence on Target Users

Influence maximization has attracted a considerable amount of research work due to the explosive growth in online social networks. Existing studies of influence maximization on social networks aim at deriving a set of users (referred to as seed users) in a social network to maximize the expected number of users influenced by those seed users. However, in some scenarios, such as election campaigns and target audience marketing, the requirement of the influence maximization is to influence a set of specific users. This set of users is defined as the target set of users. In this paper, given a target set of users, we study the Target Influence Maximization (TIM) problem with the purpose of maximizing the number of users within the target set. We particularly focus on two important issues: (1) how to capture the social influence among users, and (2) how to develop an efficient scheme that offers wide influence spread on specified subsets. Experiment results on real-world datasets validate the performance of the solution for TIM using our proposed approaches.

Yu-Ting Wen, Wen-Chih Peng, Hong-Han Shuai

Team Expansion in Collaborative Environments

In this paper, we study the team expansion problem in collaborative environments where people collaborate with each other in the form of a team, which might need to be expanded frequently by having additional team members during the course of the project. Intuitively, there are three factors as well as the interactions between them that have a profound impact on the performance of the expanded team, including (1) the task the team is performing, (2) the existing team members, and (3) the new candidate team member. However, the vast majority of the existing work either considers these factors separately, or even ignores some of these factors. In this paper, we propose a neural network based approach TECE to simultaneously model the interactions between the team task, the team members as well as the candidate team members. Experimental evaluations on real-world datasets demonstrate the effectiveness of the proposed approach.

Lun Zhao, Yuan Yao, Guibing Guo, Hanghang Tong, Feng Xu, Jian Lu

HashAlign: Hash-Based Alignment of Multiple Graphs

Fusing or aligning two or more networks is a fundamental building block of many graph mining tasks (e.g., recommendation systems, link prediction, collective analysis of networks). Most past work has focused on formulating pairwise graph alignment as an optimization problem with varying constraints and relaxations. In this paper, we study the problem of multiple graph alignment (collectively aligning multiple graphs at once) and propose HashAlign, an efficient and intuitive hash-based framework for network alignment that leverages structural properties and other node and edge attributes (if available) simultaneously. We introduce a new construction of LSH families, as well as robust node and graph features that are tailored for this task. Our method quickly aligns multiple graphs while avoiding the all-pairwise-comparison problem by expressing all alignments in terms of a chosen ‘center’ graph. Our extensive experiments on synthetic and real networks show that, on average, HashAlign is $$2{\times }$$2× faster and 10 to 20% more accurate than the baselines in pairwise alignment, and $$2{\times }$$2× faster while 50% more accurate in multiple graph alignment.

Mark Heimann, Wei Lee, Shengjie Pan, Kuan-Yu Chen, Danai Koutra

Evaluating and Analyzing Reliability over Decentralized and Complex Networks

In an increasingly interconnected and distributed world, the ability to ensure communications becomes pivotal in day-to-day operations. Given a network whose edges are prone to failures and disruptions, reliability captures the probability that traffic will reach a target location by traversing edges starting from a given source. This paper investigates reliability in decentralized and complex networks. To evaluate reliability, we introduce a multi-agent method that involves pathfinding agents to reduce the graph. Performance of this method is tested on scale-free and small-world networks as well as real-world spatial networks. We also investigate reliability score which aims to rank the capability of nodes in terms of traffic dissemination traffic across all nodes. Analysis over spatial networks indicates that the reliability score correlates with central and sub-central regions in a geographical region.

Jaron Mar, Jiamou Liu, Yanni Tang, Wu Chen, Tianyi Sun

Efficient Exact and Approximate Algorithms for Computing Betweenness Centrality in Directed Graphs

In this paper, first given a directed network G and a vertex $$r \in V(G)$$r∈V(G), we propose a new exact algorithm to compute betweenness score of r. Our algorithm pre-computes a set $$\mathcal {RF}(r)$$RF(r), which is used to prune a huge amount of computations that do not contribute in the betweenness score of r. Then, for the cases where $$\mathcal {RF}(r)$$RF(r) is large, we present a randomized algorithm that samples from $$\mathcal {RF}(r)$$RF(r) and performs computations for only the sampled elements. We show that this algorithm provides an $$(\epsilon ,\delta )$$(ϵ,δ)-approximation of the betweenness score of r. Finally, we empirically evaluate our algorithms and show that they significantly outperform the most efficient existing algorithms, in terms of both running time and accuracy. Our experiments also show that our proposed algorithms can effectively compute betweenness scores of all vertices in a set of vertices.

Mostafa Haghir Chehreghani, Albert Bifet, Talel Abdessalem

Forecasting Bitcoin Price with Graph Chainlets

Over the last couple of years, Bitcoin cryptocurrency and the Blockchain technology that forms the basis of Bitcoin have witnessed a flood of attention. In contrast to fiat currencies used worldwide, the Bitcoin distributed ledger is publicly available by design. This facilitates observing all financial interactions on the network, and analyzing how the network evolves in time. We introduce a novel concept of chainlets, or Bitcoin subgraphs, which allows us to evaluate the local topological structure of the Bitcoin graph over time. Furthermore, we assess the role of chainlets on Bitcoin price formation and dynamics. We investigate the predictive Granger causality of chainlets and identify certain types of chainlets that exhibit the highest predictive influence on Bitcoin price and investment risk.

Cuneyt G. Akcora, Asim Kumer Dey, Yulia R. Gel, Murat Kantarcioglu

Information Propagation Trees for Protest Event Prediction

Protest event prediction using information propagation from social media is an important but challenging problem. Despite the plethora of research, the implicit relationship between social media information propagation and real-world protest events is unknown. Given some information propagating on social media, how can we tell if a protest event will occur? What features of information propagation are useful and how do these features contribute to a pending protest event? In this paper, we address these questions by presenting a novel formalized propagation tree model that captures relevant protest information propagating as precursors to protest events. We present a viewpoint of information propagation as trees which captures both temporal and structural aspects of information propagation. We construct and extract structural and temporal features daily from propagation trees. We develop a matching scheme that maps daily feature values to protest events. Finally, we build a robust prediction model that leverages propagation tree features for protest event prediction. Extensive experiments conducted on Twitter datasets across states in Australia show that our model outperforms existing state-of-the-art prediction models with an accuracy of up to 89% and F1-score of 0.84. We also provide insights on the interpretability of our features to real-world protest events.

Jeffery Ansah, Wei Kang, Lin Liu, Jixue Liu, Jiuyong Li

Predictive Team Formation Analysis via Feature Representation Learning on Social Networks

Team formation is to find a group of experts covering required skills and well collaborating together. Existing studies suffer from two defects: cannot afford flexible designation of team members and do not consider whether the formed team is truly adopted in practice. In this paper, we propose the Predictive Team Formation (PTF) problem. PTF provides the flexibility of designated members and delivers the prediction-based formulation to compose the team. We propose two methods by learning the feature representations of experts based on node2vec [4]. One is Biased-n2v that models the topic bias of each expert in the social network. The other is Guided-n2v that refines the transition probabilities between skills and experts to guide the random walk in a heterogeneous graph of expert-expert, expert-skill, and skill-skill. Experiments conducted on DBLP and IMDb datasets exhibit that our methods can significantly outperform the state-of-the-art optimization-based approaches in terms of prediction recall. We also reveal that the designated members with tight social connections can lead to better performance.

Lo Pang-Yun Ting, Cheng-Te Li, Kun-Ta Chuang

Leveraging Local Interactions for Geolocating Social Media Users

Predicting the geolocation of social media users is one of the core tasks in many applications, such as rapid disaster response, targeted advertisement, and recommending local events. In this paper, we introduce a new approach for user geolocation that unifies users’ social relationships, textual content, and metadata. Our two key contributions are as follows: (1) We leverage semantic similarity between users’ posts to predict their geographic proximity. To achieve this, we train and utilize a powerful word embedding model over millions of tweets. (2) To deal with isolated users in the social graph, we utilize a stacking-based learning approach to predict users’ locations based on their tweets’ textual content and metadata. Evaluation on three standard Twitter benchmark datasets shows that our approach outperforms state-of-the-art user geolocation methods.

Mohammad Ebrahimi, Elaheh ShafieiBavani, Raymond Wong, Fang Chen

Utilizing Sequences of Touch Gestures for User Verification on Mobile Devices

Smartphones have become ubiquitous in our daily lives; they are used for a wide range of tasks and store increasing amounts of personal data. To minimize risk and prevent misuse of this data by unauthorized users, access must be restricted to verified users. Current classification-based methods for gesture-based user verification only consider single gestures, and not sequences. In this paper, we present a method which utilizes information from sequences of touchscreen gestures, and the context in which the gestures were made using only basic touch features. To evaluate our approach, we built an application which records all the necessary data from the device (touch and contextual sensors which do not consume significant battery life). Using XGBoost on the collected data, we were able to classify between a legitimate user and the population of illegitimate users (imposters) with an average equal error rate (EER) of 4.78% and an average area under the curve (AUC) of 98.15%. Our method demonstrates that by considering only basic touch features and utilizing sequences of gestures, as opposed to individual gestures, the accuracy of the verification process improves significantly.

Liron Ben Kimon, Yisroel Mirsky, Lior Rokach, Bracha Shapira

Backmatter

Title: Advances in Knowledge Discovery and Data Mining
Editors: Dinh Phung
Vincent S. Tseng
Prof. Geoffrey I. Webb
Bao Ho
Mohadeseh Ganji
Lida Rashidi
Publisher: Springer International Publishing
Electronic ISBN: 978-3-319-93040-4
Print ISBN: 978-3-319-93039-8
DOI: https://doi.org/10.1007/978-3-319-93040-4

Springer Professional

About this book

Table of Contents

Frontmatter

Feature Learning and Data Mining Process

Frontmatter

Discovering High Utility Itemsets Based on the Artificial Bee Colony Algorithm

A Scalable and Efficient Subgroup Blocking Scheme for Multidatabase Record Linkage

Efficient Feature Selection Framework for Digital Marketing Applications

Dynamic Feature Selection Algorithm Based on Minimum Vertex Cover of Hypergraph

Feature Selection for Multiclass Binary Data

Scalable Model-Based Cascaded Imputation of Missing Data

On Reducing Dimensionality of Labeled Data Efficiently

Using Metric Space Indexing for Complete and Efficient Record Linkage

Dimensionality Reduction via Community Detection in Small Sample Datasets

An Interaction-Enhanced Feature Selection Algorithm

An Extended Random-Sets Model for Fusion-Based Text Feature Selection

Attribute Reduction Algorithm Based on Improved Information Gain Rate and Ant Colony Optimization

Efficient Approximate Algorithms for the Closest Pair Problem in High Dimensional Spaces

Efficient Compression Technique for Sparse Sets

It Pays to Be Certain: Unsupervised Record Linkage via Ambiguity Minimization

Community Detection and Network Science

Frontmatter

Consensus Community Detection in Multilayer Networks Using Parameter-Free Graph Pruning

Community Discovery Based on Social Relations and Temporal-Spatial Topics in LBSNs

A Unified Weakly Supervised Framework for Community Detection and Semantic Matching

Tapping Community Memberships and Devising a Novel Homophily Modeling Approach for Trust Prediction

Deep Learning Theory and Applications in KDD

Frontmatter

Text-Visualizing Neural Network Model: Understanding Online Financial Textual Data

MIDA: Multiple Imputation Using Denoising Autoencoders

Dual Control Memory Augmented Neural Networks for Treatment Recommendations

Denoising Time Series Data Using Asymmetric Generative Adversarial Networks

Shared Deep Kernel Learning for Dimensionality Reduction

CDSSD: Refreshing Single Shot Object Detection Using a Conv-Deconv Network

Binary Classification of Sequences Possessing Unilateral Common Factor with AMS and APR

Automating Reading Comprehension by Generating Question and Answer Pairs

Emotion Classification with Data Augmentation Using Generative Adversarial Networks

Trans2Vec: Learning Transaction Embedding via Items and Frequent Itemsets

Detecting Complex Sensitive Information via Phrase Structure in Recursive Neural Networks

Clustering and Unsupervised Learning

Frontmatter

A Distance Scaling Method to Improve Density-Based Clustering

Neighbourhood Contrast: A Better Means to Detect Clusters Than Density

Clustering of Multiple Density Peaks

A New Local Density for Density Peak Clustering

An Efficient Ranking-Centered Density-Based Document Clustering Method

Fast Manifold Landmarking Using Locality-Sensitive Hashing

Equitable Conceptual Clustering Using OWA Operator

Unsupervised Extremely Randomized Trees

Local Graph Clustering by Multi-network Random Walk with Restart

Scalable Approximation Algorithm for Graph Summarization

Privacy-Preserving and Security

Frontmatter

RIPEx: Extracting Malicious IP Addresses from Security Forums Using Cross-Forum Learning

Pattern-Mining Based Cryptanalysis of Bloom Filters for Privacy-Preserving Record Linkage

A Privacy Preserving Bayesian Optimization with High Efficiency

Randomizing SVM Against Adversarial Attacks Under Uncertainty

Recommendation and Data Factorization

Frontmatter

One for the Road: Recommending Male Street Attire

Context-Aware Location Annotation on Mobility Records Through User Grouping

A Joint Optimization Approach for Personalized Recommendation Diversification

Personalized Item-of-Interest Recommendation on Storage Constrained Smartphone Based on Word Embedding Quantization

Social Network, Ubiquitous Data and Graph Mining

Frontmatter

Topic-Specific Retweet Count Ranking for Weibo

Motif-Aware Diffusion Network Inference

Tri-Fly: Distributed Estimation of Global and Local Triangle Counts in Graph Streams

WFSM-MaxPWS: An Efficient Approach for Mining Weighted Frequent Subgraphs from Edge-Weighted Graph Databases

A Game-Theoretic Adversarial Approach to Dynamic Network Prediction

Targeted Influence Minimization in Social Networks

Maximizing Social Influence on Target Users

Team Expansion in Collaborative Environments

HashAlign: Hash-Based Alignment of Multiple Graphs

Evaluating and Analyzing Reliability over Decentralized and Complex Networks

Efficient Exact and Approximate Algorithms for Computing Betweenness Centrality in Directed Graphs

Forecasting Bitcoin Price with Graph Chainlets

Information Propagation Trees for Protest Event Prediction

Predictive Team Formation Analysis via Feature Representation Learning on Social Networks