Skip to main content

2016 | Buch

Advanced Data Mining and Applications

12th International Conference, ADMA 2016, Gold Coast, QLD, Australia, December 12-15, 2016, Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the proceedings of the 12th International Conference on Advanced Data Mining and Applications, ADMA 2016, held in Gold Coast, Australia, in December 2016.

The 70 papers presented in this volume were carefully reviewed and selected from 105 submissions. The selected papers covered a wide variety of important topics in the area of data mining, including parallel and distributed data mining algorithms, mining on data streams, graph mining, spatial data mining, multimedia data mining, Web mining, the Internet of Things, health informatics, and biomedical data mining.

Inhaltsverzeichnis

Frontmatter
Erratum to: Sentiment Analysis for Depression Detection on Social Networks
Jinyan Li, Xue Li, Shuliang Wang, Jianxin Li, Quan Z. Sheng

Spotlight Research Papers

Frontmatter
Effective Monotone Knowledge Integration in Kernel Support Vector Machines

In many machine learning applications there exists prior knowledge that the response variable should be increasing (or decreasing) in one or more of the features. This is the knowledge of ‘monotone’ relationships. This paper presents two new techniques for incorporating monotone knowledge into non-linear kernel support vector machine classifiers. Incorporating monotone knowledge is useful because it can improve predictive performance, and satisfy user requirements. While this is relatively straight forward for linear margin classifiers, for kernel SVM it is more challenging to achieve efficiently. We apply the new techniques to real datasets and investigate the impact of monotonicity and sample size on predictive accuracy. The results show that the proposed techniques can significantly improve accuracy when the unconstrained model is not already fully monotone, which often occurs at smaller sample sizes. In contrast, existing techniques demonstrate a significantly lower capacity to increase monotonicity or achieve the resulting accuracy improvements.

Christopher Bartley, Wei Liu, Mark Reynolds
Textual Cues for Online Depression in Community and Personal Settings

Depression is often associated with poor social skills. The Internet allows individuals who are depressed to connect with others via online communities, helping them to address the social skill deficit. While the difficulty of collecting data in traditional studies raises a bar for investigating the cues of depression, the user-generated media left by depression sufferers on social media enable us to learn more about depression signs. Previous studies examined the traces left in the posts of online depression communities in comparison with other online communities. This work further investigates if the content that members of the depression community contribute to the community blogs different from what they make in their own personal blogs? The answer to this question would help to improve the performance of online depression screening for different blogging settings. The content made in the two settings were compared in three textual features: affective information, topics, and language styles. Machine learning and statistical methods were used to discriminate the blog content. All three features were found to be significantly different between depression Community and Personal blogs. Noticeably, topic and language style features, either separately or jointly used, show strong indicative power in prediction of depression blogs in personal or community settings, illustrating the potential of using content-based multi-cues for early screening of online depression communities and individuals.

Thin Nguyen, Svetha Venkatesh, Dinh Phung
Confidence-Weighted Bipartite Ranking

Bipartite ranking is a fundamental machine learning and data mining problem. It commonly concerns the maximization of the AUC metric. Recently, a number of studies have proposed online bipartite ranking algorithms to learn from massive streams of class-imbalanced data. These methods suggest both linear and kernel-based bipartite ranking algorithms based on first and second-order online learning. Unlike kernelized ranker, linear ranker is more scalable learning algorithm. The existing linear online bipartite ranking algorithms lack either handling non-separable data or constructing adaptive large margin. These limitations yield unreliable bipartite ranking performance. In this work, we propose a linear online confidence-weighted bipartite ranking algorithm (CBR) that adopts soft confidence-weighted learning. The proposed algorithm leverages the same properties of soft confidence-weighted learning in a framework for bipartite ranking. We also develop a diagonal variation of the proposed confidence-weighted bipartite ranking algorithm to deal with high-dimensional data by maintaining only the diagonal elements of the covariance matrix. We empirically evaluate the effectiveness of the proposed algorithms on several benchmark and high-dimensional datasets. The experimental results validate the reliability of the proposed algorithms. The results also show that our algorithms outperform or are at least comparable to the competing online AUC maximization methods.

Majdi Khalid, Indrakshi Ray, Hamidreza Chitsaz
Mining Distinguishing Customer Focus Sets for Online Shopping Decision Support

With the development of e-commerce, online shopping becomes increasingly popular. Very often, online shopping customers read reviews written by other customers to compare similar items. However, the number of customer reviews is typically too large to look through in a reasonable amount of time. To extract information that can be used for online shopping decision support, this paper investigates a novel data mining problem of mining distinguishing customer focus sets from customer reviews. We demonstrate that this problem has many applications, and at the same time, is challenging. We present dFocus-Miner, a mining method with various techniques that makes the mined results interpretable and user-friendly. Our experimental results on real world data sets verify the effectiveness and efficiency of our method.

Lu Liu, Lei Duan, Hao Yang, Jyrki Nummenmaa, Guozhu Dong, Pan Qin
Community Detection in Networks with Less Significant Community Structure

Label propagation is a low complexity approach to community detection in complex networks. Research has extended the basic label propagation algorithm (LPA) in multiple directions including maximizing the modularity, a well-known quality function to evaluate the goodness of a community division, of the detected communities. Current state-of-the-art modularity-specialized label propagation algorithm (LPAm+) maximizes modularity using a two-stage iterative procedure: the first stage is to assign labels to nodes using label propagation, the second stage merges smaller communities to further improve modularity. LPAm+ has been shown able to achieve excellent performance on networks with significant community structure where the network modularity is above a certain threshold. However, we show in this paper that for networks with less significant community structure, LPAm+ tends to get trapped in local optimal solutions that are far from optimal. The main reason comes from the fact that the first stage of LPAm+ often misplaces node labels and severely hinders the merging operation in the second stage. We overcome the drawback of LPAm+ by correcting the node labels after the first stage. We apply a label propagation procedure inspired by the meta-heuristic Record-to-Record Travel algorithm that reassigns node labels to improve modularity before merging communities. Experimental results show that the proposed algorithm, named meta-LPAm+, outperforms LPAm+ in terms of modularity on networks with less significant community structure while retaining almost the same performance on networks with significant community structure.

Ba-Dung Le, Hung Nguyen, Hong Shen
Prediction-Based, Prioritized Market-Share Insight Extraction

We present an approach for Business Intelligence (BI), where market share changes are tracked, evaluated, and prioritized dynamically and interactively. Out of all the hundreds or thousands of possible combinations of sub-markets and players, the system brings to the user those combinations where the most significant changes have happened, grouped into related insights. Time-series prediction and user interaction enable the system to learn what “significant” means to the user, and adapt the results accordingly. The proposed approach captures key insights that are missed by current top-down aggregative BI systems, and that are hard to be spotted by humans (e.g., Cisco’s US market disruption in 2010).

Renato Keshet, Alina Maor, George Kour
Interrelationships of Service Orchestrations

Despite topic models have been successfully used to reveal hidden orchestration patterns from service logs, the potential uses of their interrelationships have yet to be explored. In particular, the popularity of an orchestration pattern is a leading indicator of other orchestrations in many situations. Indeed, the research in capturing relationships by induced networks has been active in some areas, such as in spatial problems. In this paper, we propose a structure discovery process to reveal relationship networks among service orchestrations. In practice, more robust business logic can be formulated by having a good understanding of these relationships that leads to efficiency gains. Our proposed interrelationship discovery process is performed by a set of optimizations with adaptive regularization. These features make our proposed solution efficient and self-adjusted to the dynamics in service environments. The results from our extensive experiments on service consumption logs confirm the effectiveness of our proposed solution.

Victor W. Chu, Raymond K. Wong, Fang Chen, Chi-Hung Chi
Outlier Detection on Mixed-Type Data: An Energy-Based Approach

Outlier detection amounts to finding data points that differ significantly from the norm. Classic outlier detection methods are largely designed for single data type such as continuous or discrete. However, real world data is increasingly heterogeneous, where a data point can have both discrete and continuous attributes. Handling mixed-type data in a disciplined way remains a great challenge. In this paper, we propose a new unsupervised outlier detection method for mixed-type data based on Mixed-variate Restricted Boltzmann Machine (Mv.RBM). The Mv.RBM is a principled probabilistic method that models data density. We propose to use free-energy derived from Mv.RBM as outlier score to detect outliers as those data points lying in low density regions. The method is fast to learn and compute, is scalable to massive datasets. At the same time, the outlier score is identical to data negative log-density up-to an additive constant. We evaluate the proposed method on synthetic and real-world datasets and demonstrate that (a) a proper handling mixed-types is necessary in outlier detection, and (b) free-energy of Mv.RBM is a powerful and efficient outlier scoring method, which is highly competitive against state-of-the-arts.

Kien Do, Truyen Tran, Dinh Phung, Svetha Venkatesh
Low-Rank Feature Reduction and Sample Selection for Multi-output Regression

There are always varieties of inherent relational structures in the observations, which is crucial to perform multi-output regression task for high-dimensional data. Therefore, this paper proposes a new multi-output regression method, simultaneously taking into account three kinds of relational structures, $$i.e. $$, the relationships between output and output, feature and output, sample and sample. Specially, the paper seeks the correlation of output variables by using a low-rank constraint, finds the correlation between features and outputs by imposing an $$\ell _{2,1}$$-norm regularization on coefficient matrix to conduct feature selection, and discovers the correlation of samples by designing the $$\ell _{2,1}$$-norm on the loss function to conduct sample selection. Furthermore, an effective iterative optimization algorithm is proposed to settle the convex objective function but not smooth problem. Finally, experimental results on many real datasets showed the proposed method outperforms all comparison algorithms in aspect of aCC and aRMSE.

Shichao Zhang, Lifeng Yang, Yonggang Li, Yan Luo, Xiaofeng Zhu
Biologically Inspired Pattern Recognition for E-nose Sensors

The high sensitivity, stability, selectivity and adaptivity of mailman olfactory system is a result of a large number of olfactory receptors feeding into an extensive layers of neural processing units. Olfactory receptor cells (ORC) contribute significantly in the sense of smells. Bloodhounds have four billion ORC making them ideal for tracking while human have about 30 million ORC. E-nose stability, sensitivity and selectivity have been a challenging task. We hypothesize that appropriate signal processing with an increase number of sensory receptors can significantly improve odour recognition in e-nose. Adding physical receptors to e-nose is costly and can increase system complexity. Therefore, we propose an Artificial Olfactory Receptor Cells Model (AORCM) inspired by neural circuits of the vertebrate olfactory system to improve e-nose performance. Secondly, we introduce and adaptation layer to cope with drift and unknown changes. The major layers in our model are the sensory transduction layer, sensory adaptation layer, artificial olfactory receptors layer (AORL) and artificial olfactory cortex layer (AOCL). Each layer in the proposed system is biologically inspired by the mammalian olfactory system. The experiments are executed using chemo-sensory arrays data generated over three-year period. The propose model resulted in a better performance and stability compared to other models. To our knowledge, e-nose stability, selectivity and sensitivity are still unsolved problem. Our paper provides a new approach in improving e-nose pattern recognition over long period of time.

Sanad Al-Maskari, Wenping Guo, Xiaoming Zhao
Addressing Class Imbalance and Cost Sensitivity in Software Defect Prediction by Combining Domain Costs and Balancing Costs

Effective methods for identification of software defects help minimize the business costs of software development. Classification methods can be used to perform software defect prediction. When cost-sensitive methods are used, the predictions are optimized for business cost. The data sets used as input for these methods typically suffer from the class imbalance problem. That is, there are many more defect-free code examples than defective code examples to learn from. This negatively impacts the classifier’s ability to correctly predict defective code examples. Cost-sensitive classification can also be used to mitigate the affects of the class imbalance problem by setting the costs to reflect the level of imbalance in the training data set. Through an experimental process, we have developed a method for combining these two different types of costs. We demonstrate that by using our proposed approach, we can produce more cost effective predictions than several recent cost-sensitive methods used for software defect prediction. Furthermore, we examine the software defect prediction models built by our method and present the discovered insights.

Michael J. Siers, Md Zahidul Islam
Unsupervised Hypergraph Feature Selection with Low-Rank and Self-Representation Constraints

Unsupervised feature selection is designed to select a subset of informative features from unlabeled data to avoid the issue of ‘curse of dimensionality’ and thus achieving efficient calculation and storage. In this paper, we integrate the feature-level self-representation property, a low-rank constraint, a hypergraph regularizer, and a sparsity inducing regularizer (i.e., an $$\ell _{2,1}$$-norm regularizer) in a unified framework to conduct unsupervised feature selection. Specifically, we represent each feature by other features to rank the importance of features via the feature-level self-representation property. We then embed a low-rank constraint to consider the relations among features and a hypergarph regularizer to consider both the high-order relations and the local structure of the samples. We finally use an $$\ell _{2,1}$$-norm regularizer to result in low-sparsity to output informative features which satisfy the above constraints. The resulting feature selection model thus takes into account both the global structure of the samples (via the low-rank constraint) and the local structure of the data (via the hypergraph regularizer), rather than only considering each of them used in the previous studies. This enables the proposed model more robust than the previous models due to achieving the stable feature selection model. Experimental results on benchmark datasets showed that the proposed method effectively selected the most informative features by removing the adverse effect of redundant/nosiy features, compared to the state-of-the-art methods.

Wei He, Xiaofeng Zhu, Yonggang Li, Rongyao Hu, Yonghua Zhu, Shichao Zhang
Improving Cytogenetic Search with GPUs Using Different String Matching Schemes

Cytogenetic data involves analysis of chromosomes structure using karyotyping. Current cytogenetic data of patients in a hospital are very large. A physician needs to search and analyses these data for typical aberration. This paper presents the approach to speedup large cytogenetic data search with GPUs. It utilizes the parallel threads in GPUs which concurrently look for a typical string. The two search schemes are parallelized and their performances are compared. Different search scheme can be parallelized in the same manner. The experimental results show that the speedup up to 15 times can be achieved compared to the sequential version even for large number of strings searched. With the help of shared memory, parallel string search can be improved further by 8 %. However, the shared memory has the limited size which cannot hold large number of strings. The percentage of data transfer time can be reduced if more strings are searched; i.e. more work load per thread. Using a more optimized string matching scheme leads to lower speedup due to the overhead of precomputing of state tables and occupies more GPU memory due to these tables. Thus, due to the nature of GPUs that have many concurrent threads, separate and smaller memory, the simple algorithm for a thread may be good enough. The optimization may be focused on the GPU-related issues such as memory coalesce, thread divergence etc. to improve the speedup further. We also present the application of the GPU string search for finding typical aberrations, and extracting the relevant data from patients’ record for further analysis.

Chantana Chantrapornchai, Chidchanok Choksuchat
CEIoT: A Framework for Interlinking Smart Things in the Internet of Things

In the emerging Internet of Things (IoT) environment, things are interconnected but not interlinked. Interlinking relevant things offers great opportunities to discover implicit relationships and enable potential interactions among things. To achieve this goal, implicit correlations between things need to be discovered. However, little work has been done on this important direction and the lack of correlation discovery has inevitably limited the power of interlinking things in IoT. With the rapidly growing number of things that are connected to the Internet, there are increasing needs for correlations formation and discovery so as to support interlinking relevant things together effectively. In this paper, we propose a novel approach based on Multi-Agent Systems (MAS) architecture to extract correlations between smart things. Our MAS system is able to identify correlations on demand due to the autonomous behaviors of object agents. Specifically, we introduce a novel open-sourced framework, namely CEIoT, to extract correlations in the context of IoT. Based on the attributes of things our IoT dataset, we identify three types of correlations in our system and propose a new approach to extract and represent the correlations between things. We implement our architecture using Java Agent Development Framework (JADE) and conduct experimental studies on both synthetic and real-world datasets. The results demonstrate that our approach can extract the correlations at a much higher speed than the naive pairwise computation method.

Ali Shemshadi, Quan Z. Sheng, Yongrui Qin, Ali Alzubaidi
Adopting Hybrid Descriptors to Recognise Leaf Images for Automatic Plant Specie Identification

In recent years, leaf image recognition and classification has become one of the most important subjects in computer vision. Many approaches have been proposed to recognise and classify leaf images relying on features extraction and selection algorithms. In this paper, a concept of distinctive hybrid descriptor is proposed consisting of both global and local features. HSV Colour histogram (HSV-CH) is extracted from leaf images as the global features, whereas Local Binary Pattern after two level wavelet decomposition (WavLBP) is extracted to represent the local characteristics of leaf images. A hybrid method, namely “Hybrid Descriptor” (HD), is then proposed considering both the global and local features. The proposed method has been empirically evaluated using four data sets of leaf images with 256 $$\times $$ 256 pixels. Experimental results indicate that the performance of proposed method is promising – the HD outperformed typical leaf image recognising approaches as baseline models in experiments. The presented work makes clear, significant contribution to knowledge advancement in leaf recognition and image classification.

Ali A. Al-kharaz, Xiaohui Tao, Ji Zhang, Raid Lafta
Efficient Mining of Pan-Correlation Patterns from Time Course Data

There are different types of correlation patterns between the variables of a time course data set, such as positive correlations, negative correlations, time-lagged correlations, and those correlations containing small interrupted gaps. Usually, these correlations are maintained only on a subset of time points rather than on the whole span of the time points which are traditionally required for correlation definition. As these types of patterns underline different trends of data movement, mining all of them is an important step to gain a broad insight into the dependencies of the variables. In this work, we prove that these diverse types of correlation patterns can be all represented by a generalized form of positive correlation patterns. We also prove a correspondence between positive correlation patterns and sequential patterns. We then present an efficient single-scan algorithm for mining all of these types of correlations. This “pan-correlation” mining algorithm is evaluated on synthetic time course data sets, as well as on yeast cell cycle gene expression data sets. The results indicate that: (i) our mining algorithm has linear time increment in terms of increasing number of variables; (ii) negative correlation patterns are abundant in real-world data sets; and (iii) correlation patterns with time lags and gaps are also abundant. Existing methods have only discovered incomplete forms of many of these patterns, and have missed some important patterns completely.

Qian Liu, Jinyan Li, Limsoon Wong, Kotagiri Ramamohanarao
Recognizing Daily Living Activity Using Embedded Sensors in Smartphones: A Data-Driven Approach

Smartphones are widely available commercial devices and using them as a basis to creates the possibility of future widespread usage and potential applications. This paper utilizes the embedded sensors in a smartphone to recognise a number of common human actions and postures. We group the range of all possible human actions into five basic action classes, namely walking, standing, sitting, crouching and lying. We also consider the postures pertaining to three of the above actions, including standing postures (backward, straight, forward and bend), sitting postures (lean, upright, slouch and rest) and lying postures (back, side and stomach) . Training data was collected through a number of people performing a sequence of these actions and postures with a smartphone in their shirt pockets. We analysed and compared three classification algorithms, namely k Nearest Neighbour (kNN), Decision Tree Learning (DTL) and Linear Discriminant Analysis (LDA) in terms of classification accuracy and efficiency (training time as well as classification time). kNN performed the best overall compared to the other two and is believed to be the most appropriate classification algorithm to use for this task. The developed system is in the form of an Android app. Our system can real-time accesses the motion data from the three sensors and on-line classifies a particular action or posture using the kNN algorithm. It successfully recognizes the specified actions and postures with very high precision and recall values of generally above 96 %.

Wenjie Ruan, Leon Chea, Quan Z. Sheng, Lina Yao
Dynamic Reverse Furthest Neighbor Querying Algorithm of Moving Objects

With the development of wireless communications and positioning technologies, locations of moving objects are highly demanding services. The assumption of static data is majorly applied on previous researches on reverse furthest neighbor queries. However, the data are dynamic property in the real world. Even, the data-aware are uncertain due to the limitation of measuring equipment or the delay of data communication. To effectively find the influence of querying a large number of moving objects existing in boundary area vs querying results of global query area, we put forward dynamic reverse furthest neighbor query algorithms and probabilistic reverse furthest neighbor query algorithms. These algorithms can solve the query of weak influence set for moving objects. Furthermore, we investigate the uncertain moving objects model and define a probabilistic reverse furthest neighbor query, and then present a half-plane pruning for individual moving objects and spatial pruning method for uncertain moving objects. The experimental results show that the algorithm is effective, efficient and scalable in different distribution and volume of data sets.

Bohan Li, Chao Zhang, Weitong Chen, Yingbao Yang, Shaohong Feng, Qiqian Zhang, Weiwei Yuan, Dongjing Li

Research Papers

Frontmatter
Relative Neighborhood Graphs Uncover the Dynamics of Social Media Engagement

In this paper, we examine if the Relative Neighborhood Graph (RNG) can reveal related dynamics of page-level social media metrics. A statistical analysis is also provided to illustrate the application of the method in two other datasets (the Indo-European Language dataset and the Shakespearean Era Text dataset). Using social media metrics on the world’s ‘top check-in locations’ Facebook pages dataset, the statistical analysis reveals coherent dynamical patterns. In the largest cluster, the categories ‘Gym’, ‘Fitness Center’, and ‘Sports and Recreation’ appear closely linked together in the RNG. Taken together, our study validates our expectation that RNGs can provide a “parameter-free" mathematical formalization of proximity. Our approach gives useful insights on user behaviour in social media page-level metrics as well as other applications.

Natalie Jane de Vries, Ahmed Shamsul Arefin, Luke Mathieson, Benjamin Lucas, Pablo Moscato
An Ensemble Approach for Better Truth Discovery

Truth discovery is a hot research topic in the Big Data era, with the goal of identifying true values from the conflicting data provided by multiple sources on the same data items. Previously, many methods have been proposed to tackle this issue. However, none of the existing methods is a clear winner that consistently outperforms the others due to the varied characteristics of different methods. In addition, in some cases, an improved method may not even beat its original version as a result of the bias introduced by limited ground truths or different features of the applied datasets. To realize an approach that achieves better and robust overall performance, we propose to fully leverage the advantages of existing methods by extracting truth from the prediction results of these existing truth discovery methods. In particular, we first distinguish between the single-truth and multi-truth discovery problems and formally define the ensemble truth discovery problem. Then, we analyze the feasibility of the ensemble approach, and derive two models, i.e., serial model and parallel model, to implement the approach, and to further tackle the above two types of truth discovery problems. Extensive experiments over three large real-world datasets and various synthetic datasets demonstrate the effectiveness of our approach.

Xiu Susie Fang, Quan Z. Sheng, Xianzhi Wang
Single Classifier Selection for Ensemble Learning

Ensemble classification is one of representative learning techniques in the field of machine learning, which combines a set of single classifiers together aiming at achieving better classification performance. Not every arbitrary set of single classifiers can obtain a good ensemble classifier. The efficient and necessary condition to construct an accurate ensemble classifier is that the single classifiers should be accurate and diverse. In this paper, we first formally give the definitions of accurate and diverse classifiers and put forward metrics to quantify the accuracy and diversity of the single classifiers; afterwards, we propose a novel parameter-free method to pick up a set of accurate and diverse single classifiers for ensemble. The experimental results on real world data sets show the effectiveness of the proposed method which could improve the performance of the representative ensemble classifier Bagging.

Guangtao Wang, Xiaomei Yang, Xiaoyan Zhu
Community Detection in Dynamic Attributed Graphs

Community detection is one of the most widely studied tasks in network analysis because community structures are ubiquitous across real-world networks. These real-world networks are often both attributed and dynamic in nature. In this paper, we propose a community detection algorithm for dynamic attributed graphs that, unlike existing community detection methods, incorporates both temporal and attribute information along with the structural properties of the graph. Our proposed algorithm handles graphs with heterogeneous attribute types, as well as changes to both the structure and the attribute information, which is essential for its applicability to real-world networks. We evaluated our proposed algorithm on a variety of synthetically generated benchmark dynamic attributed graphs, as well as on large-scale real-world networks. The results obtained show that our proposed algorithm is able to identify graph partitions of high modularity and high attribute similarity more efficiently than state-of-the-art methods for community detection.

Gonzalo A. Bello, Steve Harenberg, Abhishek Agrawal, Nagiza F. Samatova
Secure Computation of Skyline Query in MapReduce

To select representative objects from a large scale database is an important step to understand the database. A skyline query, which retrieves a set of non-dominated objects, is one of popular methods for selecting representative objects. In this paper, we have considered a distributed algorithm for computing a skyline query in order to handle “big data”. In conventional distributed algorithms for computing a skyline query, the values of each object of a local database have to be disclosed to another. Recently, we have to be aware of privacy in a database, in which such disclosures of privacy information in conventional distributed algorithms are not allowed. In this work, we propose a novel approach to compute the skyline in a multi-parties computing environment without disclosing individual values of objects to another party. Our method is designed to work in MapReduce framework − in Hadoop framework. Our experimental results confirm the effectiveness and scalability of the proposed secure skyline computation.

Asif Zaman, Md. Anisuzzaman Siddique, Annisa, Yasuhiko Morimoto
Recommending Features of Mobile Applications for Developer

Features recommendation is an important technique for getting the requirements to develop and update mobile Apps and it has been one of the frontier study in requirements engineering. However, the mobile Apps’ descriptions are always free-format and noisy, the classical features recommendation methods cannot be effectively applied to mobile Apps’ features recommendation. In addition, most mobile Apps’ source codes that contain API calling information can be obtained by software tools, which can accurately indicate the functional features. Therefore, this paper proposes a hybrid feature recommendation method of mobile Apps, which is based on both explicit description and implicit code information. A self-adaptive similarity measure and KNN is used to find relevant Apps, and functional features are extracted from the Apps and recommended for developers. Experimental results on four categories Apps show that the proposed features recommendation method with hybrid information is more effective than the classical method.

Hong Yu, Yahong Lian, Shuotao Yang, Linlin Tian, Xiaowei Zhao
Adaptive Multi-objective Swarm Crossover Optimization for Imbalanced Data Classification

Training a classifier with imbalanced dataset where there are more data from the majority class than the minority class is a known problem in data mining research community. The resultant classifier would become under-fitted in recognizing test instances of minority class and over-fitted with overwhelming mediocre samples from the majority class. Many existing techniques have been tried, ranging from artificially boosting the amount of the minority class training samples such as SMOTE, downsizing the volume of the majority class samples, to modifying the classification induction algorithm in favour of the minority class. However, finding the optimal ratio between the samples from the two majority/minority class for building a classifier that has the best accuracy is tricky, due to the non-linear relationships between the attributes and the class labels. Merely rebalancing the sample sizes of the two classes to exact portions will often not produce the best result. Brute-force attempt to search for the perfect combination of majority/minority class samples for the best classification result is NP-hard. In this paper, a unified preprocessing approach is proposed, using stochastic swarm heuristics to cooperatively optimize the mixtures from the two classes by progressively rebuilding the training dataset is proposed. Our novel approach is shown to outperform the existing popular methods.

Jinyan Li, Simon Fong, Meng Yuan, Raymond K. Wong
Causality-Guided Feature Selection

Identifying meaningful features that drive a phenomenon (response) of interest in complex systems of interconnected factors is a challenging problem. Causal discovery methods have been previously applied to estimate bounds on causal strengths of factors on a response or to identify meaningful interactions between factors in complex systems, but these approaches have been used only for inferential purposes. In contrast, we posit that interactions between factors with a potential causal association on a given response could be viable candidates not only for hypothesis generation but also for predictive modeling. In this work, we propose a causality-guided feature selection methodology that identifies factors having a potential cause-effect relationship in complex systems, and selects features by clustering them based on their causal strength with respect to the response. To this end, we estimate statistically significant causal effects on the response of factors taking part in potential causal relationships, while addressing associated technical challenges, such as multicollinearity in the data. We validate the proposed methodology for predicting response in five real-world datasets from the domain of climate science and biology. The selected features show predictive skill and consistent performance across different domains.

Mandar S. Chaudhary, Doel L. Gonzalez II, Gonzalo A. Bello, Michael P. Angus, Dhara Desai, Steve Harenberg, P. Murali Doraiswamy, Fredrick H. M. Semazzi, Vipin Kumar, Nagiza F. Samatova, for the Alzheimer’s Disease Neuroimaging Initiative
Temporal Interaction Biased Community Detection in Social Networks

Community detection in social media is a fundamental problem in social data analytics in order to understand user relationships and improve social recommendations. Although the problem has been extensively investigated, most of the research examined communities based on static structure in social networks. Our findings within large social networks such as Twitter, show that only a few users have interactions or communications within any fixed time interval. It is not difficult to see that it makes more potential sense to find such active communities that are biased to temporal interactions of social users, rather than relying solely on static structure. Communities detected with this new perspective will provide time-variant social relationships or recommendations in social networks, which can greatly improve the applicability of social data analytics.In this paper, we address the proposed problem of temporal interaction biased community detection using a three-step process. Firstly, we develop an activity biased weight model which gives higher weight to active edges or inactive edges in close proximity to active edges. Secondly, we redesign the activity biased community model by extending the classical density based community detection metric. Thirdly, we develop two different expansion-driven algorithms to find the activity biased densest community efficiently. Finally, we verify the effectiveness of the extended community metric and the efficiency of the algorithms using three real datasets.

Noha Alduaiji, Jianxin Li, Amitava Datta, Xiaolu Lu, Wei Liu
Extracting Key Challenges in Achieving Sobriety Through Shared Subspace Learning

Alcohol abuse is quite common among all people without any age restrictions. The uncontrolled use of alcohol affects both the individual and society. Alcohol addiction leads to a huge increase in crime, suicide, health related problems and financial crisis. Research has shown that certain behavioral changes can be effective towards staying abstained. The analysis of behavioral changes of quitters and those who are at the beginning phase of quitting can be useful for reducing the issues related to alcohol addiction. Most of the conventional approaches are based on surveys and, therefore, expensive in both time and cost. Social media has lend itself as a source of large, diverse and unbiased data for analyzing social behaviors. Reddit is a social media platform where a large number of people communicate with each other. It has many different sub-groups called subreddits categorized based on the subject. We collected more than 40,000 self reported user’s data from a subreddit called ‘/r/stopdrinking’. We divide the data into two groups, short-term with abstinent days less than 30 and long-term abstainers with abstinent days greater than 365 based on badge days at the time of post submission. Common and discriminative topics are extracted from the data using JS-NMF, a shared subspace non-negative matrix factorization method. The validity of the extracted topics are demonstrated through predictive performance.

Haripriya Harikumar, Thin Nguyen, Santu Rana, Sunil Gupta, Ramachandra Kaimal, Svetha Venkatesh
Unified Weighted Label Propagation Algorithm Using Connection Factor

With the social networks getting increasingly larger, fast community detection algorithms like the label propagation algorithm, are attracting more attention. But the label propagation algorithm deals vertices with no proper weight, which leads to the loss in the performance. We propose the connection factor of the vertex to measure its influence on the local connectivity. The connection factor can reveal the topological structure feature, and we propose a unified weight to modify the original label propagation algorithm. Experiments show that our Unified Weighted LPA has an average performance promotion from 5 % to 10 %, in the best case more than 30 %.

Xin Wang, Songlei Jian, Kai Lu, Xiaoping Wang
MetricRec: Metric Learning for Cold-Start Recommendations

Making recommendations for new users is a challenging task of cold-start recommendations due to the absence of historical ratings. When the attributes of users are available, such as age, occupation and gender, then new users’ preference can be inferred. Inspired by the user based collaborative filtering in warm-start scenario, we propose using the similarity on attributes to conduct recommendations for new users. Two basic similarity metrics, cosine and Jaccard, are evaluated for cold-start. We also propose a novel recommendation model, MetricRec, that learns an interest-derived metric such that the users with similar interests are close to each other in the attribute space. As the MetricRec’s feasible area is conic, we propose an efficient Interior-point Stochastic Gradient Descent (ISGD) method to optimize it. During the optimizing process, the metric is always guaranteed in the feasible area. Owing to the stochastic strategy, ISGD possesses scalability. Finally, the proposed models are assessed on two movie datasets, Movielens-100K and Movielens-1M. Experimental results demonstrate that MetricRec can effectively learn the interest-derived metric that is superior to cosine and Jaccard, and solve the cold-start problem effectively.

Furong Peng, Xuan Lu, Jianfeng Lu, Richard Yi-Da Xu, Cheng Luo, Chao Ma, Jingyu Yang
Time Series Forecasting on Engineering Systems Using Recurrent Neural Networks

Modern large scale processing and manufacturing systems cover a wide array and large number of assets that need to work together to ensure that plant is generating output reliably and at the desired yield rate, such as the viscosity of quench oil in styrene cracking system. However, due to the complexity of the overall process, it is important to consider the entire plant as a network to identify deterioration patterns and forecast condition.Instead to figure out the prediction from engineering perspective, we propose to leverage deep learning approach to predict the next state based on the historical information. Particularly, recurrent neural network (RNN) is selected in this paper as a basis for temporal forecasting. Considering the fact that there are multiple sub-systems running in parallel whose independence cannot be captured by a normal RNN, we design a LSTM (Long Short Term Memory) network for each sub-system and feed the outputs of LSTMs into a linear neural network layer for predicting viscosity one-hour ahead.

Dongxu Shao, Tianyou Zhang, Kamal Mannar, Yue Han
EDAHT: An Expertise Degree Analysis Model for Mass Comments in the E-Commerce System

In order to help consumers to retrieve the most valuable information from amount of comments quickly, we present a method of evaluating the expertise degree of comments. Firstly, we propose an algorithm to construct automatically an attribute-word hierarchy tree from the massive comments data. Secondly, we develop an expertise degree analysis based on attribute-word hierarchy tree (EDAHT) to estimate the expertise degree of comments. The experiments results on 8,000 manual scoring data show that EDAHT model is high consistent with the manual scoring data, and this novel model is effective.

Jiang Zhong, You Xiong, Weili Guo, Jingyi Xie
A Scalable Document-Based Architecture for Text Analysis

Analyzing textual data is a very challenging task because of the huge volume of data generated daily. Fundamental issues in text analysis include the lack of structure in document datasets, the need for various preprocessing steps and performance and scaling issues. Existing text analysis architectures partly solve these issues, providing restrictive data schemas, addressing only one aspect of text preprocessing and focusing on one single task when dealing with performance optimization. Thus, we propose in this paper a new generic text analysis architecture, where document structure is flexible, many preprocessing techniques are integrated and textual datasets are indexed for efficient access. We implement our conceptual architecture using both a relational and a document-oriented database. Our experiments demonstrate the feasibility of our approach and the superiority of the document-oriented logical and physical implementation.

Ciprian-Octavian Truică, Jérôme Darmont, Julien Velcin
DAPPFC: Density-Based Affinity Propagation for Parameter Free Clustering

In the clustering algorithms, it is a bottleneck to identify clusters with arbitrarily. In this paper, a new method DAPPFC (density-based affinity propagation for parameter free clustering) is proposed. Firstly, it obtains a group of normalized density from the unsupervised clustering results. Then, the density is used for density clustering for multiple times. Finally, the multiple-density clustering results undergo a two-stage synthesis to achieve the final clustering result. The experiment shows that the proposed method does not require the user’s intervention, and it can also get an accurate clustering result in the presence of arbitrarily shaped clusters with a minimal additional computation cost.

Hanning Yuan, Shuliang Wang, Yang Yu, Ming Zhong
Effective Traffic Flow Forecasting Using Taxi and Weather Data

Short-term traffic flow forecasting is an important component of intelligent transportation systems. The forecasting results can be used to support intelligent transportation systems to plan operation and manage revenue. In this paper, we aim to predict the daily floating population by presenting a novel model using taxi trajectory data and weather information. We study the problem of floating traffic flow prediction with weather-affected New York City, and a new methodology called WTFPredict is proposed to solve this problem. In particular, we target the busiest part of the city (i.e., the airports), and identify its boundary to compute the traffic flow around the area. The experimental results based on large scale, real-life taxi and weather data (12 million records) indicate that the proposed method performs well in forecasting the short-term traffic flows. Our study will provide some valuable insights to transport management, urban planning, and location-based services (LBS).

Xiujuan Xu, Benzhe Su, Xiaowei Zhao, Zhenzhen Xu, Quan Z. Sheng
Understanding Behavioral Differences Between Short and Long-Term Drinking Abstainers from Social Media

Drinking alcohol has high cost on society. The journey from being a regular drinker to a successful quitter may be a long and hard journey, fraught with the risk to relapse. Research has shown that certain behavioral changes can be effective towards staying abstained. Traditional way to conduct research on drinking abstainers uses questionnaire based approach to collect data from a curated group of people. However, it is an expensive approach in both cost and time and often results in small data with less diversity. Recently, social media has emerged as a rich data source. Reddit is one such social media platform that has a community (‘subreddit’) with an interest to quit drinking. The discussions among the group dates back to year 2011 and contain more than 40,000 posts. This large scale data is generated by users themselves and without being limited by any survey questionnaires. The most predictive factors from the features (unigrams, topics and LIWC) associated with short-term and long-term abstinence are identified using Lasso. It is seen that many common patterns manifest in unigrams, topics and LIWC. Whilst topics provided much richer associations between a group of words and the outcome, unigrams and LIWC are found to be good at finding highly predictive solo and psycho linguistically important words. Combining them we have found that many interesting patterns that are associated with the successful attempt made by the long-term abstainer, at the same time finding many of the common issues faced during the initial period of abstinence.

Haripriya Harikumar, Thin Nguyen, Sunil Gupta, Santu Rana, Ramachandra Kaimal, Svetha Venkatesh
Discovering Trip Hot Routes Using Large Scale Taxi Trajectory Data

Discovering trip hot routes is very meaningful for drivers to pick up a passenger, as well as for managers to plan urban public transport. Riding by taxis is one of the important means of transportation. Large scale taxi trajectory data from taxi GPS device implicates residents’ trip behavior. In this paper, we present a method to discover trip hot routes using large scale taxi trajectory data. Firstly, we measure taxi trajectory similarity with longest common subsequence (LCS). LCS-based DBSCAN trajectory clustering algorithm was proposed. Then hot routes were extracted using large scale taxi trajectory data. Our experiment shows that the trajectory clustering algorithm and hot route extraction method are effective.

Linjiang Zheng, Qisen Feng, Weining Liu, Xin Zhao
Discovering Spatially Contiguous Clusters in Multivariate Geostatistical Data Through Spectral Clustering

Spectral clustering has recently become one of the most popular modern clustering algorithms for traditional data. However, the application of this clustering method on geostatistical data produces spatially scattered clusters, which is undesirable for many geoscience applications. In this work, we develop a spectral clustering method aimed to discover spatially contiguous and meaningful clusters in multivariate geostatistical data, in which spatial dependence plays an important role. The proposed spectral clustering method relies on a similarity measure built from a non-parametric kernel estimator of the multivariate spatial dependence structure of the data, emphasizing the spatial correlation among data locations. The capability of the proposed spectral clustering method to provide spatially contiguous and meaningful clusters is illustrated using the European Geological Surveys Geochemical database.

Francky Fouedjio
On Improving Random Forest for Hard-to-Classify Records

Random Forest draws much interest from the research community because of its simplicity and excellent performance. The splitting attribute at each node of a decision tree for Random Forest is determined from a predefined number of randomly selected subset of attributes of the entire attribute set. The size of the subset is one of the most controversial points of Random Forest that encouraged many contributions. However, a little attention is given to improve Random Forest specifically for those records that are hard to classify. In this paper, we propose a novel technique of detecting hard-to-classify records and increase the weights of those records in a training data set. We then build Random Forest from the weighted training data set. The experimental results presented in this paper indicate that the ensemble accuracy of Random Forest can be improved when applied on weighted training data sets with more emphasis on hard-to-classify records.

Md Nasim Adnan, Md Zahidul Islam
Scholarly Output Graph: A Graphical Article-Level Metric Indicating the Impact of a Scholar’s Publications

Statistically, top scholars tend to accumulate a large number of publications during their tenure. While the patterns illustrating their scientific impact are monotonous and it is difficult to get a concrete comprehension to the academic development of the scholars’ output. So we address the issue of graphically presenting and comparing the impact of individual scholars’ publications. Besides, with the development of Web 2.0, more information about the social impact of a scholar’s work is becoming increasingly available and relevant. Thus comes the challenge of how to quickly compare among a scholar’s entire collection of publications, and pinpoint those with higher social popularity as well as academic influence. To this end, we propose a graphical article-level metric, namely Scholarly Output Graph (SOG). SOG captures three dimensions including journal impact factor (JIF), scientific impact and social popularity, and reflects not only the quality of the publications but also the immediate responses from social networks. With the visual cues of block length, width and color, users can intuitively locate articles of higher scientific impact, JIF and social popularity. Additionally, SOG proves to be widely applicable, practical and flexible as a navigation tool for filtering publications. To demonstrate the usability of SOG, we design a literature navigation homepage with a list of 50 researchers in computer science with their individual scholarly output graphs and the results can be found at http://impact.linkscholar.org/SOGExample.html.

Yu Liu, Dan Lin, Jing Li, Shimin Shan
Distributed Lazy Association Classification Algorithm Based on Spark

The lazy association classification algorithms are inefficient when classifying multiple unclassified samples at the same time. The existing lazy association classification algorithms are sequential which can’t deal with the big data problems. To solve these problems, we propose a distributed lazy association classification algorithm based on Spark, named as SDLAC. Firstly, it clusters the unclassified samples by K-Means algorithm. Secondly, it executes distributed projections according to clustered results, and mines classification association rules by a distributed mining algorithm based on spark. Then it constructs classifier to classify unclassified samples. The experiments are conducted on the 5 UCI datasets and a big dataset from the first national college competition on cloud computing(China). The results show that SDLAC algorithm is more accurate than the CBA algorithm. Besides, its efficiency is far more than the typical distributed lazy association classification algorithm. In other words, the SDLAC algorithm can adapt big data environment.

Xueming Li, Chaoyang zhang, Guangwei Chen, Xiaoteng Sun, Qi Zhang, Haomin Yang
Event Evolution Model Based on Random Walk Model with Hot Topic Extraction

To identify the evolution relationship between events, this paper presents a new model which utilizes random walk model to weight the cosine similarity of events according to the chronological order. This model is proved to be effective and accurate in identifying relationship of events through comparisons with other models. This paper also puts forward an innovative method to detect hot topics which applies the concept of related events. This paper introduces parallel computing, including Spark and MapReduce, to significantly improve the efficiency of the event evolution calculation and hot topic detection.

Chunzi Wu, Bin Wu, Bai Wang
Knowledge-Guided Maximal Clique Enumeration

Maximal clique enumeration is a long-standing problem in graph mining and knowledge discovery. Numerous classic algorithms exist for solving this problem. However, these algorithms focus on enumerating all maximal cliques, which may be computationally impractical and much of the output may be irrelevant to the user. To address this issue, we introduce the problem of knowledge-biased clique enumeration, a query-driven formulation that reduces output space, computation time, and memory usage. Moreover, we introduce a dynamic state space indexing strategy for efficiently processing multiple queries over the same graph. This strategy reduces redundant computations by dynamically indexing the constituent state space generated with each query. Experimental results over real-world networks demonstrate this strategy’s effectiveness at reducing the cumulative query-response time. Although developed in the context of maximal cliques, our techniques could possibly be generalized to other constraint-based graph enumeration tasks.

Steve Harenberg, Ramona G. Seay, Gonzalo A. Bello, Rada Y. Chirkova, P. Murali Doraiswamy, Nagiza F. Samatova
Got a Complaint?- Keep Calm and Tweet It!

Research shows that many public service agencies use Twitter to share information and reach out to the public. Recently, Twitter is also being used as a platform to collect complaints from citizens and resolve them in an efficient time and manner. However, due to the dynamic nature of the website and presence of free-form-text, manual identification of complaint posts is overwhelmingly impractical. We formulate the problem of complaint identification as an ensemble classification problem. We perform several text enrichment processes such as hashtag expansion, spell correction and slang conversion on raw tweets for identifying linguistic features. We implement a one-class SVM classification and evaluate the performance of various kernel functions for identifying complaint tweets. Our result shows that linear kernel SVM outperforms polynomial and RBF kernel functions and the proposed approach classifies the complaint tweets with an overall precision of $$76\,\%$$. We boost the accuracy of our approach by performing an ensemble on all three kernels. Result shows that one-class parallel ensemble SVM classifier outperforms cascaded ensemble learning with a margin of approximately $$20\,\%$$. By comparing the performance of each kernel against ensemble classifier, we provide an efficient method to classify complaint reports.

Nitish Mittal, Swati Agarwal, Ashish Sureka
Query Classification by Leveraging Explicit Concept Information

A key task in query understanding is interpreting user intentions from the limited words that the user submitted to the search engines. Query classification (QC) has been widely studied for this purpose, which classifies queries into a set of target categories as user search intents. Query classification is an important as well as difficult problem in the field of information retrieval, since the queries are usually short in length, ambiguous and noisy. In this case, traditional “bag-of-words” based classification methods fail to achieve high accuracy in the task of QC. In this paper, we propose to mine explicit “Concept” information to help resolve this problem. Specifically, we first leverage existing knowledge bases to enrich the short query from the concept level. Then we discuss the usage of the mined concept information and propose a novel language model based query classification method which takes both words and concepts into consideration. Experimental results show that the mined concepts are very informative and effective to improve query classification.

Fang Wang, Ze Yang, Zhoujun Li, Jianshe Zhou
Stabilizing Linear Prediction Models Using Autoencoder

To date, the instability of prognostic predictors in a sparse high dimensional model, which hinders their clinical adoption, has received little attention. Stable prediction is often overlooked in favour of performance. Yet, stability prevails as key when adopting models in critical areas as healthcare. Our study proposes a stabilization scheme by detecting higher order feature correlations. Using a linear model as basis for prediction, we achieve feature stability by regularizing latent correlation in features. Latent higher order correlation among features is modelled using an autoencoder network. Stability is enhanced by combining a recent technique that uses a feature graph, and augmenting external unlabelled data for training the autoencoder network. Our experiments are conducted on a heart failure cohort from an Australian hospital. Stability was measured using Consistency index for feature subsets and signal-to-noise ratio for model parameters. Our methods demonstrated significant improvement in feature stability and model estimation stability when compared to baselines.

Shivapratap Gopakumar, Truyen Tran, Dinh Phung, Svetha Venkatesh
Mining Source Code Topics Through Topic Model and Words Embedding

Developers nowadays can leverage existing systems to build their own applications. However, a lack of documentation hinders the process of software system reuse. We examine the problem of mining topics (i.e., topic extraction) from source code, which can facilitate the comprehension of the software systems. We propose a topic extraction method, Embedded Topic Extraction (EmbTE), that considers word semantics, which are never considered in mining topics from source code, by leveraging word embedding techniques. We also adopt Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF) to extract topics from source code. Moreover, an automated term selection algorithm is proposed to identify the most contributory terms from source code for the topic extraction task. The empirical studies on Github (https://github.com/) Java projects show that EmbTE outperforms other methods in terms of providing more coherent topics. The results also indicate that method name, method comments, class names and class comments are the most contributory types of terms to source code topic extraction.

Wei Emma Zhang, Quan Z. Sheng, Ermyas Abebe, M. Ali Babar, Andi Zhou
IPC Multi-label Classification Based on the Field Functionality of Patent Documents

The International Patent Classification (IPC) is used for the classification of patents according to their technological area. Research on the IPC automatic classification system has focused on applying various existing machine learning methods rather than considering the data characteristics or the field structure of the patent documents. This paper proposes a new method for IPC automatic classification using two structural fields, the technical field and the background field selected by applying the characteristics of patent documents. The effects of the structural fields of the patent document classification are examined using a multi-label model and 564,793 registered patents of Korea at the IPC subclass level. An 87.2% precision rate is obtained when using titles, abstracts, claims, technical fields and backgrounds. From this sequence, it is verified that the technical field and background field play an important role in improving the precision of IPC multi-label classification at the IPC subclass level.

Sora Lim, YongJin Kwon
Unsupervised Component-Wise EM Learning for Finite Mixtures of Skew t-distributions

In recent years, finite mixtures of skew distributions are gaining popularity as a flexible tool for modelling data with asymmetric distributional features. Parameter estimation for these mixture models via the traditional EM algorithm requires the number of components to be specified a priori. In this paper, we consider unsupervised learning of skew mixture models where the optimal number of components is estimated during the parameter estimation process. We adopt a component-wise EM algorithm and use the minimum message length (MML) criterion. For illustrative purposes, we focus on the case of a finite mixture of multivariate skew t distributions. The performance of the approach is demonstrated on a real dataset from flow cytometry, where our mixture model was used to provide an automated segmentation of cell populations.

Sharon X. Lee, Geoffrey J. McLachlan
Supervised Feature Selection by Robust Sparse Reduced-Rank Regression

Feature selection keeping discriminative features (i.e., removing noisy and irrelevant features) from high-dimensional data has been becoming a vital important technique in machine learning since noisy/irrelevant features could deteriorate the performance of classification and regression. Moreover, feature selection has also been applied in all kinds of real applications due to its interpretable ability. Motivated by the successful use of sparse learning in machine learning and reduced-rank regression in statics, we put forward a novel feature selection pattern with supervised learning by using a reduced-rank regression model and a sparsity inducing regularizer during this article. Distinguished from those state-of-the-art attribute selection methods, the present method have described below: (1) built upon an $$\ell _{2,p}$$-norm loss function and an $$\ell _{2,p}$$-norm regularizer by simultaneously considering subspace learning and attribute selection structure into a unite framework; (2) select the more discriminating features in flexible, furthermore, in respect that it may be capable of dominate the degree of sparseness and robust to outlier samples; and (3) also interpretable and stable because it embeds subspace learning (i.e., enabling to output stable models) into the feature selection framework (i.e., enabling to output interpretable results). The relevant results of experiment on eight multi-output data sets indicated the effectiveness of our model compared to the state-of-the-art methods act on regression tasks.

Rongyao Hu, Xiaofeng Zhu, Wei He, Jilian Zhang, Shichao Zhang
PUEPro: A Computational Pipeline for Prediction of Urine Excretory Proteins

A computational pipeline is developed to accurately predict urine excretory proteins and the possible origins of the proteins. The novel contributions of this study include: (i) a new method for predicting if a cellular protein is urine excretory based on unique features of proteins known to be urine excretory; and (ii) a novel method for identifying urinary proteins originating from the urinary system. By integrating these tools, our computational pipeline is capable of predicting the origin of a detected urinary protein, hence offering a novel tool for predicting potential biomarkers of a specific disease, which may have some of their proteins urine excreted. One application is presented for this prediction pipeline to demonstrate the effectiveness of its prediction. The pipeline and supplementary materials can be accessed at the following URL: http://csbl.bmb.uga.edu/PUEPro/.

Yan Wang, Wei Du, Yanchun Liang, Xin Chen, Chi Zhang, Wei Pang, Ying Xu
Partitioning Clustering Based on Support Vector Ranking

Support Vector Clustering (SVC) has become a significant boundary-based clustering algorithm. In this paper we propose a novel SVC algorithm named “Partitioning Clustering Based on Support Vector Ranking (PC-SVR)”, which is aimed at improving the traditional SVC, which suffers the drawback of high computational cost during the process of cluster partition. PC-SVR is divided into two parts. For the first part, we sort the support vectors (SVs) based on their geometrical properties in the feature space. Based on this, the second part is to partition the samples by utilizing the clustering algorithm of similarity segmentation based point sorting (CASS-PS) and thus produce the clustering. Theoretically, PC-SVR inherits the advantages of both SVC and CASS-PS while avoids the downsides of these two algorithms at the same time. According to the experimental results, PC-SVR demonstrates good performance in clustering, and it outperforms several existing approaches in terms of Rand index, adjust Rand index, and accuracy index.

Qing Peng, Yan Wang, Ge Ou, Yuan Tian, Lan Huang, Wei Pang
Global Recursive Based Node Importance Evaluation

The world city network (WCN) research has been promoting the evaluation algorithms in complex networks. The study of urban network focuses on the measurements of city position in the WCN. In previous study, a set of algorithms such as centricity evaluation, power evaluation and their recursive power are proposed to support the WCN research. In this paper, we propose a novel global recursive based node importance evaluation (GRNIE) algorithm for WCN, which improves the performance in evaluating the network centricity and power by being applied to the Friedmann Basics Network. The results of the experiment show that the proposed GRNIE outperforms the previous algorithms (i.e., degree, recursive centricity, recursive power) in task of classification with the improvements of 72 %, 32 % and 20 % respectively in accuracy.

Lu Zhao, Li Xiong, Shan Xue
Extreme User and Political Rumor Detection on Twitter

Twitter, as a popular social networking tool that allows its users to conveniently propagate information, has been widely used by politicians and political campaigners worldwide. In the past years, Twitter has come under scrutiny due to its lack of filtering mechanisms, which lead to the propagation of trolling, bullying, and other unsocial behaviors. Rumors can also be easily created on Twitter, e.g., by extreme political campaigners, and widely spread by readers who cannot judge their truthfulness. Current work on Twitter message assessment, however, focuses on credibility, which is subjective and can be affected by assessor’s bias. In this paper, we focus on the actual message truthfulness, and propose a rule-based method for detecting political rumors on Twitter based on identifying extreme users. We employ clustering methods to identify news tweets. In contrast with other methods that focus on the content of tweets, our unsupervised classification method employs five structural and timeline features for the detection of extreme users. We show with extensive experiments that certain rules in our rule set provide accurate rumor detection with precision and recall both above 80 %, while some other rules provide 100 % precision, although with lower recalls.

Cheng Chang, Yihong Zhang, Claudia Szabo, Quan Z. Sheng
Deriving Public Sector Workforce Insights: A Case Study Using Australian Public Sector Employment Profiles

Effective approaches for measurement of human capital in public sector and government agencies is essential for robust workforce planning against changing economic conditions. To this purpose, adopting innovative hypotheses driven workforce data analysis can help discover hidden patterns and trends about the workforce. These trends are useful for decision making and support the development of policies to reach desired employment outcomes. In this study, the data challenges and approaches to a real life workforce analytics scenario are described. Statistical results from numerous workforce data experiments are combined to derive three hypotheses that are useful to public sector organisations for human resources management and decision making.

Shameek Ghosh, Yi Zheng, Thorsten Lammers, Ying Ying Chen, Carolyn Fitzmaurice, Scott Johnston, Jinyan Li
Real-Time Stream Mining Electric Power Consumption Data Using Hoeffding Tree with Shadow Features

Many energy load forecasting models have been established from batch-based supervised learning models where the whole data must be loaded to learn. Due to the sheer volumes of the accumulated consumption data which arrive in the form of continuous data streams, such batch-mode learning requires a very long time to rebuild the model. Incremental learning, on the other hand, is an alternative for online learning and prediction which learns the data stream in segments. However, it is known that its prediction performance falls short when compared to batch learning. In this paper, we propose a novel approach called Shadow Features (SF) which offer extra dimensions of information about the data streams. SF are relatively easy to compute, suitable for lightweight online stream mining.

Simon Fong, Meng Yuen, Raymond K. Wong, Wei Song, Kyungeun Cho
Real-Time Investigation of Flight Delays Based on the Internet of Things Data

Flight delay is a very important problem resulting in the wasting of billions of dollars each year. Other researchers have studied this problem using historical records of flights. With the emerging paradigm of Internet of things (IoT), it is now possible to analyze sensors data in real-time. We investigate flight delays using real-time data from the IoT. We crawl IoT data and collect the data from various resources including flights, weather and air quality sensors. Our goal is to improve our understanding of the roots and signs of flight delays in order to be able to classify a given flight based on the features from flights and other data sources. We extend the existing works by adding new data sources and considering new factors in the analysis of flight delay. Through the use of real-time data, our goal is to establish a novel service to predict delays in real-time.

Abdulwahab Aljubairy, Ali Shemshadi, Quan Z. Sheng

Demo Papers

Frontmatter
IRS-HD: An Intelligent Personalized Recommender System for Heart Disease Patients in a Tele-Health Environment

The use of intelligent technologies in clinical decision making support may play a promising role in improving the quality of heart disease patients’ life and helping to reduce cost and workload involved in their daily health care in a tele-health environment. The objective of this demo proposal is to demonstrate an intelligent prediction system we developed, called IRS-HD, that accurately advises patients with heart diseases concerning whether they need to take the body test today or not based on the analysis of their medical data during the past a few days. Easy-to-use user friendly interfaces are developed for users to supply necessary inputs to the system and receive recommendations from the system. IRS-HD yields satisfactory recommendation accuracy, offers a promising way for reducing the risk of incorrect recommendations, as well saves the workload for patients to conduct body tests every day.

Raid Lafta, Ji Zhang, Xiaohui Tao, Yan Li, Vincent S. Tseng
Sentiment Analysis for Depression Detection on Social Networks

As a response to the urgent demand of methods that help detect depression at early stage, the work presented in this paper has adopted sentiment analysis techniques to analyse users’ contributions of social network to detect potential depression. A prototype has been developed, aiming at demonstrating the mechanism of the approach and potential social effect that may be delivered. The contributions include a depressive sentiment knowledge base and an algorithm to analyse textual data for depression detection.

Xiaohui Tao, Xujuan Zhou, Ji Zhang, Jianming Yong
Traffic Flow Visualization Using Taxi GPS Data

Intelligent transportation systems (ITSs) became an essential tool for a broad range of transportation applications. Traffic flow visualization is an important problem in ITS. The visualized results can be used to support ITSs to plan operation and manage revenue. In this paper, we aim to visualize the daily floating taxis by presenting a novel figure using taxi trajectory data and weather information. Many visualization platforms feature a online-offline phase, in which taxi GPS trajectory data is processed by two phases. This approach incurs high costs though, since trajectory data is huge generated by taxis every second continually. To support the frequent trajectories, we present an analysis tool for mining frequent trajectories of taxis (FTMTool). It allows us to find the driver’s routes by collecting input on the most frequent roads, thereby achieving a set of high quality routes. The tool also supports the task statistic in selecting the specific roads. We demonstrate the usefulness of our tool using real data from New York city.

Xiujuan Xu, Zhenzhen Xu, Xiaowei Zhao
Backmatter
Metadaten
Titel
Advanced Data Mining and Applications
herausgegeben von
Jinyan Li
Xue Li
Shuliang Wang
Jianxin Li
Quan Z. Sheng
Copyright-Jahr
2016
Electronic ISBN
978-3-319-49586-6
Print ISBN
978-3-319-49585-9
DOI
https://doi.org/10.1007/978-3-319-49586-6