nach oben

2019 | Buch

Kapitel lesen Erstes Kapitel lesen

Advanced Data Mining and Applications

15th International Conference, ADMA 2019, Dalian, China, November 21–23, 2019, Proceedings

herausgegeben von: Jianxin Li, Sen Wang, Dr. Shaowen Qin, Xue Li, Shuliang Wang

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book constitutes the proceedings of the 15th International Conference on Advanced Data Mining and Applications, ADMA 2019, held in Dalian, China in November 2019.

The 39 full papers presented together with 26 short papers and 2 demo papers were carefully reviewed and selected from 170 submissions. The papers were organized in topical sections named: Data Mining Foundations; Classification and Clustering Methods; Recommender Systems; Social Network and Social Media; Behavior Modeling and User Profiling; Text and Multimedia Mining; Spatial-Temporal Data; Medical and Healthcare Data/Decision Analytics; and Other Applications.

Inhaltsverzeichnis

Frontmatter

Data Mining Foundations

Frontmatter

Mining Emerging High Utility Itemsets over Streaming Database

HUIM (High Utility Itemset Mining) is a classical data mining problem that has gained much attention in the research community with a wide range of applications. The goal of HUIM is to identify all itemsets whose utility satisfies a user-defined threshold. In this paper, we address a new and interesting direction of high utility itemsets mining, which is mining temporal emerging high utility itemsets from data streams. The temporal emerging high utility itemsets are those that are not high utility in the current time window of the data stream but have high potential to become a high utility in the subsequent time windows. Discovery of temporal emerging high utility itemsets is an important process for mining interesting itemsets that yield high profits from streaming databases, which has many applications such as proactive decision making by domain experts, building powerful classifiers, market basket analysis, catalogue design, among others. We propose a novel method, named EFTemHUI (Efficient Framework for Temporal Emerging HUI mining), to identify Emerging High Utility Itemsets better. To improve the efficiency of the mining process, we devise a new mechanism to evaluate the high utility itemsets that will emerge, which has the ability to capture and store the information about potential high utility itemsets. Through extensive experimentation using three datasets, we proved that the proposed method yields excellent accuracy and low errors in the prediction of emerging patterns for the next window.

Acquah Hackman, Yu Huang, Philip S. Yu, Vincent S. Tseng

A Methodology for Resolving Heterogeneity and Interdependence in Data Analytics

The big data analytics achieves wide application in a number of areas due to its capability in uncovering hidden patterns, correlations and insights through integrating multiple data sources. However, the interdependence and heterogeneity features of these data sources pose a big challenge in managing these data sources to support “last mile” analytics in decision making and value co-creation which are usually with multiple perspectives and at multiple granularities. In this paper, we propose a unified knowledge representation framework, namely, Cyber-Entity (Cyber-E) modeling, to capture and formalize selected behaviors of real entities in both the social and physical worlds to the cyber analytic space. Its special features include not only the stateful, intra- properties of a Cyber-E, but also the inter-relationship and dependence among them. A grouping mechanism, called Cyber-G, is also introduced to support flexible granularity adjustment in the knowledge management. It supports rapid on-demand self-service analytics. An illustrating example of applying this approach in academic research community is given, followed by a case study of two top conferences in service computing area– ICSOC and ICWS– to illustrate the effectiveness and potentials of our approach.

Han Han, Yunwei Zhao, Can Wang, Min Shu, Tao Peng, Chi-Hung Chi, Yonghong Yu

Learning Subgraph Structure with LSTM for Complex Network Link Prediction

Link prediction is a hot research topic in complex network. Traditional link prediction is based on similarities between nodes, such as common neighbors and Jaccard index. These methods are easy to understand and widely used. However, most existing works use a single relationship between two target nodes, lacking the use of information around the two target nodes. Due to the poor scalability of these methods, the performances of link prediction are not good. In this paper, we propose a novel link prediction method, learning Subgraph structure with Long-Short Term Memory network (SG-LSTM), which uses a recurrent neural network to learn the subgraph patterns to predict links. First, we extract the enclosing subgraph of the target link. Second, we use a graph labeling algorithm called the hash-based Weisfeiler-Lehman (HWL) algorithm to re-label the extracted closed subgraphs, which maps the subgraphs to the sequential data that reflects the subgraph structure. Finally, these sequential data are trained using long-short term memory network (LSTM) to learn the link prediction model. This learned LSTM model is used to predict the link. Large-scale experiments verify that our proposed method has superior link prediction performances to traditional link prediction methods.

Yun Han, Donghai Guan, Weiwei Yuan

Accelerating Minimum Temporal Paths Query Based on Dynamic Programming

Temporal path is a fundamental problem in the research of temporal graphs. The solutions [19] in existing studies are not efficient enough since they spend more time to scan temporal edges which reflects connections between two vertices in every time instants. Therefore, in this paper, we first propose efficient algorithms including FDP and SDP, using dynamic programming to calculate the shortest path and fastest path respectively. Then we define a restricted minimum temporal path for some special requirements, including the restricted earliest-arrival path and restricted latest-departure path, and present REDP and RLDP algorithms to solve them. Finally, extensive experiments have demonstrated that our proposed algorithms are effective and efficient over massive real-world temporal graphs.

Mo Li, Junchang Xin, Zhiqiong Wang, Huilin Liu

Multiple Query Point Based Collective Spatial Keyword Querying

Spatial keyword search is a useful technique to enable users find the spatial web object they prefer. Since they objects spatially close to the query point may not fulfill all query objectives, collective spatial keyword query aims to retrieve a group of objects that can cover all required query keywords while properly located in spatial. However in some cases, the querying may be subject to several people in different locations together, and the returned group of objects should not only cover all of their objectives, but also optimal regarding to all of the related people. To this end, this paper studies the problem of multiple query point based collective spatial keyword querying (MCSKQ). Two novel algorithms, HCQ and BCQ, are proposed to support efficient collective query processing w.r.t. multiple query points. The experimental results and related analysis show that MCSKQ has good efficiency and accuracy performance.

Yun Li, Ziheng Wang, Jing Chen, Fei Wang, Jiajie Xu

Unsupervised Feature Selection for Noisy Data

Feature selection techniques are enormously applied in a variety of data analysis tasks in order to reduce the dimensionality. According to the type of learning, feature selection algorithms are categorized to: supervised or unsupervised. In unsupervised learning scenarios, selecting features is a much harder problem, due to the lack of class labels that would facilitate the search for relevant features. The selecting feature difficulty is amplified when the data is corrupted by different noises. Almost all traditional unsupervised feature selection methods are not robust against the noise in samples. These approaches do not have any explicit mechanism for detaching and isolating the noise thus they can not produce an optimal feature subset. In this article, we propose an unsupervised approach for feature selection on noisy data, called Robust Independent Feature Selection (RIFS). Specifically, we choose feature subset that contains most of the underlying information, using the same criteria as the Independent component analysis (ICA). Simultaneously, the noise is separated as an independent component. The isolation of representative noise samples is achieved using factor oblique rotation whereas noise identification is performed using factor pattern loadings. Extensive experimental results over divers real-life data sets have showed the efficiency and advantage of the proposed algorithm.

Kaveh Mahdavi, Jesus Labarta, Judit Gimenez

Tri-Level Cross-Domain Sign Prediction for Complex Network

Sign prediction is a fundamental research issue in complex network mining, while the high cost of data collection leads to insufficient data for prediction. The transfer learning method can use the transferable knowledge in other networks to complete the learning tasks in the target network. However, when the inter-domain differences are large, it is difficult for existing methods to obtain useful transferable knowledge. We therefor propose a tri-level cross-domain model using inter-domain similarity and relativity to solve the sign prediction problem in complex networks (TCSP). The first level pre-classifies the source domain, the second level selects the key instances of the source domain, and the third level calculates the similarity between the source domain and the target domain to obtain the pseudo-labels of the target domain. These “labeled” instances are used to train the sign classifier and predict the sign in the target network. Experimental results on real complex network datasets verify the effectiveness of the proposed method.

Jiali Pang, Donghai Guan, Weiwei Yuan

An Efficient Mining Algorithm of Closed Frequent Itemsets on Multi-core Processor

In this paper, we improved a sequential NOV-CFI algorithm mining closed frequent itemsets in transaction databases, called SEQ-CFI and consisting of three phases: the first phase, quickly detect a Kernel_COOC array of co-occurrences and occurrences of kernel item in at least one transaction; the second phase, we built the list of nLOOC-Tree base on the Kernel_COOC and a binary matrix of dataset (self-reduced search space); the last phase, the algorithm is a fast mining closed frequent itemsets base on nLOOC-Tree. The next step, we develop a sequential algorithm for mining closed frequent itemsets and thus parallelize the sequential algorithm to effectively demonstrate the multi-core processor, called NPA-CFI. The experimental results show that the proposed algorithms perform better than other existing algorithms, as well as to expand the parallel NPA-CFI algorithm on distributed computing systems such as Hadoop, Spark.

Huan Phan

TSRuleGrowth: Mining Partially-Ordered Prediction Rules From a Time Series of Discrete Elements, Application to a Context of Ambient Intelligence

This paper presents TSRuleGrowth, an algorithm for mining partially-ordered rules on a time series. TSRuleGrowth takes principles from the state of the art of transactional rule mining, and applies them to time series. It proposes a new definition of the support, which overcomes the limitations of previous definitions. Experiments on two databases of real data coming from connected environments show that this algorithm extracts relevant usual situations and outperforms the state of the art.

Benoit Vuillemin, Lionel Delphin-Poulat, Rozenn Nicol, Laetitia Matignon, Salima Hassas

HGTPU-Tree: An Improved Index Supporting Similarity Query of Uncertain Moving Objects for Frequent Updates

Position uncertainty is one key feature of moving objects. Existing uncertain moving objects indexing technology aims to improve the efficiency of querying. However, when moving objects’ positions update frequently, the existing methods encounter a high update cost. We purpose an index structure for frequent position updates: HGTPU-tree, which decreases cost caused by frequent position updates of moving objects. HGTPU-tree reduces the number of disk I/Os and update costs by using bottom-up update strategy and reducing same group moving objects updates. Furthermore we purpose moving object group partition algorithm STSG (Spatial Trajectory of Similarity Group) and uncertain moving object similar group update algorithm. Experiments show that HGTPU-tree reduces memory cost and increases system stability compared to existing bottom-up indexes. We compared HGTPU-tree with TPU-tree, GTPU-tree and TPU2M-tree. Results prove that HGTPU-tree is superior to other three state-of-the-art index structures in update cost.

Mengqian Zhang, Bohan Li, Kai Wang

Robust Feature Selection Based on Fuzzy Rough Sets with Representative Sample

Fuzzy rough set theory is not only an objective mathematical tool to deal with incomplete and uncertain information but also a powerful computing paradigm to realize feature selection. However, the existing fuzzy rough set models are sensitive to noise in feature selection. To solve this problem, a novel fuzzy rough set model that is robust to noise is studied in this paper, which expands the research of fuzzy rough set theory and broadens the application of feature selection. In this study, we propose a fuzzy rough set model with representative sample (RS-FRS), and it deals better with noise. Firstly, the fuzzy membership of the sample is defined, and it is added into the construction of RS-FRS model, which could increase the upper and lower approximation of RS-FRS and reduce the influence of the noise samples. The proposed model considers the fuzziness of the sample membership degree, and it can approximate other subsets of the domain space with the fuzzy equivalent approximation space more precisely. Furthermore, RS-FRS model does not need to set parameters for the model in advance, which helps reduce the model complexity and human intervention effectively. At the same time, we also give a careful study to the related properties of RS-FRS model, and a robust feature selection based on RS-FRS with sample pair selection is designed. Extensive experiments are given to illustrate the robustness and effectiveness of the proposed model.

Zhimin Zhang, Weitong Chen, Chengyu Liu, Yun Kang, Feng Liu, Yuwen Li, Shoushui Wei

Classification and Clustering Methods

Frontmatter

HUE-Span: Fast High Utility Episode Mining

High utility episode mining consists of finding episodes (sub-sequences of events) that have a high importance (e.g high profit) in a sequence of events with quantities and weights. Though it has important real-life applications, the current problem definition has two critical limitations. First, it underestimates the utility of episodes by not taking into account all timestamps of minimal occurrences for utility calculations, which can result in missing high utility episodes. Second, the state-of-the-art UP-Span algorithm is inefficient on large databases because it uses a loose upper bound on the utility to reduce the search space. This paper addresses the first issue by redefining the problem to guarantee that all high utility episodes are found. Moreover, an efficient algorithm named HUE-Span is proposed to efficiently find all patterns. It relies on a novel upper-bound to reduce the search space and a novel co-occurrence based pruning strategy. Experimental results show that HUE-Span not only finds all patterns but is also up to five times faster than UP-Span.

Philippe Fournier-Viger, Peng Yang, Jerry Chun-Wei Lin, Unil Yun

Clustering Noisy Temporal Data

Clustering time series data is frequently hampered by various noise components within the signal. These disturbances affect the ability of clustering to detect similarities across the various signals, which may result in poor clustering results. We propose a method, which first smooths out such noise using wavelet decomposition and thresholding, then reconstructs the original signal (with minimised noise) and finally undertakes the clustering on this new signal. We experimentally evaluate the proposed method on 250 signals that are generated from five classes of signals. Our proposed method achieves improved clustering results.

Paul Grant, Md Zahidul Islam

A Novel Approach for Noisy Signal Classification Through the Use of Multiple Wavelets and Ensembles of Classifiers

Classification of time series signals can be crucial for many practical applications. While the existing classifiers may accurately classify pure signals, the existence of noise can significantly disturb the classification accuracy of these classifiers. We propose a novel classification approach that uses multiple wavelets together with an ensemble of classifiers to return high classification accuracy even for noisy signals.The proposed technique has two main steps. In Step 1, We convert raw signals into a useful dataset by applying multiple wavelet transforms, each from a different wavelet family or all from the same family with differing filter lengths. In Step 2, We apply the dataset processed in Step 1 to an ensemble of classifiers. We test on 500 noisy signals from five different classes. Our experimental results demonstrate the effectiveness of the proposed technique, on noisy signals, compared to the approaches that use either raw signals or a single wavelet transform.

Paul Grant, Md Zahidul Islam

Recommender Systems

Frontmatter

Reminder Care System: An Activity-Aware Cross-Device Recommendation System

Alzheimer’s disease (AD) affects large numbers of elderly people worldwide and represents a significant social and economic burden on society, particularly in relation to the need for long term care facilities. These costs can be reduced by enabling people with AD to live independently at home for a longer time. The use of recommendation systems for the Internet of Things (IoT) in the context of smart homes can contribute to this goal. In this paper, we present the Reminder Care System (RCS), a research prototype of a recommendation system for the IoT for elderly people with cognitive disabilities. RCS exploits daily activities that are captured and learned from IoT devices to provide personalised recommendations. The experimental results indicate that RCS can inform the development of real-world IoT applications.

May S. Altulyan, Chaoran Huang, Lina Yao, Xianzhi Wang, Salil Kanhere, Yunajiang Cao

Similar Group Finding Algorithm Based on Temporal Subgraph Matching

The similar group search is an important approach for the recommendation system or social network analysis. However, there is a negligence of the influence of temporal features of social network on the search for similarity group. In this paper, we model the social network through the temporal graph and define the similar group in the temporal social network. Then, the T-VF2 algorithm is designed to search the similarity group through the temporal subgraph matching technique. To evaluate our proposed algorithm, we also extend the VF2 algorithm by point-side collaborative filtering to perform temporal subgraph matching. Finally, lots of experiments show the effectiveness and efficient of our proposed algorithm.

Yizhu Cai, Mo Li, Junchang Xin

Traditional PageRank Versus Network Capacity Bound

In a former paper [10] we simplified the proof of a theorem on personalized random walk that is fundamental to graph nodes clustering and generalized it to bipartite graphs for a specific case where the probability of random jump was proportional to the number of links of “personally preferred” nodes. In this paper, we turn to the more complex issue of graphs in which the random jump follows a uniform distribution.

Robert A. Kłopotek, Mieczysław A. Kłopotek

RecKGC: Integrating Recommendation with Knowledge Graph Completion

Both recommender systems and knowledge graphs can provide overall and detailed views on datasets, and each of them has been a hot research domain by itself. However, recommending items with a pre-constructed knowledge graph or without one often limits the recommendation performance. Similarly, constructing and completing a knowledge graph without a target is insufficient for applications, such as recommendation. In this paper, we address the problems of recommendation together with knowledge graph completion by a novel model named RecKGC that generates a completed knowledge graph and recommends items for users simultaneously. Comprehensive representations of users, items and interactions/relations are learned in each respective domain, such as our attentive embeddings that integrate tuples in a knowledge graph for recommendation and our high-level interaction representations of entities and relations for knowledge graph completion. We join the tasks of recommendation and knowledge graph completion by sharing the comprehensive representations. As a result, the performance of recommendation and knowledge graph completion are mutually enhanced, which means that the recommendation is getting more effective while the knowledge graph is getting more informative. Experiments validate the effectiveness of the proposed model on both tasks.

Jingwei Ma, Mingyang Zhong, Jiahui Wen, Weitong Chen, Xiaofang Zhou, Xue Li

PRME-GTS: A New Successive POI Recommendation Model with Temporal and Social Influences

Successive point-of-interest (POI) recommendation is an important research task which can recommend new POIs the user has not visited before. However, the existing researches for new successive POI recommendation ignore the integration of time information and social relations information which can improve the prediction of the system. In order to solve this problem, we propose a new recommendation model called PRME-GTS that incorporates social relations and temporal information in this paper. It can models the relations between users, temporal information, points of interest, and social information, which is based on the framework of pair-wise ranking metric embedding. Experimental results on the two datasets demonstrate that employing temporal information and social relations information can effectively improve the performance of the successive point-of-interest (POI) recommendation.

Rubai Mao, Zhe Han, Zitu Liu, Yong Liu, Xingfeng Lv, Ping Xuan

Social Network and Social Media

Frontmatter

Precomputing Hybrid Index Architecture for Flexible Community Search over Location-Based Social Networks

Community search is defined as finding query-based communities within simple graphs. One of the most crucial community models is minimum degree subgraph in which each vertex has at least k neighbours. Due to the rapid development of location-based devices; however, simple graphs are unable to handle Location-Based Social Networks LBSN personal information such as interests and spatial locations. Hence, this paper aims to construct a Precomputed Hybrid Index Architecture (PHIA) for the sake of enhancing simple graphs to store and retrieve information of LBSN users. This method consists of two stages; the first is precomputing, and the second is index construction. Numerical testing showed that our hybrid index approach is reasonable because of its flexibility to combine different dimensions by adapting the wide used community model $$k-core$$.

Ismail Alaqta, Junhu Wang, Mohammad Awrangjeb

Expert2Vec: Distributed Expert Representation Learning in Question Answering Community

Community question answering (CQA) has attracted increasing attention recently due to its potential as a de facto knowledge base. Expert finding in CQA websites also has considerably board applications. Stack Overflow is one of the most popular question answering platforms, which is often utilized by recent studies on the recommendation of the domain expert. Despite the substantial progress seen recently, it still lacks relevant research on the direct representation of expert users. Hence hereby we propose Expert2Vec, a distributed Expert Representation learning in question answering community to boost the recommendation of the domain expert. Word2Vec is used to preprocess the Stack Overflow dataset, which helps to generate representations of domain topics. Weight rankings are then extracted based on domains and variational autoencoder (VAE) is unitized to generate representations of user-topic information. This finally adopts the reinforcement learning framework with the user-topic matrix to improve it internally. Experiments show the adequate performance of our proposed approaches in the recommendation system.

Xiaocong Chen, Chaoran Huang, Xiang Zhang, Xianzhi Wang, Wei Liu, Lina Yao

Improving the Link Prediction by Exploiting the Collaborative and Context-Aware Social Influence

The study of link prediction has attracted increasing attention with the booming social networks. Researchers utilized topological features of networks and the attribute features of nodes to predict new links in the future or find the missing links in the current network. Some of the works take topic into consideration, but they don’t think of the social influence that has potential impacts on link prediction. Hence, it leads us to introduce social influence into topics to find contexts. In this paper, we propose a novel model under the collaborative filter framework and improve the link prediction by exploiting context-aware social influence. We also adopt the clustering algorithm with the use of topological features, thus we incorporate the social influence, topic and topological structure to improve the quality of link prediction. We test our method on Digg data set and the results of the experiment demonstrate that our method performs better than the traditional approaches.

Han Gao, Yuxin Zhang, Bohan Li

A Causality Driven Approach to Adverse Drug Reactions Detection in Tweets

Social media sites such as Twitter is a platform where users usually express their feelings, opinions, and experiences, e.g., users often share their experiences about medications including adverse drug reactions in their tweets. Mining and detecting this information on adverse drug reactions could be immensely beneficial for pharmaceutical companies, drug-safety authorities and medical practitioners. However, the automatic extraction of adverse drug reactions from tweets is a nontrivial task due to the short and informal nature of tweets. In this paper, we aim to detect adverse drug reaction mentions in tweets where we assume that there exists a cause-effect relationship between drug names and adverse drug reactions. We propose a causality driven neural network-based approach to detect adverse drug reactions in tweets. Our approach applies a multi-head self attention mechanism to learn word-to-word interactions. We show that when the causal features are combined with the word-level semantic features, our approach can outperform several state-of-the-art adverse drug reaction detection approaches.

Humayun Kayesh, Md. Saiful Islam, Junhu Wang

Correlate Influential News Article Events to Stock Quote Movement

This study is to investigate the digital media influence on financial equity stocks. For investment plans, knowledge-based decision support system is an important criterion. The stock exchange is becoming one of the major areas of investments. Various factors affect the stock exchange in which social media and digital news articles are found to be the major factors. As the world is more connected now than a decade ago, social media does play a main role in making decisions and change the perception of looking at things. Therefore a robust model is an important need for forecasting the stock prices movement using social media news or articles. From this line of research, we assess the performance of correlation-based models to check the rigorousness over the large data sets of stocks and the news articles. We evaluate the various stock quotes of entities across the world on the day news article is published. Conventional sentiment analysis is applied to the news article events to extract the polarity by categorizing the positive and negative statements to study their influence on the stocks based on correlation.

Arun Chaitanya Mandalapu, Saranya Gunabalan, Avinash Sadineni, Taotao Cai, Nur Al Hasan Haldar, Jianxin Li

Top-N Hashtag Prediction via Coupling Social Influence and Homophily

Considering the wide acceptance of the social media social influence starts to play very important role. Homophily has been widely accepted as the confounding factor for social influence. While literature attempts to identify and gauge the magnitude of the effects of social influence and homophily separately limited attention was given to use both sources for social behavior computing and prediction. In this work we address this shortcoming and propose neighborhood based collaborative filtering (CF) methods via the behavior interior dimensions extracted from the domain knowledge to model the data interdependence along time factor. Extensive experiments on the Twitter data demonstrate that the behavior interior based CF methods produce better prediction results than the state-of-the-art approaches. Furthermore, considering the impact of topic communication modalities (topic dialogicity, discussion intensiveness, discussion extensibility) on interior dimensions will lead to an improvement of 3%. Finally, the joint consideration of social influence and homophily leads to as high as 80.8% performance improvement in terms of accuracy when compared to the existing approaches.

Can Wang, Zhonghao Sun, Yunwei Zhao, Chi-Hung Chi, Willem-Jan van den Heuvel, Kwok-Yan Lam, Bela Stantic

Unfolding the Mixed and Intertwined: A Multilevel View of Topic Evolution on Twitter

Despite the extensive research efforts in information diffusion, most previous studies focus on the speed and coverage of the diffused information in the network. A better understanding on the semantics of information diffusion can provide critical information for the domain-specific/socio-economic phenomenon studies based on diffused topics. More specifically, it still lacks (a) a comprehensive understanding of the multiplexity in the diffused topics, especially with respect to the temporal relations and inter-dependence between topic semantics; (b) the similarities and differences in these dimensions under different diffusion degrees. In this paper, the semantics of a topic is described by sentiment, controversy, content richness, hotness, and trend momentum. The multiplexity in the diffusion mechanisms is also considered, namely, hashtag cascade, url cascade, and retweet. Our study is conducted upon 840, 362 topics from about 42 million tweets during 2010.01–2010.10. The results show that the topics are not randomly distributed in the Twitter space, but exhibiting a unique pattern at each diffusion degree, with a significant correlation among content richness, hotness, and trend momentum. Moreover, under each diffusion mechanism, we also find the remarkable similarity among topics, especially when considering the shifting and scaling in both the temporal and amplitude scales of these dimensions.

Yunwei Zhao, Can Wang, Han Han, Willem-Jan van den Heuvel, Chi-Hung Chi, Weimin Li

Behavior Modeling and User Profiling

Frontmatter

DAMTRNN: A Delta Attention-Based Multi-task RNN for Intention Recognition

Recognizing human intentions from electroencephalographic (EEG) signals is attracting extraordinary attention from the artificial intelligence community because of its promise in providing non-muscular forms of communication and control to those with disabilities. So far, studies have explored correlations between specific segments of an EEG signal and an associated intention. However, there are still challenges to be overcome on the road ahead. Among these, vector representations suffer from the enormous amounts of noise that characterize EEG signals. Identifying the correlations between signals from adjacent sensors on a headset is still difficult. Further, research not yet reached the point where learning models can accept decomposed EEG signals to capture the unique biological significance of the six established frequency bands. In pursuit of a more effective intention recognition method, we developed DAMTRNN, a delta attention-based multi-task recurrent neural network, for human intention recognition. The framework accepts divided EEG signals as inputs, and each frequency range is modeled separately but concurrently with a series of LSTMs. A delta attention network fuses the spatial and temporal interactions across different tasks into high-impact features, which captures correlations over longer time spans and further improves recognition accuracy. Comparative evaluations between DAMTRNN and 14 state-of-the-art methods and baselines show DAMTRNN with a record-setting performance of 98.87% accuracy.

Weitong Chen, Lin Yue, Bohan Li, Can Wang, Quan Z. Sheng

DeepIdentifier: A Deep Learning-Based Lightweight Approach for User Identity Recognition

Identifying a user precisely through mobile-device-based sensing information is a challenging and practical issue as it is usually affected by context and human-action interference. We propose a novel deep learning-based lightweight approach called DeepIdentifier. More specifically, we design a powerful and efficient block, namely funnel block, as the core components of our approach, and further adopt depthwise separable convolutions to reduce the model computational overhead. Moreover, a multi-task learning approach is utilized on DeepIdentifier, which learns to recognize the identity and reconstruct the signal of the input sensor data simultaneously during the training phase. The experimental results on two real-world datasets demonstrate that our proposed approach significantly outperforms other existing approaches in terms of efficiency and effectiveness, showing up to 17 times and 40 times improvement over state-of-the-art approaches in terms of model size reduction and computational cost respectively, while offering even higher accuracy. To the best of our knowledge, DeepIdentifier is the first lightweight deep learning approach for solving the identity recognition problem. The dataset we gathered, together with the implemented source code, is public to facilitate the research community.

Meng-Chieh Lee, Yu Huang, Josh Jia-Ching Ying, Chien Chen, Vincent S. Tseng

Domain-Aware Unsupervised Cross-dataset Person Re-identification

We focus on the person re-identification (re-id) problem of matching people across non-overlapping camera views. While most existing works rely on the abundance of labeled exemplars, we consider a more difficult unsupervised scenario, where no labeled exemplar is provided. One solution for unsupervised re-id that attracts much attention in the recent researches is cross-dataset transfer learning. It utilizes knowledge from multiple source datasets from different domains to enhance the unsupervised learning performance on the target domain. In previous works, much effect is taken on extraction of the generic and robust common appearances representations across domains. However, we observe that there also particular appearances in different domains. Simply ignoring these domain-unique appearances will misleading the matching schema in re-id application. Few unsupervised cross-dataset algorithms are proposed to learn the common appearances across multiple domains, even less of them consider the domain-unique representations. In this paper, we propose a novel domain-aware representation learning algorithm for unsupervised cross-dataset person re-id problem. The proposed algorithm not only learns a common appearances across-datasets but also captures the domain-unique appearances on the target dataset via minimization of the overlapped signal supports across different domains. Extensive experimental studies on benchmark datasets show superior performances of our algorithm over state-of-the-art algorithms. Sample analysis on selected samples also verifies the ability of diversity learning of our algorithm.

Zhihui Li, Wenhe Liu, Xiaojun Chang, Lina Yao, Mahesh Prakash, Huaxiang Zhang

Invariance Matters: Person Re-identification by Local Color Transfer

Person re-identification is a complex image retrieval problem. The color of the image is distorted due to changes in illumination, etc., which makes pedestrian recognition more challenging. In this paper, we take the conditional image, the reference image and its corresponding clothing segmentation image as input, and then restore the true color of the person through color conversion. In addition, we calculate the similarity between the conditional image and the image dataset by the chromatic aberration similarity and the clothing segmentation invariance. We evaluated the proposed method on a public dataset. A large number of experimental results show that the method is effective.

Ying Niu, Chunmiao Yuan, Kunliang Liu, Yukuan Sun, Jiayu Liang, Guanghao Jin, Jianming Wang

Research on Interactive Intent Recognition Based on Facial Expression and Line of Sight Direction

Interaction intent recognition refers to the discrimination and prediction of whether a person (user) wants to interact with the robot during the human-robot interaction (HRI) process. Interactive intent recognition is one of the key technologies of intelligent robots. This paper mainly studies the interactive intent recognition method based on visual images, which is of great significance to improve the intelligence of robots. In the process of communication between people, people often make different interactions according to each other’s emotional state. At present, the visual-based interactive intent recognition method mainly utilizes the user’s gesture, line of sight direction, and head posture to judge the interaction intention, and has not found the interactive intention recognition method based on the user’s emotional state. Therefore, this paper proposes an interactive intent recognition algorithm that combines facial expression features and line of sight directions. The experimental results show that the accuracy of the intent recognition algorithm including expression recognition is 93.3%, and the accuracy of the intent recognition algorithm without expression recognition is 83%. Therefore, the performance of the intent recognition algorithm is significantly improved after the expression recognition is increased.

Siyu Ren, Guanghao Jin, Kunliang Liu, Yukuan Sun, Jiayu Liang, Shiling Jiang, Jianming Wang

Text and Multimedia Mining

Frontmatter

Mining Summary of Short Text with Centroid Similarity Distance

Text summarization aims at producing a concise summary that preserves key information. Many textual inputs are short and do not fit with the standard longer text-based techniques. Most of the existing short text summarization approaches rely on metadata information such as the authors or reply networks. However, not all raw textual data can provide such information. In this paper, we present our method to summarize short text using a centroid-based method with word embeddings. In particular, we consider the task when there is no metadata information other than the text itself. We show that the centroid embeddings approach can be applied to short text to capture semantically similar sentences for summarization. With further clustering strategy, we were able to identify relevant sub-topics that further improves the context diversity in the overall summary. The empirical evaluation demonstrates that our approach can outperform other methods on two annotated LREC track dataset.

Nigel Franciscus, Junhu Wang, Bela Stantic

Children’s Speaker Recognition Method Based on Multi-dimensional Features

In life, the voice signals collected by people are essentially mixed signals, which mainly include information related to speaker characteristics, such as gender, age and emotional state. The commonality and characteristics of traditional single-dimensional speaker information recognition are analyzed, and children’s individualized analysis is carried out for common acoustic feature parameters such as prosodic features, sound quality features and spectral-based features. Therefore, considering the temporal characteristics of voice, combined with the Time-Delay Neural Network (TDNN) model, Bidirectional Long Short-Term Memory model and the attention mechanism, the multi-channel model is trained to form a speaker recognition problem solution for children’s speaker recognition. A large number of experimental results show that on the basis of guaranteeing the accuracy of age and gender recognition, higher accuracy of children’s voiceprint recognition can be obtained.

Ning Jia, Chunjun Zheng, Wei Sun

Constructing Dictionary to Analyze Features Sentiment of a Movie Based on Danmakus

As a new commenting mode, danmaku not only shows the subjective attitude or emotion of the reviewer, but also has instantaneity and interactivity compared with traditional comments. In order to improve the existing film evaluation mechanism of mainstream film rating websites, this paper trains the word vector model based on movies’ danmaku and builds the movie feature word lexicon iteratively. And then, through the Boson ( https://bosonnlp.com/dev/resource ) sentiment dictionary and TF-IDF algorithm, we set up the feature-sentiment dictionary. Finally, we use the feature-sentiment dictionary and combine the dictionary of the degree words to calculate the sentiment score of each feature based on the movie danmaku. Our experimental results are compared with the scores of a film rating website “Mtime” ( http://www.mtime.com/ ). The comparison proves that our method of analyzing and computing sentiment of movie features is not only novel but also effective.

Jie Li, Yukun Li

Single Image Dehazing Algorithm Based on Sky Region Segmentation

In this paper a hybrid image defogging approach based on region segmentation is proposed to address the dark channel priori algorithm’s shortcomings in de-fogging the sky regions. The preliminary stage of the proposed approach focuses on segmentation of sky and non-sky regions in a foggy image taking the advantageous of Meanshift and edge detection with embedded confidence. In the second stage, an improved dark channel priori algorithm is employed to defog the non-sky region. Ultimately, the sky area is processed by DehazeNet algorithm, which relies on deep learning Convolutional Neural Networks. The simulation results show that the proposed hybrid approach in this research addresses the problem of color distortion associated with sky regions in foggy images. The approach greatly improves the image quality indices including entropy information, visibility ratio of the edges, average gradient, and the saturation percentage with a very fast computation time, which is a good indication of the excellent performance of this model.

Weixiang Li, Wei Jie, Somaiyeh Mahmoudzadeh

Online Aggregated-Event Representation for Multiple Event Detection in Videos

Event detection is used to locate the frames corresponding to events of interest in given videos. Real-world videos contain multiple events of interest, and they are rarely segmented. Existing online methods can only detect segments containing single event instances, and this is not suitable for processing videos with several event instances. There are multiple event detection methods, but they are all relatively inefficient and offline methods. To handle the online detection of several events, we propose a novel framework with three modules that are: the event proposal generation, aggregated-event representation, and refined detection modules. The first module can locate time intervals that are likely to contain target events, termed as proposals. The second module can aggregate all events before the current time to form a temporal context that will be used to generate initial detection results of multiple events. The refined detection module finally refines the results based on event proposals and object detection. The proposed method achieves a detection accuracy of 24.88% on a multi-event dataset - Charades, which is higher than state-of-the-art methods.

Molefe Vicky Mleya, Weiqi Li, Jiayu Liang, Kunliang Liu, Yunkuan Sun, Guanghao Jin, Jianming Wang

Standard Deviation Clustering Combined with Visual Psychological Test Algorithm for Image Segmentation

Detection of the visual salient image area for image segmentation, image recognition, and adaptive compression application is beneficial. It makes an object, a person, or some pixels stand out against the background of the image and provide support for image recognition and target detection. The detection can simplify the process of computer visual image processing and improve the effect and efficiency of computer visual inspection. This paper introduces a kind of salient detection method, without any manual intervention, and uses the method of decomposing brightness, color space, negative map solution, and standard deviation to find the super-distance pixel in the image. The method of clustering is used to separate the region of objects and image background, and output RGB color salient objects image. Moreover, it can accurately highlight the object contour and internal pixels. This method studies the characteristics of the original pixels such as brightness or color and utilizes the image basis features to achieve the image saliency detection. It has high adaptive detection ability, low time complexity and high computational efficiency.

Zhenggang Wang, Jin Jin, Zhong Liu

Fast Video Clip Retrieval Method via Language Query

The goal of video clip retrieval is to find video clips that match the description of the query in massive video data based on natural language queries. The booming of video-based social media, the increase in the amount of video data and the increasing complexity of video content have created challenges for video retrieval. The existing relevant methods rely on more complex description paragraphs to match corresponding videos then associate each sentence with specifically interest segment, which is need more rich language queries supervision. In this paper, we aim to improve the efficiency by generating proper descriptions from the videos and searching the clips only in the possible videos, which descriptions matches the queries. Specifically, our method is top-down framework, which divides the task into two stages. The upper stage is basically coarse retrieval that selects candidate videos according the description of the videos. The bottom stage is video clips locating that is done by matching the queries with candidate clips through the matching strategy. We tested our method with the existing methods on Charades-STA dataset and the experimental data shows it improves remarkable performance.

Pengju Zhang, Chunmiao Yuan, Kunliang Liu, Yukuan Sun, Jiayu Liang, Guanghao Jin, Jianming Wang

Research on Speech Emotional Feature Extraction Based on Multidimensional Feature Fusion

In the field of speech processing, speech emotion recognition is a challenging task with broad application prospects. Since the effective speech feature set directly affects the accuracy of speech emotion recognition, the research on effective features is one of the key issues in speech emotion recognition. Emotional expression and individualized features are often related, so it is often difficult to find generalized effective speech features, which is one of the main research contents of this paper. It is necessary to generate a general emotional feature representation in the speech signal from the perspective of local features and global features: (1) Using the spectrogram and Convolutional Recurrent Neural Network (CRNN) to construct the speech emotion recognition model, which can effectively learn to represent the spatial characteristics of the emotional information and to obtain the aggravated local feature information. (2) Using Low-Level acoustic Descriptors (LLD), through a large number of experiments, the feature representations of limited dimensions such as energy, fundamental frequency, spectrum and statistical features based on these low-level features are screened to obtain the global feature description. (3) Combining the previous features, and verifying the performance of various features in emotion recognition on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) emotional corpus, the accuracy and representativeness of the features obtained in this paper are verified.

Chunjun Zheng, Chunli Wang, Wei Sun, Ning Jia

Improved Algorithms for Zero Shot Image Super-Resolution with Parametric Rectifiers

Recently, a novel Zero-Shot Super-Resolution (ZSSR) method is proposed to generate high-resolution (HR) images from their low-resolution (LR) counterparts. ZSSR employs a convolutional neural network (CNN) to represent transformations from LR images to HR images and is trained on a single image. ZSSR achieves state-of-the-art performance on both real low-resolution images (i.e., historic images, and images taken with a mobile phone) and several benchmark datasets (e.g., Set 5 and Set 14 to name a few). However, the training of the CNN network of ZSSR is not stable since rectifier is used as the activation function and a custom learning rate adjustment policy is proposed in ZSSR. In this paper, we use parametric rectifier as the activation function and present an improved algorithm for the training of ZSSR. Experimental results demonstrate that the proposed method outperforms ZSSR in terms of both reconstruction accuracy and speed on two benchmark datasets: Set 5 and Set 14, respectively.

Jiayi Zhu, Senjian An, Wanquan Liu, Ling Li

Spatial-Temporal Data

Frontmatter

Spatial-Temporal Recurrent Neural Network for Anomalous Trajectories Detection

Aiming to improve the quality of taxi service and protect the interests in passengers, anomalous trajectory detection attracts increasing attention. Most of the existing methods concentrate on the coordinate information about trajectories and learn the similarities between anomalous trajectories from a large number of coordinate sequences. These methods ignore the relationship of spatial-temporal and ignore the particularity of the whole trajectory. Through data analysis, we find that there are significant differences between normal trajectories and anomalous trajectories in terms of spatial-temporal characteristic. Meanwhile Recurrent Neural Network can use trajectory embedding to capture the sequential information on the trajectory. Consequently, we propose an efficient method named Spatial-Temporal Recurrent Neural Network (ST-RNN) using coordinate sequence and spatial-temporal sequence. ST-RNN combines the advantages of the Recurrent Neural Network (RNN) in learning sequence information and adds attention mechanism to the RNN to improve the performance of the model. The application of Spatial-Temporal Laws in anomalous trajectory detection also achieves a positive influence. Several experiments on a real-world dataset demonstrate that the proposed ST-RNN achieves state-of-the-art performance in most cases.

Yunyao Cheng, Bin Wu, Li Song, Chuan Shi

Spatiotemporal Crime Hotspots Analysis and Crime Occurrence Prediction

Advancement of technology in every aspect of our daily life has shaped an expanded analytical approach to crime. Crime is a foremost problem where the top priority has been concerned by the individual, the community and government. Increasing possibilities to track crime events give public organizations and police departments the opportunity to collect and store detailed data, including spatial and temporal information. Thus, exploratory analysis and data mining become an important part of the current methodology for the detection and forecasting of crime development. Spatiotemporal crime hotspots analysis is an approach to analyze and identify different crime patterns, relations, and trends in crime with identification of highly concentrated crime areas. In this paper spatiotemporal crime hotspots analysis using the dataset of the city of Chicago was done. First, we explored the spatiotemporal characteristics of crime in the city, secondary we explored the time series trend of top five crime types, Thirdly, the seasonal autoregressive integrated moving average model (SARIMA) based crime prediction model is presented and its result is compared to the one of the recently developed models based on deep learning algorithms for forecasting time series data, Long Short-Term Memory (LSTM). The results show that LSTM outperforms SARIMA model.

Niyonzima Ibrahim, Shuliang Wang, Boxiang Zhao

Medical and Healthcare Data/Decision Analytics

Frontmatter

Fast Bat Algorithm for Predicting Diabetes Mellitus Using Association Rule Mining

Association Rules (ARs) are the most important tool of Data Mining (DM) used to extract useful information stored in large databases during the last years. Motivated by the success of population-based metaheuristics dealing with this amount of data, we propose to develop a faster approach of the Bat algorithm based on ARM. Our approach is evaluated on a real database of population with or without diabetes. The proposed algorithm has better optimization accuracy and time complexity compared with the old version of the algorithm.

Hend Amraoui, Faouzi Mhamdi, Mourad Elloumi

Using a Virtual Hospital for Piloting Patient Flow Decongestion Interventions

It is beyond the capacity of the human mind to process large amounts of interdependent information, such as predicting the dynamic behavior of a complex system and evaluating the short and long term effects of potential interventions aimed to improve its operations. At the same time, it is extremely costly to test these interventions with the real world system subject to improvement. Fortunately, we have moved to an era where advancements in computing and software technology have provided us the capabilities to build virtual complex systems (simulation models), that can serve as risk-free digital platforms for running pilot experiments with potential system interventions and obtain comparative data for decision support and optimization. This paper presents two case studies in a healthcare setting, where a simulation model named HESMAD (Hospital Event Simulation Model: Arrivals to Discharge) was applied to pilot potential interventions proposed by hospital professionals or researchers that are aimed at minimizing hospital patient flow congestion episodes. It was demonstrated that simulation modelling is not only an effective approach to conduct virtual experiments for evaluating proposed intervention ideas from healthcare professionals, but also an ideal vehicle for piloting scientific research outcomes from data science researchers. Some experience-based discussions on various issues involved in simulation modelling, such as validation of the simulation model and interpretation of simulation results are also provided.

Shaowen Qin

Deep Interpretable Mortality Model for Intensive Care Unit Risk Prediction

Estimating the mortality of patients plays a fundamental role in an intensive care unit (ICU). Currently, most learning approaches are based on deep learning models. However, these approaches in mortality prediction suffer from two problems: (i) the specificity of causes of death are not considered in the learning process due to the different diseases, and symptoms are mixed-used without diversification and localization; (ii) the learning outcome for the mortality prediction is not self-explainable for the clinicians. In this paper, we propose a Deep Interpretable Mortality Model (DIMM), which employs Multi-Source Embedding, Gated Recurrent Units (GRU), Attention mechanism and Focal Loss techniques to prognosticate mortality prediction. We intensified the mortality prediction by considering the different clinical measures, medical treatments and the heterogeneity of the disease. More importantly, for the first time, in this framework, we use a separate evidence-based interpreter named Highlighter to interpret the prediction model, which makes the prediction understandable and trustworthy to clinicians. We demonstrate that our approach achieves state-of-the-art performance in mortality prediction and can get an interpretable prediction on four different diseases.

Zhenkun Shi, Weitong Chen, Shining Liang, Wanli Zuo, Lin Yue, Sen Wang

Causality Discovery with Domain Knowledge for Drug-Drug Interactions Discovery

Bayesian Network Probabilistic Graphs have recently been applied to the problem of discovery drug-drug interactions, i.e., the identification of drugs that, when consumed together, produce an unwanted side effect. These methods have the advantage of being explainable: the cause of the interaction is made explicit. However, they suffer from two intrinsic problems: (1) the high time-complexity for computing causation, i.e., exponential; and (2) the difficult identification of causality directions, i.e., it is difficult to identify in drug-drug interactions databases whether a drug causes an adverse effect – or vice versa, an adverse effect causes a drug consumption. While solutions for addressing the causality direction identification exist, e.g., the CARD method, these assume statistical independence between drug pairs considered for interaction: real data often does not satisfy this condition.In this paper, we propose a novel causality discovery algorithm for drug-drug interactions that goes beyond these limitations: Domain-knowledge-driven Causality Discovery (DCD). In DCD, a knowledge base that contains known drug-side effect pairs is used to prime a greedy drug-drug interaction algorithm that detects the drugs that, when consumed together, cause a side effect. This algorithm resolves the drug-drug interaction discovery problem in $$O(n^2)$$ time and provides the causal direction of combined causes and their effect, without resorting to assuming statistical independence of drugs intake. Comprehensive experiments on real-world and synthetic datasets show the proposed method is more effective and efficient than current state-of-the-art solutions, while also addressing a number of drawbacks of current solutions, including the high time complexity, and the strong assumptions regarding real-world data that are often violated.

Sitthichoke Subpaiboonkit, Xue Li, Xin Zhao, Harrisen Scells, Guido Zuccon

Personalised Medicine in Critical Care Using Bayesian Reinforcement Learning

Patients with similar conditions in the intensive care unit (ICU) may have different reactions for a given treatment. An effective personalised medicine can help save patient lives. The availability of recorded ICU data provides a huge potential to train and develop the systems. However, there is no ground truth of best treatments. This makes existing supervised learning based methods are not appropriate. In this paper, we proposed clustering based Bayesian reinforcement learning. Firstly, we transformed the multivariate time series patient record into a real-time Patient Sequence Model (PSM). After that, we computed the likelihood probability of treatments effect for all patients and cluster them based on that. Finally, we computed Bayesian reinforcement learning to derive personalised policies. We tested our proposed method using 11,791 ICU patients records from MIMIC-III database. Results show that we are able to cluster patient based on their treatment effects. In addition, our method also provides better explainability and time-critical recommendation that are very important in a real ICU setting.

Chandra Prasetyo Utomo, Hanna Kurniawati, Xue Li, Suresh Pokharel

TDDF: HFMD Outpatients Prediction Based on Time Series Decomposition and Heterogenous Data Fusion in Xiamen, China

Hand, foot and mouth disease (HFMD) is a common infectious disease in global public health. In this paper, the time series decomposition and heterogeneous data fusion (TDDF) method is proposed to enhance features in the performance of HFMD outpatients prediction. The TDDF first represents meteorological features and Baidu search index features with the consideration of lags, then those features are fused into decomposed historical HFMD cases to predict coming outpatient cases. Experimental results and analyses on the real collected records show the efficiency and effectiveness of TDDF on regression methods.

Zhijin Wang, Yaohui Huang, Bingyan He, Ting Luo, Yongming Wang, Yingxian Lin

Other Applications

Frontmatter

Efficient Gaussian Distance Transforms for Image Processing

This paper presents Gaussian distance transform (GDT) of images and demonstrates its applications to image partition and image filtering. The time complexity of the naive implementation of GDT is quadratic on the image size and is thus computationally intractable for real time applications and for high resolution images. To address this issue, we investigate the properties of GDT and show that GDT can be conducted in linear lime using well known matrix search algorithms. Experimental results are provided to show the applications of GDT to image partition and image filtering.

Senjian An, Yiwei Liu, Wanquan Liu, Ling Li

Tourist’s Tour Prediction by Sequential Data Mining Approach

This paper answers the problem of predicting future behaviour tourist based on past behaviour of an individual tourist. The individual behaviour is naturally an indicator of the behaviour of other tourists. The prediction of tourists movement has a crucial role in tourism marketing to create demand and assist tourists in decision-making. With advances in information and communication technology, social media platforms generate data from millions of people from different countries during their travel. The main objective of this paper is to consider sequential data-mining methods to predict tourist movement based on Instagram data. Rules emerge from those ones are exploited to predict future behaviors. The originality of this approach is a combination between pattern mining to reduce the size of data and the automata to condense the rules. The capital city of France, Paris is selected to demonstrate the utility of the proposed methodology.

Lilia Ben Baccar, Sonia Djebali, Guillaume Guérard

TOM: A Threat Operating Model for Early Warning of Cyber Security Threats

Threat profiling helps reveal the current trends of attacks, and underscores the significance of specific vulnerabilities, hence serves as the means for providing an early warning of potential attacks. However, the existing approaches on threat profiling models are mainly rule-based and depend on the domain experts’ knowledge, which limit their applicability in the automated processing of cyber threat information from heterogeneous sources, e.g. the cyber threat intelligence information from open sources. The threat profiling models based on analytic approaches, on the other hand, are potentially capable of automatically discovering the hidden patterns from a massive volume of information. This paper proposes to apply the data analytic approaches to develop the threat profiling models in order to identify the potential threats by analyzing a large number of cyber threat intelligence reports from open sources, extract information from the cyber threat intelligence reports, and represent them in a structure that facilitates the automated risk assessment, and hence achieve the early warning of likely cyber attacks. We introduce the Threat Operating Model (TOM) which captures important information of the identified cyber threats, while can be implemented as an extension of the Structured Threat Information eXpression (STIX). Both the matrix-decomposition based semi-supervised method and the term frequency based unsupervised method are proposed. The experiment results demonstrate a fairly effectiveness (accuracy around 0.8) and a robust performance w.r.t different temporal periods.

Tao Bo, Yue Chen, Can Wang, Yunwei Zhao, Kwok-Yan Lam, Chi-Hung Chi, Hui Tian

Prediction for Student Academic Performance Using SMNaive Bayes Model

Predicting students academic performance is very important for students future development. There are a large number of students who can not graduate from colleges on time for various reasons every year. Nowadays, a large volume of students academic data has been generated in the process of promoting education informatization from the field of education. It becomes critical to predict student performance and ensure students to graduate on time by taking the best of these data. Machine learning models that predict students performance are widely available. However, some existing machine learning models still have the problem of low accuracy in predicting students performance. To solve this problem, we proposes a SMNaive Bayes (SMNB) model, which integrates Sequential Minimal Optimization (SMO) and Naive Bayes to make the prediction result more accurate. The basic idea is that the model predicts the performance of students professional courses via their basic course performance in the previous stage. In particular, SMO algorithm is leveraged to predict students academic performance of the first step and produces the results of the prediction; Naive Bayes then makes decision about the inconsistent results of the initial prediction; Lastly, the final results of students professional course performance prediction are produced. To test the effectiveness of our proposed model, we have conducted extensive experiments to compare SMNB against four prediction methods. The experimental results demonstrate that the proposed SMNB model is superior to all the compared methods.

Baoting Jia, Ke Niu, Xia Hou, Ning Li, Xueping Peng, Peipei Gu, Ran Jia

Chinese Sign Language Identification via Wavelet Entropy and Support Vector Machine

Sign language recognition is significant for smoothing barrier of communication between hearing-impaired people and health people. This paper proposed a novel Chinese sign language identification approach, in which wavelet entropy was adopted for feature reduction and support vector machine was employed for classification. The experiment was implemented on 10-fold cross validation. Our method (WE+SVM) yielded overall accuracy of 85.69 ± 0.59%. The results indicated this method was effective and superior to three state-of-the-art approaches.

Xianwei Jiang, Zhaosong Zhu

An Efficient Multi-request Route Planning Framework Based on Grid Index and Heuristic Function

In this paper, we will discuss the recently studied and currently less studied path finding problem, which is multi-request route planning (MRRP). Given a road network and plenty of points of interests (POIs), each POI has its own service lists. User specifies the departure place and destination location as well as request lists, the task of MRRP is to find the most cost-effective route from the user’s starting point to the end point and satisfy all the user’s requests. At present, only one paper solved MRRP problem. Its method can’t be extended to time-dependent road networks directly with time-varying values because it takes up more memory. In this paper, we propose a new framework based on grid file and heuristic functions for solving MRRP problem. The framework consists of three phases. The area arrangement phase compares request lists with service lists contained in the adjacent grid nearby to filter unnecessary regions. In the routing preparation phase, the most profitable POIs are selected to meet the needs of users. And the path finding phase obtains the final shortest path results. Extensive experiments have been conducted to evaluate the performance of the proposed framework and compare with the state-of-the-art algorithms. The results show that the route costs selected by the proposed method are 2–3 times less than those obtained by others under different settings. Meanwhile, the execution time of our algorithm is 2–3 times less than them.

Jiajia Li, Jiahui Hu, Vladislav Engel, Chuanyu Zong, Xiufeng Xia

Nodes Deployment Optimization Algorithm Based on Fuzzy Data Fusion Model in Wireless Sensor Networks

As an integrated network, wireless sensor networks can connect the logic information world with the real physical world by performing information sensing, gathering, processing and delivering. There are diverse and potential applications for Wireless sensor networks. In recent years, the increasing requisitions of Wireless sensor networks have more and more research dedicated to the question of sensor nodes deployment. As for the nodes deployment of underwater wireless sensor networks, the optimization strategy on node deployment determines the capability and quality of service of Wireless sensor networks as well. There are some key points that should be considered, including the coverage range to be monitored, energy consumption of nodes, amount of deployed sensors, connectivity, and lifetime of the Wireless sensor networks. This paper analyzes the problem of nodes deployment optimization in wireless sensor network. Referring to the fuzzy cognitive model and fuzzy data fusion model, with consideration of certain environmental factors which may affect the detection result, a novel method NAFC is presented in this paper. The simulation model is established by MATLAB software. According to the simulation results, the demonstrated algorithm of underwater sensor node deployment shows its effectiveness, which can fulfill the requisition of network coverage ratio, reduce the number of deployed nodes, prolong the network lifetime and expand the detection range of network, thus the scheme improve the comprehensive detection performance of WSN accordingly.

Na Li, Qiangyi Li, Qiangnan Li

Community Enhanced Record Linkage Method for Vehicle Insurance System

Record linkage is a pivotal data integration stage in the vehicle insurance claims analysis system and serves as a foundation for fraud detection, market promotion and other major business applications. While the traditional method of rules based classification plus clerical review is still in use in the industry, the latest development has advanced into link analysis based collective record linkage which has put the blocking and classification processes under the global context. To apply this method with a fraud detection objective, we have developed a community enhanced record linkage model specially tailored for the requirements of vehicle insurance claim system. A major novel approach is the construction of claim communities linking the claims, customers and vehicles involved and apply probabilistic data matching algorithms integrated with spatio-temporal co-occurrence patterns. In addition, the matched results could be used to identify the outliers in fraud detection analysis.

Christian Lu, Guangyan Huang, Yong Xiang

COEA: An Efficient Method for Entity Alignment in Online Encyclopedias

Knowledge graph is the cornerstone of artificial intelligence. Entity alignment in multi-source online encyclopedias is an important part of data integration to construct the knowledge graph. In order to solve the problem that traditional methods are not effective enough for entity alignment in online encyclopedias tasks, this paper proposes the Chinese Online Encyclopedia Aligner (COEA) based on the combination of entity attributes and context. In this paper, we focus on (1) extracting attribute information and context of entities from the infobox of online encyclopedias and normalizing them, (2) computing the similarity of entity attributes based on Vector Space Model, and (3) further considering the entity similarity based on the topic model over entity context when the similarity of attributes is between the lower bound and the upper bound. Finally, data sets of entity alignment in online encyclopedias are constructed for simulation experiments. The experimental results, which show the method proposed in this paper outperforms traditional entity alignment algorithms, verify that our method can significantly improve the performance of entity alignment in online encyclopedias in the construction of Chinese knowledge graphs.

Yimin Lv, Xin Wang, Runpu Yue, Fuchuan Tang, Xue Xiang

Efficient Deployment and Mission Timing of Autonomous Underwater Vehicles in Large-Scale Operations

This study introduces a connective model of routing- local path planning for Autonomous Underwater Vehicle (AUV) time efficient maneuver in long-range operations. Assuming the vehicle operating in a turbulent underwater environment, the local path planner produces the water-current resilient shortest paths along the existent nodes in the global route. A re-routing procedure is defined to re-organize the order of nodes in a route and compensate any lost time during the mission. The Firefly Optimization Algorithm (FOA) is conducted by both of the planners to validate the model’s performance in mission timing and its robustness against water current variations. Considering the limitation over the battery lifetime, the model offers an accurate mission timing and real-time performance. The routing system and the local path planner operate cooperatively, and this is another reason for model’s real-time performance. The simulation results confirms the model’s capability in fulfilment of the expected criterion and proves its significant robustness against underwater uncertainties and variations of the mission conditions.

Somaiyeh MahmoudZadeh

MLCA: A Multi-label Competency Analysis Method Based on Deep Neural Network

The goal of human resource management is to select the right people to the right positions, no matter by recruitment, assessment or promotion. To achieve this goal, competency analysis is an effective way. We can obtain the employee’s competency and the position’s requirements by the analysis. The competency analysis also provide a strong intellectual support in the downstream works, such as assessing or promoting employees, or establishing employee files. The multi-label text classification model, which is proposed in this paper based on deep neural network, can successfully complete the competency analysis, and its performance is much better than the current text multi-label classification method. We also construct a multi-label classification dataset in human resource field, which is the first one focused on competency analysis, as far as we know.

Guohao Qiao, Bin Wu, Bai Wang, Baoli Zhang

MACCA: A SDN Based Collaborative Classification Algorithm for QoS Guaranteed Transmission on IoT

Software defined network (SDN) can effectively balance link loads and guarantee QoS for different application categories of data streams on Internet of Things (IoT). To achieve high accuracy and low time consumption for stream classification for SDN, the collaborative methods are considered. By analyzing the data sets of network flows on CyberGIS and IoT, a Misclassification-Aware Collaborative Classification Algorithm named MACCA is proposed. MACCA collaborates the misclassification results judgment module and the decision module to calculate the final classification results, thus it can avoid the reduction of overall accuracy caused by voting to determine the results. The evaluation results show that the MACCA can classify the network data streams efficiently with an average accuracy of 99.66% and a lower time consumption compared to other classification algorithms, which can be implemented on SDN-based networks.

Weifeng Sun, Zun Wang, Guanghao Zhang, Boxiang Dong

DataLearner: A Data Mining and Knowledge Discovery Tool for Android Smartphones and Tablets

Smartphones have become the ultimate ‘personal’ computer, yet despite this, general-purpose data mining and knowledge discovery tools for mobile devices are surprisingly rare. DataLearner is a new data mining application designed specifically for Android devices that imports the Weka data mining engine and augments it with algorithms developed by Charles Sturt University. Moreover, DataLearner can be expanded with additional algorithms. Combined, DataLearner delivers 40 classification, clustering and association rule mining algorithms for model training and evaluation without need for cloud computing resources or network connectivity. It provides the same classification accuracy as PCs and laptops, while doing so with acceptable processing speed and consuming negligible battery life. With its ability to provide easy-to-use data mining on a phone-size screen, DataLearner is a new portable, self-contained data mining tool for remote, personalised and educational applications alike. DataLearner features four elements – this paper, the app available on Google Play, the GPL3-licensed source code on GitHub and a short video on YouTube.

Darren Yates, Md Zahidul Islam, Junbin Gao

Prediction of Customer Purchasing Power of Google Merchandise Store

For customer data mining of Google Merchandise Store, ensemble learning models such as LightGBM are popular. However, LightGBM mines the data information once, which has the rough granularity of data mining. So that LightGBM cannot dig into the more potential internal correlation information of Google Merchandise Store’s dataset. In this paper, the deep LGB model is proposed to automatically refine the granularity of data mining through sliding window, on this basis, the model is endowed with certain representation learning ability through Deep, so as to dig out deeper association between data. Then, a semi-automatic feature engineering is proposed, which firstly processes some features of the data set automatically, and then generates the final data set with a little manual analysis. The experimental results show that, use customer data of the Google Merchandise Store, the prediction accuracy of the deep LGB model with semi-automatic feature engineering is 6.16% points higher than that of the original data set put into the single LGB model.

ZhiYu Ye, AiMin Feng, Hang Gao

Research on Short-Term Traffic Flow Forecasting Based on KNN and Discrete Event Simulation

With the rapid development of urban traffic, it is very important to achieve accurate short-term traffic flow forecasting. Firstly, with the problem of short-term traffic flow forecasting, the key features that affect the traffic flow are extracted and the KNN non-parametric regression method is used for forecasting. Secondly, in order to solve the problem of dynamic traffic flow assignment, we build a simulation model and achieved good results. Finally, we use the case of short-term flow forecasting in airport to carry out a data experiment. The experimental results show that the traffic flow of traffic nodes and routes can be forecasted completely by using KNN algorithm combined with discrete event simulation technology, and the results are more credible.

Shaozheng Yu, Yingqiu Li, Guojun Sheng, Jiao Lv

Application of Weighted K-Means Decision Cluster Classifier in the Recognition of Infectious Expressions of Primary School Students Reading

In this paper, a new classification algorithm for infectious expressions of reading is proposed. This algorithm called weighted K-means decision cluster classifier (WKDCC) is based on the establishment of decision tree model, but four improvements are proposed: (1) using anchor partition instead of heuristic information such as information gain to search (2) feature weighting; (3) using k-means clustering center instead of centroid as anchor point; (4) asymmetric partition. WKDCC is used for recognize infectious expressions of students reading. The results show that WKDCC performed better in accuracy than decision tree, and its time complexity is lower than that of the classical decision tree algorithm. WKDCC is particularly suitable for large data with many samples such as audio data.

Dongqing Zhang, Zhenyu Liu

An Anti-fraud Framework for Medical Insurance Based on Deep Learning

Given rising medical costs, medical expense control has become an important task in the healthcare domain. To solve the shortage of medical reimbursement mechanisms based on medical service items, single-disease payment models have been extensively studied. However, the approach of payment via a single-disease model is also flawed, and fraud may occur. Herein, we present an anti-fraud framework for medical insurance based on deep learning to automatically identify suspicious medical records, ensure the effective implementation of single-disease charges, and reduce the workload of medical insurance auditors. The framework first predicts the probabilities of diseases according to patients’ chief complaints and then evaluates whether the disease codes written in medical records are reasonable via the predicted probabilities; finally, medical records with unreasonable disease codes are selected as abnormal cases for manual auditing. We conduct experiments on a real-world dataset from a large hospital and demonstrate that our model can play an effective role in anti-fraud for medical insurance.

Guoming Zhang, Shucun Fu, Xiaolong Xu, Lianyong Qi, Xuyun Zhang, Wanchun Dou

Demos

Frontmatter

BSI: A System for Predicting and Analyzing Accident Risk

In recent years, the rapid growth of motor vehicle ownership brings great pressure to the road traffic system and inevitably leads to a large number of traffic accidents. Therefore, it is a demanding task to build a well-developed system to identify the high-risk links, i.e., black spots, of a road network. However, most of the existing works focus on identifying black spots in a road network simply based on the statistic data of accidents, which leads to low accuracy. In this demonstration, we present a novel system called BSI, to predict and analyze the high-risk links in a road network by adequately utilizing the spatial-temporal features of accidents. First, BSI predicts the trend of accidents by a spatial-temporal sequence model. Then, based on predicted results, K-means method is utilized to discover the roads with the highest accident severity. Finally, BSI identifies the central location and coverage of a high-risk link by a modified DBSCAN clustering model. BSI can visualize the final identified black spots and provide the results to the user.

Xinyu Ma, Yuhao Yang, Meng Wang

KG3D: An Interactive 3D Visualization Tool for Knowledge Graphs

With the emerge of knowledge graphs in different scales like DBpedia, YAGO, and WikiData, they have become the cornerstone to support many artificial intelligence tasks. However, it is difficult for end-users to query and understand those knowledge graphs consisting of hundreds of millions of nodes and edges. To help end-users better retrieve information from RDF data and explore the knowledge graph without SPARQL or knowing the relation types, we developed an interactive visual query tool, called KG3D, which can realize connection query and pattern matching. Our tool can view the knowledge graph in 3-dimensional space and automatically convert the query to the SPARQL statement. In this paper, we present the superiority of KG3D over other tools, discuss the design motivation, and demonstrate various use cases.

Dawei Xu, Lin Wang, Xin Wang, Dianquan Li, Jianpeng Duan, Yongzhe Jia

Backmatter

Titel: Advanced Data Mining and Applications
herausgegeben von: Jianxin Li
Sen Wang
Dr. Shaowen Qin
Xue Li
Shuliang Wang
Verlag: Springer International Publishing
Electronic ISBN: 978-3-030-35231-8
Print ISBN: 978-3-030-35230-1
DOI: https://doi.org/10.1007/978-3-030-35231-8