nach oben

2021 | Buch

Advances in Knowledge Discovery and Data Mining

25th Pacific-Asia Conference, PAKDD 2021, Virtual Event, May 11–14, 2021, Proceedings, Part I

herausgegeben von: Prof. Kamal Karlapalem, Hong Cheng, Naren Ramakrishnan, R. K. Agrawal, P. Krishna Reddy, Dr. Jaideep Srivastava, Assist. Prof. Tanmoy Chakraborty

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

The 3-volume set LNAI 12712-12714 constitutes the proceedings of the 25th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, PAKDD 2021, which was held during May 11-14, 2021.

The 157 papers included in the proceedings were carefully reviewed and selected from a total of 628 submissions. They were organized in topical sections as follows:

Part I: Applications of knowledge discovery and data mining of specialized data;

Part II: Classical data mining; data mining theory and principles; recommender systems; and text analytics;

Part III: Representation learning and embedding, and learning from data.

Inhaltsverzeichnis

Frontmatter

Applications of Knowledge Discovery

Frontmatter

Fuzzy World: A Tool Training Agent from Concept Cognitive to Logic Inference

Not like many visual systems or NLP frameworks, human generally use both visual and semantic information for reasoning tasks. In this paper, we present a 3D virtual simulation learning environment Fuzzy World based on gradual learning paradigm to train visual-semantic reasoning agent for complex logic reasoning tasks. Furthermore our baseline approach employed semantic graphs and deep reinforcement learning architecture shows the significant performance over the tasks.

Minzhong Luo

Collaborative Reinforcement Learning Framework to Model Evolution of Cooperation in Sequential Social Dilemmas

Multi-agent reinforcement learning (MARL) has very high sample complexity leading to slow learning. For repeated social dilemma games e.g. Public Goods Game(PGG), Fruit Gathering Game(FGG), MARL exhibits low sustainability of cooperation due to non-stationarity of the agents and the environment, and the large sample complexity. Motivated by the fact that humans learn not only through their own actions (organic learning) but also by following the actions of other humans (social learning) who also continuously learn about the environment, we address this challenge by augmenting RL based models with a notion of collaboration among agents. In particular, we propose Collaborative-Reinforcement-Learning (CRL), where agents collaborate by observing and following other agent’s actions/decisions. The CRL model significantly influences the speed of individual learning, which effects the collective behavior as compared to RL only models and thereby effectively explaining the sustainability of cooperation in repeated PGG settings. We also extend the CRL model for PGGs over different generations where agents die, and new agents are born following a birth-death process. Also, extending the proposed CRL model, we propose Collaborative Deep RL Network(CDQN) for a team based game (FGG) and the experimental results confirm that agents following CDQN learns faster and collects more fruits.

Ritwik Chaudhuri, Kushal Mukherjee, Ramasuri Narayanam, Rohith D. Vallam

SigTran: Signature Vectors for Detecting Illicit Activities in Blockchain Transaction Networks

Cryptocurrency networks have evolved into multi-billion-dollar havens for a variety of disputable financial activities, including phishing, ponzi schemes, money-laundering, and ransomware. In this paper, we propose an efficient graph-based method, SigTran, for detecting illicit nodes on blockchain networks. SigTran first generates a graph based on the transaction records from blockchain. It then represents the nodes based on their structural and transactional characteristics. These node representations accurately differentiate nodes involved in illicit activities. SigTran is generic and can be applied to records extracted from different networks. SigTran achieves an $$F_1$$ F 1 score of 0.92 on Bitcoin and 0.94 on Ethereum, which outperforms the state-of-the-art performance on these benchmarks obtained by much more complex, platform-dependent models.

Farimah Poursafaei, Reihaneh Rabbany, Zeljko Zilic

VOA*: Fast Angle-Based Outlier Detection over High-Dimensional Data Streams

Outlier detection in the high-dimensional data stream is a challenging data mining task. In high-dimensional data, the distance-based measures of outlierness become less effective and unreliable. Angle-based outlier detection ABOD technique was proposed as a more suitable scheme for high-dimensional data. However, ABOD is designed for static datasets and its naive application on a sliding window over data streams will result in poor performance. In this research, we propose two incremental algorithms for fast outlier detection based on an outlier threshold value in high-dimensional data streams: IncrementalVOA and $$VOA^{*}$$ V O A ∗ . IncrementalVOA is a basic incremental algorithm for computing outlier factor of each data point in each window. $$VOA^{*}$$ V O A ∗ enhances the incremental computation by using a bound-based pruning method and a retrospect-based incremental computation technique. The effectiveness and efficiency of the proposed algorithms are experimentally evaluated on synthetic and real world datasets where $$VOA^{*}$$ V O A ∗ outperformed other methods.

Vijdan Khalique, Hiroyuki Kitagawa

Learning Probabilistic Latent Structure for Outlier Detection from Multi-view Data

Mining anomalous objects from multi-view data is a challenging issue as data collected from diverse sources have more complicated distributions and exhibit inconsistently heterogeneous properties. Existing multi-view outlier detection approaches mainly focus on transduction, which becomes very costly when new data points are introduced by an input stream. Besides, the existing detection methods use either the pairwise multiplication of cross-view data vectors to quantify outlier scores or the predicted joint probability to measure anomalousness, which are less extensible to support more sources. To resolve these challenges, we propose in this paper a Bayesian probabilistic model for finding multi-view outliers in an inductive learning manner. Specifically, we first follow the probabilistic projection method of latent variable for exploring the structural correlation and consistency of different views. Then, we seek for a promising Bayesian treatment for the generative model to approach the issue of selecting the optimal dimensionality. Next, we explore a variational approximation approach to estimate model parameters and achieve model selection. The outlier score for every sample can then be predicted by analyzing mutual distances of its representations across different views in the latent subspace. Finally, we benchmark the efficacy of the proposed method by conducting comprehensive experiments.

Zhen Wang, Ji Zhang, Yizheng Chen, Chenhao Lu, Jerry Chun-Wei Lin, Jing Xiao, Rage Uday Kiran

GLAD-PAW: Graph-Based Log Anomaly Detection by Position Aware Weighted Graph Attention Network

Anomaly detection is a crucial and challenging subject that has been studied within diverse research areas. In this work, we focus on log data (especially computer system logs) which is a valuable source to investigate system status and detect system abnormality. In order to capture transition pattern and position information of records in logs simultaneously, we transfer log files to session graphs and formulate the log anomaly detection problem as a graph classification task. Specifically, we propose GLAD-PAW, a graph-based log anomaly detection model utilizing a new position aware weighted graph attention layer (PAWGAT) and a global attention readout function to learn embeddings of records and session graphs. Extensive experimental studies demonstrate that our proposed model outperforms existing log anomaly detection methods including both statistical and deep learning approaches.

Yi Wan, Yilin Liu, Dong Wang, Yujin Wen

CubeFlow: Money Laundering Detection with Coupled Tensors

Money laundering (ML) is the behavior to conceal the source of money achieved by illegitimate activities, and always be a fast process involving frequent and chained transactions. How can we detect ML and fraudulent activity in large scale attributed transaction data (i.e. tensors)? Most existing methods detect dense blocks in a graph or a tensor, which do not consider the fact that money are frequently transferred through middle accounts. CubeFlow proposed in this paper is a scalable, flow-based approach to spot fraud from a mass of transactions by modeling them as two coupled tensors and applying a novel multi-attribute metric which can reveal the transfer chains accurately. Extensive experiments show CubeFlow outperforms state-of-the-art baselines in ML behavior detection in both synthetic and real data.

Xiaobing Sun, Jiabao Zhang, Qiming Zhao, Shenghua Liu, Jinglei Chen, Ruoyu Zhuang, Huawei Shen, Xueqi Cheng

Unsupervised Boosting-Based Autoencoder Ensembles for Outlier Detection

Autoencoders have been recently applied to outlier detection. However, neural networks are known to be vulnerable to overfitting, and therefore have limited potential in the unsupervised outlier detection setting. The majority of existing deep learning methods for anomaly detection is sensitive to contamination of the training data to anomalous instances. To overcome the aforementioned limitations we develop a Boosting-based Autoencoder Ensemble approach (BAE). BAE is an unsupervised ensemble method that, similarly to boosting, builds an adaptive cascade of autoencoders to achieve improved and robust results. BAE trains the autoencoder components sequentially by performing a weighted sampling of the data, aimed at reducing the amount of outliers used during training, and at injecting diversity in the ensemble. We perform extensive experiments and show that the proposed methodology outperforms state-of-the-art approaches under a variety of conditions.

Hamed Sarvari, Carlotta Domeniconi, Bardh Prenkaj, Giovanni Stilo

Unsupervised Domain Adaptation for 3D Medical Image with High Efficiency

Domain adaptation is a fundamental problem in the 3D medical image process. The current methods mainly cut the 3D image into 2D slices and then use 2D CNN for processing, which may ignore the inter-slice information of the 3D medical image. Methods based on 3D CNN can capture the inter-slice information but lead to extensive memory consumption and lots of training time. In this paper, we aim to model the inter-slice information in 2D CNN to realize the unsupervised 3D medical image’s domain adaptation without additional cost from 3D convolutions. To better capture the inter-slice information, we train the model from the adjacent (local) slices and the global slice sequence perspective. We first propose the Slice Subtract Module method (SSM), which can easily embed into 2D CNN and model the adjacent slices by introducing very limited extra computation cost. We then train the adaptation model with a distance consistency loss supervised by the LSTM component to model the global slice sequence. Extensive experiments on BraTS2019 and Chaos datasets show that our method can effectively improve the quality of domain adaptation and achieve state-of-the-art accuracy on segmentation tasks with little computation increased while remaining parameters unchanged.

Chufu Deng, Kuilin Li, Zhiguang Chen

A Hierarchical Structure-Aware Embedding Method for Predicting Phenotype-Gene Associations

Identifying potential causal genes for disease phenotypes is essential for disease treatment and facilitates drug development. Inspired by existing random-walk based embedding methods and the hierarchical structure of Human Phenotype Ontology (HPO), this work presents a Hierarchical Structure-Aware Embedding Method (HSAEM) for predicting phenotype-gene associations, which explicitly incorporates node type information and node individual difference into random walks. Unlike existing meta-path-guided heterogeneous network embedding techniques, HSAEM estimates an individual jumping probability for each node learned from hierarchical structures of phenotypes and different node influences among genes. The jumping probability guides the current node to select either a heterogeneous neighborhood or a homogeneous neighborhood as the next node, when performing random walks over the heterogeneous network including HPO, phenotype-gene and Protein-Protein Interaction (PPI) networks. The generated node sequences are then fed into a heterogeneous SkipGram model to perform node representations. By defining the individual jumping probability based on hierarchical structure, HSAEM can effectively capture co-occurrence of nodes in the heterogeneous network. HSAEM yields its extraordinary performance not only in the statistical evaluation metrics compared to baselines but also in the practical effectiveness of prioritizing causal genes for Parkinson’s Disease.

Lin Wang, Mingming Liu, Wenqian He, Xu Jin, Maoqiang Xie, Yalou Huang

Autonomous Vehicle Path Prediction Using Conditional Variational Autoencoder Networks

Path prediction of autonomous vehicles is an essential requirement under any given traffic scenario. Trajectory of several agent vehicles in the vicinity of ego vehicle, at least for a short future, is needed to be predicted in order to decide upon the maneuver of the ego vehicle. We explore variational autoencoder networks to obtain multimodal trajectories of agent vehicles. In our work, we condition the network on past trajectories of agents and traffic scenes as well. The latent space representation of traffic scenes is achieved by using another variational autoencoder network. The performance of the proposed networks is compared against a residual baseline model.

D. N. Jagadish, Arun Chauhan, Lakshman Mahto

Heterogeneous Graph Attention Network for Small and Medium-Sized Enterprises Bankruptcy Prediction

Credit assessment for Small and Medium-sized Enterprises (SMEs) is of great interest to financial institutions such as commercial banks and Peer-to-Peer lending platforms. Effective credit rating modeling can help them make loan-granted decisions while limiting their risk exposure. Despite a substantial amount of research being conducted in this domain, there are three existing issues. Firstly, many of them are mainly developed based on financial statements, which usually are not publicly-accessible for SMEs. Secondly, they always neglect the rich relational information embodied in financial networks. Finally, existing graph-neural-network-based (GNN) approaches for credit assessment are only applicable to homogeneous networks. To address these issues, we propose a heterogeneous-attention-network-based model (HAT) to facilitate SMEs bankruptcy prediction using publicly-accessible data. Specifically, our model has two major components: a heterogeneous neighborhood encoding layer and a triple attention output layer. While the first layer can encapsulate target nodes’ heterogeneous neighborhood information to address the graph heterogeneity, the latter can generate the prediction by considering the importance of different metapath-based neighbors, metapaths, and networks. Extensive experiments in a real-world dataset demonstrate the effectiveness of our model compared with baselines.

Yizhen Zheng, Vincent C. S. Lee, Zonghan Wu, Shirui Pan

Algorithm Selection as Superset Learning: Constructing Algorithm Selectors from Imprecise Performance Data

Algorithm selection refers to the task of automatically selecting the most suitable algorithm for solving an instance of a computational problem from a set of candidate algorithms. Here, suitability is typically measured in terms of the algorithms’ runtimes. To allow the selection of algorithms on new problem instances, machine learning models are trained on previously observed performance data and then used to predict the algorithms’ performances. Due to the computational effort, the execution of such algorithms is often prematurely terminated, which leads to right-censored observations representing a lower bound on the actual runtime. While simply neglecting these censored samples leads to overly optimistic models, imputing them with precise though hypothetical values, such as the commonly used penalized average runtime, is a rather arbitrary and biased approach. In this paper, we propose a simple regression method based on so-called superset learning, in which right-censored runtime data are explicitly incorporated in terms of interval-valued observations, offering an intuitive and efficient approach to handling censored data. Benchmarking on publicly available algorithm performance data, we demonstrate that it outperforms the aforementioned naïve ways of dealing with censored samples and is competitive to established methods for censored regression in the field of algorithm selection.

Jonas Hanselle, Alexander Tornede, Marcel Wever, Eyke Hüllermeier

Sim2Real for Metagenomes: Accelerating Animal Diagnostics with Adversarial Co-training

Machine learning models have made great strides in many fields of research, including bioinformatics and animal diagnostics. Recently, attention has shifted to detecting pathogens from genome sequences for disease diagnostics with computational models. While there has been tremendous progress, it has primarily been driven by large amounts of annotated data, which is expensive and hard to obtain. Hence, there is a need to develop models that can leverage low-cost, synthetic genome sequences to help tackle complex metagenome classification problems for diagnostics. In this paper, we present one of the first sim2real approaches to help multi-task deep learning models learn robust feature representations from synthetic metagenome sequences and transfer the learning to predict pathogen sequences in real data. Extensive experiments show that our model can successfully leverage synthetic and real genome sequences to obtain $$80\%$$ 80 % accuracy on metagenome sequence classification. Additionally, we show that our proposed model obtains 76% accuracy, with limited real metagenome sequences.

Vineela Indla, Vennela Indla, Sai Narayanan, Akhilesh Ramachandran, Arunkumar Bagavathi, Vishalini Laguduva Ramnath, Sathyanarayanan N. Aakur

Attack Is the Best Defense: A Multi-Mode Poisoning PUF Against Machine Learning Attacks

Resistance to modeling attacks is an important issue for Physical Unclonable Functions (PUFs). Deep learning, the state-of-the-art modeling attack, has recently been shown to be able to break many newly developed PUFs. Since then, many more complex PUF structures or challenge obfuscations have been proposed to resist deep learning attacks. However, the proposed methods typically focus on increasing the nonlinearity of PUF structure and challenge-response mapping. In this paper, we explore another direction with a multi-mode poisoning approach for a classic PUF (MMP PUF) in which each working mode is a simple add-on function for a classic PUF. By dividing the original challenge space for each working mode, the proposed MMP PUF generates a multi-modal challenge-response dataset that poisons machine learning algorithms. To validate the idea, we design two working mode types, challenge shift and response flip, as examples with widely-used delay-based Arbiter PUF. Experimental results show that our approach respectively achieves 74.37%, 68.08%, and 50.09% accuracy for dual-mode shift, quad-mode circular shift and dual-mode flip with deep learning models trained on over 3 million challenge-response pairs.

Chia-Chih Lin, Ming-Syan Chen

Combining Exogenous and Endogenous Signals with a Semi-supervised Co-attention Network for Early Detection of COVID-19 Fake Tweets

Fake tweets are observed to be ever-increasing, demanding immediate countermeasures to combat their spread. During COVID-19, tweets with misinformation should be flagged and neutralised in their early stages to mitigate the damages. Most of the existing methods for early detection of fake news assume to have enough propagation information for large labelled tweets – which may not be an ideal setting for cases like COVID-19 where both aspects are largely absent. In this work, we present ENDEMIC, a novel early detection model which leverages exogenous and endogenous signals related to tweets, while learning on limited labelled data. We first develop a novel dataset, called ECTF for early COVID-19 Twitter fake news, with additional behavioural test-sets to validate early detection. We build a heterogeneous graph with follower-followee, user-tweet, and tweet-retweet connections and train a graph embedding model to aggregate propagation information. Graph embeddings and contextual features constitute endogenous, while time-relative web-scraped information constitutes exogenous signals. ENDEMIC is trained in a semi-supervised fashion, overcoming the challenge of limited labelled data. We propose a co-attention mechanism to fuse signal representations optimally. Experimental results on ECTF, PolitiFact, and GossipCop show that ENDEMIC is highly reliable in detecting early fake tweets, outperforming nine state-of-the-art methods significantly.

Rachit Bansal, William Scott Paka, Nidhi, Shubhashis Sengupta, Tanmoy Chakraborty

TLife-LSTM: Forecasting Future COVID-19 Progression with Topological Signatures of Atmospheric Conditions

Understanding the impact of atmospheric conditions on SARS-CoV2 is critical to model COVID-19 dynamics and sheds a light on the future spread around the world. Furthermore, geographic distributions of expected clinical severity of COVID-19 may be closely linked to prior history of respiratory diseases and changes in humidity, temperature, and air quality. In this context, we postulate that by tracking topological features of atmospheric conditions over time, we can provide a quantifiable structural distribution of atmospheric changes that are likely to be related to COVID-19 dynamics. As such, we apply the machinery of persistence homology on time series of graphs to extract topological signatures and to follow geographical changes in relative humidity and temperature. We develop an integrative machine learning framework named Topological Lifespan LSTM (TLife-LSTM) and test its predictive capabilities on forecasting the dynamics of SARS-CoV2 cases. We validate our framework using the number of confirmed cases and hospitalization rates recorded in the states of Washington and California in the USA. Our results demonstrate the predictive potential of TLife-LSTM in forecasting the dynamics of COVID-19 and modeling its complex spatio-temporal spread dynamics.

Ignacio Segovia-Dominguez, Zhiwei Zhen, Rishabh Wagh, Huikyo Lee, Yulia R. Gel

Lifelong Learning Based Disease Diagnosis on Clinical Notes

Current deep learning based disease diagnosis systems usually fall short in catastrophic forgetting, i.e., directly fine-tuning the disease diagnosis model on new tasks usually leads to abrupt decay of performance on previous tasks. What is worse, the trained diagnosis system would be fixed once deployed but collecting training data that covers enough diseases is infeasible, which inspires us to develop a lifelong learning diagnosis system. In this work, we propose to adopt attention to combine medical entities and context, embedding episodic memory and consolidation to retain knowledge, such that the learned model is capable of adapting to sequential disease-diagnosis tasks. Moreover, we establish a new benchmark, named Jarvis-40, which contains clinical notes collected from various hospitals. Experiments show that the proposed method can achieve state-of-the-art performance on the proposed benchmark. Code is available at https://github.com/yifyang/LifelongLearningDiseaseDiagnosis .

Zifeng Wang, Yifan Yang, Rui Wen, Xi Chen, Shao-Lun Huang, Yefeng Zheng

GrabQC: Graph Based Query Contextualization for Automated ICD Coding

Automated medical coding is a process of codifying clinical notes to appropriate diagnosis and procedure codes automatically from the standard taxonomies such as ICD (International Classification of Diseases) and CPT (Current Procedure Terminology). The manual coding process involves the identification of entities from the clinical notes followed by querying a commercial or non-commercial medical codes Information Retrieval (IR) system that follows the Centre for Medicare and Medicaid Services (CMS) guidelines. We propose to automate this manual process by automatically constructing a query for the IR system using the entities auto-extracted from the clinical notes. We propose GrabQC, a Graph based Query Contextualization method that automatically extracts queries from the clinical text, contextualizes the queries using a Graph Neural Network (GNN) model and obtains the ICD Codes using an external IR system. We also propose a method for labelling the dataset for training the model. We perform experiments on two datasets of clinical text in three different setups to assert the effectiveness of our approach. The experimental results show that our proposed method is better than the compared baselines in all three settings.

Jeshuren Chelladurai, Sudarsun Santhiappan, Balaraman Ravindran

Deep Gaussian Mixture Model on Multiple Interpretable Features of Fetal Heart Rate for Pregnancy Wellness

About 1% of the newborns have disorders of cardiac rhythm and conduction that result in heart failure, brain damage and even sudden newborn death. To ensure the safety of pregnancy, the electronic fetal monitoring (EFM) technique is widely used in obstetrics. However, the existing automatic diagnosis methods suffer from two main problems: insufficient features and low interpretability. In order to improve the interpretability and effect of the method, we propose a novel fully interpretable method. We first propose an iterative local linear regression (ILLR) method of linear complexity, which calculates over all local ranges of the fetal heart rate (FHR) and generates local gradients and coefficients of determination, that are used as indicators of intensity and typicality of fetal heart activity. Then, we elaborate the methodology of extraction of dozens of features by interpretable methods. Finally, we propose an interpretable deep Gaussian mixture model that can automatically select multiple features, which is composed of a mixture model based on Gaussian model weighted by features and a regression model. We conduct cross validation experiments on the full benchmark intrapartum database CTU-UHB, which shows that our method obtains significant improvements of 5.61% accuracy over state-of-the-art baselines.

Yan Kong, Bin Xu, Bowen Zhao, Ji Qi

Adverse Drug Events Detection, Extraction and Normalization from Online Comments of Chinese Patent Medicines

Chinese Patent Medicines (CPMs) are welcomed by many people around the world, but the lack of information about their Adverse Drug Reactions (ADRs) is a big issue for drug safety. To get this information, we need to analyze from a number of real-world Adverse Drug Events (ADEs). However, current surveillance systems can only capture a small portion of them and there is a significant time lag in processing the reported data. With the rapid growth of E-commerce in recent years, quantities of patient-oriented user comments are posted on social media in real-time, making it of great value to automatically discover ADEs. To this end, we build a dataset containing 17K patient-oriented user posts about CPMs and further propose a new model that jointly performs ADE detection, extraction and normalization. Different from most previous works dealing with these tasks independently, we show how multi-task learning helps tasks to facilitate each other. To better deal with colloquial expressions and confusing statements in user comments, we leverage standard ADR-terms as prior knowledge as well as finding clues from other related comments. Experimental results show that our model outperforms previous work by a substantial margin.

Zi Chai, Xiaojun Wan

Adaptive Graph Co-Attention Networks for Traffic Forecasting

Traffic forecasting has remained a challenging topic in the field of transportation, due to the time-varying traffic patterns and complicated spatial dependencies on road networks. To address such challenges, we propose an adaptive graph co-attention network (AGCAN) to predict traffic conditions on a road network graph. In our model, an adaptive graph modelling method is adopted to learn a dynamic relational graph in which the links can capture the dynamic spatial correlations of traffic patterns among nodes, even though the adjacent nodes may not be physically connected. Besides, we propose a novel co-attention network targeting long- and short-term traffic patterns. The long-term graph attention module is used to derive periodic patterns from historical data, while the short-term graph attention module is employed to respond to sudden traffic changes, like car accidents and special events. To minimize the loss generated during the learning process, we adopt an encoder-decoder architecture, where both the encoder and decoder consist of novel hierarchical spatio-temporal attention blocks to model the impact of potential factors on traffic conditions. Overall, the experimental results on two real-world traffic prediction tasks demonstrate the superiority of AGCAN.

Boyu Li, Ting Guo, Yang Wang, Amir H. Gandomi, Fang Chen

Dual-Stage Bayesian Sequence to Sequence Embeddings for Energy Demand Forecasting

Bayesian methods provide a robust framework to model uncertainty estimation. However, conventional Bayesian models are hard to tune and often fail to scale over high dimensional data. We propose a novel Dual-Stage Bayesian Sequence to Sequence (DBS2S) model by extending Sequence to Sequence (S2S) deep learning architecture with exogenous variable input to capture uncertainty and forecast energy consumption accurately. DBS2S model is trained with a two-stage S2S encoder-decoder network and benefits from the feature representation capability of S2S to capture complex multimodal posterior distributions within embeddings. We evaluate the proposed model for probabilistic energy consumption forecasting using four real-world public datasets and achieve improved prediction accuracy up to 64% in terms of mean absolute percentage error over existing, state-of-the-art Bayesian neural networks. Additionally, probabilistic prediction intervals forecasted by DBS2S is utilized to detect outliers and flag diverse consumer behavior.

Frances Cameron-Muller, Dilusha Weeraddana, Raghav Chalapathy, Nguyen Lu Dang Khoa

AA-LSTM: An Adversarial Autoencoder Joint Model for Prediction of Equipment Remaining Useful Life

Remaining Useful Life (RUL) prediction of equipment can estimate the time when equipment reaches the safe operating limit, which is essential for strategy formulation to reduce the possibility of loss due to unexpected shutdowns. This paper proposes a novel RUL prediction model named AA-LSTM. We use a Bi-LSTM-based autoencoder to extract degradation information contained in the time series data. Meanwhile, a generative adversarial network is used to assist the autoencoder in extracting abstract representation, and then a predictor estimates the RUL based on the abstract representation learned by the autoencoder. AA-LSTM is an end-to-end model, which jointly optimizes autoencoder, generative adversarial network, and predictor. This training mechanism improves the model’s feature extraction and prediction capabilities for time series. We validate AA-LSTM on turbine engine datasets, and its performance outperforms state-of-the-art methods, especially on datasets with complex working conditions.

Dong Zhu, Chengkun Wu, Chuanfu Xu, Zhenghua Wang

Data Mining of Specialized Data

Frontmatter

Analyzing Topic Transitions in Text-Based Social Cascades Using Dual-Network Hawkes Process

We address the problem of modeling bursty diffusion of text-based events over a social network of user nodes. The purpose is to recover, disentangle and analyze overlapping social conversations from the perspective of user-topic preferences, user-user connection strengths and, importantly, topic transitions. For this, we propose a Dual-Network Hawkes Process (DNHP), which executes over a graph whose nodes are user-topic pairs, and closeness of nodes is captured using topic-topic, a user-user, and user-topic interactions. No existing Hawkes Process model captures such multiple interactions simultaneously. Additionally, unlike existing Hawkes Process based models, where event times are generated first, and event topics are conditioned on the event times, the DNHP is more faithful to the underlying social process by making the event times depend on interacting (user, topic) pairs. We develop a Gibbs sampling algorithm for estimating the three network parameters that allows evidence to flow between the parameter spaces. Using experiments over large real collection of tweets by US politicians, we show that the DNHP generalizes better than state of the art models, and also provides interesting insights about user and topic transitions.

Jayesh Choudhari, Srikanta Bedathur, Indrajit Bhattacharya, Anirban Dasgupta

HiPaR: Hierarchical Pattern-Aided Regression

We introduce HiPaR, a novel pattern-aided regression method for data with both categorical and numerical attributes. HiPaR mines hybrid rules of the form $$p \Rightarrow y = f(X)$$ p ⇒ y = f ( X ) where p is the characterization of a data region and f(X) is a linear regression model on a variable of interest y. The novelty of the method lies in the combination of an enumerative approach to explore the space of regions and efficient heuristics that guide the search. Such a strategy provides more flexibility when selecting a small set of jointly accurate and human-readable hybrid rules that explain the entire dataset. As our experiments shows, HiPaR mines fewer rules than existing pattern-based regression methods while still attaining state-of-the-art prediction performance.

Luis Galárraga, Olivier Pelgrin, Alexandre Termier

Improved Topology Extraction Using Discriminative Parameter Mining of Logs

Analytics on log data from various sources like application, middleware and infrastructure plays a very important role in troubleshooting of distributed applications. The existing tools for log analytics work by mining log templates and template sequences. The template sequences are then used to derive the application control flow or topology. In this work, we show how the use of parameters in logs enables discovery of a more accurate application topology, thereby aiding troubleshooting. An accurate application topology information helps in better correlation of logs at runtime, enabling troubleshooting tasks like anomaly detection, fault localization and root cause analysis to be more accurate.To this end, we propose a novel log template mining approach which uses parameter mining combined with fuzzy clustering on historical runtime logs to mine better quality templates. We also leverage parameter flows between log templates using a novel discriminative parameter mining approach for better topology extraction. In our method we do not assume any source code instrumentation or application specific assumptions like presence of transaction identifiers. We demonstrate the effectiveness of our approach in mining templates and application topology using real world as well as simulated data.

Atri Mandal, Saranya Gupta, Shivali Agarwal, Prateeti Mohapatra

Back to Prior Knowledge: Joint Event Causality Extraction via Convolutional Semantic Infusion

Joint event and causality extraction is a challenging yet essential task in information retrieval and data mining. Recently, pre-trained language models (e.g., BERT) yield state-of-the-art results and dominate in a variety of NLP tasks. However, these models are incapable of imposing external knowledge in domain-specific extraction. Considering the prior knowledge of frequent n-grams that represent cause/effect events may benefit both event and causality extraction, in this paper, we propose convolutional knowledge infusion for frequent n-grams with different windows of length within a joint extraction framework. Knowledge infusion during convolutional filter initialization does not only help the model capture both intra-event (i.e., features in an event cluster) and inter-event (i.e., associations across event clusters) features but also boost training convergence. Experimental results on the benchmark datasets show that our model significantly outperforms the strong BERT+CSNN baseline.

Zijian Wang, Hao Wang, Xiangfeng Luo, Jianqi Gao

A k-MCST Based Algorithm for Discovering Core-Periphery Structures in Graphs

Core-periphery structures are examples of meso-scale characteristics of graphs. Most existing algorithms for core-periphery (CP) structures work by first finding the dense cores of a network and then discovering the peripheral nodes around them. Our algorithm presented here seeks to query a graph to return the CP structures centered around any selected query node. Our algorithm significantly reduces the computational complexity of repeatedly querying the CP structures from a network. Our algorithm repeatedly extracts minimum cost spanning trees (MCSTs), first from the original network, and then successively from the residual networks. From the union of these MCSTs, our algorithm efficiently answers the queries for CP structures around nodes. We validate our algorithm on example networks taken from two domains.

Susheela Polepalli, Raj Bhatnagar

Detecting Sequentially Novel Classes with Stable Generalization Ability

Recognizing object of unseen novel classes is of great importance in real-world incremental data scenarios. Additionally, novel classes arise frequently in data stream mining, e.g., new topics in opinion monitoring, and novel protein families in protein sequence classification. Conducting streaming novel class detection is a complex problem composed of detection and model update. However, when updating the model solely with detected novel instances, it concentrates more on the novel patterns than known ones; thus the detection ability of the model may degrade consequently. This would exert harmful affections to further classification as the data evolving. To this end, in this paper, we consider the accuracy of the detection along with the robustness of model updating, and propose DEtecting Sequentially Novel clAsses with Stable generalization Ability (DesNasa). Specifically, DesNasa utilizes a prototypical network to reflect the structure information between scattered prototypes for novel class detection. Furthermore, the well-designed data augmentation method can help the model learning novel patterns robustly without degrading detection ability. Experiments on various datasets successfully validate the effectiveness of our proposed method.

Da-Wei Zhou, Yang Yang, De-Chuan Zhan

Learning-Based Dynamic Graph Stream Sketch

A graph stream is a kind of dynamic graph representation that consists of a consecutive sequence of edges where each edge is represented by two endpoints and a weight. Graph stream is widely applied in many application scenarios to describe the relationships in social networks, communication networks, academic collaboration networks, etc. Graph sketch mechanisms were proposed to summarize large-scale graphs by compact data structures with hash functions to support fast queries in a graph stream. However, the existing graph sketches use fixed-size memory and inevitably suffer from dramatic performance drops after a massive number of edge updates. In this paper, we propose a novel Dynamic Graph Sketch (DGS) mechanism, which is able to adaptively extend graph sketch size to mitigate the performance degradation caused by memory overload. The proposed DGS mechanism incorporates deep neural network structures with graph sketch to actively detect the query errors, and dynamically expand the memory size and hash space of a graph sketch to keep the error below a pre-defined threshold. We conducted extensive experiments on three real-world graph stream datasets, which show that DGS outperforms the state-of-the-arts with regard to the accuracy of different kinds of graph queries.

Ding Li, Wenzhong Li, Yizhou Chen, Mingkai Lin, Sanglu Lu

Discovering Dense Correlated Subgraphs in Dynamic Networks

Given a dynamic network, where edges appear and disappear over time, we are interested in finding sets of edges that have similar temporal behavior and form a dense subgraph. Formally, we define the problem as the enumeration of the maximal subgraphs that satisfy specific density and similarity thresholds. To measure the similarity of the temporal behavior, we use the correlation between the binary time series that represent the activity of the edges. For the density, we study two variants based on the average degree. For these problem variants we enumerate the maximal subgraphs and compute a compact subset of subgraphs that have limited overlap. We propose an approximate algorithm that scales well with the size of the network, while achieving a high accuracy. We evaluate our framework on both real and synthetic datasets. The results of the synthetic data demonstrate the high accuracy of the approximation and show the scalability of the framework.

Giulia Preti, Polina Rozenshtein, Aristides Gionis, Yannis Velegrakis

Fake News Detection with Heterogenous Deep Graph Convolutional Network

Fake news detection is a challenging problem due to its tremendous real-world political and social impacts. Previous works judged the authenticity of news mainly based on the content of a single news, which is generally not effective because the fake news is often written to mislead users by mimicking the true news. This paper innovatively utilizes the connection between multiple news, such as their relevance in time, content, topic and source, to detect fake news. We construct a heterogeneous graph with different types of nodes and edges, which is named as News Detection Graph (NDG), to integrate various information of multiple news. In order to learn deep representation of news nodes, we propose a Heterogenous Deep Convolutional Network (HDGCN) which utilizes a wider receptive field, a neighbor sampling strategy and a hierarchical attention mechanism. Extensive experiments carried on two real-world datasets demonstrated the effectiveness of our work in solving the fake news detection problem.

Zhezhou Kang, Yanan Cao, Yanmin Shang, Tao Liang, Hengzhu Tang, Lingling Tong

Incrementally Finding the Vertices Absent from the Maximum Independent Sets

A vertex v in a graph G is called an absent vertex if it is not in any maximum independent set of G. Absent vertex discovery is useful in various scenarios. For example, if G depicts a wireless communication interference graph, the existence of absent vertices in G may indicate network throughput bottlenecks. However, finding all the absent vertices is hard since it is at least as difficult as finding all the maximum independent sets, which is NP-hard. This paper focuses on a method that finds the absent vertices incrementally, in the hope of finding many such vertices quickly in the early incremental stages. The method iteratively invokes two polynomial-time algorithms to find the ‘easy’ absent vertices, and then the expensive exact maximum independent set solver to find the ‘difficult’ ones. At each iteration, the Mirror theorem is used to find extra absent vertices, and then all the absent vertices found so far are removed from the graph before going into the next iteration, until all absent vertices are found. Experimental results show that the above method can find most absent vertices much earlier than the baseline brute-force method on several widely-used datasets, showing its effectiveness.

Xiaochen Liu, Weiguo Zheng, Zhenyi Chen, Zhenying He, X. Sean Wang

Neighbours and Kinsmen: Hateful Users Detection with Graph Neural Network

With a massive rise of user-generated web content on social media, the amount of hate speech is also increasing. Countering online hate speech is a critical yet challenging task. Previous research has primarily focused on hateful content detection. In this study, we shift the attention from hateful content detection towards hateful users detection. Note, hateful users detection can benefit from users’ tweets, profiles, social relationships, but the real benefit is that it can be aided by Graph Neural Networks (GNN). Typical Graph Neural Networks, such as GraphSAGE, only considers local neighbourhood information and samples the neighbourhood uniformly, thus they lack the ability to capture long-range relationships or to differentiate neighbours of a node. In this paper, we present HateGNN – a GNN-based method to address these two limitations. Our proposed method relies on the notion of latent neighbourhood, as well as systematic sampling of the neighbourhood nodes. The experimental results demonstrate that HateGNN outperforms state-of-the-art baselines in the task of detecting hateful users. We also provide a detailed analysis to demonstrate the efficacy of the proposed method.

Shu Li, Nayyar A. Zaidi, Qingyun Liu, Gang Li

Graph Neural Networks for Soft Semi-Supervised Learning on Hypergraphs

Graph-based semi-supervised learning (SSL) assigns labels to initially unlabelled vertices in a graph. Graph neural networks (GNNs), esp. graph convolutional networks (GCNs), are at the core of the current-state-of-the art models for graph-based SSL problems. GCNs have recently been extended to undirected hypergraphs in which relationships go beyond pairwise associations. There is a need to extend GCNs to directed hypergraphs which represent more expressively many real-world data sets such as co-authorship networks and recommendation networks. Furthermore, labels of interest in these applications are most naturally represented by probability distributions. Motivated by these needs, in this paper, we propose a novel GNN-based method for directed hypergraphs, called Directed Hypergraph Network (DHN) for semi-supervised learning of probability distributions (Soft SSL). A key contribution of this paper is to establish generalisation error bounds for GNN-based soft SSL. In fact, our theoretical analysis is quite general and has straightforward applicability to DHN as well as to existing hypergraph methods. We demonstrate the effectiveness of our method through detailed experimentation on real-world datasets. We have made the code available.

Naganand Yadati, Tingran Gao, Shahab Asoodeh, Partha Talukdar, Anand Louis

A Meta-path Based Graph Convolutional Network with Multi-scale Semantic Extractions for Heterogeneous Event Classification

Heterogeneous social events modeling in large and noisy data sources is an important task for applications such as international situation assessment and disaster relief. Accurate and interpretable classification can help human analysts better understand global social dynamics and make quick and accurate decisions. However, achieving these goals is challenging due to several factors: (i) it is not easy to model different types of objects and relations in heterogeneous events in an unified manner, (ii) it is difficult to extract different semantic dependences at different scales among word sequences, and (iii) it is hard to accurately learn the subtle difference between events. Recently, graph neural networks have demonstrated advantages in learning complex and heterogeneous data. In this paper, we design a social event modeling method based on a Heterogeneous Information Network (HIN) and meta-path to calculate the similarity of events. In order to extract different semantic dependence, we propose a multi-scales semantic feature extraction framework. We present a Local Extrema Graph convolution Network (LEGCN) to expand the difference of various types events achieving accurate classification. We conduct extensive experiments on multiple real-world datasets and show that the proposed method exhibits comparable or even superior performance and also provides convincing interpretation capabilities.

Haiyang Wang, Xin Song, Yujia Liu, ChenGuang Chen, Bin Zhou

Noise-Enhanced Unsupervised Link Prediction

Link prediction has attracted attention from multiple research areas. Although several – mostly unsupervised – link prediction methods have been proposed, improving them is still under study. In several fields of science, noise is used as an advantage to improve information processing, inspiring us to also investigate noise enhancement in link prediction. In this research, we study link prediction from a data preprocessing point of view by introducing a noise-enhanced link prediction framework that improves the links predicted by current link prediction heuristics. The framework proposes three noise methods to help predict better links. Theoretical explanation and extensive experiments on synthetic and real-world datasets show that our framework helps improve current link prediction methods.

Reyhaneh Abdolazimi, Reza Zafarani

Weak Supervision Network Embedding for Constrained Graph Learning

Constrained learning, a weakly supervised learning task, aims to incorporate domain constraints to learn models without requiring labels for each instance. Because weak supervision knowledge is useful and easy to obtain, constrained learning outperforms unsupervised learning in performance and is preferable than supervised learning in terms of labeling costs. To date, constrained learning, especially constrained clustering, has been extensively studied, but was primarily focused on data in the Euclidean space. In this paper, we propose a weak supervision network embedding (WSNE) for constrained learning of graphs. Because no label is available for individual nodes, we propose a new loss function to quantify the constraint-based loss, and integrate this loss in a graph convolutional neural network (GCN) and variational graph auto-encoder (VGAE) combined framework to jointly model graph structures and node attributes. The joint optimization allows WSNE to learn embedding not only preserving network topology and content, but also satisfying the constraints. Experiments show that WSNE outperforms baselines for constrained graph learning tasks, including constrained graph clustering and constrained graph classification.

Ting Guo, Xingquan Zhu, Yang Wang, Fang Chen

RAGA: Relation-Aware Graph Attention Networks for Global Entity Alignment

Entity alignment (EA) is the task to discover entities referring to the same real-world object from different knowledge graphs (KGs), which is the most crucial step in integrating multi-source KGs. The majority of the existing embedding-based entity alignment methods embed entities and relations into a vector space based on relation triples of KGs for local alignment. As these methods insufficiently consider the multiple relations between entities, the structure information of KGs has not been fully leveraged. In this paper, we propose a novel framework based on Relation-aware Graph Attention Networks to capture the interactions between entities and relations. Our framework adopts the self-attention mechanism to spread entity information to the relations and then aggregate relation information back to entities. Furthermore, we propose a global alignment algorithm to make one-to-one entity alignments with a fine-grained similarity matrix. Experiments on three real-world cross-lingual datasets show that our framework outperforms the state-of-the-art methods.

Renbo Zhu, Meng Ma, Ping Wang

Graph Attention Networks with Positional Embeddings

Graph Neural Networks (GNNs) are deep learning methods which provide the current state of the art performance in node classification tasks. GNNs often assume homophily – neighboring nodes having similar features and labels–, and therefore may not be at their full potential when dealing with non-homophilic graphs. In this work, we focus on addressing this limitation and enable Graph Attention Networks (GAT), a commonly used variant of GNNs, to explore the structural information within each graph locality. Inspired by the positional encoding in the Transformers, we propose a framework, termed Graph Attentional Networks with Positional Embeddings (GAT-POS), to enhance GATs with positional embeddings which capture structural and positional information of the nodes in the graph. In this framework, the positional embeddings are learned by a model predictive of the graph context, plugged into an enhanced GAT architecture, which is able to leverage both the positional and content information of each node. The model is trained jointly to optimize for the task of node classification as well as the task of predicting graph context. Experimental results show that GAT-POS reaches remarkable improvement compared to strong GNN baselines and recent structural embedding enhanced GNNs on non-homophilic graphs.

Liheng Ma, Reihaneh Rabbany, Adriana Romero-Soriano

Unified Robust Training for Graph Neural Networks Against Label Noise

Graph neural networks (GNNs) have achieved state-of-the-art performance for node classification on graphs. The vast majority of existing works assume that genuine node labels are always provided for training. However, there has been very little research effort on how to improve the robustness of GNNs in the presence of label noise. Learning with label noise has been primarily studied in the context of image classification, but these techniques cannot be directly applied to graph-structured data, due to two major challenges—label sparsity and label dependency—faced by learning on graphs. In this paper, we propose a new framework, UnionNET, for learning with noisy labels on graphs under a semi-supervised setting. Our approach provides a unified solution for robustly training GNNs and performing label correction simultaneously. The key idea is to perform label aggregation to estimate node-level class probability distributions, which are used to guide sample reweighting and label correction. Compared with existing works, UnionNET has two appealing advantages. First, it requires no extra clean supervision, or explicit estimation of the noise transition matrix. Second, a unified learning framework is proposed to robustly train GNNs in an end-to-end manner. Experimental results show that our proposed approach: (1) is effective in improving model robustness against different types and levels of label noise; (2) yields significant improvements over state-of-the-art baselines.

Yayong Li, Jie Yin, Ling Chen

Graph InfoClust: Maximizing Coarse-Grain Mutual Information in Graphs

This work proposes a new unsupervised (or self-supervised) node representation learning method that aims to leverage the coarse-grain information that is available in most graphs. This extends previous attempts that only leverage fine-grain information (similarities within local neighborhoods) or global graph information (similarities across all nodes). Intuitively, the proposed method identifies nodes that belong to the same clusters and maximizes their mutual information. Thus, coarse-grain (cluster-level) similarities that are shared between nodes are preserved in their representations. The core components of the proposed method are (i) a jointly optimized clustering of nodes during learning and (ii) an Infomax objective term that preserves the mutual information among nodes of the same clusters. Our method is able to outperform competing state-of-art methods in various downstream tasks, such as node classification, link prediction, and node clustering. Experiments show that the average gain is between 0.2% and 6.1%, over the best competing approach, over all tasks. Our code is publicly available at: https://github.com/cmavro/Graph-InfoClust-GIC .

Costas Mavromatis, George Karypis

A Deep Hybrid Pooling Architecture for Graph Classification with Hierarchical Attention

Graph classification has been a classical problem of interest in machine learning and data mining because of its role in biological and social network analysis. Due to the recent success of graph neural networks for node classification and representation, researchers started extending them for the entire graph classification purpose. The main challenge is to represent the whole graph by a single vector which can be used to classify the graph in an end-to-end fashion. Global pooling, where node representations are directly aggregated to form the graph representation and more recently hierarchical pooling, where the whole graph is converted to a smaller graph through a set of hierarchies, are proposed in the literature. Though hierarchical pooling shows promising results for graph classification, it looses a significant amount of information in the hierarchical architecture. To address this, we propose a novel hybrid graph pooling architecture, which finds the importance of different hierarchies of pooling and aggregates them accordingly. We use a series of graph isomorphism networks, along with a bi-directional LSTM with self attention to implement the proposed hybrid pooling. Experiments show the merit of the proposed architecture with respect to a diverse set of state-of-the-art algorithms on multiple datasets.

Sambaran Bandyopadhyay, Manasvi Aggarwal, M. Narasimha Murty

Maximizing Explainability with SF-Lasso and Selective Inference for Video and Picture Ads

There is a growing interest in explainable machine learning methods. In our investigation, we have collected heterogeneous features from two series of YouTube video ads and seven series of Instagram picture ads to form our datasets. There are two main challenges that we found in analysing such data: i) multicollinearity and ii) infrequent common features. Due to these issues, standard estimation methods, such as OLS, Lasso, and Elastic-net, are only able to find a small number of significant features. This paper proposes a method called Significant Feature Lasso (SF-Lasso) to maximize model explainability by identifying most of the significant features that affect a target outcome (such as online video and picture ad popularity). Experiments show that SF-Lasso is able to identify much more significant features while maintaining similar prediction accuracy as what Lasso and Elastic-net can obtain. The human evaluation shows that SF-Lasso is better at identifying true features that appeal to ad viewers. We also find that the number of significant features is mainly affected by the model size (i.e., the number of active variables) and the correlations among explanatory variables.

Eunkyung Park, Raymond K. Wong, Junbum Kwon, Victor W. Chu

Reliably Calibrated Isotonic Regression

Using classifiers for decision making requires well-calibrated probabilities for estimation of expected utility. Furthermore, knowledge of the reliability is needed to quantify uncertainty. Outputs of most classifiers can be calibrated, typically by using isotonic regression that bins classifier outputs together to form empirical probability estimates. However, especially for highly imbalanced problems it produces bins with few samples resulting in probability estimates with very large uncertainty. We provide a formal method for quantifying the reliability of calibration and extend isotonic regression to provide reliable calibration with guarantees for width of credible intervals of the probability estimates. We demonstrate the method in calibrating purchase probabilities in e-commerce and achieve significant reduction in uncertainty without compromising accuracy.

Otto Nyberg, Arto Klami

Multiple Instance Learning for Unilateral Data

Multi-instance learning (MIL) is a popular learning paradigm rooted in real-world applications. Recent studies have achieved prominent performance with sufficient annotation data. Nevertheless, acquisition of enough labeled data is often hard and only a little or partially labeled data is available. For example, in web text mining, the concerning bags (positive) is often rare compared with the unrelated ones (negative) and unlabeled ones. This leads to a new learning scenario with little negative bags and many unlabeled bags, which we name it as unilateral data. It is a new learning problem and has received little attention. In this paper, we propose a new method called Multiple Instance Learning for Unilateral Data (MILUD) to tackle this problem. To utilize the information of bags fully, we consider statistics characters and discriminative mapping information simultaneously. The key instances of bags are determined by the distinguishability of mapped samples based on fake labels. Besides, we also employed an empirical risk minimization loss function based on the mapping results to learn the optimal classifier and analyze its generalization error bound. The experimental results show that method outperforms other existing state-of-art methods.

Xijia Tang, Tingjin Luo, Tianxiang Luan, Chenping Hou

An Online Learning Algorithm for Non-stationary Imbalanced Data by Extra-Charging Minority Class

Online learning is one of the trending areas of machine learning in recent years. How to update the model based on new data is the core question in developing an online classifier. When new data arrives, the classifier should keep its model up-to-date by (1) learn new knowledge, (2) keep relevant learned knowledge, and (3) forget obsolete knowledge. This problem becomes more challenging in imbalanced non-stationary scenarios. Previous approaches save arriving instances, then utilize up/down sampling techniques to balance preserved samples and update their models. However, this strategy comes with two drawbacks: first, a delay in updating the models, and second, the up/down sampling causes information loss for the majority classes and introduces noise for the minority classes. To address these drawbacks, we propose the Hyper-Ellipses-Extra-Margin model (HEEM), which properly addresses the class imbalance challenge in online learning by reacting to every new instance as it arrives. HEEM keeps an ensemble of hyper-extended-ellipses for the minority class. Misclassified instances of the majority class are then used to shrink the ellipse, and correctly predicted instances of the minority class are used to enlarge the ellipse. Experimental results show that HEEM mitigates the class imbalance problem and outperforms the state-of-the-art methods.

Sajjad Kamali Siahroudi, Daniel Kudenko

Locally Linear Support Vector Machines for Imbalanced Data Classification

Classification of imbalanced data is one of most challenging aspects of machine learning. Despite over two decades of progress there is still a need for developing new techniques capable to overcome numerous difficulties embedded in the nature of imbalanced datasets. In this paper, we propose Locally Linear Support Vector Machines (LL-SVMs) for effectively handling imbalanced datasets. LL-SVMs is a lazy learning approach which trains a local classifier for each new test instance using its k nearest neighbors. This way, we are able to maximize the margin in the original input features space and obtain a better adaptation to complex class boundaries. We combine LL-SVMs with local oversampling and cost-sensitive approaches to make them skew-insensitive. Working only in the local neighborhood significantly improves the generalization over the minority class and tackles instance-level difficulties, such as class overlapping, borderline and noisy instances, as well as small disjuncts. An extensive experimental study shows that our local models are able to outperform their global counterparts, especially when handling difficult, borderline, and noisy imbalanced datasets.

Bartosz Krawczyk, Alberto Cano

Low-Dimensional Representation Learning from Imbalanced Data Streams

Learning from data streams is among the contemporary challenges in the machine learning domain, which is frequently plagued by the class imbalance problem. In non-stationary environments, ratios among classes, as well as their roles (majority and minority) may change over time. The class imbalance is usually alleviated by balancing classes with resampling. However, this suffers from limitations, such as a lack of adaptation to concept drift and the possibility of shifting the true class distributions. In this paper, we propose a novel ensemble approach, where each new base classifier is built using a low-dimensional embedding. We use class-dependent entropy linear manifold to find the most discriminative low-dimensional representation that is, at the same time, skew-insensitive. This allows us to address two challenging issues: (i) learning efficient classifiers from imbalanced and drifting streams without data resampling; and (ii) tackling simultaneously high-dimensional and imbalanced streams that pose extreme challenges to existing classifiers. Our proposed low-dimensional representation algorithm is a flexible plug-in that can work with any ensemble learning algorithm, making it a highly useful tool for difficult scenarios of learning from high-dimensional imbalanced and drifting data streams.

Łukasz Korycki, Bartosz Krawczyk

PhotoStylist: Altering the Style of Photos Based on the Connotations of Texts

The need to modify a photo to reflect the connotations of a text can arise due to multifarious reasons (e.g., a musician might modify a photo in the album cover to better reflect the connotations in her song lyrics). An interesting observation is that different styles of photos convey different feelings. In this paper, we propose the PhotoStylist scheme to effectively modify the style of an input photo to represent the connotations in an input text. Existing methods that aim to transfer emotions into photos rely on an emotion class being provided as input and modify the overall color of photos based on the input emotion class, generating unrealistic colors for many objects in the image. To address these limitations, we design PhotoStylist, a novel deep-learning-based approach, to alter the individual style of each object in the photo in a way that the connotations of the input text are naturally and effectively embedded into the modified photos. Evaluation results on the Amazon Mechanical Turk (MTurk) show that our scheme can achieve output photos significantly closer to the connotations of the input text than the output photos from the state-of-the-art baselines.

Siamul Karim Khan, Daniel (Yue) Zhang, Ziyi Kou, Yang Zhang, Dong Wang

Gazetteer-Guided Keyphrase Generation from Research Papers

The task of keyphrase generation aims to generate the key phrases that capture the primary content of a document. An external domain-specific gazetteer can assist in generating keyphrases that are literally absent in the document (i.e., do not match any contiguous sub-sequence of source text) but relevant to the content of the document. In this paper, we present a technique to integrate knowledge from a gazetteer in order to improve keyphrase generation from research papers. We also present a copy mechanism that helps our model to utilize the gazetteer vocabulary to deal with the out-of-vocabulary words in keyphrases. Since constructing and maintaining relevant high-quality gazetteer by hand is very expensive, we also propose a method for automatic construction of a gazetteer given the input document, by leveraging similar documents in the training corpus. The thus constructed gazetteer helps focus on corpus-level information carried by other similar documents. Although this external information is crucial, it is never considered in previous studies. Experiments on real world datasets of research papers demonstrate that our proposed approach improves the performance of the state-of-the-art keyphrase generation models.

T. Y. S. S. Santosh, Debarshi Kumar Sanyal, Plaban Kumar Bhowmick, Partha Pratim Das

Minits-AllOcc: An Efficient Algorithm for Mining Timed Sequential Patterns

Sequential pattern mining aims to find the subsequences in a sequence database that appear together in the order of timestamps. Although there exist sequential pattern mining techniques, they ignore the temporal relationship information between the itemsets in the subsequences. This information is important in many real-world applications. For example, even if healthcare providers know that symptom Y frequently occurs after symptom X, it is also valuable for them to be able to estimate when Y will occur after X so that they can provide treatment at the right time. Considering temporal relationship information for sequential pattern mining raises new issues to be solved, such as designing a new data structure to save this information and traversing this structure efficiently to discover patterns without re-scanning the database. In this paper, we propose an algorithm called Minits-AllOcc (MINIng Timed Sequential Pattern for All-time Occurrences) to find sequential patterns and the transition time between itemsets based on all possible occurrences of a pattern in the database. We also propose a parallel multicore CPU version of this algorithm, called MMinits-AllOcc (Multicore Minits-AllOcc), to deal with Big Data. Extensive experiments on real and synthetic datasets show the advantages of this approach over the brute-force method. Also, the multicore CPU version of the algorithm is shown to outperform the single-core version on Big Data by 2.5X.

Somayah Karsoum, Clark Barrus, Le Gruenwald, Eleazar Leal

: Harnessing Text and Temporal Tree Network for Rumor Detection on Twitter

Social media platforms have democratized the publication process resulting into easy and viral propagation of information. However, spread of rumors via such media often results into undesired and extremely impactful political, economic, social, psychological and criminal consequences. Several manual as well as automated efforts have been undertaken in the past to solve this critical problem. Existing automated methods are text based, user credibility based or use signals from the tweet propagation tree. We aim at using the text, user, propagation tree and temporal information jointly for rumor detection on Twitter. This involves several challenges like how to handle text variations on Twitter, what signals from user profile could be useful, how to best encode the propagation tree information, and how to incorporate the temporal signal. Our novel architecture, $$T^3N$$ T 3 N (Text and Temporal Tree Network), leverages deep learning based architectures to encode text, user and tree information in a temporal-aware manner. Our extensive comparisons show that our proposed methods outperform the state-of-the-art techniques by $$\sim $$ ∼ 7 and $$\sim $$ ∼ 6% points respectively on two popular benchmark datasets, and also lead to better early detection results.

Nikhil Pinnaparaju, Manish Gupta, Vasudeva Varma

AngryBERT: Joint Learning Target and Emotion for Hate Speech Detection

Automated hate speech detection in social media is a challenging task that has recently gained significant traction in the data mining and Natural Language Processing community. However, most of the existing methods adopt a supervised approach that depended heavily on the annotated hate speech datasets, which are imbalanced and often lack training samples for hateful content. This paper addresses the research gaps by proposing a novel multitask learning-based model, AngryBERT, which jointly learns hate speech detection with sentiment classification and target identification as secondary relevant tasks. We conduct extensive experiments to augment three commonly-used hate speech detection datasets. Our experiment results show that AngryBERT outperforms state-of-the-art single-task-learning and multitask learning baselines. We conduct ablation studies and case studies to empirically examine the strengths and characteristics of our AngryBERT model and show that the secondary tasks are able to improve hate speech detection.

Md Rabiul Awal, Rui Cao, Roy Ka-Wei Lee, Sandra Mitrović

SCARLET: Explainable Attention Based Graph Neural Network for Fake News Spreader Prediction

False information and true information fact checking it, often co-exist in social networks, each competing to influence people in their spread paths. An efficient strategy here to contain false information is to proactively identify if nodes in the spread path are likely to endorse false information (i.e. further spread it) or refutation information (thereby help contain false information spreading). In this paper, we propose SCARLET (truSt andCredibility bAsed gRaph neuraLnEtwork model using aTtention) to predict likely action of nodes in the spread path. We aggregate trust and credibility features from a node’s neighborhood using historical behavioral data and network structure and explain how features of a spreader’s neighborhood vary. Using real world Twitter datasets, we show that the model is able to predict false information spreaders with an accuracy of over 87%.

Bhavtosh Rath, Xavier Morales, Jaideep Srivastava

Content Matters: A GNN-Based Model Combined with Text Semantics for Social Network Cascade Prediction

Effectively modeling and predicting the size of information cascades is essential for downstream tasks such as rumor detection and epidemic prevention. Traditional methods normally rely on tedious hand-crafted feature engineering efforts, which is inefficient in complex diffusion processes such as social network (SN) cascades. In recent years, graph neural network methods have been successfully used in cascade prediction tasks. Most of these methods make use of the structural features of SNs, while the effect of textual user-generated content (UGC) is far from clear or fully utilized. In this paper, we focus on the questions of how the textual UGC affect user activation state and trigger their retweet behaviors. We propose a novel GNN-based model named TSGNN, which jointly model the textual content and structure features. It uses recurrent neural networks with attentions to learn content feature representations that potentially affect information propagation. We find that tweets of fewer high coherent topics are more likely to trigger user retweet behaviors, and we also design a gate mechanism to model the activation state of users under the combined influence of content, structure, and other self-activation. Experimental results demonstrate that TSGNN significantly outperforms all the state-of-the-art methods in multiple metrics.

Yujia Liu, Kang Zeng, Haiyang Wang, Xin Song, Bin Zhou

TERMCast: Temporal Relation Modeling for Effective Urban Flow Forecasting

Urban flow forecasting is a challenging task, given the inherent periodic characteristics of urban flow patterns. To capture the periodicity, existing urban flow prediction approaches are often designed with closeness, period, and trend components extracted from the urban flow sequence. However, these three components are often considered separately in the prediction model. These components have not been fully explored together and simultaneously incorporated in urban flow forecasting models. We introduce a novel urban flow forecasting architecture, TERMCast. A Transformer based long-term relation prediction module is explicitly designed to discover the periodicity and enable the three components to be jointly modeled This module predicts the periodic relation which is then used to yield the predicted urban flow tensor. To measure the consistency of the predicted periodic relation vector and the relation vector inferred from the predicted urban flow tensor, we propose a consistency module. A consistency loss is introduced in the training process to further improve the prediction performance. Through extensive experiments on three real-world datasets, we demonstrate that TERMCast outperforms multiple state-of-the-art methods. The effectiveness of each module in TERMCast has also been investigated.

Hao Xue, Flora D. Salim

Traffic Flow Driven Spatio-Temporal Graph Convolutional Network for Ride-Hailing Demand Forecasting

Accurately predicting the demand for ride-hailing in the region is important for transportation and the economy. Prior works are devoted to mining the spatio-temporal correlations between regions limited to historical demand data, weather data, and event data, ignoring rich traffic flow information related to citizens’ travel. However, due to the dynamic characteristics of traffic flow and the irregularity of the road network structure, it is difficult to utilize traffic flow information directly. In this paper, we propose a framework called traffic flow driven spatio-temporal graph convolutional network (TST-GCN) to forecast ride-hailing demand. Specifically, we construct a novel region graph based on point of interest (POI) information to model the association between different regions. Besides, we design a stacked traffic-region demand graph convolutional network (TRGCN) module, which is composed of two kinds of nested graph convolutional network structures, effectively modeling the spatial dynamic dependency between regions. Then, the convolution long short-term memory (ConvLSTM) layer is further adopted to obtain spatio-temporal features. We evaluate the proposed model on two real datasets, and the experimental results show that our model outperforms many state-of-art methods.

Hao Fu, Zhong Wang, Yang Yu, Xianwei Meng, Guiquan Liu

A Proximity Forest for Multivariate Time Series Classification

Multivariate time series (MTS) classification has gained attention in recent years with the increase of multiple temporal datasets from various domains, such as human activity recognition, medical diagnosis, etc. The research on MTS is still insufficient and poses two challenges. First, discriminative features may exist on the interactions among dimensions rather than individual series. Second, the high dimensionality exponentially increases computational complexity. For that, we propose a novel proximity forest for MTS classification (MTSPF). MTSPF builds an ensemble of proximity trees that are split through the proximity between unclassified time series and its exemplar one. The proximity of trees is measured by locally slope-based dynamic time warping (DTW), which enhances traditional DTW by considering regional slope information. To extract the interaction among dimensions, several dimensions of an MTS instance are randomly selected and converted into interrelated sequences as the input of trees. After constructing each tree independently, the weight of each tree is calculated for weighted classifying. Experimental results on the UEA MTS datasets demonstrate the efficiency and accuracy of the proposed approach.

Yue Zhang, Zhihai Wang, Jidong Yuan

C-Guard: A Cross-Correlation Gaining Framework for Urban Air Quality Prediction

Predicting air quality is increasingly important for protecting people’s daily health and helping government decision-making. The multistep air quality prediction largely depends on the correlations of air quality-related factors. How to model the correlations among factors is a big challenge. In this paper, we propose a cross-correlation gaining framework (C $$^2$$ 2 -Guard) consisting of a temporal correlation module, factor correlation module, and cross gaining module for air quality (mainly PM2.5) prediction. Specifically, the temporal correlation module is used to extract the temporal dependence of air pollutant time series to gain their distributed representation. In the factor correlation module, a novel convolution and recalibration block is designed for air quality factor correlations extraction to gain their distributed representation in the factor dimension. In the cross gaining module, a joint-representation block is proposed to learn the cross-correlations between time and factor dimensions. Finally, extensive experiments are conducted on two real-world air quality datasets. The results demonstrate that our C $$^2$$ 2 -Guard outperforms the state-of-the-art methods of air pollutants prediction in terms of RMSE and MAE.

Yu Chu, Lin Li, Qing Xie, Guandong Xu

Simultaneous Multiple POI Population Pattern Analysis System with HDP Mixture Regression

In recent years, the use of smartphone Global Positioning System (GPS) logs has accelerated the analysis of urban dynamics. Predicting the population of a city is important for understanding the land use patterns of specific areas of interest. The current state-of-the-art predictive model is a variant of bilinear Poisson regression models. It is independently optimized for each point of interest (POI) using the GPS logs captured at that single POI. Thus, it is prone to instability during fine-scale POI analysis. Inspired by the success of topic modeling, in this study, we propose a novel approach based on the hierarchical Dirichlet process mixture regression to capture the relationship between POIs and upgrade the prediction performance. Specifically, the proposed model enables mixture regression for each POI, while the parameters of each regression are shared across the POIs owing to the hierarchical Bayesian property. The empirical study using 32 M GPS logs from mobile phones in Tokyo shows that our model for large-scale finer-mesh analysis outperforms the state-of-the-art models. We also show that our proposed model realizes important applications, such as visualizing the relationship between cities or abnormal population increase during an event.

Yuta Hayakawa, Kota Tsubouchi, Masamichi Shimosaka

Interpretable Feature Construction for Time Series Extrinsic Regression

Supervised learning of time series data has been extensively studied for the case of a categorical target variable. In some application domains, e.g., energy, environment and health monitoring, it occurs that the target variable is numerical and the problem is known as time series extrinsic regression (TSER). In the literature, some well-known time series classifiers have been extended for TSER problems. As first benchmarking studies have focused on predictive performance, very little attention has been given to interpretability. To fill this gap, in this paper, we suggest an extension of a Bayesian method for robust and interpretable feature construction and selection in the context of TSER. Our approach exploits a relational way to tackle with TSER: (i), we build various and simple representations of the time series which are stored in a relational data scheme, then, (ii), a propositionalisation technique (based on classical aggregation/selection functions from the relational data field) is applied to build interpretable features from secondary tables to “flatten” the data; and (iii), the constructed features are filtered out through a Bayesian Maximum A Posteriori approach. The resulting transformed data can be processed with various existing regressors. Experimental validation on various benchmark data sets demonstrates the benefits of the suggested approach.

Dominique Gay, Alexis Bondu, Vincent Lemaire, Marc Boullé

SEPC: Improving Joint Extraction of Entities and Relations by Strengthening Entity Pairs Connection

Joint extraction of entities and relations aims at recognizing relational triples (subject s, relation r, object o) from unstructured text. For any entity pair (s, o) in correct relational triples, they do not appear independently, but depending on each other. While existing approaches usually model entity pairs only by sharing the encoder layer, which is insufficient to exploit entity pair intrinsic connection. To solve this problem, we propose to strengthen entity pairs connection (SEPC) by utilizing the duality property of entity pairs, which can further improve the joint extraction. The entity pairs recognization is transformed to finding subject conditioned on the object and finding object conditioned on the subject, and the dual supervised learning is introduced to model their connection. We finally demonstrate the effectiveness of our proposed method on two widely used datasets NYT and WebNLG (Code and data available: https://github.com/zjp9574/SEPC ).

Jiapeng Zhao, Panpan Zhang, Tingwen Liu, Jinqiao Shi

Backmatter

Titel: Advances in Knowledge Discovery and Data Mining
herausgegeben von: Prof. Kamal Karlapalem
Hong Cheng
Naren Ramakrishnan
R. K. Agrawal
P. Krishna Reddy
Dr. Jaideep Srivastava
Assist. Prof. Tanmoy Chakraborty
Verlag: Springer International Publishing
Electronic ISBN: 978-3-030-75762-5
Print ISBN: 978-3-030-75761-8
DOI: https://doi.org/10.1007/978-3-030-75762-5