nach oben

2018 | Buch

Kapitel lesen Erstes Kapitel lesen

Database Systems for Advanced Applications

23rd International Conference, DASFAA 2018, Gold Coast, QLD, Australia, May 21-24, 2018, Proceedings, Part II

herausgegeben von: Jian Pei, Yannis Manolopoulos, Shazia Sadiq, Jianxin Li

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This two-volume set LNCS 10827 and LNCS 10828 constitutes the refereed proceedings of the 23rd International Conference on Database Systems for Advanced Applications, DASFAA 2018, held in Gold Coast, QLD, Australia, in May 2018.
The 83 full papers, 21 short papers, 6 industry papers, and 8 demo papers were carefully selected from a total of 360 submissions. The papers are organized around the following topics: network embedding; recommendation; graph and network processing; social network analytics; sequence and temporal data processing; trajectory and streaming data; RDF and knowledge graphs; text and data mining; medical data mining; security and privacy; search and information retrieval; query processing and optimizations; data quality and crowdsourcing; learning models; multimedia data processing; and distributed computing.

Inhaltsverzeichnis

Frontmatter

Correction to: Coverage-Oriented Diversification of Keyword Search Results on Graphs

In the originally published version of chapter 10 the funding information in the acknowledgement section was incomplete. This has now been corrected.

Ming Zhong, Ying Wang, Yuanyuan Zhu

Medical Data Mining

Frontmatter

Personalized Prescription for Comorbidity

Personalized medicine (PM) aiming at tailoring medical treatment to individual patient is critical in guiding precision prescription. An important challenge for PM is comorbidity due to the complex interrelation of diseases, medications and individual characteristics of the patient. To address this, we study the problem of PM for comorbidity and propose a neural network framework Deep Personalized Prescription for Comorbidity (PPC). PPC exploits multi-source information from massive electronic medical records (EMRs), such as demographic information and laboratory indicators, to support personalized prescription. Patient-level, disease-level and drug-level representations are simultaneously learned and fused with a trilinear method to achieve personalized prescription for comorbidity. Experiments on a publicly real world EMRs dataset demonstrate PPC outperforms state-of-the-art works.

Lu Wang, Wei Zhang, Xiaofeng He, Hongyuan Zha

Modeling Patient Visit Using Electronic Medical Records for Cost Profile Estimation

Estimating health care cost of patients provides promising opportunities for better management and treatment to medical providers and patients. Existing clinical approaches only focus on patient’s demographics and historical diagnoses but ignore ample information from clinical records. In this paper, we formulate the problem of patient’s cost profile estimation and use Electronic Medical Records (EMRs) to model patient visit for better estimating future health care cost. The performance of traditional learning based methods suffered from the sparseness and high dimensionality of EMR dataset. To address these challenges, we propose Patient Visit Probabilistic Generative Model (PVPGM) to describe a patient’s historical visits in EMR. With the help of PVPGM, we can not only learn a latent patient condition in a low dimensional space from sparse and missing data but also hierarchically organize the high dimensional EMR features. The model finally estimates the patient’s health care cost through combining the effects learned both from the latent patient condition and the generative process of medical procedure. We evaluate the proposed model on a large collection of real-world EMR dataset with 836,033 medical visits from over 50,000 patients. Experimental results demonstrate the effectiveness of our model.

Kangzhi Zhao, Yong Zhang, Zihao Wang, Hongzhi Yin, Xiaofang Zhou, Jin Wang, Chunxiao Xing

Learning the Representation of Medical Features for Clinical Pathway Analysis

Clinical Pathway (CP) represents the best practice of treatment process management for inpatients with specific diagnosis, and a treatment process can be divided into several stages, usually in units of days. With the explosion of medical data, CP analysis is receiving increasing attention, which provides important support for CP design and optimization. However, these data-driven researches often suffer from the high complexity of medical data, so that a proper representation of medical features is necessary. Most of existing representation learning methods in healthcare domain focus on outpatient data, which get weak performance and interpretability when adopted for CP analysis. In this paper, we propose a new representation, RoMCP, which can capture both diagnosis information and temporal relations between days. The learned diagnosis embedding grasps the key factors of the disease, and each day embedding is determined by the diagnosis together with the preorder days. We evaluate RoMCP on real-world dataset with 538K inpatient visits for several typical CP analysis tasks. Our method demonstrates significant improvement on performance and interpretation.

Xiao Xu, Ying Wang, Tao Jin, Jianmin Wang

Domain Supervised Deep Learning Framework for Detecting Chinese Diabetes-Related Topics

As millions of people are diagnosed with diabetes every year in China, many diabetes-related websites in Chinese provide news and articles. However, most of the online articles are uncategorized or lack a clear or unified topic, users often cannot find their topics of interest effectively and efficiently. The problem of health text classification on Chinese websites cannot be easily addressed by applying existing approaches, which have been used for English documents, in a straightforward manner. To address this problem and meet users’ demand for diabetes-related information needs, we propose a Chinese domain lexicon, adopt some professional diabetes topic explanations as domain knowledge and incorporate them into deep learning approach to form our topic classification framework. Our experiments using real datasets showed that the framework significantly achieved a higher effectiveness and accuracy in categorizing diabetes-related topics than most of the state-of-the-art benchmark approaches. Our experimental analysis also revealed that some health websites provided some incorrect or misleading category information.

Xinhuan Chen, Yong Zhang, Kangzhi Zhao, Qingcheng Hu, Chunxiao Xing

Security and Privacy

Frontmatter

Publishing Graph Node Strength Histogram with Edge Differential Privacy

Protecting the private graph data while releasing accurate estimate of the data is one of the most challenging problems in data privacy. Node strength combines the topological information with the weight distribution of the weighted graph in a natural way. Since an edge in graph data oftentimes represents relationship between two nodes, edge-differential privacy (edge-DP) can protect relationship between two entities from being disclosed. In this paper, we investigate the problem of publishing the node strength histogram of a private graph under edge-DP. We propose two clustering approaches based on sequence-aware and local density to aggregate histogram. Our experimental study demonstrates that our approaches can greatly reduce the error of approximating the true node strength histogram.

Qing Qian, Zhixu Li, Pengpeng Zhao, Wei Chen, Hongzhi Yin, Lei Zhao

PrivTS: Differentially Private Frequent Time-Constrained Sequential Pattern Mining

In this paper, we address the problem of mining time-constrained sequential patterns under the differential privacy framework. The mining of time-constrained sequential patterns from the sequence dataset has been widely studied, in which the transition time between adjacent items should not be too large to form frequent sequential patterns. A wide spectrum of applications can greatly benefit from such patterns, such as movement behavior analysis, targeted advertising, and POI recommendation. Improper releasing and use of such patterns could jeopardize the individually’s privacy, which motivates us to apply differential privacy to mining such patterns. It is a challenging task due to the inherent sequentiality and high complexity. Towards this end, we propose a two-phase algorithm PrivTS, which consists of sample-based filtering and count refining modules. The former takes advantage of an improved sparse vector technique to retrieve a set of potentially frequent sequential patterns. Utilizing this information, the latter computes their noisy supports and detects the final frequent patterns. Extensive experiments conducted on real-world datasets demonstrate that our approach maintains high utility while providing privacy guarantees.

Yanhui Li, Guoren Wang, Ye Yuan, Xin Cao, Long Yuan, Xuemin Lin

Secure Range Query over Encrypted Data in Outsourced Environments

With the rapid development of cloud computing paradigm, data owners have the opportunity to outsource their databases and management tasks to the cloud. Due to the privacy concerns, it is required for them to encrypt the databases before outsourcing. However, there is no existing techniques handling range queries in a fully secure way. Therefore, in this paper we focus exactly on secure processing of range queries over outsourced encrypted databases. To efficiently process secure range queries, the extraordinarily challenging task is how to perform fully secure range queries over encrypted data without the cloud ever decrypting the data. To address the challenge, we first propose a basic secure range queries algorithm which is not absolutely secure (i.e., leaking the privacy of access patterns and path patterns). To meet a better security, we present a fully secure algorithm that preserves the privacy of the data, query, result, access patterns and path patterns. At last, we empirically analyze and conduct a comprehensive performance evaluation using real dataset to validate our ideas and the proposed secure algorithms.

Ningning Cui, Xiaochun Yang, Leixia Wang, Bin Wang, Jianxin Li

TRQED: Secure and Fast Tree-Based Private Range Queries over Encrypted Cloud

With the prevalence of cloud computing, data owners are motivated to outsource their databases to the cloud. However, sensitive information has to be encrypted to preserve the privacy, which inevitably makes effective data utilization a challenging task. Existing work either focuses on keyword searches, or suffers from inadequate security guarantees or inefficiency. In this paper, we focus on the problem of multi-dimensional range queries over dynamic encrypted cloud data. We propose a Tree-based private Range Query scheme over dynamic Encrypted cloud Data (TRQED), which enables faster-than-linear range queries and supports data dynamics while preserving the query privacy and single-dimensional privacy simultaneously. TRQED achieves provable security against semi-honest adversaries under known background model. Extensive experiments on real-world datasets show that the overhead of TRQED is desirable, and TRQED is more efficient compared with existing work.

Wei Yang, Yang Xu, Yiwen Nie, Yao Shen, Liusheng Huang

Search and Information Retrieval

Frontmatter

iExplore: Accelerating Exploratory Data Analysis by Predicting User Intention

Exploratory data analysis over large datasets has become an increasingly prevalent use case. However, users are easily overwhelmed by the data and might take a long time to find interesting facts. In this paper, we design a system called iExplore to assist users in doing this time-consuming data exploration task through predicting user intention. Moreover, we propose an intention model to help the iExplore system have a comprehensive understanding of user’s intention. Thus, the exploratory process can be accelerated by the intention-driven recommendation and prefetching mechanisms. Extensive experiments demonstrate that the intention-driven iExplore system can significantly lighten the burden of users and facilitate the exploratory process.

Zhihui Yang, Jiyang Gong, Chaoying Liu, Yinan Jing, Zhenying He, Kai Zhang, X. Sean Wang

Coverage-Oriented Diversification of Keyword Search Results on Graphs

Query result diversification has drawn great research interests in recent years. Most previous work focuses on finding a locally diverse subset of a given finite result set, in which the results are as dissimilar to each other as possible. However, such a setup may not always hold. Firstly, we may need the result set to be globally diverse with respect to all possible demands behind a given query. Secondly, the result set may not be given before diversification. In this paper, we address these two problems in the scenario of keyword search on graphs. We first reasonably formalize a problem of coverage-oriented diversified keyword search on graphs. It aims to find both locally and globally diverse and also relevant results simultaneously while searching on graphs. The global diversity is defined as a query-dependent metric called coverage, which dynamically assigns weights to potential query demands with respect to their topological distances to the given keywords. Then, we present a search algorithm to solve our problem. It guarantees to return the optimal diverse result set, and can eliminate unnecessary and redundant diversity computation. Lastly, we perform both effectiveness and efficiency evaluation of our approach on DBPedia. Compared with the local diversification approach, our approach can improve the coverage and reduce the redundancy of search results remarkably.

Ming Zhong, Ying Wang, Yuanyuan Zhu

Novel Approaches to Accelerating the Convergence Rate of Markov Decision Process for Search Result Diversification

Recently, some studies have utilized the Markov Decision Process for diversifying (MDP-DIV) the search results in information retrieval. Though promising performances can be delivered, MDP-DIV suffers from a very slow convergence, which hinders its usability in real applications. In this paper, we aim to promote the performance of MDP-DIV by speeding up the convergence rate without much accuracy sacrifice. The slow convergence is incurred by two main reasons: the large action space and data scarcity. On the one hand, the sequential decision making at each position needs to evaluate the query-document relevance for all the candidate set, which results in a huge searching space for MDP; on the other hand, due to the data scarcity, the agent has to proceed more “trial and error” interactions with the environment. To tackle this problem, we propose MDP-DIV-kNN and MDP-DIV-NTN methods. The MDP-DIV-kNN method adopts a k nearest neighbor strategy, i.e., discarding the k nearest neighbors of the recently-selected action (document), to reduce the diversification searching space. The MDP-DIV-NTN employs a pre-trained diversification neural tensor network (NTN-DIV) as the evaluation model, and combines the results with MDP to produce the final ranking solution. The experiment results demonstrate that the two proposed methods indeed accelerate the convergence rate of the MDP-DIV, which is 3x faster, while the accuracies produced barely degrade, or even are better.

Feng Liu, Ruiming Tang, Xutao Li, Yunming Ye, Huifeng Guo, Xiuqiang He

Structures or Texts? A Dynamic Gating Method for Expert Finding in CQA Services

Expert finding plays an important role in community question answering websites. Previously, most works focused on assessing the user expertise scores mainly from their past question-answering semantic features. In this work, we propose a gating mechanism to dynamically combine structural and textual representations based on past question-answering behaviors. We also use some user activities including temporal behaviors as the features, which determine the gate values. We evaluate the performance of our method on the well-known question answering sites Stackexchange and Quora. Experiments show that our approach can improve the performance on expert finding tasks.

Zhiqiang Liu, Yan Zhang

Query Processing and Optimizations

Frontmatter

Collusion-Resistant Processing of SQL Range Predicates

Prior solutions for securely handling SQL range predicates in outsourced cloud-resident databases have primarily focused on passive attacks in the Honest-but-Curious adversarial model, where the server is only permitted to observe the encrypted query processing. We consider here a significantly more powerful adversary, wherein the server can launch an active attack by clandestinely issuing specific range queries via collusion with a few compromised clients. The security requirement in this environment is that data values from a plaintext domain of size N should not be leaked to within an interval of size $$H$$H. Unfortunately, all prior encryption schemes for range predicate evaluation are easily breached with only $$O(log_2\psi )$$O(log2ψ) range queries, where $$\psi = N/H$$ψ=N/H. To address this lacuna, we present SPLIT, a new encryption scheme where the adversary requires exponentially more – $$\mathbf{O}(\psi )$$O(ψ) – range queries to breach the interval constraint, and can therefore be easily detected by standard auditing mechanisms.The novel aspect of SPLIT is that each value appearing in a range-sensitive column is first segmented into two parts. These segmented parts are then independently encrypted using a layered composition of a Secure Block Cipher with the Order-Preserving Encryption and Prefix-Preserving Encryption schemes, and the resulting ciphertexts are stored in separate tables. At query processing time, range predicates are rewritten into an equivalent set of table-specific sub-range predicates, and the disjoint union of their results forms the query answer. A detailed evaluation of SPLIT on benchmark database queries indicates that its execution times are well within a factor of two of the corresponding plaintext times, testifying to its efficiency in resisting active adversaries.

Manish Kesarwani, Akshar Kaul, Gagandeep Singh, Prasad M. Deshpande, Jayant R. Haritsa

Interactive Transaction Processing for In-Memory Database System

In-memory transaction processing has gained fast development in recent years. Previous works usually assume the one-shot transaction model, where transactions are run as stored procedures. Though many systems have shown impressive throughputs in handling one-shot transactions, it is hard for developers to debug and maintain stored procedures. According to a recent survey, most applications still prefer to operate the database using the JDBC/ODBC interface. Upon realizing this, the work targets on the problem of interactive transaction processing for in-memory database system. Our key contributions are: (1) we address several important design considerations for supporting interaction transaction processing; (2) a coroutine-based execution engine is proposed to handle different kinds of blocking efficiently and improve the CPU usage; (3) a lightweight and latch-free lock manager is designed to schedule transaction conflicts without introducing many overhead; (4) experiments on both the TPC-C and a micro benchmark show that our method achieves better performance than existing solutions.

Tao Zhu, Donghui Wang, Huiqi Hu, Weining Qian, Xiaoling Wang, Aoying Zhou

An Adaptive Eviction Framework for Anti-caching Based In-Memory Databases

Current in-memory DBMSs suffer from the performance bottleneck when data cannot fit in memory. To solve such a problem, anti-caching system is proposed and with proper configuration, it can achieve better performance than state-of-the-art counterpart. However, in current anti-caching eviction procedure, all the eviction parameters are fixed while real workloads keep changing from time to time. Therefore, the performance of anti-caching system can hardly stay in the best state. We propose an adaptive eviction framework for anti-caching system and implement four tuning techniques to automatically tune the eviction parameters. In particular, we design a novel tuning technique called window-size adaption specialized for anti-caching system and embed it into the adaptive eviction framework. The experimental results show that with adaptive eviction, anti-caching based database system can outperform the traditional prototype by 1.2x–1.8x and 1.7x–4.5x under TPC-C benchmark and YCSB benchmark, respectively.

Kaixin Huang, Shengan Zheng, Yanyan Shen, Yanmin Zhu, Linpeng Huang

Efficient Complex Social Event-Participant Planning Based on Heuristic Dynamic Programming

To manage the Event Based Social Networks (EBSNs), an important task is to solve the Global Event Planning with Constraints (GEPC) problem, which arranges suitable social events to target users. Existing studies are not efficient enough because of the two-step framework. In this paper, we propose a more efficient method, called Heuristic-DP, which asynchronously considers all the constraints together. Using this method, we improve the computational complexity from $$O(|E|^2 + |U||E|^2)$$O(|E|2+|U||E|2) to O(|U||E|), where |U| is the number of users and |E| is the number of events in an EBSN platform. We also propose an improved heuristic strategy in one function of the heuristic-DP algorithm, which slightly increases the time cost, but can obtain a more accurate result. Finally, we verify the effectiveness and efficiency of our proposed algorithms through extensive experiments over real and synthetic datasets.

Junchang Xin, Mo Li, Wangzihao Xu, Yizhu Cai, Minhua Lu, Zhiqiong Wang

Data Quality and Crowdsourcing

Frontmatter

Repairing Data Violations with Order Dependencies

Lexicographical order dependencies (ODs) are proposed to describe the relationships between two lexicographical ordering specifications with respect to lists of attributes, and are proved to be useful in query optimizations concerning ordered attributes. To take full advantage of ODs, the data instance is supposed to satisfy OD specifications. In practice, data are often found to violate given ODs, as demonstrated in recent studies on discovery of ODs. This highlights the quest for data repairing techniques for ODs, to restore consistency of the data with respect to ODs. New challenges arise since ODs convey order semantics beyond functional dependencies, and are specified on lists of attributes. In this paper, we make a first effort to develop techniques for repairing data violations with ODs. (1) We formalize the data repairing problem for ODs, and prove that it is NP-hard in the size of the data. (2) Despite the intractability, we develop effective heuristic algorithms to address the problem. (3) We experimentally evaluate the effectiveness and efficiency of our algorithms, using both real-life and synthetic data.

Yu Qiu, Zijing Tan, Kejia Yang, Weidong Yang, Xiangdong Zhou, Naiwang Guo

Multi-Worker-Aware Task Planning in Real-Time Spatial Crowdsourcing

Spatial crowdsourcing emerges as a new computing paradigm with the development of mobile Internet and the ubiquity of mobile devices. The core of many real-world spatial crowdsourcing applications is to assign suitable tasks to proper workers in real time. Many works only assign a set of tasks to each worker without making the plan how to perform the assigned tasks. Others either make task plans only for a single worker or are unable to operate in real time. In this paper, we propose a new problem called the Multi-Worker-AwareTaskPlanning (MWATP) problem in the online scenario, in which we not only assign tasks to workers but also make plans for them, such that the total utility (revenue) is maximized. We prove that the offline version of MWATP problem is NP-hard, and no online algorithm has a constant competitive ratio on the MWATP problem. Two heuristic algorithms, called Delay-Planning and Fast-Planning, are proposed to solve the problem. Extensive experiments on synthetic and real datasets verify the effectiveness and efficiency of the two proposed algorithms.

Qian Tao, Yuxiang Zeng, Zimu Zhou, Yongxin Tong, Lei Chen, Ke Xu

MT-MCD: A Multi-task Cognitive Diagnosis Framework for Student Assessment

Student assessment aims to diagnose student latent attributes (e.g., skill proficiency), which is a crucial issue for many educational applications. Existing studies, such as cognitive diagnosis, mainly focus on exploiting students’ scores on questions to mine their attributes from an independent exam. However, in many real-world scenarios, different students usually participate in different exams, where the results obtained from different exams by traditional methods are not comparable to each other. Therefore, the problem of conducting assessments from different exams to obtain precise and comparable results is still underexplored. To this end, in this paper, we propose a Multi Task - Multidimensional Cognitive Diagnosis framework (MT-MCD) for student assessment from different exams simultaneously. In the framework, we first apply a multidimensional cognitive diagnosis model for each independent assessment task. Then, we extract features from the question texts to bridge the connections with each task. After that, we employ a multi-task optimization method for the framework learning. MT-MCD is a general framework where we develop two effective implementations based on two representative cognitive diagnosis models. We conduct extensive experiments on real-world datasets where the experimental results demonstrate that MT-MCD can obtain more precise and comparable assessment results.

Tianyu Zhu, Qi Liu, Zhenya Huang, Enhong Chen, Defu Lian, Yu Su, Guoping Hu

Towards Adaptive Sensory Data Fusion for Detecting Highway Traffic Conditions in Real Time

The key challenge of detecting highway traffic conditions is to achieve it in a fully-covered, high-accuracy, low-cost and real-time manner. We present an approach named Megrez on the basis of treating mobile phones and probe vehicles as roving sensors, loop detectors as static sensors. Megrez can admit one or multiple types of data, including signaling data in a mobile communication network, data from loop detectors, and GPS data from probe vehicles, to carry out the traffic estimation and monitoring. In order to accurately reconstruct traffic conditions with full road segment coverage, Megrez provides a practical way to overcome the sparsity and incoherence of sensory data and recover the missing data in light of recent progresses in compressive sensing. Moreover, Megrez incorporates the characteristics of traffic flows to rectify the estimates. Using large-scale real-world data as input, we conduct extensive experiments to evaluate Megrez. The experimental results show that, in contrast to three other fusion methods, the results from our approach have high precisions and recalls. In addition, Megrez keeps the errors of estimates low even when not all three types of data are available.

Yanling Cui, Beihong Jin, Fusang Zhang, Tingjian Ge

On the Interaction of Functional and Inclusion Dependencies with Independence Atoms

Infamously, the finite and unrestricted implication problems for the classes of (i) functional and inclusion dependencies together, and (ii) embedded multivalued dependencies alone are each undecidable. Famously, the restriction of (i) to functional and unary inclusion dependencies in combination with the restriction of (ii) to multivalued dependencies yield implication problems that are still different in the finite and unrestricted case, but each are finitely axiomatizable and decidable in low-degree polynomial time. An important embedded tractable fragment of embedded multivalued dependencies are independence atoms. These stipulate independence between two attribute sets in the sense that for every two tuples there is a third tuple that agrees with the first tuple on the first attribute set and with the second tuple on the second attribute set. Our main results show that finite and unrestricted implication deviate for the combined class of independence atoms, unary functional and unary inclusion dependencies, but both are axiomatizable and decidable in low-degree polynomial time. This combined class adds arbitrary independence atoms to unary keys and unary foreign keys, which frequently occur in practice as surrogate keys and references to them.

Miika Hannula, Sebastian Link

Source Selection for Inconsistency Detection

Inconsistencies in a database can be detected based on violations of integrity constraints, such as functional depencies (FDs). In big data era, many related data sources give us the chance of detecting inconsistency extensively. That is, even though violations do not exist in a single data set D, we can leverage other data sources to discover potential violations. A significant challenge for violation detection based on data sources is that accessing too many data sources introduces a huge cost, while involving too few data sources may miss serious violations. Motivated by this, we investigate how to select a proper subset of sources for inconsistency detection. To address this problem, we formulate the gain model of sources and introduce the optimization problem of source selection, called SSID, in which the gain is maximized with the cost under a threshold. We show that the SSID problem is NP-hard and propose a greedy approximation approach for SSID. To avoid accessing data sources, we also present a randomized technique for gain estimation with theoretical guarantees. Experimental results on both real and synthetic data show high performance on both effectiveness and efficiency of our algorithm.

Lingli Li, Xu Feng, Hongyu Shao, Jinbao Li

Effective Solution for Labeling Candidates with a Proper Ration for Efficient Crowdsourcing

One of the core problems of crowdsourcing research is how to reduce the cost, in other words, how to get better results with a limited budget. To save budget, most researchers concentrate on internal steps of crowdsourcing while in this work we focus on the pre-processing stage: how to select the input for crowds to contribute. A straightforward application of this work is to help budget-limited machine learning researchers to get better balanced training data from crowd labeling. Specifically, we formulate the prior information based input manipulating procedure as the Candidate Selection Problem (CSP) and propose an end-squeezing algorithm for it. Our results show that a considerable cost reduction can be achieved by manipulating the input to the crowd with the help of some additional prior information. We verify the effectiveness and efficiency of these algorithms through extensive experiments.

Zhao Chen, Peng Cheng, Chen Zhang, Lei Chen

Handling Unreasonable Data in Negative Surveys

Negative survey is a method of collecting sensitive data. Compared with traditional surveys, negative survey can effectively protect the privacy of participants. Data collector usually has some background knowledge about the survey, and background knowledge could be effectively used for estimating aggregated results from the collected data. Traditional methods for estimating aggregated results would get some unreasonable data, such as negative values, and some values inconsistent with the background knowledge. Handling these unreasonable data could improve the accuracy of the estimated aggregated results. In this paper, we propose a method for handling values that are inconsistent with the background knowledge and negative values. The simulation results show that, compared with NStoPS, NStoPS-I and NStoPS-BK, more accurate aggregated results could be estimated by the proposed method.

Jianwen Xiang, Shu Fang, Dongdong Zhao, Jing Tian, Shengwu Xiong, Dong Li, Chunhui Yang

Learning Models

Frontmatter

Multi-view Proximity Learning for Clustering

In recent years, multi-view clustering has become a hot research topic due to the increasing amount of multi-view data. Among existing multi-view clustering methods, proximity-based method is a typical class and achieves much success. Usually, these methods need proximity matrices as inputs, which can be constructed by some nearest-neighbors-based approaches. However, in this way, neither the intra-view cluster structure nor the inter-view correlation is considered in constructing proximity matrices. To address this issue, we propose a novel method, named multi-view proximity learning. By introducing the idea of representative, our model can consider both the relations between data objects and the cluster structure within individual views. Besides, the spectral-embedding-based scheme is adopted for modeling the correlations across different views, i.e. the view consistency and complement properties. Extensive experiments on both synthetic and real-world datasets demonstrate the effectiveness of our method.

Kun-Yu Lin, Ling Huang, Chang-Dong Wang, Hong-Yang Chao

Extracting Label Importance Information for Multi-label Classification

Existing multi-label learning approaches assume all labels in a dataset are of the same importance. However, the importance of each label is generally different in real world. In this paper, we introduce multi-labelimportance (MLI) which measures label importance from two perspectives: label predictability and label effects. Specifically, label predictability and label effects can be extracted from training data before building models for multi-label learning. After that, the multi-labelimportance information can be used in existing approaches to improve the performance of multi-label learning. To prove this, we propose a classifier chain algorithm based on multi-labelimportance ranking and a improved kNN-based algorithm which takes both feature distance and label distance into consideration. We apply our algorithms on benchmark datasets demonstrating efficient multi-label learning by exploiting multi-labelimportance. It is also worth mentioning that our experiments show the strong positive correlation between label predictability and label effects.

Dengbao Wang, Li Li, Jingyuan Wang, Fei Hu, Xiuzhen Zhang

Exploiting Instance Relationship for Effective Extreme Multi-label Learning

Extreme multi-label classification is an important data mining technique, which can be used to label each unseen instance with a subset of labels from a large label set. It has wide applications and many methods have been proposed in recent years. Existing methods either seek to compress label space or train a classifier based on instances’ features, among which tree-based classifiers enjoy the advantages of better efficiency and accuracy. In many real world applications, instances are not independent and relationship between instances is very useful information. However, how to utilize relationship between instances in extreme multi-label classification is less studied. Exploiting such relationship may help improve prediction accuracy, especially in the circumstance that feature space is very sparse. In this paper, we study how to utilize the similarity between instances to build more accurate tree-based extreme multi-label classifiers. To this end, we introduce the utilization of relationship between instances to state-of-the-art models in two ways: feature engineering and collaborative labeling. Extensive experiments conducted on three real world datasets demonstrate that our proposed method achieves higher accuracy than the state-of-the-art models.

Feifei Li, Hongyan Liu, Jun He, Xiaoyong Du

Exploiting Ranking Consistency Principle in Representation Learning for Location Promotion

Location-based services, which use information of people’s geographical position as service context, are becoming part of our daily life. Given the large volume of heterogeneous data generated by location-based services, one important problem is to estimate the visiting probability of users who haven’t visited a target Point of Interest (POI) yet, and return the target user list based on their visiting probabilities. This problem is called the location promotion problem. The location promotion problem has not been well studied due to the following difficulties: (1) the cold start POI problem: a target POI for promotion can be a new POI with no check-in records; and (2) heterogeneous information integration. Existing methods mainly focus on developing a general mobility model for all users’ check-ins, but ignore the ranking utility from the perspective of POIs and the interaction between geographical and preference influence of POIs.In order to overcome the limitations of existing studies, we propose a unified representation learning framework called hybrid ranking and embedding. The core idea of our method is to exploit the ranking consistency principle into the representation learning of POIs. Our method not only enables the interaction between the geographical and preference influence for both users and POIs under a ranking scheme, but also integrates heterogeneous semantic information of POIs to learn a unified preference representation. Extensive experiments show that our method can return a ranked user list with better ranking utility than the state-of-the-art methods for both existing POIs and new POIs. Moreover, the performance of our method with respect to different POI categories is consistent with the hierarchy of needs in human life.

Siyuan Zhang, Yu Rong, Yu Zheng, Hong Cheng, Junzhou Huang

Patent Quality Valuation with Deep Learning Models

Patenting is of significant importance to protect intellectual properties for individuals, organizations and companies. One of practical demands is to automatically evaluate the quality of new patents, i.e., patent valuation, which can be used for patent indemnification and patent portfolio. However, to solve this problem, most traditional methods just conducted simple statistical analyses based on patent citation networks, while ignoring much crucial information, such as patent text materials and many other useful attributes. To that end, in this paper, we propose a Deep Learning based Patent Quality Valuation (DLPQV) model which can integrate the above information to evaluate the quality of patents. It consists of two parts: Attribute Network Embedding (ANE) and Attention-based Convolutional Neural Network (ACNN). ANE learns the patent embedding from citation networks and attributes, and ACNN extracts the semantic representation from patent text materials. Then their outputs are concatenated to predict the quality of new patents. The experimental results on a real-world patent dataset show our method outperforms baselines significantly with respect to patent valuation.

Hongjie Lin, Hao Wang, Dongfang Du, Han Wu, Biao Chang, Enhong Chen

Learning Distribution-Matched Landmarks for Unsupervised Domain Adaptation

Domain adaptation is widely used in database applications, especially in data mining. The basic assumption of domain adaptation (DA) is that some latent factors are shared by the source domain and the target domain. Revealing these shared factors, as a result, is the core operation of many DA approaches. This paper proposes a novel approach, named Learning Distribution-Matched Landmarks (LDML), for unsupervised DA. LDML reveals the latent factors by learning a domain-invariant subspace where the two domains are well aligned at both feature level and sample level. At the feature level, the divergences of both the marginal distribution and the conditional distribution are mitigated. At the sample level, each sample is evaluated so that we can take full advantage of the pivotal samples and filter out the outliers. Extensive experiments on two standard benchmarks verify that our approach can outperform state-of-the-art methods with significant advantages.

Mengmeng Jing, Jingjing Li, Jidong Zhao, Ke Lu

Factorization Meets Memory Network: Learning to Predict Activity Popularity

We address the problem, i.e., early prediction of activity popularity in event-based social networks, aiming at estimating the final popularity of new activities to be published online, which promotes applications such as online advertising recommendation. A key to success for this problem is how to learn effective representations for the three common and important factors, namely, activity organizer (who), location (where), and textual introduction (what), and further model their interactions jointly. Most of existing relevant studies for popularity prediction usually suffer from performing laborious feature engineering and their models separate feature representation and model learning into two different stages, which is sub-optimal from the perspective of optimization. In this paper, we introduce an end-to-end neural network model which combines the merits of Memory netwOrk and factOrization moDels (MOOD), and optimizes them in a unified learning framework. The model first builds a memory network module by proposing organizer and location attentions to measure their related word importance for activity introduction representation. Afterwards, a factorization module is employed to model the interaction of the obtained introduction representation with organizer and location identity representations to generate popularity prediction. Experiments on real datasets demonstrate MOOD indeed outperforms several strong alternatives, and further validate the rational design of MOOD by ablation test.

Wen Wang, Wei Zhang, Jun Wang

Representation Learning for Large-Scale Dynamic Networks

Representation leaning on networks aims to embed networks into a low-dimensional vector space, which is useful in many tasks such as node classification, network clustering, link prediction and recommendation. In reality, most real-life networks constantly evolve over time with various kinds of changes to the network structure, e.g., creation and deletion of edges. However, existing network embedding methods learn the representation vectors for nodes in a static manner, which are not suitable for dynamic network embedding. In this paper, we propose a dynamic network embedding approach for large-scale networks. The method incrementally updates the embeddings by considering the changes of the network structures and is able to dynamically learn the embedding for networks with millions of nodes within a few seconds. Extensive experimental results on three real large-scale networks demonstrate the efficiency and effectiveness of our proposed methods.

Yanwei Yu, Huaxiu Yao, Hongjian Wang, Xianfeng Tang, Zhenhui Li

Multi-view Discriminative Learning via Joint Non-negative Matrix Factorization

Multi-view learning attempts to generate a classifier with a better performance by exploiting relationship among multiple views. Existing approaches often focus on learning the consistency and/or complementarity among different views. However, not all consistent or complementary information is useful for learning, instead, only class-specific discriminative information is essential. In this paper, we propose a new robust multi-view learning algorithm, called DICS, by exploring the Discriminative and non-discriminative Information existing in Common and view-Specific parts among different views via joint non-negative matrix factorization. The basic idea is to learn a latent common subspace and view-specific subspaces, and more importantly, discriminative and non-discriminative information from all subspaces are further extracted to support a better classification. Empirical extensive experiments on seven real-world data sets have demonstrated the effectiveness of DICS, and show its superiority over many state-of-the-art algorithms.

Zhong Zhang, Zhili Qin, Peiyan Li, Qinli Yang, Junming Shao

Efficient Discovery of Embedded Patterns from Large Attributed Trees

Discovering informative patterns deeply hidden in large tree datasets is an important research area that has many practical applications. Many modern applications and systems represent, export and exchange data in the form of trees whose nodes are associated with attributes. In this paper, we address the problem of mining frequent embedded attributed patterns from large attributed data trees. Attributed pattern mining requires combining tree mining and itemset mining. This results in exploring a larger pattern search space compared to addressing each problem separately. We first design an interleaved pattern mining approach which extends the equivalence-class based tree pattern enumeration technique with attribute sets enumeration. Further, we propose a novel layered approach to discover all frequent attributed patterns in stages. This approach seamlessly integrates an itemset mining technique with a recent unordered embedded tree pattern mining algorithm to greatly reduce the pattern search space. Our extensive experimental results on real and synthetic large-tree datasets show that the layered approach displays, in most cases, orders of magnitude performance improvements over both the interleaved mining method and the attribute-as-node embedded tree pattern mining method and has good scaleup properties.

Xiaoying Wu, Dimitri Theodoratos

Classification Learning from Private Data in Heterogeneous Settings

Classification is useful for mining labels of data. Though well-trained classifiers benefit many applications, their training procedures on user-contributed data may leak users’ privacy.This work studies methods for private model training in heterogeneous settings, specially for the Naïve Bayes Classifier (NBC). Unlike previous works focusing on centralized and consistent datasets, we consider the private training in two more practical settings, namely the local setting and the mixture setting. In the local setting, individuals directly contribute training tuples to the untrusted trainer. In the mixture setting, the training dataset is composed of individual tuples and statistics of datasets from institutes. We propose a randomized response based NBC strategy for the local setting. To cope with the privacy of heterogeneous data (single tuples and the statistics) in the mixture setting, we design a unified privatized scheme. It integrates respective sanitization strategies on the two data types while preserving privacy. Besides contributing error bounds of estimated probabilities constituting NBC, we prove their optimality in the minimax framework and quantify the classification error of the privately learned NBC. Our analyses are validated with extensive experiments on real-world datasets.

Yiwen Nie, Shaowei Wang, Wei Yang, Liusheng Huang, Zhenhua Zhao

Multimedia Data Processing

Frontmatter

Fusing Satellite Data and Urban Data for Business Location Selection: A Neural Approach

Business location selection is of great importance in practice, but is a long and costly process. Traditional approaches to selecting optimal business locations consider complex factors such as foot traffic, neighborhood structure, space rent and available workforce, which are typically hard to acquire or measure. In this paper, we propose to exploit the highly available satellite data (e.g., satellite images and nighttime light data) as well as urban data for business location selection. We first perform an empirical analysis to evaluate the direct relationship between satellite features and business locations. We then propose a novel regression-and-ranking combined neural network model, to collectively predict the popularity and ranking for each location. Our model fuses the heterogeneous yet discriminative features from satellite and urban data and captures feature interactions effectively. We conduct experiments to compare our approach with various baselines. The results verify the effectiveness of the extracted satellite features and the superior performance of our model in terms of four metrics.

Yanan Xu, Yanyan Shen, Yanmin Zhu, Jiadi Yu

Index and Retrieve Multimedia Data: Cross-Modal Hashing by Learning Subspace Relation

Hashing methods have been extensively applied to efficient multimedia data indexing and retrieval on account of explosion of multimedia data. Cross-modal hashing usually learns binary codes by mapping multi-modal data into a common Hamming space. Most supervised methods utilize relation information like class labels as pairwise similarities of cross-modal data pair to narrow intra-modal and inter-modal gap. In this paper, we propose a novel supervised cross-modal hashing method dubbed Subspace Relation Learning for Cross-modal Hashing (SRLCH), which exploits relation information in semantic labels to make similar data from different modalities closer in the low-dimension Hamming subspace. SRLCH preserves the discrete constraints and nonlinear structures, while admitting a closed-form binary codes solution, which effectively enhances the training efficiency. An iterative alternative optimization algorithm is developed to simultaneously learn both hash functions and unified binary codes, indexing multimedia data in an efficient way. Evaluations in two cross-modal retrieval tasks on three widely-used datasets show that the proposed SRLCH outperforms most cross-modal hashing methods.

Luchen Liu, Yang Yang, Mengqiu Hu, Xing Xu, Fumin Shen, Ning Xie, Zi Huang

Deep Sparse Informative Transfer SoftMax for Cross-Domain Image Classification

In many real applications, it is often encountered that the models trained on source domain cannot fit the related target images very well, due to the variants and changes of the imaging background, lighting of environment, viewpoints and so forth. Therefore cross-domain image classification becomes a very interesting research problem. Lots of research efforts have been conducted on this problem, where many of them focus on exploring the cross-domain image features. Recently transfer learning based methods become the main stream. In this paper, we present a novel transfer SoftMax model called Sparse Informative Transfer SoftMax (SITS) to deal with the problem of cross-domain image classification. SITS is a flexible classification framework. Specifically, the principle eigenvectors of the target domain feature space are introduced into our objective function, hence the informative features of the target domain are exploited in the process of the model training. The sparse regularization for feature selection and the SoftMax classification are also employed in our framework. On this basis, we developed Deep SITS network to efficiently learn informative transfer model and enhance the transferable ability of deep neural network. Extensive experiments are conducted on several commonly used benchmarks. The experimental results show that comparing with the state-of-the-art methods, our method achieves the best performance.

Hanfang Yang, Xiangdong Zhou, Lan Lin, Bo Yao, Zijing Tan, Haocheng Tang, Yingjie Tian

Sitcom-Stars Oriented Video Advertising via Clothing Retrieval

This paper introduces a novel learning-based framework for video content-based advertising, DeepLink, which aims at linking sitcom-stars and online stores with clothing retrieval by using state-of-the-art deep convolutional neural networks (CNNs). Concretely, several deep CNN models are adopted for composing multiple sub-modules in DeepLink, including human-body detection, human-pose selection, face verification, clothing detection and retrieval from advertisements (ads) pool that is constructed by clothing images collected from real-world online stores. For clothing detection and retrieval from ad images, we firstly transfer the state-of-the-art deep CNN models to our data domain, and then train corresponding models based on our constructed large-scale clothing datasets. Extensive experimental results demonstrate the feasibility and efficacy of our proposed clothing-based video-advertising system.

Haijun Zhang, Yuzhu Ji, Wang Huang, Linlin Liu

Distributed Computing

Frontmatter

Efficient Snapshot Isolation in Paxos-Replicated Database Systems

Modern database systems are increasingly deployed in a cluster of commodity machines with Paxos-based replication technique to offer better performance, higher availability and fault-tolerance. The widely adopted implementation is that one database replica is elected to be a leader and to be responsible for transaction requests. After the transaction execution is completed, the leader generates transaction log and commit this transaction until the log has been replicated to a majority of replicas. The state of the leader is always ahead of that of the follower replicas since the leader commits the transactions firstly and then notifies other replicas of the latest committed log entries in the later communication. As the follower replica can’t immediately provide the latest snapshot, both read-write and read-only transactions would be executed at the leader to guarantee the strong snapshot isolation semantic. In this work, we design and implement an efficient snapshot isolation scheme. This scheme uses adaptive timestamp allocation to avoid frequently requesting the leader to assign transaction timestamps. Furthermore, we design an early log replay mechanism for follower replicas. It allows the follower replica to execute a read operation without waiting to replay log to generate the required snapshot. Comparing with the conventional implementation, we experimentally show that the optimized snapshot isolation for Paxos-replicated database systems has better performance in terms of scalability and throughput.

Jinwei Guo, Peng Cai, Bing Xiao, Weining Qian, Aoying Zhou

Proof of Reputation: A Reputation-Based Consensus Protocol for Peer-to-Peer Network

The advent of blockchain sheds light on addressing trust issues of peer-to-peer networks by providing a distributed tamper-resistant ledger. Beyond cryptocurrencies, it is believed that blockchain can also be used to protect other properties such as reputation. Most of the existing studies of enhancing reputation systems using blockchains are built on top of the bitcoin-like blockchains, so they are inherently constrained by the low-efficiency and high-consumption of the underlying blockchain. To fill this gap, we present a reputation-based consensus protocol called Proof of Reputation (PoR), which guarantees the reliability and integrity of transaction outcomes in an efficient way. In PoR, we let reputation serves as the incentive for both good behavior and block publication instead of digital coins, therefore no miners are needed. We also implement a prototype and our scalability experiments show that our protocol can scale to over a thousand participants in a peer-to-peer network with throughput of hundreds of transactions per second.

Fangyu Gai, Baosheng Wang, Wenping Deng, Wei Peng

Incremental Materialized View Maintenance on Distributed Log-Structured Merge-Tree

Modern database systems are in need of supporting highly scalable transactions of data updates and efficient queries over data simultaneously for the real-time applications. One solution to reach the demand is to implement query optimization techniques on the online transaction processing (OLTP) systems. The materialized view is considered as a panacea to improve query latency. However, it also involves a significant cost of maintenance which trades away transaction performance. In this paper, we develop materialized views on a distributed log-structured merge-tree (LSM-tree), which is a well-known structure adopted to improve data write performance. We examine the design space and conclude several design features for the implementation of view on LSM-tree. An asynchronous approach with two optimizations are proposed to decouple the view maintenance with transaction process. Under the asynchronous update, we also provide consistency query for views. Experiments on TPC-H benchmark show our method achieves better performance than straightforward methods on different workloads.

Huichao Duan, Huiqi Hu, Weining Qian, Haixin Ma, Xiaoling Wang, Aoying Zhou

CDSFM: A Circular Distributed SGLD-Based Factorization Machines

Factorization Machines (FMs) offers attractive performance by combining low-rank data vectors and heuristic features. However, it suffers from the growth of the dataset and the model complexity. Although much efforts have been made to distribute FMs over multiple machines, the computation efficiency is still limited by the foundational master-slave framework. In this paper, we propose CDSFM, which leverages Stochastic Gradient Langevin Dynamics (SGLD) to optimize FMs, and is distributed into a completely new circular framework. Experiments on two genres of datasets show that CDSFM can achieves a 2.3–4.7$$\times $$× speed-up over the comparison methods while obtains better performance.

Kankan Zhao, Jing Zhang, Liangfu Zhang, Cuiping Li, Hong Chen

Industrial Track

Frontmatter

An Industrial-Scale System for Heterogeneous Information Card Ranking in Alipay

Alipay (https://global.alipay.com/), one of the world’s largest mobile and online payment platforms, provides not only payment services but also business about many aspects of our daily lives (finance, insurance, credit, express, news, social contact, etc.). The homepage in Alipay app (https://render.alipay.com/p/s/download) integrates massive heterogeneous information cards, which need to be ranked in appropriate order for better user experience. This paper demonstrates an industrial-scale system for heterogeneous information card ranking. We implement an ensemble ranking model, blending online and chunked-based learning algorithms which are developed on parameter server mechanism and able to handle industrial-scale data. Moreover, we propose efficient and effective factor embedding methods, which aim to reduce high-dimensional heterogenous factor features to low-dimensional embedding vectors by subtly revealing feature interactions. Offline experimental as well as online A/B testing results illustrate the efficiency and effectiveness of our proposals.

Zhiqiang Zhang, Chaochao Chen, Jun Zhou, Xiaolong Li

A Twin-Buffer Scheme for High-Throughput Logging

For a transactional database system, the efficiency of logging is usually crucial to its performance. The emergence of new hardware, such as NVM and SSD, eliminated the traditional I/O bottleneck of logging and released the potential of multi-core CPUs. As a result, the parallelism of logging becomes important. We propose a parallel logging subsystem called TwinBuf and implemented it in PostgreSQL. This solution can make better use of multi-core CPUs, and is generally applicable to all kinds of storage devices, such as hard disk, SSD and NVM. TwinBuf adopts per-thread logging slots to parallelize logging, and a twin-log-buffer mechanism to make sure that logging can be performed in a non-stop manner. It performs group commit to minimize the persistence overheads. Experimental evaluation was conducted to demonstrate its advantages.

Qingzhong Meng, Xuan Zhou, Shan Wang, Haiyan Huang, Xiaoli Liu

Qualitative Instead of Quantitative: Towards Practical Data Analysis Under Differential Privacy

Differential privacy (DP) has become the de facto standard in the academic and industrial communities. Although DP can provide strong privacy guarantee, it also brings a major of performance loss for data mining systems. Recently there has been a flood of research into the quantitative mining of DP based algorithms, which are designed to improve the performance of data mining systems. However, industrial applications demand accurate quantitative mining results. Results containing noise are actually difficult to use. This paper rethinks to apply DP in industrial big data from another perspective: qualitative analysis, which aims to dig the data about rank, pattern, important set, etc. It does not require accurate results and naturally has a greater ability to accommodate noise. We design a framework about DP data publication based attribute importance rank to support the qualitative analysis of DP, which assists data buyers to perform qualitative analysis tasks and to know the credibility of their results. We show the realization of this framework using two typical qualitative tasks. Experimental results on public data and industrial data show that making use of this framework, qualitative analysis tasks can be completed with a high confidence support even when privacy budget $$\epsilon $$ϵ is very small (e.g., 0.05). Our observations suggest that qualitative analysis of DP has the potential ability to realize applying DP in industrial applications.

Xuanyu Bai, Jianguo Yao, Mingyuan Yuan, Jia Zeng, Haibing Guan

Client Churn Prediction with Call Log Analysis

Client churn prediction is a classic business problem of retaining customers. Recently, machine learning algorithms have been applied to predict client churn and have shown promising performance comparing to traditional methods. Despite of its success, existing machine learning approach mainly focus on structured data such as demographic and transactional data, while unstructured data, such as emails and phone calls, have been largely overlooked. In this work, we propose to improve existing churn prediction models by analysing customer characteristics and behaviours from unstructured data, particularly, audio calls. To be specific, we developed a text mining model combined with gradient boosting tree to predict client churn. We collected and conducted extensive experiments on 900 thousand audio calls from 200 thousand customers, and experimental results show that our approach can significantly improve the previous model by exploiting the additional unstructured data.

Nhi N. Y. Vo, Shaowu Liu, James Brownlow, Charles Chu, Ben Culbert, Guandong Xu

Unpack Local Model Interpretation for GBDT

A gradient boosting decision tree (GBDT), which aggregates a collection of single weak learners (i.e. decision trees), is widely used for data mining tasks. Because GBDT inherits the good performance from its ensemble essence, much attention has been drawn to the optimization of this model. With its popularization, an increasing need for model interpretation arises. Besides the commonly used feature importance as a global interpretation, feature contribution is a local measure that reveals the relationship between a specific instance and the related output. This work focuses on the local interpretation and proposes an unified computation mechanism to get the instance-level feature contributions for GBDT in any version. Practicality of this mechanism is validated by the listed experiments as well as applications in real industry scenarios.

Wenjing Fang, Jun Zhou, Xiaolong Li, Kenny Q. Zhu

Cost-Sensitive Churn Prediction in Fund Management Services

Churn prediction is vital to companies as to identify potential churners and prevent losses in advance. Although it has been addressed as a classification task and a variety of models have been employed in practice, fund management services have presented several special challenges. One is that financial data is extremely imbalanced since only a tiny proportion of customers leave every year. Another is a unique cost-sensitive learning problem, i.e., costs of wrong predictions for churners should be related to their account balances, while costs of wrong predictions for non-churners should be the same. To address these issues, this paper proposes a new churn prediction model based on ensemble learning. In our model, multiple classifiers are built using sampled datasets to tackle the imbalanced data issue while exploiting data fully. Moreover, a novel sampling strategy is proposed to deal with the unique cost-sensitive issue. This model has been deployed in one of the leading fund management institutions in Australia, and its effectiveness has been fully validated in real applications.

James Brownlow, Charles Chu, Bin Fu, Guandong Xu, Ben Culbert, Qinxue Meng

Demonstration Track

Frontmatter

A Movie Search System with Natural Language Queries

In this demo, we present $$\mathsf {MSeeker}$$MSeeker, a user-friendly movie search system with following characteristics: it (1) transforms natural language queries ($$\mathsf {nlq}$$nlq) into graph pattern queries Q with a special node $$u_o$$uo as “query focus”; (2) identifies diversified top-k matches of $$u_o$$uo by early termination algorithm; and (3) provides graphical interface to help users interact with the system.

Xin Wang, Huayi Zhan, Lan Yang, Zonghai Li, Jiying Zhong, Liang Zhao, Rui Sun, Bin Tan

EventSys: Tracking Event Evolution on Microblogging Platforms

In this paper, we demonstrate a prototype system named EventSys, which provides efficient monitoring services for detecting and tracking event evolution on microblogging platforms. The major features of EventSys are: (1) It describes the lifecycle of an event by a staged model, and provides effective algorithms for detecting the stages of an event. (2) It offers emotional analysis over the stages of an event, through which people are able to know the public emotional tendency over a specific event at different time. (3) It provides a novel event-type-driven method to extract event tuples, which forms the foundation for event evolution analysis. After a brief introduction to the architecture and key technologies of EventSys, we present a case study to demonstrate the working process of EventSys.

Lin Mu, Peiquan Jin, Lizhou Zheng, En-Hong Chen

AdaptMX: Flexible Join-Matrix Streaming System for Distributed Theta-Joins

Stream join is a fundamental and important processing in many real-world applications. Due to the complexity of join operation and the inherent characteristic of streaming data (e.g., skewed distribution and dynamics), though massive research has been conducted, adaptivity and load-balancing are still urgent problems. In this paper, an enhanced adaptive join-matrix system AdaptMX for stream theta-join is presented, which combines the key-based and tuple-based join approaches well: (i) at outer level, it modifies the well-known join-matrix model to allocate resource on demand, improving the adaptivity of tuple-based parititoning scheme; (ii) at inner level, it adopts a key-based routing policy among grouped processing tasks to maintain the join semantics and cost-effective load balancing strategies to remove the stragglers. For demonstration, we present a transparent processing of distributed stream theta-join and compare the performance of our AdaptMX system with other baselines, with 3$$\times $$× higher throughput.

Xiaotong Wang, Cheng Jiang, Junhua Fang, Xiangfeng Wang, Rong Zhang

A System for Spatial-Temporal Trajectory Data Integration and Representation

Different GPS devices and transportation companies record and store their data using various formats. Even though GPS data often contains the same spatial-temporal and semantic attributes, describing the moving object’s trajectory, the integration of these datasets into a single format and storage platform is yet an issue. Therefore, we deliver a data integration system for simplified loading and preprocessing of trajectory data into a standard text platform; this facilitates data access and processing by any trajectory application using multiple and heterogeneous datasets.

Douglas Alves Peixoto, Xiaofang Zhou, Nguyen Quoc Viet Hung, Dan He, Bela Stantic

SLIND: Identifying Stable Links in Online Social Networks

Link stability detection has been an important and long-standing problem in the link prediction domain. However, it is often easily overlooked as being trivial and has not been adequately dealt with in link prediction [1]. In this demo, we introduce an innovative link stability detection system, called SLIND (Stable LINk Detection), that adopts a Multi-Variate Vector Autoregression analysis (MVVA) approach using link dynamics to establish stability confidence scores of links within a clique of nodes in online social networks (OSN) to improve detection accuracy and the representation of stable links. SLIND is also able to determine stable links through the use of partial feature information and potentially scales well to much larger datasets with very little accuracy to performance trade-offs using random walk Monte-Carlo estimates.

Ji Zhang, Leonard Tan, Xiaohui Tao, Xiaoyao Zheng, Yonglong Luo, Jerry Chun-Wei Lin

MusicRoBot: Towards Conversational Context-Aware Music Recommender System

Traditional recommendation approaches work well on depicting users’ long-term music preference. However, in the conversational applications, it is unable to capture users’ real time music taste, which are dynamic and depend on user context including users’ emotion, current activities or sites. To meet users’ real time music preferences, we have developed a conversational music recommender system based on music knowledge graph, MusicRoBot (Music RecOmmendation Bot). We embed the music recommendation into a chatbot, integrating both the advantages of dialogue system and recommender system. In our system, conversational interaction helps capture more real-time and richer requirements. Users can receive real time recommendation and give feedbacks by conversation. Besides, MusicRoBot also provides the music Q&A function to answer several types of musical question by the music knowledge graph. A WeChat based service has been deployed piloted for volunteers already.

Chunyi Zhou, Yuanyuan Jin, Kai Zhang, Jiahao Yuan, Shengyuan Li, Xiaoling Wang

HDUMP: A Data Recovery Tool for Hadoop

Hadoop is a popular distributed framework for massive data processing. HDFS is the underlying file system of Hadoop. More and more companies use Hadoop as data processing platform. Once Hadoop crashes, the data stored in HDFS can not be accessed directly. We present HDUMP, a light-weight bypassing file system, which aims to recover the data stored in HDFS when Hadoop crashes.

Zhongsheng Li, Qiuhong Li, Wei Wang, Qitong Wang, Fengbin Qi, Yimin Liu, Peng Wang

Modeling and Evaluating MID1 ICAL Pipeline on Spark

Squire Kilometre Array (SKA) project generates almost the hugest data volume in the world. SKA data flow pipelines need almost real-time processing ability, which brings huge challenges to the execution frameworks (EF for short). We propose a cost model for a typical SKA data flow pipeline named as MID1 ICAL pipeline on Spark. By simulating the I/O of MID1 ICAL pipeline with a reduced SKA data, we evaluate several different implementations of MID1 ICAL pipeline and conclude the optimized method for this pipeline on Spark.

Zhongsheng Li, Qiuhong Li, Yimin Liu, Wei Wang, Fengbin Qi, Mingmin Chi, Yitong Wang

Backmatter

Titel: Database Systems for Advanced Applications
herausgegeben von: Jian Pei
Yannis Manolopoulos
Shazia Sadiq
Jianxin Li
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-91458-9
Print ISBN: 978-3-319-91457-2
DOI: https://doi.org/10.1007/978-3-319-91458-9