Skip to main content

2017 | Buch

Database Systems for Advanced Applications

22nd International Conference, DASFAA 2017, Suzhou, China, March 27-30, 2017, Proceedings, Part I

herausgegeben von: Selçuk Candan, Lei Chen, Torben Bach Pedersen, Lijun Chang, Wen Hua

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This two volume set LNCS 10177 and 10178 constitutes the refereed proceedings of the 22nd International Conference on Database Systems for Advanced Applications, DASFAA 2017, held in Suzhou, China, in March 2017.

The 73 full papers, 9 industry papers, 4 demo papers and 3 tutorials were carefully selected from a total of 300 submissions. The papers are organized around the following topics: semantic web and knowledge management; indexing and distributed systems; network embedding; trajectory and time series data processing; data mining; query processing and optimization; text mining; recommendation; security, privacy, senor and cloud; social network analytics; map matching and spatial keywords; query processing and optimization; search and information retrieval; string and sequence processing; stream date processing; graph and network data processing; spatial databases; real time data processing; big data; social networks and graphs.

Inhaltsverzeichnis

Frontmatter

Semantic Web and Knowledge Management

Frontmatter
A General Fine-Grained Truth Discovery Approach for Crowdsourced Data Aggregation

Crowdsourcing has been proven to be an efficient tool to collect large-scale datasets. Answers provided by the crowds are often noisy and conflicted, which makes aggregating them to infer ground truth a critical challenge. Existing fine-grained truth discovery methods solve this problem by exploring the correlation between source reliability and task topics or answers. However, they can only work on limited tasks, which results in the incompatibility with Writing tasks and Transcription tasks, along with the insufficient utilization of the global dataset. To maintain compatibility, we consider the existence of clusters in both tasks and sources, then propose a general fine-grained method. The proposed approach contains two integral components: kl-means and Pattern-based Truth Discovery (PTD). With the aid of ground truth data, kl-means directly employs a co-clustering reliability model on the correctness matrix to learn the patterns. Then PTD conducts the answer aggregation by incorporating captured patterns, producing a more accurate estimation. Therefore, our approach is compatible with all tasks and can better demonstrate the correlation among tasks and sources. Experimental results show that our method can produce a more precise estimation than other general truth discovery methods due to its ability to learn and utilize the patterns of both tasks and sources.

Yang Du, Hongli Xu, Yu-E Sun, Liusheng Huang
Learning the Structures of Online Asynchronous Conversations

The online social networks have embraced huge success from the crowds in the last two decades. Now, more and more people get used to chat with friends online via instant messaging applications on personal computers or mobile devices. Since these conversations are sequentially organized, which fails to show the logical relations between messages, they are called asynchronous conversations in previous studies. Unfortunately, the sequential layouts of messages are usually not intuitive to see how the conversation evolves as time elapses. In this paper, we propose to learn the structures of online asynchronous conversations by predicting the “reply-to” relation between messages based on text similarity and latent semantic transferability. A heuristic method is also brought forward to predict the relation, and then recover the conversation structure. We demonstrate the effectiveness of the proposed method through experiments on a real-world web forum comment data set.

Jun Chen, Chaokun Wang, Heran Lin, Weiping Wang, Zhipeng Cai, Jianmin Wang
A Question Routing Technique Using Deep Neural Network for Communities of Question Answering

Online Communities for Question Answering (CQA) such as Quora and Stack Overflow face the challenge of providing high quality answers to the questions asked by their users. Although CQA frameworks receive new questions in a linear rate, the rate of the unanswered questions increases in an exponential way. This variation eventually compromise effectiveness of the CQA frameworks as knowledge sharing platforms. The main cause for this challenge is the improper routing of questions to the potential answerers, field experts or interested users. The proposed technique QR-DSSM uses deep semantic similarity model (DSSM) to extract semantic similarity features using deep neural networks. The extracted semantic features are used to rank the profiles of the answerers by their relevance the routed question. QR-DSSM maps the asked questions and the profiles of the users into a latent semantic space where the relevance is measured using cosine similarity between the two; questions and users’ profiles. QR-DSSM achieved MRR score of 0.1737. QR-DSSM outperformed the baseline models such as query likelihood language model (QLLM), Latent Dirichlet Allocation (LDA), SVM classification technique and RankingSVM learning to rank technique.

Amr Azzam, Neamat Tazi, Ahmad Hossny
Category-Level Transfer Learning from Knowledge Base to Microblog Stream for Accurate Event Detection

Many Web applications need the accurate event detection technique on microblog stream. But the accuracy of existing methods is still challenged by microblog’s short length and high noise. We develop a novel category-level transfer learning method TransDetector to deal with the task. TransDetector bases on two facts, that microblog is short but can be enriched by knowledge base semantically with transfer learning; and events can be detected more accurately on microblogs with richer semantics. The following contributions are made in TransDetector. (1) We propose a structure-guided category-level topics extraction method, which exploits the knowledge base’s hierarchical structure to extract categories’ highly correlated topics. (2) We develop a probabilistic model CTrans-LDA for category-level transfer learning, which utilizes the word co-occurrences and transfers the knowledge base’s category-level topics into microblogs. (3) Events are detected accurately on category-level word time series, due to richer semantics and less noise. (4) Experiment verifies the quality of category-level topics extracted from knowledge base, and the further study on the benchmark Edinburgh twitter corpus validates the effectiveness of our proposed transfer learning method for event detection. TransDetector achieves high accuracy, promoting the precision by 9% without sacrificing the recall rate.

Weijing Huang, Tengjiao Wang, Wei Chen, Yazhou Wang

Indexing and Distributed Systems

Frontmatter
AngleCut: A Ring-Based Hashing Scheme for Distributed Metadata Management

Today’s file systems are required to store PB-scale or even EB-scale data across thousands of servers. Under this scenario, distributed metadata management schemes, which store metadata on a group of metadata servers (MDS’s), are used to alleviate the workload of a single server. However, they present a significant challenge as the group of MDS’s should maintain a high level of metadata locality and load balancing, which are practically contradictory to each other. In this paper we propose a novel and specially designed hashing scheme called AngleCut to partition metadata namespace tree and serve large-scale distributed storage systems. AngleCut first uses a locality preserving hashing (LPH) function to project the namespace tree into linear keyspace, i.e., multiple Chord-like rings. Then we design a history-based allocation strategy to adjust the workload of MDS’s dynamically. The metadata cache mechanism is also adopted in AngleCut to improve the query efficiency. In general, our scheme preserves the metadata locality essentially as well as maintaining high load balancing between MDS’s. The theoretical proof and extensive experiments on trace data exhibit the superiority of AngleCut over the previous literature.

Jiaxi Liu, Renxuan Wang, Xiaofeng Gao, Xiaochun Yang, Guihai Chen
An Efficient Bulk Loading Approach of Secondary Index in Distributed Log-Structured Data Stores

How to improve reading performance of Log-Structured-Merge (LSM)-tree gains much attention recently. Meanwhile, constructing secondary index for LSM data stores is a popular solution. And bulk loading of secondary index is inevitable when a new application is developed on an existing LSM data stores. However, to the best of our knowledge there are few studies on research of bulk loading of secondary index in distributed LSM-tree. In this paper, we study the performance improvement of bulk loading of secondary index in distributed LSM-tree data stores. We propose an efficient bulk loading approach of secondary index in Log-Structured Data Stores. Firstly, we design secondary index structure based on distributed LSM-tree to guarantee the scalability and consistency of secondary index. Secondly, we propose an efficient framework to handle bulk loading of secondary index in a distributed environment, which can provide a good load balancing for query processing by using equal-depth histogram to capture data distribution. Analysis of theoretical and experimental results on standard benchmark illustrate the efficacy of the proposed methods in a distributed environment.

Yanchao Zhu, Zhao Zhang, Peng Cai, Weining Qian, Aoying Zhou
Performance Comparison of Distributed Processing of Large Volume of Data on Top of Xen and Docker-Based Virtual Clusters

Recently, with the advent of cloud computing, it becomes essential to run distributed computing tasks, such as Hadoop MapReduce tasks, on top of virtual computing nodes instead of physical computing nodes. But, distributed big data processing on top of virtual machines usually causes unbalanced use of physical resources, such as memory, disk I/O and network resources, thus resulting in severe performance problems. In this paper, we show how virtualization methods affect distributed processing of very large volume of data, by comparing Hadoop MapReduce processing performance on top of Xen-based virtual clusters versus Docker-based virtual clusters. In our experiments, we compare the performance of two different virtual clusters by changing virtualization methods, block sizes and node numbers. Our results show that, in terms of the distributed big data processing performance, Docker-based virtual cluster is usually faster than Xen-based virtual cluster, but there exist some cases where Xen is faster than Docker according to the parameters, such as block size and virtual node numbers.

Haejin Chung, Yunmook Nah
An Adaptive Data Partitioning Scheme for Accelerating Exploratory Spark SQL Queries

For data analysis, it’s useful to explore the data set with a sequence of queries, frequently using the results from the previous queries to shape the next queries. Thus, data used in the previous queries are often reused, at least in part, in the next queries. This fact may be used to accelerate queries with data partitioning, a widely used technique that enables skipping the irrelevant data for better I/O performance. For getting effective partitions which are likely to cover the query workload in the future, we propose an adaptive partitioning scheme, combining the data-driven metrics and user-driven metrics to guide the data partitioning as well as a heuristic model using the metric plugin system to support different exploratory patterns. For partition storage and management, we propose an effective partition index structure for quickly searching for appropriate partitions to answer queries. The system is quite helpful in improving the performance of exploratory queries.

Chenghao Guo, Zhigang Wu, Zhenying He, X. Sean Wang

Network Embedding

Frontmatter
Semi-Supervised Network Embedding

Network embedding aims to learn a distributed representation vector for each node in a network, which is fundamental to support many data mining and machine learning tasks such as node classification, link prediction, and social recommendation. Current popular network embedding methods normally first transform the network into a set of node sequences, and then input them into an unsupervised feature learning model to generate a distributed representation vector for each node as the output. The first limitation of existing methods is that the node orders in node sequences are ignored. As a result some topological structure information encoded in the node orders cannot be effectively captured by such order-insensitive embedding methods. Second, given a particular machine learning task, some annotation data can be available. Existing network embedding methods are unsupervised and are not effective to incorporate the annotation data to learn better representation vectors. In this paper, we propose an order sensitive semi-supervised framework for network embedding. Specifically, we first propose an novel order sensitive network embedding method: StructuredNE to integrate node order information into the embedding process in an unsupervised manner. Then based on the annotation data, we further propose an semi-supervised framework SemNE to modify the representation vectors learned by StructuredNE to make them better fit the annotation data. We thoroughly evaluate our framework through three data mining tasks (multi-label classification, network reconstruction and link prediction) on three datasets. Experimental results show the effectiveness of the proposed framework.

Chaozhuo Li, Zhoujun Li, Senzhang Wang, Yang Yang, Xiaoming Zhang, Jianshe Zhou
CirE: Circular Embeddings of Knowledge Graphs

The embedding representation technology provides convenience for machine learning on knowledge graphs (KG), which encodes entities and relations into continuous vector spaces and then constructs $$\langle entity,relation,entity \rangle $$ triples. However, KG embedding models are sensitive to infrequent and uncertain objects. Furthermore, there is a contradiction between learning ability and learning cost. To this end, we propose circular embeddings (CirE) to learn representations of entire KG, which can accurately model various objects, save storage space, speed up calculation, and is easy to train and scalable to very large datasets. We have the following contributions: (1) We improve the accuracy of learning various objects by combining holographic projection and dynamic learning. (2) We reduce parameters and storage by adopting the circulant matrix as the projection matrix from the entity space to the relation space. (3) We reduce training time through adaptive parameters update algorithm which dynamically changes learning time for various objects. (4) We speed up the computation and enhance scalability by fast Fourier transform (FFT). Extensive experiments show that CirE outperforms state-of-the-art baselines in link prediction and entity classification, justifying the efficiency and the scalability of CirE.

Zhijuan Du, Zehui Hao, Xiaofeng Meng, Qiuyue Wang
PPNE: Property Preserving Network Embedding

Network embedding aims at learning a distributed representation vector for each node in a network, which has been increasingly recognized as an important task in the network analysis area. Most existing embedding methods focus on encoding the topology information into the representation vectors. In reality, nodes in the network may contain rich properties, which could potentially contribute to learn better representations. In this paper, we study the novel problem of property preserving network embedding and propose a general model PPNE to effectively incorporate the rich types of node properties. We formulate the learning process of representation vectors as a joint optimization problem, where the topology-derived and property-derived objective functions are optimized jointly with shared parameters. By solving this joint optimization problem with an efficient stochastic gradient descent algorithm, we can obtain representation vectors incorporating both network topology and node property information. We extensively evaluate our framework through two data mining tasks on five datasets. Experimental results show the superior performance of PPNE.

Chaozhuo Li, Senzhang Wang, Dejian Yang, Zhoujun Li, Yang Yang, Xiaoming Zhang, Jianshe Zhou
HINE: Heterogeneous Information Network Embedding

Network embedding has shown its effectiveness in embedding homogeneous networks. Compared with homogeneous networks, heterogeneous information networks (HINs) contain semantic information from multi-typed entities and relations, and are shown to be a more effective model for real world data. The existing network embedding methods fail to explicitly capture the semantics in HINs. In this paper, we propose an HIN embedding model (HINE), which consists of local and global semantic embedding. Local semantic embedding aims to incorporate entity type information via embedding the local structures and types of the entities in a supervised way. Global semantic embedding leverages multi-hop relation types among entities to propagate the global semantics via a Markov Random Field (MRF) to impact the embedding vectors. By doing so, HINE is capable to capture both local and global semantic information in the embedding vectors. Experimental results show that HINE significantly outperforms state-of-the-art methods.

Yuxin Chen, Chenguang Wang

Trajectory and Time Series Data Processing

Frontmatter
DT-KST: Distributed Top-k Similarity Query on Big Trajectory Streams

During the past decade, with the widespread use of smartphones and other mobile devices, big trajectory data are generated and stored in a distributed way. In this work, we focus on the distributed top-k similarity query over big trajectory streams. Processing such a distributed query is challenging due to the limited network bandwidth. To overcome this challenge, we propose a communication-saving algorithm DT-KST (Distributed Top-KSimilar Trajectories). DT-KST utilizes the multi-resolution property of Haar wavelet, and devises a level-increasing communication strategy to tighten the similarity bounds. Then, a local pruning strategy is imported to reduce the amount of data returned from distributed nodes. Theoretical analysis and extensive experiments on a real dataset show that DT-KST outperforms the state-of-the-art approach in terms of communication cost.

Zhigang Zhang, Yilin Wang, Jiali Mao, Shaojie Qiao, Cheqing Jin, Aoying Zhou
A Distributed Multi-level Composite Index for KNN Processing on Long Time Series

Recently, sensor-based applications have emerged and collected plenty of long time series. Traditional whole matching similarity search can only query full length time series. However, for long time series, similarity search on arbitrary time windows is more attractive and important. In this paper, we address the problem of window-based KNN search of time series data on HBase. Based on PAA approximation, we propose a composite index structure comprising Horizontal Segment Tree and Vertical Inverted Table. VI-Table is capable to prune time series by data summary in high levels, while HS-Tree leverages data summary in low levels to reduce access of the raw time series data. Both VI-Table and HS-Tree can be built parallel and incrementally. Our experiment results show the effectiveness and robustness of the proposed approach.

Xiaqing Wang, Zicheng Fang, Peng Wang, Ruiyuan Zhu, Wei Wang
Outlier Trajectory Detection: A Trajectory Analytics Based Approach

Trajectories obtained from GPS-enabled devices give us great opportunities to mine out hidden knowledge about the urban mobility, traffic dynamics and human behaviors. In this paper, we aim to understand historical trajectory data for discovering outlier trajectories of taxis. An outlier trajectory is a trajectory grossly different from others, meaning there are few or even no trajectories following a similar route in a dataset. To identify outlier trajectories, we first present a prefix tree based algorithm called PTS, which traverses the search space on-the-fly to calculate the number of trajectories following similar routes for outlier detection. Then we propose two trajectory clustering based approaches PBOTD and DBOTD to cluster trajectories and extract representative routes in different ways. Outlier detection is carried out on the representatives directly, and the accuracy can be guaranteed by some proven error bounds. The evaluation of the proposed methods on a real dataset of taxi trajectories verifies the high efficiency and accuracy of the DBOTD algorithm.

Zhongjian Lv, Jiajie Xu, Pengpeng Zhao, Guanfeng Liu, Lei Zhao, Xiaofang Zhou
Clustering Time Series Utilizing a Dimension Hierarchical Decomposition Approach

Time series clustering has attracted amount of attention recently. However, clustering massive time series faces the challenge of the huge computation cost. To reduce the computation cost, we propose a novel Dimension Hierarchical Decomposition (DHD for short) method to represent time series and a corresponding tree structure, denoted as DHDTree, to reorganize the time series collections to achieve the best separation effect. The main idea of DHDTree is to adapt k-d tree for time series by utilizing the DHD representation. When splitting, we select the most separable splitting strategy according to a predefined cost model. A fundamental feature of DHDTree is that it overcomes dimension curse by leveraging dimension compositions instead of selecting only one dimension when splitting, aiming to acquire the maximal separation effect. We illustrate that DHDTree obtains both the balance and the locality properties, which are important factors for the efficiency of time series organization for clustering. By the support of DHDTree, we improve clustering in two aspects. First, the DHD representation decreases the computation cost between time series dramatically. Secondly, we acquire the centers benefiting from the reorganization of the time series using our proposed DHDTree structure. Both the synthetic and real data sets verify the effectiveness and efficiency of the proposed method.

Qiuhong Li, Peng Wang, Yang Wang, Wei Wang, Yimin Liu, Jiaye Wu, Danyang Dou

Data Mining

Frontmatter
Fast Extended One-Versus-Rest Multi-label SVM Classification Algorithm Based on Approximate Extreme Points

In large-scale multi-label classification framework, applications of non-linear kernel extended one-versus-rest multi-label support vector machine (OVR-ESVM) classification algorithm are severely restricted by excessive training time. To deal with this problem, we improve the OVR-ESVM classification algorithm and propose fast OVR-ESVM classification algorithm based on approximate extreme points (AEML-ESVM). The AEML-ESVM classification algorithm integrates the advantages of OVR-ESVM classification algorithm and binary approximate extreme points support vector machine (AESVM) classification algorithm. In other words, it can not only shorten the training time greatly, but also reflect label correlation of individual instance explicitly. Meanwhile, its classification performance is similar to that of the OVR-ESVM classification algorithm. Experiment results on three public data sets show that AEML-ESVM classification algorithm can substantially reduce training time and its classification performance is comparable with that of the OVR-ESVM classification algorithm. It also outperforms existing fast multi-label SVM classification algorithms in both training time and classification performance.

Zhongwei Sun, Zhongwen Guo, Xupeng Wang, Jing Liu, Shiyong Liu
Efficiently Discovering Most-Specific Mixed Patterns from Large Data Trees

Discovering informative tree patterns hidden in large datasets is an important research area that has many practical applications. Along the years, research has evolved from mining induced patterns to mining embedded patterns. Mixed patterns allow extracting all the information extracted by embedded or induced patterns but also more detailed information which cannot be extracted by the other two. Unfortunately, the problem of extracting unconstrained mixed patterns from data trees has not been addressed up to now.In this paper, we address the problem of mining unordered frequent mixed patterns from large trees. We propose a novel approach that non-redundantly extracts most-specific mixed patterns. Our approach utilizes effective pruning techniques to reduce the pattern search space. It exploits efficient homomorphic pattern matching algorithms to compute pattern support incrementally and avoids the costly enumeration of all pattern matchings required by older approaches. An extensive experimental evaluation shows that our approach not only mines mixed patterns from real and synthetic datasets up to several orders of magnitude faster than older state-of-the-art embedded tree mining algorithms applied to large data trees but also scales well empowering the extraction of informative mixed patterns from large datasets for which no previous approaches exist.

Xiaoying Wu, Dimitri Theodoratos
Max-Cosine Matching Based Neural Models for Recognizing Textual Entailment

Recognizing textual entailment is a fundamental task in a variety of text mining or natural language processing applications. This paper proposes a simple neural model for RTE problem. It first matches each word in the hypothesis with its most-similar word in the premise, producing an augmented representation of the hypothesis conditioned on the premise as a sequence of word pairs. The LSTM model is then used to model this augmented sequence, and the final output from the LSTM is fed into a softmax layer to make the prediction. Besides the base model, in order to enhance its performance, we also proposed three techniques: the integration of multiple word-embedding library, bi-way integration, and ensemble based on model averaging. Experimental results on the SNLI dataset have shown that the three techniques are effective in boosting the predicative accuracy and that our method outperforms several state-of-the-state ones.

Zhipeng Xie, Junfeng Hu
An Intelligent Field-Aware Factorization Machine Model

The widely-used field-aware factorization machines model (FFM) takes the interactions of all the text features into consideration which will lead to a large number of invalid calculations. An intelligent field-aware factorization machine model (iFFM) is proposed in this paper. In the model, the key attributes are promoted and the factor selection operations are embedded into the computation process intelligently by using the auto feature engineering technology. Meanwhile, Markov Chain Monte Carlo (MCMC) and stochastic gradient descent (SGD) methods are applied to optimize the loss function and improve the recommendation accuracy. In order to get better diversity, a new model iFFM-2 is put forward, which is the linear weighted combination of iFFM and a model built based on the heat-spreading algorithm. The experimental results show that iFFM can obtain higher accuracy and computation efficiency compared with FFMs, iFFM-2 inherits the accuracy of iFFM, and it can provide better diversity.

Cairong Yan, Qinglong Zhang, Xue Zhao, Yongfeng Huang

Query Processing and Optimization (I)

Frontmatter
Beyond Skylines: Explicit Preferences

Skyline queries are well-known in the database community and there are many algorithms for the computation of the Pareto frontier. But users do not only think of finding the Pareto optimal objects, they often want to find the best objects concerning an explicit specified preference order. While preferences themselves often are defined as general strict partial orders, almost all algorithms are designed to evaluate Pareto preferences combining weak orders, i.e., Skylines. In this paper, we consider general strict partial orders and we present a method to evaluate such explicit preferences by embedding any strict partial order into a complete lattice. This enables preference evaluation with specialized lattice based algorithms instead of algorithms relying on tuple-to-tuple comparisons and therefore speed-ups their computation as can be seen in our experiments.

Markus Endres, Timotheus Preisinger
Optimizing Window Aggregate Functions in Relational Database Systems

The window function has become an important OLAP extension of SQL since SQL:2003, and is supported by major commercial RDBMSs (e.g. Oracle, DB2, SQL Server, Teradata and Pivotal Greenplum) and by emerging Big Data platforms (e.g. Google Tenzing, Apache Hive, Pivotal HAWQ and Cloudera Impala). Window functions are designed for advanced data analytics use cases, bringing significant functional and performance enhancements to OLAP and decision support applications. However, we identify that existing window function evaluation approaches are still with significant room for improvement. In this paper, we revisit the conventional two-phase evaluation framework for window functions in relational databases, and propose several novel optimization techniques which aim to minimize the redundant data accesses and computations during the function calls invoked over window frames. We have integrated the proposed techniques into PostgreSQL, and compared them with both PostgreSQL’s and SQL Server’s native window function implementation over the TPC benchmark. Our comprehensive experimental studies demonstrate significant speedup over existing approaches.

Guangxuan Song, Jiansong Ma, Xiaoling Wang, Cheqing Jin, Yu Cao
Query Optimization on Hybrid Storage

Thanks to the rapid growth of memory capacity, it is now feasible to perform query processing completely in memory. Nevertheless, as main memory is substantially more expensive than most secondary storage equipments, including HDD and SSD, it is not suitable for storing cold data. Therefore, a hybrid data storage composed of both memory and secondary storage is expected to stay popular in the foreseeable future. In this paper, we introduce a query optimization model for hybrid data storage. Different from traditional query processors, which treat either main memory as a cache or secondary storage as an anti-cache, our model performs semantic data partitioning between memory and secondary storage. Query optimization can thus take the partitioning of data into account, to achieve enhanced performance. We conducted experimental evaluation on a columnar query engine to demonstrate the advantage of the proposed approach.

Anxuan Yu, Qingzhong Meng, Xuan Zhou, Binyu Shen, Yansong Zhang
Efficient Batch Grouping in Relational Datasets

Data Grouping is an expensive and frequently used operator in data processing, meanwhile data is often too big to fit in memory, where disk sorting based method is often employed. Disk sorting reads and writes the entire dataset for many times, which is very time-consuming, so reducing I/O costs is of great significants. In many applications, grouping a set of records multi-times on different keys is very common. Grouping in batch manner and techniques of sharing intermediate results are studied in this paper for efficiency. In batch grouping settings, different grouping orders may result in different I/O costs. To minimize I/O costs, we formalize the group-order scheduling problem as an optimization problem which can be proven in NP-Complete, and then propose a heuristic algorithm. Experimental results on TPC-H as well as synthetic datasets show the efficiency and robustness of our techniques.

Jizhou Sun, Jianzhong Li, Hong Gao

Text Mining

Frontmatter
Memory-Enhanced Latent Semantic Model: Short Text Understanding for Sentiment Analysis

Short texts, such as tweets and reviews, are not easy to be processed using conventional methods because of the short length, the irregular syntax and the lack of statistical signals. Term dependencies can be used to relax the problem, and to mine latent semantics hidden in short texts. And Long Short-Term Memory networks (LSTMs) can capture and remember term dependencies in a long distance. LSTMs have been widely used to mine semantics of short texts. At the same time, by analyzing the text, we find that a number of key words contribute greatly to the semantics of the text. In this paper, we propose a LSTM based model (MLSM) to enhance the memory of the key words in the short text. The proposed model is evaluated with two datasets: IMDB and SemEval2016, respectively. Experimental results demonstrate that the proposed method is effective with significant performance enhancement over the baseline LSTM and several other latent semantic models.

Fei Hu, Xiaofei Xu, Jingyuan Wang, Zhanbo Yang, Li Li
Supervised Intensive Topic Models for Emotion Detection over Short Text

With the emergence of social media services, documents that only include a few words are becoming increasingly prevalent. More and more users post short messages to express their feelings and emotions through Twitter, Flickr, YouTube and other apps. However, the sparsity of word co-occurrence patterns in short text brings new challenges to emotion detection tasks. In this paper, we propose two supervised intensive topic models to associate latent topics with emotional labels. The first model constrains topics to relevant emotions, and then generates document-topic probability distributions. The second model establishes association among biterms and emotions by topics, and then estimates word-emotion probabilities. Experiments on short text emotion detection validate the effectiveness of the proposed models.

Yanghui Rao, Jianhui Pang, Haoran Xie, An Liu, Tak-Lam Wong, Qing Li, Fu Lee Wang
Leveraging Pattern Associations for Word Embedding Models

Word embedding method has been shown powerful to capture words association, and facilitated numerous applications by effectively bridging lexical gaps. Word semantic is encoded with vectors and modeled based on n-gram language models, as a result it only takes into consideration of words co-occurrences in a shallow slide windows. However, the assumption of the language modelling ignores valuable associations between words in a long distance beyond n-gram coverage. In this paper, we argue that it is beneficial to jointly modeling both surrounding context and flexible associative patterns so that the model can cover long distance and intensive association. We propose a novel approach to combine associated patterns for word embedding method via joint training objection. We apply our model for query expansion in document retrieval task. Experimental results show that the proposed method can perform significantly better than the state-of-the-arts baseline models.

Qian Liu, Heyan Huang, Yang Gao, Xiaochi Wei, Ruiying Geng
Multi-Granularity Neural Sentence Model for Measuring Short Text Similarity

Measuring the semantic similarities between short texts is a critical and fundamental task because it is the basis for many applications. Although existing methods have explored this problem through enriching the short text representations based on the pre-trained word embeddings, the performance is still far from satisfaction because of the limited feature information. In this paper, we present an effective approach that combines convolutional neural network and long short-term memory to exploit from character-level to sentence-level features for performing the semantic matching of short texts. The proposed approach nicely models the feature information of sentences with the multiple representations and captures the rich matching patterns at different levels. Our model is rather generic and can hence be applied to matching tasks in different language. We use both paraphrase identification and semantic similarity tasks for evaluating our approach. The experimental results demonstrate that the proposed multiple-granularity neural sentence model obtains a significant improvement on measuring short texts similarity compared with the existing benchmark approaches.

Jiangping Huang, Shuxin Yao, Chen Lyu, Donghong Ji

Recommendation

Frontmatter
Leveraging Kernel Incorporated Matrix Factorization for Smartphone Application Recommendation

The explosive growth in the number of smartphone applications (apps) available on the market poses a significant challenge to making personalized recommendations based on user preferences. The training data usually consists of sparse binary implicit feedback (i.e. user-app installation pairs), which results in ambiguities in representing the users interests due to a lack of negative examples. In this paper, we propose two kernel incorporated matrix factorization models to predict user preferences for apps by introducing the categorical information of the apps. The two models extends Probabilistic Matrix Factorization (PMF) by constraining the user and app latent features to be similar to their neighbors in the app-categorical space, and adopts Stochastic Gradient Decent (SGD)-based methods to learn the models. The experimental results show that our model outperforms the baselines, in terms of two ranking-oriented evaluation metrics.

Chenyang Liu, Jian Cao, Jing He
Preference Integration in Context-Aware Recommendation

In Recommender Systems, recommendation tasks are usually implemented by preference mining. Most existing methods in context-aware recommendation focus on learning user interests from all the records to implement preference mining. However, some records contribute to recommendations, whereas some others may reduce the performances. To address these limitations, we propose a division learning strategy to divide the original records into several groups based on regression tree techniques. Then, a two-layer preference mining process is carried out to produce group and local preferences. Finally, the two preferences are integrated by the preference integration (Prin) approach to give recommendations. The experimental results demonstrated that our model outperformed other state-of-the-art methods, which illustrated the importance of targeted modeling.

Lin Zheng, Fuxi Zhu
Jointly Modeling Heterogeneous Temporal Properties in Location Recommendation

Point-Of-Interest (POI) recommendation systems suggest interesting locations to users based on their previous check-ins via location-based social networks (LBSNs). Individuals visiting a location are partially affected by many factors including social links, travel distance and the time. A growing line of research has been devoted to taking advantage of various effects to improve existing location recommendation methods. However, the temporal influence owns numerous dimensions which deserve to be explored more in depth. The subset property comprises a set of homogeneous slots such as an hour of the day, the day of the week, week of the month, month of the year, and so on. In addition, time has other attributes such as the recency which signifies the newly visited locations versus others. In this paper, we further study the role of time factor in recommendation models. Accordingly, we define a new problem to jointly model a pair of heterogeneous time-related effects (recency and the subset feature) in location recommendation.To address the challenges, we propose a generative model which computes the probability for the query user to visit a proposing location based on various homogeneous subset attributes. At the same time, the model calculates how likely the newly visited venues obtain a higher rank compared to others. The model finally performs POI recommendation through combining the effects learned from both homogeneous and heterogeneous temporal influences. Extensive experiments are conducted on two real-life datasets. The results show that our system gains a better effectiveness compared to other competitors in location recommendation.

Saeid Hosseini, Hongzhi Yin, Meihui Zhang, Xiaofang Zhou, Shazia Sadiq
Location-Aware News Recommendation Using Deep Localized Semantic Analysis

With the popularity of mobile devices and the quick growth of the mobile Web, users can now browse news wherever they want, so their news preferences are usually strongly correlated with their geographical contexts. Consequently, many research efforts have been put on location-aware news recommendation; the explored approaches can mainly be divided into physical distance-based and geographical topic-based ones. As for geographical topic-based location-aware news recommendation, ELSA is the state-of-the-art geographical topic model: it has been reported to outperform many other topic models, e.g., BOW, LDA, and ESA. However, the Wikipedia-based topic space in ELSA suffers from the problems of high dimensionality, sparsity, and redundancy, which greatly degrade the recommendation performance of ELSA. Therefore, to overcome these problems, in this work, we propose three novel geographical topic feature models, CLSA, ALSA, and DLSA, which integrate clustering, autoencoders, and recommendation-oriented deep neural networks, respectively, with ELSA to extract dense, abstract, low dimensional, and effective topic features from the Wikipedia-based topic space for the representation of news and locations. Experimental results show that (i) CLSA, ALSA, and DLSA all greatly outperform the state-of-the-art geographical topic model, ELSA, in location-aware news recommendation in terms of both the recommendation effectiveness and efficiency; (ii) Deep Localized Semantic Analysis (DLSA) achieves the most significant improvements: its precision, recall, MRR, and MAP are all about 3 times better than those of ELSA; while its recommendation time-cost is only about 1/29 of that of ELSA; and (iii) DLSA, ALSA, and CLSA can also remedy the “cold-start” problem by uncovering users’ latent news preferences at new locations.

Cheng Chen, Thomas Lukasiewicz, Xiangwu Meng, Zhenghua Xu
Review-Based Cross-Domain Recommendation Through Joint Tensor Factorization

Cross domain recommendation which aims to transfer knowledge from auxiliary domains to target domains has become an important way to solve the problems of data sparsity and cold start in recommendation systems. However, most existing works only consider ratings and tags, but ignore the text information like reviews. In reality, review text in some way well explains the reason why a product could gain such high or low ratings and reflect users’ sentiment towards different aspects of an item. For instance, reviews can be taken advantage to obtain users’ attitudes towards the specific aspect “screen” or “battery” of a “cell phone”. Taking these aspect factors into cross domain recommendation will bring us more about user preference, and thus could potentially improve the performance of recommendation. In this paper, we for the first time study how to fully exploit the aspect factors extracted from the review text to improve the performance of cross domain recommendation. Specifically, we first model each user’s sentiment orientation and concern degree towards different aspects of items extracted from reviews as tensors. To effectively transfer the aspect-level preferences of users towards items, we propose a joint tensor factorization model on auxiliary domain and target domain together. Experimental results on real data sets show the superior performance of the proposed method especially in the cold-start users in target domain by comparison with several state-of-the-arts cross domain recommendation methods.

Tianhang Song, Zhaohui Peng, Senzhang Wang, Wenjing Fu, Xiaoguang Hong, Philip S. Yu

Security, Privacy, Senor and Cloud

Frontmatter
A Local-Clustering-Based Personalized Differential Privacy Framework for User-Based Collaborative Filtering

The Collaborative Filtering (CF) algorithm plays an essential role in recommender systems. However, the CF algorithm relies on the user’s direct information to provide good recommendations, which may cause major privacy issues. To address these problems, Differential Privacy (DP) has been introduced into CF recommendation algorithms. In this paper, we propose a novel framework called Local-clustering-based Personalized Differential Privacy (LPDP) as an extension of DP. In LPDP, we take the privacy requirements specified at the item-level into consideration instead of employing the same level of privacy guarantees for all users. Moreover, we introduce a local-similarity-based item clustering process into the LPDP scheme, which leads to the result that any items within the same local cluster are hidden. We conduct a theoretical analysis of the privacy guarantees provided within the proposed LPDP scheme. We experimentally evaluate the LPDP scheme on real datasets and demonstrate the superior performance in recommendation quality.

Yongkai Li, Shubo Liu, Jun Wang, Mengjun Liu
Fast Multi-dimensional Range Queries on Encrypted Cloud Databases

With the adoption of cloud computing, data owners can store their datasets on clouds for lower cost and better performance. However, privacy issues compel sensitive data to be encrypted before outsourcing, which inevitably introduces challenges in terms of search functionalities. This paper considers the issue of multi-dimensional range queries on encrypted cloud databases. Prior schemes focusing on this issue are weak in either security or efficiency. In this paper, using our improved asymmetric scalar-product-preserving encryption, we present an innovative technique for the encrypted rectangle intersection problem. Based on this technique, we propose a tree-based method to handle multi-dimensional range queries in encrypted form. Thorough analysis demonstrates that our method is secure under the honest-but-curious model and the known-plaintext attack model. Experimental results on both real-life and artificial datasets and comprehensive comparisons with other schemes show the high efficiency of our proposed approach.

Jialin Chi, Cheng Hong, Min Zhang, Zhenfeng Zhang
When Differential Privacy Meets Randomized Perturbation: A Hybrid Approach for Privacy-Preserving Recommender System

Privacy risks of recommender systems have caused increasing attention. Users’ private data is often collected by probably untrusted recommender system in order to provide high-quality recommendation. Meanwhile, malicious attackers may utilize recommendation results to make inferences about other users’ private data. Existing approaches focus either on keeping users’ private data protected during recommendation computation or on preventing the inference of any single user’s data from the recommendation result. However, none is designed for both hiding users’ private data and preventing privacy inference. To achieve this goal, we propose in this paper a hybrid approach for privacy-preserving recommender systems by combining differential privacy (DP) with randomized perturbation (RP). We theoretically show the noise added by RP has limited effect on recommendation accuracy and the noise added by DP can be well controlled based on the sensitivity analysis of functions on the perturbed data. Extensive experiments on three large-scale real world datasets show that the hybrid approach generally provides more privacy protection with acceptable recommendation accuracy loss, and surprisingly sometimes achieves better privacy without sacrificing accuracy, thus validating its feasibility in practice.

Xiao Liu, An Liu, Xiangliang Zhang, Zhixu Li, Guanfeng Liu, Lei Zhao, Xiaofang Zhou
Supporting Cost-Efficient Multi-tenant Database Services with Service Level Objectives (SLOs)

Quality of Service (QoS) is at the core of the vision of Database as a Service (DBaaS). Traditional approaches in DBaaS often reserve computation resources (e.g. CPU and memory) to satiate tenants’ QoS guarantees under various circumstances, which inevitably results in poor resource utilization, as the tenants’ actual workloads are usually below their expectations described by their Service Level Objectives (SLOs). In this paper, we propose a novel scheme FrugalDB to enhance resource utilization for DBaaS systems with QoS guarantees. FrugalDB accommodates two independent database engines, an in-memory engine for heavy workloads with tight SLOs, and a disk-based engine for light workloads with loose SLOs. By allocating each tenant’ workload to an appropriate engine via workload migration, this dual-engine scheme can substantially save computation resources, and thus consolidate more tenants on a single database server. FrugalDB tries to minimize workload migration cost incurred in moving workloads between the two engines. By an effective workload estimation method and an efficient migration schedule algorithm, FrugalDB responds quickly to workload changes and executes workload migrations with minimal overhead. We evaluate FrugalDB with extensive experiments, which show that it achieves high tenant consolidation rate yet with few performance SLO violations.

Yifeng Luo, Junshi Guo, Jiaye Zhu, Jihong Guan, Shuigeng Zhou
Recovering Missing Values from Corrupted Spatio-Temporal Sensory Data via Robust Low-Rank Tensor Completion

With the booming of the Internet of Things, tremendous amount of sensors have been installed in different geographic locations, generating massive sensory data with both time-stamps and geo-tags. Such type of data usually have shown complex spatio-temporal correlation and are easily missing in practice due to communication failure or data corruption. In this paper, we aim to tackle the challenge – how to accurately and efficiently recover the missing values for corrupted spatio-temporal sensory data. Specifically, we first formulate such sensor data as a high-dimensional tensor that can naturally preserve sensors’ both geographical and time information, thus we call spatio-temporal Tensor. Then we model the sensor data recovery as a low-rank robust tensor completion problem by exploiting its latent low-rank structure and sparse noise property. To solve this optimization problem, we design a highly efficient optimization method that combines the alternating direction method of multipliers and accelerated proximal gradient to minimize the tensor’s convex surrogate and noise’s $$\ell _1$$-norm. In addition to testing our method by a synthetic dataset, we also use passive RFID (radio-frequency identification) sensors to build a real-world sensor-array testbed, which generates overall 115,200 sensor readings for model evaluation. The experimental results demonstrate the accuracy and robustness of our approach.

Wenjie Ruan, Peipei Xu, Quan Z. Sheng, Nickolas J. G. Falkner, Xue Li, Wei Emma Zhang

Social Network Analytics (I)

Frontmatter
Group-Level Influence Maximization with Budget Constraint

Influence maximization aims at finding a set of seed nodes in a social network that could influence the largest number of nodes. Existing work often focuses on the influence of individual nodes, ignoring that infecting different seeds may require different costs. Nonetheless, in many real-world applications such as advertising, advertisers care more about the influence of groups (e.g., crowds in the same areas or communities) rather than specific individuals, and are very concerned about how to maximize the influence with a limited budget. In this paper, we investigate the problem of group-level influence maximization with budget constraint. Towards this, we introduce a statistical method to reveal the influence relationship between the groups, based on which we propose a propagation model that can dynamically calculate the influence spread scope of seed groups, following by presenting a greedy algorithm called GLIMB to maximize the influence spread scope with a limited cost budget via the optimization of the seed-group portfolio. Theoretical analysis shows that GLIMB can guarantee an approximation ratio of at least $$(1-1/\sqrt{e})$$. Experimental results on both synthetic and real-world data sets verify the effectiveness and efficiency of our approach.

Qian Yan, Hao Huang, Yunjun Gao, Wei Lu, Qinming He
Correlating Stressor Events for Social Network Based Adolescent Stress Prediction

The increasingly severe psychological stress damages our mental health in this highly competitive society, especially for immature teenagers who cannot settle stress well. It is of great significance to predict teenagers’ psychological stress in advance and prepare targeting help in time. Due to the fact that stressor events are the source of stress and impact the stress progression, in this paper, we give a novel insight into the correlation between stressor events and stress series (stressor-stress correlation, denotes as SSC) and propose a SSC-based stress prediction model upon microblog platform. Considering both linguistic and temporal correlations between stressor series and stress series, we first quantify the stressor-stress correlation with KNN method. Afterward, a dynamic NARX recurrent neural network is constructed to integrate such impact of stressor events for teens’ stress prediction in future episode. Experiment results on the real data set of 124 high school students verify that our prediction framework achieves promising performance and outperforms baseline methods. Integrating the correlation of stressor events is proved to be effective in stress prediction, significantly improving the average prediction accuracy.

Qi Li, Liang Zhao, Yuanyuan Xue, Li Jin, Mostafa Alli, Ling Feng
Emotion Detection in Online Social Network Based on Multi-label Learning

Emotion detection in online social networks benefits many applications such as recommendation systems, personalized advertisement services, etc. Traditional sentiment or emotion analysis mainly address polarity prediction or single label classification, while ignore the co-existence of emotion labels in one instance. In this paper, we address the multiple emotion detection problem in online social networks, and formulate it as a multi-label learning problem. By making observations to an annotated Twitter dataset, we discover that multiple emotion labels are correlated and influenced by social network relationships. Based on the observations, we propose a factor graph model to incorporate emotion labels and social correlations into a unified framework, and solve the emotion detection problem by a multi-label learning algorithm. Performance evaluation shows that the proposed approach outperforms the existing baseline algorithms.

Xiao Zhang, Wenzhong Li, Sanglu Lu
Backmatter
Metadaten
Titel
Database Systems for Advanced Applications
herausgegeben von
Selçuk Candan
Lei Chen
Torben Bach Pedersen
Lijun Chang
Wen Hua
Copyright-Jahr
2017
Electronic ISBN
978-3-319-55753-3
Print ISBN
978-3-319-55752-6
DOI
https://doi.org/10.1007/978-3-319-55753-3