Skip to main content
Top

2017 | Book

Databases Theory and Applications

28th Australasian Database Conference, ADC 2017, Brisbane, QLD, Australia, September 25–28, 2017, Proceedings

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 28th Australasian Database Conference, ADC 2017, held in Brisbane, QLD, Australia, in September 2017.

The 20 full papers presented together with 2 demo papers were carefully reviewed and selected from 32 submissions. The mission of ADC is to share novel research solutions to problems of today’s information society that fulfill the needs of heterogeneous applications and environments and to identify new issues and directions for future research and development work. The topics of the presented papers are related to all practical and theoretical aspects of advanced database theory and applications, as well as case studies and implementation experiences.

Table of Contents

Frontmatter

Spatial Databases

Frontmatter
The Flexible Group Spatial Keyword Query
Abstract
We propose the flexible group spatial keyword query and algorithms to process three variants of the query in the spatial textual domain: (i) the group nearest neighbor with keywords query, which finds the data object that optimizes the aggregate cost function for the whole group Q of size n query objects, (ii) the subgroup nearest neighbor with keywords query, which finds the optimal subgroup of query objects and the data object that optimizes the aggregate cost function for a given subgroup size m (\(m \le n\)), and (iii) the multiple subgroup nearest neighbor with keywords query, which finds optimal subgroups and corresponding data objects for each of the subgroup sizes in the range [m, n]. We design query processing algorithms based on branch-and-bound and best-first paradigms. Finally, we conduct extensive experiments with two real datasets to show the efficiency of the proposed algorithms.
Sabbir Ahmad, Rafi Kamal, Mohammed Eunus Ali, Jianzhong Qi, Peter Scheuermann, Egemen Tanin
Surrounding Join Query Processing in Spatial Databases
Abstract
Spatial join queries play an important role in spatial database, and mostly all the distance-based join queries are based on the range search and nearest neighbour (NN), namely range join query and kNN join query. In this paper, we propose a new join query which is called surrounding join query. Given two point datasets Q and P of multidimensional objects, the surrounding query retrieves for each point in Q its all surrounding points in P. As a new spatial join query, we propose algorithms that are able to process such query efficiently. Evaluation on multiple real world datasets illustrate that our approach achieves high performance.
Lingxiao Li, David Taniar, Maria Indrawan-Santiago, Zhou Shao
Boosting Point-Based Trajectory Search with Quad-Tree
Abstract
The availability of spatial data generated by objects enables people to search for a similar pattern using a set of query points. In this paper, we focus on point-based trajectory search problem which returns top-k results to a set of query points. The primary purpose of this work is to revisit state-of-the-art search algorithms on various indices and find the best choice of spatial index while giving a reason behind it. Furthermore, we propose an optimization on the search method, which is able to find the initial upper bound for the query points, leading to further performance improvement. Lastly, extensive experiments on real dataset verified the choice of the index and our proposed search method.
Munkh-Erdene Yadamjav, Sheng Wang, Zhifeng Bao, Bang Zhang

Query Processing

Frontmatter
Query Refinement for Correlation-Based Time Series Exploration
Abstract
In this paper, we focus on the problem of exploring sequential data to discover time sub-intervals that satisfy certain pairwise correlation constraints. Differently than most existing works, we use the deviation from targeted pairwise correlation constraints as an objective to minimize in our problem. Moreover, we include users preferences as an objective in the form of maximizing similarity to users’ initial sub-intervals. The combination of these two objectives are prevalent in applications where users explore time series data to locate time sub-intervals in which targeted patterns exist. Discovering these sub-intervals among time series data is extremely useful in various application areas such as network and environment monitoring.
Towards finding the optimal sub-interval (i.e., optimal query) satisfying these objectives, we propose applying query refinement techniques to enable efficient processing of candidate queries. Specifically, we propose QFind, an efficient algorithm which refines a user’s initial query to discover the optimal query by applying novel pruning techniques. QFind applies two-level pruning techniques to safely skip processing unqualified candidate queries, and early abandon the computations of correlation for some pairs based on a monotonic property. We experimentally validate the efficiency of our proposed algorithm against state-of-the-art algorithm under different settings using real and synthetic data.
Abdullah M. Albarrak, Mohamed A. Sharaf
A Multi-way Semi-stream Join for a Near-Real-Time Data Warehouse
Abstract
Semi-stream processing, the operation of joining a stream of data with non-stream disk-based master data, is a crucial component of near real-time data warehousing. The requirements for semi-stream joins are fast, accurate processing and the ability to function well with limited memory. Currently, semi-stream algorithms presented in the literature such as MeshJoin, Semi-Stream Index Join and CacheJoin can join only one foreign key in the stream data with one table in the master data. However, it is quite likely that stream data have multiple foreign keys that need to join with multiple tables in the master data. We extend CacheJoin to form three new possibilities for multi-way semi-stream joins, namely Sequential, Semi-concurrent, and Concurrent joins. Initially, the new algorithms can join two foreign keys in the stream data with two tables in the master data. However, these algorithms can be easily generalized to join with any number of tables in the master data. We evaluated the performance of all three algorithms, and our results show that the semi-concurrent architecture performs best under the same scenario.
M. Asif Naeem, Kim Tung Nguyen, Gerald Weber
Decomposition-Based Approximation of Time Series Data with Max-Error Guarantees
Abstract
With the growing popularity of IoT nowadays, tremendous amount of time series data at high resolution is being generated, transmitted, stored, and processed by modern sensor networks in different application domains, which naturally incurs extensive storage and computation cost in practice. Data compression is the key to resolve such challenge, and various compression techniques, either lossless or lossy, have been proposed and widely adopted in industry and academia. Although existing approaches are generally successful, we observe a unique characteristic in certain time series data, i.e., significant periodicity and strong randomness, which leads to poor compression performance using existing methods and hence calls for a specifically designed compression mechanism that can utilise the periodic and stochastic patterns at the same time. To this end, we propose a decomposition-based compression algorithm which divides the original time series into several components reflecting periodicity and randomness respectively, and then approximates each component accordingly to guarantee overall compression ratio and maximum error. We conduct extensive evaluation on a real world dataset, and the experimental results verify the superiority of our proposals compared with current state-of-the-art methods.
Boyu Ruan, Wen Hua, Ruiyuan Zhang, Xiaofang Zhou

Similarity Search

Frontmatter
Searching k-Nearest Neighbor Trajectories on Road Networks
Abstract
With the proliferation of mobile devices, massive trajectory data has been generated. Searching trajectories by locations is one of fundamental tasks. Previous work such as [3, 6, 9] has been proposed to answer the search. Such work typically measures the distance between trajectories and queries by the distance between query points and GPS points of trajectories. Such measurement could be inaccurate because those GPS points generated by some sampling rate are essentially discrete. To overcome this issue, we treat a trajectory as a sequence of line segments and compute the distance between a query point and a trajectory by the one between the query point and line segments. Next, we index the line segments by R-tree and match each trajectory to the associated line segments by inverted lists. After that, we propose a k-nearest neighbor (KNN) search algorithm on the indexing structure. Moreover, we propose to cluster line segments and merge redundant trajectory IDs for higher efficiency. Experimental results validate that the proposed method significantly outperforms existing approaches in terms of saving storage cost of data and the query performance.
Pengcheng Yuan, Qinpei Zhao, Weixiong Rao, Mingxuan Yuan, Jia Zeng
Efficient Supervised Hashing via Exploring Local and Inner Data Structure
Abstract
Recent years have witnessed the promising capacity of hashing techniques in tackling nearest neighbor search because of the high efficiency in storage and retrieval. Data-independent approaches (e.g., Locality Sensitive Hashing) normally construct hash functions using random projections, which neglect intrinsic data properties. To compensate this drawback, learning-based approaches propose to explore local data structure and/or supervised information for boosting hashing performance. However, due to the construction of Laplacian matrix, existing methods usually suffer from the unaffordable training cost. In this paper, we propose a novel supervised hashing scheme, which has the merits of (1) exploring the inherent neighborhoods of samples; (2) significantly saving training cost confronted with massive training data by employing approximate anchor graph; as well as (3) preserving semantic similarity by leveraging pair-wise supervised knowledge. Besides, we integrate discrete constraint to significantly eliminate accumulated errors in learning reliable hash codes and hash functions. We devise an alternative algorithm to efficiently solve the optimization problem. Extensive experiments on two image datasets demonstrate that our proposed method is superior to the state-of-the-arts.
Shiyuan He, Guo Ye, Mengqiu Hu, Yang Yang, Fumin Shen, Heng Tao Shen, Xuelong Li
Learning Robust Graph Hashing for Efficient Similarity Search
Abstract
Unsupervised hashing has recently drawn much attention in efficient similarity search for its desirable advantages of low storage cost, fast search speed, semantic label independence. Among the existing solutions, graph hashing makes a significant contribution as it could effectively preserve the neighbourhood data similarities into binary codes via spectral analysis. However, existing graph hashing methods separate graph construction and hashing learning into two independent processes. This two-step design may lead to sub-optimal results. Furthermore, features of data samples may unfortunately contain noises that will make the built graph less reliable. In this paper, we propose a Robust Graph Hashing (RGH) to address these problems. RGH automatically learns robust graph based on self-representation of samples to alleviate the noises. Moreover, it seamlessly integrates graph construction and hashing learning into a unified learning framework. The learning process ensures the optimal graph to be constructed for subsequent hashing learning, and simultaneously the hashing codes can well preserve similarities of data samples. An effective optimization method is devised to iteratively solve the formulated problem. Experimental results on publicly available image datasets validate the superior performance of RGH compared with several state-of-the-art hashing methods.
Luyao Liu, Lei Zhu, Zhihui Li

Data Mining

Frontmatter
Provenance-Based Rumor Detection
Abstract
With the advance of social media networks, people are sharing contents in an unprecedented scale. This makes social networks such as microblogs an ideal place for spreading rumors. Although different types of information are available in a post on social media, traditional approaches in rumor detection leverage only the text of the post, which limits their accuracy in detection. In this paper, we propose a provenance-aware approach based on recurrent neural network to combine the provenance information and the text of the post itself to improve the accuracy of rumor detection. Experimental results on a real-world dataset show that our technique is able to outperform state-of-the-art approaches in rumor detection.
Chi Thang Duong, Quoc Viet Hung Nguyen, Sen Wang, Bela Stantic
An Embedded Feature Selection Framework for Hybrid Data
Abstract
Feature selection in terms of inductive supervised learning is a process of selecting a subset of features relevant to the target concept and removing irrelevant and redundant features. The majority of feature selection methods, which have been developed in the last decades, can deal with only numerical or categorical features. An exception is the Recursive Feature Elimination under the clinical kernel function which is an embedded feature selection method. However, it suffers from low classification performance. In this work, we propose several embedded feature selection methods which are capable of dealing with hybrid balanced, and hybrid imbalanced data sets. In the experimental evaluation on five UCI Machine Learning Repository data sets, we demonstrate the dominance and effectiveness of the proposed methods in terms of dimensionality reduction and classification performance.
Forough Rezaei Boroujeni, Bela Stantic, Sen Wang
A New Data Mining Scheme for Analysis of Big Brain Signal Data
Abstract
Analysis and processing of brain signal data (e.g. Electroencephalogram (EEG) data) is a significant challenge in the medical data mining community due to its massive size and dynamic nature. The most crucial part of EEG data analysis is to discover hidden knowledge from a large volume of data through pattern mining for efficient analysis. This study focuses on discovering representative patterns from each channel data to recover useful information reducing the size of data. In this paper, a novel algorithm based on principal component analysis (PCA) technique is developed to accurately and efficiently extract a pattern from the vast amount of EEG signal data. This study considers PCA to explore the sequential pattern of each EEG channel data as PCA is a dominant tool for finding patterns in it. In order to represent the distribution of the pattern, the most significant satanical features (e.g. mean, standard deviation) are computed from the extracted pattern. Then aggregating all of the features extracted from each of the patterns in a subject, a feature vector set is created that is fed into random forest (RF) and random tree (RT) classification model, individually for classifying different categories of the signals. The proposed methodology is tested on two benchmark EEG datasets of BCI Competition III: dataset V and dataset IVa. In order to further evaluate performance, the proposed scheme is compared with some recently reported algorithms, where the same datasets were used. The experimental results demonstrate that the proposed methodology with RF achieves higher performance compared to the RT and also the recently reported methods. The present study suggests the merits and feasibility of applying proposed method in the medical data mining for efficient analysis of any biomedical signal data.
Siuly Siuly, Roozbeh Zarei, Hua Wang, Yanchun Zhang
Extracting Keyphrases Using Heterogeneous Word Relations
Abstract
Extracting keyphrases from documents for providing a quick and insightful summarization is an interesting and important task, on which lots of research efforts have been laid. Most of the existing methods could be categorized as co-occurrence based, statistic-based, or semantics-based. The co-occurrence based methods do not take various word relations besides co-occurrence into full consideration. The statistic-based methods introduce more unrelated noises inevitably due to the inclusion of external text corpus, while the semantic-based methods heavily depend on the semantic meanings of words. In this paper, we propose a novel graph-based approach to extract keyphrases by considering heterogeneous latent word relations (the co-occurrence and the semantics). The underlying random walk model behind the proposed approach is made possible and reasonable by exploiting nearest neighbor documents. Extensive experiments over real data show that our method outperforms the state-of-art methods.
Wei Shi, Zheng Liu, Weiguo Zheng, Jeffrey Xu Yu
Mining High-Dimensional CyTOF Data: Concurrent Gating, Outlier Removal, and Dimension Reduction
Abstract
Cytometry is a powerful tool in clinical diagnosis of health disorders, in particular, immunodeficiency diseases and acute leukemia. Recent technological advancements have enabled up to 100 measurements to be taken simultaneously on each cell, thus generating high-throughput and high-dimensional datasets. Current analysis, relying on manual segmentation of cell populations (gating) on sequential low-dimensional projections of the data, is subjective, time consuming and error-prone. It is also known that these multidimensional cytometric data typically exhibit non-normal features, including asymmetry, multimodality, and heavy tails. This present a great challenge to traditional clustering methods which are typically based on symmetric distributions.
In recent years, non-normal distributions have received increasing interest in the statistics literature. In particular, finite mixtures of skew distributions have emerged as a promising alternative to the traditional normal mixture modelling. However, these models are not well suited to high-dimensional settings.
This paper describes a flexible statistical approach designed for performing, at the same time, unsupervised clustering, dimension reduction, and outlier removal for cytometric data. The approach is based on finite mixtures of multivariate skew normal factor analyzers (SkewFA) with threshold pruning. The model can be fitted by maximum likelihood (ML) via an expectation-maximization (EM) algorithm. An application to a large CyTOF data is presented to demonstrate the usefulness of the SkewFA model and to illustrate its effectiveness relative to other algorithms.
Sharon X. Lee

AI for Big Data

Frontmatter
Locality-Constrained Transfer Coding for Heterogeneous Domain Adaptation
Abstract
Currently, most of widely used databases are label-wise. In other words, people organize their data with corresponding labels, e.g., class information, keywords and description, for the convenience of indexing and retrieving. However, labels of the data from a novel application usually are not available, and labeling by hand is very expensive. To address this, we propose a novel approach based on transfer learning. Specifically, we aim at tackling heterogeneous domain adaptation (HDA). HDA is a crucial topic in transfer learning. Two inevitable issues, feature discrepancy and distribution divergence, get in the way of HDA. However, due to the significant challenges of HDA, previous work commonly focus on handling one of them and neglect the other. Here we propose to deploy locality-constrained transfer coding (LCTC) to simultaneously alleviate the feature discrepancy and mitigate the distribution divergence. Our method is powered by two tactics: feature alignment and distribution alignment. The former learns new transferable feature representations by sharing-dictionary coding and the latter aligns the distribution gaps on the new feature space. By formulating the problem into a unified objective and optimizing it via an iterative fashion, the two tactics are reinforced by each other and the two domains are drawn closer under the new representations. Extensive experiments on image classification and text categorization verify the superiority of our method against several state-of-the-art approaches.
Jingjing Li, Ke Lu, Lei Zhu, Zhihui Li
Training Deep Ranking Model with Weak Relevance Labels
Abstract
Deep neural networks have already achieved great success in a number of fields, for example, computer vision, natural language processing, speech recognition, and etc. However, such advances have not been observed in information retrieval (IR) tasks yet, such as ad-hoc retrieval. A potential explanation is that in a particular IR task, training a document ranker usually needs large amounts of relevance labels which describe the relationship between queries and documents. However, this kind of relevance judgments are usually very expensive to obtain. In this paper, we propose to train deep ranking models with weak relevance labels generated by click model based on real users’ click behavior. We investigate the effectiveness of different weak relevance labels trained based on several major click models, such as DBN, RCM, PSCM, TCM, and UBM. The experimental results indicate that the ranking models trained with weak relevance labels are able to utilize large scale of behavior data and they can get similar performance compared to the ranking model trained based on relevance labels from external assessors, which are supposed to be more accurate. This preliminary finding encourages us to develop deep ranking models with weak supervised data.
Cheng Luo, Yukun Zheng, Jiaxin Mao, Yiqun Liu, Min Zhang, Shaoping Ma
A Deep Approach for Multi-modal User Attribute Modeling
Abstract
With the explosive growth of user-generated contents (e.g., texts, images and videos) on social networks, it is of great significance to analyze and extract people’s interests from the massive social media data, thus providing more accurate personalized recommendations and services. In this paper, we propose a novel multimodal deep learning algorithm for user profiling, dubbed multi-modal User Attribute Model (mmUAM), which explores the intrinsic semantic correlations across different modalities. Our proposed model is based on Poisson Gamma Belief Network (PGBN), which is a deep learning topic model for count data in documents. By improving PGBN, we succeed in addressing the problem of learning a shared representation between texts and images in order to obtain textual and visual attributes for users. To evaluate the effectiveness of our proposed method, we collect a real dataset from Sina Weibo. Experimental results demonstrate that the proposed algorithm achieves encouraging performance compared with several state-of-the-art methods.
Xiu Huang, Zihao Yang, Yang Yang, Fumin Shen, Ning Xie, Heng Tao Shen

Potpourri

Frontmatter
Reversible Fragile Watermarking for Fine-Grained Tamper Localization in Spatial Data
Abstract
Spatial data has been widely used in many areas. Meanwhile, the increasingly simplified data access and manipulation methods in spatial data related applications make it important to verify data truthfulness. In this paper, we propose a reversible fragile watermarking scheme for the content authentication of spatial data. The proposed scheme embeds (detects) four kinds of watermarks, i.e., group, object, vertex-local and vertex-global watermarks, into (from) each object (vertex) by altering the digits of the coordinates after the LSD (Least Significant Digit). According to the detection results of those four kinds of watermarks, the scheme can not only locate tampers, but also recognize modification types on both object and vertex levels. In addition, the proposed scheme is reversible, that is the original data can be restored from the watermarked data. The feature of reversibility makes the scheme suitable for applications which require high-precision data.
Hai Lan, Yuwei Peng
Jointly Learning Attentions with Semantic Cross-Modal Correlation for Visual Question Answering
Abstract
Visual Question Answering (VQA) has emerged as a prominent multi-discipline research problem in artificial intelligence. A number of recent studies are focusing on proposing attention mechanisms such as visual attention (“where to look”) or question attention (“what words to listen to”), and they have been proved to be efficient for VQA. However, they focus on modeling the prediction error, but ignore the semantic correlation between image attention and question attention. As a result, it will inevitably result in suboptimal attentions. In this paper, we argue that in addition to modeling visual and question attentions, it is equally important to model their semantic correlation to learn them jointly as well as to facilitate their joint representation learning for VQA. In this paper, we propose a novel end-to-end model to jointly learn attentions with semantic cross-modal correlation for efficiently solving the VQA problem. Specifically, we propose a multi-modal embedding to map the visual and question attentions into a joint space to guarantee their semantic consistency. Experimental results on the benchmark datasets demonstrate that our model outperforms several state-of-the-art techniques for VQA.
Liangfu Cao, Lianli Gao, Jingkuan Song, Xing Xu, Heng Tao Shen
Deep Semantic Indexing Using Convolutional Localization Network with Region-Based Visual Attention for Image Database
Abstract
In this paper, we introduce a novel deep semantic indexing method, a.k.a. captioning, for image database. Our method can automatically generate a natural language caption describing an image as a semantic reference to index the image. Specifically, we use a convolutional localization network to generate a pool of region proposals from an image, and then leverage the visual attention mechanism to sequentially generate the meaningful language words. Compared with previous methods, our approach can efficiently generate compact captions, which can guarantee higher level of semantic indexing for image database. We evaluate our approach on two widely-used benchmark datasets: Flickr30K, and MS COCO. Experimental results across various evaluation metrics show the superiority of our approach as compared with other visual attention based approaches.
Mingxing Zhang, Yang Yang, Hanwang Zhang, Yanli Ji, Ning Xie, Heng Tao Shen

Demo Papers

Frontmatter
Real-Time Popularity Prediction on Instagram
Abstract
Social network services have become a part of modern daily life. Despite explosive growth of social media, people only pay attention to a small fraction of them. Therefore, predicting the popularity of a post in social network becomes an important service and can benefit a series of important applications, such as advertisement delivery, load balancing and personalized recommendation etc. In this demonstration, we develop a real-time popularity prediction system based on user feedback e.g. count of likes. In the proposed system, we develop effective algorithms which utilize the temporal growth of user feedbacks to predict the popularity in real-time manner. Moreover, the system is easy to be adapted for a variety of social network platforms. Using datasets collected from Instagram, we show that the proposed system can perform effective prediction on popularity at early stage of post.
Deming Chu, Zhitao Shen, Yu Zhang, Shiyu Yang, Xuemin Lin
DSKQ: A System for Efficient Processing of Diversified Spatial-Keyword Query
Abstract
With the rapid development of mobile portable devices and location positioning technologies, massive amount of geo-textual data are being generated by a huge number of web users on various social platforms, such as Facebook and Twitter. Meanwhile, spatial-textual objects that represent Point-of-interests (POIs, e.g., shops, cinema, hotel or restaurant) are increasing pervasively. Consequently, how to retrieve a set of objects that best matches the user’s submitted spatial keyword query (SKQ) has been intensively studied by the research communities and commercial organisations. Existing works only focus on returning the nearest matching objects, although we observe that many real-life applications are now using diversification to enhance the quality of the query results. Thus, existing methods fail to solve the problem of diversified SKQ efficiently. In this demonstration, we introduce DSKQ, a diversified in-memory spatial-keyword query system, which considers both the textual relevance and the spatial diversity of the results processing on road network. We present a prototype of DSKQ which provides users with an application-based interface to explore the diversified spatial-keyword query system.
Shanqing Jiang, Chengyuan Zhang, Ying Zhang, Wenjie Zhang, Xuemin Lin, Muhammad Aamir Cheema, Xiaoyang Wang
Backmatter
Metadata
Title
Databases Theory and Applications
Editors
Zi Huang
Xiaokui Xiao
Xin Cao
Copyright Year
2017
Electronic ISBN
978-3-319-68155-9
Print ISBN
978-3-319-68154-2
DOI
https://doi.org/10.1007/978-3-319-68155-9

Premium Partner