Skip to main content
main-content

Über dieses Buch

This book constitutes the refereed proceedings of the 29th Australasian Database Conference, ADC 2018, held in Gold Coast, QLD, Australia, in May 2018.

The 23 full papers plus 6 short papers presented together with 3 demo papers were carefully reviewed and selected from 53 submissions. The Australasian Database Conference is an annual international forum for sharing the latest research advancements and novel applications of database systems, data-driven applications, and data analytics between researchers and practitioners from around the globe, particularly Australia and New Zealand.

Inhaltsverzeichnis

Frontmatter

Full Research Papers: Database and Applications

Frontmatter

Adaptive Access Path Selection for Hardware-Accelerated DRAM Loads

For modern main memory database systems, the memory bus is the main bottleneck. Specialized hardware components of large NUMA systems, such as HPE’s GRU, make it possible to offload memory transfers. In some cases, this improves the throughput by 30%, but other scenarios suffer from reduced performance. We show which factors influence this tradeoff. Based on our experiments, we present an adaptive prediction model that supports the DBMS in deciding whether to utilize these components. In addition, we evaluate non-coherent memory access as an additional access method and discuss its benefits and shortcomings.

Markus Dreseler, Timo Gasda, Jan Kossmann, Matthias Uflacker, Hasso Plattner

Privacy Preservation for Trajectory Data Publishing by Look-Up Table Generalization

With the increasing of location-aware devices, it is easy to collect the trajectory of a person which can be represented as a sequence of visited locations with regard to timestamps. For some applications such as traffic management and location-based advertising, the trajectory data may need to be published with other private information. However, revealing the private trajectory and sensitive information of user poses privacy concerns especially when an adversary has the background knowledge of target user, i.e., partial trajectory information. In general, data transformation is needed to ensure privacy preservation before data releasing. Not only the privacy has to be preserved, but also the data quality issue must be addressed, i.e., the impact on data quality after the transformation should be minimized. LKC-privacy model is a well-known model to anonymize the trajectory data that are published with the sensitive information. However, computing the optimal LKC-privacy solution on trajectory data by the brute-force (BF) algorithm with full-domain generalization technique is highly time-consuming. In this paper, we propose a look-up table brute-force (LT-BF) algorithm to preserve privacy and maintain the data quality based on LKC-privacy model in the scenarios which the generalization technique is applied to anonymize the trajectory data efficiently. Subsequently, our proposed algorithm is evaluated with experiments. The results demonstrate that our proposed algorithm is not only returns the optimal solution as the BF algorithm, but also it is highly efficient.

Nattapon Harnsamut, Juggapong Natwichai, Surapon Riyana

Trajectory Set Similarity Measure: An EMD-Based Approach

To address the trajectory sparsity issue concerning Origin-Destination (OD) pairs, in general, most existing studies strive to reconstruct trajectories by concatenating the sub-trajectories along the specific paths and filling up the sparsity with conceptual trajectories. However, none of them gives the robustness validation for their reconstructed trajectories. By intuition, the reconstructed trajectories are more qualified if they are more similar to the exact ones traversing directly from the origin to the destination, which indicates the effectiveness of the corresponding trajectory augmentation algorithms. Nevertheless, to our knowledge, no existing work has studied the similarity of trajectory sets. Motivated by this, we propose a novel similarity measure to evaluate the similarity between two set of trajectories, borrowing the idea of the Earth Mover’s Distance. Empirical studies on a large real trajectory dataset show that our proposed similarity measure is effective and robust.

Dan He, Boyu Ruan, Bolong Zheng, Xiaofang Zhou

Histogram Construction for Difference Analysis of Spatio-Temporal Data on Array DBMS

To analyze scientific data, there are frequent demands for comparing multiple datasets on the same subject to detect any differences between them. For instance, comparison of observation datasets in a certain spatial area at different times or comparison of spatial simulation datasets with different parameters are considered to be important. Therefore, this paper proposes a difference operator in spatio-temporal data warehouses, based on the notion of histograms in the database research area. We propose a difference histogram construction method and they are used for effective and efficient data visualization in difference analysis. In addition, we implement the proposed algorithms on an array DBMSs SciDB, which is appropriate to process and manage scientific data. Experiments are conducted using mass evacuation simulation data in tsunami disasters, and the effectiveness and efficiency of our methods are verified.

Jing Zhao, Yoshiharu Ishikawa, Chuan Xiao, Kento Sugiura

Location-Aware Group Preference Queries in Social-Networks

With the recent advances in location-acquisition techniques and GPS-embedded mobile devices, traditional social networks such as Twitter and Facebook have acquired the dimension of location. This in result has facilitated the generation of geo-tagged data (e.g., check-ins) at unprecedented scale and have essentially enhanced the user experience in location-based services associated with social networks. Typical location-based social networks allow people to check-in at a location of interest using smart devices which then is published on social network and this information can be exploited for recommendation. In this paper, we propose a new type of query called Geo-Social Group preference Top-k (SG-$$Top_k$$) query. For a group of users, a SG-$$Top_k$$ query returns top-k places that are most likely to satisfy the needs of users based on spatial and social relevance. Finally, we conduct an exhaustive evaluation of proposed schemes to answer the query and demonstrate the effectiveness of the proposed approaches.

Ammar Sohail, Arif Hidayat, Muhammad Aamir Cheema, David Taniar

Social-Textual Query Processing on Graph Database Systems

Graph database systems are increasingly being used to store and query large-scale property graphs with complex relationships. Graph data, particularly the ones generated from social networks generally has text associated to the graph. Although graph systems provide support for efficient graph-based queries, there have not been comprehensive studies on how other dimensions, such as text, stored within a graph can work well together with graph traversals. In this paper we focus on a query that can process graph traversal and text search in combination in a graph database system and rank users measured as a combination of their social distance and the relevance of the text description to the query keyword. Our proposed algorithm leverages graph partitioning techniques to speed-up query processing along both dimensions. We conduct experiments on real-world large graph datasets and show benefits of our algorithm compared to several other baseline schemes.

Oshini Goonetilleke, Timos Sellis, Xiuzhen Zhang

Using SIMD Instructions to Accelerate Sequence Similarity Searches Inside a Database System

Database systems are optimised for managing large data sets, but they face difficulties making an impact to life sciences where the typical use cases involve much more complex analytical algorithms than found in traditional OLTP or OLAP scenarios. Although many database management systems (DBMS) are extensible via stored procedures to implement transactions or complex algorithms, these stored procedures are usually unable to leverage the inbuilt optimizations provided by the query engine, so other optimization avenues must be explored.In this paper, we investigate how sequence alignment algorithms, one of the most common operations carried out on a bioinformatics or genomics database, can be efficiently implemented close to the data within an extensible database system. We investigate the use of single instruction, multiple data (SIMD) extensions to accelerate logic inside an DBMS. We also compare it to implementations of the same logic outside the DBMS.Our implementation of an SIMD-accelerated Smith Waterman sequence-alignment algorithm shows an order of magnitude improvement on a non-accelerated version while running inside a DBMS. Our SIMD accelerated version also performs with little to no overhead inside the DBMS compared to the same logic running outside the DBMS.

Sidath Randeni Kadupitige, Uwe Röhm

Renovating Database Applications with DBAutoAwesome

Renovating a database application is the act of significantly reprogramming the application to meet new needs, extend functionality, or re-design to foster maintainability. It can be costly to manually renovate a database application so techniques for automating the renovation are needed. Previous research in renovation has focused on methods to improve performance, such as autonomic database research to automatically tune a DBMS or manage indexes. But there has been little previous research on how to improve functionality. There are several ways in which the functionality can be improved such as interfaces to other tools (e.g., data mining with Weka), content management system integration (e.g., Wordpress plugins), an enhanced set of forms and scripts to query and manage the data, and database mediation and migration scripts. We focus on the final category in this paper: management of the data. We propose an approach, which we call DBAutoAwesome, that adopts Google’s Auto Awesome philosophy: automatically improve an existing artifact and let the user (developer) decide whether to keep and use the improved artifact. The DBAutoAwesome approach ingests a database application to produce an enhanced application. In this paper we describe how DBAutoAwesome enhances data modification and query forms.

Jonathan Adams, Curtis E. Dyreson

Full Research Papers: Data Mining and Applications

Frontmatter

Uncovering Attribute-Driven Active Intimate Communities

Most existing studies in community detection either focus on the common attributes of the nodes (users) or rely on only the topological links of the social network graph. However, the bulk of literature ignores the interaction strength among the users in the retrieved communities. As a result, many members of the detected communities do not interact frequently to each other. This inactivity will create problem for online advertisers as they require the community to be highly interactive to efficiently diffuse marketing information. In this paper, we study the problem of detecting attribute-driven active intimate community, that is, for a given input query consisting a set of attributes, we want to find densely-connected communities in which community members actively participate as well as have strong interaction (intimacy) with respect to the given query attributes. We design a novel attribute relevance intimacy score function for the detected communities and establish its desirable properties. To this end, we use an indexed based solution to efficiently discover active intimate communities. Extensive experiments on real data sets show the effectiveness and performance of our proposed method.

Md Musfique Anwar, Chengfei Liu, Jianxin Li

Customer Churn Prediction in Superannuation: A Sequential Pattern Mining Approach

The role of churn modelling is to maximize the value of marketing dollars spent and minimize the attrition of valuable customers. Though churn prediction is a common classification task, traditional approaches cannot be employed directly due to the unique issues inherent within the wealth management industry. Through this paper we address the issue of unseen churn in superannuation; whereby customer accounts become dormant following the discontinuation of compulsory employer contributions, and suggest solutions to the problem of scarce customer engagement data. To address these issues, this paper proposes a new approach for churn prediction and its application in the superannuation industry. We use the extreme gradient boosting algorithm coupled with contrast sequential pattern mining to extract behaviors preceding a churn event. The results demonstrate a significant lift in the performance of prediction models when pattern features are used in combination with demographic and account features.

Ben Culbert, Bin Fu, James Brownlow, Charles Chu, Qinxue Meng, Guandong Xu

Automated Underwriting in Life Insurance: Predictions and Optimisation

Underwriting is an important stage in the life insurance process and is concerned with accepting individuals into an insurance fund and on what terms. It is a tedious and labour-intensive process for both the applicant and the underwriting team. An applicant must fill out a large survey containing thousands of questions about their life. The underwriting team must then process this application and assess the risks posed by the applicant and offer them insurance products as a result. Our work implements and evaluates classical data mining techniques to help automate some aspects of the process to ease the burden on the underwriting team as well as optimise the survey to improve the applicant experience. Logistic Regression, XGBoost and Recursive Feature Elimination are proposed as techniques for the prediction of underwriting outcomes. We conduct experiments on a dataset provided by a leading Australian life insurer and show that our early-stage results are promising and serve as a foundation for further work in this space.

Rhys Biddle, Shaowu Liu, Peter Tilocca, Guandong Xu

Maintaining Boolean Top-K Spatial Temporal Results in Publish-Subscribe Systems

Nowadays many devices and applications in social networks and location-based services are producing, storing and using description, location and occurrence time of objects. Given a massive number of boolean top-k spatial-temporal queries and the spatial-textual message streams, in this paper we study the problem of continuously updating top-k messages with the highest ranks, each of which contains all the requested keywords when rank of a message is calculated by its location and freshness. Decreasing the ranks of existing top-k results over time and producing new incoming messages, cause continuously computing and maintaining the best results. To the best of our knowledge, there is no prior work that can exactly solve this problem. We propose two indexing and matching methods, then conduct an experimental evaluation to show the impact of parameters and analyse the models.

Maryam Ghafouri, Xiang Wang, Long Yuan, Ying Zhang, Xuemin Lin

Interdependent Model for Point-of-Interest Recommendation via Social Networks

Point-of-Interest (POI) recommendation is an important way to help people discover attractive places. POI recommendation approaches are usually based on collaborative filtering methods, whose performances are largely limited by the extreme scarcity of POI check-ins and a lack of rich contexts, and also by assuming the independence of locations. Recent strategies have been proposed to capture the relationship between locations based on statistical analysis, thereby estimating the similarity between locations purely based on the visiting frequencies of multiple users. However, implicit interactions with other link locations are overlooked, which leads to the discovery of incomplete information. This paper proposes a interdependent item-based model for POI recommender systems, which considers both the intra-similarity (i.e. co-occurrence of locations) and inter-similarity (i.e. dependency of locations via links) between locations, based on the TF-IDF conversion of check-in times. Geographic information, such as the longitude and latitude of locations, are incorporated into the interdependent model. Substantial experiments on three social network data sets verify the POI recommendation built with our proposed interdependent model achieves a significant performance improvement compared to the state-of-the-art techniques.

Jake Hashim-Jones, Can Wang, Md. Saiful Islam, Bela Stantic

Child Abuse and Domestic Abuse: Content and Feature Analysis from Social Media Disclosures

Due to increase in popularity of social media, people have started discussing their thoughts and opinions in the form of textual posts. Currently, the people tend to disclose even the socially tabooed topics such as Child Abuse (CA), and Domestic Abuse (DA) to receive the desired response and social support in turn. The increasing volume of abuse related posts being shared on social media is of great interest for public health sectors and family welfare organizations to monitor the public health and promote support services. However, due to the large volume, high velocity and huge variety of context and content of user generated data, it is difficult to mine the different kinds of abuse (CA and DA) related posts from other general posts, that flood over the web. Hence, this paper aims to discover and differentiate the characteristics of CA and DA posts from the massive user generated posts, with the underlying context. Various features such as psycholinguistic, textual and sentimental features are analyzed and Machine Learning techniques are trained to analyze the predictive power of extracted features. Hence, the resulting model achieves more predictive power with high accuracy in classifying possible cases of abuse related posts from diverse user posts.

Sudha Subramani, Hua Wang, Md Rafiqul Islam, Anwaar Ulhaq, Manjula O’Connor

30 min-Ahead Gridded Solar Irradiance Forecasting Using Satellite Data

Solar irradiance forecasting is critical to balancing solar energy production and energy consumption in the electric grid; however, solar irradiance forecasting is dependent on meteorological conditions and, in particular, cloud cover, which are captured in satellite imagery. In this paper we present a method for short-term solar irradiance forecasting using gridded global horizontal irradiance (GHI) data estimated from satellite images. We use this data to first create a simple linear regression model with a single predictor variable. We then discuss various methods to extend and improve the model. We found that adding predictor variables and partitioning the data to create multiple models both reduced prediction errors under certain circumstances. However, both these techniques were outperformed by applying a data transformation before training the linear regression model.

Todd Taomae, Lipyeow Lim, Duane Stevens, Dora Nakafuji

An Efficient Framework for the Analysis of Big Brain Signals Data

Big Brain Signals Data (BBSD) analysis is one of the most difficult challenges in the biomedical signal processing field for modern treatment and health monitoring applications. BBSD analytics has been recently applied towards aiding the process of care delivery and disease exploration. The main purpose of this paper is to introduce a framework for the analysis of BBSD of time series EEG in biomedical signal processing for identification of abnormalities. This paper presents a data analysis framework combining complex network and machine learning techniques for the analysis of BBSD in time series form. The proposed method is tested on an electroencephalogram (EEG) time series database as the implanted electrodes in the brain generate huge amounts of time series data in EEG. The pilot study in this paper has examined that the proposed methodology has the capability to analysis massive size of brain signals data and also can be used for handling any other biomedical signal data in time series form (e.g. electrocardiogram (ECG); Electromyogram (EMG)). The main benefit of the proposed methodology is to provide an effective way for analyzing the vast amount of BBSD generated from the brain to care patients with better outcomes and also help technicians for making intelligent decisions system.

Supriya, Siuly, Hua Wang, Yanchun Zhang

Full Research Papers: Theories and Methodologies

Frontmatter

TSAUB: A Temporal-Sentiment-Aware User Behavior Model for Personalized Recommendation

Personalized recommender system has become an essential means to help people discover attractive and interesting items. We find that to buy an item, a user is influenced not only by her intrinsic interests and temporal contexts, but also by the crowd sentiment to this item. Users tend to refuse to accept the recommended items whose most reviews are negative. In light of this, we propose a temporal-sentiment-aware user behavior model (TSAUB) to learn personal interests, temporal contexts (i.e., temporal preferences of the public) and crowd sentiment from user review data. Based on the learnt knowledge from TSAUB, we design a temporal-sentiment-aware recommender system. To improve the training efficiency of TSAUB, we develop a distributed learning algorithm for model parameter estimation using the Spark framework. Extensive experiments have been performed on four Amazon datasets, and the results show that our recommender system significantly outperforms the state-of-the-arts by making more effective and efficient recommendations.

Qinyong Wang, Hongzhi Yin, Hao Wang, Zi Huang

Finding Maximal Stable Cores in Social Networks

Maximal Stable Cores are a cohesive subgraph on a social network which use both engagement and similarity to identify stable groups of users. The problem is, when given a query user and a similarity threshold, to find all Maximal Stable Cores relative to the user. We propose a baseline algorithm and as the problem is NP-Hard, an improved heuristic algorithm which utilises linear time k-core decomposition. Experiments how that when the two algorithms differ, the improved algorithm significantly outperforms the baseline.

Alexander Zhou, Fan Zhang, Long Yuan, Ying Zhang, Xuemin Lin

Feature Extraction for Smart Sensing Using Multi-perspectives Transformation

Air quality sensing systems, such as e-nose, are one of the complex dynamic systems; due to their sensitivity to electromagnetic interference, humidity, temperature, pressure and airflow. This yield to a Multi-Dependency effect over the output signal. To address the Multi-Dependency effect, we propose a multi-dimensional signal transformation for feature extraction. Our idea is analogous to viewing one huge object from different angles and arriving at different perspectives. Every perspective is partially true, but the final picture can be inferred by combining all perspectives. We evaluated our method extensively on two data sets including a publicly available e-nose dataset generated over a three-year period. Our results show higher performance in term of accuracies, F-measure, and stability when compared to standard methods.

Sanad Al-Maskari, Ibrahim A. Ibrahim, Xue Li, Eimad Abusham, Abdulqader Almars

Finding Influential Nodes by a Fast Marginal Ranking Method

The problem of Influence Maximization (IM) aims to find a small set of k nodes (seed nodes) in a social network G that could maximize the expected number of nodes. It has been proven to be #P-hard, and many approximation algorithms and heuristic algorithms have been proposed to solve this problem in polynomial time. Those algorithms, however, either trade effectiveness for practical efficiency or vice versa. In order to make a good balance between effectiveness and efficiency, this paper introduces a novel ranking method to identify the influential nodes without computing their exact influence. In particular, our method consists of two phases, the influence ranking and the node selection. At the first phase, we rank the node’s influence based on the centrality of the network. At the second phase, we greedily pick the nodes of high ranks as seeds by considering their marginal influence to the current seed set. Experiments on real-world datasets show that the effectiveness of our method outperforms the state-of-the-art heuristic methods by 3% to 25%; and its speed is faster than the approximate method by at least three orders of magnitude (e.g., the approximate method could not complete in 12 h even for a social network of |V| = 196,591 and |E| = 950,327, while our method completes in 100 s).

Yipeng Zhang, Ping Zhang, Zhifeng Bao, Zizhe Xie, Qizhi Liu, Bang Zhang

Maximizing Reverse k-Nearest Neighbors for Trajectories

In this paper, we address a popular query involving trajectories, namely, the Maximizing Reverse k-Nearest Neighbors for Trajectories (MaxRkNNT) query. Given a set of existing facility trajectories (e.g., bus routes), a set of user trajectories (e.g., daily commuting routes of users) and a set of query facility trajectories (e.g., proposed new bus routes), the MaxRkNNT query finds the proposed facility trajectory that maximizes the cardinality of reverse k-Nearest Neighbors (NNs) set for the query trajectories. A major challenge in solving this problem is to deal with complex computation of nearest neighbors (or similarities) with respect to multi-point queries and data objects. To address this problem, we first introduce a generic similarity measure between a query object and a data object that helps us to define the nearest neighbors according to user requirements. Then, we propose some pruning strategies that can quickly compute k-NNs (or top-k) facility trajectories for a given user trajectory. Finally, we propose a filter and refinement technique to compute the MaxRkNNT. Our experimental results show that our proposed approach significantly outperforms the baseline for both real and synthetic datasets.

Tamjid Al Rahat, Arif Arman, Mohammed Eunus Ali

Auto-CES: An Automatic Pruning Method Through Clustering Ensemble Selection

Ensemble learning is a machine learning approach where multiple learners are trained to solve a particular problem. Random Forest is an ensemble learning algorithm which comprises numerous decision trees and nominates a class through majority voting for classification and averaging approach for regression. The prior research affirms that the learning time of the Random Forest algorithm linearly increases when the number of trees in the forest augments. This large number of decision trees in the Random Forest can cause certain challenges. Firstly, it can enlarge the model complexity, and secondly, it can negatively affect the efficiency of large-scale datasets. Hence, ensemble pruning methods (e.g. Clustering Ensemble Selection (CES)) are devised to select a subset of decision trees out of the forest. The main challenge is that the prior CES models require the number of clusters as input. To solve the problem, we devise an Automatic CES pruning model (Auto-CES) for Random Forest which can automatically find the proper number of clusters. Our proposed model is able to obtain an optimal subset of trees that can provide the same or even better effectiveness compared to the original set. Auto-CES has two components: clustering and selection. First, our algorithm utilizes a new clustering technique to classify homogeneous trees. In selection part, it takes both accuracy and diversity of the trees into consideration to choose the best tree.Extensive experiments are conducted on five datasets. The results show that our algorithm can perform the classification task more effectively than the state-of-the-art rivals.

Mojtaba Amiri Maskouni, Saeid Hosseini, Hadi Mohammadzadeh Abachi, Mohammadreza Kangavari, Xiaofang Zhou

DistClusTree: A Framework for Distributed Stream Clustering

In this paper, we investigate the problem of clustering distributed multidimensional data streams. We devise a distributed clustering framework DistClusTree that extends the centralized ClusTree approach. The main difficulty in distributed clustering is balancing communication cost and clustering quality. We tackle this in DistClusTree through combining spatial index summaries and online tracking for efficient local and global incremental clustering. We demonstrate through extensive experiments the efficacy of the framework in terms of communication cost and approximate clustering quality.

Zhinoos Razavi Hesabi, Timos Sellis, Kewen Liao

Short Research Papers

Frontmatter

Mobile Application Based Heavy Vehicle Fatigue Compliance in Australian Operations

The Australian National Heavy Vehicle Regulator (NHVR) defines the rules for fatigue management in heavy haulage trucking operations. The rules place restrictions on total work and minimum rest hours, and are aimed at regulating the potential for fatigue risk amongst drivers. This paper presents a performance-based fatigue management system based on driver fatigue data stored in simple mobile databases and deployed via Android smart phones. The system funded by WorkSafe Tasmania and entitled, Logistics Fatigue Manager (LFM), was evaluated with a cohort of heavy haulage drivers in Australian forestry. The correlation between driver fatigue estimates and actual sleep hours (recorded using FitBits) is confirmed, and is also supported through driver interviews. The benefit is that management of fatigue risk could be more tailored to individual drivers opening up efficiency gains across supply chains.

Luke Mirowski, Joel Scanlan

Statistical Discretization of Continuous Attributes Using Kolmogorov-Smirnov Test

Unlike unsupervised discretization methods that use simple rules to discretize continuous attributes through a low time complexity which mostly depends on sorting procedure, supervised discretization algorithms take the class label of attributes into consideration to achieve high accuracy. Supervised discretization process on continuous features encounters two significant challenges. Firstly, noisy class labels affect the effectiveness of discretization. Secondly, due to the high computational time of supervised algorithms in large-scale datasets, time complexity would rely on discretizing stage rather than sorting procedure. Accordingly, to address the challenges, we devise a statistical unsupervised method named as SUFDA. The SUFDA aims to produce discrete intervals through decreasing differential entropy of the normal distribution with a low temporal complexity and high accuracy. The results show that our unsupervised system obtains a better effectiveness compared to other discretization baselines in large-scale datasets.

Hadi Mohammadzadeh Abachi, Saeid Hosseini, Mojtaba Amiri Maskouni, Mohammadreza Kangavari, Ngai-Man Cheung

Econometric Analysis of the Industrial Growth Determinants in Colombia

An econometric study is carried out using a panel data model with fixed effects to identify the industrial development determinants in Colombia during the term 2005–2015. The database used in the study corresponds to World Bank and the Colombian state. The determinants of industrial growth identified at the theoretical level that allow the enhancement of productive capacities to face foreign competition in Colombia are: innovation; networks of innovations and knowledge among companies and organizations; the interest rate; the capital-product ratio, the unit labor cost; and the exchange rate. The amount invested in scientific, technological and innovation activities by industrial group is the only variable that is not significant in the model.

Carolina Henao-Rodríguez, Jenny-Paola Lis-Gutiérrez, Mercedes Gaitán-Angulo, Luz Elena Malagón, Amelec Viloria

Parallelizing String Similarity Join Algorithms

A key operation in data cleaning and integration is the use of string similarity join (SSJ) algorithms to identify and remove duplicates or similar records within data sets. With the advent of big data, a natural question is how to parallelize SSJ algorithms. There is a large body of existing work on SSJ algorithms and parallelizing each one of them may not be the most feasible solution. In this paper, we propose a parallelization framework for string similarity joins that utilizes existing SSJ algorithms. Our framework partitions the data using a variety of partitioning strategies and then executes the SSJ algorithms on the partitions in parallel. Some of the partitioning strategies that we investigate trade accuracy for speed. We implemented and validated our framework on several SSJ algorithms and data sets. Our experiments show that our framework results in significant speedup with little loss in accuracy.

Ling-Chih Yao, Lipyeow Lim

Exploring Human Mobility Patterns in Melbourne Using Social Media Data

Location based social networks such as Swarm provide a rich source of information on urban functions and city dynamics. Users voluntarily check-in at places they visit using a mobile application. Analysis of data created by check-ins can give insight into user’s mobility patterns. This study uses location-sharing data from Swarm to explore spatio-temporal and geo-temporal patterns within Melbourne city. Descriptive statistical analyses using SPSS on check-in data were performed to reveal meaningful trends and to attain a deeper understanding of human mobility patterns in the city. The results showed that mobility patterns vary based on gender and venue category. Furthermore, the patterns are different during different days of a week as well as at different times of a day but are not necessarily influenced by weather.

Ravinder Singh, Yanchun Zhang, Hua Wang

Bootstrapping Uncertainty in Schema Covering

Schema covering is the process of representing large and complex schemas by easily comprehensible common objects. This task is done by identifying a set of common concepts from a repository called concept repository and generating a cover to describe the schema by the concepts. Traditional schema covering approach has two shortcomings: it does not model the uncertainty in the covering process, and it requires user to state an ambiguity constraint which is hard to define. We remedy this problem by incorporating probabilistic model into schema covering to generate probabilistic schema cover. The integrated probabilities not only enhance the coverage of cover results but also eliminate the need of defining the ambiguity parameter. Experiments on real-datasets show the competitive performance of our approach.

Nguyen Thanh Toan, Phan Thanh Cong, Duong Chi Thang, Nguyen Quoc Viet Hung, Bela Stantic

Demo Papers

Frontmatter

TEXUS: Table Extraction System for PDF Documents

Tables in documents are a rich and under-exploited source of structured data in otherwise unstructured documents. The extraction and understanding of tabular data is a challenging task which has attracted the attention of researchers from a range of disciplines such as information retrieval, machine learning and natural language processing. In this demonstration, we present an end-to-end table extraction and understanding system which takes a PDF file and automatically generates a set of XML and CSV files containing the extracted cells, rows and columns of tables, as well as a complete reading order analysis of the tables. Unlike many systems that work as a black-boxed, ad-hoc solution, our system design incorporates the open, reusable and extensible architecture to support research into, and development of, table-processing systems. During the demo, users will see how our system gradually transforms a PDF document into a set of structured files through a series of processing modules, namely: locating, segmenting and function/structure analysis.

Roya Rastan, Hye-Young Paik, John Shepherd, Seung Hwan Ryu, Amin Beheshti

Visual Evaluation of SQL Plan Cache Algorithms

Caching optimized query plans reduces the time spent optimizing SQL queries at the cost of increased memory consumption. Different cache eviction strategies, such as LRU(-K) or GD(F)S exist that aim at increasing the cache hit ratio or reducing the overall cost of cache misses. A comprehensive study on how different workloads and tuning parameters influence these strategies does not yet publicly exist. We propose a tool that enables both such research as well as performance tuning for DBAs by visualizing the effects of changed parameters in real time.

Jan Kossmann, Markus Dreseler, Timo Gasda, Matthias Uflacker, Hasso Plattner

Visualising Top-k Alternative Routes

Alternatives to the shortest path are a standard feature of modern navigation services where more than one suitable paths between source and destination are presented to the users so that they can use a path of their choice for navigation. Although there exist several approaches to compute top-k alternative paths, these techniques define suitable paths differently, hence, the top-k alternative routes generated by these techniques may be different. Unfortunately, there is no work that quantifies or experimentally compares the quality of the alternative routes generated by these techniques. This demonstration visualises the top-k alternative routes generated by two state-of-the-art techniques as well as the routes provided by Google Maps. The visualisation makes it easy for the users of the demonstration to compare the quality of the routes generated by each technique. The source code of the demonstration is also made publicly available which makes it easy to incorporate results by other techniques and mapping services and thus compare routes provided by these.

Lingxiao Li, Muhammad Aamir Cheema, David Taniar, Maria Indrawan-Santiago

Backmatter

Weitere Informationen

Premium Partner

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

Whitepaper

- ANZEIGE -

Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.
Jetzt gratis downloaden!

Bildnachweise