Skip to main content
Top

2016 | Book

Database Systems for Advanced Applications

DASFAA 2016 International Workshops: BDMS, BDQM, MoI, and SeCoP, Dallas, TX, USA, April 16-19, 2016, Proceedings

insite
SEARCH

About this book

This book constitutes the workshop proceedings of the 21st International Conference on Database Systems for Advanced Applications, DASFAA 2016, held in Dallas, TX, USA, in April 2016.

The volume contains 32 full papers (selected from 43 submissions) from 4 workshops, each focusing on a specific area that contributes to the main themes of DASFAA 2016: The Third International Workshop on Semantic Computing and Personalization, SeCoP 2016; the Third International Workshop on Big Data Management and Service, BDMS 2016; the First International Workshop on Big Data Quality Management, BDQM 2016; and the Second International Workshop on Mobile of Internet, MoI 2016.

Table of Contents

Frontmatter

SeCoP 2016

Frontmatter
Weibo Mood Towards Stock Market

Behavioral economics and behavioral finance believe that public mood is correlated with economic indicators and financial decisions are significantly driven by emotions. A growing body of research has examined the correlation between stock market and social media public mood state. However most research is conducted on English social media websites, the number of research on how public mood states in Chinese social media websites affect the stock market in China is limited. This paper first summarizes the previous research on text mining and social media sentiment analysis. After that, we investigate whether measurements of collective public mood states derived from Weibo which is a social media website similar as Twitter but most posts are written in Chinese are correlated to the stock market price in China. We use a novel Chinese mood extracting method using two NLP (Natural Language Processing) tools: Jieba and Chinese Emotion Words Ontology to analyze the text content of daily Weibo posts. A Granger Causality analysis is then used to investigate the hypothesis that the extracted public mood or emotion states are predictive of the stock price movement in China. Our experimental results indicate that some public mood dimensions such as “Happiness” and “Disgust” are highly correlated with the change of stock price and we can use them to forecast the price movement.

Wen Hao Chen, Yi Cai, Kin Keung Lai
Improving Diversity of User-Based Two-Step Recommendation Algorithm with Popularity Normalization

Recommender systems become increasingly significant in solving the information overload problem. Beyond conventional rating prediction and ranking prediction recommendation technologies, two-step recommendation algorithms have been demonstrated that they have outstanding accuracy performance in top-N recommendation tasks. However, their recommendation lists are biased towards popular items. In this paper, we propose a popularity normalization method to improve the diversity of user-based two-step recommendation algorithms. Experiment results show that our proposed approach improves the diversity performance significantly while maintaining the advantage of two-step recommendation approaches on accuracy metrics.

Xiangyu Zhao, Wei Chen, Feng Yang, Zhongqiang Liu
Followee Recommendation in Event-Based Social Networks

Recent years have witnessed the rapid growth of event-based social networks (EBSNs) such as Plancast and DoubanEvent. In these EBSNs, followee recommendation which recommends new users to follow can bring great benefits to both users and service providers. In this paper, we focus on the problem of followee recommendation in EBSNs. However, the sparsity and imbalance of the social relations in EBSNs make this problem very challenging. Therefore, by exploiting the heterogeneous nature of EBSNs, we propose a new method called Heterogenous Network based Followee Recommendation (HNFR) for our problem. In the HNFR method, to relieve the problem of data sparsity, we combine the explicit and latent features captured from both the online social network and the offline event participation network of an EBSN. Moreover, to overcome the problem of data imbalance, we propose a Bayesian optimization framework which adopts pairwise user preference on both the social relations and the events, and aims to optimize the area under ROC curve (AUC). The experiments on real-world data demonstrate the effectiveness of our method.

Shuchen Li, Xiang Cheng, Sen Su, Le Jiang
SBTM: Topic Modeling over Short Texts

With the rapid development of social media services such as Twitter, Sina Weibo and so forth, short texts are becoming more and more prevalent. However, inferring topics from short texts is always full of challenges for many content analysis tasks because of the sparsity of word co-occurrence patterns in short texts. In this paper, we propose a classification model named sentimental biterm topic model (SBTM), which is applied to sentiment classification over short texts. To alleviate the problem of sparsity in short texts, the similarity between words and documents are firstly estimated by singular value decomposition. Then, the most similar words are added to each short document in the corpus. Extensive evaluations on sentiment detection of short text validate the effectiveness of the proposed method.

Jianhui Pang, Xiangsheng Li, Haoran Xie, Yanghui Rao
Similarity-Based Classification for Big Non-Structured and Semi-Structured Recipe Data

In current big data era, there has been an explosive growth of various data. Most of these large volume of data are non-structured or semi-structured (e.g., tweets, weibos or blogs), which are difficult to be managed and organized. Therefore, an effective and efficient classification algorithm for such data is essential and critical. In this article, we focus on a specific kind of non-structured/semi-structured data in our daily life: recipe data. Furthermore, we propose the document model and similarity-based classification algorithm for big non-structured and semi-structured recipe data. By adopting the proposed algorithm and system, we conduct the experimental study on a real-world dataset. The results of experiment study verify the effectiveness of the proposed approach and framework.

Wei Chen, Xiangyu Zhao
An Approach of Fuzzy Relation Equation and Fuzzy-Rough Set for Multi-label Emotion Intensity Analysis

There are a large number of subjective texts which contain people’s all kinds of sentiments and emotions in social media. Analyzing the sentiments and predicting the emotional expressions of human beings have been widely studied in academic communities and applied in commercial systems. However, most of the existing methods focus on single-label sentiment analysis, which means that only an exclusive sentiment orientation (negative, positive or neutral) or an emotion state (joy, hate, love, sorrow, anxiety, surprise, anger, or expect) is considered for a document. In fact, multiple emotions may be widely coexisting in one document, paragraph, or even sentence. Moreover, different words can express different emotion intensities in the text. In this paper, we propose an approach that combining fuzzy relation equation with fuzzy-rough set for solving the multi-label emotion intensity analysis problem. We first get the fuzzy emotion intensity of every sentiment word by solving a fuzzy relation equation, and then utilize an improved fuzzy-rough set method to predict emotion intensity for sentences, paragraphs, and documents. Compared with previous work, our proposed algorithm can simultaneously model the multi-labeled emotions and their corresponding intensities in social media. Experiments on a well-known blog emotion corpus show that our proposed multi-label emotion intensity analysis algorithm outperforms baseline methods by a large margin.

Chu Wang, Daling Wang, Shi Feng, Yifei Zhang
Features Extraction Based on Neural Network for Cross-Domain Sentiment Classification

Sentiment analysis is important to develop marketing strategies, enhance sales and optimize supply chain for electronic commerce. Many supervised and unsupervised algorithms have been applied to build the sentiment analysis model, which assume that the distributions of the labeled and unlabeled data are identical. In this paper, we aim to deal with the issue of a classifier trained for use in one domain might not perform as well in a different one, especially when the distribution of the labeled data is different with that of the unlabeled data. To tackle this problem, we incorporate feature extraction methods into the neural network model for cross-domain sentiment classification. These methods are applied to simplify the structure of the neural network and improve the accuracy. Experiments on two real-world datasets validate the effectiveness of our methods for cross-domain sentiment classification.

Endong Zhu, Guoyan Huang, Biyun Mo, Qingyuan Wu
Personalized Medical Reading Recommendation: Deep Semantic Approach

Therapists are faced with the overwhelming task of identifying, reading, and incorporating new information from a vast and fast growing volume of publications into their daily clinical decisions. In this paper, we propose a system that will semantically analyze patient records and medical articles, perform medical domain specific inference to extract knowledge profiles, and finally recommend publications that best match with a patient’s health profile. We present specific knowledge extraction and matching details, examples, and results from the mental health domain.

Tatiana Erekhinskaya, Mithun Balakrishna, Marta Tatu, Dan Moldovan
A Highly Effective Hybrid Model for Sentence Categorization

Sentence categorization is a task to classify sentences by their types, which is very useful for the analysis of many NLP applications. There exist grammar or syntactic rules to determine types of sentences. And keywords like negation word for negative sentences is an important feature. However, no all sentences have rules to classify. Besides, different types of sentences may contain the same keywords whose meaning may be changed by context. We address the first issue by proposing a hybrid model consisting of Decision Trees and Support Vector Machines. In addition, we design a new feature based on N-gram model. The results of the experiments conducted on the sentence categorization dataset in “Good Ideas of China” Competition 2015 show that (1) our model outperforms baseline methods and all online systems in this competition; (2) the effectiveness of our feature is higher than that of features frequently used in NLP.

Zhenhong Chen, Kai Yang, Yi Cai, Dongping Huang, Ho-fung Leung
Improved Automatic Keyword Extraction Given More Semantic Knowledge

Graph-based ranking algorithm such as TextRank shows a remarkable effect on keyword extraction. However, these algorithms build graphs only considering the lexical sequence of the documents. Hence, graphs generated by these algorithm can not reflect the semantic relationships between documents. In this paper, we demonstrate that there exists an information loss in the graph-building process from textual documents to graphs. These loss will lead to the misjudgment of the algorithm. In order to solve this problem, we propose a new approach called Topic-based TextRank. Different from the traditional algorithm, our approach takes the lexical meaning of the text unit (i.e. words and phrase) into account. The result of our experiments shows that our proposed algorithm can outperform the state-of-the-art algorithms.

Kai Yang, Zhenhong Chen, Yi Cai, DongPing Huang, Ho-fung Leung
Generating Computational Taxonomy for Business Models of the Digital Economy

We propose to design a semi-automatic ontology building approach to create a new taxonomy of the digital economy based on a big data approach – harvesting data by scraping publicly available Web pages of digitally-focused business. The method is based on a small core ontology which provides the basic level concepts in business model. We try to use computational approaches to extracting Web data towards generating concepts and taxonomy of business models in the digital economy, which can help consequently address the important question while exploring new business models in big data era.

Chao Wu, Yi Cai, Mei Zhao, Songping Huang, Yike Guo
How to Use the Social Media Data in Assisting Restaurant Recommendation

Online social network applications such as Twitter, Weibo, have played an important role in people’s life. There exists tremendous information in the tweets. However, how to mine the tweets and get valuable information is a difficult problem. In this paper, we design the whole process for extracting data from Weibo and develop an algorithm for the foodborne disease events detection. The detected foodborne disease information are then utilized to assist the restaurant recommendation. The experiment results show the effectiveness and efficiency of our method.

Wenjuan Cui, Pengfei Wang, Xin Chen, Yi Du, Danhuai Guo, Yuanchun Zhou, Jianhui Li
A Combined Collaborative Filtering Model for Social Influence Prediction in Event-Based Social Networks

Event-based social networks (EBSNs) provide convenient online platforms for users to organize, attend and share social events. Understanding users’ social influences in social networks can benefit many applications, such as social recommendation and social marketing. In this paper, we focus on the problem of predicting users’ social influences on upcoming events in EBSNs. We formulate this prediction problem as the estimation of unobserved entries of the constructed user-event social influence matrix, where each entry represents the influence value of a user on an event. In particular, we define a user’s social influence on a given event as the proportion of the user’s friends who are influenced by him/her to attend the event. To solve this problem, we present a combined collaborative filtering model, namely, Matrix Factorization with Event Neighborhood (MF-EN) model, by incorporating event-based neighborhood method into matrix factorization. Due to the fact that the constructed social influence matrix is very sparse and the overlap values in the matrix are few, it is challenging to find reliable similar event neighbors using the widely adopted similarity measures (e.g., Pearson correlation and Cosine similarity). To address this challenge, we propose an additional information based neighborhood discovery (AID) method by considering three event-specific features in EBSNs. The parameters of our MF-EN model are determined by minimizing the associated regularized squared error function through stochastic gradient descent. We conduct a comprehensive performance evaluation on real-world datasets collected from DoubanEvent. Experimental results demonstrate the superiority of the proposed model compared to several alternatives.

Xiao Li, Xiang Cheng, Sen Su, Shuchen Li, Jianyu Yang
Learning Manifold Representation from Multimodal Data for Event Detection in Flickr-Like Social Media

In this work, a three-stage social event detection model is devised to discover events in Flickr data. As the features possessed by the data are typically heterogeneous, a multimodal fusion model (M$$^{2}$$2F) exploits a soft-voting strategy and a reinforcing model is devised to learn fused features in the first stage. Furthermore, a Laplacian non-negative matrix factorization (LNMF) model is exploited to extract compact manifold representation. Particularly, a Laplacian regularization term constructed on the multimodal features is introduced to keep the geometry structure of the data. Finally, clustering algorithms can be applied seamlessly in order to detect event clusters. Extensive experiments conducted on the real-world dataset reveal the M$$^{2}$$2F-LNMF-based approaches outperform the baselines.

Zhenguo Yang, Qing Li, Wenyin Liu, Yun Ma
Deep Neural Network for Short-Text Sentiment Classification

As a concise medium to describe events, short text plays an important role to convey the opinions of users. The classification of user emotions based on short text has been a significant topic in social network analysis. Neural Network can obtain good classification performance with high generalization ability. However, conventional neural networks only use a simple back-propagation algorithm to estimate the parameters, which may introduce large instabilities when training deep neural networks by random initializations. In this paper, we apply a pre-training method to deep neural networks based on restricted Boltzmann machines, which aims to gain competitive and stable classification performance of user emotions over short text. Experimental evaluations using real-world datasets validate the effectiveness of our model on the short-text sentiment classification task.

Xiangsheng Li, Jianhui Pang, Biyun Mo, Yanghui Rao, Fu Lee Wang

BDMS 2016

Frontmatter
VMPSP: Efficient Skyline Computation Using VMP-Based Space Partitioning

The skyline query returns a set of interesting points that are not dominated by any other points in the multi-dimensional data sets. This query has already been considerably studied over last several years in preference analysis and multi-criteria decision making applications fields. Space partitioning, the best non-index framework, has been proposed and existing methods based on it do not consider the balance of partitioned subspaces. To overcome this limitation, we first develop a cost evaluation model of space partitioning in skyline computation, propose an efficient approach to compute the skyline set using balanced partitioning. We illustrate the importance of the balance in partitioning. Based on this, we propose a method to construct a balanced partitioning point VMP whose ith attribute value is the median value of all points in ith dimension. We also design a structure RST to reduce dominance tests among those subspaces which are comparable. The experimental evaluation indicates that our algorithm is faster at least several times than existing state-of-the-art algorithms.

Kaiqi Zhang, Donghua Yang, Hong Gao, Jianzhong Li, Hongzhi Wang, Zhipeng Cai
Real-Time Event Detection with Water Sensor Networks Using a Spatio-Temporal Model

Event detection with the spatio-temporal correlation is one of the most popular applications of wireless sensor networks. This kind of task trends to be a difficult problem of big data analysis due to the massive data generated from large-scale sensor networks like water sensor networks, especially in the context of real-time analysis. To reduce the computational cost of abnormal event detection and improve the response time, sensor node selection is needed to cut down the amount of data for the spatio-temporal correlation analysis. In this paper, a connected dominated set (CDS) approach is introduced to select backbone nodes from the sensor network. Furthermore, a spatio-temporal model is proposed to achieve the spatio-temporal correlation analysis, where Markov chain is adopted to model the temporal dependency among the different sensor nodes, and Bayesian Network (BN) is used to model the spatial dependency. The proposed approach and model have been applied to the real-time detection of urgent events (e.g. water pollution incidents) with water sensor networks. Preliminary experimental results on simulated data indicate that our solution can achieve better performance in terms of response time and scalability, compared to the simple threshold algorithm and the BN-only algorithm.

Yingchi Mao, Xiaoli Chen, Zhuoming Xu
Bayesian Network Structure Learning from Big Data: A Reservoir Sampling Based Ensemble Method

Bayesian network (BN) learning from big datasets is potentially more valuable than learning from conventional small datasets as big data contain more comprehensive probability distributions and richer causal relationships. However, learning BNs from big datasets requires high computational cost and easily ends in failure, especially when the learning task is performed on a conventional computation platform. This paper addresses the issue of BN structure learning from a big dataset on a conventional computation platform, and proposes a reservoir sampling based ensemble method (RSEM). In RSEM, a greedy algorithm is used to determine an appropriate size of sub datasets to be extracted from the big dataset. A fast reservoir sampling method is then adopted to efficiently extract sub datasets in one pass. Lastly, a weighted adjacent matrix based ensemble method is employed to produce the final BN structure. Experimental results on both synthetic and real-world big datasets show that RSEM can perform BN structure learning in an accurate and efficient way.

Yan Tang, Zhuoming Xu, Yuanhang Zhuang
Correlation Feature of Big Data in Smart Cities

Smart cities are constantly faced with the generated data resources. To effectively manage and utilize the big city data, data vitalization technology is proposed. Considering the complex and diverse relationships among the big data, data correlation is very important for data vitalization. This paper presents a framework for data correlation and depicts the discovery, representation and growth of data correlation. In particular, this paper proposes an innovative representation of data correlation, namely the data correlation diagram. Based on the basic and the multi-stage data relations, we optimize the data correlation diagrams according to the transitive rules. We also design dynamic data diagrams to support data and relation changes, reducing the response time to data changes and enabling the autonomous growth of the vitalized data and the relations. Finally an instance of smart behaviors is introduced which verifies the feasibility and efficiency of the data relation diagram.

Yi Zhang, Xiaolan Tang, Bowen Du, Weilin Liu, Juhua Pu, Yujun Chen
Nearly Optimal Probabilistic Coverage for Roadside Data Dissemination in Urban VANETs

Data disseminations based on Roadside Access Points (RAPs) in vehicular ad-hoc networks attract lots of attentions and have a promising prospect. In this paper, we focus on a roadside data dissemination, including three basic elements: RAP Service Provider (RSP), mobile vehicles and requesters. The RSP has deployed many RAPs at different locations in a city. A requester wants to rent some RAPs, which can disseminate their data to vehicles with some probabilities. Then, it tries to select the minimal number of RAPs to finish the data dissemination, in order to save the expenses. Meanwhile, the selected RAPs need to ensure that the probability of each vehicle receiving data successfully is no less than a threshold. We prove that this RAP selection problem is NP-hard, since it’s a meaningful extension of the classic Set Cover problem. To solve this problem, we propose a greedy algorithm and give its approximation ratio. Moreover, we conduct extensive simulations on real world data to prove its good performance.

Yawei Hu, Mingjun Xiao, An Liu, Ruhong Cheng, Hualin Mao
OCC: Opportunistic Crowd Computing in Mobile Social Networks

Crowd computing is a new paradigm, in which a group of users are coordinated to deal with a huge job or huge amounts of data that one user cannot easily do. In this paper, we design an Opportunistic Crowd Computing system (OCC) for mobile social networks (MSNs). Unlike traditional crowd computing systems, the mobile users in OCC move around and communicate each other by using short-distance wireless communication mechanisms (e.g., WiFi or Bluetooth) when they encounter each other, so as to save communication costs. The key design of OCC is the task assignment scheme. Unlike the traditional crowd computing task assignment problem, the task assignment in OCC must take into consideration the users’ mobile behaviors. To solve this problem, we present an optimal user group algorithm (OUGA). It can minimize the total cost, while ensuring the task completion rates. Moreover, we conduct a performance analysis, and prove the optimality of this algorithm. In addition, the simulations show that our algorithm achieves a good performance.

Hualin Mao, Mingjun Xiao, An Liu, Jianbo Li, Yawei Hu
Forest of Distributed B+Tree Based on Key-Value Store for Big-Set Problem

In many big-data systems, the amount of data is growing rapidly. Many systems have to store big-sets: the sets with a large number of items. Efficiently storing a large number of big-sets to support high rate updating and querying is a challenging problem in data storage systems. Nowadays, distributed key-value stores play important roles in building large-scale systems with many advantages. They support horizontal scalability, low-latency, high throughput when manipulating small or medium key-value pairs. Unfortunately, when working with big-set data structure, they do not work well and most of them are not scalable with a large number of big sets. In this research, we analyze the difficulty in storing big-sets using key-value stores. An architecture called “Forest of distributed$$B^{+}Tree$$B+Tree” and algorithms are proposed to build NoSql data store for storing big data structures such as set, dictionary. The big-sets are split into multiple small sets of limited size and stored in key-value stores. A Multi-level meta-data is also proposed and used to reduce the complexity in writing operations of big-sets when using key-value stores from O(N) to O(log(N)). This research can store larger number of items in a set than Cassandra and Google BigTable. Parts of big set in this research is distributed while a row in Google BigTable only has a limited size and must be fit in a server. Experiment results show that proposed system has better read performance than Cassandra. The proposed architecture may potentially be used in various applications such as storage system for data from sensors in the Internet of Things (IoT) systems, commercial transaction storages and social networks.

Thanh Trung Nguyen, Minh Hieu Nguyen

BDQM 2016

Frontmatter
An Efficient Schema Matching Approach Using Previous Mapping Result Set

The widespread adoption of eXtensible Markup Language pushed a growing number of researchers to design XML specific Schema Matching approaches, aiming at finding the semantic correspondence of concepts between different data sources. In the latest years, there has been a growing need for developing high performance matching systems in order to identify and discover such semantic correspondence across XML data. XML schema matching methods face several challenges in the form of definition, utilization, and combination of element similarity measures. In this paper, we propose the XML schema matching framework based on previous mapping result set (PMRS). We first parse XML schemas as schema trees and extract schema feature. Then we construct PMRS as the auxiliary information and conduct the retrieving algorithm based on PMRS. To cope with complex matching discovery, we compute the similarity among XML schemas semantic information carried by XML data. Our experimental results demonstrate the performance benefits of the schema matching framework using PMRS.

Hongjie Fan, Junfei Liu, Wenfeng Luo, Kejun Deng
A Distributed Load Balance Algorithm of MapReduce for Data Quality Detection

Big data quality detection is a valuable problem in data quality field. MapReduce is an important distributed data processing model mainly for big data processing. Load balance is a key factor that influences the property of MapReduce. In this paper, we propose a distributed greedy approximation algorithm for load balance problem in MapReduce for data quality detection. There are three key challenges: (a) reduce the problem to NP-complete and prove a considerable approximation ratio of the proposed algorithm, (b) just impose one more round of MapReduce than conventional processing and occupy minimal time in the total process, (c) be simple and convenient feasible. Experimental results on real-life and synthetic data demonstrate that the proposed algorithm in this paper is effective for load balance.

Yitong Gao, Yan Zhang, Hongzhi Wang, Jianzhong Li, Hong Gao
A Formal Taxonomy to Improve Data Defect Description

Data quality assessment outcomes are essential for analytical processes, especially for big data environment. Its efficiency and efficacy depends on automated solutions, which are determined by understanding the problem associated with each data defect. Despite the considerable number of works that describe data defects regarding to accuracy, completeness and consistency, there is a significant heterogeneity of terminology, nomenclature, description depth and number of examined defects. To cover this gap, this work reports a taxonomy that organizes data defects according to a three-step methodology. The proposed taxonomy enhances the descriptions and coverage of defects with regard to the related works, and also supports certain requirements of data quality assessment, including the design of semi-supervised solutions to data defect detection.

João Marcelo Borovina Josko, Marcio Katsumi Oikawa, João Eduardo Ferreira
ISSA: Efficient Skyline Computation for Incomplete Data

Over the past years, the skyline query has already caused wide attention in database community. For the skyline computation over incomplete data, the existing algorithms focus mainly on reducing the dominance tests among these points with the same bitmap representation by exploiting $$ Bucket $$Bucket technique. While, the issue of exhaustive comparisons among those points in different buckets remains unsolved, which is the major cost. In this paper, we present a general framework COBO for skyline computation over incomplete data. And based on COBO, we develop an efficient algorithm ISSA in two phases: $$ pruning $$pruning$$ compared $$compared$$ list $$list and $$ reducing $$reducing$$ expected $$expected$$ comparison $$comparison$$ times $$times. We construct a compared list order according to ACD to diminish significantly the total comparisons among the points in different buckets. The experimental evaluation on synthetic and real data sets indicates that our algorithm outperforms existing state-of-the-art algorithm 1 to 2 orders of magnitude in comparisons.

Kaiqi Zhang, Hong Gao, Hongzhi Wang, Jianzhong Li
Join Query Processing in Data Quality Management

Data quality management is the essential problem for information systems. As a basic operation of Data quality management, joins on large-scale data play an important role in document clustering. MapReduce is a programming model which is usually applied to process large-scale data. Many tasks can be implemented under the framework, such as data processing of search engines and machine learning. However, there is no efficient support for join operation in current implementations of MapReduce. In this paper, we present a strategies to build the extend bloom filter for the large dataset using MapReduce. We use the extend bloom filter to improve the performance of two-way and multi-way joins.

Mingliang Yue, Hong Gao, Shengfei Shi, Hongzhi Wang
Similarity Search on Massive Data Based on FPGA

Data quality is a very important question in massive data process. When we want to distill valuable knowledge from a mass set of data, the key point is to know whether the dataset is clean. So before we extract useful massage from the dataset we’d better do some data clean job. Similarity search is a very important method in data clean. MapReduce will be used to do similarity search in our data clean system. But the efficiency is very low. We found that when we process the massive data stored in HDFS with MapReduce programing model every part of the dataset will be scanned and this is very time-consuming especially for large scale dataset. In this paper we will do filter operation on original data with hardware before we use similarity search to do data clean.

Yanzheng Wang, Hong Gao, Shengfei Shi, Hongzhi Wang
Skyline Join Query Processing over Multiple Relations

Skyline query on multiple relations, known as skyline join query processing, attracts much attention recently. However, most of the existing algorithms perform skyline join just on two relations. In this paper, we propose an efficient algorithm Skyjog, which is applicable for skyline join on two or even more relations. Skyjog divides each relation into two or three partitions. Based on the proposed group division approach, tuples generated by several join combinations of these partitions definitely are skyline points. Skyjog only has to examine tuples of other join combinations. Thus, Skyjog achieves performance efficiency by avoiding much skyline computation. Experiments demonstrate that Skyjog has an outstanding performance on all datasets, and outperforms the state-of-the-art skyline join algorithms on both two relations and more than two relations.

Jinchao Zhang, Zheng Lin, Bo Li, Weiping Wang, Dan Meng
Detect Redundant RDF Data by Rules

The development and standardization of semantic web technologies have resulted in an unprecedented volume of RDF datasets being published on the Web. However, data quality exists in most of the information systems, and the RDF data is no exception. The quality of RDF data has become a hot spot of Web research and many data quality dimensions and metrics have been proposed. In this paper, we focus on the redundant problem in RDF data, and propose a rule based method to find and delete the semantic redundant triples. By evaluating the existing datasets, we prove that our method can remove the redundant triples to help data publisher provide more concise RDF data.

Tao Guang, Jinguang Gu, Li Huang

MoI 2016

Frontmatter
Behavior-Based Twitter Overlapping Community Detection

In this paper, we try to cluster twitter users into different communities. These communities can be overlapping based on their interests. The paper proposed a RWC (relation-weight-clustering) model to construct twitter users’ network. This model takes twitter users’ “@” and “RT@” behaviors into account. By counting their “@” and “RT@” frequency, the relation strength can be then descripted. Using SVM, we can get the users interest vector by analyzing their tweets. And the common interest vector between two users is calculated according to their common interests. Using community detection algorithm to resolve the relation-nodes-based network, the overlapping communities are formed with modularity of 0.682.

Lixiang Guo, Zhaoyun Ding, Hui Wang
Versatile Safe-Region Generation Method for Continuous Monitoring of Moving Objects in the Road Network Distance

This paper proposes a fast safe-region generation method for several kinds of vicinity queries including distance range queries, set k nearest neighbor (NN) queries, and ordered kNN queries. When a user is driving a car on a road network, he/she wants to know objects located in a vicinity of the car. However, the result is changing according to the movement of the car, and therefore, the up-to-date result is always expected, and requested to the server. On the other hand, frequent requests for updating results to the server cause heavy loading. To cope with this problem efficiently, the idea of safe-region has been proposed. This paper proposes a fast generation method of the safe-region applicable to several types of vicinity queries. Through experimental evaluations, the proposed algorithm achieves a great performance in terms of processing times, and is one or two orders of magnitude faster than existing algorithms.

Yutaka Ohsawa, Htoo Htoo
Backmatter
Metadata
Title
Database Systems for Advanced Applications
Editors
Hong Gao
Jinho Kim
Yasushi Sakurai
Copyright Year
2016
Electronic ISBN
978-3-319-32055-7
Print ISBN
978-3-319-32054-0
DOI
https://doi.org/10.1007/978-3-319-32055-7

Premium Partner