Skip to main content

2016 | Buch | 1. Auflage

Web-Age Information Management

17th International Conference, WAIM 2016, Nanchang, China, June 3-5, 2016, Proceedings, Part II

insite
SUCHEN

Über dieses Buch

This two-volume set, LNCS 9658 and 9659, constitutes the thoroughly refereed proceedings of the 17th International Conference on Web-Age Information Management, WAIM 2016, held in Nanchang, China, in June 2016.

The 80 full research papers presented together with 8 demonstrations were carefully reviewed and selected from 266 submissions. The focus of the conference is on following topics: data mining, spatial and temporal databases, recommender systems, graph data management, information retrieval, privacy and trust, query processing and optimization, social media, big data analytics, and distributed and cloud computing.

Inhaltsverzeichnis

Frontmatter
Erratum to: Web-Age Information Management (Part I and II)
Bin Cui, Nan Zhang, Jianliang Xu, Xiang Lian, Dexi Liu

Privacy and Trust

Frontmatter
Detecting Anomalous Ratings Using Matrix Factorization for Recommender Systems

Personalization recommendation techniques play a key role in the popular E-commerce services such as Amazon, TripAdvisor and etc. In practice, collaborative filtering recommender systems are highly vulnerable to “shilling” attacks due to its openness. Although attack detection based on such attacks has been extensively researched during the past decade, the studies on these issues have not reached an end. They either extract extra features from user profiles or directly calculate similarity between users to capture concerned attackers. In this paper, we propose a novel detection technique to bypass these hard problems, which combines max-margin matrix factorization with Bayesian nonparametrics and outlier detection. Firstly, mean prediction errors for users and items are calculated by utilizing trained prediction model on test sets. And then we continue to comprehensively analyze the distribution of mean prediction errors of items in order to reduce the scope of concerned items. Based on the suspected items, all anomalous users can be finally determined by analyzing the distribution of mean prediction error on each user. Extensive experiments on the MovieLens-100K dataset demonstrate the effectiveness of the proposed method.

Zhihai Yang, Zhongmin Cai, Xinyuan Chen
A Novel Spatial Cloaking Scheme Using Hierarchical Hilbert Curve for Location-Based Services

With the rapid development of positioning and wireless technologies, Location-based Services (LBS) appear anywhere in our daily life. Though LBS brings a great convience to users, the location privacy is vulnerable in many ways (e.g., untrusted LBS Server). To address privacy issues, we propose HHScloak, a novel Hierarchical Hilbert Curve Spatial Cloaking algorithm to effectively achieve k-anonymity for mobile users in LBS. Different from existing methods, we take Average Query Density (AQD) into consideration, and generate the Anonymity Set (AS) which satisfies reciprocity and uniformity. Based on the hierarchical method and optimal splitting bucket strategy, our scheme provides a larger cloaking region which guarantees privacy level. Security analysis and experimental evaluation prove that HHScloak scheme can perform more efficiently and effectively than other methods.

Ningning Cui, Xiaochun Yang, Bin Wang
Efficient Privacy-Preserving Content-Based Image Retrieval in the Cloud

Cloud storage systems are increasingly being used to host personal or organizational data of critical importance, especially for image data that needs more storage space than ordinary data. While bringing in much convenience, existing cloud storage solutions could seriously breach the privacy of users. Encryption before outsourcing images to the cloud helps to protect the privacy of the data, but it also brings challenges to perform image retrieval over encrypted data. To address this issue, considerable amount of searchable encryption schemes have been proposed in the literature. However, most existing schemes are either less secure or too computation and communication intensive to be practical. In this paper, we propose an efficient privacy-preserving content-based image retrieval scheme. We first convert the high-dimensional image descriptors to compact binary codes, and then adapt the asymmetric scalar-product-preserving encryption (ASPE) to ensure the confidentiality of the sensitive images. The security analysis and experiments show the security, accuracy and efficiency of our proposed scheme.

Kai Huang, Ming Xu, Shaojing Fu, Dongsheng Wang
Preserving the d-Reachability When Anonymizing Social Networks

The goal of graph anonymization is avoiding disclosure of privacy in social networks through graph modifications meanwhile preserving data utility of anonymized graphs for social network analysis. Graph reachability is an important data utility as reachability queries are not only common on graph databases, but also serving as fundamental operations for many other graph queries. However, the graph reachability is severely distorted after anonymization. In this work, we study how to preserve the d-reachability of vertices when anonymizing social networks. We solve the problem by designing a d-reachability preserving graph anonymization (d-RPA for short) algorithm. The main idea of d-RPA is to find a subgraph that preserves the d-reachability, and keep it unchanged during anonymization. We show that d-RPA can efficiently find such a subgraph and anonymize the releasing graph with low information loss. Extensive experiments on real datasets illustrate that anonymized social networks generated by our method can be used to answer d-reachable queries with high accuracy.

Xiangyu Liu, Jiajia Li, Dahai Zhou, Yunzhe An, Xiufeng Xia
Personalized Location Anonymity - A Kernel Density Estimation Approach

In recent years, the problem of location privacy protection in location-based service (LBS) has drawn a great deal of researchers’ attention. However, the existing technologies of location privacy protection rarely consider the personal visit probability and other side-information, which are likely to be exploited by attackers. In order to protect the users’ location privacy more effectively, we propose a Personal Location Anonymity (PLA) combining side-information to achieve k-anonymity. On the offline phase, we utilize Kernel Density Estimation (KDE) approach to obtain the personal visit probability for each cell of space according to a specific users’ visited locations. On the online phase, the dummy locations for each user’s query can be selected based on both the entropy of personal visit probability and the area of Cloaking Region (CR). We conduct extensive experiments on the real dataset to verify the performance of privacy protection degree, where the privacy properties are measured by the location information entropy and the area of CR.

Dapeng Zhao, Jiansong Ma, Xiaoling Wang, Xiuxia Tian
Detecting Data-model-oriented Anomalies in Parallel Business Process

Currently, most information systems are data intensive. The data models of such are posing notable influence on business processes. However, the predominance of existed process verification methods leave out the impact of data models on process models. Meanwhile, with parallel structures in business processes multiplying, business process structures are becoming increasingly intricate and large in size. A parallel structure engenders also uncertainty, and consequently increases the chances and decreases the detectability of anomalies occasioned by process and data model conflicts. In this paper, these anomalies are analyzed and classified. A data state matrix and data operation algebra is introduced to establish the relation between the parallel-process model and the data model. Then, an anomaly detection method under the divide-and-conquer framework is proposed to ensure efficiency in detecting anomalies in business processes. Both theoretical analysis and experimental results prove this method to be highly efficient and effective in detecting data model oriented anomalies.

Ning Yin, Shanshan Wang, Hongyan Li, Lilue Fan
Learning User Credibility on Aspects from Review Texts

Spammer detection has been popularly studied these years which aims at filtering unfair or incredible customers. Most users have different backgrounds or preferences so that they make distinct reviews/ratings, however they can not be treated as spammers. To date, the existing previous spammer detection technology has limited usability. In this paper, we propose a method to calculate user credibility on multi-dimensions by considering users difference related to their personalities e.g. background and preference. Firstly, we propose to evaluate customer credibilities on aspects with the consideration of different concerns given by different customers. A boot-strapping algorithm is applied to detect the intrinsic aspects of review text and the aspect ratings are assigned by mining semantic polarity. Then, an iteration algorithm is designed for estimating credibilities by considering the consistency between individual ratings and overall ratings on aspects. Finally, experiments on the real dataset demonstrate that our method outperforms baseline systems.

Yifan Gao, Yuming Li, Yanhong Pan, Jiali Mao, Rong Zhang
Detecting Anomaly in Traffic Flow from Road Similarity Analysis

Taxies equipped with GPS devices are considered as 24-hour moving sensors widely distributed in urban road networks. Plenty of accurate and realtime trajectories of taxi are recorded by GPS devices and are commonly studied for understanding traffic dynamics. This paper focuses on anomaly detection in traffic volume, especially the non-recurrent traffic anomaly caused by unexpected or transient incidents, such as traffic accidents, celebrations and disasters. It is important to detect such sharp changes of traffic status for sensing abnormal events and planning their impact on the smooth volume of traffic. Unlike existing anomaly detection approaches that mainly monitor the derivation of current traffic status from history in the past, the proposed method in this paper evaluates the abnormal score of traffic on one road by comparing its current traffic volume with not only its historical data but also its neighbors. We define the neighbors as the roads that are close in sense of both geo-location and traffic patterns, which are extracted by matrix factorization. The evaluation results on trajectories data of 12,286 taxies over four weeks in Beijing show that our approach outperforms other baseline methods with higher precision and recall.

Xinran Liu, Xingwu Liu, Yuanhong Wang, Juhua Pu, Xiangliang Zhang

Query Processing and Optimization

Frontmatter
PACOKS: Progressive Ant-Colony-Optimization-Based Keyword Search over Relational Databases

Keyword search over relational databases makes it easier to retrieve information from structural data. One solution is to first represent the relational data as a graph, and then find the minimum Steiner tree containing all the keywords by traversing the graph. However, the existing work involves substantial costs even for those based on heuristic algorithms, as the minimum Steiner tree problem is proved to be an NP-hard problem. In order to reduce the response time for a single search to a low level, a progressive ant-colony-optimization-based algorithm, called PACOKS, is proposed here, which achieves the best answer in a step-by-step manner, through the cooperation of large amounts of searches over time, instead of in an one-step manner by a single search. Through this way, the high costs for finding the best answer, are shared among large amounts of searches, so that low cost and fast response time for a single search is achieved. Extensive experimental results based on our prototype show that our method can achieve better performance than those state-of-the-art methods.

Ziyu Lin, Qian Xue, Yongxuan Lai
Enhanced Query Classification with Millions of Fine-Grained Topics

Query classification is a crucial task to understand user search intents. Although this problem has been well studied in the past decades, it is still a big challenge in real-world applications due to the sparse, noisy and ambiguous nature of queries. In this paper, we present another important issue called “the pomegranate phenomenon”. This phenomenon is named for the gap between manually manageable small taxonomy and massive coherent topics in each category. Furthermore, the fine-grained topics in the same category of the taxonomy may be textually more relevant to the topics in other categories. This phenomenon will hurt the performances of most traditional classification methods. To overcome this problem, we present a practical approach to enhance the performances of traditional query classifiers. First, we detect millions of fine-grained query topics from two years of click logs which can represent different query intents and give them category labels. Second, for a given query, we calculate the K most relevant topics and select the label by majority voting, then try to use this label to improve the results of classical query classification methods. Empirical evaluation confirms that our topic based classification algorithms can significantly enhance the performances of traditional classifiers in read-world query classification tasks.

Qi Ye, Feng Wang, Bo Li, Zhimin Liu
A Hybrid Machine-Crowdsourcing Approach for Web Table Matching and Cleaning

Table matching and data cleaning are two crucial activities in integrating data from different web tables, which have traditionally been considered as separate activities. We show that data cleaning can effectively help us discover table matches, and vice versa. In this paper, we study a hybrid machine-crowdsourcing approach to handle the two activities together with a well-developed knowledge base. Understanding the semantics of tables is fundamental to both matching and cleaning. We select the most valuable columns to crowdsourcing validation and infer others by consolidating crowdsourcing results and machine-generated results. When resolving inconsistency between data and semantics, relative trust is taken into account to validate data or semantics via crowd. Our experimental results show the effectiveness of the proposed approach for matching and cleaning web tables using real-life datasets.

Chunhua Li, Pengpeng Zhao, Victor S. Sheng, Zhixu Li, Guanfeng Liu, Jian Wu, Zhiming Cui
An Update Method for Shortest Path Caching with Burst Paths Based on Sliding Windows

Caching shortest paths is an important problem, and it is widely used in traffic networks, social networks, logistic networks, communication networks and other fields. By far, few existing methods has considered burst paths, which are common in real life. For instance, queries of a certain path can sharply increase due to a promotion or vacations. Such burst queries usually last for a relatively short period, and their frequencies are too low to be loaded into caches according to conventional methods. In this paper, we propose two methods: the basic and incremental cache update methods. Both methods are based on sliding windows. Specially, the incremental update method quantifies burst paths and updates the cache incrementally. Comprehensive experiments show that our methods surpass current best algorithm SPC by more than 30 % in terms of hit ratio on real road network data with burst paths.

Xiaohua Li, Ning Wang, Kanggui Peng, Xiaochun Yang, Ge Yu
Low Overhead Log Replication for Main Memory Database System

Log replication is the key component of high available database system. To guarantee data consistency and reliability, modern database systems often use Paxos protocol to replicate log in multiple database instance sites. Since the replicated logs need to contain some metadata such as committed log sequence number (LSN), this increases the overhead of storage and network. It has significantly negative impact on the throughput in the update intensive work load. In this paper, we present an implementation of log replication and database recovery, which adopts the idea of piggybacking, i.e. committed LSN is embedded in the commit logs. This practice not only retains virtues of Paxos replication, but also reduces disk and network IO effectively, which enhances performance and decreases recovery time. We implemented and evaluated our approach in a main memory database system (Oceanbase), and found that our method can offer 1.3x higher throughput than traditional log replication with synchronization mechanism.

Jinwei Guo, Chendong Zhang, Peng Cai, Minqi Zhou, Aoying Zhou
Diversification of Keyword Query Result Patterns

Keyword search allows the users to search for information on tree data without making use of a complex query language and without knowing the schema of the data sources. However, keyword queries are usually ambiguous in expressing the user intent. Most of the current keyword search approaches either filter or use a scoring function to rank the candidate result set. These techniques do not differentiate the results and might return to the user a result set which is not the intended. To address this problem, we introduce in this paper an original approach for diversification of keyword search results on tree data which aims at returning a subset of the candidate result set trading off relevance for diversity. We formally define the problem of diversification of patterns of keyword search results on tree data as an optimization problem. We introduce relevance and diversity measures on result pattern sets. We design a greedy heuristic algorithm that chooses top-k most relevant and diverse result patterns for a given keyword query. Our experimental results show that the introduced relevance and diversity measures can be used effectively and that our algorithm can efficiently compute a set of result patterns for keyword queries which is both relevant and diverse.

Cem Aksoy, Ananya Dass, Dimitri Theodoratos, Xiaoying Wu
Efficient Approximate Substring Matching in Compressed String

The idea of LZ77 self-index has been proposed for repetitive text in compressed forms. Existing methods of approximate string matching based on LZ77 focus on space efficiency. We focus on how to efficiently search similar strings in text without decompressing the whole text. We propose RS-search algorithm to merge all the occurrences of substring efficiently to narrow down the potential region and design novel filterings to reduce the scale of candidates. The experiments show that our algorithm achieves outstanding performance and an interesting time-space trade-off in approximate matching for compressed string.

Yutong Han, Bin Wang, Xiaochun Yang
Top-K Similarity Search for Query-By-Humming

As an important way of music retrieval, Query-By-Humming has gained wide attention because of its effectiveness and convenience. This paper proposes a novel Top-K similarity search technique, which provides fast retrieval for Query-By-Humming. We propose a distance function MDTW for multi-dimensional sequence matching as well as a subsequence matching method $$MDTW_{sub}$$. We show that the proposed method is highly applicable to music retrieval. In our paper, music pieces are represented by 2-dimensional time series, where each dimension holds information about the pitch or duration of each note, respectively. In order to improve the efficiency, we utilize inverted lists and q-gram technique to process music database, and utilize q-chunk technique to process hummed piece. Then, we calculate the MDTW distances between hummed q-chunks and music q-grams, and we can get the candidate music and their sensitive data areas. We proposes TopK-Brute and TopK-LB Algorithm to search the Top-K songs. The experimental results demonstrate both the efficiency and effectiveness of our approach.

Peipei Wang, Bin Wang, Shiying Luo

Social Media

Frontmatter
Restricted Boltzmann Machines for Retweeting Behaviours Prediction

With the information explosion on social network, personalized recommendation is eagerly required to assist users to obtain interesting news or tweets within limited time. Since retweeting history reveals users personal preferences in some degree, retweeting behaviors predicting system could feed users with messages according to their probability of being retweeted. In this paper, based on the neural network model called Restricted Bolzmann Machine(RBM), we propose retweeting behaviours prediction methods adapting two scenarios: with or without detailed information of users and microblogs. When the dataset misses the detailed information, the predicting problem is treated as a collaborative filtering task and RBM plays the role of an independent classifier. The other is that RBM performs as a feature selector to detect the hidden similarity between users for a content-based model, logistic regression model(LR). Furthermore, users are clustered into different communities by our previously proposed community detection algorithm and community property is integrated into RBM to improve its performance. Experiment results indicate that features extracted by RBM can help get promotion of performance by 3.79 % in terms of F1-Score comparing with basic LR model and the community property ulteriorly improves the effectiveness of RBM.

Xiang Li, Lijuan Xie, Yong Tan, Qiuli Tong
Cross-Collection Emotion Tagging for Online News

With the rapid development of Internet and social media, online news has become an important type of information that attracts millions of readers to express their emotions. Therefore, it is of great significance to build an emotion classifier for online news. However, it largely relies on the collection with sufficient labeled news to build an emotion classifier and the manually labeling work can be quite labor intensive. Moreover, different collections may have different domains such as politics or entertainment. Even in the same domain, different collections require different classifiers, since they have different emotion labels and feature distributions. In this paper, we focus on the task of cross-collection emotion tagging for online news. This task can be formulated as a transfer learning problem which utilizes a source collection with abundant labeled data and a target collection with limited labeled data within the same domain. We proposed a novel method to transfer knowledge from source collection to help build an emotion classifier for target collection. Experimental results on four real datasets show that our method outperforms competitive baselines.

Li Yu, Xue Zhao, Chao Wang, Haiwei Zhang, Ying Zhang
Online News Emotion Prediction with Bidirectional LSTM

Recent years have brought a significant growth in the volume of user generated data. Sentiment analysis is a crucial tool in the mining of such data, which is of great value for both improving particular services and assisting organizations’ decision making process. Existing research focuses on identifying sentiment polarity on subjective text, such as tweets and product reviews. Sentiment analysis on news still remains a challenge: identifying effective emotion-differentiated features automatically from the more objective content, and modeling the longer document semantically. In this paper, we tackle this problem by improving document representations. From the word level, we implemented skip-gram model to learn the word representations with rich contextual information. Moreover, we propose two document representation approaches based on neural networks. We first introduce bidirectional long short-term memory (BLSTM) neural network to capture the complete contextual information in long news articles. In order to extract more salient information from document, we integrate a convolutional neural network with BLSTM to augment the document representations. Extensive experiments show the proposed model outperforms the other baseline methods.

Xue Zhao, Chao Wang, Zhifan Yang, Ying Zhang, Xiaojie Yuan
Learning for Search Results Diversification in Twitter

Diversifying the results retieved is an effective approach to tackling users’ information needs in Twitter, which typically described by query phrase are often ambiguous and have more than one interpretation. Due to tweets being often very short and lacking in reliable grammatical sytle, it reduces the effectiveness of traditional IR and NLP techniques. However, Twitter, as a social media, also presents interesting opportunies for this task (for example the author information such as the number of statuses). In this paper, we firstly address diversitication of the search results in Twitter with a learning method and explore a series of diversity features describing the relationship between tweets which include tweet content, sub-topic of tweet and the Twitter specific social information such as hashtags. The experimental results on the Tweets2013 datasets demonstrate the effectiveness of the learning approach. Additionally, the Twitter retrieval task achieves improvement by taking into account the diversity features. Finally, we find the sub-topic and Twitter specific social features can help solve the diversity task, especially the post time, hashtags of tweet and the location of author.

Ying Wang, Zhunchen Luo, Yang Yu
Adjustable Time-Window-Based Event Detection on Twitter

Twitter has become an important platform for reporting breaking news and instant events. However, it is almost impossible to detect events on Twitter manually due to the large volume of data and the noise in them. Though automatic event detection has been studied a lot, most works can only detect events in a fixed time window. In this paper, we propose an efficient system that can detect events in adjustable time windows. We detect terms with unusual frequency and group them into events. We further modify a segment tree data structure to support adjustable time window based event detection, which can efficiently aggregate statistics of terms of varied-sized time windows and is both space and time saving. We finally validate the effectiveness and efficiency of our proposed techniques through extensive experiments on real datasets.

Qinyi Wang, Jieying She, Tianshu Song, Yongxin Tong, Lei Chen, Ke Xu
User-IBTM: An Online Framework for Hashtag Suggestion in Twitter

Twitter, the global social networking microblogging service, allows registered users to post 140-character messages known as tweets. People use the hashtag symbol ‘#’ before a relevant keyword or phrase in their tweets to categorize the tweets and help them show more easily in a Twitter search. However, there are very few tweets contain hashtags, which impedes the quality of the search results and their applications. Therefore, how to automatically generate or recommend hashtags has become a particularly important academic research problem. Although many attempts have been made for solving this problem, previous methods mostly do not take the dynamic nature of hashtags into consideration. Furthermore, most previous work focuses on exploiting the similarity between tweets and ignores semantics in tweets.In this paper, we regard hashtags as the underlying topics of the tweets. We first introduce an effective method for discovering the latent topics of streaming tweets, which uses the recently proposed incremental biterm topic model (IBTM). Then considering the personalized preferences, we propose a novel model, namely online Twitter-User LDA, to learn each Twitter user’s dynamic interests. Finally, we design an online hashtag suggestion framework called User-IBTM by combining the online Twitter-User LDA and IBTM. As shown in the experimental results on real world data from Twitter, our designed framework outperforms several state-of-the-art methods and achieves satisfying performance (Code is available at https://github.com/worldcodingNow/UserIBTM/tree/master).

Jia Li, Hua Xu
Unifying User and Message Clustering Information for Retweeting Behavior Prediction

Online social networks have been recently increasingly become the dominant platform of information diffusion by user’s retweeting behavior. Thus, understanding and predicting who will be retweeted in a given network is a challenging but important task. Existing studies only investigate individual user and message for retweeting prediction. However, social influence and selection lead to formation of groups. The intrinsic and important factor has been neglected for this problem. In the paper, we propose a unified user and message clustering based approach for retweeting behavior prediction. We first cluster users and messages into different groups based on explicit and implicit factors together. Then we model social clustering information as regularization terms to introduce the retweeting prediction framework in order to reduce sparsity of data and improve accuracy of prediction. Finally, we employ matrix factorization method to predict user’s retweeting behavior. The experimental results on a real-world dataset demonstrate that our proposed method effectively increases accuracy of retweeting behavior prediction compared to state-of-the-art methods.

Bo Jiang, Jiguang Liang, Ying Sha, Lihong Wang, Zhixin Kuang, Rui Li, Peng Li
KPCA-WT: An Efficient Framework for High Quality Microblog Extraction in Time-Frequency Domain

Massive social event relevant messages are generated in online social media, which makes the filtering and screening a great challenge. In order to obtain massages with high quality, a high quality information extraction framework based on kernel principal component analysis and wavelet transformation (KPCA-WT) is proposed. First, based on multiple features fusion, we design an algorithm to extract the microblogs of high quality, which transforms the features into wavelet domain to capture the detailed differences between the feature signals. Then the weights of the features are evaluated by EM algorithm and fused further to get a comprehensive value of each message. In addition, to reduce the effect of noisy features and speed up the operation, these features are processed through kernel principal component analysis before transforming into wavelet domain. Experimental results show that the proposed framework can extract information with higher quality, less redundancy, and greatly reduce the time consumption.

Min Peng, Xinyuan Dai, Kai Zhang, Guanyin Zeng, Jiahui Zhu, Shuang Ouyang, Qianqian Xie, Gang Tian

Big Data Analytics

Frontmatter
Active Learning Method for Constraint-Based Clustering Algorithms

Semi-supervision clustering aims to improve clustering performance with the help of user-provided side information. The pairwise constraints have become one of the most studied types of side information. According to the previous studies, such constraints increase clustering performance, but the choice of constraints is critical. If the constraints are selected improperly, they may even degrade the clustering performance. In order to solve this problem, researchers proposed some learning methods to actively select most informative pairwise constraints. In this paper, we presents a new active learning method for selecting informative data set, which significantly improves both the Explore phase and the Consolidate phase of the Min-Max algorithm. Experimental results on the data set of UCI Machine Learning Repository, using MPCK-means as the underlying constraint-based semi-supervised clustering algorithm, show that the proposed algorithm has better performance.

Lijun Cai, Tinghao Yu, Tingqin He, Lei Chen, Meiqi Lin
An Effective Cluster Assignment Strategy for Large Time Series Data

The problem of clustering time series data is of importance to find similar groups of time series, e.g., identifying people who share similar mobility by analyzing their spatio-temporal trajectory data as time series. YADING is one of the most recent and efficient methods to cluster large-scale time series data, which mainly consists of sampling, clustering, and assigning steps. Given a set of processed time series entities, in the sampling step, YADING clusters are found by a density-based clustering method. Next, the left input data is assigned by computing the distance (or similarity) to the entities in the sampled data. Sorted Neighbors Graph (SNG) data structure is used to prune the similarity computation of all possible pairs of entities. However, it does not guarantee to choose the sampled time series with lower density and therefore results in deterioration of accuracy. To resolve this issue, we propose a strategy to order the SNG keys with respect to the density of clusters. The strategy improves the fast selection of time series entities with lower density. The extensive experiments show that our method achieves higher accuracy in terms of NMI than the baseline YADING algorithm. The results suggest that the order of SNG keys should be the same as the clustering phase. Furthermore, the findings also show interesting patterns in identifying density radiuses for clustering.

Damir Mirzanurov, Waqas Nawaz, JooYoung Lee, Qiang Qu
AdaWIRL: A Novel Bayesian Ranking Approach for Personal Big-Hit Paper Prediction

Predicting the most impactful (big-hit) paper among a researcher’s publications so it can be well disseminated in advance not only has a large impact on individual academic success, but also provides useful guidance to the research community. In this work, we tackle the problem of given the corpus of a researcher’s publications in previous few years, how to effectively predict which paper will become the big-hit in the future. We explore a series of features that can drive a paper to become the big-hit, and design a novel Bayesian ranking algorithm AdaWIRL (Adaptive Weighted Impact Ranking Learning) that leverages a weighted training schema and an adaptive timely false correction strategy to predict big-hit papers. Experimental results on the large ArnetMiner dataset with over 1.7 million authors and 2 million papers demonstrate the effectiveness of AdaWIRL. Specifically, it correctly predicts over 78.3 % of all researchers’ big-hit papers and outperforms the compared regression and ranking algorithms, with an average of $$5.8\,\%$$ and $$2.9\,\%$$ improvement respectively. Further analysis shows that temporal features are the best indicator for personal big-hit papers, while authorship and social features are less relevant. We also demonstrate that there is a high correlation between the impact of a researcher’s future works and their similarity to the predicted big-hit paper.

Chuxu Zhang, Lu Yu, Jie Lu, Tao Zhou, Zi-Ke Zhang
Detecting Live Events by Mining Textual and Spatial-Temporal Features from Microblogs

As microblogging services on the mobile devices are widely used, microblogs can be viewed as a kind of event sensor to perceive the dynamic behaviors in the city. In particular, detecting live events in microblogs, such as mass gathering, emergencies, etc., can help to understand what happened from the point of view of people who are present. For identifying the live events from a large number of short and noisy microblogs, the paper builds a generative probabilistic model named the ST-LDA model to cluster the microblogs whose semantics, time and space are similar into the same topic, and then determines the live events from the topics by an HMM-based method. The paper conducts the experiments on the real microblogs from weibo.com. Experimental results show that our method can detect live events more accurately and more completely than the LDA-based method and the TimeLDA-based method.

Zhejun Zheng, Beihong Jin, Yanling Cui, Qiang Ji
A Label Correlation Based Weighting Feature Selection Approach for Multi-label Data

Exploiting label correlation is important for multi-label learning, where each instance is associated with a set of labels. However, most of existing multi-label feature selection methods ignore the label correlation. Therefore, we propose a Label Correlation Based Weighting Feature Selection Approach for Multi-Label Data, called MLLCWFS. It is a framework developed from traditional filtering feature selection methods for single-label data. To exploit the label correlation, we compute the importance of each label in mutual information, and adopt three weighting strategies to evaluate the correlation between features and labels. Extensive experiments conducted on four benchmark data sets using two base classifiers demonstrate that our approach is superior to the state-of-the-art feature selection algorithms for multi-label data.

Lu Liu, Jing Zhang, Peipei Li, Yuhong Zhang, Xuegang Hu
Valuable Group Trajectory Pattern Mining Directed by Adaptable Value Measuring Model

Group trajectory pattern mining for large amounts of mobile customers is a practical task in broad applications. Usually the pattern mining result is a set of mined patterns with their support. However, most of them are not valuable for users and it is difficult for users to go through all mined patterns to find valuable ones. In this paper, instead of just mining group trajectory patterns, we investigate how to mine the top valuable patterns for users, which has not been well solved yet given the following two challenges. The first is how to estimate the value of trajectory patterns according to users’ requirements. Second, there are redundant information in the mined results because many mined patterns share common sub-patterns. To address these challenges, we define an adaptable value measuring model by leveraging multi-factors correlation in users’ requirements, which is used to estimate the value of trajectory patterns. In order to reduce the redundant sub-patterns, we propose a new group trajectory pattern mining approach directed by the adaptable value measuring model. In addition, we extend and implement the algorithm as a parallel algorithm in cloud computing platform to deal with massive data. Experiments on real massive mobile data show the effectiveness and efficiency of the proposed approach.

Xinyu Huang, Tengjiao Wang, Shun Li, Wei Chen
DualPOS: A Semi-supervised Attribute Selection Approach for Symbolic Data Based on Rough Set Theory

Rough set theory, supplying an effective model for representation of uncertain knowledge, has been widely used in knowledge engineering and data mining. Especially, rough set theory has been used as an attribute selection method with much success. However, current rough set approaches for attribute reduction are unsuitable for semi-supervised learning as no enough labeled data can guarantee to calculate the dependency degree. We propose a new attribute selection strategy based on rough sets, called DualPOS. It provides mutual function mechanism of multi-attributes, and generates the most consistent one as a candidate. Experiments are carried out to test the performances of classification and clustering of the proposed algorithm. The results show that DualPOS is valid for attribute selection in semi-supervised learning.

Jianhua Dai, Huifeng Han, Hu Hu, Qinghua Hu, Jinghong Zhang, Wentao Wang
Semi-supervised Clustering Based on Artificial Bee Colony Algorithm with Kernel Strategy

Artificial Bee Colony (ABC) algorithm, which simulates the intelligent foraging behavior of a honey bee swarm, is one of optimization algorithms introduced recently. The performance of the ABC algorithm has been proved to be very effective in many researches. In this paper, ABC algorithm combined with kernel strategy is proposed for clustering semi-supervised information. The proposed clustering strategy can make use of more background knowledge than traditional clustering methods and deal with non-square clusters with arbitrary shape. Several datasets including 2D display data and UCI datasets are used to test the performance of the proposed algorithm and the experiment results indicate that the constructed algorithm is effective.

Jianhua Dai, Huifeng Han, Hu Hu, Qinghua Hu, Bingjie Wei, Yuejun Yan

Distributed and Cloud Computing

Frontmatter
HMNRS: A Hierarchical Multi-source Name Resolution Service for the Industrial Internet

The concept of Industrial Internet brings great potential for realizing intelligent products. In order to identify and locate information objects associated with these smart things, an identifier-locator mapping system is one of the most recognized technologies that enable the product information network. However, existing approaches lack enough attention to the locality issue. No matter aggregating information to the initial manufacturer or a random node calculated by some consistent hashing algorithms, all of these proposals will violate the administration of autonomous product information system as well as cause high latency and communication costs. To solve this issue, we propose a hierarchical multi-source name resolution service based on a two-level mapping system. Only the latest redirection is published on a global scale, while detailed mapping records are still maintained in each local subdomain. Furthermore, we propose two types of resolution mode based on this mapping system. A basic mode always originates the resolution request from an inter-domain name resolution, while the improved mode is a locality preference query mechanism. Through the theoretical analysis and simulation experiments, we find that if only the percentage of local queries exceed 20 %, the improved mode will behave much better for a trace query.

Yang Liu, Guoqiang Fu, Xinchi Li
Optimizing Replica Exchange Strategy for Load Balancing in Multienant Databases

Resource sharing in multitenant databases is a challenge issue. The phenomenon can adversely affect a tenant’s performance due to contending for shared resources among other tenants, and may cause a performance crisis. In this paper, a performance crisis is mitigated by a dynamic load balancing mechanism, which based on exchanging the roles between tenants’ primary replicas and secondary replicas. The mechanism is composed of two parts: firstly, to balance resource utilization across servers, queries are dynamically allocated according to resource consumption; secondly, Improved Simulated Annealing Algorithm is developed to identify an optimal subset of tenants on overloaded servers, and for each tenant in the set, a suitable secondary replica can be selected to exchange roles with its primary replica to mitigate a crisis. Experimental results show significant reduction of service level objective violations compared to migration-based load balancing method and no load balancing method, respectively.

Teng Liu, Qingzhong Li, Lanju Kong, Lei Liu, Lizhen Cui
ERPC: An Edge-Resources Based Framework to Reduce Bandwidth Cost in the Personal Cloud

Personal Cloud storage and file synchronization services, such as Dropbox, Google Drive, and Baidu Cloud, are increasingly prevalent within the Internet community. It is estimated that subscriptions of personal cloud storage are projected to hit 1.3 billion in 2017. In order to provide high rates of data retrieving, cloud providers require huge amounts of bandwidth. As an attempt to reduce their bandwidth cost and, at the same time, guarantee the quality of service, we propose a novel cloud framework based on distributed edge resources (i.e., voluntary peers in P2P Networks and edge servers in Content Delivery Networks).P2P technique is well considered as an efficient way to distribute contents, but not all of contents are applicable for it. Thus we first present an approach to select the contents which are more profitable to be placed in P2P networks. Considering the unreliability of P2P networks, we utilize CDN servers to dynamically cache the contents which can not be well served by P2P peers. The proposed caching algorithm takes the state of P2P peers into account and considers replacement and allocation policies simultaneously. According to trace-driven simulations, the ERPC achieves an impressive performance of saving bandwidth for cloud system and guaranteeing download rates for users.

Shaoduo Gan, Jie Yu, Xiaoling Li, Jun Ma, Lei Luo, Qingbo Wu, Shasha Li
Multidimensional Similarity Join Using MapReduce

Similarity join is arguably one of the most important operators in multidimensional data analysis tasks. However, processing a similarity join is costly especially for large volume and high dimensional data. In this work, we attempt to process the similarity join on MapReduce such that the join computation can be scaled horizontally. In order to make the workload balancing among all MapReduce nodes, we systemically select the most profitable feature based on a novel data selectivity approach. Given the selected feature, we develop the partitioning scheme for MapReduce processing based on two different optimization goals. Our proposed techniques are extensively evaluated on real datasets.

Ye Li, Jian Wang, Leong Hou U
Real-Time Logo Recognition from Live Video Streams Using an Elastic Cloud Platform

Real-time logo recognition from a live video stream has promising commercial applications. For example, a sports video website broadcasting a live soccer match could show advertisements of brands when their logos appear in the video. Although logo recognition is a well-studied problem, the vast majority of previous work focuses on recognition accuracy, rather than system efficiency. Consequently, existing methods cannot recognize logos in real-time, especially when a large number of logos appear in the video. Motivated by this, we propose a general framework that converts an offline logo detection method to a real-time one, by utilizing the massive parallel processing capabilities of an elastic cloud platform. The main challenge is to obtain high scalability, meaning that logo recognition efficiency keeps improving as we add more computing resources, as well as elasticity, meaning that the resource allocation is guided by the current workload rather than the peak load. The proposed framework achieves these by balancing workload, elastically provisioning resources, minimizing communication overhead, and eliminating performance bottlenecks in the system. Experiments using real data demonstrate the high efficiency, scalability and elasticity of the proposed solution.

Jianbing Ding, Hongyang Chao, Mansheng Yang
Profit Based Two-Step Job Scheduling in Clouds

One of the critical challenges facing the cloud computing industry today is to increase the profitability of cloud services. In this paper, we deal with the problem of scheduling parallelizable batch type jobs in commercial data centers to maximize cloud providers’ profit. We propose a novel and efficient two-step on-line scheduler. The first step is to rank the arrival jobs to decide an eligible set based on their inherent profitability and pre-allocate resources to them; and the second step is to re-allocate resources between the waiting jobs from the eligible set, based on threshold profit-effectiveness ratio as a cut-off point, which is decided dynamically by solving an aggregated revenue maximization problem. The results of numerical experiments and simulations show that our approach are efficient in scheduling parallelizable batch type jobs in clouds and our scheduler can outperform other scheduling algorithms used for comparison based on classical heuristics from literature.

Shuo Zhang, Li Pan, Shijun Liu, Lei Wu, Xiangxu Meng
A Join Optimization Method for CPU/MIC Heterogeneous Systems

In recent years, heterogeneous systems consisting of general CPUs and many-core coprocessors have become the main trend in the high-performance computing area due to their powerful parallel computing capabilities and superior energy efficiencies. Join is one of the most important operations in database system. In order to effectively exploit each hardware’s advantages in heterogeneous systems, in this paper we focus on how to optimize the join algorithm in hybrid CPU/MIC system. We design a join method with CPU and MIC working collaboratively when implementing the join operation. In order to fully utilize the MIC’s parallel computing power, we also propose a Sort-Scatter-Join (SSJ) algorithm for MIC to generate the join index. Through turning the traditional process of comparison and matching into the process of computing and scattering, the SSJ gains more beneficial from thread-level parallelism and SIMD data parallelism. Experiment results show that, compared with the traditional parallel sort-merge join algorithm, the peak performance of the SSJ running on MIC is improved by around 26 %.

Kailai Zhou, Hong Chen, Hui Sun, Cuiping Li, Tianzhen Wu
GFSF: A Novel Similarity Join Method Based on Frequency Vector

String similarity join is widely used in many fields, e.g. data cleaning, web search, pattern recognition and DNA sequence matching. During the recent years, many similarity join methods have been proposed, for example Pass-Join, Ed-Join, Trie-Join, and so on, among which the Pass-Join algorithm based on edit distance can achieve much better overall performance than the others. But Pass-Join can not effectively filter those candidate pairs which are partially similar. Here a novel algorithm called GFSF is proposed, which introduces two additional filtering steps based on character frequency vector. Through this way, the number of pairs which are only partially similar are greatly reduced, thus greatly reducing the total time of string similarity join process. The experimental results show that the overall performance of the proposed method is better than Pass-Join.

Ziyu Lin, Daowen Luo, Yongxuan Lai
Backmatter
Metadaten
Titel
Web-Age Information Management
herausgegeben von
Bin Cui
Nan Zhang
Jianliang Xu
Xiang Lian
Dexi Liu
Copyright-Jahr
2016
Electronic ISBN
978-3-319-39958-4
Print ISBN
978-3-319-39957-7
DOI
https://doi.org/10.1007/978-3-319-39958-4

Neuer Inhalt