Skip to main content
Top

2016 | Book

Web-Age Information Management

17th International Conference, WAIM 2016, Nanchang, China, June 3-5, 2016, Proceedings, Part I

insite
SEARCH

About this book

This two-volume set, LNCS 9658 and 9659, constitutes the thoroughly refereed proceedings of the 17th International Conference on Web-Age Information Management, WAIM 2016, held in Nanchang, China, in June 2016.

The 80 full research papers presented together with 8 demonstrations were carefully reviewed and selected from 266 submissions. The focus of the conference is on following topics: data mining, spatial and temporal databases, recommender systems, graph data management, information retrieval, privacy and trust, query processing and optimization, social media, big data analytics, and distributed and cloud computing.

Table of Contents

Frontmatter
Erratum to: BMF: An Indexing Structure to Support Multi-element Check
Chenyang Xu, Qin Liu, Weixiong Rao

Data Mining

Frontmatter
More Efficient Algorithm for Mining Frequent Patterns with Multiple Minimum Supports

Frequent pattern mining (FPM) is an important data mining task, having numerous applications. However, an important limitation of traditional FPM algorithms, is that they rely on a single minimum support threshold to identify frequent patterns (FPs). As a solution, several algorithms have been proposed to mine FPs using multiple minimum supports. Nevertheless, a crucial problem is that these algorithms generally consume a large amount of memory and have long execution times. In this paper, we address this issue by introducing a novel algorithm named efficient discovery of Frequent Patterns with Multiple minimum supports from the Enumeration-tree (FP-ME). The proposed algorithm discovers FPs using a novel Set-Enumeration-tree structure with Multiple minimum supports (ME-tree), and employs a novel sorted downward closure (SDC) property of FPs with multiple minimum supports. The proposed algorithm directly discovers FPs from the ME-tree without generating candidates. Furthermore, an improved algorithms, named $${\text {FP-ME}}_\mathrm{DiffSet}$$, is also proposed based on the DiffSet concept, to further increase mining performance. Substantial experiments on real-life datasets show that the proposed approaches not only avoid the “rare item problem”, but also efficiently and effectively discover the complete set of FPs in transactional databases.

Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao
Efficient Mining of Uncertain Data for High-Utility Itemsets

High-utility itemset mining (HUIM) is emerging as an important research topic in data mining. Most algorithms for HUIM can only handle precise data, however, uncertainty that are embedded in big data which collected from experimental measurements or noisy sensors in real-life applications. In this paper, an efficient algorithm, namely Mining Uncertain data for High-Utility Itemsets (MUHUI), is proposed to efficiently discover potential high-utility itemsets (PHUIs) from uncertain data. Based on the probability-utility-list (PU-list) structure, the MUHUI algorithm directly mine PHUIs without candidate generation and can reduce the construction of PU-lists for numerous unpromising itemsets by using several efficient pruning strategies, thus greatly improving the mining performance. Extensive experiments both on real-life and synthetic datasets proved that the proposed algorithm significantly outperforms the state-of-the-art PHUI-List algorithm in terms of efficiency and scalability, especially, the MUHUI algorithm scales well on large-scale uncertain datasets for mining PHUIs.

Jerry Chun-Wei Lin, Wensheng Gan, Philippe Fournier-Viger, Tzung-Pei Hong, Vincent S. Tseng
An Improved HMM Model for Sensing Data Predicting in WSN

Wireless sensor networks (WSN) have been employed in numerous fields of real world applications. Data failure and noise reduction still remain tough unsolved problems for WSN. Predicting methods for data recovery by empirical treatment, mostly based on statistics has been studied exclusively. Machine learning models can greatly enhance the predicting performance. In this paper, an improved HMM is proposed for multi-step predicting of wireless sensing data given historical data. The proposed model is based on clustering of wireless sensing data and multi-step predicting is accordingly accomplished for different varying patterns using HMM whose parameters are optimized by Particle Swarm Optimization (PSO). We evaluate our model on two real wireless sensing datasets, and comparison between Naive Bayesian, Grey System, BP Neural Networks and traditional HMMs are conducted. The experimental results show that our proposed model can provide higher accuracy in sensing data predicting. This proposed model is promising in the fields of agriculture, industry and other domains, in which the sensing data usually contains various varying patterns.

Zeyu Zhang, Bailong Deng, Siding Chen, Li Li
eXtreme Gradient Boosting for Identifying Individual Users Across Different Digital Devices

With the increasing popularity of tablets, smartphones and other mobile electronic devices, it is not uncommon for users to complete online tasks through different electronic devices. Identifying individual users across different digital devices is now becoming a hot research topic. Methods based on name, email and other demographic information have received much attention. However, it is often difficult to obtain a complete set of information. In this paper, we use a probabilistic approach for cross-device identity issue and focus on comparing different algorithms. We conduct an in-depth study and expand the attribute of data through the study of the relationship between attributes. Dummy variables are introduced to improve the efficiency of the models. Experimental results on four datasets (released by ICDM Challenge) show that the eXtreme Gradient Boosting can consistently and significantly outperform other algorithms on both accuracy and F1-score. It also consistently provides a better performance compared to the methods we used in ICDM Challenge (We took part in the ICDM 2015 Challenge, and achieved a moderate score ranking use C4.5 and BP model), and achieves a better comprehensive evaluation ranking.

Rongwei Song, Siding Chen, Bailong Deng, Li Li
Two-Phase Mining for Frequent Closed Episodes

The concept of episodes was introduced for discovering the useful and interesting temporal patterns from the sequential data. Over the years, many episode mining strategies have been suggested, which can be roughly classified into two classes: Apriori-based breadth-first algorithms and projection-based depth-first algorithms. As we know, both kinds of algorithms are level-wise pattern growth methods, so that they have higher computational overhead due to level-wise growth iteration. In addition, their mining time will increase with the increase of sequence length. In the paper, we propose a novel two-phase strategy to discover frequent closed episodes. That is, in phase I, we present a level-wise shrinking mechanism, based on maximal duration episodes, to find the candidate frequent closed episodes from the episodes with the same 2-neighboring episode prefix, and in phase II, we compare the candidates with different prefixes to discover the final frequent closed episodes. The advantage of the suggested mining strategy is it can reduce mining time due to narrowing episode mapping range when doing closure judgment. Experiments on simulated and real datasets demonstrate that the suggested strategy is effective and efficient.

Guoqiong Liao, Xiaoting Yang, Sihong Xie, Philip S. Yu, Changxuan Wan
Effectively Updating High Utility Co-location Patterns in Evolving Spatial Databases

Spatial co-location mining has been used for discovering spatial feature sets which show frequent association relationships based on the spatial neighborhood. In spatial high utility co-location mining, we should consider the utility as a measure of interests, by considering the different value of individual instance that belongs to different feature. This paper presents a problem of updating high utility co-locations on evolving spatial databases which are updated with fresh data at some areas. Updating spatial patterns is a complicated process in that fresh data increase the new neighbor relationships. The increasing of neighbors can affect the result of high utility co-location mining. This paper proposes an algorithm for efficiently updating high utility co-locations and evaluates the algorithm by experiments.

Xiaoxuan Wang, Lizhen Wang, Junli Lu, Lihua Zhou
Mining Top-k Distinguishing Sequential Patterns with Flexible Gap Constraints

Distinguishing sequential pattern (DSP) mining has been widely employed in many applications, such as building classifiers and comparing/analyzing protein families. However, in previous studies on DSP mining, the gap constraints are very rigid – they are identical for all discovered patterns and at all positions in the discovered patterns, in addition to being predetermined. This paper considers a more flexible way to handle gap constraint, allowing the gap constraints between different pairs of adjacent elements in a pattern to be different and allowing different patterns to use different gap constraints. The associated DSPs will be called DSPs with flexible gap constraints. After discussing the importance of specifying/determining gap constraints flexibly in DSP mining, we present GepDSP, a heuristic mining method based on Gene Expression Programming, for mining DSPs with flexible gap constraints. Our empirical study on real-world data sets demonstrates that GepDSP is effective and efficient, and DSPs with flexible gap constraints are more effective in capturing discriminating sequential patterns.

Chao Gao, Lei Duan, Guozhu Dong, Haiqing Zhang, Hao Yang, Changjie Tang
A Novel Chinese Text Mining Method for E-Commerce Review Spam Detection

Review spam is increasingly rampant in China, which seriously hampers the development of the vigorous e-commerce market. In this paper, we propose a novel Chinese text mining method to detect review spam automatically and efficiently. We correctly extract keywords in complicated review text and conduct fine-grained analysis to recognize the semantic orientation. We study the spammers’ behavior patterns and come up with four effective features to describe untruthful comments. We train classifier to classify reviews into spam or non-spam. Experiments are conducted to demonstrate the excellent performance of our algorithm.

Xiu Li, Xinwei Yan

Spatial and Temporal Databases

Frontmatter
Retrieving Routes of Interest Over Road Networks

In this paper, we propose route of interest (ROI) query which allows users to specify their interests with query keywords and returns a route such that (i) its distance is less than a distance threshold and (ii) its relevance to the query keywords is maximized. ROI query is particularly helpful for tourists and city explorers. For example, a tourist may wish to find a route from a scenic spot to her hotel to cover many artware shops. It is challenging to efficiently answer ROI query due to its NP-hard complexity. Novelly, we propose an adaptive route sampling framework that adaptively computes a route according to a given response time, and gradually improve the quality of the route with time. Moreover, we design a suite of route sampling techniques under this framework. Experiments on real data suggest that our proposed solution can return high quality routes within a short response time.

Wengen Li, Jiannong Cao, Jihong Guan, Man Lung Yiu, Shuigeng Zhou
Semantic-Aware Trajectory Compression with Urban Road Network

Vehicles are generating more and more trajectories, which are expensive to store and mange. Thus it calls for vehicular trajectory compression techniques. We propose a semantic-aware compression framework that includes two steps: Map Matching(MM) and Semantic Trajectory Compression(STC). On the one hand, due to measurement errors, trajectories cannot be precisely mapped to real roads. In the MM step, we utilize multidimensional information(including distance and direction) to find the most matching roads and generate aligned trajectories. On the other hand, some unnecessary points in trajectories can be reduced based on the roads. In the STC step, we extract a set of crucial points from the aligned trajectory, which capture the major driving semantics of the trajectory. Meanwhile, our decompression method is fairly lightweight and can efficiently support various applications. Empirical study shows that MM achieves high matching quality, STC achieves more than 8 times compression ratio, and decompression is efficient on real datasets.

Na Ta, Guoliang Li, Bole Chen, Jianhua Feng
Discovering Underground Roads from Trajectories Without Road Network

With the wide application of GPS-enabled electronic devices, huge amounts of positional information data have been accumulated, so that it’s critical to discover inherent knowledge from such massive data. In this paper, we address this topic by proposing two issues, including how to discover the underpasses for pedestrians to cross the roads, and how to discover the tunnels providing passageways for vehicles. Subsequently, we propose a three-step framework to deal with the issues, including an incremental clustering phase, a sub-trajectory detecting phase and a cluster filtering phase. Experiments upon real-life data sets demonstrate the effectiveness and efficiency of the proposed framework.

Qiuge Song, Jiali Mao, Cheqing Jin
Ridesharing Recommendation: Whether and Where Should I Wait?

Ridesharing brings significant social and environmental benefits, e.g., saving energy consumption and satisfying people’s commute demand. In this paper, we propose a recommendation framework to predict and recommend whether and where should the users wait to rideshare. In the framework, we utilize a large-scale GPS data set generated by over 7,000 taxis in a period of one month in Nanjing, China to model the arrival patterns of occupied taxis from different sources. The underlying road network is first grouped into a number of road clusters. GPS data are categorized to different clusters according to where their sources are located. Then we use a kernel density estimation approach to personalize the arrival pattern of taxis departing from each cluster rather than a universal distribution for all clusters. Given a query, we compute the potential of ridesharing and where should the user wait by investigating the probabilities of possible destinations based on ridesharing requirements. Users are recommended to take a taxi directly if the potential to rideshare with others is not high enough. Experimental results show that the accuracy about whether ridesharing or not and the ridesharing successful ratio are respectively about 3 times and at most 40 % better than the naive “stay-as-where-you-are” strategy. This shows that about 500 users can save 4–8 min with our recommendation. Given 9 RMB as the starting taxi fare and suppose users can save half of the total fare by ridesharing, users can save 10.828-44.062 RMB.

Chengcheng Dai
Keyword-aware Optimal Location Query in Road Network

In this paper, we study a very useful type of optimal location query, motivated by the following real application: for property renting or purchasing, a client often wants to find a residence such that the sum of the distances between this residence and its nearest facilities is minimal, and meanwhile the residence should be on one of the client-selected road segments (representing where the client prefers to live). The facilities are categorized with keywords, eg., school, hospital and supermarket, and in this problem one facility for each category is required. To the best of our knowledge, this type of query has not been studied before. To tackle this problem, we propose a basic algorithm based on dividing roads (edges) into sub-intervals and find the optimal locations by only inspecting the endpoints of the sub-intervals. We also propose an improved algorithm with keyword filtering and edge pruning strategies. Finally, we demonstrate the efficiency of our algorithms with extensive experiments on large-scale real datasets.

Jinling Bao, Xingshan Liu, Rui Zhou, Bin Wang
Point-of-Interest Recommendations by Unifying Multiple Correlations

In recent years, we have witnessed the development of location-based services which benefit users and businesses. This paper aims to provide a unified framework for location-aware recommender systems with the consideration of social influence, categorical influence and geographical influence for users’ preference. In the framework, we model the three types of information as functions following a power-law distribution, respectively. And then we unify different information in a framework and learn the exact function by using gradient descent methods. The experimental results on real-world data sets show that our recommendations are more effective than baseline methods.

Ce Cheng, Jiajin Huang, Ning Zhong
Top-k Team Recommendation in Spatial Crowdsourcing

With the rapid development of Mobile Internet and Online To Offline (O2O) marketing model, various spatial crowdsourcing platforms, such as Gigwalk and Gmission, are getting popular. Most existing studies assume that spatial crowdsourced tasks are simple and trivial. However, many real crowdsourced tasks are complex and need to be collaboratively finished by a team of crowd workers with different skills. Therefore, an important issue of spatial crowdsourcing platforms is to recommend some suitable teams of crowd workers to satisfy the requirements of skills in a task. In this paper, to address the issue, we first propose a more practical problem, called $$\underline{Top}$$-$$\underline{k} \; \underline{T}eam \;\underline{R}ecommendation \;in \;spatial \;crowdsourcing$$ (TopkTR) problem. We prove that the TopkTR problem is NP-hard and design a two-level-based framework, which includes an approximation algorithm with provable approximation ratio and an exact algorithm with pruning techniques to address it. Finally, we verify the effectiveness and efficiency of the proposed methods through extensive experiments on real and synthetic datasets.

Dawei Gao, Yongxin Tong, Jieying She, Tianshu Song, Lei Chen, Ke Xu
Explicable Location Prediction Based on Preference Tensor Model

Location prediction has been attracting an increasing interest from the data mining community. In real world, however, to provide more targeted and more personal services, the applications like location-aware advertising and route recommendation are interested not only in the predicted location but its explanation as well. In this paper, we investigate the problem of Explicable Location Prediction (ELP) from LBSN data, which is not easy due to the challenges of the complexity of human mobility motivation and data sparsity. In this paper, we propose a Preference Tensor Model (PTM) to address the challenges. The core component of PTM is a preference tensor, each cell of which represents how much a user prefers to a specific place at a specific time point. The explicable location prediction can be made via a retrieval of the preference tensor, and meanwhile a motivation vector is generated as the explanation of the prediction. To model the complicated motivations of human movement, we propose two motivation tensors, a social tensor and a personal tensor, to represent the social cause and the personal cause of human movement. From the motivation tensors, the motivation vector consisting of a social ingredient and a personal ingredient can be produced. To deal with data sparsity, we propose a Social Tensor Decomposition Algorithm (STDA) and a Personal Tensor Decomposition Algorithm (PTDA), which are able to fill missing values of a sparse social tensor and a sparse personal tensor, respectively. Particularly, to achieve a higher accuracy, STDA fuses an additional social constraint with the decomposition. The experiments conducted on real-world datasets verify the proposed model and algorithms.

Duoduo Zhang, Ning Yang, Yuchi Ma

Recommender Systems

Frontmatter
Random Partition Factorization Machines for Context-Aware Recommendations

Fusing hierarchical information implied into contexts can significantly improve predictive accuracy in recommender systems. We propose a Random Partition Factorization Machines (RPFM) by adopting random decision trees to split the contexts hierarchically to better capture the local complex interplay. The intuition here is that local homogeneous contexts tend to generate similar ratings. During prediction, our method goes through from the root to the leaves and borrows from predictions at higher level when there is sparseness at lower level. Other than estimation accuracy of ratings, RPFM also reduces the over-fitting by building an ensemble model on multiple decision trees. We test RPFM on three different benchmark contextual datasets. Experimental results demonstrate that RPFM outperforms state-of-the-art context-aware recommendation methods.

Shaoqing Wang, Cuilan Du, Kankan Zhao, Cuiping Li, Yangxi Li, Yang Zheng, Zheng Wang, Hong Chen
A Novel Framework to Process the Quantity and Quality of User Behavior Data in Recommender Systems

Recommender system has become one of the most popular techniques to cope with the information overload problem. In the past years many algorithms have been proposed to obtain accurate recommendations. Such methods usually put all the collected user data into learning models without a careful consideration of the quantity and quality of individual user feedbacks. Yet in real applications, different types of users tend to represent preferences and opinions in various ways, thus resulting in user data with radically diverse quantity and quality. This characteristic of data influences the performance of recommendations. However, little attention has been devoted to the management of quantity and quality for user data in recommender systems. In this paper, we propose a generic framework to seamlessly exploit different pre-processing and recommendation approaches for ratings of different users. More specifically, we first classify users into groups based on the quantity and quality of their behavior data. In order to handle the user groups diversely, we further propose several data pre-processing strategies. Subsequently, we present a novel transfer latent factor model (TLMF) to transfer learnt models between groups. Finally, we conduct extensive experiments on a large data set and demonstrate the effectiveness of our proposed framework.

Penghua Yu, Lanfen Lin, Yuangang Yao
RankMBPR: Rank-Aware Mutual Bayesian Personalized Ranking for Item Recommendation

Previous works indicated that pairwise methods are state-of- the-art approaches to fit users’ taste from implicit feedback. In this paper, we argue that constructing item pairwise samples for a fixed user is insufficient, because taste differences between two users with respect to a same item can not be explicitly distinguished. Moreover, the rank position of positive items are not used as a metric to measure the learning magnitude in the next step. Therefore, we firstly define a confidence function to dynamically control the learning step-size for updating model parameters. Sequently, we introduce a generic way to construct mutual pairwise loss from both users’ and items’ perspective. Instead of user-oriented pairwise sampling strategy alone, we incorporate item pairwise samples into a popular pairwise learning framework, bayesian personalized ranking (BPR), and propose mutual bayesian personalized ranking (MBPR) method. In addition, a rank-aware adaptively sampling strategy is proposed to come up with the final approach, called RankMBPR. Empirical studies are carried out on four real-world datasets, and experimental results in several metrics demonstrate the efficiency and effectiveness of our proposed method, comparing with other baseline algorithms.

Lu Yu, Ge Zhou, Chuxu Zhang, Junming Huang, Chuang Liu, Zi-Ke Zhang
Unsupervised Expert Finding in Social Network for Personalized Recommendation

Personalized Recommendation has drawn greater attention in academia and industry as it can help people filter out massive useless information. Several existing recommender techniques exploit social connections, i.e., friends or trust relations as auxiliary information to improve recommendation accuracy. However, opinion leaders in each circle tend to have greater impact on recommendation than those of friends with different tastes. So we devise two unsupervised methods to identify opinion leaders that are defined as experts. In this paper, we incorporate the influence of experts into circle-based personalized recommendation. Specifically, we first build explicit and implicit social networks by utilizing users’ friendships and similarity respectively. Then we identify experts on both social networks. Further, we propose a circle-based personalized recommendation approach via fusing experts’ influences into matrix factorization technique. Extensive experiments conducted on two datasets demonstrate that our approach outperforms existing methods, particularly on handing cold-start problem.

Junmei Ding, Yan Chen, Xin Li, Guiquan Liu, Aili Shen, Xiangfu Meng
An Approach for Clothing Recommendation Based on Multiple Image Attributes

Currently, many online shopping websites recommend commodities to users according to their purchase history and the behaviors of others who have similar history with the target users. Most recommendations are conducted by commodity tags based similarity search. However, clothing purchase has some specialized characteristics, i.e. users usually don’t like to go with the crowd blindly and will not buy the same clothing twice. Moreover, the text tags cannot express clothing features accurately enough. In this paper we propose a novel approach that extracts multi-features from images to analyze its content in different attributes for clothing recommendation. Specifically, a color matrix model is proposed to distinguish split joint clothing. ULBP feature is extracted to represent fabric pattern attribute. PHOG, Fourier, and GIST features are extracted to describe collar and sleeve attributes. Then, some classifiers are trained to classify clothing fabric patterns and split joint types. Experiments based on every attribute and their combinations have been done respectively, and have achieved satisfied results.

Dandan Sha, Daling Wang, Xiangmin Zhou, Shi Feng, Yifei Zhang, Ge Yu
SocialFM: A Social Recommender System with Factorization Machines

Exponential growth of web2.0 makes social information be an indispensable part for recommender systems to solve cold start and sparsity problems. Most of the existing matrix factorization (MF) based algorithms for social recommender systems factorize rating matrix into two low-rank matrices. In this paper, we propose an improved factorization machines (FMs) with social information, called SocialFM. Our approach can effectively simulate the influence propagation by estimating interactions between categorical variables and specifying the input feature vectors. We combine user trust value with similarity to compute the influence value between users. We also present social regularization and model regularization to impose constraint on the objective function. Our approach is a general method, which can be easily extended to incorporate other context like user mood, timestamp, location, etc. The experiment results show that our approach outperforms other state-of-the-art recommendation methods.

Juming Zhou, Dong Wang, Yue Ding, Litian Yin
Identifying Linked Data Datasets for sameAs Interlinking Using Recommendation Techniques

Due to the outstanding role of owl:sameAs as the most widely used linking predicate, the problem of identifying potential Linked Data datasets for sameAs interlinking was studied in this paper. The problem was regarded as a Recommender systems problem, so serveral classical collaborative filtering techniques were employed. The user-item matrix was constructed with rating values defined depending on the number of owl:sameAs RDF links between datasets from Linked Open Data Cloud 2014 dump. The similarity measure is a key for memory-based collaborative filtering methods, a novel dataset semantic similarity measure was proposed based on the vocabulary information extracted from datasets. We conducted experiments to evaluate the accuracy of both the predicted ratings and recommended datasets lists of these recommenders. The experiments demonstrated that our customized recommenders out-performed the original ones with a great deal, and achieved much better metrics in both evaluations.

Haichi Liu, Ting Wang, Jintao Tang, Hong Ning, Dengping Wei, Songxian Xie, Peilei Liu
Query-Biased Multi-document Abstractive Summarization via Submodular Maximization Using Event Guidance

This paper proposes an abstractive multi-document summarization method. Given a document set, the system first generates sentence clusters through an event clustering algorithm using distributed representation. Each cluster is regarded as a subtopic of this set. Then we use a novel multi-sentence compression method to generate K-shortest paths for each cluster. Finally, some preferable paths are selected from these candidates to construct the final summary based on several customized submodular functions, which are designed to measure the summary quality from different perspectives. Experimental results on DUC 2005 and DUC 2007 datasets demonstrate that our method achieves better performance compared with the state-of-the-art systems.

Rui Sun, Zhenchao Wang, Yafeng Ren, Donghong Ji

Graph Data Management

Frontmatter
Inferring Diffusion Network on Incomplete Cascade Data

Inferring the diffusion network based on observed cascades is fundamental and of interest in the field of information diffusion on the network. All the existing methods which aim to infer the network assume that the cascades are complete without any missing nodes. In real world, not every infection between nodes can be easily observed or acquired. As a result, there are some missing nodes in the real cascades, which indicates that the observed cascades are incomplete and makes more challenges for solving this problem. Being able to recover the incomplete cascades is critical to us since inferred network based on the incomplete cascades can be inaccurate. In this paper, we tackle the problem by developing a two-stage framework, which finds the paths that contain the missing nodes at first and then estimate the location and infection time of missing nodes. Experiments on real and synthetic data show the accuracy of our algorithm to finding the missing node on the network, as well as the infection time of the missing nodes.

Peng Dou, Sizhen Du, Guojie Song
Anchor Link Prediction Using Topological Information in Social Networks

People today may participate in multiple social networks (Facebook, Twitter, Google+, etc.). Predicting the correspondence of the accounts that refer to the same natural person across multiple social networks is a significant and challenging problem. Formally, social networks that outline the relationships of a common group of people are defined as aligned networks, and the correspondence of the accounts that refer to the same natural person across aligned networks are defined as anchor links. In this paper, we learn the problem of Anchor Link Prediction (ALP). Firstly, two similarity metrics (Bi-Similarity BiS and Reliability Similarity ReS) are proposed to measure the similarity between nodes in aligned networks. And we prove mathematically that the node pair with the maximum BiS has higher probability to be an anchor link and a correctly predicted anchor link must have high ReS. Secondly, we present an iterative algorithm to solve the problem of ALP efficiently. Also, we discuss the termination of the algorithm to give a tradeoff between precision and recall. Finally, we conduct a series of experiments on both synthetic social networks and real social networks to confirm the effectiveness of our approach.

Shuo Feng, Derong Shen, Yue Kou, Tiezheng Nie, Ge Yu
Collaborative Partitioning for Multiple Social Networks with Anchor Nodes

Plenty of individuals are getting involved in more than one social networks, and maintaining multiple relationships of social networks. The value behind the integrated information of multiple social networks is high. Howerver, the research of multiple social networks has been less studied. Our work presented in this paper taps into abundant information of multiple social networks and aims to resolve the initial phase problem of multi-related social network analysis based on MapReduce by partition the mutli-related social networks into non-intersecting subsets. To concretize our discussion, we propose a new multilevel framework (CPMN), which usually proceed in four stages, Merging Phase, Coarsening Phase, Intial Partitioning Phase and Uncoarsening Phase. We propose a modified matching strategy in the second stage and a modified refinement algorithm in the fourth stage. We prove the effective of CPMN on both synthetic data and real datasets. Experiments show that the same node in different social networks is assigned to the same partition by 100 % without sacrificing the load balance and edge-cut too much. We believe that our work will shed light on the study of multiple social networks based on MapReduce.

Fenglan Li, Anming Ji, Songchang Jin, Shuqiang Yang, Qiang Liu
A General Framework for Graph Matching and Its Application in Ontology Matching

Graph matching (GM) is a fundamental problem in computer science. Two issues severely limit the application of GM algorithms. (1) Due to the NP-hard nature, providing a good approximation solution for GM problem is challenging. (2) With large scale data, existing GM algorithms can only process graphs with several hundreds of nodes. We propose a matching framework, which contains nine different objective functions for describing, constraining, and optimizing GM problems. By holistically utilizing these objective functions, we provide GM approximated solutions. Moreover, a fragmenting method for large GM problem is introduced to our framework which could increase the scalability of the GM algorithm.The experimental results show that the proposed framework improves the accuracy when compared to other methods. The experiment for the fragmenting method unveils an innovative application of GM algorithms to ontology matching. It achieves the best performance in matching two large real-world ontologies compared to existing approaches.

Yuda Zang, Jianyong Wang, Xuan Zhu
Internet Traffic Analysis in a Large University Town: A Graphical and Clustering Approach

Campus networks consist of a rich diversity of end hosts including wired desktops, servers, and wireless BYOD devices such as laptops and smartphones, which are often compromised in insecure networks. Making sense of traffic behaviors of end hosts in campus networks is a daunting task due to the open nature of the network, heterogeneous devices, high mobility of end users, and a wide range of applications. To address these challenges, this paper applies a combination of graphical approaches and spectral clustering to group the Internet traffic of campus networks into distinctive traffic clusters in a divide-and-conquer manner. Specifically, we first model the data communication between a particular subnet of campus networks and the Internet on a specific application port via bipartite graphs, and subsequently use the one-mode projection to capture behavior similarity of end hosts in the same subnet for the same network applications. Finally we apply a spectral clustering algorithm to explore the behavior similarity to identify distinctive application clusters within each subnet. Our experimental results have demonstrated the benefits of our proposed method for analyzing Internet traffic of a large university town to discover anomalous behaviors and to uncover distinctive temporal and spatial traffic patterns.

Weitao Weng, Kai Lei, Kuai Xu, Xiaoyou Liu, Tao Sun
Conceptual Sentence Embeddings

Most sentence embedding models typically represent each sentence only using word surface, which makes these models indiscriminative for ubiquitous homonymy and polysemy. In order to enhance discriminativeness, we employ concept conceptualization model to assign associated concepts for each sentence in the text corpus, and learn conceptual sentence embedding (CSE). Hence, the sentence representations are more expressive than some widely-used document representation models such as latent topic models, especially for short text. In the experiments, we evaluate the CSE models on two tasks, text classification and information retrieval. The experimental results show that the proposed models outperform typical sentence embedding models.

Yashen Wang, Heyan Huang, Chong Feng, Qiang Zhou, Jiahui Gu
Inferring Social Roles of Mobile Users Based on Communication Behaviors

Social roles of mobile users have widespread applications. However, most of users’ social roles and other personal information are missing due to privacy and some other reasons, which makes it difficult to infer users’ social roles precisely. Though mobile operators are lacking in information about users’ social roles, they have mobile communication data which records users’ communication behaviors. Since users with same social role have similar communication behaviors, it is possible to infer users’ social roles based on their communication behaviors. This paper studies the problem of inferring social roles of mobile users from users’ communication behaviors. A Mobile Communication Behaviors based framework (MCB) is proposed to infer social roles of mobile users. MCB solved the difficulties of inferring users’ social roles with few labeled users, inaccurate label information, and few users’ feature information. Our study is based on a real-world large mobile communication dataset and the experiment shows the accuracy and effectiveness of the method.

Yipeng Chen, Hongyan Li, Jinbo Zhang, Gaoshan Miao
Sparse Topical Coding with Sparse Groups

Learning a latent semantic representing from a large number of short text corpora makes a profound practical significance in research and engineering. However, it is difficult to use standard topic models in microblogging environments since microblogs have short length, large amount, snarled noise and irregular modality characters, which prevent topic models from using full information of microblogs. In this paper, we propose a novel non-probabilistic topic model called sparse topical coding with sparse groups (STCSG), which is capable of discovering sparse latent semantic representations of large short text corpora. STCSG relaxes the normalization constraint of the inferred representations with sparse group lasso, a sparsity-inducing regularizer, which is convenient to directly control the sparsity of document, topic and word codes. Furthermore, the relaxed non-probabilistic STCSG can be effectively learned with alternating direction method of multipliers (ADMM). Our experimental results on Twitter dataset demonstrate that STCSG performs well in finding meaningful latent representations of short documents. Therefore, it can substantially improve the accuracy and efficiency of document classification.

Min Peng, Qianqian Xie, Jiajia Huang, Jiahui Zhu, Shuang Ouyang, Jimin Huang, Gang Tian

Information Retrieval

Frontmatter
Differential Evolution-Based Fusion for Results Diversification of Web Search

Results diversification has been a key research issue on web search in the last couple of years. Some recent research work suggests that data fusion, especially linear combination of multiple results, is a good option of dealing with this problem. However, there are many different ways of setting weights. In this paper, we propose a differential evolution-based method to find optimal weights in the weight space for the linear combination method. Experimental results show that the proposed method is effective compared with the state-of-the-art techniques.

Chunlin Xu, Chunlan Huang, Shengli Wu
BMF: An Indexing Structure to Support Multi-element Check

Standard Bloom Filter is an efficient structure to check element membership with low space cost and low false positive rate. However, the standard Bloom Filter assumes that all elements belong to a single set. When given multiple sets of elements, it cannot efficiently check whether or not multiple input elements belong to the same set, called multi-element check. To support the multi-element check, in this paper, we design a new data structure, namely Bloom Multi-filter (BMF). BMF maintains an array of integer numbers to support (1) the insertion of multiple sets of elements into BMF and (2) the lookup to answer multi-element check. We propose four techniques to improve the BMF and optimize the false positive rate. We conducted intensive experiments to study the tradeoff between BMF’s space cost and lookup precision. Our experimental results indicate that BMF greatly outperforms the standard bloom filters with around 9.82 folds of lower false positive rate.

Chenyang Xu, Qin Liu, Weixiong Rao
Efficient Unique Column Combinations Discovery Based on Data Distribution

Discovering all unique column combinations in a relation is a fundamental research problem for modern data management and knowledge discovery applications. With the rapid growth of data volume and popularity of distributed platform, some algorithms are trying to discover uniques in large-scale datasets. However, the performance is not always satisfactory for some datasets which have few unique values in each column. This paper proposes a parallel algorithm to discover unique column combinations in large-scale datasets on Hadoop. We first construct a prefix tree to depict all unique candidates. Then we parallelize the verification of candidates in the same layer of the prefix tree. Two parallel strategies can be chosen: one is parallelizing across all subtrees, the other is parallelizing only in a single subtree. The parallel strategies and pruning methods are self-adaptive based on the data distribution. Eventually, experimental results demonstrate the advantages of the method we proposed.

Chao Wang, Shupeng Han, Xiangrui Cai, Haiwei Zhang, Yanlong Wen
Event Related Document Retrieval Based on Bipartite Graph

Given a short event name, event retrieval is a process of retrieving event related documents from a document collection. The existing approaches employ the-state-of-art retrieval models to retrieve relevant documents, however, these methods only regard the input query as several keywords instead of an event, thus the special aspects of the event are not considered in the models. Aiming at this problem, we first propose a novel bipartite graph model to describe an event, where one bipartition represents event type and the other represents the event specific information. Each edge between two bipartitions issues co-occurrence relationship. Then we model an event with a unigram language model estimated through the corresponding bipartite graph. Based on KL-divergence retrieval framework, event model is integrated into the query model for more accurate query representation. Experiments on publicly available TREC datasets show that our method can improve the precision@N metric of event retrieval.

Wenjing Yang, Rui Li, Peng Li, Meilin Zhou, Bin Wang
SPedia: A Semantics Based Repository of Scientific Publications Data

There is a noticeable increase in the number of scientific publications. These publications are being published by different publishers. Springer is one of those publishers which has published more than nine million scientific documents. SpringerLink is the portal providing the gateway to searching and accessing these published scientific documents. The structure, as well as the way, the contents are presented on the portal, provides valuable information about documents metadata such as author, ISBN, references, articles, chapters. However, this metadata is understandable by human in such a way that it facilitates the keyword-based searches through SpringerLink portal. At the same time this huge data about scientific documents is in silence as it is neither open nor linked to other datasets. To address these issues, we have created a semantics based repository called SPedia which consists of semantically enriched data about documents published by Springer. Currently, SPedia datasets consist of more than 300 million RDF triples. In this paper we describe SPedia and examine the quality of its extracted data by performing semantically enriched queries. The results show that SPedia facilitates the users to put sophisticated queries by employing semantic Web techniques instead of keyword-based searches. In addition, SPedia datasets can be utilized to link to other datasets available in the Linked Open Data (LOD) cloud.

Muhammad Ahtisham Aslam, Naif Radi Aljohani
A Set-Based Training Query Classification Approach for Twitter Search

Learning to rank is a popular technique of building a ranking model for Twitter search by utilizing a rich list of features. As most learning to rank algorithms are supervised, their effectiveness is heavily affected by the quality of labeled training data. Selecting training queries with high quality is an important means to improving the effectiveness of ranking model for Twitter search. Existing approach for this problem learns a query quality classifier, which estimates the training query quality on a per query basis, but ignores the dependence between queries. This paper proposes a set-based training query classification approach that estimates a training query’s quality by taking its usefulness in combination with other training queries into consideration. Evaluation on standard TREC Microblog track test collection shows effective retrieval performance brought by the proposed approach.

Qingli Ma, Ben He, Jungang Xu, Bin Wang
NBLucene: Flexible and Efficient Open Source Search Engine

The most popular open source projects for text searching have been designed to support many features. These projects are well-written in Java for cross-platform using. But when conducting research, the execution efficiency of program should be more essential, which is a problem for applications written in Java. It is also difficult for Java to use parallel mechanisms in the modern computer system like SIMD and GPUs. To this end, we expand an open source text searching project written in C++ for research purpose.Our approach is to define a flexible and efficient search engine architecture which consists of extensible application programming interfaces. We aim to provide a flexible architecture to enable researchers to readily implement and modify search engine algorithms and strategies. Moreover, we integrate one generic mathematical encoding library which can be used to compress inverted index. We also implement an integral framework for result summarization, including snippet generation and cache strategies. Experiment results show that the new architecture makes a significant improvement versus original work.

Zhaohua Zhang, Benjun Ye, Jiayi Huang, Rebecca Stones, Gang Wang, Xiaoguang Liu
Context-Aware Entity Summarization

Entity summarization aims at selecting a small subset of attribute-value pairs of an entity from a knowledge graph, which provides users with concrete information given an entity-related query. However, previous approaches focus on the “goodness” of the attribute-value pairs, paying little attention to user preference towards them.In this paper, we formalize the task of context-aware entity summarization, and propose an algorithm to solve this problem. We model user interest by mining the latent topics in a query log dataset. A modified Personalized PageRank algorithm is utilized to rank attribute-value pairs by leveraging three elements: relevance, informativeness and topic coherence. We evaluate our approach on real-world datasets and show that it outperforms the state-of-the-art approaches.

Jihong Yan, Yanhua Wang, Ming Gao, Aoying Zhou
Backmatter
Metadata
Title
Web-Age Information Management
Editors
Bin Cui
Nan Zhang
Jianliang Xu
Xiang Lian
Dexi Liu
Copyright Year
2016
Electronic ISBN
978-3-319-39937-9
Print ISBN
978-3-319-39936-2
DOI
https://doi.org/10.1007/978-3-319-39937-9