Skip to main content

2015 | Buch

Web-Age Information Management

16th International Conference, WAIM 2015, Qingdao, China, June 8-10, 2015. Proceedings

insite
SUCHEN

Über dieses Buch

This book constitutes the refereed proceedings of the 16th International Conference on Web-Age Information Management, WAIM 2015, held in Qingdao, China, in June 2015. The 33 full research papers, 31 short research papers, and 6 demonstrations were carefully reviewed and selected from 164 submissions. The focus of the conference is on following topics: advanced database and web applications, big data analytics big data management, caching and replication, cloud computing, content management, crowdsourcing data and information quality, data management for mobile and pervasive computing, data management on new hardware, data mining, data provenance and workflow, data warehousing and OLAP, deep web, digital libraries, entity resolution and entity linking and graph data management and RDF.

Inhaltsverzeichnis

Frontmatter
Erratum to: Web-Age Information Management

Erratum to: J. Li and Y. Sun (Eds.) Web-Age Information Management,

DOI:10.1007/978-3-319-21042-1

In an earlier print and online version of this volume, two editors were not included on the inner title pages and cover. This has been corrected.

The names of the editors are as follows:

Xin Luna Dong, Xiaohui Yu, Jian Li, and Yizhou Sun

Xin Luna Dong, Xiaohui Yu, Jian Li, Yizhou Sun

Graph and Social Network

Frontmatter
An Influence Field Perspective on Predicting User’s Retweeting Behavior

User’s retweeting behavior, which is the key mechanism for information diffusion in the micro-blogging systems, has been widely employed as an important profile for personalized recommendation and many other tasks. Retweeting prediction is of great significance. In this paper, we believe that user’s retweeting behavior is synthetically caused by the influence from other users and the post. By analogy with the concept of electric field in physics, we propose a new conception named “influence field” which is able to incorporate different types of potential influence. Based on this conception, we provide a novel approach to predict user’s retweeting behavior. The experimental results demonstrate the effectiveness of our approach.

Yi Shen, Jianjun Yu, Kejun Dong, Juan Zhao, Kai Nan
Realizing Impact Sourcing by Adaptive Gold Questions: A Socially Responsible Measure for Workers’ Trustworthiness

In recent years, crowd sourcing has emerged as a good solution for digitizing voluminous tasks. What’s more, it offers a

social

solution promising to extend economic opportunities to low-income countries, alleviating the welfare of poor, honest and yet uneducated labor. On the other hand, crowd sourcing’s virtual nature and anonymity encourages fraudulent workers to misuse the service for quick and easy monetary gain. This in turn compromises the quality of results, and forces task providers to employ strict control measures like gold questions or majority voting, which may gravely misjudge honest workers with lower skills, ultimately discarding them from the labor pool. Thus, the problem of fairly distinguishing between fraudulent and honest workers lacking educational skills becomes vital for supporting the vision of Impact Sourcing and its social responsibility. We develop a technique with socially responsible gold questions as an objective measure of workers’ trustworthiness, rather than a mere discarding mechanism. Our statistical model aligns workers’ skill levels and questions’ difficulty levels, which then allows adapting the gold questions’ difficulty for a fair judgment. Moreover, we illustrate how low-skilled workers’ initial payloads, which are usually discarded along with the worker, can be partially recovered for an increased economic gain, and show how low-skilled workers can be seamlessly integrated into high-performing teams. Our experiments prove that about 75% of misjudged workers can be correctly identified and effectively be integrated into teams with high overall result correctness between 70-95%.

Kinda El Maarry, Ulrich Güntzer, Wolf-Tilo Balke
A Coalition Formation Game Theory-Based Approach for Detecting Communities in Multi-relational Networks

Community detection is a very important task in social network analysis. Most existing community detection algorithms are designed for single-relational networks. However, in the real world, social networks are mostly multi-relational. In this paper, we propose a coalition formation game theory-based approach to detect communities in multi-relational social networks. We define the multi-relational communities as the shared communities over multiple single-relational graphs, and model community detection as a coalition formation game process in which actors in a social network are modeled as rational players trying to improve group’s utilities by cooperating with other players to form coalitions. Each player is allowed to join multiple coalitions and coalitions with fewer players can merge into a larger coalition as long as the merge operation could improve the utilities of coalitions merged. We then use a

greedy

agglomerative manner

to identify communities. Experimental results and performance studies verify the effectiveness of our approach.

Lihua Zhou, Peizhong Yang, Kevin Lü, Zidong Zhang, Hongmei Chen
Individual Influence Maximization via Link Recommendation

Recent years have witnessed the increasing interest in exploiting social influence in social networks for many applications, such as viral marketing. Most of the existing research focused on identifying a subset of influential individuals with the maximum influence spread. However, in the real-world scenarios, many individuals also care about the influence of herself and want to improve it. In this paper, we consider such a problem that maximizing a target individual’s influence by recommending new links. Specifically, if a given individual/node makes new links with our recommended nodes then she will get the maximum influence gain. Along this line, we formulate this link recommendation problem as an optimization problem and propose the corresponding objective function. As it is intractable to obtain the optimal solution, we propose greedy algorithms with a performance guarantee by exploiting the submodular property. Furthermore, we study the optimization problem under a specific influence propagation model (i.e., Linear model) and propose a much faster algorithm (

uBound

), which can handle large scale networks without sacrificing accuracy. Finally, the experimental results validate the effectiveness and efficiency of our proposed algorithms.

Guowei Ma, Qi Liu, Enhong Chen, Biao Xiang
Community Detection Based on Minimum-Cut Graph Partitioning

One of the most useful measurements of community detection quality is the modularity, which evaluates how a given division deviates from an expected random graph. This article demonstrates that (i) modularity maximization can be transformed into versions of the standard minimum-cut graph partitioning, and (ii) normalized version of modularity maximization is identical to normalized cut graph partitioning. Meanwhile, we innovatively combine the modularity theory with popular statistical inference method in two aspects: (i) transforming such statistical model into null model in modularity maximization; (ii) adapting the objective function of statistical inference method for our optimization. Based on the demonstrations above, this paper proposes an efficient algorithm for community detection by adapting the Laplacian spectral partitioning algorithm. The experiments, in both real-world and synthetic networks, show that both the quality and the running time of the proposed algorithm rival the previous best algorithms.

Yashen Wang, Heyan Huang, Chong Feng, Zhirun Liu
Coherent Topic Hierarchy: A Strategy for Topic Evolutionary Analysis on Microblog Feeds

Topic evolutionary analysis on microblog feeds can help reveal users’ interests and public concerns in a global perspective. However, it is not easy to capture the evolutionary patterns since the semantic coherence is usually difficult to be expressed and the timeline structure is always intractable to be organized. In this paper, we propose a novel strategy, in which a coherent topic hierarchy is designed to deal with these challenges. First, we incorporate the sparse biterm topic model to extract some coherent topics from microblog feeds. Then the topology of these topics is constructed by the basic Bayesian rose tree combined with topic similarity. Finally, we devise a cross-tree random walk with restart model to bond each pair of sequential trees into a timeline hierarchy. Experimental results on microblog datasets demonstrate that the coherent topic hierarchy is capable of providing meaningful topic interpretations, achieving high clustering performance, as well as presenting motivated patterns for topic evolutionary analysis.

Jiahui Zhu, Xuhui Li, Min Peng, Jiajia Huang, Tieyun Qian, Jimin Huang, Jiping Liu, Ri Hong, Pinglan Liu
Age Detection for Chinese Users in Weibo

Age is one of the most important attributes in one user’s profile. Age detection has many applications like personalized search, targeted advertisement and recommendation. Current research has uncovered the relationship between the use of western language and social identities to some extents. However, the age detection problem for Chinese users is so far unexplored. Due to the cultural and societal difference, some well known features in English may not be applicable to the Chinese users. For example, while the frequency of capitalized letter in English has proved to be a good feature, Chinese users do not have such patterns. Moreover, Chinese has its own characteristics such as rich emoticons, complex syntax and unique lexicon structures. Hence age detection for Chinese users is a new big challenge.

In this paper, we present our age detection study on a corpus of microblogs from 3200 users in Sina Weibo. We construct three types of Chinese language patterns, including stylistic, lexical, and syntactic features, and then investigate their effects on age prediction. We find a number of interesting language patterns: (1) there is a significant topic divergence among Chinese people in various age groups, (2) the young people are open and easy to accept new slangs from the internet or foreign languages, and (3) the young adult people exhibit distinguished syntactic structures from all other people. Our best result reaches an accuracy of 88% when classifying users into four age groups.

Li Chen, Tieyun Qian, Fei Wang, Zhenni You, Qingxi Peng, Ming Zhong
Batch Mode Active Learning for Networked Data with Optimal Subset Selection

Active learning has increasingly become an important paradigm for classification of networked data, where instances are connected with a set of links to form a network. In this paper, we propose a novel batch mode active learning method for networked data (BMALNeT). Our novel active learning method selects the best subset of instances from the unlabeled set based on the correlation matrix that we construct from the dedicated informativeness evaluation of each unlabeled instance. To evaluate the informativeness of each unlabeled instance accurately, we simultaneously exploit content information and the network structure to capture the uncertainty and representativeness of each instance and the disparity between any two instances. Compared with state-of-the-art methods, our experimental results on three real-world datasets demonstrate the effectiveness of our proposed method.

Haihui Xu, Pengpeng Zhao, Victor S. Sheng, Guanfeng Liu, Lei Zhao, Jian Wu, Zhiming Cui

Information and Knowledge

Frontmatter
A Question Answering System Built on Domain Knowledge Base

Interactive Question Answering (QA) system is capable of answering users’ questions with managing/understanding the dialogs between human and computer. With the increasing amount of online information, it is highly needed to answer users’ concerns on a specific domain such as health-related questions. In this paper, we proposed a general framework for domain-specific interactive question answering systems which takes advance of domain knowledge bases. First, a semantic parser is generated to parse users’ questions to the corresponding logical forms on basis of the knowledge base structure. Second, the logical forms are translated into query language to further harvest answers from the knowledge base. Moreover, our framework is featured with automatic dialog strategy development which relies on manual intervention in traditional interactive QA systems. For evaluation purpose, we applied our framework to a Chinese interactive QA system development, and used a health-related knowledge base as domain scenario. It shows promising results in parsing complex questions and holding long history dialog.

Yicheng Liu, Yu Hao, Xiaoyan Zhu, Jiao Li
Fast and Accurate Computation of Role Similarity via Vertex Centrality

There is growing evidence that vertex similarity based on structural context is the basis of many link mining applications in complex networks. As a special case of vertex similarity, role similarity which measures the similarity between two vertices according to their roles in a network can facilitate the search for peer vertices. In

RoleSim

, graph automorphism is encapsulated into the role similarity measure. As a real-valued role similarity,

RoleSim

shows good interpretative power in experiments. However,

RoleSim

is not sufficient for some applications since it is very time-consuming and may assign unreasonable similarities in some cases. In this paper, we present

CentSim

, a novel role similarity metric which obeys all axiomatic properties for role similarity.

CentSim

can quickly calculate the role similarity between any two vertices by directly comparing their corresponding centralities. The experimental results demonstrate that

CentSim

achieves best performance in terms of efficiency and effectiveness compared with the state-of-the-art.

Longjie Li, Lvjian Qian, Victor E. Lee, Mingwei Leng, Mei Chen, Xiaoyun Chen
Multimodal-Based Supervised Learning for Image Search Reranking

The aim of image search reranking is to rerank the images obtained by a conventional text-based image search engine to improve the search precision, diversity and so on. Current image reranking methods are often based on a single modality. However, it is hard to find a general modality which can work well for all kinds of queries. This paper proposes a multimodal-based supervised learning for image search reranking. First, for different modalities, different similarity graphs are constructed and different approaches are utilized to calculate the similarity between images on the graph. Exploiting the similarity graphs and the initial list, we integrate the multiple modality into query-independent reranking features, namely PageRank Pseudo Relevance Feedback, Density Feature, Initial Ranking Score Feature, and then fuse them into a 19-dimensional feature vector for each image. After that, the supervised method is employed to learn the weight of each reranking feature. The experiments constructed on the MSRA-MM Dataset demonstrate the improvement in robust and effectiveness of the proposed method.

Shengnan Zhao, Jun Ma, Chaoran Cui
Novel Word Features for Keyword Extraction

Keyword extraction plays an increasingly crucial role in several texts related researches. Applications that utilize feature word selection include text mining, and information retrieval etc. This paper introduces novel word features for keyword extraction. These new word features are derived according to the background knowledge supplied by patent data. Given a document, to acquire its background knowledge, this paper first generates a query for searching the patent data based on the key facts present in the document. The query is used to find files in patent data that are closely related to the contents of the document. With the patent search result file set, the information of patent inventors, assignees, and citations in each file are used to mining the hidden knowledge and relationship between different patent files. Then the related knowledge is imported to extend the background knowledge base, which would be extracted to derive the novel word features. The newly introduced word features that reflect the document’s background knowledge offer valuable indications on individual words’ importance in the input document and serve as nice complements to the traditional word features derivable from explicit information of a document. The keyword extraction problem can then be regarded as a classification problem and the Support Vector Machine (SVM) is used to extract the keywords. Experiments have been done using two different data sets. The results show our method improves the performance of keyword extraction.

Yiqun Chen, Jian Yin, Weiheng Zhu, Shiding Qiu
A Multi-round Global Performance Evaluation Method for Interactive Image Retrieval

In interactive image retrieval systems, from the image search results, a user can select an image and click to view its similar or related images until he reaches the targets. Existing evaluation approaches for image retrieval methods only focus on local performance of single-round search results on some selected samples. We propose a novel approach to evaluate their performance in the scenario of interactive image retrieval. It provides a global evaluation considering multi-round user interactions and the whole image collection. We model the interactive image search behaviors as navigation on an information network constructed by the image collection by using images as graph nodes. We leverage the properties of this constructed image information network to propose our evaluation metrics. We use a public image dataset and three image retrieval methods to show the usage of our evaluation approach.

Jiyi Li
Resorting Relevance Evidences to Cumulative Citation Recommendation for Knowledge Base Acceleration

Most knowledge bases (KBs) can hardly be kept up-to-date due to time-consuming manual maintenance. Cumulative Citation Recommendation (CCR) is a task to address this problem, whose objective is to filter relevant documents from a chronological stream corpus and then recommend them as candidate citations with certain relevance estimation to target entities in KBs. The challenge of CCR is how to accurately category the candidate documents into different relevance levels, since the boundaries between them are vague under the current definitions. To figure out the boundaries more precisely, we explore three types of relevance evidences including entities’ profiles, existing citations in KBs, and temporal signals, to supplement the definitions of relevance levels. Under the guidance of the refined definitions, we incorporate these evidences into classification and learning to rank approaches and evaluate their performance on TREC-KBA-2013 dataset. The experimental results show that all these approaches outperform the corresponding baselines. Our analysis also reveals various significances of these evidences in estimating relevance levels.

Jingang Wang, Lejian Liao, Dandan Song, Lerong Ma, Chin-Yew Lin, Yong Rui
Relevance Search on Signed Heterogeneous Information Network Based on Meta-path Factorization

Relevance search is a primitive operation in heterogeneous information networks, where the task is to measure the relatedness of objects with different types. Due to the semantics implied by network links, conventional research on relevance search is often based on meta-path in heterogeneous information networks. However, existing approaches mainly focus on studying non-signed information networks, without considering the polarity of the links in the network. In reality, there are many signed heterogeneous networks that the links can be either positive (such as trust, preference, friendship, etc.) or negative (such as distrust, dislike, opposition, etc.). It is challenging to utilize the semantic information of the two kinds of links in meta-paths and integrate them in a unified way to measure relevance.

In this paper, a relevance search measure called SignSim is proposed, which can measure the relatedness of objects in signed heterogeneous information networks based on signed meta-path factorization. SignSim firstly defines the atomic meta-paths and gives the computing paradigm of similarity between objects with the same type based on atomic meta-paths, with collaborative filtering using positive and negative user preferences. Then, on basis of the combination of different atomic meta-paths, SignSim can measure the relatedness between objects with different types based on multi-length signed meta-paths. Experimental results on real-world dataset verify the effectiveness of our proposed approach.

Min Zhu, Tianchen Zhu, Zhaohui Peng, Guang Yang, Yang Xu, Senzhang Wang, Xiangwei Wang, Xiaoguang Hong
Improving the Effectiveness of Keyword Search in Databases Using Query Logs

Using query logs to enhance user experience has been extensively studied in the Web IR literature. However, in the area of keyword search on structured data (relational databases in particular), most existing work has focused on improving search result quality through designing better scoring functions, without giving explicit consideration to query logs. Our work presented in this paper taps into the wealth of information contained in query logs, and aims to enhance the search effectiveness by explicitly taking into account the log information when ranking the query results. To concretize our discussion, we focus on schema-graph-based approaches to keyword search (using the seminal work DISCOVER as an example), which usually proceed in two stages, candidate network (

CN

) generation and

CN

evaluation. We propose a query-log-aware ranking strategy that uses the frequent patterns mined from query logs to help rank the

CN

s generated during the first stage. Given the frequent patterns, we show how to compute the maximal score of a

CN

using a dynamic programming algorithm. We prove that the problem of finding the maximal score is NP-hard. User studies on a real dataset validate the effectiveness of the proposed ranking strategy.

Jing Zhou, Yang Liu, Ziqiang Yu
cluTM: Content and Link Integrated Topic Model on Heterogeneous Information Networks

Topic model is extensively studied to automatically discover the main themes that pervade a large and unstructured collection of documents. Traditional topic models assume the documents are independent and there are no correlations among them. However, in many real scenarios, a document may be interconnected with other documents and objects, and thus form a text related heterogeneous network, such as the DBLP bibliographic network. It is challenging for traditional topic models to capture the link information associated to diverse types of objects in such a network. To this end, we propose a

u

nified

T

opic

M

odel cluTM by incorporating both the document

c

ontent and various

l

inks in the text related heterogeneous network. cluTM combines the textual documents and the link structures by the proposed joint matrix factorization on both the text matrix and link matrices. Joint matrix factorization can derive a common latent semantic space shared by multi-typed objects. With the multi-typed objects represented by the common latent features, the semantic information can be therefore largely enhanced simultaneously. Experimental results on DBLP datasets demonstrate the effectiveness of cluTM in both topic mining and multiple objects clustering in text related heterogeneous networks by comparing against state-of-the-art baselines.

Qian Wang, Zhaohui Peng, Senzhang Wang, Philip S. Yu, Qingzhong Li, Xiaoguang Hong

Recommender Systems

Frontmatter
Learning to Recommend with User Generated Content

In the era of Web 2.0, user generated content (UGC), such as social tag and user review, widely exists on the Internet. However, in recommender systems, most of existing related works only study single kind of UGC in each paper, and different types of UGC are utilized in different ways. This paper proposes a unified way to use different types of UGC to improve the prediction accuracy for recommendation. We build two novel collaborative filtering models based on Matrix Factorization (MF), which are oriented to user features learning and item features learning respectively. In the user side, we construct a novel regularization term which employs UGC to better understand a user’s interest. In the item side, we also construct a novel regularization term to better infer an item’s characteristic. We conduct comprehensive experiments on three real-world datasets, which verify that our models significantly improve the prediction accuracy of missing ratings in recommender systems.

Yueshen Xu, Zhiyuan Chen, Jianwei Yin, Zizheng Wu, Taojun Yao
Combining Positive and Negative Feedbacks with Factored Similarity Matrix for Recommender Systems

Traditional collaborative filtering algorithms like ItemKNN, cannot capture the relationships between items that are not co-rated by at least one user. To cope with this problem, the item-based factor models are put forward to utilize low dimensional space to learn implicit relationships between items. However, these models consider all user’s rated items equally as positive examples, which is unreasonable and fails to interpret the actual preferences of users. To tackle the aforementioned problems, in this paper, we propose a novel item-based latent factor model, which can consider user’s positive and negative feedbacks while learning item-item correlations. In particular, for each user, we divide his rated items into two different parts, i.e., positive examples and negative examples, depending on whether the rating of the item is above the average rating of the user or not. In our model, we assume that the predicted rating of an item should be boosted if the item is similar to most of the positive examples. On the contrary, the predicted rating should be diminished if the item is similar to most of the negative examples. The item-item similarity is approximated by an inner product of two low-dimensional item latent factor matrices which are learned using a structural equation modeling approach. Comprehensive experiments on two benchmark datasets indicate that our method has significant improvements as compared with existing approaches in both rating prediction and top-

N

recommendation.

Mengshuang Wang, Jun Ma, Shanshan Huang, Peizhe Cheng
Review Comment Analysis for Predicting Ratings

Rating prediction is a common task in recommendation systems that aims to predict a rating representing the opinion from a user to an item. In this paper, we propose a comment-based collaborative filtering (CCF) approach that captures correlations between hidden aspects in review comments and numeric ratings. The idea is motivated by the observation that the opinion of a user against an item is represented by different aspects discussed in review comments. In our approach, we first explores topic modeling to discover hidden aspects from review comments. Profiles are then created for users and items separately based on the discovered aspects. In the testing stage, we estimate the aspects of comments based on the profiles of users and items because the comments are not available when testing. Lastly, we build final systems by utilizing the profiles and traditional collaborative filtering methods. We evaluate the proposed approach on a real data set. The experimental results show that our prediction systems outperform several strong baseline systems.

Rong Zhang, Yifan Gao, Wenzhe Yu, Pingfu Chao, Xiaoyan Yang, Ming Gao, Aoying Zhou
Adaptive Temporal Model for IPTV Recommendation

How to help the IPTV service provider make the program recommendation to their clients is the problem we propose to solve in this paper. Here we offer an adaptive temporal model to identify multiple members under a shared IPTV account. The time intervals are first detected and defined in each account. Then, the preference similarity is calculated among the intervals to extract the members. After that, we evaluate our model on the industrial data sets by a famous IPTV provider. The experimental results show that our proposed model is promising and outperform the state-of-the-art algorithms with low computational complexity and versatility without user feedback. Furthermore, the proposed model has been officially adopted by the IPTV provider and applied in their IPTV systems with excellent user satisfaction in 2013.

Yan Yang, Qinmin Hu, Liang He, Minjie Ni, Zhijin Wang
RPCV: Recommend Potential Customers to Vendors in Location-Based Social Network

Location-based social network has received much attention recently. It provides rich information of social and spatial context for researchers to study users’ behaviors from different aspects. A number of recent efforts focus on recommending locations, users, activities, and social medias for users. Unlike previous works, we intend to make recommendations for vendors, assisting vendors in finding potential customers in location-based social network. We propose a framework to recommend potential customers to vendors (called RPCV) in location-based social network effectively and efficiently. To find the best set of customers, RPCV takes both spatial relations and user preference into consideration. A reverse spatial-preference kRanks algorithm, which effectively combines spatial relations with user preference, is also proposed. Our experimental results on real datasets from Foursquare and Brightkite show that our framework has higher performance than other state-of-the-art approaches.

Yuanliu Liu, Pengpeng Zhao, Victor S. Sheng, Zhixu Li, An Liu, Jian Wu, Zhiming Cui
Mining Dependencies Considering Time Lag in Spatio-Temporal Traffic Data

Learning dependency structure is meaningful to characterize causal or statistical relationships. Traditional dependencies learning algorithms only use the same time stamp data of variables. However, in many real-world applications, such as traffic system and climate, time lag is a key feature of hidden temporal dependencies, and plays an essential role in interpreting the cause of discovered temporal dependencies. In this paper, we propose a method for mining dependencies by considering the time lag. The proposed approach is based on a decomposition of the coefficients into products of two-level hierarchical coefficients, where one represents feature-level and the other represents time-level. Specially, we capture the prior information of time lag in spatio-temporal traffic data. We construct a probabilistic formulation by applying some probabilistic priors to these hierarchical coefficients, and devise an expectation-maximization (EM) algorithm to learn the model parameters. We evaluate our model on both synthetic and real-world highway traffic datasets. Experimental results show the effectiveness of our method.

Xiabing Zhou, Haikun Hong, Xingxing Xing, Wenhao Huang, Kaigui Bian, Kunqing Xie
Location Semantics Protection Based on Bayesian Inference

In mobile Internet, popular Location-Based Services (LBSs) recommend Point-of-Interest (POI) data according to physical positions of smartphone users. However, untrusted LBS providers can violate location privacy by analyzing user requests semantically. Therefore, this paper aims at protecting user privacy in location-based applications by evaluating disclosure risks on sensitive location semantics. First, we introduce a novel method to model location semantics for user privacy using Bayesian inference and demonstrate details of computing the semantic privacy metric. Next, we design a cloaking region construction algorithm against the leakage of sensitive location semantics. Finally, a series of experiments evaluate this solution’s performance to show its availability.

Zhengang Wu, Zhong Chen, Jiawei Zhu, Huiping Sun, Zhi Guan

Big Data

Frontmatter
SALA: A Skew-Avoiding and Locality-Aware Algorithm for MapReduce-Based Join

MapReduce is a parallel programming model, which is extensively used to process join operations for large-scale dataset. However, traditional MapReduce-based join is not efficient when handling skewed data, because it can lead to partitioning skew, which further results in longer response time of the whole join process. Additionally, some newly proposed methods usually involve large amounts of intermediate results over the network in the shuffle phase of Mapreduce-based join, which may consume a lot of time and cause performance degradation. Here a novel algorithm called SALA is proposed, which employs volume/locality-aware partitioning instead of hash partitioning for data distribution. Compared with other existing join algorithms, SALA has three typical advantages: (1) makes sure that the data is distributed to reducers evenly when the input datasets are skewed, (2) reduces the amount of intermediate results transferred across the network by utilizing data locality, and (3) does not make any modification of the MapReduce framework. The extensive experimental results show that SALA not only achieves better load balance but reduces network overhead, and therefore speeds up the whole join process significantly in the presence of data skew.

Ziyu Lin, Minxing Cai, Ziming Huang, Yongxuan Lai
Energy-Proportional Query Processing on Database Clusters

Energy efficiency has been a critical issue in database clusters. In this paper, we present an energy-proportional database cluster and propose an energy-proportional query processing approach to reduce the energy consumed by database clusters while keeping high time performance. Particularly, we introduce a

query stream buffer

on top of a database cluster and propose an

unbalanced load allocation

algorithm to distribute workloads among the cluster so as to realize better energy proportionality. Further, we present an adaptive algorithm to turn on/off nodes according to workload changes. With this mechanism, we can reduce energy consumption while keeping high time performance for query processing on database clusters. We build a prototype database cluster and use the TPC-H benchmark to compare our proposal with three baseline methods, where different query patterns are used. The results suggest the superiority of our proposal in energy savings and time performance.

Jiazhuang Xie, Peiquan Jin, Shouhong Wan, Lihua Yue
SparkRDF: In-Memory Distributed RDF Management Framework for Large-Scale Social Data

Considering the scalability and semantic requirements, Resource Description Framework (RDF) and the de-facto query language SPARQL are well suited for managing and querying online social network (OSN) data. Despite some existing works have introduced distributed framework for querying large-scale data, how to improve online query performance is still a challenging task. To address this problem, this paper proposes a scalable RDF data framework, which uses key-value store for offline RDF storage and pipelined in-memory based query strategy. The proposed framework efficiently supports SPARQL Basic Graph Pattern (BGP) queries on large-scale datasets. Experiments on the benchmark dataset demonstrate that the online SPARQL query performance of our framework outperforms existing distributed RDF solutions.

Zhichao Xu, Wei Chen, Lei Gai, Tengjiao Wang
Distributed Grid-Based K Nearest Neighbour Query Processing Over Moving Objects

K

-nearest neighbour (

k

-NN) queries over moving objects is a classic problem with applications to a wide spectrum of location-based services. Abundant algorithms exist for solving this problem in a centralized setting using a single server, but many of them become inapplicable when distributed processing is called for tackling the increasingly large scale of data. To address this challenge, we propose a distributed grid-based solution to

k

-NN query processing over moving objects. First, we design a new grid-based index called Block Grid Index (BGI), which indexes moving objects using a two-layer structure and can be easily constructed and maintained in a distributed setting. We then propose a distributed

k

-NN algorithm based on BGI, called DBGKNN. We implement BGI and DBGKNN in the commonly used master-worker mode, and the efficiency of our solution is verified by extensive experiments with millions of nodes.

Min Yang, Yang Liu, Ziqiang Yu
An Efficient Block Sampling Strategy for Online Aggregation in the Cloud

As the development of social network, mobile Internet, etc., an increasing amount of data are being generated, which beyond the processing ability of traditional data management tools. In many real-life applications, users can accept approximate answers accompanied by accuracy guarantees. One of the most commonly used approaches is online aggregation. Online aggregation responds aggregation queries against the random samples and refines the result as more samples are received. In the era of big data, more and more data analysis applications are migrated to the cloud, so online aggregation in the cloud has also attracted more attention. There can be a huge difference between the number of tuples in each group when dealing with group-by queries. As a result, answers of online aggregation based on uniform random sampling can result in poor accuracy for groups with very few tuples. Data in the cloud are usually organized into blocks and this data organization makes sampling more complex. In this paper, we propose an efficient block sampling which can exactly reflect the importance of different blocks for answering group-by queries. We implement our methods in a cloud online aggregation system called COLA and the experimental results demonstrate our method can get results with higher accuracy.

Xiang Ci, Xiaofeng Meng
Computing Probability Threshold Set Similarity on Probabilistic Sets

Currently, the computation of set similarity has become an increasingly important tool in many real-world applications, such as near-duplicate detection, data cleaning and record linkage, etc., in which sets often are uncertain due to date missing, imprecise and noise, etc. The challenge of evaluating similarity between probabilistic sets mainly stems from the exponential blowup in the number of possible worlds induced by uncertainty. In this paper, we define the probability threshold set similarity (

PTSS

) between two probabilistic sets based on the possible world semantics and propose an exact solution to compute

PTSS

via the dynamic programming. To speed up the computation of the probability threshold set query (

PTSQ

), we derive an efficient and effective pruning rule for

PTSQ

. Finally, we conduct extensive experiments to verify the effectiveness and efficiency of our algorithms using both real and synthetic datasets.

Lei Wang, Ming Gao, Rong Zhang, Cheqing Jin, Aoying Zhou
Region-aware Top-k Similarity Search

Location-based services have attracted significant attention for the ubiquitous smartphones equipped with GPS systems. These services (e.g., Google map, Twitter) generate large amounts of spatio-textual data which contain both geographical location and textual description. Existing location-based services (LBS) assume that the attractiveness of a Point-of-Interest (POI) depends on its spatial proximity from people. However, in most cases, POIs within a certain distance are all acceptable to users and people may concern more about other aspects. In this paper, we study a region-aware top-k similarity search problem: given a set of spatio-textual objects, a spatial region and several input tokens, finds

k

most textual-relevant objects falling in this region. We summarize our main contributions as follows: (1) We propose a hybrid-landmark index which integrates the spatial and textual pruning seamlessly. (2) We explore a priority-based algorithm and extend it to support fuzzy-token distance. (3) We devise a cost model to evaluate the landmark quality and propose a deletion-based method to generate high quality landmarks (4) Extensive experiments show that our method outperforms state-of-the-art algorithms and achieves high performance.

Sitong Liu, Jianhua Feng, Yongwei Wu
A Partition-Based Bi-directional Filtering Method for String Similarity JOINs

A string similarity join finds similar string pairs from two sets of strings, which is frequently found in many applications, such as duplicate detection, data integration and cleaning. Various algorithms have been proposed to address its efficiency issues. Partition-based filtering methods, such as Pass-JOIN, are promising, which quickly screens out possible similar string pairs by searching partitioned parts of a string in another string, in order of increasing length, and then performs similarity verification base on edit-distance. We notice that, filtering with different direction produces different candidate sets, which motivate us using a bi-directional filtering mechanism. This paper proposes a novel bi-directional filtering mechanism to enhance the filtering capability, which pipelines filtered results in forward direction to the process of backward filtering. The substring selection method of Pass-JOIN is adapted for the backward filtering. Experimental results show that the proposed bi-directional filtering algorithm outperforms the origin algorithm on real-world datasets.

Ying Huang, Baoning Niu, Chunhua Song
Fast Multiway Maximum Margin Clustering Based on Genetic Algorithm via the NystrÖm Method

Motivated by theories of support vector machine, the concept of maximum margin has been extended to the applications in the unsupervised scenario, developing a novel clustering method

maximum margin clustering (MMC). MMC shows an outstanding performance in computational accuracy, which is superior to other traditional clustering methods. But the integer programming of labels of data instances induces MMC to be a hard non-convex optimization problem to settle. Currently, many techniques like semi-definite programming, cutting plane etc. are embedded in MMC to tackle this problem. However, the increasing time complexity and premature convergence of these methods limit the analytic capability of MMC for large datasets. This paper proposes a fast multiway maximum margin clustering method based on genetic algorithm (GAM3C). GAM3C initially adopts the NystrÖm method to generate a low-rank approximate kernel matrix in the dual form of MMC, reducing the scale of original problem and speeding up the subsequent analyzing process; and then makes use of the solution-space alternation of genetic algorithm to compute the non-convex optimization of MMC explicitly, obtaining the multiway clustering results simultaneously. Experimental results on real world datasets reflect that GAM3C outperforms the state-of-the-art maximum margin clustering algorithms in terms of computational accuracy and running time.

Ying Kang, Dong Zhang, Bo Yu, Xiaoyan Gu, Weiping Wang, Dan Meng

Short Papers

Frontmatter
DTMF: A User Adaptive Model for Hybrid Recommendation

Due to the uneven distributions of the ratings and social relationships for each user, two types of above recommendation methods should have varying weights when make recommendations. In this paper, we propose a user adaptive hybrid recommendation model, which dynamically combines a trust-aware based method and low-rank matrix factorization with adaptive tradeoff parameters, named as DTMF. It can utilize the advantages of these two methods and learn combinative parameters automatically. We investigate our model’s performance on two social data sets - Epinions and Flixster. Experimental results show that DTMF performs better than other state-of-the-art methods.

Wenlong Yang, Jun Ma, Shanshan Huang
An Adaptive Skew Handling Join Algorithm for Large-scale Data Analysis

Join plays an essential role in large-scale data analysis, but the performance is severely degraded by data skew. Existing works can’t adaptively handle data skew very well and reduce communication cost simultaneously. To address these problems, we firstly propose a mixed data structure comprising Bloom Filter and Histogram(BFH). Based on BFH, Bloom Filter and Histogram Join(BFHJ) is proposed to handle data skew adaptively. BFHJ can reduce communication cost by filtering unnecessary records. Furthermore, BFHJ adopts a heuristic partitioning strategies to balance workload. Experiments on TPC-H demonstrate that BFHJ outperforms the state-of-the-art methods in terms of communication cost, load balance and query time.

Di Wu, Tengjiao Wang, Yuxin Chen, Shun Li, Hongyan Li, Kai Lei
NokeaRM: Employing Non-key Attributes in Record Matching

Record Matching (RM) aims at finding out pairs of instances referring to the same entity between relational tables. Existing RM methods mainly work on key attribute values, but neglect the possible effectiveness of non-key attribute values in RM. As a result, when two instances referring to the same entity do not have similar key attribute values, they are unlikely to be linked as an instance pair. On the other hand, the two instances may share some important non-key attribute values which can also help us identify the relationship between them. With this intuition, we propose to employ non-key attributes in RM. Basically, we propose a rule-based algorithm based on a tree-like structure, which can not only deal with noisy and missing values, but also greatly improve the efficiency of the method by finding out matched instances or filtering unmatched instances as early as possible. The experimental results based on several data sets demonstrate that our method outperforms existing RM methods by reaching a higher precision and recall. Besides, the proposed techniques can greatly improve the efficiency of a baseline.

Qiang Yang, Zhixu Li, Jun Jiang, Pengpeng Zhao, Guanfeng Liu, An Liu, Jia Zhu
Efficient Foreign Key Discovery Based on Nearest Neighbor Search

With rapid growth of data size and schema complexity, many data sets are structured in tables but without explicit foreign key definitions. Automatically identifying foreign keys among relations will be beneficial to query optimization, schema matching, data integration and database design as well. This paper formulates foreign key discovery as a nearest neighbor search problem and proposes a fast foreign key discovery algorithm. To reduce foreign key candidates, we detect inclusion dependencies first. Then we choose statistical features to represent an attribute and define two attributes’s distance. Finally, foreign keys are discovered by finding nearest neighbors of all primary keys. Experiment results on real and synthetic data sets show that our algorithm can discover foreign keys efficiently.

Xiaojie Yuan, Xiangrui Cai, Man Yu, Chao Wang, Ying Zhang, Yanlong Wen
OntoEvent: An Ontology-Based Event Description Language for Semantic Complex Event Processing

In this paper, we propose an ontology-based event description language, OntoEvent, for semantic event modeling in CEP system. In OntoEvent language, complex events are modeled and described in the form of event ontology. We propose the concept of nature language event constructor and define them by synonym set of WordNet database to describe logical and temporal event relationship in OntoEvent. We demonstrate event ontology examples in application domain of smart home, and our analysis shows that OntoEvent is of rich expressiveness compared to other related languages.

Meng Ma, Ping Wang, Jun Yang, Chao Li
Differential Trust Propagation with Community Discovery for Link-Based Web Spam Demotion

In this paper, we propose a novel differential trust propagation scheme with community discovery, which can be applied to all kinds of trust propagation algorithms. We first use a random walk-based community discovery algorithm to preselect suspicious communities in which the members are almost spam pages. We then utilize these suspicious communities to limit the across-community-boundary trust propagation. Experimental results on WEBSPAM-UK2007 and ClueWeb09 demonstrate that the proposed penalizing scheme significantly improves the performance of trust propagation algorithms such as TrustRank, LCRank, CPV.

Xianchao Zhang, Yafei Feng, Hua Shen, Wenxin Liang
Inferring User Preference in Good Abandonment from Eye Movements

Many studies have been done to investigate good abandonment, but only a few have utilized it to improve search engine performance. In this paper, we aim at inferring user preference in good abandonment. Particularly, we use eye movement data to infer which search result has satisfied user’s information need in each good abandonment instance. An eye-tracking experiment was conducted to capture user’s eye movement data in good abandonment search tasks. These data were transformed into histograms and sequences on which we applied popular machine learning algorithms for the inference. Our results show that the approach can infer user preference with reasonable accuracy.

Wanxuan Lu, Yunde Jia
Private Range Queries on Outsourced Databases

With the advent of cloud computing, data owners could upload their databases to the cloud service provider to relief the burden of data storage and management. To protect sensitive data from the cloud, the data owner could publish an encrypted version of the original data. However, this will make data utilization much harder. In this paper, we consider the problem of private range query. Specifically, the data owner wants to obtain all the data within a query region, while keeping the query private to the service provider. Previous works only provide partial security guarantee, and are inefficient to deal with large scale datasets. To solve this problem, we present a fully secure scheme based on private information retrieval (PIR) and batch codes (BC).

Lu Li, Liusheng Huang, An Liu, Yao Shen, Wei Yang, Shengnan Shao
Efficient Influence Maximization Based on Three Degrees of Influence Theory

The study on influence modeling is to understand the information diffusion and word-of-mouth marketing. In this paper, based on Three Degrees of Influence theory, we propose a suitable diffusion model named Three Steps Cascade Model (TSCM) to simulate online social network information diffusion process. We focus on the influence maximization problem under TSCM and devise an efficient algorithm to solve this problem. The experiment results on real-networks show the robustness and utility of our approach.

Yadong Qin, Jun Ma, Shuai Gao
Bichromatic Reverse kNN Query Algorithm on Road Network Distance

A bichromatic reverse nearest neighbor (BRNN) query has been studied in a road network distance recently, however, existing works for BRNN query have shortcomings in the long processing time. In this paper, we propose a BRNN query algorithm in a road network distance by using the simple materialized path view (SMPV) data structure [

1

].

Tin Nilar Win, Htoo Htoo, Yutaka Ohsawa
IIRS: A Novel Framework of Identifying Commodity Entities on E-commerce Big Data

Identification of the same commodity entities is a major challenge in the heterogeneous multi-source e-commerce of big data. This paper introduces a framework based on Map-Reduce, called IIRS, which is made up of data index, data integration, entity recognition and data sorting. IIRS aims to form the unified model and high efficient commodity information with building an index model based on commodity’s attribute/value and constructing a global model map to record commodity’s attribute and value, identify the commodity entities in different e-commerce with measuring the similarity of the commodity’s identity, and then output the same identity commodity sets and their associated properties organized in the inverted index list. Through an extensive experimental study on real e-commerce dataset on Hadoop, IIRS significantly demonstrates its feasibility, accuracy, and high efficiency.

Qiqing Fang, Yamin Hu, Shujun Lv, Lejiang Guo, Lei Xiao, Yahui Hu
Extracting Appraisal Expressions from Short Texts

Short texts such as tweets and E-commerce reviews can reflect people’s opinions on interested events or products, which are much beneficial to many applications. However, one opinion word may have different sentiment polarities when modifying different targets. Therefore, in this paper we propose to extract “

appraisal expressions

” that are represented by tuples of (

opinion word

,

target

), indicating an opinion word and the target modified by the word. By extracting appraisal expressions, we can further construct target-sensitive sentiment dictionaries and improve the effectiveness of sentiment analysis on short texts. Consequently, we propose a

filtering-refinement

framework to extract appraisal expressions from short texts. In the

filtering

step, we extract appraisal-expression candidates, and in the

refinement

step, we use SVM to extract appraisal expressions and present a

dependency-grammar-based

approach to automatically label training data. Comparative experiments between our proposal and three baseline methods suggest the superiority and effectiveness of our proposal.

Peiquan Jin, Yongbo Yu, Jie Zhao, Lihua Yue
Friendship Link Recommendation Based on Content Structure Information

Intuitively, a friendship link between two users can be recommended based on the similarity of their generated text content or structure information. Although this problem has been extensively studied, the challenge of how to effectively incorporate the information from the social interaction and user generated content remains largely open. We propose a model (LRCS) to recommend user’s potential friends by incorporating user’s generated content and structure features. First, network users are clustered based on the similarity of user’s interest and structural features. Users in the same cluster with the query user are considered as the candidate friends. Then, a weighted SimRank algorithm is proposed to recommend the most similar users as the friends. Experiments on two real-life datasets show the superiority of our approach.

Xiaoming Zhang, Qiao Deng, Zhoujun Li
Overlapping Community Detection in Directed Heterogeneous Social Network

In social networks, users and artifacts (documents, discussions or videos) can be modelled as directed bi-type heterogeneous networks. Most existing works for community detection is either with undirected links or in homogeneous networks. In this paper, we propose an efficient algorithm OcdRank (Overlapping Community Detection and Ranking), which combines overlapping community detection and community-member ranking together in directed heterogeneous social network. The algorithm has low time complexity and supports incremental update. Experiments show that our method can detect better community structures as compared to other existing community detection methods.

Changhe Qiu, Wei Chen, Tengjiao Wang, Kai Lei
Efficient MapReduce-Based Method for Massive Entity Matching

Most of the state-of-the-art MapReduce-based entity matching methods inherit traditional Entity Resolution techniques on centralized system and focus on data blocking strategies in order to solve the load balancing problem occurred in distributed environment. In this paper, we propose a MapReduce-based entity matching framework for processing semi-structured and unstructured data. We use a Locality Sensitive Hash (LSH) function to generate low dimensional signatures for high dimensional entities; we introduce a series of random algorithms to ensure that similar signatures will be matched in reduce phase with high probability. Moreover, our framework contains a solution for reducing redundant similarity computation. Experiments show that our approach has a huge advantage on processing speed whilst keeps a high accuracy.

Pingfu Chao, Zhu Gao, Yuming Li, Junhua Fang, Rong Zhang, Aoying Zhou
The Role of Physical Location in Our Online Social Networks

One of the most important properties of social networking sites is its reachability – no physical location constraint. In addition, all social networking sites allow us to search people with common interests, so we can find friends anywhere in the world easier than ever. With the help of social media, it seems that expaning our social networks is physical location independent. Motivated by the above observations, we study the role of physical location in social media. If physical location is no longer a barrier and physical interaction can be ignored, then our online social networks should have the following characteristics: (1) A number of our friends are from different places in the world other than the places that we have been; (2) A number of our friends are not from our physical social circles – they are not our colleagues, not our high school friends, etc.

Jia Zhu, Pui Cheong Gabriel Fung, Kam-fai Wong, Binyang Li, Zhixu Li, Haoye Dong
Hierarchical Community Evolution Mining from Dynamic Networks

Research on community evolution contributes to understanding the nature of network evolution. Previous community evolution studies have two defects: (1) the algorithms do not have sufficient stability or cannot handle the radical structure change of communities, and (2) they cannot reveal the evolutionary regularities with multiple levels. To solve these problems, this paper proposes a new method for mining the evolution of communities from dynamic networks. Experiments demonstrate that compared with traditional methods, our work significantly improves the algorithm performances.

Yonghui Zhang, Chuan Li, Yanmei Li, Changjie Tang, Ning Yang
Exploiting Conceptual Relations of Sentences for Multi-document Summarization

Multi-document Summarization becomes increasingly important in the age of big data. However, existing summarization systems do not or implicitly consider the conceptual relations of sentences. In this paper, we propose a novel method called Multi-document Summarization based on Explicit Semantics of Sentences (MDSES), which explicitly take conceptual relations of sentences into consideration. It is composed of three components: sentence-concept graph construction, concept clustering and summary generation. We first obtain sentence-concept semantic relation to construct a sentence-concept graph. Then we run graph weighting algorithm to get ranked weighted sentences and concepts. Besides, we obtain concept-concept semantic relation for concepts clustering to eliminate redundancy. Finally, we conduct summary generation to get informative summary. Experimental results on DUC dataset using ROUGE metrics demonstrate the good effectiveness of our methods.

Hai-Tao Zheng, Shu-Qin Gong, Ji-Min Guo, Wen-Zhen Wu
Towards Narrative Information Systems

Narrative interfaces promise to improve the user experience of interacting with information systems by adapting a powerful communication concept which comes natural in human interaction. In this paper, we outline how such a narrative information system which answers queries using elaborate stories can be realized. We show how to construct query-dependent plot graphs from unstructured data sources. Also, we conduct an extended user study on our prototype implementation in which users adapt and personalize the plots of query-dependent stories. These studies provide further insights into which parts of a story are relevant and can be used as a starting point for either personalizing stories or for crafting better user-independent stories in later works.

Philipp Wille, Christoph Lofi, Wolf-Tilo Balke
Maximizing the Spread of Competitive Influence in a Social Network Oriented to Viral Marketing

In this paper, we discuss the maximization of the spread of competitive influence when multiple companies market competing product using social network. We first extend the linear threshold model by incorporating the competitive influence spread,obtaining the extended linear threshold model (ELTM). We then give the objective function of selecting the optimal seed set of competing products in a social network, and the objective function is monotone and submodular under the ELTM, thus a greedy algorithm could achieve 1−1/e approximation ration (where e is the base of natural logarithm). Accordingly, preliminary experimental results verify the feasibility of our method.

Hong Wu, Weiyi Liu, Kun Yue, Weipeng Huang, Ke Yang
SEMR: Secure and Efficient Multi-dimensional Range Query Processing in Two-tiered Wireless Sensor Networks

Wireless sensor network (WSN) is an important part of Internet of Things. Due to abundant resources and high efficiency, many future large-scale WSNs are expected to follow a two-tiered architecture, where resource-rich storage nodes are placed in the upper tier and resource-limited sensor nodes are placed at the lower tier. However, the architecture threatens network security. Thus it is challenging to process range query while protecting sensitive data from adversaries. Most existing relate work focuses on single-attribute privacy-preserving range query. In practical applications, sensor nodes usually sense multiple attributes. In this paper, we propose a secure and efficient range query protocol for multi-dimensional data in two-tiered wireless sensor networks. A specified pair of matrices is used to compute characteristic values of sensed data and encrypt range queries. Finally, experimental results confirm the efficiency of our protocol.

Lei Dong, Jianxiang Zhu, Xiaoying Zhang, Hong Chen, Cuiping Li, Hui Sun
A Sampling-Based Framework for Crowdsourced Select Query with Multiple Predicates

In this paper, we consider the crowdsourced select query with multiple predicates. We find that different predicates have different selectivities. An important problem is to determine a good predicate order. However it is rather hard to obtain an optimal order. To address this problem, we propose a sampling-based framework to find a high-quality order. We devise a minimum random selection method by randomly selecting the predicate sequence. Since minimum random selection randomly selects predicate permutations over predicates, which may bring large cost, we propose a filtering based algorithm to further reduce the cost. We evaluate our method using a real-world dataset. Experimental results indicate that our methods significantly reduce the monetary cost.

Jianhong Feng, Huiqi Hu, Xueping Weng, Jianhua Feng, Yongwei Wu
A Customized Schema Design Framework for Multi-tenant Database

Existing multi-tenant database systems provide little support to consider high performance, good scalability, little space consumption and full customization at the same time. In this paper, we propose a

customized

database schema design framework which supports schema customization without sacrificing performance and scalability. We propose the interactive-based method to help tenants better design their customized schemas and integrate the schemas. We propose the graph partition method to reorganize integrated tables. Experimental results show that our method achieves better performance and higher scalability with schema customization property than state-of-the-art methods.

Jiacai Ni, Jianhua Feng, Yongwei Wu
A Multi-attribute Probabilistic Matrix Factorization Model for Personalized Recommendation

Recommender systems can interpret personalized preferences and recommend the most relevant choices to the benefit of countless users in the era of information explosion. Attempts to improve the performance of recommendation systems have hence been the focus of much research effort. Few attempts, however, recommend on the basis of both social relationship and the content of items users have tagged. This paper proposes a new recommending model incorporating social relationship and items’ description information with probabilistic matrix factorization called SCT-PMF. Meanwhile, we take full advantage of the scalability of probabilistic matrix factorization, which helps to overcome data sparsity as well. Experiments demonstrate that SCT-PMF is scalable and outperforms several baselines (PMF, LDA, CTR) for recommending.

Feng Tan, Li Li, Zeyu Zhang, Yunlong Guo
A Novel Knowledge Extraction Framework for Resumes Based on Text Classifier

In the information age, there are plenty of resume data in the internet. Several previous research have been proposed to extract facts from resumes, however, they mainly rely on large amounts of labeled data and the text format information, which made them limited by human efforts and the file format. In this paper, we propose a novel framework, not depending on the file format, to extract knowledge about the person for building a structured resume repository. The proposed framework includes two major processes: the first is to segment text into semi-structured data with some text pretreatment operations. The second is to further extract knowledge from the semi-structured data with text classifier. The experiments on the real dataset demonstrate the improvement when compared to previous researches.

Jie Chen, Zhendong Niu, Hongping Fu
SimRank Based Top-k Query Aggregation for Multi-Relational Networks

SimRank is one measure that compute the similarities between nodes in applications, where the returning of top-k query lists is often required. In this paper, we adopt SimRank as the similarity computation measure and re-write the original inefficient iterative equation into a non-iterative one, we call it Eigen-SimRank. We focus on multi-relational networks, where there may exist different kinds of relationships among nodes and query results may change with different perspectives. In order to compute a top-k query list under any perspective especially compound perspective, we suggest dynamic updating algorithm and rank aggregation methods. We evaluate our algorithms in the experiment section.

Jing Xu, Cuiping Li, Hong Chen, Hui Sun
On Dynamic Top-k Influence Maximization

This paper studies the top-

k

influence maximization problem together with network dynamics. We propose an incremental update framework that takes advantage of smoothness of network changes. Experiments show that the proposed method outperforms the straightforward solution by a wide margin.

Hao Wang, Nana Pan, U Leong Hou, Bohan Zhan, Zhiguo Gong
Community Based Spammer Detection in Social Networks

Social networks (SNS in short) such as Sina Weibo have now become one of the most popular Internet applications. A major challenge to social networks is the huge amount of spammers, or fake accounts generated by computer programmes. Traditional approaches in combating spammers mostly lie on user features, such as the completeness of user profiles, the number of users’ activities, and the content of posted tweets etc. These methods may achieve good results when they are designed. However, spammers will evolve in order to survive. Hence the feature-based approaches will gradually lose their power. After careful analysis on a real SNS, we find that people will rebuild in cyber social networks their communities in the physical world. Interestingly, spammers also have to construct their communities in order to hide themselves, because separated users will be easily detected by anti-spam tools. Moreover, it is very hard for spammers to sneak into normal users’ communities since a community member needs to have links with most other members. Based on this observation, we propose a novel approach to judge a user by the features of both himself/herself and the accounts within the same communities as him/her.

Dehai Liu, Benjin Mei, Jinchuan Chen, Zhiwu Lu, Xiaoyong Du
LDM: A DTD Schema Mapping Language Based on Logic Patterns

Existing XML schema mapping languages often follow the approach for the relational schema mappings to conjunctively combine the correspondences between tuples of elements. However, this approach is inconvenient to present the complex mappings which often involve the hierarchical and heterogeneous data. In this paper, we propose a new pattern-based DTD schema mapping language named LDM. The views in LDM adopt the logical as well as the structural operators of compose the patterns for matching the data elements, so as to flexibly specify the structural relationships among the elements and uniformly organize them in a logical structure. Under the support of a deductive restructuring mechanism of the logical structure, the data elements can be coherently reorganized and thus the mappings can be specified declaratively. We describe the design issues of LDM and demonstrate its features with certain typical mappings on integrating DTD schemas.

Xuhui Li, Yijun Guan, Mengchi Liu, Rui Cai, Ming Zhong, Tieyun Qian
Effective Sampling of Points of Interests on Maps Based on Road Networks

With the rapid development of location-based services, it is particularly important for start-up marketing research to explore and characterize points of interests (PoIs) such as restaurants and hotels on maps. However, due to the lack of a direct access to PoI databases, we have to rely on existing APIs to query PoIs within the region and calculate the PoI statistics. Unfortunately, public APIs generally impose a limit on the maximum number of queries. Therefore, we propose effective and efficient sampling methods based on a road network to sample PoIs on maps and give unbiased estimators to calculate the PoI statistics. Experimental results show that compared with the state-of-the-art methods, our sampling methods improve the efficiency of aggregate statistical estimation.

Ziting Zhou, Pengpeng Zhao, Victor S. Sheng, Jiajie Xu, Zhixu Li, Jian Wu, Zhiming Cui
Associated Index for Big Structured and Unstructured Data

In big data epoch, one of the major challenges is the large volume of mixed structured and unstructured data. Because of different form, structured and unstructured data are often considered apart from each other. However, they may speak about the same entities of the world. If a query involves both structured data and its unstructured counterpart, it is inefficient to execute it separately. The paper presents a novel index structure tailored towards associations between structured and unstructured data, based on entity co-occurrences. It is also a semantic index represented as RDF graphs which describes the semantic relationships among entities. Experiments show that the associated index can not only provide apposite information but also execute queries efficiently.

Chunying Zhu, Qingzhong Li, Lanju Kong, Xiangwei Wang, Xiaoguang Hong

Demo Papers

Frontmatter
Web Knowledge Base Improved OCR Correction for Chinese Business Cards

In the field of Optical Character Recognition(OCR), improving the recognition accuracy has been extensively studied in the past decades. In this paper, different from previously published model-based correction methods, Knowledge Base was applied to OCR correcting system from the perspective of linked knowledge. A pipelined method integrating selectivity-aware pre-filtering, text-level and image-level comparison was explored to identify the best candidate with better efficiency and accuracy. For more reliable comparison of company, the weighted coefficients derived from Wikipedia were applied to distinguish the different importance. Moreover, traditional Levenshtein distance was generalized to Image-based Levenshtein measure to better distinguish strings with similar text similarity. The experimental results demonstrated that the proposed system could perform more effectively than the baseline case.

Xiaoping Wang, Yanghua Xiao, Wei Wang
Shortest Path and Word Vector Based Relation Representation and Clustering

Relation representation plays an important role in text understanding. In this paper, different from previously published supervised methods or semi-supervised methods, an new method of relation representation and clustering based on shortest path and word vector was proposed. By accumulating the word vector along the shortest path within dependency tree, we can not only obtain the essential representation of the relation, but also can map the relation into semantic space simultaneously. Therefore, reliable distance between any two relations could be measured. Moreover, further applications such as relations clustering can be performed conveniently by direct analysis on the collection of vectors.

Xiaoping Wang, Yanghua Xiao, Wei Wang
CrowdSR: A Crowd Enabled System for Semantic Recovering of Web Tables

Without knowing any semantic of tables on web, it’s very difficult for web search to take advantage of those high quality sources of relational information.We present CrowdSR, a system that enables semantic recovering of web tables by crowdsourcing. To minimize the number of tuples posed to the crowd, CrowdSR selects a small number of representative tuples by clustering based on novel integrative distance. An evaluation mechanism is also implemented on Answer Credibility in order to recommend related tasks for workers and decide the final answers for each task more accurately.

Huaxi Liu, Ning Wang, Xiangran Ren
A Personalized News Recommendation System Based on Tag Dependency Graph

The tags of news articles give readers the most important and relevant information regarding the news articles, which are more useful than a simple bag of keywords extracted from news articles. Moreover, latent dependency among tags can be used to assign tags with different weight. Traditional content-based recommendation engines have largely ignored the latent dependency among tags. To solve this problem, we implemented a prototype system called PRST, which is presented in this paper. PRST builds a tag dependency graph to capture the latent dependency among tags. The demonstration shows that PRST makes news recommendation more effectively.

Pengqiang Ai, Yingyuan Xiao, Ke Zhu, Hongya Wang, Ching-Hsien Hsu
SmartInt: A Demonstration System for the Interaction Between Schema Mapping and Record Matching

Schema Mapping and Record Matching are two necessary steps in merging multiple data sources with different schemas. While schema mapping unifies the schemas of different data sets, record matching finds out all pairs of linked instances between the data sets. So far, the two processes are well-studied separately, but no effort has been made in the interaction between them. In this demonstration, we present the SmartInt system which performs schema mapping and record matching interactively in merging data sets, and we show that schema mapping and record matching can benefit each other in reaching a better integration performance. We will demonstrate, step by step, how the interaction works to improve the integration quality.

Jun Jiang, Zhixu Li, Qiang Yang, Pengpeng Zhao, Guanfeng Liu, Lei Zhao
CDSG: A Community Detection System Based on the Game Theory

Community detection is an important task in social network analysis. A game theory-based community detection system (CDSG) is developed in this demonstration. CDSG uses cooperative and non-cooperative game theory to detect communities. The combination of cooperative and non-cooperative game makes utilities of groups and individuals can be taken into account simultaneously and decreases the computational cost, thus CDSG can detect overlapping communities with high accuracy and efficiency, such that it can effectively help users in analyzing and exploring complex networks.

Peizhong Yang, Lihua Zhou, Lizhen Wang, Xuguang Bao, Zidong Zhang
Backmatter
Metadaten
Titel
Web-Age Information Management
herausgegeben von
Xin Luna Dong
Xiaohui Yu
Jian Li
Yizhou Sun
Copyright-Jahr
2015
Electronic ISBN
978-3-319-21042-1
Print ISBN
978-3-319-21041-4
DOI
https://doi.org/10.1007/978-3-319-21042-1