Skip to main content
Top

2017 | Book

Web and Big Data

First International Joint Conference, APWeb-WAIM 2017, Beijing, China, July 7–9, 2017, Proceedings, Part I

Editors: Lei Chen, Christian S. Jensen, Cyrus Shahabi, Xiaochun Yang, Xiang Lian

Publisher: Springer International Publishing

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This two –volume set, LNCS 10366 and 10367, constitutes the thoroughly refereed proceedings of the First International Joint Conference, APWeb-WAIM 2017, held in Beijing, China in July 2017.

The 44 full papers presented together with 32 short papers and 10 demonstrations papers were carefully reviewed and selected from 240 submissions. The papers are organized around the following topics: spatial data processing and data quality; graph data processing; data mining, privacy and semantic analysis; text and log data management; social networks; data mining and data streams; query processing; topic modeling; machine learning; recommendation systems; distributed data processing and applications; machine learning and optimization.

Table of Contents

Frontmatter

Tutorials

Frontmatter
Meta Paths and Meta Structures: Analysing Large Heterogeneous Information Networks

A heterogeneous information network (HIN) is a graph model in which objects and edges are annotated with types. Large and complex databases, such as YAGO and DBLP, can be modeled as HINs. A fundamental problem in HINs is the computation of closeness, or relevance, between two HIN objects. Relevance measures, such as PCRW, PathSim, and HeteSim, can be used in various applications, including information retrieval, entity resolution, and product recommendation. These metrics are based on the use of meta-paths, essentially a sequence of node classes and edge types between two nodes in a HIN. In this tutorial, we will give a detailed review of meta-paths, as well as how they are used to define relevance. In a large and complex HIN, retrieving meta paths manually can be complex, expensive, and error-prone. Hence, we will explore systematic methods for finding meta paths. In particular, we will study a solution based on the Query-by-Example (QBE) paradigm, which allows us to discover meta-paths in an effective and efficient manner.We further generalise the notion of meta path to “meta structure”, which is a directed acyclic graph of object types with edge types connecting them. Meta structure, which is more expressive than the meta path, can describe complex relationship between two HIN objects (e.g., two papers in DBLP share the same authors and topics). We will discuss three relevance measures based on meta structure. Due to the computational complexity of these measures, we also study an algorithm with data structures proposed to support their evaluation. Finally, we will examine solutions for performing query recommendation based on meta-paths. We will also discuss future research directions.

Reynold Cheng, Zhipeng Huang, Yudian Zheng, Jing Yan, Ka Yu Wong, Eddie Ng

Spatial Data Processing and Data Quality

Frontmatter
TrajSpark: A Scalable and Efficient In-Memory Management System for Big Trajectory Data

The widespread application of mobile positioning devices has generated big trajectory data. Existing disk-based trajectory management systems cannot provide scalable and low latency query services any more. In view of that, we present TrajSpark, a distributed in-memory system to consistently offer efficient management of trajectory data. TrajSpark introduces a new abstraction called IndexTRDD to manage trajectory segments, and exploits a global and local indexing mechanism to accelerate trajectory queries. Furthermore, to alleviate the essential partitioning overhead, it adopts the time-decay model to monitor the change of data distribution and updates the data-partition structure adaptively. This model avoids repartitioning existing data when new batch of data arrives. Extensive experiments of three types of trajectory queries on both real and synthetic dataset demonstrate that the performance of TrajSpark outperforms state-of-the-art systems.

Zhigang Zhang, Cheqing Jin, Jiali Mao, Xiaolin Yang, Aoying Zhou
A Local-Global LDA Model for Discovering Geographical Topics from Social Media

Micro-blogging services can track users’ geo-locations when users check-in their places or use geo-tagging which implicitly reveals locations. This “geo tracking” can help to find topics triggered by certain events in certain regions. However, discovering such topics is very challenging because of the large amount of noisy messages (e.g. daily conversations). This paper proposes a method to model geographical topics, which can filter out irrelevant words by different weights in the local and global contexts. Our method is based on the Latent Dirichlet Allocation (LDA) model but each word is generated from either a local or a global topic distribution by its generation probabilities. We evaluated our model with data collected from Weibo, which is currently the most popular micro-blogging service for Chinese. The evaluation results demonstrate that our method outperforms other baseline methods in several metrics such as model perplexity, two kinds of entropies and KL-divergence of discovered topics.

Siwei Qiang, Yongkun Wang, Yaohui Jin
Team-Oriented Task Planning in Spatial Crowdsourcing

The rapid development of mobile devices has stimulated the popularity of spatial crowdsourcing. Various spatial crowdsourcing platforms, such as Uber, gMission and Gigwalk, are becoming increasingly important in our daily life. A core functionality of spatial crowdsourcing platforms is to allocate tasks or make plans for workers to efficiently finish the published tasks. However, existing studies usually ignore the fact that tasks may impose different skill requirements on workers, which may lead to decreased numbers of accomplished tasks in real-world applications. In this work, we propose a practical problem called TOTP, $${\underline{T}}$$eam-$${\underline{O}}$$riented $${\underline{T}}$$ask $${\underline{P}}$$lanning, which not only makes feasible plans for workers but also satisfies the skill requirements of different tasks on workers. We prove the NP-hardness of TOTP, and propose two greedy-based heuristic algorithms to solve the TOTP problem. Evaluations on both synthetic and real-world datasets verify the effectiveness and the efficiency of the proposed algorithms.

Dawei Gao, Yongxin Tong, Yudian Ji, Ke Xu
Negative Survey with Manual Selection: A Case Study in Chinese Universities

Negative survey is a promising method which can protect personal privacy while collecting sensitive data. Most of previous works focus on negative survey models with specific hypothesis, e.g., the probability of selecting negative categories follows the uniform distribution or Gaussian distribution. Moreover, as far as we know, negative survey is never conducted with manual selection in real world. In this paper, we carry out such a negative survey and find that the survey may not follow the previous hypothesis. And existing reconstruction methods like NStoPS and NStoPS-I perform poorly on the survey data. Therefore, we propose a method called NStoPS-MLE, which is based on the maximum likelihood estimation, for reconstructing useful information from the collected data. This method also uses background knowledge to enhance its performance. Experimental results show that our method can get more accurate aggregated results than previous methods.

Jianguo Wu, Jianwen Xiang, Dongdong Zhao, Huanhuan Li, Qing Xie, Xiaoyi Hu
Element-Oriented Method of Assessing Landscape of Sightseeing Spots by Using Social Images

Assessing the quality of sightseeing spots is a key challenge to satisfy the diverse needs of tourists and discover new sightseeing resources (spots). In this paper, we propose an element-oriented method of landscape assessment that analyzes images available on image-sharing web sites. The experimental results demonstrate that our method is superior to the existing ones based on low-level visual features and user behavior analysis.

Yizhu Shen, Chenyi Zhuang, Qiang Ma
Sifting Truths from Multiple Low-Quality Data Sources

In this paper, we study the problem of assessing the quality of co-reference tuples extracted from multiple low-quality data sources and finding true values from them. It is a critical part of an effective data integration solution. In order to solve this problem, we first propose a model to specify the tuple quality. Then we present a framework to infer the tuple quality based on the concept of quality predicates. In particular, we propose an algorithm underlying the framework to find true values for each attribute. Last, we have conducted extensive experiments on real-life data to verify the effectiveness and efficiency of our methods.

Zizhe Xie, Qizhi Liu, Zhifeng Bao

Graph Data Processing

Frontmatter
A Community-Aware Approach to Minimizing Dissemination in Graphs

Given a graph, can we minimize the spread of an entity (such as a meme or a virus) while maintaining the graph’s community structure (defined as groups of nodes with denser intra-connectivity than inter-connectivity)? At first glance, these two objectives seem at odds with each other. To minimize dissemination, nodes or links are often deleted to reduce the graph’s connectivity. These deletions can (and often do) destroy the graph’s community structure, which is an important construct in real-world settings (e.g., communities promote trust among their members). We utilize rewiring of links to achieve both objectives. Examples of rewiring in real life are prevalent, such as purchasing products from a new farm since the local farm has signs of mad cow disease; getting information from a new source after a disaster since your usual source is no longer available, etc. Our community-aware approach, called constrCRlink (short for Constraint Community Relink), preserves (on average) $$98.6\%$$ of the efficacy of the best community-agnostic link-deletion approach (namely, NetMelt$$^{+}$$), but changes the original community structure of the graph by only $$4.5\%$$. In contrast, NetMelt$$^{+}$$ changes $$13.6\%$$ of the original community structure.

Chuxu Zhang, Lu Yu, Chuang Liu, Zi-Ke Zhang, Tao Zhou
Time-Constrained Graph Pattern Matching in a Large Temporal Graph

Graph pattern matching (GPM) is an important operation on graph computation. Most existing work assumes that query graph or data graph is static, which is contrary to the fact that graphs in real life are intrinsically dynamic. Therefore, in this paper, we propose a new problem of Time-Constrained Graph Pattern Matching (TCGPM) in a large temporal graph. Different from traditional work, our work deals with temporal graphs rather than a series of snapshots. Besides, the query graph in TCGPM contains two types of time constraints which are helpful for finding more useful subgraphs. To address the problem of TCGPM, a baseline method and an improved method are proposed. Besides, to further improve the efficiency, two pruning rules are proposed. The improved method runs several orders of magnitude faster than the baseline method. The effectiveness of TCGPM is several orders of magnitude better than that of GPM. Extensive experiments on three real and semi-real datasets demonstrate high performance of our proposed methods.

Yanxia Xu, Jinjing Huang, An Liu, Zhixu Li, Hongzhi Yin, Lei Zhao
Efficient Compression on Real World Directed Graphs

Real world graphs typically exhibit power law degrees, and many of them are directed graphs. The growing scale of such graphs has made efficient execution of graph computation very challenging. Reducing graph size to fit in memory, for example by using the technique of lossless compression, is crucial in cutting the cost of large scale graph computation. Unfortunately, literature work on graph compression still suffers from issues including low compression ration and high decompression overhead. To address the above issue, in this paper, we propose a novel compression approach. The basic idea of our graph compression is to first cluster graph adjacency matrix via graph structure information, and then represent the clustered matrix by lists of encoded numbers. Our extensive evaluation on real data sets validates the tradeoff between space cost saving and graph computation time, and verifies the advantages of our work over state of art SlashBurn [1, 2] and Ligra+ [3].

Guohua Li, Weixiong Rao, Zhongxiao Jin
Keyphrase Extraction Using Knowledge Graphs

Extracting keyphrases from documents automatically is an important and interesting task since keyphrases provide a quick summarization for documents. Although lots of efforts have been made on keyphrase extraction, most of the existing methods (the co-occurrence based methods and the statistic-based methods) do not take semantics into full consideration. The co-occurrence based methods heavily depend on the co-occurrence relations between two words in the input document, which may ignore many semantic relations. The statistic-based methods exploit the external text corpus to enrich the document, which introduces more unrelated relations inevitably. In this paper, we propose a novel approach to extract keyphrases using knowledge graphs, based on which we could detect the latent relations of two keyterms (i.e., noun words and named entities) without introducing many noises. Extensive experiments over real data show that our method outperforms the state-of-art methods including the graph-based co-occurrence methods and statistic-based clustering methods.

Wei Shi, Weiguo Zheng, Jeffrey Xu Yu, Hong Cheng, Lei Zou
Semantic-Aware Partitioning on RDF Graphs

With the development of the Semantic Web, an increasingly large number of organizations represent their data in RDF format. A single machine cannot efficiently process complex queries on RDF graphs. It becomes necessary to use a distributed cluster to store and process large-scale RDF datasets that are required to be partitioned. In this paper, we propose a semantic-aware partitioning method for RDF graphs. Inspired by the PageRank algorithm, classes in the RDF schema graphs are ranked. A novel partitioning algorithm is proposed, which leverages the semantic information of RDF and reduces crossing edges between different fragments. The extensive experiments on both synthetic and real-world datasets show that our semantic-aware RDF graph partitioning outperforms the state-of-the-art methods by a large margin.

Qiang Xu, Xin Wang, Junhu Wang, Yajun Yang, Zhiyong Feng
An Incremental Algorithm for Estimating Average Clustering Coefficient Based on Random Walk

Clustering coefficient is an important measure in social network analysis, community detection and many other applications. However, it is expensive to compute clustering coefficient for the real-world networks, because many networks, such as Facebook and Twitter, are usually large and evolving continuously. Aiming to improve the performance of clustering coefficient computation for the large and evolving networks, we propose an incremental algorithm based on random walk model. The proposed algorithm stores previous random walk path and updates the the average clustering coefficient estimation through reconstructing partial path in an incremental approach, instead of recomputing clustering coefficient from scratch as long as graph changes. Theoretical analysis suggests that the proposed algorithm improves the performance of clustering coefficient estimation for dynamic graphs effectively without sacrificing in accuracy. Extensive experiments on some real-world graphs also demonstrate that the proposed algorithm reduces the running time significantly comparing with a state-of-art algorithm based on random walk.

Qun Liao, Lei Sun, He Du, Yulu Yang

Data Mining, Privacy and Semantic Analysis

Frontmatter
Deep Multi-label Hashing for Large-Scale Visual Search Based on Semantic Graph

Huge volumes of images are aggregated over time because many people upload their favorite images to various social websites such as Flickr and share them with their friends. Accordingly, visual search from large scale image databases is getting more and more important. Hashing is an efficient technique to large-scale visual content search, and learning-based hashing approaches have achieved great success due to recent advancements of deep learning. However, most existing deep hashing methods focus on single label images, where hash codes cannot well preserve semantic similarity of images. In this paper, we propose a novel framework, deep multi-label hashing (DMLH) based on a semantic graph, which consists of three key components: (i) Image labels, semantically similar in terms of co-occurrence relationship, are classified in such a way that similar labels are in the same cluster. This helps to provide accurate ground truth for hash learning. (ii) A deep model is trained to simultaneously generate hash code and feature vector of images, based on which multi-label image databases are organized by hash tables. This model has excellent capability in improving retrieval speed meanwhile preserving semantic similarity among images. (iii) A combination of hash code based coarse search and feature vector based fine image ranking is used to provide an efficient and accurate retrieval. Extensive experiments over several large image datasets confirm that the proposed DMLH method outperforms state-of-the-art supervised and unsupervised image retrieval approaches, with a gain ranging from 6.25% to 38.9% in terms of mean average precision.

Chunlin Zhong, Yi Yu, Suhua Tang, Shin’ichi Satoh, Kai Xing
An Ontology-Based Latent Semantic Indexing Approach Using Long Short-Term Memory Networks

Nowadays, online data shows an astonishing increase and the issue of semantic indexing remains an open question. Ontologies and knowledge bases have been widely used to optimize performance. However, researchers are placing increased emphasis on internal relations of ontologies but neglect latent semantic relations between ontologies and documents. They generally annotate instances mentioned in documents, which are related to concepts in ontologies. In this paper, we propose an Ontology-based Latent Semantic Indexing approach utilizing Long Short-Term Memory networks (LSTM-OLSI). We utilize an importance-aware topic model to extract document-level semantic features and leverage ontologies to extract word-level contextual features. Then we encode the above two levels of features and match their embedding vectors utilizing LSTM networks. Finally, the experimental results reveal that LSTM-OLSI outperforms existing techniques and demonstrates deep comprehension of instances and articles.

Ningning Ma, Hai-Tao Zheng, Xi Xiao
Privacy-Preserving Collaborative Web Services QoS Prediction via Differential Privacy

Collaborative Web services QoS prediction has become an important tool for the generation of accurate personalized QoS. While a number of achievements have been attained on the study of improving the accuracy of collaborative QoS prediction, little work has been done for protecting user privacy in this process. In this paper, we propose a privacy-preserving collaborative QoS prediction framework which can protect the private data of users while retaining the ability of generating accurate QoS prediction. We introduce differential privacy, a rigorous and provable privacy preserving technique, into the preprocess of QoS data prediction. We implement the proposed approach based on a general approach named Laplace mechanism and conduct extensive experiments to study its performance on a real world dataset. The experiments evaluate the privacy-accuracy trade-off on different settings and show that under some constraint, our proposed approach can achieve a better performance than baselines.

Shushu Liu, An Liu, Zhixu Li, Guanfeng Liu, Jiajie Xu, Lei Zhao, Kai Zheng
High-Utility Sequential Pattern Mining with Multiple Minimum Utility Thresholds

High-utility sequential pattern mining is an emerging topic in recent decades and most algorithms were designed to identify the complete set of high-utility sequential patterns under the single minimum utility threshold. In this paper, we first propose a novel framework called high-utility sequential pattern mining with multiple minimum utility thresholds to mine high-utility sequential patterns. A high-utility sequential pattern with multiple minimum utility thresholds algorithm, a lexicographic sequence (LS)-tree, and the utility-linked (UL)-list structure are respectively designed to efficiently mine the high-utility sequential patterns (HUSPs). Three pruning strategies are then introduced to lower the upper-bound values of the candidate sequences, and reduce the search space by early pruning the unpromising candidates. Substantial experiments on real-life datasets show that our proposed algorithms can effectively and efficiently mine the complete set of HUSPs with multiple minimum utility thresholds.

Jerry Chun-Wei Lin, Jiexiong Zhang, Philippe Fournier-Viger
Extracting Various Types of Informative Web Content via Fuzzy Sequential Pattern Mining

In this paper, we present a web content extraction method to extract different types of informative web content for news web pages. A fuzzy sequential pattern mining method, namely FSP, is developed to gradually discover fuzzy sequential patterns for various types of informative web content. To avoid the situation that the usage of HTML tags may be changed with the development of web technology, fuzzy sequential patterns are mined using a stable feature, in particular, the number of tokens in each line of source code. We have conducted extensive experiments and good clustering properties for the discovered sequential patterns are observed. Experimental results demonstrate that the FSP method is effective compared with state-of-the-art content extraction methods. Besides main articles of web pages, it can also find other types interesting web content such as article recommendations and article titles effectively.

Ting Huang, Ruizhang Huang, Bowei Liu, Yingying Yan
Exploiting High Utility Occupancy Patterns

Most studies have considered the frequency as sole interestingness measure for identifying high quality patterns. However, each object is different in nature, in terms of criteria such as the utility, risk, or interest. Besides, another limitation of frequent patterns is that they generally have a low occupancy, and may not be truly representative. Thus, this paper extends the occupancy measure to assess the utility of patterns in transaction databases. The High Utility Occupancy Pattern Mining (HUOPM) algorithm considers user preferences in terms of frequency, utility, and occupancy. Several novel data structures are designed to discover the complete set of high quality patterns without candidate generation. Extensive experiments have been conducted on several datasets to evaluate the effectiveness and efficiency of HUOPM.

Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao

Text and Log Data Management

Frontmatter
Translation Language Model Enhancement for Community Question Retrieval Using User Adoption Answer

Community Question Answering (CQA) services on Web provide an important alternative for knowledge acquisition. As an essential component of CQA services, question retrieval can help users save much time by finding relevant questions. However, there is a “gap” between queried questions and candidate questions, which is called lexical chasm or word mismatch problem. In this paper, we improve traditional Topic inference based Translation Language Model (T$$^2$$LM) by using the topic information of queries. Moreover, we make use of user information, specifically the number of user adoption answers, for further enhancing our proposed model. In our model, the translation model and the topic model “bridge” the word gap by linking different words. Besides, user information that has no direct relation with semantics is used to help us “bypass” the gap. By combining both of them we obtain a considerable improvement for the performance of question retrieval. Experimental results on a real Chinese CQA data set show that our proposed model improves the retrieval performance over T$$^2$$LM baseline by 7.5% in terms of Mean Average Precision (MAP).

Ming Chen, Lin Li, Qing Xie
Holographic Lexical Chain and Its Application in Chinese Text Summarization

Lexical chain has been widely used in many NLP areas. However, when using it for Web text summarization, especially for domain-specific text summarization, we got low accuracy results. The main reason is that traditional lexical chains only take nouns into consideration while information of other grammatical parts is missing. We introduce lexical chains of predicates and adjectives (adverbs) respectively. These three types of lexical chains together are called holographic lexical chains (HLCs), which capture most of the information included in the text. A specifically designed construction method for HLC is presented. We applied HLC method to Chinese text summarization and used machine learning methods whose features are adapted to the new method. In a comparative study of Chinese foreign trade texts, we got summarization results with accuracy of 86.88%. Our HLC construction method obtained improvements of 7.02% in accuracy than the known best methods in Chinese text summarization.

Shengluan Hou, Yu Huang, Chaoqun Fei, Shuhan Zhang, Ruqian Lu
Authorship Identification of Source Codes

Source code authorship identification is an issue of authorship identification from documents, and it is to identify authors of source codes or programs based on source code examples of programmers. The main applications of authorship identification of source codes include software intellectual property infringement, malicious code detection and software maintenance and update. This paper proposes an approach of constructing author profiles of programmers based on a logic model of continuous word-level n-gram and discrete word-level n-gram, and a multi-level context model about operations, loops, arrays and methods. Further, we employ the technique of sequential minimal optimization for support vector machine training to identify authorship of source codes. The advantage of author profiles in this paper can discover explicit and implicit personal programming preference patterns of and between keywords, identifiers, operators, statements, methods and classes. Experimental results on programs from two open source websites demonstrate that our approach achieves a high accuracy and outperforms the baseline methods.

Chunxia Zhang, Sen Wang, Jiayu Wu, Zhendong Niu
DFDS: A Domain-Independent Framework for Document-Level Sentiment Analysis Based on RST

Document-level sentiment analysis is among the most popular research fields of nature language processing in recent years, in which one of major challenges is that discourse structural information can be hardly captured by existing approaches. In this paper, a domain-independent framework for document-level sentiment classification with weighting rules based on Rhetorical Structure Theory is proposed. First, original textual documents are parsed into rhetorical structure trees through a preprocessing pipeline. Next, the sentiment score of elementary discourse units is computed via sentence-level sentiment classification method. Finally, according to the rhetorical relation between neighbor discourse units, we define weighting schema and composing rules based on which scores of elementary discourse units are summed recursively to the whole document. Experiment results show that our approach has better performance on datasets in different domains, compared with state-of-art document-level sentiment analysis systems based on RST, and the best result is 15% higher than baseline.

Zhenyu Zhao, Guozheng Rao, Zhiyong Feng
Fast Follower Recovery for State Machine Replication

The method of state machine replication, adopting a single strong Leader, has been widely used in the modern cluster-based database systems. In practical applications, the recovery speed has a significant impact on the availability of the systems. However, in order to guarantee the data consistency, the existing Follower recovery protocols in Paxos replication (e.g., Raft) need multiple network trips or extra data transmission, which may increase the recovery time. In this paper, we propose the Follower Recovery using Special mark log entry (FRS) algorithm. FRS is more robust and resilient to Follower failure and it only needs one network round trip to fetch the least number of log entries. This approach is implemented in the open source database system OceanBase. We experimentally show that the system adopting FRS has a good performance in terms of recovery time.

Jinwei Guo, Jiahao Wang, Peng Cai, Weining Qian, Aoying Zhou, Xiaohang Zhu
Laser: Load-Adaptive Group Commit in Lock-Free Transaction Logging

Log manager is a key component of DBMS and is considered as the most prominent bottleneck in the modern in-memory OLTP system. In this paper, by addressing two existing performance hurdles in the current procedure, we propose a high-performance transaction logging engine Laser and integrate it into OceanBase, an in-memory OLTP system. First, we present a lock-free transaction logging framework to eliminate the lock contention. Then we make theoretical analysis and propose a judicious grouping strategy to determine an optimized group time for different workloads. Experiment results show that it improves 1.4X–2.4X throughput and reduces more than $$60\%$$ latency compared with current methods.

Huan Zhou, Huiqi Hu, Tao Zhu, Weining Qian, Aoying Zhou, Yukun He

Social Networks

Frontmatter
Detecting User Occupations on Microblogging Platforms: An Experimental Study

User occupation refers to the professional position of a user in real world. It is very helpful for a number of applications, e.g., personalized recommendation and targeted advertising. However, because of the risk of privacy leaks, many users do not provide their occupation information on microblogging platforms. This makes it hard to detect user occupations on microblogging platforms. In this paper, we conduct an experimental study on this issue. Particularly, we propose an experimental framework of detecting user occupations on microblogging platforms. We first implement a number of classification models and devise various sets of features for user occupation detection. Then, we propose to construct an occupation-oriented lexicon, which is collected by an iterative extension algorithm considering semantic similarity and importance between words. We combine the lexicon with the word embedding approach to detect user occupations. We conduct comprehensive experiments and present a set of experimental results. The results show that the lexicon-based word embedding method achieves higher accuracy compared with traditional feature-base classification models.

Xia Lv, Peiquan Jin, Lin Mu, Shouhong Wan, Lihua Yue
Counting Edges and Triangles in Online Social Networks via Random Walk

Online social network (OSN) analysis has attracted much attention in recent years. Edge and triangle counts are both fundamental properties in OSNs. However, for many OSNs, one can only access parts of the network using the application programming interfaces (APIs). In such cases, conventional algorithms become infeasible. In this paper, we introduce efficient algorithms for estimating the number of edges and triangles in OSNs based on random walk sampling on OSNs. We also derive a theoretical bound on the sample size and number of APIs calls needed in our algorithms for a probabilistic accuracy guarantee. We ran experiments on several publicly available real-world networks and the results demonstrate the effectiveness of our algorithms.

Yang Wu, Cheng Long, Ada Wai-Chee Fu, Zitong Chen
Fair Reviewer Assignment Considering Academic Social Network

An important task in peer review is assigning papers to appropriate reviewers, which is known as Reviewer Assignment (RA) problem. Most of existing works mainly focus on the similarity between paper topics and reviewer expertise, few works consider multiple relationships between authors and reviewers on academic social network. However, these relationships could influence reviewer assessments on fairness. In this paper, we address RA problem considering academic social network to find a confident, fair and balanced assignment. We model papers and reviewers based on matching degree by combining collaboration distance and topic similarity, and propose Maximum Sum of Matching degree RA (MSMRA) problem. Two algorithms are designed for MSMRA problem: Simulated Annealing-based Stochastic Approximation, and Maximum Matching and Minimum Deviation. Experiments show that our methods achieved good performance both on overall effectiveness and fairness distribution within reasonable running time.

Kaixia Li, Zhao Cao, Dacheng Qu
Viral Marketing for Digital Goods in Social Networks

Influence maximization is a problem of finding a small set of highly influential individuals in social networks to maximize the spread of influence. However, the distinction between the spread of influence and profit is neglected. The problem of profit maximization in social network extends the influence maximization problems to a realistic setting aiming to gain maximum profit in social networks. In this paper, we consider how to sell the digital goods (near zero marginal cost) by viral marketing in social network. The question can be modeled as a profit maximization problem. We show the problem is an unconstrained submodular maximization and adopt two efficient algorithms from two approaches. One is a famous algorithm from theoretical computer science and that can achieve a tight linear time (1/2) approximation. The second is to propose a profit discount heuristic which improves the efficiency. Through our extensive experiments, we demonstrate the efficiency and quality of the algorithms we applied. Based on results of our research, we also provide some advice for practical viral marketing.

Yu Qiao, Jun Wu, Lei Zhang, Chongjun Wang
Change Detection from Media Sharing Community

This paper investigates how social images and image change detection techniques can be applied to identify the damages caused by natural disasters for disaster assessment. We propose a framework that takes advantages of near duplicate image detection and robust boundary matching for the change detection in disasters. First we perform the near duplicate detection by local interest point-based matching over image pairs. Then, we propose a novel boundary representation model called relative position annulus (RPA), which is robust to boundary rotation, location shift and editing operations. A new RPA matching method is proposed by extending dynamic time wrapping (DTW) from time to position annulus. We have done extensive experiments to evaluate the high effectiveness and efficiency of our approach.

Naoki Kito, Xiangmin Zhou, Dong Qin, Yongli Ren, Xiuzhen Zhang, James Thom
Measuring the Similarity of Nodes in Signed Social Networks with Positive and Negative Links

Similarity measure in non-signed social networks has been extensively studied for decades. However, how to measure the similarity of two nodes in signed social networks remains an open problem. It is challenging to incorporate both positive and negative relationships simultaneously in signed social networks due to the opposite opinions implied by them. In this paper, we study the similarity measure problem in signed social networks. We propose a basic node similarity measure that can utilize both positive and negative relations in signed social networks by comparing the immediate neighbors of two objects. Moreover, we exploit the propagation of similarity in networks. Finally, we perform extensive experimental comparison of the proposed method against existing algorithms on real data set. Our experimental results show that our method outperforms other approaches.

Tianchen Zhu, Zhaohui Peng, Xinghua Wang, Xiaoguang Hong

Data Mining and Data Streams

Frontmatter
Elastic Resource Provisioning for Batched Stream Processing System in Container Cloud

Batched stream processing systems achieve higher throughput than traditional stream processing systems while providing low latency guarantee. Recently, batched stream processing systems tend to be deployed in cloud due to their requirement of elasticity and cost efficiency. However, the performance of batched stream processing systems are hardly guaranteed in cloud because static resource provisioning for such systems does not fit for stream fluctuation and uneven workload distribution. In this paper, we propose EStream: an elastic batched stream processing system based on Spark Streaming, which transparently adjusts available resource to handle workload fluctuation and uneven distribution in container cloud. Specifically, EStream can automatically scale cluster when resource insufficiency or over-provisioning is detected under the situation of workload fluctuation. On the other hand, it conducts resource scheduling in cluster according to the workload distribution. Experimental results show that EStream is able to handle workload fluctuation and uneven distribution transparently and enhance resource efficiency, compared to original Spark Streaming.

Song Wu, Xingjun Wang, Hai Jin, Haibao Chen
An Adaptive Framework for RDF Stream Processing

In this paper, we propose a novel framework for RDF stream processing named PRSP. Within this framework, the evaluation of C-SPARQL queries on RDF streams can be reduced to the evaluation of SPARQL queries on RDF graphs. We prove that the reduction is sound and complete. With PRSP, we implement several engines to support C-SPARQL queries by employing current SPARQL query engines such as Jena, gStore, and RDF-3X. The experiments show that PRSP can still maintain the high performance by applying those engines in RDF stream processing, although there are some slight differences among them. Moreover, taking advantage of PRSP, we can process large-scale RDF streams in a distributed context via distributed SPARQL engines, such as gStoreD and TriAD. Besides, we can evaluate the performance and correctness of existing SPARQL query engines in processing RDF streams in a unified way, which amends the evaluation of them ranging from static RDF data to dynamic RDF data.

Qiong Li, Xiaowang Zhang, Zhiyong Feng
Investigating Microstructure Patterns of Enterprise Network in Perspective of Ego Network

In social networks the behavior of individuals can be researched through the evolution of the microstructure. As we know, triad is the basic atom shape to build the whole social network. However we find that quad plays the basic role rather than triad in Enterprise Network (EN). In particular, we focus on four typical microstructure patterns including triad, 4-cycle, 4-chordalcycle and 4-clique in EN. We propose algorithms to mine these microstructure patterns and compute the frequencies of each type of microstructure patterns in an efficient parallel way. We also analyze the structural features of these microstructure patterns in a perspective of ego network. Additionally we present the evolutionary rules between these microstructure patterns based on the statistical analysis. Finally we combine the features into traditional methods to solve the link prediction problem. The results show that these features and our combination methods are effective to predict links between enterprises in EN.

Xiutao Shi, Liqiang Wang, Shijun Liu, Yafang Wang, Li Pan, Lei Wu
Neural Architecture for Negative Opinion Expressions Extraction

Opinion expressions extraction is one of the main frameworks in opinion mining. Extracting negative opinions is more difficult than positive opinions because of indirect expressions. Especially, in the domain of consumer reviews, consumers are easier to be influenced by negative reviews when making decision. In this paper, we focus on the extraction of negative opinion expressions of consumer reviews. State-of-art methods heavily depend on task specific knowledge in the form of handcrafted features and data pre-processing. In this paper, we use a neural architecture by combining word embeddings, Bi-LSTM and CRF. We add a conditional random fields (CRF) layer to bidirectional long-short term memory (Bi-LSTM) recurrent neural network language model, which provides sentence level tag information and improves the result of experiment. Our model requires no feature engineering and outperforms feature dependent methods when experimenting on real-world reviews from Amazon.com.

Hui Wen, Minglan Li, Zhili Ye
Identifying the Academic Rising Stars via Pairwise Citation Increment Ranking

Predicting the fast-rising young researchers (the Academic Rising Stars) in the future provides useful guidance to the research community, e.g., offering competitive candidates to university for young faculty hiring as they are expected to have success academic careers. In this work, given a set of young researchers who have published the first first-author paper recently, we solve the problem of how to effectively predict the top $$k\%$$ researchers who achieve the highest citation increment in $$\varDelta t$$ years. We explore a series of factors that can drive an author to be fast-rising and design a novel pairwise citation increment ranking (PCIR) method that leverages those factors to predict the academic rising stars. Experimental results on the large ArnetMiner dataset with over 1.7 million authors demonstrate the effectiveness of PCIR. Specifically, it outperforms all given benchmark methods, with over 8% average improvement. Further analysis demonstrates that temporal features are the best indicators for rising stars prediction, while venue features are less relevant.

Chuxu Zhang, Chuang Liu, Lu Yu, Zi-Ke Zhang, Tao Zhou
Fuzzy Rough Incremental Attribute Reduction Applying Dependency Measures

Since data increases with time and space, many incremental rough based reduction techniques have been proposed. In these techniques, some focus on knowledge representation on the increasing data, some focus on inducing rules from the increasing data. Whereas there is less work on incremental feature selection (i.e., attribute reduction) from the increasing data, especially the increasing real valued data. And fuzzy rough sets is then applied in this incremental method because fuzzy rough set can effectively reduce attributes from the real valued data. By analyzing the basic concepts, such as lower approximation and positive region, of fuzzy rough sets on incremental datasets, the incremental mechanisms of these concepts are then proposed. An incremental algorithm is then designed. Finally, some numerical experiments demonstrate that the incremental algorithm is effective and efficient compared to non-incremental attribute reduction algorithms, especially on the datasets with large number of attributes.

Yangming Liu, Suyun Zhao, Hong Chen, Cuiping Li, Yanmin Lu

Query Processing

Frontmatter
SET: Secure and Efficient Top-k Query in Two-Tiered Wireless Sensor Networks

Top-k query is one of important queries in wireless sensor networks (WSNs). It provides the k highest or lowest data collected in the entire network. Due to abundant resources and high efficiency, many future large-scale WSNs are expected to follow a two-tiered architecture with resource-rich master nodes. However, the sensor network is unattended and insecure. Since master nodes store data collected by sensor nodes and respond to queries issued by users, they are attractive to adversaries. It is challenging to process top-k query while protecting sensitive data from adversaries. To address this problem, we propose SET, a framework for secure and efficient top-k query in two-tiered WSNs. A renormalized arithmetic coding is designed to enable each master node to obtain exact top-k result without knowing actual data. Besides, a verification scheme is presented to allow the sink to detect compromised master nodes. Finally, theoretical analysis and experimental results confirm the efficiency, accuracy and security of our proposal.

Xiaoying Zhang, Hui Peng, Lei Dong, Hong Chen, Hui Sun
Top-k Pattern Matching Using an Information-Theoretic Criterion over Probabilistic Data Streams

As the development of data mining technologies for sensor data streams, more sophisticated methods for complex event processing are demanded. In the case of event recognition, since event recognition results may contain errors, we need to deal with the uncertainty of events. We therefore consider probabilistic event data streams with occurrence probabilities of events, and develop a pattern matching method based on regular expressions. In this paper, we first analyze the semantics of pattern matching over non-probabilistic data streams, and then propose the problem of top-k pattern matching over probabilistic data streams. We introduce the use of an information-theoretic criterion to select appropriate matches as the result of pattern matching. Then, we present an efficient algorithm to detect top-k matches, and evaluate the effectiveness of our approach using real and synthetic datasets.

Kento Sugiura, Yoshiharu Ishikawa
Sliding Window Top-K Monitoring over Distributed Data Streams

The problem of distributed monitoring has been intensively investigated recently. This paper studies monitoring the top k data objects with the largest aggregate numeric values from distributed data streams within a fixed-size monitoring window W, while minimizing communication cost across the network. We propose a novel algorithm, which reallocates numeric values of data objects among distributed monitoring nodes by assigning revision factors when local constraints are violated, and keeps the local top-k result at distributed nodes in line with the global top-k result. Extensive experiments are conducted on top of Apache Storm to demonstrate the efficiency and scalability of our algorithm.

Zhijin Lv, Ben Chen, Xiaohui Yu
Diversified Top-k Keyword Query Interpretation on Knowledge Graphs

Exploring a knowledge graph through keyword queries to discover meaningful patterns has been studied in many scenarios recently. From the perspective of query understanding, it aims to find a number of specific interpretations for ambiguous keyword queries. With the assistance of interpretation, the users can actively reduce the search space and get more relevant results.In this paper, we propose a novel diversified top-k keyword query interpretation approach on knowledge graphs. Our approach focuses on reducing the redundancy of returned results, namely, enriching the semantics covered by the results. In detail, we (1) formulate a diversified top-k search problem on a schema graph of knowledge graph for keyword query interpretation; (2) define an effective similarity measure to evaluate the semantic similarity between search results; (3) present an efficient search algorithm that guarantees to return the exact top-k results and minimize the calculation of similarity, and (4) propose effective pruning strategies to optimize the search algorithm. The experimental results show that our approach improves the diversity of top-k results significantly from the perspectives of both statistics and human cognition. Furthermore, with very limited loss of result precision, our optimization methods can improve the search efficiency greatly.

Ying Wang, Ming Zhong, Yuanyuan Zhu, Xuhui Li, Tieyun Qian
Group Preference Queries for Location-Based Social Networks

Location-based social networks involve a great number of POIs (points of interest) as well as users’ check-in information and their ratings on POIs. We note that users have their own preferences for POI categories. In addition, they have their own network of friends. Therefore, it is necessary to provide for a group of users (circle of friends) a new kind of POI-finding service that considers not only POI preferences of each user but also other aspects of location-based social networks such as users’ locations and POI ratings. Aiming to solve this problem, in this paper we present a new type of query called Spatial Group Preference (SGP) query. For a group of users, an SGP query returns top-k POIs that are most likely to satisfy the needs of users. Specially, we propose a new evaluation model that considers user preferences for user preferences for POI categories, POI properties including locations and ratings, and the mutual influence between POIs. Based on this model, we develop algorithms based on R-tree to evaluate SGP queries. We conduct experiments on a simulation dataset and the results suggest the efficiency of our proposal.

Yuan Tian, Peiquan Jin, Shouhong Wan, Lihua Yue
A Formal Product Search Model with Ensembled Proximity

In this paper we study the problem of product search, where products are retrieved and ranked based on how their reviews match the query. Current product search systems suffer from the incapability to measure correspondence between a product feature and its desired property. A proximity language model is presented to embed textual adjacency in the frequency based estimation framework. To tailor for product search problem, we explore strategies for distinguishing product feature and desired property, quantifying pair-wise proximity based on conditional probability, and aggregating review opinions at product level. Experiments on a real data set demonstrate good performances of our model.

Zepeng Fang, Chen Lin, Yun Liang

Topic Modeling

Frontmatter
Incorporating User Preferences Across Multiple Topics into Collaborative Filtering for Personalized Merchant Recommendation

Merchant recommendation, namely recommending personalized merchants to a specific customer, has become increasingly important during the past few years especially with the prevalence of Location Based Social Networks (LBSNs). Although many existing methods attempt to address this task, most of them focus on applying the conventional recommendation algorithm (e.g. Collaborative Filtering) for merchant recommendation while ignoring harnessing the hidden information buried in the users’ reviews. In fact, the information of user real preferences on various topics hidden in the reviews is very useful for personalized merchant recommendation. To this end, in this paper, we propose a graphical model by incorporating user real preferences on various topics from user reviews into collaborative filtering technique for personalized merchant recommendation. Then, we develop an optimization algorithm based on a Gaussian model to train our merchant recommendation approach. Finally, we conduct extensive experiments on two real-world datasets to demonstrate the efficiency and effectiveness of our model. The experimental results clearly show that our proposed model outperforms the state-of-the-art benchmark approaches.

Yunfeng Chen, Lei Zhang, Xin Li, Yu Zong, Guiquan Liu, Enhong Chen
Joint Factorizational Topic Models for Cross-City Recommendation

The research of personalized recommendation techniques today has mostly parted into two mainstream directions, namely, the factorization-based approaches and topic models. Practically, they aim to benefit from the numerical ratings and textual reviews, correspondingly, which compose two major information sources in various real-world systems, including Amazon, Yelp, eBay, Netflix, and many others.However, although the two approaches are supposed to be correlated for their same goal of accurate recommendation, there still lacks a clear theoretical understanding of how their objective functions can be mathematically bridged to leverage the numerical ratings and textual reviews collectively, and why such a bridge is intuitively reasonable to match up their learning procedures for the rating prediction and top-N recommendation tasks, respectively.In this work, we exposit with mathematical analysis that, the vector-level randomization functions to harmonize the optimization objectives of factorizational and topic models unfortunately do not exist at all, although they are usually pre-assumed and intuitively designed in the literature.Fortunately, we also point out that one can simply avoid the seeking of such a randomization function by optimizing a Joint Factorizational Topic (JFT) model directly. We further apply our JFT model to the cross-city Point of Interest (POI) recommendation tasks for performance validation, which is an extremely difficult task for its inherent cold-start nature. Experimental results on real-world datasets verified the appealing performance of our approach against previous methods with pre-assumed randomization functions in terms of both rating prediction and top-N recommendation tasks.

Lin Xiao, Zhang Min, Zhang Yongfeng
Aligning Gaussian-Topic with Embedding Network for Summarization Ranking

Query-oriented summarization addresses the problem of information overload and help people get the main ideas within a short time. Summaries are composed by sentences. So, the basic idea of composing a salient summary is to construct quality sentences both for user specific queries and multiple documents. Sentence embedding has been shown effective in summarization tasks. However, these methods lack of the latent topic structure of contents. Hence, the summary lies only on vector space can hardly capture multi-topical content. In this paper, our proposed model incorporates the topical aspects and continuous vector representations, which jointly learns semantic rich representations encoded by vectors. Then, leveraged by topic filtering and embedding ranking model, the summarization can select desirable salient sentences. Experiments demonstrate outstanding performance of our proposed model from the perspectives of prominent topics and semantic coherence.

Linjing Wei, Heyan Huang, Yang Gao, Xiaochi Wei, Chong Feng
Improving Document Clustering for Short Texts by Long Documents via a Dirichlet Multinomial Allocation Model

Document clustering for short texts has received considerable interest. Traditional document clustering approaches are designed for long documents and perform poorly for short texts due to the their sparseness representation. To better understand short texts, we observe that words that appear in long documents can enrich short text context and improve the clustering performance for short texts. In this paper, we propose a novel model, namely DDMAfs, which (1) improves the clustering performance of short texts by sharing structural knowledge of long documents to short texts; (2) automatically identifies the number of clusters; (3) separates discriminative words from irrelevant words for long documents to obtain high quality structural knowledge. Our experiments indicate that the DDMAfs model performs well on the synthetic dataset and real datasets. Comparisons between the DDMAfs model and state-of-the-art short text clustering approaches show that the DDMAfs model is effective.

Yingying Yan, Ruizhang Huang, Can Ma, Liyang Xu, Zhiyuan Ding, Rui Wang, Ting Huang, Bowei Liu
Intensity of Relationship Between Words: Using Word Triangles in Topic Discovery for Short Texts

Uncovering latent topics from given texts is an important task to help people understand excess heavy information. This has caused the hot study on topic model. However, the main texts available daily are short, thus traditional topic models may not perform well because of data sparsity. Popular models for short texts concentrate on word co-occurrence patterns in the corpus. However, they do not consider the intensity of relationship between words. So we propose the new way, called word-network triangle topic model (WTTM). In WTTM, we search for the word triangles to measure the relations between words. The results of experiments on real-world corpus show that our method performs better in several evaluation ways.

Ming Xu, Yang Cai, Hesheng Wu, Chongjun Wang, Ning Li
Context-Aware Topic Modeling for Content Tracking in Social Media

Content in social media is difficult to analyse because of its short and informal feature. Fortunately, some social media data like tweets have rich hashtags information, which can help identify meaningful topic information. More importantly, hashtags can express the context information of a tweet better. To enhance the significant effect of hashtags via topic variables, this paper, we propose a context-aware topic model to detect and track the evolution of content in social media by integrating hashtag and time information named hashtag-supervised Topic over Time (hsToT). In hsToT, a document is generated jointly by the existing words and hashtags (the hashtags are treated as topic indicators of the tweet). Experiments on real data show that hsToT capture hashtags distribution over topics and topic changes over time simultaneously. The model can detect the crucial information and track the meaningful content and topics successfully.

Jinjing Zhang, Jing Wang, Li Li
Backmatter
Metadata
Title
Web and Big Data
Editors
Lei Chen
Christian S. Jensen
Cyrus Shahabi
Xiaochun Yang
Xiang Lian
Copyright Year
2017
Electronic ISBN
978-3-319-63579-8
Print ISBN
978-3-319-63578-1
DOI
https://doi.org/10.1007/978-3-319-63579-8

Premium Partner