Skip to main content
Top

2015 | Book

Web Technologies and Applications

17th Asia-Pacific Web Conference, APWeb 2015, Guangzhou, China, September 18-20, 2015, Proceedings

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 17th Asia-Pacific Conference APWeb 2015 held in Guangzhou, China, in September 2015.

The 67 full papers and presented together with 3 industrial track papers and 7 demonstration track papers were carefully reviewed and selected from 146 submissions. The papers cover a wide spectrum of Web-related data management problems, and provide a thorough view on the rapid advances of technical solutions.

Table of Contents

Frontmatter

Research Track

Frontmatter
On the Marriage of SPARQL and Keywords

To maximize the advantages of both SPARQL and keyword search, we introduce a novel paradigm that combines both of them and propose a hybrid query (called an SK query) that integrates SPARQL and keyword search in this paper. In order to answer SK queries efficiently, a structural index is devised and a novel integrated query algorithm is proposed. We evaluate our method in large real RDF graphs and experiments demonstrate both effectiveness and efficiency of our method.

Peng Peng, Lei Zou, Dongyan Zhao
An Online Inference Algorithm for Labeled Latent Dirichlet Allocation

Using topic models to analyze documents is a popular method in text mining. Labeled Latent Dirichlet Allocation(Labeled LDA) is one of them that is widely used to model tagged documents and to solve relevant problems, such as tagged document visualization, snippet extraction and so on. However, traditional batch inference for Labeled LDA, which runs over entire document collection, is computationally expensive and not suitable for large scale corpora and text streams. In this paper, we develop an efficient online algorithm for Labeled LDA, called

online Labeled LDA

(online-LLDA). It is based on particle filter, a Sequential Monte Carlo approximation technique. Our experiments show that online-LLDA significantly outperforms batch algorithm(batch-LLDA) in time, while preserving equivalent quality.

Qiang Zhou, Heyan Huang, Xian-Ling Mao
Efficient Buffer Management for PCM-Enhanced Hybrid Memory Architecture

Recently,

phase change memory

(PCM) has been considered in memory architecture to serve as an extension to DRAM, due to its special properties of byte-addressibility, low energy consumption, and high read performance. However, PCM has a lower write speed than DRAM. Besides, it has a limited write endurance. Therefore, the co-existence of PCM and DRAM in main memory urges a careful buffer-management policy to avoid frequent writes to PCM. To address this problem, we present the first approach that reduces PCM writes by efficient page exchanges and page replacements. Specially, we propose two clock data structures to maintain DRAM and PCM pages, and devise a page exchange method to make recently-updated pages reside in DRAM. In addition, differing from previous studies that do not consider the influence of page replacements on PCM writes, we present a new page replacement algorithm to reduce page replacement on PCM. With this mechanism, we can reduce PCM writes efficiently while keeping a high hit ratio. We conduct trace-driven experiments on both synthetic and real traces. The experimental results suggest that our proposal can greatly reduce PCM writes and maintain a high hit ratio for PCM/DRAM-based hybrid memory architecture.

Kaimeng Chen, Peiquan Jin, Lihua Yue
Efficient Location-Dependent Skyline Queries in Wireless Broadcast Environments

We study the problem of location-dependent skyline query (

LDSQ

) processing in wireless broadcast environments in this paper. Compared with answering the skyline queries in a conventional setting, two new issues arise while processing location-dependent skyline in wireless broadcast environments. First, the result of an

LDSQ

is closely related to the query point’s location; secondly, query processing strategies must take the linear property of wireless broadcast media and limited battery life of mobile devices into consideration. To address these new issues, this paper proposes an efficient solution for

LDSQ

processing in wireless broadcast environments. In particular, data objects to be disseminated are first divided into two parts via pre-computation in the broadcast server, and then a novel air data organization scheme is designed in the broadcast disk. At the mobile client end, an energy-efficient

LDSQ

processing algorithm is presented. To demonstrate the efficiency of our solution, extensive experiments are conducted along with detailed performance analysis.

Yingyuan Xiao, Pengqiang Ai, Hongya Wang, Ching-Hsien Hsu, Wenxiang Cui
Distance and Friendship: A Distance-Based Model for Link Prediction in Social Networks

With the emerging of location-based social networks, study on the relationship between human mobility and social relationships becomes quantitatively achievable. Understanding it correctly could result in appealing applications, such as targeted advertising and friends recommendation. In this paper, we focus on mining users’ relationship based on their mobility information. More specifically, we propose to use distance between two users to predict whether they are friends. We first demonstrate that distance is a useful metric to separate friends and strangers. By considering location popularity together with distance, the difference between friends and strangers gets even larger. Next, we show that distance can be used to perform an effective link prediction. In addition, we discover that certain periods of the day are more social than others. In the end, we use a machine learning classifier to further improve the prediction performance. Extensive experiments on a Twitter dataset collected by ourselves show that our model outperforms the state-of-the-art solution by 30%.

Yang Zhang, Jun Pang
Multi-Label Emotion Tagging for Online News by Supervised Topic Model

An enormous online news services provide users with interactive platforms where users can freely share their subjective emotions, such as sadness, surprise, and anger, towards the news articles. Such emotions can not only help understand the preferences and perspectives of individual users, but also benefit a number of online applications to provide users with more relevant services. While most of previous approaches are intended for recognizing a single emotion of the author, it has been observed that different emotions of the readers are more representative of the news articles. Therefore, this paper focuses on predicting readers’ multiple emotions evoked by online news. To the best of our knowledge, this is the first research work for addressing the task. This paper proposes a novel supervised topic model which introduces an additional emotion layer to associate latent topics with evoked multiple emotions of readers. In particular, the model generates a set of latent topics from emotions, followed by generating words from each topic. The experiments on the real dataset from online news service demonstrate the effectiveness of the proposed approach in multi-label emotion tagging for online news.

Ying Zhang, Lili Su, Zhifan Yang, Xue Zhao, Xiaojie Yuan
Distinguishing Specific and Daily Topics

The task of distinguishing specific and daily topics is useful in many applications such as event chronicle and timeline generation, and cross-document event coreference resolution. In this paper, we investigate several numeric features that describe useful statistical information for this task, and propose a novel Bayesian model for distinguishing specific and daily topics from a collection of documents based on documents’ content. The proposed Bayesian model exploits mixture of Poisson distributions for modeling probability distributions of the numeric features. The experimental results show that our approach is promising to solve this problem.

Tao Ge, Wenzhe Pei, Baobao Chang, Zhifang Sui
Matching Reviews to Object Based on 2-Stage CRF

With the development of web data integration, it poses a new challenge how to match relevant reviews to integrated database objects and provide users the more complete holistic views of entities. According to the features of web data integration and reviews from web, we proposed a method based on 2-layer Conditional Random Fields(CRF) to match reviews to database objects. On the one hand, our method leverages the integrated structured entity and significantly reduces the dependence on manually labeled training data. On the other hand, we employ semi-Markov CRF to recognize the structured entities and exploit a variety of entity-level and pattern-level recognition clues available in a database of entities and labeled reviews, thereby effectively resolving the entity variety and improving the accuracy of the entity recognition. Experiments in multiple domains show that our method can substantially superior to traditional tf-idf based methods as well as a recent language model-based method for the review matching problem.

Zhang Yongxin, Li Qingzhong, Wang Dequan, Ding Yanhui, Liu Congli, Yan Zhongmin
Discovering Restricted Regular Expressions with Interleaving

Discovering a concise schema from given XML documents is an important problem in XML applications. In this paper, we focus on the problem of learning an unordered schema from a given set of XML examples, which is actually a problem of learning a restricted regular expression with interleaving using positive example strings. Schemas with interleaving could present meaningful knowledge that cannot be disclosed by previous inference techniques. Moreover, inference of the

minimal

schema with interleaving is challenging. The problem of finding a

minimal

schema with interleaving is shown to be NP-hard. Therefore, we develop an approximation algorithm and a heuristic solution to tackle the problem using techniques different from known inference algorithms. We do experiments on real-world data sets to demonstrate the effectiveness of our approaches. Our heuristic algorithm is shown to produce results that are very close to optimal.

Feifei Peng, Haiming Chen
Efficient Algorithms for Distance-Based Representative Skyline Computation in 2D Space

Representative skyline computation is a fundamental issue in database area, which has attracted much attention in recent years. A notable definition of representative skyline is the distance-based representative skyline (DBRS). Given an integer

k

, a DBRS includes

k

representative skyline points that aims at minimizing the maximal distance between a non-representative skyline point and its nearest representative. In the 2D space, the state-of-the-art algorithm to compute the DBRS is based on dynamic programming (DP) which takes

O

(

km

2

) time complexity, where

m

is the number of skyline points. Clearly, such a DP-based algorithm cannot be used for handling large scale dataset due to the quadratic time cost. To overcome this problem, in this paper, we propose a new approximate algorithm called ARS, and a new exact algorithm named PSRS, based on a carefully-designed parametric search technique. We show that the ARS algorithm can guarantee a solution that is at most

ε

larger than the optimal solution. The proposed ARS and PSRS algorithms run in

O

(

k

log

2

m

log(

T

/

ε

)) and

O

(

k

2

log

3

m

) time respectively, where

T

is no more than the maximal distance between any two skyline points. We conduct extensive experimental studies over both synthetic and real-world datasets, and the results demonstrate the efficiency and effectiveness of the proposed algorithms.

Taotao Cai, Rong-Hua Li, Jeffrey Xu Yu, Rui Mao, Yadi Cai
Trustworthy Collaborative Filtering through Downweighting Noise and Redundancy

Proliferation of Electronic Commerce (EC) has revolutionized the way people purchase online. Web-based technologies enable people to more actively interact with merchants and service providers. Such purchasing logs and comments further lead to proliferation of recommender systems. Existing recommendation algorithms exploit either prior transactions or customer reviews to predict user interests towards certain items. Vast noise may be introduced into such information by fake raters, and information redundancy also makes recommender system entangled. In this work, we first examine user reviews and prior transactions to estimate user credibility and item importance to reduce effect from content polluters. Then we propose to alleviate the redundant information from homogeneous users based on link analysis. A unified framework is finally proposed to incorporate them in a mathematical formulation, which can be efficiently optimized. Experimental results on real world data reveal that our model can significantly outperform other baselines.

Qiuxiang Dong, Zhi Guan, Zhong Chen
A Co-ranking Framework to Select Optimal Seed Set for Influence Maximization in Heterogeneous Network

The rising popularity of social media presents new opportunities for one of the enterprise’s most important needs—selecting most influential individuals in viral marketing, which has attracted increasing attention in both academia and industry. Most recent algorithms of influence maximization have demonstrated remarkable successes, however their applications are limited to homogeneous networks. In this paper, we formulate the problem of influence maximization in heterogeneous network, and propose a co-ranking framework to simultaneously select seed sets with different types. This framework is flexible and could adequately takes advantage of additional information implicit in the heterogeneous structure. We conduct extensive experiments using the data collected from ACM Digital Library, and the experimental results show that both the quality and the running time of the proposed algorithm rival the existing algorithms.

Yashen Wang, Heyan Huang, Chong Feng, Xianxiang Yang
Hashtag Sense Induction Based on Co-occurrence Graphs

Twitter hashtags are used to categorize tweets for improving search categorizing topic. But the fact that people can create and use hashtags freely leads to a situation such that one hashtag may have multiple senses. In this paper, we propose a method to induce senses of a hashtag in a particular time frame. Our assumption is that for a sense of a hashtag the context words around it are similar. Then we design a method that uses a co-occurrence graph and community detection algorithm. Both words and hashtags are nodes of the co-occurrence graph, and an edge represents the relation of two nodes co-occurring in the same tweet. A list of words with a high node degree representing a sense is extracted as a community of the graph. We take Wikipedia disambiguation list page as word sense inventory to refine the results by removing non-sense topics.

Mengmeng Wang, Mizuho Iwaihara
Hybrid-LSH for Spatio-Textual Similarity Queries

Locality Sensitive Hashing (LSH) is a popular method for high dimensional indexing and search over large datasets. However, little efforts have put forward to utilizing LSH in mobile applications for processing spatio-textual similarity queries, such as find nearby shopping centers that have a top ranked hair salon. In this paper, we present hybrid-LSH, a new LSH method for indexing data objects according to both their spatial location and their keyword similarity. Our hybrid-LSH approach has two salient features: First our hybrid-LSH carefully combines the spatial location based LSH and textual similarity based LSH to ensure the correctness of the spatial and textual similarity based NN queries. Second, we present an adaptive query-processing model to address the fixed range problem of traditional LSH and to handle queries with varying ranges effectively. Extensive experiments conducted on both synthetic and real datasets validate the efficiency of our hybrid LSH method.

Mingdong Zhu, Derong Shen, Ling Liu, Ge Yu
Sleep Quality Evaluation of Active Microblog Users

In this paper, we propose a novel method to evaluate the sleep quality of Active Microblog Users(AMUs) based on Sina Microblog data, where Sina Microblog is the largest microblog platform with 500 million registered users in China. A microblog user is called AMU if s/he posts more than 100 microblogs during a year. Our study is meaningful because the amount of AMUs is huge in China and the results can reflect the lifestyle of these people. The primary works of this paper are as follows: First we successfully obtained 700 million microblogs from 0.55 million microblog users as our dataset. Then we detected the possible start and end sleep time of each AMU by a novel pattern and algorithm. Finally we designed an evaluation system to give the score of each AMU’s sleep quality. In the experiment, we compared the sleep quality of AMUs in different cities of China and found the difference in topics between high and low score groups by LDA method.

Kai Wu, Jun Ma, Zhumin Chen, Pengjie Ren
An Ensemble Matchers Based Rank Aggregation Method for Taxonomy Matching

Taxonomy matching is an important operation of knowledge base merging. Several matchers for automating taxonomy matching have been proposed and evaluated in the knowledge base community. Studies reveal that there is no single taxonomy matcher suitable for any domain-specific taxonomy mapping, therefore an ensemble of taxonomy matchers is essential. In this paper, we propose taxonomy metamatching, a distributed computing framework for assembling taxonomy matchers and generating an optimal taxonomy mapping. And we introduce TRA, a Threshold Rank Aggregation algorithm for this problem. Experimental results show that TRA outperforms state-of-the-art approaches regardless of domains and scales of taxonomies, which demonstrates that TRA performs good adaptability to taxonomy matching.

Hailun Lin, Yuanzhuo Wang, Yantao Jia, Jinhua Xiong, Peng Zhang, Xueqi Cheng
Distributed XML Twig Query Processing Using MapReduce

Twig query processing is one of the core operations of XML queries. Centralized holistic twig algorithms suffer great efficiency losses when large-scale XML documents are partitioned and stored in the cloud. Previous work on distributed twig query processing have some limitations, e.g., utter dependence on priori knowledge of query patterns, iteration of MapReduce jobs, etc. In this paper, our arbitrary XML partitioning and storage strategy require no knowledge of query pattern; twig queries can be efficiently processed in a single-round MapReduce job with good scalability. Extensive experiments are conducted to verify the efficiency and scalability of our algorithms.

Xin Bi, Guoren Wang, Xiangguo Zhao, Zhen Zhang, Shuang Chen
Sentiment Word Identification with Sentiment Contextual Factors

Sentiment word identification (SWI) refers to the task of automatically identifying whether a given word expresses positive or negative opinion. SWI is a critical component of sentiment analysis technologies. Traditional sentiment word identification techniques become unqualified because they need seed sentiment words which leads to low robustness. In this paper, we consider SWI as a matrix factorization problem and propose three models for it. Instead of seed words, we exploit sentiment matching and sentiment consistency for modeling. Extensive experimental studies on three real-world datasets demonstrate that our models outperform the state-of-the-art approaches.

Jiguang Liang, Xiaofei Zhou, Yue Hu, Li Guo, Shuo Bai
Large-Scale Graph Classification Based on Evolutionary Computation with MapReduce

Discriminative subgraph mining from a large collection of graph objects is a crucial problem for graph classification. Several main memory-based approaches have been proposed to mine discriminative subgraphs, but they always lack scalability and are not suitable for large-scale graph databases. Based on the MapReduce model, we propose an efficient method, MRGAGC, to process discriminative subgraph mining. MRGAGC employs the iterative MapReduce framework to mine discriminative subgraphs. Each map step applies the evolutionary computation and three evolutionary strategies to generate a set of locally optimal discriminative subgraphs, and the reduce step aggregates all the discriminative subgraphs and outputs the result. The iteration loop terminates until the stopping condition threshold is met. In the end, we employ subgraph coverage rules to build graph classifiers using the discriminative subgraphs mined by MRGAGC. Extensive experimental results on both real and synthetic datasets show that MRGAGC obviously outperforms the other approaches in terms of both classification accuracy and runtime efficiency.

Zhanghui Wang, Yuhai Zhao, Guoren Wang, Yurong Cheng
Multiple Attribute Aware Personalized Ranking

Personalized ranking is a typical task of recommender systems. It can provide a set of items for specific user and help recommender systems more correctly direct each item to its user. Recently, as the dramatically increasing social media, an entity, i.e., user and item, usually associates with multiple kinds of characterized information, e.g., explicit ratings, implicit feedbacks, and multi-type attributes (such as age, sex, occupation, or posts of user). Intuitively, comprehensively considering these information, we can obtain better personalized ranking results. However, most conventional methods only take collaborative information (explicit ratings or implicit feedbacks) or single type attributes into account. In this work, we investigate how to combine multiple attribute and collaborative information to learn the latent factors for entities and the attribute-aware mappings. As a result, we propose a novel Multiple-attribute-aware Bayesian Personalized Ranking model, Maa-BPR, for personalized ranking, which can learn reliable latent factors for entities as well as effective mappings for multiple attribute. The experimental results show that, compared with the state-of-the-art methods, Maa-BPR not only provides better ranking performance, but also is robust to new entities and the incomplete attributes.

Weiyu Guo, Shu Wu, Liang Wang, Tieniu Tan
Knowledge Base Completion Using Matrix Factorization

With the development of Semantic Web, the automatic construction of large scale knowledge bases (KBs) has been receiving increasing attention in recent years. Although these KBs are very large, they are still often incomplete. Many existing approaches to KB completion focus on performing inference over a single KB and suffer from the feature sparsity problem. Moreover, traditional KB completion methods ignore complementarity which exists in various KBs implicitly. In this paper, we treat KBs completion as a large matrix completion task and integrate different KBs to infer new facts simultaneously. We present two improvements to the quality of inference over KBs. First, in order to reduce the data sparsity, we utilize the type consistency constraints between relations and entities to initialize negative data in the matrix. Secondly, we incorporate the similarity of relations between different KBs into matrix factorization model to take full advantage of the complementarity of various KBs. Experimental results show that our approach performs better than methods that consider only existing facts or only a single knowledge base, achieving significant accuracy improvements in binary relation prediction.

Wenqiang He, Yansong Feng, Lei Zou, Dongyan Zhao
MATAR: Keywords Enhanced Multi-label Learning for Tag Recommendation

Tagging is a popular way to categorize and search online content, and tag recommendation has been widely studied to better support automatic tagging. In this work, we focus on recommending tags for content-based applications such as blogs and question-answering sites. Our key observation is that many tags actually have appeared in the content in these applications. Based on this observation, we first model the tag recommendation problem as a multi-label learning problem and then further incorporate keyword extraction to improve recommendation accuracy. Moreover, we speedup the proposed method using a locality-sensitive hashing strategy. Experimental evaluations on two real data sets demonstrate the effectiveness and efficiency of our proposed methods.

Licheng Li, Yuan Yao, Feng Xu, Jian Lu
Reverse Direction-Based Surrounder Queries

This paper proposes a new spatial query called the reverse direction-based surrounder (RDBS) query, which retrieves a user who is seeing a point of interest (POI) as one of their direction-based surrounders (DBSs). According to a user, one POI can be dominated by a second POI if the POIs are directionally close and the first POI is farther from the user than the second is. Two POIs are directionally close if their included angle with respect to the user is smaller than an angular threshold,

θ

. If a POI cannot be dominated by another POI, it is a DBS of the user. We also propose an extended query called the

competitor RDBS query

. POIs that share the same RDBSs with another POI are defined as competitors of that POI. We design algorithms to answer the RDBS queries and competitor queries. The experimental results show that the proposed algorithms can answer the queries efficiently.

Xi Guo, Yoshiharu Ishikawa, Aziguli Wulamu, Yonghong Xie
Learning to Hash for Recommendation with Tensor Data

Recommender systems usually need to compare user interests and item characteristics in the context of large user and item space, making hashing based algorithms a promising strategy to speed up recommendation. Existing hashing based recommendation methods only model the users and items and dealing with the matrix data, e.g., user-item rating matrix. In practice, recommendation scenarios can be rather complex, e.g., collaborative retrieval and personalized tag recommendation. The above scenarios generally need fast search for one type of entities (target entities) using multiple types of entities (source entities). The resulting three or higher order tensor data makes conventional hashing algorithms fail for the above scenarios. In this paper, a novel hashing method is accordingly proposed to solve the above problem, where the tensor data is approached by properly designing the similarities between the source entities and target entities in Hamming space. Besides, operator matrices are further developed to explore the relationship between different types of source entities, resulting in auxiliary codes for source entities. Extensive experiments on two tasks, i.e., personalized tag recommendation and collaborative retrieval, demonstrate that the proposed method performs well for tensor data.

Qiyue Yin, Shu Wu, Liang Wang
Spare Part Demand Prediction Based on Context-Aware Matrix Factorization

Maintenance spare part is used to replace and update the damaged and old components in the equipment. Forecasting spare part demand is notoriously difficult, as demand is typically intermittent and lumpy. Meanwhile, with the development of the sensor and internet technology, numerous condition monitoring systems are used to monitor the working condition of equipment, generating a large variety of monitor data at runtime. In this paper, we propose a Spare Part Demand (SPD) model based on a context-aware matrix factorization approach. The SPD mode incorporates historical spare part demands, the correlation between spare part demands and working places, and the correlation between spare part demands and monitor data. We evaluate our method based on extensive experiments using historical spare demands of one important component from more than 10000 concrete pump trucks and monitor data generated by part of these pump concrete pump trucks over a period of 9 months. The results demonstrate the advantages of our method over the previous studies, validating the contribution of our method.

Jianwei Ding, Yingbo Liu, Yuan Cao, Li Zhang, Jianmin Wang
Location Sensitive Friend Recommendation in Social Network

How to recommend friends in social network has attracted many research efforts. Most current friend recommendation methods are just based on the assumption that people will become friends if they have common interests which are usually estimated with the contents of their published posts and following relationships. However, friends recommended by these methods are only suitable for virtual social space instead of the real world. In this paper, we propose a new method to recommend friends in social network from the perspective of not just common interests, but also real-life needs. That is, we focus on finding friends that they can communicate with each other by social network and participate in some real-life activities face to face. The central idea of our approach is that we suppose people are more likely to be friends if their lives share more location overlaps besides the common interests. Currently, most people publish posts containing their real-time location information at any time, which makes it possible to detect and use the location information to recommend friends. Thus, our method combines users’ published posts, their location sequences detected from the posts and how active they are in Sina Weibo to estimate whether they can become friends in not only social network but also the real world. Experiments on Sina Weibo dataset demonstrate that our method can significantly outperform the traditional friend recommendation methods in terms of

Precision

,

Recall

and

F

1 measures.

Xueqin Sui, Zhumin Chen, Jun Ma
A Compression-Based Filtering Mechanism in Content-Based Publish/Subscribe System

Recently publish/subscribe (pub/sub) has been a popular paradigm in Internet-scale applications to decouple information publishers and subscribers. The main task of pub/sub is to find those subscriptions that match a given publication. Typically a matching algorithm utilizes a subscription indexing structure for higher efficiency. Unfortunately, given the diverse interests of publishers and subscribers, the semantic space of pub/sub becomes high dimensional and very sparse, leading to nontrivial space cost and inefficient matching algorithm. Existing work paretically tackled the both issues but at cost of scarifying another one, and few can meet the both goals. To overcome this issue, in this paper, we propose a novel coding approach. The approach can not only reduce the space cost of subscription index, and meanwhile helps optimizing the matching efficiency. Our experiments successfully demonstrate the proposed algorithm can achieve not only the low space cost of our proposed subscription indexing but also low running time of matching algorithm.

Qin Liu, Yiwen Zheng, Kaile Wang
Sentiment Classification for Chinese Product Reviews Based on Semantic Relevance of Phrase

The emotional tendencies of product reviews on web have an important influence. Analysis of the sentiment of reviews on the Internet became very necessary. In this paper, a new sentiment analysis algorithm is utilized to analyze sentiment of Chinese product reviews. At training stage, a model based on skip-gram is proposed to train phrase vectors respectively on positive and negative reviews, which represent the semantic relationship of phrases. The predication of emotional tendencies of reviews based on the phrase vectors. The model does not need any modeling and feature extraction for the review data, thus it is applicable for massive data. Experimental results show that when dealing with massive data, the algorithm is better than traditional algorithms on both accuracy and learning time.

Heng Chen, Hai Jin, Pingpeng Yuan, Lei Zhu, Hang Zhu
Overlapping Schema Summarization Based on Multi-label Propagation

Modern databases are usually composed of hundreds of tables. Querying an unfamiliar database is a tall order for users before they truly understand its schema. A schema summary can help to provide a succinct overview of the schema and improve the usability of databases. Existing summarization methods only focus on each element in a database belongs to one topic, ignores the fact that some elements may belong to multiple topics. This paper come up with a new method of generating overlapping summaries. It is the very first work to address the task as far as we know. We formulate overlapping schema summarization first and then introduce multi-label propagation algorithm in community detection to achieve several groups. To refine the partition, we cluster the groups additionally using hierarchical clustering algorithm. Finally, we find representative tables in each cluster to annotate the schema summary. The extensive experiments on both benchmark database and real-world database show that our approach not only achieves higher accuracy but also generates more meaningful summary.

Man Yu, Chao Wang, Xiangrui Cai, Ying Zhang, Yanlong Wen, Xiaojie Yuan
Analysis of Subjective City Happiness Index Based on Large Scale Microblog Data

City Happiness Index(CHI) is an important societal metric to evaluate the living status of the people in a city. In traditional method, CHI is usually calculated by the combination of other objective city indicators including economy, environment, technology and education. However, happiness is a kind of subjective feeling of people rather than the measurement on how much material wealth people have gotten. Therefore we propose a novel method to evaluate Subjective City Happiness Index(SCHI) by the analysis of public sentiment on microblogs. We carried out the analysis by mining the word distribution of the microblogs in Sina Weibo, which is the largest microblog platform in China. As an application, we used the model to calculate the SCHIs of 36 major cities of China based on 55 million microblogs posted by 0.9 million unique users. Furthermore, we investigated the variety of SCHI with time in different granularities(month, day, hour) in the year 2013.

Kai Wu, Jun Ma, Zhumin Chen, Pengjie Ren
Tree-Based Metric Learning for Distance Computation in Data Mining

Distance is an essential measurement of data mining. A good metric often leads to a good performance. Then how to obtain a proper metric systematically is critical. Distance metric learning is a classic method to learn distances between instances on data set with complex distributions. However, most researches on distance metric learning are based on Mahalanobis metric, which is equivalent to linear transformation on distance space that has limitation on complex data. To solve this problem, we propose a metric learning method based on non-linear transformation suitable for complex data. By using the tree model, we could address non-linearly separable data that rearrange input data and represent them to another forms, and tree model could be able to implicitly represent data to a new distance space with a non-linear activator function. Furthermore, single tree model will lead to overfit that has higher generalization errors. Therefore, we design a randomize algorithm to combining different tree models which could reduce the generalization errors in theory and practice. According to analysis, we prove the correctness and effectiveness of our algorithm in theory. Extensive experiments demonstrate that algorithm is stable and suitable for data mining.

Ming Yan, Yan Zhang, Hongzhi Wang
GSCS – Graph Stream Classification with Side Information

With the popularity of applications like Internet, sensor network and social network, which generate graph data in stream form, graph stream classification has become an important problem. Many applications are generating side information associated with graph stream, such as terms and keywords in authorship graph of research papers or IP addresses and time spent on browsing in web click graph of Internet users. Although side information associated with each graph object contains semantically relevant information to the graph structure and can contribute much to improve the accuracy of graph classification process, none of the existing graph stream classification techniques consider side information. In this paper, we have proposed an approach,

G

raph

S

tream

C

lassification with

S

ide information (

GSCS

), which incorporates side information along with graph structure by increasing the dimension of the feature space of the data for building a better graph stream classification model. Empirical analysis by experimentation on two real life data sets is provided to depict the advantage of incorporating side information in the graph stream classification process to outperform the state of the art approaches. It is also evident from the experimental results that

GSCS

is robust enough to be used in classifying graphs in form of stream.

Amit Mandal, Mehedi Hasan, Anna Fariha, Chowdhury Farhan Ahmed
A Supervised Parameter Estimation Method of LDA

Latent Dirichlet Allocation (LDA) probabilistic topic model is a very effective dimension-reduction tool which can automatically extract latent topics and dedicate to text representation in a lower-dimensional semantic topic space. But the original LDA and its most variants are unsupervised without reference to category label of the documents in the training corpus. And most of them view the terms in vocabulary as equally important, but the weight of each term is different, especially for a skewed corpus in which there are many more samples of some categories than others. As a result, we propose a supervised parameter estimation method based on category and document information which can estimate the parameters of LDA according to term weight. The comparative experiments show that the proposed method is superior for the skewed text classification, which can largely improve the recall and precision of the minority category.

Liu Zhenyan, Meng Dan, Wang Weiping, Zhang Chunxia
Learning Similarity Functions for Urban Events Detection by Mining Hotline Phone Records

Many cities around the world have established a platform, entitled

public service hotline

, to allow citizens to tell about city issues, e.g. noise nuisance, or personal encountered problems, e.g. traffic accident, by making a phone call. As a result of “crowd sensing”, these records contain rich human intelligence that can help to detect urban events. In this paper, we present an event detection approach to detect urban events based on phone records. Specifically, given a set of phone records in a period of time, we first learn a similarity matrix. Each element of the matrix is estimated as the probability that the corresponding pair of records describe the same event. Then, we propose an

Improved Affinity Propagation

(

IAP

) clustering approach which takes the similarity matrix as input and generates clusters as output. Each cluster is an urban event composed of several records. Extensive experiments demonstrate the great improvement of

IAP

on three standard datasets for clustering and the effectiveness of our event detection approach on real data from a hotline.

Pengjie Ren, Peng Liu, Zhumin Chen, Jun Ma, Xiaomeng Song
Answering Spatial Approximate Keyword Queries in Disks

Spatial approximate keyword queries consist of a spatial condition and a set of keywords as the fuzzy textual conditions, and they return objects labeled with a set of keywords similar to queried keywords while satisfying the spatial condition. Such queries enable users to find objects of interest in a spatial database, and make mismatches between user query keywords and object keywords tolerant. With the rapid growth of data, spatial databases storing objects from diverse geographical regions can be no longer held in main memories. Thus, it is essential to answer spatial approximate keyword queries over disk resident datasets. Existing works present methods either returns incomplete answers or indexes in main memory, and effective solutions in disks are in demand. This paper presents a novel disk resident index RMB-tree to support spatial approximate keyword queries. We study the principle of augmenting R-tree with capacity of approximate keyword searching based on existing solutions, and store multiple bitmaps in R-tree nodes to build an RMB-tree. RMB-tree supports spatial conditions such as range constraint, combined with keyword similarity metrics such as edit distance, dice etc. Experimental results against R-tree on two real world datasets demonstrate the efficiency of our solution.

Jinbao Wang, Donghua Yang, Yuhong Wei, Hong Gao, Jianzhong Li, Ye Yuan
Hashing Multi-Instance Data from Bag and Instance Level

In many scenarios, we need to do similarity search of multi-instance data. Although the traditional kernel methods can measure the similarity of bags in original feature space, the time and storage cost of these methods are so high which makes such methods cannot deal with large scale problems. Recently, hashing methods have been widely used for similarity search due to its fast search speed and low storage cost. However, few works consider how to hash multi-instance data. In this paper, we present two multi-instance hashing methods: (1) Bag-level Multi-Instance Hashing (BMIH); (2) Instance-level Multi-Instance Hashing (IMIH). BMIH first maps each bag to a new feature representation by a feature fusion method; then, supervised hashing method is used to convert new features to hash code. To utilize more instance information in each bag, IMIH regards instances in all bags as training data and apply two types of hash learning methods (unsupervised and supervised, respectively) to convert all instances to binary code; then, for a test bag, a similarity measure is proposed to search similar bags. Our experiments on four real-world datasets show that instance-level hashing with supervised information outperforms all proposed techniques.

Yao Yang, Xin-Shun Xu, Xiaolin Wang, Shanqing Guo, Lizhen Cui
A Multi-news Timeline Summarization Algorithm Based on Aging Theory

This paper focuses on the problem of news event timeline summary in Multi-Document Summarization, which aims to summarize multi-news regarding the same event in timeline. The majority of the traditional solutions to this problem consider the text surface features and topic-related features, such as the length of each sentence, the position of the sentence in the document, the number of topic words, etc. Traditional methods ignored that every event has its life circle including birth, growth, maturity and death. In this paper, a novel approach is presented for summarizing multi-news regarding the same topic in consideration of both the traditional features and the life circle feature of each event. The proposed approach consists of four steps. First, sentences and their publishing date are extracted from each news article. Second, the extracted sentences are pretreated to reduce the influence of noises like synonyms. Third, life circle features and other four categories of features which are common used in this field are collected. Finally, SVM model is used to train these features to recognize the summary sentence of the news document. This approach have been tested on the public datasets, DUC-2002 and TAC-2010, and the results show that our approach is more effective in summarizing multi-news in timeline than existing methods.

Jie Chen, Zhendong Niu, Hongping Fu
Extended Strategies for Document Clustering with Word Co-occurrences

To tackle the sparse data problem of the bag-of-words model for document clustering, recent strategies have been proposed to enrich a document with the relatedness of all the words in a corpus to the document, where the relatedness is estimated by the weighted sum of word co-occurrences. However, the relatedness is overestimated without eliminating the overlaps between word co-occurrences. This paper demonstrates that the weighted sum strategy gives the upper bound of the theoretic degree of relatedness. Two strategies are further proposed to approach the theoretic degree of relatedness. The first strategy is established under the extreme assumption that all the words in a document co-occur with each other. By considering the specificities of words, the second strategy gives several extended versions of the weighted sum strategy. Substantial experiments verify that the document clustering incorporated with the extended strategies achieve a significant performance improvement compared to the state-of-the-art techniques.

Yang Wei, Jinmao Wei, Zhenglu Yang
A Lightweight Evaluation Framework for Table Layouts in MapReduce Based Query Systems

Table layout determines the way how the relational row-column data values are organized and stored. In recent years, considerable candidates have been developed in MapReduce based query systems; they differ on storage space utilization, data loading time, query performance and so on. In most time, users are confronted with the problem of choosing the comprehensive optimum table layout given the workloads and the schema of tables. The straightforward way to run queries on generated data and compare the results is time consuming, and incurs the inaccuracy due to the MapReduce’s nondeterministic execution runtime. In this paper, we propose a lightweight framework to evaluate table layouts without running the query. The framework adopts the

black box method

to test critical metrics, and the

query aware strategy

that extracts table-layout-related operations from query. Based on the metrics and operations, the framework makes suggestions to users. We conduct extensive experiments to empirically study the popular table layouts. Through the results illustration, we discover that column projection and compression are the most two prominent factors for general cases. Moreover, we discuss optimization chances for the intermediate tables produced in high level language systems.

Feng Zhu, Jie Liu, Lijie Xu, Dan Ye, Jun Wei, Tao Huang
An Extended Graph-Based Label Propagation Method for Readability Assessment

Readability assessment is to evaluate the reading difficulty of a document, which can be quantified as reading levels. In this paper, we propose an extended graph-based label propagation method for readability assessment. We employ three vector space models (VSMs) to compute edges and weights for the graphs, along with three graph sparsification techniques. By incorporating the pre-classification results, we develop four strategies to reinforce the graphs before label propagation to capture the ordinal relation among the reading levels. The reinforcement includes recomputing weights for the edges, and filtering out edges linking nodes with big level difference. Experiments are conducted systematically on datasets of both English and Chinese. The results demonstrate both effectiveness and potential of the proposed method.

Zhiwei Jiang, Gang Sun, Qing Gu, Lixia Yu, Daoxu Chen
Simple is Beautiful: An Online Collaborative Filtering Recommendation Solution with Higher Accuracy

Matrix factorization has high computation complexity. It is unrealistic to directly adopt such techniques to online recommendation where users, items, and ratings grow constantly. Therefore, implementing an online version of recommendation based on incremental matrix factorization is a significant task. Though some results have been achieved in this realm, there is plenty of room to improve. This paper focuses on designing and implementing algorithms to perform online collaborative filtering recommendation based on incremental matrix factorization techniques. Specifically, for the new-user and new-item problems, Moore-Penrose pseudoinverse is used to perform incremental matrix factorization; and for the new-rating problem, iterative stochastic gradient descentlearning procedure is directly applied to get the updates. The solutions seem simple but efficient: extensive experiments show that they outperform the benchmark and the state-of-the-art in terms of incremental properties.

Feng Zhang, Ti Gong, Victor E. Lee, Gansen Zhao, Guangzhi Qu
Random-Based Algorithm for Efficient Entity Matching

Most of the state-of-the-art MapReduce-based entity matching methods inherit traditional Entity Resolution techniques on centralized system and focus on data blocking strategies for structured entities in order to solve the load balancing problem occurred in distributed environment. In this paper, we propose a MapReduce-based entity matching framework for Entity Matching on semi-structured and unstructured data. Each entity is represented by a high dimensional vector generated from description data. In order to reduce network transmission, we produce lower dimensional bit-vectors called signatures for those entity vectors based on Locality Sensitive Hash (LSH) function. Our LSH is required for promising cosine similarity. A series of random algorithms are designed to ensure the performance for entity matching. Moreover, our design contains a solution for reducing redundant computation by one round of additional MapReduce job. Experiments show that our approach has a huge advantages on both processing speed and accuracy compared to the other methods.

Pingfu Chao, Zhu Gao, Yuming Li, Junhua Fang, Rong Zhang, Aoying Zhou
A New Similarity Measure Between Semantic Trajectories Based on Road Networks

With the development of the positioning technology, studies on trajectories have been growing rapidly in the past decades. As a fundamental part involved in trajectory recommendation and prediction, trajectory similarity has attracted considerable attention from researchers. However, most existing works focus on raw trajectory similarity by comparing their shapes, while very few works study semantic trajectory similarity and none of them take all the geographical, semantic, and timestamp information into consideration. In this paper, we model semantic trajectories based on road networks considering all these information, and propose a Constrained Time-based Common Parts (CTCP) approach to measure the similarity. Since the strict time constraint in CTCP may lead to many zero values, we further propose an improved Weighted Constrained Time-based Common Parts (WCTCP) method by relaxing the time constraint to measure the similarity more accurately. We conducted extensive performance studies on real datasets to confirm the effectiveness of our approaches.

Xia Wu, Yuanyuan Zhu, Shengchao Xiong, Yuwei Peng, Zhiyong Peng
Effective Citation Recommendation by Unbiased Reference Priority Recognition

Citation recommendation is a meaningful and challenging research problem nowadays. Most of prior researches make a simplified assumption that the citations are more preferential for the papers to cite than the uncited ones. Consequently, the unreasonable priority assertion between the cited and uncited papers derived from the above assumption makes citation recommendation prone to be biased. To address this issue, we firstly propose an instinctive assumption that the more preferential a reference is, the easier it can be recognized as a citation. Based on this assumption, we propose two methods

CR

and

CR+C

aiming to find more unbiased priority between the cited and uncited papers with

c-SVC

. Then, a improved RankSVM model is trained for citation ranking purpose. Experimental results demonstrate that, comparing with the RankSVM model, our methods achieve 5.27% improvement on Recall@50 and 8.28% improvement on MRR. Moreover,

CR+C

achieves advantage on efficiency by taking only 18.9% time it needs.

Wen-Yang Lu, Yu-Bin Yang, Xiao-Jiao Mao, Qi-Hai Zhu
UserGreedy: Exploiting the Activation Set to Solve Influence Maximization Problem

Influence Maximization is the problem of selecting a small set of seed users in a social network to maximize the spread of influence. Traditional solutions are mainly divided into two directions. The one is greedy-based methods and the other is heuristics-based methods. The greedy-based methods can effectively estimate influence spread using thousands of Monte-Carlo simulations. However, the computational cost of simulation is extremely expensive so that they are not scalable to large networks. The heuristics-based methods, estimating influence spread according to heuristic strategies, have low computational cost but without theoretical guarantees. In order to improve both performance and effectiveness, in this paper we propose a greedy-based algorithm, named UserGreedy. In UserGreedy, we first propose a novel concept called

Activation Set

, which is defined as a set of users that can be activated by a seed user with a certain probability under the most standard and popular independent cascade (IC) model. Based on the computation of such probabilities, we can directly estimate the influence spread without the expensive simulation process. We then design an influence spread function based on the the

Activation Set

and mathematically prove that it has the property of monotonicity and submodularity, which provides theoretical guarantee for the UserGreedy algorithm. Besides, we also propose an efficient method to obtain the

Activation Set

and hence implement the UserGreedy algorithm. Experiments on real-world social networks demonstrate that our algorithm is much faster than existing greedy-based algorithms and outperforms the state-of-art heuristics-based algorithms in terms of effectiveness.

Wenxin Liang, Chengguang Shen, Xianchao Zhang
AILabel: A Fast Interval Labeling Approach for Reachability Query on Very Large Graphs

Recently, reachability queries on large graphs have attracted much attention. Many state-of-the-art approaches leverage spanning tree to construct indexes. However, almost all of these work require indexes and original graph in memory simultaneously, which will limit the scalability. Thus, a new interval labeling approach called AILabel (Augmented Interval Label) is proposed in this paper. AILabel labels each node with quadruples. Index construction time of AILabel is O(

m

+

n

), which requires only one traversal through the graph. Besides, AILabel only needs index to answer the queries. We further proposed an approach D-AILabel based on AILabel to handle reachability queries on dynamic graph. Finally, experiments on real and synthetic datasets are conducted and prove that AILabel can efficiently scale to large graph and reachability queries on dynamic graph can be effectively handled by D-AILabel.

Feng Shuo, Xie Ning, Shen de-Rong, Li Nuo, Kou Yue, Yu Ge
Graph-Based Hybrid Recommendation Using Random Walk and Topic Modeling

In this paper, we propose a graph-based method for hybrid recommendation. Unlike a simple linear combination of several factors, our method integrates user-based, item-based and content-based techniques more fully. The interaction among different factors are not done once, but by iterative updates. The graph model is composed of target user’s similar-minded neighbors, candidate items, target user’s historical items and the topics extracted from items’ contents using topic modeling. By constructing the concise graph, we filter out irrelevant noise and only retain useful information which is highly related to the target user. Top-N recommendation list is finally generated by using personalized random walk. We conduct a series of experiments on two datasets: movielen and lastfm. Evaluation results show that our proposed approach achieves good quality and outperforms existing recommendation methods in terms of accuracy, coverage and novelty.

Hai-Tao Zheng, Yang-Hui Yan, Ying-Min Zhou
RDQS: A Relevant and Diverse Query Suggestion Generation Framework

Traditional query suggestion methods mainly leverage click-through information to find related queries as recommendations, without considering the semantic relateness between queries. In addition, few studies use click-through distribution in diversifying query suggestions. To address these issues, we propose a novel and effective framework to generate relevant and diversified query suggestions. We combine query semantics and click-through information together to generate query suggestion candidates which are highly relevant to original query , we use click-through distribution to diversify the candidates. We evaluate our method on a large-scale search log dataset of a commercial engine, experimental results indicate that our framework has significantly improved the relevance and diversity of suggested queries by comparing to four baseline methods.

Hai-Tao Zheng, Yi-Chi Zhang
Minimizing the Cost to Win Competition in Social Network

In social network, influences are propagating among users. Influence maximizing is an important problem which has been studied widely in recent years. However, in the real world, sometimes users need to make their influence maximization through social network and defeat their competitors under minimum cost, such as online voting, expanding market share, etc. In this paper, we consider the problem of selecting a seed set with the minimum cost to influence more people than other competitors. We show this problem is NP-hard and propose a cost-effective greedy algorithm to approximately solve the problem and improve the efficiency based on the submodularity. Furthermore, a new cost-effective Degree Adjust heuristics is proposed to get high efficiency. Experimental results show that our cost-effective greedy algorithm achieves better effectiveness than other algorithms and the cost-effective Degree Adjust heuristic algorithm achieves high efficiency and gets better effectiveness than Degree and Random heuristics.

Ziyan Liu, Xiaoguang Hong, Zhaohui Peng, Zhiyong Chen, Weibo Wang, Tianhang Song
Ad Dissemination Game in Ephemeral Networks

The dissemination of ads in an ephemeral network has become an important research topic. A challenge of ad dissemination is the guarantee of robustness such that rational nodes in the ephemeral network have sufficient impetus to forward ads, despite facing the limitation of resources and the risk of privacy leakage. This paper proposes a strategy for ad dissemination in an ephemeral network. Acknowledging the assumption of incomplete information, we propose a bargaining-based game

G

D

that can be used by nodes to decide whether to forward an ad. There exists Bayesian Nash equilibrium in

G

D

with which the proposed approach provides nodes a strong impetus to disseminate the ads with higher dissemination accuracy.

Lihua Yin, Yunchuan Guo, Yanwei Sun, Junyan Qian, Athanasios Vasilakos
User Generated Content Oriented Chinese Taxonomy Construction

The taxonomy is one of the basic components in knowledge graphs as it establishes types of classes and semantic relations among the classes. Taxonomies are normally constructed either manually, or by language-dependent rules or patterns for type and relation extraction or inference. Existing work on building taxonomies for knowledge graphs is mostly in English language environment. In this paper, we propose a novel approach for large-scale Chinese taxonomy construction based on user generated content. We take Chinese Wikipedia as the data source, develop methods to extract classes and their relations mined from user tagged categories, and build up the taxonomy using a bottom-up strategy. The algorithms can be easily applied to other Wiki-style data sources. The experiments show that the constructed Chinese taxonomy achieves better results in both quality and quantity.

Jinyang Li, Chengyu Wang, Xiaofeng He, Rong Zhang, Ming Gao
Mining Weighted Frequent Itemsets with the Recency Constraint

Weighted Frequent Itemset Mining (WFIM) has been proposed as an alternative to frequent itemset mining that considers not only the frequency of items but also their relative importance. However, an important limitation of WFIM is that it does not consider how recent the patterns are. To address this issue, we extend WFIM to consider the recency of patterns, and thus present the Recent Weighted Frequent Itemset Mining (RWFIM). A projection-based algorithm named RWFIM-P is designed to mine Recent Weighted Frequent Itemsets (RWFIs) based on a novel upper-bound downward closure property. Moreover, an improved algorithm named RWFIM-PE is also proposed, which introduces a new pruning strategy named Estimated Weight of 2-itemset Pruning (EW2P) to prune unpromising candidate of RWFIs early. An experimental evaluation against a state-of-the-art WFIM algorithm on the real-world and synthetic datasets show that the proposed algorithms are highly efficient.

Jerry Chun-Wei Lin, Wensheng Gan, Philippe Fournier-Viger, Tzung-Pei Hong
Boosting Explicit Semantic Analysis by Clustering Paragraph Vectors of Wikipedia Articles

Explicit Semantic Analysis (ESA) is an effective method that utilizes Wikipedia entries (articles) to represent text and compute semantic relatedness (SR) for text pairs. Analogous to ordinary web search techniques, ESA also suffers from the redundancy issues due to the ongoing expansion of the amount of Wikipedia entries. Entries redundancy could lead to biased representation that lay particular emphasis on semantics from a large number of similar entries. On the other hand, original ESA for SR has a weak point that it does not consider the correlations or similarities between the Wikipedia articles of the text representations. To tackle these problems, We develop a novel method to cluster the redundant or similar entries by similarity measurement based on Paragraph Vector (PV), a neural network language model. Results of experiments on four datasets show that our framework could gain better performance in relatedness accuracy against ESA.

Hai-Tao Zheng, Wenzhen Wu
Research on Semantic Disambiguation in Treebank

The increasingly widespread application of natural language processing technology leads parsing to play a significant role. As a result, the size and quality of treebank have become the focus of relevant research. However, there exists data sparseness when we use the treebank to parse. With the help of Cilin semantic information and words contextual information, this paper proposes a context-based lexical semantics disambiguation method. After applying this method on CTB (Chinese Treebank) 5.0 and TCT (Tsinghua Chinese Treebank), using Berkeley Parser achieved relatively good results. In Penn Chinese Treebank, the precision and recall rates reached 85.35% and 84.34% respectively, and the F value reached 84.84%. Comparing with the parsing results of using the original corpus, the correct rate increased by 1.86% and the recall rate increased by 1.02% and the comprehensive index F value increased by 1.35%. As consequence, the overall parsing error rate dropped by 8.17%.

Lin Miao, Xueqiang Lv, Yunfang Wu, Yue Wang
PDMA: A Probabilistic Framework for Diversifying Recommendation Lists

In this paper, we derive a probabilistic ranking framework for diversifying the recommendations of baseline methods. Unlike conventional approaches to balance relevance and diversity, we produce the diversified list by maximizing user’s current marginal aspect preference, thus avoiding the hyperparameters in making the tradeoff. Before diversification, we adopt clustering to generate a much smaller set of candidate items based on three requirements: efficiency, relevance and diversity. As a result, it helps us not only reduce the search space greatly but also promote a slight increase in performance. Our framework is flexible to incorporate new preference aspects and apply new marginal aspect preference algorithms. Evaluation results show that our method can get better diversity than others and maintain comparable accuracy to baseline methods, thus a better balance between relevance and diversity.

Yang-Hui Yan, Ying-Min Zhou, Hai-Tao Zheng
User Behavioral Context-Aware Service Recommendation for Personalized Mashups in Pervasive Environments

With the rapid development of mobile Internet and increasing amount of smart devices, Internet services have been integrated into peoples’ daily lives. Due to the features of end-user-oriented mashups in pervasive environments, new challenges have been presented to conventional mashup approaches, including the complexity of user behaviors, the difficulty of predicting real-time user preference and other dynamic contexts. In this paper, we propose a new paradigm for behavioral context-based personalized mashup provision in pervasive environments by integrating mashup construction and execution into user natural behaviors. In the proposed paradigm, users with similar behavior patterns are identified and then probability distributions of potential behavior selection for user clusters are discovered from historical mashup logs, which provide supports for predicting and recommending user activities for future mashup constructions. Analysis and experiments indicate that our approach can effectively simplify personalized mashup composition, as well as improve the quality of mashup composition and recommendation based on behavioral contexts and personalization in pervasive environments.

Wei He, Guozhen Ren, Lizhen Cui, Hui Li
On Coherent Indented Tree Visualization of RDF Graphs

Indented tree has been widely used to organize information and visualize graph-structured data like RDF graphs. Given a starting resource in a cyclic RDF graph as root, there are different ways of transforming the graph into a tree representation to be visualized as an indented tree. In this paper, we aim to smooth user’s reading experience by visualizing an optimally coherent indented tree in the sense of featuring the fewest reversed edges, which often cause confusion and interrupt the user’s cognitive process due to lack of effective way of presentation. To achieve this, we propose a two-step approach to generate such an optimal tree representation for a given RDF graph. We empirically show the difference in coherence between tree representations of real-world RDF graphs generated by different approaches. These differences lead to a significance difference in user experience in our user study, which reports a notable degree of dependence between coherence and user experience.

Qingxia Liu, Gong Cheng, Yuzhong Qu
Online Feature Selection Based on Passive-Aggressive Algorithm with Retaining Features

Feature selection is an important topic in data mining and machine learning, and has been extensively studied in many literature. Unlike traditional batch learning methods, online learning is more efficient for real-world applications. Most existing studies of online learning require accessing all the features of training instances, but in real world, it is often expensive to acquire the full set of attributes. In online feature selection process, when a training instance arrive, a fixed small number of features will be selected, and then the other features will be ignored. However, those ignored features may be useful and selected in later instances. If we only consider the new instances for these special features, it will lead to extreme errors. To address these issues, we improved a novel algorithm with Passive-Aggressive Algorithm and retaining features. Then we evaluate the performance of the proposed algorithms for online feature selection on several public datasets, and we can see from the experiments that our algorithm consistently surpassed the baseline algorithms for all the situations.

Hai-Tao Zheng, Haiyang Zhang
Online Personalized Recommendation Based on Streaming Implicit User Feedback

Since user preference is drifting over time, modeling temporary dynamic recommender system has been proven to be valuable for accurate recommendation performance. However, user feedback is continuously updating while the traditional recommender system is trained off-line in batch mode so that it cant capture user taste change in time. In this paper, we build a dynamic real-time recommendation model based on implicit user feedback stream to improve both the recommendation accuracy and training efficiency. Moreover, our model has obvious advantages over the traditional approaches in diversity, interpretability, and strong robustness to hostile attack. Finally, we conduct experiments on two real world datasets to validate the effectiveness of our proposed method and demonstrate the superior performance when compared with state-of-the-art approach.

Zhisheng Wang, Qi Li, Ye Liu, Wei Liu, Jian Yin
A Self-learning Rule-Based Approach for Sci-tech Compound Phrase Entity Recognition

Sci-tech compound phrase entity (

e.g.

, the names of projects, books and patents) recognition is a fundamental task of science and technology data processing and discovery. However, much less work have been done on the problem. In this paper, we first give the characteristics of sci-tech entities that are different from personal name and other traditional entities. Then we introduce a self-learning rule-based approach to address the problem. The approach consists of three stages, namely rule-based text truncation, BlackPOS-based text split and WhiteKey-based confirmation. Constructing the best WhiteKey list is a NP-hard problem. We further design dynamic programming and greedy algorithms to address the problem. Experimental results show that our approach achieves 94.78% precision rate, 89.19% recall rate and 91.9%

F

1

measure in average. Moreover, our approach is universal and orthogonal to prior named entity recognition work.

Tingwen Liu, Yang Zhang, Yang Yan, Jinqiao Shi, Li Guo
Batch Mode Active Learning for Geographical Image Classification

In this paper, an innovative batch mode active learning by combining discriminative and representative information for hyperspectral image classification with support vector machine is proposed. In the past years, the batch mode active learning mainly exploits different query functions, which are based on two criteria: uncertainty criterion and diversity criterion. Generally, the uncertainty criterion and diversity criterion are independent of each other, and they also could not make sure the queried samples identical and independent distribution. In the proposed method, the diversity criterion is focused. In the innovative diversity criterion, firstly, we derive a novel form of upper bound for true risk in the active learning setting, by minimizing this upper bound to measure the discriminative information, which is connected with the uncertainty. Secondly, for the representative information, the maximum mean discrepancy(MMD) which captures the representative information of the data structure is adopt to match the distribution of the labeled samples and query samples, to make sure the queried samples have a similar distribution to the labeled samples and guarantee the queried samples are diversified. Meanwhile, the number of new queried samples is adaptive, which depends on the distribution of the labeled samples. In the experiment, we employ two benchmark remote sensing images, Indian Pines and Washington DC. The experimental results demonstrate the effective of our proposed method compared with the state-of-the-art AL methods

Zengmao Wang, Bo Du, Lefei Zhang, Wenbin Hu, Dacheng Tao, Liangpei Zhang
A Multi-view Retweeting Behaviors Prediction in Social Networks

Retweeting is the most prominent feature in online social networks. It allows users to reshare another user’s tweets for her followers and bring about second information diffusion. Predicting retweeting behaviors is an important and essential task for advertising product launch, hot event detection and analysis of human behavior. However, most of the methods and systems have been developed for modeling the retweeting behaviors, it has not been fully explored for this problem. In this paper, we first cast the problem of retweeting behaviors prediction as a classification task and propose a formally definition. We then systematically summarize and extract a lot of features, namely user status, content, temporal, and social tie information, for predicting users’ retweeting behaviors. We incorporate these features into Support Vector Machine (SVM) model for our prediction problem. Finally, we conduct extensive experiments on a real world dataset collected from Twitter to validate our proposed approach. Our experimental results demonstrate that our proposed model can improve prediction effectiveness by combining the extracted features compared to the baselines that do not.

Bo Jiang, Ying Sha, Lihong Wang
Probabilistic Frequent Pattern Mining by PUH-Mine

To mine frequent itemsets from uncertain data, many existing algorithms rely on expected support based mining. An alternative approach relies on

probabilistic based mining

, which captures the frequentness probability. While the possible world semantics are widely used, the exponential growth of possible worlds makes the probabilistic based mining computationally challenging when compared to the expected support based mining. In this paper, we propose two efficient approximate hyperlinked structure based algorithms, which generate a collection of all potentially probabilistic frequent itemsets with a novel upper bound and verify if they are truly probabilistic frequent. Experimental results show the efficiency of our algorithms in mining probabilistic frequent itemsets from uncertain data.

Wenzhu Tong, Carson K. Leung, Dacheng Liu, Jialiang Yu
A Secure and Efficient Framework for Privacy Preserving Social Recommendation

The well-known cold start problem in traditional collaborative filtering based recommender systems can be effectively addressed by social recommendation, which has been witnessed by a number of researches recently in many application domains. The social graph used in social recommendation is typically owned by a third party such as Facebook and Twitter, and should be hidden from recommender systems for obvious reasons of commercial benefits, as well as due to privacy legislation. In this paper, we present a secure and efficient framework for privacy preserving social recommendation. Our framework is built on mature cryptographic building blocks, including Paillier cryptosystem and Yao’ protocol, which lays a solid foundation for the security of our framework. Using our framework, the owner of sales data and the owner of social graph can cooperatively compute social recommendation, without revealing their private data to each other. We theoretically prove the security and analyze the complexity of our framework. Empirical study shows our framework has a linear complexity with respect to the number of users and items in recommender systems and is practical in real applications.

Shushu Liu, An Liu, Guanfeng Liu, Zhixu Li, Jiajie Xu, Pengpeng Zhao, Lei Zhao
DistDL: A Distributed Deep Learning Service Schema with GPU Accelerating

Deep Learning is a hot topic developed by the industry and academia which integrates the broad field of artificial intelligence with the deployment of deep neural networks in the big data era. Recently, the capability to train neural networks has resulted in state-of-the-art performance in many domains such as computer vision, speech recognition, recommended system, natural language processing, drug discovery and behavioural analysis etc. However, existing deep learning systems are inadequate for scaling, especially in current cloud infrastructures where the nodes are distributed across multiple geographical location. The tendency is evolving that deep learning must be collaborative optimized by the fields both machine learning and systems.

In this paper, we have presented

DistDL

, a novel distributed deep learning service schema in order to reduce the training time consumption along with communication overhead and achieve extremely parallelism with data and model. Additionally, we also took into consideration GPUs inside

DistDL

by leveraging the remarkable competency of heterogeneous computation. The results of our experiments in the benchmarks suggest that

DistDL

is adaptive to various scaling patterns with the same accuracy while minimising training time by adopting the GPU, up to 80% speed up.

Jianzong Wang, Lianglun Cheng
A Semi-supervised Solution for Cold Start Issue on Recommender Systems

The recommender system is the most competitive solution to solve information overload problem, and has been extensively applied. The current collaborative filtering based recommender systems explore users’ latent interest with their historical online behavior records. They are all facing the cold start issue. In this work, we proposed a background-based semi-supervised tri-training method named BSTM to tackle this problem. In detail, we capture fine-grained users’ background information by using a factorization model. By exploring these information, the performance of our recommendation can be significantly promoted. Besides, we proposed a semi-supervised ensemble algorithm, which got both labeled and unlabeled data involved. This algorithm assembled diverse weak prediction models which are generated by exploring samples with diverse background information and by tri-training tactic. The experimental results show that, with this solution, the accuracy of recommendation is significantly improved, and the cold-start issue is alleviated.

Zhifeng Hao, Yingchao Cheng, Ruichu Cai, Wen Wen, Lijuan Wang

Industry Track

Frontmatter
Hybrid Cloud Deployment of an ERP-Based Student Administration System

Conceptually, a hybrid cloud approach to deploying an ERP system can effectively achieve a cost-performance balance, especially during seasonal peaks of concurrent accesses. This paper describes an actual implementation of an ERP-based student administration system, where the web tier and application tier are deployed on hyrbid cloud with the databases resided on premises. The key challenge of this implementation is to set up the web tier and application tier for the right cost-performance balance while attaining an appropriate level of resilience. In this paper, technical evaluation is conducted to demonstrate the cost efficiency of the hybrid cloud deployment, and the benefits are discussed. These provide a useful reference on hybrid-cloud deployment of ERP systems.

Simon K. S. Cheung
A Benchmark Evaluation of Enterprise Cloud Infrastructure

As cloud computing matures in enterprise IT system, agile and scalable evaluation of cloud infrastructure becomes critical. A lightweight benchmark approach is introduced, to meet the requirement of low cost, scalability and relevance of typical enterprise environment. Benchmark results are analyzed and compared to industry benchmark data, conclusion is given on implementing a scientific evaluation of enterprise cloud infrastructure.

Yong Chen, Xuanzhong Xie, Jiayin Wu
A Fast Data Ingestion and Indexing Scheme for Real-Time Log Analytics

Structured log data is a kind of append-only time-series data which grows rapidly as new entries are continuously generated and captured. It has become very popular in application domains such as Internet, sensor networks and telecommunications. In recent years, many systems have been developed to support batch analysis of such structured log data. But they often fail to meet the high throughput requirements of real-time log data ingestion and analytics. An efficient index is very important to accelerate log data analytics, and at the meanwhile to support high throughput data loading. This paper focuses on designing a specialized indexing scheme for real-time log data analytics. The solution adopts a dynamic

global hash index

to partition the tuples into hash buckets. Then the tuples in the hash buckets are sorted and buffered in the

sort buffer queue

. When the amount of data in the queue reaches a threshold, the data is packed into

segments

before spilling to the disks. Moreover, an

intra-segment index

is maintained by

meta database

. With such an indexing scheme, the database system achieves high throughput and real-time data loading and query performance. As shown in the experiments, the data loading throughput reaches 5 million tuples per second per node. The delay of data loading does not exceed 10 seconds, and a sub-second query performance is achieved for the given queries.

Haoqiong Bian, Yueguo Chen, Xiongpai Qin, Xiaoyong Du

Demonstration Track

Frontmatter
A Fair Data Market System with Data Quality Evaluation and Repairing Recommendation

With the development of data market, data resources play a key role as the part of business resources. However, existing data markets are too simple to reveal the real data values in practical application. Motivated by the effectiveness and fairness of the data market, we develop a fair data market system that takes data quality into consideration. In our system, we design a fair data price evaluation mechanism, which aims at meeting the needs of both supply and demand. For the data quality issues in the data market, several critical factors, including accuracy, completeness, consistency, and currency, are integrated in order to show comprehensive assessment of the data. Moreover, our system can also provide data repairing recommendation based on data quality evaluation.

Xiaoou Ding, Hongzhi Wang, Dan Zhang, Jianzhong Li, Hong Gao
HouseIn: A Housing Rental Platform with Non-redundant Information Integrated from Multiple Sources

Housing Rental Platforms (HRPs) such as

rentalhouses.com, 58.com, ganji.com

provide convenient ways to find accommodations, but the redundancy of the rental advertising information within and across platforms brings unpleasant user experience. Besides, rental advertisements are usually presented in the form of a list, which can hardly give users a straightforward big picture about the housing rental market of a particular interested area. In this demonstration, we introduce HouseIn, a novel HRP that expect to: 1) provide users a clear big picture in several aspects about the housing rental market of a particular interested area; and 2) detect those advertisements referring to the same property, such that to give users a price comparison between various platforms for the same apartment for rental. The core challenge in implementing HouseIn lies on the Record Matching problem between these advertisements, given that there is no exact Unit/Building Number of the apartment for the sake of privacy issue, and the detailed information about the house can be various given by different agencies. To handle the Record Matching (RM) problem between these advertisements, we employ several state-of-the-art RM methods plus Information Extraction (IE) techniques to use both structured and unstructured information in the advertisements. We show the advantages of HouseIn with several demonstration scenarios.

Jian Zhou, Zhixu Li, Qiang Yang, Jun Jiang, Jia Zhu, An Liu, Guanfeng Liu, Lei Zhao
ONCAPS: An Ontology-Based Car Purchase Guiding System

In this paper, an ontology-based car purchase guiding system, ONCAPS, is presented. The system integrates Semantic Web technologies to provide car information to people who want to buy cars. It employs ontology-based data integration technology to gather data from multiple car selling websites in China. It captures personalized weights on car features by an analytic hierarchy process based approach. It computes the personalized distance between a car and a given requirement by a weighted sum model where ontology reasoning is involved. For purchase guiding, ONCAPS suggests the user all cars that have the minimum personalized distance to the given requirement. The suggestions are shown with explanations on why these cars are recommended.

Jianfeng Du, Jun Zhao, Jiayi Cheng, Qingchao Su, Jiacheng Liang
A Multiple Trust Paths Selection Tool in Contextual Online Social Networks

Online Social Network (OSN) is becoming popular, where trust is one of the most important factors for participants’ decision making. This requires to efficiently and effectively select those social trust paths that can yield the most trustworthy trust evaluation results between two unknown participants to establish reasonable trust relationships between them. Thus, we develop a Multiple Trust Paths (MTP) selection tool based on the state-of-the-art trust paths selection method proposed by us, which considers the social contexts, like the social trust and social intimacy degree between participants, and the role impact factor of participants in trust paths selection. Our tool could help users evaluate the trustworthiness of unknown participants in a variety of OSN based applications, for instance, to help a retailer identify new trustworthy customers and introduce products to them, or help users select the trustworthy workers in OSN based crowdsourcing platforms.

Linlin Ma, Guanfeng Liu, Guohao Sun, Lei Li, Zhixu Li, An Liu, Lei Zhao
EPEMS: An Entity Matching System for E-Commerce Products

Entity Matching is used to identify records representing the same entities in the real world. As e-commerce is developing rapidly, online products grow explosively in both amount and variety. Applying entity matching to e-commerce data and finding records representing the same products make customers convenient to compare prices. This paper proposes an entity matching system for e-commerce data, called EPEMS. Compared with existing systems, we improve an existing sorted neighborhood blocking method, which is used to reduce the number of comparisons. At the same time the similarity of product pictures is used to improve matching results.

Lei Gao, Pengpeng Zhao, Victor S. Sheng, Zhixu Li, An Liu, Jian Wu, Zhiming Cui
PPS-POI-Rec: A Privacy Preserving Social Point-of-Interest Recommender System

Point-of-Interest (POI) recommendation is an important task for location based service (LBS) providers. Social POI recommendation outperforms traditional, non-social approaches as social relations among users (a.k.a. social graph) could be used as a data source to calculate user similarities, which is generally hard to evaluate due to the lack of sufficient user check-in data. However, the social graph is typically owned by a social networking service (SNS) provider such as Facebook and should be hidden from LBS provider for obvious reasons of commercial benefits, as well as due to privacy legislation. In this paper, we present PPS-POI-Rec, a novel privacy preserving social POI recommender system that enables SNS provider and LBS provider to cooperatively recommend a set of POIs to a target user while keeping their private data secret. We will demonstrate step by step how a social POI recommendation can be jointly made by SNS provider and LBS provider, without revealing their private data to each other.

Xiao Liu, An Liu, Guanfeng Liu, Zhixu Li, Jiajie Xu, Pengpeng Zhao, Lei Zhao
Incorporating Contextual Information into a Mobile Advertisement Recommender System

The ever growing popularity of smart mobile devices and rapid advent of wireless technology have given rise to a new class of advertising system, i.e., mobile advertisement recommender system. The traditional internet advertising systems have largely ignored the fact that users interact with the system within a particular “context”. In this paper, we implemented a mobile advertisement recommender prototype system called MARS. MARS captures different user’s contextual information to improve recommendation results. The demonstration shows that MARS makes advertisement recommendation more effectively.

Ke Zhu, Yingyuan Xiao, Pengqiang Ai, Hongya Wang, Ching-Hsien Hsu
Backmatter
Metadata
Title
Web Technologies and Applications
Editors
Reynold Cheng
Bin Cui
Zhenjie Zhang
Ruichu Cai
Jia Xu
Copyright Year
2015
Electronic ISBN
978-3-319-25255-1
Print ISBN
978-3-319-25254-4
DOI
https://doi.org/10.1007/978-3-319-25255-1

Premium Partner