Skip to main content
main-content

Über dieses Buch

This LNCS double volume LNCS 9931-9932 constitutes the refereed proceedings of the 18th Asia-Pacific Conference APWeb 2016 held in Suzhou, China, in September 2016.
The 79 full papers and presented together with 24 short papers and 17 demo papers were carefully reviewed and selected from 215 submissions.
the focus of the conference was on following subjects: Spatio-temporal, Textual and Multimedia Data Management
Social Media Data Analysis
Modelling and Learning with Big Data
Streaming and Real-time Data Analysis
Recommendation System
Data Quality and Privacy
Query Optimization and Scalable Data Processing

Inhaltsverzeichnis

Frontmatter

Research Full Paper: Spatio-temporal, Textual and Multimedia Data Management

Frontmatter

Probabilistic Nearest Neighbor Query in Traffic-Aware Spatial Networks

Travel planning and recommendation have received significant attention in recent years. In this light, we study a novel problem of finding probabilistic nearest neighbors and planning the corresponding travel routes in traffic-aware spatial networks (TANN queries) to avoid traffic congestions. We propose and study two probabilistic TANN queries: (1) a time-threshold query like “what is my closest restaurant with the minimum congestion probability to take at most 45 min?”, and (2) a probability-threshold query like “what is the fastest path to my closest petrol station whose congestion probability is less than 20 %?”. We believe that this type of queries may benefit users in many popular mobile applications, such as discovering nearby points of interest and planning convenient travel routes for users. The TANN queries are challenged by two difficulties: (1) how to define probabilistic metrics for nearest neighbor queries in traffic-aware spatial networks, and (2) how to process the TANN queries efficiently under different query settings. To overcome these challenges, we define a series of new probabilistic metrics and develop two efficient algorithms to compute the TANN queries. The performances of TANN queries are verified by extensive experiments on real and synthetic spatial data.

Shuo Shang, Zhewei Wei, Ji-Rong Wen, Shunzhi Zhu

NERank: Bringing Order to Named Entities from Texts

Most entity ranking research aims to retrieve a ranked list of entities from a Web corpus given a query. However, entities in plain documents can be ranked directly based on their relative importance, in order to support entity-oriented Web applications. In this paper, we introduce an entity ranking algorithm NERank to address this issue. NERank first constructs a graph model called Topical Tripartite Graph from a document collection. A ranking function is designed to compute the prior ranks of topics based on three quality metrics. We further propose a meta-path constrained random walk method to propagate prior topic ranks to entities. We evaluate NERank over real-life datasets and compare it with baselines. Experimental results illustrate the effectiveness of our approach.

Chengyu Wang, Rong Zhang, Xiaofeng He, Guomin Zhou, Aoying  Zhou

FTS: A Practical Model for Feature-Based Trajectory Synthesis

Driven by the GPS-enabled devices and wireless communication technologies, the researches and applications on spatio-temporal databases have received significant attention during the past decade. Hence, large trajectory datasets are extremely necessary to test high performance algorithms for these applications and researches. However, real-world datasets are not accessible in many cases due to privacy concerns and business competition. For this reason, we propose a practical model FTS to generate new trajectories in this work. We generate new trajectories based on features extracted from original dataset and validate the result by comparing the features of generated trajectories with the given dataset finally.

Jiapeng Li, Wei Chen, An Liu, Zhixu Li, Lei Zhao

Distributed Text Representation with Weighting Scheme Guidance for Sentiment Analysis

With rapid growth of social media, sentiment analysis has recently attracted growing attention in both academic and industrial fields. One of the most successful paradigms for sentiment analysis is to feed bag-of-words (BOW) features into classifiers. Usually, weighting schemes are required to weight raw BOW features to obtain better accuracy, where important words are assigned more weights while unimportant ones are given less weights. Another line of researches for sentiment analysis focuses on neural models, where dense features are automatically extracted from texts by neural networks. In this paper, we take advantages of techniques in both lines of researches, where weighting schemes are introduced into the neural models to guide neural networks to focus on those important words. Neural models are known for their automatic feature learning abilities, however, we discover that when suitable guidance such as weighting schemes are applied, better features can be extracted for sentiment analysis. Experimental results show that our models outperform or can compete with state-of-the-art approaches on three commonly used sentiment analysis datasets.

Zhe Zhao, Tao Liu, Xiaoyun Hou, Bofang Li, Xiaoyong Du

A Real Time Wireless Interactive Multimedia System

Recent years, various interactive multimedia systems have been applied to relevant fields such as education, entertainment, etc. Researchers exploit sensors, computer vision, ultrasonic, and electromagnetic radiation to achieve human-computer interaction (HCI). This paper proposes an interactive wireless multimedia system which utilizes ubiquitous wireless signals to identify human motions around smart WiFi devices. Compared with related work, our system realizes interactions between human and computer without extra hardware devices. The system identifies human gestures around the smart devices (i.e., a laptop) equipped with the commercial 802.11n NIC, and it maps different gestures into distinguishable computer instructions. We build a proof-of-concept prototype using off-the-shelf laptop and evaluate the system in a laboratory environment with standard WiFi access points. The results show that our system detects human gesture with an accuracy over 95 % and it achieves an average gesture classification accuracy of 89 % for five different users.

Hong Li, Wei Yang, Yang Xu, Jianxin Wang, Liusheng Huang

Mining Co-locations from Continuously Distributed Uncertain Spatial Data

A co-location pattern is a group of spatial features whose instances tend to locate together in geographic space. While traditional co-location mining focuses on discovering co-location patterns from deterministic spatial data sets, in this paper, we study the problem in the context of continuously distributed uncertain data. In particular, we aim to discover co-location patterns from uncertain spatial data where locations of spatial instances are represented as multivariate Gaussian distributions. We first formulate the problem of probabilistic co-location mining based on newly defined prevalence measures. When the locations of instances are represented as continuous variables, the major challenges of probabilistic co-location mining lie in the efficient computation of prevalence measures and the verification of the probabilistic neighborhood relationship between instances. We develop an effective probabilistic co-location mining framework integrated with optimization strategies to address the challenges. Our experiments on multiple datasets demonstrate the effectiveness of the proposed algorithm.

Bozhong Liu, Ling Chen, Chunyang Liu, Chengqi Zhang, Weidong Qiu

An Online Approach for Direction-Based Trajectory Compression with Error Bound Guarantee

With the increasing usage of GPS-enabled devices which can record users’ travel experiences, moving object trajectories are collected in many applications. Raw trajectory data can be of large volume but storage is limited, and direction-based compression to preserve the skeleton of a trajectory became popular recently. In addition, real-time applications and constrained resources often require online processing of incoming data instantaneously. To address this challenge, in this paper we first investigate two approaches extended from Douglas-Peucker and Greedy Deviation algorithms respectively, which are two most popular algorithms for trajectory compression. To further improve the online computational efficiency, we propose a faster approximate algorithm with error bound guarantee named Angular-Deviation. Experimental results demonstrate it can achieve low running time to suit the most constrained computation environments.

Bingqing Ke, Jie Shao, Yi Zhang, Dongxiang Zhang, Yang Yang

A Data Grouping CNN Algorithm for Short-Term Traffic Flow Forecasting

In this paper, a data grouping approach based on convolutional neural network (DGCNN) is proposed for forecasting urban short-term traffic flow. This approach includes the consideration of spatial relations between traffic locations, and utilizes such information to train a convolutional neural network for forecasting. There are three advantages of our approach: (1) the spatial relations of traffic flow are adopted; (2) high-quality features are extracted by CNN; and (3) the accuracy of forecasting short-term traffic flow is improved. To verify our model, extensive experiments are performed on a real data set, and the result shows that the model is more effective than other existing methods.

Donghai Yu, Yang Liu, Xiaohui Yu

Efficient Evaluation of Shortest Travel-Time Path Queries in Road Networks by Optimizing Waypoints in Route Requests Through Spatial Mashups

In the real world, the route with the shortest travel time in a road network is more meaningful than that with the shortest network distance for location-based services (LBS). However, not every LBS provider has adequate resources to compute/estimate travel time for routes by themselves. A cost-effective way for LBS providers to estimate travel time for routes is to issue external requests to Web mapping services (e.g., Google Maps, Bing Maps, and MapQuest Maps). Due to the high cost of processing such external requests and the usage limits of Web mapping services, we take the advantage of direction sharing and waypoints supported by Web mapping services to reduce the number of external requests and the query response time for shortest travel-time route queries in this paper. We model the problem of selecting the optimal waypoints for an external route request as finding the longest simple path in a weighted bipartite digraph. As it is a NP-complete problem, we propose a greedy algorithm to find the best set of waypoints in an external route request. We evaluate the performance of our approach using real Web mapping services, a real road network, real and synthetic data sets. Experimental results show the efficiency, scalability, and applicability of our approach.

Detian Zhang, Chi-Yin Chow, Qing Li, An Liu

Discovering Companion Vehicles from Live Streaming Traffic Data

Companions of moving objects are object groups that move together in a period of time. To quickly identify companion vehicles from a special kind of streaming traffic data, called Automatic Number Plate Recognition (ANPR) data, this paper proposes an approach to discover companion vehicles. Compared to related approaches, we transform the companion discovery into a frequent sequence-mining problem. We make several improvements on top of a recent frequent sequence-mining algorithm, called SeqStream, to handle customized time constraints among sequence elements when discovering traveling companions. We also use pseudo projection technique to improve the performance of our algorithm. Finally, extensive experiments are done using a real dataset to show efficiency and effectiveness of our approach.

Chen Liu, Xiongbin Wang, Meiling Zhu, Yanbo Han

Time-Constrained Sequenced Route Query in Indoor Spaces

Location-based services (LBSs) in indoor spaces have emerged as a new research direction. In this paper, we study a new kind of indoor LBSs that is called Time-Constrained Sequenced Route (TCSR) query. A TCSR query returns a route consisting of a sequence of indoor locations before a given deadline such that each location matches a given location type as well as a given stay-time period. Such queries are popular in indoor spaces, e.g., in a business center, people may want to first stay at a toy shop for 30 min, and then go to a coffee room for one-hour rest, and finally arrive at a cinema before 18:00 PM. Classic route-search algorithms like Dijkstra have to search a large set of possible routes and thus are inefficient for TCSR queries. In addition, they do not consider the multi-floor feature of indoor spaces. In this paper we present a two-stage approach to evaluate a TCSR query. First, we find the optimal floor sequence for a TCSR query. Next, we propose a multi-source Dijkstra algorithm to get the time-constrained sequenced locations in a single floor. We conduct experiments on a synthetic indoor space and the results suggest that our proposal is efficient and scalable.

Wenyi Luo, Peiquan Jin, Lihua Yue

Near-Duplicate Web Video Retrieval and Localization Using Improved Edit Distance

With the development of network, there exists many near-duplicate videos online shared by individuals. These ones cause problems such as copyright infringement and search result redundancy. To solve the issues, this paper proposes a filter-and-refine framework for near-duplicate video retrieval and localization. By regarding video sequences as strings, Edit distance is used and improved in the approach. Firstly, bag-of-words (BOW) model is utilized to measure the similarities between frames. Then, non-near-duplicate videos are filtered out by computing the proposed relative Edit distance similarity (REDS). Next, a dynamic programming strategy is proposed to rank the remained videos and localize the similar segments. Experiments demonstrate the effectiveness and robustness of the method in retrieval and localization.

Hao Liu, Qingjie Zhao, Hao Wang, Cong Zhang

Efficient Group Top-k Spatial Keyword Query Processing

With the proliferation of geo-positioning and geo-tagging, spatial web objects that possess both a geographical location and textual description are gaining in prevalence. Given a spatial location and a set of keywords, a top-k spatial keyword query returns the k best spatio-textual objects ranked according to their proximity to the query location and relevance to the query keywords. To our knowledge, existing study on spatial keyword query processing only focuses on single query point scenario. In this paper, we take the first step to study the problem of multiple query points (or group queries) top-k spatial keyword query processing. We first propose a threshold-based algorithm, which first performs incremental top-k spatial keyword query for each query point and then combines their results. Next, we propose another more efficient algorithm by treating the whole query set as a query unit, which can significantly reduce the objects to be examined, and thus achieve higher performance. Extensive experiments using real datasets demonstrate that our approaches are efficient in terms of runtime and I/O cost, as compared to the baseline algorithm.

Kai Yao, Jianjun Li, Guohui Li, Changyin Luo

Research Full Paper: Social Media Data Analysis

Frontmatter

Learn to Recommend Local Event Using Heterogeneous Social Networks

Event-based social networks (EBSNs), which link the online and offline social networks, are increasing popular online services. Along with dramatic rise of the users and events in EBSNs, it is necessary to recommend event to users. Taking full advantage of social networks information can significantly improve predictive accuracy in recommender systems. The intuition here is that the user’s response to events are determined by his/her instinct and behaviours of friends. We propose a Heterogeneous Social Poisson Factorization(HSPF) model which combines online and offline social networks into one framework, and integrates the tie strength of online and offline friend relationships to the model. We test HSPF on Meetup dataset. Experimental results demonstrate that HSPF outperforms state-of-the-art recommendation methods.

Shaoqing Wang, Zheng Wang, Cuiping Li, Kankan Zhao, Hong Chen

Top-k Temporal Keyword Query over Social Media Data

Analytic jobs over social media data typically need to explore data of different periods. However, most existing keyword search work merely use creation time of items as the measurement of their recency. In this paper we propose top-k temporal keyword query that ranks data by their aggregate sum of shared times during the given time window. A query algorithm that can be executed over a general temporal inverted index is provided. The complexity analysis based on the power law distribution reveals the upper bound of accessed items. Furthermore, two-tiers structure and piecewise maximum approximation sketch are proposed as refinements. Extensive empirical studies on a reallife dataset show the combination of two refinements achieves remarkable performance improvement under different query settings.

Fan Xia, Chengcheng Yu, Weining Qian, Aoying Zhou

When a Friend Online is More Than a Friend in Life: Intimate Relationship Prediction in Microblogs

Microblogging services such as Twitter and Sina Weibo have been an important, if not indespensible, platform for people around the world to connect to one another. The rich content and user interactions on these platforms reveal insightful information about each user that are valuable for various real-life applications. In particular, user offline relationships, especially those intimate ones such as family members and couples, offer distinctive value for many business and social settings. In this study, we focus on using Sina Weibo to discover intimate offline relationships among users. The problem is uniquely interesting and challenging due to the difficulty in mining such sensitive and implicit knowledge across the online-offline boundary. We introduce deep learning approaches to this relationship identity problem and adopt an integrated model to capture features from both user profile and mention message. Our experiments on real data demonstrate the effectiveness of our approach. In addition, we present interesting findings from behavior between intimate users in terms of user features and interaction patterns.

Yunshi Lan, Mengqi Zhang, Feida Zhu, Jing Jiang, Ee-Peng Lim

Community Inference with Bayesian Non-negative Matrix Factorization

In terms of networks, the clustering is based on the topology structure of the network and the groups found are called Communities. We might expect a coherent group to be one which has more links between members of the group than it has to nodes outside the group in other clusters. Detection Communities in a large network can efficiently simplify network structure, help to understand the network topology and learn how the network works.As a dimension reduction method, Non-negative Matrix Factorization (NMF) aims to find two non-negative matrices whose product approximates the original matrix well, and is widely used in graph clustering condition with good physical interpretability and universal applicability. Based on the consideration that there is no any physical meaning to reconstruct a network with negative adjacency matrix, using NMF to obtain new representations of network with non-negativity constraints can achieve much productive effect in community analysis.Incorporating Bayesian methods with prior knowledge for NMF, we can gain further insights into the data and determinate the optimal parameters for detecting model. In this paper, we propose a Bayesian non-negative matrix factorization method with Symmetric assumption (BSNMF), which not only achieve better community detection results in undirected network, but also effectively predict most appropriate count of communities in a large network with Automatic Relevance Determination model. We compare our approaches with other NMF-based methods in Email social networks, and experimental results for community detection show that our approaches are effective to find the communities number and achieve better community detection results.

Xiaohua Shi, Hongtao Lu

Maximizing the Influence Ranking Under Limited Cost in Social Network

Influence maximizing is an important problem which has been studied widely in recent years. There are many situations in which people are more concerned about influence ranking than influence coverage in competition network, because some times only the top-ranked users can win the rewards, while few researchers have studied this problem. In this paper, we consider the problem of selecting a seed set under limited cost to get as high influence ranking as possible. We show this problem is NP-hard and propose a Intelligent Greedy algorithm to approximately solve the problem and improve the efficiency based on the submodularity. Furthermore, a new Cost-Effective Multi-Step Influence Adjust algorithm is proposed to get high efficiency. Experimental results show that our Intelligent Greedy algorithm achieves better effectiveness than other algorithms and the Cost-Effective Multi-Step Influence Adjust algorithm achieves high efficiency and gets better effectiveness than Degree and Random algorithms.

Xiaoguang Hong, Ziyan Liu, Zhaohui Peng, Zhiyong Chen, Hui Li

Towards Efficient Influence Maximization for Evolving Social Networks

Identifying the most influential individuals can provide invaluable help in developing and deploying effective viral marketing strategies. Previous studies mainly focus on designing efficient algorithms or heuristics to find top-K influential nodes on a given static social network. While, as a matter of fact, real-world social networks keep evolving over time and a recalculation upon the changed network inevitably leads to a long running time, significantly affecting the efficiency. In this paper, we observe from real-world traces that the evolution of social network follows the preferential attachment rule and the influential nodes are mainly selected from high-degree nodes. Such observations shed light on the design of IncInf, an incremental approach that can efficiently locate the top-K influential individuals in evolving social networks based on previous information instead of calculation from scratch. In particular, IncInf quantitatively analyzes the influence spread changes of nodes by localizing the impact of topology evolution to only local regions, and a pruning strategy is further proposed to effectively narrow the search space into nodes experiencing major increases or with high degrees. We carried out extensive experiments on real-world dynamic social networks including Facebook, NetHEPT, and Flickr. Experimental results demonstrate that, compared with the state-of-the-art static heuristic, IncInf achieves as much as 21$$\times $$ speedup in execution time while maintaining matching performance in terms of influence spread.

Xiaodong Liu, Xiangke Liao, Shanshan Li, Bin Lin

Detecting Community Pacemakers of Burst Topic in Twitter

Twitter has become one of largest social networks for users to broadcast burst topics. Influential users usually have a large number of followers and play an important role in the diffusion of burst topic. There have been many studies on how to detect influential users. However, traditional influential users detection approaches have largely ignored influential users in user community. In this paper, we investigate the problem of detecting community pacemakers. Community pacemakers are defined as the influential users that promote early diffusion in the user community of burst topic. To solve this problem, we present DCPBT, a framework that can detect community pacemakers in burst topics. In DCPBT, a burst topic user graph model is proposed, which can represent the topology structure of burst topic propagation across a large number of Twitter users. Based on the model, a user community detection algorithm based on random walk is applied to discover user community. For large-scale user community, we propose a ranking method to detect community pacemakers in each large-scale user community. To test our framework, we conduct the framework over Twitter burst topic detection system. Experimental results show that our method is more effective to detect the users that influence other users and promote early diffusion in the early stages of burst topic.

Guozhong Dong, Wu Yang, Feida Zhu, Wei Wang

Dynamic User Attribute Discovery on Social Media

Social media service defines a new paradigm of people communicating, self-expressing and sharing on the Web. Users in today’s social media platforms often post contents, inferring their interests/attributes, which are significant for many Web services such as social recommendation, personalized searching and online advertising. User attributes are temporally dynamic along with internal interest changing and external influence. Based on topic modeling, we present a probabilistic method for dynamic user attribute discovery. Our method automatically detects user attributes and models the dynamics using time windows and decay function, thereby facilitating more accurate recommendation. Evaluation on a Sina Weibo dataset shows the superiority in terms of precision, recall and F-measure as compared to baselines, such as static user attribute modeling.

Xiu Huang, Yang Yang, Yue Hu, Fumin Shen, Jie Shao

The Competition of User Attentions Among Social Network Services: A Social Evolutionary Game Approach

As the total amount of users in the social network services approach the total amount of netizens, social network service providers have to compete with each other for the attention of the existing users, rather than attracting totally new users without any social network service using experiences. Most of the current game theoretical studies on social network services focus on the evolution of the cooperation/defection between users, and the evolution of the structure of the networks for a single social networks, and fail to consider the competition of multiple social networks over a same and stable users group and their user attentions. In this paper, we propose a competitive social evolutionary game model to describe the competition of user attentions among social network services. We introduce the concept of user attention, and the popularity of social networks to describe the local and general user attention distributions, respectively. Our simulation of the competition between two social networks with different initial network structures and cooperation/defection utilities shows that a greater reputation awareness can suppress the influence of the defect temptation value.

Jingyuan Li, Yuanzhuo Wang, Yuan Lu, Xueqi Cheng, Yan Ren

Mechanism Analysis of Competitive Information Synchronous Dissemination in Social Networks

Different group of information, such as advertising and product promotion, compete with each other as they diffuse over social networks. Most of the existing methods analyze the dissemination mechanism mainly upon the information itself, without considering human characteristics. This paper uses a framework of social evolutionary game to simulate the dissemination and adjusts utility function and updating mechanism based on coordination game. We find that individuals consider more about their own reputation and more communication between them, individuals are more cautious in the face of strategy choice. When the benefit of competitive information is nearly 1.2 times of the original one, it can make up the loss of reputation caused by changing strategy. For the specific network environment based on simulation, the actual data on Sina Weibo strongly verify this rule and shows that factor of reputation promotes the cooperation and users won’t easily change their information.

Yuan Lu, Yuanzhuo Wang, Jianye Yu, Jingyuan Li, Li Liu

A Topic-Specific Contextual Expert Finding Method in Social Network

Expert retrieval is a widely studied problem. However, most existing expert finding methods focus on social network which contains topic-irrelevant users and interactions. This results in that the expert results are not topic-specific and practical because many users need to find experts for certain topic. Furthermore, contextual factors of social network also affect the accuracy of expert finding and are seldom concerned comprehensively in existing approaches. To solve above problems, in this paper, we propose a topic-specific contextual expert finding method. At first, we define a topic-specific contextual feature model (TSCFM) which consists of a topic-aware model (TAM) for topical feature and a context-aware model (CAM) for contextual feature. TAM uses LDA and HITS to extract topical feature, and CAM evaluates social relation, time and location factors to extract contextual features. Then based on TSCFM, we learn an expert scoring function which synthetically concerns topical and contextual features using SVM algorithm and rank the experts. The experiments on two datasets demonstrate that our proposed expert finding method is feasible and can improve the accuracy.

Xiaoqin Xie, Yijia Li, Zhiqiang Zhang, Haiwei Pan, Shuai Han

Budget Minimization with Time and Influence Constraints in Social Network

The problem of influence maximization in social network gains increasing attention in recent years. The target is to seek a seed set that maximizes the influence coverage with given budget and infinite time space. However, in many real-world applications, people are eager to achieve the desired influence coverage with limited time and the smallest budget, where we want to find a minimum seed set with constrained time and influence. We refer to the problem as Budget Minimization (BM). The BM problem is much more challenging compared with traditional influence maximization w.r.t. the following reasons: (1) it requires to find both the minimum size of the seed set and the most influential nodes simultaneously, leading to complex and expensive optimization procedure; (2) the estimation of the influence coverage of a given seed set should be accurate enough, since we have to decide whether it can reach the influence threshold exactly. In this paper, we propose an Extended Simulated Annealing on Budget Minimization (ESABM) method to efficiently find the smallest seed set in the considered BM problem. The ESABM method extends the traditional Simulated Annealing (SA) algorithm by importing the ‘delete’ and ‘insert’ operators in addition to the ‘replace’ operator, which is used in the traditional SA algorithm. Based on the operators, some operator selection techniques are proposed with detailed theoretical guarantees. Moreover, since we have to estimate the influence coverage in the ESABM algorithm, we further propose an efficient layered-graph based influence coverage estimation method. Experimental results conducted on five real world data show that our proposed method outperforms the existing state-of-the-art methods in terms of both accuracy and efficiency.

Peng Dou, Sizhen Du, Guojie Song

B-mine: Frequent Pattern Mining and Its Application to Knowledge Discovery from Social Networks

As an important data mining task, frequent pattern mining has drawn attention from many researchers. This has led to the development of many frequent pattern mining algorithms, which include Apriori-based, tree-based, and hyperlinked array structure-based algorithms, as well as vertical mining algorithms. Although these algorithms are efficient and popular, they also suffer from some drawbacks. To tackle these drawbacks, we present in this paper an alternative algorithm called B-mine that uses a bitwise approach to mine frequent patterns. Evaluation results show the space- and time-efficiency of B-mine for frequent pattern mining, as well as the practicality of B-mine for social network analysis and knowledge discovery from social networks.

Fan Jiang, Carson K. Leung, Hao Zhang

An Efficient Online Event Detection Method for Microblogs via User Modeling

Detecting events in microblog is important but still challenging. As tweet stream is a mixture of user interests and external events, its expensive to distinguish them. Existing methods are ineffective since they ignore user interests or only model interests and events on a fixed dataset without scalability. In this paper, we introduce an online learning model User Modeling Based Interest and Event Topic Model (UMIETM). UMIETM (1) exploits user modeling’s information to discover events, which usually capture attentions from users with different interests, and (2) treats the arriving data as stream and run the detection in online learning style. Furthermore, UMIETM can handle dynamic increased vocabulary in tweet stream. The UMIETM is verified on the real dataset which spans one year and contains 16 million tweets, and it outperforms state-of-the-art models in quantitative.

Weijing Huang, Wei Chen, Lamei Zhang, Tengjiao Wang

Man-O-Meter: Modeling and Assessing the Evolution of Language Usage of Individuals on Microblogs

Language usage behavior of users evolves over time, as they interact on social media such as Twitter. We study the evolution of language usage behavior of individuals, across topics, on microblogs. We propose Man-O-Meter, a framework to model such evolution. We model the evolution using a combination of three dimensions: (a) time, (b) content (topics) and (c) influence flow over social relationships. We assert the goodness of our approach, by predicting ranks of experts, with respect to their influence in their respective expertise category, using the change in language used in time. We apply our framework on 2, 273 influential microbloggers on Twitter, across 62 categories, spanning over 10 domains. Our work is applicable in predicting activity and influence, interest evolution, job change and community change expected to happen to a user, in future.

Kuntal Dey, Saroj Kaushik, Hemank Lamba, Seema Nagar

Research Full Paper: Modelling and Learning with Big Data

Frontmatter

Forecasting Career Choice for College Students Based on Campus Big Data

Career indecision is a difficult obstacle in front of adolescents. Traditional vocational assessment research measure it by means of questionnaires and diagnose the potential sources of career indecision. Based on the diagnostic outcomes, career consolers develop the treatment plans tailor to students. However, because of personal motives and the architecture of the mind, it may be difficult for students to know themselves, so that the outcome of questionnaires can not fully reflect their inner states and statuses. Self-perception theory suggest students’ behavior could be used as clue for inference. Thus, we proposed a data-driven framework for forecast student career choice of graduation based on their behavior in and around the campus, playing an important role in supporting career counseling and career guiding. By evaluating on 10M behavior data of over four thousand students, we show the potential of this framework in these functionality.

Min Nie, Lei Yang, Bin Ding, Hu Xia, Huachun Xu, Defu Lian

Online Prediction for Forex with an Optimized Experts Selection Model

Online prediction is a process to repeatedly predict the next element from a sequence of given previous elements. It has a broad range of applications on various areas, such as medical and finance. The biggest challenge of online prediction is sequence data does not have explicit features, which means it is difficult to remain good predictions. One of popular solution is to make prediction with expert advice, and the challenge is to pick the right experts with minimum cumulative loss. In this article, we use forex prediction as a case study, and propose a model that can select a good set of forex experts by learning a set of previous observed sequences. To achieve better performance, our model not only considers the average mistakes made by experts but also takes the average profit earn by experts into account. We demonstrate the merits of our model on a real major currency pairs data set.

Jia Zhu, Jing Yang, Jing Xiao, Changqin Huang, Gansen Zhao, Yong Tang

Fast Rare Category Detection Using Nearest Centroid Neighborhood

Rare category detection is an open challenge in data mining. The existing approaches to this problem often have some flaws, such as inappropriate investigation scopes, high time complexity, and limited applicable conditions, which will degrade their performance and reduce their usability. In this paper, we present FRANC an effective and efficient solution for rare category detection. It adopts an investigation scope based on k-nearest centroid neighbors with an automatically selected k, which helps the algorithm capture the real changes on local densities and data distribution caused by the presence of rare categories. By using our proposed pruning method, the identification of k-nearest centroid neighbors, which is the most computationally expensive step in FRANC, will be much faster for each data example. Extensive experimental results on real data sets demonstrate the effectiveness and efficiency of FRANC.

Song Wang, Hao Huang, Yunjun Gao, Tieyun Qian, Liang Hong, Zhiyong Peng

Latent Semantic Diagnosis in Traditional Chinese Medicine

Traditional Chinese Medicine (TCM) is the main route of disease control for ancient Chinese. Through thousands of years’ development and inheriting, TCM is the most influential traditional medical system which lasts the longest time and used by the largest population. However, the theory of TCM lacks objective and quantitative standards. In this paper, we propose a statistical diagnosis approach to find out the pathogenesises based on the latent semantic analysis of symptoms and the corresponding herbs. We assume that the latent pathogenesis is the inherent connection between symptoms and herbs within a medical case. Previous topic models mostly focus on single content documents, but medical cases have two different contents: symptoms and herbs. We therefore develop a novel muti-content model based on LDA. We used the proposed model to analysis two TCM domains amenorrhea and lung cancer. Experiment results illustrate that the pathogenesises found by our model correspond well with the theory of TCM and it provides a theoretical data-driven approach to establish diagnosis standards.

Wendi Ji, Ying Zhang, Xiaoling Wang, Yiping Zhou

Correlation-Based Weighted K-Labelsets for Multi-label Classification

RAkEL(RAndom k-labELsets) is an effective ensemble multi-label classification method where each sub classifier is trained on a small randomly-selected subset of k labels, called k-labelset. However, random combination of labels may lead to the poor performance of sub classifiers and the method can not make full use of the label correlations. In this paper, we propose a novel ensemble multi-label classification method named LCWkEL(Label Correlations-based Weighted k-labELsets). Instead of randomly choosing subsets, we select a number of k-labelsets based on a label correlation matrix. Furthermore, considering the label correlations in different k-labelsets may have different influence on an instance, we construct a weight coefficient vector for an instance. Each dimension of the vector represents the weight coefficient for each sub classifier. For the multi-label classification of an unlabeled instance, LCWkEL calculates the weighted sum of all sub classifiers’ predictions, which can improve the classification performance effectively. Experimental results on three areas of data sets show that the method proposed in this paper can obtain competitive performance compared with the RAkEL method and other high-performing multi-label classification methods.

Jingyang Xu, Jun Ma

Classifying Relation via Bidirectional Recurrent Neural Network Based on Local Information

Relation classification is an important research task in the field of natural language processing (NLP). In this paper, we apply a bidirectional recurrent neural network upon local windows of entities for relation classification. In contrast to previous approaches, only word tokens around entities are taken into consideration in our model. Upon word tokens, a bidirectional recurrent neural network is used to extract local context features of entities. To retain the important features for classification , we propose to use a novel weighted pooling layer upon hidden layers of RNN. Experiments on the SemEval-2010 dataset show that our proposed method achieves competitive results without introducing any external resources.

Xiaoyun Hou, Zhe Zhao, Tao Liu, Xiaoyong Du

Psychological Stress Detection from Online Shopping

The increasingly faster life pace in modern society makes people always feel stressful and it is of great significance to discover a users suffering stress in time. According to psychological study, shopping is chosen as an effective way for stress relief, especially for females. Compared with non-stress cases, a user may perform different shopping patterns when under stress. An interesting issue then arises: can we detect one’s psychological stress from online shopping data? By investigating stress-related outlier features from both content and behavior of online purchase orders, we learn a users stress status by classification. A real user study of 20 experienced female online customers aged 23–30 verifies the effectiveness of shopping based stress detection, achieving an F1-measure of more than 80 % with J48 classifier. None of the features negatively affect the detection result. Feature combinations bring dramatic improvements than single feature. In total, shopping content features are proved to be more significant than behavior features.

Liang Zhao, Hao Wang, Yuanyuan Xue, Qi Li, Ling Feng

Confidence-Learning Based Collaborative Filtering with Heterogeneous Implicit Feedbacks

Implicit feedbacks, which indirectly reflect opinions through observing user behaviors, have recently received more and more attention in recommendation communities due to their accessibility and richness in real-world applications. Most of the existing implicit-feedback-based recommendation algorithms only exploit one type of implicit feedback. In real-world applications, there is usually more than one type of implicit feedback. Considering the sparsity problem of recommender systems, it is significant to leveraging more available data. In this paper, we study the heterogeneous implicit feedbacks problem, where more than one type of implicit feedback is available. We study the characteristics of different types of implicit feedbacks, and propose a unified approach to infer the confidence that we can believe a user prefers an item. Then we apply the inferred confidence to both point-wise and pair-wise matrix factorization models, and propose a more generic strategy to select training samples for pair-wise methods. Experiments on real-world e-commerce data show that our methods outperform the state-of-art approaches, considering several commonly used ranking oriented evaluation criterions.

Jing Wang, Lanfen Lin, Heng Zhang, Jiaqi Tu

Star-Scan: A Stable Clustering by Statistically Finding Centers and Noises

In this paper, we present a new clustering algorithm, called A $$\mathbf{Sta }$$ble Cluste$$\mathbf{r }$$ing by $$\mathbf{S }$$tatistically Finding $$\mathbf{C }$$enters $$\mathbf{a }$$nd $$\mathbf{N }$$oises (Star-Scan). Star-Scan is a density-based clustering algorithm that can find arbitrary shape clusters and resists to the noise in a dataset. It borrows the idea from Rodriguez’s Clustering by Fast Search and Find of Density Peaks (CFSFDP) that the cluster centers are characterized by the points with both higher density and farther distance to other centers than their neighbors. Different from CFSFDP, instead of manual operation, Star-Scan uses a statistical method, box plot, to select cluster centers automatically. Furthermore, due to inadequate selection of cluster centers in CFSFDP, we apply a merging post-process to the produced clusters to get stable and correct results. Finally, we also use box plot to filter out noises on each of final clusters to solve the problem of over-filtering in CFSFDP. We have demonstrated the good performance of Star-Scan algorithm on several synthetic datasets.

Nan Yang, Qing Liu, Yaping Li, Lin Xiao, Xiaoqing Liu

Aggregating Crowd Wisdom with Instance Grouping Methods

With the blooming of crowdsourcing platforms, utilizing crowd wisdom becomes popular. Label aggregation is one of the key topics in crowdsourcing research. The goal is to infer true labels from multiple labels provided by different users. Most researchers make their efforts in modeling user ability and instance difficulty. However, these methods may suffer from sparsity of labels in practice. In this paper, we consider label aggregation from the view of grouping instances. We assume instances are sampled from latent groups and instances in the same group share the same true label. A probabilistic graphical model named InGroup (Instance Grouping model) is constructed to infer latent group assignments as well as true labels. Further, we combine user ability and group difficulty into InGroup to achieve a better model called InGroup+ (InGroup Plus). The experiments conducted on a real-world dataset show the advantages of instance grouping methods compared with other methods.

Li’ang Yin, Zhengbo Li, Jianhua Han, Yong Yu

CoDS: Co-training with Domain Similarity for Cross-Domain Image Sentiment Classification

Classifying images according to the sentiments expressed therein has a wide range of applications, such as sentiment-based search or recommendation. Most existing methods for image sentiment classification approach this problem by training general classifiers based on certain visual features, ignoring the discrepancies across domains. In this paper, we propose a novel co-training method with domain similarity (CoDS) for cross-domain image sentiment classification in social applications. The key idea underlying our approach is to use both the images and the corresponding textual comments when training classifiers, and to use the labeled data of one domain to make sentiment classification for the images of another domain through co-training. We compute image/text similarity between the source domain and the target domain and set the weighting of the corresponding classifiers to improve performance. We perform extensive experiments on a real dataset collected from Flickr. The experimental results show that our proposed method significantly outperforms the baseline methods.

Linlin Zhang, Meng Chen, Xiaohui Yu, Yang Liu

Feature Selection via Vectorizing Feature’s Discriminative Information

Feature selection is a popular technology for reducing dimensionality. Commonly features are evaluated with univariate scores according to their classification abilities, and the high-score ones are preferred and selected. However, there are two flaws for this strategy. First, feature complementarity is ignored. A subspace constructed by the partially predominant but complementary features is suitable for recognition task, whereas this feature subset cannot be selected by this strategy. Second, feature redundancy for classification cannot be measured accurately. This redundancy weakens the subset’s discriminative performance, but it cannot be reduced by this strategy. In this paper, a new feature selection method is proposed. It assesses feature’s discriminative information for each class and vectorizes this information. Then, features are represented by their corresponding discriminative information vectors, and the most distinct ones are selected. Both feature complementarity and classification redundancy can be easily measured by comparing the differences between these new vectors. Experimental results on both low-dimensional and high-dimensional data testify the new method’s effectiveness.

Jun Wang, Hengpeng Xu, Jinmao Wei

A Label Inference Method Based on Maximal Entropy Random Walk over Graphs

With the rapid development of Internet, graphs have been widely used to model the complex relationships among various entities in real world. However, the labels on the graphs are always incomplete. The accurate label inference is required for many real applications such as personalized service and product recommendation. In this paper, we propose a novel label inference method based on maximal entropy random walk. The main idea is that a small number of vertices in graphs propagate their labels to other unlabeled vertices in a way of random walk with the maximal entropy guidance. We give the algorithm and analyze the time and space complexities. We confirm the effectiveness of our algorithm through conducting experiments on real datasets.

Jing Pan, Yajun Yang, Qinghua Hu, Hong Shi

An Adaptive kNN Using Listwise Approach for Implicit Feedback

Collaborative Filtering is a very popular method in recommendation systems. In item recommendation tasks, a list of items is recommended to users by ranking, but traditional CF methods do not treat it as a ranking problem for implicit feedback datasets. In this paper, we propose MAP-kNN, an adaptive kNN approach using listwise approach for implicit feedback datasets. The similarity matrix is learned by maximizing the Mean Average Precision, which is a well-known measurement in information retrieval for representing the performance of a list of ranked items. An optimization strategy and a new sampling method are proposed to improve the learning efficiency of MAP-kNN. The complexity of our algorithm over each iteration after optimization is lower than other methods that also use listwise approach. Experimental results on two datasets indicate that our approach outperforms other state of the art recommendation approaches.

Bu-Xiao Wu, Jing Xiao, Jia Zhu, Chen Ding

Quantifying the Effect of Sentiment on Topic Evolution in Chinese Microblog

The role of sentiment on topic evolution in social media is an interesting problem and has not been fully investigated. Quantifying the effect of sentiment on topic evolution can help people understand the relationship between sentiment and information diffusion. In this paper, we propose a method to identify the stages of topic evolution and introduce a new metric called popularity strength to measure their popularity. We also classify topics into four categories and quantify the effect of sentiment on different classes. Our findings show that “Good news illumines widely, and bad news flies quickly”, and sentiment has complex dynamics towards topic evolution.

Peng Fu, Zheng Lin, Hailun Lin, Fengcheng Yuan, Weiping Wang, Dan Meng

Measuring Directional Semantic Similarity with Multi-features

Semantic similarity measures between linguistic terms are essential in many Natural Language Processing (NLP) applications. Term similarity is most conventionally perceived as a symmetric relation. However, semantic directional (asymmetric) relations exist in lexical semantics and make symmetric similarity measures less suitable for their identification. Furthermore, directional similarity actually represents even more general conditions and is more practical in some specific NLP applications than symmetric similarity. As the footstone of similarity measures, current semantic features cannot efficiently represent large scale web text collections. Hence, we propose a new directional similarity method, considering feature representations both in linguistic and extra linguistic dimensions. We evaluate our approach on standard word similarity, reporting state-of-the-art performance on multiple datasets. Experiments show that our directional method handles both symmetric and directional semantic relations and leads to clear improvements in entity search and query expansion.

Bo Liu, Xuanhua Shi, Hai Jin

Finding Latest Influential Research Papers Through Modeling Two Views of Citation Links

Finding hidden topics and latest topic influential papers in a corpus can help researchers get a quick overview and recent development of a scientific research field. Existing work focused on finding milestone papers which are usually published many years ago. Finding latest influential papers is a more challenging problem due to lack of enough citation information of newly published papers. In this paper, we study this problem and propose a novel way of modeling citation links with a probabilistic generative model. The key idea is to consider two views of citation, both citing and being cited of each paper. Through this idea, we can not only model topic dependence between cited and citing papers but also incorporate latest papers into our model. Based on these ideas, we jointly model the two views with an extension of topic model, Bi-Citation-LDA model, which can not only find previous important papers but also find newly published influential papers in each topic. Experiments on real dataset and comparison with existing methods indicate that our model can effectively find latest topic influential papers.

Lu Huang, Hongyan Liu, Jun He, Xiaoyong Du

Multi-label Chinese Microblog Emotion Classification via Convolutional Neural Network

Recently, analyzing people’s sentiments in microblogs has attracted more and more attentions from both academic and industrial communities. The traditional methods usually treat the sentiment analysis as a kind of single-label supervised learning problem that classifies the microblog according to sentiment orientation or single-labeled emotion. However, in fact multiple fine-grained emotions may be coexisting in just one tweet or even one sentence of the microblog. In this paper, we regard the emotion detection in microblogs as a multi-label classification problem. We leverage the skip-gram language model to learn distributed word representations as input features, and utilize a Convolutional Neural Network (CNN) based method to solve multi-label emotion classification problem in the Chinese microblog sentences without any manually designed features. Extensive experiments are conducted on two public short text datasets. The experimental results demonstrate that the proposed method outperforms strong baselines by a large margin and achieves excellent performance in terms of multi-label classification metrics.

Yaqi Wang, Shi Feng, Daling Wang, Ge Yu, Yifei Zhang

Mining Recent High Expected Weighted Itemsets from Uncertain Databases

Weighted Frequent Itemset Mining (WFIM) has been proposed as an alternative to frequent itemset mining that considers not only the frequency of items but also their relative importance. However, some limitations of WFIM make it unrealistic in many real-world applications. In this paper, we present a new type of knowledge called Recent High Expected Weighted Itemset (RHEWI) to consider the recency, weight and uncertainty of desired patterns, thus more up-to-date and relevant results can be provided to the users. A projection-based algorithm named RHEWI-P is presented to mine RHEWIs based on a novel upper-bound downward closure (UBDC) property. An improved algorithm named RHEWI-PS is further proposed to introduce a sorted upper-bound downward closure (SUBDC) property for pruning unpromising candidates. An experimental evaluation against the state-of-the-art HEWI-Uapriori algorithm is carried on both real-world and synthetic datasets, and the results show that the proposed algorithms are highly efficient and acceptable to mine the required information.

Wensheng Gan, Jerry Chun-Wei Lin, Philippe Fournier-Viger, Han-Chieh Chao

Context-Aware Chinese Microblog Sentiment Classification with Bidirectional LSTM

Recently, with the fast development of the microblog, analyzing the sentiment orientations of the tweets has become a hot research topic for both academic and industrial communities. Most of the existing methods treat each microblog as an independent training instance. However, the sentiments embedded in tweets are usually ambiguous and context-aware. Even a non-sentiment word might convey a clear emotional tendency in the microblog conversations. In this paper, we regard the microblog conversation as sequence, and leverage bidirectional Long Short-Term Memory (BLSTM) models to incorporate preceding tweets for context-aware sentiment classification. Our proposed method could not only alleviate the sparsity problem in the feature space, but also capture the long distance sentiment dependency in the microblog conversations. Extensive experiments on a benchmark dataset show that the bidirectional LSTM models with context information could outperform other strong baseline algorithms.

Yang Wang, Shi Feng, Daling Wang, Yifei Zhang, Ge Yu

Backmatter

Weitere Informationen

Premium Partner

    Bildnachweise