Outlier and Imbalanced Data Analysis

Frontmatter

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning

Imbalanced class distribution is a challenging problem in many real-life classification problems. Existing synthetic oversampling do suffer from the curse of dimensionality because they rely heavily on Euclidean distance. This paper proposed a new method, called Minority Oversampling Technique based on Local Densities in Low-Dimensional Space (or MOT2LD in short). MOT2LD first maps each training sample into a low-dimensional space, and makes clustering of their low-dimensional representations. It then assigns weight to each minority sample as the product of two quantities: local minority density and local majority count, indicating its importance of sampling. The synthetic minority class samples are generated inside some minority cluster. MOT2LD has been evaluated on 15 real-world data sets. The experimental results have shown that our method outperforms some other existing methods including SMOTE, Borderline-SMOTE, ADASYN, and MWMOTE, in terms of G-mean and F-measure.

Zhipeng Xie, Liyang Jiang, Tengju Ye, Xiaoli Li

Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles

Popular outlier detection methods require the pairwise comparison of objects to compute the nearest neighbors. This inherently quadratic problem is not scalable to large data sets, making multidimensional outlier detection for big data still an open challenge. Existing approximate neighbor search methods are designed to preserve distances as well as possible. In this article, we present a highly scalable approach to compute the nearest neighbors of objects that instead focuses on preserving neighborhoods well using an ensemble of space-filling curves. We show that the method has near-linear complexity, can be distributed to clusters for computation, and preserves neighborhoods—but not distances—better than established methods such as locality sensitive hashing and projection indexed nearest neighbors. Furthermore, we demonstrate that, by preserving neighborhoods, the quality of outlier detection based on local density estimates is not only well retained but sometimes even improved, an effect that can be explained by relating our method to outlier detection ensembles. At the same time, the outlier detection process is accelerated by two orders of magnitude.

Erich Schubert, Arthur Zimek, Hans-Peter Kriegel

Rare Category Exploration on Linear Time Complexity

Rare Category Exploration (in short as RCE) discovers the remaining data examples of a rare category from a seed. Approaches to this problem often have a high time complexity and are applicable to rare categories with compact and spherical shapes rather than arbitrary shapes. In this paper, we present FREE an effective and efficient RCE solution to explore rare categories of arbitrary shapes on a linear time complexity w.r.t. data set size. FREE firstly decomposes a data set into equal-sized cells, on which it performs wavelet transform and data density analysis to find the coarse shape of a rare category, and refines the coarse shape via an M

$$k$$

NN based metric. Experimental results on both synthetic and real data sets verify the effectiveness and efficiency of our approach.

Zhenguang Liu, Hao Huang, Qinming He, Kevin Chiew, Yunjun Gao

Probabilistic and Uncertain Data

Frontmatter

FP-CPNNQ: A Filter-Based Protocol for Continuous Probabilistic Nearest Neighbor Query

An increasing number of applications in environmental monitoring and location-based services make use of large-scale distributed sensing provided by wireless sensor networks. In such applications, a large number of sensor devices are deployed to collect useful information such as temperature readings and vehicle positions. However, these distributed sensors usually have limited computational and communication power and thus the amount of sensor queries should be reduced to conserve system resources. At the same time, data captured by such sensors is inherently imprecise due to sensor limitations. We propose an efficient probabilistic filter-based protocol for answering continuous nearest neighbor queries over uncertain sensor data. Experimental evaluation on real-world temperature sensing data and synthetic location data showed a significant reduction in the number of update messages.

Yinuo Zhang, Anand Panangadan, Viktor K. Prasanna

Efficient Queries Evaluation on Block Independent Disjoint Probabilistic Databases

Probabilistic data management has recently drawn much attention of the database research community. This paper investigates safe plans of queries on block independent disjoint (BID) probabilistic databases. This problem is fundamental to evaluate queries whose time complexity is PTIME. We first introduce two new probabilistic table models which are the correlated table and the correlated block table, and a hybrid project which executes a disjoint project and then performs an independent project in an atomic operation on BID tables. After that, we propose an algorithm to find safe plans for queries on BID probabilistic databases. Finally, we present the experimental results to show that the proposed algorithm can find safe plans for more queries than the state-of-the-art and the safe plans generated by the proposed algorithm are efficient and scale well.

Biao Qin

Parallel Top-k Query Processing on Uncertain Strings Using MapReduce

Top-

$$k$$

query is an important and essential operator for data analysis over string collections. However, when uncertainty comes into big data, it calls for new parallel algorithms for efficient query processing on large scale uncertain strings. In this paper, we proposed a MapReduce-based parallel algorithm, called MUSK, for answering top-

$$k$$

queries over large scale uncertain strings. We used the probabilistic

$$n$$

-grams to generate key-value pairs. To improve the performance, a novel lower bound for

expected edit distance

was derived to prune strings based on a new defined function

gram mapping distance

. By integrating the bound with TA, the filtering power in the Map stage was optimized effectively to decrease the transmission cost. Comprehensive experimental results on both real-world and synthetic datasets showed that MUSK outperformed the baseline approach with speeds up to 6 times in the best case, which indicated good scalability over large datasets.

Hui Xu, Xiaofeng Ding, Hai Jin, Wenbin Jiang

Tracing Errors in Probabilistic Databases Based on the Bayesian Network

Data in probabilistic databases may not be absolutely correct, and worse, may be erroneous. Many existing data cleaning methods can be used to detect errors in traditional databases, but they fall short of guiding us to find errors in probabilistic databases, especially for databases with complex correlations among data. In this paper, we propose a method for tracing errors in probabilistic databases by adopting Bayesian network (BN) as the framework of representing the correlations among data. We first develop the techniques to construct an augmented Bayesian network (ABN) for an anomalous query to represent correlations among input data, intermediate data and output data in the query execution. Inspired by the notion of blame in causal models, we then define a notion of blame for ranking candidate errors. Next, we provide an efficient method for computing the degree of blame for each candidate error based on the probabilistic inference upon the ABN. Experimental results show the effectiveness and efficiency of our method.

Liang Duan, Kun Yue, Cheqing Jin, Wenlin Xu, Weiyi Liu

Data Mining II

Frontmatter

Mining Frequent Spatial-Textual Sequence Patterns

Penetration of GPS-enabled devices has resulted in the generation of a lot of Spatial-Textual data, which can be mined or analyzed to improve various location-based services. One such kind of data is Spatial-Textual sequential data (Activity-Trajectory data), i.e. a sequence of locations visited by a user with each location having a set of activities performed by the user is a Spatial-Textual sequence. Mining such data for knowledge discovery is a cumbersome task due to the complexity of the data type and its representation. In this paper, we propose a mining framework along with algorithms for mining Spatial-Textual sequence data to find out frequent Spatial-Textual sequence patterns. We study the use of existing sequence mining algorithms in the context of Spatial-Textual sequence data and propose efficient algorithms which outperform existing algorithms in terms of computation time, as we observed by extensive experimentation. We also design an external memory algorithm to mine large-size data which cannot be accommodated in main memory. The external memory algorithm uses spatial dimension to partition the data into a set of chunks to minimize the number of false positives and has been shown to outperform the naïve external-memory algorithm that uses random partitioning.

Krishan K. Arya, Vikram Goyal, Shamkant B. Navathe, Sushil Prasad

Effective and Interpretable Document Classification Using Distinctly Labeled Dirichlet Process Mixture Models of von Mises-Fisher Distributions

Document Classification is essential to information retrieval and text mining. Accuracy and interpretability are two important aspects of text classifiers. This paper proposes an interpretable classification method (DLDPvMFs) by using the Dirichlet process mixture (DPM) model to discover the hidden topics distinctly within each label for classification of directional data based on the von Mises-Fisher (vMF) distribution, which arises naturally for data distributed on the unit hypersphere. We use a mean-field variational inference algorithm when developing DLDPvMFs. By using the label information of the training data explicitly and determining automatically the number of topics for each label to find the topical space, class topics are coherent, relevant and discriminative and since they help us interpret class’s label as well as distinguish classes. Our experimental results showed the advantages of our approach via significant criteria such as separability, interpretability and effectiveness in classification task of large datasets with high dimension and complex distribution. Our obtained results are highly competitive with state-of-the-art approaches.

Ngo Van Linh, Nguyen Kim Anh, Khoat Than, Nguyen Nguyen Tat

MPTM: A Topic Model for Multi-Part Documents

Topic models have been successfully applied to uncover hidden probabilistic structures in collections of documents, where documents are treated as unstructured texts. However, it is not uncommon that some documents, which we call multi-part documents, are composed of multiple named parts. To exploit the information buried in the document-part relationships in the process of topic modeling, this paper adopts two assumptions: the first is that all parts in a given document should have similar topic distributions, and the second is that the multiple versions (corresponding to multiple named parts) of a given topic should have similar word distributions. Based on these two underlying assumptions, we propose a novel topic model for multi-part documents, called Multi-Part Topic Model (or MPTM in short), and develop its construction and inference method with the aid of the techniques of collapsed Gibbs sampling and maximum likelihood estimation. Experimental results on real datasets demonstrate that our approach has not only achieved significant improvement on the qualities of discovered topics, but also boosted the performance in information retrieval and document classification.

Zhipeng Xie, Liyang Jiang, Tengju Ye, Zhenying He

Retaining Rough Diamonds: Towards a Fairer Elimination of Low-Skilled Workers

Living the economic dream of globalization in the form of a location- and time-independent world-wide employment market, today crowd sourcing companies offer affordable digital solutions to business problems. At the same time, highly accessible economic opportunities are offered to workers, who often live in low or middle income countries. Thus, crowd sourcing can be understood as a flexible social solution that indiscriminately reaches out to poor, yet diligent workers: a win-win situation for employers and crowd workers. On the other hand, its virtual nature opens doors to unethical exploitation by fraudulent workers, compromising in turn the overall quality of the gained results and increasing the costs of continuous result quality assurance, e.g. by gold questions or majority votes. The central question discussed in this paper is how to distinguish between basically honest workers, who might just be lacking educational skills, and plainly unethical workers. We show how current quality control measures misjudge and subsequently discriminate against honest workers with lower skill levels. In contrast, our techniques use statistical models that computes the level of a worker’s skill and a task’s difficulty to clearly distinguish each worker’s success zone and detect irrational response patterns, which usually imply fraud. Our evaluation shows that about 50% of misjudged workers can be successfully detected as honest, can be retained, and subsequently redirected to easier tasks.

Kinda El Maarry, Wolf-Tilo Balke

Spatio-temporal Data II

Frontmatter

Skyline Trips of Multiple POIs Categories

In this paper, we introduce a new interesting path finding problem, which is the Skyline Trips of Multiple POIs Categories (

STMPC

) query. In particular, given a road network with a set of Points of Interest (POIs) from different categories, a list of items the user is planning to purchase and a pricing function for items at each related POI; find the skyline trips in term of both trip length and trip aggregated cost. This query has important applications in everyday life. Specifically, it assists people to choose the most suitable trips among the skyline trips based on two dimensions; trip total length and trip aggregated cost. We prove the problem is NP-hard and we distinguish it from existing related problems. We also proposed a framework and two effective algorithms to efficiently solve the

STMPC

query in real time and produce near optimal results when tested on real datasets.

Saad Aljubayrin, Zhen He, Rui Zhang

Keyword-Aware Dominant Route Search for Various User Preferences

Route search has been studied extensively. However existing solutions for route search are often insufficient in offering users the flexibility to specify their requirements. Recently, a new kind of keyword-aware optimal route (KOR) query is proposed for finding an optimal route such that it covers a set of user-specified keywords, it satisfies a budget constraint (e.g., time), and the total popularity of the route is maximized. For example, a user may search for the route passing through

cinema

and

bookstore

within a travel time of 3 hours, which has the highest total rating score. KOR only returns one result regardless of user preferences:however some may care more about

bookstore

, while others think

cinema

more important. Apparently, it is not user-friendly for users to specify their own preference explicitly. To meet the various user preferences, this paper presents a new route search query called Keyword-aware Dominant Routes (KDR) query, which returns all possible preferable routes, covering a set of keywords under a specified budget constraint. We devise a novel algorithm DR for efficiently finding the exact result of a KDR query and a heuristic algorithm FDR to approximately answer a KDR query. Experimental results show that our algorithms are efficient.

Yujiao Li, Weidong Yang, Wu Dan, Zhipeng Xie

Spatial Keyword Range Search on Trajectories

With advances in geo-positioning technologies and ensuing location based service, there are a rapid growing amount of trajectories associated with textual information collected in many emerging applications. For instance, nowadays many people are used to sharing interesting experience through Foursquare or Twitter along their travel routes. In this paper, we investigate the problem of spatial keyword range search on trajectories, which is essential to make sense of large amount of trajectory data. To the best of our knowledge, this is the first work to systematically investigate range search over trajectories where three important aspects,

i.e.

, spatio, temporal and textual, are all taken into consideration. Given a query region, a timespan and a set of keywords, we aim to retrieve trajectories that go through this region during query timespan, and contain all the query keywords. To facilitate the range search, a novel index structure called IOC-Tree is proposed based on the inverted indexing and octree techniques to effectively explore the spatio, temporal and textual pruning techniques. Furthermore, this structure can also support the query with order-sensitive keywords. Comprehensive experiments on several real-life datasets are conducted to demonstrate the efficiency.

Yuxing Han, Liping Wang, Ying Zhang, Wenjie Zhang, Xuemin Lin

TOF: A Throughput Oriented Framework for Spatial Queries Processing in Multi-core Environment

In this paper, we develop a Throughput Oriented Framework (TOF) for efficient processing of spatiotemporal queries in multi-core environment. Traditional approaches to spatial query processing were focused on reduction of query latency. In real world, most LBS applications emphasize throughput rather than query latency. TOF is designed to achieve maximum throughput. Instead of resorting to complex indexes, TOF chooses to execute a batch queries at each run, so it can maximize data locality and parallelism on multi-core platforms. Using TOF, we designed algorithms for processing range queries and kNN queries respectively. Experimental study shows that these algorithms outperform the existing approaches significantly in terms of throughput.

Zhong-Bin Xue, Xuan Zhou, Shan Wang

Query Processing

Frontmatter

Identifying and Caching Hot Triples for Efficient RDF Query Processing

Resource Description Framework (RDF) has been used as a general model for conceptual description and information modelling. As the growing number and volume of RDF datasets emerged recently, many techniques have been developed for accelerating the query answering process on triple stores, which handle large-scale RDF data. Caching is one of the popular solutions. Non-RDBMS based triple stores, which leverage the intrinsic nature of RDF graphs, are emerging and attracting more research attention in recent years. However, as their fundamental structure is different from RDBMS triple stores, they can not leverage the RDBMS caching mechanism. In this paper, we develop a

time-aware frequency

based caching algorithm to address this issue. Our approach retrieves the accessed triples by analyzing and expanding previous queries and collects most frequently accessed triples by evaluating their access frequencies using

Exponential Smoothing

, a forecasting method. We evaluate our approach using real world queries from a publicly available SPARQL endpoint. Our theoretical analysis and empirical results show that the proposed approach outperforms the state-of-the-art approaches with higher hit rates.

Wei Emma Zhang, Quan Z. Sheng, Kerry Taylor, Yongrui Qin

History-Pattern Implementation for Large-Scale Dynamic Multidimensional Datasets and Its Evaluations

In this paper, we present a novel encoding/decoding method for dynamic multidimensional datasets and its implementation scheme. Our method encodes an

n

-dimensional tuple into a pair of scalar values even if

n

is sufficiently large. The method also encodes and decodes tuples using only

shift

and

and/or

register instructions. One of the most serious problems in multidimensional array based tuple encoding is that the size of an encoded result may often exceed the machine word size for large-scale tuple sets. This problem is efficiently resolved in our scheme. We confirmed the advantages of our scheme by analytical and experimental evaluations. The experimental evaluations were conducted to compare our constructed prototype system with other systems; (1) a system based on a similar encoding scheme called history-offset encoding, and (2) PostgreSQL RDBMS. In most cases, both the storage and retrieval costs of our system significantly outperformed those of the other systems.

Masafumi Makino, Tatsuo Tsuji, Ken Higuchi

Scalagon: An Efficient Skyline Algorithm for All Seasons

Skyline queries are well-known in the database community and there are many algorithms for the computation of the Pareto frontier. The most prominent algorithms are based on a block-nested-loop style tuple-to-tuple comparison (BNL). Another approach exploits the lattice structure induced by a Skyline query over low-cardinality domains. In this paper, we present

Scalagon

, an algorithm which combines the ideas of the lattice approach and a BNL-style algorithm to evaluate Skylines on arbitrary domains. Since multicore processors are going mainstream, we also present a parallel version of Scalagon. We demonstrate through extensive experimentation on synthetic and real datasets that our algorithm can result in a significant performance advantage over existing techniques.

Markus Endres, Patrick Roocks, Werner Kießling

Towards Order-Preserving SubMatrix Search and Indexing

Order-Preserving SubMatrix

(OPSM) has been proved to be important in modelling biologically meaningful subspace cluster, capturing the general tendency of gene expressions across a subset of conditions. Given an OPSM query based on row or column keywords, it is desirable to retrieve OPSMs quickly from a large gene expression dataset or OPSM data via

indices

. However, the time of OPSM mining from gene expression dataset is long and the volume of OPSM data is huge. In this paper, we investigate the issues of indexing two datasets above and first present a naive solution

pfTree

by applying

p

re

f

ix-

Tree

. Due to it is not efficient to search the tree, we give an optimization indexing method

pIndex

. Different from

pfTree

,

pIndex

employs row and column header tables to traverse related branches in a

bottom-up

manner. Further, two pruning rules based on

number

and

order

of keywords are introduced. To reduce the number of column keyword candidates on fuzzy queries, we introduce a

F

irst

I

tem of keywords ro

T

ation method

FIT

, which reduces it from

$$n!$$

to

$$n$$

. We conduct extensive experiments with real datasets on a single machine, Hadoop and Hama, and the experimental results show the efficiency and scalability of the proposed techniques.

Tao Jiang, Zhanhuai Li, Qun Chen, Kaiwen Li, Zhong Wang, Wei Pan

Database Storage and Index II

Frontmatter

Large-Scale Multi-party Counting Set Intersection Using a Space Efficient Global Synopsis

Privacy-preserving set intersection (PPSI) of very large data sets is increasingly being required in many real application areas including health-care, national security, and law enforcement. Various techniques have been developed to address this problem, where the majority of them rely on computationally expensive cryptographic techniques. Moreover, conventional data structures cannot be used efficiently for providing count estimates of the elements of the intersection of very large data sets. We consider the problem of efficient PPSI by integrating sets from multiple (three or more) sources in order to create a global synopsis which is the result of the intersection of efficient data structures, known as Count-Min sketches. This global synopsis furthermore provides count estimates of the intersected elements. We propose two protocols for the creation of this global synopsis which are based on homomorphic computations, a secure distributed summation scheme, and a symmetric noise addition technique. Experiments conducted on large synthetic and real data sets show the efficiency and accuracy of our protocols, while at the same time privacy under the Honest-but-Curious model is preserved.

Dimitrios Karapiperis, Dinusha Vatsalan, Vassilios S. Verykios, Peter Christen

Improved Weighted Bloom Filter and Space Lower Bound Analysis of Algorithms for Approximated Membership Querying

The elements in a large universe

$$U$$

have different membership likelihoods and query frequencies in many applications. Thus, the number of hash functions assigned to each element of

$$U$$

in the traditional Bloom filter can be further optimized to minimize the average false positive rate. We propose an improved weighted Bloom filter (IWBF) that assigns an optimal number of hash functions to each element and has a less average false positive rate compared to the weighted Bloom filter. We show a tight space lower bound for any approximated membership querying algorithm that represents a small subset

$$S$$

of

$$U$$

and answers membership queries with predefined false positive rates, when the query frequencies and membership likelihoods of the elements in

$$U$$

are known. We also provide an approximate space lower bound for approximated membership querying algorithms that have an average false positive rate, and show that the number of bits used in IWBF is within a factor of

$$1.44$$

of the approximate space lower bound.

Xiujun Wang, Yusheng Ji, Zhe Dang, Xiao Zheng, Baohua Zhao

Tree Contraction for Compressed Suffix Arrays on Modern Processors

We propose a novel processor-aware compaction technique for pattern matching that is widely-used in databases, information retrieval, and text mining. As the amount of data increases, it is getting important to efficiently store data on memory. A compressed suffix array (CSA) is a compact data structure for in-memory pattern matching. However, CSA suffers from tremendous processor penalties, such as a flood of instructions and cache/TLB misses due to the lack of processor-aware design. To mitigate these penalties, we propose a novel compaction technique for CSA, called suffix trie contraction (STC). The frequently accessed suffixes of CSA are transformed to a trie (e.g., a suffix trie), and then inter-connected nodes in the trie are repeatedly ’

$$contracted$$

’ to a single node, which enables lightweight sequential scans in a processor-friendly way. In detail, STC consists of two contraction techniques: fixed-length path contraction (FPC) and sub-tree contraction (SC). FPC is applied to the parts with a few branches in the trie, and SC is applied to the parts with many branches. Our experiment results indicate that FPC outperforms naive CSA by two orders of magnitude for short pattern queries and by three times for long pattern queries. As the number of branches inside the trie increases, SC gradually becomes superior to CSA and FPC for short pattern queries. Finally, the latency and throughput of STC are 7 times and 72 times better than those of CSA for the TREC test data set at the expense of additional 7.1 % space overhead.

Takeshi Yamamuro, Makoto Onizuka, Toshimori Honjo

Scalable Top- $$ k$$ Spatial Image Search on Road Networks

A top-

$$ k$$

spatial image search on road networks returns

$$ k$$

images based on both their spatial proximity as well as the relevancy of image contents. Existing solutions for the top-

$$ k $$

text query are not suitable to this problem since they are not sufficiently scalable to cope with hundreds of query keywords and cannot support very large road networks. In this paper, we model the problem as a top-

$$ k $$

aggregation problem. We first propose a new separate index approach that is based on the visual vocabulary tree image index and the G-tree road network index and then propose a query processing method called an external combined algorithm(CA) method. Our experimental results demonstrate that our approach outperforms the state-of-the-art hybrid method more than one order of magnitude improvement.

Pengpeng Zhao, Xiaopeng Kuang, Victor S. Sheng, Jiajie Xu, Jian Wu, Zhiming Cui

Social Networks II

Frontmatter

An Efficient Method to Find the Optimal Social Trust Path in Contextual Social Graphs

Online Social Networks (OSN) have been used as platforms for many emerging applications, where trust is a critical factor for participants’ decision making. In order to evaluate the trustworthiness between two unknown participants, we need to perform trust inference along the social trust paths formed by the interactions among the intermediate participants. However, there are usually a large number of social trust paths between two participants. Thus, a challenging problem is how to effectively and efficiently find the optimal social trust path that can yield the most trustworthy evaluation result based on the requirements of participants. In this paper, the core problem of finding the optimal social trust path with multiple constraints of social contexts is modelled as the classical NP-Complete Multi-Constrained Optimal Path (MCOP) selection problem. To make this problem practically solvable, we propose an efficient and effective approximation algorithm, called T-MONTE-K, by combining Monte Carlo method and our optimised search strategies. Lastly we conduct extensive experiments based on a real-world OSN dataset and the results demonstrate that the proposed T-MONTE-K algorithm can outperform state-of-the-art MONTE_K algorithm significantly.

Guanfeng Liu, Lei Zhao, Kai Zheng, An Liu, Jiajie Xu, Zhixu Li, Athman Bouguettaya

Pricing Strategies for Maximizing Viral Advertising in Social Networks

Viral advertising in social networks is playing an important role for the promotions of new products, ideas and innovations. It usually starts from a set of initial adopters and spreads via social links to become viral. Given a limited budget, one central problem in studying viral advertising is

influence maximization

, in which one needs to target a set of initial adopters such that the number of users accepting the advertising afterwards is maximized. To solve this problem, previous works assume that each user has a fixed cost and will spread the advertising as long as the provider offers a benefit that is equal to the cost. However, the assumption is oversimplified and far from real scenarios. In practice, it is crucial for the provider to understand how to incentivize the initial adopters.

In this paper, we propose the use of concave probability functions to model the user valuation for sharing the advertising. Under the new pricing model, we show that it is NP-hard to find the optimal pricing strategy. Due to the hardness, we then propose a discrete greedy pricing strategy which has a constant approximation performance guarantee. We also discuss how to discretize the budget to provide a good trade-off between the performance and the efficiency. Extensive experiments on different data sets are implemented to validate the effectiveness of our algorithm in practice.

Bolei Zhang, Zhuzhong Qian, Wenzhong Li, Sanglu Lu

Boosting Financial Trend Prediction with Twitter Mood Based on Selective Hidden Markov Models

Financial trend prediction has been a hot topic in both academia and industry. This paper proposes to exploit Twitter mood to boost financial trend prediction based on

selective hidden Markov models

(sHMM). First, we expand the

profile of mood states

(POMS) Bipolar lexicon to extract rich society moods from massive tweets. Then, we determine which mood has the most predictive power on the financial index based on

Granger causality analysis

(GCA). Finally, we extend sHMM to combine financial index and the selected Twitter mood to predict next-day trend. Extensive experiments show that our method not only outperforms the state-of-the-art methods, but also provides controllability to financial trend prediction.

Yifu Huang, Shuigeng Zhou, Kai Huang, Jihong Guan

k-Consistent Influencers in Network Data

With the prevalence of online social media such as Facebook, Twitter and YouTube, social influence analysis has attracted considerable research interests recently. Existing works on top-k influential nodes discovery find influential users at single time point only and do not capture whether the users are consistently influential over a period of time. Finding top-k consistent influencers has many interesting applications, such as targeted marketing, recommendation, experts finding, and stock market. Identifying top-k consistent influencers is a challenging task. First, we need to dynamically compute the total influence of each user at each time point from an action log. However, to find the consistent top-scorers, we need to sort and rank them at each time point. This is computationally expensive and not scalable. In this paper, we define the consistency of a node based on its influence and volatility over time. With the help of grid index, we develop an efficient algorithm called TCI to obtain the top-k consistent influencers given a time period. We conduct extensive experiments on three real world datasets to evaluate the proposed methods. We also demonstrate the usefulness of top-k consistent influencers in identifying information sources and finding experts. The experimental results demonstrate the efficiency and effectiveness of our methods.

Enliang Xu, Wynne Hsu, Mong Li Lee, Dhaval Patel

Industrial Papers

Frontmatter

Analyzing Electric Vehicle Energy Consumption Using Very Large Data Sets

An electric vehicle (EV) is an interesting vehicle type because it has the potential of reducing the dependence on fossil fuels by using electricity from, e.g., wind turbines. A significant disadvantage of EVs is a very limited range, typically less than 200 km. This paper compares EVs to conventional vehicles (CVs) for private transportation using two very large data sets. The EV data set is collected from 164 vehicles (126 million rows) and the CV data set from 447 vehicles (206 million rows). Both data sets are collected in Denmark throughout 2012, with a logging frequency of 1 Hz. GPS data is collected from both vehicle types. In addition, EVs also log the actual energy consumption every second using the vehicle’s CAN bus. By comparing the two data sets, we observe that EVs are significantly slower on motorways, faster in cities, and drive shorter distances compared to CVs. Further, we study the effects of temperature, wind direction, wind speed, and road inclination. We conclude that the energy consumption (and range) of an EV is very sensitive to head wind, low temperatures, and steep road inclinations.

Benjamin Krogh, Ove Andersen, Kristian Torp

Interactive, Flexible, and Generic What-If Analyses Using In-Memory Column Stores

One well established method of measuring the success of companies are key performance indicators, whose inter-dependencies can be represented by mathematical models, such as value driver trees. While such models have commonly agreed semantics, they lack the right tool support for business simulations, because a flexible implementation that supports multi-dimensional and hierarchical structures on large data sets is complex and computationally challenging. However, in-memory column stores as the backbone of enterprise applications provide incredible performance that enables to calculate flexible simulation scenarios interactively even on large sets of enterprise data.

In this paper, we present the HPI Business Simulator as a tool to model and run generic what-if analyses in an interactive mode that allows the exploration of scenarios backed by the full enterprise database on the finest level of granularity. The tool comprises a meta-model to describe the dependencies of key performance indicators as a graph, a method to define data bindings for nodes, and a framework to specify rules that describe how to calculate simulation scenarios.

Stefan Klauck, Lars Butzmann, Stephan Müller, Martin Faust, David Schwalb, Matthias Uflacker, Werner Sinzig, Hasso Plattner

Energy Efficient Scheduling of Fine-Granularity Tasks in a Sensor Cloud

Wireless Sensor Networks (WSNs) are frequently used in number of applications like unattended environmental monitoring. WSNs have low battery power hence schemes have been proposed to reduce the energy consumption during sensor task processing. Consider a Sensor Cloud where owners of heterogeneous WSNs come together to offer sensing as a service to the users of multiple applications. In a Sensor Cloud environment, it is important to manage different types of tasks requests from multiple applications efficiently. In our work, we have proposed a scheduling scheme suitable for the multiple applications in a Sensor Cloud. The scheduling scheme proposed is based on TDMA which considers the fine granularity of tasks. In our performance evaluation, we show that the proposed scheme saves energy of sensors and provides better throughput and response time in comparison to a most recent work.

Rashmi Dalvi, Sanjay Kumar Madria

Demo

Frontmatter

Invariant Event Tracking on Social Networks

When an event is emerging and actively discussed on social networks, its related issues may change from time to time. People may focus on different issues of an event at different times. An invariant event is an event with changing subsequent issues that last for a period of time. Examples of invariant events include government elections, natural disasters, and breaking news. This paper describes our demonstration system for tracking invariant events over social networks. Our system is able to summarize continuous invariant events and track their developments along a timeline. We propose invariant event detection by utilizing an approach of Clique Percolation Method (CPM) community mining. We also present an approach to event tracking based on the relationships between communities. The

Twitter

messages related to the 2013 Australian Federal Election are used to demonstrate the effectiveness of our approach. As the first of this kind, our system provides a benchmark for further development of monitoring tools for social events.

Sayan Unankard, Xue Li, Guodong Long

EmoTrend: Emotion Trends for Events

In this demo paper we present EmoTrend, a web-based system that supports event-centric temporal analytics of the global mood, as expressed in Twitter. Given a time range, and optionally a set of keywords, the system relies on peak frequencies, and the social graph, to identify relevant events. Subsequently, by performing sentiment analysis on related tweets, the global impact and reception of the events are presented by a visualization of the overall mood trend in the time range.

Yi-Shin Chen, Carlos Argueta, Chun-Hao Chang

A Restaurant Recommendation System by Analyzing Ratings and Aspects in Reviews

Recommender systems are widely deployed to predict the preferences of users to items. They are popular in helping users find movies, books and products in general. In this work, we design a restaurant recommender system based on a novel model that captures correlations between hidden aspects in reviews and numeric ratings. It is motivated by the observation that a user’s preference against an item is affected by different aspects discussed in reviews. Our method first explores topic modeling to discover hidden aspects from review text. Profiles are then created for users and restaurants separately based on aspects discovered in their reviews. Finally, we utilize regression models to detect the user-restaurant relationship. Experiments demonstrate the advantages.

Yifan Gao, Wenzhe Yu, Pingfu Chao, Rong Zhang, Aoying Zhou, Xiaoyan Yang

ENRS: An Effective Recommender System Using Bayesian Model

Traditional content-based news recommender systems strive to use a bag of words or a topic distribution to capture readers’ reading preference. However, they didn’t take advantage of the named entities extracted from news articles and the relations among different named entities to model readers’ reading preference. Named entities contain much more semantic information and relations than a bag of words or a topic distribution. In this paper, we design and implement a prototype system named ENRS, which combines the named entity with the naïve Bayesian algorithm, to recommend readers news articles. The key technical merit of our work is that we built a probabilistic entity graph to capture the relations among different named entities, based on which ENRS can increase the diversity of recommendation significantly. The architecture of ENRS and the recommendation algorithm are discussed and a demonstration of ENRS is also presented.

Yingyuan Xiao, Pengqiang Ai, Hongya Wang, Ching-Hsien Hsu, Yukun Li

EPSCS: Simulating and Measuring Energy Proportionality of Server Clusters

Energy proportionality for a server cluster means energy consumption is proportional to the workloads running on the cluster. One problem is that it is too costly, time-consuming, and complex to build a real cluster to evaluate energy-proportional algorithms. Aiming to solve this problem, we propose to build a prototype system that is able to simulate and test energy proportionality of a server cluster quickly and easily. Our system, named

Energy-Proportional Server Cluster Simulator

(

EPSCS

), allows users to configure a virtual server cluster and test energy proportionality on real traces continuously. We implement three energy-proportional algorithms in

EPSCS

and visualize the real-time results to evaluate time performance and energy consumption. New algorithms can be integrated into

EPSCS

and be compared with existing ones. We first describe the architecture and key designs of

EPSCS

. And finally, a case study of

EPSCS’s

demonstration is presented.

Jiazhuang Xie, Peiquan Jin, Shouhong Wan, Lihua Yue

MAVis: A Multiple Microblogs Analysis and Visualization Tool

An increasing number of people obtain and share information on social networks through short text messages, a.k.a. microblogs. These microblogs propagate widely online based on the followship between users as well as the retweeting mechanism. The regular pattern of retweeting behaviors can be discovered by mining the historical retweet data, and the key users in the information diffusion process can also be found in this way. This paper gives the novel definition of information diffusion network and three categories of nodes in the network. A tool designed to mine the information diffusion network and visualize the result is implemented. This paper introduces related definitions, the architecture, mining algorithms and the visualization interface.

Changping Wang, Chaokun Wang, Jingchao Hao, Hao Wang, Xiaojun Ye

Springer Professional

About this book

Table of Contents

Frontmatter

Outlier and Imbalanced Data Analysis

Frontmatter

A Synthetic Minority Oversampling Method Based on Local Densities in Low-Dimensional Space for Imbalanced Learning

Fast and Scalable Outlier Detection with Approximate Nearest Neighbor Ensembles

Rare Category Exploration on Linear Time Complexity

Probabilistic and Uncertain Data

Frontmatter

FP-CPNNQ: A Filter-Based Protocol for Continuous Probabilistic Nearest Neighbor Query

Efficient Queries Evaluation on Block Independent Disjoint Probabilistic Databases

Parallel Top-k Query Processing on Uncertain Strings Using MapReduce

Tracing Errors in Probabilistic Databases Based on the Bayesian Network

Data Mining II

Frontmatter

Mining Frequent Spatial-Textual Sequence Patterns

Effective and Interpretable Document Classification Using Distinctly Labeled Dirichlet Process Mixture Models of von Mises-Fisher Distributions

MPTM: A Topic Model for Multi-Part Documents

Retaining Rough Diamonds: Towards a Fairer Elimination of Low-Skilled Workers

Spatio-temporal Data II

Frontmatter

Skyline Trips of Multiple POIs Categories

Keyword-Aware Dominant Route Search for Various User Preferences

Spatial Keyword Range Search on Trajectories

TOF: A Throughput Oriented Framework for Spatial Queries Processing in Multi-core Environment

Query Processing

Frontmatter

Identifying and Caching Hot Triples for Efficient RDF Query Processing

History-Pattern Implementation for Large-Scale Dynamic Multidimensional Datasets and Its Evaluations

Scalagon: An Efficient Skyline Algorithm for All Seasons

Towards Order-Preserving SubMatrix Search and Indexing

Database Storage and Index II

Frontmatter

Large-Scale Multi-party Counting Set Intersection Using a Space Efficient Global Synopsis

Improved Weighted Bloom Filter and Space Lower Bound Analysis of Algorithms for Approximated Membership Querying

Tree Contraction for Compressed Suffix Arrays on Modern Processors

Scalable Top- $$ k$$ Spatial Image Search on Road Networks

Social Networks II

Frontmatter

An Efficient Method to Find the Optimal Social Trust Path in Contextual Social Graphs

Pricing Strategies for Maximizing Viral Advertising in Social Networks

Boosting Financial Trend Prediction with Twitter Mood Based on Selective Hidden Markov Models

k-Consistent Influencers in Network Data

Industrial Papers

Frontmatter

Analyzing Electric Vehicle Energy Consumption Using Very Large Data Sets

Interactive, Flexible, and Generic What-If Analyses Using In-Memory Column Stores

Energy Efficient Scheduling of Fine-Granularity Tasks in a Sensor Cloud

Demo

Frontmatter

Invariant Event Tracking on Social Networks

EmoTrend: Emotion Trends for Events

A Restaurant Recommendation System by Analyzing Ratings and Aspects in Reviews

ENRS: An Effective Recommender System Using Bayesian Model

EPSCS: Simulating and Measuring Energy Proportionality of Server Clusters

MAVis: A Multiple Microblogs Analysis and Visualization Tool

Backmatter

Premium Partner