nach oben

2017 | Buch

Kapitel lesen Erstes Kapitel lesen

Web and Big Data

First International Joint Conference, APWeb-WAIM 2017, Beijing, China, July 7–9, 2017, Proceedings, Part II

herausgegeben von: Lei Chen, Christian S. Jensen, Cyrus Shahabi, Xiaochun Yang, Xiang Lian

Verlag: Springer International Publishing

Buchreihe : Lecture Notes in Computer Science

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This two –volume set, LNCS 10366 and 10367, constitutes the thoroughly refereed proceedings of the First International Joint Conference, APWeb-WAIM 2017, held in Beijing, China in July 2017.

The 44 full papers presented together with 32 short papers and 10 demonstrations papers were carefully reviewed and selected from 240 submissions. The papers are organized around the following topics: spatial data processing and data quality; graph data processing; data mining, privacy and semantic analysis; text and log data management; social networks; data mining and data streams; query processing; topic modeling; machine learning; recommendation systems; distributed data processing and applications; machine learning and optimization.

Inhaltsverzeichnis

Frontmatter

Machine Learning

Frontmatter

Combining Node Identifier Features and Community Priors for Within-Network Classification

With widely available large-scale network data, one hot topic is how to adopt traditional classification algorithms to predict the most probable labels of nodes in a partially labeled network. In this paper, we propose a new algorithm called identifier based relational neighbor classifier (IDRN) to solve the within-network multi-label classification problem. We use the node identifiers in the egocentric networks as features and propose a within-network classification model by incorporating community structure information to predict the most probable classes for unlabeled nodes. We demonstrate the effectiveness of our approach on several publicly available datasets. On average, our approach can provide Hamming score, Micro-$$\text {F}_1$$ score and Macro-$$\text {F}_1$$ score up to 14%, 21% and 14% higher than competing methods respectively in sparsely labeled networks. The experiment results show that our approach is quite efficient and suitable for large-scale real-world classification tasks.

Qi Ye, Changlei Zhu, Gang Li, Feng Wang

An Active Learning Approach to Recognizing Domain-Specific Queries From Query Log

In this paper, we address the problem of recognizing domain-specific queries from general search engine’s query log. Unlike most previous work in query classification relying on external resources or annotated training queries, we take query log as the only resource for recognizing domain-specific queries. In the proposed approach, we represent query log as a heterogeneous graph and then formulate the task of domain-specific query recognition as graph-based transductive learning. In order to reduce the impact of noisy and insufficient of initial annotated queries, we further introduce an active learning strategy into the learning process such that the manual annotations needed are reduced and the recognition results can be continuously refined through interactive human supervision. Experimental results demonstrate that the proposed approach is capable of recognizing a certain amount of high-quality domain-specific queries with only a small number of manually annotated queries.

Weijian Ni, Tong Liu, Haohao Sun, Zhensheng Wei

Event2vec: Learning Representations of Events on Temporal Sequences

Sequential data containing series of events with timestamps is commonly used to record status of things in all aspects of life, and is referred to as temporal event sequences. Learning vector representations is a fundamental task of temporal event sequence mining as it is inevitable for further analysis. Temporal event sequences differ from symbol sequences and numerical time series in that each entry is along with a corresponding time stamp and that the entries are usually sparse in time. Therefore, methods either on symbolic sequences such as word2vec, or on numerical time series such as pattern discovery perform unsatisfactorily. In this paper, we propose an algorithm called event2vec that solves these problems. We first present Event Connection Graph to summarize events while taking time into consideration. Then, we conducts a training Sample Generator to get clean and endless data. Finally, we feed these data to embedding neural network to get learned vectors. Experiments on real temporal event sequence data in medical area demonstrate the effectiveness and efficiency of the proposed method. The procedure is totally unsupervised without the help of expert knowledge. Thus can be used to improve the quality of health-care without any additional burden.

Shenda Hong, Meng Wu, Hongyan Li, Zhengwu Wu

Joint Emoji Classification and Embedding Learning

Under conversation scenarios, emoji is widely used to express humans’ feelings, which greatly enriches the representation of plain text. Plentiful utterances with emoji are produced by humans manually in social media platforms every day, which make emoji great influence on the human life. For the academic community, researchers are always with the help of utterances including emoji as annotated data to work on sentiment analysis, yet lack of adequate attention to emoji itself. The challenges lie in how to discriminate so many different kinds of emoji, especially for those with similar meanings, which make this problem quite different from traditional sentiment analysis. In this paper, in order to gain an insight into emoji, we propose a matching architecture using deep neural networks to jointly learn emoji embeddings and make classification. In particular, we use a convolutional neural network to get the embedding of the utterance and match it with the embedding of the corresponding emoji, to obtain its best classification, and otherwise also train the emoji embeddings. Experiments based on a massive dataset demonstrate the effectiveness of our proposed approach better than traditional softmax methods in terms of p@1, p@5 and MRR evaluation metrics. Then a test of human experience shows the performance could meet the requirement of practice systems.

Xiang Li, Rui Yan, Ming Zhang

Target-Specific Convolutional Bi-directional LSTM Neural Network for Political Ideology Analysis

Ideology detection from text plays an important role in identifying the political ideology of politicians who have expressed their beliefs on many issues. Most existing approaches based on bag-of-words features fail to capture semantic information. And other sentence modeling methods are inefficient to extract ideological target context which is significant for identifying the political ideology. In this paper, we propose a target-specific Convolutional and Bi-directional Long Short Term Memory neural network (CB-LSTM) which is suitable in intensifying ideological target-related context and learning semantic representations of the text at the same time. We conduct experiments on two commonly used datasets and a well-designed dataset extracted from tweets. The experimental results show that the proposed method outperforms the state-of-the-art methods.

Xilian Li, Wei Chen, Tengjiao Wang, Weijing Huang

Boost Clickbait Detection Based on User Behavior Analysis

Article in the web is usually titled with a misleading title to attract the users click for gaining click-through rate (CTR). A clickbait title may increase click-through rate, but decrease user experience. Thus, it is important to identify the articles with a misleading title and block them for specific users. Existing methods just consider text features, which hardly produce a satisfactory result. User behavior is useful in clickbait detection. Users have different tendencies for the articles with a clickbait title. User actions in an article usually indicate whether an article is with a clickbait title. In this paper, we design an algorithm to model user behavior in order to improve the impact of clickbait detection. Specifically, we use a classifier to produce an initial clickbait-score for articles. Then, we define a loss function on the user behavior and tune the clickbait score toward decreasing the loss function. Experiment shows that we improve precision and recall after using user behavior.

Hai-Tao Zheng, Xin Yao, Yong Jiang, Shu-Tao Xia, Xi Xiao

Recommendation Systems

Frontmatter

A Novel Hybrid Friends Recommendation Framework for Twitter

As one of the key features of social networks, friends recommendation is a kind of link prediction task with ranking that was extensively investigated recently in the area of social networks analysis as users would like to follow people who have similar interests to them. We use Twitter as a case study and propose a novel hybrid friends recommendation framework that is not only based on friends relationship but also users’ location information, which are recorded by Twitter when they posted their tweets. Our framework can recommend friends to users who have similar interests based on location features by using collaborative filtering to effectively filter out those common places which are meaningless, e.g., bus station; and focuses on those places that have high probability that people are there more likely to become friends, e.g., dance studio. In addition, we propose a multiple classifiers combination method to leverage the information contained in friends and locations features in order to get better outcomes. We evaluate our framework on two real corpora from Twitter, and the favorable results indicate that our proposed approach is feasible.

Yan Zhao, Jia Zhu, Mengdi Jia, Wenyan Yang, Kai Zheng

A Time and Sentiment Unification Model for Personalized Recommendation

With the rapid development of social media, personalized recommendation has become an essential means to help people discover attractive and interesting items. Intuitively, users buying items online are influenced not only by their preferences and public attentions, but also by the crowd sentiment (i.e., the word of mouth) to the items. Specifically, users are likely to refuse an item whose most reviews are negative from the crowd. Therefore, a good personalized recommendation model also needs to take crowd sentiment into account, which most current methods do not. In light of this, we propose TSUM, a model that jointly integrates time and crowd sentiment, for personalized recommendation in this paper. TSUM simultaneously models user-oriented topics related to user preferences, time-oriented topics relevant to temporal context, and crowd sentiment towards items. TSUM combines the influences of user preferences, temporal context and crowd sentiment to model user behavior in a unified way. Extensive experimental results on two large real world datasets show that our recommender system significantly outperforms the state-of-the-arts by making more effective personalized recommendations.

Qinyong Wang, Hongzhi Yin, Hao Wang

Personalized POI Groups Recommendation in Location-Based Social Networks

With development of urban modernization, there are a large number of hop spots covering the entire city, defined as Pionts-of-Interest (POIs) Group consist of POIs. POI Groups have a significant impact on people’s lives and urban planning. Every person has her/his own personalized POI Groups (PPGs) based on preferences and friendship in location-based social networks (LBSNs). However, there are almost no researches on this aspect in recommendation systems. This paper proposes a novel PPGs Recommendation algorithm, and models the PPGs by expanding the model of DBSCAN. Our model considers the degree to each PPG covering the target users’ POI preferences. The system recommends the target user with the PPGs which have the top-N largest scores, and it is one NP-hard problem. This paper proposes the greedy algorithm to solve it. Extensive experiments on the two LBSN datasets illustrate the effectiveness of our proposed algorithm.

Fei Yu, Zhijun Li, Shouxu Jiang, Xiaofei Yang

Learning Intermediary Category Labels for Personal Recommendation

In many recommender systems, category information has been used as additional features for recommender for quite some time, whose application has tended to be understand relationships between products in order to surface recommendations that are relevant to a given context. Nevertheless, the categories as intermediary are labels for not only attributes of products but also preference characteristics of people, is ignored. Here we propose a framework to learn the intermediary role of categories acting as a bridge between users and items. The framework includes two parts. Firstly, we collect the intermediary factors that category labels affect attributes of items and user preferences respectively. Secondly, we integrate the category medium of assemble item attributes and user preferences to online recommender systems to help users discover similar or complementary products. We evaluate our framework on the Amazon product catalog and demonstrate hierarchy categories can capture characteristics of users and items simultaneously.

Wenli Yu, Li Li, Jingyuan Wang, Dengbao Wang, Yong Wang, Zhanbo Yang, Min Huang

Skyline-Based Recommendation Considering User Preferences

In this paper, we propose a skyline-based recommendation and ranking function. We suppose that some recommender systems, such as hotel recommender systems, are based not only on user preferences but also cost performance. For these kinds of applications, We first extract items with good cost performance and then identify items that users prefer, which reduce the computational cost of the online process. Based on the results of our preliminary experiments, we propose user feedback-based scoring and density-aware scoring methods where items that are highly similar to a user’s latent requirements are recommended and attribute values in a dense area are quantized into a single value. The result of the experiments suggest that the density-aware scoring provides equal to or greater accuracy than the basic scoring.

Shuhei Kishida, Seiji Ueda, Atsushi Keyaki, Jun Miyazaki

Improving Topic Diversity in Recommendation Lists: Marginally or Proportionally?

Diversifying the recommendation lists in recommendation systems could potentially satisfy user’s needs. Most diversification techniques are designed to recommend the top-k relevant and diverse items, which take the coverage of the user preferences into account. The relevance scores are usually estimated by methods such as latent matrix factorization. While in this paper, we model the users’ interests with the topic distributions on the rated items. And then we investigate how to improve the topic diversification within the recommendation lists. We first estimate the topic distributions of users and items through training Latent Dirichlet Allocation (LDA) on the rating set. After that we propose two topic diversification methods based on submodular function maximization and proportionality respectively. Experimental results on MovieLens and FilmTrust datasets demonstrate that our approach outperforms state-of-the-art techniques in terms of distributional diversity.

Xiaolu Xing, Chaofeng Sha, Junyu Niu

Distributed Data Processing and Applications

Frontmatter

Integrating Feedback-Based Semantic Evidence to Enhance Retrieval Effectiveness for Clinical Decision Support

The goal of Clinical Decision Support (CDS) is to help physicians find useful information from a collection of medical articles with respect to the given patient records, in order to take the best care of their patients. Most of the existing CDS methods do not sufficiently consider the semantic evidence, hence the potential in improving the performance in biomedical articles retrieval. This paper proposes a novel feedback-based approach which considers the semantic association between a retrieved biomedical article and a pseudo feedback set. Evaluation results show that our method outperforms the strong baselines, and is able to improve over the best runs in the CDS tasks of TREC 2014 & 2015.

Chenhao Yang, Ben He, Jungang Xu

Reordering Transaction Execution to Boost High Frequency Trading Applications

High frequency trading (HFT) has always been welcomed because it benefits not only personal interests but also the whole social welfare. While the recent advance of portfolio selection in HFT market generates more profit, it yields much contended OLTP workloads. Featuring in exploiting the abundant parallelism, transaction pipeline, the state-of-the-art concurrency control (CC) mechanism, however suffers from limited concurrency confronted with HFT workloads. Its variants that enable more parallel execution by leveraging find-grained contention information also take little effect. To solve this problem, we for the first time observe and formulate the source of restricted concurrency as harmful ordering of transaction statements. To resolve harmful ordering, we propose PARE, a pipeline-aware reordered execution, to improve application performance by rearranging statements in order of their degrees of contention. In concrete, two mechanisms are devised to ensure the correctness of statement rearrangement and identify the degrees of contention of statements respectively. Experiment results show that PARE can improve transaction throughput and reduce transaction latency on HFT applications by up to an order of magnitude than the state-of-the-art CC mechanism.

Ningnan Zhou, Xuan Zhou, Xiao Zhang, Xiaoyong Du, Shan Wang

Bus-OLAP: A Bus Journey Data Management Model for Non-on-time Events Query

Increasing the on-time rate of bus service can prompt the people’s willingness to travel by bus, which is an effective measure to mitigate the city traffic congestion. Performing queries on the bus arrival can be used to identify and analyze various kinds of non-on-time events that happened during the bus journey, which is helpful for detecting the factors of delaying events, and providing decision support for optimizing the bus schedules. We propose a data management model, called Bus-OLAP, for querying bus monitoring data, considering the characteristics of bus monitoring data and the scenarios of on-time analysis. While fulfilling typical requirements of bus monitoring data analysis, Bus-OLAP not only provides a flexible way to manage the data and to implement multiple granularity data query and update, but also supports distributed query and computation. The experiments on real-world bus monitoring data verify that Bus-OLAP is effective and efficient.

Tinghai Pang, Lei Duan, Jyrki Nummenmaa, Jie Zuo, Peng Zhang

Distributed Data Mining for Root Causes of KPI Faults in Wireless Networks

In the field of wireless network optimization, with the enlargement of network size and the complication of network structure, traditional processing methods cannot effectively identify the causes of network faults in the face of increasing network data. In this paper, we propose a root-cause-analysis method based on distributed data mining (DRCA). Firstly, we put forward an improved decision tree, where the selection of the best split-feature is based on the feature’s purity-gain, and then we skillfully convert the problem of root-cause-analysis into modeling of an improved decision tree and interpretation of the tree model. In order to solve the problem of memory and efficiency associated with large-scale data, we parallelize the algorithm and distribute the tasks to multiple computers. The experiments show that DRCA is an effective, efficient, and scalable method.

Shiliang Fan, Yubin Yang, Wenyang Lu, Ping Song

Precise Data Access on Distributed Log-Structured Merge-Tree

Log-structured merge tree decomposes a large database into multiple parts: an in-writing part and several read-only ones. It achieves high write throughput as well as low read latency. However, read requests have to go through multiple structures to find the required data. In a distributed database system, different parts of the LSM-tree are stored distributedly. Data access issues extra network communications for a server in the query layer to pull entries from the underlying storage layer. This work proposes the precise data access strategy. A Bloom filter-based structure is designed to test whether an element exists in the in-writing part of the LSM-tree. A lease-based synchronization strategy is used to maintain consistent copies of the Bloom filter on remote query servers. Experiments show that the solution has 6$$\times $$ throughput improvement over existing methods.

Tao Zhu, Huiqi Hu, Weining Qian, Aoying Zhou, Mengzhan Liu, Qiong Zhao

Cuttle: Enabling Cross-Column Compression in Distributed Column Stores

We observe that, in real-world distributed data warehouse systems, data columns from different sources often exhibit redundancy. Even though these systems can employ both general and column-oriented compression schemes to reduce the data storage pressure, such cross-column redundancy (CCR) is not recognized or exploited effectively. Therefore, we propose Cuttle, a column storage system that enables cross-column compression to reduce CCR. Specifically, we identify three kinds of CCR and develop a referential transformation encoding (RTE) scheme to compress multiple columns of data with CCR. Furthermore, we address the CCR selection problem and propose a greedy algorithm to generate cross-column compression schemes. Our experiments on real-world datasets show that Cuttle can further reduce data size by half after applying both the column-oriented and general compression schemes, and that the query processing performance with Cuttle is improved by $$20\%$$ without any change to the application programs.

Hao Liu, Jiang Xiao, Xianjun Guo, Haoyu Tan, Qiong Luo, Lionel M. Ni

Machine Learning and Optimization

Frontmatter

Optimizing Window Aggregate Functions via Random Sampling

Window functions have been a part of the SQL standard since 2003 and have been well studied during the past decade. As the demand increases in analytics tools, window functions have seen an increasing amount of potential applications. Although the current mainstream commercial databases support window functions, the existing implementation strategies are inefficient for the real-time processing of big data. Recently, some algorithms based on sampling (e.g., online aggregation) have been proposed to deal with large and complex data in relational databases, which offer us a flexible tradeoff between accuracy and efficiency. However, sampling techniques have not been considered for window functions in databases. In this paper, we first propose two algorithms to deal with window functions based on two sampling techniques, Naive Random Sampling and Incremental Random Sampling. The proposed algorithms are highly efficient and are general enough to aggregate other existing algorithms of window functions. In particular, we evaluated our algorithms in the latest version of PostgreSQL, which demonstrated superior performance over the TPC-H benchmark.

Guangxuan Song, Wenwen Qu, Yilin Wang, Xiaoling Wang

Fast Log Replication in Highly Available Data Store

Modern large-scale data stores widely adopt consensus protocols to achieve high availability and throughput. The recently proposed Raft algorithm has better understandability and widely implemented in large amount of open source projects. In these consensus algorithms including Raft, log replication is a common and frequently used operation which has significant impact on the system performance. Especially, since the commit latency is capped by the slowest follower out of the majority followers responded to the leader, it’s important to design a fast scheme to process the replicated logs by follower nodes. Based on the analysis on how the follower node handles the received log entries in Raft algorithm, we figure out the main factors influencing the duration time from when the follower receives the log and to when it acknowledges the leader this log was received. In terms of these factors we propose an effective log replication scheme to optimize the process of flushing logs to disk and replaying them, referred to as Raft with Fast Followers (FRaft). Finally, we compare the performance of Raft and FRaft using YCSB benchmark and Sysbench test tools, and experimental results demonstrate FRaft has lower latency and higher throughput than the Raft only using straightforward pipeline and batch optimization for log replication.

Donghui Wang, Peng Cai, Weining Qian, Aoying Zhou, Tianze Pang, Jing Jiang

New Word Detection in Ancient Chinese Literature

Mining Ancient Chinese corpus is not as convenient as modern Chinese, because there is no complete dictionary of ancient Chinese words which leads to the bad performance of tokenizers. So finding new words in ancient Chinese texts is significant. In this paper, the Apriori algorithm is improved and used to produce candidate character sequences. And a long short-term memory (LSTM) neural network is used to identify the boundaries of the word. Furthermore, we design word confidence feature to measure the confidence score of new words. The experimental results demonstrate that the improved Apriori-like algorithm can greatly improve the recall rate of valid candidate character sequences, and the average accuracy of our method on new word detection raise to 89.7%.

Tao Xie, Bin Wu, Bai Wang

Identifying Evolutionary Topic Temporal Patterns Based on Bursty Phrase Clustering

We discuss a temporal text mining task on finding evolutionary patterns of topics from a collection of article revisions. To reveal the evolution of topics, we propose a novel method for finding key phrases that are bursty and significant in terms of revision histories. Then we show a time series clustering method to group phrases that have similar burst histories, where additions and deletions are separately considered, and time series is abstracted by burst detection. In clustering, we use dynamic time warping to measure the distance between time sequences of phrase frequencies. Experimental results show that our method clusters phrases into groups that actually share similar bursts which can be explained by real-world events.

Yixuan Liu, Zihao Gao, Mizuho Iwaihara

Personalized Citation Recommendation via Convolutional Neural Networks

Automatic citation recommendation based on citation context, together with consideration of users’ preference and writing patterns is an emerging research topic. In this paper, we propose a novel personalized convolutional neural networks (p-CNN) discriminatively trained by maximizing the conditional likelihood of the cited documents given a citation context. The proposed model not only nicely represents the hierarchical structures of sentences with their layer-by-layer composition and pooling, but also includes authorship information. It includes each paper’s author into our neural network’s input layer and thus can generate semantic content features and representative author features simultaneously. The results show that the proposed model can effectively captures salient representations and hence significantly outperforms several baseline methods in citation recommendation task in terms of recall and Mean Average Precision rates.

Jun Yin, Xiaoming Li

A Streaming Data Prediction Method Based on Evolving Bayesian Network

In the Big Data era, large volumes of data are continuously and rapidly generated from sensor networks, social network, the Internet, etc. Learning knowledge from streaming Big Data is an important task since it can support online decision making. Prediction is one of the useful learning task but a fixed model usually does not work well because of the data distribution change over time. In this paper, we propose a streaming data prediction method based on evolving Bayesian network. The Bayesian network model is inferred based on Gaussian mixture model and EM algorithm. To support evolving model structure and parameters based on streaming data, an evolving hill-climbing algorithm is proposed which is based on incremental calculation of score metric when new data is arrived. The experimental evaluations show that this method is effective and it outperforms other popular methods for streaming data prediction.

Yongheng Wang, Guidan Chen, Zengwang Wang

A Learning Approach to Hierarchical Search Result Diversification

The queries in search engine that issued by users are often ambiguous. By returning diverse ranking results we can satisfy different information needs as far as possible. Recently, a hierarchical structure are proposed to represent user intents instead of a flat list of subtopics. Although the hierarchical diversification model performs better than previous models, it utilizes a predefined function to calculate the diversity score, which may not reach the optimal result. The model’s parameters need to be tuned manually and repeatedly without intention, which cause a time-consuming problem. In this paper, we introduce a learning based hierarchical diversification model. Benefit from the learning model, the parameter values are determined automatically and more optimal. Experiments show that our approach outperform several existing diversification models significantly.

Hai-Tao Zheng, Zhuren Wang, Xi Xiao

Demo Papers

Frontmatter

TeslaML: Steering Machine Learning Automatically in Tencent

In this demonstration, we showcase TeslaML, the machine learning (ML) platform in Tencent Inc. TeslaML offers an interactive and visual workspace for users to create an ML pipeline via dragging, placing, and connecting the implemented modules. For the non-experts, TeslaML provides many ready-to-use ML modules to build an ML pipeline without any programming. Besides, TeslaML abstracts many existing ML systems as system modules. The integration of various systems enables the experienced users to use their preferred systems, to test new algorithms, and to obtain the most efficient execution. Furthermore, TeslaML provides many schedulers to meet different scheduling requirements.

Jiawei Jiang, Ming Huang, Jie Jiang, Bin Cui

DPHSim: A Flexible Simulator for DRAM/PCM-Based Hybrid Memory

In this paper, we demonstrate a flexible simulator for DRAM and PCM based hybrid memory systems. PCM (Phase Change Memory) as a new kind of non-volatile memories has received much attention from both academia and industry. While PCM has many new features, such as fast read speed, non-volatility, and low power consumption, how to use PCM at memory hierarchy still remains a problem. In addition, at present PCM chips and storage devices are not available. This makes it hard to evaluate PCM-related algorithms. Thus, we design a flexible simulator named DPHSim to provide a test bed for PCM-related studies. The unique features of DPHSim are manifold. (1) It supports various kinds of memory hierarchy, including DRAM-only, PCM-only, PCM as the cache of DRAM, and hybrid memory (PCM and DRAM are both used as main memory). (2) It provides flexible configuration options for workloads generation and system setting. (3) It offers user-friendly memory allocation APIs for users to simulate memory hierarchy and evaluate performance on DPHSim. After an overview on the features and architecture of DPHSim, we present a case study of DPHSim to demonstrate its feasibility and flexibility.

Dezhi Zhang, Peiquan Jin, Xiaoliang Wang, Chengcheng Yang, Lihua Yue

CrowdIQ: A Declarative Crowdsourcing Platform for Improving the Quality of Web Tables

Web tables provide us with high-quality sources of structured data. However, we could not use those valuable tables directly owing to various problems such as conflict data and missing headers. We present CrowdIQ, a scalable platform that integrates crowdsourcing technology for improving the quality of web tables. We design CrowdIQL, which is a declarative language aiming at helping requesters operate tables more exactly and flexibly. Crowdsourcing task is also optimized in this platform by providing candidate items and minimizing useless data, which help requesters to get higher quality tables with less cost.

Yihai Xi, Ning Wang, Xiaoyu Wu, Yuqing Bao, Wutong Zhou

OICPM: An Interactive System to Find Interesting Co-location Patterns Using Ontologies

In spatial data mining, the usefulness of co-location patterns is strongly limited by the huge amount of delivered co-location patterns. Although many methods have been proposed to reduce the number of co-location patterns, most of them do not guarantee that the extracted co-location patterns are interesting for the user for being generally based on statistical information. This demonstration presents OICPM, an interactive system to discover interesting co-location patterns based on ontologies. With OICPM, the user can find his/her real interesting ones from a massive amount of co-location patterns efficiently within only a few rounds of selection, and the mined interesting co-location patterns are filtered in order for better decision.

Xuguang Bao, Lizhen Wang, Qing Xiao

BioPW: An Interactive Tool for Biological Pathway Visualization on Linked Data

With the development of Linked Data, large amounts of biological pathway RDF data have been published on the semantic Web. To make these various datasets available to life scientists, we demonstrate an interactive tool, called BioPW, for the biological pathway Linked Data visualization. In contrast to showing the biological pathway data merely from one dataset, our tool, with the Open PHACTS Linked Data API and users’ exploratory interaction, could clearly illustrate the biological pathways with their associated information from multiple perspectives of linked datasets.

Yuan Liu, Xin Wang, Qiang Xu

ChargeMap: An Electric Vehicle Charging Station Planning System

The deployment optimization of charging stations is meaningful for the promotion of electric vehicles. The traditional approaches of charging station planning are mostly lack of comprehensive consideration. Some factors cannot be fully considered and effectively quantified in those approaches, such as the load of power grid, charging demand, transportation cost, construction cost, etc. This demo presents ChargeMap, a novel system of electric vehicle charging station planning which based on a multi-factor optimization model. ChargeMap could attain a balance among the factors that have great influence on charging station planning. What’s more, it delivers an apropos approach to quantify these factors. In this demo, we bring forth the application of ChargeMap on the real data sets of power grid, population, transportation and real estate of Beijing, which delivers an effective solution to the optimization of charging station planning.

Longlong Xu, Wutao Lin, Xiaorong Wang, Zhenhui Xu, Wei Chen, Tengjiao Wang

Topic Browsing System for Research Papers Based on Hierarchical Latent Tree Analysis

New academic papers appear rapidly in the literature nowadays. This poses a challenge for researchers who are trying to keep up with a given field, especially those who are new to a field and may not know where to start from. To address this kind of problems, we have developed a topic browsing system for research papers where the papers have been automatically categorized by a probabilistic topic model. Rather than using Latent Dirichlet Allocation (LDA) for topic modeling, we use a recently proposed method called hierarchical latent tree analysis, which has been shown to perform better than some state-of-the-art LDA-based methods. The resulting topic model contains a hierarchy of topics so that users can browse topics at different levels. The topic model contains a manageable number of general topics at the top level and allows thousands of fine-grained topics at the bottom level.

Leonard K. M. Poon, Chun Fai Leung, Peixian Chen, Nevin L. Zhang

A Tool of Benchmarking Realtime Analysis for Massive Behavior Data

With the increasing development of platforms for massive users, the amount of data generated from these platforms is rapidly increasing. A large number of big data analysis frameworks have been designed to analyze data generated from these platforms. This however requires a specific benchmark to evaluate the system performance. Today, realtime analysis (or streaming analysis) becomes a hot research topic of big data research. However, there is no special benchmark designed for such streaming analysis systems. This paper introduces a tool for evaluating the performance of such streaming analysis systems. Based on the scenario of e-commerce platforms, the benchmark tool is designed using a data generator with certain user models based on the user’s habits in e-commerce platforms. A test suite is developed to be responsible for simulated mixed workloads for streaming analysis.

Mingyan Teng, Qiao Sun, Buqiao Deng, Lei Sun, Xiongpai Qin

Interactive Entity Centric Analysis of Log Data

Interactive entity centric analysis of log data can help us gain fine granularity insights on business. In this paper, firstly we describe a fiber based partitioning method for log data, which accelerate later entity centric analysis. Secondly, we present our fiber based partitioner which is used by Spark SQL query engine. Fiber based partitioner takes locations of data blocks into account when loading data from HDFS into RDD, and when shuffling data from upstream operators to downstream operators during joining, avoids data interchange between node and speeds up query processing. Finally, we present our experiment results which demonstrates that fiber based partitioner improve entity centric queries.

Qiao Sun, Xiongpai Qin, Buqiao Deng, Wei Cui

A Tool for 3D Visualizing Moving Objects

Visually representing query results and complex data structures in a database system provides a convenient way for users to understand and analyze the data. In this demo, we will present a tool for 3D visualizing moving objects, i.e., spatial objects continuously changing locations over time. Instead of simply reporting numbers and lines, moving objects as well as dynamic attributes are animated and graphically displayed in an unified way. The tool benefits comprehensively understanding the spatio-temporal movement. Furthermore, we introduce how the tool provides powerful visual metaphors to explore the index structure such that one can fast determine whether the structure has a good shape, i.e., well preserving the spatio-temporal proximity. This is not a standalone software but a tool embedded in a database system.

Weiwei Wang, Jianqiu Xu

Backmatter

Titel: Web and Big Data
herausgegeben von: Lei Chen
Christian S. Jensen
Cyrus Shahabi
Xiaochun Yang
Xiang Lian
Verlag: Springer International Publishing
Electronic ISBN: 978-3-319-63564-4
Print ISBN: 978-3-319-63563-7
DOI: https://doi.org/10.1007/978-3-319-63564-4