Skip to main content
main-content

Über dieses Buch

This two volume set of LNCS 11029 and LNCS 11030 constitutes the refereed proceedings of the 29th International Conference on Database and Expert Systems Applications, DEXA 2018, held in Regensburg, Germany, in September 2018.

The 35 revised full papers presented together with 40 short papers were carefully reviewed and selected from 160 submissions. The papers of the first volume discuss a range of topics including: Big data analytics; data integrity and privacy; decision support systems; data semantics; cloud data processing; time series data; social networks; temporal and spatial databases; and graph data and road networks. The papers of the second volume discuss a range of the following topics: Information retrieval; uncertain information; data warehouses and recommender systems; data streams; information networks and algorithms; database system architecture and performance; novel database solutions; graph querying and databases; learning; emerging applications; data mining; privacy; and text processing.

Inhaltsverzeichnis

Frontmatter

Big Data Analytics

Frontmatter

Scalable Vertical Mining for Big Data Analytics of Frequent Itemsets

Advances in technology and the increasing growth of popularity on Internet of Things (IoT) for many applications have produced huge volume of data at a high velocity. These valuable big data can be of a wide variety or different veracity. Embedded in these big data are useful information and valuable knowledge. This leads to data science, which aims to apply big data analytics to mine implicit, previously unknown and potentially useful information from big data. As a popular data analytic task, frequent itemset mining discovers knowledge about sets of frequently co-occurring items in the big data. Such a task has drawn attention in both academia and industry partially due to its practicality in various real-life applications. Existing mining approaches mostly use serial, distributed or parallel algorithms to mine the data horizontally (i.e., on a transaction basis). In this paper, we present an alternative big data analytic approach. Specifically, our scalable algorithm uses the MapReduce programming model that runs in a Spark environment to mine the data vertically (i.e., on an item basis). Evaluation results show the effectiveness of our algorithm in big data analytics of frequent itemsets.
Carson K. Leung, Hao Zhang, Joglas Souza, Wookey Lee

ScaleSCAN: Scalable Density-Based Graph Clustering

How can we efficiently find clusters (a.k.a. communities) included in a graph with millions or even billions of edges? Density-based graph clustering SCAN is one of the fundamental graph clustering algorithms that can find densely connected nodes as clusters. Although SCAN is used in many applications due to its effectiveness, it is computationally expensive to apply SCAN to large-scale graphs since SCAN needs to compute all nodes and edges. In this paper, we propose a novel density-based graph clustering algorithm named ScaleSCAN for tackling this problem on a multicore CPU. Towards the problem, ScaleSCAN integrates efficient node pruning methods and parallel computation schemes on the multicore CPU for avoiding the exhaustive nodes and edges computations. As a result, ScaleSCAN detects exactly same clusters as those of SCAN with much shorter computation time. Extensive experiments on both real-world and synthetic graphs demonstrate that the performance superiority of ScaleSCAN over the state-of-the-art methods.
Hiroaki Shiokawa, Tomokatsu Takahashi, Hiroyuki Kitagawa

Sequence-Based Approaches to Course Recommender Systems

The scope and order of courses to take to graduate are typically defined, but liberal programs encourage flexibility and may generate many possible paths to graduation. Students and course counselors struggle with the question of choosing a suitable course at a proper time. Many researchers have focused on making course recommendations with traditional data mining techniques, yet failed to take a student’s sequence of past courses into consideration. In this paper, we study sequence-based approaches for the course recommender system. First, we implement a course recommender system based on three different sequence related approaches: process mining, dependency graph and sequential pattern mining. Then, we evaluate the impact of the recommender system. The result shows that all can improve the performance of students while the approach based on dependency graph contributes most.
Ren Wang, Osmar R. Zaïane

Data Integrity and Privacy

Frontmatter

BFASTDC: A Bitwise Algorithm for Mining Denial Constraints

Integrity constraints (ICs) are meant for many data management tasks. However, some types of ICs can express semantic rules that others ICs cannot, or vice versa. Denial constraints (DCs) are known to be a response to this expressiveness issue because they generalize important types of ICs, such as functional dependencies (FDs), conditional FDs, and check constraints. In this regard, automatic DC discovery is essential to avoid the expensive and error-prone task of manually designing DCs. FASTDC is an algorithm that serves this purpose, but it is highly sensitive to the number of records in the dataset. This paper presents BFASTDC, a bitwise version of FASTDC that uses logical operations to form the auxiliary data structures from which DCs are mined. Our experimental study shows that BFASTDC can be more than one order of magnitude faster than FASTDC.
Eduardo H. M. Pena, Eduardo Cunha de Almeida

BOUNCER: Privacy-Aware Query Processing over Federations of RDF Datasets

Data provides the basis for emerging scientific and interdisciplinary data-centric applications with the potential of improving the quality of life for the citizens. However, effective data-centric applications demand data management techniques able to process a large volume of data which may include sensitive data, e.g., financial transactions, medical procedures, or personal data. Managing sensitive data requires the enforcement of privacy and access control regulations, particularly, during the execution of queries against datasets that include sensitive and non-sensitive data. In this paper, we tackle the problem of enforcing privacy regulations during query processing, and propose BOUNCER, a privacy-aware query engine over federations of RDF datasets. BOUNCER allows for the description of RDF datasets in terms of RDF molecule templates, i.e., abstract descriptions of the properties of the entities in an RDF dataset and their privacy regulations. Furthermore, BOUNCER implements query decomposition and optimization techniques able to identify query plans over RDF datasets that not only contain the relevant entities to answer a query, but that are also regulated by policies that allow for accessing these relevant entities. We empirically evaluate the effectiveness of the BOUNCER privacy-aware techniques over state-of-the-art benchmarks of RDF datasets. The observed results suggest that BOUNCER can effectively enforce access control regulations at different granularity without impacting the performance of query processing.
Kemele M. Endris, Zuhair Almhithawi, Ioanna Lytra, Maria-Esther Vidal, Sören Auer

Minimising Information Loss on Anonymised High Dimensional Data with Greedy In-Memory Processing

Minimising information loss on anonymised high dimensional data is important for data utility. Syntactic data anonymisation algorithms address this issue by generating datasets that are neither use-case specific nor dependent on runtime specifications. This results in anonymised datasets that can be re-used in different scenarios which is performance efficient. However, syntactic data anonymisation algorithms incur high information loss on high dimensional data, making the data unusable for analytics. In this paper, we propose an optimised exact quasi-identifier identification scheme, based on the notion of k-anonymity, to generate anonymised high dimensional datasets efficiently, and with low information loss. The optimised exact quasi-identifier identification scheme works by identifying and eliminating maximal partial unique column combination (mpUCC) attributes that endanger anonymity. By using in-memory processing to handle the attribute selection procedure, we significantly reduce the processing time required. We evaluated the effectiveness of our proposed approach with an enriched dataset drawn from multiple real-world data sources, and augmented with synthetic values generated in close alignment with the real-world data distributions. Our results indicate that in-memory processing drops attribute selection time for the mpUCC candidates from 400s to 100s, while significantly reducing information loss. In addition, we achieve a time complexity speed-up of \(O(3^{n/3})\approx O(1.4422^{n})\).
Nikolai J. Podlesny, Anne V. D. M. Kayem, Stephan von Schorlemer, Matthias Uflacker

Decision Support Systems

Frontmatter

A Diversification-Aware Itemset Placement Framework for Long-Term Sustainability of Retail Businesses

In addition to maximizing the revenue, retailers also aim at diversifying product offerings for facilitating sustainable revenue generation in the long run. Thus, it becomes a necessity for retailers to place appropriate itemsets in a limited k number of premium slots in retail stores for achieving the goals of revenue maximization and itemset diversification. In this regard, research efforts are being made to extract itemsets with high utility for maximizing the revenue, but they do not consider itemset diversification i.e., there could be duplicate (repetitive) items in the selected top-utility itemsets. Furthermore, given utility and support thresholds, the number of candidate itemsets of all sizes generated by existing utility mining approaches typically explodes. This leads to issues of memory and itemset retrieval times. In this paper, we present a framework and schemes for efficiently retrieving the top-utility itemsets of any given itemset size based on both revenue as well as the degree of diversification. Here, higher degree of diversification implies less duplicate items in the selected top-utility itemsets. The proposed schemes are based on efficiently determining and indexing the top-λ high-utility and diversified itemsets. Experiments with a real dataset show the overall effectiveness and scalability of the proposed schemes in terms of execution time, revenue and degree of diversification w.r.t. a recent existing scheme.
Parul Chaudhary, Anirban Mondal, Polepalli Krishna Reddy

Global Analysis of Factors by Considering Trends to Investment Support

Understanding the factors affecting financial products is important for making investment decisions. Conventional factor analysis methods focus on revealing the impact of factors over a certain period locally, and it is not easy to predict net asset values. As a reasonable solution for the prediction of net asset values, in this paper, we propose a trend shift model for the global analysis of factors by introducing trend change points as shift interference variables into state space models. In addition, to realize the trend shift model efficiently, we propose an effective trend detection method, TP-TBSM (two-phase TBSM), by extending TBSM (trend-based segmentation method). The experimental results validate the proposed model and method.
Makoto Kirihata, Qiang Ma

Efficient Aggregation Query Processing for Large-Scale Multidimensional Data by Combining RDB and KVS

This paper presents a highly efficient aggregation query processing method for large-scale multidimensional data. Recent developments in network technologies have led to the generation of a large amount of multidimensional data, such as sensor data. Aggregation queries play an important role in analyzing such data. Although relational databases (RDBs) support efficient aggregation queries with indexes that enable faster query processing, increasing data size may lead to bottlenecks. On the other hand, the use of a distributed key-value store (D-KVS) is key to obtaining scale-out performance for data insertion throughput. However, querying multidimensional data sometimes requires a full data scan owing to its insufficient support for indexes. The proposed method combines an RDB and D-KVS to use their advantages complementarily. In addition, a novel technique is presented wherein data are divided into several subsets called grids, and the aggregated values for each grid are precomputed. This technique improves query processing performance by reducing the amount of scanned data. We evaluated the efficiency of the proposed method by comparing its performance with current state-of-the-art methods and showed that the proposed method performs better than the current ones in terms of query and insertion.
Yuya Watari, Atsushi Keyaki, Jun Miyazaki, Masahide Nakamura

Data Semantics

Frontmatter

Learning Interpretable Entity Representation in Linked Data

Linked Data has become a valuable source of factual records. However, because of its simple representations of records (i.e., a set of triples), learning representations of entities is required for various applications such as information retrieval and data mining. Entity representations can be roughly classified into two categories; (1) interpretable representations, and (2) latent representations. Interpretability of learned representations is important for understanding relationship between two entities, like why they are similar. Therefore, this paper focuses on the former category. Existing methods are based on heuristics which determine relevant fields (i.e., predicates and related entities) to constitute entity representations. Since the heuristics require laboursome human decisions, this paper aims at removing the labours by applying a graph proximity measurement. To this end, this paper proposes RWRDoc, an RWR (random walk with restart)-based representation learning method which learns representations of entities by weighted combinations of minimal representations of whole reachable entities w.r.t. RWR. Comprehensive experiments on diverse applications (such as ad-hoc entity search, recommender system using Linked Data, and entity summarization) indicate that RWRDoc learns proper interpretable entity representations.
Takahiro Komamizu

GARUM: A Semantic Similarity Measure Based on Machine Learning and Entity Characteristics

Knowledge graphs encode semantics that describes entities in terms of several characteristics, e.g., attributes, neighbors, class hierarchies, or association degrees. Several data-driven tasks, e.g., ranking, clustering, or link discovery, require for determining the relatedness between knowledge graph entities. However, state-of-the-art similarity measures may not consider all the characteristics of an entity to determine entity relatedness. We address the problem of similarity assessment between knowledge graph entities and devise GARUM, a semantic similarity measure for knowledge graphs. GARUM relies on similarities of entity characteristics and computes similarity values considering simultaneously several entity characteristics. This combination can be manually or automatically defined with the help of a machine learning approach. We empirically evaluate the accuracy of GARUM on knowledge graphs from different domains, e.g., networks of proteins and media news. In the experimental study, GARUM exhibits higher correlation with gold standards than studied existing approaches. Thus, these results suggest that similarity measures should not consider entity characteristics in isolation; contrary, combinations of these characteristics are required to precisely determine relatedness among entities in a knowledge graph. Further, the combination functions found by a machine learning approach outperform the results obtained by the manually defined aggregation functions.
Ignacio Traverso-Ribón, Maria-Esther Vidal

Knowledge Graphs for Semantically Integrating Cyber-Physical Systems

Cyber-Physical Systems (CPSs) are engineered systems that result from the integration of both physical and computational components designed from different engineering perspectives (e.g., mechanical, electrical, and software). Standards related to Smart Manufacturing (e.g., AutomationML) are used to describe CPS components, as well as to facilitate their integration. Albeit expressive, smart manufacturing standards allow for the representation of the same features in various ways, thus hampering a fully integrated description of a CPS component. We tackle this integration problem of CPS components and propose an approach that captures the knowledge encoded in smart manufacturing standards to effectively describe CPSs. We devise SemCPS, a framework able to combine Probabilistic Soft Logic and Knowledge Graphs to semantically describe both a CPS and its components. We have empirically evaluated SemCPS on a benchmark of AutomationML documents describing CPS components from various perspectives. Results suggest that SemCPS enables not only the semantic integration of the descriptions of CPS components, but also allows for preserving the individual characterization of these components.
Irlán Grangel-González, Lavdim Halilaj, Maria-Esther Vidal, Omar Rana, Steffen Lohmann, Sören Auer, Andreas W. Müller

Cloud Data Processing

Frontmatter

Efficient Top-k Cloud Services Query Processing Using Trust and QoS

The growing number of cloud service providers has led to an exploding number of functionally similar cloud services, with a wide range choices of non-functional properties (NFPs). Thus, selecting services based on NFPs becomes of significant importance. Current approaches assume that a cloud service provider offers a single service per functionality, therefore, they do not perform well in the real-life setting where each cloud service provider offers different service plans. In contrast, in this paper, we propose an approach to select top-k cloud services that is built taking into account the real-life setting. Our approach combines the trust, determined by the reputation of the provider, and the QoS. We present different algorithms for processing such selection queries and evaluate them through a set of experiments.
Karim Benouaret, Idir Benouaret, Mahmoud Barhamgi, Djamal Benslimane

Answering Top-k Queries over Outsourced Sensitive Data in the Cloud

The cloud provides users and companies with powerful capabilities to store and process their data in third-party data centers. However, the privacy of the outsourced data is not guaranteed by the cloud providers. One solution for protecting the user data is to encrypt it before sending to the cloud. Then, the main problem is to evaluate user queries over the encrypted data.
In this paper, we consider the problem of answering top-k queries over encrypted data. We propose a novel system, called BuckTop, designed to encrypt and outsource the user sensitive data to the cloud. BuckTop comes with a top-k query processing algorithm that is able to process efficiently top-k queries over the encrypted data, without decrypting the data in the cloud data centers.
We implemented BuckTop and compared its performance for processing top-k queries over encrypted data with that of the popular threshold algorithm (TA) over original (plaintext) data. The results show the effectiveness of BuckTop for outsourcing sensitive data in the cloud and answering top-k queries.
Sakina Mahboubi, Reza Akbarinia, Patrick Valduriez

-Tree: An Efficient Indexing Scheme for Server-Centric Data Center Networks

Index plays a very important role in cloud storage systems, which can support efficient querying tasks for data-intensive applications. However, most of existing indexing schemes for data centers focus on one specific topology and cannot be migrated directly to the other networks. In this paper, based on the observation that server-centric data center networks (DCNs) are recursively defined, we propose pattern vector, which can formulate the server-centric topologies more generally and design \(R^2\)-Tree, a scalable two-layer indexing scheme with a local R-Tree and a global R-Tree to support multi-dimensional query. To show the efficiency of \(R^2\)-Tree, we start from a case study for two-dimensional data. We use a layered global index to reduce the query scale by hierarchy and design a method called Mutex Particle Function (MPF) to determine the potential indexing range. MPF helps to balance the workload and reduce routing cost greatly. Then, we extend \(R^2\)-Tree indexing scheme to handle high-dimensional data query efficiently based on the topology feature. Finally, we demonstrate the superior performance of \(R^2\)-Tree in three typical server-centric DCNs on Amazon’s EC2 platform and validate its efficiency.
Yin Lin, Xinyi Chen, Xiaofeng Gao, Bin Yao, Guihai Chen

Time Series Data

Frontmatter

Monitoring Range Motif on Streaming Time-Series

Recent IoT-based applications generate time-series in a streaming fashion, and they often require techniques that enable environmental monitoring and event detection from generated time-series. Discovering a range motif, which is a subsequence that repetitively appears the most in a time-series, is a promising approach for satisfying such a requirement. This paper tackles the problem of monitoring a range motif of a streaming time-series under a count-based sliding-window setting. Whenever a window slides, a new subsequence is generated and the oldest subsequence is removed. A straightforward solution for monitoring a range motif is to scan all subsequences in the window while computing their occurring counts measured by a similarity function. Because the main bottleneck is similarity computation, this solution is not efficient. We therefore propose an efficient algorithm, namely SRMM. SRMM is simple and its time complexity basically depends only on the occurring counts of the removed and generated subsequences. Our experiments using four real datasets demonstrate that SRMM scales well and shows better performance than a baseline.
Shinya Kato, Daichi Amagata, Shunya Nishio, Takahiro Hara

MTSC: An Effective Multiple Time Series Compressing Approach

As the volume of time series data being accumulated is likely to soar, time series compression has become essential in a wide range of sensor-data applications, like Industry 4.0 and Smart grid. Compressing multiple time series simultaneously by exploiting the correlation between time series is more desirable. In this paper, we present MTSC, a novel approach to approximate multiple time series. First, we define a novel representation model, which uses a base series and a single value to represent each series. Second, two graph-based algorithms, \(MTSC_{mc}\) and \(MTSC_{star}\), are proposed to group time series into clusters. \(MTSC_{mc}\) can achieve higher compression ratio, while \(MTSC_{star}\) is much more efficient by sacrificing the compression ratio slightly. We conduct extensive experiments on real-world datasets, and the results verify that our approach outperforms existing approaches greatly.
Ningting Pan, Peng Wang, Jiaye Wu, Wei Wang

DancingLines: An Analytical Scheme to Depict Cross-Platform Event Popularity

Nowadays, events usually burst and are propagated online through multiple modern media like social networks and search engines. There exists various research discussing the event dissemination trends on individual medium, while few studies focus on event popularity analysis from a cross-platform perspective. In this paper, we design DancingLines, an innovative scheme that captures and quantitatively analyzes event popularity between pairwise text media. It contains two models: TF-SW, a semantic-aware popularity quantification model, based on an integrated weight coefficient leveraging Word2Vec and TextRank; and \(\omega \)DTW-CD, a pairwise event popularity time series alignment model matching different event phases adapted from Dynamic Time Warping. Experimental results on eighteen real-world datasets from an influential social network and a popular search engine validate the effectiveness and applicability of our scheme. DancingLines is demonstrated to possess broad application potentials for discovering knowledge related to events and different media.
Tianxiang Gao, Weiming Bao, Jinning Li, Xiaofeng Gao, Boyuan Kong, Yan Tang, Guihai Chen, Xuan Li

Social Networks

Frontmatter

Community Structure Based Shortest Path Finding for Social Networks

With the rapid expansion of communication data, research about analyzing social networks has become a hotspot. Finding the shortest path (SP) in social networks can help us to investigate the potential social relationships. However, it is an arduous task, especially on large-scale problems. There have been many previous studies on the SP problem, but very few of them considered the peculiarity of social networks. This paper proposed a community structure based method to accelerate answering the SP problem of social networks during online queries. We devise a two-stage strategy to strike a balance between offline pre-computation and online consultations. Our goal is to perform fast and accurate online approximations. Experiments show that our method can instantly return the SP result while satisfying accuracy constraint.
Yale Chai, Chunyao Song, Peng Nie, Xiaojie Yuan, Yao Ge

On Link Stability Detection for Online Social Networks

Link stability detection has been an important and long-standing problem within the link prediction domain. However, it has often been overlooked as being trivial and has not been adequately dealt with in link prediction. In this paper, we present an innovative method: Multi-Variate Vector Autoregression (MVVA) analysis to determine link stability. Our method adopts link dynamics to establish stability confidence scores within a clique sized model structure observed over a period of 30 days. Our method also improves detection accuracy and representation of stable links through a user-friendly interactive interface. In addition, a good accuracy to performance trade-off in our method is achieved through the use of Random Walk Monte Carlo estimates. Experiments with Facebook datasets reveal that our method performs better than traditional univariate methods for stability identification in online social networks.
Ji Zhang, Xiaohui Tao, Leonard Tan, Jerry Chun-Wei Lin, Hongzhou Li, Liang Chang

EPOC: A Survival Perspective Early Pattern Detection Model for Outbreak Cascades

The past few decades have witnessed the booming of social networks, which leads to a lot of researches exploring information dissemination. However, owing to the insufficient information exposed before the outbreak of the cascade, many previous works fail to fully catch its characteristics, and thus usually model the burst process in a rough manner. In this paper, we employ survival theory and design a novel survival perspective Early Pattern detection model for Outbreak Cascades (in abbreviation, EPOC), which utilizes information both from the static nature and its later diffusion process. To classify the cascades, we employ two Gaussian distributions to get the optimal boundary and also provide rigorous proof to testify its rationality. Then by utilizing both the survival boundary and hazard ceiling, we can precisely detect early pattern of outbreak cascades at very early stage. Experiment results demonstrate that under three practical and special metrics, our model outperforms the state-of-the-art baselines in this early-stage task.
Chaoqi Yang, Qitian Wu, Xiaofeng Gao, Guihai Chen

Temporal and Spatial Databases

Frontmatter

Analyzing Temporal Keyword Queries for Interactive Search over Temporal Databases

Querying temporal relational databases is a challenge for non-expert database users, since it requires users to understand the semantics of the database and apply temporal joins as well as temporal conditions correctly in SQL statements. Traditional keyword search approaches are not directly applicable to temporal relational databases since they treat time-related keywords as tuple values and do not consider the temporal joins between relations, which leads to missing answers, incorrect answers and missing query interpretations. In this work, we extend keyword queries to allow the temporal predicates, and design a schema graph approach based on the Object-Relationship-Attribute (ORA) semantics. This approach enables us to identify temporal attributes of objects/relationships and infer the target temporal data of temporal predicates, thus improving the completeness and correctness of temporal keyword search and capturing the various possible interpretations of temporal keyword queries. We also propose a two-level ranking scheme for the different interpretations of a temporal query, and develop a prototype system to support interactive keyword search.
Qiao Gao, Mong Li Lee, Tok Wang Ling, Gillian Dobbie, Zhong Zeng

Implicit Representation of Bigranular Rules for Multigranular Data

Domains for spatial and temporal data are often multigranular in nature, possessing a natural order structure defined by spatial inclusion and time-interval inclusion, respectively. This order structure induces lattice-like (partial) operations, such as join, which in turn lead to join rules, in which a single domain element (granule) is asserted to be equal to, or contained in, the join of a set of such granules. In general, the efficient representation of such join rules is a difficult problem. However, there is a very effective representation in the case that the rule is bigranular; i.e., all of the joined elements belong to the same granularity, and, in addition, complete information about the (non)disjointness of all granules involved is known. The details of that representation form the focus of the paper.
Stephen J. Hegner, M. Andrea Rodríguez

QDR-Tree: An Efficient Index Scheme for Complex Spatial Keyword Query

With the popularity of mobile devices and the development of geo-positioning technology, location-based services (LBS) attract much attention and top-k spatial keyword queries become increasingly complex.It is common to see that clients issue a query to find a restaurant serving pizza and steak, low in price and noise level particularly.However, most of prior works focused only on the spatial keyword while ignoring these independent numerical attributes.
      In this paper we demonstrate, for the first time, the Attributes-Aware Spatial Keyword Query (ASKQ), and devise a two-layer hybrid index structure called Quad-cluster Dual-filtering R-Tree (QDR-Tree). In the keyword cluster layer, a Quad-Cluster Tree (QC-Tree) is built based on the hierarchical clustering algorithm using kernel k-means to classify keywords.In the spatial layer, for each leaf node of the QC-Tree, we attach a Dual-Filtering R-Tree (DR-Tree) with two filtering algorithms, namely, keyword bitmap-based and attributes skyline-based filtering. Accordingly, efficient query processing algorithms are proposed.
      Through theoretical analysis, we have verified the optimization both in processing time and space consumption. Finally, massive experiments with real-data demonstrate the efficiency and effectiveness of QDR-Tree.
Xinshi Zang, Peiwen Hao, Xiaofeng Gao, Bin Yao, Guihai Chen

Graph Data and Road Networks

Frontmatter

Approximating Diversified Top-k Graph Pattern Matching

Graph pattern matching has been increasingly used in e.g., social network analysis. As the matching semantic is typically defined in terms of subgraph isomorphism, several problems are raised: (1) matching computation is often very expensive, due to the intractability of the problem, (2) the semantic is often too strict to identify meaningful matches, and (3) there may exist excessive matches which makes inspection very difficult. On the other hand, users are often interested in diversified top-k matches, rather than entire match set, since result diversification has been proven effective in improving users’ satisfaction, and top-k matches not only eases result understanding but also can save the cost of matching computation. Motivated by these, this paper investigates approximating diversified top-k graph pattern matching. (1) We extend traditional notion of subgraph isomorphism by allowing edge to path mapping, and define matching based on the revised notion. With the extension, more meaningful matches could be captured. (2) We propose two functions for ranking matches: a relevance function \(w(\cdot )\) based on tightness of connectivity, and a distance function \(d(\cdot )\) measuring match diversity. Based on relevance and distance functions, we propose diversification function \(F(\cdot )\), and formalize the diversified top-k graph pattern matching problem using \(F(\cdot )\). (3) Despite hardness of the problem, we provide two approximation algorithms with performance guarantees, and one of them even preserves early termination property. (4) Using real-life and synthetic data, we experimentally verify that our approximation algorithms are effective, and outperform traditional matching algorithms.
Xin Wang, Huayi Zhan

Boosting PageRank Scores by Optimizing Internal Link Structure

We consider and formulate problems of PageRank score boosting motivated by applications such as effective web advertising. More precisely, given a graph and target vertices, one is required to find a fixed-size set of missing edges that maximizes the minimum PageRank score among the targets. We provide theoretical analyses to show that all of them are NP-hard. To overcome the hardness, we develop heuristic-based algorithms for them. We finally perform experiments on several real-world networks to verify the effectiveness of the proposed algorithms compared to baselines. Specifically, our algorithm achieves 100 times improvements of the minimum PageRank score among selected 100 vertices by adding only dozens of edges.
Naoto Ohsaka, Tomohiro Sonobe, Naonori Kakimura, Takuro Fukunaga, Sumio Fujita, Ken-ichi Kawarabayashi

Finding the Most Navigable Path in Road Networks: A Summary of Results

Input to the Most Navigable Path (MNP) problem consists of the following: (a) a road network represented as a directed graph, where each edge is associated with numeric attributes of cost and “navigability score” values; (b) a source and a destination and; (c) a budget value which denotes the maximum permissible cost of the solution. Given the input, MNP aims to determine a path between the source and the destination which maximizes the navigability score while constraining its cost to be within the given budget value. This problem finds its applications in navigation systems for developing nations where streets, quite often, do not display their names. MNP problem would help in such cases by providing routes which are more convenient for a driver to identify and follow. Our problem is modeled as the arc orienteering problem which is known to be NP-hard. The current state-of-the-art for this problem may generate paths having loops, and its adaptation for MNP, that yields simple paths, was found to be inefficient. In this paper, we propose two novel algorithms for the MNP problem. Our experimental results indicate that the proposed solutions yield comparable or better solutions while being orders of magnitude faster than the current state-of-the-art for large real road networks. We also propose an indexing structure for the MNP problem which significantly reduces the running time of our algorithms.
Ramneek Kaur, Vikram Goyal, Venkata M. V. Gunturi

Load Balancing in Network Voronoi Diagrams Under Overload Penalties

Input to the problem of Load Balanced Network Voronoi Diagram (LBNVD) consists of the following: (a) a road network represented as a directed graph; (b) locations of service centers (e.g., schools in a city) as vertices in the graph and; (c) locations of demand (e.g., school children) also as vertices in the graph. In addition, each service center is also associated with a notion of capacity and an overload penalty which is “charged” if the service center gets overloaded. Given the input, the goal of the LBNVD problem is to determine an assignment where each of the demand vertices is allotted to a service center. The objective here is to generate an assignment which minimizes the sum of the following two terms: (i) total distance between demand vertices and their allotted service centers and, (ii) total penalties incurred while overloading the service centers. The problem of LBNVD finds its application in the domain of urban planning. Research literature relevant to this problem either assume infinite capacity or do not consider the concept of “overload penalty.” These assumptions are relaxed in our LBNVD problem. We develop a novel algorithm for the LBNVD problem and provide a theoretical upper bound on its worst-case performance (in terms of solution quality). We also present the time complexity of our algorithm and compare against the related work experimentally using real datasets.
Ankita Mehta, Kapish Malik, Venkata M. V. Gunturi, Anurag Goel, Pooja Sethia, Aditi Aggarwal

Backmatter

Weitere Informationen

Premium Partner

BranchenIndex Online

Die B2B-Firmensuche für Industrie und Wirtschaft: Kostenfrei in Firmenprofilen nach Lieferanten, Herstellern, Dienstleistern und Händlern recherchieren.

Whitepaper

- ANZEIGE -

Best Practices für die Mitarbeiter-Partizipation in der Produktentwicklung

Unternehmen haben das Innovationspotenzial der eigenen Mitarbeiter auch außerhalb der F&E-Abteilung erkannt. Viele Initiativen zur Partizipation scheitern in der Praxis jedoch häufig. Lesen Sie hier  - basierend auf einer qualitativ-explorativen Expertenstudie - mehr über die wesentlichen Problemfelder der mitarbeiterzentrierten Produktentwicklung und profitieren Sie von konkreten Handlungsempfehlungen aus der Praxis.
Jetzt gratis downloaden!

Bildnachweise