Skip to main content
Top

2024 | Book

Databases Theory and Applications

34th Australasian Database Conference, ADC 2023, Melbourne, VIC, Australia, November 1-3, 2023, Proceedings

Editors: Zhifeng Bao, Renata Borovica-Gajic, Ruihong Qiu, Farhana Choudhury, Zhengyi Yang

Publisher: Springer Nature Switzerland

Book Series : Lecture Notes in Computer Science

insite
SEARCH

About this book

This book constitutes the refereed proceedings of the 34th Australasian Database Conference on Databases Theory and Applications, ADC 2023, held in Melbourne, VIC, Australia, during November 1-3, 2023.

The 26 full papers presented in this volume are carefully reviewed and selected from 41 submissions. They were organized in topical sections named: Mining Complex Types of Data, Natural Language Processing and Text Analysis, Machine Learning and Computer Vision, Database Systems and Data Storage, Data Quality and Fairness for Graphs and Graph Mining and Graph Algorithms.

Table of Contents

Frontmatter

Mining Complex Types of Data

Frontmatter
kNN Join for Dynamic High-Dimensional Data: A Parallel Approach
Abstract
The k nearest neighbor (kNN) join operation is a fundamental task that combines two high-dimensional databases, enabling data points in the User dataset U to identify their k nearest neighbor points from the Item dataset I. This operation plays a crucial role in various domains, including knowledge discovery, data mining, similarity search applications, and scientific research. However, exact kNN search in high-dimensional spaces is computationally demanding, and existing sequential methods face challenges in handling large datasets. In this paper, we propose an efficient parallel solution for dynamic kNN join over high-dimensional data, leveraging the high-dimensional R tree (HDR Tree) for improved efficiency. Our solution harnesses the power of Simultaneous Multi-Threading (SMT) technologies and Single-Instruction-Multiple-Data (SIMD) instructions in modern CPUs for parallelisation. Importantly, our research is the first to introduce parallel computation for exact kNN join over high-dimensional data. Experimental results demonstrate that our proposed approach outperforms the sequential HDR Tree method by up to 1.2 times with a single thread. Moreover, our solution provides near-linear scalability as the number of threads increases.
Nimish Ukey, Zhengyi Yang, Wenke Yang, Binghao Li, Runze Li
Take a Close Look at the Optimization of Deep Kernels for Non-parametric Two-Sample Tests
Abstract
The maximum mean discrepancy (MMD) test with deep kernel is a powerful method to distinguish whether two samples are drawn from the same distribution. Recent studies aim to maximize the test power of MMD test to find the best deep kernel for testing, where the test power is the ratio of MMD value and MMD’s asymptotic variance. However, in this paper, we find that direct maximization of the test power sometimes leads to an unreasonable case that MMD value is very small but the test power is large. In this case, the testing performance is not satisfactory. Thus we propose two main methods to simultaneously maximize the test power and the MMD value with deep kernel by combining the methods from non-smooth optimization and the methods from Pareto optimization. Experiments verify the effectiveness of two methods on two benchmark datasets in the field of two-sample testing.
Xunye Tian, Feng Liu
Probabilistic Reverse Top-k Query on Probabilistic Data
Abstract
Reverse top-k queries have received much attention from research communities. The result of reverse top-k queries is a set of objects, which had the k-most interest based on their objects’ references. Moreover, answering the queries on probabilistic data has been studied in many applications. The most common problem with uncertain queries is how to calculate their probabilities. Currently, there are some proposed solutions for selecting answers to queries and calculating probabilistic values based on users’ preferences. In this paper, we study answering reverse top-k queries on probabilistic data. Firstly, we propose a novel method to calculate probabilistic tuples based on the expected theory. Secondly, we present the advantages of our approach against the traditional approach. Furthermore, we upgrade the new algorithm using two techniques of R-tree and upper-lower bound. The experimental results illustrate the efficiency of the proposed algorithm compared to the traditional algorithms in terms of scalability.
Trieu Minh Nhut Le, Jinli Cao
Optimizing Taxi Route Planning Based on Taxi Trajectory Data Analysis
Abstract
In daily life, taxis have become one of the most common and convenient ways of public transportation. With the advancement of positioning and mobility technologies, a large amount of taxi trajectory data has been collected, providing valuable data resources for urban planning, traffic management, and personalized route recommendations. However, these huge datasets also pose computational and processing challenges. This study uses the annual taxi trajectory data of Porto City obtained from the Kaggle platform, containing more than 1.7 million records, to study data query and analysis in a big data environment. We focus on comparing the efficiency and overhead of two spatial index structures, K-d tree and R-tree, in handling such large-scale datasets. Experimental results show that the K-d tree has a time-efficiency advantage in K-nearest neighbours query tasks, while the R-tree performs better in complex spatial query tasks. These findings provide important references for taxi route planning and other big data applications, especially in scenarios requiring efficient and accurate data retrieval.
Xinyi Yang, Zhi Chen, Yadan Luo

Natural Language Processing and Text Analysis

Frontmatter
Learning Implicit Sentiment for Explainable Review-Based Recommendation
Abstract
Users can publish reviews to express their detailed feelings about the items. Positive and negative sentiments about various aspects of an item co-existing in the same review may cause confusion in recommendations and generate inappropriate explanations. However, most current explainable recommendation methods fail to capture users’ implicit sentiment behind the reviews. In this paper, we propose a novel Implicit Sentiment learning model for Explainable review-based Recommendation, named ISER, which learns users’ implicit sentiments from reviews and explores them to generate recommendations with more fine-grained explanations. Specifically, we first propose a novel representation learning to model users/items based on the implicit sentiment behind the reviews. Then we propose two implicit sentiment fusion strategies for rating prediction and explanation generation respectively. Finally, we propose a multi-task learning framework to jointly optimize the rating prediction task and the explanation generation task, which improves the recommendation quality in a mutual promotion manner. The experiments demonstrate the effectiveness and efficiency of our proposed model compared to the baseline models.
Ningning Sun, Yue Kou, Xiangmin Zhou, Derong Shen, Dong Li, Tiezheng Nie
Balanced and Explainable Social Media Analysis for Public Health with Large Language Models
Abstract
As social media becomes increasingly popular, more and more public health activities emerge, which is worth noting for pandemic monitoring and government decision-making. Current techniques for public health analysis involve popular models such as BERT and large language models (LLMs). Although recent progress in LLMs has shown a strong ability to comprehend knowledge by being fine-tuned on specific domain datasets, the costs of training an in-domain LLM for every specific public health task are especially expensive. Furthermore, such kinds of in-domain datasets from social media are generally highly imbalanced, which will hinder the efficiency of LLMs tuning. To tackle these challenges, the data imbalance issue can be overcome by sophisticated data augmentation methods for social media datasets. In addition, the ability of the LLMs can be effectively utilised by prompting the model properly. In light of the above discussion, in this paper, a novel ALEX framework is proposed for social media analysis on public health. Specifically, an augmentation pipeline is developed to resolve the data imbalance issue. Furthermore, an LLMs explanation mechanism is proposed by prompting an LLM with the predicted results from BERT models. Extensive experiments conducted on three tasks at the Social Media Mining for Health 2023 (SMM4H) competition with the first ranking in two tasks demonstrate the superior performance of the proposed ALEX method. Our code has been released in https://​github.​com/​YanJiangJerry/​ALEX.
Yan Jiang, Ruihong Qiu, Yi Zhang, Peng-Fei Zhang
Prompt-Based Effective Input Reformulation for Legal Case Retrieval
Abstract
Legal case retrieval plays an important role for legal practitioners to effectively retrieve relevant cases given a query case. Most existing neural legal case retrieval models directly encode the whole legal text of a case to generate a case representation, which is then utilised to conduct a nearest neighbour search for retrieval. Although these straightforward methods have achieved improvement over conventional statistical methods in retrieval accuracy, two significant challenges are identified in this paper: (1) Legal feature alignment: the usage of the whole case text as the input will generally incorporate redundant and noisy information because, from the legal perspective, the determining factor of relevant cases is the alignment of key legal features instead of whole text matching; (2) Legal context preservation: furthermore, since the existing text encoding models usually have an input length limit shorter than the case, the whole case text needs to be truncated or divided into paragraphs, which leads to the loss of the global context of legal information. In this paper, a novel legal case retrieval framework, PromptCase, is proposed to tackle these challenges. Firstly, legal facts and legal issues are identified and formally defined as the key features facilitating legal case retrieval based on a thorough study of the definition of relevant cases from a legal perspective. Secondly, with the determining legal features, a prompt-based encoding scheme is designed to conduct an effective encoding with language models. Extensive zero-shot experiments have been conducted on two benchmark datasets in legal case retrieval, which demonstrate the superior retrieval effectiveness of the proposed PromptCase. The code has been released on https://​github.​com/​yanran-tang/​PromptCase.
Yanran Tang, Ruihong Qiu, Xue Li
Surveying the Landscape: Compound Methods for Aspect-Based Sentiment Analysis
Abstract
Aspect-based sentiment Analysis (ABSA) has emerged as a critical research area in natural language processing, facilitating a deeper understanding of user opinions and sentiments expressed in text. This review article comprehensively surveys the landscape of deep learning approaches in ABSA, focusing on triplet and quadruplet ABSA. We delve into the significance of ABSA in diverse domains and present an overview of the critical components and challenges associated with compound ABSA tasks. The review analyzes state-of-the-art models, encompassing pipeline-based and generative-based solutions. Comparative analysis demonstrates the advantages and limitations of these approaches, including their performance, generalizability, and efficiency. Additionally, we explore the domains and datasets used in ABSA research and highlight the crucial factors contributing to ABSA task solutions’ effectiveness. This comprehensive review highlights current challenges and fosters further advancements in compound ABSA tasks.
Marwah Alharbi, Jiao Yin, Hua Wang

Machine Learning and Computer Vision

Frontmatter
Towards Reliable and Efficient Vegetation Segmentation for Australian Wheat Data Analysis
Abstract
Automated crop data analysis plays an important role in modern Australian agriculture. As one of the key procedures of analysis, vegetation segmentation, which aims to predict pixel-level labels of vegetation images, has recently demonstrated advanced results on benchmark datasets. However, the promising results are built upon the assumption that the test data and the training data always follow an identical distribution. Due to the differences in vegetation species, country, or illumination conditions, such assumptions are commonly violated in the real-world scenario. As a pilot study, this work confirms the model pre-trained on worldwide vegetation data has a degradation issue when being applied to the Australian wheat data. Instead of conducting expensive pixel-level annotation of Australian wheat data, we propose a self-training strategy that incorporates confidence estimated pseudo-labeling of the wheat data in the training process to close the distribution gap. Meanwhile, to reduce the computational cost, we equip the lightweight transformer framework with a token clustering and reconstruction module. Extensive experimental results demonstrate that the proposed network can achieve 6.4% higher mIOU and 8.6% lower computational costs over the baseline methods.
Bowen Yuan, Zijian Wang, Xin Yu
SMST: A Saliency Map to Scanpath Transformer
Abstract
Regarding the virtual reality (VR) environment, scanpath prediction is critical for saving rendering resources and guiding content design due to its omnidirectional characteristic. However, only a few scanpath prediction models have been proposed, compared with the blossoming of saliency map prediction models. Focusing on scanpath prediction in the omnidirectional image area, this paper introduces a novel model that transforms the predicted saliency map into a potential scanpath, named saliency map to scanpath transformer (SMST). The model comprises three parts, filtering, clustering, and routing. Given a predicted saliency map in the VR environment, we first filter out mediocre areas and obtain the filtered saliency map. We then acquire a fixation set by clustering the filtered saliency map based on saliency distribution and taking the centers of the resulting clusters as potential fixation points. The fixation points are then connected, with the weights of the connections defined by the cylindrical distance and the primary visual feature of the original image. Based on the network composed of fully connected fixation points, the scanpath prediction model is converted to a routing problem, which aims to find the optimal route that reaches all the fixation points once and only once. We then propose a scanpath prediction model using an ant colony optimization algorithm. We evaluate the proposed model on multiple kinds of predicted saliency maps, and the prediction performance is promising.
Xi Cao, Yong-Feng Ge, Ying Lin
Multi-level Storage Optimization for Intermediate Data in AI Model Training
Abstract
As Transformer-based large models become the mainstream of AI training, the development of hardware devices (e.g., GPUs) cannot keep up with the rapid increase of model scale. Although the development of various parallel training techniques enables models to be trained on multiple GPUs, it still requires high costs that most researchers cannot afford. The increase of the hardware threshold for AI model training has affected the application of deep learning. In fact, CPU memory and external disk memory can be used as cache, which can reduce the occupation of high-cost GPU memory. In this paper, we analyze two types of intermediate data used in AI model training and propose a multi-level intermediate data offloading policy for the training process. Firstly, we propose a dynamic management policy via warm-up to optimize GPU memory usage according to the characteristics of the AI training process. Secondly, we asynchronously offload the optimizer state data with a specified ratio to the HDD, which can further optimize CPU memory usage. We conduct experiments on the large pre-trained model GPT-2 to verify the effectiveness of our method, and the results indicate that the multi-level storage optimization of intermediate data can help to achieve a larger AI model training under constrained hardware resources.
Junfeng Fu, Yang Yang, Gang Hu, Xianqiang Luo, Jie Shao
Enhancing Night-to-Day Image Translation with Semantic Prior and Reference Image Guidance
Abstract
Current unpaired image-to-image translation models deal with the datasets on unpaired domains effectively but face the challenge of mapping images from domains with scarce information to domains with abundant information due to the degradation of visibility and the loss of semantic information. To improve the quality of night-to-day translation further, we propose a novel image translation method named “RefN2D-Guide GAN” that utilizes reference images to improve the adaptability of the encoder within the generator through the feature matching loss. Moreover, we introduce a segmentation module to assist in preserving the semantic details of the generated images without the need for ground true annotations. The incorporation of the embedding consistency loss differentiates the roles of the encoder and decoder and facilitates the transfer of learned representation to both translation directions. Our experimental results show that our proposed method can effectively enhance the quality of night-to-day image translation on the night training set of the ACDC dataset and achieve higher mIoU on the translated images.
Junzhi Ning, Mingming Gong
Batch Level Distributed Training of LSTM with Infinity Norm Gradient Flow
Abstract
The advent of the big data era has led to a substantial increase in available data for analysis and prediction, creating a need for effective utilization of this vast input to improve prediction quality. LSTM-based neural networks have demonstrated exceptional performance in tasks such as time series forecasting. However, the effectiveness of these models can be constrained by the limitations of GPU memory. Distributed computing has emerged as a promising solution to address the challenges posed by large-sample, long-sequence time series forecasting. This work develops a novel distributed training method for LSTM-based time series forecasting under big data scenario. Infinity norm gradient flow (INGF) is applied to speed up the convergence, acceleration techniques are designed to improve the utility rate of multiple GPUs. The study showcases significant insights into the performance of various distributed strategies and optimization techniques for batch level distributed training. As a result, we achieve an impressive tenfold increase in efficiency while making only a negligible sacrifice in accuracy.
Linzhe Cai, Chen Liu, Xinghuo Yu, Chaojie Li, Andrew Eberhard

Database Systems and Data Storage

Frontmatter
Health Status Assessment for HDDs Based on Bi-LSTM and N-Dimensional Similarity Metric
Abstract
In order to reduce the economic losses caused by hard disk failures, researchers have proposed various statistical and machine learning methods based on Self-Monitoring Analysis and Reporting Technology (SMART) attributes. Predicting hard drive health using SMART attributes, as proposed by previous methods, is effective for adopting different passive fault tolerance mechanisms in advance. Despite the effectiveness of these methods, there are still significant limitations. Specifically, these methods define health status according to the remaining time before it breaks down. However, they ignore changes in SMART features that reflect deteriorating disk health. In this paper, we propose an N-dimensional similarity metric to evaluate the health of HDDs, which acts on both SMART attributes and time-to-failure of HDDs. In addition, we use hypothesis test to eliminate abnormal data and propose a Bidirectional LSTM (Bi-LSTM) based model with weighted categorical cross-entropy loss. Experiments on the Backblaze and Baidu datasets show that our method obtains reasonably accurate health status assessments and outperforms previous methods. Code is available at https://​github.​com/​su26225/​HDD-Health-Status.​
Bo Su, Xin Man, Hui Xu, Xiangmin Zhou, Jie Shao
Why Query Plans Are Different: An Automatic Detection and Inference System
Abstract
Preventing plan regression has always been a demanding task. SQL tuning advisor, e.g., index advisor, and optimizer testing tool are two common solutions. The former is proposed for database users to avoid a sub-optimal plan on a production system while the second tries to evaluate the optimizer before releasing a new version. However, both of them are uninformed for optimizer developers to quickly find the reason why a sub-optimal plan is generated for a query. Manual analysis is both labor-intensive and inefficient for large query plan and number of queries. To this end, we propose AutoDI, an automatic detection and inference system to analyze why a sub-optimal plan is obtained. AutoDI  consists of two main modules, Difference Finder and Inference. In Difference Finder, we introduce two types of signatures and design an efficient algorithm to detect the possible sub-optimal parts of the plan in a top-down manner. Inference has an extensible rule-action framework and takes the detected differences as the input to explore the possible reasons behind them. We test AutoDI  with analytic workloads on a production database. Compared to developers, AutoDI  can find the right reason in most cases with much less time. AutoDI has been deployed in TiDB to help optimizer developers analyze the plan regression cases.
Hai Lan, Yuanjia Zhang, Yu Dong, Dongxu Huang, Liu Tang, Jian Zhang
An Empirical Analysis of Just-in-Time Compilation in Modern Databases
Abstract
JIT (Just-in-Time) technology has garnered significant attention for improving the efficiency of database execution. It offers higher performance by eliminating interpretation overhead compared to traditional execution engines. LLVM serves as the primary JIT architecture, which was implemented in PostgreSQL since version 11. However, recent advancements in WASM-based databases, such as Mutable, present an alternative JIT approach. This approach minimizes the extensive engineering efforts associated with the execution engine and focuses on optimizing supported operators for lower latency and higher throughput. In this paper, we perform comprehensive experiments on these two representative open-source databases to gain deeper insights into the effectiveness of different JIT architectures.
Miao Ma, Zhengyi Yang, Kongzhang Hao, Liuyi Chen, Chunling Wang, Yi Jin
Relational Expressions for Data Transformation and Computation
Abstract
Separate programming models for data transformation (declarative) and computation (procedural) impact programmer ergonomics, code reusability and database efficiency. To eliminate the necessity for two models or paradigms, we propose a small but high-leverage innovation, the introduction of complete relations into the relational database. Complete relations and the discipline of constraint programming, which concerns them, are founded on the same algebra as relational databases. We claim that by synthesising the relational database of Codd and Date, with the results of the constraint programming community, the relational model holistically offers programmers a single declarative paradigm for both data transformation and computation, reusable code with computations that are indifferent to what is input and what is output, and efficient applications with the query engine optimising and parallelising all levels of data transformation and computation.
David Robert Pratten, Luke Mathieson

Data Quality and Fairness for Graphs

Frontmatter
Discovering Graph Differential Dependencies
Abstract
Graph differential dependencies (GDDs) are a novel class of integrity constraints in property graphs for capturing and expressing the semantics of difference in graph data. They are more expressive, and subsume other graph dependencies; and thus, are more useful for addressing many real-world graph data quality/management problems. In this paper, we study the general discovery problem for GDDs – the task of finding a non-redundant and succinct set of GDDs that hold in a given property graph. Indeed, we present characterisations of GDDs based on their semantics, extend existing data structures, and device pruning strategies to enable our proposed level-wise discovery algorithm, GDDMiner, returns a minimal cover of valid GDDs efficiently. Further, we perform experiments over three real-world graphs to demonstrate the feasibility, scalability, and effectiveness of our solution.
Yidi Zhang, Selasi Kwashie, Michael Bewong, Junwei Hu, Arash Mahboubi, Xi Guo, Zaiwen Feng
Maximum Fairness-Aware (k, r)-Core Identification in Large Graphs
Abstract
Cohesive subgraph mining is a fundamental problem in attributed graph analysis. The k-core model has been widely used in many studies to measure the cohesiveness of subgraphs. However, none of them considers the fairness of attributes in communities. To address this limitation, we propose a novel model, named fair (kr)-core, to assess fairness equity within communities captured in attributed graphs. The model not only focuses on cohesive community structure but also eliminates vertices that contribute to unfairness. We prove that the problem of identifying the maximum fair (kr)-core is NP-hard. To scale for large graphs, we introduce innovative pruning techniques and search methods. Comprehensive experiments are conducted on 6 real-world datasets to evaluate the efficiency and effectiveness of the proposed model and solutions.
Xingyu Tan, Chengyuan Guo, Xiaoyang Wang, Wenjie Zhang, Chen Chen
IFGNN: An Individual Fairness Awareness Model for Missing Sensitive Information Graphs
Abstract
Graph neural networks (GNNs) provide an approach for analyzing complicated graph data for node, edge, and graph-level prediction tasks. However, due to societal discrimination in real-world applications, the labels in datasets may have certain biases. This bias is magnified as GNNs iteratively obtain information from neighbourhoods through message passing and aggregation, generating unfair embeddings that implicitly affect the prediction results. In real-world datasets, missing sensitive attributes is common due to incomplete data collection and privacy concerns. However, research on the fairness of GNNs in incomplete graph data is limited and mainly focuses on group fairness. Addressing individual unfairness in GNNs when the sensitive attributes are missing remains unexplored. To solve this novel problem, we introduce a model named IFGNN, which leverages a GNN-based encoder and a decoder to generate node embeddings. Additionally, IFGNN adopts the Lipschitz condition to ensure individual fairness. Through comprehensive experiments on four real-world datasets compared with baseline models in node classification tasks, the results demonstrate that IFGNN can achieve individual fairness while maintaining high prediction accuracy.
Kejia Xu, Zeming Fei, Jianke Yu, Yu Kong, Xiaoyang Wang, Wenjie Zhang
Efficient Maximum Relative Fair Clique Computation in Attributed Graphs
Abstract
Cohesive subgraph mining is a fundamental problem in attributed graph analysis. However, the existing models on attributed graphs ignore the fairness of attributes. In this paper, we propose a novel model, called maximum relative fair clique, which integrates cohesiveness and fair resource allocation. Specifically, given an attributed graph G and a positive integer \(\delta \), a relative fair clique is a clique where the number of vertices with the most common attribute minus the number of vertices with the least amount of the common attribute should be no more than \(\delta \). We aim to find the maximum relative fair clique, which is the maximal one with the largest size. To solve this problem, we develop an algorithm, MRFCSearch, equipped with a novel heuristic algorithm and an efficient pruning technique. We evaluate the algorithm on four real-world graphs, demonstrating the performance of the proposed techniques.
Yufeng Zhang, Shiyuan Liu, Yanping Wu, Lu Qin, Ying Zhang

Graph Mining and Graph Algorithms

Frontmatter
Balanced Hop-Constrained Path Enumeration in Signed Directed Graphs
Abstract
Hop-constrained path enumeration, which aims to output all the paths from two distinct vertices within the given hops, is one of the fundamental tasks in graph analysis. Previous works about this problem mainly focus on unsigned graphs. Nevertheless, with the ever-growing popularity of online social media, signed graphs with positive and negative edges are becoming increasingly ubiquitous and there is a paucity of balanced paths (i.e., paths with an even number of negative edges) enumeration which makes more sense according to balance theory. Motivated by this, in this paper, we propose a new problem, i.e., enumerate balanced paths with hop constraint (BPHC). A baseline method firstly is proposed by extending the DFS method. To further speed up the efficiency, we construct an almost-satisfy table to instantly correct unbalanced state during the search process. Based on the almost-satisfy table, we develop algorithm BMAS by combining the shortest path length in the process of table update and lookup. We conduct experiments on four real world networks to verify the performance of the proposed algorithms.
Zhiyang Tang, Jinghao Wang, Yanping Wu, Xiaoyang Wang, Lu Qin, Ying Zhang
An Experimental Evaluation of Two Methods on Shortest Distance Queries over Small-World Graphs
Abstract
Efficient computation of shortest distances is a fundamental problem in graph databases. In this paper, we report our experimental evaluation and comparison of two state-of-the-art methods for exact distance computation over small-world graphs. Our experiment reveals some results that are not previously reported. These results provide insights on the pros and cons of each method, and on the possible directions these methods can be improved. They can also help us in choosing the right method for specific application scenarios.
Houyu Di, Junhu Wang
Discovering Densest Subgraph over Heterogeneous Information Networks
Abstract
Densest Subgraph Discovery (DSD) is a fundamental and challenging problem in the field of graph mining in recent years. The DSD aims to determine, given a graph G, the subgraph with the maximum density according to the specified definition. The majority of current research focuses on identifying DSD in homogeneous information networks. Nevertheless, these techniques cannot be applied directly in Heterogeneous Information Networks (HINs) since the semantics of paths are not considered. This limits the application of these approaches in certain fields, as many graphs, e.g., the DBLP dataset, comprise of many types of edges. In order to remedy this research need, this paper proposes approaches for resolving the DSD issue over HINs.
By examining numerous linkage paths between two vertices, we characterize the relationship between two objects of the same type using the concept of meta-path. We further build two new HIN models and provide DSD techniques that perform well in large networks. In addition, comprehensive experiments on various HIN datasets demonstrate the effectiveness and efficiency of the proposed methods.
Haozhe Yin, Kai Wang, Wenjie Zhang, Dong Wen, Xiaoyang Wang, Ying Zhang
Influence Maximization Revisited
Abstract
Influence Maximization (IM) has been extensively studied, which is to select a set of k seed users from a social network to maximize the expected number of influenced users in the social network. There are many approaches proposed under a cascade model to find such a single set of k seed users. Such a set being computed may not be unique, as it is most likely that there exist more than one set, \(S_1, S_2, \cdots \), each of them leads to the same IM, given a social network exhibits rich symmetry as reported in the literature. In this paper, first, we study how to select a set of k seed users from a set of seed \(k'~(\ge k)\) users which can be either a union of sets of seed users, \(\mathbb {S} = \bigcup _i S_i\), where \(S_i\) is a set of k seed users, or simply a set of seed users of size \(k'~(\ge k)\) being computed, based on cooperative game using Shapley value. Second, we develope a visualization system to explore the process of influence spreading from topological perspective, as IM only gives the expected number of influenced users without much information on how influence spreads in a large social network. We conduct experimental studies to confirm the effectiveness of the seed users selected in our approach.
Yihan Geng, Kunyu Wang, Ziqi Liu, Michael Yu, Jeffrey Xu Yu
On Directed Densest Subgraph Detection
Abstract
The well-studied directed densest subgraph problem aims to find two (possibly overlapping) vertex subsets \(S^*\) and \(T^*\) in a given directed graph \(G=(V,E)\) such that \(\rho (S,T) = \frac{|E(S,T)|}{\sqrt{|S||T|}}\) is maximized; here E(ST) denotes the set of edges from vertices of S to T in G. This problem is polynomial-time solvable, and both exact algorithms and approximation algorithms have been proposed in the literature. However, the existing exact algorithms are time-consuming, while the existing approximation algorithms often yield trivial solutions that consist of the highest-degree vertex and its in-neighbors or out-neighbors. Moreover, there is nothing special about geometric mean that is adopted in the existing density measure for combining \(\frac{|E(S,T)|}{|S|}\) and \(\frac{|E(S,T)|}{|T|}\). In this paper, we explore alternative density measures and propose corresponding algorithms, for directed densest subgraph identification. Specifically, we introduce three density measures that combine \(\frac{|E(S,T)|}{|S|}\) and \(\frac{|E(S,T)|}{|T|}\) by harmonic mean, arithmetic mean, and minimum mean, respectively. Based on these density measures, we formulate the harmonic mean-based directed densest subgraph (HDDS) problem, the arithmetic mean-based directed densest subgraph (ADDS) problem, and the minimum mean-based directed densest subgraph (MDDS) problem. We then propose a 2-approximation algorithm for HDDS, a 2-approximation algorithm for ADDS, and a heuristic algorithm for MDDS; our HDDS and MDDS algorithms run in linear time to the input graph size. Extensive empirical studies on large real-world directed graphs show that our ADDS algorithm produces similar trivial results as the existing approximation algorithm, and our HDDS and MDDS algorithms generate nontrivial and much better solutions and scale to large graphs.
Kai Yao, Xin Yang, Lijun Chang
Backmatter
Metadata
Title
Databases Theory and Applications
Editors
Zhifeng Bao
Renata Borovica-Gajic
Ruihong Qiu
Farhana Choudhury
Zhengyi Yang
Copyright Year
2024
Electronic ISBN
978-3-031-47843-7
Print ISBN
978-3-031-47842-0
DOI
https://doi.org/10.1007/978-3-031-47843-7

Premium Partner