Skip to main content

2025 | Buch

Web Information Systems Engineering – WISE 2024

25th International Conference, Doha, Qatar, December 2–5, 2024, Proceedings, Part V

herausgegeben von: Mahmoud Barhamgi, Hua Wang, Xin Wang

Verlag: Springer Nature Singapore

Buchreihe : Lecture Notes in Computer Science

insite
SUCHEN

Über dieses Buch

This five-volume set LNCS 15436 -15440 constitutes the proceedings of the 25th International Conference on Web Information Systems Engineering, WISE 2024, held in Doha, Qatar, in December 2024.

The 110 full papers and 55 short papers were presented in these proceedings were carefully reviewed and selected from 368 submissions. The papers have been organized in the following topical sections as follows:

Part I : Information Retrieval and Text Processing; Text and Sentiment Analysis; Data Analysis and Optimisation; Query Processing and Information Extraction; Knowledge and Data Management.

Part II: Social Media and News Analysis; Graph Machine Learning on Web and Social; Trustworthy Machine Learning; and Graph Data Management.

Part III: Recommendation Systems; Web Systems and Architectures; and Humans and Web Security.

Part IV: Learning and Optimization; Large Language Models and their Applications; and AI Applications.

Part V: Security, Privacy and Trust; Online Safety and Wellbeing through AI; and Web Technologies.

Inhaltsverzeichnis

Frontmatter

Security, Privacy and Trust

Frontmatter
Privacy-Preserving k-core Decomposition for Graphs

k-core decomposition is an important task in graph data processing in various fields, such as social network analysis, computational biology, medicine research, etc. In some applications involving massive graph processing, users usually outsource the graph data and k-core decomposition task on the graph, to cloud service providers, so as to alleviate the problem of limited computational resources. This solution, however, may pose threats to the privacy of the data and of the users, for example, leakage of node and edge information, and even the k value. In this paper, we propose a homomorphic encryption-based scheme called HEkc for k-core decomposition in the scenario of graph data outsourcing. We prove that the proposed scheme is secure under the semi-honest model, i.e., it can perform k-core decomposition correctly while preserving the privacy of the graph, the k value, and the computation result. In addition, we design two variants named HEkc-fast and HEkc-safe, to improve the efficiency of HEkc and to prevent it from the known-plaintext attack, respectively. Finally, we empirically evaluate the effectiveness of our proposed schemes through extensive experiments conducted on eight real datasets and the results show that our schemes for privacy-preserving k-core decomposition are promising.

Xuyang Liu, Rong Zhao, Bingwen Feng, Jilian Zhang
Anomaly Detection in Log Streams Based on Time-Contextual Models

Organisations today heavily rely on complex software systems integrated through multiple layers of middleware. This complexity leads to substantial generation of operational data of structured and semi-structured formats which is recorded in log files. The workload of the system fluctuates according to specific periods of the day which impacts the amount and quality of data generated in log files. In this paper, we propose a new log anomaly detection approach that leverages a collection of smaller models designed to capture workload fluctuations over specific time intervals. We demonstrate its effectiveness in detecting anomalies within log streams. Our evaluation uses log data from servers in a production environment, handling a complex back-end system that processes hundreds of requests per second. We show that our method outperforms traditional and widely used anomaly detection methods in data streams in the context of dynamic and time-sensitive workload scenarios.

Daniil Fedotov, Jaroslav Kuchar, Tomas Vitvar
Enhancing Open-Set Recognition with Global Feature Representation

Open-set recognition (OSR) is crucial for classifying known classes while identifying unknown data in real-world applications. Existing methods often rely on discriminative features from deep models but struggle with recognizing new data with non-discriminative differences from known classes and are susceptible to adversarial attacks. To address this limitation, we redefine the OSR problem by incorporating the unknown space with non-discriminative differences. Specifically, we introduce global feature representations that encapsulate both discriminative and non-discriminative features. Unlike discriminative features alone, global representations capture variations between arbitrary samples, thereby covering unknown spaces that deviate from known samples in various feature aspects. As an exploratory solution, we propose the Open Calibrated K-Nearest Neighbor (OpenCKNN) classifier based on global feature representation. OpenKNN, an open version of the traditional closed KNN, preserves all global features without information loss, enabling the recognition of all unknown classes. Moreover, KNN’s local neighbor learning and average nearest neighbor discrimination effectively handle challenges posed by irregular data distributions and atypical points. We also employ nearest-neighbor distance calibration using a pseudo-extreme value machine to mitigate inconsistencies across different class clusters. Extensive experiments on benchmark vision and intrusion detection datasets demonstrate that our approach significantly enhances open-set recognition and resistance to adversarial attacks.

GuoLou Ping
Dynamic-Parameter Genetic Algorithm for Multi-objective Privacy-Preserving Trajectory Data Publishing

Nowadays, trajectory data is widely accessible and can be beneficial for various practical applications, such as location-based services, personalized recommendation, and traffic management. Despite the immense benefits in these scenarios, trajectories can reveal highly sensitive information about individuals, such as personal characteristics, movement patterns, visited locations, and social connections. Consequently, it is imperative to prioritize protecting privacy when conducting trajectory analyses. Existing privacy-preserving techniques focus on optimizing data utility but often overlook the diverse requirements for privacy preservation. To address this limitation, this paper aims to maximize both privacy and utility as a multi-objective optimization problem for Privacy-Preserving Trajectory Data Publishing (PPTDP). We propose a novel algorithm called Dynamic-Parameter Genetic Algorithm (DPGA) that utilizes the non-dominated sorting multi-objective optimization approach and genetic algorithm (GA). This algorithm designs the mutation and crossover strategies to dynamically adjust the mutation and crossover parameters and improve the solution’s quality. It also adopts a scramble mutation strategy that helps to achieve better population diversity. Extensive experiments demonstrate the efficiency of the proposed algorithm in terms of solution accuracy and convergence result.

Samsad Jahan, Yong-Feng Ge, Hua Wang, Enamul Kabir
A Graph-Based Approach for Software Functionality Classification on the Web

In the context of rising cybersecurity threats within software supply chains, the precise classification of software package functionalities is essential for mitigating risks posed by the exploitation of third-party libraries in web-based systems. This paper introduces a novel approach employing a Heterogeneous Information Network (HIN) and the Metapath2Vec algorithm to elevate the security and reliability of software package classification within the NPM repository, which is crucial for web application development. Our methodology capitalises on intricate package dependencies and metadata to not only enhance classification accuracy but also effectively utilise the complex and dynamic relationships widespread in web ecosystems. Comparative analyses underscore that our framework outstrips conventional methods such as DeepWalk and Node2Vec, with substantial improvements in precision and recall across a majority of functionality classes assessed. This research significantly advances web information systems engineering by providing a robust framework for the dynamic analysis of relationships and functionalities in software packages, thereby strengthening the security resilience of web-based software ecosystems.

Yinhao Jiang, Michael Bewong, Arash Mahboubi, Sajal Halder, Rafiqul Islam, Md Zahidul Islam, Ryan H. L. Ip, Praveen Gauravaram, Jason Xue
FUD-LDP: Fully User Driven Local Differential Privacy

In crowd-sourced data collection for statistical aggregate, Local Differential Privacy (LDP) has become the de facto mechanism for preserving privacy. The current LDP mechanisms primarily focus on enforcing a preset privacy-level for all participants. Other user driven mechanisms that attempt to provide participants with the freedom to choose privacy-levels, such as user-driven LDP (UD-LDP) and personalised LDP (PLDP), are limited, in achieving their goals. UD-LDP allows users to choose a privacy-level from a fixed set of values determined by the data collector while PLDP allows users to adjust privacy within their multidimensional data, but not their overall privacy-level. In this study we present a fully user-driven local differential privacy mechanism denoted as FUD-LDP which gives participants the freedom to choose their preferred privacy-level. At the same time, this method enhances the accuracy of the measured statistics. We also analyse and present the effects of various privacy-level distributions on the efficiency of FUD-LDP compared to other existing user driven mechanisms.

Gnanakumar Thedchanamoorthy, Michael Bewong, Meisam Mohammady, Tanveer Zia, Md Zahidul Islam
A Privacy-Preserving Encryption Framework for Big Data Analysis

The advent of big data has brought numerous conveniences and benefits but has also heightened users’ privacy concerns. Traditional methods like data masking and encryption secure user access control but suffer from storage space wastage due to data padding limitations. Moreover, these systems face decoding challenges and risk exposing confidential information after decryption. To overcome these issues, this study aims to develop a format-preserving encryption (FPE) based privacy-preserving technique to maintain user access control while optimizing anomaly detection accuracy and minimizing information loss. This method first generates a fixed-length key for each algorithm based on specified key length parameters, then continue the same length and format for the ciphertext as the original plaintext ensuring compatibility with databases. Our analysis of accuracy, information loss over ac-curacy, and information loss over root mean square error (RMSE) demonstrates the overall efficacy of the proposed method. Our experiment on brain computer interface (BCI) based electroencephalogram (EEG) data achieves 96.55% accuracy and requires only 2.41 s of computation for user access control. Remarkably, use of cryptography does not significantly impact performance compared to a non-privacy-preserving framework. Our developed framework will guide future researchers to develop more effective privacy protection mechanisms in BCI technology, ensuring the security of confidential information.

Taslima Khanam, Siuly Siuly, Kate Wang, Zhonglong Zheng
Location Nearest Neighbor Query Scheme in Edge Computing Based on Differential Privacy

In edge computing environments, users are frequently exposed to the hazard of location privacy disclosure when utilizing location-based services (LBS). To address the issue of location privacy breaches in LBS within edge computing, this paper introduces a LBS nearest neighbor query algorithm based on differential privacy. In recognition of the spatial distribution of edge nodes, a novel approach, designated the Geographically Indistinguishable Mechanism in Limited Areas (GIMIA), is proposed to obfuscate the precise location within the coverage of edge nodes. To solve the varying privacy protection budgets required by users across diverse geographical regions, this study further presents a Geographically Indistinguishable Mechanism in Limited Areas based on Spatial Quad-Tree (QGIMIA), according to the regional user density. Experimental findings validate that the data utility and computational time delay of the proposed GIMIA and QGIMIA outperform traditional geographically indistinguishable mechanisms based on Laplace in LBS-based nearest neighbor queries.

Yanni Han, Zhuoqun Li, Jian Zhang, Zhen Wu, Ying Ding
Open Research Challenges for Private Advertising Systems Under Local Differential Privacy

Due to the ongoing deprecation of third-party cookies on mainstream browsers, the digital advertising industry is facing novel challenges regarding how to operate artificial intelligence (AI) systems. One of these bottlenecks lies in the tentative use of local differential privacy (LDP) to obfuscate granular user data, preventing from using standard machine learning pipelines to tackle the privacy/utility trade-off. This position paper reviews the main research directions that have been explored to cope with this issue and states the main positioning and research guidelines regarding how to operate an AI system under LDP, notably by pointing out the main limitations of existing work. More specifically, we highlight the importance of conducting research works focusing on multi-task learning under LDP schemes and of seeking prior information to help design privacy-preserving mechanisms.

Matilde Tullii, Solenne Gaucher, Hugo Richard, Eustache Diemert, Vianney Perchet, Alain Rakotomamonjy, Clément Calauzènes, Maxime Vono
Industry-Specific Vulnerability Assessment

Vulnerability databases are instrumental in tracking and understanding software security threats. Among these, the National Vulnerability Database (NVD), maintained by the US government, is a primary source for security professionals, including system administrators, developers, and researchers. While many researchers have explored these databases, industry-based analyses are limited. In this paper, we curated a dataset from the NVD focused on six key industries: Education, Health, Entertainment, Finance, Energy, and Business & Retail. We performed a comprehensive analysis to highlight industry trends and affinities with various insights. This paper presents a unique industry-categorized dataset from the NVD, laying the groundwork for specialized vulnerability classification in future research.

Mohammed Alkinoon, Hattan Althebeiti, Ali Alkinoon, Manar Mohaisen, Saeed Salem, David Mohaisen
Blockchain-Driven Medical Data Shamir Threshold Encryption with Attribute-Based Access Control Scheme

When medical data terminals are used for doctor diagnosis, they can receive the corresponding medical data information in real time and transmit it to the cloud for storage after encryption. However, the keys used for data encryption and decryption are usually stored and managed directly by users or third-party organizations, resulting in the data owners losing absolute control over their data. This situation can lead to security and privacy issues, and as the number of participants increases, the time overhead also shows an unfavorable growth. To address these problems, we propose a blockchain-based medical data Shamir threshold encryption scheme. This scheme uses dynamic two-level Shamir secret sharing to split and mix the encryption private key, replacing a single long polynomial with multiple short polynomials to reduce key splitting time, and supports adaptive threshold adjustment based on the number of participants. Additionally, blockchain supervision is introduced to enhance security. We also propose an attribute-based key distribution access control scheme, constructing an “attribute relationship graph” to simplify permission management, with smart contracts automatically executing access control to ensure secure data sharing. Simulation experiments demonstrate that this scheme improves efficiency in key management, encryption and decryption, and access control, while effectively ensuring the security and privacy of shared medical data.

Wei Shen, Qian Zhou, Jiayang Wu
Smart Contracts Vulnerability Detection Using Transformers

Smart contracts play a crucial role in the digital transformation of exchanges and transactions. They are self-executing programs, based on blockchain technology, where the terms of the agreement are directly coded. Service Level Agreements (SLAs) is used as contract to define non-functional service terms between providers and customers. Therefore, several works in the literature have proposed transforming SLAs into SCs on the block-chain to ensure decentralization, transparency, and traceability. However, none of the aforementioned works considered how to ensure the security of smart contracts which is crucial as they are immutable and can be revisited or modified after their deployment. To address this issue, in a previous work, we suggested to use formal verification as an effective solution to check SLAs security before they are deployed. Despite its effectiveness, formal verification presents significant time delays and considerable computation resources and can be impractical in some cases where time is a critical factor. To overcome this drawback, in this paper, we propose to employ Transformers with Tree Positional Encoding Layer as an approach to detect smart contracts vulnerabilities in a limited time. By adding this layer, we are taking into consideration the tree form of a code as an input to the Transformer model. The suggested technique is used to shortlist susceptible and to reduce the properties to be checked by the formal verification technique. The proposed approach allows to evaluate the largest possible number of contracts within a constrained time-frame.

Riham Badra, Layth Sliman, Amine Dhraief
Weibo-FA: A Benchmark Dataset for Fake Account Detection in Weibo Platform

Weibo is one of the top social media platforms in China, similar to X/Twitter. There are many fake accounts on Weibo who may have various negative impacts on the platform and community. However, most existing work on fake user datasets and detection models is concentrated on X/Twitter, with little consideration given to Weibo. In this paper, we first review some methods for detecting fake accounts on social networking platforms. Secondly, we introduce a dataset called Weibo-FA dataset collected from the Weibo platform, which contains profile information of both fake and genuine accounts. Then Weibo-FA dataset is compared with a baseline dataset using several basic machine learning models and neural networks to evaluate its applicability in traditional models. Subsequently, the Weibo-FA dataset is employed to train two state-of-the-art neural network-based models to demonstrate its suitability for advanced models. The test results are compared with related work. Regarding the results, they indicate that the proposed Weibo-FA dataset exhibits good usability and completeness. It not only enhances the performance of traditional simple models but also achieves high accuracy in detecting fake accounts for advanced complex models.

Zhiqi Li, Weidong Fang, Wuxiong Zhang
More Than Just a Random Number Generator! Unveiling the Security and Privacy Risks of Mobile OTP Authenticator Apps

One-Time Passwords (OTPs) are a crucial component of multi-factor authentication (MFA) systems, providing additional security by requiring users to supply a dynamically generated code for authenticating to web services. The growth in smartphone usage has resulted in a shift from hardware tokens to mobile app-based OTP authenticators; however, these apps also present potential security and privacy threats. In this paper, we present a comprehensive analysis of 182 publicly available OTP apps on Google Play. Our analysis entails an array of passive and active measurements meticulously designed to assess the security and privacy attributes inherent to each OTP application. We investigate the presence of suspicious libraries, usage of binary protections, access to root privileges, secure backup and cryptographic mechanisms, and protection against traffic interception. Our experiments highlight several security and privacy weaknesses in instances of OTP apps. We observe that 28% of the analyzed apps are signed using a vulnerable version of the Android application signing mechanism. Over 40% of the OTP apps include third-party libraries leading to user information leakage to third-parties. 31.9% of the OTP applications are vulnerable to network interception, and only 13.2% possess the capability to detect devices that have been Jailbroken or rooted, which poses a significant concern. Our study highlights the need for better security and privacy guarantees in OTP apps and the importance of user awareness.

Muhammad Ikram, I. Wayan Budi Sentana, Hassan Asghar, Mohamed Ali Kaafar, Michal Kepkowski
Synthetic Data: Generate Avatar Data on Demand

Anonymization is crucial for the sharing of personal data in a privacy-aware manner yet it is a complex task that requires to set up a trade-off between the robustness of anonymization (i.e., the privacy level provided) and the quality of the analysis that can be expected from anonymized data (i.e., the resulting utility). Synthetic data has emerged as a promising solution to overcome the limits of classical anonymization methods while achieving similar statistical properties to the original data. Avatar-based approaches are a specific type of synthetic data generation that rely on local stochastic simulation modeling to generate an avatar for each original record. While these approaches have been used in healthcare, their attack surface is not well documented and understood. In this paper, we provide an extensive assessment of such approaches and comparing them against other data synthesis methods. We also propose an improvement based on conditional sampling in the latent space, which allows synthetic data to be generated on demand (i.e., of arbitrary size). Our empirical analysis shows that avatar-generated data are subject to the same utility and privacy trade-off as other data synthesis methods with a privacy risk more important on the edge data, which correspond to records that have the fewest alter egos in the original data.

Thomas Lebrun, Louis Béziaud, Tristan Allard, Antoine Boutet, Sébastien Gambs, Mohamed Maouche
A Lightweight Detection of Sequential Patterns in File System Events During Ransomware Attacks

Ransomware poses a major threat by encrypting files and demanding ransom for decryption. This paper introduces a lightweight hybrid model for detecting ransomware by analyzing file system events. By combining XGBoost and Long Short-Term Memory (LSTM) networks, the approach identifies and predicts malicious behaviors with high accuracy and low computational cost. A File System Monitor Watchdog was developed to track file activities, collecting a dataset from 20 ransomware families. XGBoost is used for initial pattern detection, and LSTM networks for sequential analysis. The model achieved 97.12% detection accuracy, outperforming traditional methods in accuracy and efficiency, while reducing computational costs.

Arash Mahboubi, Hang Thanh Bui, Hamed Aboutorab, Khanh Luong, Seyit Camtepe, Keyvan Ansari
Detection and Mitigation of Backdoor Attacks on x-Apps

The integration of artificial intelligence (AI) and machine learning (ML) within the Open Radio Access Network (O-RAN) xApps introduces significant enhancements to network automation and anomaly detection. However, this open architecture increases vulnerability to sophisticated attacks, including backdoor threats. This paper proposes the use of an xLSTM autoencoder for detecting anomalies in O-RAN, specifically focusing on its ability to model long-term dependencies in network traffic. xLSTM, with its enhanced memory mechanisms, addresses the limitations of traditional models like LSTM by improving both detection accuracy and computational efficiency in real-time environments.

Rouaa Naim, Hams Gelban, Ahmed Badawy
Cohesive Database Neighborhoods for Differential Privacy: Mapping Relational Databases to RDF

This paper studies how privacy guarantees on relational databases (RDBs) with foreign key constraints can be transposed to Semantic Web (RDF) databases and vice versa. Thus, we consider a Differentially Private (DP) model for RDB related to cascade deletion and demonstrate that it is sometimes similar to an existing DP graph privacy model, but inconsistently so. Consequently, we tweak this model in the relational world to propose a new model called restrict deletion. We show that it is equivalent to an existing DP graph privacy model, facilitating the comprehension, design and implementation of DP mechanisms when mapping of RDB to RDF.

Sara Taki, Adrien Boiret, Cédric Eichler, Benjamin Nguyen
i-Right: Identifying and Classifying GDPR User Rights in Fitness Tracker and Smart Home Privacy Policies

Regulations and laws, such as the EU GDPR, require service providers to inform the users about their data collection and processing practices. The existing method used for the portrayal of the rights and responsibilities of both the user and the service provider in terms of data collection, processing and sharing, are the privacy policies, that depict the practices that an organization or company follows when handling the personal data of its users. In this work, we introduce an automated approach, i-Right that classifies the text of privacy policies from the domains of fitness trackers and smart homes, extracting information regarding the eight GDPR user rights present (e.g. Right to Object). Our results show that i-Right achieves classification of the text with high accuracy. The proposed approach could provide a valuable tool for users to understand how their personal data is handled by service providers and to comprehend the possible risks from using their devices. A side contribution of our work is the creation of a labelled dataset of 133 privacy policies to assist the above process.

Alexia Dini Kounoudes, Georgia M. Kapitsaki, Ioannis Katakis
AttackER: Towards Enhancing Cyber-Attack Attribution with a Named Entity Recognition Dataset

Cyber-attack attribution is an important process that allows experts to put in place attacker-oriented countermeasures and legal actions. The analysts mainly perform attribution manually, given the complex nature of this task. AI and, more specifically, Natural Language Processing (NLP) techniques can be leveraged to support cybersecurity analysts during the attribution process. However powerful these techniques may be, they must address the lack of datasets in the attack attribution domain. In this work, we will fill this gap and will provide, to the best of our knowledge, the first dataset on cyber-attack attribution. We designed our dataset with the primary goal of extracting attack attribution information from cybersecurity texts, utilizing named entity recognition (NER) methodologies from the field of NLP. Unlike other cybersecurity NER datasets, ours offers a rich set of annotations with contextual details, including some that span phrases and sentences. We conducted extensive experiments and applied NLP techniques to demonstrate the dataset’s effectiveness for attack attribution. These experiments highlight the potential of Large Language Models (LLMs) capabilities to improve the NER tasks in cybersecurity datasets for cyber-attack attribution.

Pritam Deka, Sampath Rajapaksha, Ruby Rani, Amirah Almutairi, Erisa Karafili
R-CONV: An Analytical Approach for Efficient Data Reconstruction via Convolutional Gradients

In the effort to learn from extensive collections of distributed data, federated learning has emerged as a promising approach for preserving privacy by using a gradient-sharing mechanism instead of exchanging raw data. However, recent studies show that private training data can be leaked through many gradient attacks. While previous analytical-based attacks have successfully reconstructed input data from fully connected layers, their effectiveness diminishes when applied to convolutional layers. This paper introduces an advanced data leakage method to efficiently exploit convolutional layers’ gradients. We present a surprising finding: even with non-fully invertible activation functions, such as ReLU, we can analytically reconstruct training samples from the gradients. To the best of our knowledge, this is the first analytical approach that successfully reconstructs convolutional layer inputs directly from the gradients, bypassing the need to reconstruct layers’ outputs. Prior research has mainly concentrated on the weight constraints of convolution layers, overlooking the significance of gradient constraints. Our findings demonstrate that existing analytical methods used to estimate the risk of gradient attacks lack accuracy. In some layers, attacks can be launched with less than 5 % of the reported constraints.

Tamer Ahmed Eltaras, Qutaibah Malluhi, Alessandro Savino, Stefano Di Carlo, Adnan Qayyum
Privacy-Preserving Behavioral Anomaly Detection in Dynamic Graphs for Card Transactions

Anomaly detection in financial transactions poses significant privacy challenges. This paper introduces a federated learning (FL) framework for Privacy-Preserving Behavioral Anomaly Detection using Graph Neural Networks (GNNs) on dynamic graphs to model cardholder transactions. We incorporate anonymization-based and noise-based privacy-preserving methods for feature engineering and a domain-specific negative sampling technique to train models without labeled data, making it suitable for real-world applications. Our results, benchmarked on synthetic and real-world datasets, show that deep learning-based outperform clustering-based methods, with F1-scores of 0.91 ± 0.02 and 0.87 ± 0.04, respectively. Additionally, using the anomaly score as a feature in fraud detection models yields a 1.76% ± 0.54% improvement in F1-score, enhancing fraud detection performance while preserving privacy.

Farouk Damoun, Hamida Seba, Radu State

Online Safety and Wellbeing Through AI

Frontmatter
DisCo-FEND: Social Context Veracity Dissemination Consistency-Guided Case Reasoning for Few-Shot Fake News Detection

With the rapid development of the Internet, traditional news channels are being supplanted, leading to an increased prevalence of fake news. Mainstream pre-trained language models (PLMs)-based fake news detection methods follow the ‘pre-training and fine-tuning’ paradigm, relying on full supervision and heavily dependent on large, high-quality datasets. In contrast to these methods, “pre-trained and prompt-tuning” offers more efficient learning, especially in data-scarce scenarios. Meanwhile, extensive analysis of social patterns reveals a tendency driven by user psychology and behavior: users often disseminate information that aligns with their pre-existing beliefs, thereby reinforcing and solidifying their convictions. This phenomenon is termed “social context veracity dissemination consistency”. Inspired by this phenomenon, we propose DisCo-FEND, A social context veracity Dissemination Consistency-guided case reasoning augmentation for the Fake News Detection (FEND) task. During model inference, we adopt a novel strategy that enhances reasoning by using multiple FEND cases. It leverages multiple news cases with higher dissemination consistency to refine predictions. Additionally, a high-quality label words acquisition approach and an adaptive weight allocation-based multi-label words mapping strategy improves the convergence and generalization of DisCo-FEND.

Weiqiang Jin, Ningwei Wang, Tao Tao, Mengying Jiang, Xiaotian Wang, Biao Zhao, Hao Wu, Haibin Duan, Guang Yang
NLWM: A Robust, Efficient and High-Quality Watermark for Large Language Models

Since the advent of ChatGPT, the popularity of large language models has made distinguishing between model-generated and human-created text a significant challenge. Embedding watermarks during text generation is a crucial method for identifying model-generated content. However, existing approaches predominantly focus on 0-bit watermarks, limiting their capacity to embed more substantive information like authority information, version information and so on. The development of n-bit watermarks remains in its nascent stage. Most n-bit watermarking methods have constraints in text quality and robustness, often struggling to achieve high accuracy over limited text sequences. Hence, we propose an n-bit watermark(NLWM) based on language model proxy. To address the issue of reduction in text quality observed in current methods, we partition the model’s vocabulary using a fixed random seed and introduce a novel strategy to reweight the original probability distribution. Simultaneously, in order to enhance watermark robustness and extraction accuracy, we incorporate a BCH error-correcting code mechanism to our method. Since NLWM does not need to access the language model when extracting watermarks, it has high extraction efficiency. In open text generation experiments, we compared our method with baseline models. The results demonstrate that NLWM not only enhances watermark robustness and text quality but also improves the efficiency and accuracy of watermark extraction, highlighting the practical value of this method.

Mengting Song, Ziyuan Li, Kai Liu, Min Peng, Gang Tian
Improving the Robustness of Rumor Detection Models with Metadata-Augmented Evasive Rumor Datasets

Rumors on social media can cause serious harm. Advances in NLP enable deceptive rumors resembling real posts, necessitating more robust detection. One approach collects and augments a dataset with adversarial rumors meant to evade models. Understanding evasive rumors and adding them to a dataset improves model robustness. We demonstrate effective data augmentation that significantly improves detection models. State-of-the-art accuracy drops by up to 29.5% against evasive rumors, while our augmentation raises it by up to 14.62%. Results highlight data augmentation’s importance for robust detection models countering evasion. Our evaluation shows the value of augmentation for developing models robust against adversarial attacks.

Larry Huynh, Andrew Gansemer, Hyoungshick Kim, Jin B. Hong
Rumor Alteration for Improving Rumor Generation

This study investigates the impact of rumor alterations on detectability; we identify that sentiment is an important qualifier for rumor detection models, and the alteration of a rumor’s sentiment can produce more evasive rumors. Using the PLAN rumor detection model and modified PHEME, Twitter15, and Twitter16 datasets, we show that altering positive and neutral sentiments reduces detection metrics by up to 1.8%. Rephrasing rumors with non-rumor content has the most significant effect, decreasing accuracy, precision, recall, and F1 by up to 5.7%. Our findings highlight the challenges of detecting altered rumors and introduce new methodologies for generating altered rumor datasets, advancing rumor detection research and combating misinformation.

Larry Huynh, Jesse Kilcullen, Jin B. Hong
Generating Effective Answers to People’s Everyday Cybersecurity Questions: An Initial Study

Human users are often the weakest link in cybersecurity, with a large percentage of security breaches attributed to some kind of human error. When confronted with everyday cybersecurity questions - or any other question for that matter, users tend to turn to their search engines, online forums, and, recently, chatbots. We report on a study on the effectiveness of answers generated by two popular chatbots to an initial set of questions related to typical cybersecurity challenges faced by users (e.g., phishing, use of VPN, multi-factor authentication). The study does not only look at the accuracy of the answers generated by the chatbots but also at whether these answers are understandable, whether they are likely to motivate users to follow any provided recommendations, and whether these recommendations are actionable. Surprisingly enough, this initial study suggests that state-of-the-art chatbots are already reasonably good at providing accurate answers to common cybersecurity questions. Yet the study also suggests that the chatbots are not very effective when it comes to generating answers that are relevant, actionable, and, most importantly, likely to motivate users to heed their recommendations. The study proceeds with the design and evaluation of prompt engineering techniques intended to improve the effectiveness of answers generated by the chatbots. Initial results suggest that it is possible to improve the effectiveness of answers and, in particular, their likelihood of motivating users to heed recommendations, and their ability to act upon these recommendations without diminishing their accuracy. We discuss the implications of these initial results and plans for future work in this area.

Ananya Balaji, Lea Duesterwald, Ian Yang, Aman Priyanshu, Costanza Alfieri, Norman Sadeh
Propaganda to Hate: A Multimodal Analysis of Arabic Memes with Multi-agent LLMs

In the past decade, social media platforms have been used for information dissemination and consumption. While a major portion of the content is posted to promote citizen journalism and public awareness, some content is posted to mislead users. Among different content types such as text, images, and videos, memes (text overlaid on images) are particularly prevalent and can serve as powerful vehicles for propaganda, hate, and humor. In the current literature, there have been efforts to individually detect such content in memes. However, the study of their intersection is very limited. In this study, we explore the intersection between propaganda and hate in memes using a multi-agent LLM-based approach. We extend the propagandistic meme dataset with coarse and fine-grained hate labels. Our finding suggests that there is an association between propaganda and hate in memes. We provide detailed experimental results that can serve as a baseline for future studies. We will make the experimental resources publicly available to the community ( https://github.com/firojalam/propaganda-and-hateful-memes.git ).

Firoj Alam, Md. Rafiul Biswas, Uzair Shah, Wajdi Zaghouani, Georgios Mikros
Did You Tell a Deadly Lie? Evaluating Large Language Models for Health Misinformation Identification

The rapid spread of health misinformation online poses significant challenges to public health, potentially leading to confusion, undermining trust in health authorities, and hindering effective health interventions. Large Language Models (LLMs) have shown promise in various natural language processing tasks, including misinformation detection. However, their effectiveness in identifying health-specific misinformation has not been extensively benchmarked. This study evaluates the performance of seven state-of-the-art LLMs - GPT-3.5, GPT-4, Gemini, Flan-T5 XL, Gemma, LLaMA-2, and Mistral - on the task of health misinformation detection across four datasets (Monkeypox-V1, Monkeypox-V2, COVID-19, and CoAID). The models were tested under five different settings: zero-shot classification, 5-shot random examples, 10-shot random examples, 5-shot sampled examples, and 10-shot sampled examples. Performance was evaluated using macro F1-score, and inter-model agreement was assessed using Cohen’s Kappa and Fleiss’ Kappa scores. By comprehensively benchmarking these LLMs, this study aims to determine which models excel in particular scenarios and provide insights into their potential for combating health misinformation in online environments.

Surendrabikram Thapa, Kritesh Rauniyar, Hariram Veeramani, Aditya Shah, Imran Razzak, Usman Naseem
Native vs Non-native Language Prompting: A Comparative Analysis

Large language models (LLMs) have shown remarkable abilities in different fields, including standard Natural Language Processing (NLP) tasks. To elicit knowledge from LLMs, prompts play a key role, consisting of natural language instructions. Most open and closed source LLMs are trained on available labeled and unlabeled resources-digital content such as text, images, audio, and videos. Hence, these models have better knowledge for high-resourced languages but struggle with low-resourced languages. Since prompts play a crucial role in understanding their capabilities, the language used for prompts remains an important research question. Although there has been significant research in this area, it is still limited, and less has been explored for medium to low-resourced languages. In this study, we investigate different prompting strategies (native vs. non-native) on 11 different NLP tasks associated with 11 different Arabic datasets (8.7K data points). In total, we conducted 198 experiments involving 3 open and closed LLMs (including an Arabic-centric model), and 3 prompting strategies. Our findings suggest that, on average, the non-native prompt performs the best, followed by mixed and native prompts. All prompts will be made available to the community through the LLMeBench ( https://llmebench.qcri.org/ ) framework.

Mohamed Bayan Kmainasi, Rakif Khan, Ali Ezzat Shahroor, Boushra Bendou, Maram Hasanain, Firoj Alam
DisFact: Fact-Checking Disaster Claims

The rapid proliferation of false information on the internet poses a significant challenge before, during, and after disasters, emphasizing the critical need for domain-specific automatic fact-checking systems. In this study, we introduce DisFact, a new fact-checking pipeline, and a dataset of disaster-related claims generated from the Federal Emergency Management Agency (FEMA) press releases and disaster declarations. Our retrieval method involves no model training, making it more efficient and less resource-intensive. It starts by breaking a lengthy document into sentences; we further apply embeddings to calculate the relevancy score between a claim and document pairs and then compute the similarity score between claims and sentences to rank the retrieved evidence(s). For claim verification, we utilize a deep learning approach that comprises a transformer-based embedding with a feedforward neural network. The experimental findings demonstrate that our fact-checking models achieve top performance on our custom disaster dataset. Furthermore, our models outperform other state-of-the-art models on FEVER and SciFact shared tasks, underscoring the effectiveness of our approach and its adaptability in handling longer documents and generalizing across diverse fact-checking datasets. DisFact signifies a pivotal advancement in automated fact-checking, emphasizing simplicity, accuracy, and computational efficiency. DisFact dataset and code are available on GitHub (DisFact Dataset and Code - https://github.com/abdul0366/DisFact ).

Ademola Adesokan, Haiwei Hu, Sanjay Madria

Web Technologies

Frontmatter
Multi-perspective Conformance Checking for Email-Driven Processes

Business process conformance checking determines whether the execution of a process instance aligns with a predefined model that specifies its expected behaviour. Existing approaches are usually tailored for business processes modelled and executed through traditional business process management systems and therefore fall short when the business processes to be checked are described or partially supported by rather unorthodox systems. Such is the case for email-driven processes which are generally business process fragments executed using emailing systems. In fact, one of the main limitations exhibited by existing conformance checking techniques is that they mostly only consider the behavioral, functional, and structural perspectives of a process. In the case of email-driven processes, however, it is crucial to also take into account the contextual perspective of the events obtained from the email bodies and threads. In this paper, we propose an approach for multi-perspective conformance checking of email-driven processes, taking into account both their sequential as well as their contextual perspectives.

Ralph Bou Nader, Ikram Garfatta, Marwa Elleuch, Walid Gaaloul, Yehia Taher
Progressive Server-Side Rendering with Suspendable Web Templates

Progressive server-side rendering (PSSR) enhances the user experience by optimizing the first contentful paint and supporting incremental rendering as data becomes available. However, constructing web templates for asynchronous data models introduces complexity due to undesired interleaving between data access completion and template processing, potentially resulting in malformed HTML. Existing asynchronous programming idioms, such as async/await or suspending functions, enable developers to create code with a synchronous appearance while maintaining non-blocking progress. In this work, we introduce the first proposal for server-side rendering (SSR) web templates that seamlessly support the async/await pattern. Our proposal addresses the challenges associated with building SSR web templates for asynchronous data models, ensuring the production of well-formed HTML documents and preserving the capability of PSSR.

Fernando Miguel Carvalho
HMSC-LLMs: A Hierarchical Multi-agent Service Composition Method Based on Large Language Models

The rapid progress of large language models (LLMs) has been applied successfully in service composition and scheduling. However, LLMs will exhibit poor performance when dealing with massive service and complex tasks. To address this challenge, this paper proposes the HMSC-LLMs method, a novel hierarchical multi-agent service composition algorithm based on LLMs, to better interact with users through prompts. Firstly, HMSC-LLMs refers to five roles during service composition, namely, Planner, Manager, Provider, Executor, and Critic. Here each role is based on an LLM agent and responsible for its own domain task. Specifically, the Planner is responsible for decomposing the complex needs into subtasks. The Manager coordinates the activities of various roles and the Provider filters out service. Then the Executor generates service parameters and schedules this service. Furthermore, the Critic supervises the entire service execution workflow to ensure that each role works normally. In our work, the HMSC-LLMs method classified 17000 services and assigned them to multiple agents. Finally, a series of experiments on ToolBench and RapidAPI data sets show that the HMSC-LLMs outperforms traditional single-agent and multi-agent methods in terms of plan accuracy, parameter Accuracy, and hallucination rate.

Xingchuang Liao, Wenjun Wu, Xiaoming Yu, Xin Ji, Yiting Chen, Junting Li
Enhancing Web Spam Detection Through a Blockchain-Enabled Crowdsourcing Mechanism

The proliferation of spam on the Web has necessitated the development of machine learning models to automate their detection. However, the dynamic nature of spam and the sophisticated evasion techniques employed by spammers often lead to low accuracy in these models. Traditional machine-learning approaches struggle to keep pace with spammers’ constantly evolving tactics, resulting in a persistent challenge to maintain high detection rates. To address this, we propose blockchain-enabled incentivized crowdsourcing as a novel solution to enhance spam detection systems. We create an incentive mechanism for data collection and labeling by leveraging blockchain’s decentralized and transparent framework. Contributors are rewarded for accurate labels and penalized for inaccuracies, ensuring high-quality data. A smart contract governs the submission and evaluation process, with participants staking cryptocurrency as collateral to guarantee integrity. Simulations show that incentivized crowdsourcing improves data quality, leading to more effective machine-learning models for spam detection. This approach offers a scalable and adaptable solution to the challenges of traditional methods.

Noah Kader, Inwon Kang, Oshani Seneviratne
WNSWE: Web-Based Network Simulator for Web Engineering Education

The complexity of current Web Information Systems represents a challenge when educating the next generation of engineers capable of dealing with all aspects of their life cycle. Network simulators are used in university education to address this challenge and provide students with the necessary comprehensive understanding of the interplay of their inherently distributed system components and the respective protocols at different network layers. However, existing simulators only partially satisfy educational requirements, having limitations in the level of control of educators to focus on selected aspects on the desired conceptual level, support for demonstrations with interventions at runtime, and the flexibility to set up, use, and program the simulators at home and on mobile devices. Thus, we propose WNSWE, a Web-based Network Simulator designed for Web Engineering Education, detailing its client-side architecture, program model, and interactive capabilities. To evaluate WNSWE, we applied it to 26 educational scenarios and report on 3 case studies of employing it for classroom teaching and asynchronous assignments in different university courses and teaching settings, covering a range of topics from basic networking to cloud and security of distributed systems. Furthermore, WNSWE is provided as open educational resource (OER) and we outline our roadmap for its extension.

Sebastian Heil, Lucas Schröder, Martin Gaedke
Backmatter
Metadaten
Titel
Web Information Systems Engineering – WISE 2024
herausgegeben von
Mahmoud Barhamgi
Hua Wang
Xin Wang
Copyright-Jahr
2025
Verlag
Springer Nature Singapore
Electronic ISBN
978-981-9605-76-7
Print ISBN
978-981-9605-75-0
DOI
https://doi.org/10.1007/978-981-96-0576-7