Zum Inhalt

Database Engineered Applications

29th International Symposium, IDEAS 2025, Newcastle upon Tyne, UK, July 14–16, 2025, Proceedings

  • 2026
  • Buch
insite
SUCHEN

Über dieses Buch

Dieser LNCS-Konferenzband stellt die Abhandlung des 29. Internationalen Symposiums für datenbanktechnische Anwendungen, IDEAS 2025, in Newcastle upon Tyne, Großbritannien, im Juli 2025 dar. Die 13 vollständigen und 6 kurzen Beiträge in diesem Buch wurden sorgfältig überprüft und aus 30 Einreichungen ausgewählt. Sie waren wie folgt in thematische Abschnitte gegliedert: Sprache und Modelle; Klassifizierung; verteilte Systeme; Beantwortung von Anfragen und Ausbildung; und Data Mining.

Inhaltsverzeichnis

Frontmatter

Language and Models

Frontmatter
Generative Adversarial Networks Reveal Carian, Elder Futhark, Old Hungarian and Old Turkic Script Relationships
Abstract
The precise classification and relationship of several ancient scripts has been the subject of debate for over a century. These controversial scripts include the Carian alphabet, the Elder Futhark alphabet, the Old Hungarian script, and the Yeniseian variant of the Old Turkic script. This paper settles the relationship among these controversial scripts in an objective and algorithmic way by using a Convolutional Neural Network augmented with a Generative Adversarial Network, which gives a probability of the membership of each sign in any script. The results yield a similarity metric between pairs of scripts, and that allows the mapping of the evolution of these scripts using a phylogenetic tree algorithm.
Shohaib Shaffiey, Peter Z. Revesz
EHSAN: Leveraging ChatGPT in a Hybrid Framework for Arabic Aspect-Based Sentiment Analysis in Healthcare
Abstract
Arabic-language patient feedback remains under-analysed because dialect diversity and scarce aspect-level sentiment labels hinder automated assessment. To address this gap, we introduce EHSAN, a data-centric hybrid pipeline that merges ChatGPT pseudo-labelling with targeted human review to build the first explainable Arabic aspect-based sentiment dataset for healthcare. Each sentence is annotated with an aspect and sentiment label (positive, negative, or neutral), forming a pioneering Arabic dataset aligned with healthcare themes, with ChatGPT-generated rationales provided for each label to enhance transparency. To evaluate the impact of annotation quality on model performance, we created three versions of the training data: a fully supervised set with all labels reviewed by humans, a semi-supervised set with 50% human review, and an unsupervised set with only machine-generated labels. We fine-tuned two transformer models on these datasets for both aspect and sentiment classification. Experimental results show that our Arabic-specific model achieved high accuracy even with minimal human supervision, reflecting only a minor performance drop when using ChatGPT-only labels. Reducing the number of aspect classes notably improved classification metrics across the board. These findings demonstrate an effective, scalable approach to Arabic aspect-based sentiment analysis (SA) in healthcare, combining large language model annotation with human expertise to produce a robust and explainable dataset. Future directions include generalisation across hospitals, prompt refinement, and interpretable data-driven modelling.
Eman Alamoudi, Ellis Solaiman
Automated Glyph Feature Detection Using Convolutional Neural Networks
Abstract
A glyph is a specific design for a character in a writing system. Analyzing a glyph’s anatomical features can offer insight into its applications, ancestry, and historical context. However, manually identifying features is a subjective, time-consuming task. In this paper we present the Automated Letter Feature Analyzer (ALFA) system for computationally identifying a glyph’s anatomical features. ALFA uses convolutional neural networks (CNNs) along with other image analysis techniques to evaluate glyphs using both learned patterns and explicit shape metrics. A modular web-based framework was created to efficiently render, capture, and label large glyph image datasets for machine learning tasks. CNNs were trained to detect three anatomical features with overall accuracies between 97% and 99%. The system also achieved an accuracy of 98.45% when counting enclosed spaces and objects in positive space, while glyph weight by quadrant was tuned to concur with visual labeling at 97% accuracy. Results show ALFA is not only useful for collecting glyph images and labeling large image datasets but also can facilitate new research in computational linguistics by offering a way to computationally detect a glyph’s anatomical elements.
Michael Mason, Sam Kirchner, Carter Powell
A Vision for Robust and Human-Centric LLM-Based QR Code Security
Abstract
Quick Response (QR) codes are now widely used as a digital communication tool. However, their extensive adoption has made them an attractive target for cyberattacks, particularly through the injection of malicious URLs that redirect users to phishing sites or initiate malware installations. Conventional security approaches such as blacklists and antivirus software are no longer efficient against such evolving threats. This vision paper proposes an AI-based framework using fine-tuned Large Language Models (LLMs) to identify malicious URLs embedded within QR codes. To ensure transparency, a novel ensemble Explainable AI (XAI) is applied to aggregate insights from various XAI methods to explain the features influencing model predictions, facilitating more robust interpretations. To enhance clarity and usability, the proposed framework incorporates personalized explanations tailored to cybersecurity analysts, system developers, and non-expert end users, informed by a role-specific user study. Furthermore, as XAI methods may expose sensitive model behavior, cyberattackers craft adversarial inputs to mislead the model or manipulate explanations. This necessitates the integration of adversarial training to ensure model robustness and explanation integrity evaluated through perturbation consistency checks. The paper outlines key challenges in explanation fidelity and personalization and presents a development roadmap to advance secure, transparent, and human-centric explainable QR code analysis.
Hissah Almousa, Ellis Solaiman

Classification

Frontmatter
Exploring Classification with Spectral Transformation
Abstract
Many classification models assign a real-valued score to each object and apply a threshold to determine class membership. While a variety of well-established methods exist for constructing such scores, the use of spectral techniques has received little attention in this context. In this paper, we explore a novel classification approach that treats the label function of binary training data as a signal over the feature space. Using the Discrete Cosine Transform (DCT), we approximate this signal on a sparse grid and reconstruct a smooth decision function whose values are subjected to a fixed threshold. This formulation inherently emphasizes low-frequency components, which promotes smoothness and potentially improves generalization. We discuss the theoretical motivation, implementation challenges, and present experiments that suggest spectral methods may offer an alternative perspective on binary classification.
Alexander Stahl
Optimizing Classification Accuracy with Simulated Annealing in k-Anonymity
Abstract
In the era of extensive data collection, achieving a balance between individual privacy protection and the preservation of data utility is critical. This paper introduces a novel k-anonymization approach that integrates simulated annealing with generalization hierarchies and suppression constraints to optimize classification accuracy on anonymized datasets. Unlike traditional greedy algorithms, our method probabilistically navigates the anonymization solution space. We validate our approach through extensive experiments on two real-world datasets, Adult and MIMIC-III, comparing against the state-of-the-art ARX framework. Our method improves AUC-ROC scores by up to 3.3% over ARX, and successfully generates feasible anonymizations even under stringent privacy requirements where ARX fails – demonstrating robustness and effectiveness of our simulated annealing-based anonymization strategy.
Despina Tawadros, Wenhui Yang, Lena Wiese, Volker Meyer
Predicting Gelation in Copolymers Using Deep Learning Through a Comparative Study of ANN, CNN, and LSTM Models with SHAP Explainability
Abstract
This study presents a Deep Learning (DL)-based approach to predict gelation behavior in copolymer systems using compositional and physicochemical descriptors. Three architectures—Artificial Neural Network (ANN), Convolutional Neural Network (CNN), and Long Short-Term Memory (LSTM)—were tested and evaluated under conditions of pronounced class imbalance. The ANN model achieved the best performance, with an Accuracy (ACC) of 94% and an F1-score of 0.57, demonstrating strong discriminative capability in the binary classification of gelation propensity. To enhance robustness, threshold optimization was employed, and SHapley Additive exPlanations (SHAP) was used to identify key predictors, including specific monomer concentrations and degree of polymerization. The findings demonstrate that data-driven methods can effectively capture complex gelation patterns and provide interpretable, mechanistically relevant insights. This study underscores the potential of Artificial Intelligence (AI) to accelerate polymer design while reducing reliance on empirical experimentation.
Selahattin Barış Çelebi, Ammar Aslan, Mutlu Canpolat
A Total Variation Regularized Framework for Epilepsy-Related MRI Image Segmentation
Abstract
Focal Cortical Dysplasia (FCD) is a primary cause of drug-resistant epilepsy and is difficult to detect in brain magnetic resonance imaging (MRI) due to the subtle and small-scale nature of its lesions. Accurate segmentation of FCD regions in 3D multimodal brain MRI images is essential for effective surgical planning and treatment. However, this task remains highly challenging due to the limited availability of annotated FCD datasets, the extremely small size and weak contrast of FCD lesions, the complexity of handling 3D multimodal inputs, and the need for output smoothness and anatomical consistency, which is often not addressed by standard voxel-wise loss functions. This paper presents a new framework for segmenting FCD regions in 3D brain MRI images. We adopt state-of-the-art transformer-enhanced encoder-decoder architecture and introduce a novel loss function combining Dice loss with an anisotropic Total Variation (TV) term. This integration encourages spatial smoothness and reduces false positive clusters without relying on post-processing. The framework is evaluated on a public FCD dataset with 85 epilepsy patients and demonstrates superior segmentation accuracy and consistency compared to standard loss formulations. The model with the proposed TV loss shows an 11.9% improvement on the Dice coefficient and 13.3% higher precision over the baseline model. Moreover, the number of false positive clusters is reduced by 61.6%.
Mehdi Rabiee, Sergio Greco, Reza Shahbazian, Irina Trubitsyna
Enhancing Flight Delay Prediction with Network-Aware Ensemble Learning
Abstract
This study presents a comprehensive framework for predicting departure delays in U.S. domestic aviation by integrating advanced feature engineering, network analysis, and ensemble learning methods. Using a dataset of 2,638,673 flights across 354 airports from May to August 2024, we engineered predictors using temporal features (cyclical time), operational metrics (airport congestion), and network characteristics (in-/out-degree centrality and cluster labels). We extracted data for the five airlines with the highest number of flights: Southwest (WN), American (AA), Delta (DL), United (UA), and SkyWest (OO). A novel greedy mutual information and correlation-based feature selection method was then applied to each dataset to improve prediction performance. Multiple classifiers, including Random Forest (RF), Extra Trees (ET), XGBoost, and LightGBM, were evaluated. RF and ET consistently outperformed the others, motivating their inclusion in a Voting ensemble. The Voting classifier achieved robust performance across all five airlines, with overall accuracy ranging from 88.9% to 91.8%, F1–scores between 88.5% and 91.4%, and AUC–ROC values all above 95%. DL yielded the highest performance (91.8% accuracy and 96.8% AUC–ROC). These results demonstrate that combining network–cluster information with rich historical features substantially improves delay prediction, providing a scalable approach for airlines and air traffic managers to mitigate operational disruptions.
Mary Dufie Afrane, Yao Xu, Lixin Li

Distributed Systems

Frontmatter
FedMod: Vertical Federated Learning Using Multi-server Secret Sharing
Abstract
Vertical Federated Learning (VFL) allows multiple entities with feature-partitioned datasets to collaboratively train machine learning models while keeping their data private. However, many existing VFL approaches either rely on a single trusted server or incur significant computational and communication overhead due to encryption-based techniques. In this paper, we propose FedMod, a scalable and lightweight VFL framework that removes the need for trusted parties or cryptographic primitives. FedMod introduces a novel multi-server architecture combined with additive secret sharing to protect intermediate computations during training. We conduct extensive experiments across multiple real-world datasets and benchmark FedMod against state-of-the-art methods including homomorphic encryption, differential privacy, and functional encryption. Our results show that FedMod achieves comparable or superior accuracy with significantly lower computation time and communication cost. Moreover, FedMod provides strong protection in the semi-honest setting and remains secure even when some parties or servers partially collude. These results highlight FedMod’s practicality for real-world privacy-preserving collaborative learning scenarios.
Kasra Mojallal, Ali Abbasi Tadi, Dima Alhadidi
Throughput-Driven Database Replication Using a Ring-Based Order Protocol
Abstract
We present a database replication architecture that guarantees ACID transaction properties as well as high throughput expected of modern database systems. Higher throughput results due to server replicas processing distinct, non-overlapping subsets of incoming transactions in parallel. Our novel approach addresses all challenges that emerge in ensuring ACID properties across all incoming transactions processed in parallel even when access pattern of transactions is not known a priori. At the core of our approach is a high-throughput, ring-based total order protocol which the database replicas use to reach consensus for resolving conflicts among transactions, ensuring serializability and accomplishing atomic commit. After presenting the architecture, protocol performance is evaluated through implementations when replication degree is two and three, tolerating at most one replica crash. While 2-fold replication requires perfect crash detection, three-fold can do with weak detectors.
Ye Liu, Paul Ezhilchelvan, Yingming Wang, Jim Webber
Blockchain-Backed Fuzzy Search for Semi-structured Translation Data: A Scalable Hybrid Approach with Hyperledger Fabric and Elasticsearch
Abstract
Translation Memory (TM) systems are critical components of modern computer-aided translation (CAT) workflows. However, centralized TM platforms often lack transparency, user control, and verifiable trust guarantees. This paper introduces a decentralized architecture that enables scalable fuzzy search over TM segments while ensuring data provenance and integrity through blockchain validation. The proposed solution integrates Elasticsearch for high-performance approximate matching with Hyperledger Fabric as a trust-enforcing validation layer. The proposed system is designed to be interoperable with standard CAT tools via a backend gateway that performs fuzzy retrieval and verifies match authenticity using smart contracts. Importantly, only hashed metadata is stored on-chain, preserving confidentiality while enabling auditability. We conducted four experimental rounds with datasets of 100k, 500k, 1M and 10M segments to assess the system’s performance and scalability. Results show that the architecture maintains sub-second query times even at scale, with the blockchain validation layer introducing minimal overhead. These findings demonstrate the feasibility of integrating decentralized trust mechanisms into real-time linguistic data systems. The research illustrates how database engineering principles can be effectively combined with blockchain technologies to meet the evolving demands of secure and decentralized collaboration.
Edvan Soares, Valeria Times

Query Answering and Education

Frontmatter
Towards Sustainable DBMS: A Framework for Real-Time Energy Estimation and Query Categorization
Abstract
Energy efficiency in database management systems (DBMS) is increasingly critical due to the rising computational demands of modern applications. Our work proposes a complete framework to analyze energy consumption. We developed a real-time monitoring framework that captures CPU and memory utilization during query execution and estimates energy consumption. We have implemented a query logging mechanism to track and analyze execution time. We propose an energy estimation model that computes power consumption using CPU utilization metrics and query categorization based on energy usage profiles. We studied the correlation between execution time and energy consumption using Pearson correlation. We propose a power-based classification of SQL query types, enabling more energy-aware optimization strategies. The result of our analysis highlights the opportunities for power-aware query optimization, making DBMS operations green computing and efficient.
Tidenek Fekadu Kore, David Sarramia, Myoung-Ah Kang, François Pinet
Context-Aware Visualization for Explainable AI Recommendations in Social Media: A Vision for User-Aligned Explanations
Abstract
Social media platforms today strive to improve user experience through AI recommendations, yet the value of such recommendations vanishes as users do not understand the reasons behind them. This issue arises because explainability in social media is general and lacks alignment with user-specific needs. In this vision paper, we outline a user-segmented and context-aware explanation layer by proposing a visual explanation system with diverse explanation methods. The proposed system is framed by the variety of user needs and contexts, showing explanations in different visualized forms, including a technically detailed version for AI experts and a simplified one for lay users. Our framework is the first to jointly adapt explanation style (visual vs. numeric) and granularity (expert vs. lay) inside a single pipeline. A public pilot with 30 X users will validate its impact on decision-making and trust.
Banan Mohammad Alkhateeb, Ellis Solaiman
Transparent Adaptive Learning via Data-Centric Multimodal Explainable AI
Abstract
Artificial intelligence-driven adaptive learning systems are reshaping education through data-driven adaptation of learning experiences. Yet many of these systems lack transparency, offering limited insight into how decisions are made. Most explainable AI (XAI) techniques focus on technical outputs but neglect user roles and comprehension. This paper proposes a hybrid framework that integrates traditional XAI techniques with generative AI models and user personalisation to generate multimodal, personalised explanations tailored to user needs. We redefine explainability as a dynamic communication process tailored to user roles and learning goals. We outline the framework’s design, key XAI limitations in education, and research directions on accuracy, fairness, and personalisation. Our aim is to move towards explainable AI that enhances transparency while supporting user-centred experiences.
Maryam Mosleh, Marie Devlin, Ellis Solaiman
Analyzing Student Feedback to Assess NoSQL Education
Abstract
We present a learning analytics study on a Master-level practical database course designed to equip students with hands-on experience in the assessment of four database systems for specific use cases. The course is based on a React web application to monitor task performance, including success rates, executability, processing time, and perceived difficulty. Learning analytics conducted across two semesters reveals trends in student success and challenges, such as superior performance in Schema Evolution tasks with PostgreSQL and increased difficulty in Network Analysis tasks across all databases. While Cassandra’s lack of join capabilities introduces additional learning complexities, Neo4J demonstrates constantly higher executability and ease of syntax.
Vanessa Meyer, Lena Wiese, Ahmed Al-Ghezi

Data Mining

Frontmatter

Open Access

Data Mining for Language Superfamilies Using Congruent Sound Groups
Abstract
There have been several attempts in recent years to prove that some well-known language families can be grouped together into a language superfamily. This paper presents a data mining method to search for a language superfamily. The data mining method is based on the consideration of regular sound changes in various languages. Congruent Sound Groups are derived from the commonly observed regular sound changes. To demonstrate the feasibility of this method, we collected a set of words related to the four basic elements of air, earth, fire and water from seven languages: Hindi, Japanese, Korean, Russian, Sanskrit, Tamil and Telugu. These seven languages are classified into four different language families: Dravidian, Indo-European, Japonic, and Koreanic. The congruent sound group-based analysis enabled the identification of seven cognate groups of words that involve different language families. This suggests that these four different language families originate from a single protolanguage that was likely spoken in Asia more than 10,000 years ago.
Peter Z. Revesz, Mohanendra Siddha
A Cross-Linguistic Analysis of Linear A, Linear B and Swahili
Abstract
Prior computational work on Linear A has focused on isolated statistical models with little cross-linguistic integration. This study introduces a unified computational framework analyzing Linear A, Linear B, and Swahili through probabilistic modeling and pattern mining. We construct syllable-level Markov models, extract transition probability matrices, and apply Jaccard similarity and Apriori rule mining to uncover structural patterns. Results reveal recurring transition clusters in Linear A suggestive of morphological structure and identify significant syllabic co-occurrences in Swahili that align with patterns in ancient scripts. Cross-script analysis highlights potential phonotactic and positional correspondences, offering new computational insights into Linear A’s linguistic organization. The methods presented can help in the decipherment of scripts that lack bilingual texts.
Joslin Ishimwe, Adrian Ratwatte, Prince Ngiruwonsanga
Automated Identification of Allographs Among the Indus Valley Script Signs
Abstract
One of the major reasons for the lack of decipherment of the Indus Valley Script is that even its set of signs have not been precisely identified. This paper proposes a fully automated method leveraging computer vision and statistical modeling to analyze the Indus Valley Script. The automated method includes sign segmentation using adaptive thresholding and morphology, followed by visual feature extraction with VGG16 deep learning and dimensionality reduction via principal component analysis. Clustering with K-means grouped the previously proposed 417 Indus Valley Script signs into 50 clusters that would be a more reasonable number if the Indus Valley Script is a syllabic or alphabetic script. The automated method also builds a first-order Markov chain on the 50 sign clusters. The Markov model reveals some frequent self-loops and other interesting patterns that hint at the grammar of the underlying language of the Indus Valley Script. The grammar could lead to the identification of related languages, aiding the decipherment of the Indus Valley Script. The proposed automated method could be adapted to the study of other undeciphered scripts.
Harsh Tamkiya, Gunjit Agrawal, Chiradeep Debnath, Peter Z. Revesz
Backmatter
Titel
Database Engineered Applications
Herausgegeben von
Giacomo Bergami
Paul Ezhilchelvan
Yannis Manolopoulos
Sergio Ilarri
Jorge Bernardino
Carson K. Leung
Peter Z. Revesz
Copyright-Jahr
2026
Electronic ISBN
978-3-032-06744-9
Print ISBN
978-3-032-06743-2
DOI
https://doi.org/10.1007/978-3-032-06744-9

Die PDF-Dateien dieses Buches wurden gemäß dem PDF/UA-1-Standard erstellt, um die Barrierefreiheit zu verbessern. Dazu gehören Bildschirmlesegeräte, beschriebene nicht-textuelle Inhalte (Bilder, Grafiken), Lesezeichen für eine einfache Navigation, tastaturfreundliche Links und Formulare sowie durchsuchbarer und auswählbarer Text. Wir sind uns der Bedeutung von Barrierefreiheit bewusst und freuen uns über Anfragen zur Barrierefreiheit unserer Produkte. Bei Fragen oder Bedarf an Barrierefreiheit kontaktieren Sie uns bitte unter accessibilitysupport@springernature.com.

    Bildnachweise
    AvePoint Deutschland GmbH/© AvePoint Deutschland GmbH, NTT Data/© NTT Data, Wildix/© Wildix, arvato Systems GmbH/© arvato Systems GmbH, Ninox Software GmbH/© Ninox Software GmbH, Nagarro GmbH/© Nagarro GmbH, GWS mbH/© GWS mbH, CELONIS Labs GmbH, USU GmbH/© USU GmbH, G Data CyberDefense/© G Data CyberDefense, FAST LTA/© FAST LTA, Vendosoft/© Vendosoft, Kumavision/© Kumavision, Noriis Network AG/© Noriis Network AG, WSW Software GmbH/© WSW Software GmbH, tts GmbH/© tts GmbH, Asseco Solutions AG/© Asseco Solutions AG, AFB Gemeinnützige GmbH/© AFB Gemeinnützige GmbH