main-content

## Über dieses Buch

This book constitutes the proceedings of the 7th International Conference on Future Data and Security Engineering, FDSE 2020, which was supposed to be held in Quy Nhon, Vietnam, in November 2020, but the conference was held virtually due to the COVID-19 pandemic.

The 24 full papers (of 53 accepted full papers) presented together with 2 invited keynotes were carefully reviewed and selected from 161 submissions. The other 29 accepted full and 8 short papers are included in CCIS 1306. The selected papers are organized into the following topical headings: security issues in big data; big data analytics and distributed systems; advances in big data query processing and optimization; blockchain and applications; industry 4.0 and smart city: data analytics and security; advanced studies in machine learning for security; and emerging data management systems and applications.

## Inhaltsverzeichnis

### Blockchain Technology: Intrinsic Technological and Socio-Economic Barriers

FDSE’2020 Keynote
Abstract
Since the introduction of Bitcoin in 2009 and its immense resonance in media, we have seen a plethora of envisioned blockchain solutions. Usually, such blockchain solutions claim to be disruptive. Often, such disruptiveness takes the form of a proclaimed blockchain revolution. In this paper, we want to look at blockchain technology from a neutral, analytical perspective. Our aim is to understand technological and socio-economic barriers to blockchain solutions that are intrinsic in the blockchain technology stack itself. We look into the permissionless blockchain as well as the permissioned blockchain. We start with a characterization of cryptocurrency as one-tiered uncollateralized M1 money. We proceed with defining essential modes of business communications (message authentication, signature, registered letter, contract, order etc.) and how they are digitized classically. We review potential blockchain solutions for these modes of communications, including socio-economic considerations. At the technical level, we discuss scalability issues and potential denial-of-service attacks. On the other hand, we also look into four successful blockchain solutions and explain their design. Now: what is the blockchain revolution and how realistic is it? Will it shake up of our institutions? Or, vice versa: does it have to rely on a re-design of our institutions instead? Can we design useful blockchain solutions independent of fundamental institutional re-design? It is such questions which have motivated us to compile this paper and we hope that we are able to bring some light to them.
Ahto Buldas, Dirk Draheim, Takehiko Nagumo, Anton Vedeshin

### Data Quality for Medical Data Lakelands

Abstract
Medical research requires biological material and data. Medical studies based on data with unknown or questionable quality are useless or even dangerous, as evidenced by recent examples of withdrawn studies. Medical data sets consist of highly sensitive personal data, which has to be protected carefully and is only available for research after approval of ethics committees. These data sets, therefore, cannot be stored in central data warehouses or even in a common data lake but remain in a multitude of data lakes, which we call Data Lakelands. An example for such a Medical Data Lakelands are the collections of samples and their annotations in the European federation of biobanks (BBMRI-ERIC). We discuss the quality dimensions for data sets for medical research and the requirements for providers of data sets in terms of both quality of meta-data and meta-data of data quality documentation with the aim to support researchers to effectively and efficiently identify suitable data sets for medical studies.

### Authorization Policy Extension for Graph Databases

Abstract
The high increase in the use of graph databases also for business- and privacy-critical applications demands for a sophisticated, flexible, fine-grained authorization and access control approach. Attribute-based access control (ABAC) supports a fine-grained definition of authorization rules and policies. Attributes can be associated with the subject, the requested resource and action, but also the environment. Thus, this is a promising starting point. However, specific characteristics of graph-structured data such as attributes on vertices and edges along a path to the resource, are not yet considered. The well-established eXtensible Access Control Markup Language (XACML), which defines a declarative language for fine-grained, attribute-based authorization policies, is the basis for our proposed approach - XACML for Graph-structured data (XACML4G). The additional path-specific constraints, described in graph patterns, demand for specialized processing of the rules and policies as well as adapted enforcement and decision making in the access control process. To demonstrate XACML4G and its enforcement process, we present a scenario from the university domain. Due to the project’s environment, the prototype is built with the multi-model database ArangoDB. The results are promising and further studies concerning performance and use in practice are planned.
Aya Mohamed, Dagmar Auer, Daniel Hofer, Josef Küng

### A Model-Driven Approach for Enforcing Fine-Grained Access Control for SQL Queries

Abstract
In this paper we propose a novel, model-driven approach for enforcing fine-grained access control (FGAC) policies when executing SQL queries. More concretely, we define a function $$\mathrm{SecQuery}()$$ that, given a FGAC policy $$\mathcal{S}$$ and a SQL select-statement q, generates a SQL stored-procedure, such that: if a user is authorized, according to $$\mathcal{S}$$, to execute q, then calling this stored-procedure returns the same result that executing q; otherwise, if a user is not authorized, according to $$\mathcal{S}$$, to execute q, then calling this stored-procedure signals an error. We have implemented our approach in an open-source project, called SQL Security Injector (SQLSI).
Phước Bảo Hoàng Nguyễn, Manuel Clavel

### On Applying Graph Database Time Models for Security Log Analysis

Abstract
For aiding computer security experts in their work, log files are a crucial piece of information. Especially the time domain is of interest, since sometimes, timestamps are the only linking points between associated events caused by attackers, faulty systems or similar. With the idea of storing and analyzing log information in graph databases comes also the question, how to model the time aspect and in particular, how timestamps shall be stored and connected in a proper form.
This paper analyzes three different models in which time information extracted from log files can be represented in graph databases and how the data can be retrieved again in a form that is suitable for further analysis. The first model resembles data stored in a relational database, while the second one enhances this approach by applying graph database specific amendments while the last model makes almost full use of a graph database’s capabilities. Hereby, the main focus points are laid on the queries for retrieving the data, their complexity, the expressiveness of the underlying data model and the suitability for usage in graph databases.
Daniel Hofer, Markus Jäger, Aya Mohamed, Josef Küng

### Integrating Web Services in Smart Devices Using Information Platform Based on Fog Computing Model

Abstract
In the present research, we propose an information platform for integrating ordinary web services in smart devices. It is based on a fog computing model that enables a fog node to mediate between web services and smart devices. The proposed platform enables the use of the same services and data regardless of the type of smart devices. As an example of such a platform, we construct a ToDo management service for teams collaborating via the Internet. The presented proposal outlines the way of establishing communications between such a web service and different kinds of smart speakers.
Takeshi Tsuchiya, Ryuichi Mochizuki, Hiroo Hirose, Tetsuyasu Yamada, Norinobu Imamura, Naoki Yokouchi, Keiichi Koyanagi, Quang Tran Minh

### Adaptive Contiguous Scheduling for Data Aggregation in Multichannel Wireless Sensor Networks

Abstract
These days multichannel wireless sensor networks (MWSNs) have been concerned in data aggregation since the data aggregation delay is significantly reduced. However, in these environments we must consider not only timeslot collisions but also channels collisions. Along with that problem, the data collection rate and energy consumption of the networks are also important problems needed to be solved. In this paper, we propose a scheduling scheme named Adaptive Contiguous Scheduling for the data aggregation in MWSNs. This proposed scheme applies the parents changing approach and channels reused strategy to reduce the number of channels used to allocate nodes in the network leading to preserve the energy consumption. The experimental results show that our scheme reduces the amount of used channels up to 69.57% and 72%, as compared to state-of-the-art algorithms.
Van-Vi Vo, Tien-Dung Nguyen, Duc-Tai Le, Hyunseung Choo

### Relating Network-Diameter and Network-Minimum-Degree for Distributed Function Computation

Abstract
Distributed computing network-systems are modeled as directed/undirected graphs with vertices representing compute elements and adjacency-edges capturing their uni- or bi-directional communication. To quantify an intuitive tradeoff between two graph-parameters: minimum vertex-degree and diameter of the underlying graph, we formulate an extremal problem with the two parameters: for all positive integers n and d, the extremal value $$\nabla (n, d)$$ denotes the least minimum vertex-degree among all connected order-n graphs with diameters of at most d. We prove matching upper and lower bounds on the extremal values of $$\nabla (n, d)$$ for various combinations of n- and d-values.
H. K. Dai, M. Toulouse

### Growing Self-Organizing Maps for Metagenomic Visualizations Supporting Disease Classification

Abstract
Numerous medical models based on the personalized medicine approach have been investigated to provide more efficient treatments and improved health-care service. Metagenomic data - the genomic samples of microbial communities - appear to be one of the most valuable sources to test the hypotheses for these models. However, interpreting this source is hard due to its very high dimension. As a result, some visualization methods have been proposed to deal with metagenomic data. These methods are not only for representing the numerical data but also for leveraging deep learning algorithms on the generated images to improve the diagnosis. In this study, we present an approach that uses Growing Self-Organizing Maps to transform features of three species metagenomic datasets into images. Then, generated images are fetched into a Convolutional Neural Network to do disease prediction tasks. The proposed method produces promising performance compared to other visualization approaches.
Hai Thanh Nguyen, Bang Anh Nguyen, My N. Nguyen, Quoc-Dinh Truong, Linh Chi Nguyen, Thao Thuy Ngoc Banh, Phung Duong Linh

### On Norm-Based Locality Measures of 2-Dimensional Discrete Hilbert Curves

Abstract
A discrete space-filling curve provides a 1-dimensional indexing or traversal of a multi-dimensional grid space. Sample applications of space-filling curves include multi-dimensional indexing methods, data structures and algorithms, parallel computing, and image compression. Locality preservation reflects proximity between grid points, that is, close-by grid points are mapped to close-by indices or vice versa. The underlying locality measure for our studies, based on the p-normed metric $$d_{p}$$, is the maximum ratio of $$d_{p}(v, u)^{m}$$ to $$d_{p}(\tilde{v}, \tilde{u})$$ over all corresponding point-pairs (vu) and $$(\tilde{v}, \tilde{u})$$ in the m-dimensional grid space and 1-dimensional index space, respectively. Our analytical results close the gaps between the current best lower and upper bounds with exact formulas for $$p \in \{1, 2\}$$, and extend to all reals $$p \ge 2$$, and our empirical results will shed some light on determining the exact formulas for the locality measure for all reals $$p \in (1, 2)$$.
H. K. Dai, H. C. Su

### A Comparative Study of Join Algorithms in Spark

Abstract
In the era of information explosion, the amount of data generated is increasing day by day, reached the threshold of petabytes or even zettabytes. In order to extract useful information from a variety of huge data sources, we need effectively computational operations performed in parallel and distributed manner on a cluster of computers. These operations involve a lot of complex and expensive processing operations. One of the typical and frequently used operations in queries is a join operation to combine more than one dataset into one. Currently, although there are some studies on join operations in Spark, there has not been any study showing an adequate and systematic comparison of join algorithms in the Spark environment. Therefore, this study is dedicated to the join operation aspects in Spark. It describes important strategies of implementing the join operation in detail, and exposes the advantages and disadvantages of each one. In addition, the work provides a more thorough comparison of the joins by using a mathematical cost model and experimental verification.
Anh-Cang Phan, Thuong-Cang Phan, Thanh-Ngoan Trieu

### Blockchain-Based Forward and Reverse Supply Chains for E-waste Management

Abstract
In this paper, we propose a novel smart e-waste management system, by leveraging the power of blockchain technology and smart contract, that considers both forward and reverse supply chains. This allows the proposed system to capture whole life cycle of e-products, starting from their manufacturing (as new products) to their disposal (as e-wastes) and their recycling back to raw materials. In this context, we address various challenges and limitations which existing blockchain-based solutions are facing, especially incomplete coverage of e-products’ life cycle, access control, payment mechanism, incentivization, scalability issue, missing experimental validation, etc. We present a prototype implementation of the system as a proof-of-concept using Solidity on Ethereum platform, and we perform an experimental evaluation to demonstrate its feasibility and performance in terms of execution gas cost.
Swagatika Sahoo, Raju Halder

### A Pragmatic Blockchain Based Solution for Managing Provenance and Characteristics in the Open Data Context

Abstract
Nowadays, open data is a vital part of the most variety of resource input for many systems. Information originates from different sources and is reused by many various applications under different purposes, thereby exposing several problems about managing data provenance and characteristics. Meanwhile, blockchain is a new raising technology that gets much attention around the world. With its immutability, transparency, distributed mechanisms, and automation capabilities, using blockchain is a possible direction to solve these problems. This paper presents the model design of integrating blockchain into an open data platform to resolve this issue. The research involves some related studies, integration mechanisms, and operating procedures, showing the standard model of communication between two platforms and experimental real model with CKAN and Hyperledger Fabric. The result shows that this combination is logical and a feasible direction of high applicability. Also, the proposed solution is general and scalable for further requirements in the future.
Tran Khanh Dang, Thu Duong Anh

### OAK: Ontology-Based Knowledge Map Model for Digital Agriculture

Abstract
Nowadays, a huge amount of knowledge has been amassed in digital agriculture. This knowledge and know-how information are collected from various sources, hence the question is how to organise this knowledge so that it can be efficiently exploited. Although this knowledge about agriculture practices can be represented using ontology, rule-based expert systems, or knowledge model built from data mining processes, the scalability still remains an open issue. In this study, we propose a knowledge representation model, called an ontology-based knowledge map, which can collect knowledge from different sources, store it, and exploit either directly by stakeholders or as an input to the knowledge discovery process (Data Mining). The proposed model consists of two stages, 1) build an ontology as a knowledge base for a specific domain and data mining concepts, and 2) build the ontology-based knowledge map model for representing and storing the knowledge mined on the crop datasets. A framework of the proposed model has been implemented in agriculture domain. It is an efficient and scalable model, and it can be used as knowledge repository a digital agriculture.
Quoc Hung Ngo, Tahar Kechadi, Nhien-An Le-Khac

### A Novel Approach to Diagnose ADHD Using Virtual Reality

Abstract
The main procedures of attention-deficit hyperactivity disorder (ADHD) assessment are interviews with the subject, his or her parents and teacher, observation of the subject, and self-screening questionnaires. However, these traditional medical assessments have serious problems. Interviews may be efficient to an adult subject; however, adolescent subjects are not familiar to express their emotion and mental status precisely. Observation and self-screening questionnaires require a long period of time to be finished, being easily forged by an observer or a subject. To resolve these obstacles, we propose a virtual reality (VR)-based ADHD diagnosis model, in which the VR contents close to reality (such as a school environment) diagnose whether a subject is suspected ADHD or not by various sensors in VR and responses in the VR contents based on the ADHD categorization. We implement the VR contents to diagnose ADHD by using the Unity, which finds major ADHD characteristics by the ADHD categorization based on the Diagnostic and Statistical Manual of mental disorders (DSM-5). We present the medical data engineering and security features in our diagnosis method to protect a patient’s information.
Seung Ho Ryu, Soohwan Oh, Sangah Lee, Tai-Myoung Chung

### A Three-Way Energy Efficient Authentication Protocol Using Bluetooth Low Energy

Abstract
Bluetooth Low Energy (BLE) is increasing in popularity. Many scientists are proposing it as a technique for contact tracing to combat COVID-19. Additionally, BLE is being used in applications involving transferring sensitive information such as home security systems. This paper provides a new authentication solution for BLE with enhanced privacy, but minimal impact on energy consumption. We also provided a framework to demonstrate our protocol can be implemented on real devices, which support BLE modules, and can withstand typical types of cyberattacks.
Thao L. P. Nguyen, Tran Khanh Dang, Tran Tri Dang, Ai Thao Nguyen Thi

### Clustering-Based Deep Autoencoders for Network Anomaly Detection

Abstract
A novel hybrid approach between clustering methods and autoencoders (AEs) is introduced for detecting network anomalies in a semi-supervised manner. A previous work has developed regularized AEs, namely Shrink AE (SAE) and Dirac Delta Variational AE (DVAE) that learn to represent normal data into a very small region being close to the origin in their middle hidden layers (latent representation). This work based on the assumption that normal data points may share some common characteristics, so they can be forced to distribute in a small single cluster. In some scenarios, however, normal network data may contain data from very different network services, which may result in a number of clusters in the normal data. Our proposed hybrid model attempts to automatically discover these clusters in the normal data in the latent representation of AEs. At each iteration, an AE learns to map normal data into the latent representation while a clustering method tries to discover clusters in the latent normal data and force them being close together. The co-training strategy can help to reveal true clusters in normal data. When a querying data point coming, it is first mapped into the latent representation of the AE, and its distance to the closest cluster center can be used as an anomaly score. The higher anomaly score a data point has, the more likely it is anomaly. The method is evaluated with four scenarios in the CTU13 dataset, and experiments illustrate that the proposed hybrid model often out-performs SAE on three out of four scenarios.
Van Quan Nguyen, Viet Hung Nguyen, Nhien-An Le-Khac, Van Loi Cao

### Flexible Platform for Integration, Collection, and Analysis of Social Media for Open Data Providers in Smart Cities

Abstract
Developing infrastructure and intelligent utilities for smart cities is an important trend in the world as well as in Vietnam. Thus, it is important to assist developers in building services for open data and smart city utilities. This motivates our proposal to develop a flexible platform with useful components, which can be integrated to develop these solutions quickly, to listen and analyze data from different social media sources with the diversification of data types, to provide open data providers in smart cities. Our method focuses on the ability to flexibly integrate artificial intelligence applications into the system to be able to both analyze effectively social events and serve smart cities in creating open data providers. We do not develop a particular system, but we create a platform, including different components, which are easy to be extended and integrated to create specific applications. To evaluate our platform, we develop four systems, including a face recognition system for celebrity recognition in news videos, an object detection system for brand logo recognition, a video highlighting system for summarizing football matches, and a text analysis system serving for keyword occurrences and emotional text analysis for admissions of universities. In these systems, we have collected and analyzed nearly 1000 videos from CNN, CBSN, FIFATV channels on YouTube, thousands of posts from admission pages of universities on Facebook. Each system gives a unique meaning to each specific situation for open data providers in smart cities.
Thanh-Cong Le, Quoc-Vuong Nguyen, Minh-Triet Tran

### Post-quantum Digital-Signature Algorithms on Finite 6-Dimensional Non-commutative Algebras

Abstract
There are introduced three methods for defining finite 6-dimensional associative algebras over the ground finite field GF(p), every one of which contains a set of the global right-sided units. Formulas describing the set of the global units are presented for every of the considered three algebras that contain $$p^s$$ global units, where $$s=2,3,4.$$ The algebras are used as carriers of the hidden discrete logarithm problem that is used as the base cryptographic primitive of the post-quantum digital signature algorithms.
Nikolay A. Moldovyan, Dmitriy N. Moldovyan, Alexander A. Moldovyan, Hieu Minh Nguyen, Le Hoang Tuan Trinh

### Malicious-Traffic Classification Using Deep Learning with Packet Bytes and Arrival Time

Abstract
Internet technology is rapidly developing through the development of computer technology. However, we haven been experiencing problems such as malware with these developments. Various methods of malware detection have been studied for years to respond to malicious codes. There are three main ways to classify traffic. They are port-based, payload-based and a machine learning method. We attempt to classify malicious traffic using CNN which is one of deep learning algorithms. The features we use for CNN are the packet’s size and its arrival time. The packet’s size and arrival time information are extracted and then converted into an image file. The converted image is then used for CNN to classify what type of attack the traffic is. The accuracy of the proposed technique was 95%, which showed very high performance, proving that classification was possible.
Ingyom Kim, Tai-Myoung Chung

### Detecting Malware Based on Dynamic Analysis Techniques Using Deep Graph Learning

Abstract
Detecting malware using dynamic analysis techniques is an efficient method. Those familiar techniques such as signature-based detection perform poorly when attempting to identify zero-day malware, and it is also a challenging and time-consuming task to manually engineer malicious behaviors. Several studies have tried to detect unknown behaviors automatically. One of effective approaches introduced in recent years is to use graphs to represent the behavior of an executable, and learn from these graphs. However, current graph representations have ignored much important information such as parameters, variables changes… In this paper, we present a new method for malware detection by applying a graph attention network on multi-edge directional heterogeneous graphs constructed from Windows API calls collected after a file being executed in cuckoo sandbox… The experiments show that our model achieves better performance than other baseline models at both TPR and FAR scores.
Nguyen Minh Tu, Nguyen Viet Hung, Phan Viet Anh, Cao Van Loi, Nathan Shone

### Understanding the Decision of Machine Learning Based Intrusion Detection Systems

Abstract
Intrusion Detection Systems (IDSs) is an important research topic in security engineering. The role of an IDS is to detect the malicious incoming network flows, hence it can protect a computer system from attack. Recent research studies in IDS focus in using different machine learning techniques to build an IDS. However, due to the black-box nature of the machine learning algorithms, it is difficult to understand and get insights of the system. In this work, we extend the recent studies by providing the explanation of the decisions of the IDSs built in the previous studies. Given a deeper understanding of the IDS, the users will have more trust to use the system while the engineers can rely on the explanation to tweet the system. The experimental results show that we can significantly reduce the computational power requirement of the IDS based on the explanation of the model.
Quang-Vinh Dang

### Combining Support Vector Machines for Classifying Fingerprint Images

Abstract
We propose to combine support vector machine (SVM) models learned from different visual features for efficiently classifying fingerprint images. Real datasets of fingerprint images are collected from students at the Can Tho University. The SVM algorithm learns classification models from the handcrafted features such as the scale-invariant feature transform (SIFT) and the bag-of-words (BoW) model, the histogram of oriented gradients (HoG), the deep learning of invariant features Xception, extracted from fingerprint images. Followed which, we propose to train a neural network for combining SVM models trained on these different visual features, making improvements of the fingerprint image classification. The empirical test results show that combining SVM models is more accurate than SVM models trained on any single visual feature type. Combining SVM-SIFT-BoW, SVM-HoG, SVM-Xception improves 11.17%, 14.07%, 10.83% classification accuracy of SVM-SIFT-BoW, SVM-HoG and SVM-Xception, respectively.
The-Phi Pham, Minh-Thu Tran-Nguyen, Minh-Tan Tran, Thanh-Nghi Do

### Toward an Ontology for Improving Process Flexibility

Abstract
Process flexibility supports organisations to deal with changes, uncertainty, variations, and exceptions in business operations. Although several taxonomies of process flexibility have been proposed, the domain still lacks an ontological structure that clarifies and organises the domain. The current study fills this gap by building an ontology for improving process flexibility. Our results identify main business contexts, cases, dynamic modelling techniques, mechanisms to manage process flexibility, and their hierarchy relationships, which are structured into an ontology. The current study is significant as it provides a theoretical blueprint for improving the flexibility of organisational business processes.
Nguyen Hoang Thuan, Hoang Ai Phuong, Majo George, Mathews Nkhoma, Pedro Antunes

### Sentential Semantic Dependency Parsing for Vietnamese

Abstract
Semantic dependency parse is the dependency graph of a sentence. This graph shows the grammatical dependencies between words in a sentence clearer than the dependency parse because it allows one word possibly be the dependant in two or more dependencies in a sentence. Therefore, it has been used to represent the meaning of a sentence. In order to represent the meaning of a sentence in Vietnamese, a method of parsing the sentence into semantic dependencies is proposed in this paper. This rule-based method transforms the result of Vietnamese dependency parser by using semantic constraints existing in a lexicon ontology called VLO. The test result shows that the proposed method can capture more dependencies than the state-of-the-art Vietnamese dependency parser with the precisions respectively being 0.5328 and 0.3113.
Tuyen Thi-Thanh Do, Dang Tuan Nguyen

### An In-depth Analysis of OCR Errors for Unconstrained Vietnamese Handwriting

Abstract
OCR post-processing is an essential step to improve the accuracy of OCR-generated texts by detecting and correcting OCR errors. In this paper, the OCR texts are resulted from an OCR engine which is based on the attention-based encoder-decoder model for unconstrained Vietnamese handwriting. We identify various kinds of Vietnamese OCR errors and their possible causes. Detailed statistics of Vietnamese OCR errors are provided and analyzed at both character level and syllable level, using typical OCR error characteristics such as error rate, error mapping/edit, frequency and error length. Furthermore, the statistical analyses are done on training and test sets of a benchmark database to infer whether the test set is the appropriate representative of the training set regarding the OCR error characteristics. We also discuss the choice of designing OCR post-processing approaches at character level or at syllable level relying on provided statistics of studied datasets.
Quoc-Dung Nguyen, Duc-Anh Le, Nguyet-Minh Phan, Ivan Zelinka

### Backmatter

Weitere Informationen