nach oben

2021 | Buch

Kapitel lesen Erstes Kapitel lesen

Data Science – Analytics and Applications

Proceedings of the 3rd International Data Science Conference – iDSC2020

herausgegeben von: Peter Haber, Dr. Thomas Lampoltshammer, Dr. Manfred Mayr, Dr. Kathrin Plankensteiner

Verlag: Springer Fachmedien Wiesbaden

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Organisationen sind bereits von der starren Struktur des klassischen Projektmanagements zu agilen Ansätzen übergegangen. Dies gilt auch für Softwareentwicklungsprojekte, die flexibel sein müssen, um schnell auf die Wünsche der Kunden reagieren zu können und um Änderungen zu berücksichtigen, die aufgrund von Architekturentscheidungen erforderlich sind. Nachdem sich die Datenwissenschaft als Eckpfeiler in Organisationen und Unternehmen etabliert hat, ist es nun zwingend erforderlich, diesen entscheidenden Schritt auch für analytische Geschäftsprozesse durchzuführen. Die nicht-deterministische Natur der Datenwissenschaft und die ihr innewohnenden analytischen Aufgaben erfordern einen interaktiven Ansatz für eine evolutionäre, schrittweise Entwicklung zur Realisierung der wichtigsten Geschäftsanwendungen und Anwendungsfälle.

Die 3. Internationale Konferenz zur Datenwissenschaft (iDSC 2020) brachte Forscher, Wissenschaftler und Wirtschaftsexperten zusammen, um Möglichkeiten zu erörtern, wie neue Wege zur Umsetzung agiler Ansätze in den verschiedenen Bereichen der Datenwissenschaft, wie maschinelles Lernen und KI, Data Mining oder Visualisierung und Kommunikation, sowie Fallstudien und Best Practices von führenden Forschungseinrichtungen und Wirtschaftsunternehmen etabliert werden können.

Der Tagungsband umfasst alle im wissenschaftlichen Track vorgestellten Volltexte und die Kurzbeiträge aus dem studentischen Track.

Zu den Themen, die Sie interessieren, gehören unter anderem:

Künstliche Intelligenz und Maschinelles Lernen Implementierung von Data-Mining-Prozessen Agile Datenwissenschaft und Visualisierung Fallstudien und Anwendungen für Agile Datenwissenschaft

Organizations have moved already from the rigid structure of classical project management towards the adoption of agile approaches. This holds also true for software development projects, which need to be flexible to adopt to rapid requests of clients as well to reflect changes that are required due to architectural design decisions. With data science having established itself as corner stone within organizations and businesses, it is now imperative to perform this crucial step for analytical business processes as well. The non-deterministic nature of data science and its inherent analytical tasks require an interactive approach towards an evolutionary step-by-step development to realize core essential business applications and use-cases.

The 3rd International Data Science Conference (iDSC 2020) brougt together researchers, scientists, and business experts to discuss means of establishing new ways of embracing agile approaches within the various domains of data science, such as machine learning and AI, data mining, or visualization and communication as well as case studies and best-practices from leading research institutions and business companies.

The proceedings include all full papers presented in the scientific track and the short papers from the student track.

Among the topics of interest are:

Artificial Intelligence and Machine Learning Implementation of data mining processes Agile Data Science and Visualization Case Studies and Applications for Agile Data Science

Inhaltsverzeichnis

Frontmatter

Abstracts of Industry Contributions

Zusammenfassung

∎

Peter Haber, Thomas J. Lampoltshammer, Manfred Mayr, Kathrin Plankensteiner

Non Peer-Reviewed Invited Papers

Frontmatter

Shift Planning for Smart Meter Service Operators

Zusammenfassung

Today’s world is fast-paced and so is technology. Almost every little device is somehow ”smart”, i.e., it collects data and provides some features. In order to cope with this development, the need for efficient optimization techniques grows steadily. In this work, we consider the maintenance process of smart meter devices, i.e., devices that collect data about electricity consumption. These smart meters are installed in private homes and need to be maintained from time to time. The maintenance is conducted by service operators of electricity companies. Since house owners have to be informed about the service at least two weeks prior, the procedure of planning shifts can be performed in advance. We develop and compare two solution approaches for determining shifts for service operators and evaluate the performance and stability of the generated tours with different metrics.

Paul Alexandru Bucur, Philipp Hungerländer, Anna Jellen, Kerstin Maier, Veronika Pachatz

Introducing Natural Language Interface to Databases for Data-Driven Small and Medium Enterprises

This paper summarizes major challenges and current approaches in the context of constructing Natural Language Interfaces to Databases for data-driven small and medium enterprises.

Zusammenfassung

Reading text, identifying key ideas, summarizing, making connections and other tasks that require comprehension and context are easy tasks for humans, but training a computer to perform these tasks is a challenge. Recent advances in deep learning make it possible to interpret the text effectively and achieve high performance results across natural language tasks. Interacting with relational databases trough natural language enables users of any background to query and analyze a huge amount of data in a user-friendly way. The purpose of Natural Language Interface is to allow users to compose questions in Natural Language and receive the response also in Natural Language. The idea of using natural language instead of SQL has promoted the development of new type of processing called Natural Language Interface to Database (NLIDB). This paper is an introduction to Natural Language Processing and Natural Language Interface to Database, significant challenges in this research field and how to construct a company specific dataset. It also gives a brief overview of the major techniques used to develop Natural Language Interface to Databases.

Dejan Radovanovic

An Easy-to-Use Execution Environment for the Parallelisation of Computationally Intensive Data Science Applications

Zusammenfassung

With Cloud Computing and multi-core CPUs parallel computing resources are becoming more and more affordable and commonly available. Parallel programming should as well be easily accessible for everyone. Unfortunately, existing frameworks and systems are powerful but often very complex to use for anyone who lacks the knowledge about underlying concepts. This paper introduces a software framework and execution environment whose objective is to provide a system which should be easily usable for everyone who could benefit from parallel computing. Some real-world examples are presented with an explanation of all the steps that are necessary for computing in a parallel and distributed manner.

Sabrina Rosmann, Thomas Feilhauer, Steffen Finck, Martin Sobotka

Forecast Aggregation and Error Comparison: An Empirical Study

Zusammenfassung

The aim of this paper is to present empirical results associated with forecast performance. It is known that common measures of error fail to be scale invariant, and hence cannot be used to make meaningful error comparisons on forecasts across differing time series. This offers a particular challenge toward forecast improvement when one’s intent is to compare error across different units or granularity. Moreover, although it is prudent to test many forecast methods on a time series, one cannot be sure that a single selected method will not lead to complete forecast failure. We address the aforementioned challenges by analyzing a sizable collection of time series in-house.

German Wehinger, Josh Beal

German Abstracts of Peer Reviewed Full Papers

Zusammenfassung

∎

Peter Haber, Thomas J. Lampoltshammer, Manfred Mayr, Kathrin Plankensteiner

Peer-Reviewed Full Papers

Frontmatter

Applying an Adapted Data Mining Methodology (DMME) to a Tribological Optimisation Problem

Zusammenfassung

This work provides a guideline for a structural approach towards data mining projects in tribology. Due to the specifics of tribological processes, parts of the DMME methodology need to be refined. The refined data mining methodology is applied to an on-going data mining project in tribology aimed at predicting wear rate and coefficient of friction of nitrocarburised coatings. The applied adapted methodology provides an efficient framework for data generation, preparation and analysis. At the same time, it supports and guides interdisciplinary work between data scientists and tribologists.

Samuel Bitrus, Igor Velkavrh, Eugen Rigger

Implementation of an Automatic Musical Scores Recognition System

Zusammenfassung

Fingerprinting is a widely used technique on the web for identifying documents and information. Identifying fingerprints remains largely an arduous process. One of the areas considered insufficiently by the researchers concerns the search for musical scores. The main purpose of the method proposed in this paper is to deepen the understanding regarding the internal structure of a musical melody (musical fingerprint) and its impact on the automatic recognition of musical scores, considered in their symbolic level. The method has been verified by realizing an algorithm implemented in a Musical Score Search Engine (MSSE). Results show the identification of a musical score with a precision range close to 93%. The paper ends with a series of recommendations for enhanced implementation of automated musical score recognition system and suggestions for further researches.

Michele Della Ventura

Bayesian A/B Testing for Business Decisions

Zusammenfassung

Controlled experiments (A/B tests or randomized field experiments) are the de facto standard to make data-driven decisions when implementing changes and observing customer responses. The methodology to analyze such experiments should be easily understandable to stakeholders like product and marketing managers. Bayesian inference recently gained a lot of popularity and, in terms of A/B testing, one key argument is the easy interpretability. For stakeholders, “probability to be best” (with corresponding credible intervals) provides a natural metric to make business decisions. In this paper, we motivate the quintessential questions a business owner typically has and how to answer them with a Bayesian approach. We present three experiment scenarios that are common in our company, how they are modeled in a Bayesian fashion, and how to use the models to draw business decisions. For each of the scenarios, we present a real-world experiment, the results and the final business decisions drawn.

Shafi Kamalbasha, Manuel J. A. Eugster

Outlier detection in Bioinformatics with Mixtures of Gaussian and heavy-tailed distributions

Zusammenfassung

Starting from approaches in Bioinformatics, we will investigate aspects of Bayesian robustness ideas and compare them to methods from classical robust statistics. Bayesian robustness branches into three aspects, robustifying the prior, the likelihood or the loss function. Our focus will be the the likelihood itself. For computational convenience, normal likelihoods are the standard for many basic analyses ranging from simple mean estimation to regression or discriminatory models. However, similar to classical analyses non-normal data cause problems in the estimation process and are often covered with complex models for the overestimated variance or shrink- age. Most prominently, Bayesian non-parametrics approach this challenge with infinite mixtures of distributions. However, infinite mixture models do not allow an identification of outlying values in “near-Gaussian” scenarios being almost too flexible for such a purpose. The goal of our works is to allow for a robust estimation of parameters of the “main part of the data”, while being able to identify the outlying part of the data and providing a posterior probability for not fitting the main likelihood model. For this purpose, we propose to mix a Gaussian likelihood with heavy-tailed or skewed distributions of a similar structure which can hierarchically be related to the normal distribution in order to allow a consistent estimation of parameters and efficient simulation. We present an application of this approach in Bioinformatics for the robust estimation of genetic array data by mixing Gaussian and student’s t distributions with various degrees of freedom. To this effect, we employ microarray data as a case study for this behaviour, as they are well-known for their complicated, over-dispersed noise behaviour. Our secondary goal is to present a methodology, which helps not only to identify noisy genes but also to recognise whether single arrays are responsible for this behaviour. Although Bioinformatics dropped array technology in favour of sequencing in research, the medical diagnostics has picked up the methodology and thus require appropriate error estimators.

Alexandra Posekany

Uncertainty aware deep point based neural network for 3D object classification

Zusammenfassung

Efforts in various planning scenarios like factory planning, motion and trajectory planning, product design, etc. tend towards full realization in 3D. This makes point clouds an important 3D data type for capturing and assessing different situations. In this paper, we design a Bayesian extension to the frequentist PointNet classification network [1] by applying Bayesian convolutions and linear layers with variational inference. This approach allows the estimation of the model’s uncertainty in its predictions. Further, we are able to describe how each point in the input point cloud contributes to the prediction level uncertainty. Additionally, our network is compared against the state-of-the-art and shows strong performance. We prove the feasibility of our approach using a ModelNet 3D data set. Further, we generate an industrial 3D point data set at a German automotive assembly plant and apply our network. The results show that we can improve the frequentist baseline on ModelNet by about 6.46 %.

Christina Petschnigg, Jürgen Pilz

Comparison of solution approaches for the propagation of quality requirements of steering gears

Zusammenfassung

In the supply chain of the automotive industry the propagation of high quality standards is required. In the daily operations of steering system suppliers, the analysis of End of Line (EOL) vibroacoustic measurements encoded as order spectra for ball nut assemblies (BNA) is indispensable. Our goal is to find quality windows for the given BNA order spectra to detect faulty components. Due to the difficult interpretation of heuristic solutions, we use a Mixed Integer Linear Programming (MILP) formulation to analyze the solution quality of a genetic algorithm for the aforementioned problem. We prepare a carefully constructed benchmark set, which reflects the behavior of real-world EOL order spectra. In the provided computational study, we demonstrate the efficiency of the MILP approach on our benchmark instances with up to 945 order spectra, each consisting of 260 spectral orders.

Philipp Armbrust, Paul Alexandru Bucur, Philipp Hungerländer

Persistent Homology in Data Science

Zusammenfassung

Topological data analysis (TDA) applies methods of topology in data analysis and found many applications in data science in the recent decade that go well beyond machine learning. TDA builds upon the observation that data often possesses a certain intrinsic shape such as the shape of a point cloud, the shape of a signal or the shape of a geometric object. Persistent homology is probably the most prominent tool in TDA that gives us the means to describe and quantify topological properties of these shapes.

In this paper, we give an overview of the basic concepts of persistent homology by interweaving intuitive explanations with the formal constructions of persistent homology. In order to illustrate the versatility of TDA and persistent homology we discuss three domains of applications, namely the analysis of signals and images, the analysis of geometric shapes and topological machine learning. With this paper we intend to contribute to the dissemination of TDA and illustrate their application in fields that received little recognition so far, like signal processing or CAD/CAM.

Stefan Huber

Personalization of Deep Learning

Zusammenfassung

We discuss training techniques, objectives and metrics toward personalization of deep learning models. In machine learning, personalization addresses the goal of a trained model to target a particular individual by optimizing one or more performance metrics, while conforming to certain constraints. To personalize, we investigate three methods of “curriculum learning“ and two approaches for data grouping, i.e., augmenting the data of an individual by adding similar data identified with an auto-encoder. We show that both “curriculuum learning” and “personalized” data augmentation lead to improved performance on data of an individual. Mostly, this comes at the cost of reduced performance on a more general, broader dataset.

Johannes Schneider, Michalis Vlachos

NetSEC at High-Speed: Distributed Stream Learning for Security in Big Networking Data

Zusammenfassung

Continuous, dynamic and short-term learning is an effective learning strategy when operating in very fast and dynamic environments, where concept drift constantly occurs. We focus on a particularly challenging problem, that of continually learning detection models capable to recognize network attacks and system intrusions in highly dynamic environments such as communication networks. We consider adaptive learning algorithms for the analysis of continuously evolving network data streams, using a dynamic, variable length system memory which automatically adapts to concept drifts in the underlying data. By continuously learning and detecting concept drifts to adapt memory length, we show that adaptive learning algorithms can continuously realize high detection accuracy over dynamic network data streams. To deal with big network traffic streams, we deploy the proposed models into a big data analytics platform for network traffic monitoring and analysis tasks, and show that high speed up computations (as high as × 5) can be achieved by parallelizing off-the-shelf stream learning approaches.

Pedro Casas, Pavol Mulinka, Juan Vanerio

DeepMAL - Deep Learning Models for Malware Traffic Detection and Classification

Zusammenfassung

Robust network security systems are essential to prevent and mitigate the harming effects of the ever-growing occurrence of network attacks. In recent years, machine learning-based systems have gain popularity for network security applications, usually considering the application of shallow models, which rely on the careful engineering of expert, handcrafted input features. The main limitation of this approach is that handcrafted features can fail to perform well under different scenarios and types of attacks. Deep Learning (DL) models can solve this limitation using their ability to learn feature representations from raw, non-processed data. In this paper we explore the power of DL models on the specific problem of detection and classification of malware network traffic. As a major advantage with respect to the state of the art, we consider raw measurements coming directly from the stream of monitored bytes as input to the proposed models, and evaluate different raw-traffic feature representations, including packet and flow-level ones. We introduce DeepMAL, a DL model which is able to capture the underlying statistics of malicious traffic, without any sort of expert handcrafted features. Using publicly available traffic traces containing different families of malware traffic, we show that DeepMAL can detect and classify malware flows with high accuracy, outperforming traditional, shallow-like models.

Gonzalo Marín, Pedro Caasas, Germán Capdehourat

Human migration as a complex network: appropriate abstraction, and the feasibility of Network Science tools

Zusammenfassung

The number of Network Science studies has risen significantly in recent two decades. Various real phenomena are increasingly analyzed as complex networks. Human migration was seldom analyzed, however, in line with global circumstances, the number of migration-as-network applications has recently grown as well. Those new migration-as-network studies are hands-on implementations of elementary measures and models. Assessments on the right kind of network abstraction of human migration, as well as the feasibility and interpretability of measures on the phenomenon, have not yet been offered. We investigate these aspects, assessing the congruence of network tools used for analyzing migration, and their informative potential for the policy and decision-making domain.

Dino Pitoski, Thomas J. Lampoltshammer, Peter Parycek

Titel: Data Science – Analytics and Applications
herausgegeben von: Peter Haber
Dr. Thomas Lampoltshammer
Dr. Manfred Mayr
Dr. Kathrin Plankensteiner
Verlag: Springer Fachmedien Wiesbaden
Electronic ISBN: 978-3-658-32182-6
Print ISBN: 978-3-658-32181-9
DOI: https://doi.org/10.1007/978-3-658-32182-6