nach oben

2019 | Buch

Kapitel lesen Erstes Kapitel lesen

Data Science – Analytics and Applications

Proceedings of the 2nd International Data Science Conference – iDSC2019

herausgegeben von: Peter Haber, Thomas Lampoltshammer, Dr. Manfred Mayr

Verlag: Springer Fachmedien Wiesbaden

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

This book offers the proceedings of the Second International Data Science Conference (iDSC2019), organized by Salzburg University of Applied Sciences, Austria. The Conference brought together researchers, scientists, and business experts to discuss new ways of embracing agile approaches to various facets of data science, including machine learning and artificial intelligence, data mining, data visualization, and communication. The papers gathered here include case studies of applied techniques, and theoretical papers that push the field into the future. The full-length scientific-track papers on Data Analytics are broadly grouped by category, including Complexity; NLP and Semantics; Modelling; and Comprehensibility.

Included among real-world applications of data science are papers on

Exploring insider trading using hypernetworksData-driven approach to detection of autism spectrum disorderAnonymization and sentiment analysis of Twitter posts

Theoretical papers in the book cover such topics as Optimal Regression Tree Models Through Mixed Integer Programming; Chance Influence in Datasets with Large Number of Features; Adversarial Networks — A Technology for Image Augmentation; and Optimal Regression Tree Models Through Mixed Integer Programming.

Five shorter student-track papers are also published here, on topics such as

State-of-the-art Deep Learning Methods to effect Neural Machine Translation from Natural Language into SQLA Smart Recommendation System to Simplify Projecting for a HMI/SCADA Platform Use of Adversarial Networks as a Technology for Image AugmentationUsing Supervised Learning to Predict the Reliability of a Welding Process

The work collected in this volume of proceedings will provide researchers and practitioners with a detailed snapshot of current progress in the field of data science. Moreover, it will stimulate new study, research, and the development of new applications.

Inhaltsverzeichnis

Frontmatter

Double Blind Reviewed Full Papers

Frontmatter

Exploring Insider Trading Within Hypernetworks

Abstract

Insider trading can have crippling effects on the economy and its prevention is critical to the security and stability of global markets. It is hypothesized that insiders who trade at similar times share information. We analyze 400 companies and 2,000 insiders, identifying interesting trading patterns in these networks that are suggestive of illegal activity. Insiders are classified as either routine or opportunistic traders, allowing us to concentrate on well timed and highly profitable trades of the latter. Using trade classification and analyzing each trader’s role in a hypernetwork, reveals cliques of opportunistic and routine traders. This idea forms the basis of a graph based detection algorithm that seeks to identify traders belonging to opportunistic cliques. The ideas of trade classification and trading cliques present interesting opportunities to develop more robust policing systems which can automatically flag illegal activity in markets, and predict the likelihood that such activity will occur in the future.

Jad Rayes, Priya Mani

Chance influence in datasets with a large number of features

Abstract

Machine learning research, e.g. genomics research, is often based on sparse datasets that have very large numbers of features, but small samples sizes. Such configuration promotes the influence of chance on the learning process as well as on the evaluation. Prior research underlined the problem of generalization of models obtained based on such data. In this paper, we deeply investigate the influence of chance on classification and regression. We empirically show how considerable the influence of chance such datasets is. This brings the conclusions drawn based on them into question. We relate the observations of chance correlation to the problem of method generalization. Finally, we provide a discussion of chance correlation and guidelines that mitigate the influence of chance.

Abdel Aziz Taha, Alexandros Bampoulidis, Mihai Lupu

Data Analytics | NLP and Semantics

Frontmatter

Combining Lexical and Semantic Similarity Methods for News Article Matching

Abstract

Matching news articles from multiple different sources with different narratives is a crucial step towards advanced processing of online news flow. Although, there are studies about finding duplicate or near-duplicate documents in several domains, none focus on grouping news texts based on their events or sources. A particular event can be narrated from very different perspectives with different words, concepts, and sentiment due to the different political views of publishers. We develop novel news document matching method which combines several different lexical matching scores with similarity scores based on semantic representations of documents and words. Our experimental result show that this method is highly successful in news matching. We also develop a supervised approach by labeling pairs of news documents as same or not, then extracting structural and temporal features. The classification model learned using these features, especially temporal ones and train a classification model. Our results show that supervised model can achieve higher performance and thus better suited for solving above mentioned difficulties of news matching.

Mehmet Umut Sen, Hakki Yagiz Erdinc, Burak Yavuzalp, Murat Can Ganiz

The Effectiveness of the Max Entropy Classifier for Feature Selection

Abstract

Feature selection is the task of systematically reducing the number of input features for a classification task. In natural language processing, basic feature selection is often achieved by removing common stop words. In order to more drastically reduce the number of input features, actual feature selection methods such as Mutual Information or Chi-Squared are used on a count-based input representation. We suggest a task-oriented approach to select features based on the weights as learned by a Max Entropy classifier trained on the classification task. The remaining features can then be used by other classifiers to do the actual classification. Experiments on different natural language processing tasks confirm that the weight-based method is comparable to count-based methods. The number of input features can be reduced considerably while maintaining the classification performance.

Martin Schn¨oll, Cornelia Ferner, Stefan Wegenkittl

Impact of Anonymization on Sentiment Analysis of Twitter Postings

Abstract

The process of policy-modelling, and the overall field of policy-making are complex and put decision-makers in front of great challenges. One of them is present in form of including citizens into the decision-making process. This can be done via various forms of E-Participation, with active/passive citizen-sourcing as one way to tap into current discussions about topics and issues of relevance towards the general public. An increased understanding of feelings behind certain topics and the resulting behavior of citizens can provide great insight for public administrations. Yet at the same time, it is more important than ever to respect the privacy of the citizens, act in a legally compliant way, and therefore foster public trust. While the introduction of anonymization in order to guarantee privacy preservation represents a proper solution towards the challenges stated before, it is still unclear, if and to what extent the anonymization of data will impact current data analytics technologies. Thus, this research paper investigates the impact of anonymization on sentiment analysis of social media, in the context of smart governance. Three anonymization algorithms are tested on Twitter data and the results are analyzed regarding changes within the resulting sentiment. The results reveal that the proposed anonymization approaches indeed have a measurable impact on the sentiment analysis, up to a point, where results become potentially problematic for further use within the policy-modelling domain.

Thomas J. Lampoltshammer, L˝orinc Thurnay, Gregor Eibl

Data Analytics | Modelling

Frontmatter

A Data-Driven Approach for Detecting Autism Spectrum Disorders

Abstract

Autism spectrum disorders (ASDs) are a group of conditions characterized by impairments in reciprocal social interaction and by the presence of restricted and repetitive behaviors. Current ASD detection mechanisms are either subjective (survey-based) or focus only on responses to a single stimulus. In this work, we develop machine learning methods for predicting ASD based on electrocardiogram (ECG) and skin conductance (SC) data collected during a sensory challenge protocol (SCP) in which the reactions to eight stimuli were observed from 25 children with ASD and 25 typically developing children between 5 and 12 years of age. The length of the time series makes it difficult to utilize traditional machine learning algorithms to analyze these types of data. Instead, we developed feature processing techniques which allow efficient analysis of the series without loss of effectiveness. The results of our analysis of the protocol time series confirmed our hypothesis that autistic children are greatly affected by certain sensory stimulation. Moreover, our ensemble ASD prediction model achieved 93.33% accuracy, which is 13.33% higher than the best of 8 different baseline models we tested.

Manika Kapoor, David C. Anastasiu

Optimal Regression Tree Models through Mixed Integer Programming*

Abstract

Regression analysis is a tool for predicting the output variables from a set of known independent variables. Through regression, a function that captures the relationship between the variables is fitted to the data. Tree regression models are popular in the literature due to their ability to be computed quickly and their simple interpretations. However, creating complex tree structures can lead to overfitting the training data resulting in a poor predictive model. This work introduces a tree regression algorithm that employs mathematical programming to optimally split data into two sub regions, called nodes, and a statistical test to assess the quality of partitioning. A number of publicly available literature examples have been used to test the performance of the method against others that are available in the literature.

Ioannis Gkioulekas, Lazaros G. Papageorgiou

A Spatial Data Analysis Approach for Public Policy Simulation in Thermal Energy Transition Scenarios

Abstract

The paper elaborates on an approach to simulate the effect of public policies regarding thermal energy transition pathways in urban communities. The paper discusses the underlying methodologies of calculating Heating Energy demand of buildings and the rationale for potential zones for thermal energy systems. In order to simulate the effects of public policies on communities the authors developed a spatial Agentbased Model, where the buildings are the main objects that are subject to change, based on a number of both technically and socio-demographic parameters. In order to fill a spatial Agentbased Model with data a number of open source and commercially available datasets need to be spatially analyzed and merged. The initial results of the spatial Agent-based Model simulation show that public policies for thermal energy transition can be simulated accordingly.

Lina Stanzel, Johannes Scholz, Franz Mauthner

Data Analytics | Comprehensibility

Frontmatter

A Probabilistic Approach to Web Waterfall Charts

Abstract

The purpose of this paper is to propose an efficient and rigorous modeling approach for probabilistic waterfall charts illustrating timings of web resources, with particular focus on fitting them on big data. An implementation on real-world data is discussed, and illustrated on examples. The technique is based on non-parametric density estimation, and we discuss some subtle aspects of it, such as noisy inputs or singular data. We also investigate optimization techniques for numerical integration that arises as a part of modeling.

Maciej Skorski

Facilitating Public Access to Legal Information

Abstract

The European legal system is multi-layered and complex, and large quantities of legal documentation have been produced since its inception. This has significant ramifications for European society, whose various constituent actors require regular access to accurate and timely legal information, and often struggle with basic comprehension of legalese. The project focused on within this paper proposes to develop a suite of usercentric services that will ensure the real–time provision and visualisation of legal information to citizens, businesses and administrations based on a platform supported by the proper environment for semantically annotated Big Open Legal Data (BOLD). The objective of this research paper is to critically explore how current user activity interacts with the components of the proposed project platform through the development of a conceptual model. Model Driven Design (MDD) is employed to describe the proposed project architecture, complemented by the use of the Agent Oriented Modelling (AOM) technique based on UML (Unified Modelling Language) user activity diagrams to develop both the proposed platform’s user requirements and show the dependencies that exist between the different components that make up the proposed system.

Shefali Virkar, Chibuzor Udokwu, Anna-Sophie Novak, Sofia Tsekeridou

Do we have a Data Culture?

Abstract

Nowadays, adopting a “data culture” or operating “data-driven” are desired goals for a number of managers. However, what does it mean when an organization claims to have data culture? A clear definition is not available. This paper aims to sharpen the understanding of data culture in organizations by discussing recent usages of the term. It shows that data culture is a kind of organizational culture. A special form of data culture is a data-driven culture. We conclude that a data-driven culture is defined by following a specific set of values, behaviors and norms that enable effective data analytics. Besides these values, behaviors and norms, this paper presents the job roles necessary for a datadriven culture. We include the crucial role of the data steward that elevates a data culture to a data-driven culture by administering data governance. Finally, we propose a definition of data-driven culture that focuses on the commitment to data-based decision making and an ever-improving data analytics process. This paper helps teams and organizations of any size that strive towards advancing their – not necessarily big – data analytics capabilities by drawing their attention to the often neglected, non-technical requirements: data governance and a suitable organizational culture.

Wolfgang Kremser, Richard Brunauer

Non reviewed short Papers

Frontmatter

Neural Machine Translation from Natural Language into SQL with state-of-the-art Deep Learning methods

Abstract

Reading text, identifying key ideas, summarizing, making connections and other tasks that require comprehension and context are easy tasks for humans but training a computer to perform these tasks is a challenge. Recent advances in deep learning make it possible to interpret text effectively and achieve high performance results across natural language tasks. Interacting with relational databases trough natural language enables users of any background to query and analyze a huge amount of data in a user-friendly way. This paper summaries major challenges and different approaches in the context of Natural Language Interfaces to Databases (NLIDB). A state-ofthe- art language translation model developed by Google named Transformer is used to translate natural language queries into structured queries to simplify the interaction between users and relational database systems.

Dejan Radovanovic

Smart recommendation system to simplify projecting for an HMI/SCADA platform

Abstract

Abstract—Modelling and connecting machines and hardware devices of manufacturing plants in HMI/SCADA software platforms is considered time-consuming and requires expertise. A smart recommendation system could help to support and simplify the tasks of the projecting process. In this paper, supervised learning methods are proposed to address this problem. Data characteristics, modelling challenges, and two potential modelling approaches, one-hot encoding and probabilistic topic modelling, are discussed.

The methodology for solving this problem is still in progress. First results are expected by the date of the conference.

Sebastian Malin, Kathrin Plankensteiner, Robert Merz, Reinhard Mayr, Sebastian Schöndorfer, Mike Thomas

Adversarial Networks — A Technology for Image Augmentation

Abstract

A key application of data augmentation is to boost state-of-the-art machine learning for completion of missing values and to generate more data from a given dataset. In addition to transformations or patch extraction as augmentation methods, adversarial networks can be used to learn the probability density function of the original data. Generative adversarial networks (GANs) are an adversarial method to generate new data from noise by pitting a generator against a discriminator and training in a zero-sum game trying to find a Nash Equilibrium. This generator can then be used in order to convert noise into augmentations of the original data. This short paper shows the usage of GANs in order to generate fake face images as well as tips to overcome the notoriously hard training of GANs.

Maximilian Ernst Tschuchnig

Using supervised learning to predict the reliability of a welding process

Abstract

Abstract—In this paper, supervised learning is used to predict the reliability of manufacturing processes in industrial settings. As an example case, lifetime data has been collected from a special device made of sheet metal. It is known, that a welding procedure is the critical step during production. To test the quality of the welded area, End-of-Life tests have been performed on each of the devices.

For the statistical analysis, not only the acquired lifetime, but also data specifying the device before and after the welding process as well as measured curves from the welding step itself, e.g., current over time, are available.

Typically, the Weibull and log-normal distributions are used to model lifetime. Also in our case, both are considered as an appropriate candidate distribution. Although both distributions might fit the data well, the log-normal distribution is selected because the ks-test and the Bayesian Factor indicate slightly better results.

To model the lifetime depending on the welding parameters, a multivariable linear regression model is used. To find the significant covariates, a mix of forward selection and backward elimination is utilized. The t-test is used to determine each covariate’s importance while the adjusted coefficient of determination is used as a global Goodness-of-Fit criterion. After the model that provides the best fit has been determined, predictive power is evaluated with a non-exhaustive cross-validation and sum of squared errors.

The results show that the lifetime can be predicted based on the welding settings. For lifetime prediction, the model yields accurate results when interpolation is used. However, an extrapolation beyond the range of available data shows the limits of a purely data-driven model.

Melanie Zumtobel, Kathrin Plankensteiner

Titel: Data Science – Analytics and Applications
herausgegeben von: Peter Haber
Thomas Lampoltshammer
Dr. Manfred Mayr
Verlag: Springer Fachmedien Wiesbaden
Electronic ISBN: 978-3-658-27495-5
Print ISBN: 978-3-658-27494-8
DOI: https://doi.org/10.1007/978-3-658-27495-5