nach oben

2024 | Buch

Kapitel lesen Erstes Kapitel lesen

Data Science—Analytics and Applications

Proceedings of the 5th International Data Science Conference—iDSC2023

herausgegeben von: Peter Haber, Thomas J. Lampoltshammer, Manfred Mayr

Verlag: Springer Nature Switzerland

Enthalten in: Springer Professional "Wirtschaft+Technik" , Springer Professional "Technik" , Springer Professional "Wirtschaft"

Einloggen, um Zugang zu erhalten

Über dieses Buch

Based on the overall digitalization in all spheres of our lives, Data Science and Artificial Intelligence (AI) are nowadays cornerstones for innovation, problem solutions, and business transformation. Data, whether structured or unstructured, numerical, textual, or audiovisual, put in context with other data or analyzed and processed by smart algorithms, are the basis for intelligent concepts and practical solutions. These solutions address many application areas such as Industry 4.0, the Internet of Things (IoT), smart cities, smart energy generation, and distribution, and environmental management. Innovation dynamics and business opportunities for effective solutions for the essential societal, environmental, or health challenges, are enabled and driven by modern data science approaches.

However, Data Science and Artificial Intelligence are forming a new field that needs attention and focused research. Effective data science is only achieved in a broad and diverse discourse – when data science experts cooperate tightly with application domain experts and scientists exchange views and methods with engineers and business experts. Thus, the 5th International Data Science Conference (iDSC 2023) brings together researchers, scientists, business experts, and practitioners to discuss new approaches, methods, and tools made possible by data science.

Inhaltsverzeichnis

Frontmatter

Research and Science

Frontmatter

Comparison of Clustering Algorithms for Statistical Features of Vibration Data Sets

Abstract

Vibration-based condition monitoring systems are receiving increasing attention due to their ability to accurately identify different conditions by capturing dynamic features over a broad frequency range. However, there is little research on clustering approaches in vibration data and the resulting solutions are often optimized for a single data set. In this work, we present an extensive comparison of the clustering algorithms K-means clustering, OPTICS, and Gaussian mixture model clustering (GMM) applied to statistical features extracted from the time and frequency domains of vibration data sets. Furthermore, we investigate the influence of feature combinations, feature selection using principal component analysis (PCA), and the specified number of clusters on the performance of the clustering algorithms. We conducted this comparison in terms of a grid search using three different benchmark data sets. Our work showed that averaging (Mean, Median) and variance-based features (Standard Deviation, Interquartile Range) performed significantly better than shape-based features (Skewness, Kurtosis). In addition, K-means outperformed GMM slightly for these data sets, whereas OPTICS performed significantly worse. We were also able to show that feature combinations as well as PCA feature selection did not result in any significant performance improvements. With an increase in the specified number of clusters, clustering algorithms performed better, although there were some specific algorithmic restrictions.

Philipp Sepin, Jana Kemnitz, Safoura Rezapour Lakani, Daniel Schall

Towards Measuring Vulnerabilities and Exposures in Open-Source Packages

Abstract

Much of the current software depends on open-source components, which in turn have complex dependencies on other open-source libraries. Vulnerabilities in open source therefore have potentially huge impacts. The goal of this work is to get a quantitative overview of the frequency and evolution of existing vulnerabilities in popular software repositories and package managers. To this end, we provide an up-to-date overview of the open source landscape and its most popular package managers, we discuss approaches to map entries of the Common Vulnerabilities and Exposures (CVE) list to open-source libraries and we show the frequency and distribution of existing CVE entries with respect to popular programming languages.

Tobias Dam, Sebastian Neumaier

CSRX: A Novel Crossover Operator for a Genetic Algorithm Applied to the Traveling Salesperson Problem

Abstract

In this paper, we revisit the application of Genetic Algorithm (GA) to the Traveling Salesperson Problem (TSP) and introduce a family of novel crossover operators that outperform the previous state of the art. The novel crossover operators aim to exploit symmetries in the solution space, which allows us to more effectively preserve well-performing individuals, namely the fitness invariance to circular shifts and reversals of solutions. These symmetries are general and not limited to or tailored to TSP specifically.

Martin Uray, Stefan Wintersteller, Stefan Huber

First Insight into Social Media User Sentiment Spreading Potential to Enhance the Conceptual Model for Disinformation Detection

Abstract

The networks of digital communication, including social media, have become the primary means for information dissemination. While these networks offer vast benefits, such as fast knowledge exchange, improved integration, as well as entertainment for their users, they also carry many negative aspects, such as the spread of false news and malicious content. In this paper, we propose the Sentiment Spread Potential (SSP) algorithm, which combines sentiment and temporal network analysis to calculate a user’s potential for spreading information of different sentiment. This algorithm should be useful in the process of disinformation detection in the part of user profiling.

Dino Pitoski, Slobodan Beliga, Ana Meštrović

Hateful Messages: A Conversational Data Set of Hate Speech Produced by Adolescents on Discord

Abstract

With the rise of social media, an increase of hateful content online can be observed. Even though the understanding and definitions of hate speech vary, platforms, communities, and legislature all acknowledge the challenge. Adolescents are a new and active group of social media users. The majority of adolescents experience or witness online hate speech. Research in the field of automated hate speech classification has been on the rise and focuses on aspects such as bias, generalizability, and performance. To increase generalizability and performance, it is important to understand biases within the data. This research addresses the bias of youth language within hate speech classification and contributes by providing a modern and anonymized hate speech youth language data set consisting of 88.395 annotated chat messages. The data set consists of publicly available online messages from the chat platform Discord. For 35.553 messages, the user profiles provided age annotations, setting the average author age to under 20 years old. 6,4% of the total messages were classified as hate speech using the annotation schema, which was adapted for this data set.

Jan Fillies, Silvio Peikert, Adrian Paschke

Prediction of Tourism Flow with Sparse Geolocation Data

Abstract

Modern tourism in the 21st century is facing numerous challenges. Among these the rapidly growing number of tourists visiting space-limited regions like historical cities, museums and bottlenecks such as bridges is one of the biggest. In this context, a proper and accurate prediction of tourism volume and tourism flow within a certain area is important and critical for visitor management tasks such as sustainable treatment of the environment and prevention of overcrowding. Static flow control methods like conventional low-level controllers or limiting access to overcrowded venues could not solve the problem yet. In this paper, we empirically evaluate the performance of state-of-the-art deep-learning methods such as RNNs, GNNs, and Transformers as well as the classic statistical ARIMA method. Granular limited data supplied by a tourism region is extended by exogenous data such as geolocation trajectories of individual tourists, weather and holidays. In the field of visitor flow prediction with sparse data, we are thereby capable of increasing the accuracy of our predictions, incorporating modern input feature handling as well as mapping geolocation data on top of discrete POI data.

Julian Lemmel, Zahra Babaiee, Marvin Kleinlehner, Ivan Majic, Philipp Neubauer, Johannes Scholz, Radu Grosu, Sophie Neubauer

Popular and on the Rise—But Not Everywhere: COVID-19-Infographics on Twitter

Abstract

The coronavirus pandemic has altered many industries around the world. Journalism is one of them. Especially data journalists have gained attention within and outside of their newsrooms. We aim to study the prevalence of journalistic data visualizations before and after COVID-19 in 1.9 million image posts of news organizations on Twitter across six countries using a semi-manual detection approach. We find an increase in the shares of tweets containing infographics. Although this effect is not consistent across countries, we find increases in the prevalence of COVID-19-related content and interactions in infographics throughout all geographies. This study helps to generalize existing qualitative research on a larger, international scale.

Benedict Witzenberger, Angelina Voggenreiter, Jürgen Pfeffer

Taxonomy-Enhanced Document Retrieval with Dense Representations

Abstract

Document retrieval is a task that powers several downstream applications such as search and question answering. One way to approach this task is to take embeddings of the documents to be retrieved, and of the query, and use a similarity function to rank results. In this work, we extend this approach by incorporating knowledge about entities mentioned in either the document or the query, in the form of taxonomic relations and canonical labels of said entities. The method, when applied to a domain-specific corpus, improves retrieval recall over a state of the art method trained on a general domain corpus. It does so without requiring any further retraining of the machine learning models involved, thus making it applicable for use cases where training is not feasible because of data or infrastructure limitations.

Victor Mireles, Artem Revenko, Ioanna Lytra, Anna Breit, Julia Klezl

Robustness of Sentiment Analysis of Multilingual Twitter Postings

Abstract

Due to increasing digitalisation and access to content published online, the amount of data continues to grow. Opinions, experiences and thoughts are shared on various online platforms. In particular, sharing personal content on social media has become increasingly popular in recent years. Mostly, microblogging is done on social media using text. This text data can be further processed. Information can be extracted from the posted text. Often, a so-called sentiment analysis is used, to determine whether texts adopt a positive, neutral, or negative attitude. Such analysis can be relevant for politics, marketing or economics. Whenever text of a different origin language is to be analysed, a translation to English has to be made beforehand, since sentiment analyses are primarily designed for English text. The necessity for translation poses the question of an introduction of bias towards a particular sentiment, through machine translation. This work shows that for two different architectures of transformer network-based translations, only minimal changes are detectable. This is proved with examples from different origin languages.

Beatrice Steiner, Alexander Buchelt, Alexander Adrowitzer

Exploratory Analysis of the Applicability of Formalised Knowledge to Personal Experience Narration

Abstract

Some of the victims of Nazi prosecution have consigned their personal experiences in the form of diaries of their internment in concentration camps. Such human-centric texts may contrast with the organisation of knowledge about such events that, for example, historians and archivists make. In this work, we analyse six such narrations with the use of Entity Extraction and Named Entity Recognition techniques, present the results of the corresponding exploration, and discuss the suitability of such tools on this corpus. We show that knowledge tools, that have been successfully used to organise documents, can be lacking when describing personal accounts, and we suggest ways to alleviate this.

Victor Mireles, Stephanie Billib, Artem Revenko, Stefan Jänicke, Frank Uiterwaal, Pavel Pecina

Applications and Use Cases

Frontmatter

Supply Chain Data Spaces–The Next Generation of Data Sharing

Abstract

The economy is heavily dependent on closely coordinated and optimised supply chain processes with an increased demand for data sharing and supply chain visibility. Industrial supply chain data spaces offer a way to deal with this demand. In this work, we present three business cases from different domains – steel industry, food industry, and manufacturing industry – derived from a workshop series with stakeholders from Austrian industries that will contribute significantly to the development of an Austrian supply chain data space concept.

Angela Fessl, Gert Breitfuß, Nina Popanton, Julia Pichler, Carina Hochstrasser, Michael Plasch, Michael Herburger

Condition Monitoring and Anomaly Detection: Real-World Challenges and Successes

Abstract

Data science projects in industry come with many challenges – from idea exploration over proof-of-concept implementation to deployment. This paper shows along the use case of detecting anomalies in LED drivers how to successfully approach such a project. Focus is put on the anomaly detection using machine learning methods, namely one-class SVMs, isolation forests, and LSTM-based autoencoders. The algorithms show promising results; all detected anomalies can be linked to an abnormality in the data. These anomalies will be analysed by domain experts to optimize the product design and the production process. Furthermore, the successful proof-of-concept implementation justifies the investment into a global deployment of the anomaly detection in other development and production sites.

Katharina Dimovski, Léo Bonal, Thomas Zengerle, Ulrich Hüttinger, Norbert Linder, Doris Entner

Towards Validated Head Tracking On Moving Two-Wheelers

Abstract

We investigate the problem of validating head tracking methods while riding two-wheelers. A low-cost inertial measurement unit and an image-based system using fiducial markers are compared against a wearable motion capture system. Results show that both systems are capable of tracking head motion. However, signal drift correction and hardening against outdoor conditions are required to make the systems viable in real-life use.

Wolfgang Kremser, Sebastian Mayr, Simon Lassl, Marco Holzer, Martin Tiefengrabner

A Framework for Inline Quality Inspection of Mechanical Components in an Industrial Production

Reduction of Development Time and Increase of Classification Performance by Using a Data-Centric Deep Learning Approach

Abstract

Automated quality inspection of components in industrial production environments is one of the main requirements to achieve current productivity and quality goals. Since conventional inspection systems have only partially met these requirements, and research projects in the industry have hardly been practicable in this area, MIBA AG has developed a quality inspection framework of its high-quality components to achieve these goals in its production facilities. Using deep learning and focusing on the data-centric approach are the key success factors of those quality inspection systems. This technical report describes the developed framework, and the results are discussed.

Christian Prechtl, Sebastian Bomberg, Florian Jungreithmaier

A Modular Test Bed for Reinforcement Learning Incorporation into Industrial Applications

Abstract

This application paper explores the potential of using reinforcement learning (RL) to address the demands of Industry 4.0, including shorter time-to-market, mass customization, and batch size one production. Specifically, we present a use case in which the task is to transport and assemble goods through a model factory following predefined rules. Each simulation run involves placing a specific number of goods of random color at the entry point. The objective is to transport the goods to the assembly station, where two rivets are installed in each product, connecting the upper part to the lower part. Following the installation of rivets, blue products must be transported to the exit, while green products are to be transported to storage. The study focuses on the application of reinforcement learning techniques to address this problem and improve the efficiency of the production process.

Reuf Kozlica, Georg Schäfer, Simon Hirländer, Stefan Wegenkittl

Backmatter

Titel: Data Science—Analytics and Applications
herausgegeben von: Peter Haber
Thomas J. Lampoltshammer
Manfred Mayr
Verlag: Springer Nature Switzerland
Electronic ISBN: 978-3-031-42171-6
Print ISBN: 978-3-031-42170-9
DOI: https://doi.org/10.1007/978-3-031-42171-6

Springer Professional

Über dieses Buch

Inhaltsverzeichnis

Frontmatter

Research and Science

Frontmatter

Comparison of Clustering Algorithms for Statistical Features of Vibration Data Sets

Towards Measuring Vulnerabilities and Exposures in Open-Source Packages

CSRX: A Novel Crossover Operator for a Genetic Algorithm Applied to the Traveling Salesperson Problem

First Insight into Social Media User Sentiment Spreading Potential to Enhance the Conceptual Model for Disinformation Detection

Hateful Messages: A Conversational Data Set of Hate Speech Produced by Adolescents on Discord

Prediction of Tourism Flow with Sparse Geolocation Data

Popular and on the Rise—But Not Everywhere: COVID-19-Infographics on Twitter

Taxonomy-Enhanced Document Retrieval with Dense Representations

Robustness of Sentiment Analysis of Multilingual Twitter Postings

Exploratory Analysis of the Applicability of Formalised Knowledge to Personal Experience Narration

Applications and Use Cases

Frontmatter

Supply Chain Data Spaces–The Next Generation of Data Sharing

Condition Monitoring and Anomaly Detection: Real-World Challenges and Successes

Towards Validated Head Tracking On Moving Two-Wheelers

A Framework for Inline Quality Inspection of Mechanical Components in an Industrial Production

A Modular Test Bed for Reinforcement Learning Incorporation into Industrial Applications

Backmatter

Premium Partner