Skip to main content

2019 | Buch

Applied Data Science

Lessons Learned for the Data-Driven Business

herausgegeben von: Prof. Dr. Martin Braschler, Dr. Thilo Stadelmann, Dr. Kurt Stockinger

Verlag: Springer International Publishing

insite
SUCHEN

Über dieses Buch

This book has two main goals: to define data science through the work of data scientists and their results, namely data products, while simultaneously providing the reader with relevant lessons learned from applied data science projects at the intersection of academia and industry. As such, it is not a replacement for a classical textbook (i.e., it does not elaborate on fundamentals of methods and principles described elsewhere), but systematically highlights the connection between theory, on the one hand, and its application in specific use cases, on the other.

With these goals in mind, the book is divided into three parts: Part I pays tribute to the interdisciplinary nature of data science and provides a common understanding of data science terminology for readers with different backgrounds. These six chapters are geared towards drawing a consistent picture of data science and were predominantly written by the editors themselves. Part II then broadens the spectrum by presenting views and insights from diverse authors – some from academia and some from industry, ranging from financial to health and from manufacturing to e-commerce. Each of these chapters describes a fundamental principle, method or tool in data science by analyzing specific use cases and drawing concrete conclusions from them. The case studies presented, and the methods and tools applied, represent the nuts and bolts of data science. Finally, Part III was again written from the perspective of the editors and summarizes the lessons learned that have been distilled from the case studies in Part II. The section can be viewed as a meta-study on data science across a broad range of domains, viewpoints and fields. Moreover, it provides answers to the question of what the mission-critical factors for success in different data science undertakings are.

The book targets professionals as well as students of data science: first, practicing data scientists in industry and academia who want to broaden their scope and expand their knowledge by drawing on the authors’ combined experience. Second, decision makers in businesses who face the challenge of creating or implementing a data-driven strategy and who want to learn from success stories spanning a range of industries. Third, students of data science who want to understand both the theoretical and practical aspects of data science, vetted by real-world case studies at the intersection of academia and industry.

Inhaltsverzeichnis

Frontmatter

Foundations

Frontmatter
Chapter 1. Introduction to Applied Data Science
Abstract
What is data science? Attempts to define it can be made in one (prolonged) sentence, while it may take a whole book to demonstrate the meaning of this definition. This book introduces data science in an applied setting, by first giving a coherent overview of the background in Part I, and then presenting the nuts and bolts of the discipline by means of diverse use cases in Part II; finally, specific and insightful lessons learned are distilled in Part III. This chapter introduces the book and provides an answer to the following questions: What is data science? Where does it come from? What are its connections to big data and other mega trends? We claim that multidisciplinary roots and a focus on creating value lead to a discipline in the making that is inherently an interdisciplinary, applied science.
Thilo Stadelmann, Martin Braschler, Kurt Stockinger
Chapter 2. Data Science
Abstract
Even though it has only entered public perception relatively recently, the term “data science” already means many things to many people. This chapter explores both top-down and bottom-up views on the field, on the basis of which we define data science as “a unique blend of principles and methods from analytics, engineering, entrepreneurship and communication that aim at generating value from the data itself.” The chapter then discusses the disciplines that contribute to this “blend,” briefly outlining their contributions and giving pointers for readers interested in exploring their backgrounds further.
Martin Braschler, Thilo Stadelmann, Kurt Stockinger
Chapter 3. Data Scientists
Abstract
What is a data scientist? How can you become one? How can you form a team of data scientists that fits your organization? In this chapter, we trace the skillset of a successful data scientist and define the necessary competencies. We give a disambiguation to other historically or contemporary definitions of the term and show how a career as a data scientist might get started. Finally, we will answer the third question, that is, how to build analytics teams within a data-driven organization.
Thilo Stadelmann, Kurt Stockinger, Gundula Heinatz Bürki, Martin Braschler
Chapter 4. Data Products
Abstract
Data science is becoming an established scientific discipline and has delivered numerous useful results so far. We are at the point in time where we begin to understand what results and insights data science can deliver; at the same time, however, it is not yet clear how to systematically deliver these results for the end user. In other words: how do we design data products in a process that has relevant guaranteed benefit for the user? Additionally, once we have a data product, we need a way to provide economic value for the product owner. That is, we need to design data-centric business models as well.
In this chapter, we propose to view the all-encompassing process of turning data insights into data products as a specific interpretation of service design. This provides the data scientist with a rich conceptual framework to carve the value out of the data in a customer-centric way and plan the next steps of his endeavor: to design a great data product.
Jürg Meierhofer, Thilo Stadelmann, Mark Cieliebak
Chapter 5. Legal Aspects of Applied Data Science
Abstract
Data scientists operate in a legal context and the knowledge of its rules provides great benefit to any applied data science project under consideration, in particular with view to later commercialization. Taking into account legal aspects early on may prevent larger legal issues at a subsequent project stage. In this chapter we will present some legal topics to provide data scientists with a frame of reference for their activities from a legal perspective, in particular: (1) comments on the qualification and protection of “data” from a legal perspective, including intellectual property issues; (2) data protection law; and (3) regulatory law. While the legal framework is not the same worldwide and this chapter mainly deals with Swiss law as an example, many of the topics mentioned herein also come up in other legislations.
Michael Widmer, Stefan Hegy
Chapter 6. Risks and Side Effects of Data Science and Data Technology
Abstract
In addition to the familiar and well-known privacy concerns, there are more serious general risks and side effects of data science and data technology. A full understanding requires a broader and more philosophical look on the defining frames and on the goals of data science. Is the aim of continuously optimizing decisions based on recorded data still helpful or have we reached a point where this mind-set produces problems? This contribution provides some arguments toward a skeptical evaluation of data science. The underlying conflict has the nature of a second order problem: It cannot be solved with the rational mind-set of data science as it might be this mind-set which produces the problem in the first run. Moreover, data science impacts society in the large—there is no laboratory in which its effects can be studied in a controlled series of experiments and where simple solutions can be generated and tested.
Clemens H. Cap

Use Cases

Frontmatter
Chapter 7. Organization
Abstract
Part II of this book represents its core—the nuts and bolts of applied data science, presented by means of 16 case studies spanning a wide range of methods, tools, and application domains.
Martin Braschler, Thilo Stadelmann, Kurt Stockinger
Chapter 8. What Is Data Science?
Abstract
Data science, a new discovery paradigm, is potentially one of the most significant advances of the early twenty-first century. Originating in scientific discovery, it is being applied to every human endeavor for which there is adequate data. While remarkable successes have been achieved, even greater claims have been made. Benefits, challenge, and risks abound. The science underlying data science has yet to emerge. Maturity is more than a decade away. This claim is based firstly on observing the centuries-long developments of its predecessor paradigms—empirical, theoretical, and Jim Gray’s Fourth Paradigm of Scientific Discovery (Hey et al., The fourth paradigm: data-intensive scientific discovery Edited by Microsoft Research, 2009) (aka eScience, data-intensive, computational, procedural)—and secondly on my studies of over 150 data science use cases, several data science-based startups, and, on my scientific advisory role for Insight (https://​www.​insight-centre.​org/​), a Data Science Research Institute (DSRI) that requires that I understand the opportunities, state of the art, and research challenges for the emerging discipline of data science. This chapter addresses essential questions for a DSRI: What is data science? What is world-class data science research? A companion chapter (Brodie, On Developing Data Science, in Braschler et al. (Eds.), Applied data science – Lessons learned for the data-driven business, Springer 2019) addresses the development of data science applications and of the data science discipline itself.
Michael L. Brodie
Chapter 9. On Developing Data Science
Abstract
Understanding phenomena based on the facts—on the data—is a touchstone of data science. The power of evidence-based, inductive reasoning distinguishes data science from science. Hence, this chapter argues that, in its initial stages, data science applications and the data science discipline itself be developed inductively and deductively in a virtuous cycle.
The virtues of the twentieth Century Virtuous Cycle (aka virtuous hardware-software cycle, Intel-Microsoft virtuous cycle) that built the personal computer industry (National Research Council, The new global ecosystem in advanced computing: Implications for U.S. competitiveness and national security. The National Academies Press, Washington, DC, 2012) were being grounded in reality and being self-perpetuating—more powerful hardware enabled more powerful software that required more powerful hardware, enabling yet more powerful software, and so forth. Being grounded in reality—solving genuine problems at scale—was critical to its success, as it will be for data science. While it lasted, it was self-perpetuating, due to a constant flow of innovation, and to benefitting all participants—producers, consumers, the industry, the economy, and society. It is a wonderful success story for twentieth Century applied science. Given the success of virtuous cycles in developing modern technology, virtuous cycles grounded in reality should be used to develop data science, driven by the wisdom of the sixteenth Century proverb, Necessity is the mother of invention.
This chapter explores this hypothesis using the example of the evolution of database management systems over the last 40 years. For the application of data science to be successful and virtuous, it should be grounded in a cycle that encompasses industry (i.e., real problems), research, development, and delivery. This chapter proposes applying the principles and lessons of the virtuous cycle to the development of data science applications; to the development of the data science discipline itself, for example, a data science method; and to the development of data science education; all focusing on the critical role of collaboration in data science research and management, thereby addressing the development challenges faced by the more than 150 Data Science Research Institutes (DSRIs) worldwide. A companion chapter (Brodie, What is Data Science, in Braschler et al (Eds.), Applied data science – Lessons learned for the data-driven business, Springer 2019), addresses essential questions that DSRIs should answer in preparation for the developments proposed here: What is data science? What is world-class data science research?
Michael L. Brodie
Chapter 10. The Ethics of Big Data Applications in the Consumer Sector
Abstract
Business applications relying on processing of large amounts of heterogeneous data (Big Data) are considered to be key drivers of innovation in the digital economy. However, these applications also pose ethical issues that may undermine the credibility of data-driven businesses. In our contribution, we discuss ethical problems that are associated with Big Data such as: How are core values like autonomy, privacy, and solidarity affected in a Big Data world? Are some data a public good? Or: Are we obliged to divulge personal data to a certain degree in order to make the society more secure or more efficient? We answer those questions by first outlining the ethical topics that are discussed in the scientific literature and the lay media using a bibliometric approach. Second, referring to the results of expert interviews and workshops with practitioners, we identify core norms and values affected by Big Data applications—autonomy, equality, fairness, freedom, privacy, property-rights, solidarity, and transparency—and outline how they are exemplified in examples of Big Data consumer applications, for example, in terms of informational self-determination, non-discrimination, or free opinion formation. Based on use cases such as personalized advertising, individual pricing, or credit risk management we discuss the process of balancing such values in order to identify legitimate, questionable, and unacceptable Big Data applications from an ethics point of view. We close with recommendations on how practitioners working in applied data science can deal with ethical issues of Big Data.
Markus Christen, Helene Blumer, Christian Hauser, Markus Huppenbauer
Chapter 11. Statistical Modelling
Abstract
In this chapter, we present statistical modelling approaches for predictive tasks in business and science. Most prominent is the ubiquitous multiple linear regression approach where coefficients are estimated using the ordinary least squares algorithm. There are many derivations and generalizations of that technique. In the form of logistic regression, it has been adapted to cope with binary classification problems. Various statistical survival models allow for modelling of time-to-event data. We will detail the many benefits and a few pitfalls of these techniques based on real-world examples. A primary focus will be on pointing out the added value that these statistical modelling tools yield over more black box-type machine-learning algorithms. In our opinion, the added value predominantly stems from the often much easier interpretation of the model, the availability of tools that pin down the influence of the predictor variables in concise form, and finally from the options they provide for variable selection and residual analysis, allowing for user-friendly model development, refinement, and improvement.
Marcel Dettling, Andreas Ruckstuhl
Chapter 12. Beyond ImageNet: Deep Learning in Industrial Practice
Abstract
Deep learning (DL) methods have gained considerable attention since 2014. In this chapter we briefly review the state of the art in DL and then give several examples of applications from diverse areas of application. We will focus on convolutional neural networks (CNNs), which have since the seminal work of Krizhevsky et al. (ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25, pp. 1097–1105, 2012) revolutionized image classification and even started surpassing human performance on some benchmark data sets (Ciresan et al., Multi-column deep neural network for traffic sign classification, 2012a; He et al., Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. CoRR, Vol. 1502.01852, 2015a). While deep neural networks have become popular primarily for image classification tasks, they can also be successfully applied to other areas and problems with some local structure in the data. We will first present a classical application of CNNs on image-like data, in particular, phenotype classification of cells based on their morphology, and then extend the task to clustering voices based on their spectrograms. Next, we will describe DL applications to semantic segmentation of newspaper pages into their corresponding articles based on clues in the pixels, and outlier detection in a predictive maintenance setting. We conclude by giving advice on how to work with DL having limited resources (e.g., training data).
Thilo Stadelmann, Vasily Tolkachev, Beate Sick, Jan Stampfli, Oliver Dürr
Chapter 13. The Beauty of Small Data: An Information Retrieval Perspective
Abstract
This chapter focuses on Data Science problems, which we will refer to as “Small Data” problems. We have over the past 20 years accumulated considerable experience with working on Information Retrieval applications that allow effective search on collections that do not exceed in size the order of tens or hundreds of thousands of documents. In this chapter we want to highlight a number of lessons learned in dealing with such document collections.
The better-known term “Big Data” has in recent years created a lot of buzz, but also frequent misunderstandings. To use a provocative simplification, the magic of Big Data often lies in the fact that sheer volume of data will necessarily bring redundancy, which can be detected in the form of patterns. Algorithms can then be trained to recognize and process these repeated patterns in the data streams.
Conversely, “Small Data” approaches do not operate on volumes of data big enough to exploit repetitive patterns to a successful degree. While there have been spectacular applications of Big Data technology, we are convinced that there are and will remain countless, equally exciting, “Small Data” tasks, across all industrial and public sectors, and also for private applications. They have to be approached in a very different manner to Big Data problems. In this chapter, we will first argue that the task of retrieving documents from large text collections (often termed “full text search”) can become easier as the document collection grows. We then present two exemplary “Small Data” retrieval applications and discuss the best practices that can be derived from such applications.
Martin Braschler
Chapter 14. Narrative Visualization of Open Data
Abstract
Several governments around the globe have recently released significant amounts of open data to the public. The main motivation is that citizens or companies use these datasets and develop new data products and applications by either enriching their existing data stores or by smartly combining datasets from various open data portals.
In this chapter, we first describe the development of open data over the last few years and briefly introduce the open data portals of the USA, the EU, and Switzerland. Next we will explain various methods for information visualization. Finally, we describe how we combined methods from open data and information visualization. In particular, we show how we developed visualization applications on top of the Swiss open data portal that enable web-based, interactive information visualization as well as a novel paradigm—narrative visualization.
Philipp Ackermann, Kurt Stockinger
Chapter 15. Security of Data Science and Data Science for Security
Abstract
In this chapter, we present a brief overview of important topics regarding the connection of data science and security. In the first part, we focus on the security of data science and discuss a selection of security aspects that data scientists should consider to make their services and products more secure. In the second part about security for data science, we switch sides and present some applications where data science plays a critical role in pushing the state-of-the-art in securing information systems. This includes a detailed look at the potential and challenges of applying machine learning to the problem of detecting obfuscated JavaScripts.
Bernhard Tellenbach, Marc Rennhard, Remo Schweizer
Chapter 16. Online Anomaly Detection over Big Data Streams
Abstract
In many domains, high-quality data are used as a foundation for decision-making. An essential component to assess data quality lies in anomaly detection. We describe and empirically evaluate the design and implementation of a framework for data quality testing over real-world streams in a large-scale telecommunication network. This approach is both general—by using general-purpose measures borrowed from information theory and statistics—and scalable—through anomaly detection pipelines that are executed in a distributed setting over state-of-the-art big data streaming and batch processing infrastructures. We empirically evaluate our system and discuss its merits and limitations by comparing it to existing anomaly detection techniques, showing its high accuracy, efficiency, as well as its scalability in parallelizing operations across a large number of nodes.
Laura Rettig, Mourad Khayati, Philippe Cudré-Mauroux, Michał Piorkówski
Chapter 17. Unsupervised Learning and Simulation for Complexity Management in Business Operations
Abstract
A key resource in data analytics projects is the data to be analyzed. What can be done in the middle of a project if this data is not available as planned? This chapter explores a potential solution based on a use case from the manufacturing industry where the drivers of production complexity (and thus costs) were supposed to be determined by analyzing raw data from the shop floor, with the goal of subsequently recommending measures to simplify production processes and reduce complexity costs.
The unavailability of the data—often a major threat to the anticipated outcome of a project—has been alleviated in this case study by means of simulation and unsupervised machine learning: a physical model of the shop floor produced the necessary lower-level records from high-level descriptions of the facility. Then, neural autoencoders learned a measure of complexity regardless of any human-contributed labels.
In contrast to conventional complexity measures based on business analysis done by consultants, our data-driven methodology measures production complexity in a fully automated way while maintaining a high correlation to the human-devised measures.
Lukas Hollenstein, Lukas Lichtensteiger, Thilo Stadelmann, Mohammadreza Amirian, Lukas Budde, Jürg Meierhofer, Rudolf M. Füchslin, Thomas Friedli
Chapter 18. Data Warehousing and Exploratory Analysis for Market Monitoring
Abstract
With the growing trend of digitalization, many companies plan to use machine learning to improve their business processes or to provide new data-driven services. These companies often collect data from different locations with sometimes conflicting context. However, before machine learning can be applied, heterogeneous datasets often need to be integrated, harmonized, and cleaned. In other words, a data warehouse is often the foundation for subsequent analytics tasks.
In this chapter, we first provide an overview on best practices of building a data warehouse. In particular, we describe the advantages and disadvantage of the major types of data warehouse architectures based on Inmon and Kimball. Afterward, we describe a use case on building an e-commerce application where the users of this platform are provided with information about healthy products as well as products with sustainable production. Unlike traditional e-commerce applications, where users need to log into the system and thus leave personalized traces when they search for specific products or even buy them afterward, our application allows full anonymity of the users in case they do not want to log into the system. However, analyzing anonymous user interactions is a much harder problem than analyzing named users. The idea is to apply modern data warehousing, big data technologies, as well as machine learning algorithms to discover patterns in the user behavior and to make recommendations for designing new products.
Melanie Geiger, Kurt Stockinger
Chapter 19. Mining Person-Centric Datasets for Insight, Prediction, and Public Health Planning
Abstract
In order to increase the accuracy and realism of agent-based simulation systems, it is necessary to take the full complexity of human behavior into account. Mobile phone records are capable of capturing this complexity, in the form of latent patterns. These patterns can be discovered via information processing, data mining, and visual analytics. Mobile phone records can be mined to improve our understanding of human societies, and those insights can be encapsulated in population models. Models of geographic mobility, travel, and migration are key components of both population models and the underlying datasets of simulation systems. For example, using such models enables both the analysis of existing traffic patterns and the creation of accurate simulations of real-time traffic flow. The case study presented here demonstrates how latent patterns and insights can be (1) extracted from mobile phone datasets, (2) turned into components of population models, and (3) utilized to improve health-related simulation software. It does so within the context of computational epidemiology, applying the Data Science process to answer nine specific research questions pertaining to factors influencing disease spread in a population. The answers can be used to inform a country’s strategy in case of an epidemic.
Jonathan P. Leidig, Greg Wolffe
Chapter 20. Economic Measures of Forecast Accuracy for Demand Planning: A Case-Based Discussion
Abstract
Successful demand planning relies on accurate demand forecasts. Existing demand planning software typically employs (univariate) time series models for this purpose. These methods work well if the demand of a product follows regular patterns. Their power and accuracy are, however, limited if the patterns are disturbed and the demand is driven by irregular external factors such as promotions, events, or weather conditions. Hence, modern machine-learning-based approaches take into account external drivers for improved forecasting and combine various forecasting approaches with situation-dependent strengths. Yet, to substantiate the strength and the impact of single or new methodologies, one is left with the question how to measure and compare the performance or accuracy of different forecasting methods. Standard measures such as root mean square error (RMSE) and mean absolute percentage error (MAPE) may allow for ranking the methods according to their accuracy, but in many cases these measures are difficult to interpret or the rankings are incoherent among different measures. Moreover, the impact of forecasting inaccuracies is usually not reflected by standard measures. In this chapter, we discuss this issue using the example of forecasting the demand of food products. Furthermore, we define alternative measures that provide intuitive guidance for decision makers and users of demand forecasting.
Thomas Ott, Stefan Glüge, Richard Bödi, Peter Kauf
Chapter 21. Large-Scale Data-Driven Financial Risk Assessment
Abstract
The state of data in finance makes near real-time and consistent assessment of financial risks almost impossible today. The aggregate measures produced by traditional methods are rigid, infrequent, and not available when needed. In this chapter, we make the point that this situation can be remedied by introducing a suitable standard for data and algorithms at the deep technological level combined with the use of Big Data technologies. Specifically, we present the ACTUS approach to standardizing the modeling of financial contracts in view of financial analysis, which provides a methodological concept together with a data standard and computational algorithms. We present a proof of concept of ACTUS-based financial analysis with real data provided by the European Central Bank. Our experimental results with respect to computational performance of this approach in an Apache Spark based Big Data environment show close to linear scalability. The chapter closes with implications for data science.
Wolfgang Breymann, Nils Bundi, Jonas Heitz, Johannes Micheler, Kurt Stockinger
Chapter 22. Governance and IT Architecture
Abstract
Personalized medicine relies on the integration and analysis of diverse sets of health data. Many patients and healthy individuals are willing to play an active role in supporting research, provided there is a trust-promoting governance structure for data sharing as well as a return of information and knowledge. MIDATA.coop provides an IT platform that manages personal data under such a governance structure. As a not-for-profit citizen-owned cooperative, its vision is to allow citizens to collect, store, visualize, and share specific sets of their health-related data with friends and health professionals, and to make anonymized parts of these data accessible to medical research projects in areas that appeal to them. The value generated by this secondary use of personal data is managed collectively to operate and extend the platform and support further research projects. In this chapter, we describe central features of MIDATA.coop and insights gained since the operation of the platform. As an example for a novel patient engagement effort, MIDATA.coop has led to new forms of participation in research besides formal enrolment in clinical trials or epidemiological studies.
Serge Bignens, Murat Sariyar, Ernst Hafen
Chapter 23. Image Analysis at Scale for Finding the Links Between Structure and Biology
Abstract
Image data is growing at a rapid rate, whether from the continuous uploads on video portals, photo-sharing platforms, new satellites, or even medical data. The volumes have grown from tens of gigabytes to exabytes per year in less than a decade. Deeply embedded inside these datasets is detailed information on fashion trends, natural disasters, agricultural output, or looming health risks. The large majority of statistical analysis and data science is performed on numbers either as individuals or sequences. Images, however, do not neatly fit into the standard paradigms and have resulted in “graveyards” of large stagnant image storage systems completely independent of the other standard information collected. In this chapter, we will introduce the basic concepts of quantitative image analysis and show how such work can be used in the biomedical context to link hereditary information (genomic sequences) to the health or quality of bone. Since inheritance studies are much easier to perform if you are able to control breeding, the studies are performed in mice where in-breeding and cross-breeding are possible. Additionally, mice and humans share a large number of genetic and biomechanical similarities, so many of the results are transferable (Ackert-Bicknell et al. Mouse BMD quantitative trait loci show improved concordance with human genome-wide association loci when recalculated on a new, common mouse genetic map. Journal of Bone and Mineral Research 25(8):1808–1820, 2010).
Kevin Mader

Lessons Learned and Outlook

Frontmatter
Chapter 24. Lessons Learned from Challenging Data Science Case Studies
Abstract
In this chapter, we revisit the conclusions and lessons learned of the chapters presented in Part II of this book and analyze them systematically. The goal of the chapter is threefold: firstly, it serves as a directory to the individual chapters, allowing readers to identify which chapters to focus on when they are interested either in a certain stage of the knowledge discovery process or in a certain data science method or application area. Secondly, the chapter serves as a digested, systematic summary of data science lessons that are relevant for data science practitioners. And lastly, we reflect on the perceptions of a broader public toward the methods and tools that we covered in this book and dare to give an outlook toward the future developments that will be influenced by them.
Kurt Stockinger, Martin Braschler, Thilo Stadelmann
Metadaten
Titel
Applied Data Science
herausgegeben von
Prof. Dr. Martin Braschler
Dr. Thilo Stadelmann
Dr. Kurt Stockinger
Copyright-Jahr
2019
Electronic ISBN
978-3-030-11821-1
Print ISBN
978-3-030-11820-4
DOI
https://doi.org/10.1007/978-3-030-11821-1

Premium Partner