Data Science is, by definition, an interdisciplinary field. It incorporates knowledge from Statistics, Computer Science and Mathematics and hence can tackle challenging application domains which had remained out of reach because of a combined lack of data and computer power. In what follows we shall illustrate this interdisciplinary nature of Data Science by means of two case studies.
5.1 Case study 1: protein structure prediction
Predicting the correct three-dimensional structure of a protein given its one-dimensional protein sequence is a crucial issue in Life Sciences and Bioinformatics. Massive databases of DNA and protein sequences have become available, and many research groups are actively pursuing their efforts to solve the protein folding problem.
A promising approach has been put forward by the research group of Prof. Thomas Hamelryck from the University of Copenhagen. It combines inputs from Biology, Statistics, Machine Learning, Physics and Computer Science, and hence is a nice example of Data Science in action. One of their main ingredients are graphical models from Machine Learning such as dynamic Bayesian networks, which they analyse from a statistical physics standpoint. An essential part of every protein sequence are the dihedral angles between certain atoms. Predicting their most likely values is a key component in understanding the protein structure at a local level. These pairs of angles, however, are no typical quantities since 0
\(^{\circ }\) and 360
\(^{\circ }\) represent the same value, hence pairs of angles need to be represented as data points on a torus. Devising statistical models and methods for such data is part of a research stream called Directional Statistics (see the book [
20] for a recent account) and requires, besides Mathematics, also Computer Science skills. Finally, the Hamelryck group uses probability kinematics to combine their findings on local and non-local structures in a meaningful way.
We refer the interested reader to the monograph [
21] for details about this approach.
5.2 Case study 2: Digital Twins in engineering and personalised medicine
Our second case study is concerned with the problem of data-driven model selection in engineering and medical simulations. We split the discussion in two parts, starting with engineering applications in which digital twins are the most advanced and where ethical considerations are more easily addressed.
All systems devised today in Engineering fall within the category of Complex Systems, i.e.
a system composed of many components which interact with each other. Natural systems such as the human body or the environment are other examples of Complex Systems. It is not possible to study, design and optimise complex systems using analytical methods, i.e. hand calculations. Recourse is always made to some type of mathematical model, usually a set of partial differential equations (PDEs). The resulting problem is solved numerically using a wide variety of
discretisation methods including finite element methods [
22‐
26], finite differences, meshfree methods [
27], isogeometric approaches [
28,
29], geometry independent field approximation [
30,
31], scaled-boundary finite elements [
32‐
36], boundary element approaches [
37], enriched boundary elements [
38] or combinations thereof [
39‐
41].
Discretisation methods have been subject to a large amount of research but a much more difficult task is the choice of a suitably descriptive mathematical model. In other words, computational engineers need to answer the question: “What is the best model for this system given computational constraints and the quantities I am interested in?” Once the model is chosen, selecting a suitable discretisation approach is usually straightforward.
Let us look at this problem of
model selection via two connected examples. First, consider modern engineering materials, such as composites which have been developed to perform well in increasingly challenging environments
2. The durability of gigantic composite structures such as the Airbus A380, over 79 m in wingspan, is influenced by physical phenomena occurring at the scale of carbon fibres which are around 5 microns in diameter. The brute-force approach consisting of including all carbon fibres in the simulation of 1 cubic millimetre of composite material would require solving a set of 8 billion equations in 8 billion unknowns, making the problem intractable over the size of the aircraft. The task of the computational engineer is therefore to select a model which can deal with engineering-scale simulations in a computationally affordable manner, but preserves the important effects taking place at the smaller scales.
Once a suitable model has been selected, the associated parameters must be identified in light of experimental observations, i.e. the model must be calibrated. In Materials Engineering, the traditional approach to this has been to perform experiments within laboratory conditions, which are most often far removed from those which the structure or system will undergo during its service life, in particular when harsh environmental effects are of interest. Statistical approaches can be used, but they only partially overcome the hurdle as they are reliant upon predefined statistical distributions, which do not account for “unknown unknowns” or in-service conditions which were not considered during the experimental campaigns, “rare events” in particular. Parameter and model identification and selection are, today still, open problems.
Increasingly miniatured and versatile sensing devices, embedded into engineering and natural systems offer an exciting alternative to traditional (and insufficient) “Experiment-in-the-lab-to-model-behaviour-in-the-field” approaches by leveraging (Big) Data gathered on the fly, during the service life of the system to drive model selection and parameter identification.
To achieve this, Statistics (namely Bayesian inference) and Machine Learning methods [
42‐
47] have been leveraged for a few years. The Bayesian paradigm, in particular, enables the enrichment of prior (expert) knowledge about the system with new data, as it is being acquired.
3
Whilst important in Engineering, the need to update models on the fly as new data becomes available in order to better control Engineering Systems is strictly necessary in Personalised Medicine where all patients are different and in vivo experiments are not possible. In this field, it is necessary to infer the best possible model for a patient from
a priori knowledge obtained from other patients. Successful approaches have been recently published [
19,
47] which enable predictive science in Medicine, for example for laser-treatment of tumours [
42]. The reader is referred to [
47] for a recent discussion of the emerging field known as “Computer-Guided Predictive Medicine”, to [
52] for applications to brain tumour model personalisation and to [
53] for sparse Bayesian image registration.
This quest for on-the-fly data assimilation and fusion into computer models has been fuelling the development of “digital twins”, a digital replica of the real system, which lives a “digital life” in parallel to the real system and can be interrogated to make decisions. These “twins” require predictive, high-fidelity models to learn from real-time data acquired during the life of the system, accounting for “real” conditions during predictions. These Twins could enable to predict the motion of target areas during surgery with predefined accuracy [
54‐
56] or fuel virtual reality engines [
57] enabling the surgeons to “see through” the patient, investigate the potential response of a patient to a given treatment [
58]. Digital Twins could also enable to transition from “factors of safety” and associated over-engineering to adaptive structures and systems which adapt to their environment [
59‐
64]. For this revolution to take place, Data Science approaches must be harnessed by computational scientists. This will require significant multi-disciplinary efforts in educating the next-generation computational and data scientists.