The observation “Data Science is not Computer Science” does not mean that there is no computer science in data science. Far from it. In the Introduction I already noted that to me data science is the union of datafied research in all of academia. One of these research areas is, of course, computer science. In fact, as I will argue in this section, data science poses new as well as already existing challenges to computer science. Challenges that must be met to let computer science truly become a language for data science.
First of all, again in the Introduction, we already noted that storing, manipulating and analyzing vast amounts of data of a bewildering variety of types is becoming the core of many new approaches to science, any kind of science. That is, the problems colloquially known as Big Data are an integral part of the challenges posed by data science.
It is not just Big Data that makes it challenging for data-driven scientists to use our methods. We already briefly discussed another one in the context of the language, viz., the large variety of algorithms for the same data mining problem—such as classification. This is a challenge for those who want to use machine learning algorithms to solve their own problems. Which one should they use and with what parameter settings? This is also still very much an open question for the computer scientists working in the area.
The challenges discussed so far are challenges one could discuss equally well in a position paper with the title “Challenges Big Data Poses to Computer Science”, this raises the question: is there more to Data Science than Big Data—at least for computer scientists? To argue that there is, I end this Section with two new, related, challenges: one foundational and one conceptual.
3.1 A foundational challenge
Many of the problems discussed so far are in essence aspects of the same problem: how do we know we have a good model? By becoming a language for other disciplines, computer science becomes foundational for these other sciences. This can only happen if its own foundations are secure. The selection of models from data is one of these foundations for the machine learning and data mining community.
It is well known that the so-called Problem of Induction is unsolvable [
68]. That is, there is no surefire way to induce a model from a (finite) data set. The simplest way to illustrate this is by observing that there are infinitely many functions that go through finitely many given data points. Moreover, a No Free Lunch Theorems by Wolpert [
70] shows that in supervised learning there is not even a best (approximating) algorithm.
This hasn’t deterred computer scientists from developing well-founded theories to learn from data. Probably the best known one is known under headings as Computational Learning Theory (COLT), Probably Approximately Correct (PAC) learning, and Statistical Learning Theory (SLT), arising out of the work of Valiant [
62] and (earlier) Vapnik and Chervonenkis [
65].
The approach taken in PAC learning is to search for algorithms that almost always lead to almost correct models. More specifically, following [
53], PAC learning is defined as follows.
A hypothesis class
\(\mathcal {H}\) is agnostic PAC learnable with respect to a set
Z and a loss function
\(l{:}\,Z \times \mathcal {H} \rightarrow \mathbb {R}_{+}\) if there exists a function
\(m_{\mathcal {H}}{:}\,(0, 1)^2 \rightarrow \mathbb {N}\) and a learning algorithm
A with the following property:
-
for every \(\epsilon , \delta \in (0, 1)\)
-
for every distribution \(\mathcal {D}\) over Z
-
when running A on \(m \ge m_{\mathcal {H}}(\epsilon , \delta )\) i.i.d. samples generated by \(\mathcal {D}\)
-
A returns a hypothesis
\(h \in \mathcal {H}\) such that with probability at least
\(1 - \delta \)$$\begin{aligned} L_{\mathcal {D}}(h) \le \min _{h' \in \mathcal {H}} L_{\mathcal {D}}(h')+ \epsilon \end{aligned}$$
where
\(L_{\mathcal {D}}(h)\) is the expectation of the loss
l(
D,
h) on samples
D sampled according to
\(\mathcal {D}\)
And then, perhaps surprisingly, there is just one property that decides whether or not a hypothesis set is PAC learnable, viz., its VC dimension. Briefly speaking, the VC dimension of
\(\mathcal {H}\) is the size of the largest set it can shatter, i.e., the largest set such that
\(\mathcal {H}\) can distinguish all its subsets—i.e., there is a hypothesis that says yes to all elements of that subset and no to all others.
Moreover, if
\(\mathcal {H}\) has VC dimension
d we know that we only need a sample size of
$$\begin{aligned} O\left( \frac{d + \log (1/\delta )}{\epsilon ^2} \right) \end{aligned}$$
to achieve the
\(\epsilon , \delta \) bounds from the definition of PAC learning. A polynomial bound that is certainly valuable in a Big Data world. Your data may be too big to handle, but using just a sample of this size will almost always give you an almost optimal result.
In fact, PAC learning can be set up by first analyzing—with \(\mathcal {H}\) finite for the two-class classification problem—how big a sample one should take. The fact that by turning these results into a general learnability definition maintain the polynomial sample size bound is a testament to the reasonableness of PAC learning.
To further illustrate this reasonableness, we briefly look at two relaxations of the PAC learning definition. Firstly, the definition requires that the sample size bound
\(m_{\mathcal {H}}\) holds for all hypotheses uniformly. What if we allow larger minimal sample sizes for “complex” hypotheses than for simpler ones? That relaxation leads to Structural Risk Minimization [
64] in which one—conceptually, not necessarily practically—computes the optimal hypothesis by a tower of approximation, each learned the PAC learning way.
PAC learning requires that we can approximate the optimal hypothesis arbitrarily close. What if we relax this assumption? If we only require that our solution is slightly better than random guessing? The Boosting lemma [
51] then tells us that we can approximate (strong) PAC learning with a weighted sum of such weak learners.
Finally, the PAC learning framework is independent of the distribution
\(\mathcal {D}\) the data set
D has been sampled from. If we are willing to take
\(\mathcal {D}\) into account, we can get even better bounds for minimal sample sizes using Rademacher Complexity (also known as Rademacher Averages), see [
53].
In other words, PAC learning is in a way as good as it gets, relaxations either approximate it or are approximated by it.
Our discussion so far was essentially for the two-class classification setting, but to a large extent it holds far more generally for supervised learning. In fact, there are also results for unsupervised learning problems such as itemset mining [
49] and graph summarization [
48]. These are, however, special among unsupervised problems in that they test properties. One can objectively test whether these properties hold or not, leading to a loss function that is, more or less, the same as for two-class classification. PAC learning is essentially supervised learning.
Unfortunately, not all datafied science is about predictions. With the notable exception of Hari Seldon,
8 historians seldom attempt to predict anything. Rather, their aim is to try to understand and explain the course of history. We’ll discuss an example in some depth in the next subsection.
The point here is that many sciences are more concerned with unsupervised learning than with supervised learning. And, in general, unsupervised learning is notoriously hard to evaluate.
For example, for clustering there are various—often ad-hoc—methods to evaluate the quality of the returned clusters [
27]. But there is no established way to compute how close your clustering is to the optimal clustering—in as far as the latter is even defined.
While this is not so much of a problem in a non-academic setting—or in the case where unsupervised learning is used in an exploratory data analysis phase [
61] followed by a predictive analysis—it is far from ideal when the unsupervised analysis is the end product. In fact, this is exactly the reason that makes some machine learners argue [
69] that unsupervised learning should never be the end product. It should only be done within a (supervised) context, giving an objective evaluation criterion: the best clustering is the one that leads to the best classifier later on.
In an extension of PAC learning, known as PAC Bayes [
43], this restrictive idea has been taken up to provide objective foundations for clustering, more precisely co-clustering [
52]. However, as I already stated above, in some sciences there is simply no predictive context; a historian wants to understand history, not to predict it.
However, much statistical tests and
p values are misunderstood and misused [
32], such objective measures are necessary quality control in the scientific process. It won’t take long before researchers in other datafied areas will request such objective measures for unsupervised learning.
One could argue that not even PAC learning in supervised cases gives you such measures—after all, PAC learning only tells you the results are probably as good as its gets
with the chosen hypothesis set. However, methods like train and test or cross-validation will give you further evidence of the quality of your result. Moreover, the bootstrap will give you, empirically, all the test statistics you need [
24].
All of this is lacking, in general, for the unsupervised case. There is, however, a computer science theory for unsupervised learning, viz., Algorithmic Information Theory or Kolmogorov Complexity [
42]. The assumption here is that the model should be computable—not an overly stringent requirement for models one may want to do something with. More precisely the requirement is that the data set is computed by a model.
In a nutshell, fix a universal Turing machine and consider all inputs that generate our data set D and then halt. Out of all these inputs, choose the smallest; the length of the shortest string is known as the Kolmogorov complexity of D.
Why the smallest? One argument is to refer to Ockham’s razor, paraphrasing: don’t do with more for what you could do with less; see e.g., [
57] for a thorough analysis of that argument.
Another argument is by compression. If the shortest input string is shorter than
D, we have compressed
D. Compression works by detecting regularity, hence, the shortest input string—the best possible compression—detects all regularity in
D. Data sets for which the Kolmogorov complexity is
not smaller than
D,
$$\begin{aligned} K(D) \ge |D| \end{aligned}$$
are random data sets, data sets with no discernible structure.
A third, and final, argument is based on algorithmic probability [
58]. Simplified, perhaps oversimplified, the reasoning is as follows. All programs computing
D are themselves again bit strings. Kraft’s inequality [
42] allows you to define a probability distribution on this set; it is slightly more complicated because the set is (countably) infinite, but this is the gist. The shortest program—the shortest bit string—computing your dataset
D is the most likely under this distribution among all that compute
D.
One may now wonder: why should we use this distribution? It is known as Solomonoff’s universal prior and has all the good properties one would like a non-informative prior to have. For example, it does not suffer from the re-parametrization problems most non-informative priors suffer from. See [
47] for a lucid discussion of this prior, its reasonableness and the theory of universal induction it belongs to.
So, we take the shortest program, but we did not specify which universal Turing machine. The reader could wonder whether this choice makes a difference. The answer is no, because there is always a program—input string—which turns one UTM into another one; a simple consequence of being a universal Turing machine. Hence the complexity with regard to one UTM is at most a constant different from the complexity with regard to another one. Kolmogorov complexity analysis is always “upto a constant” for that reason.
A more serious problem is that the Kolmogorov complexity is not computable. For the simple reason that we stated the shortest program that computes D and then halts and it is well known that the Halting Problem is not decidable. It is upper semi-computable: dovetail over all Turing machines—input strings for your chosen UTM—and whenever one outputs D and then halts you have a new upper bound for the complexity of D. This is, however, not an effective approach in practice.
For that, note that an input string for your favorite UTM consists—often—of two parts. First a part that selects a certain Turing machine—the program—followed by a “random” part that lets that program generate D. In such a case, the complexity consists of two parts. Firstly, the complexity of the model (the program). Secondly, the complexity of the data given that model (the data encoded by the model).
This line of reasoning leads to the Minimum Description Length principle [
34]. Like for PAC learning, we start by choosing a set of hypotheses
\(\mathcal {H}\). The principle can be described roughly as follows.
Given a set of models
\(\mathcal {H}\), the best model
\(H \in \mathcal {H}\) for data set
D is the one that minimizes
$$\begin{aligned} L(H) + L(D \mid H) \end{aligned}$$
in which
-
L(H) is the length, in bits, of the description of H
-
L(D|H) is the length, in bits, of the description of the data when encoded with H.
If this looks suspiciously much like maximum likelihood with Bayes rule:
$$\begin{aligned} \mathbb {P}(H \mid D) \propto \mathbb {P}(H) \times \mathbb {P}( D \mid H) \end{aligned}$$
you are very much right. This rule was the original motivation
9 for the inventor of MDL: Rissanen [
50].
Based as it is in Algorithmic Information Theory, MDL is a sound approach to induce models from data. Indeed, it works very well for unsupervised learning, our own work on pattern set mining—starting from [
54]—could be cited as an example of this. Clustering, to take that again as example, can be handled in at least two ways. Firstly, we can simply assume that each cluster
\(C_i\) in the data is best described by its own hypothesis
\(H_i\). This means that we no longer search for the single hypothesis that minimizes
\(L(H) + L(D \mid H)\), but for a set of hypotheses
\(\{ H_i \}\) and a partitioning of the data
\(D = \bigcup _i D_i\) minimizing the sum
$$\begin{aligned} \sum _i L(H_i) + L(D_i \mid H_i) \end{aligned}$$
Alternatively, we can employ the normalized information distance [
41] based on the Kolmogorov complexity:
$$\begin{aligned} d(x, y) = \frac{\max \left\{ K(x \mid y^*), K(y \mid x^*)\right\} }{\max \{K(x), K(y)\}} \end{aligned}$$
in which
\(x^*\) denotes a shortest program that computes
x.
This measure is, obviously, not computable, but it can be approximated by your favorite compression algorithm. Using that approximation to the distance, one can then use any of the traditional clustering algorithms.
The Algorithmic Information Theory approach not only works for unsupervised learning, but also for supervised learning. In fact, universal induction, which we briefly mentioned above, was developed for predictive problems [
58]. Still, AIT—or, more in particular MDL—falls short of the foundational theory we are looking for.
Firstly because one has to choose a set of hypotheses, like in the PAC learning scheme, but one also has to choose an encoding scheme for MDL. In the limit, this choice may not matter. But, in practice, this choice has major impact on what one discovers—upto a constant is a problem when one has a finite data set, however big it is.
Secondly, one often has to resort to heuristic algorithms to a find a good—optimality not guaranteed—hypothesis. Moreover, it is known that MDL optimal solutions can be hard to approximate; see [
2] for details. Hence, guarantees one has from the theory do not easily carry over to applications.
Finally, like PAC learning, MDL tells you which \(H \in \mathcal {H}\) you should use. But unlike the supervised case, there is still no way to measure how good your choice actually is. Therefore, again unlike PAC learning, there is no way to estimate how large a sample should be to get reasonable results.
In other words, computer science has produced—at least—two beautiful theories to do well-founded induction from data. However, both theories fall short—for different reasons—if one would want to use them as inductive foundations for Data Science. For me, this is the biggest challenge computer science has to meet to become successfully the language for Data Science; even bigger than the existence of a symbolic language that one could use to reason about algorithms, the challenge we briefly discussed at the end of Sect.
2.
The fact that we have two theories that fail for different reasons, do give hope that they could somehow be combined into a universal algorithmic theory of induction. One that can be used by data scientists on whatever problem they are working on.
One way to approach to such an integration may be based on information theory. PAC Bayes—the extension we briefly discussed—is based on mutual information rather than a simple loss function. AIT is itself an information theory. Clearly, these information theories live in different universes, entropy describes the randomness at the source of messages, while AIT describes the randomness of the object (message) itself. Still, there is a strong connection between the two [
35]:
$$\begin{aligned} \text {Entropy } \approx \text { expected Kolmogorov Complexity} \end{aligned}$$
A long time ago, Haussler stated [
37]
It’s a dangerous thing to try and formalize an enterprise as complex and varied as machine learning...
That doesn’t mean that we shouldn’t try. Even if it wouldn’t fit all possible approaches equally well, such a theory would make computer science a far better language for data science.
3.2 A conceptual challenge
When one collaborates with scientists from other disciplines on data science problems, most of the problems can be handled very well by our existing toolbox. Now and again, problems fit our concepts very well, but the way to approach it is not immediately obvious. One example would be outlier detection. We know very well what an outlier is [
3]
An outlier is an observation which deviates so much from the other observations as to arouse suspicions that it was generated by a different mechanism
But if your colleagues asks you how to find outlying articles in a newspaper archive, the operationalization of that knowledge is not immediately clear.
Even more rarely, problems come up that do not, yet, fit the concepts we have. In this section, I discuss one such problem. Not only because it illustrates so well why data science is good for computer science: new problems are fun, but also because in this particular example, we are looking for an unsupervised solution, thus illustrating further the need for the theory I was arguing for in the previous section.
Historians now routinely have access to large digital historical archives, such as all—well, a large selection of—newspapers published in the Netherlands between 1618 and 1995 at
http://www.delpher.nl/. With the advent of such resources, new kind of research questions become attainable. Before such digital archives, archive based research was limited by what human labor could achieve. Now, tools can be used to quickly select and analyze digital archives.
In one of our discussions, a colleague from the History Department of the Universiteit Utrecht suggested the following challenge. New words and concepts regularly enter our vocabulary and, especially the concepts, often lead to discussions in newspapers. For example, the concept of Darwinism entered the Dutch vocabulary not long after Darwin published his famous “On the Origin of Species” [
19]. And not only our vocabulary, but also the newspapers.
At first, the articles in the newspaper mainly consisted of a discussion between proponents and opponents of the theory of evolution. But gradually this changed, Darwinism acquired a broader meaning. Rather than “just” referring to a biological theory it could also refer to a social science theory: social Darwinism [
17]. A theory that upheld that humans and human society were also in that sense subject to a struggle for the survival of the fittest, “natural” selection also plays a role in the evolution of human societies.
His goal is to discover and follow such evolving (no pun intended) discourses automatically from, e.g., (digital) newspaper archives. Having access to the complete discourse and its evolution would give him a far more complete view on if, and how, and when, the concept of Darwinism became accepted by—a vast majority of—the Dutch population.
Note that this is very much an unsupervised learning task. We do not necessarily know which discourses have taken place and even less how they evolved. On first sight, this may seem like a relatively straightforward topic modeling task, more specifically a dynamic topic modeling task [
10] since that technique was especially designed for that problem. However, it turned out not to be so easy.
Firstly, because it is not about the linear evolution of recognized topics such as Atomic Physics or Neuroscience, but about much more fuzzily defined topics—newspapers are not scientific journals and characteristic terms may vary between different newspapers and different epochs. Secondly, the evolution is far from linear, topics may split: the discussion goes into different directions, e.g., the discussion on biological Darwinism continues, while the discussion on social Darwinism starts. Topics may merge, they may die down and then reappear later again; for the latter, think of periodic topics like the Olympics.
This is not only a technical problem, but it is also a conceptual one. Topic modeling is very much a special case of clustering. That is, a topic is a cluster of newspaper articles together with a set of (more or less) characteristic terms. How can we decide that a cluster of one set of articles (with its characteristic terms) is similar to another set of articles (with its own characteristic terms)? Exacerbated the fact that all articles and most of the characteristic terms will be different among the two sets.
The goal of this discussion is not to solve this problem. It is to illustrate that Data Science poses new, interesting, questions to the computer science community. Moreover, it should also illustrate the need for solid foundations, both for supervised and unsupervised learning. Perhaps not with my first attempts at a solution, but certainly later, my colleague from the History department will ask: are you sure of this result? How likely is it that there is no better explanation of the course of history?