Skip to main content

Über dieses Buch

No statistical model is "true" or "false," "right" or "wrong"; the models just have varying performance, which can be assessed. The main theme in this book is to teach modeling based on the principle that the objective is to extract the information from data that can be learned with suggested classes of probability models. The intuitive and fundamental concepts of complexity, learnable information, and noise are formalized, which provides a firm information theoretic foundation for statistical modeling. Inspired by Kolmogorov's structure function in the algorithmic theory of complexity, this is accomplished by finding the shortest code length, called the stochastic complexity, with which the data can be encoded when advantage is taken of the models in a suggested class, which amounts to the MDL (Minimum Description Length) principle. The complexity, in turn, breaks up into the shortest code length for the optimal model in a set of models that can be optimally distinguished from the given data and the rest, which defines "noise" as the incompressible part in the data without useful information.

Such a view of the modeling problem permits a unified treatment of any type of parameters, their number, and even their structure. Since only optimally distinguished models are worthy of testing, we get a logically sound and straightforward treatment of hypothesis testing, in which for the first time the confidence in the test result can be assessed. Although the prerequisites include only basic probability calculus and statistics, a moderate level of mathematical proficiency would be beneficial. The different and logically unassailable view of statistical modelling should provide excellent grounds for further research and suggest topics for graduate students in all fields of modern engineering, including and not restricted to signal and image processing, bioinformatics, pattern recognition, and machine learning to mention just a few.




1. Introduction

Statistical modeling or model building is an activity aimed at learning rules and restrictions in a set of observed data, proverbially called “laws” or “the go of it,” as stated by Maxwell. The traditional approach, presumably influenced by physics, is to imagine or assume that the data have been generated as a sample from a population, originally of a parametrically defined probability distribution and, later, more generally, a so-called nonparametric distribution. Then the so-imagined unknown distribution is estimated from the data by use of various principles, such as the least-squares and maximum-likelihood principles, or by minimization of some mean loss function, the mean taken with respect to the “true” data generating distribution imagined.

Information and Coding


2. Shannon-Wiener Information

In coding we want to transmit or store sequences of elements of a finite set A = a1,..., am in terms of binary symbols 0 and 1. The set A is called an alphabet and its elements are called symbols, which can be of any kind, often numerals. The sequences of the symbols are called messages, or often just data when the symbols are numerals. We begin by defining the code as a function C : AB*, taking each symbol in the alphabet into a finite binary string, called a codeword. These define a binary tree (see Figure 2.1), in which the codewords appear as paths starting at the root node. We are interested in codes that are one-to-one maps so that they have an inverse. The code may be extended to sequences x = x1,..., xn, also written as C, C : A*B*, by the operation of concatenation: C(xxn+1) = C(x)C(xn+1), where xa denotes the string obtained when symbol a is appended to the end of the string x.

3. Coding of Random Processes

Up to this point we have considered encoding data modeled as outcomes of random variables, which are defined by an alphabet and a single probability distribution. When such a distribution is sampled repeatedly, we actually generate data by an iid (independent identically distributed) random process. A much more important way to model data sequences is to consider them as samples from a general random process, which we discuss next.

Statistical Modeling


4. Kolmogorov Complexity

For a string (xn), generated by sampling a probability distribution P(xn), we have already suggested the ideal code length — logP(xn) to serve as its complexity, the Shannon complexity, with the justification that its mean is for large alphabets a tight lower bound for the mean prefix code length. The problem, of course, arises that this measure of complexity depends very strongly on the distribution P, which in the cases of interest to us is not given. Nevertheless, we feel intuitively that a measure of complexity ought to be linked with the ease of its description. For instance, consider the following three types of data strings of length n = 20, where the length actually ought to be taken large to make our point:
generate a string by flipping a coin 20 times

5. Stochastic Complexity

As we discussed above, the problem of main interest for us is to obtain a measure of both the complexity and the (useful) information in a data set. As in the algorithmic theory, the complexity is the primary notion, which then allows us to define the more intricate notion of information. Our plan is to define the complexity in terms of the shortest code length when the data is encoded with a class of models as codes. In the previous section we saw that this leads into the noncomputability problem if we let the class of models include the set of all computer programs, a “model” identified with a computer program (code) that generates the given data. However, if we select a smaller class, the noncomputability problem can be avoided, but we have to overcome another difficulty: How are we to define the shortest code length? It seems that in order not to fall back to the Kolmogorov complexity we must spell out exactly how the distributions as models are to be used to restrict the coding operations. Here we adopt a different strategy: we define the idea of shortest code length in a probabilistic sense, which turns out to satisfy all practical requirements - unless the data strings are too short. It is clear that if the strings are short there are too many ways to model them. As a matter of fact, the strings must be long even for the desired algorithmic results to hold.

6. Structure Function

We extend Kolmogorov’s structure function in the algorithmic theory of complexity to statistical models in order to avoid the problem of noncomputability. For this we have to construct the analog of Kolmogorov complexity and to generalize Kolmogorov’s model as a finite set to a statistical model. The Kolmogorov complexity K(xn) will be replaced by the stochastic complexity for the model class \( \mathcal{M}_\gamma \) γ (5.40) and the other analogs required will be discussed next. In this section the structure index γ will be held constant, and to simplify the notations we drop it from the models written now as f(xn;ϑ), and the class as \( \mathcal{M}_k \) k. The parameters θ range over a bounded subset ω of Rk.

7. Optimally Distinguishable Models

In [3] and [4] Balasubramanian showed the most interesting result that Ck,n, also known as the Riemann volume, for a fixed γ with k parameters gives the maximal number of distinguishable models that can be obtained from data of size n. His idea of distinguishability is based on differential geometry and is somewhat intricate, mainly because it is defined in terms of covers of the parameter space rather than partitions. The fact is that any two models fi and fi, no matter how far apart, will have intersecting supports, and hence the sense in which two models can be distinguished does not have a direct interpretation. We wish to refine Balasubramanian’s idea by obtaining a measure of the separation between the models, which is both intuitively appealing and also can be easily interpreted. Moreover, importantly, we show that there is a value for the parameter d and the number of the equivalence classes Bd,n(θi) for which this measure is optimized.

8. The MDL Principle

We have mentioned the MDL principle on several occasions somewhat loosely as the principle that calls for finding the model and model class with which the data together with the model and model class, respectively, can be encoded with the shortest code length. Actually to apply the principle we must distinguish between two types of models — those for data compression and others for general statistical purposes such as prediction. In data compression, we apply the models to the same data from which the models are determined. Hence these models need not have any predictive power; and, in fact, to get the shortest code length we do not even need to fit models in the class considered, say, \( \mathcal{M}_\gamma \) γ. This is because the universal NML model gives a code length, which we called the stochastic complexity and which we consider to be the shortest for all intents and purposes.

9. Applications

The generally used Neyman-Pearson hypothesis testing is based on the particularly dangerous assumption in this case that one of the hypotheses tested is the true data-generating distribution. This means that the theory does not take into account the effect of having to fit the hypotheses as models to data, and hence whatever has been deduced must have a fundamental defect. The second fundamental flaw in the theory is that there is no rational quantified way to assess the confidence in the test result arrived. The common test is to decide between the favored null hypothesis and either a single opposing hypothesis or, more generally, one of an uncountable number of them, in the case of the so-called composite hypothesis. The null hypothesis would be abandoned only when the data fall in the so-called critical region, whose probability under the null hypothesis is less than a certain level of the test, such as the commonly used value .05. While clearly there is a considerable confidence in rejecting the null hypothesis when the data fall far in the tails, because the data then are not likely to be typical under the null hypothesis, we have little confidence in accepting it when the data fall outside the critical region. However, what is the point in putting a sharp boundary based only on such obvious and vague considerations? After all, an epsilon variation in the data can swing the decision one way or the other. This single unwarranted act undoubtedly has caused wrong real-life decisions with literally life-and-death implications.


Weitere Informationen

Premium Partner