December 8 (Saturday), 2001

Whistler, British Columbia, Canada

(Useful link: Selected Publications on MDL )

Peter Grunwald
(chair), Complex Statistical Systems Group, EURANDOM,
Netherlands

In-Jae Myung
, Department of Psychology, Ohio State University

Mark Pitt
, Center for Cognitive Science, Ohio State University

Inductive inference, the process of inferring a general law from observed instances, is at the core of science. The Minimum Description Length (MDL) Principle, which was originally proposed by Jorma Rissanen in 1978 as a computable approximation of Kolmogorov complexity, is a powerful method for inductive inference. The MDL principle states that the best explanation (i.e., model) given a limited set of observed data is the one that permits the greatest compression of the data. That is, the more we are able to compress the data, the more we learn about the underlying regularities that generated the data. This conceptualization originated in algorithmic information theory from the notion that the existence of regularities underlying data necessarily implies redundancy in the information from successive observations.

Since 1978, significant strides have been made in both the mathematics
and application of MDL. For example, MDL is now being applied in
machine learning, statistical inference, model selection, and psychological
modeling. The purpose of this workshop is to bring together researchers,
both theorists and practitioners, to discuss the latest developments
and share new ideas. In doing so, our intent is to introduce to the
broader NIPS community the current state of the art in the field. More
often than not, current applications of MDL in machine learning, neural
computing and artificial intelligence are based on developments from
the 1970s and 1980s; not many researchers seem to be aware of the *much
sharper* methods that have been developed in the 1990s. The lineup of
workshop participants not only reflects the highly interdisciplinary character
of the field, but should also make the workshop appealing to a wide range
of conference attendees. Further, most of the major researchers in the area
have agreed to be present.

This one-day workshop will be divided thematically into two parts, one focusing on theory (morning session) and the other on applications (evening session). The morning session will begin with a 55-min tutorial on MDL by Peter Grunwald, who has just written a book on MDL that will appear around the time of the workshop. Following the tutorial there will be talks on recent advances in MDL. The two of those will be given by leaders in the field. Professor Paul Vitanyi will speak on algorithmic information theory, and Dr. Jorma Rissanen, a founder of MDL, will speak on recent developments in the MDL theory. The evening session will consist of five talks, mostly on the applications of MDL and related ideas in cognitive science, machine learning, and statistics.

All talks will end with a 5-minute question and answer period. In addition,
both the morning and evening sessions will conclude with a 20-minute
panel-audience discussion. Four panelists will have 10 minutes each
to provide their own perspectives and opinions on the talks presented
at the workshop, followed by audience reaction and participation.

__Morning Session (Theory)__

7:30 - 8:25: Tutorial onModernMDL (P Grunwald , EURANDOM, Netherlands)

8:25 - 8:50: Algorithmic Statistics (P Vitanyi , CWI and University of Amsterdam)

8:50 - 9:00: The Cost of an Example: Compression Schemes and MDL ( M K Warmuth , University of California, Santa Cruz)

9:00 - 9:20: Coffee Break

9:20 - 9:45: MDL Theory as a Foundation for Statistical Modeling (J Rissanen , IBM)

9:45 - 10:10: MDL, Statistical Physics and Geometry ( V Balasubramanian , University of Pennsylvania)

10:10 - 10:30: Panel Discussion ( A Barron , Yale University; A Lanterman , Georgia Institute of Teachnology)

__Evening Session (Application)__

16:00 - 16:25: Exact Minimax Optimal Description Length Criterion for Regression (A Barron , and F Liang, Yale University)

16:25 - 16:50: A Simplicity Principle for Perception ( N Chater , University of Warwick)

16:50 - 17:15: The Simplicity Principle in Visual Perception ( P A van der Helm , University of Nijmegen)

17:15 - 17:40: Coffee Break

17:40 - 18:05: MDL and Classification in Machine Learning ( Tirri , University of Helsinki)

18:05 - 18:30: Model Selection in Cognitive Science ( I J Myung and M A Pitt , Ohio State University)

18:30 - 19:00: Panel Discussion and Wrap-up (Discussants: J Feldman , Rutgers University; A Hanson , Indiana University)

We give a general introduction to the MDL Principle for model selection. We focus on `universal modeling'. This is the key concept of the advanced MDL methods that have been developed mostly in the 1990s. Beyond giving a general introduction, we emphasize the differences between the (1995) universal modeling approach and the (1978) two-part code approach that is still adopted in most MDL applications. The latter approach is now regarded as a crude approximation of the universal modeling approach. By explaining the differences in detail, we clear up several very common misconceptions surrounding MDL These misconceptions, which usually stem from identifying MDL with two-part code MDL, are:

(1) The MDL Principle may lead to arbitrary results since by associating codes with models in a different way, one can get completely different results for the same data

(2) MDL can only be justified asymptotically since there are constants hidden in all formulas involving codelengths. These constants can only safely be neglected for data sets with sample sizes going to infinity.

(3) MDL can always be re-interpreted as Bayesian model selection with certain priors

We explicitly show that none of these statements is true in general (while statements (1) and (2) are never true, statement (3) is true in some cases).

**References:**

A. Barron, J. Rissanen, and B. Yu (1998). The Minimum
Description Length principle in coding and modeling. (Special Commemorative
Issue: Information Theory: 1948-1998) *IEEE. Trans. Inform. Th*
., Oct 1998, 2743-2760.

P. Grunwald (2002). *Minimum Description Length
and Maximum Probability. *Kluwer (forthcoming book).

P Grunwald 91998). *The Minimum Description Length
Principle and Reasoning under Uncertainty, *Chs. 1-2. Ph.D. Thesis,
University of Amsterdam, ILLC Dissertation Series 98-3, chapters 1-2.

M. Hansen and Bin Yu (2001). Model selection and Minimum
Description Length principle. *J. Amer. Statist. Assoc*. vol.
96, 746-774.

Myung, I. J., Balasubramanian, V., & Pitt, M. A.
(2000). Counting probability distributions: Differential geometry and model
selection. *Proceedings of the National Academy of Sciences USA*
, 97, 11170-11175.

While Kolmogorov complexity is the accepted absolute measure of information content of an individual finite object, a similarly absolute notion is needed for the relation between an individual data sample and an individual model summarizing the information in the data, for example, a finite set (or probability distribution) where the data sample typically came from. The statistical theory based on such relations between individual objects can be called algorithmic statistics, in contrast to classical statistical theory that deals with relations between probabilistic ensembles. We develop the algorithmic theory of statistic, sufficient statistic, and minimal sufficient statistic. This theory is based on two-part codes consisting of the code for the statistic (the model summarizing the regularity, the meaningful information, in the data) and the model-to-data code. In contrast to the situation in probabilistic statistical theory, the algorithmic relation of (minimal) sufficiency is an absolute relation between the individual model and the individual data sample. For traditional problems, dealing with frequencies over small sample spaces, the probabilistic approach is appropriate. But for current novel applications, average relations are often irrelevant, since the part of the support of the probability density function that will ever be observed has about zero measure. This is the case in, for example, complex video and sound analysis---there we need the absolute approach. We distinguish implicit and explicit descriptions of the models. We give characterizations of algorithmic (Kolmogorov) minimal sufficient statistic for all data samples for both description modes---in the explicit mode under some constraints. We also strengthen and elaborate earlier results on the "Kolmogorov structure function'' and "`absolutely non-stochastic objects''---those rare objects for which the simplest models that summarize their relevant information (minimal sufficient statistics) are at least as complex as the objects themselves. We demonstrate a close relation between the probabilistic notions and the algorithmic ones: (i) in both cases there is an ``information non-increase''law; (ii) it is shown that a function is a probabilistic sufficient (in an appropriate sense) an algorithmic sufficient statistic.

The work presented here is part of a triad of papers dealing with the best individual model for individual data: The present work [1] supplies the basic theoretical underpinning by way of two-part codes, [2] derives ideal versions of applied methods (MDL) inspired by the theory, and [3] treats experimental applications thereof.

**References:**

[1] P. Gacs, J. Tromp, P. Vitanyi, Algorithmic Statistics,
*IEEE Trans. Information Theory*, 47:6 (2001),
To appear.
http://xxx.lanl.gov/abs/math.PR/0006233

[2] P.M.B. Vitanyi and M. Li,Minimum Description Length
Induction, Bayesianism, and Kolmogorov Complexity,* IEEE Trans.
Information Theory* , IT-46:2 (2000), 446--464.
http://xxx.lanl.gov/abs/cs.LG/9901014

[3] Q. Gao, M. Li and P.M.B. Vitanyi, Applying MDL to learning best model granularity, Artificial Intelligence, 121:1-2(2000), 1-29. http://xxx.lanl.gov/abs/physics/0005062

Basic theory and applications of Kolmogorov complexity:

[4] M. Li and P.M.B. Vitanyi, *An Introduction to
Kolmogorov Complexity and its Applications* (Second Edition),
Springer-Verlag, New York, 1997 (xx + 637 pp).

See also the home page of the author:
http://www.cwi.nl/~paulv/
,
http://www.cwi.nl/~paulv/learning.html

In this talk we outline some recent developments
in the MDL theory, which form a foundation for a theory of statistical modeling,
Two basic notions about a data set are defined: the *complexity*
and the* information*, both relative to a class of probability
distributions as models. The former represents the shortest code
length in a probabilistic sense with which the data can be encoded
with use of the models, and the second the logarithm of the number
of optimally distinguishable models from the given amount of data. The
complexity is the sum of the information and the code length for
*noise *in analogy with the Kolmogorov sufficient
statistics decomposition in the algorithmic theory of information.
Any two model classes can be compared by their complexity and the one
with smaller complexity sees the data with less noise.

**References:**

Rissanen, J. (2000). MDL denoising. *IEEE Trans
on Information Theory*, IT-46 (7).

Rissanen, J. (2001). Strong optimality of the normalized
ML models as universal codes and information in data. *IEEE Trans
Information Theory*, IT-47 (5), 1-6.

The Minimum Description Length principle selects between competing models of data by picking the one that gives the shortest codelength. Bayesian selection simply picks the most likely model given observed data. I demonstrate that the Bayesian approach is formally identical to solving a certain statistical physics problem on the space of probability distributions. Using physical techniques, I show that MDL and Bayesian methods give essentially equivalent model selection criteria. In exploiting the physical analogy, the number of data points acts as an "inverse temperature", while the relative entropy between the data-generating distribution and a model acts as an "energy". Central to the approach is the representation of a parametric model family as a surface embedded in the space of distributions, with a distinguished metric and reparametrization invariant measure of volume. I show that the relevant measure is the "Jeffreys prior" of Bayesian inference, which uniformly "counts" the distributions in a model family, and that the Fisher information is the associated metric. In both MDL and Bayesian approaches, simple model families are preferred explanations of observations until enough data accumulates to justify a more complex model. The effective notion of "simplicity" employed by these techniques has a pleasing geometric interpretation - a large fraction of the volume of a simple model, measured in the distinguished metric mentioned earlier, lies close to the truth. The practical relevance of such geometric intuitions and interpretations is also considered.

**References:**

V. Balasubramanian.(1997). Statistical Inference,
Occam's Razor and Statistical Mechanics on The Space of Probability Distributions.
*Neural Computation*, Vol. 9, No. 2,
pp. 349-368.

I.J. Myung, V. Balasubramanian and M.A. Pitt. (2000).
Counting Probability Distributions: Differential Geometry and Model
Selection. *Proceedings of the National Academy
of Science* , Vol. 97, No. 21, pp. 11170-11175.

See also the home page of the author:
http://dept.physics.upenn.edu/~vbalasub
.

For problems of model selection in linear regression, we determine the exact minimax universal data compression strategy for the minimum description length (MDL) criterion. The analysis also gives the best invariant and indeed minimax procedure for predictive density estimation in location families, using Kullback-Leibler loss. The exact minimax rule is generalized Bayes using a uniform (Lebesgue measure) prior on the location parameter, which is made proper by conditioning on an initial set of observations.

**References:**

A. R. Barron, J Rissanen, and B. Yu (1998). The minimum
description length principle in coding and modeling. I*EEE Trans.
Inform. Theory*, 44: 2743-2760.

B. S. Clarke and A. R. Barron (1990). Information-theoretic
asymptotics of Bayes methods. *IEEE Trans. Inform. Theory*
, 36: 453-471.

This talk considers the hypothesis that the perceptual system seeks the simplest (shortest) description of perceptual data. This idea can be traced back to Mach and is consistent with recent work in Kolmogorov complexity and minimum description length statistics. At face value, however, there are clear empirical counterexamples to the viewpoint. I consider the extent to which these counterexamples can be dealt with by formulating a version of the simplicity principle for perception that reflects relevant constraints of the perceptual system.

**Reference:**

N Chater (1996). Reconciling simplicity and likelihood
principles in perceptual organization. *Psychological Review*
, 103, 566-581.

The perceptual counterpart of the mathematical MDL principle, i.e. the simplicity principle, states that the human visual system selects the simplest interpretation of a given visual stimulus. This perceptual principle has a longer research history than its mathematical counterpart but, due to its informal formulation, had to rely mainly on empirical evidence. Such evidence is necessary but not sufficient and has to be complemented with formal theoretical evidence. Perception research already attacked successfully the problems of defining regularity (rather than of defining randomness, as in mathematics), of complexity measurement, and of computability. Yet, theoretically, the veridicality (or predictive power) of the perceptual simplicity principle remained unclear. In this context, mathematical research on the MDL principleh as been extremely helpful, by providing the perceptual simplicity principle with a firm formal foundation. Furthermore, it provides a better conceptual framework for theoretical and empirical research into, for instance, the status of perceptual categories, and the relation between viewpoint dependencies and viewpoint independencies in object perception.

**References:**

Hochberg, J. E., & McAlister, E. (1953). A quantitative
approach to figural "goodness". *Journal of Experimental Psychology*
, 46, 361--364. (Proposal of the simplicity principle
in visual perception)

Garner, W. R. (1962). *Uncertainty and structure
as psychological concepts* . New York: Wiley. (Conceptual transition
from classical to structural information theory)

Simon, H. A. (1972). Complexity and the representation
of patterned sequences of symbols. *Psychological Review*,
79, 369--382. (Perceptual coding languages; empirical confirmation
of the Invariance Theorem)

Leeuwenberg, E. L. J., & Boselie, F. (1988). Against
the likelihood principle in visual form perception. *Psychological
Review* , 95, 485--491. (Perceptual categories and structural
probabilities)

van der Helm, P. A. (2000). Simplicity versus likelihood
in visual perception: From surprisals to precisals. *Psychological
Bulletin* , 126, 770--800. (Combining classical, structural,
and algorithmic information theory).

The general MDL Principle has
deep fundamental implications to modeling practices in various fields
from statistics to psychology. In computer science issues of modeling
are particularly prevalent in areas that focus on building intelligent
artificial systems. General symbolic artificial intelligence research,
neurocomputing and machine learning are fields that rely heavily on the
construction of models, and on a regular basis use statistical techniques
to address the model construction tasks. One fundamental problem class
commonly addressed is classification, i.e., a discrete prediction problem,
where each data element from a fixed domain is to be categorized using
a predefined (finite) set of classes. As such problems abstract reasonably
many diagnostic problems, they have been intensively studied in the machine
learning research. Although MDL Principle has been applied in classification
tasks also in machine learning, the work has concentrated on a very naive
way of applying MDL (usually with simplistic two-part coding) and the
more advanced MDL methods developed in 1990s are still virtually unknown
in the community. Machine learning research has a strong emphasis on empirical
results with actual real life application data sets, as the predictive
perfomance of the developed algorithms can be tested against a common
shared database of data sets. We will discuss the use of advanced, i.e.,
Normalized Maximum Likelihood based methods for classification, and present
empirical results on data sets from one of the standard benchmark databases
in the machine learning community. We will also discuss some of the difficulties
of actually *implementing* the MDL Principle for classification. Part of
the work presented has not yet been published and reflects work still in
progress.

**Reference:**

P.Kontkanen, P.Myllymäki, T.Silander, H.Tirri, and P.Grünwald
(2000). On Predictive Distributions and Bayesian Networks. *Statistics
and Computing, *10, 39-54.

The question of how one should decide among competing
explanations of data is at the heart of the scientific enterprise. Computational
models in cognitive science are increasingly being advanced as explanations
of cognitive behavior. The success of this line of inquiry depends on
the development of robust methods to guide the evaluation and selection
of these models. In this paper, we explore application of MDL (Rissanen,
1996) for selecting among mathematical models of cognition. MDL provides
an intuitive and theoretically well-grounded understanding of why one model
should be chosen over another. A central, but elusive, concept in model
selection, complexity, can also be derived with the method. The adequacy
of the method is demonstrated in two areas of cognitive modeling: psychophysics
and information integration.

**References:**

M A Pitt, I J Myung and S Zhang (in press). Toward
a method of selecting among computational models of cognition. *Psychological
Review*.

J Rissanen (1996). Fisher information and stochastic
complexity. *IEEE Trans Information Theory, 42*, 40-47.

Peter Grunwald (
pdg@cwi.nl
, +31-20-592-4115)

In-Jae Myung (myung.1@osu.edu
, +1-614-292-1862)

Mark Pitt (pitt.2@osu.edu
, +1-614-292-4193)

*Last modified: December 19, 2001.*