NIPS*2001 Workshop:
Minimum Description Length:
Developments in Theory and New Applications
December 8 (Saturday), 2001
Whistler, British Columbia, Canada
(Useful link: Selected Publications on MDL )

Workshop Organizers:

Peter Grunwald (chair), Complex Statistical Systems Group, EURANDOM, Netherlands
In-Jae Myung , Department of Psychology, Ohio State University
Mark Pitt , Center for Cognitive Science, Ohio State University


Description

Inductive inference, the process of inferring a general law from observed instances, is at the core of science. The Minimum Description Length (MDL) Principle, which was originally proposed by Jorma Rissanen in 1978 as a computable approximation of Kolmogorov complexity, is a powerful method for inductive inference. The MDL principle states that the best explanation (i.e., model) given a limited set of observed data is the one that permits the greatest compression of the data. That is, the more we are able to compress the data, the more we learn about the underlying regularities that generated the data. This conceptualization originated in algorithmic information theory from the notion that the existence of regularities underlying data necessarily implies redundancy in the information from successive observations.

Since 1978, significant strides have been made in both the mathematics and application of MDL. For example, MDL is now being applied in machine learning, statistical inference, model selection, and psychological modeling. The purpose of this workshop is to bring together researchers, both theorists and practitioners, to discuss the latest developments and share new ideas. In doing so, our intent is to introduce to the broader NIPS community the current state of the art in the field. More often than not, current applications of MDL in machine learning, neural computing and artificial intelligence are based on developments from the 1970s and 1980s; not many researchers seem to be aware of the much sharper methods that have been developed in the 1990s. The lineup of workshop participants not only reflects the highly interdisciplinary character of the field, but should also make the workshop appealing to a wide range of conference attendees. Further, most of the major researchers in the area have agreed to be present.


Workshop Format

This one-day workshop will be divided thematically into two parts, one focusing on theory (morning session) and the other on applications (evening session). The morning session will begin with a 55-min tutorial on MDL by Peter Grunwald, who has just written a book on MDL that will appear around the time of the workshop. Following the tutorial there will be talks on recent advances in MDL. The two of those will be given by leaders in the field. Professor Paul Vitanyi will speak on algorithmic information theory, and Dr. Jorma Rissanen, a founder of MDL, will speak on recent developments in the MDL theory. The evening session will consist of five talks, mostly on the applications of MDL and related ideas in cognitive science, machine learning, and statistics.

All talks will end with a 5-minute question and answer period. In addition, both the morning and evening sessions will conclude with a 20-minute panel-audience discussion. Four panelists will have 10 minutes each to provide their own perspectives and opinions on the talks presented at the workshop, followed by audience reaction and participation.


Workshop Schedule

Morning Session (Theory)

7:30 - 8:25: Tutorial on Modern MDL (P Grunwald , EURANDOM, Netherlands)
8:25 - 8:50: Algorithmic Statistics (P Vitanyi , CWI and University of Amsterdam)
8:50 - 9:00: The Cost of an Example: Compression Schemes and MDL ( M K Warmuth , University of California, Santa Cruz)

9:00 - 9:20: Coffee Break
9:20 - 9:45: MDL Theory as a Foundation for Statistical Modeling (J Rissanen , IBM)
9:45 - 10:10: MDL, Statistical Physics and Geometry ( V Balasubramanian , University of Pennsylvania)

10:10 - 10:30: Panel Discussion ( A Barron , Yale University; A Lanterman , Georgia Institute of Teachnology)
10:30 - 16:00: Ski Break

Evening Session (Application)

16:00 - 16:25: Exact Minimax Optimal Description Length Criterion for Regression (A Barron , and F Liang, Yale University)

16:25 - 16:50: A Simplicity Principle for Perception ( N Chater , University of Warwick)
16:50 - 17:15: The Simplicity Principle in Visual Perception ( P A van der Helm , University of Nijmegen)
17:15 - 17:40: Coffee Break

17:40 - 18:05: MDL and Classification in Machine Learning ( Tirri , University of Helsinki)
18:05 - 18:30: Model Selection in Cognitive Science ( I J Myung and M A Pitt , Ohio State University)
18:30 - 19:00: Panel Discussion and Wrap-up (Discussants: J Feldman , Rutgers University; A Hanson , Indiana University)


Abstracts

Tutorial on Modern MDL

Peter Grunwald

EURANDOM, Netherlands
pdg@cwi.nl

We give a general introduction to the MDL Principle for model selection. We focus on `universal modeling'. This is the key concept of the advanced MDL methods that have been developed mostly in the 1990s. Beyond giving a general introduction, we emphasize the differences between the (1995) universal modeling approach and the (1978) two-part code approach that is still adopted in most MDL applications. The latter approach is now regarded as a crude approximation of the universal modeling approach. By explaining the differences in detail, we clear up several very common misconceptions surrounding MDL These misconceptions, which usually stem from identifying MDL with two-part code MDL, are:

(1) The MDL Principle may lead to arbitrary results since by associating codes with models in a different way, one can get completely different results for the same data

(2) MDL can only be justified asymptotically since there are constants hidden in all formulas involving codelengths. These constants can only safely be neglected for data sets with sample sizes going to infinity.

(3) MDL can always be re-interpreted as Bayesian model selection with certain priors

We explicitly show that none of these statements is true in general (while statements (1) and (2) are never true, statement (3) is true in some cases).

References:

A. Barron, J. Rissanen, and B. Yu (1998). The Minimum Description Length principle in coding and modeling. (Special Commemorative Issue: Information Theory: 1948-1998) IEEE. Trans. Inform. Th ., Oct 1998, 2743-2760.

P. Grunwald (2002). Minimum Description Length and Maximum Probability. Kluwer (forthcoming book).

P Grunwald 91998). The Minimum Description Length Principle and Reasoning under Uncertainty, Chs. 1-2. Ph.D. Thesis, University of Amsterdam, ILLC Dissertation Series 98-3, chapters 1-2.

M. Hansen and Bin Yu (2001). Model selection and Minimum Description Length principle. J. Amer. Statist. Assoc. vol. 96, 746-774.

Myung, I. J., Balasubramanian, V., & Pitt, M. A. (2000). Counting probability distributions: Differential geometry and model selection. Proceedings of the National Academy of Sciences USA , 97, 11170-11175.


Algorithmic Statistics

Paul Vitanyi

CWI and University of Amsterdam
Paul.Vitanyi@cwi.nl

While Kolmogorov complexity is the accepted absolute measure of information content of an individual finite object, a similarly absolute notion is needed for the relation between an individual data sample and an individual model summarizing the information in the data, for example, a finite set (or probability distribution) where the data sample typically came from. The statistical theory based on such relations between individual objects can be called algorithmic statistics, in contrast to classical statistical theory that deals with relations between probabilistic ensembles. We develop the algorithmic theory of statistic, sufficient statistic, and minimal sufficient statistic. This theory is based on two-part codes consisting of the code for the statistic (the model summarizing the regularity, the meaningful information, in the data) and the model-to-data code. In contrast to the situation in probabilistic statistical theory, the algorithmic relation of (minimal) sufficiency is an absolute relation between the individual model and the individual data sample. For traditional problems, dealing with frequencies over small sample spaces, the probabilistic approach is appropriate. But for current novel applications, average relations are often irrelevant, since the part of the support of the probability density function that will ever be observed has about zero measure. This is the case in, for example, complex video and sound analysis---there we need the absolute approach. We distinguish implicit and explicit descriptions of the models. We give characterizations of algorithmic (Kolmogorov) minimal sufficient statistic for all data samples for both description modes---in the explicit mode under some constraints. We also strengthen and elaborate earlier results on the "Kolmogorov structure function'' and "`absolutely non-stochastic objects''---those rare objects for which the simplest models that summarize their relevant information (minimal sufficient statistics) are at least as complex as the objects themselves. We demonstrate a close relation between the probabilistic notions and the algorithmic ones: (i) in both cases there is an ``information non-increase''law; (ii) it is shown that a function is a probabilistic sufficient (in an appropriate sense) an algorithmic sufficient statistic.

The work presented here is part of a triad of papers dealing with the best individual model for individual data: The present work [1] supplies the basic theoretical underpinning by way of two-part codes, [2] derives ideal versions of applied methods (MDL) inspired by the theory, and [3] treats experimental applications thereof.

References:

[1] P. Gacs, J. Tromp, P. Vitanyi, Algorithmic Statistics, IEEE Trans. Information Theory, 47:6 (2001), To appear. http://xxx.lanl.gov/abs/math.PR/0006233

[2] P.M.B. Vitanyi and M. Li,Minimum Description Length Induction, Bayesianism, and Kolmogorov Complexity, IEEE Trans. Information Theory , IT-46:2 (2000), 446--464. http://xxx.lanl.gov/abs/cs.LG/9901014

[3] Q. Gao, M. Li and P.M.B. Vitanyi, Applying MDL to learning best model granularity, Artificial Intelligence, 121:1-2(2000), 1-29. http://xxx.lanl.gov/abs/physics/0005062

Basic theory and applications of Kolmogorov complexity:

[4] M. Li and P.M.B. Vitanyi, An Introduction to Kolmogorov Complexity and its Applications (Second Edition), Springer-Verlag, New York, 1997 (xx + 637 pp).

See also the home page of the author: http://www.cwi.nl/~paulv/ , http://www.cwi.nl/~paulv/learning.html


MDL Theory as a Foundation for Statistical Modeling

Jorma Rissanen

University of London, Royal Holloway, UK
and
Technical University of Tampere
jorma.rissanen@mdl-research.org

In this talk we outline some recent developments in the MDL theory, which form a foundation for a theory of statistical modeling, Two basic notions about a data set are defined: the complexity and the information, both relative to a class of probability distributions as models. The former represents the shortest code length in a probabilistic sense with which the data can be encoded with use of the models, and the second the logarithm of the number of optimally distinguishable models from the given amount of data. The complexity is the sum of the information and the code length for noise in analogy with the Kolmogorov sufficient statistics decomposition in the algorithmic theory of information. Any two model classes can be compared by their complexity and the one with smaller complexity sees the data with less noise.

References:

Rissanen, J. (2000). MDL denoising. IEEE Trans on Information Theory, IT-46 (7).

Rissanen, J. (2001). Strong optimality of the normalized ML models as universal codes and information in data. IEEE Trans Information Theory, IT-47 (5), 1-6.


MDL, Statistical Physics and Geometry

Vijay Balasubramanian

University of Pennsylvania
vijay@physics.upenn.edu

The Minimum Description Length principle selects between competing models of data by picking the one that gives the shortest codelength. Bayesian selection simply picks the most likely model given observed data. I demonstrate that the Bayesian approach is formally identical to solving a certain statistical physics problem on the space of probability distributions. Using physical techniques, I show that MDL and Bayesian methods give essentially equivalent model selection criteria. In exploiting the physical analogy, the number of data points acts as an "inverse temperature", while the relative entropy between the data-generating distribution and a model acts as an "energy". Central to the approach is the representation of a parametric model family as a surface embedded in the space of distributions, with a distinguished metric and reparametrization invariant measure of volume. I show that the relevant measure is the "Jeffreys prior" of Bayesian inference, which uniformly "counts" the distributions in a model family, and that the Fisher information is the associated metric. In both MDL and Bayesian approaches, simple model families are preferred explanations of observations until enough data accumulates to justify a more complex model. The effective notion of "simplicity" employed by these techniques has a pleasing geometric interpretation - a large fraction of the volume of a simple model, measured in the distinguished metric mentioned earlier, lies close to the truth. The practical relevance of such geometric intuitions and interpretations is also considered.

References:

V. Balasubramanian.(1997). Statistical Inference, Occam's Razor and Statistical Mechanics on The Space of Probability Distributions. Neural Computation, Vol. 9, No. 2, pp. 349-368.

I.J. Myung, V. Balasubramanian and M.A. Pitt. (2000). Counting Probability Distributions: Differential Geometry and Model Selection. Proceedings of the National Academy of Science , Vol. 97, No. 21, pp. 11170-11175.

See also the home page of the author: http://dept.physics.upenn.edu/~vbalasub .


Exact Minimax Optimal Description Length Criterion for Regression

Andrew Barron and Feng Liang

Yale University
{andrew.barron, feng.liang}@yale.edu

For problems of model selection in linear regression, we determine the exact minimax universal data compression strategy for the minimum description length (MDL) criterion. The analysis also gives the best invariant and indeed minimax procedure for predictive density estimation in location families, using Kullback-Leibler loss. The exact minimax rule is generalized Bayes using a uniform (Lebesgue measure) prior on the location parameter, which is made proper by conditioning on an initial set of observations.

References:

A. R. Barron, J Rissanen, and B. Yu (1998). The minimum description length principle in coding and modeling. IEEE Trans. Inform. Theory, 44: 2743-2760.

B. S. Clarke and A. R. Barron (1990). Information-theoretic asymptotics of Bayes methods. IEEE Trans. Inform. Theory , 36: 453-471.

B. S. Clarke and A. R. Barron (1994). Jeffrey's prior is asymptotically least favorable under entropy risk. J. Statistical Planning and Inference, 41: 37-60.

A Simplicity Principle for Perception

Nick Chater

University of Warwick
N.Chater@warwick.ac.uk

This talk considers the hypothesis that the perceptual system seeks the simplest (shortest) description of perceptual data. This idea can be traced back to Mach and is consistent with recent work in Kolmogorov complexity and minimum description length statistics. At face value, however, there are clear empirical counterexamples to the viewpoint. I consider the extent to which these counterexamples can be dealt with by formulating a version of the simplicity principle for perception that reflects relevant constraints of the perceptual system.

Reference:

N Chater (1996). Reconciling simplicity and likelihood principles in perceptual organization. Psychological Review , 103, 566-581.


The Simplicity Principle in Visual Perception

Peter van der Helm

University of Nijmegen
peterh@nici.kun.nl

The perceptual counterpart of the mathematical MDL principle, i.e. the simplicity principle, states that the human visual system selects the simplest interpretation of a given visual stimulus. This perceptual principle has a longer research history than its mathematical counterpart but, due to its informal formulation, had to rely mainly on empirical evidence. Such evidence is necessary but not sufficient and has to be complemented with formal theoretical evidence. Perception research already attacked successfully the problems of defining regularity (rather than of defining randomness, as in mathematics), of complexity measurement, and of computability. Yet, theoretically, the veridicality (or predictive power) of the perceptual simplicity principle remained unclear. In this context, mathematical research on the MDL principleh as been extremely helpful, by providing the perceptual simplicity principle with a firm formal foundation. Furthermore, it provides a better conceptual framework for theoretical and empirical research into, for instance, the status of perceptual categories, and the relation between viewpoint dependencies and viewpoint independencies in object perception.

References:

Hochberg, J. E., & McAlister, E. (1953). A quantitative approach to figural "goodness". Journal of Experimental Psychology , 46, 361--364. (Proposal of the simplicity principle in visual perception)

Garner, W. R. (1962). Uncertainty and structure as psychological concepts . New York: Wiley. (Conceptual transition from classical to structural information theory)

Simon, H. A. (1972). Complexity and the representation of patterned sequences of symbols. Psychological Review, 79, 369--382. (Perceptual coding languages; empirical confirmation of the Invariance Theorem)

Leeuwenberg, E. L. J., & Boselie, F. (1988). Against the likelihood principle in visual form perception. Psychological Review , 95, 485--491. (Perceptual categories and structural probabilities)

van der Helm, P. A. (2000). Simplicity versus likelihood in visual perception: From surprisals to precisals. Psychological Bulletin , 126, 770--800. (Combining classical, structural, and algorithmic information theory).


MDL and Classification in Machine Learning

Herri Tirri

Helsinki Institute for Information Technology
Henri.Tirri@cs.Helsinki.fi

The general MDL Principle has deep fundamental implications to modeling practices in various fields from statistics to psychology. In computer science issues of modeling are particularly prevalent in areas that focus on building intelligent artificial systems. General symbolic artificial intelligence research, neurocomputing and machine learning are fields that rely heavily on the construction of models, and on a regular basis use statistical techniques to address the model construction tasks. One fundamental problem class commonly addressed is classification, i.e., a discrete prediction problem, where each data element from a fixed domain is to be categorized using a predefined (finite) set of classes. As such problems abstract reasonably many diagnostic problems, they have been intensively studied in the machine learning research. Although MDL Principle has been applied in classification tasks also in machine learning, the work has concentrated on a very naive way of applying MDL (usually with simplistic two-part coding) and the more advanced MDL methods developed in 1990s are still virtually unknown in the community. Machine learning research has a strong emphasis on empirical results with actual real life application data sets, as the predictive perfomance of the developed algorithms can be tested against a common shared database of data sets. We will discuss the use of advanced, i.e., Normalized Maximum Likelihood based methods for classification, and present empirical results on data sets from one of the standard benchmark databases in the machine learning community. We will also discuss some of the difficulties of actually *implementing* the MDL Principle for classification. Part of the work presented has not yet been published and reflects work still in progress.

Reference:

P.Kontkanen, P.Myllymäki, T.Silander, H.Tirri, and P.Grünwald (2000). On Predictive Distributions and Bayesian Networks. Statistics and Computing, 10, 39-54.


Model Selection in Cognitive Science

In Jae Myung and Mark A Pitt

Ohio State University
{myung.1, pitt.2}@osu.edu

The question of how one should decide among competing explanations of data is at the heart of the scientific enterprise. Computational models in cognitive science are increasingly being advanced as explanations of cognitive behavior. The success of this line of inquiry depends on the development of robust methods to guide the evaluation and selection of these  models. In this paper, we explore application of MDL  (Rissanen, 1996) for selecting among mathematical models of cognition. MDL provides an intuitive and theoretically well-grounded understanding of why one model should be chosen over another. A central, but elusive, concept in model selection, complexity, can also be derived with the method. The adequacy of the method is demonstrated in two areas of cognitive modeling: psychophysics and information integration.
 
References:

M A Pitt, I J Myung and S Zhang (in press). Toward a method of selecting among computational models of cognition. Psychological Review.

J Rissanen (1996). Fisher information and stochastic complexity. IEEE Trans Information Theory, 42, 40-47.


Contact Information

Peter Grunwald ( pdg@cwi.nl , +31-20-592-4115)
In-Jae Myung (myung.1@osu.edu , +1-614-292-1862)
Mark Pitt (pitt.2@osu.edu , +1-614-292-4193)

Last modified: December 19, 2001.