Satellite Conference to the 30th Annual Mathematical Psychology Meetings

Symposium on Methods for Model Selection August 3-4, 1997

Indiana University
Bloomington, IN

ORGANIZING COMMITTEE AND SPONSORSHIP

Symposium organizers:
Malcolm Forster (University of Wisconsin)
Peter Grunwald (CWI, Amsterdam)
In Jae Myung (Ohio State University)

The symposium is made possible by generous financial support from the Indiana University Cognitive Science Program (Richard Shiffrin, Director).


INTRODUCTION

The symposium is aimed at the following question:

How does one decide whether one model is better than another?

Psychologists in general, and modelers in particular, are often faced with a choice between competing models fit to the same data set. In certain special cases, such as cases where one model is a parametric restriction of another, there exist accepted methods for model comparison. In most other cases, especially where the models are of different types, methods of selecting a best model are largely unknown, even to practitioners within the field of Mathematical Psychology. However recent years have seen many exciting developments in this area; a number of quantitative and statistical techniques have been and are being developed. The issues are of course complex, and choosing a correct technique is made difficult by the many competing approaches, including classical Neyman-Pearson testing, the Akaike Information Criterion, the Bayesian approach, Minimum Description Length methods, cross validation, and so on. Because the techniques have in many cases arisen independently in quite different fields, their relatedness and relative merits are not easy to see.

Thus the aims of the conference are to present the latest developments, highlight their practical value, provide an understanding of the methods, and provide some insights into why some may work better in some applications than in others. For each of the approaches mentioned above, there will be an introductory, tutorial-like talk given by an expert in the field. Additional talks will focus on applications and comparisons among the approaches. The symposium will be 'user friendly': time will be reserved for discussion and questions, and the speakers will not assume the audience to have any prior knowledge of the methods under discussion.

A special issue of the Journal of Mathematical Psychology, based on this symposium, is being planned.





PROGRAM

The schedule includes a 10 min informal discussion break between talks, in addition to a 10 to 15 min formal discussion included in each talk.

Sunday, August 3, 1997

1:30 - 1:40 Opening Remarks
1:40 - 2:15 Perception, Information Integration, Model Choice, and Multiple Phenomena
James Cutting (Cornell University)
2:25 - 3:20 Akaike's Information Criterion and Recent Developments in Informational Complexity
Hamparsum Bozdogan (University of Tennessee at Knoxville)
3:20 - 3:50 Coffee Break
3:50 - 4:45 Cross-validation Methods
Michael Browne (Ohio State University)
4:55 - 5:50 The Modeling of Model Selection
Malcolm Forster (University of Wisconsin, Madison)
6:00 - 6:35 Generalization Test Method for Model Comparison
Jerome Busemeyer (Purdue University)

Monday, August 4, 1997

9:00 - 9:55 Bayesian Model Selection
Larry Wasserman (Carnegie Mellon University)
10:05 - 10:40 Model Selection Statistical Tests for Comparing Non-nested and Misspecified Models
Richard Golden (University of Texas at Dallas)
10:40 - 11:10 Coffee Break
11:10 - 12:05 The Minimum Description Length Principle
Peter Grunwald (CWI, Amsterdam)
12:15 - 12:50 Importance of Complexity in Model Selection
In Jae Myung (Ohio State University)







ABSTRACTS


Perception, Information Integration, Model Choice, and Multiple Phenomena

James Cutting
Department of Psychology
Cornell University
jec7@cornell.edu

Bruno and Cutting (1988) manipulated four sources of information affecting the perceived layout and depth of three objects--relative size, height in the visual field, occlusion, and motion perspective. In accounting for their results they suggested that an additive model well-accounted for their results. Massaro (1988) reanalyzed these data and found that a fuzzy-logical model of perception (FLMP), a Bayesian-type model with the same number of parameters as the additive model, accounted for the data equally well. Cutting, Bruno, Brady, and Moore (1992) ran additional studies and, overall, found that the data of 23 observers were fit better by an additive model, and the data of 21 were fit better by FLMP. They also conducted a series of Monte Carlo simulations and found that the scope of FLMP was considerably greater than that of the additive model, and attributed the difference to model complexity, particularly as captured by minimum description length. Massaro and Cohen (1992) thought this inconsequential, even "magical"; but Li, Lewandowski, and DeBrunner (1996) and Myung and Pitt (1997) explored relations among the additive model and FLMP in other ways. This small, but potentially important, controversy focused on how best to characterize information integration in the domain of depth perception. Two issues arise from it. One is long-standing and the topic of this conference--how to choose a model. The other, which I claim is prior, is about how to frame one's expectations about what to model. To this end Cutting and Vishton (1995; Cutting, 1997) presented an analysis of the relative potency of nine sources of information about depth and layout as a function of distance. The major results were that ordinal ranking of the potency of these sources varies with distance from the observer, and that these rankings depend on the assumptions made in using each source of information. Examples will be given in which different sources ramify or falsify the assumptions of other sources, creating different percepts. Thus, one criterion of model choice concerns the degree to which the models capture the multiple phenomena of interest in a principled way. Neither an additive model nor FLMP are successful in this regard.

References

Bruno, N. & Cutting, J.E.(1988). Minimodularity and the perception of layout. Journal of Experimental Psychology: General, 117, 161-170.
Cutting, J.E. (1997). How the eye measures reality and virtual reality. Behavior Research Methods, Instruments, & Computers. 29, 27-36.
Cutting, J.E., Bruno, N., Brady, N. & Moore, C. (1992). Selectivity, scope, and simplicity of models: A lesson from fitting judgments of perceived depth. Journal of Experimental Psychology: General, 121, 364-381.
Cutting, J.E. & Vishton, P.M. (1995). Perceiving layout and knowing distances: The integration, relative potency, and contextual use of different information about depth. In W. Epstein & S. Rogers (eds.) Perception of space and motion (pp. 69-117). San Diego, CA: Academic Press.
Li, S.-C., Lewandowski, S., & DeBrunner, V.E. (1996). Using parameter sensitivity and interdependence to predict model scope and falsifiability. Journal of Experimental Psychology: General, 125, 360-369.
Massaro, D.W. (1988). Ambiguity and perception in experimentation. Journal of Experimental Psychology: General, 117, 417-421.
Massaro, D.W. & Cohen, M.M. (1992). The paradigm and the fuzzy-logical model of perception are alive and well. Journal of Experimental Psychology: General, 122, 115-124.
Myung, I.J. & Pitt, M.A. (1996) Applying Occam's razor in modeling cognition: A Bayesian approach. Psychonomic Bulletin & Review, 4, 79-95.


Akaike's Information Criterion and Recent Developments in Informational Complexity

Hamparsum Bozdogan
Department of Statistics
University of Tennessee at Knoxville
bozdogan@utk.edu

During the last twenty five years, Akaike's (1973) entropic information criterion, which is known as AIC, has had a fundamental impact in statistical model evaluation problems. The introduction of AIC furthered the recognition of the importance of good modeling in statistics. As a result, many important statistical modeling techniques have been developed in various fields of statistics, control theory, econometrics, engineering, psychometrics, and in many other fields (see, Bozdogan 1994a, b, c). First half of this paper will study the general theory of Akaike's imaginative work, discuss its meaning and the basic philosophy, introduce its analytical extensions using the standard results established in mathematical statistics without violating Akaike's basic principles, and show their asymptotic properties and give the results on their inferential error rates (Bozdogan, 1987). Second half of this paper will present some recent developments on a new entropic or informational complexity criterion (ICOMP) of Bozdogan (1988, 1990, 1994d, 1996, 1997a,b) for model selection. Analytic formulation of ICOMP takes the "spirit" of Akaike's (1973) AIC, but it is based on the generalization and utilization of an entropic covariance complexity (COVCOMP) index of van Emden (1971) of a multivariate normal distribution in parametric estimation. For a general multivariate linear or nonlinear structural model ICOMP is designed to estimate a loss function:
Loss=Lack of Fit + Lack of Parsimony + Profusion of Complexity
in several ways using the additivity property of information theory, and the developments in Rissanen (1976) in his Final Estimation Criterion (FEC) in estimation and model identification problems, as well as AIC, and its analytical extensions given in Bozdogan (1987). The first approach of ICOMP, uses an information-based characterization of the covariance matrix properties of the parameter estimates and the error terms starting from their finite sampling distributions. A second approach to quantifying the concept of overall model complexity is based on the use of the estimated inverse-Fisher information matrix (IFIM), which is also known as the Cram r-Rao lower bound matrix. In general, ICOMP controls the risks of both insufficient and overparameterized models. A model with minimum ICOMP is chosen to be the best model among all possible competing alternative models. With ICOMP, complexity is viewed not as the number of parameters in the model, but as the degree of interdependence i.e., the correlational structure among the parameter estimates. By defining complexity in this way, ICOMP provides a more judicious penalty term than AIC, CAIC of Bozdogan (1987), MDL of Rissanen (1978), and SBC of Schwarz (1978). The lack of parsimony is automatically adjusted across the competing alternative portfolio of models as the parameter spaces of these models are constrained in the model selection process. The use of the complexity of the estimated IFIM, in the information-theoretic model evaluation criteria takes into account the fact that as we increase the number of free parameters in a model, the accuracy of the parameter estimates decreases. Hence, ICOMP chooses models that provide more accurate and efficient parameter estimates. It takes into account parameter redundancy, stability, and the error structure of the models. Comparisons of ICOMP with other information-based model selection criteria will be made, and its consistency properties will be shown. These are developed in Bozdogan and Haughton (1994) in the case of the usual multiple regression models, where the probabilities of underfitting and overfitting of ICOMP have been established as the sample size n tends to infinity. Selected real applications will be presented by applying the new approach to problems in factor analytic models, selecting the best predictors of creativity and achievement in cognative science, choosing the number of clusters in mixture-model cluster analysis, in vector autoregressive models using the genetic algorithm (GA), and in a wide spectrum of important Bayesian and classical model selection problems to demonstrate the utility and the versatility of the new proposed approach.

Key words & phrases

Akaike's Information Criterion (AIC); Informational Complexity (ICOMP); Covariance Complexity (COVCOMP); Model Selection; Bayesian and Classical Model Selection.

References

Akaike, H. (1973). Information Theory and an Extension of the Maximum Likelihood Principle. In Second International Symposium on Information Theory , B.N. Petrov and F. Csaki (eds.), Academiai Kiado, Budapest, 267-281.
Bozdogan, H. (1987). Model Selection and Akaike's Information Criterion (AIC): The General Theory and its Analytical Extensions, Psychometrika,
Bozdogan, H. (1988). ICOMP: A New Model Selection Criterion. In Classification and Related Methods of Data Analysis, Hans H. Bock (ed.), Amste
Bozdogan, H. (1990). On the Information-based Measure of Covariance Complexity and its Application to the Evaluation of Multivariate Linear Mo
Bozdogan, H. ( ed.) (1994a). Theory & Methodology of Time Series Analysis, Volume 1, Proceedings of First US/Japan Conference on The Frontiers
Bozdogan, H. ( ed.) (1994b). Multivariate Statistical Modeling, Volume 2, Proceedings of First US/Japan Conference on The Frontiers of Statis
Bozdogan, H. ( ed.) (1994c). Engineering & Scientific Applications of Informational Modeling, Volume 3, Proceedings of First US/Japan Conference on The Frontiers of Statistical Modeling: An Informational Approach, Kluwer Academic Publishers, the Netherlands.
Bozdogan, H. (1994d). Mixture-model Cluster Analysis Using Model Selection Criteria and a New Informational Measure of Complexity. In Multivariate Statistical Modeling, Vol. 2, H. Bozdogan (ed.), Proceedings of the First US/Japan Confere
Bozdogan, H. ( 1996). A New Informational Complexity Criterion for Model Selection: The General Theory and its Applications. Invited paper pres
Bozdogan, H. (1997a). Statistical Modeling and Model Evaluation: A New Informational Approach, a forthcoming book.
Bozdogan, H. (1997b). Informational Complexity and Multivariate Statistical Modeling, a forthcoming book.
Bozdogan, H. and Haughton, D. M. A. (1996). Informational Complexity Criteria for Regression Models, Computational Statistics & Data Analysis,
Rissanen, J. (1976). Minmax Entropy Estimation of Models for Vector Processes. In System Identification, R. K. Mehra and D. G. Lainiotis (eds.
Rissanen, J. (1978). Modeling by Shortest Data Description, Automatica, 14, 465-471.
Schwarz, G. (1978). Estimating the Dimension of a Model, Annals of Statistics, 6, 461-464.


Cross-validation Methods

Michael Browne
Departments of Psychology and Statistics
Ohio State University
browne.4@osu.edu

Cross-validation methods have been employed in linear regression (e.g. Mosier, 1951) for many years as an aid to selecting the number of independent variables to be included in a regression model. They have also been used to compare alternative models in the analysis of moment structures (Cudeck & Browne, 1983). Cross-validation is of primary interest when a single model is to be selected from a number of competing models. The criterion for selection is the degree to which a model based on the available sample will reflect the behavior of future observations. Cross-validation methods are not intended to find a correct model. No assumption is made that any of the models under consideration are correct. In the classical cross-validation approach, two samples of more or less the same size are required. One, the calibration sample, is employed to estimate parameters in the model. The other, the validation sample, is employed to calculate a cross-validation index. This indicates the extent to which the model, with parameters estimated from the calibration sample, predicts the behavior of future observations. An obvious disadvantage of this approach is that more precise estimates would be possible if observations employed for the validation sample were included in the calibration sample. A possible means of reducing this loss of precision is to use one observation at a time for validation purposes and the rest for calibration, applying the process repeatedly (Geisser, 1975; Stone, 1974). This approach involves a fair amount of computation but is feasible in regression. It is not applied in the analysis of covariance structures because a covariance matrix cannot be calculated from a sample of size one. An alternative is to avoid the use of the validation sample altogether by making distributional assumptions, typically of normality, and estimating the expected value of the cross-validation index using the calibration sample alone. This approach has been applied in regression with fixed independent variables (Mallows, 1973), in regression with stochastic independent variables (Darlington, 1968; Browne, 1975) and in the analysis of covariance structures (Browne & Cudeck, 1989). It is closely related to the Akaike information criterion. Examination of formulae for the expected value of the cross-validation index indicates that the choice of model using the cross-validation approach will depend strongly on the size of the calibration sample. If the calibration sample is small, models with few parameters will be favored over those with many parameters. It is important to bear in mind, therefore, that cross-validation and related methods, should not be used with the aim of selecting parsimonious, easily understood models. The aim of cross-validation is to select a model that will yield the best predictions, taking the size of the calibration sample into account. While, it does happen that parsimonious models will be favored if the calibration sample is small, models with many parameters will exhibit the best performance for very large calibration samples.

References

Browne, M. W. (1975). A comparison of single sample and cross-validation methods for estimating the mean squared error of prediction in multiple linear regression. British Journal of Mathematical and Statistical Psychology, 28, 112-120.
Browne & Cudeck (1989). Single sample cross-validation indices for covariance structures. Multivariate Behavioral Research, 24, 445-455.
Cudeck, R. & Browne, M. W. (1983). Cross-validation of covariance structures. Multivariate Behavioral Research, 18, 147-167.
Darlington, R. B. (1968). Multiple regression in psychological research. Psychological Bulletin, 69, 161-182.
Geisser, S. (1975). The predictive sample reuse method with applications. Journal of the American Statistical Association, 70, 320-328.
Mallows, C. L. (1973). Some comments on Cp. Technometrics, 15, 661-675.
Mosier, C. I. (1951). Problems and designs of cross-validation. Educational and Psychological Measurement, 11, 5-11.
Stone, M. (1974). Cross-validatory choice and assessment of statistical predictions (with Discussion). Journal of the Royal Statistical Society, Series B, 36, 111-147.


The Modeling of Model Selection

Malcolm Forster
Department of Philosophy
University of Wisconsin, Madison
mforster@facstaff.wisc.edu


A single data set may be explained in terms of two or more quite different models. How does one decide which model captures the phenomenon best? A model selection criterion makes a choice from amongst any set of competing models. But is it the right choice? How good is the method? This paper is about comparing the performance of different methods of model selection. I will argue that the average performance of common model selection criteria, such as AIC and BIC, depends on the context in which the selection takes place. There is a sense in which one has to model the model selection context itself in order to select a model selection rule. My computer computations show: These two results are already written up in a paper that can be downloaded from my home page. I plan to extend these results to non-nested add new model selection criteria to the comparison, and illustrate the issue in terms of an example in mathematical psychology.


Generalization Test Method for Model Comparison

Jerome R. Busemeyer
Deparment of Psychology
Purdue University
jbusemey@indiana.edu

One of the most common methods for evaluating and comparing mathematical models is the cross - validation method. Essentially, the parameters from each model are estimated using the data obtained from the first replication of a fixed research design, and then these same parameters are used to compute predictions for the data from the second replication of the fixed research design. This appears to be a valid method for comparing models that differ both in function form (non-nested) and number of parameters, because no parameters are estimated from the second replication, and it is fair to compare the parameter free predictions of the models. The problem with this reasoning becomes apparent when one considers the logical extreme case where the sample size for each replication approaches infinity. In this limiting case, the model that fits better in the first replication will always win the competition in the second replication because the data distributions are practically identical in the two large samples. Furthermore, the model with more parameters tends to fit better in the first sample, and for this reason, the cross - validation method tends to pick the model with more parameters with larger sample sizes (as do the closely related Akaike and Schwartz criteria). The irony of this issue is that large sample sizes should be the most ideal condition for comparing models. The answer to this paradox is the following fairly obvious change in the method: instead of employing two replications from one fixed design, divide one large design into two interweaved subdesigns. The parameters of each model are estimated from the first subdesign, and then these same parameters are used to make truly parameter free predictions for new conditions in the second subdesign. In other words, the models are compared in terms of their ability to interpolate and extrapolate to new design conditions. This is called the generalization test method. Examples illustrating the power of the generalization test method over more traditional methods will be presented using examples from nonlinear regression and nonlinear time series analysis.

References

Browne, M. W. (1975) A comparison of single sample and cross-validation methods for estimating the mean squared error of prediction in multiple linear regression. British Journal of Mathematical and Statistical Psychology, 28, 112-120.
Browne, M. W., & Cudeck, R. (1989) Single sample cross - validation indices covariance structures. Multivariate Behavioral Research, 24, 445-455.
Cudeck, R. & Browne, M. W. (1983) Cross - validation of covariance structures. Multivariate Behavioral Research, 18, 147-167.
Camstra, A., & Boomsma, A. (1992) Cross - validation in regression and covariance structure analysis. Sociological Methods and Research 21, 89-115.


Bayesian Model Selection

Larry Wasserman
Department of Statistics
Carnegie Mellon University
larry@stat.cmu.edu

In this talk I will give a tutorial on Bayesian methods for model selection. Throughout the talk I will emphasize the relative advantages and disadvantages of Bayesian methods. I will also discuss modern approaches for implementing these methods. If time permits, I'll discuss some recent applications based on joint work with Randy Bruno and Jay McClelland. I will assume no prior knowledge of Bayesian model selection. Specific topics include:
(1) Quick overview of Bayesian methods
(2) connections between Bayes and frequentist methods
(3) Bayes factors for model selection
(4) relation of Bayes factors to BIC
(5) asymptotic behavior of Bayes factors
(6) Markov chain Monte Carlo methods for computing Bayes factors
(7) Goodness of fit from the Bayesian perspective
(8) Some applications to Psychology

References

Kass, R.E. and Raftery, A. (1995). Bayes Factors. Journal of the American Statistical Association, 90, 773-795.
Kass, R.E. and Wasserman, L. (1995). A reference Bayesian test for nested hypotheses with large samples. J. Amer. Statist. Assoc., 90, 928-934.
DiCiccio, T., Kass, R., Raftery, A. and Wasserman, L. (1997). Computing Bayes factors by combining simulation and asymptotic approximations. To appear in J. Amer. Statist. Assoc. Currently available as tech report 630 at: http://www.stat.cmu.edu/cmu-stats/tr/.

Model Selection Statistical Tests for Comparing Non-nested and Misspecified Models

Richard Golden
Cognition and Neuroscience Program
University of Texas at Dallas
golden@utdallas.edu

A number of performance measures have been proposed for model selection. Such criteria include performance measures which measure the likelihood of the observed data given a model (log-likelihood functions). Such criteria also include log-likelihood functions combined with "penalty" terms for penalizing "overly complex" models such as the Akaike Information Criteria (AIC) and Minimum Description Length (MDL) measures. The above measures, under general conditions, are special cases of a performance measure that measures the likelihood of a model for a given data set (i.e., a performance measure that selects the "best model" with a minimum probability of error). A generalization of the minimum probability error approach to model selection is to use the performance measure that selects the "best model" with the smallest expected loss in a Bayesian sense. In this talk, a corollary to Vuong's (1988) model selection theory is presented which permits the construction of a "large-sample" statistical test for deciding if either: (i) one probability model has a smaller expected loss than another, or (ii) there is not a sufficient amount of observed data for model discrimination (or models are "equally distant" from data generating process). For example, statistical tests may be used to compare two non-nested nonlinear (or linear) regression models in which predictor variable X1 belongs to model #1 but not to model #2 and predictor variable X2 belongs to model #2 but not to model #1. Such statistical tests are valid even if the residual errors are not normally distributed provided the errors are independent and identically distributed. Moreover, in the special case where the probability models just happen to be nested and the full model is correctly specified, the proposed statistical test reduces automatically to Wilk's (1938) Generalized Likelihood Ratio Test.

References

Golden, R. M. (1995). Making correct statistical inferences using a wrong probability model. Journal of Mathematical Psychology, 38, 3-20.
Vuong (1989). Likelihood ratio tests for model selection and non-nested hypotheses. Econometrica 57:307-333.
Also check out my book:
Golden, R. M. (1996). Mathematical Methods for Neural Network Analysis and Design. MIT Press.


The Minimum Description Length Principle

Peter Grunwald
Center for Mathematics and Computer Science
CWI, Amsterdam
Peter.Grunwald@cwi.nl

Much of statistics can be seen as a special case of inductive inference, the process of inferring from a finite set of data the general laws governing this set of data. Any such `law' defines a regularity in the data; any regularity in the data can be used to compress it, i.e. describe it in a shorter way than by just listing all data items. We can then say that we have learned something about the data if we are able to describe it in a short way. Formalizing this idea, which is essentially just a version of `Occam's Razor', leads to the `Minimum Description Length (MDL) Principle'. The MDL Principle has its roots in information theory and theoretical computer science; it has been introduced in its current form by J. Rissanen (1989,1996). MDL can be seen either as a methodology or as a philosophy:
As a methodology, the MDL Principle says that we should always compress our data as much as possible. Since there is a one-to-one correspondence between code lengths (i.e. description lengths) and probabilities, it comes as no surprise that this methodology is closely related to several other paradigms in statistical modeling, most notably Bayesian methods and the Maximum Entropy Principle, but also Prequential Analysis and Cross-Validation. Indeed, it has sometimes been (erroneously) claimed that MDL is just `Bayesianism in disguise': in many contexts, starting out from quite different considerations, MDL and Bayes' methods arrive at the same mathematical formulas. However, as will be shown in the talk, there are also many contexts in which this is not the case.
As a philosophy, the MDL Principle says that in the real world, there is no such thing as a true model; in practically all cases, `a true underlying distribution' which generates the data simply does not exist: if we can compress the data, we have a good description of it and that may help us in making good predictions of future data. But that is all there is to it! This point of view (as we will show in the talk) has several implications for practical model selection problems; for example, it implies that overfitting (i.e. selecting a model that is too complex) is, in a sense, much more harmful than underfitting (selecting a model that is too simple).

In the first part of the talk, I will introduce the MDL Principle as a methodology. I will introduce two-part codes and stochastic complexity, the two main notions in MDL theory. I will show how these notions are related to maximum likelihood estimation, cross-validation, the Bayesian MAP approach and the Bayesian evidence and in particular how MDL sometimes provides an automatic means to generate Bayesian prior distributions. Some theorems justifying the use of MDL will be mentioned. In the second part of the talk, I will show in detail how MDL can be applied in the context of model selection, using a concrete example of a dataset for which several entirely different models are proposed. To this end, I will introduce Rissanen's (1989) universal test statistic. I will end the talk by briefly discussing MDL as a philosophy, and showing how this leads to a view on model selection that is quite different from most other currently held views on the topic.
The talk assumes no prior knowledge of either MDL or information theory.

References

J. Rissanen (1989). Stochastic Complexity in Statistical Inquiry. World Scientific, Singapore 1989
A.R. Barron and T.M. Cover (1991). Minimum Complexity Density Estimation. IEEE Transactions on Information Theory, 37 (4): 1034-1054
J. Rissanen (1996). Fisher Information and Stochastic Complexity. IEEE Transactions on Information Theory, 42 (1): 40-47, january 1996


Importance of Complexity in Model Selection

In Jae Myung
Department of Psychology
Ohio State University
myung.1@osu.edu


In this talk I will try to provide a closing summary of other symposium talks, touching on a few main issues in model selection. Specifically, a common theme that underlies many model selection methods is to avoid choosing unnecessarily complex models. In my view, each method falls under one of two approaches: generalization-based approach and likelihood-based approach. In the generalization-based approach, a model is evaluated based on its ability to describe not only the present data but also future or unseen observations from the same process. Examples of this approach include cross-validation and the Akaike information criterion (AIC). On the other hand, the likelihood-based approach selects the model that is most likely to have generated present observations, with no consideration given to accounting for future observations. Examples of this approach include Bayesian model selection and Schwarz's Bayesian information criterion (BIC). In the heart of both approaches, however, lies the issue of model complexity. Note that in each approach choosing overly complex models is avoided by selecting a model based on not just its maximum likelihood--which is obtained by finely tuning the parameters to the present data pattern--but rather some sort of average performance measure, an average over all possible (present and future) data patterns in the generalization-based approach, or an average over all possible parameter values in the likelihood-based approach. Given the similarity in spirit but perhaps trivial differences in formulation, the question one might then ask is "Are the two 'essentially' equivalent?" I will argue that the answer to this question may be an yes. Also discussed, if time permits, is the issue of model falsifiability, that is closely related to model complexity.

References

Bozdogan, H. (1987). Model selection and Akaike's information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52, 345-370.
Kass, R. E., & Raftery, A. E. (1995). Bayes factors. Journal of the American Statistical Association, 90, 773-795.
Linhart, H., & Zucchini, W. (1986). Model Selection. John Wiley & Sons.
Myung, I. J., & Pitt, M. A. (1997). Applying Occam's razor in modeling cognition: A Bayesian approach. Psychonomic Bulletin & Review, 4, 79-95.
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461-464.