Information Criteria and Related Pitfalls
Eiji Kurozumi
Professor, Graduate School of Economics, Hitotsubashi University
This column discusses information criteria in relation to recent research that derived model selection criteria concerning the number of regressors and the number of structural changes in multivariate regression models (Kurozumi and Tuvaandorj 2010). Information criteria are applied by many postgraduate students, not to mention researchers, because they are frequently used to select models in empirical analyses. Typical methods of model selection include information criteria and the selection of significant variables through repeated hypothesis testing. One advantage of information criteria lies in their simplicity. For example, if hypothesis testing is applied to select a lag order of a time-series model, the lag order selected may differ depending on the significance level. This is not the case with information criteria, with which only one optimal model is selected. This discussion may lead readers to believe that information criteria are superior to hypothesis testing, but this is by no means true. There are cases in which hypothesis testing sometimes provides a wider variety of information, although this is not detailed here, and both information criteria and hypothesis testing are undoubtedly effective statistical methods.
The Akaike information criterion (AIC) and Bayesian information criterion (BIC), which appear in many textbooks, are the most frequently used information criteria. They are defined as follows:
AIC=T*log(s2)+2*K, BIC=T*log(s2)+K*log(T).
In these formulas, T is the sample size, s2 is the variance estimator of the error term of the model, and K is the number of coefficients included in the model (some textbooks define them by dividing both members by T). In other words, AIC and BIC can be readily obtained if models are estimated by applying the ordinary least squares method, and the variance estimator s2 is obtained from the estimated residuals. The only remaining step after this process is obtaining a model that minimizes the information criteria.
Information criteria can be readily obtained in this way, and this has encouraged their widespread use. However, they have inherent pitfalls, including insufficient understanding of the criteria and incorrect use. This is further detailed below, using AIC as an example.
So, how was AIC derived in the first place? In many textbooks, the 2*K in the second term of AIC is explained as the “penalty term.” It is widely known that the sum of squared residuals is always reduced when the number of regressors of a model is increased. Therefore, if the first AIC term is used as a selection criterion, the most complicated model with the largest number of regressors will be selected. Thus a penalty for addition of regressors is required, and this is 2*K. This interpretation is not incorrect and can be understood naturally by anyone who has learned the adjusted R-squares. So then, where did the 2*K penalty come from? It was in fact selected not at random but properly, in a statistical manner.
One reference used in selecting models is Kullback-Leibler information (KL information), and the following is a simple example for understanding it. Suppose either Model A or Model B is to be selected. And suppose also that the probability of Model A being right (prior probability of A), which is based on the information held by the researcher before observing data, is P(A), and similarly the probability of Model B being right (prior probability of B) is P(B). Note that these prior probabilities are assumed by the researcher prior to observation of the data. Once data x is observed, however, a new probability of Model A's being right (posterior probability of A) and a new probability of Model B's being right (posterior probability of B) will be obtained based on the observed data. In other words, the judgment over which of the two models is more likely to be right is updated when the data have been observed. If these posterior probabilities are denoted by P(A|x) and P(B|x), respectively, the changes of the probabilities of Model A's and Model B's being right can be measured by applying the following formula.
log P(A|x)/P(B|x) - log P(A)/P(B).
If the value obtained by this formula is large, Model A is more likely to be right, and if it is small, Model B is more favorable. However, the above formula is one as of the time when a specific value x was observed. Therefore, if an expected value is obtained by assuming that x is derived from Model A, it will not depend on x. This is the KL information, hereinafter described as KL(A;B).
Features of this KL information include: 1) that KL(A;B) is a nonnegative value, and; 2) that the value of KL(A;B) is 0 if and only if A is equal to B. Accordingly, if Model B is selected in a way that minimizes the value of KL(A;B) when Model A is a true model, a model that is closer to a right one can be selected. Incidentally, KL information is not “distance” in mathematical terms. For it to be distance, the values of KL(A;B) and KL(B;A) must be identical, but these two values generally differ. Therefore the term “Kullback-Leibler distance,” which is brought up on rare occasions, is incorrect.
The issue here is that it is generally impossible to know the KL information because it depends on true Model A. One possible solution is to estimate the KL information and select a model in a way that minimizes the estimated KL information. AIC is a model selection criterion based on this principle. With AIC, estimating the KL information with a log-likelihood first needs to be considered. This log-likelihood corresponds to the first AIC term. However, it is known that the log-likelihood is not the unbiased estimator, but the biased estimator, of the KL information. This bias turns out to be 2*K, which is why 2*K appears in the second AIC term.
As described above, AIC is simple in form, but is actually supported by statistical theory. Because of its theoretical background and its simple form AIC has been applied widely for empirical analyses.
AIC must therefore not be modified without careful consideration. For example, we should avoid expanding the use of AIC, which was derived for stationary models, in applying it to non-stationary models, because the theoretical validity of such an expansion is not known. One example would be estimation of cointegration models. One method of estimating a cointegration regression is a dynamic regression that uses leads and lags of regressors. Some empirical studies use the traditional AIC for selecting these leads and lags. Cointegration models, however, are non-stationary so there is no guarantee that the classical AIC is an unbiased estimator of the KL information. Accordingly, AIC should not be used for cointegration models without careful consideration. Actually, however, the validity of using AIC for cointegration models is proved in Discussion Paper Series No. 6 (Choi and Kurozumi 2008). Therefore, those empirical analyses ultimately were not problematic.
Another example is models with structural changes. Suppose that a model selection is conducted by using the classical AIC when the number of structural changes is not known. In this case, the penalty is 2*K if the model is a K-variate regression model with no structural changes. If there is one structural change, the parameter is re-estimated after the structural change; therefore the penalty will be 2*(2*K), or 4K. In the same way, the penalty will be 6*K if there are two structural changes. So, the question is whether the original penalty is correct, and the answer is “no.” Discussion Paper Series No. 144 (Kurozumi and Tuvaandorj 2010) derived the following formula by taking into account the assumed number of structural changes given by m:
2*(number of unknown coefficients) + 6*m
This means that the penalty of the classical AIC is insufficient, moreover, 6*(number of structural changes) must be added to the penalty. Of course, this penalty was derived by precisely calculating the bias of log-likelihood, and therefore is supported by statistical theory. Simulations also show that while selection of the true model with structural changes is almost impossible with classical AIC, the modified AIC frequently chooses the true model.
The above discussions would lead readers to understand that while AIC has been applied widely due to its simple form, usability, and ease of interpretation, it has been derived based on the support of a statistical theory and its use must not be expanded without careful consideration. Based on this point, it would be desirable if information criteria continue to be used widely and for many useful empirical studies to be promoted.
References
Choi, I., and E. Kurozumi (2008), “Model Selection Criteria for the Leads-and-Lags Cointegrating Regression,” Global COE Hi-Stat Discussion Paper Series No. 6, Hitotsubashi University, forthcoming in Journal of Econometrics.
Kurozumi, E., and P. Tuvaandorj (2010), “Model Selection Criteria in Multivariate Models with Multiple Structural Changes,” Global COE Hi-Stat Discussion Paper Series No. 144, Hitotsubashi University.