APPENDIX 235 Appendix Hierarchical and nonhierarchical clustering can be viewed as complementary techniques. That is. a hierarchical clustering method can be used to identify the number of clusters and cluster seeds, then the resulting clustering solution can be refined using a nonhierarchical clustering technique. This appendix briefly discusses the SAS commands that can be used to achieve these objectives. The data in Table 7.1 are used for illustration purposes. It will be assumed that the data set was analyzed using various hierarchical procedures to determine which hierarchical algorilhm gave the best cluster solution and the number of clusters in the data set. Table A7.1 gives the SAS commands. The data are first subjected to hierarchical clustering and an SAS output data set TREE is obtained. In the PROC TREE procedure the NCLUSTERS option specifies that a three-cluster solution is desired. The OUT=CLUS3 option requests a new data set. CLUS3. which contains the variable CLUSTER whose value gives the membership of each observation. For example. if the third observation is in cluster 2 then the value of CLUSTER for the third observation will Table A7.1 Using a Nonhlerarchical Clustering Technique to Refine a Hierarchical Cluster Solution OPTIONS NOCENTER; TITLE1 HIERARCHICAL ANALYSIS FOR DATA IN TABLE 7.1; DATA TABLE 1 ; INPUT SID $ 1-2 INCOME 4-5 EDUC 7-8; CARDS: insert data here *Cornmands for hierarchical clustering: PROC CLUSTER NOPRINT METHOD=CENTROID NONORM GUT=TREE; ID SID; VAR INCOME EDUC: *Commands for creating CLUS3 data set; PROC TREE DATA=TREE OUT=CLUS3 NCLUSTERS=3 NOPRINT; ID SID: COpy INCm-tE EDUC; PROC SCRTj BY CLUSTER; TITLE2 '3-CLUSTER SOLUTION '; *Commands for obtaining cluster means or centroids; PROC ~ffiANS NOPRINT; BY CLUSTER; OUTPUT OUT=INITIAL MEAN=INCOME EDUC; VAR INCOME EDUC; *Comrnands for non-hierarchical clustering; PROC FASTCLUS DATA=TABLE1 SEED=INITIAL LIST DISTANCE MAXCLUSTERS=3 ~~XITER=30; VAR INCOME EDUC; TITLE3 'NCNHIER~RCHICAL CLUSTERING';
2S6 CHAPTER 7 CLUSTER ANALYSIS be 2. The COPY command requests that the CLUS3 data set should also contain the values of INCOME and EDUC variables. The PROC MEANS command uses the CLUS3 data set to compute the means of each variable for each cluster, and this information is contained in the SAS data set INITIAL. Finally, the PROC FASTCLUS. which is the nonhierarchical clustering procedure in SAS. is employed to obtain a nonhierarchical clUster solution for the data in Table 7.1. The DATA option in the FASTCLUS procedure specifies that the data set for clustering is in the SAS data set TABLE!, and the SEED option specifies that the initial cluster seeds are in the SAS data set INlTIAL.
CHAPTER 8 Two-Group Discriminant Analysis Consider the following examples: • The IRS is interested in identifying variables or factors that significantly differenti- ate between audited tax returns that resulted in underpayment of taxes and those that did not. The IRS also wants to know if it is possible to use the identified factors to fonn a composite index that will parsimoniously represent the differences between the two groups of tax returns. Finally. can the computed index be used to predict which future tax returns should be audited? • A medical researcher is interested in determining factors that significantly differ- entiate between patients who have had a heart attack and those who have not yet had a heart attack. The medical researcher then wants to use the identified factors to predict whether a patient is likely to have a heart attack in the future. • A criminologist is interested in determining differences between on-parole priscT:ers who have and who have not violated their parole, then using this information for making future parole decisions. • The marketing manager of a consumer packaged goods firm is interested in identi- fying salient attributes that successfully differentiate between purchasers and non- purchasers of brands, and employing this information to predict purchase intentions of potential customers. Each of the above examples attempts to meet the following three objectives: 1. Identify the variables that discriminate \"best\" between the two groups. 2. Use the identified variables or factors to develop an equation or function for com- puting a new variable or index that will parsimoniously represent the differences between the two groups. 3. Use the identified variables or the computed index to develop a rule to classify future observations into one of the two groups. Discriminant analysis is one of the available techniques for achieving the preceding objectives. This chapter discusses the case of twt) groups. The next chapter presents the case of more than two groups. 8.1 GEOMETRIC VIEW OF DISCRIMINANT ANALYSIS The data given in Table 8.1 are used for the discussion of the geometric approach to discriminant analysis. The table gives financial ratios for a sample of 24 firms, the 12 237
238 CHAPTER 8 TWO-GROUP DISCRIMINANT A..lIJALYSIS Table 8.1 Financial Data for Most-Admired and Least·Admired Firms Group 1: Most-Admired Group 2: Least·Admired Firm Firm Number EBrrASS ROTC Z Number EBrrASS ROTC Z 1 0.158 0.182 0.240 13 -0.012 -0.031 -0.030 2 0.210 0.206 0.294 14 0.036 0.053 0.063 3 0.207 0.188 0.279 15 0.038 0.036 0.05:! 4 0.280 0.236 0.365 16 -0.063 -0.074 -0.097 5 0.197 0.193 0.276 17 -0.054 -0.119 -0. 12::! 6 0.227 0.173 0.283 18 0.000 -0.005 -0.004 7 0.148 0.196 0.243 19 0.005 0.039 0.031 8 0.254 0.212 0.329 20 0.091 0.112 0.151 9 0.079 0.147 0.160 21 -0.036 -0.072 -0.076 10 0.149 0.128 0.196 22 0.045 0.064 0.077 11 0.200 0.150 0.247 23 -0.026 -0.024 -0.035 12 0.187 0.191 0.267 24 0.016 0.026 0.030 Note: Z computed using WI = .707 and M'2 :t: .707. most-admired finns and the 12 least-admired finns. The financial ratios are: EBITASS, earnings before interest and taxes to total assets, and ROTC. return on total capital. 8.1.1 Identifying the \"Best\" Set of Variables Figure 8. ~ gives a plot of the data, which can be used to visually assess the extent to which the two ratios discriminate between the two groups. The projections of the points onto the two axes, representing EBITASS and ROTC. give the values for the respective ratios. It is clear that the two groups of finns are well separated with respect to each ratio. In other words. each ratio does discriminate between the two groups of finns. Examining differences between groups with respect to a single variable is referred to as a univariare analysis. That is. does each variable (ratio) discriminate between the two groups? The univariate tenn is used to emphasize the fact that the differences between the two groups are assessed for each variable independent of the remaining variables. It is also clear that the two groups are well separated in the two-dimensional space. which implies that both the ratios combined or jointly provide a good separation of the two groups of finns. Examining differences with respect to two or more variables simultaneously is referred to as muitil'ariate analysis, It is clear that the multivariate tenn is used to emphasize that the differences in the means of the two groups are assessed simultaneously for aU the variables. In the above example, based on a visual analysis. both of the variables do seem to discriminate between the two groups. This may not always be the case. Consider. for instance, the case where data on four financial ratios, XI, Xl, X3. and X4• are available for the two groups of finns. Figure 8.2 portrays the distribution of each financial ratio. From the figure it is apparent that there is a greater difference between most·admired and least-admired firnls with respect to ratios Xl and X2 than with respect to ratios X:; and X4 • That is, ratios Xl and X~ are the variables that provide the \"best\" discrimina- tion between the two groups. Identifying a set of variables that \"best\" discriminates between rhe two groups is the first objective ojdiscriminant analysis. Variables provid- ing the best discrimination are called discriminator variables.
8.1 GEOMETRIC VIEW OF DISCRIMmANT &\"'lALYSIS 239 0.3,--------------------, RI 0.2 R2 \"\"\"\"\"x\"\" \" x x 0.1 u '\" \"\" x 5 E) I!) ~ I!) I!) Q e p\" E) E) -0.1 x \",Group I '\" \",Group::! o 0.05 0.1 0.3 EBITASS Figure 8.1 Plot of data in Table 8.1 and new axis. Most admired l.e:lst admired Ra[ioX~ Mos! :ldmired Least admired Most Least Ra!io \\'.1 admired admired Most Least 3dmired '1dnlired Figure 8.2 Distributions of financial ratios. 8.1.2 Identifying a New Axis In F:igure 8.1. consider a new axis. Z. in the two-dimensional space which makes an angle of. say, 450 with the EBITASS axis. The projection of any point. say P. on Z will be given by:
240 CHAPTERS TWO-GROUP DISCRIMINANT ANALYSIS Table 8.2 Summary Statistics for Various Linear Combinations Weights Sum ofSquares A (I WI WI SS, SS•. SSb (SSb.'SSw) 0 1.000 0.000 0.265 0.053 0.212 4.000 10 0.985 0.174 0.351 0.069 0.282 4.087 20 0.940 0.342 0.426 0.083 0.343 4.133 21 0.934 0.358 0.432 0.084 0.348 4.143 30 0.866 0.500 0.481 0.094 0.387 4.117 40 0.766 0.643 0.510 0.101 0.409 4.050 50 0.643 0.766 0.509 0.102 0.407 3.999 60 0.500 0.866 0.479 0.098 0.381 3.888 70 0.342 0.940 0.422 0.089 0.333 3.742 80 0.174 0.985 0.347 0.077 0.270 3.506 90 0.000 1.000 0.261 0.062 0.199 3.210 where Zp is the projection of point or finn P on the Z axis. According to Eq. 2.24 of Chapter 2, WI = cos 45° = .707 and W2 = sin45D = .707. Therefore. Zp = .707 x EBITASS + .707 x ROTC. This equation clearly represents a linear combination of the financial ratios. EBITASS and ROTC. for firm P. That is. the projection of points onto the Z axis gives a new variable Z, which is a linear combination of the original variables. Table 8.1 also gives the values of this new variable. The total sum of squares (SS,), the between-group sum of squares (SSb), and the within-group sum of squares (SSw) for Z are, respectively, 0.513,0.411, and 0.102. The ratio, A. of the between-group to the within-group sum of squares is 4.029 (0.411': 0.1 02). Table 8.2 gives SS,.SS\",.SSb, and A for various angles between Z and EBITASS. Figure 8.3 gives a plot of A and 8, the angle between Z and EBITASS. From the table and the figure we see that: 1. When 8 = 00 or 8 = 900 , only the corresponding variable, EBITASS or ROTC, is used for forming the new variable. 2. The value of A changes as 8 changes. 3. There is one and only one angle (i.e., 8 = 21 0 ) that results in a maximum value for ..\\.1 4. The following linear combination results in the maximum value for the '\\: Z = cos2( x EBITASS + sin2( x ROTC (8.1) = 0.934 X EBITASS + 0.358 x ROTC. The new axis. Z. is chosen such that the new variable Z gives a maximum value for A. As shown below. the maximum value of A implies that the new variable Z provides the maximum separation between the two groups. 'The other angle thal gives a ma:~imum val ue is 201' (180' T 21· ) and the rcsuhing axis is merely a reflection of the Z axis.
8.1 GEOMETRIC VIEW OF DISCRIMINk'IT A..'lALYSIS 241 jr-----------------------------~ ·tIl 4.6 4.4 M:u:imum / ~ 4.2 --< - - . - \" - - . ]\" . J4.rI ' . \" .--., E .\", j ., . \\3.8 \" 3.6 \\.3.4 3.2 10 20 21 30 40 jQ 60 70 80 90 Theta (8) Figure 8.3 Plot oflambda versus theta. ForZ to provide maximum separation, the following two conditions must be satisfied. 1. The means of Z for the two groups should be as far apart as possible (which is equivalent to having a maximum value for the between-group sum of squares). 2. Values of Z for each group should be as homogeneous as possible (which is equiv- alent to having a minimum value for the within-group sum of squares). Satisfying just one of these conditions will not result in a maximum separation. For example, compare the distribution of two possible new variables, Z, shown in Panels I and II of Figure 8.4. The difference in the means of the two groups for the new variable shown in Panel I is greater than that of the new variable shown in Panel II. However, the new variable in Panel II provides a better separation or discrimination than that in Panel I because the two groups of firms in Panel II, compared to Panel I, are more homogeneous with respect to rhe new variable. A measure of group homogeneity is provided by the within-group sum of squares. and'a good measure of the difference in the means of the two groups is given by the between-group sum of squares. Therefore, it is obvious that for maximum separation or discrimination the new axis, Z, should be selected such that the ratio of SSb to SS... for the new variable is maximum. The second objective ofdiscriminant analysis is to identify a new a.r:is, Z, such that the new variable Z, given by the projection of observations onto this new a.\"tis, provides the maximum separation or discrimination between the two groups. Note the similarity and the difference between discriminant analysis and principal components analysis. In both cases, a new axis is identified and a new variable is fonned that is a linear combination of the original variables. That is, the new variable is given by the projection of the points onto this new axis. The difference is with re- spect to the criterion used to identify the new axis. In principal components analysis, a new axis is identified such that the projection of the points onto the new axis accounts for maximum variance in the data, which is equivalent to maximizing SSt. because there
242 CHAPTER 8 TWO-GROUP DISCRIMINANT ANALYSIS Least admired MOSladnured ~------~~--~--~~--~------~=-----z Panel I Least admired Most admired AA L---------~--~----------~------------z Panel II Figure 8.4 Examples oflinear combinations. is no criterion variable for dividing the sample into groups. In discriminant analysis. on the other hand, the objective is not to account for maximum variance in the data (i.e., maximize SSt), but to maximize the between-group to within-group sum of squares ratio (i.e.. SSb/SSw) that results in the best discrimination between the groups. The new axis. or the linear combination, that is identified is called the linear discriminant function, henceforth referred to as the discriminant function. The projection of a point onto the discriminant function (i.e., the value of the new variable) is called the discriminant score. For the data set given in Table 8.1 the discriminant function is given by Eq. 8.1 and the discriminant scores are given in Table 8.3. 8.1.3 Classification The third objective ofdiscriminant analysis is to classify future obsen.·ations into one of the two groups. Actually, classification can be considered as an independent proce- dure unrelated to discriminant analysis. However, most textbooks and computer pro- grams treat it as a part of the discriminant analysis procedure. We will discuss both apprtJaches--classification as a separate procedure and as a part of discriminant anal- ysis. It should be noted that. under certain conditions, both procedures give identical classifitation results. Section 8.3.3 provides further discussion of the tw(' classification approaches. Classification as a Part ofDiscriminant Analysis Classification of future observations is done by using the discriminant scores. Figure 8.5 gives a one-dimensional plot of the discriminant scores, commonly referred to as a plot
Table 8.3 Discriminant Score and Classification for Most-Admired and Least-Admired Finns (WI = .934 and Wz = .358) Firm Number Group 1 Classification Firm Number Group 2 Classification 1 Discriminant 1 13 Discriminant 2 2 Score I 14 Score 2 3 1 15 2 4 0.213 J 16 -0.022 2 5 0.270 1 17 0.053 2 6· 0.261 1 18 0.048 2 7 0.346 1 19 2 8 0.253 J 20 -0.085 1 9 0.274 I 21 -0.093 2 10 0.208 J -0.002 2 11 0.313 1 22 2 12 0.126 1 0.019 2 0.185 23 0.129 Average 0.241 24 -0.059 0.243 0.065 -0.033 0.244 0.024 0.00367 t
244 CHAPTER 8 T\\VO-GROUP DISCRIMINAAlT ANALYSIS Cutoff value II R2~: RI __.L._•.•••...1••_._• • _~___•.•~l. -1...._._---'_ -{).I 0 0.1 I 0.2 0.3 0.4 I I I Figure 8.5 Plot of discriminant scores. of the observations in the discriminant space. Classification of observations is done as follows. First. the discriminant space is divided into two mutually exclusive and collec- tive]y exhaustive regions. R1 and R2. Now since there is only one discriminant score, the plot shown in Figure 8.5 is a one-dimensional plot. Consequently. a point will di- vide the space into two regions. The value of the discriminant score that divides the one-dimensional space into the two regions is called the cutoff value. Next. the dis- criminant score of a given finn is plotted in the discriminant space and is classified as most-admired if the computed discriminant score for the firm falls in region Rl and least-admired if it falls in region R2. In other words. a given finn is classified as least- admired if the discriminant score of the observation is less than the cutoff value, and most-admired if the discriminant score is greater than the cutoff value. Once again. the estimated cutoff value is one that minimizes a given criterion (e.g., minimizes misclas- sification errors pr misclassification costs). Classification as an Independent Procedure Classification essentially reduces to first partitioning a given p-dimensional variable space into two mutually exclusive and collectively exhaustive regions. Next, any given observation is plotted or located in the p-dimensional space and the observation is as- signed to the group in whose region it falls. For example. in Figure 8.1 the dotted line divides the two-dimensional space into regions R1 and R2. Observations falling in re- gion RI are classified as most-admired firms. and those finns falling in region R2 are classified as least-admired finns. It is clear that the classification problem reduces to developing an algorithm or a rule for dividing the given variable space into mutually exclusive and collectively exhaustive regions. Obviously, one would like to divide the space such that a certain criterion. say the number of incorrect classifications or mis- classification costs. is minimized. The various algorithms or classification rules differ mainly with respect to the minimization criterion used to divide the p-dimensional space into the appropriate number of regions. Section AS.2 of the Appendix discusses in detail some of the commonly used criteria for dividing the variable space into classification regions and the ensuing classification rules. 8.2 ANALYTICAL APPROACH TO DISCRIMINANT ANALYSIS The data given in Table 8. I are used for discussing the analyt.ical approach to discrim- inant analysis. 8.2.1 Selecting the Discriminator Variables Table 8.4 gives means and standard deviations for the two groups. The differences in the means of the two groups can be assessed by using an independent sample (-test. The
8.3 DISCRThIDI'A.\".II\"T A..'lALYSIS USING SPSS 245 Table 8.4 Means, Standard Deviations, and t-values for Most- and Least-Admired Firms Group 1 Group 2 Variable Mean Std. Dev. Mean Std. De,', l-nlue EBrFASS .191 .053 .003 .045 9.367 ROTC .184 .030 .001 .069 8.337 t-values for testing equality of the means of the two groups are 9.367 for EBIT ASS and 8.337 for ROTC. The I-test suggests that the two groups are significantly differ- ent with respect to both of the financial ratios at a significance level of .05. That is, both financial ratios do discriminate between the two groups and consequently will be used to form the discriminant function. This conclusion is based on a univariate ap- proach. That is, a separate independent I-test is done for each financial ratio. However. a preferred approach is to perform a multivariate test in which both financial ratios are tested simultaneously or jointly. A discussion of the multivariate test is provided in Sec- tion 8.3.2. 8.2.2 Discriminant Function and Classification Let the linear combination or the discriminant function that forms the new variable (or the discriminant score) be Z = W1 X EBITASS + W2 X ROTC (8.2) where Z is the discriminant function.:! Analytically, the objective of discriminant anal- ysis is to identify the weights. W1 and \"'2, of the above discriminant function such that ~ _ between-group sum of squares (8.3) I - within-group sum of squares is maximized. The discriminant function. given by Eq. 8.2. is obtained by maximizing Eq. 8.3 and is referred to as Fisher's linear discriminant [unction. This is clearly an optimization problem and its technical details are provided in the Appendix. Normally the cutoff value selected for classification purposes is the one that mini- mizes the number of incorrect classifications or misc1assification costs. Details pertain- ing to the fomlUlae used for obtaining cutoff value are given in the Appendix. 8.3 DISCRIMINANT ANALYSIS USING SPSS The data given in Table 8.1 are used to discuss the output generated by the discrimi- nant analysis procedure in SPSS. Table 8.5 gives the SPSS commands. Following is a brief description of the procedure commands for discriminant analysis. The GROUPS subcommand specifies the variable defining group membership. The ANALYSIS ~Note that Eq. 8.2 can be viewed as :1 general linear model of the fonn Z \"\" ~X where Z and X are vectors of dependent and independent variables, respectIvely, and P is a vector of coefficienrs. In the present case, Z is the dependent variable, EBIT.4.SS and ROTC are the independent variables. and WI and W2 are the coefficients.
248 CHAPTER a TWO-GROUP DISCRIMINANT ANALYSIS Table 8.5 SPSS Commands for Discriminant Analysis of Data in Table 8.1 DJ..TA :.rST FREE IE3:TASS RO~C EXCELL BEGIN DATA inse=~ data here END DATA DIS~KIMINANT: GROU?S=EXCELL(1,21 IV~~~;3LES=~BITASS RCTC IA~~.:'.LYSIS=:::5ITASS ROTC IV~:-HOD=D :R:::CT I S:-;'.'!' I ST 1: CS;=J..I..L IP:\"::T=.;'LL subcommand gives a potential list of variables to be used for fonning the discriminant function. The METIIOD subcommand specifies the method to be used for selecting variables to fonn the discriminant function. The DIRECT method is used when all the variables specified in the ANALYSIS subcommand are used to formulate the discrim- inant function. However. many times the researcher is not sure which of the potential discriminating variables should be used to form the discriminant function. In such cases, a list of potential variables is provided in the ANALYSIS subcommand and the program selects the \"best'\" set of variables using a given statistical criterion. The selection of the best set of variables by the program is referred to as stepwise discriminant analysis.3 A number of statistical criteria are available for conducting a stepwise discriminant anal- ysis. These criteria will be discussed in Section 8.6. The STATISTICS and the PLOT subcommands are used for obtaining the relevant statistics and plots for interpretation purposes. The ALL option indicates that the program should compute and print all the possible statistics and plots. Exhibit 8.1 gives the partial output and is labeled for discussion purposes. The fol- lowing discussion is keyed to the circled numbers in the output. For presentation clarity, most values reponed in the text have been rounded to three significant digits: any dif- ferences between computations reported in the text and the output are due to rounding errors. 8.3.1 Evaluating the Significance of Discriminating Variables The first step is to assess the significance of the discriminating variables. Do the selected discriminating variables significantly differentiate between the two groups? It appears that the means of each variable are different for the two groups [IJ. A discussion of the formal statistical test for testing the difference between means of the two groups follows. The null and the alternative hypotheses for each discriminating variable are: Ho:P-l = P-2 Ha : P-l :;e /1-2 where P-l and P-~ are the population means. respectively. for groups I and 2. In Section 8.2.1, the above hypotheses were tested using an independent sample (-test. Alterna- tively. one can use the Wilks' :\\ test statistic. Wilks' ~\\ is computed using the following ~The concept is ..imilar to that u~ed in stepwise multiple regression.
8.3 DISCRI~n..\"'lA..'\"T AL\"lALYSIS USING SPSS 241 Exhibit 8.1 Discriminant analysis for most-admired and least-admired firms (DGroup means EXCEL!. EBITASS ROTC 1 .19133 .18350 2 .00333 .00125 Total .09733 • C9238 ~poOled within-groups covariance ma~rix with 22 degrees of freedom £BITASS ROTC EB!TASS 2.4261515E-03 ROTC 2.033681SE-03 2.8J4l477E-03 ~pooled within-grouFs correlation matrix EBITASS EBITASS ROTC ROTC 1.00000 l. OCCDO .77969 ~WilkS' Lambda (U-statistic) and un~variate F-ratio with 1 and 22 degrees of freedom Variable Wilks' Lambda : Significance ------------- -------- ------------- ------------ EBITASS 97.4076 ROTC .2Dl08 71. 0699 .0000 .23638 .0000 ~To~al covariance matrix with 23 degrees of freedom E3ITASS ROTC EBITASS .0115 o ROTC .0109 .0113 Minimum tolerance le\\·e:.................. .00100 Canonical Discriminant Functions Maximum number of func~~ons . . . . . . . . . . . . . . 1 100.00 Minimum cumulative per=ent of variance .. . 1.0000 Haximum signi:icance 0: i-1ilks' Lambda ... . Prior probability :or each g=oup is .50000 ~ClasSification function coe!ficients (Fisher's linear discrimin?~~ functions) EXCELL = 1 2 EBITASS 61. 2374430 2.5511703 ROTC 21.0268971 -1. 4044441 (Constant) -8.4B0i470 -.6965214 (continued)
· 248 CHAPTER 8 TWO·GROUP DISCRIMINANT ANALYSIS Exhibit 8.1 (continued) Canonical Discrim~nant Functions Pct of Cum Canonical After Wilks' Fcn Eigenvalue Variance Pct Corr Fcn Lambda Chi-square d! Sig o .195162 34.312 2 .0000 4.1239 100.00 100.00 .8971 * Marks the 1 canon~cal discriminant functions remaining in the analysis. ~Standardized canonical discr~minant :unction coef!ic~ents EBITASS Func 1 ROTC .74337 .30547 ~Structure matrix: Pooled w~th~n-groups correlat~ons between d~scr~minat~ng var~ables and canon~cal d~scriminant functions (Var~ables ordered by size of correlation w~th~n function) EBITASS Func 1 ROTC .98154 .88506 ~Unstandard~zed canon~cal discr~minant function coeff~c~ents EBITASS Func 1 ROTC 15.0919163 (Constant) 5.7685027 -2.0018:120 ~canonical d~scriminant funct~ons evaluated at group means (group centroids) Group Func 1 1. 94429 1 2 -1. 94429 ~Test of Equa:ity of Group Covariance Matrices Using Box's M The ranks and natural lo~~=ithms of determinants printed are those of the group covariance matrices. Grot:p Label Rank Log Detennnant 2 -13.5160 .. 7 1 2 -1'; .107651 2 2 -12.834397 poojed w~thin-groups covariance matrix 0:Bcx's X Approximate F Degrees freedom Signif~cance 21.5()3~5 D... E365 3, 8'7120.0 .0C02 @ Case r-!is Ac~ua:' Highest Probabi1~ty 2nd Highest D~scrim Nurr.ber \\,PC-l ,o_ Se: G:-oup GrouF P(;)/G) P(G/D) '::;roup P (G.-D) Scores 1.4326 .1 :,;, 1 .GeS8 .99€2 2 .0038 2.3558 2 .:. .6807 .9999 2 .0001 (continued)
8.3 DISCRIMINA.\"'IT Ai'lALYSIS USING SPSS 249 Exhibit S.l (continued) 1 .7930 .9998 2 .0002 2.2067 1 .1008 1.0000 2 .0000 3.5853 31 41 20 2 ** 1 .0616 .5727 2 .4273 .0753 21 2 2 .3096 1.0000 1 .0000 -2.9605 22 2 2 .3218 .9761 1 .0239 -.9535 23 2 2 .5563 .9999 1 .0001 -2.5326 24 2 2 .7384 .9981 1 .0019 -1.6104 Symbols used in plots @SymbOI Group Label ------ -------------------- 11 22 All-groups Stacked His~cgram Canonical Discriminant Function 1 4+ + JI I F r 3+ + eI qI u 22 1 11 + e 2+ nI 22 1 11 I cI 22 1 11 yI 22 1 11 1+ 22 2 22 222 22 1 1 1 lll1 1 1 + I 22 2 22 222 22 1 1 1 lll1 1 1 I I 22 L'\" 22 222 22 1 1 1 1111 ~ 1 I I 22 2 22 222 22 1 1 1 1111 1 1 x---------+---------~--------+--------_+_-------_+_--------x out -4.0 -2.0 .0 2.0 4.0 out Class 2222222222222222222222222222222111111111111111111111111111111 Centroids 21 ~Classification results - Actual Group No. of Predicted Group Membership Cases 12 Group 1 12 12 o 100.0% .0% Group 2 12 1 11 8.3% 91.7% Percent of \"grouped\" cases correctly classified: 95.93%
250 CHAPTER 8 TWO·GROUP DISCRIMINANT ANALYSIS formula: SSw (8.4) A = SSt' SSw is obtained from the SSCPM' matrix, which in tum can be computed by multiplying the Sw [2] matrix by its pooled degrees of freedom.4 The SSCPw = ( 0.0534 0.0447 ) 0.0447 0.0617 . SSt is obtained from the SSCP, matrix, which in tum is computed by multiplying St [5] by the total degrees of freedom.5 Therefore. SSCPr is equal to SSCP = (0.265 0.250) I 0.250 0.261 . Using Eq. 8.4. Wilks' A's for EBITASS and ROTC are, respectively. equal to .202 (i.e., .0534 -;- .265) and .236 (i.e., 0.0617 -;- 0.261). which. within rounding error, are the same as reponed in the output [4]. Note that the smaller the value for A the greater the probability that the null hypothesis will be rejected and vice versa. To assess the statistical significance of the Wilks' A, it can be convened into an F·ratio using the foHowing transformation: CF = ~ A )(1l1 + 112P- P - 1) (8.5) where p (which is 1 in this case) is the number of variable(s) for which the statistic is computed. Given that the null hypothesis is true, the F-ratio follows an F-distribution. with p and 111 + n2 - P - I degrees of freedom. The corresponding F-ratios. using Eq. 8.5, are 86.911 and 71.220. respectively, which again. within rounding errors, are the same as reported in the output [4]. Based on the critical F-value, the null hypotheses for both variables can be rejected at a significance level of .05. That is. the two groups are different with respect to EBITASS and ROTC. Once again, because the two groups were compared separately for each variable, the statistical significance tests are referred to as univariate tests. The univariate Wilks' A test for the difference in means of two groups is identical to the t-test discussed in Section 8.2.1. In fact, for two groups F = r2. Note that the r2 values obtained from Table 8.4. within rounding errors. are the same as the F-values computed above (i.e., 9.367~ = 87.741 and 8.3372 = 69.506). The reason for employing the Wilks' A test statistic instead of the t-test will become apparent in the next section. 8.3.2 The Discriminant Function Options for Computing the Discriminant Function The program prints the various parameters or options that have been selected for com- puting the discriminant function and for classifying observations in the sample [6]. A discussion of these options follows. Since discriminant analysis involves the inversion of within-group matrices (see the Appendix), the accuracy of the computations is severely affected if the matrices are 4Recall lhal the pooled degrees of freedom is equal to nl + n: - ~ where nl and II; are, respectiyely, the number of observations for group), I and 2. 5Recall that the lolal degrees of freedom is equallo n I + n: - I.
8.3 DISCRI~IINA..vr A..~ALYSIS USING SPSS . 251 near singular (i.e.• so~e of the discriminator variables are highly correlated or are linear combinations of other variables). The tolerallce le\\'el provides a control for the desired amount of computational accuracy or the degree of multicollinearity that one is willing to tolerate. The tolerance of any variable is equal to I - R'2.. where R2 is the squared multiple correlation between this variable and other variables in the discriminant func- tion. The higher the multiple correlation between a given variable and the variables in the discriminant function, the lower the tolerance and vice versa. That is, tolerance is a measure of the amount of multicollinearity among the discriminator variables. If the tolerance of a given variable is less than the specified value, then the variable is not in- cluded in the discriminant function. Tolerance, therefore, is used to specify the degree to which the variables can be correlated and still be used to fonn me discriminant func- tion. A default value of .001 is used by the SPSS program; however, one can specify any desired level for the tolerance (see the SPSS manual for the necessary commands). The maximum number of discriminant functions that can be computed is the mini- mum of G - J or p, where G is the number of groups andp is the number of variables. Since the number of groups is equal to 2, only one discriminant function is possible. The minimum cumulative percent of variance is discussed in the next chapter, because the amount of variance extracted is only meaningful when there are more than two groups. The ma.'(imunz level of significance for Wilks' A is only meaningful for stepwise dis- criminant analysis. and is discussed in St:ction 8.6. The prior probability of each group is tht: probability of any random observation belonging to that group (I.e .. Group 1 or Group 2). That is, it is the probability that a given firm is the most- or least-admired firm if no other information about the firm is available. In the present case, it i~ assumed that the priors are equal; that is. the prior probabilities of a given firm being most- or least-admired are each equal to 0.50. Estimate ofthe Discriminant Function The unstandardized estimate of the discriminant function is [11]: (8.6) Z = -2.00181 + 15.0919 x EBITA.SS + 5.769 x ROTC. This is referred to as the unstandardized discriminant function because unstandard- ized (i.e., raw) data are used for computing the discriminant function. The discriminant function is also referred to as the canonical discriminant function, because discrimi- nant analYSis is a special case of canonical correlation analysis. The estimated weights of the discriminant function given by Eq, 8.6 appear to be different from the weights of the discriminant function in Eq. 8.1. However, as shown below, the weights of the two equations, in a relative sense. are the same. The coefficients of the discriminant function are not unique; they are unique only in a relative sense. That is, only the ratio of the coefficients is unique. For example. the ratio of the weights given in Eq. 8.6 is 2.616 (15.0919 -:- 5.769) which. within rounding error. is the ratio of the coefficients given in Eq. 8.1 (0.934 -;- 0.358 = 2.609). The coefficients given in Eq. 8.6 can be normalized to sum to one by dividing each Jl1.1coefficient by + w~.6 Normalized discriminant function coefficients are WI = 15.0919 ~/15.0919~ + 5.7692 = .934 DSince nonnalizing is done by dividing each coefficient by a constant. it does not change the relative value of the discriminant score.
252 CHAPTER B TWO.QROUP DISCRIMINANT ANALYSIS and =l1';! 5.769 y'15.09192 + 5.7692 = .357. ~: : As can be seen, the normaliz~d weights. within rounding errors, are the same as the Onts used in Eq. 8.1. A constant is added to the unstandardized discriminant function so that tbe average of the discriminant scores is zero and, therefore, the constant simply adjusts the scale of the discriminant scores. Statistical Significance 'of the Discriminant Func;tion Differences in the means of two groups for each discriminator variable were tested using the univariate Wilks' .'l test statistic in Section 8.3.1. However, in the case of more than one discriminator variable it is desirable to test the differences between the two groups for all the variables jointly or simultaneously. This multivariate test ofsign{ficance has the following null and alternate hypotheses: Ho: (P.}flT.4SS) = (P.i~IT:\\SS). P. ROTC P. ROTC and The test statistic for testing these multivariate hypotheses is a direct generalization of the univariate Wilks' A statistic and is given by IsscPwl A = /SSCPt/. where ]-I represents the determinant of a given matrix.7 Vv·ilks' A can be approximated as a chi-square statistic using the following transformation: K = -[n - 1 - (p + G):2JInA. (8.7) rThe statistic is distributed as a chi-square distribution with p(G - I) degrees of . ifreedom. The Wilks' A for the discriminant function is .195 [8], and its equivalent value is J? = - [24 - 1 - (2 + 2).... 2] In(.195) = 34.330. which. within rounding error. is the same as given in the output [8]. The null hypothesis can·be rej.ecled at an alpha level of 0.05. implying that the two groups are significantly different with respect to EBITASS and ROTC taken jointly. Since the discriminant function is a linear combination of discriminator variables. it can also be concluded that the discriminant function is statistically significant. That is, the means of the discriminant scores for the two groups are significantly different. Sta- -tistical simificance of the discrimin3I11 function can also be assessed b.\\' transforming ~ 7Various test statistics such as the t-statistic. F-statistic. and Hotelling's T~ are special case!. of the Wilks' .\\ test statistic.
8.3 DISCRIMINANT ANALYSIS USDJG SPSS 253 Wilks' A into an exact F-ratio using Eq. 8.5. That is. F = (1 - .195)(24 - 2 - 1) = 43346 .195 2 .I which is statistically significant at an alpha level of .05. Practical Significance of the Discriminant Function It is quite possible that the difference between the two groups is statistically significant even though, for all practical purposes, the differences between the groups may not be large. This can occur for large sample sizes. Practical significance relates to assessing how large or how meaningful the differences between the two groups are. The output reports the canonical correlation, which is equal to 0.897 [8]. As discussed below, the square of the canonical correlation can be used as a measure of the practical significance of the discriminant function. It can be shown that the squared canonical correlation (CR2) is equal to CR2 = SSb (8.8) SS,' or CR = j~~> (8.9) From Eq. 8.8 it is obvious that CR2 gives the proportion of the total sum of squares for the discriminant score that is due to the differences between the groups. Funhermore, as will be shown in Section 8.4, two-group discriminant analysis can also be fonnulated as a multiple regression problem. The corresponding multiple R that would be obtained is the same as the canonical correlation. Recall that in regression analysis R2 is a mea- sure of the amount of variance in the dependent variable that is accounted for by the independent variables, and therefore it is a measure of the strength of the relationship between the dependent and the independent variables. Since the discriminant score is a linear function of the discriminating variables, C R2 gives the amount of variation between the groups that is explained by the discriminating variables. Hence, CR2 is a measure of the strength of the discriminant function. From the output, C R is equal to 0.897 [8] and therefore C R2 is equal to .804. That is, about 80% of the variation between the two groups is accounted for by the discrim- inating variables, which appears to be quite high. Although C R2 ranges between zero and one. there are no guidelines to suggest how high is \"high.\" The researcher therefore should compare this C R2 to those obtained in other similar applications and detennine if the strength of the relationship is relatively strong, moderate, or weak. Assessing the Importance ofDiscriminant Variables and the Meaning of the Discriminant Function If discriminant analysis is done on standardized data then the resulting discriminant function is referred to as standardized canonical discriminant function. However. a sep- arate analysis is not needed as standardized coefficients can be computed from the un- standardized coefficients by using the following transformation: bj = bjsj , where bj. bj. and Sj are, respectively, standardized coefficient, unstandardized coef- ficient, and the pooled standard deviation of variable j. The standardized coefficients for
254 CHAPTER a TWO-GROUP DISCRlMlNANT ANALYSIS EBITASS and ROTC, respectively, are equal to .743 (i.e., 15.0919 J,0024261) and .305 (i.e., 5.769 JO.002804). which are the same as given in the output [9]. Note that .0024261 and 0.002804 are the pooled variances for variables EBITASS and ROTC. respectively. and are obtained from I.w [2]. Standardized coefficients are normally used for assessing the relative importance of discriminator variables forming the discriminant function. The greater the standard- ized c~efficient, the greater the relative importance of a given variable and vice versa. Therefore. it appears that ROTC is relatively less important than EBITASS in forming the discriminant function. However. caution is advised in such an interpretation when the variables are correlated among themselves. Depending on the severity of multi- collinearity present in the sample data, the relative importance of the variables could change from sample to sample. Consequently, in the pres~nce of multicollinearity in the data, it is recommended that inferences regarding the imponance of the discrim- inator variables be avoided. The problem is similar to that posed by multicollinearity in multiple regression analysis. Since the discriminant score is a composite index or a linear combination of original variables, it might be interesting to know what exactly the discriminant score represents. In other words, just as in principal components and factor analysis, a label can be assigned to the discriminant function. The loadings or the structure coefficients are helpful for assigning the label and also for interpreting the contribution of each variable to the fonnation of the discriminant function. The load- ing of a given discriminator variable is simply the correlation coefficient between the discriminant score and the discriminator variable and the value of the loading will lie between + 1 and -1. The closer the absolute value of the loading of a variable to 1, the more communaljty there js between the discriminating variable and the discriminant function and vice versa. Loadings are given in the structure matrix [10]. Alternatively, they can be computed using the following formula: p (8.10) h = 'L-r;jbj, j=1 where h is the loading of variable i, rij is the pooled correlation between variable i with variable j, and bj is the standardized coefficient of variable j. For example, the loading of EBITASS (i = 1) is given by 11 = 1.000 x .743 + .780 x .305, and is equal to .981. Note that 0.780 is the pooled correlation between EBITASS and ROTC [3]. Since the loadings of both of the discriminator variables are high, the dis- criminant score can be interpreted as a measure of the financial health of a given firm. Once again, how \"high\" is high is a judgmental question; many researchers have used a value of 0.50 as the cutoff value. Also. the contribution of both the variables toward the formation of the discriminant function is high because both variables have high loadiq.zs. 8.3.3 ,,4.Classification Methods A number of methods are available for classifying sample and future observations. Some of the commonly used methods are: 1. Cutoff-\\'alue method. 2. Statistical decision theory method.
8.3 DlSCRL'\\fINANT ANALYSIS USING SPSS 255 3. Classification function method. 4. Mahalanobis distance method. Cutoff-Value Method As discussed earlier. classification of observations essentially reduces to dividing the discriminant space into two regions. The value of the discriminant score that divides the space into the two regions is called the cutoff value. Following is a discussion of how the cutoff value is computed. Table 8.3 gives the discriminant score for each observation that was formed by using Eq. 8.1. As can be seen from Eq. 8.1, the greater the values for EBITASS and ROTC the greater the value for the discriminant score and vice versa. Since financially healthy firms will have higher values for the two financial ratios, most-admired firms will have a greater discriminant score than least-admired firms. Therefore. any given firm will be classified as a most-admired firm jf its discriminant score is greater than the cutoff value, and as a least-admired finn if its discriminant score is less than the cutoff value. Normally the cutoff value selected is the one that minimizes the number of incorrect classifications or misclassification errors. A commonly used cutoff value that minimizes the number of incorrect classifications for the sample data is cutoff value = i\\ +2 z\"-, (8.11) where Zj is the average discriminant score for group j. This formula assumes equal sample sizes for the two groups. For unequal sample sizes the cutoff value is given by: cutoff value = n)i) + nzt 2 , (8.12) nl + n2 where ng is the number of observations in group g. From Table 8.3, the averages of the discriminant scores for groups 1 and 2 are, respectively, 0.244 and 0.00367, and the cutoff value will be cutoff value = 0.244 + 0.00367 2 = 0.124. Table 8.3 also gives the classification of the finns based on the computed discrim- inant score and the cutoff value of 0.124. Note that only the 20th observation is mis- classified, giving a correct classification rate of 95.83% (i.e., 23 -;- 24). Identical results are obtained if classification is done using the discriminant scores computed from the unstandardized discriminant function given by Eq. 8.6 (i.e., the one reported in the out- put [11]). Discriminant scores resulting from Eq. 8.6 are also given in the output [14]. The average of the discriminant scores for groups 1 and 2, respectively, are 1.944 and -1.944 [12] giving a cutoff value of zero. Once again the 20th observation is misclas- sified. A summary of the classification results is provided in a matrix known as the classi- fication matrix or the confusion matrix. The classification matrix is given at the end of the output [16]. All but one of the observations have been correctly classified. There are other rules for computing cutoff values and for classifying future obser- vations. Equation 8.11 assumes equal misclassification costs and equal priors. Equal
256 CHAPl'ER 8 TWO-GROUP DISCRIMINANT ANALYSIS misclassification costs implies that the penalty or the cost of misclassifying observa- tions in groups 1 or 2 is the same. That is, the cost of misc1assifying a most-admired firm is the same as misclassifying a least-admired firm. Equal priors imply that the prior probabilities are equal. That is, any given firm selected at random will have an equal chance of being either a most- or a least-admired firm. Alternative classification procedures or rules that relax these assumptions are discussed below. Statistical Decision Theory SPSS uses the statistical decision theory method for classifying sample observations into various groups. This method minimizes misclassification errors, taking into ac- count prior probabilities and misclassification costs. For example, the data set given in Table 8.1 consists of an equal number of most- and least-admired firms. This does not imply that the population also has an equal number of most- and least-admired finns. It is quite possible that, say, 70% of the finns in the population are most-admired firms and only 30% of firms are least-admired. That is, the probability of any given firm being most-admired is .7. This probability is known as the prior probability. Misclassification costs also may not be equal. For example, in the case of a jury verdict the \"social\" cost of finding an innocent person guilty might be much more than finding a guilty person innocent. Or, in the case of studies dealing with bankruptcy prediction, it might be more costly to classify a healthy finn as a potential candidate for bankruptcy than a potentially bankrupt firm as healthy. Classification rules that incorporate prior probabilities and misclassification costs are based on Bayesian theory. Bayesian theory essentially revises prior probabilities based on additional available information. That is, ifnothing is known about a given firm, then the probability that it belongs to the most-admired group is PI, where PI is the prior probability. Based Q.I1 additional mfonnation about the finn (Le., its values for EBITASS and ROTC) the prior probability can be revised to qi. Revising the prior probability PI to a posterior or revised probability q!. based on additional infonnation. is the whole essence of Bayesian theory. A classification rule incorporating prior probabilities is given by: Assign the observation to group I if Z ~ t I + Z2 + In [P2 ]. (8.13) 2 PI and assign to group 2 if z < Zl + .22 + In[P2]. (8.14) 2 PI where Z is the discriminant score for a given observation, Zj is the average discriminant score for group j. and Pj is the prior probability of group j. Misclassification costs also can be incorporated into the above classification rule. For example, consider the 2 x 2 misclassification cost table given in Table 8.6. In the table•. C(i,/j) is the cost of misclassifying into group i an observation that belongs to group j. The rule for classifying observations which incorporate prior probabilities and misclassification costs is given by: Assign the observation to group 1 if (8.15)
8.3 DISCRIMINANT ANALYSIS USING SPSS 251 Table 8.6 Mi.sclusification Costs Actual !\\fembership Predicted Membership Group 1 Group 2 Group 1 Zero cost C(L'2) Group 2 C(2!I) Zero cost and assign to group 2 if Z < 21 + 22 + 1 [P2C(1/2)] (8.16) 2 n P1 C(2/1) . Equations 8.15 and 8.16 give the general classification rule based on statistical deci- sion theory and this rule minimizes misc1assification errors. The classification rule is derived assuming that the discriminator variables have a multivariate normal distribu- tion (see the Appendix). From Eqs. 8.15 and 8.16, it is obvious that the cutoff value is shifted toward the group that has a lower prior or a lower cost of misclassification. Or geometrically, the classification region or space increases for groups that have a higher prior or a higher misclassification cost. Note that for equal misclassification costs and equal priors, Eqs. 8.15 and 8.16 re- duce to: Assign the observation to group 1 if and assign the observation to group 2 if Z < 21 + 22 2 . The right-hand side of the above equations is the cutoff value used in the cutoff-value classification method. Therefore, the cutoff-value classification method is the same as the statistical decision theory method with equal priors, equal misclassification costs. and assuming that the data come from a multivariate normal distribution. In many instances the researcher is interested not in classification ofthe 0 bservations but in their posterior probabilities. SPSS computes the posterior probabilities under the assumption that the data come from a multivariate nonnal distribution and that the co- variance matrices of the two groups are equal. The interested reader is referred to the Appendix for fuither details and for the fonnula used for computing posterior probabil- ities. The posterior probabilities (given by PCG.. D). where G represents the group and D is the discriminant score) are given in the output [14]. The posteriors can be used for classifying obse~ations. An observation is assigned to the group with the highest pos- terior probability. For example. the posterior probabilities of Observation 20 for groups 1 and 2 are. respectively, 0.573 and 0.427. Therefore, once again, the observation is misclassified into group 1. Classification Functions Classification can also be done by using the classification functions computed for each group. Classifications based on classification functions are identical to those given by
258 CHAPI'ER 8 TWO-GROUP DISCRIMINANT ANALYSIS Eqs. 8.15 and 8.16. SPSS computes the classification functions. The classification func- tions reported in the output are [7]: C1 = -8.481 + 61.237 x EBIT ASS + 21.0269 x ROTC for group 1, and C2 = -0.697 + 2.551 x EBITASS - 1.404 x ROTC for group 2. Observations are assigned to the group with the largest classification score. The coefficients of the classification functions are not interpreted. These functions are used solely for classification purposes. Furthermore. as will be seen later, prior proba- bilities and miscIassificatioD costs only affect the constant of the preceding equations; the coefficients of the classification functions are not affected. Mahalanobis Distance Method Observations can also be classified using Mahalanobis or the statistical distance (com- puted by employing original variables) of each observation from the centroid of each group. The observation is assigned to the group to which it is the closest as measured by the Mahalanobis distance. For example, using Eq. 3.9 of Chapter 3. the Mahalanobis distance for Observation 1 from the centroid of group 1 (i.e.. most-admired firms) is equal to MD2 = 1 [(.158 - .191f (.182 - .184)2 1 - .7802 .00243 + .0028 _ 2 x .780(.158 ~ .l9I)('I~184)] .....' .00243 ../.0028 = 1.047. Table 8.7 gives the Mahalanobis distance for each observation and its classification into the respective group (see the Appendix for further details). Again, only the 20th ob- servation is misclassified. Classifications employing the Mahalanobis distance method assume equal priors. equal misclassification costs, and multivariate normality for the discriminating variables. How Good Is the Classification Rate? The correct classification rate is 95.83% [16]. How good is this classification rate? Hu- berty (1984) has proposed approximate test statistics that can be used to evaluate the statistical and the practical significance of the overall classification rate and the classi- fication rate for each group. STATISTICAL TESTS. The test statistics for assessing the statistical significance of the classification rate for any group and for the overall classification rate are given by z· = (0 - e) ../ii (8.17) ,:e(n - e) (8.18)
-Table 8.7 Classification Based on Mahalanobis Distance Group 2. Group 1 Mahalanobis Distance Mahalanobis Distance from from Firm Number Group 1 Group 2. Classification Firm Number Group 1 Group 2. Classification 1 1.047 12.240 I 13 18.808 0.440 2 14 9.888 0.982 2 2 0.182 18.494 1 15 10.994 0.524 2 16 28.424 2.165 2 3 0.186 17.310 1 17 33.437 6.113 2 18 15.784 0.015 2 4 3.722 31.509 I 19 14.342 1.207 2 20 4.546 5.207 1 5 0.029 16.234 I 21 25.171 2.119 2 22 8.777 1.423 2 6 2.077 20.807 1 23 20.010 0.355 2 24 12.723 0.248 2 7 2.862 13.556 1 8 2.192 25.858 I 9 8.102 8.548 1 10 1.122 8.753 1 11 1.607 16.148 12 0.104 1~i,()62 i
260 CHAPrER 8 TWO-GROUP DISCRIMINANT ANALYSIS (8.19) e= 2:1 G nit (8.20) -n g=1 where Og is the number of correct classifications for group g; egis the expected number of correct classifications due to chance for group g; ng is the number of observations in group g: 0 is the total number of correct classifications; e is the expected number of correct classifications due to chance for the total sample; and n is the total number of Z;observatio~. The test statistics, and Z·, follow an approximately nonnal probability distribution. From Eqs. 8.19 and 8.20, el = e2 = 6 and e = 12, and from Eqs. 8.17 and 8.18: Z. = (l:! - 6) ../12 = 3.464 I ../6(12 - 6) Z; = (11 - 6) Ji2 = 2.887 - /6(12 - 6) Z. = (2~ - 12) J24 = 4.491. . /12(24 - 12) The statistics, Zi. Z2, and Z\" are significant at an alpha level of .05. suggesting that the number of correct classifications is significantly greater than due to chance. An alternative definition of total classifications due to chance is based on the use of a naive prediction rule. In the naive prediction rule, all observaticns are classified into the largest group. If the sizes of the groups are equal then the naive prediction rule would give niG correct classifications due to chance where n is the number of observations and G is the number of groups. In the present case. use of the naive classification rule would result in 12 correct classifications due to chance and Z = 4.491 which is significant at p < .05. PRACTICAL SIGNIFICANCE. The practical significance of classification is the ex- tent to which the classification rate obtained via the various classification techniques is better than the classification rate obtained due to chance alone. That is, to what extent is che improvement over classification due to chance if one uses the various classification techniques? The index to measure the improvement is given by (Huberty 1984) I = 0,' n- ~/n x 100. (8.21) 1- e/n The I in the above equation gives the percent reduction in error over chance classifica- tion that would result if a given classification method is used. Using Eq. 8.21 [16). I = 23/24 - 12/24 x 100 1-12/24 = 91.667. That is, by using the classification method in the discriminant analysis procedure a 91.6670/c reduction in error over chance is obtained. In other words, I I (i.e., 0.91667 X 12) observations over and above chance classifications are obtained by using the clas- sification method in discriminant analysis.
8.3 DISCRIMINAl.'IT ANALYSIS USING SPSS 261 Using Misclassification Costs and Priors for Classification in SPSS Unequal priors can be specified in the SPSS using the PRIORS subcommand. but mis- classification costs cannot be directly specified in the SPSS program. However. the program can be tricked into considering misclassification costs. The trick is to incorpo- rate the misclassification coste: into the priors. Suppose that the prior probabilities for the data set in Table 8.1 are PI = .7 and P2 = .3, and it is four times as costly to misclas- sify a most-admired finn as a least-admired finn than it is to misclassify a least-admired firm as a most-admired firm. These costs of misclassification can be stated as C(1/2) = 1 C(2/I) = 4. Using the above information, first compute PI' C(2!1) = .7 x 4 = 2.800 and P2 • C(1/2) = .300 x I = .300. New prior probabilities are computed by normalizing the above equations [0 sum to one. That is. new PI = 2.82+.80.3 90 new P2 = ._,. 8 .3 0.3 = . 10. + =. , Discriminant analysis can be rerun with the above new priors by including the following PRIORS subcommand, PRIORS = .9 .1. The resulting ciassification functions are C) = -7.893 + 61.237 x EBfTASS + 21.027 X ROTC for group 1 and C2 = -2.306 + 2.551 x EBfTASS - 1.404 X ROTC for group 2. Note that only the constant has changed; the coefficients or weights of E BfT ASS and ROTC have not changed. Table 8.8 gives that part of the SPSS out- put which contains discriminant scores, classification, and posterior probabilities. Once again, note that discriminant scores have not changed even though the posterior proba- bilities have changed. Summary ofClassification Methods In the present case all the methods provided the same classification. However. this may not always be the case. All four methods will give the same results: (1) if the data come from a multivariate normal distribution; (2) if the covariance matrices of the two groups are equal; and (3) if the misclassification costs and priors are equal. In general the results may be different depending upon which of the assumptions are satisfied. The effect of the first two assumptions on classification is discussed in Section 8.5. Note that classification methods based on the Mahalanobis distance do not employ the discriminant function or the discriminant scores. That is, classification by the Ma- halanobis method can be viewed as a separate technique that is independent of discrim- inant analysis. In fact. as discussed earlier, many textbooks prefer to treat discriminant analysis and classification as separate problems because all the classification methods discussed above can be shown to be independent of the discriminant function.
262 CHAPTER 8 TWO-GROUP DISCRIMINANT ANALYSIS Table 8.8 Discriminant Scores, Classification, and Posterior Probability for Unequal Priors Highest Probability 2nd Highest Case Actual Discrim Number Group Group P(D/G) P(GfD) Group P(GfD) Scores 11 1 0.6086 0.9996 2 0.0004 1.4326 ..\\·2 1 1 0.6821 1.0000 2 0.0000 2.3543 31 41 1 0.7955 1.0000 2 0.0000 2.2039 51 61 1 0.1008 1.0000 2 0.0000 3.5859 71 81 1 0.8862 1.0000 2 0.0000 2.0878 91 10 1 1 0.6334 1.0000 2 0.0000 2.4216 11 I 12 1 1 0.5633 0.9995 2 0.0005 1.3668 13 2 14 2 1 0.2675 1.0000 .'.l. 0.0000 3.0535 15 2 16 2 1 0.0570 0.9135 2 0.0865 0.0413 17 2 18 2 1 0.3371 0.9976 2 0.0024 0.9848 19 2 20 2·· .,1 0.9497 0.9999 2 0.0001 1.8817 21 2 1 0.9823 0.9999 0.0001 1.9225 22 2 2 0.6774 0.9991 1 0.0009 -2.3607 23 .'.l. 2 0.4278 0.9074 I 0.0926 - 1.1517 24 2 2 0.4699 0.9280 1 0.0720 -1.2221 2 0.1510 1.0000 1 0.0000 -3.3808 2 0.1184 1.0000 1 0.0000 -3.5064 2 0.9329 0.9966 1 0.0034 -2.0289 2 0.8083 0.9881 1 0.0119 -1.7020 1 0.0615 0.9234 2 0.0766 0.0750 2 0.3085 0.9999 1 0.0001 -2.9632 2 0.3216 0.8193 1 0.1807 -0.9536 2 0.5561 0.9995 1 0.0005 -2.5334 2 0.7370 0.9830 1 0.0170 -1.6089 \" MiscIassitied observations. 8.3.4 llistograms for the Discriminant Scores This section of the output gives a histogram for both the groups [15]. The histogram provides a visual display of group separation with respect to the discriminant score. As can be seen, there appears to be virtually no overlap between the two groups. 8.4 REGRESSION APPROACH TO DISCRIMINANT ANALYSIS A,s mentioned earlier, two-group discriminant analysis can also be fonnulated as a mul- tiple regression problem. The dependent variable is the group membership and is binary (i.,e;\"lO or 1). For the present example. we can arbitrarily code 0 for least-admired firms and 1 for most-admired firms. The independent variables a. e the discriminator vari- ables. Exhibit 8.2 gives the multiple regression output for the data set given in Table 8.1. Multiple R is equal to .897 [1], and it is the sanle as the canonical correlation re- poned in the discriminant analysis output given in Exhibit 8.1. The multiple regression equation is given by Y = 0.086 + 3.124 x EBlTASS + 1.194 x ROTC.
8.5 ASSL'YPI'IONS 283 Exhibit 8.2 Multiple regression approach to discriminant analysis CD Multiple R .89713 R Square .80<184 .n.djusted R Square .78625 Standard Error .236:4 Analysis of Variance OF Sum of Squa=es Mean Square 4.82903 2.41451 Regression 2 1. 17097 .05576 Residual 21 F \"\" 43.30142 Signif F ~ .0000 ------------------ Variables in the Equation ------------------ Variable B SE B Beta T Sig T EBITASS 3.123638 1.483193 .657003 2.106 .0474 ROTC 1.193931 1. 49580 6 .249005 .798 .4337 (Constant) .2062 .085677 .065683 1.304 As stated previously, coefficients of the discriminant function are nor unique. As be- fore. ignoring the constant, the ratio of the coefficients is equal to 2.616 (3.124/1.194). The normalized co·efficients for EBITASS = 3.124 = .934 ROTC = J3.1242 + 1.1942 1.194 = .357, j3.1242 + 1.1942 within rounding error, are the same as those given by Eq. 8.1 and the normalized coef- ficients reported in Section 8.3.2. However, caution is advised in interpreting the sta- tistical significance tests from regression analysis as the multivariate nonnality and homoscedasticity assumptions will be violated due to the binary nature of the depen- dent variable. 8.5 ASSUMPTIONS Discriminant analysis assumes tbat data come from a multivariate nonna! distribution and that the covariance matrices of the groups are equaL Procedures to test for the violation of these assumptions are discussed in Chapter 12. In the following section we discuss the effect of the violation of these assumptions on the results of discriminant analysis. 8.5.1 Multivariate Normality The assumption of multivariate normality is necessary for the significance tests of the discriminator variables and the discriminant function. If the data do not come from a multivariate normal distribution then, in theory, none of the significance tests are valid. Classification results, in theory, are also affected if the data do not come from a multivariate nonnal distribution. The real issue is the degree to which data can
264 CHAPTER 8 TWO-GROUP DISCRIMINANT ANALYSIS deviate from nonnormality without substantially affecting the results. Unfortunately. there is no clear-cut answer as to \"how much\" of nonnonnality is acceptable. How- ever, the researcher should be aware that studies have shown that. although the overall classification error is not affected, the classification errOr of some groups might be over- estimated and for other groups it might be underestimated (Lachenbruch. Sneeringer. and Revo 1973). If there is reason to believe that the multivariate normality assumption is clearly being violated then one can use logistic regression analysis because it does not make any distributional assumptions for the independent variables. Logistic regression analysis is discussed in Chapter 10. 8.5.2 Equality of Covariance Matrices Linear discriminant analysis assumes that the covariance matrices of the t\\\\'O groups are equal. Violation of this assumption affects the significance tests and the classifica- tion results. Research studies have shown that the degree to which they are affected depends on the number of discriminator variables and the sample size of each group (Holloway and Dunn 1967. Gilbert 1969. and Marks and Dunn 1974). Specifically. the null hypothesis of equal mean vectors is rejected more often than it should be when the number of discriminator variables is large or the sample sizes of the groups are differ- ent. That is, the significance level is inflated. Furthermore. as the number of variables increases the significance level becomes more sensitive to unequal sample sizes. The classification rate is also affected and various rules do not result in a minimum amount of misclassification error. If the assumption of equal covariance matrices is rejected, one could use a quadratic discriminant function for classification purposes. However. it has been found that for small sample sizes the performance of linear dis- criminant function !S superior to quadratic discriminant function. as the number of pa- rameters that need to be estimated for the quadratic discriminant function is nearly doubled. A statistical test is available for testing the equality of the covariance matrices. The null and alternate hypotheses for the statistical test are Ho : l:l = l;2 Ha : ~l ¥- l:2 where ~g is the covariance matrix for group~. The appropriate test statistic is Box's M and can be approximated as an F-statistic. SPSS reports the test statistic and, as can be seen, the null hypothesis is rejected (see the part circled 13 in Exhibit 8.1). However, the preceding lest is sensitive to sample sizes in that for a large sample even small differences between the covariance matrices will be statistically significant. To summarize. violation of the assumptions of equality of covariance matrices and nonnormality affects the statistical significance tests and classification resto:lts. As in- dicated previously. it has been shown that discriminant analysis is quite robust to the violations of these assumptions. Nevertheless. when interpreting results the researcher should be aware of the possible effects due to violation of these assumptions. 8.6 STEPWISE DISCRIMINANT ANALYSIS Until now it was assumed that the best set of discriminator variables is known. and the known discriminator variables are used to form the discriminant function. Situations
8.6 STEPWISE DISCRIMINANT Al.~ALYSIS 265 do arise when a number of potential discriminator variables are known. but there is no indication as to which would be the best set of variables for forming the discriminant function. Stepwise discriminant analysis is a useful technique for selecting the best set of discriminating variables to fonn the discriminant function. 8.6.1 Stepwise Procedures The best set of variables for forming the discriminant function can be selected using a forward. a backward, or a stepwise procedure. Each of these procedures is discussed below. Forward Selection In forward selection, the variable that is entered first into the discriminant function is the one that provides the most discrimination between the groups as measured by a given statistical criterion. In the next step. the variable that is entered is the one that adds the maximum amount of additional discriminating power to the discriminant function. as measured by the statistical criterion. The procedure continues until no additional variables are entered into the discriminant function. Backward Selection The backward selection begins with all the variables in the discriminant function. At each step, one variable is removed, that one being the one that provides the least amount of decrease in the discriminating power, as measured by the statistical criterion. The procedure continues until no more variables can be removed. Stepwise Selection Stepwise selection is a combination of the forward and backward elimination proce- dures. It begins with no variables in the discriminant function', then at each step a variable is either added or removed. A variable already in the discriminant function is removed if it does not significantly lower the discriminating power, as measured by the statistical criterion. If no variable is removed at a given step then the variable that significantly adds the most discriminating power. as measured by the statistical crite- rion, is added to the discriminant function. The procedure stops when at a given step no variable is added or removed from the discriminant function. Each of the three procedures gives the same discriminant function if the variables are not correlated among themselves. However, the results could be very different if there is a substantial amount of multicollinearity in the data. Consequently, the re- searcher should exercise caution in the use of stepwise procedures if a substantial amount of multicollinearity is suspected in the data. The problem of multicollinear- ity in discriminant analysis is similar to that in the case of multiple regression analysis. The problem of multicollinearity and its effect on the results is further discussed in Section 8.6.4. 8.6.2 Selection Criteria As mentioned previously, a statistical criterion is used for determining the addition or re- moval of variables in discriminant function. A number of criteria have been suggested. A discussion of commonly used criteria follows.
288 CHAPTER 8 TWO-GROUP DISCRIMINANT ANALYSIS Wilks' A \"'ilks' A is the ratio of the within-group sum of squares to the total sum of squares. At each step the variable that is included is the one with the smallest Willes' A after the effect of variables already in the discriminant function is removed or panialled out. Since the Wilks' A can be approximated by the F-ratio, the rule is tantamount to entering the variable that has the highest partial F-ratio. Because Wilks' A is equal to A = SSw = : SSw SSt SSb + SSw • minimizing Wilks' A implies that the within-group sum of squares is minimized and the between-groups sum of squares is maximized. That is, the WIlks' A selection criterion considers between-groups separation and within-group homogeneity. BaD's V Rao's V is based on the Mahalanobis distance. and concentrates on separation between the groups, as measured by the distance of the centroid of each group from the centroid of the total sample. Rao's V and the Change in it while adding or deleting a variable can >? rbe approximated as a statistic that follows a distribution. However, although it maximizes between-groups separation. Rao's V does not take into consideration group homogeneity. Therefore, the use of Rao '5 V may produce a discriminant function that does not have maximum within-group homogeneity. Mahalanobis Squared Distance Wilks' A and Rao's V maximize the total separation among all the groups. In the case of more than two groups, the result could be that all pairs of groups may not have an optimal separation. The Mahalanobis squared distance tries to ensure that there is separation among all pairs of groups. At each step the procedure enters (removes) the variable that provides the maximum increase (minimum decrease) in separation. as measured by the Mahalanobis squared distance. between the pairs of groups that are closest to each other. Between-Groups F-ratio In computing the Mahalanobis distance all groups are given equal weight. To overcome this limitation, the Mahalanobis distance is converted into an F-ratio. The fonnula used to compute the F-rario takes into account the sizes of the groups such that larger groups receive mqre weight than smaller groups. The between-groups F-ratio measures the separation between a given pair of groups. Each of the above-mentioned criteria may result in a discriminant function with a different subset of the potential discriminating variables. Although there are no abso- lute rules regarding the best statistical criterion, the researcher should be aware of the objectives of the various criteria in selecting the criterion for perfonning stepwise dis- criminant analysis. Wilks' A is the most commonly used statistical criterion. 8.6.3 Cutoff Values for Selection Criteria As discussed above, the stepwise procedures select a variable such that there is an increase in discriminating power of the function. Therefore. a cutoff value needs to be specified. below which the discriminant power, as measured by the statistical criterion,
8.6 STEPWISE DISCRIMIN&vr ANALYSIS 287 is considered to be i~significant. That is. minimum conditions for selection of variables need to be specified. If the objective is to include only those variables that improve the discriminating power of the discriminant function at a given significance level, then the nonnal proce- dure is to specify a significance level (e.g., p = 0.05) for inclusion and/or removal of variables. However. a much lower significance level Li.an that desired should be spec- ified because the overall probability of rejecting the inclusion or deletion of any given variable may be much less than 0.05, as such tests are perfonned for many variables. For example, if 10 independent hypotheses are tested. each at a significance level of .05, the overall significance level (Le., probability of Type I error) is 0.40 (Le.• 1 - .951°). If the objective is to maximize total discriminating power of the discriminant func- tion irrespective of how small the discriminating power of each variable is, then a mod- erate significance level should be specified. Costanza and Affifi (1979) recommend a p-value between .1 and .25. 8.6.4 Stepwise Discriminant Analysis Using SPSS Consider the case of most- and least-admired finns. Assume that in addition to EBITASS and ROTC, the following financial ratios for the firms are also available: ROE, return on equity; REASS, return on assets; and MKTBOOK, market to book value. Table 8.9 gives the five financial ratios for the 24 finns. Our objective is to select the best set of discriminating variables for forming the discriminant function. As Table 8.9 Financial Data for Most·Admired and Least-Admired Firms Firm Group MKTBOOK ROTC ROE REASS EBITASS 11 2.304 0.182 0.191 0.377 0.158 21 2.703 0.206 0.205 0.469 0.110 31 2.385 0.188 0.182 0.581 0.207 41 5.981 0.236 0.258 0.491 0.280 51 2.762 0.193 0.178 0.587 0.197 61 2.984 0.173 0.178 0.546 0.227 71 2.070 0.196 0.178 0.443 0.148 81 2.762 0.212 0.219 OAn 0.254 91 1.345 0.147 0.148 0.297 0.079 10 1 1.716 0.128 0.118 0.597 0.149 11 1 3.000 0.150 0.157 0.530 0.200 12 1 3.006 0.191 0.194 0.575 0.187 13 2 0.975 -0.031 -0.280 0.105 -0.012 14 2 0.945 0.053 0.019 0.306 0.036 15 2 0.270 0.036 0.012 0.269 0.038 16 2 0.739 -0.074 -0.150 0.204 -d.063 17 2 0.833 -0.119 -0.358 0.155 -0.054 18 2 0.716 -0.005 -0.305 0.027 0.000 19 2 0.574 0.039 -0.0-l2 0.268 0.005 20 2 0.800 0.122 0.080 0.339 0.091 21 2 2.028 -0.072 -0.836 -0.185 -0.036 22 2 1.225 0.064 -0.430 -0.057 0.045 23 2 1.502 -0.024 -0.545 -0.050 -0.026 24 2 0.714 0.026 -0.110 0.021 0.016
268 CHAPTER 8 TWO-GROUP DISCRIMINANT ANALYSIS Table B.10 SPSS Commands for Stepwise Discriminant Analysis DISCRIMINANT GROUPS=EXCE~L(1,2) /VARIABLES=¥~TBOOK ROTC ROE REASS E3ITASS /ANALYSIS=MKTBOOK ROTC ROE REASS EBITASS /~THOD=WILKS /PIN=.15 /POUT=.15 /STATISTICS=ALL /PLOT=ALL mentioned previously, only variables that meet a given statistical criterion are selected to form the discriminant function. A stepwise discriminant analysis will be employed with the following criteria: 1. Wilks' A is used as the selection criterion. That is, at each step either a variable is added or deleted from the discriminant function according to the value of Wilks' A. 2. A tolerance level of .001 is used. 3. Priors are assumed to be equal. 4. A .15 probability level is used for entering and removing of variables. The necessary SPSS corrunands for a stepwise discriminant analysis are given in Table 8.10. The PIN and POUT subcorrunands specify the significance levels that should be used, respectively, for entering and removing variables. Exhlbit 8.3 gives the relevant part of the resulting output. Once again. the discussion is keyed to the circled numbers in the output. The output gives the means of all the variables along with the necessary statistics for univariate significance tests [I, 2]. As can be seen, the means of all the variables are significantly different for the two groups of firms [2]. For each variable not included in the discriminant function, the Wilks' A and its sig- nificance level are also printed [3]. Since the significance level and the tolerance value of all variables are above the cutoff value. all the variables are candidates for inclusion in the discriminant function. In the first step, EBITASS is included in the discriminant function because it provides the maximum discrimination as evidenced by the selec- tion criterion, Wilks' A. Also the corresponding F-ratio for evaluating the overall dis- criminant function is provided14a]. An F-value of 87.408 is significant (p < .0000), indicating that the discriminant function is statistically significant [4a]. Statistics for evaluating the extent to which the discriminant function fonned so far differentiates between pairs ofgroups are also provided. The value of 87.408 [4d] for the F-statistic in- dicates that the difference between group 1 (most-admired) and group 2 (least-admired) with respect to the discriminant score is statistically significant (p < .0000) [4d]. Since there are only two groups, the F-statistic measuring the separation of groups 1 and 2 is the same as the F-value of 87.408 for the overall discriminant function.8 'This is not the case when there are more than two groups. For example. in the case of three groups there are three pairwise F-ratios for comparing: differences between groups I and 2. groups I and 3, and groups 2 and 3. The equivalent F-ratio for the Wilks' A of the overall discriminant function measures the overall significance of all rhe groups and nor pairs ofgroups.
8.6 STEPWISE DISCRIMINANT ANALYSIS 289 Exhibit 8.3 Stepwise discriminant analysis G)Group means EXCELL MKTBOOK ROTC ROE REASS EBITASS 1 2.75150 .18350 .18383 .49708 .19133 2 .94342 .00125 -.24542 .11683 .00333 .09238 -.03019 .30696 .09733 Total 1. 84746 0WilkS' Lambda (U-statistic) and univariate F-ratio with 1 and 22 degrees of freedom Variable Wilks' Lambda F Significance -------- ------------- ------------ ------------- MKTBOOK .46135 25.6866 .0000 ROTC .23638 71. 0699 .0000 ROE .42427 29.8532 .0000 REASS .31510 47.6858 .0000 EBITASS .20108 81.4076 .0000 Variables not in the Analysis after Step 0 ------------- Minimum Signif. of Wilks' Lambda 0variable Tolerance Tolerance F to Enter MKTBOOK 1.0000000 1.0000000 .0000447 .4613459 ROTC 1.0000000 1.0000000 .0000000 .2363816 ROE 1.0000000 1.0000000 .0000113 .4242745 REASS 1.0000000 1.0000000 .0000006 .3157028 EBITASS 1.0000000 1.0000000 .0000000 .2010830 ~At step 1, EBITASS was included in the analysis. Degrees of Freedom Signif. Between Groups .0000 Wilks' Lambda .20108 11 22.0 Equivalent F 87.40757 1 22.0 ~------------- Variables in the Analysis after Step 1 ------------- Variable Toleral'.1ce Signif. of Wilks' Lambda EBITASS 1.0000000 F t.o Remove .0000 @------------- Variables not. in ·the Analysis after Step 1 ------------- Min~mum Sign~f. of Variable Tolerance Tolerance F to Enter Wilks' Lambda MKTBOOK .7541675 ROTC .3920789 .7541675 .8293020 .2006277 ROE .8187493 REASS .8453627 .3920789 : .4336968 .1951621 .8187493 .4804895 .1962610 .8453627 .1388271 .1807110 (continued)
270 CHAPTER 8 TWO·GROUP DISCRIMINANT ANALYSIS Exhibit 8.3 (continued) F statistics and significances between pairs of groups after step 1 Each F statistic has 1 and 22 degrees of freedom. Group 1 Group 87.4076 2 .0000 ~At step 2, REASS ~as included in the analysis. Degrees of Freedom S~gnif. Between Groups .0000 Wilks' Lambda .18071 21 ~2.0 Equivalent F 4?E03e~ 2 21.0 @------------- Variables in the hnalysis af.:er Step 2 ------------- Variable Tolerance S~gnif . of W~lks' Lambda REASS .8453627 F ~~ Remc,,-e .2010830 EBITASS .8453627 .1388 .3~5/028 .0007 ------------- variables not ~n the Analys~s after Step L ------------- ~inimum Signif . of ~olerance var~able Tolerance F tc Enter Wilks' !..ambda .6162599 .5323841 MKTBOOK .3918749 .3690001 .3804981 .1737252 ROTC .3722500 . ,-;22500 ROE .4882:;-;5 .1763149 .5727441 .17,7E79 F statistics and s~9nificances between pairs of groups after step 2 0:Each F stat~stic has 2 and 21 degrees freedom. Group Group 2 (7.6038 .0000 F level or tolerance or VIN insu':fic~en:: for further computation. o Summary Table Act~on Vars wilks' Sig. Label Lambda .0000 Step Ent.ered Removed in_ .20:;'05 .0000 1 EBITASS .18071 2 REhSS - 2 Classificaticn functior. coefficients (Fisher's linear disc=~m~nant functions) EXCELL 1 2 18.92H21:2 '.3c32759 RE.~SS -15.:;551023 56.48H4~:! ESIThSS -1.:12~60(l (Constant) -10.99:6677 (continued)
8.S STEPWISE DISCRIMINANT ANALYSIS 271 Exhibit 8.3 (continued) Canonical Discri~~~ant Functions Pct of Cum Canonical After Wilks' Fcn Eigenvalue Variance Pct Ccrr Fcn Lambda Chi-square df Sig : o .lB0711 35.928 2 .0000 4.5337 100.00 100.00 • Marks the 1 canonical discriminan~ functions remaining in the analysis. Standardized canonical discriminant fu~ction coefficients REASS Func 1 EBITASS .38246 .7B573 ~Structure matrix: Pooled within-groups correlations be~ween discriminating variables and canonical discriminant functions (Variables ordered by size of correiation within function) EBITASS Func 1 ROTC .93613 REASS .73492 ROE .69144 MKTBOOK .63352 .33356 Classification results - Actual Group No. of Predicted Group Membership Cases 12 Group 1 12 11 1 ?!. H B.3% Group 2 12 0 12 .0% 100.0% Percent of Hgrouped\" cases correctly classified: 95.83% The output also gives the significance level of the partial F-ratio and the tolerance for variables that fonn the discriminant function [4b]. If the p-value of the partial F- ratio or' the tolerance level of any variable does not meet specified cutoff values, then that variable is removed from the function. Since this is not the case, no variables are removed at this step. A list of variables, along with the p-values of the partial F-ratios and tol~rance levels, that have not been included in the discriminant function is provided next [~]. This table is used to determine which variable will enter the discriminant function in the next step. Based on the tolerance level and the partial F-ratios of variables that are not in the discriminant function, the variable REASS is entered in Step 2 [Sa]. After Step 2, none of the variables can be removed from the function. and of the variables that are not in the function none are candidates for inclusion in the function [5b]. Consequently, the stepwise procedure is tenninated, with the final discriminant function composed of variables REASS and EBITASS. The output also gives the summary of all the steps
272 CHAPTER 8 TWO-GROUP DISCRIMINANT ANALYSIS and other pertinent statistics for evaluating the final discriminant function [6]. These statistics have been discussed previously. A few remarks regarding the structure matrix [7] are in order. The structure matrix gives the loadings or the correlations between original variables and the discriminant score; SPSS gives the loadings for all the potential variables. One would have expected the loadings for the selected variables EBITASS and REASS to be the highest, but this is not the case. The loading of ROTC is higher than that of REASS due to the presence of multicollinearity in the data. This point is further discussed in the next section. The summary of classification results indicates that one finn belonging to group 1 has been misclassified into group 2, giving an overall classification rate of 95.83% [8], Multicollinearity and Stepwise Discriminant Analysis Based on the univariate Wilks' A's and the equivalent F-ratios given in Exhibit 8.3 [2], it appears thatEBlTASS and ROTC would provide the best discrimination because they have the lowest values for Wilks' A. However, stepwise discriminant analysis did not select ROTC.9 Why? The reason is that the variables are correlated among themselves. In other words, there is multicollinearity in the data. Table 8.11 gives the correlation matrix among the independent variables. It is clear from this table that the correlations among the independent variables are substantial. For example, the correlation between ROTC andEBlTASS is .951, suggesting that one of the two ratios is redundant. Conse- quently, it may not be necessary to include both variables in the discriminant function. This does not imply that just because ROTC is not included in the discriminant function it is not an important variable. All that is being implied is that one of them is redundant. Also. the correlation between EBITASS and REASS is .839, implying that these two variabl'!s have a lot in common, and one variable may dominate the other or might be suppressed by the other. Therefore. just because the standardized coefficient of EBITASS is much larger than that of REASS does not mean that EBITASS is more important than REASS. The imponant points to remember from the preceding discussion are that in the pres- ence of multicollinearity: 1. The selection of discriminator variables is affected by multicollinearity. Just be- cause a given variable is nol included does not imply that it is not important. It may very well be that this variable is important and does discriminate between the groups, but is not included due to its high correlation with other variables. ~ Table 8.11 Correlation Matrix for Discriminating Variables Variable MKTBOOK ROTC ROE REASS EBITASS MKIBOOK 1.000 1.000 1.000 1.000 1.000 ROTC .691 .848 .914 ROE .458 .SlO .802 .839 REASS .551 .951 EBITASS .807 'Nole thaI the variables used to form the discriminanl function in Eq. 8.1 were EBITASS and ROTC.
S.7 EXTERNAL VALIDATION OF THE DISCRIMINANT FUNCTION 273 2. Use of standardized coefficients or other measures to detennine the imponance of variables in the discriminant function is not appropriate. It is quite possible. as in our ex.ample. that one variable may lessen the importance of variables with which it is correlated. Consequently. it is recommended that statements concerning the importance of each variable should not be made when multicollinearity is present in the data. 3. In the presence of multicollinearity in the data. stepwise discriminant analysis mayor may not be appropriate depending on the source of multicollinearity. In a population-based multicollinearity the pattern of correlations among the inde- pendent variables, within sampling errors, is the same from sample to sample. In such a case. use of stepwise discriminant analysis is appropriate, as the relationship among variables is a population characteristic and the results will not change from sample to sample. On the other hand. the results of stepwise discriminant analysis could differ from sample to sample if the pattern of correlations among the indepen- dent variables varies across samples. This is called sample-based multicollinearity. In such a case stepwise discriminant analysis is not appropriate. 8.7 EXTERNAL VALIDATION OF THE DISCRIMINANT FUNCTION If discriminant analysis is used for classifying observations then the external validity needs to be examined. External validity refers to the accuracy with which the discrim- inant function can classify observations that are from another sample. The error rate obtained by classifying observations that have also been used to estimate the discrim- inant function is biased and. therefore. should not be used as a measure for validating the discriminant function. Following are three suggested techniques that are commonly employed to validate the discriminant function. 8.7.1 Holdout Method The sample is randomly divided into two groups. The discriminant function is estimated using one group and the function is then used to classify observations of the second group. The result will be an unbiased estimate of the classification rate. One could also use the second group to estimate the discriminant function and classify observations in the first group. This is referred to as double cross-validation. Obviously this technique requires a large sample size. The SPSS commands to implement the holdout method are given in Table 8.12. The first COl\\rlPUTE command creates a new variable SAMP whose values come from a uniform distribution and range between 0 and 1. The sec- ond COMPUTE command rounds the values of SAMP to the nearest whole integer. Consequently. SAMP will have a value of 0 or 1. The SELECT subcommand requests discriminant analysis using only those observations whose values for the SAMP vari- able are equal to 1. Classification is done separately for observations whose values for the SAMP variable are 0 and 1. 8.7.2 U-Method The U-method, proposed by Lachenbruch (1967), holds out one observation at a time, es- timates the discriminant function using the remaining n - 1 observations, and classifies
274 CHAPTER 8 TWO-GROUP DISCRIMINANT ANALYSIS Table 8.12 SPSS Commands for Holdout Validation SET WIDTH=80 TITLE VALIDATION USING HOLDOUT METHOD /MKTBOOK ROTC ROE REASS EBITASS EXCELL CO~2UTE SAMP=UNIFORM(l) COM?UTE SAMP=RND(SAMP) BEGIN DATA irisert data here EI,;'D OAT}!. DISCRIMINANT GROUPS=EXCELL(1,2) /VF_~IABLES=EBITASS ROT~ /SELECT=SAMP(l) /;.:.;u.YSIS=EBITASS ReTC /ME:'HOD=DlRECT /STJI.TIS:'ICS=ALL /?LOT=hLL Fn';:;:SH the held-out observation. This is equivalent to running n discriminant analyses and classifying the n held-out observations. This method gives an almost unbiased estimate of the classification rate. However, the V-method has been criticized in that it does not provide an error rate that has the smallest variance or mean-squared error (see Glick 1978; McLachlan 1974; Toussant 1974). 8.7.3 Bootstrap Method Bootstrapping is a procedure where repeated samples are dra\\\\>n from the sample. dis- criminant analysis is conducted on the samples drawn, and an error rate is computed. The overall error rate and its sampling distribution are obtained from the error rates of the repeated samples that are drawn. Bootstrapping techniques require a consider- able amount of computer time. However. with the advent of fast and cheap computing, they are gaining popularity as a viable procedure for obtaining sampling distributions of statistics whose theoretical sampling distributions are not known. See Efron (1987) for an in-depth discussion of various bootstrapping teChniques and see Bone, Sharma, and Shimp (1989) for their application in covariance structure analysis. 8.8 SUMMARY ~.» chapter discussed two-group discriminant analysis. Discriminant analysis is a technique for first identifying the \"best\" set of variables, known as the discriminator variables, that provide the best discrimination between the two groups. Then a discriminant function is estimated. which is a linear cOlnbination of the discriminator variables. The values resulting from the discrim- inant function are known as discriminant scores. The discriminant function is estimated such that the ratio of the between-groups sum of squares to the within-group sum of squares for the discriminant scores is maximum. The final objective of discriminant analysis is to classify fu- ture observations into one of the two groups. based on the values of their discriminant scores. In the next chapter we discuss discriminant analysis for more than two groups. This is known as multiple-group discriminant analysis.
QUESTIONS 275 QUESTIONS 8.1 Tables Q8.1 (a). (b). and (c) present data on two variables. XI and Xl. In each case plot the data in two-dimensional space and comment on the following: (i) Discrimination provided by XI (ii) Discrimination provided by X2 (iii) Joint discrimination provided by XI and X2• Table Q8.1(a) Table Q8.1 (b) Tabk Q8.l(c) Observation Xl Xl Observation Xl Xl Observation Xl Xl 1 58 1 74 1 23 2 67 2 33 2 68 3 52 3 63 3 12 4 63 4 24 4 58 5 68 5 85 5 33 6 54 6 13 6 76 8.2 In a consumer survey 200 respondents evaluated the taste and health benefits of 20 brands of breakfast cereals. Table Q8.2 presents the average ratings on 4 brands that consumers rated most likely to buy and 4 brands that they rated least likely to buy. Table Q8.2 Overall Taste Health Brand Rating Rating Rating A0 53 BI 68 C179 D0 32 E177 F0 46 G0 33 HI 89 Notes: 1. Overall rating: 0 \"\" least likely to buy; 1 == most likely to buy. 2. Taste and health attributes of the cere- als were rated on a lO-point scale with 1 \"\" extremely poor and 10 = extremely good. (a) Plot the data in two-dimensional space and comment on the discrimination provided by: (i) taste. (ii) health, and (iii) taste and health. (b) Consider a new axis in two-dimensional space that makes an angle () with the \"taste\" axis. Derive an equation to compute the projection of the points on the new axis. (c) Calculate the between-groups, within-group, and total sums of squares for various values of (J. (d) Define Aas the ratio of the between-groups to within-group sum of squares. Calculate A for each value of (J. (e) Plot the values of A against those of (J. Use this plot to find the value of (J that gives the maximum value of A. (f) Use the A-maximizing value of (J from (e) to derive the linear discriminant function. Hint: A spreadsheet may be used to facilitate the calculations considerably.
278 CHAPI'ER 8 TWO-GROIJP DISCRIMINANT ANALYSIS 8.3 Consider Question 8.2. (a) Use the discriminant function derived in Question 8.2(f) to calculate the discriminant scores for each brand of cereal. (b) Plot the discriminant scores in one-dimensional space. (c) Use the plot of the discriminant scores to obtain a suitable cutoff value. (d) Comment on the accuracy of classification provided by the discriminant function and the cutoff value. 8.4 Use SPSS (or any other software) to peIform discriminant analysis on the data shown in Table Q8.2. Compare the results with those obtained in Questions 8.2 and 8.3. 8.5 Cluster analysis and discriminant analysis can both be used for purposes of classification. Discuss the similarities and differences in the two approaches. Under what circumstances is each approach more appropriate? 8.6 Two-group discriminant analysis and multiple regression share a lor of similarities. and are in fact computationally indistinguishable. However. there are some fundamental differ- ences between these two approaches. Discuss the differences between the!>e two methods. 8.7 File DEPRES.DAT gives data from a study in which the respondents were adult residents of Los Angeles Counry.Q The major objectives of the study were to provide estimates of the prevalence and incidence of depression and to identify causal factors and outcomes associated with this condition. The major instrument used for classifying depression is the Depression Index (CESD) of the National Institute of Mental Health. Center for Epidemi- ologic Studies. A description of the relevant variables is provided in file DEPRES.DOC. (a) Use discriminant analysis to estimate whether an individual is likely to be depressed (use the variable \"CASES\" as the dependent variable) based on his/her income and education. (b) Include'the variables ACUTEILL, SEX. AGE, HEALTH. BEDDAYS, and CHRON- ILL as additional independent variables and check whether there is an improvement in the prediction. Hint: Since the number of independent variables i!> large you may want to use stepwise discriminant analysis. (c) Interpret the discriminant analysis solution. 8.8 Refer to the mass transponalion data in file MASST.DAT (for a description of the data refer to file MASST.DOC). Use the cluster analysis results from Question 7.3 (Chapter 7) to form two groups: \"users\" and \"nonusers\" of mass transponation. Using this new grouping as the dependent variable and determinant attributes of mass lransponation as independent variables. perform a discriminant anaIy!ois and discuss the differences between the two groups. Note: Use variables ID. F 1 to \\' 18, for this question. Ignore the other variables. 8.9 For the two groups case show that: where B = between-groups SSCP matrix for p variables. JL 1 and fL:! are the p x I vectors . of means for group 1 and group 2. and \"1 and \"2 are the number of observations in group I and group 2. \"Afiti, A.A.• and ViIginia Clark (1984). Conrl'lIIt'r l .. ided Multivariatt' Analysis. Lifetime Learning PUbli- cations. Belmont, California. pp.30-39.
AS. I FISHER'S LINEAR DISCRIMDlANT FUNCTlO!'1 277 Appendix AB.I FISHER'S LINEAR DISCRIMINANT FUNCTION Let X be a p X 1 random vector of p variables whose variance-covariance matrix is given by I and the total SSCP matrix by T. Let -y be a p X 1 vector of weights. The discriminant function will be given by {~ X'-y. (AS.l) The sum of squares for the resulting discriminant scores will be given by (AS.2) f{ ::: (X'-y)'(X'-y) = -y'XX'-y = -y'T-y where T = XX' and is the total SSCP matrix for the p variables. Since T == B + W, where B and Ware, respectively, between-groups and within-group SSCP matrices for the p variables, Eq. A8.2 can be written as f{ ::: \"I'(B + W)\"I (A8.3) = -y'B\"I + \"I'W\"I. In Eq. AS.3, -y'By and -y'W-y are. respectively, the between-groups and within-group sum of squares for the discriminant score {. The objective of discriminant analysis is to estimate the weight vector, -y, of the discriminant function given by Eq. A8.1 such that A :y'-B--y (A8A) -y'W-y is maximized. The vector of weights. \"I, can be obtained by differentiating A with respect to -Y. and equating to zero. That is -= 2(B\"I)(Y'Wy) - 2(\"I'B\"I)(Wy) = 0 (y'W-yf . a-y Or, dividing through by \"I'W\"I. 2(B\"I - AWy) 03E (A8.5) y'W\"I (B - AW)-Y = 0 (W-IB - U)-y ~ O. Equation A8.S is a system of homogeneous equations and for a nontrivial solution (A8.6) That is, the problem reduces to finding eigenvalues and eigenvectors ofthe nonsymmetric matrix. W-1B, with the eigenvectors giving the weight matrix for forming the discriminant function. For the two-groups case, Eq. AS.5 can be further simplified. It can be shown that for two groups B
278 CHAPrER 8 TWO-GROUP DISCRIMINANT ANALYSIS is equal to (AS.7) where fLl and fL2, respectively. are p x 1 vectors of means for group 1 and group 2; nl and nz are number of observations in group I and group 2. respectively; and C = n, nz/(n, + nz), which is a constant. Therefore. Eq. A8.5 can be written as [W-1C(fLl - fJ.Z)(fLl - fL2)' - AllY = 0 (AS.S) CW-1(11, - fL2)(fLl - fL2)'')' = A')' ~ [W-1(fLl - fL2)(fl.r - fL2)'''] = \". Since (fLl - fl.2)''' is a scalar. Eq. AS.8 can be written as \" = KW- 1(fLl - fLz) (AS.9) where K = C(fl.J - 112)'''/ A is a scalar and therefore is a constant. Since the within-group variance-covariance matrix. ~\"., is proportional to W and it is assumed that ~l = 1:2 :: 1:\". \"\" l:. Eq. A8.9 can also be written as (A8.IO) aAssuming a value of one for the constant K, Eq. AS.! can be written as ')' ~ l:-l(fLJ - fL2) or (AS.ll) The discriminant function given by Eg. A8.II is the Fisher's discriminant function. It is obvious that different values of the constant K give different values for\" and therefore the absolute weights of the discriminant function are not unique. The weights are unique only in a relative sense; that is, only the ratio of the weights is unique. AB.2 CLASSIFICATION Geometrically. classification involves the partitioning of the discriminant or the variable space into two mutually exclusive regions. Figure AS.l gives a hypothetical plot of two groups mea- sured on only one variable, X. The classification problem reduces to detennining a cutoff value that divides the space into two regions. RI and R2. Observations falling in region Rl are clas- I Rl \\1CUloff value r R2 -e-e-e-e-e-x-e+x-e-x-x-x-x-x- X r r r I Figure AB.I Classification in one-dimensional space.
AS.2 CLASSIFICATION 279 RI •• •)( )( ' - - - - - - - - - - - - - - XI Figure AB.2 Classification in two-dimensional space. sHied into group 1 and those falling in region R2 are classified into group 2. Similarly. Fig- ure AS.2 gives a hypothetical plot of observations measured on two variables, XI and X2. The classification problem now reduces to identifying a line that divides the two-dimensional space into two regions. If the observations are measured on three variables then the classification prob- lem reduces to identifying a plane that divides the three-dimensional space into two regions. In general. for a p-dimensional space the problem reduces to finding a (p - I)-dimensional hyper- plane that divides the p-dimensional space into two regions. In classifying observations. two types of errors can occur. as given in Table AS.I. An ob- servation coming from group I can be misclassified into group 2. Let c(2i 1) be the cost of this misclassification. Similarly. c( 1/2) is the cost of misclassifying an observation from group 2 into group 1. Obviously. one would prefer to use the criterion that minimizes misclassification costs. The follOwing section discusses the statistical decision theory method for dividing the total space into the classification regions. i.e.• developing classification rules. AB.2.1 Statistical Decision Theory Method for Developing Classification Rules Consider the case where we have only one discriminating variable, X. Let 1T1 and 1T2 represent the populations for the two groups and II (x) and h.(x) be the respective probability density functions for the I X 1 random vector, X. Figure A8.3 depicts the two density functions and the cutoff value. c. for one discriminating variable (i.e.. p ~ 1). The conditional probability of correctly classifying observations in group 1 is given by Table AB.1 Misclassi1i.cation Costs Group Assignment Actual Group Membership by Decision Maker 12 1 No cost C(1/2) 2 C(2/I) No cost
280 CHAPl'ER a TWO-GROUP DISCRIMINANT ANALYSIS P (112) P (211) Figure AB.3 Density functions for one discriminating variable. and that of correctly classifying observations in group ~ is given by where P(i j) is the conditional probability of classifying observations to group i given that they belong to group j. The conditional probability of misclassification is given by fa:P(l:'2) = i2(X)dx (AS. 12) and L'\"P(2 i 1) = fi(x)dx. (A8.13) New assume that the prior probabilities of a given observation belonging to group 1 and grou~ 2, respectively, are Pi and P2. The various classification probabilities are given by P(correctly classified into 17'1) :: P(l '1 )PI (AS. 14) P(correctly classified into 17':) = P(2:2JP2 (AS. IS) P(misclassified into r.d ::: P(1 '2)p: P(misclassified into 17'2) := P(2: l)PI. The total probabiliry ofmisc1assification (TPM) is given by the sum of Eqs. AS.14 and A8.I5. That is, T PM = P(1/2)p2 + P(2 l)PI. (AS.16) Equation AS.16 does not consider the misclassification costs. Given the misclassification costs specified in Table AS. I. the lotal cost ofmisclassificacion (TCM) is then given by TCM = C(l'2)P( 1. '2)p:! + C(2. 1)P(2 ' l)PJ. (A8.17) Classification rules can be obtained by minimizing eitherEq. AS.16 orEq. AS. I 7. In other words. the value c in Figure AS.3 should be chosen such that either TPM or TCM is minimized. If Eq. A8.16 is minimized then the resulting classification rule assumes equal costs of misclassifi- cation. Minimization of Eq. AS.17 will result in a classification rule that assumes unequal priors and unequal misclassification costs. Consequently. minimization of Eq. AS.I7 results in a more general classification rule. It can be shown that the rule which minimizes Eq. AS.I7 is given by: Assign an observation to 1i1 if !I(X) ~ [CO '21]f P:!]. (A8.IS) !2(.:t) C(:!· 1) lpi
AB.2 CLASSIFICATION 281 Table A8.2 Summary of Classification Rules Misclassification Classification Rules Costs Priors Assign to Group 1 (1ft) if: Assign to Group 2 (1f2) if: Unequal Unequal Unequal [C(l/2)]jj(x) <== [P1 ] f,ex) < [C(I/2)][P2 ] Equal Equal Equal 12(x) C(2/ 1) PI h{x) C(2/1) PI Unequal JftCx) :> [P1 fl(x) [P2 ] Equal hex) - PI h(x) < PI fl{x) ~ [c(li~2)] /rex) [C(1/2)] hex) C(2/1) f2(X) < C(2/1) fl(x) ~ I f2(X) -ftC-x)< 1 hex) and assign to 11\"2 if fl (x) [C(l/2)] [P2] hex) < C(2, 1) PI . (AS.19) Table AS.2 gives the classification rules for various combinations of priors and misclassification costs. These classification rules can be readily generalized to more than one variable by defining the function f(x) as f(x) = f(x). X2 ••. .• x p ). Posterior Probabilities It is also possible to compute the posterior probabilities based on Bayesian theory. The posterior probability, P(Trl/ Xi), that any observation i belongs to group 1 given the value of variable Xi is given by I ) -P(Tr) X; - peTrI occurs and observe x;) P(observe x,) nP(x; Trl) P(x;) ITr) )= PCTr 1)P(x; ~P(X~;J~7l1~)~P(~\".~) -+ 7P(-xil-11-\"2)-P(-11-\"2) = --:---,;P..I../.I.(..x;;;.)--\"...-- (A8.20) Pt/I(.'l:i) + p'd:(x;)\" Similarly, the posterior probability, P( \"2 Xi) is given by pdl~Xi) + P2!2(X, )' (AS.21) It is clear that for classification purposes. one must know at a minimum, the density function. f(x). In the following section the above classification rules for a multivariate normal density function are developed. AB.2.2 Classification Rules for Multivariate Normal Distributions The joint density function of any group i for a P X 1 random vector x is given by (A8.22)
282 CHAPl'ER 8 TWO-GROUP DISCRIMINANT ANALYSIS Assuming:I1 =- :I2 = :I and substituting Eq. AS.22 in Eqs. A8.I8 and A8.l9 results in the following classification rules: Assign to 71'1 if (A8.23) and assign to '71'2 if (AS.24) Taking the natural log of both sides of Eqs. A8.23 and AS.24 and rearranging items. the rule becomes: Assign to 71'1 if (~1 - Jl2)'~-lX ~ ~(fl.l - In{[~gfL2)'l:-I(fl.l + fl.:) + ~~][~~]} (AS.25) and assjgn to 'iT2 if (AS.26) In{ .(~l - f12)'~-IX < ~(\"'1 - \"'2)':I-I(fLl + \"'2) + [~g. ~~] [~~]} Further simplification of the preceding equations results in: Assign to 71'1 if ]}(fl.1 - ~fl.2)'k- IX .5(~1 - ~1fl.2),l: -I + .5(fl.1 - f.L2)'l: -I fl.:: + In {[ ~g:. ~;][ ~~ (AS.27) and assign to \"':! if 5 5 I {[CO. '2)] [P2]}(~1 - fl.2 ) \"t\" - l X < . tfLl - fl.21.'r.~.-I .,.1 +. (fLI - ~~)'-~-I fL2 + n C(2 1) P; . J;., (AB.28) The quantity (fl.l - f.\\.2)':I-1 in Eqs. A8.27 and AS.28 is the discriminant function and, therefore, the equations can be written as: Assign to 71'1 if - - [C(l'2)][P']~ ~ .5tl + .5~ +]n C(2 '1) p~ (A8.29) and assign to Tt2 if t < .5~-1 + - + In [C(1 2I))][Pp2;] ' (A8.30) .5Q C(2 twhere is the discriminant score and ~ is the mean discriminant score for group j. It can be clearly seen that for equal misclassification costs and equal priors, the preceding classification rules reduce to: Assign [0 'iT1 if (A8.31) and assign to 'ii2 if (AS.32) because InC 1) ~ O.
AS.2 CLASSIFICATION 283 Therefore. the cutoff value used in the cutoff-value method presented in the chapter implicitly assumes a multivariate nonnal distribution. equal priors. equal misclassification costs. and equal covariance matrices. Posterior Probabilities :sSubstituting Eq. AS.22 in Eq. AS.20 and assuming ~I ::: :I:! =- gives the pos;[erior probability. 1i1/p( x). of classifying an observation in group 1. That is. (.'\\8.33) Equation AS.33 can be further simplified as p(1iI/X):= _ _ _ _1_ _.,--_ P7e-1 2(x-IL:)'l:-I{X_p.~) +1 =------ - - - - - - : - - - ePI -1::!{X-ILI)'l:-I IX-P,I) += -1-~~P-2 -e--I.-!-2[-{X--I-L2-)-'~--I-(X--1-42-)--(-X--IL-1)-'-l:--I-(\"--P-.-I)-J (A8.34) PI J + 1\"1-'\" - 1\"1-'\" - ... )p.,= -1---'-:-:\"-e(--(-IL-•-- -IL2-)-'-\",--l-x-+----2)-',-,\"-~l \"-I -+-----2l'-~-~l -- PI 'I= -1-+---P-2-e-(1---~,+..-;-f,,-) PI as (.... 1 - ....2)'1- 1 is the discriminant function. We have so far assumed that the population parameters p.L I, f.L2, and:I1 ~ :I2 are known. Fur- thermore. it has been assumed that ~I = I2 \"'\" I. If the population parameters are not known then sample-based estimates i:j and S; I , respectively, of population parameters ....j and :I-I are used. However, using the sample estimates in the classification equations does not ensure that the TCM will be minimized. But. as sample size increases the bias induced in TCM by the use of sample estimates is reduced. If1:1 ~ 1:2 then the procedure for developing classification rules is the same, however the rules become quite cumbersome. AB.2.3 Mahalanobis Distance lVlethod Once again assuming 1:1 = :I2 = :I. the Mahalanobis, or statistical distance, of any observation i from group I is given by and from group 2 it is given by (A8.35) (Xi - f.L2),1: -I (Xi - ....2). (A8.36) That is, an observation i is assigned to group 1 if (Xi - P.1)'l:-I(~ - J'l) :5 (~. -1L:!)'~-t(Xi - fJ.2) and to group 2 otherwise. Equation A8.35 can be rewritten as X;~-l(f.L1 - fJ.2) ~ .5(fJ.;X- 1.... t - fJ.;X- If.L2) or
284 CHAPl'ER 8 TWO-GROUP DISCRlMINA.\"'IT ANALYSIS Equation A836 results in the following classification rule: Assign to 'lTI if (A8.37) and assign to 1T2 if (AS.38) As can be seen the rule given by Eqs. AS.37 and AS.38 is the same as:that of the cutoff-value method. It is also the same as that given by the statistical decision theory method under the assumption of equal priors. equal misclassification costs, a multivariate normal distribution for the discriminator variables, and equal covariance matrices. The only difference between Ma- halanobis distance method and the cutoff-value method employing discriminant scores is that in the former. classification regions are formed in the original variable ·space. and in the laner, classification regions are formed in the discriminant score space. Obviously. forming classifi- cation regions in the discriminant score space gives a more parsimoniOUS representation of the classification problem. A8.3 ILLUSTRATIVE EXAMPLE We use a numerical example to illustrate some of the concepts presented in the Appendix. First. we illustrate the procedures for any known distribution and then for normally distributed vari- ables. AS.3.l Any Known Distribution fI(xi) = .3, f2(xi) = .4. Given the following information C(2;' 1) = 5, C(1 :2) = 10, PI ~ .S, P1 =.2. classify the observation i. According to Eq. AS.I8 the cutoff value is equal to 10 x 0.2 = 050 5 x O.S . , and flex) = .3 = 0.75, hex) .4 and therefore observation i is assigned to group 1. If the misclassification costs are equal. then the cutoff value is equal to 0.2 i 0.8 :: 0.25 (see Table AS.2) and the observation is once again assigned fa group I. For equal priors the cutoff value is equal to 10,1 5 == 2 and the observation is assigned to group 2 (see Table A8.2). If the priors and the misclassificarion costs are equal, then the observation will be assigned to group 2 as the cutoff value is equal to one (see Table AS.2). The posterior probabilities can be computed from Eqs. A8.20 and AS.2I and are equal to . O.S x 0.3 0 Pl7Tl Xi) \"., 0.8 x 0.3 + 0.2 X 0.4 = .75, and P( 'ii::, x,\" == 0.2 x 0.4 = .:!5. 0.8 x 0.3 .... 0.2 x 0.4 Therefore. the observation will be assigned to group I.
Search
Read the Text Version
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 272
- 273
- 274
- 275
- 276
- 277
- 278
- 279
- 280
- 281
- 282
- 283
- 284
- 285
- 286
- 287
- 288
- 289
- 290
- 291
- 292
- 293
- 294
- 295
- 296
- 297
- 298
- 299
- 300
- 301
- 302
- 303
- 304
- 305
- 306
- 307
- 308
- 309
- 310
- 311
- 312
- 313
- 314
- 315
- 316
- 317
- 318
- 319
- 320
- 321
- 322
- 323
- 324
- 325
- 326
- 327
- 328
- 329
- 330
- 331
- 332
- 333
- 334
- 335
- 336
- 337
- 338
- 339
- 340
- 341
- 342
- 343
- 344
- 345
- 346
- 347
- 348
- 349
- 350
- 351
- 352
- 353
- 354
- 355
- 356
- 357
- 358
- 359
- 360
- 361
- 362
- 363
- 364
- 365
- 366
- 367
- 368
- 369
- 370
- 371
- 372
- 373
- 374
- 375
- 376
- 377
- 378
- 379
- 380
- 381
- 382
- 383
- 384
- 385
- 386
- 387
- 388
- 389
- 390
- 391
- 392
- 393
- 394
- 395
- 396
- 397
- 398
- 399
- 400
- 401
- 402
- 403
- 404
- 405
- 406
- 407
- 408
- 409
- 410
- 411
- 412
- 413
- 414
- 415
- 416
- 417
- 418
- 419
- 420
- 421
- 422
- 423
- 424
- 425
- 426
- 427
- 428
- 429
- 430
- 431
- 432
- 433
- 434
- 435
- 436
- 437
- 438
- 439
- 440
- 441
- 442
- 443
- 444
- 445
- 446
- 447
- 448
- 449
- 450
- 451
- 452
- 453
- 454
- 455
- 456
- 457
- 458
- 459
- 460
- 461
- 462
- 463
- 464
- 465
- 466
- 467
- 468
- 469
- 470
- 471
- 472
- 473
- 474
- 475
- 476
- 477
- 478
- 479
- 480
- 481
- 482
- 483
- 484
- 485
- 486
- 487
- 488
- 489
- 490
- 491
- 492
- 493
- 494
- 495
- 496
- 497
- 498
- 499
- 500
- 501
- 502
- 503
- 504
- 505
- 506
- 507
- 508
- 509
- 1 - 50
- 51 - 100
- 101 - 150
- 151 - 200
- 201 - 250
- 251 - 300
- 301 - 350
- 351 - 400
- 401 - 450
- 451 - 500
- 501 - 509
Pages: