Home Explore Fundamental of item response theory

Fundamental of item response theory

Published by alrabbaiomran, 2021-03-14 20:05:37

Description: Fundamental of item response theory

Read the Text Version

Pages:

Ahilily <lIId lum Poramrler EJlimalion 43 Alternative approa<.:hes to estimation are available. One approach is to obtain Dayesian estimates of the paramctcrs using prior distributions. Swami nathan and Gifford (l9R2, 19R5, 1986) have developed Bayesian procedures for the onc-, two-, and three-parameter models in which prior distributions arc placed on the item and ability parameters. This procedure eliminates the problems encountered in the joint maximum likelihood procedure, namely that of improper estimates for certain response patterns. The problem of inconsistent joint maximum likelihood estimates occurs because both the item and ability parameters are estimated simultaneously. This problem disappears if the item parameters can be estimated without any reference to the ability parameters. If we con- sider the examinees as having been selected randomly from a popula- tion, then, by specifying a distribution for the ability parameters, we can integrate them out of the likelihood function (integrating out the ability parameters has the same effect as \"running\" over the ability distribution to obtain a marginal likelihood function in terms of the item parameters). The resulting \"marginal maximum likelihood estimates\" do have desirable asymptotic properties; thaI is, the item parameter estimates are consistent as the number of examinees increases. This marginal maximum likelihood estimation procedure was developed by Bock and Lieberman (1970), refined by Bod and Aitkin (1981), and implemented in the computer program BILOG hy Mislevy and Bock (1984). The marginal maximum likelihood procedure is computation- ally more intensive than the joint maximum likelihood procedure be- cause of the integration that is required. Moreover, in order to obtain the marginal likelihood function of the item parameters, it is neces- sary to approximate the distribution of ability. For a good approxima- tion of the ability distribution, the availability of a large number of examinees is important. Hence, the marginal maximum likelihood pro- cedure should be carried out only with sufficiently large numbers of examinees. Once the item parameters have been estimated using the marginal maximum likelihood procedure, the item parameter estimates may be treated as known and the abilities of the examinees can be estimated using the method outlined earlier in this ch.. ptcr. Again, the larger the number of items, the better the ability parameter estimates. Either the

44 FUNDAMENTALS OF ITEM RESPONSE TlIEORY maximum likelihood eslimates of ability or, ir desired. the EAP csti~ mates of ability may be obtained. In some situations, even the marginal maximum likelihood procedure may fail; that is, the numerical procedure may fail to yield a satisfactory result even after a large number of iterations. This failure happens primarily in the estimation of the c parameter in the Jhree~parallleter model. Poor estimates of (', in turn, degrade estimates of other item parameters and of ability (Swaminath:lO & Girrord. 1985). Bayesian estimation (Mislevy. 1986) solves this prohlem (in fact, within the BILOG computer program, a prior distribution is placed on the c parameter values as the default option). Standard Errors or Item Parameter Estimates The concept of the information function, briefly introduced earlier, is a generic concept that relates to the variance of a max imum likeli- hood estimator. When the maximum likelihood estimate of the ability parameter is obtained, its variance is given as the reciprocal of the corresponding information function. Similarly, when mludmum like- lihood estimates or'item parameters are obtained, the variance·- covariance matrix of the estimates is given as the inverse of the information matrix of item parameter estimates (since, in the case of the two- and three-parameter models, each item is characterized by two and three parameters, respectively). The elements of the information matrix for the joint maximum likelihood estimates for each item are arranged in the following manner (since the matrix is symmetric. only the upper triangle elements are given): i = 1.2, .... n The expressions for the elements are given below (Hamhleton & Swaminathan, 1985; Lord, 1980).

IIhl/i I\\' alld IINII l'amm('It'f f.llima/imt 45 qiJ h,) (Pi) - ('i) P Ij Simple expressions for the variance-covariance matri1t of marginal maximum likelihood estimates are not availahle. hut a description of the procedure for obtaining them is given in Mislevy and Rock (1984) and Mislevy (19&6). The variance-covariance matrix of the item param- eter estimates is important when comparing the item parameters in two groups. a problem that arises in bias or differential item functioning studies (Lord, 1980). Summary or Parameter Estimation Methods In the preceding sections. maximum likelihood. marginal maximum likelihood, and Bayesian estimation procedures were described. These are the most widely used estimation procedures. For reviews of current procedures, refer to Baker (1987) and Swaminathan (1983). Several other approaches to estimation were not described in this chapter. A list

or46 FUNDAMENTALS ITEM RESPONSE TlIEORY of' the availahle estimation procedures wilh hlid lIesel iplions is given below. • Joint maximum likelihood procedure (Lord, 1974, 19RO), applicahle to the one-, IWO-, and three-parameler models. The ahilily and item parameters are estimated simultaneously. Marginal maximum likelihood procedure (Bock & Aitkin, 1(81). applica- ble to the one-, two-, and three-parameter models. The ability parameters are integrated out, and the item parameters are estimated. With the item parameter estimates determined, the ability parameters are estimated. • Conditional maximum likelihood procedure (Andersen, 1972, 1973; Rasch, 1960), applicable only to the one-parameter model. Here, the likelihood function is conditioned On the number right score. • Joint and marginal Baytsian tstimation proC'tdurts (Mislevy, 19R6; Swaminathan & Gifford, 1982. 1985, 1986). applicable to the one-, two-, and three-parameter models. Prior distributions are placed on the item and ability parameters, eliminating some of the problems. such as improper estimalion of parameters and nonconvergence, encountered with joint and marginal maximum likelihood procedures. • Hturistic estimation procedure (Urry. 1974, 1978), applicable primarily to the two- and three-parameter models. • Method based on nonlinear factor analysi.~ lJrllC(!(lllres (McDonald, 1967, 1989), applicable to the two-parameter and a modified case of the three- parameter model in which the c-values are fixed. In addition. When item parameters are known, estimalion of ability can be carried out by obtaining the mode of the likelihood function, or, in the case of Bayesian procedures, either the mean or Ihe mode of the posterior density function of O. The procedures summArized above are implemented in computer programs described in the next section. Computer Programs ror Parameter Estimation Until recently, few computer programs were available for estimation of the parameters of the lRT models introduced earlier. In the 1970s the most widely known and used programs were BICAL (Wright et aI., 1979) and LOGIST (Wingersky, Barton, & Lord, 1982). BlCAL fits the one-parameter model; LOGIST fits the one-, two-, and three-parameter models. Both programs use joint maximum likelihood estimation pro-

47 'i t:et!urcs, and both remain widely IIscd. LOG 1ST remains the standard \" by which ncw estimation programs arc judged. ~ Other programs available in the 1970s were I'ML (Guswfsson, IYKOa) lind ANCILLES (Urry, 1974, 1(78). PML fils the one-parameter lIlodei using the conditional maximum likelihood procedure. while ANCILLES fits the three-parametcr model using an heuristic pro- cedure. PML has not been used widely in the United Slates, and ANCILLES is not used often because ils estimation procedure is not well grounded theoretically and other programs have been shown to produce better estimates. In the 1980s several new estimation programs were introduced. Most notable of these were BILOG (Mislevy & Bock, 1984) and ASCAL (Assessment Systems Corporation, 1988). BILoe; fits the one-, two-, and three-parameter models using marginal maximum likelihood pro- cedures with oplional Bayesian procedures; ASCAL fils the three- parameter model using Bayesian procedures. BILOG is available in both mainframe and microcomputer versions. while ASCAL is a micro- computer program. Also available in the early 1980s was the program NOH ARM (Fraser & McDonald, 1988). which fits two- and three-parameter models (with fixed c-values) using a nonlinear factor analysis approach. NOHARM has not received much attention in the United Stales. Other developments included microcomputer programs for fitting the one-parameter modt'l. MICROSCALE (Mediax Interactive Technolo- gies, 19H6) and RASCAL (Assessment Systems Corporation. 1988). A microcompul<:r version of LOGIST is being developed and is expected to be released in 1991 or 1992. RIDA (Glas, 1990) is a new microcom- puter program for analyzing dichotomolls data using the one-parometer model. Both marginal and conditional maximum likelihood estimation procedures are available. A special feature is the capability of analyzing various incomplete lest designs that often arise in lest equaling (see chapter 9). Most recently. interest in IRT programs that handle polytomous data (Thissen, 1986; Wright el aI., 1989) and multidimensional dala (Carl- son, 1987) has developed, but work on the laller lopic is only just beginning and considerable amounts of research are needed before Carlson's program can be lIsed operationally. A summary of the pro- grams listed above and their advantages and key features is given in Table 3.2. Sources for the progr<lll1s arc listed in Appendix. B.

4H FUNDAMENTALS (W II EM HESI'ONS!; TIIEOI{Y TAnLE .1.2 Currently Availahle IRT Parameler Estil1lllti{)11 Prograllls - - - - - - --~------ /''''.1' ( ~ ). CO\"S ( ). F:.IIII11(1/;On Com/l/I/t'I' al/(I Pml/rill\" Soura Mml,,/ P\"(lcct/llrt' Rfqllil t'1II/'lIIs P\",/tll/\",f (+ J BICAL Wright IP Unconditional Most ,+~ Ine~pensive (Replaced elal. by BIG- (1979); Muimum mainframes + Gives SCALE) Wright et al. Likelihood standard (19119) errors + (jives graphical! slatistical fit analysis MICRO- MedillX In- IP Uncondiliollal •PC PC versioll SCALE leraclive Mulli- Maximum Technol- calegory Likelihood of BICAL ogies ( 1986) • Data can be input in II spreadshel't PML Guslafs- II' Condillonlll Unknowll + Estimates are son Maximum cnnsistenl (I 980a) Likelihood COl11putll- tionally intensive • Not widely used in th(\" U.S. RASCAL Assess- II' Unconditional PC + Includes anal- menl Sys- Mnimum tems Likelihood yses of fit Corp. ( 19811) • Incorporated in the. Micro- CAT package RIDA GIns If> Conditional m PC + Provides a (1990) Marginal MaximulIl l:olllplete Uk(\"lihnod analysis of ex;uuinees and items + Illmdles in- complete designs for tesl (\"{Iuatin!! + Inclu\"~s \"il analysis

IIbililY (/millflll 1'a/llllll'/('r cSliTIIllli,l/I 49 TAnLE .tZ (Continued) Model fJtimalitm ('ompl/lfl l'mJ (+). Procedure Rl'qllirl'T11l'lIis Con.vH. 0111/ Feolllrn (.) ANCILLES Urry 3P f1eurislic MOSI + Inexpensive (1974. mainframes Oflen de· 1978) leles ilems! examinees ESlimation procedure nOI well grounded theoretically • NOI widely used ASCAL Assess- II' Modified PC + Includes anal- menl Sys- 2P Bayesian terns ysis or fit Corp. 3P (1988) + Incorporated in the Micro- CAT packaBe • U5es fhyesilln procedures LOGIST Wingersky II' Unconditional IBM/CDC + LOGIST V (1983); Wingersky 21' Maximum Mainframes gives slImdard elal. ( 1982) 3P Likelihood (Version IV) errors + Flexible, many 0Plions + Allows omits/ nol reached Inpul speci- ficalions are complex EXJl(\"nsive to run Di ffkult 10 converl for non-IBM etluipmenl Places many conslrainl5 on Ihe parame- leu 10 oblain convergence (C,mfimu:d)

50 FUNDAMENTALS or: ITEM RESPONSE THEORY TAULE 3.2 (Continued) Proxram SOl/ree Model Estimation Complttrr Pm.• (+), Pmadllre Requirement,f (,(lI/,f (-I, alld FeaWre\" (. J BILOO Mislevy Itt IP Marginal IBM ..:' <1ptional Bock mainframe Bayes 's (19114) 2P Maximum PC Venin\" est inlllte~ 3P Likelihood + Priors prevent extreme estimates .- Expen!live to run on main· frame Wrong prinrs may give bad estimates NOlfARM Fra.~er Itt II' Least Squares Most + FilS a multi· McDonald (1988) 21' mainframes dimensional 3P PC model + Include~ residual analysis - c parameter is fixed .. Not widely used in the U,S, MULTILOG Thissen Mulli· IBM .. Generalization ( 1986) category mainframe of BlLOG to handle multi· category data MIRTE Carlson II' Unconditional IBM + Fits a multi· (1987) 2P Maximum mainframe dimensional 3P Likelihood PC model + Gives sian· dard errors • IlIdudt\"~ residual analysis - c parameter is fixed

AI\"lily (//1(111('11/ Paramel('l E.f/imation 51 Exercises ror Chapter J I. For the five items given in Table 3.1, the responses of an examinee are 10 0 0 1 II. II. What is the likelihood function for this examinee? State the assumption thlll mU9t be made in detcnnining the likelihood function. b. Plot the likelihood function at {I values from I to 0 in increments of 0.1. Rased on rhe graph, determine the maximum likelihood estimate of O. 2. The item parameters (obtained using a two-parameter model) for four items are given in Table 3.3. TABLE 3.3 ba Item 0.0 1.0 1.0 1.0 I 1.0 2.0 2 1.5 2.0 J 4 The max imum likelihood estimate of an examinee who takes this four-item test is 1.5. a. Determine the standard error of the estimate. b. Construct a 95% conridence interval for O. 3. Consider three examinees with ability values {I \"\"' -1,0, I. The re!'lponses of the three examinees to an item are O. O. I, respectively. Assume that the one-parameter model with a certain (unknown) b value fits the item. a. Write down the likelihood fuhetion in terms of the unknown b value, and slate the assumptions that are made. b. Plot the likelihood function at b values of 0 to 1 in increments of 0.1. Based on the plot, detennine the maximum likelihood estimate of b. 4. II. For the olle-pnrametcr model, write down the illfHrlllutiolllllld !ltllndord error of the item difficulty estimate. b. Compute the standard error of the difficulty parameter estimate for the data given in Exercise 3.

52 FUNDAMENTALS OF ITEM RESPONSE TIlEORY Answers to Exercises for Chapter 3 Since we are looking IIllhe response of one clIuminee on five items, we make the assumption that local independence hold!'!. See Table 3.4. TABLE 3.4 o -\\.O ...().9 ...().8 -0.7 -0.6 ...().5 .-0.4 -0.3 -0.2 ...().t 0 L 0.201 0.213 0.225 0.234 0.241 0.244 0.243 0.238 0.2211 0.213 0.195 =(L Likelihood) b.o1\\ = -0.45 = =2. a. 1(0) D2 I (0; Pi Qj) 5.19. SEeS) = I 1.\"J5j9 = 0.44. =1\\ • b. 95% confidence interval forO= o±(I.96)SE 1.5 ±(I.96)(O.44)::: 1.5 ± 0.86 ::: (0.64, 2.36) 3. a. Since the responses of different examinees are independent. and el • el • and e) are given, P( (! I.Ch. U31 e\" 02. eJ) = P(U ,10, )P(U21 (2)r(lfJI9.,). The likelihood function is, therefore. L(II\" \"2. \".lIOt. O2, 0.\\) = QI Q2 P, L J[= +el.!H -h) I +e,11(O h)][ I :1:'(~(I~)h)] b. See Table 3.5. TABLE 3.5 b 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 L 0.357 0.386 0.411 0.432 0.447 0.455 0.458 0.454 0.444 0.429 0.409 The maximum value of the likelihood occurs at \" ::: 0.6. Therefore. the maximum likelihood estimale of h is 0.6. N 2,4. a. 1(8) = D2 P(Oi)Q(Oj); SECt) = II -J/(g) ;=1 b. 1(8) ::: 2.89 (0.062 x 0.938 + 0.265 x 0.735 + 0.644 x 0.336) = 1.376; SE(8) = 0.85 - .-... . - - _. - -. .~.,..-,-.--,-----\"->,

I ) 4 Assessment of Model-Data Fit Item response theory (IRT) has great potential for solving many problems in tesling and measurement. The success of specific IRT applications is not assured, however, simply by processing test data through one of the computer programs described in Table 3.2. The advantages of item re- sponse models can be obtained ollly when the fit between the model and the test data of interest is satisfuctory. A poorly fitting IRT model will nol yield invariant item ~lnd ability parameters. III many IRT applications reported in the literature. model-data fit and tht' cOllsequences of misfit have 1I0t heen invcstignh.·d adequately. A~ a result, less is known aboulthe llppropriatencss of pari ;cular IRT lIlodels for various application~ than might he assumed from the voluminous IRT literature. In some case~ goodness-of.fit studies have heen con- ducted using what now appear to he inappropriate statistics (see, for example. Divgi, 1986; Rogers & Hallie, 1987), which may hnve resulted in erroneous decisions about the appropriateness of the model applied. A fUrther problem with Illany IRT goodness-of-fit studies is that too lIIuch reliltnce hus heen pl/lccd Oil statistical tests ul' modcl fit. These tests have 11 well-known and serious flaw: their sensitivity to examinee sample size. Almost any empirical deparlure from the model under considerat ion will lead to rejection of the null hypothesis of model-data fir if the sample Sil.l' is suffidently lar~e. If sample sizes are small, t'ven large model-data discrepancies may not he detected due to the low statistical power associated with significance tests. Morenvtf. ramme- ler estimates hased on small ~\"Imples will he of limited usefulness because they will have large standard errors. In addition, the sampling distributions of some IRT goodness-of-fit statistics are not what they have been claimed to be; errors may be made when these statistics are interpreted in light of tabulated values of known distrihutions (see, for example, Divgi, 19R6; Rogers & Jlallie. 1987).

54 FUNDAMENTALS OF ITEM RESPONSE TIIEORY TABLE 4.1 Number of Misfilling Itellls Detected Using the QI Statistic .'lampI\" Siu Slight Mi,~fi' Mitl\"! Mi,~fi' (2% /(1 J%' (4%'\" 5%J \" 0 9, I, I a:OH.I,l 150 3 4 300 5 6 600 10 II 1200 II (22%) 2400 t8 (36%) The sensitivity of goodness-of-fit statistics to sample size is illus- trated in Table 4.1. A computer program, DATAGEN (Hambleton & Rovinelli. 1973), was used to simulate the item responses of 2400 examinees on a 50-item test. The items were described by two- parameter logistic ICCs, and examinee ability was simulated to have a standard normal distribution. Two simulated tests were generated: In the first. item discrimination parameters were set at either 0.90 or 1.10, with equal numbers of items at each value. This difference corresponds to a 2% or 3% difference, on the average, in the ICCs, for ability scores over the interval [-3, 31. In the second simulated test, item discrimina- tion parameters were set at either O.RO or 1.20, again with equal numbers of items at each value. This difference corresponds to a 4% or 5% difference, on the average, in the ICCs, over the interval 1-3.31. With these item discrimination values, the test data represented \"slight\" and \"minor\" departures from t.he assumptions of the one-parameter model. Item difficulty values were chosen to be similar to those commonly found in practice (-2.00 to +2.00). The one-parameter model was filled to the generated two-parameter data, and ability estimates were obtained for five overlapping samples of examinees: the first 150, the first 400, the first 600, the first 1200, and the lotal sample of 24{)O examinees. Then. with each of the five dala sets, the Q, statistic (Yc-n, 198 I) was used to determine the nllmber of misfitling test items nt the 0.05 level of si~llificance. The statislics in Table 4.1 clenrly show Ihe influence of sample size Oil detection of model-data misfit. With small samples. almost no items were detected or identified as misfitling the model; considerably more items were detected with large samples. With large sample sizes. however, even minor empirical departures rrom the model will result ill many items

A,t,I ....umnll of Model·Data Fit 55 heing identified as llIisfining. although in practice they would function quite acceptably. f.'ortunately, an alternative exists to placing undue emphasis on the results of significance tesls in choosing IRT modds. Hamhklon and Swaminathan (1985) have recommended that judgments about the fit of the model to the test data be based on three types of evidence: I. Validity of the assumptions of the model for the test data 2. Extent to which the expected properties of the model (e.g., invariance of item and ability parameters) are obtained 3. Accuracy of model predictions using real aod, if appropriate. simulated test data Some promising goodness-of-fit analyses for amassing the three types of useful evidence have been described by Hambleton (1989) and Hambleton and Swaminathan (J985) and are summarized in Table 4.2. Evidence of the first type, bearing on the assumptions of the model, orten can be helpful in selecting IRT models for use in investigating the second and third Iypes of evidence. Evidence of the second type, involving invesligations of parameter invariance, is essential regardless of the intended application, since all IRT applications depend on this property. Evidence of the third type involves assessment of the extent to which the IRT llIodel accounts for, or explains, the actual test results and helps in understanding the nature or model-data discrepancies and their consequences. Fitting more than one model 10 the test data and comparing the results to the results obillined with simulated data that were generated to fit the model of interest nre especially helpful activ- ilie... in choosing an appropriate model (see, for example, Hambleton & Rogers, in press). Checking Assumptions Model selectioll can be \"ided hy lUi investiglltinll of the principal lIsslimptions underlying the populnr unidimensionnl item response models. Two Iissumptions common to all these models are that the data are unidimensional and the test administration was lIot speeded. An additional assumption of the two-parameter model is that guessing is minimal; a further assumption for the one-parameter lIlotlcJ is that all item discrimination indices are equal.

. '~.......~. 56 FUNDAMENTALS OF ITEM RESPONSE TIIEORY TABLE 4.2 Approaches for Assessing Goodne~s of Fit Checking Model AUIIIII\"'innJ I. Unidimensiollnlity • Eigenvalue plot (frol11 largest to smallest) of the interitem correlation matrix (tetrachoric correlations are usually preferable to phi correlado~s). The plot of eigenvalues is studied to determine whether a dominant first factor is present (Reckase, 1979). • Com pari Non of the plots of eigenvalues from the interitem correlation matrix using the test data. and an interitem correlation matrix of random data (the mn- dolO data consist of random normal deviates in a data set with the same sal11ple size and with the same number of vRriables as the test data). The two eigenvalue plots are compared. If the unidimensionality assumption is met in the test data, the two plots should be similar except for the first eigenvalue of the plot of eigenvalues for the real data. The first eigenvalue should be substantially lalger than its counterpart in the random data plot (110m, 1965). Recent modifi<.:ations and exampleN of this method can be found in the work of Drasgow and Lissak (1983). • Investigation of the assumption of local independence by checking the variance .. covariance or correlation matril( for el(aminees wilhin different intervals on the ability or test score scale (McDonald, 1981; Tucker, Humphreys, & ROl.l1owski, 1986). The entries in the off-diagonal elements of the matrices will be small and close to zero when the unidimensionality assumption is (approximately) me!. • Filling a nonlinear one-factor analysis model to the interitem correlation matrix and studying the residuals (Hallie, 1985; McDonald, 1981). Promising results from this approach were obtained by Hambleton and Rovinelli (\\986). • Using a method of factor analysis based directly on IRT (Bock, Gibbons, & Muraki, 1988): A multidimensional version of the three-parameter normal ogive model is assumed 10 account for the vector of ilem responses. Estimation of model parameters is time-consuming and complicnted but results can be ob- lained, and the results to date have been promising. Of special interest is the fit of a one-dimensional solution to the test data. hems that appear likely to vioillte the assumption are checked to see whether they function differently. The b-values for these items are calibrated separately as a subtest and then again in the fulltes!. The context of item calibration is un- important if model assumptions are me!. If the plot of b-values calibrated in the two contexts is linear with the spread comparable with the standard errors asso- ciated with the item parameter estimates, the unidimensionality assumption is viable (Bejar, 1980). 2. Equal Discrimination Indices • The distribution of item test score correlations (biserial or point-biserial correla- tions) from a standard item analysis can be reviewed_ When the distribution is reasonably homogeneous, the selection of a model that assumes equlIl item dis- crimination may be viable.

,-\" A.I.\\'('HIlIt'/11 \"l \"'\"II('I-O\"la hI 57 TABLE 4.2 (Conlinued) - - - - - - - -•.- - - - - ['''.uiM,' M\"lh\"d, J. Minimal Guessing The lJerfOriHAUCe of low~ahililY ~ludl'llls nn the rHosf difficult il~n1s call hoe, ch\"cketl. If performnnce levels are dose to zero, the asslImption is viahle. Plots of item·test score regressions can he helpful (Baker, 1964, 1965). Near- zero item performance for low-scoring eXllminees will le/HI support for the via- hility of the as,~umption. The test diffk-lIlty. time (jmits, Hntl it('111 fOUIlllt shollid he reviewed to assess the possible role of guessing in test performance, 4. Nonspeeded (power) Test Administration The variance of numher of omitted item~ .~hould he compared to the variance of number of items answered incorrectly (Gulliksen, 1(50). The assumption is nlet when the ratio is close to 1.ero. • The test scores of examinees under the specified time limit and without a time limit are compared. High overlap in pl'rformance is evidence for the viahility of the assumption. • The percentage of euminees completing the test, percenwge of examinees com- pleting 75% of the test, and the number of items completed hy RO% of the exami- nees are reviewed. When nearly all e~aminees compkte nearly all of the items. speed is assllmed to be lin unimportant fllctor in test performance. Cht'cKing Expected Model Feature., I. Invarillnce of Ahility Parameter Estimates • Ability estimates are compared for different samples of test items (for example, hard and easy items; or tests reOet·ting differelll content categories within the item pool of interest). Invariance is estahlished when the estimates do nOl differ in e.~cess of the measurement errors associated with the estimates (Wright, 1968). 2. Invftriance of Item Parameter Estimates • Comparisons of model item parameter estimates (e.g., h-volues, a-values, and/or c-values) obtained in two or more suhgroups of the population for whom the test is intended (for example, males and fenlllles; hlacks, whites, and Hispanics; in- structional groups; high- and low-test performers; examinees in different geo- graphic regions). When the estimates are invariant, the plot should be linear with the amount of scalier renecting errors due to the sample size only. Baseline plots can be obrained by using randomly equivalent samples (Shepard, Camilli, &. Wil- liams, 19R4). Checking Modt'l Predictions of Actual a\"d Simlliatt'd Te.ft Re.wlls Investigation of residuals and standardized residuals of model fit to a data sel. Determination of the nallIre of model misfit is of value in choosing a satisfac- tory IRT model (see Hambleton &. Swami nathan, 19R5; Ludlow, 1985, 19R6; Wright &. Stone, 1979). (Contilllud)

5R FUNDAMENTALS OF ITEM RESPONSE TIIEORY TAIU.F. 4.2 (Conlinued) f'ilHihie Mnh\"tI.1 • Comparisons of ohserved and predicted tesl SCOff disuihution, ohtained ff(lm assuming all model parameter estimales Are corn~~cl. (,hi square sllltisti«'s (or other statistics) or graphical methods can be used 10 report the resulls (lIalll' bleton & Traub, 1973). '• • Investigations of the effects of item placement (Kingston & Dorans, 19R4; Yen. 1980). practice effects, test speededness and cheating (Drasgow. Levine. & McLaughlin, 1987). boredom (Wright & Slone. 1(71). cllrri(:ulum (Phillips & Mehrens, 1987), poor choke of model (Wainer & Thissell. 1(87). recency of inslruction (Cook. Eignor. & Taft. 19118). cognitive processing variAbles (Talsuoka. 1987), and other Ihreals to Ihe validity of IRT results can be carried out and used to provide evidence appropriate for addressing particular IRT model use~. • Scallerplol of ability eSlimalt~s and corresponding test scores. The relationship should be slrong with scalier around Ihe lesl characteristic curve (reflecting measurement error) when the fit is acceptable (Lord, 1(74). • Application of a myriad of Malislical tests to determine overall model fit. ilem fit, and person fil (see. for example. Andersen, 1973; GUSlafsson. 1980b; Lud· low, 1985. 1986; Traub & Wolfe. 1981; Wrighl & Slone, 1979; Yen. 19R I) • Comparisons of true and eSlimaled ilelll lind abililY paramelers using computer simulnlion melhods (/lamhlelon & Cook, 19R'l). • Investigalions of model robustness using compuler simulalion methods. For e~· amrle. Ihe impliCAtions of filling one-dimensionallRT models 10 mulli,lil1lcn- sionnl data can be siudied (Ansley & Forsylh. 19H5; Drasgow & Parsons. 191\\3), Methods of studying these assumptions are summari7.ed in Table 4.2. Regarding the assumption of unidimensionality, Hattie (1985) provided a comprehensive review of 88 indices for assessing unidimensionality and concluded that many of the methods in the older psychometric literature are unsatisfactory; methods based on nonlinear factor analysis and the analysis of residuals are the most successful. The methods described in Table 4.2 for assessing unidimensionality appear to be among the most promising at this lime. Considerable research remains to be conducted on this topic. however. The checks of other model assumptions are more straightforwllrd. The methods in Table 4.2 use descriptive evidence provided by classical item statistics, but they still can be informative. For example, in an analysis of NAEP mathematics items (Hambleton & Swaminathan, 1985), it was learned that the item biserial correlations ranged from 0,02

\", .'i9 I or1\\,1.11''-1'111('1/1 Mllt/d·[)(}/(! FII to 0.1)71 This infonmJlioll indicated that iI was highly unlikely that a Oil!' parameter model would fif the test data. Checkill~ Invariance The invariance of model parameters can be assessed by means of several straightforward methods. Two of these methods are highlighted in the next section. The invariance of ability parameters can be studied by IIdministering examinees tW() (or mort\") item sets in which the items in each set vary widely ill difficulty. The item sets are constructed from the pool of test items over which ability is defined (Wright. 1968). It is common to conduct this type of study by administering both sets of lesl items 10 examinees within the SlIllIe les!. Ahility eslimates nre obtained for each examinee. one from each set of items. Then the pairs of ability estimates are ploUed on a graph. This plot should define a straight line with a slope of I because the expected ability score for each examinee docs not depend on the choice of test items (provided the item response model under investigation fits the test data). Some scalier of points about the line is to be expected, however, because of measurement error. When a linear relationship with a slope of I and an intercept of 0 is nol obtained, or the scalier exceeds Ihat eXI)ected from knowledge of the standard errors of the ability eSlimates, one or more of the assumptions underlying the item response model may not hold for the data set. Checking Model Predictions Several methods of checking model predictions are described in Table 4.2, One of the most promising of these me1hods involves the analysis of item residuals. In this method, an item response model is chosen. item and ability parameters are estimated. and predictions about the performance of various ability groups are made. assuming the validity of the chosen model. Predicted results are compared rhen with actllal results (see, for example, Hambleton & Swaminathan, 1985; Kingston & Dorans, 1985). ~--\"~ A reSidua.1 rij (s.omelimes c.H.lled a raw residual) is rhe difference belween_oJ?~'i!:rved H<:,m.. per.(2r!l,lanCe for a subgroup of examinees and the subgroup's expecled item performance: r \" . _ . ~_\",., H ....... ~\"\"'\"

FUNDAMENTALS OF ITEM IH~SPONSE TIIHJHY '1:(1',,) where i denotes the item. .i denotes the ahility category (subgroup). 1'\" oris the observed proportion ('oneci responses 011 itelll i illiheilh ability category, and 'E(Pij) is the expected proportion of correct responses obtained using the hypothesized item response lIIodel. The p\"l'lllllcters of the hypothesized model are estimated, and the estimat~; are used to calculate the probability of a correct response. This probability is taken as the expected proportion correct for the ability category. tn practice, the ability continuum usually is divided into intervals of equal width (10 to 15) for the purpose of computing residuals. The in- tervals should be wide enough that the number of examinees in each in- terval is not too small, since statistics may be ullstable in small samples. On the other hand, the intervals should be narrow enough that the examinees within each category are homogeneolls in terms of ability. The observed proportion correct is obtained by counting the number of examinees in an ability category who got the item right and dividing by the number of examinees in the category. To determine the expected proportion correct in an ability category, a a-value is needed. One approach is to use the midpoint of the ability category as a representa- tive ability value for the category and to compute the probability of a correct response using this value. Alternatively, the prohahility of a correct response for each examinee within the ability category can be obtained, and the average of these probabilities can be used as the expected proportion. A limitation of the raw residual is that it does not take into account the sampling error associated with the expected proportion-correct score within an ability category. To take this sampling error into ac- count, the standardized residual Zij is computed by dividing the raw residual by the standard error of the ex pected proport ion correct, that is, where Nj is the number of examinees in ability category';. When choosing an IRT mudel, a study of residuals, standardi/,ed residuals (residuals divided by their standard error..), or both, ohtain~.. d

AUl'J,\\II/I'1l/ of Mmlrll>lIw Fil 61 for scvewl llIodels, can provide valuahle information, liS will he dem- onstraled in the next section. Statistical tests, usually chi-slluare tests. also are applied to deter- mine model-data fit. An extensive review of goodness-of-fit statistics is provided by Traub and Lam (19R5) and Traub and Wolfe (19R I). The Q, chi-square statistic (Yen. 1981) is typical of the chi-square statistics proposed by researchers for addressing model fit. The Q, statistic for item i is given as (4.1 J where examinees are divided into m ability categories on the basis of their ability estimates. Pij and <£ (Pi) were defined earlier. TIle statistic Q, is distributed as a chi-square with degrees of freedom equal to m - Ie, where It is the number of parameters in the IRT model. If the observed value of the statistic exceeds the critical value (obtained from the chi-square table), the null hypothesis that the ICC fits the data is rejected and a beller fitting model mllst be found. Examples or Goodness-or-Fit Analyses The purpose of this section is to provide an example of procedures for investigating model-data fit using 75 items from the 1982 version of the New Mexico High School Proficiency Test. The items on this test are multiple-choice items with four choices. and we had access to the item responses of 2000 examinees. Normally, the first steps in the investigation would be as follows: I, Conduct a classical item analysis. 2. Detennine the dominance of the first factor, anti check other IRT model assumptions. 3. Make a preliminary selection of promising IRT models. 4. Obtain hem and ability parameter estimates for the models of inleresl.

62 FUNDAMENTALS OF ITEM RESPONSE TIIEORY The results of the item analysis arc reported in the Appendix A. If we had found subst.mtial variation in the item point-niserial correlations. our interest in the one-parameter model would have been low. If all of the items were relatively easy, or if the tt'st had consisted of short free-response items, we probably would not have worked with the three-parameter model, at least at the outsel. The item afl~lysis reveals that the variation in item difficulties and discrimination indices is substantial and, therefore. the one-parameter model may not be appro- priate. Nevertheless, for illustrative purposes, we will fit the one-, two-, and three-parameter models and compare the results. In general, com- parisons of the fits of different models witt facilitate the choice of an appropriate model. Figure 4.1 clearly shows the dominance of the first factor. The largest eigenvalue of the correlation matrix for the 75 items is over five times larger than the second largest, and the second largest eigenvalue is hardly distinguishable from the smaller ones. Had the plot of eigen- values produced a less conclusive result, the method of Drasgow and Lissak (1983) should have been used. In this method the plot of eigen- values resulting from a correlation matrix derived from (uncorrelated) normal deviates is obtained and is used to provide an indication of the eigenvalues that result from chance factors alone. This plot serves as a baseline for interpreting eigenvalues and (ultimately) the dimensional- ity of the real data. Appendix A contains the item parameter estimates obtained from fitting one-, two-, and three-parameter logistic models. These statistics were obtained by using LOOIST and scaling the ability scores to a mean of 0 and a standard deviation of I. The next activity was to investigate the invariance of the item param- eters for the three-parameter model. (Similar analyses were carried out with the one- and two-parameter models but are not reported here.) The sample of 2000 examinees was split into two randomly equivalent groups of 1000. In a second split, two ability groups were formed: the top half of the distribution and the bottom half of the distribution. A total of four groups of 1000 was available for subsequent analyses. Through use of the ability scores obtained from the total-group analysis, four three-parameter-model LOOIST analyses were conducted, one with each group, to obtain item parameter estimates. Figure 2.6 pro- vides the baseline information for the h parameter. This figure provides an indication of the variability that could he expected in the item parameter estimates due to the use of randomly equivalent groups of

ilue.Hmelll of Model·/)ata Fit 6] 6· M a g n I4 t u d e 2 --t---o-t~~+~t--+--+-i-~-I·-t---f-·t---t~I .. 2 3 4 G 8 7 8 9 ID ft ~ q w m Eigenvalue Figure 4.1. Plo. (If the Largest 15 Eigenvalues size 11 = 1000, that is, due to sampling error. If item difficulty parameter invariance has been obLained, a scatlerplot similar to that shown in Fig- tlrc 2.6 should he ohtained from the high-and-low-performing groups of examinees. In fact, Figures 2.6 and 4.2 are quite similar, indicating Ihat item parameter invariance is present. What also is revealed in Figure 4.2 is that item parameters for easy items arc not well estimated in the high-performing group or the hard items in the low-performing group, as demOnSlrlHCd by the \"dumbbell\" shaped scallerplms. The implications for parameter estimation are clear: Heterogeneous samples of examinees are needed to obtain stahle item parameter estimates (see. for example, Stocking, 1990). Invariance of ahility parameters across different samples of items was investigated next. Invariance of ahility parameters over randomly equivalent forms (e.g., ability estimates based on examinee perfor- IlHlnCe on the odd-numbered items and on the even-numbered items) indicates the variability due to the sampling of lesl items. A more rigorous test of in variance would be a comparison of ahility estimates over (say) tests consisting of the easiest and hardest items in the item bank.

64 FUNOAMENTALS OF ITEM RRSPONSE TIIEORY 3 . - - - - - - - - - - - - - .---~--- - - - -. -- l o w2 G r o , .u P D O~--------------~~rr---~-----·---~_i I f f -1 I o y -2 t r •.113 Y -2 -1 o 23 High Group Difficulty Figure 4.2. Plot of 3P Item Difficuhy Values Based on Samples of Differing Ability Figures 4.3 and 4.4 provide comparisons between ability estimates obtained with the randomly equivalent subtests and the hard versus easy subtests for the three-parameter model. Item parameter estimates used in the calculation of ability scores were obtained from the total sample (N = 2000) and are reported in the Appendix. These comparisons provide evidence of the invariance of ability parameters over tests of varying difficulty (note that the two plots are similar and scattered about the line with slope I). These plots also show the generally large errors in ability estimation for low- and high-ability examinees (Figure 4.3) and even larger errors in ability estimation for low-ability examinees on hard tests and for high-ability examinees on easy tests (Figure 4.4). These findings may have more to say about improper test design than parameter invariance. however. Based on the plots, it appears that item and ability parameter invari- ance was obtained with the three-parameter model. The plots also indicate that satisfactory ability estimation requires that examinees be administered test items that are matched with their ability levels and that satisfactory item parameter estimation requires heterogeneous abil- ity distributions.

) A.UI'.f.tmfllt of M(Jdrl·Data Fit 6 A b I I I t Y 0 n E n•v m• 2 3 a -6 -<4 -3 ~2 -1 0 Ability on Odd lIema .'Igun! 4.3. Plot of 3P Ability Estimates Based on Equivalent Test Halves (Odd vs. Even Ilem~) A 5 b4 I I 3 I t 2. Y 0 n 0 H a -1 r d -2 . -3 e -4 r· .80 m ., -.8 L. 4 -5 23 Ability on Easy Items Figure 4.4. Piol of JP Ability Estimates Based on Tests of Differing Difficiliry (lianiesl vs. Ellsiest Items)

66 FUNDAMENTALS OF ITEM RESPONSE TIIEORY p 0 0.8 ... p 0 r t 0.6 I 0 n C 0.4 _ _----L.- - - - '..... .. 0 0 23 r r e 0.2 c t 0 -3 -2 -f Ability Figure 4.5. Observed and Expected Proportion Correct (lP Model) for 11em 6 Perhaps the most valuable goodness-of-fit data of all are provided by residuals (and/or standardized residuals). Normally. these are best in- terpreted with the aid of graphs. Figures 4.5 to 4.7 provide the residuals (computed in 12 equally spaced ability categories between -3.0 and +3.0) obtained with the one-, two-, and three-parameter models, respec- tively, for Item 6. The best fitting ICCs using the item parameter estimates given in the Appendix also appear in the figures. When the residuals are small and randomly distributed about the ICC. we can conclude that the ICC fits the item performance data. Figllre 4.5 dearly shows that the one-parameter model does not match examinee per- formance data at the low and high ends of the ability scale. The fit is improved with the two-parameter model (Figure 4.6) because the discrimination parameter adjusts the slope of the ICC. The fit is fur- ther improved with the three-pan,meter model (Figure 4.7) hecause the c parameter takes into account the performance of the low-ability examinees. An analysis of residuals. as reOected in Figures 4.5 to 4.7, is helpful in judging model-data fit. Figures 4.8 to 4.10 show the plots of stan- dardized residuals against ability levels obtained with the one-, two-, and three-parameter models. respectively. for lIem 6. The observed pattern of standardized residuals shown in Figure 4.8 is due to the fact -----.-----.---..-.•..

A,ut',umt'llf tlf M{)(/r/·/)o/(/ Fit 67 - - - - -.. -~---- ..---_._-_..._---_._-----_ ..- p r /./'\" 0 0.8 p 0 r I 0.6 I 0 n C 0.4 0/ r ~/ r e 0.2 . c t --------- -3 -2 -1 o 23 Ability Figure 4.6. Ohserved and Expe{'lecl Proportion Correel (21' Model) for Item 6 p r o 0.8 p o r : 0.6 o n c 0.4 o r r 0.2 .--. ~ • c t o ....L • ...L._~ •....L......._ _ _ _ _..... -3 -2 -1 o 23 Ability Figure 4.7. Observed and Expecled Proporlion Correel (31' Model) for hem 6

68 FUNDAMENTALS OF ITEM RESPONSE TIIEORY 6 S5 •t 4 n .d. 3· r2 d ,, I z 2 e0 d -1 ••R -2 I -3 •d -<4 u I -5 -6 3 -3 -2 -1 o Ability Figure 4.8. IP Standardized Residuals for Item 6 6 S5 •t <4 n •d 3 r2 d I z 41 0 d -1 R •41 -2 I -3 •d -<4 u I -5 -6 23 -3 -2 -1 0 Ability Figure 4.9. 2P Standardized Residuals for Item 6

A.ut'umrllf of MmJe/·Doto Fit 69 6 3 S6 t a4 n d3 a r2 d .I z e0 d- R •e -2 I -3 d -4 u a -6 I -6 -3 -2 -1 o Ability Figure 4.10. 3P Standardized Residuals for Item 6 that the item is less discriminating than the average level of discrimi- . nation adopted for all items in the one-parameter model. Clear improve- ments are gained by using the two-parameter model. The gains from using the three-parameter over the two-parameter model are much smaller but noticeable. Item 6 WllS selected for emphasis because of its pedagogical value, but in general the two- and three-parameter models fit data for the 75 test items better than the one-parameter model. With 12 ability categories and a 75-item test, 900 standardized residuals were available for analysis. The expected distribution of standardized residuals under the null hypothesis that the model fits the test data is unknown, although one might expect the distribution of standardized residuals to be (approximately) normal with mean 0 and standard deviation l. Rather than make the aSllumption of a normal distribution, however, it ill possible to use computer simulation methods to generate a distribution of standardized residuals under the null hypothesis that the model fits the data, and use this distribution as a basis for interpreting the actual distribution. To generate the distribution of standardized residuals for the one- parameter model when the model fits the test data, item and ability

70 FUNDAMENTALS OF ITEM RESPONSE THEORY •R 0.12I 0.1 a t •I 0.08 v F 0.06 r e 0.04 . Q u •n 0.02 . .2' 3 c o~l.d.d y -3 -2' -1 0 Standardized Residual _ Real e::%2l Simulated Flxure 4.11. Prequency Distribulions of Siandanlil.ed Residuals for IP Real and Simulated Datu parameter estimates for the model (reported in the Appendix) are as- sumed to be true. Item response data then can be generated (llambleton & Rovinelli, 1973) using these parameter values, and a one-parameter model filled to the data. Standardized residuals are obtained, and the distribution of standardized residuals is formed. This distribution serves as the empirically generated \"sampling distribution\" of standardized residuals under the null hypothesis that the model fits the data. This distribution serves as the baseline for interpreting the distribution of standardized residuals obtained with the real test data. In Figure 4.11, the real and simulated distributions of standardized residuals for the one-parameter model are very different. The simulated data were distributed normally: the real data were distributed more uniformly. Clearly. since the distributions arc very different, the one- parameter model does not fit the data. Figures 4.12 and 4.13 show the real and simulated distributions of standardized residuals obtained with the two- and three-parameter models. respectively. The evidence is clear that substantial improve- ments in fit are obtained with the more general models, with the three-parameter model filling the data very well. The real and simulated distributions for the three-parameter model are nearly identical.

A.'.,,',n\"\"'11/ of Mmld 1),,/(/ Fi/ 71 R 0,12 e I O.t 8 t I 0.08 y e F 0.06 r e 0,0\" Q u e 0.02 n ............. c o ,... ,1 y 34 ·4 -3 -2 -, 0 , Standardized Residual _ Real !7/2J Simulated ..·i~ure 4.12. Freljuency Distributions of Siandardil,ed Residuals for 2(' Real and Simulated Dalll R e I 0.1 a t .,I 0.08 y F 0,06 e 0.0\" q U It 0.02 n c oy U\"uooI~\"'J\"-\"_\"_ -4 -2 -I 0 I 34 Standardized R.aldual _ Real ~ Simulated Figure 4.13. Frequency Distributions or Standardized Residuals ror JP Real and Simulated Data

72 FUNDAMENTALS OF ITEM RESrONSE TtIF.ORY 3.8~-----------------------------------------1 A •y 3 r •I•I 2.8.···..... ...-.....,....-...\\ ., ... ,. A2 . .,.. ~..- .•b 0.6 o Ui .I.. •t S 0.5 0.7 R o L -_ _- ' -_ _~ o 0.1 0.2 0.3 0.\" 0.5 Item Polnt-BI••rla. Correlation Figure 4.14. Plot of IP Average Absolute Standardized Residuals Againsl Point-Biserial Correlations Other types of goodness-of-fit evidence also can be obtained. Fig- ures 4.14, 4.15, and 4.16 show the relationship between item misfit statistics and item point-biserial correlations for the one-, two-, and three-parameter models, respectively. In this analysis, item misfit was determined by averaging the absolute values of standardized residuals ohtained after fitting the model of interest to the item data. Figure 4.14 shows the inadequacy of the one-parameter m()tlel in fitting items with high or low discrimination indices. Figure 4.15 shows that the pattern of item misfit changes substantially with the two- parameter model. Figure 4.16, for the three-parameter model, is similar to Figure 4.15 except the sizes of the item misfit statistics are generally a bit smaller. The complete set of analyses described here (and others that were nOlo described because of space limitations) are helpful in choosing an IRT model. For these test data, evidence was found that the test was uni- dimensional and that the fit of the three-parameter model was very good and substantially better than that of the one-parameter model and somewhat better than thaI of the two-parameter model. Many of the baseline results were especially helpful in Judging model-dl.lta fit.

73 3.5 A .. .. ., v3 ...• • ~. e :~ . .'\" .0 0\" I 0 \"0i- . r a 2.6 \". '''' 0 e A2 b a 1.6 \" 0 I u t e S 0.6 - R 0 0 0.1 0.2 0.3 0.4 0.6 0.8 0.7 Item Polnt-Blaerlal Correlallon Figure 4.15. Plot of 2P Average Absolute Standardized Residuals Against Point-Biserial Correlations 3.6 A v3 e r a 2.5 •g A2 b a o 1.5 . ~ .. .....,.:..-.... .. . I u •t flIP\" • -, .:- ., : S 0.5 ..I • _.1 I •••• R oL---__ ____~ - L____- L____~______L_____~_ _~ o 0.1 0.2 0.3 0.4 0.6 0.8 0.7 Item Polnt-Blaerla' Correlation Figure 4.16. Plot of 3P Average Absolute Standardized Residuals Against Poinl-Biserial Correlations

74 FUNDAMENTALS OF ITEM RESPONSE THEORY Summary In asr;cssing mmlcl·dalll fil, Ihe bcsl approach involves (a) designing and conducling a variety of anlllyscs dcr;igncd 10 (Il-Irct rxpcl.:lrd Iypes of misfit, (b) considering the full sct of results carefully. and (c) milking a judgment about the suitability of the model for the intended applica· tion. Analyses should include investigations of lIlodel ass'uinptions, of the extent to which desired model features are oh\\:lined, and of differ- ences between model predictions and actual dlila. Statislkal tests Illay be carried out, but care must be taken in iOierprcring the slatislical informntioll. The nUlIlocr of investigations that Illay he comillclt,t! is almost limitless. The amounl of effort and money expended ill collect- ing, analyzing, atHI interpreting results should he consistent wilh th(~ importance and nature of the intended application. Exercises for Chapter 4 I. Suppose that a three-parameter model was filled to a set of test data. The item parameter estimates for a particular item were a 1.23; h '\" 0.76; c 0.25. In order to assess the fit of the model to this item, the examinees were divided into five ability groups on the basis of their ability estimates, with 20 examinees in each group. The item responses for the examinees in each ability group are given in Table 4.3. TARLE 4.3 9 Lel'e/ lrem Resp01lses -2.0 0000 00000 000000 I 0 I -1.0 0 0 I 0 0 I 0 0 0 0 0 0 0 I 0 0 () 0 0.0 1.0 000 0000 00 00 I 0 I 2.0 0 I0 0 0I0 00 a. Calculate the observed proportion correct 1.11 eHeh ability level. b. Calculate the expected proportion correct at eaeh ability level (using Ihe parameter estimates given). c. Calculate the QI goodness-of-fit statistic for this item. What are the degrees of freedom for the chi-square test? d. Does the three-parameter model appear 10 fil Ihe data f(lr this item?

.1,\\\\\".<.1'1111'111 of Model,/).lftJ Fir 75 2, SIIPPOSI' thaI OIlC- and two'parameter models nlsH wert.' filled 10 Ihe ,Iala, The ilt~1Il Ilaramctcr estimates for the Iwo models arc giVl~f1 helow; ()ne-pllraml'ICr nlodel: h n.17 Two-parameter model:\" 0.1 H; (/\"\" n,5t} a. Calt:ulate the 0 1 statislics for ;tssessing tlit' fit of the onc- and Iwo- parameter models (assume that the ahility intervals are Ihe same). h. Does the one- Of two-parameter model appear to fit the data? (\". !lased on YOUf fesults, which IRT model appear.~ 10 he most appropriate for the dala ~iven7 Answers to Exercises for Chapter 4 I, II. 0::: -2: p = 0.20; 0:: I: I' \"\" 0,25; e 0; I' : 0.40; (I I: I' := 0.75; 0:; 2: I' 0.90. b. P(O =-2) (},25; P(O = -I) 0.27; P{O == 0) 0.38; p(O = I) 0.72; P(O:= 2) := 0.95. n/ NjIP] - ,£{Pj )J2 ......- - - , --~~-.---, '£ (I'j) II '£ (Pj ) I 20@.~~ - 0.2si + ~9JO~b'S~~J!·~?l~ + fQtO\"40___ Q:~!: 0.25x 0.75 0.27 x 0.73 O.38x 0.62 1.48 degrees of freedom:: 5 - 3 = 2 d. Xt05 = 5.99. Since the calculated value does nol exceed the tahulated value, we can conclude that the Ihree 'parameter model fits the data for thit! itcrt.. j~Iff NjlPj _ ,£(Pj )/2 2. a. QI :: -l;(Pj}TJ=\"'[(pj ) J

76 I'IJNDAMENTALS OF ITEM RESPONSE TIIEORY ror the one-parameter model, Q, = 20 !Q:~g-=-():!)2J: + ?i'_(0.:.??_=_ 0)2~2 +2()(Q.'*'!>_.Jk~})2 0.02 x 0.98 0.12 x 0.88 0.43 x 0.57 + .~0.75 - 0.80)2 + '2~iQ.:~():2~l 0.80x O~20 0.96>< 0.04 ' '\" = 38.52 For the two-parameter model. Q, = ~0(0.20 -()~112 + ~(0.22-\"\"JJ.~_~5i + ~()_({).40__~:i§)~ O.lIxO.89 0.25xO.75 0.46 x 0.54 = 2.67 b. The one-parameter model does not fit the data, but the two-parameter model does. c. While the three-parameter model fits the data beller than the two- parameter model, the two-parameter model fits the data almost as well. In the interests of parsimony. the model of choice would be the two-parameter model.

5 The Ability Scale The ultimate purpose of testing is to assign a \"score\" to an examiflee that reflects the examinee's level of attainment of a skill (or level of acquisi- tion of an attribute) 85 measured by the test. Needless to say, the assigned score must be interpreted with care and must be valid for its intended use (Linn. 1990). In typical achievement measurement, the examinee is assigned a score on the basis of his or her responses to a set of dichotomously scored items. In the classical test theory framework, the assigned score is the number of correct responses. This number-right score is an unbiased estimate of the examinee's true score on the test. In item response theory it is assumed that an examinee has an underlying \"ability\" e that determines his or her probability of giving a correct response to an item. Again, based on the responses of the examinee to a set of items. an ability score e is assigned to an examinee. Unfortu- nately. e is not as easily determined as the number-right score. Through use of the methods described in chapter 3, an estimate of ability e is obtained for an examinee. As we shall see later, [he ability 4} of an examinee is monotonically related to the examinee's true score (assum- ing that the item response model fits the data); that is, the relationship between 4} and the true score is nonlinear and strictly increasing. The general scheme for determining an examinee's ability is as follows: I. An e:JIarninee's responses to a set of items are obtained and coded as I (for correct answers) or 0 (for incorrect answers). 2. When the item parameters that chamcteri1.e an item lIrt\" assumed to be known (as happens when item parameters are availahle for a bank of items). the ahility e is eslimatcd using one of Ihe methods indicated in chapter 3. 77

711 ,'UN[)AMENTALS OF ITEM RESPONSE TlIEOln 3. Whcn (he ilem paramclers Ihal characterize Ihe ilellls are not known, Ihe ilem and ahility paranwlel's mllsl he eslinmlcd from the sallie respollse dala, and one of the procedures descrihed in chaptCl' :; must he employed. 4. The cstimatell ability vallie is reporled ~s is, or is transformed using a lincar or a nonlincar transformation to II Illore convenient !':l'llie (e.g., without ncgalivcs or dedl1l(1ls) In uid in Ihe inlnprelaljon of Ihe Sl·ore. The SAT and NAEP reporting scales are well-known examples of scales ohlained by transforming original score scales. .• At all stages of analysis and interpretation, the issue of the validity of the ahility score must be considered (Linn, 1990). Every attempt must he made to validate the ability score. The validity information available for the number-right score or any transformation of it may not he relevant or appropriate for the ability score and, hence, a validity study specifically designed for the ahility score may he needed. Refer to Hambleton and Swaminathan (1985, chapter 4) for more details on this important issue. What is the nature of the ability score? On what scale is it measured'! What transformations are meaningful? These important issues are dis- cussed next. The Nature of the Ahility Scale As mentioned above. the number-right score, denoted as X, is an unbiased estimate of the true score, 'to By definition, 'E(X) ::::: 't The number-right score, X, may be divided hy the number of items (i.c., linearly transformed) to yield a proportion-correct score. The propor- tion-correct score is meaningful and appropriate when the test is sub- divided into subtests, each with different numbers of items, measuring (say) a numher of different objectives. This is usually the practice with criterion-referenced tests. When the test is norm-referenced, other lin- ear transformations may he used to yield standard scores. In addition, when it is necessary to compare examinees, the score X may be trans·· formed nonlinearly to yield slanines, percentiles, and so on. While the above transformations greatly facilitate the interpretation of the score X. its major drawhack remains. The SC(lf(.~ X is not indepen-

\", The Anility Smle 79 dent of the item!l to which the examinee responds, and the transformed scores are not independent of the group of examinees to which Ihey are referenced. The ahility score 9, on the other hand, possesses sllch eindependence. As described previollsly, is independent of the partic- ular set of items administered to the examinees, and the population to which the examinee belongs. This invariance propcrty is what dis- atinguishes the score from the score X. Since it is possible to compare examinees who respond to different sets of items when using the a score, the a scale mny be thought of as an absolute scale with respect to the trait or ahility that is being measured. It is important, at this point, to discllss the natllle or meaning of the term ahility or trait. Clearly, these are lahels that descrihe what the set of test items measures. An ability or trait may be defined broadly as <lptitude or achievement, a narrowly defined achievement variable (e.g., ability to add three two-digit integers), or a personality variable (e.g., self-concept, motivation). An ability or trait is not necessnrily some- thing iunate or immutable. In fact, the term ahility or trait may be improper or misleading to the extent that it connotes a fixed character- istic of the examinee; the term pr(~ricirf/(,.v Irvel. for example, may be more appropriate in many instances. What is the nature of the scale on which 0 is defined? Clearly, the observed score X is not defined on a ratio scale. In fact, X may not even he defined on an interval scale. At best, we may treat X as heing defined 011 an ordinnl scale. The same applies to the scale on which a is defined. In some instances, however, a \"limited\" ratio-scale interpretation of the O-scale may be possible. Transformation of the O-Scale In item response llIodels, the prohabil ity of a correct response is given by the item response function, P(O). If, in Equations 2.2 or 2.3, a is replaced hy a' = aO + ~, b by h· = ah + ~, and a by iI\" = a/a, then P(O') = p(a) Thus. a, h. and a may be transformed linearly without altering the probability of a correct response (the implications of this \"indeter- minacy\" will be discussed further in later chllpters), mellning that

RO FUNDAMENTALS OF ITEM RESPONSE THEORY the O-scille may be transformed linearly tis I(JIlS as the item pnrameter values also are transformed appropriately. Recall that 0 is defined in the interval (-00, (0). Woodcock (1978), in defining the scale for the Woodcock-Johnson Psycho-Educational Bat- tery, employed the one-parameter model and the scale ,1 that is, used a logarithmic scale to the base 9. Since and then We :.: 9.10 + 500 Thus, the Woodcock-Johnson scale is a linear scale. The item difficul- ties were transformed in the same manner, W\" = 9.lb + 500 The We scale has the property that the differences (wo - Wh) = 20, 10, 0, -10, -20 correspond to the probabilities of correct responses of 0.90, 0.75, 0.50, 0.25, and 0.10, respectively. Wright (1977) modified this scale as w=9.IO+100 and termed it the WITs scale. The transformations of the O-scale described above are linear. Non- linear transformations of the O-scale may be more useful in some cases. Consider the nonlinear transformation

Th~ AIliTity ,kalt' HI and Ihe corresponding Iransfonllalion of Ihe difriculty parameter Then, for the one-parameter model o· e; ,/\"+ Hence, , = \" \"0+'-0-\" P(O) h- It is of interest to note that Rasch first developed the one-parameter model lIsing the form given above for the probability of success. The probability of an incorrect response on the 0' scale, Q(O') = I - P(O'), is The odds 0 for success, defined as P(O') I Q(O'). nre then Consider two examinees with ability 0; and 9; responding to an item, and denote their odds for success as 0, and O2, Then 0; and O2 \"\" e: • 0 1 :; h' h'

82 FUNDAMENTALS OF ITEM RESPONSE THEORY The ratio of their odds for success is 01 = 0; 0; 0; Thus, an examinee with twice the ability of another examinee:. measured on the O'-scale, has twice the odds of successfully answering the item. In this sense. the O'-scale has the properties of a ratio scale. The same property also holds for the item; for an examinee responding to two items with difficulty values h~ and hi (measured on the h*-scalc), the odds for success are 0 1 = 0' I b; and (h == O' I hi. The ratio of the odds for the examinee is If h; == 2h; (i.e., the first item is twice as easy as the second item), the odds for successfully answering the easy item are twice those for successfully answering the harder item. The ratio-scale property for 0'· and b·-scales. as defined above, holds only for the one-parameter model. For the two- and three-parameter models the scale must be defined differently (see Hambleton & Swaminathan. 1985). Another nonlinear transformation that is meaningful for the one- parameter model is the \"log-odds\" transformation. Since. for two ex- aminees responding to the same item 0; = 0;0 1 0; e DIl, eO(II, 92) , e /J92 In O.... .- D(OI OJ) O2 where In is the natural logarithm (to the base e). Typically, III the one-parameter model, [) is omitted so that e(ll-h) P(O) == [I + e(Il··· h»)

-') Rl Omitting /) in \"log-odds\" expressions. we have If abililies dillcr hy one poinl. then 2.718 Thus, a difference of one point on the ability scale corresponds to a r'Ktm or 2.72 in odds for success on the O-seale. Similarly, if 1111 examinee responds to two items with difficulty vlllues bl and h2' As before, a difference of one unit in item difficulties corresponds to a factor of 2.72 in the odds for success. The units on the log-odds SCIlIe are called /ORits, The logit units can be obtained directly as follows: Since e(O - h) and Q(O) P(O) - < - « - - - - - + e(O - h) thus P(O) == c(O h) lienee Q(O) In \"(0) 0- h ,,< Q(O)

84 FUNDAMENTALS OF ITEM RESPONSE THEORY Transrormalion to the True-Score Scale The most important transformation of the 9-scale is to the true-score scale. Let X. the number-right score, be defined 8S .\" , where Vj is the I or 0 response of an examinee to item j. If we denote the true score by t, then t = 'E (X) ::: 'E (1:\" Vj ) j~ I By the linear nature of the expectation operator, n n (1:'E Vj ) L 'E (Vj ) jd j~1 Now. if a random variable Y takes on values YI and Y2 with probabilities PI and P2• then Since Vj takes on value I with probability Pj(O) and value 0 with probability Qj(O) := I \"j(O), it follows that Thus. n Ihal is, the true score of an examinee with ability 0 is the sum of the item characteristic curvc~. The trlle score, 't ill this case, is called the

The Ability Scale 85 test characteristic CUrl'(' because it is the sum of the item characteristic curves. In the stri<:t sense, the llhove relationship holds only when the item response model fits the data. To emphasize this, the trut' score 't is indicated as 't I 6. thaI is. L't 10 = Pj (9). j=1 When no ambiguity ex iSIS, the notation 't I I} will be shorlened to 'to The true score 't and 0 are monotonically related; thai is, the true score may be considered to be a nonlinear transformation of O. Since Pj (0) is between 0 and I, t is between 0 and n. Hence, t is on the same scale as the number-right score, except thai t can assume non ·integer as well as integer values. The transformation from 9 to 't is useful in reporting ability values; instead of the e values, t values that lie in the range 0 to n are reported. Alternatively, ft, the true proportion correct or domain score, obtained by dividing 't by the number of items fl. can be reported. In this case, While -00 < 9 < 00, 1t lies between 0 and I (or, in terms of percentages, between 0% and 100%). The lower limit for 1t for the one- and two-parameter models is l.ero. For the three-parameter model, however, as 0 approaches -00, Pj(O) approaches ('j. the lower asymptote. Thus, the lower limit for 1t is r, Cj I fl. Correspondingly, the lower limit for 't is 1: (J< The transformation of 9 to the true score or the domain score has important implications. First, and most obvious, is that negative scores are eliminated. Second, the transformation yields a scale that ranges from 0 to n (orO% to 100% if the domain score is used), which is readily interprelable. When pass-fail decisions must be made, it is often diffi- cult to set a cut-off score on the O-scale. Since the domain-score scale is familiar, a cut-otT score (such as 80% mastery) is Iypically set on the domain-score scale. The domain score is plotted against 9, and the 9 value corresponding to the domain score value is identified as Ihe

86 FUNDAMENTALS OF ITEM RESPONSE THEORY TARLE ,!;.I Item Parall1elrrs for Five '('cst Ilems I,,,,,, \", rllr(/mt~/rJ'.~ I ·2,on ,lI, (' 2 ... .. -.--<.----.--~ 1 ·,1.00 ~~--~-~-- 4 5 () no 0,110 oon 1,00 I. on 0.00 2.00 U() , < 0.10 un n, I~ 2,O(J {),2!1 - -... .. ..,-~----.---~ ~---- --~--\" -~ CUI-off score on the O-scale (see, for example, lIamhleton & de Gruit,ier, 1983). Alternatively, all 0 values Clill be converted to domain-score values and the pass-fail decision made with respect to the domain-score scale. To illustrate the conversion of 0 values to domain-score values, we shall consider a test with five items. The item rarameter values for these items are given in Table 5.1. For eaeh item, n.I. The probahility of a correct response is computed at 0 -<~, -2. ·1, I. 2.3 using Ihe three-parameter model (Equation 2.3), 2. These prohahilities are summed over Ihe five items at each of Ihe 9 values to yield 3. The domain score It is ohtained al each 0 value hy divilling Ihe sum in the equation above by 5. 4. The resulting relationship belween 1t and 9 at the 9 values is tahulated. We now have the functional relationship between It and e at 0 -3, -2, I, O. I. 2, 3. This is a monotonically increasing relationship and can be ploUed as a graph. The computations arc given in Table 5.2. The final implication of the e to 't (or 0 to It) conversion is thai the true score 't (or It) of an examinee whose ability value is known (or estimated) can be computed on a set of items not administered to the examinee! When the item parameters for a set of items are given. an

r\"~ Abi/il.Y ScalI' 87 TARLE 5.2 Relalionship Belween 0 ,lilt.! 1t -_._-- f) 1'1(0) 1'210 ) \",(9) 1'4(0 ) 1'\\(9j t 1:1',(9) 11: '\" tin -.. ---\"---- ~----------- .1 0.2U O.(U 0.10 n.ls n.20 n.M 0.14 2 ()j() O.I.~ (1.11 (l,IS (1.20 1.12 o.n f 0.80 0.50 0.20 0.16 O.lO 1.115 0.37 0 0.94 0.115 0.55 0.21 0.20 2.75 0.55 0.98 0.97 0.90 0.58 0.22 U:i 0.73 2 0.99 0.99 0,99 (L94 0.60 4.51 J L()() LOU 1.00 l.OO 0.96 4.96 0.90 0.99 examinee's lrue score can be compuled as long as his or her 9 value is known, as indicated in the illustration, allowing the projection or prediction of an examinee's true score or pass-fail status on a new set of items. This feature is used in \"customized testing\" (Linn & Ham- bleton, 1990). The fact that an examinee's true score can be computed on any set of ilems also is used in the procedure for determining the scaling constants for placing the item parameters of two tests on a common scale (see chapter 9 on equating). Summary The ability 9 of an examinee, together with the item parameler values, determines the examinee's probability of responding correclly to an item. Based on the examinee's response to a set of items. an ability SCore may be assigned to Ihe examinee. The most important feature of the 9 score thnl distinguishes it from the number-righl score is that 9 does not depend on Ihe particular set of items administered to the examinee. Examinees who are administered different sets of items can be com- pared with respect to their a values. In this sense, the a-scale may be considered an absolute scale. The e values may be tran!lformed linearly to facilitate inlerprel&tion. The a-scale. or any linear transformation of it, however. does not possess the properties of ratio or interval scales. although it is popular and reasonable 10 assume that the a-scale has equal-interval scale properties. In some instances. however. a nonlinear transformation of the a-scale may provide a ratio-scale type of interpretation. The trans-

FUNDAMENTALS OF ITEM RESPONSE TIIEORY formation 9' e 9 and 1/ = e\" for Ihe one-parameler model provides a ratio-scale interpretation for the odds for sllccess. The \"log-odds\" transformation also cnahles slIch interpretatiolls. For Ihe Iwo- ami three-parameter models slIch simple Irall~forl1lal ions arc 1I0t available. Thc mosl importanl nonlinear Iransformalion of the a-scale is Ihe transformation that yiclds the true-score scale. When the-item response model fits the data, the lrue score is the slim of the ilem characterislic curves evaluated at a specified value of 9. necause it is Ihe sum of itcm characteristic curves, the ability to true-score conversion is also known as the test characteristic curvc. The Irue scnre is on the same mel ric as the number-right score. If desircd, the true seore may be convcrted to the domain score (or proportion-correct score) by dividing the trlle score by the number of items. The true score or thc proportion-correct score has intuitive appeal and, hence, is often employed to set cut-off scores for making rnRslerY-llonrnaslcry decisions. The lrue score or Ihe domain score can be computcd for any set of ilems (including Ihose nol taken by the examinee) as long as the examinee's ahililY and the ilem parameters are known. Such \"prediclions\" of an examinec's Irue score on a set of \"new\" ilems may provide vl1luablc informalion regarding Ihe use or inclusion of these items in a test. Exercises for Chapter 5 I. Suppose Ihal abilily eSlimales for a group of examinces on a lesl are in Ihc range (-4, 4). a. Whal linear Iransformalion is appropriale 10 pr(){l,\\!ce a scale on which Ihe scores are posilive inlegers (assuming Ihal 0 is oblained 10 Iwo decimal places)? b. Whal nonlinear Iransformalion is appropriale for producing scores Ihal range from 0 10 100? 2. In Table 5.2, Ihe relalionship belween e and 1t is labulaled. a. Plot a graph of 1t (on Ihe .v-axis) againsl fl (on Ihe .t-axis). Whal can you say aboul Ihe shape of Ihe curve? b. Suppose Ihal only sludenls who have answered al kasl BO% of Ihe ilems correclly are considered \"maslers\"; Ihal is, Ihe cUI-off score is sel al 1t = O.BO. If an examinee has an abililY score of 1.2, may Ihis examinee be considered a masler? c. Whal is Ihe 9 value Ihal corresponds 10 a cuI-off score of 1t = O.80?

rill' Ahi/ily SnIff' !!9 :~. An examinee has an abitity Sl:orc or 0 = 1..'\\ as determined hy his or her performance on II lesl. a. What is Ihe (~xaminee 's true scorc on the rivc-item Ic.~1 with itcm paramelers given in Tablc ).I? b. An examinee must answer at Ieasl four items Oil this test correctly to be cOllside.red a master. Would Ihis examinee he cOllsidered Ii mllsler? c. Whal is the cllt-off score in part b on the 9 scale? 4. Por Ii Iwo-parameter model. a. show thai the odds for success, O. is b. TIlc odds for success on an item for Iwo examinees with ability 91 and III are 0 1 and 020 respeclively. Show thai the odds ralio for Ihe Iwo examinees is Co If Ihe abilities of the examinees differ by one unit, what is the value of the odds ralio? What is the log of thc odds ratio? d. What is the value of the odds ratio and the log of Ihe odds ratio if the examin!'es differ by k units? Answers t.O Exercises for Chapter 5 I. a. y = 100(0 + 4). n L~. b. y = -10-0 Pj (9) n j= I 2. a. 1t is a monotonically increasing function of O. It is bounded belween 0.09 and I. In fact, Jt(9) looks like an item characteristic curve. b. 111e graph shows that the examinee with ability 1.2 has a domain score less than O.R. Hence. the examinee may not be considered a master. c. Prom Ihe graph, Jt =0.8 corresponds to 0 = 1.45.

90 FUNDAMENTALS OF ITEM RESPONSE TIIHlRY 3. a. t '\" 1: P,(O '\" 1.5) \"\" 4.5. b. Yes. c. 0\"\" I. 4. a. Q(6) == I I rI + eOa(S hl l. lIence, 0 PI Q = 'cOl/(O h) b.O I = eOa(e, 1», O2 ef)a(e\"h). f1em~e, Da(O, -I» = ena(O, 0,) =0 1 102 eDa(S, hll eOa(O, h) eDa(O, II) C. If92 - 91 = (.thenO,102 eDa; (n(OII02) '\" Va. =d. If 91 - 92 k, then 0, 102 eDok; In (0,102) Dak.

6 Item and Test Information and Efficiency Functions Basic Concepts A powerful method of describing items and tests, selt~cting test items, and comparing tests is provided hy item response theory. The mel hod involves the use of item irrjormatiOll junctions. denoted li(O), where i::; 1,2, ... ,n 16.11 1,(0) is the \"information\" provided hy item i at 0, P;(9) is the derivative of Pi(O) with respect to 0, P,(9) is the item response function. and Q,(O) = 1- Pi(O). Equation 6.1 applies to dichotomously scored logistic item response models like those ~iven in E<lulltiolls 2.1 to 2.3. In the case of the three-parameter logistic model. Equation 6.1 simplifies to (Rirnbaum, 1968, chapter 17) 2.89al (I 16.21 From Equation 6.2 it is relatively casy to infer the role of the h. n, and (' parameters in the item information function: (a) information is higher when the b value is close to 0 than when the h value is far from 0, (b) information is generally higher when the (J parameter is high, and (c) information increases as the (' parameter goes to zero. Item information functions can play an important role in test devel- opment and item evaluation in thllt they display the contribution items 91

92 HINDAMENTALS OF ITEM RESPONSE Tf!EORY make to ability estimation lit points along the ability continuum. This contribution depends to a great extent on an item's discriminating power (the higher it is, the steeper the slope of P,), and the location at which this contribution will be realized is dependent on the item's difficulty. Birnbaum (1968) showed that an item provides its maximum .information at em.. wllere ., e..... = bl + ~ln[O.5(t + \"';1 + 8Ci )]. (6.31 vllj =If guessing is minimal. that is. Ci 0, then en••• = bi' In general, when Cj > O. an item provides its maximum information at an ability level slightly higher than its difficulty. The utility of item information functions in test development and evaluation depends on the fit of the item characteristic curves (lCCs) to the test data. If the fit of the ICCs to the data is poor, then the corresponding item statistics and item information functions will be misleading. Even when the fit is good, an item may have limited value in all tests if the a purameter is low and the (' parameter is high, Moreover, the usefulness of test items (or tasks) will depend on the specific needs of the test developer within II given test. An item may provide considerable information at one end of the ability continuum but be of no value if information is needed elsewhere on the ability scale. Examples Figure 6.1 shows the item information functions for the six test items presented in Figure 2.4 and Table 2.1. Figure 6.1 highlights several important points: I. Maximum infommlion provided by an item is lit its difficulty level or slighlly higher when c > O. (This is seen by comparing the poinl nn the ability scale where inforllllllion is grcnlest to the\" values nf tilt\" corre- sponding items.) 2. The item discrimination parameter suhstantially innuences Ihe nmount of information for assessing ability that is provided by an item. (This l'an be seen by comparing the item infonnalion functions for lIems I and 2.)

Pages:

alrabbaiomran

Fundamental of item response theory

Like this book? You can publish your book online for free in a few minutes!

Create your own flipbook

TOP SEARCH

business design fashion music health life sports home marketing children

Fundamental of item response theory

Description: Fundamental of item response theory

Read the Text Version

alrabbaiomran

TOP SEARCH

RELATED PUBLICATIONS