Home Explore Fundamental of item response theory

Fundamental of item response theory

Published by alrabbaiomran, 2021-03-14 20:05:37

Description: Fundamental of item response theory

Read the Text Version

Pages:

'.... '. 1'nt Scarl' EqlltltinK 141 h, Place the h values of the common items in Ihe l~xf1t'lilllentaltcst on the saine scale as the b values of the common items in the hank, c. !low similar are the difficulty values of the cummon items for the one- and two-parameter models? Carry oul this comparison by IJlolting the !'Icaled difficulty values for the one- and two-parameter models against the \"true\" ilem bank values. 4. Two tests, A and B, wilh 10 common items were administered to two groups of examinees, and a three-parameter model was fitted to the data. The means and standard deviations for the b values of the common items are given in Table 9.4. TABLE 9.4 Tnt A 4.2 2.2 Mean 3.5 SD 1.8 The difficulty and discrimination values for an item in test Bare -1.4 and 0.9, respectively. Place these values on the same scale as test A. Answers to Exercises for Chapter 9 I. Standardize the item difficulty parameter estimates. 2. Since a common set of examinees have laken both tests,their abilities must be the same. Because of standardization during the estimation phase. however, the I) values will be related linearly according to The means and SDs of the common 9 values are used to determine a and Ii, as indicated for the anchor item equating procedure. With the relation- ship established. the abilities of examinees taking test Y and the item difficulties for test Y can be mapped onto the scale defined for test X. The item discrimination indices for test Yare mapped onto the test X scale using the transformation ax = Oay

144 FIINDAMENTALS OF ITEM RESPONSE THEORY J. 1l.IX=O.97.P=0.42. h. See Tahle 9.5. TABLE 9.S -....--~~--------- Scaled common items (2P): Comman Ilem, s, .5 Scaled common items (I P): 234 Common items from bank: \\.67 1.15 -0.78 -1.25 2.52 1.70 1.16 -0.83 -1.31 2.58 1.65 1.20 -0.110 ·-1.25 2.S0 c. lne estimates of item difficulty for the one- and two-parameter models are fairly similar, bUI the estimates for the two-parameter model are closer to the values in the bank. 4. The scaling constants for placing items in lest B on the same scale as test A (let X :::: test Band Y = lest A) are IX = 0.82 and P = 0.06. The scaled item difficulty values are 1.09 and 1.10.

1() --------------- ..--------------------- Computerized Adaptive Testing Background In previous chapters, it was shown that a test provides the most precise measurement of an examinee's ability when the difficulty of the test is malched to the ability level of the examinee. Any single lest adminis- tered to a group of euminees cannot provide the same precision of measurement for every examinee. The ideal testing situation would be to give every examinee a test that is \"tailored,\" or adapted, 10 the examinee's ability level. The earliest application of tailored or adaptive testing was in the work of Binet on intelligence testing in 1908 (Weiss, 1985). Little additional work on adaptive testing took place. however, until Fred Lord at the Educational Testing Service began a comprehensive research program in the late 1960s (for a review of his work, see Lord, 1980). Lord pursued adaptive testing because he felt fixed-length tests were ineffi- cient for most examinees, but especially for low- and high-ability examinees. Lord felt that tests could be shortened without any loss of measurement precision if the test items administered to each examinee were chosen so as to provide maximum information about the ex- aminee's ability. In theory, each examinee would be administered a unique set of items. Adaptive testing became feasible only with the advent of computers. The computer's immense power to store test information (e.g., test items and their indices) and for producing, administering. and scoring tests has enabled the potential of adaptive testing to be fully realized (Bunderson, Inouye. & Olsen, 1989; Wainer, 1990). Since the late 1960s a substantial amount of research has been supported by the U.S. Armed Services, the U.S. Office of Personnel Management, and other federal agencies; special conferences have been held, and hundreds of papers 145

146 FUNDAMENTALS OF ITEM RESPONSE TIIEORY on adaptive testing have bcen puhlished (sec, for l~XlIrnplc, Wainer, 1990; Weiss, 1983). In computerized adaptive testing (CAT), the sCllllcnce of items ad- ministered tn an eXllmin('e (kpencls on the cltllmin('c's performann' on earlier items in the test. Based on the examinee's prior performance. items that arc maximally informative about the examinee's ability level are administered. In this way, tests may be shonened without any loss of measurement precision. High-ability examinees do not need to be administered relatively easy items, and low-ability examinees do not need to be administered the most difficult items, because such items provide little or no information about the examinee's ability. Afler an examinee responds to a set of lest items (sometimes only two or three items) presented at a computer terminal, an initial ability estimate for the examinee is obtained. The computer is programmed to select the next set of administered items from the available item bank that will contribute the most information about the examinee's ability, based on the initial estimate. Details of how test items are selected and ability estimates are obtained are provided in the following sections. The administration of items to the examinee continues unlil some specified number of items is administered or a desired level of measure· ment precision (i.e.• standard error) of the ability estimate is attained. Promise or IRT Item response models are particularly suitahle for adaptive testing because it is possibie to obtain ability estimates that are independent of the particular set of test items administered. In fact. adaptive testing would not be feasible without item response theory. Even though each examinee receives a different set of items, differing in difficulty. item response theory provides a framework for comparing the ability esti· mates of different examinees. In applying item response theory to measurement problems, as was mentioned in chapter 2. lit common assumption is that one dominant factor or ability accounts for item performance. This assumption is mllde. for example, in nearly all of the current applications of adaptive testing. The IRT model most appropriate in adaptive testing is the three-parameter logistic model (Green, Bock. Humphreys. Linn, & Reckase, 1984; Lord. 1980; Weiss. 1983). The main reason for choosing

ComI'1I1('r;uti Ada\"t;I'.- Tt'stinR 147 the threc-paranlt'tl'r model is that it generally fils 1ll11Itipk-t.:hoi<.;c item dala helter than the one- or two-jlarameter models. The item information fllllctioll p!;IYs Ill'l itil'al role in mlaptivc testing. Items that <:<JIltrihl1le mallimally Itl tilt' precision of measurement (see dlllplt'rs 6 IIlld 7) lire \"eircled fot adminislration. 1It'IIls providing the Illost information are, in general, items on whiLh the C'xaminee has an (approximately) 50% to 60% chance of answering correctly. Uasie Approach In adaptive testing within an IRT frlll1lework, an allempt is made to Illllteh the difficulties of lest items to the ability level of the examinee being measured. To match test items to ability levels requires a large pool of items whose statistical characterislics are known. so Ihal suit- able items may be drawn (Millman & Arter. 1984). According to Lord (1980). the computer must he programmed to accomplish the following in order 10 tailor a test to an examinee. I. Predict from the ellaminee's previous responses how the eJl(alllinee would respond to various test items not yet administered. 2. Make effectivc use of this knowlc(lge 10 seled the leSI item to be admin- istered nex t. J. Assign at the end of testing a numerical scorc thai relJrest'nts the ahility of Ihe eJl(aminee tested. The advantages of computerized adaptive testing, ill addition to shortening tests without loss of measurement precision, are numerous. Some of these advantages are • enhanced test security • testing on demand • no need for answer sheels • test pace Ihal is keyed to the individual • immediate test scoring and reporting • the minimization of test frustration for some eJl(8111inees • greater test standardization • easy removal of \"defective items\" from thc item bank when they are identified

1411 HINDAMENTALS OF ITEM RESPONSE TIIEOI{Y • greater nexibility in Ihe choice of item formals • reductioll in test supervision time Adaptive testing research to dale has been focused in six areas: choke of IRT model, item bank, starting point for testing, selectioll of suhse- quent test items, scoring/ability estimation, and choice of method for deciding when to terminate the test administration. Refer to Itimhlcton, ZUlli, and Pieters (1991) for a discussion of rcsl~arch in these six areas. A brief discussion of two of these-item selection and ahility estima- tion-follows. Two procedures are used currently for item selection in an adap- tive mode (Kingsbury & Zara, 1989). The first, maximum information (Weiss, 1982). involves the selection of an item that provides maximum information (Le.• minimizes the standard error) at the examinee's ability level. To avoid the same items being selected time and time again (items with the highest levels of discriminating power, in general, provide the most information) and thereby (possihly) affecting test security and, subsequently, test val idity, Green et al. (1984) have suggested that items be selected on a random basis from among items that provide the greatest information at the ability level of interest. Thus, for practical reasons, slightly less than optimal items usually are administered to examinees. The second method, flllyesiofl item selection (Owen. 1975). involves the selection of the test item that minimizes the variance of the posterior distribution of the examinee's ability (see chapter ..'\\). As more test items are administered, the posterior distribution becomes more concentrated, reflecting the precision with which the examince's ability is estinHlted. orBayesian methods require specification II prior helief ahout the examinee's ability; hence, the success of the mcthod depends ill part Oil the appropriateness of the prior distrihution. Thc impact of the prior distribution diminishes as more items arc administered. An important advantage of computerized adaptive testing is that test scoring/ability estimation is carried out while the test is heing admin- istered; thus, feedback of resulls to examinees may be rrovide(1 at the completion of testing. In obtaining ahility estimates, the two estimatioll procedures commonly used are maximum likelihood and Bayesian (see Weiss. IQR2. and chapter 3. this volume). Maximum likelihood estima- tion poses problems when the numher of test items is smilll. Bayesian procedures overcome the problems encountered with maximullI Iikeli-

('omI'IIII'li:cd A(/III'III'r '{('Iling 149 hood procedures but may produce biased estimatcs or ability if inappro- \"rillte prior distrilmtions arc choscn. This example. which highl ights the features of CAT ability estimation and item selection, was prepared by Reshelar (1990). For the purposes or the example, Reshetar created a bank or I J test items. contained in Table 10.1. TABLE 10.1 I/('m IUm Paraml'/('r bac 1 0.09 1.11 0.22 2 0.47 1.21 0.24 3 -0.55 1.78 0.22 4 1.01 1.39 0.08 5 -1.88 1.22 0.07 6 -0.82 1.52 0.09 7 1.77 1.49 0.02 8 1.92 0.71 0.19 9 0.69 1.41 0.1) 10 -0.28 0.98 0.01 II 1.47 1.59 0.04 t2 0.23 0.72 0,02 IJ 1.21 0.58 0.17 Source: Prom Com/,utu IIdap'/v~ Tut/nll: O,vrt\"pmnll and Application (p. 9) by R. Re.hetar. 1990. Amhcl'IIl: Uni,crMity or MIIUachulehR. School of educalion. In practice. an item bank. would consist of hundreds. and possibly thousands. of test items. A setluence of events that might occur in computerized adaptive testing is as follows: I. hem 3 is selecled; Ihis ilem is of average difficulty and high discrimina- tion. Suppose Ihe cuminee answers Ilem :\\ correctly. A maximum likeli- hood estimale of abilily cannol be oblained unlil Ihe examinee hilS an- swered 8t least one item correctly and one ih:m incorreclly. (Zero or perfect scores correspond 10 _00 and +00 ability eSlimah:s, respeclively.)

150 FUNDAMENTALS OF ITEM RESI'ONSE TIIEORY TABLE 10.2 Maximum-Likelihood Ability Estimates ami Standard Error for Olle Examinee al the End of E'lCh CAT Stage Stall I' fIrm 1/011 (\"I \" \"1(0) ....E(O)· Nllmfll'f' Rnlwnu 1.0] 0.91 1.02 ~ 1.46 2.35 0.65 0 1.11 3.55 0.55 2 12 1 1.24 4.61 0.47 31 0 1.29 5.05 0.45 44 I 1.31 5.27 0.44 5 II I 1.25 5.47 0.43 69 1 72 0 8I 9R \"•. SE(8) = II ~/(8) 2. Another item i~ selected. Ilem 12 is chosen because it is more dimcull Ihan the previously administered item. Suppose the examinee correctly answers Item 12. Again, a maximum likelihood estimate of ability cannot , be obtained. , i 3. Item 7 is chosen next; it is more difficult than hems 3 and 12. Suppose the examinee answers this item incorrectly. The examinee's item response vector for the Ihree items may be represented as (I. I, 0). Through use of the maximum likelihood procedure for estimating ahilily with known item parameters. an ability estimate can be obtained (8\" 1.03). The test \"information for the three items at this abiliiY level is I(e = 1.03) 0.97. and the corresponding standard error is SE(O) = 1.02. These values appear in Table 10.2. 4. Next. the information provided by each of the remaining itcms in the bank eis computed at = 1.03. These values are reported in Tahle 10.3. ItCIlI 4 is selected next because it provide!! the most infnnnation at 9 = 1.03. Suppose that Item 4 is administered and then is answered correctly by the examinee. A new ability estimate is ~}btail1cd for the response pllllern (I. I, 0, I). The new ability est imale is 9 '\" 1.46. 5. The item informatioll at e '\" 1.46 for the rcmaining items is cmnputcd, The process described above for administering an item, estimating ability, determining the information provided hy unadministered items. and choosing an item to be administered next based on the infonnation it provides is continued. To continue this procedUre. Item II is chosen next. following by Item 9. then by Items 2, I. and finally. 8. The procedure stops when the standard error of the examinee's abil ity estimate stops decreasing

TABLE 10.3 lnfonnation Provided by Unadministered Items at Eac Stag~ Infor17Ultio 4 l\\ 9 23 4 5 6 1.03 0.034 0.547 1.192 0.010 0.051 5 1.46 0.179 0.319 0.004 0.017 6 1.13 0.292 0.494 0.008 0.039 7 1.24 0.:';49 0.433 0.006 0.029 8 1.29 0.232 0.006 0.026 9 1.31 0.005 0.024 10 1.25 0.006 0.028 ....

ch CAT Stage on Provid~d by ft~m 9 10 11 12 13 78 0.251 l.UlI 0.166 0.143 1.008 / 0.205 0.579 0.136 1.683 0.175 0.159 0.917 0.219 0.170 0.175 0.187 0.173 0.182 0.175 0.174 0.186 0.168 0.174 0.184 0.173

152 FllNDAMENTALS OF ITEM RESPONSE THEORY by a specified amoun!. As can be seen from Table 10.2. the decrease in the standard error when Item 8 is administered in stage 9 compared with the standard error at stage 8 is 0.0 I. The proced~re stops at this poinl. The estimate of the examinee's ability is taken as e :: 1.25. Weiss and Kingsbury (1984) described several other examples of application or CAT to educational testing problems. • • Exercise for Chapter 10 For the example in the chapter. suppose that an examinee was administered Items 3, 12, and 7 and responded (I, I. 0). Item 4 was chosen to be administered next, and the examinee answered it incorrectly. The maxi- mum likelihood estimate of ability was computed to be 0.45. Compute the infonnation function for the remaining items at this e vallie. Which item sbould be administered to the examinee nexfl Answer to Exercise for Chapter 10 The item information values at 9 = 0.45 are given in Table 10.4. TARLE 10.4 Ilem I 2 5 6 K 9 10 II I.l Inrormalion 0.50 0.4.') O. 16 0.66 0.0.1 0.19 0.1 R 1.()6 OAK ---------.--- =hem 9 has the highest infonnation at e 0.45. It is administered nexl.

11 Future Directions of ltetTI Response Theory We hope that Dr. Testmaker and other applied measurement specialists will find the contents of this book helpful. Many important concepts. models. features, and applications were introduced. and many examples were provided; this material should prepare our readers for the next steps in the learning process. No book, hy itself, can prepare measurement specitdists to use IRT models successfully in their work. Applied work with various data sets and IRT computer programs is an essential compo- nent of training in IRT. The practitioner must be ready to handle the many problems that arise in practice. Although IRT provides solutions to many testing prohlems that pre- viously were unsolved, it is not a magic wand that can be waved to overcome such deficiencies as poorly written test items and poor test designs. In the hands of careful test developers, however, IRT models and methods can become powerful tools in the design and construction of sound educational and psychological instruments, and in reporting and interpreting test results. Research on JRT models and their applications is being conducted at a phenomenal rate (see Thissen & Steinberg, 1986, for a taxonomy of models). Entire issues of several journals huve been devoted to devel- opments in IRT. For the future, two directions for research appear to be especially important: polytomous unidimensional response models and both dichotomous and polytomous multidimensional response models. Research in hoth directions is well underway (Bock, 1972; Masters & Wright, 1984; Samejima, 1969, 1972, 1973, 1974). With the growing interest in \"authentic measurement,\" special attention must be given to IRT models that can handle polytomolls scoring, since authentic mea- surement is linked to performance testing and to nondichotomous scor- ing of examinee performance. I'D

154 IllJNI)AMI~Nl'ALS OF ITliM RESPONSE TIIEORY Multidimensional IRT models were introduced originally by Lord and Novick (1968) and Samejima (1974) and, more recently. by Embret- son (1984) and McDonald (1989). Multidimensional mmJcls offer the prospect of better filling current test data and providing multidimen- sional representations of both items and examinee abilities. It remains to be seen whether parameters for these multidimensional models call be estimated properly and whether multidimensional representations of items and examinees are useful to practitioners. Goldstein and Wood (1989) have argued for more IRT model building in the future but feel that more attention should be given to placing IRT models within an explicit linear modeling framework. Advantages, aQcording to Goldstein and Wood, include model parameters that are silnpler to understand, easier to estimate, and that have well-known shltistical properties. 'In addition to the important IRT applications addressed in earlier chapters, three others are likely to draw special attention from educators and psychologists in the coming years. First, large-scale state, national, and international assessments are attracting considerable attention and will continue to do so for the foreseeable future. Item response models are being used at the all-important reporting stages in these assess- ments. It will be interesting to see what technical controversies arise from this type of application. One feature that plays an important role in reporting is the ICC. Are ICCs invariant to the nature and amounts of instruction? The assumption is that ICCs are invariant, but substan- tially more research is needed to establish this point. Second. cognitive psychologists such as Embretson (1984) arc inter- ested in using IRT models to link examinees' task performances fo their abilities through complex models that attempt to estimate parameters for the cognitive components needed to complete the fasks. This line of research is also consistent with Goldstein and Wood's (1989) goal of seeking more meaningful psychological models that help explain exam- inee test perrormance. Much or the IRT research to date has emphasi7.ed the use of mathematical models that provide little in the way of psycho- logical interpretations of examinee item lind test performance. Third, educators and psychologists are making the argument for using test scores to do more than simply rank order examinees on their abilities or determine whether they have met n particular achievement level or standard. Diagnostic inrormation is becoming increasingly important to users of test scores. Inappropriateness mea.mrement de- veloped by M. Levine and F. Drasgow (see, for example, Drasgow

f'u/llrl' Oi't'Clitl/u ollll'm R\",'IWIU(! l'h\"(}I'Y 155 el aI., 1987), which incorporales IRT models, provides a framework for idenlifying aherranl responses of examinees and special groups of examinees on individual items and groups of items. Such information may he helpful in successful diagnoslic work. Gretl1er use of IRT models in providing diagnostic inrormation is anlicipafed in the coming years.

Appendix A .. Classical and IRT Parameter Estimates for the New Mexico State Proficiency Exam TABLE A.I Classical and IRT Item ParHtncler ESlimates for IIIC Olle-, Two-, lind Three-Parameter Models --~-----~ ~1.'1...\"'1-1.1-'-P- \"raml.'la E.,lilllille.' -~. --.-.~\"---. JP llelll P r \" b I> a c1I ----.~~~----~-- 1 0.45 OAI 0.22 (121 0.61 O.~R 0.R4 0.17 2 0.70 0.45 LOU -U-fn 0.1l2 0.51 ()QI 0.1\" :\\ 0.65 0,50 ·0.75 {),(,O {),1)2 lUll 1. III n,17 4 0.77 n.20 -IA5 --2,25 11.14 ·1,69 IU7 n,17 5 0.75 0.:\\7 -1.34 -1.25 0,66 -0.97 n,M n,17 6 0.]9 n.27 n.52 0.71 O,W 1.11 O,M 0.17 7 0.76 OAO ' 1.3(, ~ 1.67 0.75 -II.')() 0.79 0,17 R 0.60 0,35 -11.52 -O,S6 (),52 . n,o'J 0.67 0,17 9 0.78 0.29 -1.51 -1.70 050 -U6 O.S] 1l.1'I 10 0.55 0.32 ,-0.27 -0,32 !l47 O.IQ 0,62 0.17 II 0.61 0.37 -0.53 -U.S5 0.56 ·n.14 O.6S 017 12 0.59 0,21 ··0,4 7 -(l,R I (1,29 -0. II IU7 0.17 D 055 0,30 n.25 -0..10 0,4.1 0,22 1),56 n,17 14 0.73 0.44 I.IR -0.97 O,R2 ··(J,67 O.RR 0.17 IS 0.38 0.54 O.5X O,4Q 0,75 0,76 UO n.IS 16 0.62 0.54 -0.51\\ -0.45 1.04 -0,04 I.:'i.l 0,21 17 0.80 0.34 -1.67 -1.5] 0.67 ,1..12 0.06 0.17 18 0.65 0.45 -0.74 -0.78 0,5<1 ·(1.32 O.M 0.17 19 0.49 OA3 0.04 O,OJ 0,6R 0.51 1.2J 0,22 20 0.64 OAO -0.7n n.M 0.65 - 0.31 O.7:l 0.17 1:'i6

'1 ,/ Ap/)I'I/(UX A \\57 TABU': A.I continued /tt'm -. _ . _ - - _..... lP P r b b a b a (' 21 0.69 0.34 ~.99 -1.07 0.53 ~.68 0.59 0.17 22 0.67 0.41 ~.85 ~.78 0.68 4).46 0.74 0.10 23 0.46 0.35 0.18 0.20 0.50 0.63 0.74 0.17 24 0.74 0.52 -1.26 -0.89 1.15 -0.64 1.25 0.17 25 0.61 0.47 ~.56 -0.48 0.80 -0.12 0.98 0.17 26 0.34 0.30 0.78 0.97 0.44 1.18 0.65 0.12 27 0.70 0.50 -1.05 -O.RO 0.99 -0.52 1.08 0.17 2R 0.6\\ OA4 -0.56 -~1.50 0.71 () 12 0.91 0.17 19 0.73 0.35 -1.23 -1.24 O.5R -H.91 0.62 0.17 3() 0.74 0.44 -1.28 -1.0:1 0.85 --O.R I 0.R6 0.17 .\" 0.57 (U2 -0.35 0.41 OA6 O.OR (I.5R 0.17 32 0.74 O.3R -1.20 1.17 ()H~ 0.'10 0.6R 0.17 .H 0.44 OYi 0.29 o.n 051 O.7R 0.R7 0.19 .14 0.60 OA5 -0.52 OAt. n.7S (UI.l UO 0.20 35 02R 0.29 1.14 1.'17 OAt. lAO 1'<)4 n.IS 30 O.M 0.46 -0.99 -n.R2 II.R3 · n..'III 0.94 n.l7 n 1129 n.27 1.11 1.46 nAI 1.54 n.63 0.10 JR 0.77 (US -1.4 \\ '.19 II\"'L 1111 OM 0.17 '\" 0.(,0 (UK -0.50 n.51 051 · not) 0.69 0.17 40 043 OAR 03J n.26 (1.111 O.5K UO 0.17 41 n.4:1 OAI 0.:1:1 IU O.fi2 O.\"R 0.99 n.17 41 n.()(1 0.46 -0.51 ·11.45 (U.') · (I.ot) ().')] 0.17 41 O.4() (),l7 0.17 11.1 R 0:')6 (1.70 I.H 0.25 44 (1.52 n.2l -0.12 -0.19 (U2 0.44 0.4 , 0.17 4.'\\ 0.26 0.2R 1.24 1.53 OA5 1.46 1.14 0.15 46 O.M 0.44 ~).6R -0.61 O.D - 11.2:') 0.R4 0.17 47 (US OAO -J.J4 --1.16 n.74 {1.R9 O.7R n.17 4R 079 0.39 -1.57 -un 0.79 ·I.OR O.RO 0.17 49 0.76 (U6 -1.:17 -1.2R OM -1.00 O.oK 0.17 :')0 n.57 (U() 4).34 ~).43 OA! O.!O 0.51 0.17 51 OA9 IU5 0.04 (1.115 053 (1..'\\7 0.94 0.20 52 (U4 n.n n.R 1 (un 11.59 1.01 1.06 0.14 53 O,,'\\() 0.39 4).114 ·OA! n.59 0.5] I.Ol 0.23 :')4 0.74 IUJ -1.26 -1.32 (J.55 ·0.94 0.01 0.17 55 OAR 0.61 0.12 0.0:') 1.21 n.n 1.41 0.08

158 FUNDAMENTALS OF ITEM RESPONSE THEORY TAIILE A.I conlinut!d __~______JIl-I ltfn~fa,.,!\"!.e!e!~ E.I!inlllleJ C l a . u i m l_____• _ _ _ _ _ \" __ c IP 2P 3P 111'/1/ pr hba b (I (' 56 0.51 0.34 -0.03 -0.03 0.48 0.43 0.67 0.17 O:5ti 0.17 57 0.64 0.32 -0.71 -0.82 0.49 -0.37 0.85 0.17 0.52 0.17 58 0.50 0.43 -0.02 -{J.03 0.66 0.35 0.70 0.17 59 <-O.l!:tl 0.26 -1.1\\8 -2.18 0.48 1.82 0.62 0.17 0.68 0.17 60 0.47 0.35 0.15 0.18 0.49 0.61 0.56 0.17 0.40 0.17 61 0.71 0.35 -1.09 -1.13 0.56 0.77 0.89 0.17 62 0.73 0.38 -1.21 -1.15 0.64 -0.85 63 0.79 0.30 -1.57 -1.69 0.53 -1.37 0.31 0.17 64 0.63 0.23 -0.63 -0.97 0.33 --0.34 1.22 O. I7 65 0.59 0.47 -0.43 -0.38 0.77 -0.05 0.74 0.17 0.66 0.17 66 0.77 0.16 -·1.45 -2.85 0.26 1.97 0.35 0.17 67 0.54 0.52 -0.20 -0.17 0.90 0.17 0.114 0.17 0.76 0.17 68 0.66 0.41 ·-0.80 -0.75 0.65 ·-0.40 1.14 0.17 0.52 0.17 69 0.72 0.37 -1.12 -1.10 0.61 -1-1.77 1.41 0.17 70 0.53 0.21 -0.14 {J.26 0.26 0.46 71 0.78 0.41 .. 1.49 .. 1.21 0.83 0.911 72 0.78 0.37 -1.53 -1.34 0.72 -1.06 73 0.64 0.53 -0.68 -0.53 0.98 0.23 74 0.60 0.28 -0.48 -0.62 0.41 -0.1l7 75 0.46 0.23 0.17 0.3\\ 0.30 0.91 76 1.26 77 -1.47 78 -1.61 79 0.60 80 0.63 81 -1.45 82 -0.91 83 -0.69 84 Ll5 85 1.02 86 0.91 87 -0.39 88 2.11 89 1.78 90 1.96

Appendix II Sources for IRT Computer Programs Program SQUIl:t' BICAL, BIGSCALE Dr. Benjamin Wright University of Chicago MICROSCALE Slatislicallaboratory Department of Education PML 58:15 Kimllark Ave. Chicago, IL 60637 RASCAL. U.S.A. ASCAL Mediu Interactive Technologies RillA 21 CharleH Street W('istpurt, CT 06880 U.S.A. Dr. lan·Eric Gustafsson University of (llUellorg Institute of Educatitm Flick S·431 20 Molndal SWEDEN Asseument Systems Corporation 2233 Univeuily Avenue Suite 440 St. Paul, MN 55114 U.S.A. Dr. Cees Glas Nlltional Institute for Ihlucaliollal Measurement P.O. Box 1034 680 I MG Arnhem The Netherlllnds 159

160 FIJNDAMENTALS OF ITEM IU~.')rONSE TUEORY SmtiH' LOG 1ST Educational Testing Service ,< TULOG. Rosedale Road MUl:nLOG Princeton. NJ 01l'i41 NOBARM U.S.A. MIRTE Scienlific Sor,ware, Inc. I 36Q Neil1:el ROlld Mnoresville, IN 461511 U.S.A. Dr. Colin Fraser Cenlre ror Behavioral Studies University or New England Armidale. N.S.W. AUSTRALIA 2351 Dr. Mark Rechse American College Testing Program P.O. Bolt 1611 Iowa City. IA S2243 U.S.A.

...,\\ References American Educalional Re~earch A!'ISocialion, American Psycholngical Associalion, &: N.uional Council on Measurement in Educali()n (198~). Sltmdards for t'dllmlinnal am/I'sydwlogical tuling. Washinglon, DC: American Psychological Association. Andersen. E. B. (1972). The numerical IIOlulion of II set of conditional eSlimation equlltions. }mmliJl of tht' Royal Slati.ttiml Sodl!ly, Ser;r.l H. 34. 42-54. Andersen, E..R. (1973). A goodness of fil lest for Ihe Rasch modeL P.tydfllnlellika. 'ul. 123·140. Andrich, D. (I978a). Application of a psychometric raling model 10 ordered categories which are scored wilh successive integers. Applied Psycholollicol Me{/.ru\"enltnt. 2. 581-594. Andrich. O. (1978b). A hinomiallalenl Irait model for the siudy of Likert-slyle allilUde otquestionnaires. British Journal Malht'l'lUltical /llId Stllli.<li('(l/ I',{y,'hlliogy. J I. 84-98, Andrich. D. (1978c). A raling formulation for ordered response c;alegories. p,(Yl'ha- mt'frika. 41, 561-573. Andrich. D. (1982). An extension of the Rasch model for ralings providing both location and dispersion parameters. P.fychomt'lrika. 47. 105·11 J. Angoff, W. H. (1911). SCllles, norms, and equivalent scores. In R. L. Thorndike (Ed.), Educational meaSUfl!menl (2nd ed.) (pp. 508-600). Washington. DC: American Council on Educalion. Ansley. T. N.. & Forsyth. R. A. (1985). An examination of the characleristics of uni- dimensional IRT parameler estimates derived from two-dimensional data. I\\ptJlit!d PsycholoRical Mt'a.mrt'rnt'nt. 9(1), 37-48. Assessment Systems Corporation. (1988). USt'r's manual for Iht' MicroCI\\T 1t'.{finR sys- Urn. (Version 3). 51. Paul. MN: Author. Baker. F. R. (\\964). An interseclion of test score inlerpretalion and ilem analysis. Journal of Educational Mf!asurt'mt'nl. 1.23-28. Baker. F. B. (1965), Origins of the item parameters XjO and ~ as a modern ilem analysis technique. Journal of Educational Ml!aSuumt'nl. 2. 167-180. Baker, F. B. (1985). Thf! hosies of itt'm rf.'.rpon.u 111l!l1ry. Portsmouth. NH: Heine-mann. Baker. F. B. (1987). Methodology review: hem parameler estimation underthe one-,Iwo-, and Ihree-parameter logistic models. I\\pplied Psydwl\"lIical Mta.mrt'mt!tII. II. 111-142. 161

162 FUNDAMENTALS OF ITEM RESPONSE TIIEORY Bejar. I. I. (1980). A procedure for investigllting the unidimensionali!y of achievement II'~I~ bRsed on ilem parameter e~lillliltes, .Imunlll of {'.'d,It'IIliollal M\"II,UfI'I·m\"I1I. 17. 2113-296, Airnbnum. A. (19611), Some latent trllit n10deh lind their use in inferring 1111 examinee '5 ability. In F. M. Lord and M. I{, Novick. Slllli.~lkllll\"f'()ril'.f of melllalirsl .H'(/I'I',< (chaplers 17-20), Reading. MA; Addiwn.We~ley. Bock, R. D. (1972), Estimating item parameters and latent ability when re~ponses are scored in two or more nominal categories. P.fychomnrilw. 37.29·51. , , Bock, R. Do, & Aitkin, M. (1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychomttrilt.tl, 46. 443-459. Bock, R. D., Gibbons, R., & Muraki, E, (19811). Full information item factor analysis. Applit'd Psycllo/(>!lical Ml'a.<lIrtmtnt. 12(3).261-280. Bock, R. D., & Lieberman, M. (1970). Filling a response curve model for dichotomously scored item~. P.fychomttrika. 35, 179-1911. Bock. R. D., & MisJevy. R. J. (1982). Adaptive EAP estimation of ability in a microcOln- puter environment. App/itd P.fychological Mta.turtmtnt, 6(4), 431-444. Bunderson, C. V., Inouye, D. K., & Olsen. J. B. (19P-9). The four generations of compu- terized educational measurement. In R. L. Linn (Ed.), Educational mta.fUrtmtnt (3rd ed.) (pp. 367·407). New York: Macmillan. Carlson. J. E. (1987). Mullidimtn.fional iltm rtSpOn,ft tlltory tstimation: A COfl/pUltl' program (Research Repol1 ONR87-2). Iowa City, IA: American College Testing, Cook, L. L., & Eignor. D. R. (1983). Practical considerations regarding the use or item response theory to equate tests. In R. K. Hambleton (Ed.), Applications af iftm rt.fpOnst th~ory (pp. 175-195). Vancouver, BC: Educational Research Institute of British Columbia. Cook. L. L.. & Eignor, D. R. (1989). Using item response theory in test score equating. Intunalional Journal of Edllwtional Rf'uard/, 13(2), 161·173. Cook, L. L., Eignor, D. R.• & Taft. H. L. (1988). A comparative study of the effects of recency of instruction on the stability of IRT and conventional item parameter estimates. Journal of Educational Mtasu/,tmtnt. 25( I), 31-45. de Gruijter, D. N. M. (1986). Small N dou not alway~ justify the Rasch model. Apptitd Psychological Mta.turtmtlll, 10, 187·194. de Gruijler. D. N. M., & Hambleton, R. K. (19113). Using item response models in crilerion-referenced lest item selection. In R. K. Hamhleton (Ed.), Applications of ilf'm I'f'.fpon.ft thtol'Y (pp. 142-154). Vancouver, Be: EducationRI Research Institute of British Columbia. Divgi, D. R. (1986). Does the Ra.~ch model really work for multiple choice items? Not if you look closely. Jo\",.\",,! of Educational Mt'osul't'mt'ni. 23. 283-2911. v..Drasgow. F.• Levine, M. & McLaughlin. M. E. (19117). Detecting inapproprillte test scores with optimal and practical appropriateness indices. Applied P,f.vdlOl,,~;{'(/1 Mta.turl'ml'nt, 11(1),59-79. Drasgow, E, & Lissak, R. I. (19113). Modified parallel analysis; A procedure for examining the latent dimensionality of dichotomously scored item responses. JouI'nal IIf Applied P,'y(·holo1(Y. 611. 363-373. Drasgow, E. & Parsons. C. K. (19113). Application of unidimensional item response theory models to multidimensional data. Applied P.r,lI('holo~i('(J1 MeOml'l'mnli. 7, IR9-199. Embretson, S. E. (1984). A general laten! trait model for response processes. I'sydw· mel/ilea. 49. 175-186.

Rrjrll'/w('s 16] Fraser, C. & McDonald. R. I'. (1988), NOHARM: Least squares ilem faclor analysis, M\"I/;I'(l/';tlft' B,'IIm';oral Rt'.tl'aFl'h. :B. 267 ·26'), OIas. C. (1990). RUM: RtHI\" inl'ompll'u llt'.\";1(1I ll/ltlly,fi,\\, Alllhelll, The Nelhe!ltlnds: National In~titule for Elducational MellNurelllenl. (loftlsteill. II\" & Wood. R, (1989). Five decadC!'s of hem reslJ,mse modelling. Bfitl.th hmrnal of Mat/It'matka/ and SfaliJI;('al P,fyd\",/oIlY. 42, 1]9-167. Green. B. E. Bock. R. D.... umphreys. L. G .. Linn. R. L.. & Recklls<\". M. D. (1984), Technical guidelines for assessing computerized adaptive tests. Jtmma/ of E:duca- fional Mrasurl'mml. 2/(4), 347·360. Green. 0, R.• Yen. W. M., & Burket, G. R. (1989). Experience~ in Ihe application of item respoose theory in test construction. Applit'd Mfa.tUrfmeni in Edu/'afi(ln. 2(4). 297-312. Gullihen, II. (1950). Tht'ory of meniallt'.tIS. New York: John Wiley, GustafslIon, J. E. (1980a). A solution of the conditional eslimalion problem for long teslS in lhe Rasch model for dichotomous items. Edllcational anll P.fycholoRical Mea- SIIreml'tlt. 40. 377-385. Guslafsson. 1. E. (l9110h). Teslina and obtaining fit of dala 10 Ihe Rasch model. Briti.th ) ournal of Mathematical and Statistical P,fycholoRY. 33, 205-233. Haebara, T. (1980). Equaling loailltic ability scales by weighted least squares method. )al'ont'u P.Tycho/agicol Rtsearch. 22.144-149. Hambleton. R. K. (Ed.). (1983). Application,f of iUIfI rnponu theory. Vancouver, BC: Educational Re!learch Institute of British Columbia. Hamblelon, R. K. (1989), Principles and selected applications of item response theory. In R. L. Linn (Ed.). Educationa/mt'asurt'lfIt'nl Ord ed.) (pp. 147-200). New York: Macmillan. Hambleton. R. K., & Cook. L. L. (1983). Robustness of item response model~ lind effects of leM length lind sample size on the precision of abilhy eslimates. In D. Weiu (Ed.), Nt'w horizon.' in testing (pp. 31-49). New York: Academic Press. Hambleton. R. K., & de Gmijter, D. N. M. (1983). Application of item response models to criterion-referenced test item selection. )(1l1rnal IIf Educational Ml'osuft'mt'nt. 20.355·367. Hambleton, R. K.• JOl1e~, R. W., &: Rogers. II. 1. (1990. August). Inflllmee of ittm paramt'ft'r t'.ttimation t'rmr., in ft',U deve/opmt'nt, Paper presented at the meeting of American P~ychologicftl ASlIOCilltion, Boston. lIambleton. R. K., &: Rogers, H. J. (19119). Detecting potentially hiRsed test items: Comparison of IRT area and Mantel·lJaenszel methods. AI'plitd Ml'a,fllrt'mt'nl in EduC'Olion. 2(4). 313-334. lIambleton, R. K.• & Rogers. H. J. (in press). Assessment of IRT model fit. Applit'd P.tYI·holog;\"a/ Mt'Dsurt'nlt'nf. Ilalnblelon. R. K., & Rovinelli. R. J. (1973). A Fortran IV program for generaling examinee response data rrom logistic lest models, Behavioral Sci\"nct'. 17, 73·74. Ilaml1lelon. R. K.• & Rovinelli. R. J. (1986). Assessing the dimensionality of a set oflesl items. Applied P.~ych(/lollical Mta.tureml!tlt. 10, 287·J02. f1amblelOn. R. K.• & Swaminathan. H. (1985). I\",m rt'spon.lf. theor....: Prin\";pl\"s and ''{If/liclilions. 80Slon: Kluwer. Ifftmhlel<ln, R. K,. & Trauh. R, E. (1973). Analysi~ nf empirical dala using two logistic llltenttrait models. Ihiti.fh hlllmol of MIIf/temtlflt·\"/lIIu/ Stati.!til'al f'.\\)'I'/IOI(I.~.v. 26. 273-281

164 FUNDAMENTALS OF ITEM RESI'ONSE TIIEORY lI\"mblelOn. R. K.. &. van tier Linden. W. J. (19112). Advances in item teSflon~e theory lind applications: An introduction. A,'plie,/ Psychlliogimi Mm.mr\"mnlt, 0, 37.\\-3711. lIambleton. R. K.• Zaal. J. N.• &. Pielers. J. M. P. (1991). Computeril.ed adaptive testing: Theory. applications. and standards. In R. K. Hambleton &. J. N. Znal (Ed.~.). Advanas in edt/catlonal and psy..hological testing: 1'I,('(}ry and applicatiml.t (PI'. 341-366). Boslon: Kluwer. lIarris. D. (1989). Comparison of 1-. 2-, and 3-parameter IRT models. Educational Measurement: Issu~s and Practice. 8, 35-41. ', lIallie. J. A. (1985). Methodological review: Assessing unidimensionality of tests and items. Applied Psychological Mea.mrl'ment. 9. 139-164, Holland. P. W.• &. Thayer. D. T. (1988). Differential item performance and the Mantel- Haenszel procedure, In H. Wainer &. H. I. Braun (Eds.). T(',vt validity (pp, 129-145), Hillsdale. NJ: Lawrence Erlbaum. Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. PsycfromnriKa, 30, 119-185. Kendall, M. G .• &. Stuart. A. (1961). The advanct'd theory ofstati.vtic.v (Vol. I). New York: Hafner. Kingsbury, G. G .. &. Zara, A. R. (1989). Procedures for selecting items for computerized adaptive tests. Applil'd Mea.mrt'mt'nt in Education, 2(4). 3~9-375. Kingston. N. M., &. Dorans. N. J. (1984). Item location effects and their implications for IRT equating and adaptive testing. Applied P.fychologiml Mt'asuremt'nt, 8. 141- 154. Kingston. N. M.• &. Dorans. N. J. (19115). The analysis of item-Ability regressions: An exploratory IRT model fit tool. Apl,/i,.d Psyclrolo(liral Mell.wft'ItIt'\"t, 9, 281-2118. Kingston. N. M.• &. Stocking. M. (1986. August). P.~y('homl'tri(' i.l.fllt:s;n IRT·ooud tnt con.flr/wtian. Paper presented at the meeting of American Psychological Associa- tion. Washington. DC. Klein, L W.• &. Jarjoura. D. (1985). The importance of content representation for common-item equating with non-random groups. Journal of Educational Mt'asure- ment, 22(3), 197-206. / Kolen. M, J. (19811). Traditional equaling methodology. Edrlt'otiorlal Mt'a.vuremt'FIt: 1s.IIIt'S and Pro('/ice, 7(4),29-36, Linn. R. L (1990). Has item response theory increased the validity of achievement test scores? Applit'd Mea.ml'emt'nt in Edllwtion, 3(2). 115-141. Linn. R. L.. &. Hambleton. R. K. (1990). Cu.flilmizu/ tNt.~ and (,I/.fIomiuti rwrnl.f (CRESST Technical Report). Los Angeles: UCLA. School of Education. Linn, R. L.. &. Harnisch. D. L. (1981). Interactions between item content and group membership on achievelJlent test items. Journal of Educational Mea.fUremt'n1. 18. 109-118. Linn. R. L.. Levine, M. V.• Hastings. C. N.• &. Wardrop, J. L. (1981). Item bias in a test of reading comprehension. Applit'd P.tycholo(lica/ Mt'a.furt'mt'lII. 5. 159-1 n. Lord. F. M. (1952). A theory of test SCOrt!.f (Psychometric Monograph No.1). IOWA City. IA: Psychometric Society. Lord. F. M. (1974). Estimation of latent ability and item parameten when there are omilled responses. P.~ychometrika. 39, 247-264. Lord. F. M. (1977). Practical applications of item characteristic I:urve theory. JOllrnal of EdllcationaIMt'asllremenl,14,117-138.

165 Lord. F M. (l9RO). Applications of item re.fpmHI' theory /tI1'/'(It·t;'ol tf'Stillll problm,.f. IIitl~dalr. NJ: Lawrence Erlbaum. Lord, F M. (19114). Standard errors of measurement at differt\"nl ability level~. Journal\"f Fducolim,ol Ml'o.fUrl'ml'lIt, 21, 2]9·243. Lord, F M., & Nov ick, M. R. ( 1968). St(lti.uit'Olll,etlnes 11/ mellwlll!'.,1 H me.'. Reading, MA: Addison·Wesley. Ludlnw, L. II. (198.5). A slrlltegy for the graphical representation of Rasch model residuals. Edu('OliOllol and P.rycholollical Mro.IIII'l'nlent, 45, 85 I -8W. Ludlow, L. H. (1986). Graphical analysis of item response theory residuals. Applied P.fycholollkal Ml'a.furl'mt!nt. /0,217-229. Maslers. G. N. (19R2). A Rasch model for parlial credit scoring. P.t}chonletrika, 47. 149-174. Masters, O. N.• & Wrighl, B. D. (1984). The essential process in II family of measurement models. P.fJ'l'homl'lri/ca, 49. S29-.544. McDonald, R. P. (1967). NOli-linear factor anaIY.fi.f (Psychometric Monograph No. IS). Iowa City: IA: P~ychometric Society. McDonald, R. P. (1981). The dimensionality of leslS lind ilems. British Journal of Malht!matical and Sfatistieal P.fycholoIlY. 34. I()()·I 17. McDonald. R. P. (1989). Fuilire directions for item response theory.llllernational }(mrnal of E\"\"('(1tiollol Re.ft!arch. IJ(2), 20.'1·220. McLaughlin, M. B.. & Drasgow. F. (1987). Lord's chi-square test of item bia~ with estimated and with known peraon parameters, Af)I'lied Psychological Mt!asurt!ment. 11,161·173, Medin Interactive Technologies, (1986). Micro.fl·ale. Black Rock, CT: Author. MeJlenbergh, G. (19119). Item billS lind item res(lOn~e theory. International Journal of Educatioflal Rt!.ftOI'Ch. 0(2). 127- 143. Millman, J.. <1 Arter, J. A. (1984). Issues in item banking. Journal of Educational Mea.fu/'emt'flt, 21. 315·330. Mi~levy, R. J. (1986). Bayes modal eslimalion in item response models. P.f.'vchometriJ:a. 51.177·19S. Mislevy, R. J., &, Bock, R. D. (1984). BILOG: Ma,rimltn' likelihood item anaIY.fi.f and tnt ,f('oring with logi.ftie modtl.\" Mooresville. IN: Scientific Software. Owen, R. J. (1975). A Bayesian !lequenlial procedure for quantal res(lOnse in the context of adaptive menial testinl. Journal of tht American Statistical Association. 70, 351·356. Phillips, S. E., & Mehrell~. W. A, (1987). Curricular differences and ullidimen5ionaJity of achievernenl test d8ta: An exploratory analysis. Journal of Educational Measur<'· menl.24.1·16. Rajl!, N. S. (1988). The IIrea between two item characteristic curves. P.tychometri!.:a. 53, 495-502. Raju. N. S. (1990). Determining Ihe significance of estimaled signed and unsigned areas between two item response functions. Af'pli~d P.t),c/wlollical Mea.Hlreml'nt. 14(2), 197·207. Rasch, O. (1960). Probabilistic modt!l.f far some intel/illl'nl'{, ami attoi\"melll t{'.Us. Cupenhagen: Danish Institute for Educationol Resear!:h. Reckase, M. D. (1979). Unifactor latent trait models Applied to multi-faclor tl\"st5: Results and implications. Jm/mal of Edllm/imllil Slatistin, 4. 207·2.m.

166 FUNDAMENTALS OF ITEM RESPONSE TIIEORY Re~hetar. R. (1990). Compult'r a(lapl;V~ lotinfr D(',,~/(}pm..nl and opplictllion (Lahora· tory of P~ycbOlnetrjc and Evaluative Research Report No. 2(4). Amherst: IIniver- sity of Musachusells. School of Education. Rogers, II. J., &. Hambleton. R. K. (1989). Evaluation of COlllpul('!r simulated haseline statislics for uSe in item bias studies. Educalional and Psycholol(it'al M~o.Tur('mmt. 49. 355·369. Rogers. H. J., &. Hallie. J. A. (1987). A Monte Carlo investigation of several person and item fit statistics for item response models. IIppli~d Psycholol(ical,M.~asur('m('nt. 11.47-57. Rudner, L. M.• Getson, P. R.• &. Knight, O. L. (1980). Biased item detection techniques. JourNlI of Educot;ONlI Statistics. 5.213-233. Safrit. M. J., COllta, M. G.• &. Cohen, A. S. (1989). Item respollllC theory and the mea&urement of motor behavior. R~s~an:h Qllarl~rly for E:urds~ and Sporl. MJ. 325-335. Samejima, F, (1969). Estimation oflatent ability u~;ng a \",\"sf'Onu pallern ofI(radt'd ~eor~~ (Psychometric Monograph No. 17). Iowa City.IA: Psychometric Society. Samejima, F. (1972). II gt!nerol model for fru rt!.fpon.ft! dOlO (Psychometric Monograph No. 18). Iowa City. IA: Psychometric Society. Samejima. F. (1973). Homogeneous case or the continuous response model. PsyrflO- m~triA:a. 38. 203·219. Samejima. P. (1974). Normal ogive model on the continuous response level in the multidimensional latent space. Psychom~tr;A:a. 39. I I 1-121. Samejima. F. (1977). A use of the information function in tailored te5ting. IIpp/it'd Psyeholol(ical Uemurement. I. 233-247. Shepard, L. A.• Camilli, 0., &: Averill. M. (198 I). Comparison of procedures for detecting test·item bi... with both internal and eJlternal abili ty criteria. Journal ()fEducalional Slal;.II;e.T. 6. 317·375. Shepard. L. A., Camilli. G .• &: Williams, D. M. (1984). Accounting ror statistical ani facts in item bias research. Journal of Educalional Slali.YI;es. 9. 93-128. Shepard. L. A.• Camilli, G., &. Williams. D. M. (1985). Validity or approximation technique~ for detecting item bias. Journal of EduHllional Ut'asur~mt'nt. 11(2). 77-105. Spray, J. (1990). One-parameter ilem respon!le theory models for psychomotor tests involving repealed. independent attempts. Reuarc:l, Quartuly for E,url'iu and Sporl.6/(2), 162·168. Stocking, M. L. (1990). Specifying optimum ell.8minees for item parameter estimation in item response theory. Psych,mrttriA:a. 55(). 461·475. Stocking. M. L.• &. Lord. F. M. (1983). Developing a common metric in item response theory. IIpplil!d Psyt'holol(ical Ul!asllrt!mt'flt. 7.201 -210. Subkoviak, M. 1., Mack. J. S .• Ironson. G. H.• &. Craig. R. D. (1984). Bmpirical compar- ison of selected item bias detection procedures with bias manipulation. Journal of EducalionoI Ueo.fu\",\"me,,/. 1 I( 1).49-.58. Swaminathan. II. (1983). Parameter estimation in Item ·respunse model•. In R. K. flam- bleton (Bd.). IIpplicalions of item r~~f'OllSe Iheory (PI\" 24·44). Vancouver. DC: Educational Research Institute of Brilish Columbia. Swaminathan. II.. &: Gifford, J. A. (1982). Baye5ian estimation in the Rasch model. Journal of Educalional Slatislics. 7. 175-191.

167 i Swaminathan. 11.. & Gifford. 1. A. (198]). Estimation of parameters in the three· \\ parameter latent trait llIodel. In I). Weiss (Ed.). N~w hmizmu in ,,,stinK (pp. 13· 30). I New York: Academic Press. 1 Swaminalhan, B.o & Gifford. J. A. (I98S). RayesiUIl eSlillllllion in the two'parameler logistic model. PJychomt'lrika. 50. 349·364. SW8min8than. If.. & GiUord, J. A. (1986). Bayesian estimatiun in the three-parameter logistic model. Psyr/wmt'lrika, 51. 589-60 I. Swaminathan. II., & Rogers. B. J. (1990). Detecting differenlial ilem funclioning using logistic regression procedure~. Jmmw/ 01 Edu('otiona/ Mea.uII-emf'IfI, 27(4), 361· 370. Talsuoka. K. K. (l9l!7). Validation of cognitive sensitivity for item response curves. Journal 01 Edllrat;onal M('alfurt'mmt. 24. 233·245. Thissen. D. M. (1986). MULTlLOG: IUm onaly.,i.l al/(I .<lorinR will,multiplt' ('alt'lCory rt'.'ponS{' mot/d, (Version 5). Mooresville. IN: Scientific Software. Thissen. D. M.o & Steinbers. L. (1986). A taxonomy of item response models. I'.,y. t'iwmt'lrika, 51, '567·'577. Traut., R. E.. & Lam. R. (19l!5). Latent structure and item sampling models for te~ling, Anllulli R(,l'il'''' \"f Psy(·holoICY• .M, 19-48. Traub, R. E.. & Wolfe. R. G. (1981). Lalenl hailtherrries nnd the assessmenl of educa· lionAI achievement. In D. C. Oerliner (Ed.), Rt'I'if'''' 01 rf'Jt'OIch in ('dUration (pp. 377·43'5). Washington. DC: American Educational Research Association. Tucker. L. R.• Humphrey~. L. 0\" &; Roznowski, M. A. (!9116). Compul'lIlivt! a('t'lIro(',Y of fivt! india.f of dimt!lfsianalily of binary ilt'm.L Champaign·Urbana: University or Illinois, Depanment or Psychology. (frry. V. W. (1914). Approximations to hem pal'alllders of mental lest model~ and their uses. f:durationul and PsycholflRical Mt!o.wr<'mt!lft, 34. 253·269. lJrry, V. W. (1918). ANCILLES: lUm I,oramnu t!slimalion pl\"OR/'wn ...ith normal ORil,t! and IORi.flii' thru·para.mt'ur modt!1 option.,. Washington, DC: U.S. Civil Service Commission. Development Cenler. Vale, C. D. (19l!6). Linking item parameters onto n COIUlllon scale. Applit'd P\"Y('ho/ORical Mrasllrf'mt'nl. 10(4). 333-344. van der Linden. W. J.• & Ooekkooi-Timminga, E. (1989). A llIaximum model rOT lest design wilh practical constrain... Psyt·homnri/«(,. 54(2). 237·247. Wainer, B. et al. (Eds.). (1990). Compulniud ada/uive tt!.H;IIR: A ,nimt'/. Hillsdale, Nl: Lawrence Erlbaum. Wainer, II.. & Thi~~en. D. (1987), E~timating abililY with Ihe wrong model. JOllrnal of Edll('(llionol Sta'i.\"in, /2, 339-368. Weiss. D. 1. (19l!2). Improving measuremenll.juality and erticiency wilh adaptive lesting. API!Iit'tl P\"yt'ho/t!Rim/ Mea.fllrt'm('n/, 6, 473-492. Weiss, D. J. (Ed.). (1983). Nt\"' horiton.f in It'stinK. New York: Academic Pre5s. Weiss. D. J. (1985). Adaptive testing by computer. JOl/Illal of COlfJllllinR and Cliniml P.fy,·hn!t'RY, .U, 774·7l!9. Weiu, D. J.• .\\ King~b\\lry. G. O. (1984). Applicalion tlf cmnpuleril.ed adaplive I('~tln, 10 educational problems. Journal of Edu('ationo/ Mt!u.fIlrt!m<'lft, 2/, 361·37.5. Wingersky. M. S. (1983). LOOIST: A program for compllling mlJx;mum likelihood procedures for logistic test model~. In R. K. lIal1lblelOn (Ed.), App/it\"alions of ;um rnpon.ft! Iheory (pp. 45·56). Vancouver. BC: Educalional Research Institule of British Columbia.

168 FUNDAMENTALS OF ITEM RESPONSE THEORY Wingersky, M. S., Barton, M. A., & lord. F. M. (19112). I.OGIST uur's Kuit/e I'rincelOn, NJ: Educational Tesling Service. Woodcock. R. W. (1978). Developm\"nl and stl1ndardiZfllion of Ihe WoodcocK-Johnson P.rycho-Educational Battt!fy. Hingham, MA: Teaching Resources Corporation. Wright, B. D. (1968). Sample-free test calibration and penon measurement. PmcudinR.r of Iht! 1967 Invitational Conft!ft!nu on Tt!5tinK Prohlems. Princeton, NJ: Educa- tional Testing Service. Wright. 9. D. (1977). Solving measurement problems with the R~s~h model. Journal of Educational Measurt!menl, 14,97-116. Wright. 9. D., Mead. R. J.• &. Bell. S. R. (1979). BICAL: Calibraling ill'm.r wilh Ihe Rasch modd (Statistical Laboratory Research Memorandum No. 239). Chicago: Univer- sity of Chicago, School of Education. Wright. B. D.• Schulz, M., &. Linacre, J. M. (1989). BIGSCALE: Rasch analysis compultr program. Chicago: MESA Press. Wright, B. D., &. Stone, M. H. (1979). Besl 11'51 dt!.tign. Chicago: MESA Press. Yen, W. M. (1980). The extent. causes, and importance of context effects on item parameter~ for two latent trllit modeh. Joltrntll of Educalional Mea.tllreml'nt. 17, 297-311. Yen, W. M. (1981). Using simulation results In choose a latent trait model. AI'I!lied Psychological Mt!'osureml'nl, 5, 245·2ti2. Yen. M. W. (1983). Use of the three-parameter logistic model in the development of a standardized achievement test. In R. K. Hamhleton (Ed.), Ap\"lif'otiot/,f of ilem rnponse theory (pp. 123-141). Vancouver, BC: Educational Research Institute of British Columbia. Yen, W. M\" Burket, G. R., & Sykes. R. C. (in press). Nun-unique stllutions to the likelihood equation fur the three-purillneter lugistic lIlodel. l'.nc!wmr/riAtI.

Index ability scores, 10, 77·88 hinomial trials model. 27 ability scores, validity of, 78 Birnbaum, A., 14,91-92,94 achievement tests, 102 adaptive testing. See computerized adap- Rock, R. n., 26.37,43,47-47,49,56, tive testing 146,I5J Aitkin, M., H, 46 Boekkooi-Timminga, E., 10 I, 103 American Educational Research AsS()cia- Bunurrson, C. V., 145 Burket, O. R.• 37.95 tion, I American l'sydlOlogical Association, I Camilli, G .• 57. 120 anchor test design, 128-129. 136-141 Carlson, J. E.• 47.49 ANCILLES, 47,49 characteristic curve method. 134·135 Andersen, E. B. 46.58 classical item tli friculty. 2, 5 Antlridl. 1>.• 26 dassicHI item discrimination. 5 Anl\\off, W. II., 123, 125 clussical measurement, 2-5 Ansley, T_ N.• 58 dassicaltrue scort' model. 4, 7 area statistic, 113 115 co!(nilive <':ollll'ollenis. 154 Arter. J. A.. 147 ASCAL. 47.4') n('\"hrll. A. S .. AssessfIlt'nt Syslrm~ Corpnrlltion. 4749 IIssumpti()n.~. 55· 59 ,'Oflllllnll !,t'''OIlS (ksi!!lI. 1211 assumptions. (;hc('king of. 55-59.61-6.1 (:(llllllllll.'ri/rd Hllal'live tt'~tinl/., 14S-1 S2 authentic measurement, 15.1 cOllcurrell1 (lIlihrlllinll. IJS· 136 Averill. M, 120 condilional nla~irnum likelihood e~lim\"- Haker. F. H.. 15.45,57 tion.46 nfll'lnn, M. A., 4(; Heyesian estinllllion procruures, J8-:\\<), Cook, L. L.. 5. 58 4.1-44 Cmta, M. G., 27 Hayesiall item selection. 148 Cfll ill , R. D., 12() Hejar, I. I.. 56 ,'I ilt'rinn .n,I'('re1Il'l.\"d le~lS. 78, 102 Dell. S. R.. 42 (,lIshlmi/.,'d 1\"'ling. 117 AlC'AL, 42. 46.48 cUI-off score, 1<5 OIGSCALE. 42, 48 RlLOG, 41. 47, 50 DATAGEN,54 de Gruijler, D. N, M.. 86, 94,102 differemial ilem funclioning. 109-120 Divgi, I). R., 5.1 domain ~(nre, R.~ Jot)

170 FUNDAMENTALS OF ITEM RESPONSE TIII'ORY Ullr;IIIS. N. 1. .'111·.'\\9 tllrorlllnli\"n hllll'l iUII . .'il'l' Il'S! 01 ;relit information function Drasgow. r.. ~6. ~II, 62. 113. I~4 informotion nlalrix, 44 EAP estimates. Su expC'cted 1\\ posteriori Inouye, D. K., 145 il1yuriance, R<sessmcnt of. <;7, ~'i'). 63-66 estimales invariance. property of, 18·25 eigenvalue plot, 56 invariant model parameters, '8 • Eignor, D. R., .'I, 58 /ronson, G. II.. 120 Embretson, S. E., 154 item billS. 109 equating, classical. 123·126 item characteristic curve. 7 equating, definition, 123 item charncteristic function, 7 equaling. IRT, 126·142 item difficulty parameter, 13 equipercentlle equaling, 123 124 item discrimination parameler, 15 equivalent group design, 1211 item inforrnalinn runction, 91-94,99- expected a posteriori estimates, 39,44 106,147 Forsyth, R. A.• 58 item misfil slatistics, 61, 72·14 Fraser, C, 41,49 item selection, 99·106 Gelson, P. R.• 109 Jarjollra, D., 128 Gibhons, R., ~6 Gifford, J. A., 42-44. 46 joint Bayesian estimation, 46 (JIM, C. 41·41'1 joint muimlllll likelihood estimation, 41· Goldstein. H.. 154 44. 46 goodness of fil. 24, 53·14 goodness·of·fit statistics, 61 Jones, R. w., 10.\\ graded response model. 26·27 Green. 8. E, 146. 141'1 Kendall, M. G .• 39 Green, D. R., 95 Kingsbury, G. G.. 148, 152 group dependent ilem statistics. 3 Kingston, N. M., 58-59. 102 growlh. measurement of, 102 Klein, L. W., 12K Gulliksen, H.. 57 Knight, D. L., 109 Guslafsson, J. E., 41-48.58 Kolen, M. J., 123 Haebara, T., 134 Lam. R., 61 Hambleton. R. K., 4-5, 36-37.43.54·59, IRtenl ~pace, 10 10,18,82.86-87.102-103,112. Levine, M. V.• SR. ItO, 154·155 114,133.136,148 Lieberman, M., 43 IInmisch, D. L., 120 likelihood function, 34 Harris. D.• 15 Linacre. J. M .• 42 lIastings. eN., 112 linear eqUAting, 124-12.5 lIallie, J. A.• 53. 56, 58 linear programming, 103 heuristic estimation, 46 linking designs, 128·129 lIolhmd. P. W.• 115. 119 Linn, R. L., 1R, 87. 110, 120, 132. 146 110m, J. L.. 56 lIumphreys. L. G .• 56, 146 Lissak. R. I.. 56, 62 inapproprialeness measuremenl, 154·155 Incal independence, 9-12 indeterminacy problem, 41·42 LOGIST. 42. 46-47, 49. 135 IOl!i~tic models; one·parameter. 12·14, RI-Kl two·parnmeler. 14-17 three-parameter. 17 ·18

/1/(/1'.1 171 l<lj!islic I('~rrssjoll pron.llure, 11'1 Owen. It I., 1411 log!!s, 11.1 Lord. F M., 4~, II, 14,43,45-46,57- Ilarallel Irsb, 4 Parsons, C. K.. 511 51(,95,100-101,110,112,125, Phillip\", S E., 511 I.HIVI, 147·147,154 !'ierers, J M.P., 141( lower asymplote. 13, 16-17 PML,47-48 Poisson counts model. 28 Ludlow, L. H., 58 polytomous response models, 26-28, 153- Mack. J. S., 120 1.~4 Mantel·Haenszel method, 115 marginal Bayesian estinlation, 46 predictions, W-61 marginAl maximum likelihood e~lima· pseudo-chllnee level parameter. 17 lion, 43, 48 Q, statistic, 54·61 Masters, G. N., 26. 153 maximum information item selection, 148 Raju, N. S., 113·114, 121 maximum likelihood criterion, 33 RASCAL,48 maximum likelihood estimation, 33-38 Rasch. G., 14,46 McDonald. R. P., 10.26.46-41,49,56, Rasch model. 14 Reckase, M. D .• 56, 146 154 regression method. 130 McLaughlin, M. E.• 58, 113. 155 regression models, 19. 32 Mead, R. J.• 42 relative efficiency, 9.~-96 mean and sigma method. 131-132.139- reliahility. da~~kal. 4.94 Reshetar. R., 149 141 residuals, '\\9·61, 66·61 Mediltll Interactive Technologies. 47·48 RIl)A.47-4I1, 135 Mehrens. W. A., 58 robusl melln and sigma melhod, 132·13:1 Millenbergh. G., 120 ROg\"fS. B. 1..5.1.55.103,114-115.119 MICROSCALE.41-48 MIRTE,50 Rovinelli. R. J.• 54,56,70 Millman, 1.• 141 Mislevy, R. }\" 37, 43·41, 49 Roznowski. M. A., 56 model fit. Set goodness of fit Rudner, L. M.• 109-110. 113 lllu1tidimen~iomll model~, 10. 154·155 MlJLTILOG,50 Sofril, M. 1., 27 Muraki. E.• 56 Samejima. F.. 26, 95,153·154 scaling. 125·126 NAEP, 102 seal ing constants, 129-135 National Council on Measurement in Schulz, M.• 42 score transformations, 78-87 Education, I Shepard. L. A., 57,120 Newton·Raphson procedure, 36,40 single-group design. 128 NOHARM, 47,50 speededness, asse~sment of. 57 nominal response model. 26 Spray, 1., 26·27 non-linear factor analYRis, 46, 56 standard error. Su Mandard error of esti· normal ogive model, 14·15 Novick, M. R., II, 154 mation MandMd error of e,timation, 44·45. 94- odds for ~UCCetlS, 81-83 Olsen, J. B.• 145 95. 106 optimal hem selection. Set Item selection standard error of measurement, 4, 94

172 FUNDAMENTALS OF ITEM RESPONSE TIIEORY slllndllnlized residuals. 5<)-61, 611-72 IInhlilllell,iunalily.9 10. ~6 \"'7 Sleinherg, L.. 153 Stocking, M. L.. 63. 102, 133-135 Urry. V. w.. 46·49 Stone, M. If.. 5, 14.57-58 Sluart, A .• 19 Vale. C. D.. 1211. 142 Subkoviak. M. J.• 120 vall lkr Lindell. W. 1.. 4.101.10.1 Swnminnlhftn. H., 5, 36·:'1.42·46. Y'i, Wainer. II. 511.145·146 .• 57-59,711.112. 112, 119, 13.1. U6 Sykes, R. C. 37 Wnrtlrnl). J. t .. 112 Weiss, n. J. 5. 145-146. 1411. 152 Taft. H. L.. 58 lailored tesling. Su compulerized adap- Williams. D. M., 57, 120 Wingersky, M. S., 42, 46, 411 live lesling WITS scale. 110 largel informalion funclion. 101-102 Wolfe. R. G., 511. 61 Tatsuoka. K. K .. 511 Wood. R.• 154 lesl Chllr8clerialic curve. 115 Woodcock. R. W.• 110 lesl dependelll abililY scores. <; Woodcod· Johnson P5ydlO\" Educnl ional lesl developmenl. 92·96.99-106 lesl fairness, 109 Rallery. 110 lesl informalion function. 38,44, 94-95. Wright. R. D., 5. 14,26.42.46-411,57· 100-106 59, 7R, 153 Te.•, Sltlndard.f, I Thayer. D. T.• 115. 119 Yen. W. M., 37, 54. 5R, 61, 95, IOJ Thissen, D. M., 47. 49. 58, 153 Traub, R. E.• 58. 61 ZlIal, J. N.. 1411 Irue proponion-correct score. 115 Zara, A. R.• 1411 Irue score. 2. 84-87 Tucker, L. R.• 56

About the Authors Ronald K. Hambleton is Professor of Education and Psychology and Chairman of the Laboratory of Psychometric and Evaluative Rellearch at the Univerllity of Massachusetts at Amherst. He received his Ph.D. in psychometric methods from the University of Toronto in 1969. His principal research interests are in the areas of criterion-referenced measurement and item response theory. His most recent books are Item Response Theory: Prindples and Applications (co-authored with H. Swaminathllll), A Practical Guide to Criterion-Referenced Testing, and, forthcoming. Advances in Educational and Psychological Te.ftinK (co-edited with Jac Zaa/). He has served as an Allsociate Editor to the Journal of Educational Statistics (1981-1989) and currently Ilerves on Ihe editorial boards of Applied MNUlireml'nt ill Education, Multivariate Behovioral Rl'.fl'tlrch, ApI,lied P,fyclwlo~iclli Mellsurement, JOllrnal of Ecillcat;oflal Mrosurement, Edllf(JtiOlw/ and PsycholoKical Measure- ment, Evaluation and the lIealtil I'mfe,u;OIIS, and Psit'ot\"emo. He hall served also as President of the International Test Commission (1990- 1994) and as President of the National Council on Measurement in Education (1989-1990). H. Swami nathan is Professor of Education and Psychology at the University of Massachusetts at Amherst. He received his Ph.D. in psychometric methods and statistics from the University of Toronto in 1971. He has held the positions of Associate Dean of Academic Affairs and Acting Dean of the School of Education. He has served as an Associate Editor to the Journal of Edw'otiollal Statistics and currently is an Associate Editor of Psicothema and Rev;.sta Edllcao Potuguese. He has served also 8s the President of Educational Statisticians, a special interest group of the American Educational Research Associa- tion; co-program chair of Division D of AERA; and as a member of the 173

174 FUNDAMENTALS OF ITEM RESPONSE TlIEORY Gmduale Records Exarnilllliiolls BtHtrtl. lIis principal research interests are in the areas of item response theory. mullivariale statislics, and Bayesian analysis. 1-1. Jane Rogers is Assistant Professor :II Teachers College. Columbia University. She received her Ph.D. in psychometric'methods frolll the University of Massachusetts in 1989. lIer research interests include item response theory, large-scale assessmenl, methods for the detection of differential item funcl ioning, Bayesian methods. and multivariate statistics.

Pages:

alrabbaiomran

Fundamental of item response theory

Like this book? You can publish your book online for free in a few minutes!

Create your own flipbook

TOP SEARCH

business design fashion music health life sports home marketing children

Fundamental of item response theory

Description: Fundamental of item response theory

Read the Text Version

alrabbaiomran

TOP SEARCH

RELATED PUBLICATIONS