Fundamentals of Item Response Theory Ronald K. Hanlbleton H. Swaminathan D. Jane Rogers .1 SAGE PUBLICATIONS The International Professional Publishers Newbury Park london New Deihl
Copyright © 1991 by Sage Publications, Inc. All rights reserved. No part of this hook may he reproduced or utilized in any form or by any means. electronic or mechanical, including photocopying. recording. or by any information storage and retrieval . .system. without permission in writing from the puhlisher. For information address: SAGE Publications, Inc. 24SS Teller Road Newbury Park. California 91320 SAGE Publication. Ltd. 6 Bonhill Street London EC2A 4PU Uniled Kingdom SAGE Publications India Pvt. Ltd. M·32 Market Grellter Kailash I New Delhi 110048 India Printed in the United Stales of America Library of Congress Cataloging-in-Publication Data Hambleton, Ronald K. Fundamentals of item response theory I Ronald K. Hambleton. II. Swaminnthan, H. Jane Rogers. p. crn. - (Measurement methods for the socilliscien{:es ; v. 2) Includes bibliographical references and index. ISBN 0-8039-3646-X (cloth). ISBN 0·8039-3647-8 (phk.) I. Item response theory. I. Swami . II. Rogers, H. Jane. III. Title. IV. Series. ~\"ERUN/Vfk BFI76JB4 1991 Srtr 91-22005 150'.28'7-dc20 5 JAN 1M FIRST PRINTING, 1991 Sage Production Editor: Diane S. Fosler
Contents Series Editor's Foreword vii Preface ix I. Background 2. Concepts, Models, and Features 7 3. Ability and Item Parameter ESlimation 32 4. Assessment of Model-Data Fit 53 5. The Ability Scale 77 6. Item and Test Information and Efficiency Functions 91 7. Test Construction 99 8. Identification of Potentially Biased Test Items 109 9. Test Score Equating 123 10. Computeri7.ed Adaptive Testing 145 1L Future Directions of hem Response Theory 153 Appendix A: Classical and JRT Parameter Estimates for 156 the New Mexico Slate Proficiency Exam 159 Appendix B: Sources for IRT Computer Programs 161 References 169 Index 173 About the Authors
Series Editor's Foreword In the last decade we have witnessed a revolution in educational and psychological measurement as the application of classical measurement theory has given way to the use of item response theory (lRT). Today, IRT is used commonly by the largest testing companies in the United States and Europe for design of tests, test assembly, test scaling anti calibra- tion, construction of test item hanks, investigations of test item bias. and otlier common procedures in the test development process. Measurement researchers, public school systems, the military, and several civilian hnlllches of the federal governlllt~nt as well, have endorsed and employed IRT with increasing enthusiasm and frequency. This book provides a lucid hut rigorous introduction to the fundamen- tal concepts of item response theory, followed by thorough. Rccessible descriptions of the application of IRT methods to problems in test construction, identification of potentially hiased test items, test equat- ing, and computerized-adaptive testing. A summary of new directions in IRT research and deVelopment completes the hook. Hambleton. Swaminathan and Rogers have developed IRT theory and application through carefully wrought arguments. appeals to familiar concepts from classical measllfement methods and hasic statistics, and extensively described. step-by-step numerical examples. The book is well illustrated with tables containing the results of actllallRT analyses and figures that portray the influence on IRT results of such fundamen- tal issues as models with differing numbers of item parameters, exam- inees at differing levels of ability. and varying degrees of congruence (\"fit\") between sets of data and IRT models. Although equations necessary to compute most IRT statistics are provided in the book, their mathematical derivations have been omilled. Nevertheless, this is not a \"cookbook\" on IRT methods. The reader will find thorough discussion of altcHHltivC\" procedures for estimating IRT vii
viii FUNDAMENTALS OF ITEM RESPONSE TIIEORY parameters-maximum likelihood estimation, marginal maximum like- lihood estimation. Bayesian estimation. and so on. Knowledge of the underlying calculus is not required to understand the origins of these procedures and the distinctions among them. Hambleton et al. have been faithful to the goal of the Measurement Methods ror the Social Sciences series, to make complex measurement concep.ts, topics, and methods accessible to readers with limited mathematic~1 backgrounds but a keen desire to understand. as well as use, methods that are on the cutting edge of social science assessment. This book introduces power- ful new measurement concepts and applications in ways that can be understood and used correctly by thousands for whom IRT heretofore has been no more than a fascinating mystery. RICHARD M. JAEGER University of North Carolina at Greensboro
\\ Preface The popular (or classical) measurement models and procedures for con- structing educational and psychological tests and interpreting test scores have served testing specialists well for a long time. A review of test catalogs and recent editions of the Mental Measurements Yearbook and Test Critiques would reveal that numerous achievement, aptitude, and personal ity tests have been constructed using these classical models and procedures. The ways in which educational and psychological tests usu- ally are constructed, evaluated. and used have many well-documented shortcomings of. however (see, for example. Hambleton, 1989). These shortcomings include (a) use of item indices whose values depend on the particular group of examinees with which they are obtained, and (b) examinee ability estimates that depend on the particular choice of items selected for a test. Psychometricians have advanced II new measurement system. item response theory (IRT), to address these and other shortcomings of common measurement practices. In the 19808, item response theory was one of the dominant topics of study among measurement specialists. Many IRT models of current interest will be described in this book. Because item response theory provides a useful framework for solving a wide variety of measurement problems. many lest publishers, state and provincial departments of education. credentialing agencies, school districts. armed forces. and industries use item response theory to assist in building tesls. identifying potentiaIly biased test items, equating scores from different tests or different forms of the same test, and reporting test scores. Item response theory has many other promising applications as well. Several of these applications will be discussed in some detail in this book. Why publish an IRT book al Ihis lime? Interest in learning about this new measurement theory and in applying it is worldwide. and the need exists for practical instructional material. The purpose of this book, ix
J( PllNDAMENTALS 01' ITEM RESPONSE TIIEORY therefore, is to provide a comprehensive and practical illtroduction to the field of item response theory. The limitations of classicaimcaslllc, ment procedures arc addressed to provide a rationale for an alternalive psychometric model. The fundamentals of item respollse theory. il1clll<l ing models, assumptions, and properties. as well as paramcter es- timation, proceuures for assessing mooel-data fit, alternative reporting scales, anu item and test information and efficiency ~o;lstitutc the central part of the book. Several important IRT applications arc de- scribed in later chapters. Connections between classical test theory ilnd item response theory are made wherever possible to enhance the clarity of the material. Since the book is intended for newcomers to the IRT field with modest statistical skills, our approach focuses on the conceptual basis of item response theory and avoids discussion of mathemalical deriva- tions or complex statistical aspects of the theory. Follow-up references are given for these important aspects. Examples and illustrations arc used as often as possible. Exercises and complete answers are included at the end of each chapter to enable practitioners to gain experience with IRT models and procedures. Finally, some of the popular IRT computer programs are introduced, along with a discussion of their strengths and weaknesses. Information about the computer programs should facilitate the successful application of JRT models. In summary, IRT consists of a family of models that have been demonstrated to be useful in the design, construction, and evaluation of educational and psychological tests. As further research is carried out, the remaining technical problems associated with applying the models should be resolved. In addition, it is expected that newer and more applicable IRT models will be developed in the coming years, enabling IRT to provide even better solutions to important measurement prob- lems. We hope that this book will be useful to measurement specialists who wish to explore the utility of IRT in their own work. We are grateful to several colleagues, former students, and current students who provided extensive reviews of an earlier draft of this book: Lloyd Bond, University of North Carolina at Greensboro; Linda L. Cook and Daniel Eignor, Educational Testing Service; Wendy Yen and Anne Fitzpatrick, CTBIMacmillanlMcGraw-HiII; and Russell W. Jones, University of Massachusetts at Amherst. Their comments often forced us to clarify our discussions and positions on various technical matters. The book is more readable and technically correct because of ollr reviewers' insights and experience.
) 1 Background Consider a Iypical measurement practitioner. Dr. Testmaker works for a company that specializes in the development and analysis of achievement and aptitude tests. The tests developed by Dr. Teslmaker's company are used in awarding high school diplomas, promoting students from one grade to the next, evaluating the quality of education, identifying workers in need of training, and cretientinling practitioners in a wide variety of professions. Dr. Testmaker knows that the company's clients expect high quality lests, tests that meet their needs and thaI can stand up technically to legal challenges. Dr. Testmaker refers to the AERA/APAINCME Stan- f!ards for Educational and Psychological Testing (1985) and is familiar with the details of a number of lawsuits Ihat have arisen because of questions about test quality or tcst misuse. Dr. Testmaker's company uses classical test theory models and meth- ods to address most of its technical problems (e.g., item selection, reliability assessment, test score equating), but recently its clients have been suggesting-and sometimes requiring--that item response theory (IRT) he lIsed wilh their lests. Dr. Tesllllaker has only a rudimentary knowledge of item response theory and no previous experience in appfying it. and consequently he has many questions, such as the following: I. What JRT models are available. ali(I which model should he used'! 2. Which of Ihe many availahle algorilhms should he lIsed to eslimale paramelers'! 3. Which IRT compuler program should he used to analp.e Ihe dala? 4. flow can the fit of the chosen IRT model to the tesl data he determined? 5. What is the relationship bel ween lest lenglh and Ihe precision of abilily / estimales?
2 '\\ I FUNDAMENTALS OF ITEM RESPONSE TIIEORY 6. Ilow can IRT Hem statistics he used to COllslrul'llests 10 meet cOlllent and technical specifications? 7. Ilow can IRT be used to evaluate Ihe statistical consequences of changing ,. items in a test? /' R. How can IRT be used to assess the relative utility of different tests thaI are measuring the same ability? ,• 9. How can IRT be used to detect the presence of potentially biased lest items? 10. How can lRT be used to place test item statistics obtained from nonequiv- 1. alent samples of examinees on a common scale? The purpose of this book is to provide an introduction to item response theory that will address the above questions and many others. Specifically. it will (a) introduce the basic concepts and most popular models of item response theory, (b) address parameter estimation and available computer programs, (c) demonstrate approaches to assessing model-data fit, (d) describe the scales on which abilities and item characteristics are reported, and (e) describe the application of IRT to test construction, detection of differential item functioning, equating, and adaptive testing. The book is intended to be oriented practically. and numerous examples are presented to highlight selected technical points. Limitations of Classical Measurement Models Dr. Testmaker's clients are turning towards item response theory because dl!.~~ical testing methods and measurement procedures have a .!!.umber of shortcomings. Perhaps the most important shortcoming is ~'!..~t examinee characteristics and test characteristics cannot be sepa- _rated: e_ach can be interpreted only in the context of the other. The examinee characteristic we are interested in is the \"ability\" measured by the test. What do we mean by ability? In the classical test theory framework, the notion of ability is expressed by the !!!~e .'H·ore, which is defined as \"the expected value of observed performance on the te.rl .of interest.\" An examinee's ability is derined only in terms of a partic- ular test. When the test is \"hard,\" the examinee will appear to have low ability; when the test is \"easy,\" the examinee will appear to have higher ability. What do we mean by \"hard\" and \"easy\" tests? The difficulty of a le.'il item is defined as \"the proportion of examines ill a RrouP of
·.... illfl~re.~t who answer the item corredl y.\" Whether an item is hard or easy depends Oil the ability of the examinees being measured, and the ability (; of the examinees depends on whether Ihe Ie,.,t itellls arc hard or easy! Item discrimination and test score reliability and validity are also ·t·... defined in terms of a particular group of examinees.. Test and item ,,{ characteristics change as the examinee context changes, and examinee -characteristics change as the test context challges. Hence, it is very ~, diHicull to compare examinees who take different tests and very diffi- cult to compare items whose characteristics are obtained using different ;\";j groups of examinees. (This is not to say that such comparisons are impossible: Measurement specialists have devised procedures to deal ~. _wjth these problems in practice, but the conceptual problem remains.) '.\": T,et liS look at the practical consequences of item characteristics that depend on the group of examinees from which they art\" obtained, that ~ is, are Rroup-dependent. Group-dependent item indices are of limited use when constructing tests for examinee populations th.. t are dissimilar cl-, to the population of examinees with which the item indices were obtained. This limitation can be a major one for test developers, who ,\" often have great difficulty securing examinees for field tests of new .r instruments-especially examinees who can represent the population for whom the test is intended. Consider, for example, the problem of 'fI fi{~ld-testing items for a state proficiency test administered in the spring of each year. Examinees included in a field tcst in the fall will. neces- .1 sarily. be less capahle than those examinees tested in the spring.llence, \" items will appear more difficult in the field tesl Ihan they will appear v in the spring test administration:.~ variation on the same problem arises .,.- .with item banks. which are becoming widely used in test construction . Suppose the goal is to expand the bank hy adding a new set of test { items along with their item indices. If these new item indices lire obtained on a group of examinees different from the groups who took ~: the items already in the bank, the comparability of item indices must be questioned. >. What are the consequences of examinee scores that depend on Ihe particular sct of items administered, that is, arc 1('.\\'I-dl'/)el/(It'fll?_Cl~early. it is difficult to compare examinees who take different tests: The scores ~m the two tesls arc on different scales. and no functional relationship l:xists betweellthc scales. Even if the examinees arc given the same or parnlleltcsts, II prohlem remains. When the examinees are of different ahility (i.e .• the test is more difficult for one group Ihan for the other), their test scores contain different amounts of error. To demonstrate this
4 FUNDAMENTALS 01' ITEM IH~SI'ONSE TIH~OI{Y point intuitively. consider an examinee who obtains a score (If I.ern: This score tells us that the examinee's ability is low but provides no infor- mation about exactly how low. On the other hand, when an examinee gets some items right and some wrong, Ihe test score contains informa- tion about what the examinee can and cannot do, and thus gives a more precise measure of ability. If the test scores for two examInees are not equally precise measures of ability, how may comparisons between the test scores be made? To obtain scores for two examinees that conlain equal amounts of error (i.e., scores Ihat are equally reliable), we can match test difficulty with the approximate ability levels of the exam- inees; yet, when several forms of a test that differ substantially in difficulty are used, test scores are, again, not comparable. Consider two examinees who perform at the 50% level on two tests that differ substantially in difficulty: These examinees cannot be considered equivalent in ability. How different are they? How may two examinees .--- be compared when they receive different scores on lesls that differ in difficulty but measure the same ability? These prohlems are diffieull to resolve within the framework of classical measurement theory. Two more sources of dissatisfaction with classical test theory lie in the definition of reliability and what may be thought of as its conceptual converse, the standard error of measurement. !!eliahility, in a classical _ . .test theory framework, is defined as \"the correlation between lest scores 'f .-= on parallel forms of a test.\" In practice, satisfying the definition of .;.-' parallel tests is difficult, if ';oiimpossible. The various reliabilily coefficients available provide either lower bound estimales of reliabil- ity or reliability estimates with unknown biases (Hambleton & van der Linden, 1982). The problem wilh the standard error of measuremenl, which is a function of test score reliability and variance, is thai it is assumed to be the same for all examinees. But as pointed out above, _~cores on any test are unequally precise measures for examinees of _different ability. Hence, the assumplion of equal errors of measurement for all examinees is implausible (Lord, 19R4). --A final limitation of classical tesl theory is Ihat it is test oriented _rather than item oriented. The classical trlle score model provides no consideration of how examinees respond 10 a given ilem. Hence. no basis exists for determining how well a particular examinee might do when confronted with a test item. More specificully, classical lest theory does not enable us to make predictions about how an individual or a group of examinees will perform on a given item. Such queslions
as, What is the prohability of an ('xaminl'c answering a given item correclly? arc important in It nUlllher of h'sting applications. Snch information is necessary, for example, if a lest designer wants to predict test ~core characteristics for onc or more populalions of examinees or 10 (ksign tests with particular characteristics for certain populations of examinees. For example, a test intended to discriminate well among scho-Iarship candidates may be desired. In addition to the limitations mentioned above, classical measure- ment models and procedures have provided less-than-ideal solutions to many testing problems~for example, the design of tests (Lord, 19RO), the identification of biased items (Lord. 1980), adaptive testing (Weiss, 1983), and the equating of test scores (Cook & Eignor, 1983, 1989). For Ihese reasons, psychometricians have sought alternative theories and models of mental measurement. The desirable features of an alter- native test theory would include (a) item characteristics that are not group-depelllient, (b) scores describing examinee proficiency that are 110t test-dependent. (c) a modellhat is expressed at the item level rather than at the test level, (d) a modellhat does !lot require strictly parallel tests for assessing reliability, and (e) a model that provides a measure of precision for each ability score. It has been shown that these fea- tures can be obtained within the framework of an alternative test the- ory known as item response theory (Hambleton, 1983; Hambleton & Swaminathan, 1985; Lord, 1980; Wright & Stone, 1979). Exercises for Chapter I I. Identify four of the limitations of classical test theory that have stimulated measurement specialists to pursue alternative measurement models. 2. Item responses on a tesl item and total test scores for 30 examinees are given in Table 1.1. The first 15 examinees were classified as \"low ability\" based on their tOlal scorell; the second 15 examinees were classified as \"high ability.\" a. Calculate the proportion of examinees in each group who answered the item correctly (this is the classical item difficulty index in each group). b. Compute the item-total correlation in each group (this is the classical item discrimination index in each group). c. What can you conclude regarding the invariance of the classical item indices?
6 FUNDAMENTALS OF ITEM RESPONSE TllEORY TABLE l.l Low-Ability Group Hiflh-Ability Group __ _ _ _ •• _ w _ _ ~. _~ _ _ _ _ _ _ _ ----~ -----~~-- Total tt~m 1'0101 tr~m Examinee Responu Score Examinu Re,fponse Score I0 8 ..16 I 33 2 0 12 17 0 28 18 29 30 6 19 I 30 20 I 29 4 0 12 21 0 28 22 33 :5 0 8 23 32 24 32 60 8 2S 1 33 26 0 34 70 8 21 35 28 34 8 0 II 29 38 30 37 9 I 13 to 0 4 II I 14 12 1 13 13 0 10 14 0 9 1:5 0 8 Answers to Exercises for Chapter I I. hem-dependent ability scores, sample-dependent item statistics, no prob· ability information available about how e)(aminees of specific abilities might perform on certain test items, restriction of equal measurement errors for all examinees. 2. a. Low-scoring group: p = 0.2. lIigh-scoring group:\" ::: OJ!. b. Low-scoring group: r 0.68. IIigh-scoring group: r ::: 0.39. c. Classical item indices are not invariant across subpopulations.
2 Concepts, Models, and Features Basic Ideas .- Item response theory (lRT) rests on two basic postulates: (a) The perfonnance of an examinee on a test item can be predicted (or ex- plained) by a set of factors called traits, latent traits, or abilities; and (b) the relationship between examinees' item performance and the set of traits underlying item perfonnance can be described by a monotoni- l cally increasing function called an item characteri.uic function or item characteri.ftic curve (ICC). This function specifies that as the level of the trait increases, the probability of a correct response to an item increa~es. Figure 2. t shows an item charucleristic function for the case when only one trait underlies perfonnance on the item, together with distributions of ability for two groups of examinees. Observe that examinees with higher values on the trait have higher probabilities of answering the item correctly than do examinees with lower values on the trait, regardless of group membership. Many possible item response models exist, differing in the mathemat- ical form of the item characteristic function and/or the number of parameters specified in the model. AIlIRT models contain one or more . parameters describing the item and one or more parameters describing the examinee. The first step in any tRT application is to estimate these parameters. Procedures for parameter estimation are discu~sed in chap- ter 3. Item response models, unlike the classical true score model, are falsifiable models. A given item response model mayor may not be appropriate for a particular set of test data; that is, the model may not adequately predict or explain the data. In any IRT application, it is .... essential to assess the fit of the model to the dala. Procedures for assessing model-data fit Are discussed in chapter 4. 7
8 FUNDAMENTALS OF ITEM RESPONSE THEORY 1.0 P ., r 0 •b b 0,' I ,I I Y 0 '=~====-:_'-------T'--\"-'-'\" '[=';';\"\";~=,,;:\";F Ability 2 Figure 2.t. An Item Characteristic Curve and Distribulions of Ability for Two Groups of Examinees When a given IRT model fits the test data of interest, several desir- ...e- able features lire obtained. Examinee ahility estimates IIrc 110t tcst· ~ dependent, and item indices are not group-dependent. Ability estimates obtained from different sets of items will be the same (except for measurement errors), and item parameter estimates obtained in differ· ent groups of examinees will be the s<lllle (except for measurement errors). In item response theory, item and ahility parameters are said to be invariant. The property of invariance of item and ability parameters is obtained by incorporating information \"hout the items into the abil- ity-estimation process and by incorporating information ahout the ex- aminees' abilities into the item-parameter-estimation process. The in- variance of item parameters is illustrated in Figure 2.1, which shows distributions of ability for two groups of examinees. Note that exami- nees of the same ability have the same probability of giving a correct response to the item, regardless of whether they are from Group I or Group 2. Since the probability of success for an examinee with given ability is determined by the item's parameters, the item parameters must also be the same for the two groups. In addition to the desirahle features mentioned above, IRT provides estimates of standard errors for individual ability estimates, rather than
COllcel'l.r. Modl'!\". \"nd Ftaillru a single estimate of error for all examinees, as is the case in classical test theory. The mathematical models employed in IRT specify that an exam- inee's probability of answering a given item correctly depends on the examin~e's ability or abilities and the characteristics of the item. IRT models include a set of assumptions about the data to which the model is applied. Although the viability of assumptions cannot be determined directly. some indirect evidence can be collected and assessed. and the overall fit of the model to the test data can be assessed as well (see chapter 4). An assumption common to the IRT models most widely used is that only one ability is measured by the items that make up the test. This is called the assumption of unidimcflsionllliry. A concept related to uni- dimensionality is that of local ;ndelJendl'flce. Unidimensionality and local independence are discussed in the next section. Another assumption made in all IRT models is that the item char- acteristic function specified reflects the trlle relationship among the unobservahle variables (abilities) and observahle variables (item re- sponses). Assumptions are made also llhout the item characteristics that are relevant to an examinee's performance on nn item. The major distinction among the JRT models in common use is in the number and -I type of item d1l\\racteristics assumcd to affect examinee perfornulIlce. These assumptions will be discussed shortly. Unhlimens;of/aUry As stated above, a common assumption of IRT models is that only one ubility is measured by a set of items ill II test. This assumption cannot be strictly met because several cognitive, personality, and test- taking factors always affect test performance. at least to some extent. These factors might include level of motivation, test anxiety, ability to work quickly, tendency to guess when in douht about answers, and cognitive skills in addition to the dominant one measured by the set of test itcms\\WIHlt is required for the unidimcnsiollality assumption to be met ade(luately by a set of test data is the prcsence of II \"dominant\" component or factor that influences test performance. This dominant
,--\"\"\", !\\ 10 FUNDAMENTALS OF ITEM RESPONSE THEORY component or factor is referred to as the ahility meajiured hy the test; it should be noled, however, thaI ability is not Ilecellsarily inherent or unchangeable. Ability scores may be expected to change over time because of learning, forgetting, and other factors. ifem response models in which a single dominant ability is presumed sufficient to explain or account for examinee performance are referred to as unidimensional models. Models in which it is assu'o,'ed that more than one ability is necessary to account for examinee test performance are referred to as multidimensional. These latter models are more complex and, to date, have nol been well developed (McDonald, 1981). Locollndependence / / Local independence means tha( when the abilities influencing lest performance are held constant, eXaminees' responses to any pair of items are statistically independent. In other words, after taking exam- inees' abilities into account, no relationship exists between examinees' responses to different items. Simply put, this means that the ahilities specified in the model are the only factors influencing examinees' responses to test items. This set of abilities represents the complete latent .fpace. When the assumption of unidimensionality holds, the complete latent space consists of only one ability. To state the definition of local independence more formally. let 9 he the complete set of abilities assumed to influence the performance of an examinee on the test. Let Vj be the response of a randomly chosen examinee to item i (i = 1,2•...• n). Let P(Vj , 9) denote the prohability of the response of a randomly chosen examinee with ability 6; P(V; = I 10)1 denotes the probability of a correct response. and P(V; = 0 I e denotes Ihe probability of an incorrect response. The property of local independence can be stated mathematically in the following way: Prob(V,. V 2•• ••• V\" 19) = P(V, 19) P(V2 16) ... P(V\" 19) \" ,=np(UjIO) ;~ The properly of local independence means that for a given examinee (or all examinees at a given ability value) the probability of a response
ConceT'U, Mm/e/s, atld Ftalurl'S II pallern on a set of items is equal to the product of prohabilities associ- flted wilh the examinee's reSpOI\\Sl~S to the individual items, For exam- ple. if the response pattern for an examinee on three items is (I t 1.0). =that is. VI = I. V2 I, and U. = 0, then the assumption of local of independence implies that P(U I = I,U2 I,U) = 018) = P(U I == Ile)p(lh 119)P(th = Ole) j where ~ ~ Pi = P(Vj = I I9) and Qj = I - Pi The notion of local independence described above may seem coun- ~ terintuitive. An examinee's responses to several test items cannot be expected to be uncorrelatcd; that is, the responses nrc unlikely to be ,~,..... independent. In what sense, then, can local independence hold? When L- variables are correlated, they have some traits in common. When these i traits are \"partialled out\" or \"held constant,\" the variables become uncorrelated. This is the basic principle underlying factor analY5is. Similarly, in item response theory, the relationships among an exam- inee's responses to several test items are due to the traits (abilities) 1.1 , influencing performance on the items. After \"partialling out\" the abili- .:J ties (i.e., conditioning on ability). the examinee's responses to the items ~ lire 1ik.e1y to he independent. For Ihis rellson. the aSsllmlltion of local ~ ~ independence can also be referred to as the assumption of conditional '-oJ .-.l independence . t When the assumption of utlidimensionality is true, local indepen- dence is obtained: in this sense, the two concepls are equivalent (Lord. 1980; Lord & Novick. 1968), Local independence can be obtained, ~ however, ('ven when the data set is nnt unidimensional Local indepen- dence will be obtained whenever the complete lalent space has been g specified; that is, when all the ability dimensions innuencing perfor- mAnce have been taken into account. Conversely. local independence does not hold when the complete latent space has not been specified. For example, on a mathematics test item that requires a high level of reading sk.ill, examinees with poor reading skills will not answer the item correctly regardless of their
12 FUNDAMENTALS OF ITEM RESPONSE TIIEORY mathematical proficicncy. lIence. a dimension other than mathl'tnatical ¥ proficicnc.:y will influence performance on the item; if a unidimensional IRT lIIodel is filled to the datil, 10(,:111 indepcn<icn(;{' will not hold. On the other hand, if all the examinees have the reqlllsitc reading skills, ollly mathematical proficiency will influence pelfnrmam:c on thc item and local independence will be obtained when II ullitlimenliiollal model is fitted. Local independence also may 110t hold whcn a tes',1'em contains a clue to the correct answer, or provides information that is helpful in answering another item. In this case, some examinees will deled the clue nnd some examinees will not. The ahility to detect the clue is a dimension other than the ahility heing tested. If II unidimensional model is fitted, local independence will not hold. Popular Models in Item Response Theory An item characteristic runction or item characteristic curve (lCC) is a mathematical expression that relates the probahility of success (i.e., giving a correct response) on an ilem to the ahility measured hy the tcst and Ihe characteristics of the itelll. While it is possihle 10 conceive of an infinite nllmber or IRT models, only a few models are in current liSe'. IA primary distinction among the most popular IInidil1le'nsional item response models is in the number of parameters used to describe items. The choice of model is up to the user, hut this choice involves assump- tions about the data that can be verified later by examining how well the model \"explains\" the ohserved test results. The three most popular unidimensional IRT models are the one-, two-, and three-parameter logistic models, so named hecause of the number of item parameters each incorporates. These models are appropriate for dichotomous item response data. One-Parameter Log;,~tic Model The one-parameter logistic model is one of the most widely used IRT models. Item characteristic curves ror the one-parameter logistic model are given by the equation Pi (9) i = 1,2, ... ,n 12.1 ) -::ft':;\".p!:./\"c-.J/~,.) • ,.,... II e: 2·1\\\"5 7{ C{')..,u'·Fr-?-' £- ~. ~.,,~ . ~-~--..-.::.- ,\"'\"
('''''''('I''S. Model.f. and Fl'olur,·,f 13 where r,(O) is Ihe prohability Ihlll n rllllliomly chosen naminee with IIhilily (} answers item i correctly, hi is Ihe item i difficulty parameter, /I is the number of items ill Ihe lest. e i,.; a Irnnscendental number (like It) whose value is 2.71 R (correct to__\"h_ _ _ _ _~ ___ _ ._~____ ~,. ____ . three decimals), and r,«(}) is nn S-shllped curve wilh values between 0 and lover Ihe ability scale. The hi parameter for an item is the point on the ability scale where .the probability of a correct respOll.~e is 0.5. Thi ... parameter is a location' parameter, indicating the position of the ICC in relation to the ability scale. Jhe greater the value of the hi parameter, the greater the ability that is required for an examinee to have a 50% chance of gelling the item right; hence,the harder the item. Difficult items arc located to the right or the higher end of the ability scale; easy items are located to the .Ieft or the lower end of the ability scale. When the ability values of a group are transformed so that their mean is 0 and their standard deviation is I, the values of bi vary (typically) from about -2.0 to +2.0. Value.~ of hi near·2.0 correspond to items that are very easy, and values of bi near 2.0 correspond to items that are very difficult for the group of examinees. Some sample ICCs for the one-parameter model are shown in Figure 2.2. The item parameters are as follows: for Item I, hI = 1.0; for Item =2. h2 = 2.0; for Item 3, bJ -1.0; and for Item 4, h4 0.0. Note that the curves differ only by their location on the nhility scale. In the one- parameter model. it is assumed tha!.item.~lifficulty is the only iteT characteristic that influences examinee performance. No item parame- ter corresponds to the classical test theory item discrimination index; in effect, this is equivalent to the assumption that all items are equally discriminating. Note also Ihal the lower asymptote of the ICC is zero: this specifies that examinees of very low ahility IHlVe 7.ero probability of correctly answering the item. Thus, no allowance is made for the possibility that low-ability examinees may guess, as they are likely to do on multiple-choice items. Clearly, the one-parameter mode I is based on restrictive assump- lions. The appropriateness of these assumplions depends on the nature of the data and the importance of the intended application. For exam- ple. the assumptions may be quite acceptable for relatively easy tests
14 FUNDAMENTALS OF ITEM RESPONSE TI1E!ORY 1 ~----~-----------.~ - - - - - - . p, ·o 0.9 - ! b! b 0.8 ! 0.7 - I r o 0.6 I C 0.5 •o•• OAl 0.3 R • 0.2 . •p 0 • 0.1 -- •• ~~~...0 t..-=::::;:,,-==::::;:,,-==::::,:::,=~_....::::::::::L~ .\".\" .• __.. ___.j .•_. -4 -3 -2 -1 o 2 3 4 Ability Figure 2.2. One-Parameter lIem Characteristic Curves for Pour Typkalltems constructed from a homogeneous bank of test items. Such situ,\\lions may arise with some criterion-referenced tests following effective instruction. The one-parameter logistic model isofte!lcill!~_ti the Rasch model, in ,honor of its developer. While the form of Rasch's model is different from that presented here, the one-parameter logistic model is mathe- matically equivalent to Rasch's model. Por details of the development of the Rasch model, refer to Rasch (1960) and Wright and Stone (1979). Two-Parameter LOKistit' Model Lord (1952) was the first to develop 11 two-parameter item response model, hased on the cumulative normal distribution (normal ogive). Birnbaum (19M\\) slIhstituted the two-parameter logistic function f(H the two-parameter normal ogive functioIlllS the form of the item dlllflH.:ter·· istic function. Logistic functions have the important advantage of heing more convenient to work with than normal ogive functions. The logistic model is more mathematically tractable than the normal ogivc model hecause the latter involves integration, whereas the former is lIn explicit function of item and ability parameters and also has important statistical properties.
) ('oncepts, Models, Gild fetltltl'e,V 15 Itelll characteristk curves for thc two-panllllctcr logistic model dc- veloped hy Rirnblllllll are given hy th(~ eqlwtion P,(O) = ena, (6 - b,' 1,2, . , _,1/ 12,21 + eOn,(O /0,) N where the parameters Pi (0) and hi are ddined just as ill E(luation 2, I. I As is easily seen, the two-parameter logistic__m()del_~semb,-~~ .,the one-parameter model except for the presence of two additional ele- e ments, The factor lJ is a scaling factor introdllced 10 make the logistic function as close as possible to the normal ogive function. It has been ~, shown that when D =: 1.7, values of P j (0) for the two-parameter normal II ogive and the two-raram~terJpgistic models'dTffcfill7ibsollite\"vafue'by less than 0.0 I for all values of O. :Jc Thc second additional element of the two-parametcr model is the .~,... parameter ai' which is called the item discrimination parameter. The aj <~ 'parameter is proportional to the slope of the I~C al the point hi on the ahility scale. Items with steeper slopes are more useful for separating ~' exam'inees into different ahility levels than are items with less steep slopcs. In fact, the usefulness of an itcm for discriminating among 1 cxaminees near an ability level e (separating examinees with abilities ~ e~ from examinces with abilities> 0) is proportional to the slope of the ICC at 0, The item discrimination parameter is dcfined, theoretically, 011 the scale (_00, +00). Negatively discriminating items are discarded from ahility tests, however, because something is wrong with all item (such as miskeying) if the probability of answering it correctly decreases as examinee ability increases. Also, it is unusual to ohtain Of values larger than 2. ,Hence, the usual range for itcm discrimination parameters is .\\.!h1.Llligh values of al result in itcm characteristic functions that are vcry \"sleep,\" and low values of tli lead to item characteristic functions that increase gradually as a fUllction of ahilily. Readcrs interested in experimenting hy changing values of item parameters to determine their dTcl'ls on ICCs arc referred to some computer software for the IBM PC and the APPLE computers by Raker (19K5), and to 1111 introdut'tory article on logistic models hy Harris (I9!N). The two-parameter model is ohviously a generalization of the one- parameter model that allows for differently discriminating items. Some sample ICC's for the two-paramcter model are shown in Figure 2,), For
16 FUNDAMENTALS OF ITEM RESPONSE TIIEOI{Y .,. 1 0 0.9 \"•b 0.8 -•I It 0.7 . II 0 0.6 t 0.15 . - _.- ~ - - c ..•,0 r 0.4 I 0.3 •A• 0.2 \"0 •ft • eo 1 234 \\\",.lV ?'\\\"'\"~ ~ W Abllltyo:::: Figure 2.3. Two-Parameler lIem Characteristic Curves for Four Typical Hems =Item), b, 1.0 and al = 1.0; for Item 2, bz == 1.0 and a2 = 0.5; for Item = ) =3, b] = -).0 and a3 .5; for Item 4, b4 0.0 and a4 1.2. The ICes are not parallel, as they were for the one-parameter model. Each ICC has a different slope, renecting the fal:t that the discrimination param- eter values are different. Note again that the lower asymptote of each curve is zero; hence, the two-parameter model. like the one-parameter model. makes no allowance for guessing behavior. The assumption of no guessing is most plausible with free-response items. but it often can be met approximately with multiple-choice itcms whcn a test is not 100 difficult for the examinees. For example. this assumption Illay be met when competency tests arc administered to students following effective instruction. An alternative and somewhat more convenient way to write Pi (0) for the two-parameter logistic model (and the three-parameter model. too) is this: If the numerator and denominator of Equation 2.2 are divided by eDa,(9 - h, >, then Pi (9) becomes
\"J 17 which nlll be wrillen more c.:mnpaclly as Pi(O) = I I . e/la,lf) /\")1 1 The mathemutkal expression for the three-parameter logistic nlO{k~1 is I = 1,2, ... , It [2.3] where Pica), hi. ai, and D are defined as for the two-parameter model. The additional parameter in the model, Ci, is <:alledthe p.w.'udo-(·hance- level parameter. This parameter provides a (possibly) nonzero lower asymptote for-the item characteristic curve and represents the probabil- ity of examinees with low ability answering the item correctly. The parameter Ci is incorporated into the model to take into account perfprmance at the low end of the ability continuum, where guessing is a factor in test performance on selected-response (e.g., multiple choice) test items. Typically. ('; assumes values that are smaller than the value that would result if examinees guessed randomly on the item. As Lord (1974) has noted, this phenomenon probably can be attributed 10 the ingenuity of item workers~11I developing attract\"lvebut f i:ect choices. For this reason, ~.L should no~_ be called tlie~ue::~srng parameter.\" ,\" Six typical three-parameter logistic ICCs arc displayed in figure 2.4. The corresponding item parameters are displayed in Table 2.1. The l~(Jmparison of Items 1 to 3 with Items 4 to 6 (but especially the comparison of Items I and 4) highlights the role of the item difficulty parameter in the location of ICCs. More difficult items (Items I, 2, 3) arc shifted to the higher end of the ability scale, while easier items are shifted to the lower end of the ability scale. The comparison of lIems I (\\11(1 2 (or Items I, 3, and 4 with Items 2, 5, and 6) highlights.!!!.!L influellce of the item discriminali<!.!.!Jlaramc~.. er on the steepness of ICC~: Finally, a comparison of Hems I and J highlights the role of the l' parllmcter «(\",) in the shape of ICes. A comparisoll of the different lower asymptotes of Items 3,5, and 6 is also informative.
18 FUNDAMENTALS 01; ITEM RESPONSE TllEOny p I. ~ r ..~J •b J •b '-' 0.8 j ,, I 234 ,I I ,0 0.6 c•r •r 0 I Rp•• •••0 Figure 2.4. Three-Parameter lIem Characteristic Curves for Six Typical hems The Property of Invariance l The property of invarianceof item and ability parameters is the .v.i: cornerstone of JRT and its major distinction from classical test theory, This property implies that the parameters that charll~~rize an ite,!,\"~~_ not depend on the ability distribution of the exami,!~_es. ~l1d_th~_pl!!.~m: .eter that characterizes an examinee does not depend on the set of test items. TABLE 2.1 Item Parameters (or Six Typical Test flems rest Item b, It!!m Paraltl('trr CI I 1.00 aj 0.00 2 1.00 0,00 3 1.00 1.80 O.2~ (J,80 0.00 1.80 n.lo 4 -I.~O I.RO 015 5 -0,50 1.20 -----~--,--. () O,~O 0.40
) 19 j As noled carlier. Ihe properly of invariallce of ilem paramelers can hc ohserved in rigure 2.1. When thc IRT modd fils Ihe dalli. the sume ! ICC is ohtained for Ihe tesl ilem regardless of Ihe distrihlltion of abililY in the group of exumillces used to estimate Ih<: item parameters. Ilence, ;) () Ihe ICC is invariant across the two populations. To some researchers, the property of item invariallt:e may seem § surprising. The property, however, is a well-known feature of the linear ~ regression model. In the linear reglcssioll model. the rcgression line for predicting a variable Y from a variahle X is obtained as Ihe line joining ~ Ihe means or the Y vllriuble for each value of lhe X variahle. When the rcgression model holds. the same regression line will he obtained within lilly restricted range of Ihe X varhlhle, that is, in any subpopulation on t.I) X. meaning that the slope and inlercept of the line will be Ihe SlIme in :S lllly sUbpopulation 011 X. A derived index sllch as the correlation coef- .;J ~ ficient, which is not a paramcter Ihnt characterizes the Icgressioilline, :'>\" is flot invariant across suopopulalions. The difference between Ihe slope i ·I I parameter and the correlalion coefficient istl~at the slope parameler docs not depend on Ihe characleristics of Ihe subpopuhllion, such as its l I~ variability, whereas Ihe correlation cocffident docs (noll', however, Ihat Ihe proper ('.~till/(ft;(lll of the line docs require it heterogeneous sample). '{\" The sahle con(~cpts also apply ill item response models, which can be r$ 1- regarded as nonlinear regression models. ~ To illustrate the property of invariance of item parameters and to l- understand the conditions under which invariance holds, consider the ~ following example, in which the responses of 90 ex llminees to 1I 40-item I, lesl were generated to fit a two-parameter logisti\\: item response model (see Equation 2.2). 1\\ summary of Ihe responses of the examinees (10 at each of 9 ability levels)-thcir responses to a parlicular ilem on the lest and their total scores on the tesl--is given in Table 2.2. The corresponding plot of probaoility of sliccess on the selected item against ability, e, is given in Figure 2.5. The classical item difficulty. or p-value. for the item of interesl is 0.5, and the classical ilem discrimination, or p(lint-bi~erial£C:l!.rela~~'!....b~ tween the item score and tolal score. is 0.65. As It demonstration of the lack of invarTnnc'e (;r\"-~la-ssfcaT--iienl indices, let us consider the exami .. nees as forming two ability groups: examinees at e of -1.72, --1.1 J, and -0.72 forming the low-ability group and the examinees at e of 0.52. 0.92, and 1.52 forming the high-ability group. For Ihe low-ability examinces the ,,-value (based 011 .30 examinees) is 0.2 ami Ihe point- hiseri,11 correl41tion is 0.56. For the high-ahility examinees the ,,-value
2() FUNDAMENTALS Or: ITEM RESPONSE THEORY /\" * I.HI -0.11 -0.40 -0,'0- 0.10 0.8t 0.8' Lit 1,00' Ability Figure 2.S. Relatiollship Between Ability and Probahility of Success IIIl an Item is 0.8 and the point-biserial correlation is 0.47. These calculations demonstrate that the classical difficulty and discrimination indices change when the examinee ability distrihution dtanges (ohviously. restriction of range results in lower point-hiserial correlations for each subgroup than for the total group). - Let us now fit separately a two-parameter item response model for the entire group and for the high- and low-ability groups. If invariance holds, the parameters obtained should be identical. Since in the two- parameter model the probability of success for an examinee with ability 9 is given by P and P I-P
I 21 e'I/Wrl\".!. Motirls. atl,l Feature.s TAnLE 2.2 Ahility Level 0, Pruhahilily of Success on all Item, Response 10 the Itelll, and lllial St'OIC for <)() Ex aminces - - -0 1'(0) F,uminff ..-~..... 2 J 4 5 {j 7 If 9 10 ! 1.71 fl 0.1 hem Response: T\\.lal Score: 0 (l (I 0 00 (I () 0 8 12 I) 12 R8 8 II 13 4 ..<t,j.t 0 (I () 0 (I () 0 c·U29 0.2 Irem Response: (I \"\" TollIl Score: 10 14 9 8 10 II IJ 12 7 7 -fU2J 0.3 Ilem Response: 0 () 0 0 0 () 0 Tolal S('ore: II 15 14 13 15 15 J3 II 15 13 -0.398 0,4 hem ResponRe: 0 0 () I (I I () 0 I Tolal Score: 13 12 III 12 17 10 16 15 12 19 0.100 0.5 hem Response: 0 0 () 0 0 Total Score: 17 21 25 25 21 19 18 19 20 15 0.191\\ 0.6 hem Response: (I (I I 0 I I0 TOlul Score: 21 19 21'> 22 25 22 24 24 28 19 IUD 0.7 hem Response: I (I (I 0I1 } Tolal Score: 27 26 25 24 24 JII 28 24 29 29 ) 11.919 O.R hem Response: n () Tolal Score: :n 211 29 .lO 2() 2X lJ .12 .12 33 ~~'. 1511'> 0.9 hem Response: 0 I Tolal Score: 34 J5 :'14 :IX .17 .l7 :'16 .l~ J7 :W it follows that =p [)a(O b) In --_. I-P =where a [)(J lind ~ = -Dah. The ahove relaliortship is a linear function of 0 with two unknowns, a and P (the slope and inlcrcc:-pt of the line. respectively), and, hence, their values call be dClerlllincd exactly if
22 FUNDAMENTALS OF ITEM Rl~SPONSE TlIEORY P and () are known at two points. (In reality. determination of item parameters cannot be carried out in this way since 0 will never he known; this procedure is used here for pedagogical purposes only.) To determine the item parameters based on the rnf;rr range o/ahilify. we can choose (arbitrarily) () = -1.716 and (} = 1.516 with corresponding P values of 0.1 and 0.9. Thu~, the two c(luations to he solved arc In ~:~ = cx(-1.716) + [} and Inp>:i = a(1.516) + [} Subtracting the first equation from the second, we have In~:~ - 'nK~ = cx(l.5l6) u(-1.716) Solving for u, we oblain a.:= 1.360. Substituting this value in the second equation gives Il = n. no. The vnlllcs of a and h now clln he determincd: (/ := n.R and h = --0.1. In the low-ahility suhgroup, a. lind Il clln he (ktcrlllincu using the two = rpoints (} = -1.716 and (} -0.723 with the corrcsponding values of (l.1 and 0.3. The equations to be solved arc : :In ~:! u(-1.716) + [} and Solving these equations in the same manner as previously. we obtain a::: 1.359 and 13 = 0.136, which in turn yield a = O.R and b -0.1. In the high-ability group. we determine a and /l using the p()illt~ 0::::: or0.523 and 0 = 1.516 with corresponding P values 0.7 and 0.9. The equations to he solved in this case are In ~:~ = cx(0.523) + [} and In ()..:~ cx( 1.516) + ~ 0.1
2] Sol ving these equations, we ohtain (1 I. :\\59 ami P= 0.1.16. which yield the same (l and b values as hefore. What we have demonstrated is the simple fact thaI (( and 13 are the slope and intercept of the line that relates In P/( I P), the log odds ratio, to O. In any langc of 0, the line is the salllc and hence (land p, and therefore (I and h, must be the same. This example shows that, in contrast with the classical item difficulty and discrimination indices, the parameters of the item response model are invariant across ability slIbpopulations. We must, however, note several points in relation to the property of invariance. Referring back to Figure 2.5, we see lhat an exact relationship exists between the probabilities of success and the 9 values. Furthermore, from Table 2.2 we see that at each 9 level the observed probability of success (observed proportion correct on the item) is exactly equal to P; that is, the model fits the data exactly ill the population. If the model does not fit the data exactly in the population, In P/( I - P) will not be an exact linear function of 9, and, hence, different a and p will be obtained when different sets of points are chosen. In other words, invariance only holds when the fit of the model to the data is exact in the population. This situation is identical to that in linear regression, where the regression coefficients are invariant only when the linear model fits the data exactly in the population. Iv. second point to be noted is thatinvariance is a properly of the I?opulation. Oy definition. the item characteristic curve is the regression of item response on ahility, P .,. '£(Ul 9) where '£ is the expected value. Hence. P (for a given_9) is the average of all item responses in the subpopulation of examinees with the spec- ified ability value 9. In the low-ability and high-ability suhpopulations described in the example. the observed probability of success at each 9 was exactly equal to 'E(U 19). Therefore, the exact linear relationship between In P/( I P) and 9 held; in other words, the invariance property was observed. On the other hand. if a sample is ohtained from the sUhpopulation of examinees with the specified ahility value 9, it is ex- tremely unlikely that the average of the item responses, or the observed probability of a correct response, will be exactly equal to .£( U 19). Even if. by some chance. rhe observed probahility was equal 10 '£(lJ 19) at one value of e, it would almost never occur at all vallies of 9. Hence, in ~amplcs, em exact linear relationship hetween In P/( t - P) and e will not
24 FUNDAMENTALS OF ITEM RESPONSE TIIEORY he observed. Therefore, we callnot expect to ohserve illvariance, in the strict sense, in \"salllpics even whcn the model fits the data CXlldly in the popUlation from which the samp\"lehasbeen drawn. 'his \"pr()hlein-{i; furthercxacerblltcdhy Ihe errors intr(),luced when ihc it{:111 iln(i cx\",lIIii- nee-paramefers are estimated. Ney~.!illereSs._irlsimportant to determine whether illvariance h(.ll<!s, since every application of item response theory capita1i1'.es on this property. AlthollghlflVariarice is clearly an all-Of-none propertyin the populitlon and can never be observed in the strict sense, we can assess the \"degree\" to which it holds when we use samples of test data. For example, if two samples of different ability are drawn from the popUla- tion and item parameters are estimated in each sample, the congruence between the two sets of estimates of each item parameter can be taken as an indication of the degree to which invariance holds. The degree of congruence can be assessed by examining the correlation between the two sets of estimates of each Hem parameter or by studying the corre- sponding scatterplot. Figure 2.6 shows a plot of the difficulty values for 75 items based on two samples from a population of examinees. Sup- pose that the samples differed with respect to ability. Since the difficulty estimates based on the two samples lie on a straight line, with some scatter, it can be concluded that the invariance property of item param- eters holds. Some degree of scatter can be expected because of the use of samples; a large amount of scatter would indicate a lack of invariance thut might be caused c'ither by model-data misfit or poor iCcm parameter estimation (which, unfortunately, are confounded). The assessment of invariance described above is clearly suhjective but is lIsed because no objective criteria are currcntly aV<lilahlc. Such investigations of the degree to which invariance holds are, as seen above, investigations of the fit of the model to the data, since illvariance and model-data fit are equivalent concepts. This approach to assessing model-data fit is discussed in detail in chapter 4. The discussion and example given above rclate 10 the in variance of item parameters in different subpopulations of examinees. The invari- ance property also holds with respect to the ahility pnrameters, meaning thaI the ability value of an examinee does not depend 011 the set of test items administered. To see this for the two-parameter model, we nole that in the equation p In I P Da(e - /J)
) IIJIU·('(JH, Model.f. and Fe,aura 25 3 S a 2 m II I e 2 ,,D o .. I I c u I -2 r •.98 t Y .1 - 3 \"_ _ _ _ _~_\"L\" _\"\"\"\"\"i\" __\"_\"\"__ L _______L __ -\"-- \"\"-, •.. 3 -3 -2 -1 o 2 Sample 1 Difficulty Figure 2.6. Plot of 3P Item Difficulty Values Based on Two Groups of Examinees if we consider a and b to be variables, then the log odds ratio is a linear / function of a and b with 9 being the slope with respect to the variable 0. As (J changes (as we consider items with different discrimination j parameters), 0 remains the same, showing that. no mailer which items are used, Ihe ability 0 remains invariant. This is the same argument as was lIsed to explain the invariance of item parameters. The demonstration of invuriam:c of itelll and ahility parameters is ohviously not restril:led to the two-parameter model. Since the onc- parameter model is a special case of the two-parameter model. at least rnathcmatic<llly, the ability and difficulty paramcters will he invariant also for this model. For the three-parameter model the paralllclers a. b. and t· characterize the item response function. Since the mathematical form of the function remains the same no matter which range of 9 is considered, the parameters that describe the function must be the same-invariant. A similar argument applies to 9 as a. b. and c vary. The importance of the property of invariallce of item and ability parameters cannot be overstated. This property is the cornerstone of item response theory and makes possible such important applications as equating, item banking, investigation of item bias, and adaptive testing.
26 FUNDAMENTALS OF ITEM RESPONSE THEORY Other Promising Models In addition to the one-, two-, and three-parameter logistic models, many other IRT models have been developed, including several models that can be applied to nondichotomous test data (see, ror example. Andrich, 1978a, 197Rb, 1978c. 1982; Ma!;ters, 19R2: Masters. & Wright, 1984; McDonald, 1989; Spray, 1990). For example, Bock (1972) devel- oped a two-parameter logistic model that can be applied to all of the answer choices in a multiple-choice test item. The purpose of his nominal response model was to maximize the precision of ability esti- mates by using all the information contained in the examinees' re- sponses, not just whether the item was answered correctly. Bock (1972) assumed that the probability that an examinee would select a particu- lar item option k (from m available options) to item i could he repre- sented as = -e-\":' (9-- h-;.) 1,2, ... ,n; k =: I, 2, ~ .. , m 12.4 J m LPik (9) e\":' (O· II:.) h=1 At each Il, the sum of probabilities across the m options, L:~ I Pift.. is one. The quantities (hi/;. aik) are item parameters related to the kth option. The model assumes no a priori ordering of the options. The graded response model or Samejima (1969) assumes, in addition to the usual assumptions. that the availahle catagories to which an examinee responds can be ordered. Examples would include a 5-point Likert rating scale or, say, a 4-point rating scale for grading essay.s, or other scales representing levels of accomplishment or partial credit. This model, like the Bock model, attempts to obtain more information from examinees' responses than simply whether they give correct or incorrect answers. With the current interest in polytomous scoring models. Samejima's extension of the two-parameter logistic model to polyloll10llS ordered categori('s is likely to receive inereasing lItient ion. (Also, see Masters & Wright. 19H4, for various extensions of Ihe one-parameter model to handle polylomolls response data.) Suppose the scoring categories for an item arc arranged in order fwm =low 10 high and denoted Xi 0, I, ... , mi where (m; + I) is the !lumber of scoring categories for the ith item. The prob'lbility of an examinee
Conn~/'H, Modt'h. (//1(/ Ft'llwrt!.t 27 , \\I' i I responding to an item in a particular category or \"j~lrn can be given hy a minor extension of the two-parameter logistic model: COa,11) n.,! 12.51 + enfl,(1l h.,) where b\" is the \"difficulty level\" for category mi' Other parameters in the model were defined earlier. With (mi + I) categories, mi difficulty values need to be estimated for each item, plus one item discrimination parameter. The actual probability of un examinee receiving a score of x, is given by the expression [2.61 With, say, 50 items in a test, and a 5-point proficiency scale for each item, a total of (50 x 4) + 50 = 250 item parameter values would need to he estimated. The field of psychomotor assessment, 100, has been influenced by item response models, and this influence has spawned new applications of relatively unknown IRT models (sec Safrit, COShl, & Cohen. 1989; Spray, 1990). Instead of ability variables such as numerical ability and reading comprehension, variables sut:h as physical fitness, basketball shooting ability. and abdominal strength are of interest in psychomotor assessment. In the simple binomial trials model, for example. =P(X xlD) = ( ; ) P(O)' Q(O)\" r [2.7] =where P(X x I 0) represents the probability that an examinee com- pletes x of n trials (e.g., shoots 8 out of 10 haskets). This probability nmld he represented by any of the logistic test models; however, the item parameters in the logistic model that would dcserihe the trials. that is. items, would be equal for clleh trial. and. helice, item-parameter ('slimntion would he t:omliderllbly simplified. Trials would Ilced to be independent and scored as PIISS or fail for this model to he IIpplicllble. If. for example. the binomial trials model is applied to basketball shooting data (e.g.• number of successful shots), !J would he hasketl);11I shooting \"bility. A.. with all IRT applications, parallR'tcr invariance
2R FUNDAMENTALS OF ITEM RESPONSE THEORY would be critical. Task (item) difficulty should be invariant across different groups of cxaminees. and abilitics should bc invariant across lasks that vary in difficulty. Another IRT model that has been applied successfully is the Poisson counts model: ,. e-r(O- h) 12.8) P(X:::::xI9,b)= x!ce(9-h) where x is the number of (say) sit-ups or push-ups completed in a minute and b represents the difficulty of the task. These and other IRT models aimed at handling polytomous response data can be expected to receive increasing use in the future, as fewer assessments are based on dichot- omously scored data. Exercises for Chapter 2 I. Item parameter values for six items are given in 'lilble 2.3. TABLE 2.3 b ------ IUm 1.0 a (' 1,0 I 1.0 1.11 0.00 -0.5 0.7 0.00 2 0.5 1.11 O.2~ 0.0 1.2 tUO :\\ L2 o.n!) 4 n.s 0.111 5 6 a. For each item. compute P(O) at I) =-3. -2. -1. O. I. 2. and ). Plot the item characteristic curves. b. Which item is the easiest? c. Which item is the least discriminating'! d. Which item does an examinee with an ability of I) =0 have the highest probability of answering correctly? What is the examinee's probability of getting this item wrong? 2. Use the ICCs in Figure 2.4 to answer the following questions: a. Which item is the easiest at e = -I.O? b. Which item is the hardest at e = O.O?
(O/III'pl.1. M()(Jd.~. IIl1d /-'{'a/lIr(!.< 29 1:. Which two Itell\\s are equally dillklllllli 0 = - un d. Whidl item is most discriminating al 0 ~ 2.0'! \\. lise Ihe fOllr two-paramekr ICes in Figure 2..1 tn answcr fhe rollowing 111I('stiol1l>: iI. Whal is Ihe value of p\\(O;: L(1)'1 h. Which item is the least discriminating? 1:. now do the ICes in Figure 2..1 differ from those in Figure 2.4'! 4. For the three-parameter lIlodel, show Ihal Ihe probability or a correcl response P(O) al 0 =b is + (. P(9) 2 5. The probability of 8 correel response al certain values of 0 ror Ihree items '{ is given in Table 2.4. i TARLE 1.4 I 9. .j.() -2.5 2IJ -1.5-1.0 0.5 () O..~ 10 1.5 10 L~ 3.0 ! hem I I 0.01 (}.Ol lUll 0.04 0,07 (I.n 0.22 O.J:'i U.~O 0.6:'i 0.7R 0.R7 0.9J .! 2 0.00 0.00 O.ot 0.04 (l.11 0.26 O.:'iO n.74 (I.R!) 0.96 0.99 n.99 (1.99 :~ 0.20 0.20 n.20 0.20 0.20 0.21 0.23 0.26 (1.]2 0.44 O.(i(l 0.76 O.RR •~ Piol Ihe ICCs for the three items. I a. Fqr Items I and 2, c = O. Determine frolll the plollhe h values for these I two ilems. I h. For Itcl1l 3, (' = 0.2. Determine from thc plot the\" value for Ihis item. I c. now would you determine the a vallie of an item from a plot of the ICC? Use this procedure to determine the a value for each of the three items. 6. Responses of 40 examinees at a KiI'en ability IndIO two item~ are given in Table 2.5. TAnu: 1.5 Item E.mminee Rnpo/l.ve.s I ot)ooo I! 000000 111100 I 000 IIIOOOO()() II 00 I 10 I 0 I 2 OII{)()OOlll' 10(lOOIIIIIIIIIII IIOOOt)IIOOIIII - - - - - - - - - -......-------~ ~- ..- - - -
30 FUNDAMENTALS OF ITEM IWSI'ONSE TIIIH>RY Constnlcl a 2 x 21able of correct lind incorrect responses (III the Iwo items, Using n chi'slllIllrc lesl fur irHkpcndCllcc. deknninc il' hll'lIl imkpclhklll'(' holds for Ihese Iwo ilems III this ahility level. Answers to ft:xercises ror Chapter 2 •< I. a. See Table 2.6. TARLE 2.6 ··2 -/ 0 0.500 2 3 0.500 e -3 0.000 0.002 O,IWi 0.625 0.955 0,9Q1I 0.027 0.085 0.2H 0.964 0.167 lIem 0.250 0.252 0.2114 0.135 0.966 0.91~ I 0.000 0.236 0.412 0.1811 0.131 0.995 2 0.0011 0.006 0.045 0.265 0.955 O.9Q1I 3 0.250 0.239 0.369 0.550 0.861 0.999 4 0.205 0.994 5 0.000 0.935 6 0.165 = =b. Item 4. c. Item 6. d. Item 4. P(failure) I ~ P(I}) = I O.7R8 0.212. 2. a. Item 4. b. Item I. c. Items 5 and 6. d. Item 2. 3. a. Approximately 0.50. b. Item 2. c. In Figure 2.3. the lower asymptotes of the ICCs are all zero; in Figure 2.4. the lower asymptotes of the ICCs are not all zero. = =4. P(I} b) c + (I - c) 1 [I + e-Da(b-b)1 =c + (I - c) 1 (I + eO) =c +(1 c)/(1 + I) = c + (I c) 12 = (2c + I c) 1 2 = (I + c) 1 2 5. a. Item I: b = 1.0. Item 2: b =0.0 b. (I + c)/2 = (I + 0.2)/2 = 0.6 =b = I} value at which p(e) = 0.6; b 2.0 =c. a slope of ICC al h. =Draw Ihe langent 10 Ihe curve al I} h and determine its slope by laking any Iwo points on Ihe tangent and dividing Ihe y increment by Ihe x increment. 6. See Table 2.7.
'l'AULE 2.7 )I II..\", 2 2)1 12 Illtorrec! ('orrltcl Incorrect 20 (If) IIt'1Il I 4 W) Hi '1..2 .\", N (AI) /lC)2 ; (It +- IJ}(/ll /))(/) I (')«('1 It) '\" 40(H x 4, 20 x R}2; (R + 20)(2() +- 4)(4 + R)(R + R) :;:: 5J}R > X~,M \"I:Sint'c the compnled exceeds the laoulated value, we clIn reject the hypothesis of independence, Local independence does not hold at this ability level. We would, therefore, conclude that a unidimensional model does not fit the data. Note I. Fur convenience, P( U; 10) will be wrillen as P,(O); Ihis nolalion will be u~d in spedfying item characteristic functions.
3 Ability and Item Parameter Estiniation The first and most important step in applying item response theory to test ). data is that of estimating the parameters that characterize the chosen item response model. In fact, the successful application of item response theory hinges on the availability of satisfactory procedures for estimating the parameters of the model. In item response models, the probability of a correct response de- pends on the examinee's ability, e, and Ihe parameters that characlerize the item. Both abilily and item parameters are unknown; what is known are the responses of the l~xllminees to Ihe lest items. The prohlem of estimation is to determine the value of e for eadl examinee and the ilem parameters from Ihe item responses. This problem is similar to thai encountered in regression analysis where, from observed responses 10 a variable, the parameters that characterize the regression model·~the regression coefficients-must be estimated. Two major differences distinguish regression models and item re- sponse models. First, the regression model is usually linear, while item response models are nonlinear. Second, and most important, the regres- sor (independent) variable in regression analysis is observable; that is, scores on this variable can be observed. In item response models the \"regressor variable\" e is unobservable. If e were observable or known, the problem of estimation of item parameters, or the \"regression coef- ficients,\" would simplify considerably, although we would still be dealing with a nonlinear regression model. Similarly, if the item parllll1- eters are known, the estimation of nhility is reasonahly straightforward. Estimation of parameters can be accomplished in several ways. In the unlikely event that the model fits the data exactly, and when 0 is known, the procedure demonstrated in the section on parameter invariance could be used. In this case, only as many points as there are item parameters in the model are needed to solve for the unknown 32
IIhilitv ali(I 111'111 P\"/(III/ell'/\" E,~li/ll(/li()\" 33 ,,, parameters. When a sample is obtained, the above procedure cannot be \\ used becaLise the model will 1I0t fil the data exactly. III this case, our I, strategy is to find the parameter v:llues that will produce the \"best filling\" curve. In linear regression, best fit is defined often in terms of Ii the least squares criterion. In IRT models the least squares criterion is not used because it is difficult to determine the properties of least :~ squares estimates in nonlinear models. Alternatively, the parameters could be estimated using a maximum likelihood criterion. The sampling I distributions of maximum likelihood estimates arc known in large samples. and this information can be used in a variety of ways in IRT I applications. We shall first describe the maximum likelihood procedure for estimating ability when the item parameters are known, and then describe the procedures for estimating item parameters. Rstimation of Ahility Suppose that II randomly chosen examinee responds to a set of II items with response pattern (Vt , V 2, .•• , U,' ... , (/,,) where: IIi is either I (a correct response) or 0 (an incorrect response) Oil item I By the assump- tioll of local independence, the joinl probability of ohserving the re- sponse pattern is the product of the prohahilities of observing each item response, thai is. P(U I .U20 •• ·,lIi,····UIl IO):::: p(Vlle) P(lh I 0) ... P(lJil 0) ... P(U\" 19), which may be ,expressed more compactly as P(V\"V2, ... ,U\"IO) = n1/ p(U,le) (0 I Since Vj is either I or O. Ihis can he taken into account hy writing Ihe likelihood fUllction as P(V I , lh,···, Un I I) nn P( If, I 0)11, II -- PWj 10) II til or simply as iI
)4 FUNDAMENTALS OF ITEM RESPONSE THEORY n\"P(U.,V2,···,v\"rO) = pll, QI {Ii 13.11 JJ )= 1 = =where Pj P(V} r 9) ami QJ I - P(Vj r 9). Equation 3.1 is an expression of the joinl prohahilily of a response pallern. When the response pallern is observed, Vj Ii): the proba- bilistic interpretation is no longer appropriate; the expression for the joint probability is now called the like/iIIOOt! function and is denoted as L(II\" /12,' .. , Ui\"'\" 1/\" I 0) where Ii) is the observed response to item}. Thus, nL(II\" 1I2,···,II,,rO) =\" Pl//Q' \"i 13.21 JJ )=1 Since Pj and Qj are functions of 0 and Ihe item parameters, the likelihood function is also a function of these parameters. As an example, consider the responses of five examinees to five items with known item parameter values, given in Table 3.1. The likelihood function for any examinee may he written usillg Ihe gen- eral expression above. For Examinee 3, for example, II, 0,112 0, =U, 0, /14 I , 1I~ I. Hence, the likelihood function for Ihis exam- inee is Since P (and hence Q) arc item response funclions whose forms ,. depend on the item parameters, and the item parameters are known in this example, the exact values of the likelihood function for a given 0 elm be computed. In particular, a graph of the likelihood fUllctioll as e varies can be plolted. Since Ihe likelihood function is a product of quantities, each bounded between 0 and I, its value will be very small. A beller scaling of the likelihood funclion can be ohtained by transform- ing il using logarithms. Furthermore, because (If the following proper- ties of logarithms, Inx)' =: Inx + Iny
I\\hilill' flll,111i'1I/ I'\"rall/('Ier 1i.~li/llal/lm 35 TAIlLE 3.1 Ilclll Parameters lind Rt'SpoJlsc Pal!erns for Five Examinees on Five Te.~1 heIlIs Ilem l'aralll('lt'r.f I':wlllilll'l' Ilrm Rr.ll'on.fI'I \",--------. ,C'· Ifrlll (Ii 2 .I 4 5 I 1.27 1.19 0.10 1 () () 0 2 1..14 0.59 0.15 00 0 \\! 3 1.14 0.15 0.15 10 0 I 4 1.00 -0.59 0.20 {) 0 0 :1 :) 0.61 -2.00 0.01 00 :.1! and I Inx\" a Inx using logarithms simplifies the computations (and, as we shall see, computation of Ihe first derivative) considerahly. Using the ahove two properties, the general expression for the logarithm of the likelihood function (log-likelihood, for short) may be written as = LInL(uI9) \" fujlnPj + (I - uj)ln(1 - P) I j=1 Here, u is the vector of item responses. Graphs of the logarithms of the likelihood for Examinees 3, 4, and .5 are given in Figure 3.1. The log-likelihood for Examinee 3 peaks at 9 =-0.5, while for Examinee 4 the log-likelihood peaks at 9 = I. For Examinee 5 the peak is at 9 = -1.5. The value of 9 that makes the likelihood function (or, correspondingly, the log-likelihood) for an examinee a maximum is defined as the maximum likelihood estimate of 9 for that examinee. The problem of findihg the maximum value of a function is not a trivial one. The graphical procedure described ahove was used for illustration and is not feasible when many examinees and many items are used. The value that maximi7.es the function may be found using a search procedure with a computer. More efficient procedures use the fllct that. at the point where the function reaches a mllximum. the slope of the function (the first derivative) is ?em. Thus, the maximum likeli- hood estimate may he determined by solving Ihc equation ohtaincd hy
36 FUNDAMENTALS OF ITEM RESPONSE TIIEORY 0 .... \"\"\",: ~'~, -5 '~.\"mIMe 4 \" L 0 g -10 . L I EX8mlnee 3 ~ \" \\ k -15 e I I h -20 . Examinee 5 ---- \\ \\ 0 0 d -25 -30 0 234 -8 -7 -8 -5 -4 -3 -2 -1 Ability Figure 3.1. Log-Likelihood Functions for Three Examinees setting the first derivative of the likelihood or log-likelihood function equal to zero. Again, this equation cannot be solved directly. and approximation methods must be used. The most popular of the approx- imation methods is the Newton-Raphson procedure described in detail in Hambleton and Swaminathan (l9R5). Unfortunately. the likelihood (or log-likelihood) function might not have a finite value as its maximum. as when an examinee answers 1111 items correctly or all items incorrectly. In this case, the maximum = =likelihood estimate will be 0 +00 or e _.c><>. SOlJle peculiar response pallerns (which cannot be discerned as such a priori) Illay result also in likelihood functions that do not have a finite absolute maximum. The log-likelihood functions for the first two examinees from Table 3.1 are shown in Figure 3.2. For Examinee 2, the log-likelihood function =appears to have a maximum at the point 9 0.9; however, the function has a higher value at 9 = -00 (values of the function are shown in the figure only to e ::.; -6). For Examinee I. too, the maximum likelihood function has its maximum at e : : -00, I-Icnee, for both exmninees, maximum likelihood estimates do not exist. The reason for this situation is Ihal the response patterns of these two examinees are aberrant: The
AM/ity lind /tl'm !'(/mml'll'r EstimatioN .n EKllmlnee 2 l \" o 1 g I l I' I k ! e -16 t I h o o -20 . d -25 .. t 23 -6 -5 -4 -3 -2 -1 o Ability Figure 3.2. Log-Likelihood Functions for Two Examinees wilh Aberrant Responses examinees answered some relatively difficult and discriminating items correctly and answered some of the easier items incorrectly. In cases like this the n1lmerical procedures used 10 find the maximum usually will diverge. The problem noted above with aberrant responses occurs only wilh the three-parameter model and not with the one- or two- parameter models (sec Hambleton & SWllminath:m 119H51. and Yen, Tlurkct, & Syk(~s Iin preslI! for discllssions of this issue), and llIay occur even for tests wilh as many as 40 ilellls. The maximum likelihood estimates (MLEs), when they exist, have well-known asymptotic (i.e., large sample) properties. Since we arc dealing with an (\"x\"minee, asymptotic rcrers to illcrca~ing test length. As test length incre!lses, tile MLE of 0, denoted as 0, is distributed normally with mean 6. This implies that the asymptotic distribution of oh is h centered on Ihe true value of 0; hence, the MLE is unbiased in h 0 long tests. The standard deviation of 0, or the standard error, denoted h as SE (6), is a function of 9 and is given as SE (6h ) = I -- --- \"/(9)
3101 FUNDAMENTALS OF ITEM RESPONSE THEORY where 1(0) i~ what is called IIIl' inftllllllllioll .Iilflll;on Since () is\" not rOl'known, the inform:ltloll functioll must be computed by substituting 0 oin the above expression. Computation of the inforlllation fUllction, its properties, and its role in test construction are descrihed in detail in chapter 6. /\\ The normality of 9 can be IIsed to construct a confidcnc,c )nterval for 9. The (I - a)% confidence interval for e is given hy ( e\" \"Z/1/2 SE (9/)\\,/\\0 + ZW2 SE (0) ) /\\ /\\ where SE (0) is the standard error evaluated at 0, and Zrr/2 is the upper (I al2) percentile point of the normal distribution. For the 95% =confidence interval, a\"\" 0.05 and ZW2 1.96. The problem of not finding maximum likelihood estimates in some situations can be overcome if a Bayesian estimation procedure is used. The basic idea is to modify the likelihood function to incorporate any prior information we mny have ahout the ability parameters. For exam- ple, we may be ahle to say, based on some previous experience, that 0 is distributed normally with mean 11 and standard deviation 0\". In this case, the prior information can he expressed in the form of a density function and denoted asl(O). Bayes' theorem states thaI the prohahility of an event A given B is peA I B) oc PCB I A) peA) where peA) is the prior probahility of event A occurring. The \"bove relationship is also true for density fUllctions, where A is 0 and B is the ohserved item response p\"ttern, It. Bayes' theorem can be written then as f{O III) <X f(FlIO)f(O) Now'/(u I e) is, in fact, the likelihood function and, hence, f(f:) III) oc L(1I1 O)f(O) The revised likelihood function/CO III) is called the posterior density and its mode is the \"most probahle\" value for 0, and can be taken as an
) Ahili/v mull/em r\"ral/!('/er fi..,imotion 39 c,~lilllaw uf O. Note thaI if we IIssume II IIlIiform prior distrihution for 0 (I.e ../(O) =: k, a constant) then /(9Iu) 0<: L(a19) In this case the Bayesian estimate is numerically identical 10 the maxi- mum likelihood estimate. We emphasize numerically because the phil- osophical basis underlying the Bayesian procedure is very different from the classical or relative frequency notion of probability (see Kendall & Stuart [19611 for details on this issue). Using a Bayesian approach solves some of the difficulties encountered with the maximum likelihood approach. Bayesian estimates of 9 can be obtained for zero items correct and perfect response patterns. and for \"aberrant\" response patterns. The posterior distribution of 9 may be described in many ways. The mode of the distribution, the Bayesian modal estimate. provides only olle description. The mean of the distribution also may be used as an estimate. The mean can be computed by approximating the posterior distribution of 0 in a fillite interval with a histogram, that is. forming a frequency distribution with k values of O. The frequency at the point OJ (j = I, .... k) is f(9J I u). The mean can then he obtained in the IIsual way: J.l(91 u) 13.3) Bock and MisJevy (1982) have called this estimate the Expecled A Posleriori (EAP) estimate. Estimation or Item Parameters In describing the procedures for estimating e, we assumed that the item parameters were known. At some point, we have to face the fact that the item parameters also must be estimated! For estimating the
40 FUNDAMENTALS OF ITEM RESPONSE THEORY ability of an examinee when item parameters arc known, we administer many items to the examinee and obtain the likelihood function for the responses of the e1laminee to II items. Conversely, if we want to estimate item parameters when 0 is known for each examinee, we administer Ihe item of interest to many cxaminees and ohll\\in the likelihood function for the responses of N examinees to the item, that is, \" L(u,. 112.\"\" liN 18,0, b, r) nN P'/IQ! \"1 ;= I where a, h, and c are the item parameters (assuming a three-parameter model). The difference between the likelihood function for an examinee and that for an item is that, for an item, the assumption of local indepen- dence need not be invoked; we merely assume that the responses of N examinees to an item are independent, a standard assumption in statis- tics. The assumption of local independence is more stringent in that we must assume that the responses of an examinee to two or more items are independent. When the 8 values are known, the estimation of item parameters is straightforward and is comparable to the procedure described in the previous section. The difference is that the likelihood function for an item, unlike that for an examinee, is multidimensional for the item parameters; that is, it is a function of three parameters. Thus, to find the MLE of the parameters o. b, and c, we must find the values of a, b, and c that correspond to the maximum value of a surface in three dimen- sions. This is accomplished by rinding the first derivative of the likeli- hood function with respect to each of the parameters a, 11, and c, setting these derivatives to zero, and solving simultaneously the resulting system of nonlinear equations in three unknowns. Obviously, we solve for two unknowns when the two-parameter model if; used, and solve for only one unknown when the one-parameter model is used. Again, the Newton-Raphson procedure, in its mu Itivariate form. is used com- monly to solve these equations. When the ability of each examinee is known, each item may he considered separately without reference to the other items. Thus, the estimation procedure must he repealed n times, once for each item.
AI1i/ill' (l1I.l1/('m 1'(//(1\"'('/('1 f;,\\Iim(ll;oll 41 Joint Estilllation of Hem and Ability Parameters II is ilpparent that at some point ncither 0 nor the item parameters will he known, This is the most common situation and presents the most diflieu\" problem. In Ihis case the responses of all Ihe cJlaminces \\0 all the ilems must he considered simulllllleolisly. The likelihood function when N examinees respond 10 II items, using the assumption of local independence, is N\" n n=L(u\" 1l2' .•• ,UN 19, a, b, c) i' j= I where Ili is the response pattern of examinee i 10 n items; 0 is Ihe vector of N ahility parameters; a, b, and care Ihe vectors of item parameters for the ,,·item test. The number of item parameters is 3\" in the three- parameter model (211 for the Iw()- and II for Ihe one-pammeler model, respectively). Local independence mllsl be assumed since 9s arc not known. The number of ability parameters is N and, hence, for the three-parameler model a total of 3\" + N parameters is to be eslimated, Before the estimation can procced, however, the problem of indetermi- nacy must be addressed. In the likelihood function given above, Ihe item and ahility parame .. lers 1H'e nol uniquely (Ietermined. In Ihe ilem response fundion for, say, Ihe three-parameter model (see Equation 2.3), if we replace 9 by !r :::: (to + fl, \"by b' :::; uh + J3, and a by 0' :::; a/u, the prohahilily of a correct response remains unchanged, /'(9) = 1'(0') Since u and pare arhitrary scaling constants, the likelihood function will not have a unique maximum. Any numerical procedurc employed I 10 find the maximum of the likelihood function will fail hecause of this \\ indeterminacy. This problem does nol arise in the estimation ofO when item parameters ure known or in the parallel situation in which item parameters arc estimated in the pre,~ence of known ability parameters, because there is no indeterminacy in Illest\" sillllliions.
42 FtJNDAMENTALS 01' IrEM RESPONSE THEORY The problem of indelerminacy may he climinaled by choosing an nrhilrnry scale for the ahility values (or tltt'!J values); usually,lltc mean and siandard deviation of the N ability values (or Ihe \" item difficulty valUes) are set 10 be 0 and I, respectively. I\\.~ we shall ,~ee later, Ihis scaling musl be taken inlo account when comparing eslimales of ilcm parameters for Iwo or more groups. Once the indeterminacy is eliminated, tbe value. bf the item and ability parameters that maximize the likelihood function can be deter- mined. In the simullaneous or joinl ma\"\\imum likp./i/wod estimatiof/ procedure, this determinalion must be done in two Ilages. In Ihe firsl stage, initial values for the ability parameters are ch~n. The logarithm of the ratio of number-right score to numher-wrong score for each examinee provides good starting values. These values are then standard- ized (to eliminate the indeterminacy) and, treating the ability values as known, the item parameters are estimated. In the second stage. treating the item parameters as known, the ability parameters are estimated. This procedure is repeated until the values of the estimates do not change between two successive estimation stages. This joint maximum likeli- hood procedure is implemented in LOGIST (Wingcrl'ky. 1983) for the one-, two-, and three-parameter models. and in BICAL (Wright. Mead. & Bell. 1979) and BlGSCALE (Wright, Schulz. & Lillacre, 19R9) for the one-parameter model. The joint maximum likelihood procedure. while conceptually appeal- ing. has some disadvantages. First, ability estimates with perfect and zero scores do not exisl. Second, item parameter estimates for items that are answered correctly (or incorrectly) by all examinee... do not exisl. Items and examinees exhibiting these pallerns must he el iminated before eSlimation can proceed. Third. in the two- and three-parameter models the joint maximum likelihood procedure does not yield consis- tent estimates of item and ability parameters. (Swaminathan & Gifford [ 1983] have shown empirically that consistent estimates may he ob- tained for item and abililY parameters if both the numher of examinees and the number of items become large.) Fourth, in the three-parameter model, unless restrictions are placed on the values the item and ability parameters take. the numerical procedure for finding the estimates may fail.
Search
Read the Text Version
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184