Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore Medical_statistics_at_a_glance-Wiley-Blackwell(2000)

Medical_statistics_at_a_glance-Wiley-Blackwell(2000)

Published by orawansa, 2019-07-09 08:52:58

Description: Medical_statistics_at_a_glance-Wiley-Blackwell(2000)

Search

Read the Text Version

less) than zero. However, the sign test does not incorporate ingly. The smallest difference thus gets the value 1, the information on the sizes of these differences. second smallest gets the value 2, etc. up to the largest differ- ence, which is assigned the value n', if there are n' non-zero The Wilcoxon signedranks test takes account not only of differences. If two or more of the differences are the same, the signs of the differences, but also their magnitude, and they each receive the average of the ranks these values therefore is a more powerful test (Topic 18).The individual would have received if they had not been tied. Under the difference is calculated for each pair of results. Ignoring null hypothesis of no difference, the sums of the ranks relat- zero differences, these are then classed as being either posi- ing to the positive and negative differences should be the tive or negative. In addition, the differences are placed in same. order of size, ignoring their signs, and are ranked accord- 1 Define the null and alternative hypotheses under study which follows a Normal distribution (its value has to be H,: the median difference in the population equals zero adjusted if there are many tied valuesl). H I : the median difference in the population does not 4 Compare the value of the test statistic to values from a equal zero. known probabilitydistribution 2 Collect relevant data from two relatedsamples If n' 5 25, refer T to Appendix A8 If n' > 25, refer z to Appendix A1 3 Calculate the value of the test statistic specific to H,, Calculate the difference for each pair of results. Rank 5 Interpretthe P-value and results all n' non-zero differences, assigning the value 1 to the Interpret the P-value and calculate a confidence interval smallest difference and the value n' to the largest. Sum for the median difference (Topic 19). the ranks of the positive (T,) and negative differences (T-1- If n' 525, the test statistic, T ,takes the value T+or T-, whichever is smaller. If n' >25, calculate the test statistic z ,where: Examples pocket depth, a measure of sum disease (greater pocket depth indicates worse disease). Ninety-six new recruits. all men aged between 16 and 20 years. had their teeth examined when thev enlisted in the As pocket depth (taking the average over the measur- Roval Air Force. After receivins the necessary treatment able sites in a mouth) was approximately Normally dis- to make their teeth dentally fit. they were examined one tributed in this sample of recruits. a paired t-test was year later. A complete mouth. excluding wisdom teeth. performed to determine whether the averase pocket has 28 teeth and.in this study. every tooth had four sites of depth was the same before and after treatment. Full com- periodontal interest: each recruit had a minimum of 83 puter output is shown in Appendix C. and a maximum of 1 12measuraMe sites on each occasion. It was of interest to examine the effect of treatment on 1 Siegel,S. & Castellan,N.J. (1988) Nonparametric Statistics for the Behavioural Sciences, 2nd edn, McGraw-Hill,New York. 50

1 I / , , : tlic mean clil'l''rc.ncc in ;I m ; ~ n ' \\; ~ \\ c r ; ~ ?1r,cockct tlcytli I7clorc anel :~l'tc.r Ircatmcnt in tlic pol~ul;ilionol recruits ~&qu;/lcl\\ro I / . : t h C nic;ln difl'crclicc in a man'\\ ;l\\cr;l?rc ~xlc+,ct 5 \\\\'c h a \\ c cviclc.licc to r-cicc.1 ~ l i cnull Ii!~l~orhc~; ~i~n .c l t!cl,tli I,c.fo~-;~~rlcl;rftcr trc;ltrncnl in Ilic ~ ~ o l ~ u l ; ~01 t i o r ~<.In inl'cr 11i;it ;I rccl.uit'\\ ;I\\,cr;lec pocket clcpth \\\\;I\\ rctlrlccil ;11t~.lrrc;~rnicrit-.1-lic (J5\"(,conliiicncc intcs\\.;~l rccrult\\ tloC\\n o t cqu;~/lcro. lor ~ l i ctr1lc 1iic:11~1liflcrc~~iilcl c; ~ \\ , c r i ~peocc k c ~clcl~tlis 0,0.3? to 0,.?~>21ii1I i.i~,0. ,I-LSh5 1.05 ./ ~1.5f~O'\\I>cIh)O. l' col~r\\c.\\\\ch ; ~ \\ 1c0 11.' c;~rctuIlicrc i l \\ \\ c \\\\.;lnt t o c o n c l i ~ ~ l c tll:~t ~t i\\ the lrc;~inicnttli;lt h;r\\ r~~clucc;id\\.cr:lgc 1~ockc.1 ~ l ~ \\ ~,It\\ l\\vi c. li:~\\cIIO control y 1 ~ 1 l - r of' recruit\\ \\\\Iio tlitl r l o l rCcci\\c-tr~~;~triicnt.'ilrhncl~ro\\criicntmay Ilc :I con- i C i l ~ c ~ r i01w' time or a cli;~n!:c in ~lcnt:llliygicnc li;~hit\\. : ~ n c lrn:I!. not Iv iluc I O tllc trc:~triictitrccci\\,cd. i I I I :Ilc l c i I c r c C ~ i t ~lc 'c tlian that ;~.;\\c\\\\ccl17). pockct clcpt 11. t hc dil'l'crcncc\\ mc:i.;ur-aldc citcc for \\\\~liicltilicrc \\\\;I\\ lo\\\\ o f i ~ t t ; ~ c I ~ ~;Ii[i c ~ i ti n the notI ~ ~ I - ~ ~ I I ~~:VI ~~I -LL~* \\ X o ~ - ~ i i ; ~cllli!.~;tt-il~~~tc*~l. each :I.;.;c.;<mcnt in c ; ~ col ~l I J o f tliccc rcsruit\\ \\\\ 110 \\verc l>c~~l'o~~:I~i\\\\i'cilcol xon signed rmks ttnst I O ilivc\\tigi~tc sent 10 ;I ~ ; I ~ I ~ c:iLirIl'!o;t-Icc~I~;I<lL..@o\\\\. o ~ ; ~ ~ t ; ~ c I i ~i \\i ;iIcIr~i t \\ilic\\licr-trc;1lni~~lIiiti t ~ ;lIII!, cl'l'cct 011 lo.;%o l ' i ~ t t ; ~ c l i ~ i i ~ ~ i t . i r i e l i ~ i ~ t i r ~orll' ~t1111cli\\c;~\\c1h:11 mil\\, I7c. more8 :~il\\;ttlcc~l I 1 Ill,:tlic niccli;in 01 tlic clit'l;.rcticc\\ (I~ctcir.ci1111.l ; ~ f ~ c r . 1 l - 1 1 ~ p~ carccnt;lccs ot mcau;urah!c .;itcs \\\\.it11 lo\\.; ol' 1rc:rtmcnt) in the I ~ C T C L ' I ~ ! ~ I ~oCf \\q; itc\\ \\\\,ill1 Ir><< ol';~tt;r~Ii- attachrncnt. hcf'orc :inel :~l'tcrrc':ltmcnr, for c:~clirecruit mcnt cqll;ll.; 7cr0 in tlic ~ ~ o ~ l t ~ l ;o~lt~i o. erci ~ . ~ ~ i t \\ ;irew<lie\\\\ 11ill tlic t ; ~ l l l,c~ lo\\\\. 11,:tlic niccli:~nol'tlic dii'l'crcncc~in the. p c ~ - c ~ c n t aO~ cIs silts \\\\~illiloss or i l t t i ~ ~ l l m ci~lolcts not c.ilu;rl /c.ro in tlic pop~11;11iori. 3 7'hcr.c i\\ onc /cl.c) ~lit'f'rcnce:of tlic rcnii~ining11' = 13 inclicating tli;~l.on:r\\.cr:~cct.he 11ct.ccnta:c~ of\\itc\\ \\\\.it11 diSlcrcncc\\. thrcc : ~ r c~>o\\iti\\c; I I ~ L I 10 ;Ire I ~ L . O : I I ~ \\ Llo. .\\\\ of ; ~ t t a c l ~ m1~.; ngrtcarc*r~ r / t , ~ttr-c;~t~ilc;~11it1.hc1~1:ll The .;urn ot' the rarihk o! thC ~lo\\iti\\c clill~arcncc\\. t l i i \\ clil'lerclicc i\\ not \\iqiilii::lnt. :\\ppcmJiu ;\\7 +(I\\\\.\\ 7-+=.:+5c l.:=?I. that the. :~lylro\\irn;~t'c)5. \",, c o n l i ~ l ~ n cincter\\ ; l l I;ir ~ h c 4 )II /.' <;25. \\vc rcl'cr I'<10 .,\\ppc.ncli~i\\S: I ' r 0.115. 5 Thcrc i.; in\\uflicierit ~.\\.iclr*ncct o rcjcct the ntlll I~*r-cncct)l:icsc :ire -71.11\"c~;lnd l).(I\",l..-\\lthou~Iithc li~l~otlic.;iocf no cli:~nccin t l i c I3crccnt:lcc 01 it^*\\ \\vith r c ~ i ~01l' t tlic t e s ~I.; 1101 \\ie~iitic:~~tlict. lo\\\\,~-l ir~ i i i titltli- Iocc ol' :~!t;tclimclir.Tlic mccli;~ntlil'f'~.rcnci~n~thC I7cr.- c;ttc\\ t l i : ~ t tlic l ~ c r c c ~ i t ;o~f y\\~itc- \\ \\\\it11 lo\\\\ ol' ;1t1:1cl1- centace oi \\itc\\ \\\\ i t l i IO\\F o1';1tt;1ch1ncnit\\ -3. I \"<,(i.c.tlic mcnt coul~Il,L~ ; I \\ much as 2 l more ;litel the t.ccrlrit i i i i- 5 \" I I. 1 ,I t i ~ n c t l i ~~i iI i l f ' r ~ i c r~ccci\\~*c~II-L~;IIIIICIII!

21 Numerical data: two unrelated groups The problem 1-(Z, -X2)-0 - (El -X2) t= -+- We have samples from two independent (unrelated) groups SE(Xl Z 2 ) - s of individuals and one numerical or ordinal variable of interest. We are interested in whether the mean or distribu- '3 Y22 tion of the variable is the same in the two groups. For example, we may wish to compare the weights in two which follows the t-distribution with (n, +n, -2) degrees groups of children, each child being randomly allocated to receive either a dietary supplement or placebo. of freedom The unpaired (two-sample) t-test 4 Comparethe value of the test statisticto values from a known probability distribution Assumptions Refer t to Appendix A2. When the sample sizes in the In the population, the variable is Normally distributed and two groups are large, the t-distribution approximates a the variances of the two groups are the same. In addition, Normal distribution, and then we reject the null hypoth- we have reasonable sample sizes so that we can check the esis at the 5% level if the absolute value (i.e. ignoring the assumptions of Normality and equal variances. sign) o f t is greater than 1.96. Rationale 5 Interpret the P-value and results We consider the difference in the means of the two groups. Interpret the P-value and calculate a confidence interval Under the null hypothesis that the population means in the for the difference in the two means. The 95% confidence two groups are the same, this difference will equal zero. interval is given by: Therefore, we use a test statistic that is based on the differ- ence in the two sample means, and on the value of the dif- (XI - X2)i x SE(Xl - Z2) ference in population means under the null hypothesis (i.e. zero). This test statistic, often referred to as t, follows where to.o5is the percentage point of the t-distribution the t-distribution. with (nl + n2- 2) degrees of freedom which gives a two- tailed probability of 0.05. Notation Our two samples are of size n, and n2.Their means are Yl q;and their standard deviations are sl and s,. 1 Define the null and alternative hypotheses under Interpretation of the confidenceinterval study The upper and lower limits of the confidence interval can Ho:the population means in the two groups are equal be used to assess whether the difference between the two Hl:the population means in the two groups are not mean values is clinically important. For example, if the lower limit is close to zero, this indicates that the true differ- equal. ence may be very small and clinically meaningless, even if the test is statistically significant. 2 Collectrelevantdata from two samples of individuals If the assumptions are not satisfied 3 Calculatethe value of the test statisticspecificto H, When the sample sizes are reasonably large, the t-test is If s is an estimate of the pooled standard deviation of the fairly robust (Topic 32) to departures from Normality. two groups, However, it is less robust to unequal variances. There is a modification of the unpaired t-test that allows for unequal n, - 1)s: +(n2 - 1)s; variances, and results from it are often provided in com- nl + n, - 2 puter output. However, if you are concerned that the assumptions are not satisfied, then you either transform the then the test statistic is given by t where: data (Topic 9) to achieve approximate Normality andlor equal variances, or use a non-parametric test such as the continued Wilcoxon rank sum test.

The W ~ ~ C OraXnOk s~um (two-sample) test the values in each of the two groups; these should be com- parable after allowing for differences in sample size if the Rationale groups have similar distributions.An equivalent test, known The Wilcoxon rank sum test makes no distributional as- astheMann-Whitney Utest,givesidenticalresultsalthough sumptions and is the non-parametric equivalent to the it is slightlymore complicated to carry out by hand. unpaired t-test.The test is based on the sum of the ranks of 1 Define the null and alternativehypotheses under study and ns and nL are the sample sizes of the smaller and Ho: the two groups have the same distribution in the larger groups, respectively.z must be adjusted if there are many tied valued. population H I : the two groups have different distributions in the 4 Compare the value of the test statistic to values from a known probabilitydistribution population. If the sample size in each group is 15 or less, refer T to 2 Collect relevant data from two samples of individuals Appendix A9. If at least one of the groups has a sample size of more 3 Calculate the value of the test statistic specific to Ho than 15,refer z to Appendix A l . All observations are ranked as if they were from a single sample. Tied observations are given the average of the 5 Interpretthe P-value and results ranks the values would have received if they had not been Interpret the P-value and obtain a confidence interval for tied. The sum of the ranks, T, is then calculated in the the difference in the two medians.This is time-consuming group with the smaller sample size. to calculate by hand so details have not been included; some statistical packages will provide the CI. If this confi- If the sample size in each group is 15 or less, T is the dence interval is not included in your package, you can test statistic. quote a confidence interval for the median in each of the If at least one of the groups has a sample size of more two groups. than 15,calculate the test statistic which follows a Normal distribution,where Example 1 forced expiratory volume (FEVI ) over a A month period. After checking the assumptions of Normality and con- In order to determine the effect of regular prophylactic stant variance (see Fig. 4.2). wc performed an unpaired inhaled corticosteroids on whee7ing episodes amociated I-test to compare thc means in the two groups. Full com- with viral infection in school age children. a randomiyed. putcr outpi~its shown in Appendix C. double-blind controlled trial was carried out comparing inhaled bcclomethasonc dipropionate with placebo. In thi5 investigation. the primary endpoint was the mean 1 H,,: the mean FEVl in the population of school age 3 Pooled standard deviation, children is the same in the two treatment groups i(49 ~ 0 . 2 )9+~(47~ 0 . 2 5)' If,: thc mean FEVl in thc population of school age children is not the same in the two treatment groups. s= =0.2670 (50+4X -2) 2 Treated group: sample size. n , = 50: mean. s, = I .hJ Test statistic,I = 1.64-1.54 litres,standard deviation,s, = 0.29 [itre% Placebo goup: sample ~ize. =48;mean,.< = 1.54litres; 0.2670 x standard dcviation.~=~0.25 litrcs rot~ri~rrtc.d 1Siegel, S.& Castellan, N.J. (1988)Nonparametric Statisticsfor the Behavioural Sciences, 2nd edn.McGraw-Hill, New York.

5 \\\\'r. II;IVC i ~ i ~ t ~ l ' l i c iC~\\ n~ ~t C I ~ I I ~ 1C0 r c i c c ~ ~ l i c1it11I h\\ lv\"tlic.\\ir :I! I he 5\",, Ic\\ cl. I lcn\\.c\\,cr.;I\\ rhc I'-\\.;lluc i \\ rlnl!' jit-.! g r r * : ~ l 1c1~i:ln tl.lli. 11ic.r~m~il! I3c ;In iri1lic;tlicrn l l l ; ~ llhc. l!!r> ~ ) o p t ~ l : ~ l imo lci;ln\\ ;lrchrlirlrcnt.~Plicc\\ti- I - i sI~I:IIL~.LlIi f l ' ~ * rl ~ c~ t~~c\\t~~t l.i~t b .lt\\vio I I ~ L ~ : I I iI\\~ I .fiJ - 1.54 =j]. { I . I l l l i t re\\. -l-li~'.)5\"4bcolilidc~ricci n ~ c r; r\\l icrr thc, true clil- rcrcricc in 1hc l\\z.r) nic;ln\\ r-;~lipcf\\ rom -( l.OOh 10 11.21i1h i t r ~ =1 1 1 . 1 1 1 ( I ' I ~ ,.ii,:(1-11. , l i l l L Example 2 I n ~ ~ r t lt ~r j li t-l l i i \\ \\\\~licrlic.rrhc tnceli;rni~~ii\\n\\trl\\,cJ in crlmll:~ri.~illl \\c.\\c.li 111' ~ : I I : I\\oI \\I)c:111t I t ~ ~ ~ - i ~ i ~ l t ~ c e II n ~ i ~ iI i rI i l ' r c ~ i lI I I I i i ;1~xth111;:1111~1 1 1 1 1';1!:1l;1<tIi1~1;1c:\\\\cc. f?~c:\\tl\\c'o i IJIC, \\111;1Il t'.4 ~. ~ ;c1\\1~ 11111i;11. hcb ~ll11iil1c.0r1' ('I).: 8 .I- cc.ll\\ irl 1 1 1 ~i1113- \\ample* \\i;rc< i~nclolw.icurl!, \\kc\\\\ccl ~ I : I I ; I . \\ \\ e [ ~ c r l o r r i i ~;Ic l I ~ II I . ~ 01 i CI I ' i ~ i ~ i i I~ ~ i c. I \\\\:iltt~xtrnrank lil1111t p ~ lto c(>tiil~:1rtclic dictrjI>tltjorl\\;. I I t l l : ~ 1 ~1! i~\\ t ~ - i t ? ~ 0~1'~(i.Io>.~>*i <l . - c ~ ~1I ilt11iil3~*1r11\\ 1 1 1 ~ 4 RL.C;I~I\\1'hc.1-carc Irl or- ILW \\ ; ~ l u c cin ci~clig r o u p . \\ \\ c tn-ocrriup5 in lhc p c y t ~ l ; r ~ i o;rnr.ctlic \\ : ~ n i c oI)t:~inI l l c /'-\\-itluchl'rrim ;\\ppcndi\\- !\\'I: I' .c 1 1 . I I I Iccrln- pi11~01~, 11pt1yli\\,ck /'= ll.Oll.3). I / ! : thehcli\\tr-ihution'i (11. ('I>?+ 7--cell I I ~ I I I ~ ~ ~inL ~rTl ~. ~c ttvo y r c i ~ ~ pillc 111~l .- u r j l l ~ l : r ~ i c i t i;II.C IIOI I I I C < ; ~ t i i ~ . 5 - 1 - h ~ ric* c \\ , i ~ l c r ~1c11crciccr 111cnull Ii! 11ci1 hc\\i\\ 111x111ic ill l l l ~l*~ t l > l Cl ~ ~ * l O \\ \\ . ili\\tril3tl!ioti\\ 01 C'I13.f. -1.--cellIc\\clc arc thc \\;lrnc in 111~- -3 Surii r i C (lie r~111ki.ni the \\ o \\ ~ - l ~ c :~~SiOiL I J => 2 ?. 2 - -4 t\\\\-o yroup.;. ~111' 11icdi:rn nulnl>cr n l ('I>.<+ T ccll.i +5-c--]()=.;: in the. \\cl\\.hc;tn :~ntlt':11;1l ; ~ ~ r I i mc:ritrup.; :lrc I .:h (OF\",, S I I I Io~l'tlic r : ~ ~ i1k11it l i ~:.1\\1li111:V1I - ( V I ~7 (1 I s 4~ L, + I I + I .? L - ~ ~II I ;111cl (5S..T.T + T.~,(r.7)~2 = (1: :rv;1I 4.32 1 0 l5-l.SOl c-c.I I . I.lclic\\.c III;II CD.;+ T cc.ll\\ ;II-cr c t l u c ~ ~it1l ]:lri~l ~ o \\ l ~ c :;r1~\\1ihrii;r.,t~stlc\\linc;I clifl'crcli~ ~ i i ~ ~ ~ l i :i ~l - t~l ti~ i t~hat ti ~cl~.\\~.r-ihc.c1l01- rncr\\l , l ~ l l l l l l ; l~ l c ; l ! l l ~ .

22 Numerical data: more than two groups The problem 1 Define the null and alternative hypotheses under study We have samples from a number of independent groups. H,: all group means in the population are equal We have a single numerical or ordinal variable and are H I : at least one group mean in the population differs interested in whether the values of the variable vary between the groups, e.g. whether platelet counts vary from the others. between women of different ethnic backgrounds. Although we could perform tests to compare the values in each pair of 2 Collect relevant data from samples of individuals groups, the high Type I error rate, resulting from the large number of comparisons, means that we may draw incorrect 3 Calculate the value of the test statisticspecificto Ho conclusions (Topic 18). Therefore, we carry out a single The test statistic for ANOVA is a ratio, F, of the between- global test to determine whether there are differences group variance to the within-group variance. This F- between any groups. statistic follows the F-distribution (Topic 8) with (k - I), (n - 1) degrees of freedom in the numerator and de- One-way analysis of variance nominator, respectively. Assumptions The calculations involved in ANOVA are complex and The groups are defined by the levels of a single factor are not shown here. Most computer packages will out- (e.g. different ethnic backgrounds). In the population of put the values directly in an ANOVA table, which usually interest, the variable is Normally distributed and the vari- includes the F-ratio and P-value (see Example). ance in each group is the same. We have a reasonable sample size so that we can check these assumptions. 4 Comparethe value of the test statisticto values from a known probabilitydistribution Rationale Refer the F-ratio to AppendixA5. Because the between- The one-way analysis of variance separates the total vari- group variation is > the within-group variation, we look ability in the data into that which can be attributed to differ- at the one-sided P-values. ences between the individuals from the different groups (the between-groupvariation),and to the random variation 5 Interpret the P-value and results between the individuals within each group (the within- If we obtain a significant result at this initial stage, we group variation, sometimes called unexplained or residual may consider performing specific pairwise post-hoc variation). These components of variation are measured comparisons. We can use one of a number of special tests using variances, hence the name analysis of variance devised for this purpose (e.g. Duncan's, Scheffb's) or we (ANOVAU).nder the null hypothesis that the group means can use the unpaired t-test (Topic 21) adjusted for multi- are the same, the between-group variance will be similar to ple hypothesis testing (Topic 18).We can also calculate the within-group variance. If,however, there are differences a confidence interval for each individual group mean between the groups, then the between-group variance will (Topic 11). Note that we use a pooled estimate of the be larger than the within-group variance. The test is based variance of the values from all groups when calculating on the ratio of these two variances. confidence intervals and performing t-tests. Most pack- ages refer to this estimate of the variance as the residual Notation variance or residual mean square;it is found in the ANOVA We have k independent samples, each defining a dif- table. ferent group. The sample sizes, means and standard ~ l t h o u g hthe two tests appear to be different, the xideviations in each group are ni, and si7respectively (i = 1 7 unpaired t-test and ANOVA give equivalent results when there are only two groups of individuals. 2, . . . ,k). The total sample size is n =nl +n2+ . . . +nk.

0 . 1 Define the null and alternative hypotheses under Group study Severe M11d:moderate Controls Ho:each group has the same distribution of values in the Sample slze, n 20 20 20 population Med~an( 9 5 O a CI) HI:each group does not have the same distribution of Range 47 5 (30 to 80) 87 5 175to 951 100 (90 to 1001 values in the population. 0-1 00 0-100 0-100 2 Collect relevant data from samples of individuals Eig.22.1 Dot plot showing physical functioningscores (from the SF- 36 questionnaire) in individuals with severe and mildimoderate 3 Calculatethe value of the test statistic specificto H, haemophilia and in normal controls.The horizontal bars are the Rank all n values and calculate the sum of the ranks medians. in each of the groups: these sums are R,, . . .Rk.The test If the assumptions are not satisfied Although ANOVA is relatively robust (Topic 32) to moderate statistic (which should be modified if there are many tied departures from Normality, it is not robust to unequal vari- values,) is given by: ances.Therefore, before carrying out the analysis,we check for Normality, and test whether the variances are similar in H=- 12 x--R3f(n+1) the groups either by eyeballing them, or by using Levene's n ( n + l ) ni test or Bartlett's test (Topic 32). If the assumptions are not satisfied, we can either transform the data (Topic 9), or which follows a Chi-squared distribution with (k - 1)df use the non-parametric equivalent of one-way ANOVAth,e Kruskal-Wallis test. 4 Compare the value of the test statisticto values from a known probabilitydistribution The Kruskal-Wallis test Refer Hto Appendix A3. Rationale 5 Interpretthe P-value and results This non-parametric test is an extension of the Wilcoxon Interpret the P-value and, if significant, perform two- rank sum test (Topic 21). Under the null hypothesis of no sample non-parametric tests, adjusting for multiple differences in the distributions between the groups, the testing. Calculate a confidence interval for the median in sums of the ranks in each of the k groups should be com- each group. parable after allowing for any differences in sample size. We use one-way ANOVA when the groups relate to a single factor and are independent. We can use other forms of ANOVA when the study design is more complexz. 1 Siegel,S.& Castellan,N.J. (1988) Nonpararnetn'c Statistics for the ZHand,D.J. &Taylor,C.C. (1987) Multivariate Analysis of Variance Behavioral Sciences. McGraw-Hill.New York. and Repeated Measures. Chapman and Hall,London. ExampifE 1 A total Iof 150 wc jifferent ethnic hackpounds the four groups using a one-way ANO\\~A.The assumptions (Normality.constant variance) were met, as shown in the were included in a cross-sectional study of factors related computer output (Appendix C). to blood clotting. We compared mean platelet levels in 1 H,: there are no differencesin the mean platelet Ievels 2 The following table summarizes the data in each in the four groups in the population group. H I :at least one group mean platelet Icvel differs from the others in the population

C;roul> S;~rnplcF ~ / C -\\lc:~nI x I I I \" ) Sr;l~rd;~t.~l tF\"e('b1 l t ) r nicttn (usins pt14~ L \\ L I .;l:rnrl:lril d~*\\.iatio-nscc pcjlnr 3 ) ( 111 (',, I L ~ C \\ I ; I I I I . I I I( , 1 1 ) ' ) . i -'.i2.7 l c r ZS.:,? C+:IUC~~~:~II 'Ill ( h l l . 0 ) ZhS. l --,llY ??o.~I j~ YV 7 . 7 34.'. of 3 1 245.' ro 7 I h.5 :\\lro-(~;~~~illlic;~~i 21 ( I J . ( I \\ ?SI.1 3X.u 1~,3Ilf.f 273,,; -1 iI ~ J h~lc~Iil~*rr:~nc;~n ltt(l2.-\\ IT?,^? Olllcl- 3)t I .:.? I 3 The t~rIlo\\viny\\yo\\ \\ t;~hlcis crlr;~clccflrrrni the com- puter output: Si iu rcr* 511ni'j~f \\ t l ~ ~ : i r ~ * < ,/I \\1c:11iS L I U . I ~ L > /,'-r;~rio I'-v:lluc. Iklnc.cn clllnic croup 7 7 1 1 .Of,7 -'<.-I i . h i ( r 1 1.477 fJ.fitV% t IGtllin c ~ h n i crr~up 7x7 ~SLI,!.;.~ l Jir 5.<02,3t14 P(jolctl< ~ a ~ i ~ l~; l~rrvtili ~ l i o=n\\ ?.;'l?..W / I t i * - y.:,4.7 Y I I Y ' . -5 .Ihcrc ic incul'lic.ic.nt cvidcncc: t o rcjcct thc null ]I!.- ~ ~ o [ l i c ctih$at thc niciln Ic\\.cI.; in thc four ~ r o ~ ~ lin- rtclic 4 The .ISOVA tahlc +cs P = 0.70. (\\Vc urlillcl ha\\.c ~ ~ o p t ~ l i ~;itrei otlnlc snmc. rci'crl-crl F to :\\ppcndis A5 with (3. I l h ) cl~!yrc.cc(>I' Crccclom tcl dctcrrninc lhc P-\\,;~luc) . Example 2 tiotiiti~wale ( PFS). \\\\.liich can titkc \\,;~luc.>frk~rn0 t o 100. \\vcrc coniparctl in the tlircc yrt~ups.r\\cvisui~ilnspection of O~~alit>,-of-lisfcccircs. mcasurcd u i n g thc SF-.?(> clucs- Fi?. 22.1 sho\\vc.cl thirt the, ci:11:1 \\vcrc not Normally cliqtl-ih- lio~tn;iirc.were oht:linecl in tlircc yroups ot' in~lividi~:tls: uti.cl.\\vc p c r f o r n ~ c d:I Kr~~skal-Wallitscsr. thrrsc with scvcrc h:~cmophili;t.thosc \\i8itlirnildlmcldcr;ltc I ~ i ~ c n ~ o p h ialni ;d~ n. nrmnl ctlritrols. Each c r o u p ctrmliriscd n snmplc of20 indi\\,iduals. Sctlrcs 011 thc ptiyvic:~liunc- 1 / I , , :cnch Froup ha\\ thc s;uiiu distrilwticln of PFS ~cor.ck 4 \\Vc rcfcr I / to , \\ y p c n ~ l i,+~\\.;: /'< 0.001. in thc [~cllx~l:lticin 5 Thcrc issuhst;~~iticnvliclcncc t o ~.c.ji.ctthc null h>.l-rotli-I 11,:;ti IC;IKo~nrho f the 9rtrup.i h : iI~~liifcrcritili\\lril~u- c\\is rh:tt the dir;tril~utiotio l T'FS qc0rc.s is the sitrnc in lion of PFS scorcs in the. pc>pul;~iot n. t h t thrcc grtiups, P;~ir\\visrc. omparisons wcrc a ~ r r i c co] ut using \\\\'ilc.nstrn r:~nli uum tccts. ;~cljuuting~ I i c/'-\\.alucs 2 Thc rfata 21-csholvn in Fig. 22.1. f o r tlic numhcr oi tests pcriorniccl. Thc intlividii;~l.;ivitli sc\\,crc and mild 'modcr:~tcIlacmol~liili;h~olh I ~ i ~scilgnifi- 3 Surii of rank5 in sc\\,cri. hi~crnt)phili;trrtlup = .;71 c;~tltl!*Io\\vcr PFS ccorcs than the ccrt~trol.;( P = 0,01103 Sum or ranks in ~nilJ!rnocl~rar~.~~c hncmophili;~group :~nrlf' = 0,O.T. r~c.;pectivc.l\\-)hut thc di%tl.ihuiic~nocf lhc = 590 scores in the t~ircrntq~liilci:r~oup<wcrc not significantly Sum of ranks in ~ i o r m : ~col ntrol r o u p = S50 ~lil'fcrcntfrom c i ~ c hothcr (/'= 0.00). JI =

23 Categorical data: a single proportion The problem 1 l~-n1I-- We have a single sample of n individuals; each individual z= /+ either 'possesses7 a characteristic of interest (e.g. is male, is pregnant, has died) or does not possess that characteris- which follows a Normal distribution. tic (e.g. is female, is not pregnant, is still alive). A useful The 112n in the numerator is a continuitycorrection:it summary of the data is provided by the proportion of indi- viduals with the characteristic. We are interested in deter- is included to make an allowance for the fact that we are mining whether the true proportion in the population of approximating the discrete Binomial distribution by the interest takes a particular value. continuous Normal distribution. The test of a single proportion 4 Compare the value of the test statisticto values from a known probability distribution Assumptions Refer toAppendixA1. Our sample of individuals is selected from the population of interest. Each individual either has or does not have the 5 Interpret the P-value and results Interpret the P-value and calculate a confidence interval particular characteristic. for the true population proportion, n. The 95% confi- dence interval for nis: Notation r individuals in our sample of size n have the characteristic. F?pl-1.96 - The estimated proportion with the characteristic is p = rln. The proportion of individuals with the characteristic in the We can use this confidence interval to assess the clinical population is n.We are interested in determining whether n or biological importance of the results. A wide confi- takes a particular value, nl. dence interval is an indication that our estimate has poor precision. Rationale The sign test applied to a proportion The number of individuals with the characteristic follows Rationale the Binomial distribution (Topic 8), but this can be ap- The sign test (Topic 19) can be used if the response of inter- est can be expressed as a preference (e.g. in a cross-over proximated by the Normal distribution, providing np and trial, patients may have a preference for either treatment A or treatment B). If there is no preference overall, then we n ( l - p ) are each greater than 5. Then p is approximately would expect the proportion preferring A, say, to equal 11,. We use the sign test to assess whether this is so. .d-Normally distributed with an estimated mean = p and an estimated standard deviation = Therefore, Although this formulation of the problem and its test statistic appear to be different from those of Topic 19,both our test statistic, which is based on p , also follows the approaches to the sign test produce the same result. Normal distribution. 1 Define the null and alternative hypotheses under study H,: the population proportion, 7 ~i,s equal to a particular value, 7c, H I :the population proportion, n, is not equal to nl. 2 Collect relevant data from a sample of individuals 3 Calculatethe value of the test statisticspecificto H,, continued

1 Define the null and alternativehypotheses under study z' follows the Normal distribution. Note that this formula Ho:the proportion, R,of preferences for A in the popula- is based on the test statistic,z , used in the previous box to tion is equal to I/, test the null hypothesis that the population proportion HI:the proportion of preferences for A in the population equals zl;here we replace n by n', and n,by 14. is not equal to 11,. 4 Compare the value of the test statistic to values from a 2 Collect relevant data from a sample of individuals known probabilitydistribution 3 Calculate the value of the test statistic specificto H,, If n' I10,refer P to Appendix A6 Ignore any individuals who have no preference and reduce the sample size from n to n' accordingly. Then If n'> 10,refer z' toAppendixA1. p = rln', where r is the number of preferences for A. 5 Interpret the P-value and results e If n' < 10,count r,the number of preferences for A. Interpret the P-value and calculate a confidence interval If n' > 10,calculate the test statistic: for the proportion of preferences for A in the entire sample of size n. z' lP-;l-& = Example lioniolhiscxual men attending a London sesunlly trans- niittccl disease clinic. In the blood donor population in Human herpes-virus S (HHV-8) has heen linked to the UK. thc scroprcvalence of HHV-S has hecn docu- Kaposi's sarcoma. primary effusion Iymphomn and mented to he 2.704,. Initially the scroprcvalcncc from this ccrtain types of multicentric Castlt.man's disesse. It study was comparcd to 2.7\"/0 usin9 n single proportion has hcen suggested that HHV-S can he transmitted scxu- tcst. i~llv.In urdcr to assess the relationships between sexual h c h i ~ ~ i ~ili~ulrdHHV-S infection. the prcvalcncc of anti- bodies to HHV-IY was determined in a group of 271 I {I,,: the seroprevalence of HHV-S in thc poptilation of 4 We refcr :to Appcndis A I :P < 0.0001. honiolhisexual men equals 2.7% 5 Thcrc is suhstanrir~lcvidcncc that the scroprcvalcncc If ,: the scroprevalence of HHV-S in the population of homalhisexual men does not equal ?.7°<r. of HHV-S in hotnolbiscxual rncn i~ttendinsscx~~nlly transmitted disease clinic4 in the LTK i5 higher than that 2 Sample 4izc. r r = 271: number who arc seropocitivc to in the blood donor population. The ')S'!/n confidence HHVS. r = 0 Scroprcvalcnce.p =501271 = 0.1 S5 (i.c. 1S.5'K,) interval for the seroprcvalence of HHV-8 in the popula- lion of tiomolbiscxual mcn is 13.Y04)to 73.I 'G.calculatcd 1). I1 x 5 - 0.0-77 - -1 as 3 Test statistic is: = 2 x 271 = 6.62 1 2 7 110.1X.5 x ( I -0.185) IMII. II N 185 1.96 x Data kindly provided by Drq N.A.Srnilh. D. R:~rlotv.anilR.S.Peters. Dcpartnictii oI'Gr~~itourir~Mi~rcy~licinc.(iuy's:iniSl 1TIu)max' Nl.lSTrus~. London. and DIJ.. I3c.;t. Department of \\!irolocy. Guy's. King.3 C o l l e ~ c:111dS~'llioni;+s'Sschrjol of Medicine. Kitig'.; C'ollcge. London. L! K.

Example patients were asked whether they preferred the active drug or the placebo.The sign test was performed to inves- In a double-hlind cross-over study. 36 adults with peren- tigate whether the proportions preferring the two pre- nial allergic rhinitis were treated with subcutaneous parations were the same. injections of either inhalant allersens or placeho. each treatment being given daily for a dcfincd period. The 1 H,,:the proportion prefcrrins the active preparation in 4 Wc refcr z' to Appendix Al: P= 0.001 the population equals 0.5 5 There is substantialevidence to reject the null hypoth- H I :the proportion prcfcrring thc active preparation in esis that the two preparations are preferred equally in the populiltion doe:s not equi11 0.5. the population.Thc 9.5% confidcncc interval for the true proportion is 0.63to 0.94, calculated as 2 Of the 36 aduIts, 27 esprlessed a preference: 2 1 pre- ferred the actwe preparation. Of those with a prcfer- +0.778 1.96x ence. the proportion preferring thc active preparati0n.p = :!1127= 0.77s. Therefore. at the very least. we helieve that alrnosr rwo- thirds of individuals in the population prefer the active /117711- 0.51- -I preparation. 2x27 * + 3 Test statistic is :' = - Data adapted from: Radcliffc. M.J.. L;tnipc. F.C..Brostnff.J. ( IWh) Allergen-specilic low-dose imniunorhempy in perennial allrrsic rhinitis: a douhle-hlind placcho-controllctl cro<<nvcrstudy.Ji>rrrrriilo f I i ~ i ~ r . ~ ~ i , q ~ r ~ i l r r b t ;r~rr~tdICAlIirI ~irtr~r1rl 1~nl orr~lu~r;oln~6.~2.32-2-17.

24 Categorical data: two proportions The problem are equal, we can estimate the overall proportion of indi- We have two independent groups of individuals (e.g. viduals with the characteristic byp = (a + b)ln; we expect nl homosexual men with and without a history of gonor- rhoea). We want to know if the proportions of individuals x p of them to be in Group 1and n2x p to be in Group 2.We with a characteristic (e.g. infected with human herpesvirus- 8,HHV-8) are the same in the two groups. evaluate expected numbers without the characteristic simi- larly. Therefore, each expected frequency is the product of We have two related groups, e.g. individuals may be the two relevant marginal totals divided by the overall total. matched, or measured twice in different circumstances (say, A large discrepancy between the observed ( 0 ) and the cor- before and after treatment). We want to know if the propor- responding expected (E) frequencies is an indication that tions with a characteristic (e.g. raised test result) are the the proportions in the two groups differ. The test statistic is same in the two groups. based on this discrepancy. Independent groups: the Chi-squared test 1 Define the null and alternative hypotheses under study Terminology Ho:the proportions of individuals with the characteristic The data are obtained, initially, as frequencies, i.e. the numbers with and without the characteristic in each are equal in the two groups in the population samp1e.Atable in which the entries are frequencies is called H I :these population proportions are not equal. a contingency table;when this table has two rows and two columns it is called a 2 x 2 table. Table 24.1 shows 2 Collect relevant data from samples of individuals the observed frequencies in the four cells corresponding to each rowlcolumn combination, the four marginal totals 3 Calculatethe value of the test statisticspecificto H,, (the frequency in a specific row or column, e.g. a + b), and 1 the overall total, n. We can calculate (see Rationale) the x ~ =(C10 frequency that we would expect in each of the four cells of the table if Howere true (the expected frequencies). where 0 and E are the observed and expected frequen- cies, respectively, in each of the four cells of the table. Assumptions The vertical lines around 0 - E indicate that we ignore its We have samples of sizes nl and n2 from two independent sign.The I/, in the numerator is the continuity correction groups of individuals. We are interested in whether the (Topic 19).The test statistic follows the Chi-squared dis- proportions of individuals who possess the characteristic tribution with 1degree of freedom. are the same in the two groups. Each individual is repre- sented only once in the study. The rows (and columns) of 4 Compare the value of the test statisticto values from a the table are mutually exclusive, implying that each indi- known probability distribution vidual can belong in only one row and only one column. Refer ~2 to AppendixA3. The usual, albeit conservative, approach requires that the expected frequency in each of the four cells is at least five. 5 Interpret the P-value and results Interpret the P-value and calculate the confidence inter- Rationale val for the difference in the true population proportions. If the proportions with the characteristic in the two groups The 95% confidence interval is given by: Table 24.1 Observed frequencies. *- p2) 19.61- Characteristic Group 1 Group 2 Total Present a b a+b If the assumptionsare not satisfied Absent c d c+d If E < 5 in any one cell,we use Fisher's exact test to obtain a P-value that does not rely on the approximation to the Chi- Total nl=a+c n2=b+d n=a+b+c+d squared distribution.This is best left to a computer program Proportion with a b =na + b as the calculations are tedious to perform by hand. characteristic p' = -n1 P2 =- P2 n2

Related groups: McNernar's test 1 Define the null and alternative hypotheses under Assumptions study HO:the proportions with the characteristic are equal in The two groups are related or dependent, e.g. each indi- vidual may be measured in two different circumstances. the two groups in the population Every individual is classified according to whether the H I :these population proportions are not equal. characteristic is present in both circumstances,one circum- stance only,or in neither (Table 24.2). 2 Collect relevant data fromtwo samples Table 24.2 Observed frequencies of pairs in which the characteristicis 3 Calculate the value of the test statistic specificto H,, present or absent. which follows the Chi-squared distribution with 1 Circumstance 1 degree of freedom. The 1in the numerator is a continu- ity correction (Topic 19). Present Absent Total no. of pairs 4 Compare the value of the test statistic with values Circumstance2 w n w+x from a known probabilitydistribution Present Y z Y+Z Refer ~2 to Appendix A3. Absent m=w+x+y+z W+Y X+Z 5 Interpret the P-value and results Total Interpret the P-value and calculate the confidence inter- val for the difference in the true population proportions. Rationale The approximate 95% CI is: The observed proportions with the characteristic in the two circumstances are (w +y)lm and (w +x)lm. They will differ if x and y differ.Therefore, to compare the proportions with the characteristic, we ignore those individuals who agree in the two circumstances, and concentrate on the discordant pairs,x and y. - Example 1 In clr~lcr10 :~sscs.;rlic relatinnsliip I>c.t\\r.cc.nsc.~u;~riIck ol' ~onclrrhoc;la. nd r l ~ c l s\\~vho h:ld not prcviou\\l>, l1;1<1 factors :inrl HH\\'-S infection (stt~J!.dcscrilwd in %>pic cnnt~rrhoc;~u.sing thc Chi-srlr~aretltest. r\\ typic;~lcorn- 23). the prc\\.:rlcncc, ol' scl.opc~).;iti\\,it1\\0 HHV-S \\I.;{\\ coni- puler n t ~ t l ) uis~shoun in Alyirncli~C. p:trCd in homo~hi.;csunlmen \\vllo h:~il ;I l)rcvior~.;I~isrt~ry 1 / I , , : thc cc.royrcvalcncc. r;ltc i \\ rhc .;;~mcin tlio\\c with and \\vith(ru~;I liisic~ryo1'yt.rnorrhnc.:1in thc popiil;~tion / I l : thc scroprc\\~i~lcncrcaw.; ;lrc not llic ~ ; ~ niinc~ h c I\\vo Froups in the ~xjp~ll;ition. 3 Thc C X P C C ~ ~l'Lrr~.clucnciccarc. shown In ~ l i ctour c c l l ~ of 1 1 1 c~o~itin~c'tictiy~hlc. xq.T l i e test s1;iticlic is

I'rrvious hiqtory of gonorrhoea -- -- - Yes No Oh.;cr\\.cd HHV-X - Ohscrv etl Espected -- 36 specred IV l -- Totitl 2% 51J - ~ -- (228x ?OC!71) 21 1 = 17.07 27 1 Seropositive 13 (4.3 x 501271) = 7.93 12ZSx7111271) Seronr3ative = I XS.03 20 (33x2?1/271) Tot;il = .;5.117 4.7 Example 2 using the more objective approach of visually assessing a section of each tooth.The percentages of teeth detected as In order to compare two methods of establishing the having cavities bv the two methods of assessment were cavity status (present or absent) of teeth, a dentist compared using McNemar's test. assessed the condiition of 100first permanent molar teeth that had either tin,v or no cavities using radiographic tech- niquesThese results were compared with those obtained 1 H,: the two methods of assessment identify the same 3 Test statistic, ~2 = (117-41- 112 = 6.86 percentage of teeth with cavities in the population 17 + 4 H,: these percentages are not equal. 4 We r e f e r ~ tzo Appendix A3 with 1 degree of freedom: 2 The frequencies for the matched pairs are displayed in 0.001 < P <0.01. the table: 5 There issubstantial evidence to reject the null hypoth- esis that the same percentage of teeth are detected Radiographic diagnosis as having cavities using the two methods of assess- Cavities Cavities ment. The radiographic method has a tendency to fail absent present Diagnosis on section Total to detect cavities.We estimate the difference in percent- Cavities absent 49 ages of teeth detected as having cavities as 51% - 38% = Cavities present 51 Total 45 4 lo 13%. An approximate confidence interval for the 17 34 true difference in the percentages is given by 4.4% to 62 38 }-/21.6% ]x 100% . 117-41 1.96 (ie.{-f -x (17+4)-- 100 100 Adapted from Ketley. C.E. & Holt, R.D.(1993) Visual and radiographic diagnosis of occlusal caries in first permanent molars and in second primary molars. British Denrnl Jortntnl, 174.364-370.

25 Categorical data: more than two categories Chi-squared test: large contingency tables 1 Define the null and alternative hypotheses under study The problem Ho:there is no association between the categories of one Individuals can be classified by two factors. For example, one factor may represent disease severity (mild, moderate factor and the categories of the other factor in the or severe) and the other factor may represent blood group population (A, B, 0,AB).We are interested in whether the two factors H I :the two factors are associated in the population. are associated. Are individuals of a particular blood group likely to be more severely ill? 2 Collect relevant data from a sample of individuals Assumptions 3 Calculatethe value of the test statisticspecificto H, The data may be presented in an r x c contingency table with r rows and c columns (Table 25.1). The entries in the where 0 and E are the observed and expected frequen- table are frequencies;each cell contains the number of indi- cies in each cell of the table. The test statistic follows the viduals in a particular row and a particular column. Every Chi-squared distribution with degrees of freedom equal individual is represented once, and can only belong in one to@-l)x(c-1). row and in one column,i.e. the categories of each factor are mutually exclusive.At least 80% of the expected frequen- Because the approximation to the Chi-squared dis- cies are greater than or equal to 5. tribution is reasonable if the degrees of freedom are greater than one, we do not need to include a continuity Rationale correction (as we did in Topic 24). The null hypothesis is that there is no association be- tween the two factors. Note that if there are only two 4 Comparethe value of the test statisticto values from a rows and two columns, then this test of no association is known probability distribution the same as that of two proportions (Topic 24). We Refer x2 to Appendix A3. calculate the frequency that we expect in each cell of the contingency table if the null hypothesis is true.As explained 5 Interpret the P-value and results in Topic 24, the expected frequency in a particular cell is the product of the relevant row total and relevant column If the assumptionsare not satisfied total, divided by the overall total. We calculate a test statis- If more than 20% of the expected frequencies are less than tic that focuses on the discrepancy between the observed 5 , we try to combine, appropriately (i.e. so that it makes and expected frequencies in every cell of the table. If the scientific sense), two or more rows and/or two or more overall discrepancy is large, then it is unlikely the null columns of the contingency table. We then recalculate the hypothesis is true. expected frequencies of this reduced table, and carry on reducing the table, if necessary, to ensure that the E 2 5 con- Table 25.1 Observed frequenciesin an r x c table. dition is satisfied. If we have reduced our table to a 2 x 2 table so that it can be reduced no further and we still have Row Col Col Col Col small expected frequencies, we use Fisher's exact test (Topic 24) to evaluate the exact P-value. Some computer categories 1 2 3 ... c Total packages will compute Fisher's exact P-values for larger contingency tables. ROW1 fll fi2 f13 ... fl, 4 ROW2 Chi-squared test for trend ROW3 f21 f22 f23 .-- f2, R2 ... R3 The problem f3 1 f32 f33 ... f3c Sometimes we investigate relationships in categorical data ... ... when one of the two factors has only two categories (e.g. the ... ... ... ... ... ... ROW r Rr ... ... ... ... ... Total n fi fa ff.3 ... f, C1 c2 c3 ... C,

presence or absence of a characteristic) and the second 2 Collect relevant data from a sample of individuals factor can be categorised into k, say,mutually exclusive cat- We estimate the proportions with the characteristic in egories that are ordered in some sense. For example, one each of the k categories.We assign a score to each of the factor might be whether or not an individual responds to column categories. Typically these are the successive treatment, and the ordered categories of the other factor may represent four different age (in years) categories values,1,2,3, . .. ,k,but, depending on how we have clas- 65-69,70-74,75-79 and 280. We can then assess whether there is a trend in the proportions with the characteristic sified the column factor, they could be numbers that over the categories of the second factor. For example, we in some way suggest the relative values of the ordered may wish to know if the proportions responding to treat- categories (e.g. the midpoint of the age range defining ment tend to increase (say) with increasing age. each category) or the trend we wish to investigate (e.g. linear or quadratic). The use of any equally spaced Table25.2 Observed frequencies and assigned scores in a 2 x k table. numbers (e.g. 1, 2, 3,. . . ,k) allows us to investigate a Col Col Col Col linear trend. Characteristic 1 2 3 ... k Total 3 Calculatethe value of the test statistic specific to H,, Present f11 f12 fi3 ... fik 4 Absent fz3 ... f2k using the notation of Table 25.2, and where the sums f21 f22 R2 extend over all the k categories.The test statistic follows Total c3 ... C, the Chi-squared distribution with (k - 1) degrees of Score c, c2 w3 ... wk n freedom. w~ W2 4 Compare the value of the test statisticto values from a known probabilitydistribution 1 Define the null and alternativehypotheses under Refer ~2 to Appendix A3. study H,: there is no trend in the proportions with the 5 Interpret the P-value and results Interpret the P-value and calculate a confidence interval characteristicin the population for each of the k proportions (Topic 11). HI:there is a trend in the proportions in the population. continued Example into four age groups (65-69,70-74,75-79 and 80t years) at the time of interview.We used the Chi-squared test to A cross-sectional survey was carried out among the determine whether the prevalence of chest pain differed elderly population living in Southampton. with the objec- in the four age groups. tive of measuring the frequency of cardiovascular disease. A total of 259 individuals ranging between 65 and 95 years of age, were interviewed. Individuals were grouped 1 H,:there is no association between age and chest pain 2 The observed frequencies (%) and expected frequen- cies are shown in the following table. in the population H I :there is an association between age and chest pain in the population.

3 Test statistic, ~2 = (15 - 9.7)' +...+ (41 - 39.1 )' 5 There is insufficient evidence to reject the null hypothesis of no association between chest pain and [ I9.7 39.1 age in the population of elderly people. The estimated proportions (95% conf dence intervals) with chest pain = 4.839 for the four successive agc groups. starting with the youngest, are: 0.20 (0.1 1. 0.29). 0.12 (0.04.0.19). 0.10 4 We refer ~2 to Appendix A3 with 3 degrees of (0.02.0.17) and0.09 (0.02.0.21). freedom: P > 0.10 (computer output gives P = 0.18). Chest pain (years) 70-74 75-70 X(k Tolal 31 Yes 5-69 9(11.5? A (9.7%) (8.9%) 225 Observed 10.2 8.1 .9 Expected I5 (20.3OA) -259 0.7 6') (88.59 56 (W.3\"d)) 41 (91.1%) No 67.8 53.0 39.1 Ohsenced 59 (79.7%)) 7X 62 Expected 64.3 4.5 74 Total As the four age groups in this study are ordered. it is even though the general test of association gave a non- also possible to analyse these data using a Chi-squared significant result. We assign the scores of 1.2,3 and 4 to test for trend.which takes into account the orderingof the each of the four age groups, respectively. and because of groups. We may obtain a significant result from this test. their even spacing.can test for a linear trend. 1 H,,: there is no linear association between age and 4 We refer ~2 to Appendix A3 with 3 degrees of chest pain in the population freedom: P > 0.10 (computer output gives P = 0.29). H,:there is a linear association between age and chest pain in the population. 5 There is insufficient evidence to reject the null 2 The data are displayed in the table above. We assign hypothesis of no linear association between chest pain scores of 1,2,3 and 4 to the four age groups. respectively. and age in the population of elderly peopl 3 Test statistic is ~ 2 . 1 x 74 4x45 - {[(I x15)+ ...+(4x4)]-34x - x? = = 3.79 [( )+\"'+(-]]}'254 34 [( ) t . - t ( x ) I }[(74 X I ? ).+..+(45x4')j-159x 259 -1 x 74 4 x 45 259 Adapted from: Dewhurst. G.. Wood. D.A.. Walker, E. er nl. (1991) A population survey of cardiovascular disease in elderly people: design. methods and prevalence results./\\~t~lrt~rlAgcit2r,0q,353-360.

26 Correlation Introduction Correlation analysis is concerned with measuring the degree of association between two variables, x and y. Ini- tially, we assume that both x and y are numerical, e.g. height and weight. Suppose we have a pair of values, (x, y), measured on each of the n individuals in our sample. We can mark the point corresponding to each individual's pair of values on a two-dimensional scatter diagram (Topic 4). Conventionally, we put the x variable on the horizontal axis, and the y vari- able on the vertical axis in this diagram. Plotting the points for all n individuals, we obtain a scatter of points that may suggest a relationship between the two variables. Pearson correlation coefficient We say that we have a linear relationship between x and y if a straight line drawn through the midst of the points pro- vides the most appropriate approximation to the observed relationship.We measure how close the observations are to the straight line that best describes their linear relationship by calculating the Pearson product moment correlation coefficient, usually simply called the correlation coefficient. Its true value in the population, p (the Greek letter, rho), is estimated in the sample by r, where which is usually obtained from computer output. r= 0 Properties Fig. 26.1 Five diagrams indicating values of r in different situations. r lies between -1 and +l. Its sign indicates whether one variable increases as the A correlation between x and y does not necessarily imply a 'cause and effect' relationship. other variable increases (positive r) or whether one vari- ables decreases as the other increases (negative r) (see Fig. r2 represents the proportion of the variability of y 26.1). that can be attributed to its linear relationship with x (Topic 28). Its magnitude indicates how close the points are to the straight line. In particular if r =+1or -1, then there is perfect When not to calculater correlation with all the points lying on the line (this is most It may be misleading to calculate r when: unusual, in practice); if r = 0, then there is no linear correla- tion (although there may be a non-linear relationship).The there is a non-linear relationship between the two closer r is to the extremes, the greater the degree of linear variables (Fig. 26.2a), e.g. a quadratic relationship (Topic association (Fig. 26.1). 30); It is dimensionless, i.e. it has no units of measurement. when the data include more than one observation on Its value is valid only within the range of values of x and y each individual; in the sample. You cannot infer that it will have the same value when considering values of x or y that are more in the presence of outliers (Fig. 26.2b); extreme than the sample values. the data comprise subgroups of individuals for which the x and y can be interchanged without affecting the value mean levels of the observations on at least one of the vari- of r. ables are different (Fig. 26.2~).

Hypothesis test for the Pearsoncorrelationcoefficient We want to know if there is any linear correlation between two numerical variables. Our sample consists of n indepen- dent pairs of values of x and y. We assume that at least one of the two variables is Normally distributed. 1 Define the null and alternative hypotheses under Fig. 26.2 Diagrams showing when it is inappropriate to calculate the study correlation coefficient.(a) Relationship not linear,r =0. (b) In the H,:p=O presence of outlier($). (c) Data comprise subgroups. H1:p+0 at least one of the variables, x or y, is measured on an 2 Collect relevant data from a sample of individuals ordinal scale; 3 Calculate the value of the test statisticspecificto H, neither x nor y is Normally distributed; Calculate r. the sample size is small; we require a measure of the association between two If n 5200,r is the test statistic variables when their relationship is non-linear. 4-If n >200,calculate T = - Calculation To estimate the population value of Spearman's rank corre- which follows a t-distribution with n - 2 degrees of lation coefficient, p,, by its sample value, r,: freedom. 1 Arrange the values of x in increasing order, starting with the smallest value, and assign successive ranks (the 4 Comparethe value of the test statisticto values from a known probability distribution numbers 1,2, 3,. . . , n) to them. Tied values receive the If n 5 150,refer r to AppendixA10. If n > 150,refer T to Appendix A2. average of the ranks these values would have received had there been no ties. 5 Interpret the P-value and results 2 Assign ranks to the values of y in a similar manner. Calculate a confidence interval for p. Provided both 3 r, is the Pearson's correlation coefficient between the variables are approximately Normally distributed, the ranks of x and y. approximate 95% confidence interval for p is: Propertiesand hypothesistests wherezl = z - 1.96 = z +-m7and These are the same as for Pearson's correlation coefficient, replacing r by r,, except that: -m 7z2 r, provides a measure of association (not necessarily Note that, if the sample size is large, Homay be rejected linear) between x and y; even if r is quite close to zero. Alternatively, even if r is large, Ho may not be rejected if the sample size is small. when testing the null hypothesis that p, = 0, refer to For this reason, it is particularly helpful to calculate r2, Appendix A l l if the sample size is less than or equal to 10; the proportion of the total variance explained by the relationship. For example, if r = 0.40 then P < 0.05 for a we do not calculate ~ , 2(it does not represent the pro- sample size of 25, but the relationship is only explaining 16% (= 0.402 x 100) of the variability of one variable. portion of the total variation in one variable that can be attributed to its relationship with the other). Spearman's rank correlation coefficient We calculate Spearman's rank correlation coefficient, the non-parametric equivalent to Pearson's correlation coeffi- cient,if one or more of the following points is true:

Example 100 of these children is shown in the scatter diagram (Fig. 28.1); there is a tendency for taller children in the As part of a study to investigate the factors associated sample to have higher blood pressures. Pearson's comela- with changes in blood pressure in children. information tion coefficient between these two variables was investi- was collected on demographic and lifestyle factors, and gated. Appendix C contains a computer output from the clinical and anthropometric measures in 4245 children analysis. aged from 5 to 7 years. The relationship between height (cm) and systolic blood pressure (mmHg) in a sample of 1 H,:the population value of the Pearson correlation between height and systolic blood pressure explains only a small percentage. 11%, of the variation in systolic coefficient-p,is zero blood pressure. H,: the population value of the Pearson correlation In order to determine the 95% confidence interval for coefficient is not zero. the true correlation coefficient.we calculate: 2 We can show (Fig.34.1) that the sample values of both (3z = 0.51n - = 0.34281 height and systolic blood pressure are approximately Normally distributed. r, =0.3428 -1.%Ag409.1=438 3 We calculate r as 0.33.This is the test statistic since n 1.%A+w9Z? = 0.3428+ =0.5418 <200. Thus the confidence interval ranges from 4 We refer r to Appendix A10 with a sample size of 100: P < 0.001. . .(el ~11.1431- 1) (e: * 1)%1~- 1) 0.33 1.96 5 There is strong evidence to reject the null hypothesis: i.e, from 2-.33 to - we conclude that there is a linear relationship between ( e l X O I J J ~+ 1) (e2xosala + 1) systolic blood pressure and height in the population of 3.96 such children. However, rr =0.33x 0.33=0.1 1.Therefore, despite the highly significant result. the relationship We are thus 95% certain that p lies between 0.14 and 0.49. As we might expect, given that each variable is Nor- 0.32.Totest H , : P ,=~ 0,we refer this value to Appendix A10 mally distributed. Spearman's rank correlation coefficient and again find P c 0.001. between these variables gave a comparable estimate of Data kindly provided by Ms 0.Papacosta and Dr P.Whincup. Department of Primary Care and Population Sciences. Royal Free and University College Medical School. Royal Free CampusLondon.UK.

27 The theory of linear regression What is linear regression? Method of least squares To investigate the relationship between two continuous We perform regression analysis using a sample of observa- variables, x and y, we measure the values of x and y on each tions. a and b are the sample estimates of the true parame- of the n individuals in our sample. We plot the points on a scatter diagram (Topics 4 and 26), and say that we have a ters, a! and p, which define the linear regression line in the linear relationship if the data approximate a straight line. If we believe y is dependent on x, with a change in y being population. a and b are determined by the method of least attributed to a change in x, rather than the other way round, we can determine the linear regressionline (the regression squares in such a way that the 'fit' of the line Y = a + b x to of y on x) that best describes the straight line relationship between the two variables. the points in the scatter diagram is optimal. We assess this by considering the residuals (the vertical distance of each The regression line point from the line, i.e. residual = observedy - fitted Y, Fig. 27.2). The line of best fit is chosen so that the sum of the The mathematical equation which estimates the simple squared residuals is a minimum. linear regressionline is: Assumptions x is called the independent, predictor or explanatory variable; 1 There is a linearrelationshipbetweenx andy 2 The observations in the sample are independent. The for a given value of x, Yis the value of y (called the depen- observations are independent if there is no more than one dent,outcome or response variable), which lies on the esti- pair of observations on each individual. mated line. It is the value we expect for y (i.e. its average) if 3 For each value of x, there is a distributionof values of y in we know the value of x, and is called the fittedvalue of y; the population;this distribution is Normal.Themean of this distribution of y values lies on the true regression line (Fig. a is the intercept of the estimated line; it is the value of Y 27.3). when x = 0 (Fig. 27.1); 4 The variability of the distribution of the y values in the population is the same for all values of x, i.e. the variance, b is the slope or gradient of the estimated line; it repre- c2i,s constant (Fig.27.3). sents the amount by which Y increases on average if we 5 The x variable can be measured without error.Note that increase x by one unit (Fig.27.1). we do not make any assumptions about the distribution of the x variable. a and b are called the regression coefficients of the esti- mated line, although this term is often reserved only for b. Many of the assumptions which underlie regression We show how to evaluate these coefficients in Topic 28. analysis relate to the distribution of the y population for a Simple linear regression can be extended to include more specified value of x, but they may be framed in terms of the than one explanatory variable; in this case, it is known as residuals. It is easier to check the assumptions (Topic 28) by multiplelinear regression (Topic 29). studying the residuals than the values of y. YA Estimated linear YA 0Estimated linear regression line Observedy regression line -w Y = a + bx Y=a+bx z.a(-d Residualk y - - - - - A I > =(y-y) y - - - - - 0 Each point corresponds to an C /\" individual's values of xand y aicw= Fitted Y a nw Ib 0 Explanatory variable X m x Fig. 27.1 Linear regression line showing the intercept, a, and the Fig. 27.2 Linear regression line showing the residual (vertical dotted slope,b (the increase in Y for a unit increase in x). line) for each point.

True linear to assess subjectively the goodness-of-fit of the regression equation. Xl x2 X 2 Test the null hypothesisthat the true slope of the line, p,is zero; a significantresult indicates that there is evidence of a Fig. 27.3 Illustration of assumptionsmade in linear regression. linear relationship between x and y. 3 Obtain an estimate of the residual variance.We need this Analysis of variance table for testing hypotheses about the slope or the intercept, and for calculating confidence intervals for these parameters Description and for predicted values of y. Usually the computer output in a regression analysis con- tains an analysis of variance table. In analysis of variance We provide details of the more common procedures in (Topic 22), the total variation of the variable of interest, in Topic 28. this case 'y', is partitioned into its component parts. Because of the linear relationship of y on x , we expect y to vary as x Regression to the mean varies;we call this the variation which is due to or explained by the regression.The remaining variability is called the The statistical use of the word 'regression' derives from a residual error or unexplainedvariation.The residual varia- phenomenon known as regression to the mean, attributed tion should be as small as possible; if so, most of the varia- to Sir Francis Galton in 1889. He demonstrated that tion in y will be explained by the regression, and the points although tall fathers tend to have tall sons, the average will lie close to the line; i.e. the line is a good fit. height of the sons is less than that of their tall fathers. The average height of the sons has 'regressed' or 'gone back' Purposes towards the mean height of all the fathers in the population. The analysis of variance table enables us to do the So, on average, tall fathers have shorter (but still tall) sons following. and short fathers have taller (but still short) sons. 1 Assess how well the line fits the data points. From the information provided in the table, we can calculate the pro- We observe regression to the mean in screening and in portion of the total variation in y that is explained by the clinical trials, when a subgroup of patients may be selected regression. This proportion, usually expressed as a percent- for treatment because their levels of a certain variable, say age and denoted by R2 (in simple linear regression it is P, cholesterol, are extremely high (or low). If the measure- the square of the correlation coefficient;Topic26), allowsus ment is repeated some time later, the average value for the second reading for the subgroup is usually less than that of the first reading, tending towards (i.e. regressing to) the average of the age- and sex-matched population, irrespec- tive of any treatment they may have received. Patients recruited into a clinical trial on the basis of a high choles- terol level on their first examination are thus likely to show a drop in cholesterol levels on average at their second examination, even if they remain untreated during this period.

28 Performing a linear regression analysis The linear regression line find a satisfactory transformation. The linearity and inde- pendence assumptions are the most important. If you are After selecting a sample of size n from our population, and dubious about the Normality and/or constant variance drawing a scatter diagram to confirm that the data approxi- assumptions, you may proceed, but the P-values in your mate a straight line,we estimate the regressionof y onx as: hypothesis tests, and the estimates of the standard errors, may be affected.Note that the x variable is rarely measured where Y is the fitted or predicted value of y, a is the inter- without any error; provided the error is small, this is usually cept, and b is the slope that represents the average change acceptable because the effect on the conclusions is minimal. in Yfor a unit change in x (Topic 27). Outliers and influential points Drawing the line An outlier is a value that is inconsistent with most of the To draw the line Y = a + bx on the scatter diagram, we values in the data set (Topic 3).We can often detect outliers by looking at the scatter diagram or the residual plots. choose three values of x, x,, x2 and x,, along its range. We substitute xl in the equation to obtain the corresponding An influential point has the effect of substantially alter- ing the estimates of the slope and the intercept of the value of Y, namely Y1 = a + bxl; Y, is our fitted value for x, regression line when it is included in the analysis. If formal methods of detection are not available, you may have to which corresponds to the observed value, y,. We repeat the rely on intuition; you should recalculate the regression line procedure for x, and x, to obtain the corresponding values without the point and note the effect. of Y2 and Y,. We plot these points on the scatter diagram and join them to produce a straight line. Do not discard outliers or influential points routinely because their omission may affect your conclusions.Always Checking the assumptions investigate the reasons for their presence and report them. For each observed value of x,the residual is the observed y Assessing goodness-of-fit minus the corresponding fitted Y. Each residual may be either positive or negative.We can use the residuals to check We can judge how well the line fits the data by calculating the following assumptions underlying linear regression. R2 (usually expressed as a percentage), which is equal to the 1 There is a linear relationshipbetweenx andy: Either plot square of the correlation coefficient (Topics26 and 27).This y against x (the data should approximate a straight line), or represents the proportion or percentage of the variability plot the residuals against x (we should observe a random of y that can be explained by its relationship with x. Its com- scatter of points rather than any systematic pattern). pliment, (100 - R2), represents the percentage of the varia- 2 The observations are independent: the observations are tion in y that is unexplained by the relationship.There is no independent if there is no more than one pair of observa- formal test to assess R2;we have to rely on subjective judge- tions on each individual. ment to evaluate the fit of the regression line. 3 The residuals are Normally distributed with a mean of zero: Draw a histogram, stem-and-leaf plot, box-and- Investigating the slope whisker plot (Topic 4) or Normal plot (Topic 32) of the residuals and 'eyeball' the result. If the slope of the line is zero, there is no linear relationship 4 The residuals have the same variability (constant vari- between x and y: changing x has no effect on y. There are ance) for all the fitted values of y: Plot the residuals against two approaches, with identical results, to testing the null the fitted values, Y, of y;we should observe a random scatter of points. If the scatter of residuals progressively increases hypothesis that the true slope,P, is zero. or decreases as Y increases, then this assumption is not satisfied. Examine the F-ratio (equal to the ratio of the 'explained' 5 The x variable can be measured without error. to the 'unexplained' mean squares) in the analysis of variance table. It follows the F-distribution and has Failure to satisfythe assumptions (1,n - 2) degrees of freedom in the numerator and denomi- If the linearity,Normality and/or constant variance assump- nator, respectively. tions are in doubt, we may be able to transform x or y (Topic 9), and calculate a new regression line for which o Calculate the test statistic = -which follows the t- these assumptions are satisfied. It is not always possible to SE(b) distribution on n -2 degrees of freedom,where SE(b) is the standard error of b.

In either case,a significant result, usually if P <0.05, leads cedure for various values of x allows us to construct to rejection of the null hypothesis. confidence limits for the line. This is a band or region that contains the true line with, say, 95% certainty. Similarly,we We calculate the 95% confidence interval for P as can calculate a wider region within which we expect most (usually 95%) of the observations to lie. b k to,o5SE(b), where to,osis the percentage point of the t- distribution with n - 2 degrees of freedom which gives a Useful formulae for hand calculations two-tailed probability of 0.05.It is the interval that contains ? = E x / r z and ~ = E y / n the true slope with 95% certainty.For large samples,say n > a=p-bZ 45,we can approximate to,o5by 1.96. x ) ~b = ~ ( X - . ) ( Y -Y) Regression analysis is rarely performed by hand; com- puter output from most statistical packages will provide all E(x- of this information. s2 = (' - Y ) 2,the estimated residual variance Using the line for prediction (n -2) We can use the regression line for predicting values of y for sres values of x within the observed range (never extrapolate beyond these limits). We predict the mean value of y for individuals who have a certain value of x by substituting that value of x into the equation of the line. So, if x =xo,we predict y as Y o=a + bxo We use this predicted value, and its standard error, to evaluate the confidence interval for the true mean value of y in the population. Repeating this pro- Example Thc ~.cl;~lic~~ishhcitlw>c.cn hciyllt (mc;i.;urciI in cm) i ~ n d systcllic hlor~dprcssurc (SRP. ~iici!\\i~rcidl l rnlnHp) in the chiltlrcn Jc~orEhcJin Topic 2h is .;ho~vnin Fig. 2s.I . \\\\'c pcrl'ormrlcl ;I ~ f n ~ plfinecsr regression analysi\\ of' s!,itolic I-rlnn~lpr.cswro cln hcight. /\\ssumptions uncicrlying this ~ ~ n ~ ~:irleyvsc.il.i~liud in f=igs2S.2 to 2S.J. :\\ typic;~lcom- putcr output i.; slio\\\\:n in Appcndiz t'.'lhcrc i.; a sipnilic;int In~itrnsliil. I hciyht ;1n~1iystolic I~locrcl 3.; c a n ht thc ~ i ~ ~ i i l i cFn-rnn~tio in thc i t ~ l i i ~ yo\\f~v\\;irinncc t;ihlc in t4ppcncli.i C' ( F = 12.03 wit11 I ;)rid OS Jc~rcc.5of I'recdoni in the iliinicr;~lor;)nil ~lcnonii- n;itrlr. rrcpcc~i\\,elyP. = 0.0oOSI. Tlic IZ' of thc moilcl i.; 1 0 . ~ ) \"O~nol.y npl~rtninia1c.1a~tcnth of thc vnri:~llilit!.in tlic -25 -23 -'5 -10 -5 0 5 10 15 20 25 Restduals (mmHg)

regression line). So. the cquation of the cstimatcd rcgrcs- sion line is: SBP =-Ih.?S+0.48x hciyht 95 100 105 110 115 In this csaniplc. the intcrccpt is of n o interest in its own Fitted values (mmHg) right (it relatec to the predicted blood pressurc for a child who hits a hcight of zero centimetres-clearly out of the Fig.28.4 Thcrc is no tcndenc! for Ihc rcsitlu:il%to incrc:iuc o r range of values seen in the study).Howcvcr,wc can inter- tlccrcacc s);r~cni;~ticalwlyith thc litled v;tluc\\. Hence t h e cnnst;int pret the slopc cocfficirnt: in thcsc childrun. systolic blood variancea r s u m p t i ~ ~isnsalislicd. pressure is prcdictcd to increase by I1.4XmmHg. on ilveraFe. lor each centin~etreincrease in height. cystolic blood prcssurc can thus he cxplaincd hy ~ h c model: that ic, hy diffcrenceq in thc hcights of thc children. I' = 0.0OOS for the hypothesis test for heiyht (i.e. The coniputcr c~utpupt rovides thc following infcwmation: I-I,,: truc slopc cquals zcro) is identical to that obtained \\briahle Par;~rnctt.r Te\\r ct;ilictic -- from the analysis of varitrncc tahlc in Appcndix C. - t.srim:lre St;ind:~rdError 2.7574 1'-v:~luc as expected. .3.JhS-I Intercept Jh.2H1 7 1h.7X-15 l).lH170 A 45?/0confidence intcrval can he calculated for the I-lcighl 0. I .?<I6 11.1HIOS truc slopc.This is yivcn by: l1.1841 Therefore. the !)5\"h, confidence intcrvaI for the slopc The paramctcr estimate for -Intercept' corresponds to (1. ranges from 0.2 I to 0.75 mmHy per cm increase in heighr. and that for 'Hcigh~ ;pond5 to h (the slope of the This confidence i n t e n d does not include zero.confirlniny thc findin3 that thc slope is significantly diffcrent from zero. We can use thc reyression equation t c i predict the sys- tolic blood prcssurc wc cspect a child of a given height to have. For example. a child who is 115cm lall has a pre- diaed systolic blnod pressure of 46.2s + (O.4S x 115) = 101.48niniHg: a child who is I3Ocm tall has s predicted systolic blood pressure of 46.2X + (0.48 x 130)= 10X.hXmm Hy.

29 Multiple linear regression What is it? Assumptions We may be interested in the effect of several explanatory The assumptions in multiple regression are the same (if we replace 'x' by 'each of the x's') as those in simple linear variables, x,, x2,. . . , xk, on a response variable, y. If we regression (Topic 27), and are checked in the same way. Failure to satisfy the linearity or independence assumptions believe that these x's may be inter-related, we should not is particularly important. We can transform (Topic 9) the y look, in isolation, at the effect on y of changing the value of variable and/or some or all of the x variables if the assump- a single x, but should simultaneously take into account the tions are in doubt, and then repeat the analysis (including values of the other x's. For example,as there is a strong rela- checking the assumptions) on the transformed data. tionship between a child's height and weight, we may want to know whether the relationship between height and sys- Categorical explanatory variables tolic blood pressure (Topic 28) is changed when we take the child's weight into account. Multiple linear regression We can perform a multiple regression analysis using cate- allows us to investigate the joint effect of these explanatory gorical explanatory variables. In particular, if we have a variables on y. Note that, although the explanatory vari- binary variable, x, (e.g. male = 0, female = I), and we ables are sometimes called independent variables, this is a increase xl by one unit, we are 'changing' from males to misnomer because they may be related. females. bl thus represents the difference in the estimated mean values of y between females and males, after adjust- We take a sample of n individuals, and measure the value ing for the other x's. of each of the variables on every individual. The multiple linear regression equation which estimates the relation- If we have a nominal explanatory variable that has more ships in the population is: than two categories of response (Topic l ) ,we have to create a number of new (dummy) binary variablesl. Some com- xi is the ith explanatory variable or covariate (i = 1,2,3, puter packages will do this automatically. However, if we . . . ,k); have an ordinal explanatory variable and its three or more categories can be assigned values on a meaningful linear Y is the predicted, expected or fitted value of y, which scale (e.g. social classes 1-5), then we can use these values directly in the multiple regression equation. corresponds to a particular set of values of xl,x2, . . . ,xk; Analysis of covariance a is the constant term, sometimes called the intercept;it is the value of y when all the x's are zero; An extension of analysis of variance ( A N O V A , T O22~) ~isCthe analysis of covariance, in which we compare the response of b,, b,, . . . ,bk are the estimated partial regression coeffi- interest between groups of individuals (e.g. two or more treatment groups) when other variables measured on each cients; b1 represents the amount by which y increases on individual are taken into account.Such data can be analysed average if we increase xl by one unit but keep all the other using multiple regression techniques by creating one or x's constant (i.e. adjust for them). If there is a relationship more binary variables to differentiate between the groups. between xl and the other x's, bl differs from the estimate of So, if we wish to compare the average values of y in two the regression coefficient obtained by regressing y on only treatment groups, while controlling for the effect of vari- xl, because the latter approach does not adjust for the other variables. b, represents the effect of xl on y that is indepen- a b l e s , ~ ~. .,.~,x~k,(e.g. age, weight, . . .),we create a binary dent of the other xs. variable,xl, to represent 'treatment' (e.g.x, =0 for treatment Invariably, you will perform a multiple regression analy- A, x1 = 1for treatment B). In the multiple regression equa- sis on the computer, and so we omit the formulae for these tion, bl is the estimated difference in the mean responses on estimated parameters. y between treatments B and A, adjusting for the other x's. Why do it? Choice of explanatory variables To be able to: As a rule of thumb, we should not perform a multiple determine the extent to which each of the explanatory regression analysis if the number of variables is greater variables is linearly related to the dependent variable, after 1Armitage, I? & Berry, G. (1994) Statistical Methods in Medical adjusting for the other variables; Research, 3rd edn. Blackwell ScientificPublications, Oxford. predict the value of the dependent variable from the explanatory variables. 6:48 am, Jan 22, 2005

than the number of individuals divided by 10. Most (judged subjectively), the model is a poor fit. Goodness- computer packages have automatic procedures for select- of-fit is particularly important when we use the multiple ing variables, e.g. stepwise selection (Topic 31). These are regression equation for prediction. particularly useful when many of the explanatory variables 2 The F-test in the ANOVA table are related. A particular problem arises when collinearity is present, i.e. when pairs of explanatory variables are a,.This tests the null hypothesis that all the partial regression extremely highly correlated, and the standard errors of their partial regression coefficients are very large. Then we coefficients in the population, 4, . . ,Pk, are zero. A may find that a group of very highly correlated variables explains much of the variability in the response variable significantresult indicates that there is a linear relationship (as judged by Rz), even though each partial regression between y and at least one of the x's. coefficient in the group is non-significant. In this situation, caution is suggested when interpreting the results. 3 The t-test of each partial regression coefficient, Pi (i = 1, Analysis 2, ...,k) Most computer output contains the followingitems. Each t-test relates to one explanatory variable, and is rel- 1 An assessmentof goodness-of-fit evant if we want to determine whether that explanatory The adjustedR2 represents the proportion of the variability variable affects the response variable, while controlling of y which can be explained by its relationship with the x's. R* is adjusted so that models with different numbers of for the effects of the other covariates. To test Ho: Pi = 0 , explanatory variables can be compared. If it has a low value we calculate the test statistic = -bi ,which follows the t- SE(bi) distribution with ( n - number of explanatory variables - 1 ) degrees of freedom. Computer output includes the values of each b,, SE(b,) and the related test statistic with its P- *value. Sometimes the 95% confidence interval for Pi is included; if not, it can be calculated as bi to,, SE(bi). Example I n Topic 28. wc studicd thc rclationsliip between systolic is contained in Appendix C.The analysis or variance table indilcates that at least cmc of thc csplanatory variables is blood pr d hcight in IOOchildren.It is Linawn tha~t related to sysl~olicblooc.Ipressure ( F =14.95with 3 and 9h degt.ees of frc:edom in the numerator and denominator. hcight a! arc positively correlated. WcIthcrd'or c rcspectivcl!r. P = 0.0001). Tlie adjusted R-' valuc ol' 0.1071- indialtcs that 2'1.796 of the variability in systolic blood perform( tiple regression analysis to Iinvestisate pressure can bc explained by the modcl -that is. by differ- ences in the height. weight c ~ r l r lsex of thc children. Thus thc eflcc13\"1 11urp1t (ctn).weight ( k g ) and sex (I)=boy. 1 = this provides a much better fit to thc data than the simple girl) on systolic blood prcssurc (1111nHi~n)thcsc children. Assun~ptionsunderlying this analysis ;Ire verified in Figs 2Y. 1 to 2'1.4. A typical output from a comp~iterenalysisof these data -30 1I I I -25 -20 -15 -10 -5 0 5 10 15 20 25 15 20 25 30 35 40 45 Residuals (rnrnHg) Weight (kg) Fig.29.2 The dislriht~tinnof t h e rssidu~1li5s approxim;~lclyNormal Fig.29.1 Thcrc is nn s ~ s r c n ~ ; t ~ i c p a ~Itt, cthr nc residuals when and thc v;~rianceissliyhtly lcss than that frcrrn tlic simple regression mtidel (Topic 1x1.rellecling11icimproved lil of t h e multiple plc>ltcclagainst \\vcighr.(Nole thar.similarlylo Fig.2X.2.n plot ol rhc rcprcssinn rnotlel ovcr Ihr simple madcl. rc.;idual.; from this modcl acninsl I~cieht:+Is<s>hows nn s!.ftcmaric palturn).

linear reg1ression in Topic 18 in which R? = 0.1 1. Typic;~l aclju\\ling for the wtight and sex of the child. the rclation- computer output (:ontains the Following inforn~ation ship ~ C ~ W C L 'lIlIt'i~hal nd sybtolic bloocl pressure hecomcs ahout the eliplanatc)ry varinblcs in thc modcl: nun-si~nilicnn(t P =O.Sh).This susscsts that the sipnilicant rclalionship hctwucn hcight and systolic hlood prcssurc in +\";.IJnr;irnclcr Sr;lndrrJ( ' 1 fnr thc s~mpIcrcgrcssion analysis rellccts thc fact that tnllcr children tcnd to hc heavier than uliorter chi1drcn.Tht.t-c is C:~ri;~hlc cxtimii~c error paramclcr l i . s ~slntislic 1'-vz~tuc ;I significant relationship (I-'=0.01 ) betwc.cn scs and sys- tolic hlood prcssurc: svstnlic blood prcssurc in girls tends - to hc 4.2.;mmHcc, hi~t1c.ro. n i ~ ~ t . r e g~e .l i a nthat of ~ O Y S . Inlcrccpr 70.43~ 17.1 1x2 I II I o.tnnII cveti after takinc accounl of possihlc diffcrcnccs in height Hcrphl I I I - 1 . 7t I 41.1nrn rb.x.i?o and wcight. Hcncc, hoth wriyht and sex are indcpundcnt WCI~II~ 1.176~ l.hl05 l).dnMll predictors of a child's svstolic hlood prcssurc. Sr. 4.25 (11.n7to I .wr) 1.5123 Il.llllll 2hIhl Wc can calculate the systolic hlocld prcssurcs we would 11.117tufh)) uxpcct fnrchildrctl of yivcn licights and wciehts. I f the lirst child rncntioncd ill T(11.ric28 who is I I5cm till1 is a girl and Thc multiple repression equation is given hy: weighs 37 kg. she now has a prcdictcd systolic hlood prcs- SBP=79.44- (0.03x hcight) + (1.1 X x weight) + surc ot' 75.31 - (0.03 x I 15) + (I .IS x 37) + (1.23 x 1 ) = (4.23x sex) 1 Icl.fi5rnmHg (higher than the IOl.4Xn1mHg predicted in Topic 28): if thc sccond child who is 130cm tall iq a boy and The relationship between w e i ~ h tand systolic blood w e i ~ l3i 0~kp. he now ha5 a predicted systolic bloclrl prcq- prcssuru is highly significant ( P i0.0001). with a onc kilogram increase in weight h e t n ~i~ssr)ciatedwith an w r c of 75.21 - (0.03 x 1.30) + (1.111 x 30) + ( 4 . 3 x 0) = average increase of I . I 8mmHg in svslolic hlood prcl;- surc. aftcr adjustin%for hcight and sex. However. after 106.71n i n ~ H g(lo\\vcr than the I0S.hXmmHg prcdictcd in TopicX). Fitted values (rnmHg) Girls Fig. t9.3 As with univarialc thcrc is filr lhc r1g.29.4 I hc uhiritwllon (11 1t1c rcslc!tu;ll%I \\ cirnilar in h o p and sirls. r e ~ i d u a l sto increase nr decrcarc system;~tirallywith fitlcd value.;. wzgcsting th;11Ihc n ~ c ~ dlcills cqu;~llyu c l l it1 the l \\ v o yroups. Hcncc the constant variance assumption is satisfied.

30 Polynomial and logistic regression Polynomial regression produces, from the sample data, an estimated logistic regression equation of the form: When we plot y against x, we may find that a linear rela- tionship is not appropriate, and that a polynomial (e.g. xi is the ith explanatory variable (i = 1,2,3, . . . ,k); quadratic, cubic) relationship is preferable. We can modify a multiple regression equation (Topic 29) to accommodate logit(P) is the predicted value of logit(p); polynomial regression. We just introduce terms into the a is the constant term; equation that represent the relevant higher orders of x. So, for example, if we have a cubic relationship, our estimated bl, b,, . . . , bk are the estimated logistic regression equation is Y =a + blx + b2x2+ b3x3.Usually,the polynomial coefficients. We interpret the exponential of a particular coefficient, regression equation will include the highest power of x deemed appropriate (e.g.x3 for a cubic relationship) and all for example, ebl,as an odds ratio (Topic 16). It is the chance lower powers of x (i.e. x and x2 for a cubic relationship). We of having the disease compared with that of not having the fit this model, and proceed with the analysis in exactly the disease if we increase x, by one unit while adjusting for all same way as if the quadratic and cubic terms represented other x's in the equation. The odds ratio is an estimate of different variables (x, and x3,say) in a multiple regression relative risk. If the relative risk is equal to one (unity), then analysis. For example, we may fit a cubic model that the 'risks' of having and not having the disease are the same comprises the explanatory 'variables' height, height2 and when xl increases by one unit. A value of the relative risk height3. above one indicates an increased risk of having the disease, and a value below one indicates a decreased risk of having Logistic regression the disease, as xl increases by one unit. Reasoning Computer output Logistic regression is very similar to linear regression; we For each explanatory variable use it when we have a binary dependent variable (e.g. the Comprehensive computer output for a logistic regression presencelabsence of a symptom, or an individual who analysis includes, for each explanatory variable, the esti- doesldoes not have a disease) and a number of explanatory mated logistic regression coefficient with standard error, variables. We want to know which explanatory variables the estimated odds ratio (i.e. the exponential of the co- influence the outcome, and can then use the equation to efficient) with a confidence interval for its true value, and a predict the outcome category into which an individual will Wald test statistic (testing the null hypothesis that the rel- fall from values of hislher explanatory variables. ative risk of 'disease' associated with this variable is unity) and associated P-value. We use this information to deter- We start by creating a binary variable to represent mine whether each variable is related to the outcome of the two outcomes of the dependent variable (e.g. y = 1 interest (e.g. disease), and to quantify the extent to which designates 'has disease', y = 0 designates 'does not this is so.Automatic selection procedures (Topic 31) can be have disease'). As this variable is binary, the assumptions used, as in multiple linear regression, to select the best com- underlying linear regression are not met. Furthermore, we bination of explanatory variables. cannot interpret predicted values that are not equal to zero or one. So, instead, we predict the probability, p, that an To assess the adequacy of the model individual will be classified into a particular category of Usually, interest is centred on examining the explanatory outcome, say 'has disease'. To overcome mathematical variables and their effect on the outcome. This information difficulties, we use the logistic or logit transformation is routinely available in all advanced statistical computer (Topic 9) of p in the logistic equation. The logit of this packages. However, there are inconsistencies between the probability is the natural logarithm (i.e. to base e) of the packages in the way in which the adequacy of the model is odds of 'disease', i.e. assessed, and in the way it is described. Your computer output may contain the following (in one guise or another). logit(p) = ln-. P 1-P A quantity called -2 log likelihood: it has an approxi- mately Chi-squared distribution, and indicates how poorly The equation and its interpretation the model fits with all the explanatory variables in the An iterative process, rather than ordinary least squares model (a significant result indicates poor prediction). regression (so we cannot use linear regression software),

The model Chi-square or the Chi-square for covariates: A histogram: this has the predicted probabilities along this tests the null hypothesis that all the regression the horizontal axis, and uses symbols (such as 1 and 0) to coefficients in the model are zero. A significant result sug- designate the group ('diseased' or 'disease-free') to which gests that at least one covariate is significantly associated an individual belongs. A good model will separate the with the dependent variable. It can be modified to compare symbols into two groups which show little or no overlap. models with differing numbers of covariates. Indices of predictive efficiency: these are not routinely The percentages of individuals correctly predicted as available in every computer package. Our advice is to refer 'diseased' or 'disease-free' by the model. This information to more advanced texts for further informationl. may be in a classificationtable. -- being HHV-8 seropositive as those who are HSV-2 Example seronegative after adjusting for the other infections. In In a study of the relationship between human herpesvirus other words, the risk of HHV-8 seropositivity in these type 8 (HHV-8) infection (described in Topic 23) and sexual behaviour, 271 homo/bisexual men were asked individuals is increased by 121%. The upper limit of the questions relating to their past history of a number of sex- ually trarismitted cdiseases (gonorrhoea, syphilis, herpes confidence intervaI for this odds ratio shows that this simplex t:YPe 2 [HSV-21 and HIV). In Topic 24 we showed that men who had a history of gonorrhoea had a higher incrc~ a s e drisk could be as much as 371%. HSV-2 infection s-e-r-u-p-r-e-v.-iilence of HHV-8 than those without i3 previou:s is a well-documented marker of sexual activity. Thus, history a~f gonorrhoea. A multiple logistic regressiom~ rather than HSV-2 being a cause of HHV-8 infection, the analysis vvas perf01-med to investigate whether this effect association may be a reflection of the sexual activity of the . .was simply a renection of the relationships net.ween individual. HHV-8 and the other infections andlor the men's age.The explanatory variables were the presence of each of the In addition, there is a tendency for a history of syphilis four infections, each coded as '0' if the patient had no history of'the parti'cular infection or '1I' if he ha1d a history to he associated with HHV-8 serostatus Although this is . ...of that in1rection,ar~dthe pat1ient's age in years marginally non-significant ( P = 0.09),we should note that A ryp.-lc... aI ~c.o.r.n.pu&r.er..o- .u.-~.p..uits display..e-u3 !In- ~A p- p e n d i xC. It shows that the Chi-square for covariates equals 24.598 the confidence interval does include values for the odds on 5 degrees of freedom ( P = 0.0002), indicating that at least one of the covariates is significantly associated with raticI as high z n contrast,there is Ino indicat HHV-8 st:rostatus. The table below summarises the infor- mation at,out each variable in the mociel. nship between a hiistory of .-an iindepende These Iresults in(jicate that HSV-2 1positivity ( P = 0.04) rhoea.- -au-u.TnIl rn v - e s c ~ u p---!.:-.:A*- u s ~ ~ ~svu~g\"..----4:- grcy~,r r r g*LI.-*I I ~ *~LI.r:r r svalr- and HIV status (P= 0.007) are independently associated with HHV-8 infection; individuals who are HSV-2 able appeared, by the Chi-squared test (Topic 24),, t o be seropositive have 2.21 times (= exp[0.7910]) the risk of associated with HHV-8serostatus because of the fatct that many men who had a history of one of the other sexually transmitted diseases in the past also had a history of eon- orrhm a . There is no significant relationship between HH'V-8 serol~ositivityand age; the odds ratio indicates that the risk of HHV-8 seropositivity increases by 0.6% for each additional year of age. Variable Parameter Standard Wald Chi- P-value Odds 9596 CI for odds estimate error square ratio ratio Intercept 0.Wl6 Gonorrhoea -2.2242 0.6512 1 1.6670 0.2431 - - Syphilis 0.5093 0.4363 1.3626 t1.0935 1.664 HSV-2positivity 1.1924 0.71 1 1 2.8122 0.0410 3.295 (0.71-3.91 ) HIV 0.7910 0.3871 4.1753 0.0067 2.206 (0.82-13.28) Age 1.6357 0.6028 7.3625 0.7628 5.133 (1.0-W.71) O.MXj2 0.0201 0.091I 1 .OM (157-1 6.73) (0.97-1.05) 1Menard, S. (1995) Applied logistic regression analysis. In: Sage University Paper Series on Quantitative Applications in the Social Sciences,Series no.07-106.Sage University Press,Thousand Oaks,California.

31 Statistical modelling Statistical modelling includes the use of simple and multi- the R2 of the model are progressively added until no further ple linear regression, polynomial regression, logistic regres- variable contributes significantly to R2. sion and methods that deal with survival data. All these methods rely on generating the mathematical model that Backwards selection-all possible variables are in- describes the relationship between two or more variables. cluded. Those that contribute least to R2 are progressively removed until none of the remaining variables can be In general, any model can be expressed in the form: removed without leading to significant loss of R2. where Y is the fitted value of the dependent variable, g(.) is Stepwise selection-a combination of forwards and some optional transformation of it (for example, the logit backwards selection that starts by progressing forwards and transformation),xl, . . . ,xkare the predictor or explanatory then, at the end of each 'step', checks backwards to ensure variables (which may include polynomial terms or be cate- that all of the included variables are still required. gorical), bl, . . . ,b, are estimated coefficients that relate to these explanatory variables, and a is a constant term. Although these procedures remove much of the manual aspect of model selection, they have some disadvantages. Model selection First, it is possible that two or more models will fit the data equally well, leading to different conclusions. Second, the In order for a model to be acceptable, it should be sensible resulting models, although mathematically justifiable, may from a clinicalstandpoint.The inclusion of large numbers of not be sensible. Therefore, a combination of these proce- variables in a model, especially those that are highly corre- dures and common sense should be applied when selecting lated,may lead to spurious results that are inconsistent with the best fitting model. expectations. Therefore, explanatory variables should only be considered for inclusion in the model if there is reason to Numerical explanatory variables suppose, from a biological or clinical standpoint, that they are related to the dependent variable. When a numerical explanatory variable, x, is added to a model it is usually assumed that the relationship be- There is always the danger of over-fitting models by tween the dependent variable, y, and that variable is including a very large number of explanatory variables. At linear, i.e. there is a straight line relationship between the its extreme, a model is saturated when there are as many (or two variables. However, a polynomial relationship (Topic more) variables as individuals. Although explaining the 30), or some other non-linear relationship, may be more data very well, an over-fitted or saturated model is gener- appropriate. ally of little use for predicting future outcomes. A usual rule-of-thumb is to ensure that there are at least 10 times as Numerical dependent variable: we show in Topic 32 many individuals as explanatory variables. how to check for linearity. If the relationship is not linear, then either we take a transformation of one or other or Often, we have a large number of explanatory variables the variables (e.g. by taking logs,Topic 9) or categorize the that we believe may be related to the dependent variable. explanatory variable (Topic 29) before including it in the For example, many factors may appear to be related to model. systolic blood pressure, including age, dietary and other lifestyle factors. A first step is to assess the relationship Binary dependent variable (y = 0 or 1):we can check for between each variable, one-by-one, and the dependent linearity by categorizing individuals into groups according variable, e.g. using simple regression. We then consider, for to their values of the explanatory variable; we observe further investigation, only those explanatory variables that whether a linear trend is present in the proportions with a appear to be related to the dependent variable. Automatic specificoutcome (y = 1,say) in each group (Topic 25). selection procedures, performed on the computer, provide a means of creating the optimal model, by selecting some of Prognostic indices and risk scores for these variables. a binary response All subsets-every combination of explanatory variables Given a large number of demographic or clinical features of is considered; that which provides the best fit, as described an individual, we may want topredict whether that individ- by the model R2 (Topic 27) or some other measure, is ual is likely to develop disease. Models, often fitted using selected. proportional hazards regression (Topic 41), logistic regres- sion (Topic 30) or a similar method known as discriminant Forwards selection-variables that contribute most to analysis, can be used to identify factors that are significantly associated with outcome. A prognostic index or risk score can then be generated from the coefficients of this model,

and the score calculated for an individual to assess hislher to provide a true assessment of the usefulness of the score,it likelihood of disease.However, a model that explains a large should be validated on other,independent,data sets. proportion of the variability in the data may not necessarily be good at predicting which patients will develop disease. Where this is impractical, we may separate the data Therefore, once we have derived a predictive score based on into two roughly equally sized sub-samples. The first sub- a model, we should assessthe validityof that score. sample, known as the training sample, is used to generate the model. The second sub-sample, known as the validation Validatingthe score sample, is used for validating the results from the training We can validate our score in a number of ways. sample.As a consequence of the smaller sample size,fewer explanatory variables can be included in the model. We produce a prediction table based on our data set, showing the number of individuals in whom we correctly Jackknifing and incorrectly predict the disease status (similar to the This is a way of both estimating and validating a score in an table in Topic 35).Measures, including sensitivity and speci- unbiased manner. Each individual is removed from the ficity,can be calculated for this table, or sample, one at a time, and the remaining (n - 1)individuals are used to estimate the parameters of the model. This We categorize individuals according to their score and process is repeated for each of the n individuals in the consider disease rates in the different categories (see sample, and the results are averaged over all n samples. Example); we should see a relationship between the cate- Because this score is generated from many different data gories and disease rates, e.g. with higher scored categories sets, it can be validated on the complete data set without having greater disease rates. taking sub-samples. Clearly, any model will always perform well on the data set that was used to generate the model.Therefore, in order Example Although there are wide difference.; in prognosi s betweenI A 1DS Grade 1 Score < 0 patients wit11 AIDS. they are often thought o l i1s a sinslc AIDS Grade 11 Score 0-99 . .honiogcneou~group. In order to croup patient5. ,L.,C...JrI..nI,L LI.~ ~ V A~ IDS Gmde 111 Score 2 I00 nccordin 5 to their likely pro: ~rognosticscore war Validation of the score was assessed hv con.;~derrng the death ratc (number of deaths divided by the total ~ c n c r a tdc nn the h;~sisof t 1 I cxpcrierICC of 36:I peruon-years of follow-up) in each grade. AIDS pi\\ticnts at a singlc cel ndon. A t c3tal of 15';I (13.X\"&) ol'thcse patient.; diccl over a follow-up pcrirjd of h years. Tl1c st:orc was the weighted sum of the number of A1nSgr;lcic Death5 hdlo\\\\ -up Dcath m ~ c cacuL .C..V.-'IJC (mild. moderate or severe) of AIDS-defining (person-years) diseases the patient had esperiencctl and liislhcr -- 17 I .O 54 IhR.0 3..5 minimun1 CD4 cell count (measured in cclls/mrn~)and I 71 153.9 8.7 .w..a..h.. c.q.u..a.l to: 11 $1.2 Ill Score= 3OOx numbcr of very severe AIDS cvcnts (Iympliom;~) Thus thcrc is a clear trend towards increasing death +10Ox number of cevcre AlDS events (all rates as sc,:>reincrea core was alsn validated cvcnts. ot her than thosc listed a5 very a group or bstients frc nd London centre. severe or mild) +?Ox ..\".I-,..\"l , u , l l , ~ L lC,,,, .l..:Il l l l d AIDS cvcnts (ocsnphajxeal calndida. , AIDS ~radc Dc;~th\\ Foll(>\\v-up Dcalh riltr: Kaposi's s,arcoma.1hcrrttior*j' (per-on-years) 1 h.i 1I.X pneumon ia, cxtrapulnionary 11 23 - 4.0 losis) X1X.5 579 h -I x min~nium CDJ cell count measured I [ ] 322 361.3 8.9 slnce AlDS In order to aid the intcrprctation of this score. and to Remarkably similar results were seen, thus confirmin? validate it. thrtc groups were identified. the valuc of this scoring !iyctenl. . .-\".Adaptcd frt~nlM: ocrofl.t\\.J..John.irrn. h.1.A..S;~hin.C.A..cr(11. ( IY0.i) Singing s!u~r.ni for clin.~caA~llJh pnticnr~L. ancet 346.11-17.

32 Checking assumptions Why bother? test has the advantage that it is not strongly dependent on the assumption of Normality. Bartlett's test can also be used Computer analysis of data offers the opportunity of han- to compare more than two variances, but it is non-robust to dling large data sets that might otherwise be beyond our departures from Normality. capabilities. However, do not be tempted to 'have a go' at statistical analyses simply because they are available on the We can use the F-test (variance-ratio test) described in computer. The validity of the conclusions drawn rely on the the box to compare two variances, provided the data in each appropriate analysis being conducted in any given circum- group are approximately Normally distributed (the test is stance, and a requirement that the underlying assumptions non-robust to a violation of this assumption). The two esti- inherent in the proposed statistical analysis are satisfied.We mated variances are s; and s:, calculated from nl and n2 say that an analysis is robust to violations of its assumptions observations, respectively. By convention, we choose sf to if its P-value and power (Topic 18) are not appreciably be the larger of the two variances, if they differ. affected by the violations. Performing a non-robust analysis could lead to misleading conclusions. 1 Define the null and alternative hypotheses under study Are the data Normally distributed? Ho:the two population variances are equal H I :the two population variances are unequal. Many analyses make assumptions about the underlying dis- tribution of the data. The following procedures verify 2 Collect relevant data from a sample of individuals approximate Normality, the most common of the distribu- tional assumptions. 3 Calculate the value of the test statisticspecific to Ho We produce a dot plot (for small samples) or a histogram, F=sfis$ stem-and-leaf plot (Fig. 4.2) or box plot to show the empiri- cal frequency distribution of the data (Topic 4). We con- which follows an F-distribution with n, - 1 df in clude that the distribution is approximately Normal if it is the numerator, and n, - 1 df in the denominator. By bell-shaped and symmetrical. The median in a box plot choosing s12 2 s22,we have ensured that the F-ratio will should cut the rectangle defining the first and third quartiles always be 21. This allows us to use the tables of the F- in half, and the two whiskers should be of equal length if the distribution, which are tabulated only for values 21. data are Normally distributed. 4 Compare the value of the test statisticto values from a Alternatively, we can produce a Normal plot (preferably known probability distribution on the computer) which plots the cumulative frequency dis- Refer F to Appendix A5. Our two-sided alternative tribution of the data (on the horizontal axis) against that of hypothesis leads to a two-tailed test. the Normal distribution. Lack of Normality is indicated by the resulting plot producing a curve that appears to deviate 5 Interpret the P-value and results from a straight line (Fig.32.1). Note that we are rarely interested in the variancesperse, so we do not usually calculate confidence intervals for Although both approaches are subjective, the Normal them. plot is more effectivefor smaller samples.The Kolmogorov- Smirnov and Shapiro-Wilk tests, both performed on the Are variables linearly related? computer, can be used to assess Normality more objectively. Most of the techniques which we discussed in Topics 26-30 Are two or more variances equal? assume that there is a linear (straight line) relationship between two or, sometimes, more than two variables. Any We explained how to use the t-test (Topic 21) to compare inferences drawn from such analyses rely on the linearity two means, and ANOVA (Topic 22) to compare more than two assumption being satisfied. The simplest way of checking means. Underlying these analyses is the assumption that the for linearity between two variables is to plot one variable variability of the observations in each group is the same,i.e. against the other, and 'eyeball' the resulting scatter of we require equal variances, described as homogeneity of variance or homoscedasticity. We have heterogeneity of variance if the variances are unequal. We can use Levene's test, using a computer program, to test for homogeneity of variance in two or more groups.The null hypothesis is that all the variances are equal. Levene's

points which should broadly follow a straight line.Alterna- -3 1 II LI I tively, we can plot the residuals against the values of the explanatory variable (x); we should observe a random 0 2 468 10 scatter of points (Fig.28.2). (a) Triglyceride (mmollL) What if the assumptions are not satisfied? -31 -I I I I , We have various options. -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Proceed as planned, recognizing that this may result in a (b) Log,o (Triglyceride) non-robust analysis. Be aware of the implications if you do this. Do not be fooled into an inappropriate analysis just because others, in similar circumstances, have done one in the past! Take an appropriate transformation of the raw data so that the transformed data satisfy the assumptions of the proposed analysis (Topic 9); If feasible, perform a non-parametricanalysis (Topic 17) that does not make any assumption about the distribution of the data (e.g. Normality). Fig. 32.1 (a) Normal plot of untransformed triglyceride levels described inTopic 19.These are skewed and the resulting Normal plot shows a distinct curve. (b) Normal plot of log triglyceride 1evels.The approximately straight line indicates that the log transformation has been successful at removing the skewness in the data. Example assurance that the underlying assumptions (Normality and constant variance) are satisfied. The stem-and-leaf Consider the unpaired I-test example of Topic 21. A total plots in Fig. 4.2 show that the data in each group are of 98 school age children were randomly assigned to approximately Normally distributed. We performed the receive either inhaled beclomethasone dipropionate or a F-test to investigate the assumption of constant variance placebo to determine their effects on wheezing. We used in the two groups. the unpaired I-test to compare the mean forced expiratory volume (FEVl) in each group over the 6 months. but need 1 H,:the vniance of FEVl ~easurementsin the PoPu- 4 We refer F = 1.34to Appendix A5 for a two-sided test lation of school age children is the same in the two treat- at the 5% level of significance.Because Appendix A5 ment groups is restricted to entries of 25 and infinity df in the H,: the variance of FEW measurements in the popu- lation of school age children is not the same in two treat- numerator, and 30 and 50 df ir ominator, we ment groups. have to intevolate (Topic21).' red tabulated 2 Treated group: sample size. n , = 50, standard devia- tion,~=, 0.291 value at the 5% level of signific between 1.57 Placebo group: sample size. n2= 48. standard deviation, s2 = 0.251 and 2.12; thus P > 0.05 because 1.34 is less than the minimum of these values. 0.29? 0.0841 5 There is insufficient evidence to reject the null == hypothesis that the variances are equal. It is reason- - -3 The able to use the unpaired t-test,which assumes Normal- test statistic. -s f -- 1.336 ity and homogeneity of variance. to compare the mean FEVl values in the two groups. F= ss 0.25v . 0 6 2 5 which follows an F-distribution with 50 - 1 = 49 and 48 - 1 =47 df in the numerator and denominator, respec- tively.

33 Sample size calculations The importance of sample size which requires essentially the same information (described in Requirements) in order to proceed. The number of patients in a study is usually restricted because of ethical, cost and time considerations. If our General formulae-- these can be complex. sample size is too small, however, we may not be able to Quick formulae-these exist for particular power values detect an important existing effect [i.e.the power (Topic 18) and significance levels for some hypothesis tests (e.g. Lehr's of the test will be inadequate], and we shall have wasted all formulael, see below). our resources. We therefore have to optimize the sample Special table$-these exist for particular hypothesis size, striking a balance between sample size and the factors tests, e.g.unpaired t-test or Chi-squared test. (such as power, the size of the treatment effect and the Altman's nomogram- this is an easy-to-use diagram that significance level) that affect it. Unfortunately, in order to is appropriate for various tests. Details are given in the next calculate the sample size required, we have to have some section. idea of the results we expect in the study. Computer software -this has the advantage that results can be presented graphically or in tables to show the conse- Requirements quence of changing the factors (e.g. power, size of effect) on the required sample size. We shall explain how to calculate the optimal sample size in simple situations; often more complex designs can be sim- Altman's nomogram plified for the purpose of calculating the sample size. If our investigation involves a number of tests, we focus on the Notation most important or evaluate the sample size required for We show in Table 33.1 the notation for using Altman's each and choose the largest. nomogram to estimate the sample size of two equally sized groups of observations for three frequently used hypothesis To calculate sample size,we need to specify the following tests of means and proportions. quantities,at the design stage of the investigation. Method Power (Topic 18)-the chance of detecting, as statisti- For each test, we calculate the standardized difference cally significant, a specified effect if it exists. We usually and join its value on the left hand axis of the nomogram choose the power to equal 70-80% or more. (Appendix B) to the power we have specified on the right- hand vertical axis.The required sample size is indicated at Significance level, a (Topic 17)-the cut-off level below the point at which the resulting line and sample size axis which we will reject the null hypothesis, i.e. it is the maxi- meet. mum probability of incorrectly concluding that there is an effect. We usually fix this as 0.05, or occasionally, 0.01, and Note that we can also use the nomogram to evaluate the reject the null hypothesis if the P-value is less than this power of a hypothesis test for a given sample size. Occa- value. sionally, this is useful if we wish to know, retrospectively, whether we can attribute lack of significancein a hypothesis Variability of the observations, e.g. the standard devia- test to an inadequately sized sample. Remember, also,that a tion,if we have a numerical variable. wide confidence interval for the effect of interest indicates poor power (Topic 11). Smallest effect of interest-the magnitude of the effect that is clinically important and that we do not want to over- Quick formulae look. This is often a difference (e.g, difference in means or proportions). Sometimes it is expressed as a multiple of the For the unpaired t-test and Chi-squared test, we can use standard deviation of the observations (the standardized Lehr's formula1 for calculating the sample size for a power difference). 1Lehr, R. (1992) Sixteen s squared over d squared: a relation for crude It is relatively simple to choose the power and signifi- sample size estimates. Statisticsin Medicine,11,1099-1102. cance level of the test that suits the particular requirements ZMachin, D. & Campbell, M.J. (1995) StatisticalTablesfor the Design of our study. Given a particular clinical scenario, it is pos- of Clinical Trials,2nd edn. Blackwell ScientificPublications, Oxford. sible to specify the effect we regard as clinically important. The real difficultylies in providing an estimate of the varia- tion in a numerical variable before we have collected the data. Methodology We can calculate sample size in a number of ways, each of 84

of 80% and a two-sided significance level of 0.05. The Power statement required sample size in each group is: It is often essential and always useful to include a power 16 statement in a study protocol or in the methods section of a (Standardized differen~e)~ paper to show that careful thought has been given to sample size at the design stage of the investigation. A If the standardized difference is small,this overestimates the sample size.N~~~that a numerator of 21 (instead of 16) typical statement might be '84 patients in each group were relates to a power of 90%. required to have a 90% chance of detecting a difference in means of 2.5 days (SD = 5 days) at the 5% level of signifi- cance using the unpaired t-test' (see Example 1). Table 33.1 Information for using Altman's nomogram. Hypothesis Standardized Explanation of N Terminology test difference NI2 observations in each group 6: the smallest difference in means that is clinically important. Unpaired t-test 6 Npairs of observations (Topic 21) N/2 observations in each group 0: the assumed equal standard deviation of the observations in each 0 of the two groups. You can estimate it using results from a similar study conducted previously or from published information. Alter- Paired t-test - natively, you could perform a pilot study to estimate it. Another ap- (Topic 20) proach is to express 6 as a multiple of the standard deviation (e.g. the 0d ability to detect a difference of two standard deviations). Chi-squared test -GPiF-&F i 6: the smallest mean difference that is clinically important. (Topic 24) 0,: the standard deviation of the differences in response, usually esti- mated from a pilot study. p, - p,: the smallest difference in the proportions of 'success' in the two groups that is clinically important. One of these proportions is often known, and the relevant difference evaluated by considering what value the other proportion must take in order to constitute a noteworthy change. Example 1 Comparing means in independent groups using the unpaired t-test Objective-to examine the effectiveness of aciclovir sus- Snntple site question -how many children arc required pension (15mglkg) for treating I-7-year-old children to have a 90% power of detecting a 2.5 day difference in with herpetic ginpivostomatitis lasting less than 72 h. duration of oral lesions between the two groups at the 5% level of significance? The authors assume that the Design-randomized, double-blind placebo-controlled standard deviation of duration of oral lesions is approxi- trial with 'treatment' administered five times a day for 7 mately 5 days. days. Mnin orrtconre nwnslrre for derertnining snmple size duration of oral lesions.

Using rhe nomogmm: standardized difference equals 0.6 and the required sample size would decrease to approximately 118 in 6 =2.5 days and a=5 days.Thus standardized total. i.e. 59 in each group). difference equals S- = -2.5 -- Q50 Qrrick fortrzirla: a5 If the Power is 90%. the required sample size in each The line connecting a standardized differenceof 0.50 group is: 21 21 - 84. and a power of 90% cuts the sample size axis at approx- irnately 160.Therefore. about 80 children are required (stondnrdizeddifference )' - (0.50)' in each group (note: if S were increased to 3 days, the Amir, J.. Haral, L.. Smettana. Z., Varsano. 1. (1977) Treatment of herpes simplex pingivostomatitis with aciclovir in children: a randomised double-hlind placeho-controlled study. British MedicalJn1rrr1~l,314,1800-1803. Example 2 Comparing two proportions in independent groups using the Chi-squared test Ohjrctive- to compare the effectiveness of corticos- Using rhe nor?logmni: teroid injections with physiotherapy for the treatment of p , =0.40 and pz =0.65. so ,ii= 0 . 4 + 0.65 = 0.525 painful stiff shoulder. 2 Desi~n-Randomized controlled trial (RCT) in which patients are randomly allocated to 6 weeks of treat- ~ h ~ ~ ~ f ~diffe~renc~e , ~ ~ ~ ~ d ~ ~ rnent. these cornprisingeither a maximum of three injec- tions or twelve 30rnin sessions of physiotherapy for each - 0.25 patient. - JO.525 x 0.375 Jn)- PI -p: = 050 - M~~~ol,tcomF n,easl,re far ~ j ~ ~ ~ ~sam, p,le, siizP~- i ~ ~ The line connecting a standardized difference of 0.50 treatment is regarded as a success after 7 weeks if the and a power o'08 cuts the size axis at 120. patient rates him/herself as having made a complete Therefore approximately 60 patients are required in improvement (on a six-point each group (note: if the power were increased to 8S0/0. recovery or as having the required sample size would increase to approxi- Likert scale). mately 140 in total. i.e. 70 patients would be required in Satnple size q~reslion-how many patients are required each group). in order to have an 80% power of detecting a clinically Qrlickfornrula: important difference in success rates of 25% between If the power is 80%. the required sample size in each the two groups at the 5% level of significance? ne gro,pis: 16 ---16 - M. (s~nttdnrdizeddifirence )' - (0.50)' authors assume a success rate of 40% in the group having the least successful treatment. van der Windt, D.A.W.M.,Koes. B.W..Devill6,W.. de Jong, B.A.. Bauter. M. (1WX) Effectiveness of corticosteroid injections with physiotherapy for treatment of painful shoulder in primary care: randomised trial. Rritisl?MedicnlJo1orm1al.317.1292-12%. Figures 18.1and 18.2show power curves for these examples.

34 Presenting results Introduction Label all axes, segments, and bars, and explain the meaning of symbols. An essential facet of statistics is the ability to summarize the important features of the analysis. We must know Avoid distorting results by exaggerating the scale on an what to include and how to display our results in a manner axis. that enables others to obtain relevant and important infor- mation easily and draw correct conclusions. This topic Indicate where two or more observations lie in the describes the key features of presentation. same position on a scatter diagram, e.g. by using a different symbol. Numerical results Ensure that all the relevant information is contained in Give figures only to the degree of accuracy that is appro- the diagram (e.g. link paired observations). priate (as a guideline, one significant figure more than the raw data). If analysing the data by hand, only round up or Presenting results in a paper down at the end of the calculations. When presenting results in a paper, we should ensure that Give the number of items on which any summary the paper contains enough information for the reader to measure (e.g. a percentage) is based. understand what has been done. Helshe should be able to reproduce the results, given the appropriate computer Describe any outliers and explain how they are handled package and data. All aspects of the design of the study (Topic 3). and the statistical methodology must be fully described. Include the units of measurement. Results of a hypothesis test When interest is focused on a parameter (e.g. the mean, Include a relevant diagram, if appropriate. regression coefficient), always indicate the precision of Indicate the hypotheses of interest. its estimate. We recommend using a confidence interval Name the test and state whether it is one- or two-tailed. Justify the assumptions (if any) underlying the test + +for this but the standard error is also acceptable. Avoid (e.g. Normality, constant variance), and describe any trans- using the symbol, as in mean SEM (Topic lo), be- formations (Topic 9) required to meet these assumptions cause by adding and subtracting the SEM, we create a (e.g. taking logarithms). 67% confidence interval that can be misleading for those used to 95% confidence intervals. It is better to show the Specify the observed value of the test statistic, its distri- standard error in brackets after the parameter estimate bution (and degrees of freedom, if relevant), and, if possi- [e.g.mean = 16.6g (SEM 0.5g)]. ble, the exact P-value (e.g. P = 0.03) rather than an interval When interest is focused on the distribution of observa- estimate of it (e.g.0.01 <P <0.05) or a star system (e.g. *, **, tions, always indicate a measure of the 'spread' of the data. *** for increasing levels of significance).Avoid writing 'n.s.' The range of values that excludes outliers (typically, the range of values containing the central 95% of the observa- when P > 0.05; an exact P-value is preferable even when the tions-Topic 6) is a useful descriptor. If the data are Nor- result is non-significant. mally distributed, this range is approximated by the sample Include an estimate of the relevant effect of interest (e.g. +mean 1.96 x standard deviation (Topic 7). You can quote the difference in means for the two-sample t-test, or the mean difference for the paired t-test) with a confidence the mean and the standard deviation [e.g.mean = 35.9mm interval (preferably) or standard error. (SD 2.8mm)l instead but this leaves the reader to evaluate the range. Draw conclusions from the results (e.g. reject the null hypothesis), interpret any confidence interval and explain Tables their implications. Do not give too much information in a table. Results of a regressionanalysis Include a concise,informative, and unambiguous title. Here we include simple (Topics 27 and 28) and multiple Label each row and column. linear regression (Topic 29), logistic regression (Topic 30), Remember that it is easier to scan information down and proportional hazards regression (Topic 41). Full details columns rather than across rows. of these analyses are explained in the associated topics. Diagrams Include relevant diagrams (e.g. a scatter plot with the fitted line for simple regression). Keep a diagram simple and avoid unnecessary frills (e.g. making a pie chart three-dimensional). Clearly state which is the dependent variable and which is (are) the explanatory variable(s). Include a concise,informative,and unambiguous title.

Justifyunderlying assumptions. Show the results of the hypothesis tests on the Describe any transformations, and explain their purpose. coefficients (i.e. include the test statistics and the P-values). Where appropriate, describe the possible numerical Draw appropriate conclusionsfrom these tests. values taken by any categorical variable (e.g. male = 0, female = I),how dummy variables were created, and the Complex analyses units of continuous variables. There are no simple rules for the presentation of the more Give an indication of the goodness-of-fit of the model complex forms of statistical analysis.Be sure to describe the (e.g.quote R2). design of the study fully (e.g. the factors in the analysis of If appropriate (e.g. in multiple regression), give the variance and whether there is a hierarchical arrangement), results of the overall F-test from the ANOVA table. and include a validation of underlying assumptions, rele- Provide estimates of all the coefficients in the model vant descriptive statistics (with confidence intervals), test statistics and P-values. A brief description of what the (includingthose which are not significant)together with the analysis is doing helps the uninitiated; this should be confidence intervals for the coefficients or standard errors accompanied by a reference for further details. Specify of their estimates. In logistic regression (Topic30) and pro- which computer package has been used. portional hazards regression (Topic 41), convert the coeffi- cients to estimated odds ratios or relative hazards (with confidence intervals). Interpret the relevant coefficients. Example Informat~vcand Table 34.1 : Information relating to first births in women with bleeding disorderst. unambiguoti~t i t l e stratified by bleeding disorder Rows and Bleeding disorder co!umns fd!ly Total Haem A Haem B vWD labrllcd !, deficiency/ Number of women with live birtns 48 14 Mother's age at birth of baby [years) Medtan 27.0 24.9 28.5 27.5 range (16.7-37.9) (16.7-33.0) (25.6-34.9) (18.8-36.6) Gestational age of baby {weeks) Units of rncasu*emeit Median 40 39 40 40 (range) (37-42) (38-42) (3941) (38-42) )---------3?-+(2-) @Weight of baby 3.62 3.78 3.64 3.62 Estimate* crf location (range) (1.96-4.46) (1.96-4.46) (3.15-3.94) (2.01-4.35) (2.90-3. and spread Sex of baby' BOY 20 (41.796) 8 (57.1%) 0 (-) 8 (42.106) 4 (40.0%) Girl 20 (41.796) 4 (28.646) 2 (40.0%) 10 (52.606) 4 (40.0°io) Not stated 8 (16.7%) 2 (14.346) 3 (60.0°'~) 1 (5.346) 2 (20.09 Interventions received during labour' 6 (42.g0'0) 2 (40.0%) 11 (57.9%) 6 (60.0' ., 9 (64.30b) 1 (20.0%) 4 (21.106) Inhaled gas 25 (52.1%) 0 (0.0%) 0 (O.OO!~) 1 (5.3\"b) 8 (80.0%) Intramuscular pethidine 22 (45.8O6) 3 (21.4\"!0) 2 (40.0°/0) 4 (21.lob) 1 (lO.OO'o) Intravenous pethidine 2 (4.296) 1 (10.0%) Epidural 10 (20.8%) 'Entries are frequencies (%) +The study is described in Top

80 85 90 95 100 105 110 115 120 125 130 Systolic blood pressure (mmHg) Fig 34.1 :Histograms showing the distribution of a) systolic blood pressure and Height (cm) b) height in a sample of 100 children (Topic 26). U r i ~ csf naa5urcr.rt Clt.ar rirlr hp5IR~PIIc~ apprclpria~lv

35 Diagnostic tools An individual's health is often characterized by a number Table 35.1 Table of frequencies. of numerical or categorical measures. We can use reference intervals (Topics 6 and 7) and diagnostic tests to determine Gold standard test whether the measurement seen in an individual is likely to be the consequence of undiagnosed illness or may be Test result Disease No Total indicative of disease. disease a+b Reference intervals Positive a b c+d Negative c d n=a+b+c+d A reference interval (often referred to as a normal range) for a single numerical variable, calculated from a very Total a+c b+d large sample, provides a range of values that are typi- cally seen in healthy individuals. If an individual's value Of the n individuals studied, a + c individuals have the is above the upper limit, or below the lower limit, we consider it to be unusually high (or low) relative to healthy disease. The prevalence (Topic 12) of the disease in this individuals. sample is = -(-a--+-. Calculating reference intervals n Two approaches can be taken. Of the a + c individuals who have the disease, a have We make the assumption that the data are Normally distributed. Approximately 95% of the data values lie positive test results (true positives) and c have nega- within 1.96 standard deviations of the mean (Topic 7). tive test results (false negatives). Of the b + d individuals +We use our data to calculate these two limits (mean 1.96 who do not have the disease, d have negative test results standard deviations). (true negatives) and b have positive test results (false An alternative approach, which does not make any positives). assumptions about the distribution of the measurement, Assessing reliability:sensitivityand specificity is to use a central range which encompasses 95% of the data values (Topic 6). We put our values in order of Sensitivity = proportion of individuals with the disease magnitude and use the 2.5th and 97.5th percentiles as our who are correctly identified by the test limits. Specificity = proportion of individuals without the The effect of other factors on referenceintervals disease who are correctly identified by the Sometimes the values of a numerical variable depend test on other factors, such as age or sex. It is important to interpret a particular value only after considering these -- d other factors. For example, we generate reference ( b+d ) intervals for systolic blood pressure separately for men and women. These are usually expressed as percentages. As with all esti- mates, we should calculate confidence intervals for these Diagnostic tests measures (Topic 11). The gold-standard test that provides a definitive diag- We would like to have a sensitivity and specificitythat are nosis of a particular condition may sometimes be im- both as close to 1(or 100%) as possible. However, in prac- practical. We would like a simple test, depending on the tice, we may gain sensitivity at the expense of specificity, presence or absence of some marker, which provides an and vice versa. Whether we aim for a high sensitivity or high accurate guide to whether or not the patient has the specificitydepends on the condition we are trying to detect, condition. along with the implications for the patient and/or the popu- lation of either a false negative or false positive test result. We take a group of individuals whose true disease status For conditions that are easily treatable, we prefer a high is known from the gold standard test. We can draw up the sensitivity; for those that are serious and untreatable, we 2 x 2 table of frequencies (Table 35.1): prefer a high specificity in order to avoid making a false positive diagnosis.

Predictive values Receiver operating characteristic curves These provide a way of assessing whether a particular type Positive predictive value =proportion of individuals of test provides useful information, and can be used to with a positive test result who compare two different tests, and to select an optimal cut-off have the disease value for a test. -- a For a given test, we consider all cut-off points that give a unique pair of values for sensitivity and specificity, and plot (a+b) the sensitivity against 1minus the specificity (thus compar- ing the probabilities of a positive test result in those with Negative predictivevalue =proportion of individuals and without disease) and connect these points by lines (Fig. with a negative test result 35.1). who do not have the disease The receiver operating characteristic (ROC) curve for a -d test that has some use will lie to the left of the diagonal of (C+ d ) the graph. Two or more tests can be compared by consider- ing the area under each curve-the test with the greater We calculate confidence intervals for these predictive area is better. Depending on the implications of false posi- values, often expressed as percentages, using the methods tive and false negative results, and the prevalence of the described in Topic 11. condition, we can choose the optimal cut-off value for a test from this graph. These predictive values provide information about how likely it is that the individual has or does not have the Is a test useful? disease, given hislher test result. Predictive values are The likelihood ratio (LR) for a positive result is the ratio of dependent on the prevalence of the disease in the popula- the chance of a positive result if the patient has the disease tion being studied. In populations where the disease is to the chance of a positive result if helshe does not have the common, the positive predictive value will be much higher disease. Likelihood ratios can also be generated for nega- than in populations where the disease is rare. The converse tive test results. For example, a LR of 2 for a positive result is true for negative predictive values. indicates that a positive result is twice as likely to occur in an individual with disease than in one without it. A high The use of a cut-off value likelihood ratio for a positive result suggests that the test Sometimes we wish to make a diagnosis on the basis of provides useful information, as does a likelihood ratio close a continuous measurement. Often there is no threshold to zero for a negative result. above (or below) which disease definitely occurs. In these situations, we need to define a cut-off value ourselves, It can be shown that: above (or below) which we believe an individual has a very high chance of having the disease. LR for a positive result = Sensitivity (1- specificity) A useful approach is to use the upper (or lower) limit of the reference interval.We can evaluate this cut-off value by We discuss the LR further in Topic 42. calculating its associated sensitivity, specificity and predic- tive values. If we choose a different cut-off, these values may change as we become more or less stringent. We choose the cut-off to optimize these measures as desired.

Example C'ytonicgalo\\~irus(CMV) is ;I conimun viral infect inn lo ohtaincd: the hox conti~inscalculations of measures of intcrcst. which approximatcl\\i 3)''h of individuals arc csposcd ThcrcTorc. for this cut-off vnlue. we havc a relatively during childhood. Althoirgll infection wit11 the virus docs I i i ~ hspecificity and a rnodcr;itc sensitivity. The LR of 7.h indic;~tcsthat this test is uscful. in that il viral load >4.5 not usually lead lo any major problems. individu:~lswlin log,,,gt.non~cs!mL is niorc than twice as likely in an inili- vidunl with 5cvere cjisense t hlan in onc without severe havc hccn inl'cctcd with C'MV in tlic past may suI'fcr discasc. H(:IWCVC~.illI ordcr t r) investigate other cut-off values. ;I R(3 C curve wi1s plot tcJ (Fis. 35.1). The plotted serious discasc after certain transplant pmccdurcs. such line kills just to tlic left of lhc di;lyon;~lclf the gr;~phF. or this example. the niost uscI'ul cut-off vnlue (5.0 lo?,,, marrot+ transpl;~ rus is citl1cr gcnomcsfml,) is t h a t which gives a sensitivity of 40% and a spccificity of 07%: thcn thc LR cquals 13.3. ltcd or il' they arc r donors. I 1 is t that the amount rr 1 their h 11~0d ;~ftcrtransplantation (the vir;~l10x1)may predict which individuals will get scvcrc discase. In order to stt~dythis hypothesis. CMV viral load was nic;isurcd in a y o u p !,I' 40 bone marrow transplan1 rccipiuntn Fifteen oC thc 49 patients developed scvcrc discasc during follow-up. Viral load values in all pi~ticnthranycd Frn~n2.7 lo?,,, gcnomcs/mL to 6.0 log,,,genonicslmL. As a starling point. a value in csccss of 4.5log F considcrcd The optimal cut-off value of 5.0 log,ogenomeslmL an indication of thc. po, :lopmcnt of 1100 yives a sensitivityof 40'6. discasc.The tahlc n f f r c q u ~ ~ l ~~~KcI-~sIWWI ~ J W tShe results specific~tyof 97% and \\'irnl lo;ld (log,,,gcnonicvn~l) Yc\\ No -li~t;il A cut-off value of 4.5 loglo genornes~rn~ >J.S - - 1 :. gives a senslttvity of 47?!~, 54.5 specificity of 82\"b and li,tnl 7 h .5h LR of 2.6 4') K 3 0 20 40 60 80 100- specificity (%) I5 34 Fig. 35.1 Rcccivcr r>puralin!:ch;~mctcrixlic( ROC')cunPc.indicatin~ Sensitivity = 7/15 x IO0\"h =-17\"%(').Soh Cl 21'% to 72('/0) lhc rcsults lrnm tuo po.i%ihIc~il-~\\l';'dftlcs.the ~>ptifiin01!1cand lh:11 L ~ Yin ~rhcJdi;i_mo~tticcl. Speciticity=?S/34 x 100°b =X?',;I (Y.ioA Cthc)~%to 95 [%I) Positive predictive value = 7/13x 100'31= 5-1'3,(9fii'<, A' to Xlt'h) ,. ive predictive valuc = 2S13h x 100% = 7,Y\"a0 (I).iOI, ... \"o to92\"lo) Likelihood ratio for pnqitive result = 0.471(1-0.87) = 2.6 (1)50b C1 1.1 to 6.5.ohtaincd from computer output) D;~lakindI\\ prctvidcd I>\\13r \\'.C'. Fmrn i ~ n t Dl r D. Gar.Dcp:~r~tnco~C\\l'tirtllupy, Rtry:~lFrct.and UniversityCollege h,lcdic;~Slchool. Koyi~Fl rcc Campus. Idondon.CIK.

36 Assessing agreement Introduction Edis the sum of expected frequencies along the diagonal; 1in the denominator represents maximum agreement. There are many occasions on which we wish to com- pare results that should concur. In particular, we may K = 1 implies perfect agreement and K = 0 suggests want to assess and, if possible, quantify the following two that the agreement is no better than that which would types of agreement. be obtained by chance. There are no objective criteria for judging intermediate values. However, kappa is often Reproducibility (methodlobserver agreement). Do two judged as providing agreement1which is: techniques used to measure a particular variable, in other- wise identical circumstances, produce the same result? Do poor if K I0.20; two or more observers using the same method of measure- fair if 0.21 2 K< 0.40; ment obtain the same results? moderate if 0.41 I K< 0.60; substantial if 0.61 I K I0.80; Repeatability. Does a single observer obtain the same good if K> 0.80. results when helshe takes repeated measurements in identi- cal circumstances? Note that kappa is dependent both on the number of cat- egories (i.e.its value is greater if there are fewer categories) Both reproducibility and repeatability can be ap- and the prevalence of the condition, so care must be taken proached in the same way. In each case, the method of when comparing kappas from different studies. For ordinal analysis depends on whether the variable is categorical data, we can also calculate a weighted kappa2,which takes (e.g. poorlaveragelgood) or numerical (e.g. systolic blood into account the extent to which the observers disagree pressure). For simplicity, we shall restrict the problem to (the non-diagonal frequencies) as well as the frequencies of that of comparing only two sets of paired results (e.g. two agreement (along the diagonal). methodsltwo observers/duplicate measurements). Numerical variables Categorical variables Suppose an observer takes duplicate measurements of a Suppose two observers assess the same patients for disease numerical variable on n individuals (just replace the word severity using a categorical scale of measurement, and we 'repeatability' by 'reproducibility' if considering the similar wish to evaluate the extent to which they agree. We present problem of method agreement). the results in a two-way contingency table of frequencies with the rows and columns indicating the categories of If the average difference (e.g. the true mean difference, response for each observer. Table 36.1 is an example, showing the results of two observers' assessments of the estimated by d )is zero (as assessed by the paired t-test,sign condition of tooth surfaces.The frequencies with which the observers agree are shown along the diagonal of the table. test or signed ranks test-Topics 19 and 20) then we can We calculate the corresponding frequencies that would infer that there is no bias in the results.This implies that, on be expected if the categorizations were made at random average, the duplicate readings agree. in the same way as we calculated expected frequencies in the Chi-squared test of association (Topic 24); i.e. each The estimated standard deviation of the differences (sd) expected frequency is the product of the relevant row and provides a measure of agreement that can be used as a com- column totals divided by the overall total.Then we measure parative tool. However, it is more usual to calculate the agreement by: British StandardsInstitutionrepeatabilitycoefficient = 2sd. This indicates the maximum difference that is likely to Cohen's kappa, K = \\ m ml occur between two measurements if there is no bias. (1-5) Assuming a Normal distribution of differences, we expect approximately 95% of the differences in the population to which represents the chance corrected proportional agree- lie between d f 2sd.The upper and lower limits of this inter- ment, where: val are called the limits of agreement;from them, we can rn is the total observed frequency (e.g. total number of 1Landis, J.R. & Koch, G.G. (1977)The measurement of observer patients); agreement for categorical data. Biometrics,33,159-174. 2Cohen, J. (1968) Weighted kappa: nominal scale agreement with Odis the sum of observed frequencies along the diagonal; provision for scale disagreement or partial credit.Psychological Bulletin, 70,213-220.

decide (subjectively) whether the agreement between pairs Be wary of producing a scatter diagram with the results of readings in a given situation is acceptable. from the first occasion plotted against those from the second occasion (or the data from one methodlobserver Precautions plotted against the other), and calculating the correlation It makes no sense to calculate a single measure of coefficient (Topic 26). We are not really interested in whether the points lie on a straight line; we want to know repeatability if the extent to which the observations in a whether they conform to the 45\" line, i.e. the line of equality. pair disagree depends on the magnitude of the measure- This will not be established by a hypothesis test of the null ment. We can check this by determining both the mean of hypothesis that the true correlation coefficient is zero. Fur- and the difference between each pair of readings, and plot- thermore, bear in mind the fact that it is possible to increase ting the n differences against their corresponding means3 the magnitude of the correlation coefficient by increasing (Fig. 36.1). If there is no relationship, then we should the range of values of the measurements. observe a random scatter of points (evenly distributed above and below zero if no bias is present). If, however, we More complex situations observe a funnel effect, with the variation in the differences Sometimes you may come across more complex problems being greater (say) for larger mean values, then we must when assessingagreement. For example, there may be more reassess the problem. We may be able to find an appropri- than two replicates, or more than two observers, or each of a ate transformation of the raw data (Topic 9), so that when number of observers may have replicate observations. You we repeat the process on the transformed observations, the can find details of the analysis of such problems in Streiner required condition is satisfied.We can also use the plot to and Normand. detect outliers (Topic 3). Example 1 There appears to be substantial agreement between the student and the experienced dentist in the coding of the Assessing agreement-categorical variable children's tooth surface< TWOobservers. an experienced dentist and a dental student, assessed the condition of 2104 tooth surfaces Table 36.1 Observed (and expecleal rrequennes or rootn surrace in school-aged children. Every surface was coded as '0' assessments. (sound). '1' (with at least one 'small' cavity), '2' (with at Dental studen1 least one'big' cavity) or'3' (with at least one filling.with or Code 0 2 3 Total hIwithout cavities) ly each ir~dividualT. he observed fre- tquencies isre showr1 in Table 36.1. The: bold figures along 1:he diagonal show the obsc:wed freq~uencies of agree- ment; the corresponding expected frequencies are in brackets. We calculated Cohen's Kappa to assess the agreement between the two observers. We estimate Cohen's kappa as: Dentist 0 l'lsS(1602.1) 46 0 7 1838 1 46 154(21.3) 18 5 223 ZO(0.5) 0 25 7- 00 0 14(0.3) 31 18 3 Total 1834 201 43 26 21(M Data kindly provided by Dr R.D. Holt. Department ofTranscu~rurau~rar Health,Eastman Dental Institutefor Oral Health Care Sciences.Uni- versity College London. London. UK. 3Bland,J.M.& Altman,D.G. (1986)Statisticalmethods for assessing AStreiner,G.L.81Norman, D.R. (1990)Health Measurement Sca1es:A agreement between two pairs of clinical measurement. Lancet, i, 307-310. Practical Guide to their Development and Use.Oxford University Press, Oxford.

Example 2 thc differences arc randomtv scattered around a m approximately zero. On the basis of these rcsul Assessingagreement -numerical variable invcsti~atorsfelt that the Rosenberg index was n,,,,,~,,. and used it to evaluate the patients' pcrccptions of the The Rosenberg self-esteem index is used to judgc a cffcctivcness of the facial surgery. patient's evaluation of his or her own self-cstccm. Thc maximum value of the indes (an indication of high self- Tahle 36.2 l l i c prc-~rcatmvnvlalue.. ( I s l and 2nd) o f the Roqcnherg esteem) for a person is 50,coniprising the sum of the indi- index ohr;~inedon 25 patien~k vidual values from 10 questions. each scored from zero to live. Part or a study that examined the effectiveness of a 1st 2nd I\\r 2nd 1st 2nd 1st 2nd 1st 2nd particular type of surgery for facial deformity examined the chanst. in a paticnt's psycholo~icalprolilc by compar- .?(I 27 41 0 - 21 20 ing the values of the Rosenberg indev in thc patient 39 41 41 41 41 34 before and after surgery. The investigators wore con- 0 49 60 4 37 3') 43 43 29 28 cerned ahout the cxtcnt to which the Roqenbcrg score 45 42 3S 40 42 12 40 39 26 177 would hc rctiahlc for a set of paticnts. and decidcd lo 25 28 41 46 44 31 0 32 30 assess the repeatability of the measure on the first 25 35 46 patients requesting treatment for facial deformity. Thoy 49 1X 46 42 obtained a value for the Rosenberp index when the 21 2.3 patient initiallv presented at the clinic and then asked thc \"\"i .. . .N61-- - - - - - - - - - - - - - - -Uppce-r li-m-it of agreement - patient for a second assessment 4 weeks latcr.Thc rcsulls arc shown in Table 36.2. - --- The difference5 (first value-second value) can bc -p:-3 X shown to be approximately Normally distrihutrd: they .*. have a mean, ri = 0.56 and standard deviation. A,,= 1.S3. - - -** *- - ---- - - - -0.. - ---- ---- ------ - The test ~ k ~ t i s t fiocr the paired I-test ir eqltal to 1.53 (degrees of freedon1 = 23). giving P = 0.14. This non- Lower limit of agreement significant result indicates that there is no ovidcncc o f any hias. -9 x-indicates three coincident points -1 2 The British Standards Institution repeatability coeffi- I cicnt is :st, = 2 x 1.S3= 3.7. Approximately 05'%ool il-tedif- 20 fcrcnccs in thc population of such patients would be 30 40 50 60 expected 10 lie between 2 f2 . ~i.~c.~be. tween -3.1 ant! 4.3. Average of 1st and 2nd values These limits arc indicated in Fig. 36.1. which shows that Fig.M.1 Diffcrcncc hctwcen first andsccond Roscnb~rgsclf-cstccm v;rlucsplot~ed;lgainst theiravcrngr: for 25 paticnts. Adnptcd from: Cunningham. S.J..Hunt. N.P.. Fcinnman. C. IO'lh) Perceptions of ourcomc I'ollowing nrthognathic surgerv. Hrirish Jo~rrtrrtl On11atttl ,\\~lit.rillofociul.C~tt;q(,ry-.34.2 10-2 I.;

37 Evidence-based medicine Sackett et al.1 describe evidence-based medicine (EBM) as 4 Extract the most useful results and determine 'the conscientious, explicit and judicious use of current best whether they are important evidence in making decisions about the care of individual patients'. To practice EBM, you must be able to locate the Extracting the most useful results research relevant to the care of your patients, and judge its You should ask the following questions: quality. Only then can you think about applying the findings (a) What is the main outcome variable (i.e. that which in clinical practice. relates to the major objective)? (b) How large is the effect of interest, expressed in terms of Sackett et al.suggest the following approach to EBM. For the main outcome variable? If this variable is: convenience, we have phrased the third and fourth points below in terms of clinical trials (Topic 14) and observational Binary (e.g. diedlsurvived) studies (Topics 15 and 16),but they can be modified to suit other forms of investigations (e.g. diagnostictests,Topic 35). (i) What are the rates of occurrence of this event (e.g. death) in the (two) comparison groups? 1 Formulate the problem (ii) The effect of interest may be the difference in rates You must decide what is of interest to you -how you define (the absolute reduction in risk) or the ratio of the patient population, which intervention (e.g. treatment) rates (the relative risk or odds ratio)-what is its or comparison is relevant, and what outcome you are magnitude? looking at (e.g. reduced mortality). Numerical (e.g. systolic blood pressure) 2 Locate the relevant information (e.g. on diagnosis, prognosis or therapy) (i) What is the mean (or median) value of the variable in each of the comparison groups? Often the relevant information will be found in published papers, but you should also consider other possibilities,such (ii) What is the effect of interest, i.e. the difference in as conference abstracts. You must know what databases means (medians)? (e.g. Medline) and other sources of evidence are available, how they are organized,which search terms to use, and how (c) How precise is the effect of interest? Ideally, the to operate the searching software. research being scrutinized should include the confidence interval for the true effect (a wide confidence interval is an 3 Critically appraise the methods in order to assess indication of poor precision). Is this confidence interval the validity (closeness to the truth) of the evidence quoted? If not, is sufficient information (e.g. the standard error of the effect of interest) provided so that the The following questions should be asked. confidence interval can be determined? Have all important outcomes been considered? Was the study conducted using an appropriate spectrum Deciding whether the results are important Consider the confidence interval for the effect of interest of patients? D o the results make biological sense? (e.g. the difference in treatment means): Was the study designed to eliminate bias? For example, (i) Would you regard the observed effect clinically in a clinical trial, was the study controlled, was randomiza- important (irrespective of whether or not the result tion used in the assignment of patients, was the assessment of the relevant hypothesis test is statistically of response 'blind', were any patients lost to follow-up, significant) if the lower limit of the confidence inter- were the groups treated in similar fashion, aside from the val represented the true value of the effect? fact that they received different treatments, and was an 'intention-to-treat' analysis performed? (ii) Would you regard the observed effect clinically important if the upper limit of the confidence inter- Are the statistical methods appropriate (e.g. have under- val represented the true value of the effect? lying assumptions been verified; have dependencies in the data (e.g. pairing) been taken into account in the analysis)? (iii) Are your answers to the above two points suffi- ciently similar to declare the results of the study 1Sackett,D.L., Richardson, W.S., Rosenberg, W.,Haynes,R.B. (1997) unambiguous and important? Evidence-based Medicine: How to Practice and Teach EBM. Churchill- Livingstone,London. To assess therapy in a randomized controlled trial, evalu- ate the number of patients you need to treat (NNT) with the experimental treatment rather than the control treat- ment in order to prevent one of them developing the 'bad' outcome (such as post-partum haemorrhage, see Example). The NNT can be determined in various ways depending on the information available. It is, for example, the reciprocal

of the difference in the proportions of individuals with the the likely benefits are worth the potential harms and bad outcome in the control and experimental groups (see costs. Example). 6 Evaluate your performance 5 Apply the results in clinical practice Self-evaluation involves questioning your abilities to com- If the results are to help you in caring for your patients, you plete tasks 1to 5 successfully.Areyou then able to integrate must ensure that: the critical appraisal into clinical practice, and have you audited your performance? You should also ask yourself your patient is similar to those on whom the results were whether you have learnt from past experience so that you obtained; are now more efficient and are finding the whole process of EBM easier. the results can be applied to your patient; all clinically important outcomes have been considered; Example Objadlvn To test the hypothesisthat active management (prophylactic oxytocic within 2 mlnutes of baby's birth, immediatecutling and clamping of the cord, delivery of placenta by controlledcord traction ot maternaleffort) of the third stage of labour lowers the rates of primary postpartum haemorrhage (PPH) comparedwith expectant management(no 7df\\/ SpccCrum maternal effort), in a setting where both managementsare commonly practiced, and that th~seffect is not mediated by maternal posture. Lpatic\"f i, Subl* 1512 women iudoedto be at low risk of PPH (blood loss > 500 ml) were 1- randomlyassignedto active or expectant management.'~xclusioncriteriawire placenta praevia, previous PPH, anteparturnhaernorrhageafter 20 weeks*gestation, anaemia, non-cephalic presentation, multiple pregnancy, intrauterine death, epiduml anaesthesia. parity greater than five, uterinefibroid, oxytocin infusion. anticoagulanttherapy, intended . \" .owrativefinstrumentaldeliverv. duration of oreonancv less than 32 weeks. Trial orofile n nshown in Topic 14. . Derlgn A randomized controlledparallelgroup trial in active or expectant management. Women were also ra -, Tc~hn~clane posture. The treatment ailocation could not be conceal& because h i v e andexpectant { rnana-ge,menl require different actions on the part of both midwifeand mother. The techn~c~anwsho did the antenatal and postnatal blood tests were unaware of the u\\ Hind , allocation. -7 \\ i L ~ Findings Analyses were by intention-to-treat. The rate of PPH was significantlylower wilh active than with expectant management(51 [6.8%] of 748 vs 126 [16.8%]of 764: relative rlsk 2.42 195% CI 1.7&3.30] ,h0.0001). Posturehad no effect on this risk ,,/(upright 92 [lZah]of 755 vssupi of 757). Objective measures of blood loss ,'/ Magnitudeof ) conflrmed the results. There in the active group but no other important diferences were detected. reducesthe risk of PPH, whatever i ctfcct ? inwrest Interpretatlam Active managementof the '/ the woman's posture, even with both approaches. It is recommended that clinical advocate active manauement care should lake lnjo account an interwenlion-freethird stag/e' on blood loss compared w~th I 7 .\\ /,,/Precision of main \\. 1j Questions the effect of interest / / importanceofthe -rateof PPH is a t / From the* propoflon.\\ ( finding5 as they relate to the ) j I e ~1.t8 and could 1, with PPH (ie 0 . 0 6 8 \\\\ \\ ind~vldual ( bc 3.3times 1 \\ and 0.168) : greater with \\ /I NNT=l~(O.I6~.06e) \\, expectant =lo I '\\ management \" ie. need W treat 10 /I /' w m e n with actrve management W prevent J, 1 ,\" A d a p t e d t r o m K o g e r s J.. Wood. J.. M c C a n d ~ s hR,..Ayers. S.,Truesdale. A.. Elbourne. D.(1998) A c t i v e versus expectant management o f t h i r d stage of labour: t h e H i n c h i n g b r o o k e randomised c o n t r o l l e d trial. Latrcer.351.693-699.with permission.

38 Systematic reviews and meta-analysis The systematic review the overall or average effect of interest (e.g. the relative risk, RR; Topic 15). The direction and magnitude of this What is it? average effect, together with a consideration of the associ- A systematic review1 is a formalized and stringent process ated confidence interval and hypothesis test result, can be of combining the information from all relevant studies used to make decisions about the therapy under investiga- (both published and unpublished) of the same health con- tion and the management of patients. dition; these studies are usually clinical trials (Topic 14) of the same or similar treatments but may be observational Statistical approach studies (Topics 15and 16).Clearly,a systematic review is an 1. We decide on the effect of interest and, if the raw data integral part of evidence based medicine (EBM;Topic 37), are available, evaluate it for each study. However, in prac- which applies the results of the best available evidence, tice, we may have to extract these effects from published together with clinical expertise, to the care of patients. So results. If the outcome in a clinical trial comparing two important is its role in EBM, that it has become the focus of treatments is: an international network of clinicians, methodologists and consumers who have formed the Cochrane Collaboration. numerical-the effect may be the difference in treat- They have created the Cochrane ControlledTrials Register, ment means. A zero difference implies no treatment and publish continually updated systematic reviews in effect; various forms (e.g. on CD ROM). binary (e.g. died1survived)-we consider the risks of the outcome (e.g. death) in the treatment groups. The What does it achieve? effect may be the difference in risks or their ratio, the Refinement and reduction -large quantities of informa- RR. If the difference in risks equals zero or R R = 1 then there is no treatment effect. tion are refined and reduced to a manageable size. 2. Obtain an estimate of statistical heterogeneity and Efficiency-the systematic review is usually quicker and check for statistical homogeneity -we have statistical het- erogeneity when there is considerable variation between less costly to perform than a new study. It may prevent the estimates of the effect of interest from the different others embarking on unnecessary studies, and can shorten studies. We can measure it, and perform a hypothesis test the time lag between medical developments and their to investigate whether the individual estimates are com- implementation. patible (i.e. homogeneous). If there is significant statistical heterogeneity, we should proceed cautiously, investigate Generalizability and consistency-results can often be the reasons for its presence and modify our approach generalized to a wider patient population in a broader accordingly. setting than would be possible from a single study. Consis- 3. Estimate the average effect of interest (with a tencies in the results from different studies can be assessed, confidence interval), and perform the appropriate hypothe- and any inconsistencies determined. sis test on the effect (e.g. that the true RR = 1)-you may come across the terms 'fixed-effects' and 'random-effects' Reliability-the systematic review aims to reduce errors, models in this context. Although the underlying concepts and so tends to improve the reliability and accuracy of rec- are beyond the scope of this book, note that we generally ommendations when compared with haphazard reviews or use a fixed-effects model if there is no evidence of statistical single studies. heterogeneity, and a random-effects model otherwise. 4. Interpret the results and present the findings-it is Power and precision- the quantitative systematic review helpful to summarize the results from each trial (e.g. the (see meta-analysis) has greater power (Topic 18) to detect sample size, baseline characteristics, effect of interest such effects of interest and provides more precise estimates of as the RR, and related confidence intervals, CI) in a table them than a single study. (see Example). The most common graphical display is a forest plot (Fig.38.1) in which the estimated effect (with CI) Meta-analysis for each trial, and their average, are marked along the length of a vertical line which represents 'no treatment What is it? effect' (e.g. this line corresponds to the value 'one' if the A meta-analysis or overview is a particular type of sys- effect is a RR). Initially,we examine whether the estimated tematic review that focuses on the numerical results. The main aim of a meta-analysis is to combine the results from individual studies to produce, if appropriate, an estimate of 1Chalmers, I. &Altman,D.G. (eds) (1995) Systematic Reviews. British Medical Journal Publishing Group, London.

effects from the different studies are on the same side of the cacy. The following principal problems should be thor- line. Then we can use the CIS to judge whether the results oughly investigated and resolved before a meta-analysis is are compatible (if the CIS overlap), to determine whether performed. incompatible results can be explained by small sample sizes (if CIS are wide) and to assess the significance of the indi- Publication bias-the tendency to include in the analysis vidual and overall effects (by observing whether the verti- only the results from published papers; these favour statisti- cal line crosses some or all of the CIS). cally significantfindings. Advantages and disadvantages Clinical heterogeneity-in which differences in the As a meta-analysis is a particular form of systematic review, patient population, outcome measures, definition of vari- it offers all the advantages of the latter (see 'what does it ables, and/or duration of follow-up of the studies included achieve?'). In particular, a meta-analysis, because of its in the analysis create problems of non-compatibility. inflated sample size,is able to detect treatment effects with greater power and estimate these effects with greater Quality differences-the design and conduct of the precision than any single study. Its advantages, together studies may vary in their quality. Although giving more with the introduction of meta-analysis software, have led weight to the better studies is one solution to this dilemma, meta-analyses to proliferate. However, improper use can any weighting system can be criticized on the grounds that lead to erroneous conclusions regarding treatment effi- it is arbitrary. Dependence-the results from studies included in the analysis may not be independent, e.g. when results from a study are published on more than one occasion. Example (1661 CABG. 1710 PTCA) with a mean follow-up of 7.7 years. The main features of the trials arc shown in Tahlc A patient with severe angina often be eligible for 38.1. Results for the composite endpoint of cardiac dcath either percutaneous transluminal coronary angioplasty plus non-fatal myocardial infarction ( M I ) in thc first ycar (PTCA)or coronary arten bypass graft ( C A B G )surgery. of follow-up are shotvn in Fig..T8.1.The cstimatcd relative Results froni eight published randotliized trials were corn- hined in a collaborative mcta-analysis of 3371 patients Table 38.1 Cl~ar;~ctcrislicocf c i ~ h rt an~lomiucdtrial.; con~p;~rinppcrcul;lneous tr:in.iluminal coronary angioplarty and coronilry ilrtcry hypass graft. - ('oun~ry Principal Sinelc- or multi- Surnhcr of Follo\\v-up Eur invcstipntor vcsscl paticnts (ycarr) Coronary Angioplnur y B\\p:~c.; CARG PTCA I Revarcularis;~tionInvcsrigar~tion -- -- .ill 541 (CABRI) -1.7 A . E Kickard.; Multi 5111 510 2+ Randornisei1 Inlcr\\,cn~ionor1 104 1OX I Trc:llnirnt ofAngin;~Trial(RITA) C'R J.K. tf:~mpto~i Sinslr (11 = . 177 1S2 2.8 S.R.Kin: 3.2 Emory Angioplasty vcrsus I:SA and nlulli (11 = 5 5 5 ) 76 76 3.1 SurgeryTri;~l( E A S T ) Mulri 70 72 3.X Gcrmati Angiopl;lrly Ryp;lss Gcrnr:~ny C.Lir.H;II Mulri hh hS ~~~~lf/flllfc~~l Surgrry I nvcstig;~tion( G A B I ) hJ h3 Francc J. Pucl Multi Thc Trruloui~~Tr.ial ('li)uloucc ) Rrauil W.H u c b Single klcdicinc Anyioplasty or S\\vit;rcrli~nd J.-J.Goy Single .C,.~.-n...g. .r.-.~,d,.rudy ( M A S S ) Mulri Arscntina A. Rodri~ucz I11c I.au<annc[rial (Laus:~nnc) ArgcntincTri;~Iof PI'CA versus CAMi (ERACI)


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook