Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore -Earl_R._Babbie-_The_Practice_of_Social_Research_((BookFi)

-Earl_R._Babbie-_The_Practice_of_Social_Research_((BookFi)

Published by dinakan, 2021-08-12 20:16:58

Description: e-Book ini adalah untuk tujuan pembacaan sahaja dan tidak berasaskan sebarang keuntungan.

Search

Read the Text Version

474 ■ Chapter 16: Statistical Analyses FIGURE 16-2 A Scattergram of the Values of Two Variables with Regression Line Added (Hypothetical) could estimate crime rates of cities if we knew their variation has been explained. In practice, we com- populations. pute r rather than r2, because the product-moment correlation can take either a positive or a negative To improve your guessing, you construct a sign, depending on the direction of the relationship regression line, stated in the form of a regression between the two variables. (Computing r2 and tak- equation that permits the estimation of values on ing a square root would always produce a positive one variable from values on the other. The general quantity.) You can consult any standard statistics format for this equation is YЈ ϭ a ϩ b(X), where a textbook for the method of computing r, although I and b are computed values, X is a given value on anticipate that most readers using this measure will one variable, and YЈ is the estimated value on the have access to computer programs designed for this other. The values of a and b are computed to mini- function. mize the differences between actual values of Y and the corresponding estimates (YЈ) based on the Unfortunately— or perhaps fortunately—social known value of X. The sum of squared differences life is so complex that the simple linear regression between actual and estimated values of Y is called model often does not sufficiently represent the the unexplained variation because it represents errors state of affairs. As we saw in Chapter 14, it’s pos- that still exist even when estimates are based on sible, using percentage tables, to analyze more than known values of X. two variables. As the number of variables increases, such tables become increasingly complicated and The explained variation is the difference between hard to read. The regression model offers a useful the total variation and the unexplained variation. alternative in such cases. Dividing the explained variation by the total varia- tion produces a measure of the proportionate reduc- Multiple Regression tion of error corresponding to the similar quantity in the computation of lambda. In the present case, Very often, social researchers find that a given de- this quantity is the correlation squared: r2. Thus, if r pendent variable is affected simultaneously by sev- ϭ 0.7, then r2 ϭ 0.49, meaning that about half the eral independent variables. Multiple regression

Descriptive Statistics ■ 475 analysis provides a means of analyzing such situa- of education on prejudice with age held constant, tions. This was the case when Beverly Yerg (1981) testing the independent effect of education. To do set about studying teacher effectiveness in physical so, we would compute the tabular relationship education. She stated her expectations in the form between education and prejudice separately for of a multiple regression equation: each age group. F ϭ b0 ϩ b1I ϩ b2X1 ϩ b3X2 ϩ b4X3 ϩ Partial regression analysis is based on this b5X4 ϩ e, where same logical model. The equation summarizing the relationship between variables is computed on the F ϭ Final pupil-performance score basis of the test variables remaining constant. As in I ϭ Initial pupil-performance score the case of the elaboration model, the result may X1 ϭ Composite of guiding and supporting then be compared with the uncontrolled relation- ship between the two variables to clarify further practice the overall relationship. X2 ϭ Composite of teacher mastery of content X3 ϭ Composite of providing specific, task- Curvilinear Regression related feedback Up to now, we’ve been discussing the association X4 ϭ Composite of clear, concise task among variables as represented by a straight line. The regression model is even more general than presentation our discussion thus far has implied. b ϭ Regression weight e ϭ Residual You may already know that curvilinear func- tions, as well as linear ones, can be represented (Adapted from Yerg 1981: 42) by equations. For example, the equation X2 ϩ Y2 ϭ 25 describes a circle with a radius of 5. Raising Notice that in place of the single X variable in variables to powers greater than 1 has the effect of a linear regression, there are several X’s, and there producing curves rather than straight lines. In the are also several b’s instead of just one. Also, Yerg real world there is no reason to assume that the has chosen to represent a as b0 in this equation but relationship among every set of variables will be with the same meaning as discussed previously. linear. In some cases, then, curvilinear regression Finally, the equation ends with a residual factor analysis can provide a better understanding of (e), which represents the variance in Y that is not empirical relationships than any linear model can. accounted for by the X variables analyzed. Recall, however, that a regression line serves Beginning with this equation, Yerg calculated two functions. It describes a set of empirical the values of the several b’s to show the relative observations, and it provides a general model for contributions of the several independent vari- ables in determining final student-performance multiple regression analysis A form of statistical scores. She also calculated the multiple-correlation analysis that seeks the equation representing the coefficient as an indicator of the extent to which impact of two or more independent variables on a all six variables predict the final scores. This follows single dependent variable. the same logic as the simple bivariate correlation discussed earlier, and it’s traditionally reported as partial regression analysis A form of regression a capital R. In this case, R ϭ 0.877, meaning that analysis in which the effects of one or more vari- 77 percent of the variance (0.8772 ϭ 0.77) in final ables are held constant, similar to the logic of the scores is explained by the six variables acting in elaboration model. concert. curvilinear regression analysis A form of regres- Partial Regression sion analysis that allows relationships among vari- ables to be expressed with curved geometric lines In exploring the elaboration model in Chapter 15, instead of straight ones. we paid special attention to the relationship be- tween two variables when a third test variable was held constant. Thus, we might examine the effect

476 ■ Chapter 16: Statistical Analyses making inferences about the relationship between crimes, for example, might seem to suggest that two variables in the general population that the small towns with, say, a population of 1,000 should observations represent. A very complex equation produce 123 crimes a year. This failure in predictive might produce an erratic line that would indeed ability does not disqualify the equation but drama- pass through every individual point. In this sense, it tizes that its applicability is limited to a particular would perfectly describe the empirical observations. range of population sizes. Second, researchers There would be no guarantee, however, that such sometimes overstep this limitation, drawing infer- a line could adequately predict new observations ences that lie outside their range of observation, or that it in any meaningful way represented the and you’d be right in criticizing them for that. relationship between the two variables in general. Thus, it would have little or no inferential value. The preceding sections have introduced some of the techniques for measuring associations among Earlier in this book, we discussed the need variables at different levels of measurement. Mat- for balancing detail and utility in data reduction. ters become slightly more complex when the two Ultimately, researchers attempt to provide the most variables represent different levels of measurement. faithful, yet also the simplest, representation of their Though we aren’t going to pursue this issue in this data. This practice also applies to regression analysis. textbook, “Measures of Association and Levels of Data should be presented in the simplest fashion Measurement,” by Peter Nardi, may be a useful re- that best describes the actual data; as such, linear source if you ever have to address such situations. regressions are the ones most frequently used. Cur- vilinear regression analysis adds a new option to the Inferential Statistics researcher in this regard, but it does not solve the problems altogether. Nothing does that. Many, if not most, social science research proj- ects involve the examination of data collected from Cautions in Regression Analysis a sample drawn from a larger population. A sample of people may be interviewed in a survey; a sample The use of regression analysis for statistical infer- of divorce records may be coded and analyzed; a ences is based on the same assumptions made for sample of newspapers may be examined through correlational analysis: simple random sampling, the content analysis. Researchers seldom if ever study absence of nonsampling errors, and continuous in- samples just to describe the samples per se; in most terval data. Because social science research seldom instances, their ultimate purpose is to make asser- completely satisfies these assumptions, you should tions about the larger population from which the use caution in assessing the results in regression sample has been selected. Frequently, then, you’ll analyses. wish to interpret your univariate and multivariate sample findings as the basis for inferences about Also, regression lines—linear or curvilinear— some population. can be useful for interpolation (estimating cases lying between those observed), but they are less This section examines inferential statistics— trustworthy when used for extrapolation (estimating the statistical measures used for making inferences cases that lie beyond the range of observations). from findings based on sample observations to a This limitation on extrapolations is important larger population. We’ll begin with univariate data in two ways. First, you’re likely to come across and move to multivariate. regression equations that seem to make illogical predictions. An equation linking population and Univariate Inferences inferential statistics The body of statistical Chapter 14 dealt with methods of presenting uni- computations relevant to making inferences from variate data. Each summary measure was intended findings based on sample observations to some as a method of describing the sample studied. Now larger population.

Inferential Statistics ■ 477 Measures of Association and Levels of Measurement Peter Nardi are commonly organized.Also,notice that the levels of measurement are themselves an ordinal scale. Pitzer College If you want to use an interval/ratio-level variable in a crosstab, Note that this table itself is set up with the dependent variables in you must first recode it into an ordinal-level variable. the rows and the independent variable in the columns,as tables Independent Variable Dependent Nominal Nominal Ordinal Interval/Ratio Variable Ordinal Crosstabs Crosstabs Chi square Chi square Correlate Interval/Ratio Lambda Lambda Pearson r Crosstabs Crosstabs Regression (R) Chi square Chi square Lambda Lambda Gamma Means Kendall’s tau t-test Sommers’ d ANOVA Means t-test ANOVA we’ll use such measures to make broader assertions review them here. In the case of a percentage, the about a population. This section addresses two uni- quantity variate measures: percentages and means. pϫq If 50 percent of a sample of people say they had n colds during the past year, 50 percent is also our best estimate of the proportion of colds in the total where p is a proportion, q equals (1 Ϫ p), and n population from which the sample was drawn. is the sample size, is called the standard error. As (This estimate assumes a simple random sample, of noted in Chapter 7, this quantity is very important course.) It’s rather unlikely, however, that precisely in the estimation of sampling error. We may be 68 50 percent of the population had colds during the percent confident that the population figure falls year. If a rigorous sampling design for random within plus or minus one standard error of the selection has been followed, however, we’ll be able sample figure; we may be 95 percent confident that to estimate the expected range of error when the it falls within plus or minus two standard errors; sample finding is applied to the population. and we may be 99.9 percent confident that it falls within plus or minus three standard errors. Chapter 7, on sampling theory, covered the procedures for making such estimates, so I’ll only

478 ■ Chapter 16: Statistical Analyses Any statement of sampling error, then, must completion rate—that is, that everyone in the sam- contain two essential components: the confidence ple completed the survey. The seriousness of this level (for example, 95 percent) and the confidence in- problem increases as the completion rate decreases. terval (for example, Ϯ2.5 percent). If 50 percent of a sample of 1,600 people say they had colds during Third, inferential statistics are addressed to the year, we might say we’re 95 percent confident sampling error only, not nonsampling error such that the population figure is between 47.5 percent as coding errors or misunderstandings of ques- and 52.5 percent. tions by respondents. Thus, although we might state correctly that between 47.5 and 52.5 percent In this example we’ve moved beyond simply of the population (95 percent confidence) would describing the sample into the realm of making report having colds during the previous year, we estimates (inferences) about the larger population. couldn’t so confidently guess the percentage who In doing so, we must take care in several ways. had actually had them. Because nonsampling er- rors are probably larger than sampling errors in a First, the sample must be drawn from the respectable sample design, we need to be especially population about which inferences are being made. cautious in generalizing from our sample findings A sample taken from a telephone directory cannot to the population. legitimately be the basis for statistical inferences about the population of a city, but only about the Tests of Statistical Significance population of telephone subscribers with listed numbers. There is no scientific answer to the question of whether a given association between two vari- Second, the inferential statistics assume several ables is significant, strong, important, interesting, things. To begin with, they assume simple ran- or worth reporting. Perhaps the ultimate test of dom sampling, which is virtually never the case in significance rests in your ability to persuade your sample surveys. The statistics also assume sampling audience (present and future) of the association’s with replacement, which is almost never done— significance. At the same time, there is a body of but this is probably not a serious problem. Although inferential statistics to assist you in this regard systematic sampling is used more frequently than called parametric tests of significance. As the name random sampling, it, too, probably presents no seri- suggests, parametric statistics are those that make ous problem if done correctly. Stratified sampling, certain assumptions about the parameters de- because it improves representativeness, clearly scribing the population from which the sample is presents no problem. Cluster sampling does pres- selected. They allow us to determine the sta- ent a problem, however, because the estimates tistical significance of associations. “Statistical of sampling error may be too small. Quite clearly, significance” does not imply “importance” or street-corner sampling does not warrant the use “significance” in any general sense. It refers simply of inferential statistics. Finally, the calculation of to the likelihood that relationships observed in a standard error in sampling assumes a 100 percent sample could be attributed to sampling error alone. Researchers often distinguish between statistical nonsampling error Those imperfections of data significance and substantive significance in this regard, quality that are a result of factors other than sam- with the latter referring to whether the relationship pling error. Examples include misunderstandings of between variables is big enough to make a mean- questions by respondents, erroneous recordings by ingful difference. Whereas statistical significance interviewers and coders, and keypunch errors. can be calculated, substantive significance is always a judgment call. statistical significance A general term referring to the likelihood that relationships observed in a Although tests of statistical significance sample could be attributed to sampling error alone. are widely reported in social science literature, the logic underlying them is rather subtle and often tests of statistical significance A class of statisti- misunderstood. Tests of significance are based on cal computations that indicate the likelihood that the relationship observed between variables in a sample can be attributed to sampling error only.

Inferential Statistics ■ 479 FIGURE 16-3 A Hypothetical Population of Men and Women Who Either Favor or Oppose Sexual Equality the same sampling logic discussed elsewhere in this The Logic of Statistical book. To understand that logic, let’s return for a Significance moment to the concept of sampling error in regard to univariate data. I think I can illustrate the logic of statistical significance best in a series of diagrams represent- Recall that a sample statistic normally provides ing the selection of samples from a population. the best single estimate of the corresponding popu- Here are the elements in the logic: lation parameter, but the statistic and the param- eter seldom correspond precisely. Thus, we report 1. Assumptions regarding the independence of the probability that the parameter falls within a two variables in the population study certain range (confidence interval). The degree of uncertainty within that range is due to normal 2. Assumptions regarding the representative- sampling error. The corollary of such a statement ness of samples selected through conventional is, of course, that it is improbable that the param- probability-sampling procedures eter would fall outside the specified range only as a result of sampling error. Thus, if we estimate that a 3. The observed joint distribution of sample ele- parameter (99.9 percent confidence) lies between ments in terms of the two variables 45 percent and 55 percent, we say by implication that it is extremely improbable that the parameter Figure 16-3 represents a hypothetical popu- is actually, say, 90 percent if our only error of esti- lation of 256 people; half are women, half are mation is due to normal sampling. This is the basic men. The diagram also indicates how each person logic behind tests of statistical significance. feels about seeing women as equal to men. In the diagram, those favoring equality have open circles, those opposing it have their circles filled in.

480 ■ Chapter 16: Statistical Analyses FIGURE 16-4 terms of the graphic illustration, a “square” selec- tion from the center of the population provides A Representative Sample a representative sample. Notice that our sample contains 16 of each type of person: Half are men The question we’ll be investigating is whether and half are women; half of each gender favors there is any relationship between gender and feel- equality, and the other half opposes it. ings about equality for men and women. More specifically, we’ll see if women are more likely The sample selected in Figure 16-4 would allow to favor equality than men are, because women us to draw accurate conclusions about the relation- would presumably benefit more from it. Take a ship between gender and equality in the larger moment to look at Figure 16-3 and see what the population. Following the sampling logic we saw answer to this question is. in Chapter 7, we’d note there was no relationship between gender and equality in the sample; thus, The illustration in the figure indicates no re- we’d conclude there was similarly no relationship lationship between gender and attitudes about in the larger population—because we’ve presum- equality. Exactly half of each group favors equality ably selected a sample in accord with the conven- and half opposes it. Recall the earlier discussion of tional rules of sampling. proportionate reduction of error. In this instance, knowing a person’s gender would not reduce the Of course, real-life samples are seldom such “errors” we’d make in guessing his or her attitude perfect reflections of the populations from which toward equality. The table in Figure 16-3 provides they are drawn. It would not be unusual for us a tabular view of what you can observe in the to have selected, say, one or two extra men who graphic diagram. Figure 16-4 represents the selection of a one- fourth sample from the hypothetical population. In

Inferential Statistics ■ 481 FIGURE 16-5 An Unrepresentative Sample opposed equality and a couple of extra women rate as the one shown in Figure 16-5. In fact, if we who favored it—even if there was no relationship actually selected a sample that gave us the results between the two variables in the population. Such this one does, we’d look for a different explanation. minor variations are part and parcel of probability Figure 16-6 illustrates the more likely situation. sampling, as we saw in Chapter 7. Notice that the sample selected in Figure 16-6 Figure 16-5, however, represents a sample that also shows a strong relationship between gender falls far short of the mark in reflecting the larger and equality. The reason is quite different this time. population. Notice that it includes far too many We’ve selected a perfectly representative sample, supportive women and opposing men. As the table but we see that there is actually a strong relation- shows, three-fourths of the women in the sample ship between the two variables in the population at support equality, but only one-fourth of the men large. In this latest figure, women are more likely do so. If we had selected this sample from a popu- to support equality than men are: That’s the case in lation in which the two variables were unrelated to the population, and the sample reflects it. each other, we’d be sorely misled by our sample. In practice, of course, we never know what’s As you’ll recall, it’s unlikely that a properly so for the total population; that’s why we select drawn probability sample would ever be as inaccu- samples. So if we selected a sample and found the

482 ■ Chapter 16: Statistical Analyses FIGURE 16-6 A Representative Sample from a Population in Which the Variables Are Related strong relationship presented in Figures 16-5 and The statistical significance of a relationship 16-6, we’d need to decide whether that finding observed in a set of sample data, then, is always ex- accurately reflected the population or was simply a pressed in terms of probabilities. “Significant at the product of sampling error. .05 level (p Յ .05)” simply means that the prob- ability that a relationship as strong as the observed The fundamental logic of tests of statistical one can be attributed to sampling error alone is significance, then, is this: Faced with any discrep- no more than 5 in 100. Put somewhat differently, ancy between the assumed independence of vari- if two variables are independent of each other in ables in a population and the observed distribution the population, and if 100 probability samples are of sample elements, we may explain that discrep- selected from that population, no more than 5 ancy in either of two ways: (1) we may attribute of those samples should provide a relationship as it to an unrepresentative sample, or (2) we may strong as the one that has been observed. reject the assumption of independence. The logic and statistics associated with probability sampling There is, then, a corollary to confidence methods offer guidance about the varying prob- intervals in tests of significance, which represents abilities of varying degrees of unrepresentativeness the probability of the measured associations being (expressed as sampling error). Most simply put, due only to sampling error. This is called the level there is a high probability of a small degree of un- of significance. Like confidence intervals, levels representativeness and a low probability of a large of significance are derived from a logical model degree of unrepresentativeness. in which several samples are drawn from a given

Inferential Statistics ■ 483 population. In the present case, we assume that Let’s assume we’re interested in the possible there is no association between the variables in relationship between church attendance and gen- the population, and then we ask what proportion der for the members of a particular church. To test of the samples drawn from that population would this relationship, we select a sample of 100 church produce associations at least as great as those members at random. We find that our sample is measured in the empirical data. Three levels of made up of 40 men and 60 women and that 70 significance are frequently used in research reports: percent of our sample say they attended church .05, .01, and .001. These mean, respectively, that during the preceding week, whereas the remaining the chances of obtaining the measured association 30 percent say they did not. as a result of sampling error are 5/100, 1/100, and 1/1,000. If there is no relationship between gender and church attendance, then 70 percent of the men in Researchers who use tests of significance nor- the sample should have attended church during mally follow one of two patterns. Some specify in the preceding week, and 30 percent should have advance the level of significance they’ll regard as stayed away. Moreover, women should have at- sufficient. If any measured association is statisti- tended in the same proportion. Table 16-6 (part I) cally significant at that level, they’ll regard it as shows that, based on this model, 28 men and 42 representing a genuine association between the women would have attended church, with 12 men two variables. In other words, they’re willing to and 18 women not attending. discount the possibility of its resulting from sam- pling error only. Part II of Table 16-6 presents the observed at- tendance for the hypothetical sample of 100 church Other researchers prefer to report the specific members. Note that 20 of the men report having at- level of significance for each association, disregard- tended church during the preceding week, and the ing the conventions of .05, .01, and .001. Rather remaining 20 say they did not. Among the women than reporting that a given association is significant in the sample, 50 attended church and 10 did not. at the .05 level, they might report significance at Comparing the expected and observed frequencies the .023 level, indicating the chances of its having (parts I and II), we note that somewhat fewer men resulted from sampling error as 23 out of 1,000. attended church than expected, whereas somewhat more women attended than expected. Chi Square Chi square is computed as follows. For each Chi square (x2) is a frequently used test of signi- cell in the tables, the researcher (1) subtracts the ficance in social science. It’s based on the null expected frequency for that cell from the observed hypothesis: the assumption that there is no relation- frequency, (2) squares this quantity, and (3) divides ship between two variables in the total population the squared difference by the expected frequency. (as you may recall from Chapter 2). Given the This procedure is carried out for each cell in the observed distribution of values on the two separate tables; part III of Table 16-6 presents the cell-by-cell variables, we compute the conjoint distribution computations. The several results are then added that would be expected if there were no relation- together to find the value of chi square: 12.70 in ship between the two variables. The result of this the example. operation is a set of expected frequencies for all the cells in the contingency table. We then compare level of significance In the context of tests of sta- this expected distribution with the distribution of tistical significance, the degree of likelihood that an cases actually found in the sample data, and we observed, empirical relationship could be attribut- determine the probability that the discovered dis- able to sampling error. A relationship is significant crepancy could have resulted from sampling error at the .05 level if the likelihood of its being only a alone. An example will illustrate this procedure. function of sampling error is no greater than 5 out of 100.

484 ■ Chapter 16: Statistical Analyses Men Women Total 28 42 70 TABLE 16-6 12 18 30 A Hypothetical Illustration of Chi Square 40 60 100 I.Expected Cell Frequencies Attended church Men Women Total Did not attend church 20 50 70 Total 20 10 30 40 60 100 II.Observed Cell Frequencies Attended church Men Women x2 ϭ 12.70 Did not attend church 2.29 1.52 p Ͻ .001 Total 5.33 3.56 III.(Observed Ϫ Expected)2 Ϭ Expected Attended church Did not attend church This value is the overall discrepancy between dom. Two of the numbers could have any values the observed conjoint distribution in the sample we choose, but once they are specified, the third and the distribution we would expect if the two number is determined. variables were unrelated to each other. Of course, the mere discovery of a discrepancy does not prove More generally, whenever we’re examining that the two variables are related, because normal the mean of N values, we can see that the degrees sampling error might produce discrepancies even of freedom equal N Ϫ 1. Thus, in the case of the when there is no relationship in the total popula- mean of 23 values, we could make 22 of them tion. The magnitude of the value of chi square, anything we liked, but the 23rd would then be however, permits us to estimate the probability of determined. that having happened. A similar logic applies to bivariate tables, such To determine the statistical significance of the as those analyzed by chi square. Consider a table observed relationship, we must use a standard set reporting the relationship between two dichoto- of chi square values. This will require the compu- mous variables: gender (men/women) and abortion tation of the degrees of freedom, which refer to the attitude (approve/disapprove). Notice that the table possibilities for variation within a statistical model. provides the marginal frequencies of both variables. Suppose I challenge you to find three numbers whose mean is 11. There are infinite solutions to Abortion Attitude Men Women Total this problem: (11, 11, 11), (10, 11, 12), (Ϫ11, 11, 33), and so on. Now, suppose I require that one of Approve 500 the numbers be 7. There would still be an infinite number of possibilities for the other two numbers. Disapprove 500 If I told you one number had to be 7 and Total 500 500 1,000 another 10, however, there would be only one possible value for the third. If the average of three Despite the conveniently round numbers in numbers is 11, their sum must be 33. If two of this hypothetical example, notice that there are the numbers total 17, the third must be 16. In this numerous possibilities for the cell frequencies. For situation, we say there are two degrees of free- example, it could be the case that all 500 men ap- prove and all 500 women disapprove, or it could be just the reverse. Or there could be 250 cases














































































Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook