Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore Elementary Statistics 10th Ed.

Elementary Statistics 10th Ed.

Published by Junix Kaalim, 2022-09-12 13:26:53

Description: Triola, Mario F.

Search

Read the Text Version

Statistics @ Work 317 Statistics @ Work “It is possible to be a What concepts of statistics do Is your use of probability and statis- journalist and not be tics increasing, decreasing, or re- comfortable with you use? maining stable? statistics, but you’re definitely limited in I use ideas like statistical significance, er- Increasing. People’s interest in new ther- what you can do.” ror rates, and probability. I don’t need to apies that may only be in the clinical trial do anything incredibly sophisticated, but stage is increasing, partly because of the Joel B. Obermayer I need to be very comfortable with the emphasis on AIDS research and on get- math and with asking questions about it. ting new drugs approved and delivered Newspaper reporter for The News to patients sooner. It’s more important & Observer I use statistics to look at medical re- than ever for a medical writer to use Joel B. Obermayer writes about search to decide whether different stud- statistics to make sure that the studies medical issues and health affairs ies are significant and to decide how to really prove what the public relations for The News & Observer, a news- write about them. Mostly, I need to be people say they prove. paper that covers the eastern half able to read statistics and understand of North Carolina. He reports on them, rather than develop them myself. Should prospective employees have managed care, public health, and I use statistics to develop good ques- studied some statistics? research at academic medical tions and to bolster the arguments I centers including Duke University make in print. I also use statistics to de- It is possible to be a journalist and not be and the University of North cide if someone is trying to give me a comfortable with statistics, but you’re Carolina at Chapel Hill. positive spin on something that might definitely limited in what you can do. If be questionable. For example, someone you write about government-sponsored at a local university once sent me a press education programs and whether they’re release about miracle creams that sup- effective, or if you write about the dan- posedly help slim you down by dissolv- gers of particular contaminants in the ing fat cells. Well, I doubt that those environment, you’re going to need to creams work. The study wasn’t too hot use statistics. either. They were trying to make claims based on a study of only 11 people. The In my field, editors often don’t think researcher argued that 11 people were about statistics in the interview process. enough to make good empirical health They worry more about writing skills. A claims. Not too impressive. People try to knowledge of statistics is more impor- manipulate the media all the time. tant for what you can do once you are Good verifiable studies with good verifi- on the job. able statistical bases make it easier to avoid being manipulated.

Estimates and Sample Sizes 7 7-1 Overview 7-2 Estimating a Population Proportion 7-3 Estimating a Population Mean: s Known 7-4 Estimating a Population Mean: s Not Known 7-5 Estimating a Population Variance

CHAPTER PROBLEM Does touch therapy work? Many patients pay $25 to $50 for a session of touch Among the 280 trials, the touch therapists identified therapy in which the touch therapist moves his or her the correct hand 123 times, for a success rate of 44%. hands within a few inches of the patient’s body without Emily, with the help of her mother, a statistician, and a actually making physical contact. The objective is to physician, submitted her findings for publication in cure a wide variety of medical conditions, including the prestigious Journal of the American Medical Asso- cancer, AIDS, asthma, heart disease, headaches, burns, ciation. After a careful and thorough review of the ex- and bone fractures. The basic theory is that a profes- perimental design and results, the article “A Close sionally trained touch therapist can detect poor align- Look at Therapeutic Touch” was published (Journal of ments in the patient’s energy field, and can then reposi- the American Medical Association, Vol. 279, No. 13). tion energy fields to create an energy balance that Emily became the youngest researcher to be published fosters the healing process. in that magazine. And she also won a blue ribbon for her science fair project. When she was in the fourth grade, nine-year-old Emily Rosa chose the topic of touch therapy for a sci- Let’s consider the key results from Emily’s project. ence fair project. She convinced 21 experienced touch Among the 280 trials, the touch therapists were correct therapists to participate in a simple test of their ability 123 times. We have a sample proportion with n 5 280 to detect a human energy field. Emily constructed a and x 5 123. Arguments against the validity of the cardboard partition with two holes for hands. Each study might include the claim that the number of trials touch therapist would put both hands through the two is too small to be meaningful, or that the touch thera- holes, and Emily would place her hand just above one pists just had a bad day and, because of chance, they of the therapist’s hands; then the therapist was asked to were not as successful as the population of all touch identify the hand that Emily had selected. Emily used therapists. We will consider such issues in this chapter. a coin toss to randomly select the hand to be used. This test was repeated 280 times. If the touch therapists It should also be noted that Emily Rosa’s project really did have the ability to sense a human energy was relatively simple. Remember, she did the project field, they should have identified the correct hand while she was a fourth-grade student. Emily’s project is much more than 50% of the time. If they did not have the type of activity that could be conducted by any stu- the ability to detect the energy field and they just dent in an introductory statistics course. After under- guessed, they should have been correct about 50% of standing concepts taught in the typical introductory the time. Here are the results that Emily obtained: statistics course, students have the ability to accomplish significant and meaningful work.

320 Chapter 7 Estimates and Sample Sizes 7-1 Overview In this chapter we begin working with the true core of inferential statistics as we use sample data to make inferences about populations. Specifically, we will use sample data to make estimates of population parameters. For example, the Chapter Problem refers to touch therapists who correctly identified the human energy field in only 44% of 280 trials. Based on the sample statistic of 44%, we will estimate the per- centage of correct identifications for the entire population of all touch therapists. The two major applications of inferential statistics involve the use of sample data to (1) estimate the value of a population parameter, and (2) test some claim (or hypothesis) about a population. In this chapter we introduce methods for esti- mating values of these important population parameters: proportions, means, and variances. We also present methods for determining the sample sizes necessary to estimate those parameters. In Chapter 8 we will introduce the basic methods for testing claims (or hypotheses) that have been made about a population parameter. 7-2 Estimating a Population Proportion Key Concept In this section we present important methods for using a sample proportion to estimate the value of a population proportion with a confidence in- terval. We also present methods for finding the size of the sample that is needed to estimate a population proportion. This section introduces general concepts that are used in the following sections and following chapters, so it is important to under- stand this section well. A Study Strategy: The time devoted to this section will be well spent because we introduce the concept of a confidence interval, and that same general concept will be applied to the following sections of this chapter. We suggest that you first read this section with the limited objective of simply trying to understand what confidence in- tervals are, what they accomplish, and why they are needed. Second, try to develop the ability to construct confidence interval estimates of population proportions. Third, learn how to interpret a confidence interval correctly. Fourth, read the section once again and try to understand the underlying theory. You will always enjoy much greater success if you understand what you are doing, instead of blindly applying me- chanical steps in order to obtain an answer that may or may not make any sense. This section will consider only cases in which the normal distribution can be used to approximate the sampling distribution of sample proportions. The follow- ing requirements apply to the methods of this section. Requirements 1. The sample is a simple random sample. 2. The conditions for the binomial distribution are satisfied. That is, there is a fixed number of trials, the trials are independent, there are two categories of out- comes, and the probabilities remain constant for each trial. (See Section 5-3.) 3. There are at least 5 successes and at least 5 failures. (With p and q unknown, we estimate their values using the sample proportion, so this requirement is a

7-2 Estimating a Population Proportion 321 way of verifying that np $ 5 and nq $ 5 are both satisfied, so the normal dis- tribution is a suitable approximation to the binomial distribution. Also, there are procedures for dealing with situations in which the normal distribution is not a suitable approximation. See Exercise 51.) Notation for Proportions p 5 population proportion x pˆ 5 5 sample proportion of x successes in a sample of size n n qˆ 5 1 2 pˆ 5 sample proportion of failures in a sample of size n Proportion, Probability, and Percent This section focuses on the population proportion p, but we can also work with probabilities or percentages. When working with a percentage, express it in decimal form. (For example, express 44% as 0.44, so that pˆ 5 0.44.) If we want to estimate a population proportion with a single value, the best estimate is pˆ. Because pˆ consists of a single value, it is called a point estimate. Definition A point estimate is a single value (or point) used to approximate a popula- tion parameter. The sample proportion pˆ is the best point estimate of the population proportion p. We use pˆ as the point estimate of p because it is unbiased and is the most consistent of the estimators that could be used. It is unbiased in the sense that the distribution of sample proportions tends to center about the value of p; that is, sample propor- tions pˆ do not systematically tend to underestimate p, nor do they systematically tend to overestimate p. (See Section 6-4.) The sample proportion pˆ is the most con- sistent estimator in the sense that the standard deviation of sample proportions tends to be smaller than the standard deviation of any other unbiased estimators. EXAMPLE Touch Therapy Success Rate In the Chapter Problem we noted that in 280 trials involving touch therapists, the correct hand was selected in 123 trials, so the success rate is pˆ 5 123>280 5 0.44. Using these test results, find the best point estimate of the proportion of all correct selections that would be made if all touch therapists were tested. SOLUTION Because the sample proportion is the best point estimate of the pop- ulation proportion, we conclude that the best point estimate of p is 0.44. (Using common sense, we might examine the design of the experiment and we might conclude that the true population proportion is actually 0.5, but using only the sample results leads to 0.44 as the best estimate of the population proportion p.)

322 Chapter 7 Estimates and Sample Sizes Why Do We Need Confidence Intervals? In the preceding example we saw that 0.44 was our best point estimate of the pop- ulation proportion p, but we have no indication of just how good our best estimate is. Because a point estimate has the serious flaw of not revealing anything about how good it is, statisticians have cleverly developed another type of estimate. This estimate, called a confidence interval or interval estimate, consists of a range (or an interval) of values instead of just a single value. Small Sample Definition The Children’s Defense Fund A confidence interval (or interval estimate) is a range (or an interval) of was organized to promote the values used to estimate the true value of a population parameter. A confi- welfare of children. The group dence interval is sometimes abbreviated as CI. published Children Out of School in America, which re- A confidence interval is associated with a confidence level, such as 0.95 (or ported that in one area, 37.5% 95%). The confidence level gives us the success rate of the procedure used to con- of the 16- and 17-year-old chil- struct the confidence interval. The confidence level is often expressed as the prob- dren were out of school. This ability or area 1 2 a (lowercase Greek alpha), where a is the complement of the statistic received much press confidence level. For a 0.95 (or 95%) confidence level, a 5 0.05. For a 0.99 (or coverage, but it was based on a 99%) confidence level, a 5 0.01. sample of only 16 children. Another statistic was based on Definition a sample size of only 3 stu- dents. (See “Firsthand Report: The confidence level is the probability 1 2 a (often expressed as the equiva- How Flawed Statistics Can lent percentage value) that is the proportion of times that the confidence in- Make an Ugly Picture Look terval actually does contain the population parameter, assuming that the esti- Even Worse,” American School mation process is repeated a large number of times. (The confidence level is Board Journal, Vol. 162.) also called the degree of confidence, or the confidence coefficient.) The most common choices for the confidence level are 90% (with a 5 0.10), 95% (with a 5 0.05), and 99% (with a 5 0.01). The choice of 95% is most com- mon because it provides a good balance between precision (as reflected in the width of the confidence interval) and reliability (as expressed by the confidence level). Here’s an example of a confidence interval based on the sample data of 280 trials of touch therapists, with 44% of the trials resulting in correct identification of the hand that was selected: The 0.95 (or 95%) confidence interval estimate of the population proportion p is 0.381 , p , 0.497. Interpreting a Confidence Interval We must be careful to interpret confidence intervals correctly. There is a correct interpretation and many different and creative wrong interpretations of the confi- dence interval 0.381 , p , 0.497.

7-2 Estimating a Population Proportion 323 Correct: “We are 95% confident that the interval from 0.381 to 0.497 actu- ally does contain the true value of p.” This means that if we were to select many different samples of size 280 and construct the corre- sponding confidence intervals, 95% of them would actually contain the value of the population proportion p. (Note that in this correct interpretation, the level of 95% refers to the success rate of the process being used to estimate the proportion, and it does not refer to the population proportion itself.) Wrong: “There is a 95% chance that the true value of p will fall between 0.381 and 0.497.” At any specific point in time, a population has a fixed and constant value p, and a confidence interval constructed from a sample either includes p or does not. Similarly, if a baby has just been born and the doctor is about to announce its gen- der, it’s wrong to say that there is probability of 0.5 that the baby is a girl; the baby is a girl or is not, and there’s no probability involved. A population proportion p is like the baby that has been born—the value of p is fixed, so the confidence inter- val limits either contain p or do not, and that is why it’s wrong to say that there is a 95% chance that p will fall between values such as 0.381 and 0.497. A confidence level of 95% tells us that the process we are using will, in the long run, result in confidence interval limits that contain the true population pro- portion 95% of the time. Suppose that the true proportion of all correct hand iden- tifications made by touch therapists is p 5 0.5. Then the confidence interval ob- tained from the given sample data would not contain the population proportion, because the true population proportion 0.5 is not between 0.381 and 0.497. This is illustrated in Figure 7-1. Figure 7-1 shows typical confidence intervals resulting from 20 different samples. With 95% confidence, we expect that 19 out of 20 samples should result in confidence intervals that do contain the true value of p, and Figure 7-1 illustrates this with 19 of the confidence intervals containing p, while one confidence interval does not contain p. Using Confidence Intervals for Comparisons Caution: Confidence intervals can be used informally to compare different data sets, but the overlapping of confidence intervals should not be used for making formal and final conclusions about equality of proportions. (See “On Judging the Significance of Differences by Examining the Overlap Between Confidence Intervals,” by Schenker and Gentleman, American Statistician, Vol. 55, No. 3.) 0.7 Figure 7-1 .0 6 Confidence Intervals from 20 Different Samples 0.4 0.3

324 Chapter 7 Estimates and Sample Sizes a/2 a/2 Critical Values z ϭ 0 za/2 The methods of this section and many of the other statistical methods found in the Found from following chapters include the use of a standard z score that can be used to distin- Table A-2 guish between sample statistics that are likely to occur and those that are unlikely. (corresponds to Such a z score is called a critical value (defined below). Critical values are based area of 1 Ϫ a/2) on the following observations: Figure 7-2 1. We know from Section 6-6 that under certain conditions, the sampling distri- bution of sample proportions can be approximated by a normal distribution, Critical Value za>2 in the as in Figure 7-2. Standard Normal Distribution 2. Sample proportions have a relatively small chance (with probability denoted by a) of falling in one of the red tails of Figure 7-2. 3. Denoting the area of each shaded tail by a>2, there is a total probability of a that a sample proportion will fall in either of the two red tails. 4. By the rule of complements (from Chapter 4), there is a probability of 1 2 a that a sample proportion will fall within the inner green-shaded region of Figure 7-2. 5. The z score separating the right-tail region is commonly denoted by za>2, and is referred to as a critical value because it is on the borderline separating sam- ple proportions that are likely to occur from those that are unlikely to occur. These observations can be formalized with the following notation and definition. Notation for Critical Value The critical value za>2 is the positive z value that is at the vertical boundary sep- arating an area of a>2 in the right tail of the standard normal distribution. (The value of 2za>2 is at the vertical boundary for the area of a>2 in the left tail.) The subscript a>2 is simply a reminder that the z score separates an area of a>2 in the right tail of the standard normal distribution. Definition A critical value is the number on the borderline separating sample statistics that are likely to occur from those that are unlikely to occur. The number za>2 is a critical value that is a z score with the property that it separates an area of a>2 in the right tail of the standard normal distribution. (See Figure 7-2.) EXAMPLE Finding a Critical Value Find the critical value za>2 corresponding to a 95% confidence level. SOLUTION Caution: To find the critical z value for a 95% confidence level, do not look up 0.95 in the body of Table A-2. A 95% confidence level corre- sponds to a 5 0.05. See Figure 7-3, where we show that the area in each of the red-shaded tails is a>2 5 0.025. We find za>2 5 1.96 by noting that all of the area to its left must be 1 2 0.025, or 0.975. We can refer to Table A-2 and find that the area of 0.9750 (found in the body of the table) corresponds exactly to a

7-2 Estimating a Population Proportion 325 Confidence Level: 95% The total area to the left of this boundary is 0.975. Figure 7-3 Finding za>2 for a 95% Confidence Level z score of 1.96. For a 95% confidence level, the critical value is therefore za>2 5 1.96. To find the critical z score for a 95% confidence level, look up 0.9750 in the body of Table A-2, not 0.95. The preceding example showed that a 95% confidence level results in a criti- cal value of za>2 5 1.96. This is the most common critical value, and it is listed with two other common values in the table that follows. Confidence Level a Critical Value, za>2 90% 0.10 1.645 95% 0.05 1.96 99% 0.01 2.575 Margin of Error When we collect a set of sample data, such as Emily Rosa’s touch therapy data given in the Chapter Problem (with 44% of 280 trials resulting in correct identifi- cations), we can calculate the sample proportion pˆ and that sample proportion is typically different from the population proportion p. The difference between the sample proportion and the population proportion can be thought of as an error. We now define the margin of error E as follows. Definition When data from a simple random sample are used to estimate a population proportion p, the margin of error, denoted by E, is the maximum likely (with probability 1 2 a) difference between the observed sample proportion pˆ and the true value of the population proportion p. The margin of error E is also called the maximum error of the estimate and can be found by multiply- ing the critical value and the standard deviation of sample proportions, as shown in Formula 7-1. Formula 7-1 pˆqˆ margin of error for proportions E 5 za>2B n

326 Chapter 7 Estimates and Sample Sizes Given the way that the margin of error E is defined, there is a probability of a that the sample proportion will be in error by more than E. Confidence Interval (or Interval Estimate) for the Population Proportion p pˆ 2 E , p , pˆ 1 E where pˆqˆ E 5 za>2B n The confidence interval is often expressed in the following equivalent formats: Curbstoning pˆ 6 E The glossary for the 2000 or Census defines curbstoning as “the practice by which a s pˆ 2 E, pˆ 1 Ed census enumerator fabricates a questionnaire for a resi- In Chapter 4, when probabilities were given in decimal form, we rounded to dence without actually visit- three significant digits. We use that same rounding rule here. ing it.” Curbstoning occurs when a census enumerator Round-Off Rule for Confidence Interval Estimates of p sits on a curbstone (or anywhere else) and fills out Round the confidence interval limits for p to three significant digits. survey forms by making up responses. Because data from Based on the preceding results, we can summarize the procedure for con- curbstoning are not real, they structing a confidence interval estimate of a population proportion p as follows. can affect the validity of the Census. The extent of curb- Procedure for Constructing a Confidence Interval for p stoning has been investigated in several studies, and one 1. Verify that the requirements are satisfied. (The sample is a simple random study showed that about 4% sample, the conditions for the binomial distribution are satisfied, and there are of Census enumerators prac- at least 5 successes and at least 5 failures.) ticed curbstoning at least some of the time. 2. Refer to Table A-2 and find the critical value za>2 that corresponds to the de- sired confidence level. (For example, if the confidence level is 95%, the criti- The methods of Section 7-2 cal value is za>2 5 1.96.) assume that the sample data have been collected in an 3. Evaluate the margin of error E 5 za>2 2pˆqˆ>n. appropriate way, so if much 4. Using the value of the calculated margin of error E and the value of the sam- of the sample data have been obtained through curbston- ple proportion pˆ, find the values of pˆ 2 E and pˆ 1 E. Substitute those values ing, then the resulting in the general format for the confidence interval: confidence interval estimates might be very flawed. pˆ 2 E , p , pˆ 1 E or pˆ 6 E or s pˆ 2 E, pˆ 1 Ed 5. Round the resulting confidence interval limits to three significant digits.

7-2 Estimating a Population Proportion 327 EXAMPLE Touch Therapy Success Rate In the Chapter Problem we noted that touch therapists participated in 280 trials of their ability to sense a human energy field. In each trial, a touch therapist was asked to identify which hand was just below the hand of Emily Rosa. Among the 280 trials, there were 123 correct identifications. The sample results are n 5 280, and pˆ 5 123>280 5 0.439286. (Instead of using 0.44 for the sample proportion, we carry extra decimal places so that subsequent calculations will not be affected by a rounding error.) a. Find the margin of error E that corresponds to a 95% confidence level. b. Find the 95% confidence interval estimate of the population proportion p. c. Based on the results, what can we conclude about the effectiveness of touch therapy? SOLUTION REQUIREMENT We should first verify that the necessary requirements are satisfied. (Earlier in this section we listed the requirements for using a normal dis- tribution as an approximation to a binomial distribution.) Given the design of the experiment, it is reasonable to assume that the sample is a simple random sample. The conditions for a binomial experiment are satisfied, because there is a fixed number of trials (280), the trials are independent (because the result from one trial doesn’t affect the probability of the result of another trial), there are two categories of outcome (correct, wrong), and the probability of being correct remains constant. Also, with 123 correct identifications in 280 trials, there are 157 wrong identifica- tions, so the number of successes (123) and the number of failures (157) are both at least 5. The check of requirements has been successfully completed. a. The margin of error is found by using Formula 7-1 with za>2 5 1.96 (as found in the preceding example), pˆ 5 0.439286, qˆ 5 1 2 0.439286 5 0.560714, and n 5 280. pˆqˆ s0.439286ds0.560714d 5 0.058133 E 5 za>2B n 5 1.96B 280 b. Constructing the confidence interval is quite easy now that we have the values of pˆ and E. We simply substitute those values to obtain this result: pˆ 2 E , p , pˆ 1 E 0.439286 2 0.058133 , p , 0.439286 1 0.058133 0.381 , p , 0.497 (rounded to three significant digits) This same result could be expressed in the format of 0.439 6 0.058 or (0.381, 0.497). If we want the 95% confidence interval for the true popula- tion percentage, we could express the result as 38.1% , p , 49.7%. This confidence interval is often reported with a statement such as this: “It is es- timated that the success rate is 44%, with a margin of error of plus or minus 6 percentage points.” That statement is a verbal expression of this format for the confidence interval: 44% 6 6%. The level of confidence should also be reported, but it rarely is in the media. The media typically use a 95% confidence level but omit any reference to it. continued

328 Chapter 7 Estimates and Sample Sizes c. To interpret the results, note that pure guessing would result in about 50% of the identifications being correct. If the touch therapists actually had an abil- ity to sense a human energy field, their success rate should have been greater than 50% by a significant amount. However, the touch therapists did slightly worse than what we might expect from coin tossing. Here is a state- ment from the article published in the Journal of the American Medical As- sociation: “Their (touch therapists) failure to substantiate TT’s (touch ther- apy’s) most fundamental claim is unrefuted evidence that the claims of TT (touch therapy) are groundless and that further professional use is unjusti- fied.” Based on the results from Emily Rosa’s science fair project, it appears that touch therapy is ineffective. Rationale for the Margin of Error Because the sampling distribution of proportions is approximately normal (because the conditions np $ 5 and nq $ 5 are both satisfied), we can use results from Section 6-6 to conclude that m and s are given by m 5 np and s 5 2npq. Both of these parameters pertain to n trials, but we convert them to a per-trial basis by dividing by n as follows: np Mean of sample proportions: m 5 5 p n 2npq npq pq Standard deviation of sample proportions: s 5 n 5 Å n2 5 Å n The first result may seem trivial, because we already stipulated that the true popu- lation proportion is p. The second result is nontrivial and is useful in describing the margin of error E, but we replace the product pq by pˆqˆ because we don’t know the value of p (it is the value we are trying to estimate). Formula 7-1 for the mar- gin of error reflects the fact that pˆ has a probability of 1 2 a of being within za>2 !pq>n of p. The confidence interval for p, as given previously, reflects the fact that there is a probability of 1 2 a that pˆ differs from p by less than the mar- gin of error E 5 za>2 !pq>n. Determining Sample Size Suppose we want to collect sample data with the objective of estimating some population proportion. How do we know how many sample items must be ob- tained? If we take the expression for the margin of error E (Formula 7-1), then solve for n, we get Formula 7-2. Formula 7-2 requires pˆ as an estimate of the pop- ulation proportion p, but if no such estimate is known (as is often the case), we re- place pˆ by 0.5 and replace qˆ by 0.5, with the result given in Formula 7-3. Sample Size for Estimating Proportion p When an estimate pˆ is known: Formula 7-2 n 5 [za>2]2pˆ qˆ E2 When no estimate pˆ is known: Formula 7-3 n 5 [za>2]2 # 0.25 E2

7-2 Estimating a Population Proportion 329 Round-Off Rule for Determining Sample Size In order to ensure that the required sample size is at least as large as it should be, if the computed sample size is not a whole number, round it up to the next higher whole number. Use Formula 7-2 when reasonable estimates of pˆ can be made by using previ- ous samples, a pilot study, or someone’s expert knowledge. Otherwise, use For- mula 7-3. Note that Formulas 7-2 and 7-3 do not include the population size N, so the size of the population is irrelevant. (Exception: When sampling is without re- placement from a relatively small finite population. See Exercise 49.) EXAMPLE Sample Size for E-Mail Survey The ways that we com- municate have been dramatically affected by the use of answering machines, fax machines, voice mail, and e-mail. Suppose a sociologist wants to determine the current percentage of U.S. households using e-mail. How many households must be surveyed in order to be 95% confident that the sample percentage is in error by no more than four percentage points? a. Use this result from an earlier study: In 1997, 16.9% of U.S. households used e-mail (based on data from the World Almanac and Book of Facts). b. Assume that we have no prior information suggesting a possible value of pˆ. SOLUTION a. The prior study suggests that pˆ 5 0.169, so qˆ 5 0.831 (found from qˆ 5 1 2 0.169). With a 95% level of confidence, we have a 5 0.05, so za>2 5 1.96. Also, the margin of error is E 5 0.04 (the decimal equivalent of “four percentage points”). Because we have an estimated value of pˆ, we use Formula 7-2 as follows: n 5 [za>2]2 pˆ qˆ 5 [1.96]2s0.169ds0.831d E2 0.042 5 337.194 5 338 (rounded up) We must survey at least 338 randomly selected households. b. As in part (a), we again use za>2 5 1.96 and E 5 0.04, but with no prior knowledge of pˆ (or qˆ), we use Formula 7-3 as follows: n 5 [za>2]2 ? 0.25 5 [1.96]2 ? 0.25 E2 0.042 5 600.25 5 601 (rounded up) INTERPRETATION To be 95% confident that our sample percentage is within four percentage points of the true percentage for all households, we should randomly select and survey 601 households. By comparing this result to the sample size of 338 found in part (a), we can see that if we have no continued

330 Chapter 7 Estimates and Sample Sizes knowledge of a prior study, a larger sample is required to achieve the same re- sults as when the value of pˆ can be estimated. But now let’s use some common sense: We know that the use of e-mail is growing so rapidly that the 1997 esti- mate is too old to be of much use. Today, substantially more than 16.9% of households use e-mail. Realistically, we need a sample larger than 338 house- holds. Assuming that we don’t really know the current rate of e-mail usage, we should randomly select 601 households. With 601 households, we will be 95% confident that we are within four percentage points of the true percentage of households using e-mail. Common Errors Try to avoid these two common errors when calculating sample size: (1) Don’t make the mistake of using E 5 4 as the margin of error corresponding to “four percentage points.” (2) Be sure to substitute the critical z score for za>2. For example, if you are working with 95% confidence, be sure to replace za>2 with 1.96. Don’t make the mistake of replacing za>2 with 0.95 or 0.05. Population Size Many people incorrectly believe that the sample size should be some percentage of the population, but Formula 7-3 shows that the population size is irrelevant. (In reality, the population size is sometimes used, but only in cases in which we sample without replacement from a relatively small population. See Exercise 49.) Polls commonly use sample sizes in the range of 1000 to 2000 and, even though such polls may involve a very small percentage of the total population, they can provide results that are quite good. Finding the Point Estimate and E from a Confidence Interval Sometimes we want to better understand a confidence interval that might have been obtained from a journal article, or it might have been generated using software or a calculator. If we already know the confidence interval limits, the sample proportion pˆ and the margin of error E can be found as follows: Point estimate of p: supper confidence limitd 1 slower confidence limitd pˆ 5 2 Margin of error: supper confidence limitd 2 slower confidence limitd E5 2 EXAMPLE The article “High-Dose Nicotine Patch Therapy,” by Dale, Hurt, et al. (Journal of the American Medical Association, Vol. 274, No. 17) includes this statement: “Of the 71 subjects, 70% were abstinent from smoking at 8 weeks (95% confidence interval [CI], 58% to 81%).” Use that statement to find the point estimate pˆ and the margin of error E. SOLUTION From the given statement, we see that the 95% confidence inter- val is 0.58 , p , 0.81. The point estimate pˆ is the value midway between the upper and lower confidence interval limits, so we get

7-2 Estimating a Population Proportion 331 supper confidence limitd 1 slower confidence limitd pˆ 5 2 0.81 1 0.58 5 5 0.695 2 The margin of error can be found as follows: supper confidence limitd 2 slower confidence limitd E5 2 0.81 2 0.58 5 5 0.115 2 Better-Performing Confidence Intervals Important note: The exercises for this section are based on the confidence interval described above, not the confidence intervals described in the following discussion. The confidence interval described in this section has the format typically presented in introductory statistics courses, but it does not perform as well as some other confidence intervals. The adjusted Wald confidence interval per- forms better in the sense that its probability of containing the true population proportion p is closer to the confidence level that is used. The adjusted Wald confidence interval uses this simple procedure: Add 2 to the number of suc- cesses x, add 2 to the number of failures (so that the number of trials n is increased by 4), then find the confidence interval as described in this section. For example, if we use the methods of this section with x 5 10 and n 5 20, we get this 95% confidence interval: 0.281 , p , 0.719. With x 5 10 and n 5 20 we use the adjusted Wald confidence interval by letting x 5 12 and n 5 24 to get this confidence interval: 0.300 , p , 0.700. The chance that the confi- dence interval 0.300 , p , 0.700 contains p is closer to 95% than the chance that 0.281 , p , 0.719 contains p. Another confidence interval that performs better than the one described in this section and the adjusted Wald confidence interval is the Wilson score confi- dence interval. It has the lower confidence interval limit shown below, and the upper confidence interval limit is expressed by changing the minus sign to a plus sign. (It is easy to see why this approach is not used much in introductory courses.) Using x 5 10 and n 5 20, the 95% Wilson score confidence interval is 0.299 , p , 0.701. pˆ 1 za2 >2 2 pˆqˆ 1 za2 >2 2n za>2å n 4n 1 1 za2>2 n For a discussion of these and other confidence intervals for p, see “Approximation Is Better than ‘Exact’ for Interval Estimation of Binomial Proportions,” by Agresti and Coull, American Statistician, Vol. 52, No. 2.

332 Chapter 7 Estimates and Sample Sizes Using Technology the confidence interval limits by using an icon for “Num trials” and enter B1. Click OK. for Confidence exact method. To use the normal approxima- In the dialog box, select the level of confi- Intervals tion method presented in this section, click dence, then click on Compute Interval. on the Options button and then click on the STATDISK Select Analysis, then Con- box with this statement: “Use test and inter- TI-83/84 PLUS Press STAT, select fidence Intervals, then Proportion One val based on normal distribution.” TESTS, then select 1-PropZInt and pro- Sample, and proceed to enter the requested ceed to enter the required items. The accom- items. EXCEL Use the Data Desk XL add-in panying display shows the result for the that is a supplement to this book. First enter confidence interval example in this section. MINITAB Select Stat, Basic Statis- the number of successes in cell A1, then enter tics, then 1 Proportion. In the dialog box, the total number of trials in cell B1. Click on TI-83/84 Plus click on the button for Summarized Data. DDXL and select Confidence Intervals, then Also click on the Options button, enter the select Summ 1 Var Prop Interval (which is desired confidence level (the default is 95%). an abbreviated form of “confidence interval Instead of using a normal approximation, for a proportion using summary data for one Minitab’s default procedure is to determine variable”). Click on the pencil icon for “Num successes” and enter A1. Click on the pencil Using Technology Proportion. Proceed to enter the required for Sample Size items in the dialog box. Determination Sample size determination is not available STATDISK Select Analysis, then Sam- as a built-in function with Minitab, Excel, or ple Size Determination, then Estimate the TI-83>84 Plus calculator. 7-2 BASIC SKILLS AND CONCEPTS Statistical Literacy and Critical Thinking 1. Critical Value What is a critical value for a normal distribution? 2. Margin of Error What is a margin of error? 3. Confidence Interval When surveying 500 people, we get 200 yes responses to a par- ticular question, so the proportion of yes responses from the whole population is esti- mated to be 0.4. Given that we have the estimated value of 0.4, why would we need a confidence interval? That is, what additional information does the confidence interval provide? 4. Sampling A student surveys 100 classmates by asking each if they have outstanding loans. After finding the sample proportion for this sample of n 5 100 subjects, can the methods of this section be used to estimate the proportion of all adults who have out- standing loans? Why or why not?

7-2 Estimating a Population Proportion 333 Finding Critical Values. In Exercises 5–8, find the critical value za>2 that corresponds to the given confidence level. 5. 99% 6. 90% 7. 98% 8. 99.5% 9. Express the confidence interval 0.222 , p , 0.444 in the form of pˆ 6 E. 10. Express the confidence interval 0.600 , p , 0.800 in the form of pˆ 6 E. 11. Express the confidence interval (0.206, 0.286) in the form of pˆ 6 E. 12. Express the confidence interval 0.337 6 0.050 in the form of pˆ 2 E , p , pˆ 1 E. Interpreting Confidence Interval Limits. In Exercises 13–16, use the given confidence interval limits to find the point estimate pˆ and the margin of error E. 13. (0.868, 0.890) 14. 0.325 , p , 0.375 15. 0.607 , p , 0.713 16. 0.0144 , p , 0.0882 Finding Margin of Error. In Exercises 17–20, assume that a sample is used to estimate a population proportion p. Find the margin of error E that corresponds to the given statis- tics and confidence level. 17. n 5 500, x 5 200, 95% confidence 18. n 5 1200, x 5 800, 99% confidence 19. 98% confidence; the sample size is 1068, of which 25% are successes. 20. 90% confidence; the sample size is 2107, of which 65% are successes. Constructing Confidence Intervals. In Exercises 21–24, use the sample data and confi- dence level to construct the confidence interval estimate of the population proportion p. 21. n 5 500, x 5 200, 95% confidence 22. n 5 1200, x 5 800, 99% confidence 23. n 5 1068, x 5 267, 98% confidence 24. n 5 4500, x 5 2925, 90% confidence Determining Sample Size. In Exercises 25–28, use the given data to find the minimum sample size required to estimate a population proportion or percentage. 25. Margin of error: 0.020; confidence level: 95%; pˆ and qˆ unknown 26. Margin of error: 0.050; confidence level: 99%; pˆ and qˆ unknown 27. Margin of error: three percentage points; confidence level: 95%; from a prior study, pˆ is estimated by the decimal equivalent of 27%. 28. Margin of error: five percentage points; confidence level: 90%; from a prior study, pˆ is estimated by the decimal equivalent of 65%. 29. Gender Selection The Genetics and IVF Institute conducted a clinical trial of the XSORT method designed to increase the probability of conceiving a girl. As this book was being written, 325 babies were born to parents using the XSORT method, and 295 of them were girls. Use the sample data to construct a 99% confi- dence interval estimate of the percentage of girls born to parents using the XSORT

334 Chapter 7 Estimates and Sample Sizes method. Based on the result, does the XSORT method appear to be effective? Why or why not? 30. Gender Selection The Genetics and IVF Institute conducted a clinical trial of the YSORT method designed to increase the probability of conceiving a boy. As this book was being written, 51 babies were born to parents using the YSORT method, and 39 of them were boys. Use the sample data to construct a 99% confidence interval esti- mate of the percentage of boys born to parents using the YSORT method. Based on the result, does the YSORT method appear to be effective? Why or why not? 31. Postponing Death An interesting and popular hypothesis is that individuals can tem- porarily postpone their death to survive a major holiday or important event such as a birthday. In a study of this phenomenon, it was found that in the week before and the week after Thanksgiving, there were 12,000 total deaths, and 6062 of them occurred in the week before Thanksgiving (based on data from “Holidays, Birthdays, and Post- ponement of Cancer Death” by Young and Hade, Journal of the American Medical Association, Vol. 292, No. 24.) Construct a 95% confidence interval estimate of the proportion of deaths in the week before Thanksgiving to the total deaths in the week before and the week after Thanksgiving. Based on the result, does there appear to be any indication that people can temporarily postpone their death to survive the Thanks- giving holiday? Why or why not? 32. Medical Malpractice An important issue facing Americans is the large number of medical malpractice lawsuits and the expenses that they generate. In a study of 1228 randomly selected medical malpractice lawsuits, it is found that 856 of them were later dropped or dismissed (based on data from the Physician Insurers Association of America). Construct a 99% confidence interval estimate of the proportion of medical malpractice lawsuits that are dropped or dismissed. Does it appear that the majority of such suits are dropped or dismissed? 33. Mendelian Genetics When Mendel conducted his famous genetics experiments with peas, one sample of offspring consisted of 428 green peas and 152 yellow peas. a. Find a 95% confidence interval estimate of the percentage of yellow peas. b. Based on his theory of genetics, Mendel expected that 25% of the offspring peas would be yellow. Given that the percentage of offspring yellow peas is not 25%, do the results contradict Mendel’s theory? Why or why not? 34. Misleading Survey Responses In a survey of 1002 people, 701 said that they voted in a recent presidential election (based on data from ICR Research Group). Voting records show that 61% of eligible voters actually did vote. a. Find a 99% confidence interval estimate of the proportion of people who say that they voted. b. Are the survey results consistent with the actual voter turnout of 61%? Why or why not? 35. Cell Phones and Cancer A study of 420,095 Danish cell phone users found that 135 of them developed cancer of the brain or nervous system. Prior to this study of cell phone use, the rate of such cancer was found to be 0.0340% for those not using cell phones. The data are from the Journal of the National Cancer Institute. a. Use the sample data to construct a 95% confidence interval estimate of the per- centage of cell phone users who develop cancer of the brain or nervous system. b. Do cell phone users appear to have a rate of cancer of the brain or nervous system that is different from the rate of such cancer among those not using cell phones? Why or why not?

7-2 Estimating a Population Proportion 335 36. Cloning Survey A recent Gallup poll consisted of 1012 randomly selected adults who were asked whether “cloning of humans should or should not be allowed.” Results showed that 901 of those surveyed indicated that cloning should not be allowed. A news reporter wants to determine whether these survey results consti- tute strong evidence that the majority (more than 50%) of people are opposed to such cloning. Construct a 99% confidence interval estimate of the proportion of adults believing that cloning of humans should not be allowed. Based on that result, is there strong evidence supporting the claim that the majority is opposed to such cloning? 37. Bias in Jury Selection In the case of Casteneda v. Partida, it was found that during a period of 11 years in Hidalgo County, Texas, 870 people were selected for grand jury duty, and 39% of them were Mexican-Americans. Use the sample data to construct a 99% confidence interval estimate of the proportion of grand jury members who were Mexican-Americans. Given that among the people eligible for jury duty, 79.1% of them were Mexican-Americans, does it appear that the jury selection process was somehow biased against Mexian-Americans? Why or why not? 38. Detecting Fraud When working for the Brooklyn District Attorney, investigator Robert Burton analyzed the leading digits of amounts on checks from companies that were suspected of fraud. Among 784 checks, 61% had amounts with leading digits of 5. Construct a 99% confidence interval estimate of the proportion of checks having amounts with leading digits of 5. When checks are issued in the normal course of honest transactions, it is expected that 7.9% of the checks will have amounts with leading digits of 5. What does the confidence interval suggest? 39. Telephone Households In 1920 only 35% of U.S. households had telephones, but that rate is now much higher. A recent survey of 4276 randomly selected households showed that 94% of them had telephones (based on data from the U.S. Census Bureau). Using those survey results, construct a 99% confidence interval estimate of the proportion of households with telephones. Given that the survey involves only 4276 households out of 115 million households, do we really have enough evidence to say that the percentage of households with telephones is now more than the 35% rate in 1920? 40. Internet Shopping In a Gallup poll, 1025 randomly selected adults were surveyed and 29% of them said that they used the Internet for shopping at least a few times a year. a. Find the point estimate of the percentage of adults who use the Internet for shopping. b. Find a 99% confidence interval estimate of the percentage of adults who use the Internet for shopping. c. If a traditional retail store wants to estimate the percentage of adult Internet shop- pers in order to determine the maximum impact of Internet shoppers on its sales, what percentage of Internet shoppers should be used? Determining Sample Size. In Exercises 41–44, find the minimum sample size required to estimate a population proportion or percentage. 41. Sample Size for Internet Purchases Many states are carefully considering steps that would help them collect sales taxes on items purchased through the Internet. How many randomly selected sales transactions must be surveyed to determine the per- centage that transpired over the Internet? Assume that we want to be 99% confident that the sample percentage is within two percentage points of the true population per- centage for all sales transactions.

336 Chapter 7 Estimates and Sample Sizes 42. Sample Size for Downloaded Songs The music industry must adjust to the growing practice of consumers downloading songs instead of buying CDs. It therefore becomes important to estimate the proportion of songs that are currently downloaded. How many randomly selected song purchases must be surveyed to determine the percentage that were obtained by downloading? Assume that we want to be 95% con- fident that the sample percentage is within one percentage point of the true population percentage of songs that are downloaded. 43. Nitrogen in Tires A recent campaign was designed to convince car owners that they should fill their tires with nitrogen instead of air. At a cost of about $5 per tire, nitro- gen supposedly has the advantage of leaking at a much slower rate than air, so that the ideal tire pressure can be maintained more consistently. Before spending huge sums to advertise the nitrogen, it would be wise to conduct a survey to determine the per- centage of car owners who would pay for the nitrogen. How many randomly selected car owners should be surveyed? Assume that we want to be 98% confident that the sample percentage is within three percentage points of the true percentage of all car owners who would be willing to pay for the nitrogen. 44. Sunroof and Side Air Bags Toyota provides an option of a sunroof and side air bag package for its Corolla model. This package costs $1400 ($1159 invoice price). As- sume that prior to offering this option package, Toyota wants to determine the per- centage of Corolla buyers who would pay $1400 extra for the sunroof and side air bags. How many Corolla buyers must be surveyed if we want to be 95% confident that the sample percentage is within four percentage points of the true percentage for all Corolla buyers? Using Appendix B Data Sets. In Exercises 45–48, use the indicated data set from Appendix B. 45. Blue M&M Candies Refer to Data Set 13 in Appendix B and find the sample propor- tion of M&Ms that are blue. Use that result to construct a 95% confidence interval es- timate of the population percentage of M&Ms that are blue. Is the result consistent with the 24% rate that is reported by the candy maker Mars? 46. Alcohol and Tobacco Use in Children’s Movies Refer to Data Set 5 in Appendix B. a. Construct a 95% confidence interval estimate of the percentage of animated chil- dren’s movies showing any tobacco use. b. Construct a 95% confidence interval estimate of the percentage of animated chil- dren’s movies showing any alcohol use. c. Compare the preceding results. Does either tobacco or alcohol appear in a greater percentage of animated children’s movies? d. In using the results from parts (a) and (b) as measures of the depiction of unhealthy habits, what important characteristic of the data is not included? 47. Precipitation in Boston Refer to Data Set 10 in Appendix B, and consider days with precipitation values different from 0 to be days with precipitation. Construct a 95% confidence interval estimate of the proportion of Wednesdays with precipitation, and also construct a 95% confidence interval estimate of the proportion of Sundays with precipitation. Compare the results. Does precipitation appear to occur more on either day? 48. Accuracy of Forecast Temperatures Refer to Data Set 8 in Appendix B. Construct a 95% confidence interval estimate of the proportion of days with an actual high tem- perature that is more than 2° different from the high temperature that was forecast one

7-2 Estimating a Population Proportion 337 day before. Then construct a 95% confidence interval estimate of the proportion of days with an actual high temperature that is more than 2° different from the high tem- perature that was forecast five days before. Compare the results. 7-2 BEYOND THE BASICS 49. Using Finite Population Correction Factor This section presented Formulas 7-2 and 7-3, which are used for determining sample size. In both cases we assumed that the population is infinite or very large and that we are sampling with replacement. When we have a relatively small population with size N and sample without replacement, we modify E to include the finite population correction factor shown here, and we can solve for n to obtain the result given here. Use this result to repeat Exercise 44, as- suming that we limit our population to 1250 Toyota Corolla buyers in one region. pˆqˆ N 2 n n 5 Npˆ qˆ [za>2]2 1dE2 E 5 za>2 B n BN 2 1 pˆqˆ[za>2]2 1 sN 2 50. One-Sided Confidence Interval A one-sided confidence interval for p can be expressed as p , pˆ 1 E or p . pˆ 2 E, where the margin of error E is modified by replacing za>2 with za. If Air America wants to report an on-time performance of at least x percent with 95% confidence, construct the appropriate one-sided confidence interval and then find the percent in question. Assume that a simple random sample of 750 flights results in 630 that are on time. 51. Confidence Interval from Small Sample Special tables are available for finding con- fidence intervals for proportions involving small numbers of cases, where the normal distribution approximation cannot be used. For example, given x 5 3 successes among n 5 8 trials, the 95% confidence interval found in Standard Probability and Statistics Tables and Formulae (CRC Press) is 0.085 , p , 0.755. Find the confi- dence interval that would result if you were to use the normal distribution incorrectly as an approximation to the binomial distribution. Are the results reasonably close? 52. Interpreting Confidence Interval Limits Assume that a coin is modified so that it favors heads, and 100 tosses result in 95 heads. Find the 99% confidence interval esti- mate of the proportion of heads that will occur with this coin. What is unusual about the results obtained by the methods of this section? Does common sense suggest a modification of the resulting confidence interval? 53. Rule of Three Suppose n trials of a binomial experiment result in no successes. According to the Rule of Three, we have 95% confidence that the true population pro- portion has an upper bound of 3>n. (See “A Look at the Rule of Three,” by Jovanovic and Levy, American Statistician, Vol. 51, No. 2.) a. If n independent trials result in no successes, why can’t we find confidence interval limits by using the methods described in this section? b. If 20 patients are treated with a drug and there are no adverse reactions, what is the 95% upper bound for p, the proportion of all patients who experience adverse reac- tions to this drug? 54. Poll Accuracy A New York Times article about poll results states, “In theory, in 19 cases out of 20, the results from such a poll should differ by no more than one per- centage point in either direction from what would have been obtained by interviewing all voters in the United States.” Find the sample size suggested by this statement.

338 Chapter 7 Estimates and Sample Sizes Estimating a Population 7-3 Mean: s Known Key Concept Section 7-2 introduced the point estimate and confidence interval as tools for using a sample proportion to estimate a population proportion, and this section presents methods for using sample data to find a point estimate and confi- dence interval estimate of a population mean. A key requirement in this section is that in addition to having sample data, we also know s, the standard deviation of the population. This section also presents a method for finding the sample size that would be required to estimate a population mean. Requirements 1. The sample is a simple random sample. (All samples of the same size have an equal chance of being selected.) 2. The value of the population standard deviation s is known. 3. Either or both of these conditions is satisfied: The population is normally dis- tributed or n . 30. Knowledge of s The above requirements include knowledge of the population standard deviation s, but the following section presents methods for estimating a population mean without knowledge of the value of s. Normality Requirement The requirements include the property that either the population is normally distributed or n . 30. If n # 30, the population need not have a distribution that is exactly normal, but it should be approximately normal. We can consider the normality requirement to be satisfied if there are no outliers and a histogram of the sample data is not too far from being bell-shaped. (The methods of this section are said to be robust, which means that these methods are not strongly affected by departures from normality, provided that those departures are not too extreme.) Sample Size Requirement This section uses the normal distribution as the distribution of sample means. If the original population is not itself normally distributed, then we say that means of samples with size n . 30 have a distribution that can be approximated by a normal distribution. The condition that the sample size is n . 30 is commonly used as a guideline, but it is not possible to identify a specific minimum sample size that is sufficient for all cases. The minimum sample size actually depends on how much the population distribution departs from a normal distribution. Sample sizes of 15 to 30 are adequate if the population appears to have a distribution that is not far from being normal, but some other populations have distributions that are extremely far from normal and sample sizes of 50 or even 100 or higher might be necessary. We will use the simplified criterion of n . 30 as justification for treating the distribution of sample means as a normal distribution. In Section 7-2 we saw that the sample proportion pˆ is the best point estimate of the population proportion p. For similar reasons, the sample mean x is the best point estimate of the population mean m.

7-3 Estimating a Population Mean: s Known 339 The sample mean x is the best point estimate of the population mean. Estimating Wildlife Population Sizes The sample mean x usually provides the best estimate, for the following two reasons: The National Forest Manage- ment Act protects endangered 1. For all populations, the sample mean x is an unbiased estimator of the popu- species, including the northern lation mean m, meaning that the distribution of sample means tends to center spotted owl, with the result about the value of the population mean m. [That is, sample means do not sys- that the forestry industry was tematically tend to overestimate the value of m, nor do they systematically not allowed to cut vast regions tend to underestimate m. Instead, they tend to target the value of m itself (as il- of trees in the Pacific North- lustrated in Section 6-4).] west. Biologists and statisti- cians were asked to analyze 2. For many populations, the distribution of sample means x tends to be more the problem, and they consistent (with less variation) than the distributions of other sample statistics. concluded that survival rates and population sizes were EXAMPLE Pulse Rates of Females Pulse rates of people are quite decreasing for the female important. Without them, where would we be? Data Set 1 in Appendix B owls, known to play an impor- includes pulse rates (in beats per minute) of randomly selected women, and here tant role in species survival. are the statistics: n 5 40, x 5 76.3, and s 5 12.5. Use this sample to find the Biologists and statisticians best point estimate of the population mean m of pulse rates for all women. also studied salmon in the Snake and Columbia Rivers in SOLUTION For the sample data, x 5 76.3. Because the sample mean x is the Washington State, and pen- best point estimate of the population mean m, we conclude that the best point guins in New Zealand. In the estimate of the pulse rate for all women is 76.3. article “Sampling Wildlife Populations” (Chance, Vol. 9, Confidence Intervals No. 2), authors Bryan Manly and Lyman McDonald com- Although a point estimate is the best single value for estimating a population pa- ment that in such studies, rameter, it does not give us any indication of just how good the best estimate is. “biologists gain through the However, a confidence interval gives us information that enables us to better un- use of modeling skills that are derstand the accuracy of the estimate. The confidence interval is associated with a the hallmark of good statistics. confidence level, such as 0.95 (or 95%). The confidence level gives us the success Statisticians gain by being in- rate of the procedure used to construct the confidence interval. As in Section 7-2, troduced to the reality of prob- a is the complement of the confidence level. For a 0.95 (or 95%) confidence level, lems by biologists who know a 5 0.05. For a 0.99 (or 99%) confidence level, a 5 0.01. what the crucial issues are.” Margin of Error When we collect a set of sample data, such as the set of 40 pulse rates of women listed in Data Set 1 from Appendix B, we can calculate the sample mean x and that sample mean is typically different from the population mean m. The difference between the sample mean and the population mean is an error. In Section 6-5 we saw that s> !n is the standard deviation of sample means. Using s> !n and the za>2 notation introduced in Section 7-2, we now use the margin of error E expressed as follows: Formula 7-4 E 5 za>2 ? s margin of error for mean (based on known s) 2n Formula 7-4 reflects the fact that the sampling distribution of sample means x is exactly a normal distribution with mean m and standard deviation s> !n whenever

340 Chapter 7 Estimates and Sample Sizes the population has a normal distribution with mean m and standard deviation s. If the population is not normally distributed, large samples yield sample means with a distribution that is approximately normal. (Formula 7-4 requires that you know the population standard deviation s, but Section 7-4 will present a method for cal- culating the margin of error E when s is not known.) Using the margin of error E, we can now identify the confidence interval for the population mean m (if the requirements for this section are satisfied). The three commonly used formats for expressing the confidence interval are shown in the following box. Captured Tank Serial Confidence Interval Estimate of the Population Mean m Numbers Reveal (With s Known) Population Size x 2 E , m , x 1 E where E 5 za>2 ? s During World War II, Allied !n intelligence specialists wanted to determine the number of or tanks Germany was producing. Traditional spy techniques x6E provided unreliable results, but statisticians obtained accurate or estimates by analyzing serial numbers on captured tanks. As sx 2 E, x 1 Ed one example, records show that Germany actually produced Definition 271 tanks in June 1941. The estimate based on serial num- The two values x 2 E and x 1 E are called confidence interval limits. bers was 244, but traditional intelligence methods resulted Procedure for Constructing a Confidence Interval for m (with Known s) in the extreme estimate of 1. Verify that the requirements are satisfied. (We have a simple random sam- 1550. (See “An Empirical ple, s is known, and either the population appears to be normally distributed Approach to Economic or n . 30.) Intelligence in World War II,” 2. Refer to Table A-2 and find the critical value za>2 that corresponds to the de- by Ruggles and Brodie, sired confidence level. (For example, if the confidence level is 95%, the criti- Journal of the American cal value is za>2 5 1.96.) Statistical Association, 3. Evaluate the margin of error E 5 za>2 ? s> !n. Vol. 42.) 4. Using the value of the calculated margin of error E and the value of the sam- ple mean x, find the values of x 2 E and x 1 E. Substitute those values in the general format for the confidence interval: x2E,m,x1E or x6E or sx 2 E, x 1 Ed 5. Round the resulting values by using the following round-off rule.

7-3 Estimating a Population Mean: s Known 341 Round-Off Rule for Confidence Intervals Used to Estimate m 1. When using the original set of data to construct a confidence interval, round the confidence interval limits to one more decimal place than is used for the original set of data. 2. When the original set of data is unknown and only the summary statistics (n, x, s) are used, round the confidence interval limits to the same number of decimal places used for the sample mean. Interpreting a Confidence Interval As in Section 7-2, be careful to interpret confidence intervals correctly. After obtaining a confidence interval estimate of the population mean m, such as a 95% confidence interval of 72.4 , m , 80.2, there is a correct interpretation and many wrong interpretations. Correct: “We are 95% confident that the interval from 72.4 to 80.2 actually Wrong: does contain the true value of m.” This means that if we were to select many different samples of the same size and construct the cor- responding confidence intervals, in the long run 95% of them would actually contain the value of m. (As in Section 7-2, this correct inter- pretation refers to the success rate of the process being used to esti- mate the population mean.) Because m is a fixed constant, it would be wrong to say “there is a 95% chance that m will fall between 72.4 and 80.2.” It would also be wrong to say that “95% of all data values are between 72.4 and 80.2.” It would also be wrong to say that “95% of sample means fall between 72.4 and 80.2.” EXAMPLE Pulse Rates of Females For the sample of pulse rates of women in Data Set 1 in Appendix B, we have n 5 40 and x 5 76.3, and the sample is a simple random sample. Assume that s is known to be 12.5. Using a 0.95 confidence level, find both of the following: a. The margin of error E b. The confidence interval for m. SOLUTION REQUIREMENT We must first verify that the requirements are satisfied. The sample is a simple random sample. The value of s is assumed to be known (12.5). With n . 30, we satisfy the requirement that “the population is nor- mally distributed or n . 30.” The requirements are therefore satisfied and we can proceed with the methods of this section. a. The 0.95 confidence level implies that a 5 0.05, so za>2 5 1.96 (as was shown in an example in Section 7-2). The margin of error E is calculated by using Formula 7-4 as follows. Extra decimal places are used to mini- mize rounding errors in the confidence interval found in part (b). E5 za>2 ? s 5 1.96 ? 12.5 5 3.8737901 2n 240 continued

342 Chapter 7 Estimates and Sample Sizes b. With x 5 76.3 and E 5 3.8737901 we construct the confidence interval as follows: x2E,m,x1E 76.3 2 3.8737901 , m , 76.3 1 3.8737901 72.4 , m , 80.2 (rounded to one decimal place as in x) INTERPRETATION This result could also be expressed as 76.3 6 3.9 or as (72.4, 80.2). Based on the sample with n 5 40, x 5 76.3 and s assumed to be 12.5, the confidence interval for the population mean m is 72.4 , m , 80.2 and this interval has a 0.95 confidence level. This means that if we were to se- lect many different random samples of 40 women and construct the confidence intervals as we did here, 95% of them would actually contain the value of the population mean m. Rationale for the Confidence Interval The basic idea underlying the construction of confidence intervals relates to the central limit theorem, which indicates that if we collect simple random samples of the same size from a normally distributed population, sample means are normally distributed with mean m and standard deviation s> !n. If we collect simple random samples all of size n . 30 from any population, the distribution of sample means is approximately normal with mean m and standard deviation s> !n. The confidence interval format is really a variation of the equation that was already used with the central limit theorem. In the expression z 5 sx 2 mxd>sx, replace sx with s> !n, replace mx with m, then solve for m to get m5x2z s 2n Using the positive and negative values for z results in the confidence interval lim- its we are using. Let’s consider the specific case of a 95% confidence level, so a 5 0.05 and za>2 5 1.96. For this case, there is a probability of 0.05 that a sample mean will be more than 1.96 standard deviations (or za>2 ? s> !n, which we denote by E) away from the population mean m. Conversely, there is a 0.95 probability that a sample mean will be within 1.96 standard deviations (or za>2 ? s> !n) of m. (See Figure 7-4.) If the sample mean x is within za>2 ? s> !n of the population mean m, then m must be between x 2 za>2 ? s> !n and x 1 za>2 ? s> !n; this is expressed in the general format of our confidence interval (with za>2 ? s> !n denoted as E): x 2 E , m , x 1 E. Determining Sample Size Required to Estimate m We now address this important question: When we plan to collect a simple ran- dom sample of data that will be used to estimate a population mean m, how many sample values must be obtained? For example, suppose we want to estimate the mean weight of airline passengers (an important value for reasons of safety). How many passengers must be randomly selected and weighed? Determining the size of a simple random sample is a very important issue, because samples that are

7-3 Estimating a Population Mean: s Known 343 There is a 1 Ϫ a probability that Figure 7-4 a sample mean will be in error by less than E or za/2s/ n Distribution of Sample Means with Known s 1Ϫa There is a probability of a that a sample mean will be in error by more than E (in one of the red tails) a/2 a/2 m EE needlessly large waste time and money, and samples that are too small may lead to poor results. If we begin with the expression for the margin of error E (Formula 7-4) and solve for the sample size n, we get the following. Sample Size for Estimating Mean m Formula 7-5 n 5 B za>2s R2 E where za>2 5 critical z score based on the desired confidence level E 5 desired margin of error s 5 population standard deviation Formula 7-5 is remarkable because it shows that the sample size does not de- pend on the size (N) of the population; the sample size depends on the desired confidence level, the desired margin of error, and the value of the standard devia- tion s. (See Exercise 40 for dealing with cases in which a relatively large sample is selected without replacement from a finite population.) The sample size must be a whole number, because it represents the number of sample values that must be found. However, Formula 7-5 usually gives a re- sult that is not a whole number, so we use the following round-off rule. (It is based on the principle that when rounding is necessary, the required sample size should be rounded upward so that it is at least adequately large as opposed to slightly too small.) Round-Off Rule for Sample Size n When finding the sample size n, if the use of Formula 7-5 does not result in a whole number, always increase the value of n to the next larger whole number.

344 Chapter 7 Estimates and Sample Sizes Dealing with Unknown s When Finding Sample Size When applying Formula 7-5, there is a practical dilemma: The formula requires that we substitute some value for the population standard deviation s, but in reality, it is usually unknown. When determining a required sample size (not constructing a confidence interval), here are some ways that we can work around this problem: 1. Use the range rule of thumb (see Section 3-3) to estimate the standard devia- tion as follows: s < range>4. (With a sample of 87 or more values randomly selected from a normally distributed population, range>4 will yield a value that is greater than or equal to s at least 95% of the time. See “Using the Sam- ple Range as a Basis for Calculating Sample Size in Power Calculations,” by Richard Browne, American Statistician, Vol. 55, No. 4.) 2. Conduct a pilot study by starting the sampling process. Start the sample col- lection process and, using the first several values, calculate the sample stan- dard deviation s and use it in place of s. The estimated value of s can then be improved as more sample data are obtained, and the sample size can be re- fined accordingly. 3. Estimate the value of s by using the results of some other study that was done earlier. In addition, we can sometimes be creative in our use of other known results. For example, IQ tests are typically designed so that the mean is 100 and the stan- dard deviation is 15. Statistics professors have IQ scores with a mean greater than 100 and a standard deviation less than 15 (because they are a more homogeneous group than people randomly selected from the general population). We do not know the specific value of s for statistics professors, but we can play it safe by using s 5 15. Using a value for s that is larger than the true value will make the sample size larger than necessary, but using a value for s that is too small would result in a sample size that is inadequate. When calculating the sample size n, any errors should always be conservative in the sense that they make n too large in- stead of too small. EXAMPLE IQ Scores of Statistics Professors Assume that we want to estimate the mean IQ score for the population of statistics professors. How many statistics professors must be randomly selected for IQ tests if we want 95% confidence that the sample mean is within 2 IQ points of the population mean? SOLUTION The values required for Formula 7-5 are found as follows: za>2 5 1.96 (This is found by converting the 95% confidence level to E52 a 5 0.05, then finding the critical z score as described in Section 7-2.) (Because we want the sample mean to be within 2 IQ points of m, the desired margin of error is 2.) s 5 15 (See the discussion in the paragraph that immediately pre- cedes this example.)

7-3 Estimating a Population Mean: s Known 345 With za>2 5 1.96, E 5 2, and s 5 15, we use Formula 7-5 as follows: n 5 B za>2s R 2 5 B 1.96 ? 15 R 2 5 216.09 5 217 (rounded up) E2 INTERPRETATION Among the thousands of statistics professors, we need to obtain a simple random sample of at least 217 of them, then we need to get their IQ scores. With a simple random sample of only 217 statistics professors, we will be 95% confident that the sample mean x is within 2 IQ points of the true population mean m. If we are willing to settle for less accurate results by using a larger margin of error, such as 4, the sample size drops to 54.0225, which is rounded up to 55. Doubling the margin of error causes the required sample size to decrease to one- fourth its original value. Conversely, halving the margin of error quadruples the sample size. Consequently, if you want more accurate results, the sample size must be substantially increased. Because large samples generally require more time and money, there is often a need for a tradeoff between the sample size and the margin of error E. Using Technology that apply to the methods of this section as STATDISK Select Analysis from the well as those of Section 7-4. STATDISK, main menu bar at the top, then select Confidence Intervals See the end of Sec- Minitab, Excel, and the TI-83>84 Plus calcu- Sample Size Determination, followed by tion 7-4 for the confidence interval procedures lator can all be used to find confidence inter- Estimate Mean. You must now enter the vals when we want to estimate a population confidence level (such as 0.95) and the error mean and the requirements of this section (in- E. You can also enter the population stan- cluding a known value of s) are all satisfied. dard deviation s if it is known. There is also an option that allows you to enter the popu- Sample Size Determination Sample size cal- lation size N, assuming that you are sam- culations are not included with the TI-83>84 pling without replacement from a finite pop- Plus calculator, or Minitab, or Excel. The ulation. (See Exercise 40.) STATDISK procedure for determining the sample size required to estimate a population mean m is described below. 7-3 BASIC SKILLS AND CONCEPTS Statistical Literacy and Critical Thinking 1. Confidence Interval Based on sample data, the following 95% confidence interval is obtained: 2.5 , m , 6.0. Write a statement that correctly interprets that confidence interval. 2. Unbiased Estimator One of the features of the sample mean that makes it a good estimator of a population mean m is that the sample mean is an unbiased estimator. What does it mean for a statistic to be an unbiased estimator of a population parameter?

346 Chapter 7 Estimates and Sample Sizes 3. Confidence Interval A manufacturer of amusement park rides needs a confidence in- terval estimate of the force that can be exerted when riders push on a leg safety restraint. Unable to find data, a sample is obtained by measuring the force from 100 high school students participating in a science fair. Will the resulting confidence interval be a good estimate of the mean force for the population of all potential riders? Why or why not? 4. Sample Size A researcher calculates the sample size needed to estimate the force that can be exerted by legs of people on amusement park rides, and the sample size of 120 is obtained. If the researcher cannot obtain a random sample and must rely instead on a convenience sample consisting of friends and relatives, can he or she compensate and get good results by using a much larger sample size? Finding Critical Values. In Exercises 5–8, find the critical value z␣>2 that corresponds to the given confidence level. 5. 95% 6. 96% 7. 92% 8. 99% Verifying Requirements and Finding the Margin of Error. In Exercises 9–12, calculate the margin of error E 5 z␣>2␴> !n if the necessary requirements are satisfied. If the requirements are not all satisfied, state that the margin of error cannot be calculated by using the methods of this section. 9. The confidence level is 95%, the sample size is n 5 100, and s 5 15. 10. The confidence level is 95%, the sample size is n 5 9, and s is not known. 11. The confidence level is 99%, the sample size is n 5 9, s 5 15, and the original pop- ulation is normally distributed. 12. The confidence level is 99%, the sample size is n 5 12, s is not known, and the orig- inal population is normally distributed. Finding a Confidence Interval. In Exercises 13–16, use the given confidence level and sample data to find a confidence interval for estimating the population mean m. 13. Salaries of college graduates who took a statistics course in college: 95% confidence; n 5 41, x 5 $67,200, and s is known to be $18,277. 14. Speeds of drivers ticketed in a 55 mi>h zone: 95% confidence; n 5 90, x 5 66.2 mi>h, and s is known to be 3.4 mi>h. 15. FICO (Fair, Isaac, and Company) credit rating scores of applicants for credit cards: 99% confidence; n 5 70, x 5 688, and s is known to be 68. 16. Amounts lost by gamblers who took a bus to an Atlantic City casino: 99% confidence; n 5 40, x 5 $189, and s is known to be $87. Finding Sample Size. In Exercises 17–20, use the given margin of error, confidence level, and population standard deviation s to find the minimum sample size required to esti- mate an unknown population mean m. 17. Margin of error: 0.5 in., confidence level: 95%, s 5 2.5 in. 18. Margin of error: 0.25 sec, confidence level: 99%, s 5 5.40 sec. 19. Margin of error: $1, confidence level: 90%, s 5 $12. 20. Margin of error: 1.5 mm, confidence level: 95%, s 5 8.7 mm.

7-3 Estimating a Population Mean: s Known 347 Interpreting Results. In Exercises 21–24, refer to the accompanying Minitab display of a 95% confidence interval generated using the methods of this section. The sample display results from using a random sample of speeds of drivers ticketed on a section of Interstate 95 in Connecticut. MINITAB Mean StDev SE Mean 95% CI Variable N Speed 81 67.3849 3.3498 0.3722 (66.6554, 68.1144) 21. Identify the value of the point estimate of the population mean m. 22. Express the confidence interval in the format of x 2 E , m , x 1 E. 23. Express the confidence interval in the format of x 6 E. 24. Write a statement that interprets the 95% confidence interval. 25. Length of Time to Earn Bachelor’s Degree In a study of the length of time that stu- dents require to earn bachelor’s degrees, 80 students are randomly selected and they are found to have a mean of 4.8 years (based on data from the National Center for Ed- ucation Statistics). Assuming that s 5 2.2 years, construct a 95% confidence interval estimate of the population mean. Does the resulting confidence interval contradict the fact that 39% of students earn their bachelor’s degrees in four years? 26. Ages of Motorcyclists Killed in Crashes A study of the ages of motorcyclists killed in crashes involves the random selection of 150 drivers with a mean of 37.1 years (based on data from the Insurance Institute for Highway Safety). Assuming that s 5 12.0 years, construct a 99% confidence interval estimate of the mean age of all motorcy- clists killed in crashes. If the confidence interval limits do not include ages below 20 years, does it mean that motorcyclists under the age of 20 rarely die in crashes? 27. Perception of Time Randomly selected statistics students of the author participated in an experiment to test their ability to determine when 1 min (or 60 seconds) has passed. Forty students yielded a sample mean of 58.3 sec. Assuming that s 5 9.5 sec, construct a 95% confidence interval estimate of the population mean of all statistics students. Based on the result, is it likely that their estimates have a mean that is reasonably close to 60 sec? 28. Cotinine Levels of Smokers When people smoke, the nicotine they absorb is converted to cotinine, which can be measured. A sample of 40 smokers has a mean cotinine level of 172.5. Assuming that s is known to be 119.5, find a 90% confidence interval estimate of the mean cotinine level of all smokers. What aspect of this problem is not realistic? 29. Blood Pressure Levels When 14 different second-year medical students at Belle- vue Hospital measured the blood pressure of the same person, they obtained the results listed below. Assuming that the population standard deviation is known to be 10 mmHg, construct a 95% confidence interval estimate of the population mean. Ideally, what should the confidence interval be in this situation? 138 130 135 140 120 125 120 130 130 144 143 140 130 150 30. World’s Smallest Mammal The world’s smallest mammal is the bumblebee bat, also known as the Kitti’s hog-nosed bat (or Craseonycteris thonglongyai). Such bats are roughly the size of a large bumblebee. Listed below are weights (in grams) from a sample of these bats. Assuming that the weights of all such bats have a standard deviation of 0.30 g, construct a 95% confidence interval estimate of their mean

348 Chapter 7 Estimates and Sample Sizes weight. Use the confidence interval to determine whether this sample of bats is from the same population with a known mean of 1.8 g. 1.7 1.6 1.5 2.0 2.3 1.6 1.6 1.8 1.5 1.7 2.2 1.4 1.6 1.6 1.6 31. Weights of Quarters from Appendix B Use the weights of post-1964 quarters listed in Data Set 14 from Appendix B. Assuming that quarters are minted to produce weights with a population standard deviation of 0.068 g, use the sample of weights to con- struct a 99% confidence interval estimate of the mean weight. U.S. mint specifica- tions require that quarters have weights between 5.443 g and 5.897 g. What does the confidence interval suggest about the production process? 32. Forecast Errors from Appendix B Refer to Data Set 8 in Appendix B and subtract each actual high temperature from the high temperature that was forecast one day before. The result is a list of errors. Assuming that all such errors have a standard deviation of 2.58, construct a 95% confidence interval estimate of the mean of all such errors. What does the result suggest about the accuracy of the forecast temperatures? Finding Sample Size. In Exercises 33–38, find the indicated sample size. 33. Sample Size for Mean IQ of Statistics Students The Wechsler IQ test is designed so that the mean is 100 and the standard deviation is 15 for the population of normal adults. Find the sample size necessary to estimate the mean IQ score of statistics students. We want to be 95% confident that our sample mean is within 2 IQ points of the true mean. The mean for this population is clearly greater than 100. The standard deviation for this population is probably less than 15 because it is a group with less variation than a group randomly selected from the general population; therefore, if we use s 5 15, we are being conservative by using a value that will make the sample size at least as large as necessary. Assume then that s 5 15 and determine the required sample size. 34. Sample Size for Weights of Quarters The Tyco Video Game Corporation finds that it is losing income because of slugs used in its video games. The machines must be ad- justed to accept coins only if they fall within set limits. In order to set those limits, the mean weight of quarters in circulation must be estimated. A sample of quarters will be weighed in order to determine the mean. How many quarters must we randomly se- lect and weigh if we want to be 99% confident that the sample mean is within 0.025 g of the true population mean for all quarters? Based on results from a sample of quar- ters, we can estimate the population standard deviation as 0.068 g. 35. Sample Size for Atkins Diet You want to estimate the mean weight loss of people one year after using the Atkins diet. How many dieters must be surveyed if we want to be 95% confident that the sample mean weight loss is within 0.25 lb of the true popula- tion mean? Assume that the population standard deviation is known to be 10.6 lb (based on data from “Comparison of the Atkins, Ornish, Weight Watchers, and Zone Diets for Weight Loss and Heart Disease Risk Reduction,” by Dansinger et al., Journal of the American Medical Association, Vol. 293, No. 1). 36. Sample Size for Television Viewing Nielsen Media Research wants to estimate the mean amount of time (in minutes) that full-time college students spend watching television each weekday. Find the sample size necessary to estimate that mean with a 15-min margin of error. Assume that a 96% confidence level is desired. Also assume that a pilot study showed that the standard deviation is estimated to be 112.2 min. 37. Sample Size Using Range Rule of Thumb You have just been hired by the marketing division of General Motors to estimate the mean amount of money now being spent on the purchase of new cars in the United States. First use the range rule of thumb to

7-4 Estimating a Population Mean: s Not Known 349 make a rough estimate of the standard deviation of the amounts spent. It is reasonable to assume that typical amounts range from $12,000 to about $70,000. Then use that estimated standard deviation to determine the sample size corresponding to 95% con- fidence and a $100 margin of error. Is the sample size practical? If not, what should be changed to get a practical sample size? 38. Sample Size Using Sample Data You want to estimate the mean pulse rate of adult males. Refer to Data Set 1 in Appendix B and find the maximum and minimum pulse rates for males, then use those values with the range rule of thumb to estimate s. How many adult males must you randomly select and test if you want to be 95% confident that the sample mean pulse rate is within 2 beats (per minute) of the true population mean m? If, instead of using the range rule of thumb, the standard deviation of the male pulse rates in Data Set 1 is used as an estimate of s, is the required sample size very different? Which sample size is likely to be closer to the correct sample size? 7-3 BEYOND THE BASICS 39. Confidence Interval with Finite Population Correction Factor The standard error of the mean is s> !n provided that the population size is infinite. If the population size is finite and is denoted by N, then the correction factor !sN 2 nd>sN 2 1d should be used whenever n . 0.05N. This correction factor multiplies the margin of error E given in Formula 7-4, so that the margin of error is as shown below. Find the 95% confidence interval for the mean of 250 IQ scores if a sample of 35 of those scores produces a mean of 110. Assume that s 5 15. E 5 za>2 s N 2 n 2n BN 2 1 40. Sample Size with Finite Population Correction Factor In Formula 7-4 for the margin of error E, we assume that the population is infinite, that we are sampling with replacement, or that the population is very large. If we have a relatively small popula- tion and sample without replacement, we should modify E to include a finite population correction factor, so that the margin of error is as shown in Exercise 39, where N is the population size. That expression for the margin of error can be solved for n to yield n 5 sN 2 Ns2sza>2d2 1dE2 1 s2sza>2d2 Repeat Exercise 33, assuming that the statistics students are randomly selected with- out replacement from a population of N 5 200 statistics students. Estimating a Population 7-4 Mean: s Not Known Key Concept This section presents methods for finding a confidence interval estimate of a population mean when the population standard deviation is not known. (Section 7-3 presented methods for estimating m when s is known.) With s unknown, we will use the Student t distribution (instead of the normal distribu- tion), assuming that certain requirements (given below) are satisfied. Because s is typically unknown in real circumstances, the methods of this section are very real- istic, practical, and they are used often.

350 Chapter 7 Estimates and Sample Sizes Requirements 1. The sample is a simple random sample. 2. Either the sample is from a normally distributed population or n . 30. As in Section 7-3, the requirement of a normally distributed population is not a strict requirement, so we can usually consider the population to be normally dis- tributed after using the sample data to confirm that there are no outliers and the histogram has a shape that is not very far from a normal distribution. Also, as in Section 7-3, the requirement that the sample size is n > 30 is commonly used as a guideline, but the minimum sample size actually depends on how much the popu- lation distribution departs from a normal distribution. [If a population is known to be normally distributed, distribution of sample means x is exactly a normal distri- bution with mean m and standard deviation s> !n; if the population is not nor- mally distributed, large (n . 30) samples yield sample means with a distribution that is approximately normal with mean m and standard deviation s> !n.] As in Section 7-3, the sample mean x is the best point estimate (or single- valued estimate) of the population mean m. The sample mean x is the best point estimate of the population mean m. Here is a major point of this section: If s is not known, but the above require- ments are satisfied, we use a Student t distribution (instead of a normal distribu- tion) developed by William Gosset (1876–1937). Gosset was a Guinness Brewery employee who needed a distribution that could be used with small samples. The Irish brewery where he worked did not allow the publication of research results, so Gosset published under the pseudonym Student. (In the interest of research and better serving his readers, the author visited the Guinness Brewery and sampled some of the product. Such commitment.) Because we do not know the value of s, we estimate it with the value of the sample standard deviation s, but this introduces another source of unreliability, es- pecially with small samples. In order to keep a confidence interval at some desired level, such as 95%, we compensate for this additional unreliability by making the confidence interval wider: We use critical values ta>2 (from a Student t distribu- tion) that are larger than the critical values of za>2 from the normal distribution. Student t Distribution If a population has a normal distribution, then the distribution of x2m t5 s 2n is a Student t distribution for all samples of size n. A Student t distribution, often referred to as a t distribution, is used to find critical values denoted by ta>2. We will soon discuss some of the important properties of the t distribution, but we first present the components needed for the construction of confidence intervals.

7-4 Estimating a Population Mean: s Not Known 351 Let’s start with the critical value denoted by ta>2. A value of ta>2 can be found in Table A-3 by locating the appropriate number of degrees of freedom in the left col- umn and proceeding across the corresponding row until reaching the number di- rectly below the appropriate area at the top. Definition The number of degrees of freedom for a collection of sample data is the number of sample values that can vary after certain restrictions have been imposed on all data values. For example, if 10 students have quiz scores with a mean of 80, we can freely Excerpts from a assign values to the first 9 scores, but the 10th score is then determined. The sum Department of of the 10 scores must be 800, so the 10th score must equal 800 minus the sum of Transportation the first 9 scores. Because those first 9 scores can be freely selected to be any val- Circular ues, we say that there are 9 degrees of freedom available. For the applications of this section, the number of degrees of freedom is simply the sample size minus 1. The following excerpts from a Department of Transporta- degrees of freedom 5 n 2 1 tion circular concern some of the accuracy requirements for EXAMPLE Finding a Critical Value A sample of size n 5 23 is a navigation equipment used in simple random sample selected from a normally distributed population. Find aircraft. Note the use of the the critical value ta>2 corresponding to a 95% confidence level. confidence interval. “The total of the error contribu- SOLUTION Because n 5 23, the number of degrees of freedom is given by tions of the airborne equip- n – 1 5 22. Using Table A-3, we locate the 22nd row by referring to the col- ment, when combined with umn at the extreme left. As in Section 7-2, a 95% confidence level corresponds the appropriate flight techni- to a 5 0.05, so we find the column listing values for an area of 0.05 in two cal errors listed, should not tails. The value corresponding to the row for 22 degrees of freedom and the exceed the following with a column for an area of 0.05 in two tails is 2.074, so ta>2 5 2.074. 95% confidence (2-sigma) over a period of time equal to Now that we know how to find critical values denoted by ta>2 we can the update cycle.” “The describe the margin of error E and the confidence interval. system of airways and routes in the United States has Margin of Error E for the Estimate of m (With s Not Known) widths of route protection used on a VOR system with Formula 7-6 E 5 ta> 2 s accuracy of 64.5 degrees on !n a 95% probability basis.” where ta>2 has n – 1 degrees of freedom. Table A-3 lists values of ta>2. Confidence Interval for the Estimate of m (With s Not Known) x2E,m,x1E where s E 5 ta>2 !n

352 Chapter 7 Estimates and Sample Sizes Stemplot of Ages The following procedure uses the above margin of error in the construction of 34778 confidence interval estimates of m. 412344555689 53344567 Procedure for Constructing a Confidence Interval for m 60 (With s Unknown) 1. Verify that the requirements are satisfied. (We have a simple random sample, and either the population appears to be normally distributed or n . 30.) 2. Using n – 1 degrees of freedom, refer to Table A-3 and find the critical value ta>2 that corresponds to the desired confidence level. (For the confidence level, refer to the “Area in Two Tails.”) 3. Evaluate the margin of error E 5 ta>2 ? s> !n. 4. Using the value of the calculated margin of error E and the value of the sam- ple mean x, find the values of x 2 E and x 1 E. Substitute those values in the general format for the confidence interval: x2E,m,x1E or x6E or sx 2 E, x 1 Ed 5. Round the resulting confidence interval limits. If using the original set of data, round to one more decimal place than is used for the original set of data. If us- ing summary statistics (n, x, s), round the confidence interval limits to the same number of decimal places used for the sample mean. EXAMPLE Constructing a Confidence Interval Listed in the ac- companying stemplot are the ages of applicants who were unsuccessful in winning promotion (based on data from “Debating the Use of Statistical Evi- dence in Allegations of Age Discrimination,” by Barry and Boland, American Statistician, Vol. 58, No. 2). There is an important larger issue of whether these applicants suffered age discrimination, but for now we will focus on the simple issue of using those values as a sample for the purpose of estimating the mean of a larger population. Assume that the sample is a simple random sample and use the sample data with a 95% confidence level to find both of the following: a. The margin of error E b. The confidence interval for m SOLUTION REQUIREMENT We must first verify that the requirements are satisfied. We are assuming that the sample is a simple random sample. We now address the requirement that “the population is normally distributed or n . 30.” Because n 5 23, we must check that the distribution is approximately normal. The shape of the stemplot does suggest a normal distribution. Also, a normal

7-4 Estimating a Population Mean: s Not Known 353 quantile plot confirms that the sample data are from a population with a distri- bution that is approximately normal. The requirements are therefore satisfied and we can proceed with the methods of this section. a. The 0.95 confidence level implies that a 5 0.05, so ta>2 5 2.074 (use Table A-3 with df 5 n 2 1 5 22, as was shown in the preceding example). After finding that the sample statistics are n 5 23, x 5 47.0, and s 5 7.2, the margin of error E is calculated by using Formula 7-6 as follows. Extra decimal places are used to minimize rounding errors in the confidence in- terval found in part (b). E5 ta>2 s 5 2.074 ? 7.2 5 3.11370404 2n 223 b. With x 5 47.0 and E 5 3.11370404, we construct the confidence interval as follows: x2E,m,x1E 47.0 2 3.11370404 , m , 47.0 1 3.11370404 43.9 , m , 50.1 (rounded to one more decimal place than the original data) INTERPRETATION This result could also be expressed in the format of 47.0 6 3.1 or (43.9, 50.1). On the basis of the given sample results, we are 95% confident that the limits of 43.9 years and 50.1 years actually do contain the value of the population mean m. We now list the important properties of the t distribution that we are using in this section. Important Properties of the Student t Distribution 1. The Student t distribution is different for different sample sizes. (See Figure 7-5 for the cases n 5 3 and n 5 12.) 2. The Student t distribution has the same general symmetric bell shape as the standard normal distribution, but it reflects the greater variability (with wider distributions) that is expected with small samples. Standard Student t Figure 7-5 normal distribution distribution with n ϭ 12 Student t Distributions for n 5 3 and n 5 12 Student t distribution The Student t distribution has the with n ϭ 3 same general shape and symme- try as the standard normal distri- bution, but it reflects the greater variability that is expected with small samples. 0

354 Chapter 7 Estimates and Sample Sizes 3. The Student t distribution has a mean of t 5 0 ( just as the standard normal distribution has a mean of z 5 0). 4. The standard deviation of the Student t distribution varies with the sample size, but it is greater than 1 (unlike the standard normal distribution, which has s 5 1). 5. As the sample size n gets larger, the Student t distribution gets closer to the standard normal distribution. Choosing the Appropriate Distribution It is sometimes difficult to decide whether to use the standard normal z distribu- tion or the Student t distribution. The flowchart in Figure 7-6 and the accompany- ing Table 7-1 both summarize the key points to be considered when constructing confidence intervals for estimating m, the population mean. In Figure 7-6 or Table 7-1, note that if we have a small (n # 30) sample drawn from a distribution that differs dramatically from a normal distribution, we can’t use the methods de- scribed in this chapter. One alternative is to use nonparametric methods (see Chapter 13), and another alternative is to use the computer bootstrap method. In both of those approaches, no assumptions are made about the original population. The bootstrap method is described in the Technology Project at the end of this chapter. SSttaarrtt Yes Is No s known Is the Is the Yes population normally No ? Yes population normally No distributed? distributed? Yes Is No Yes Is No n . 30 n . 30 t ? ? Use the t z Use nonparametric distribution. Use nonparametric or bootstrapping or bootstrapping Use the normal distribution. methods. methods. Figure 7-6 Choosing Between z and t

7-4 Estimating a Population Mean: s Not Known 355 Table 7-1 Choosing between z and t Method Conditions Use normal (z) distribution. s known and normally distributed population or s known and n . 30 Use t distribution. s not known and normally distributed population or s not known and n . 30 Use a nonparametric method or Population is not normally distributed Estimating Sugar bootstrapping. and n # 30. in Oranges Notes: 1. Criteria for deciding whether the population is normally distributed: Popula- In Florida, members of the tion need not be exactly normal, but it should appear to be somewhat symmetric citrus industry make exten- with one mode and no outliers. sive use of statistical meth- ods. One particular applica- 2. Sample size n . 30: This is a commonly used guideline, but sample sizes of 15 to tion involves the way in 30 are adequate if the population appears to have a distribution that is not far which growers are paid for from being normal and there are no outliers. For some population distributions oranges used to make orange that are extremely far from normal, the sample size might need to be larger than juice. An arriving truckload 50 or even 100. of oranges is first weighed at the receiving plant, then a The following example focuses on choosing the correct approach by using the sample of about a dozen or- methods of this section and Section 7-3. anges is randomly selected. The sample is weighed and EXAMPLE Choosing Distributions Assuming that you plan to construct then squeezed, and the a confidence interval for the population mean m, use the given data to determine amount of sugar in the juice whether the margin of error E should be calculated using a critical value of za>2 is measured. Based on the (from the normal distribution), a critical value of ta>2 (from a t distribution), or sample results, an estimate is neither (so that the methods of Sections 7-3 and this section cannot be used). made of the total amount of sugar in the entire truckload. a. n 5 150, x 5 100, s 5 15, and the population has a skewed distribution. Payment for the load of or- anges is based on the esti- b. n 5 8, x 5 100, s 5 15, and the population has a normal distribution. mate of the amount of sugar because sweeter oranges are c. n 5 8, x 5 100, s 5 15, and the population has a very skewed distribution. more valuable than those less sweet, even though the d. n 5 150, x 5 100, s 5 15, and the distribution is skewed. (This situation amounts of juice may be almost never occurs.) the same. e. n 5 8, x 5 100, s 5 15, and the distribution is extremely skewed. (This situation almost never occurs.) SOLUTION Refer to Figure 7-6 or Table 7-1 to determine the following: a. Because the population standard deviation s is not known and the sample is large (n . 30), the margin of error is calculated using ta>2 in Formula 7-6. b. Because the population standard deviation s is not known and the popula- tion is normally distributed, the margin of error is calculated using ta>2 in Formula 7-6. continued

356 Chapter 7 Estimates and Sample Sizes c. Because the sample is small and the population does not have a normal distribution, the margin of error E should not be calculated using a critical value of za>2 or ta>2. The methods of Section 7-3 and this section do not apply. d. Because the population standard deviation s is known and the sample is large (n . 30), the margin of error is calculated using za>2 in Formula 7-4. e. Because the population is not normally distributed and the sample is small (n # 30), the margin of error E should not be calculated using a critical value of za>2 or ta>2. The methods of Section 7-3 and this section do not apply. Estimates to EXAMPLE Confidence Interval for Birth Weights In a study of the Improve the Census effects of prenatal cocaine use on infants, the following sample data were obtained for weights at birth: n 5 190, x 5 2700 g, s 5 645 g (based on data In the decennial Census, not from “Cognitive Outcomes of Preschool Children with Prenatal Cocaine everyone is counted, and some Exposure,” by Singer et al., Journal of the American Medical Association, Vol. people are counted more than 291, No. 20). The design of the study justifies the assumption that the sample once. Methods of statistics can can be treated as a simple random sample. Use the sample data to construct a be used to improve population 95% confidence interval estimate of m, the mean birth weight of all infants born counts with adjustments in to mothers who used cocaine. each county of each state. Some argue that the Constitu- SOLUTION tion specifies that the Census REQUIREMENT We must first verify that the requirements are satisfied. be an “actual enumeration” The sample is a simple random sample. Because n 5 190, we satisfy the re- which does not allow for quirement that “the population is normally distributed or n . 30.” The require- adjustment. A Supreme Court ments are therefore satisfied. (This is Step 1 in the five-step procedure listed ruling prohibits use of adjusted earlier, and we can now proceed with the remaining steps.) population counts for reappor- tionment of congressional Step 2: The critical value is ta>2 5 1.972. It is found in Table A-3 as the criti- seats, but a recent ruling by a Step 3: cal value corresponding to n 2 1 5 189 degrees of freedom (left col- federal appeals court ordered umn of Table A-3) and an area in two tails of 0.05. (Because Table A-3 that the adjusted counts be does not include df 5 189, we use the closest critical value of 1.972. released, even if they can’t be We can use software to find that a more accurate critical value is used for that purpose. Accord- 1.973, so the approximation is quite good here.) ing to the Associated Press, “The Census Bureau has left Find the margin of error E: The margin of error E 5 2.97355 is com- open the possibility of using puted using Formula 7-6 as shown below, with extra decimal places adjusted data for federal fund- used to minimize rounding error in the confidence interval found in ing in the future,” so the use of Step 4. powerful statistics methods might eventually result in E 5 ta>2 s 5 1.972 ? 645 5 92.276226 better allocation of federal and 2n 2190 state funds. Step 4: Find the confidence interval: The confidence interval can now be found by using x 5 2700 and E 5 92.276226 as shown below: x2E,m,x1E 2700 2 92.276226 , m , 2700 1 92.276226 2607.7238 , m , 2792.2762

7-4 Estimating a Population Mean: s Not Known 357 Step 5: Round the confidence interval limits. Because the sample mean is Estimating rounded as a whole number, round the confidence interval limits to Crowd Size get this result: 2608 , m , 2792. There are sophisticated meth- INTERPRETATION On the basis of the sample data, we are 95% confident ods of analyzing the size of a that the limits of 2608 g and 2792 g actually do contain the value of the mean crowd. Aerial photographs birth weight. We can now compare this result to a confidence interval con- and measures of people den- structed for birth weights of children whose mothers did not use cocaine. (See sity can be used with reason- Exercise 17.) ably good accuracy. However, reported crowd size estimates Finding Point Estimate and E from a are often simple guesses. Confidence Interval After the Boston Red Sox won the World Series for the Later in this section we will describe how software and calculators can be used to first time in 86 years, Boston find a confidence interval. A typical usage requires that you enter a confidence city officials estimated that level and sample statistics, and the display shows the confidence interval limits. the celebration parade was The sample mean x is the value midway between those limits, and the margin of attended by 3.2 million fans. error E is one-half the difference between those limits (because the upper limit is Boston police provided an x 1 E and the lower limit is x 2 E, the distance separating them is 2E). estimate of around 1 million, but it was admittedly based Point estimate of m: on guesses by police com- manders. A photo analysis led supper confidence limitd 1 slower confidence limitd to an estimate of around x5 150,000. Boston University Professor Farouk El-Baz used 2 images from the U.S. Geolog- Margin of error: ical Survey to develop an estimate of at most 400,000. supper confidence limitd 2 slower confidence limitd MIT physicist Bill Donnelly E5 said that “it’s a serious thing if people are just putting out 2 any number. It means other things aren’t being vetted EXAMPLE Ages of Stowaways In analyzing the ages of all Queen that carefully.” Mary stowaways (based on data from the Cunard Line), the Minitab display shown below is obtained. Use the given confidence interval to find the point estimate x and the margin of error E. Treat the values as sample data randomly selected from a large population. Minitab 95.0% CI (24.065, 27.218) SOLUTION In the following calculations, results are rounded to one decimal place, which is one additional decimal place beyond the rounding used for the original list of ages. supper confidence limitd 1 slower confidence limitd x5 2 27.218 1 24.065 5 5 25.6 years 2 supper confidence limitd 2 slower confidence limitd E5 2 27.218 2 24.065 5 5 1.6 years 2

358 Chapter 7 Estimates and Sample Sizes Figure 7-7 Female Male BMI Indexes of Females and Males 345678 BMI (Body Mass Index) Using Confidence Intervals to Describe, Explore, or Compare Data In some cases, we might use a confidence interval to achieve an ultimate goal of estimating the value of a population parameter. In other cases, a confidence inter- val might be one of several different tools used to describe, explore, or compare data sets. Figure 7-7 shows graphs of confidence intervals for the BMI indexes of a sample of females and males (see Data Set 1 in Appendix B). Because the confi- dence intervals overlap, there does not appear to be a significant difference be- tween the mean BMI index of females and males. Using Technology MINITAB Minitab Release 14 now al- that s is known, and you must first find the lows you to use either the summary statistics sample size n and the sample standard devi- The following procedures apply to confi- n, x, and s or a list of the original sample ation s (which can be found using fx, Statis- dence intervals for estimating a mean m, values. Select Stat and Basic Statistics. If s tical, STDEV). Instead of generating the and they include the confidence intervals is not known, select 1-sample t and enter completed confidence interval with specific described in Section 7-3 as well as the confi- the summary statistics or enter C1 in the box limits, this tool calculates only the margin of dence intervals presented in this section. located at the top right. (If s is known, se- error E. You must then subtract this result Before using software or a calculator to gen- lect 1-sample Z and enter the summary from x and add it to x so that you can iden- erate a confidence interval, be sure to first statistics or enter C1 in the box located at tify the actual confidence interval limits. To check that the relevant requirements are sat- the top right. Also enter the value of s in the use this tool when s is known, click on fx, isfied. See the requirements listed near the “Standard Deviation” or “Sigma” box.) Use select the function category of Statistical, beginning of this section and Section 7-3. the Options button to enter the confidence then select the item of CONFIDENCE. In level. the dialog box, enter the value of a (called STATDISK You must first find the the significance level), the standard devia- sample size n, the sample mean x and the EXCEL Use the Data Desk XL add-in tion, and the sample size. The result will be sample standard deviation s. (See the STAT- that is a supplement to this book. Click on the value of the margin of error E. DISK procedure described in Section 3-3.) DDXL and select Confidence Intervals. Select Analysis from the main menu bar, Under the Function Type options, select TI-83/84 PLUS The TI-83>84 Plus select Confidence Intervals, then select 1 Var t Interval if s is not known. (If s is calculator can be used to generate confi- Population Mean. Proceed to enter the known, select 1 Var z Interval.) Click on dence intervals for original sample values items in the dialog box, then click the the pencil icon and enter the range of data, stored in a list, or you can use the summary Evaluate button. The confidence interval such as A1:A12 if you have 12 values listed statistics n, x, and s. Either enter the data in will be displayed. in column A. Click OK. In the dialog box, list L1 or have the summary statistics avail- select the level of confidence. (If using 1 Var able, then press the STAT key. Now select z Interval, also enter the value of s.) Click TESTS and choose TInterval if s is not on Compute Interval and the confidence known. (Choose ZInterval if s is known.) interval will be displayed. After making the required entries, the calcu- lator display will include the confidence in- The use of Excel’s tool for finding confidence terval in the format of sx 2 E, x 1 Ed. intervals is not recommended. It assumes

7-4 Estimating a Population Mean: s Not Known 359 Caution: As in Sections 7-2 and 7-3, confidence intervals can be used infor- mally to compare different data sets, but the overlapping of confidence intervals should not be used for making formal and final conclusions about equality of means. Later chapters will include procedures for deciding whether two popula- tions have equal means, and those methods will not have the pitfalls associated with comparisons based on the overlap of confidence intervals. Do not use the overlapping of confidence intervals as the basis for making formal conclusions about the equality of means. 7-4 BASIC SKILLS AND CONCEPTS Statistical Literacy and Critical Thinking 1. What’s Wrong? A “snapshot” in USA Today noted that “Consumers will spend an estimated average of $483 on merchandise” for back-to-school spending. It was reported that the value is based on a survey of 8453 consumers, and the margin of error is “61 percentage point.” What’s wrong with this information? 2. Confidence Interval The Newport Chronicle issued a report stating that based on a sample of homes, the mean tax bill is $4626 with a margin of error of $591. Express the confidence interval in the format of x 2 E , m , x 1 E. 3. Interpreting a Confidence Interval Using the systolic blood pressure levels of the 40 men listed in Data Set 1 in Appendix B, we get this 99% confidence interval: 114.4 , m , 123.4. Write a statement that correctly interprets that confidence interval. 4. Checking Requirements Suppose that we want to construct a confidence interval esti- mate of the amounts of precipitation on Mondays in Boston, and we plan to use the amounts listed in Data Set 10 from Appendix B. We can examine those amounts to see that among the 52 Mondays, there are 33 that have amounts of 0. Based on that observation, do the amounts of precipitation on Mondays appear to be normally distributed? Assuming that the sample can be treated as a simple random sample, can we use the methods of this section to construct a confidence interval estimate of the population mean? Why or why not? Using Correct Distribution. In Exercises 5–12, do one of the following, as appropriate: (a) Find the critical value z␣>2, (b) find the critical value t␣>2, (c) state that neither the normal nor the t distribution applies. 5. 95%; n 5 12; s is unknown; population appears to be normally distributed. 6. 99%; n 5 15; s is unknown; population appears to be normally distributed. 7. 99%; n 5 4; s is known; population appears to be very skewed. 8. 95%; n 5 50; s is known; population appears to be very skewed. 9. 90%; n 5 200; s is unknown; population appears to be normally distributed. 10. 98%; n 5 16; s 5 5.0; population appears to be very skewed. 11. 98%; n 5 18; s 5 21.5; population appears to be normally distributed. 12. 90%; n 5 33; s is unknown; population appears to be normally distributed.

360 Chapter 7 Estimates and Sample Sizes Finding Confidence Intervals. In Exercises 13 and 14, use the given confidence level and sample data to find (a) the margin of error and (b) the confidence interval for the popula- tion mean m. Assume that the population has a normal distribution. 13. Weight lost on Weight Watchers diet: 95% confidence; n 5 40, x 5 3.0 kg, s 5 4.9 kg. 14. Life span of desktop PC: 99% confidence; n 5 21, x 5 6.8 years, s 5 2.4 years. Interpreting Display. In Exercises 15 and 16, use the given data and the corresponding display to express the confidence interval in the format of x 2 E , m , x 1 E. Also write a statement that interprets the confidence interval. 15. IQ scores of statistics students: 95% confidence; n 5 25, x 5 118.0, s 5 10.7. Minitab StDev SE Mean 95% CI N Mean 10.700 2.140 (113.583, 122.417) 25 118.000 TI-83/84 Plus 16. Life span of cell phone: 99% confidence; n 5 27, x 5 4.6 years, s 5 1.9 years. (See the TI-83>84 Plus calculator display in the margin.) Constructing Confidence Intervals. In Exercises 17–26, construct the confidence interval. 17. Birth Weights A random sample of the birth weights of 186 babies has a mean of 3103 g and a standard deviation of 696 g (based on data from “Cognitive Outcomes of Preschool Children with Prenatal Cocaine Exposure,” by Singer et al., Journal of the American Medical Association, Vol. 291, No. 20). These babies were born to mothers who did not use cocaine during their pregnancies. Construct a 95% confidence inter- val estimate of the mean birth weight for all such babies. Compare the result to the confidence interval obtained in the example in this section that involved birth weights of babies born to mothers who used cocaine during pregnancy. Does cocaine use ap- pear to affect the birth weight of a baby? 18. Mean Body Temperature Data Set 2 in Appendix B includes 106 body temperatures for which x 5 98.208F and s 5 0.628F. Using the sample statistics, construct a 99% confidence interval estimate of the mean body temperature of all healthy humans. Do the confidence interval limits contain 98.68F? What does the sample suggest about the use of 98.68F as the mean body temperature? 19. Forecast and Actual Temperatures Data Set 8 in Appendix B includes a list of actual high temperatures and the corresponding list of three-day-forecast high temperatures. If the difference for each day is found by subtracting the three-day-forecast high tem- perature from the actual high temperature, the result is a list of 35 values with a mean of 21.38 and a standard deviation of 4.78. a. Construct a 99% confidence interval estimate of the mean difference between all actual high temperatures and three-day-forecast high temperatures. b. Does the confidence interval include 08? If a meteorologist claims that three-day- forecast high temperatures tend to be too high because the mean difference of the sample is 21.38, does that claim appear to be valid? Why or why not? 20. Shoveling Heart Rates Because cardiac deaths appear to increase after heavy snow- falls, an experiment was designed to compare cardiac demands of snow shoveling to those of using an electric snow thrower. Ten subjects cleared tracts of snow using both methods, and their maximum heart rates (beats per minute) were recorded during both

7-4 Estimating a Population Mean: s Not Known 361 activities. The following results were obtained (based on data from “Cardiac Demands of Heavy Snow Shoveling,” by Franklin et al., Journal of the American Medical Association, Vol. 273, No. 11): Manual snow shoveling maximum heart rates: n 5 10, x 5 175, s 5 15. Electric snow thrower maximum heart rates: n 5 10, x 5 124, s 5 18. a. Find the 95% confidence interval estimate of the population mean for those people who shovel snow manually. b. Find the 95% confidence interval estimate of the population mean for those people who use the electric snow thrower. c. If you are a physician with concerns about cardiac deaths fostered by manual snow shoveling, what single value in the confidence interval from part (a) would be of greatest concern? d. Compare the confidence intervals from parts (a) and (b) and interpret your findings. 21. Monitoring Lead in Air Listed below are measured amounts of lead (in micrograms per cubic meter, or mg>m3) in the air. The Environmental Protection Agency has established an air quality standard for lead of 1.5 mg>m3. The measurements shown below were recorded at Building 5 of the World Trade Center site on differ- ent days immediately following the destruction caused by the terrorist attacks of September 11, 2001. After the collapse of the two World Trade Center buildings, there was considerable concern about the quality of the air. Use the given values to construct a 95% confidence interval estimate of the mean amount of lead in the air. Is there anything about this data set suggesting that the confidence interval might not be very good? Explain. 5.40 1.10 0.42 0.73 0.48 1.10 22. Constructing a Confidence Interval The stemplot below lists the ages of applicants who were successful in winning promotion (based on data from “Debating the Use of Statistical Evidence in Allegations of Age Discrimination,” by Barry and Boland, American Statistician, Vol. 58, No. 2). Assume that the sample is a simple random sample and construct a 95% confidence interval estimate of the mean age of all such successful people. Compare the result to the confidence interval for the ages of those who were not successful (see the example in this section). 3 367889 4 2233444555566778899 5 1124 23. Credit Rating When consumers apply for credit, their credit is rated using FICO (Fair, Isaac, and Company) scores. Credit ratings are given below for a sample of applicants for car loans. Use the sample data to construct a 99% confidence interval for the mean FICO score of all applicants for credit. If one bank requires a credit rat- ing of at least 620 for a car loan, does it appear that almost all applicants will have suitable credit ratings? 661 595 548 730 791 678 672 491 492 583 762 624 769 729 734 706 24. World’s Smallest Mammal The world’s smallest mammal is the bumblebee bat, also known as the Kitti’s hog-nosed bat (or Craseonycteris thonglongyai). Such bats are roughly the size of a large bumblebee. Listed below are weights (in grams) from a

362 Chapter 7 Estimates and Sample Sizes sample of these bats. Construct a 95% confidence interval estimate of their mean weight. Are the confidence interval limits very different from the limits of 1.56 and 1.87 that are found when assuming that s is known to be 0.30 g? 1.7 1.6 1.5 2.0 2.3 1.6 1.6 1.8 1.5 1.7 2.2 1.4 1.6 1.6 1.6 25. Estimating Car Pollution In a sample of seven cars, each car was tested for nitrogen- oxide emissions (in grams per mile) and the following results were obtained: 0.06, 0.11, 0.16, 0.15, 0.14, 0.08, 0.15 (based on data from the Environmental Protection Agency). Assuming that this sample is representative of the cars in use, construct a 98% confidence interval estimate of the mean amount of nitrogen-oxide emissions for all cars. If the Environmental Protection Agency requires that nitrogen-oxide emis- sions be less than 0.165 g>mi, can we safely conclude that this requirement is being met? 26. Skull Breadths Maximum breadths of samples of male Egyptian skulls from 4000 B.C. and 150 A.D. (based on data from Ancient Races of the Thebaid by Thomson and Randall-Maciver): 4000 B.C.: 131 119 138 125 129 126 131 132 126 128 128 131 150 A.D.: 136 130 126 126 139 141 137 138 133 131 134 129 Changes in head sizes over time suggest interbreeding with people from other regions. Use confidence intervals to determine whether the head sizes appear to have changed from 4000 B.C. to 150 A.D. Explain your result. Appendix B Data Sets. In Exercises 27 and 28, use the data sets from Appendix B. 27. Pulse Rates A physician wants to develop criteria for determining whether a patient’s pulse rate is atypical, and she wants to determine whether there are significant differ- ences between males and females. Use the sample pulse rates in Data Set 1 from Appendix B. a. Construct a 95% confidence interval estimate of the mean pulse rate for males. b. Construct a 95% confidence interval estimate of the mean pulse rate for females. c. Compare the preceding results. Can we conclude that the population means for males and females are different? Why or why not? 28. Comparing Regular and Diet Pepsi Refer to Data Set 12 in Appendix B and use the sample data. a. Construct a 95% confidence interval estimate of the mean weight of cola in cans of regular Pepsi. b. Construct a 95% confidence interval estimate of the mean weight of cola in cans of Diet Pepsi. c. Compare the results from parts (a) and (b) and interpret them. Does there appear to be a difference? If so, identify a reason for the difference. 7-4 BEYOND THE BASICS 29. Effect of an Outlier Test the effect of an outlier as follows: Use the sample data from Exercise 22 to find a 95% confidence interval estimate of the population mean, after changing the last age from 54 years to 540 years. This value is not realistic, but such an error can easily occur during a data entry process. Does the confidence interval change much when 54 years is changed to 540 years? Are confidence interval limits

7-5 Estimating a Population Variance 363 sensitive to outliers? How should you handle outliers when they are found in sample data sets that will be used for the construction of confidence intervals? 30. Alternative Method Figure 7-6 and Table 7-1 summarize the decisions made when choosing between the normal and t distributions. An alternative method included in some textbooks (but almost never included in professional journals) is based on this criterion: Substitute the sample standard deviation s for s whenever n . 30, then pro- ceed as if s is known. Assume that for a simple random sample, n 5 35, x 5 50.0, and s 5 10.0, then construct 95% confidence interval estimates of m using the method of this section and using the alternative method. Compare the results. 31. Finite Population Correction Factor If a simple random sample of size n is selected without replacement from a finite population of size N, and the sample size is more than 5% of the population size (n . 0.05N), better results can be obtained by using the finite population correction factor, which involves multiplying the margin of error E by !sN 2 nd>sN 2 1d. For the sample of 100 weights of M&M candies in Data Set 13 from Appendix B, we get x 5 0.8565 g and s 5 0.0518 g. First construct a 95% confidence interval estimate of m assuming that the population is large, then con- struct a 95% confidence interval estimate of the mean weight of M&Ms in the full bag from which the sample was taken. The full bag has 465 M&Ms. Compare the results. 32. Using the Wrong Distribution Assume that a small simple random sample is selected from a normally distributed population for which s is unknown. Construction of a confidence interval should use the t distribution, but how are the confidence interval limits affected if the normal distribution is incorrectly used instead? 33. Confidence Interval for Sample of Size n 5 1 When a manned NASA spacecraft lands on Mars, the astronauts encounter a single adult Martian, who is found to be 12.0 ft tall. It is reasonable to assume that the heights of all Martians are normally distributed. a. The methods of this chapter require information about the variation of a variable. If only one sample value is available, can it give us any information about the vari- ation of the variable? b. When using the methods of this section, what happens when you try to use the sin- gle height in constructing a 95% confidence interval? c. Based on the article “An Effective Confidence Interval for the Mean with Samples of Size One and Two,” by Wall, Boen, and Tweedie (American Statistician, Vol. 55, No. 2), a 95% confidence interval for m can be found (using methods not dis- cussed in this book) for a sample of size n 5 1 randomly selected from a normally distributed population, and it can be expressed as x 6 9.68u x u. Use this result to construct a 95% confidence interval using the single sample value of 12.0 ft, and express it in the format of x 2 E , m , x 1 E. Based on the result, is it likely that some other randomly selected Martian might be 50 ft tall? 7-5 Estimating a Population Variance Key Concept This section presents methods for (1) finding a confidence inter- val estimate of a population standard deviation or variance and (2) determining the sample size required to estimate a population standard deviation or variance. In this section we introduce the chi-square distribution, which is used for finding a confidence interval estimate of s or s2.

364 Chapter 7 Estimates and Sample Sizes Requirements 1. The sample is a simple random sample. 2. The population must have normally distributed values (even if the sample is large). The assumption of a normally distributed population was made in earlier sections, but that requirement is much more critical here. For the methods of this section, departures from normal distributions can lead to gross errors. Consequently, the requirement of having a normal distribution is much stricter, and we should check the distribution of data by constructing histograms and normal quantile plots, as described in Section 6-7. When we considered estimates of proportions and means, we used the normal and Student t distributions. When developing estimates of variances or standard deviations, we use another distribution, referred to as the chi-square distribution. We will examine important features of that distribution before proceeding with the development of confidence intervals. Chi-Square Distribution In a normally distributed population with variance s2, assume that we randomly select independent samples of size n and, for each sample, compute the sample variance s2 (which is the square of the sample standard deviation s). The sample statistic x2 5 sn 2 1ds2>s2 has a sampling distribution called the chi-square distribution. Chi-Square Distribution Formula 7-7 x2 5 sn 2 1ds2 s2 where n 5 sample size s2 5 sample variance s2 5 population variance We denote chi-square by x2, pronounced “kigh square.” To find critical values of the chi-square distribution, refer to Table A-4. The chi-square distribution is de- termined by the number of degrees of freedom, and in this chapter we use n 2 1 degrees of freedom. degrees of freedom 5 n 2 1 In later chapters we will encounter situations in which the degrees of freedom are not n 2 1, so we should not make the incorrect generalization that the number of degrees of freedom is always n 2 1. Properties of the Distribution of the Chi-Square Statistic 1. The chi-square distribution is not symmetric, unlike the normal and Student t distributions (see Figure 7-8). (As the number of degrees of freedom

7-5 Estimating a Population Variance 365 Not symmetric Figure 7-8 Chi-Square Distribution x2 0 All values are nonnegative increases, the distribution becomes more symmetric, as Figure 7-9 illustrates.) 2. The values of chi-square can be zero or positive, but they cannot be negative (see Figure 7-8). 3. The chi-square distribution is different for each number of degrees of freedom (see Figure 7-9), and the number of degrees of freedom is given by df 5 n 2 1 in this section. As the number of degrees of freedom increases, the chi-square distribution approaches a normal distribution. Because the chi-square distribution is skewed instead of symmetric, the confi- dence interval does not fit a format of s2 6 E and we must do separate calculations for the upper and lower confidence interval limits. If using Table A-4 for finding critical values, note the following feature of that table: In Table A-4, each critical value of x2 corresponds to an area given in the top row of the table, and that area represents the cumulative area located to the right of the critical value. Table A-2 for the standard normal distribution provides cumulative areas from the left, but Table A-4 for the chi-square distribution provides cumulative areas from the right. df ϭ 10 Figure 7-9 df ϭ 20 Chi-Square Distribution for df 5 10 and df 5 20 0 5 10 15 20 25 30 35 40 45 x2

366 Chapter 7 Estimates and Sample Sizes EXAMPLE Critical Values Find the critical values of x2 that determine critical regions containing an area of 0.025 in each tail. Assume that the relevant sample size is 10 so that the number of degrees of freedom is 10 2 1, or 9. 0.025 0.025 0 R L To obtain this critical value, locate To obtain this critical value, locate 9 at the 9 at the left column for degrees of left column for degrees of freedom and then freedom and then locate 0.975 across locate 0.025 across the the top. The total area to the right top. of this critical value is 0.975, which we get by subtracting 0.025 from 1. Figure 7-10 Critical Values of the Chi-Square Distribution SOLUTION See Figure 7-10 and refer to Table A-4. The critical value to the right (x2 5 19.023) is obtained in a straightforward manner by locating 9 in the degrees-of-freedom column at the left and 0.025 across the top. The critical value of x2 5 2.700 to the left once again corresponds to 9 in the degrees-of- freedom column, but we must locate 0.975 (found by subtracting 0.025 from 1) across the top because the values in the top row are always areas to the right of the critical value. Refer to Figure 7-10 and see that the total area to the right of x2 5 2.700 is 0.975. Figure 7-10 shows that, for a sample of 10 values taken from a normally distributed population, the chi-square statistic sn 2 1ds2>s2 has a 0.95 probability of falling between the chi-square critical values of 2.700 and 19.023. When obtaining critical values of x2 from Table A-4, note that the numbers of degrees of freedom are consecutive integers from 1 to 30, followed by 40, 50, 60, 70, 80, 90, and 100. When a number of degrees of freedom (such as 52) is not found in the table, you can usually use the closest critical value. For example, if the number of degrees of freedom is 52, refer to Table A-4 and use 50 degrees of freedom. (If the number of degrees of freedom is exactly midway between table values, such as 55, simply find the mean of the two x2 values.) For numbers of


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook