Home Explore MCA645 CU-MCA-SEM-II-Statistical & Numerical Methods

MCA645 CU-MCA-SEM-II-Statistical & Numerical Methods

Published by kuljeet.singh, 2021-01-04 06:57:50

Description: MCA645 CU-MCA-SEM-II-Statistical & Numerical Methods

Read the Text Version

Pages:

(D) 0, 18, 24 5. The process of finding the values inside the interval (X0, Xn) is called (A). Interpolation (B). Extrapolation (C). Iterative (D). Polynomial equation Answers: 1 -C; 2 - D; 3 -B; 4 -A; 5 -A; 6.9 REFERENCES  Rajaraman V. (1993). Computer Oriented Numerical Method. New Delhi: Prentice Hall.  Salaria.R.S. (2016). Computer Oriented Numerical Methods. Delhi: Khanna Book Publishing Company.  Gupta S.P. and Kapoor, V.K. (2014). Fundamentals of Mathematical Statistics. Delhi: Sultan Chand and Sons.  Anderson (1990). Statistical Modelling. New York: McGraw Publishing House.  Gupta S.P. and Kapoor, V.K. (2015). Fundamentals of Applied statistics. Delhi: Sultan Chand & Sons.  Graybill (1990). Introduction to Statistics. New York: McGraw Publishing House.  Numerical Methods & Analysis – Engineering App – Google Play store  https://en.wikibooks.org/wiki/Statistics/Numerical_Methods/Numerical_Comparison_of_Statistic al_Software 150 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT 7- INTERPOLATION AND APPROXIMATION 2: Structure 7.0 Learning Objective 7.1 Introduction 7.2 Polynomial fitting, 7.3 other curve fitting 7.4 Summary 7.5 Keywords 7.6 Learning Activity 7.7 Unit End Questions 7.8 References 7.0 LEARNING OBJECTIVE After studying this unit, you will be able to:  State the concept of Polynomial fitting,  Comprehend other curve fitting 7.1 INTRODUCTION Curve fitting, also called regression analysis, is a process of fitting a function to a set of data points. The function can then be used as a mathematical model of the data. Since there are many types of functions (linear, polynomial, power, exponential, etc.), curve fitting can be a complicated process. Many times one has some idea of the type of function that might fit the given data and will need only to determine the coefficients of the function. In other situations, where nothing is known about the data, it is possible to make different types of plots that provide information about possible forms of functions that might fit the data well. 7.2 POLYNOMIAL FITTING Curve Fitting with Polynomials; The polyfit Function Polynomials can be used to fit data points in two ways. In one the polynomial passes through all the data points, and in the other the polynomial does not necessarily pass through any of the points, but overall gives a good approximation of the data. The two options are described below. Polynomials that pass through all the points: When n points (xi, yi) are given, it is possible to write a polynomial of degree that passes through all the points. For example, if two points are given it is possible to write a linear equation in the form of that 151 CU IDOL SELF LEARNING MATERIAL (SLM)

passes through the points. With three points the equation has the form of y = ax2 + bx+ c. With n points the polynomial has the form The coefficients of the polynomial are determined by substituting each point in the polynomial and then solving the n equations for the coefficients. Polynomials that do not necessarily pass through any of the points: When n points are given, it is possible to write a polynomial of degree less than n – 1 that does not necessarily pass through any of the points, but overall approximates the data. The most common method of finding the best fit to data points is the method of least squares. In this method the coefficients of the polynomial are determined by minimizing the sum of the squares of the residuals at all the data points. The residual at each point is defined as the difference between the value of the polynomial and the value of the data. For example, consider the case of finding the equation of a straight line that best fits four data points as shown in Figure 1. The points are (x1, y1), (x2, y2), (x3, y3) and (x4, y4), and the polynomial of the Fig 7.1: Least squares fitting of first degree polynomial to four points first degree can be written as f(x) = a1x + a0. The residual Ri, at each point is the difference between the value of the function at xi and yi, Ri = f(xi) - yi. An equation for the sum of the squares of the residuals Ri of all the points is given by or, after substituting the equation of the polynomial at each point, by: 152 CU IDOL SELF LEARNING MATERIAL (SLM)

At this stage R is a function of a1 and a0. The minimum of R can be determined by taking the partial derivative of R with respect to a1 and a0 (two equations) and equating them to zero. This results in a system of two equations with two unknowns, and. The solution of these equations gives the values of the coefficients of the polynomial that best fits the data. The same procedure can be followed with more points and higher-order polynomials. More details on the least squares method can be found in books on numerical analysis. Curve fitting with polynomials is done in MATLAB with the polyfit function, which uses the least squares method. The basic form of the polyfit function is: For the same set of m points, the polyfit function can be used to fit polynomials of any order up to. If n = 1 the polynomial is a straight line, if n = 2 the polynomial is a parabola, and so on. The polynomial passes through all the points if (the order of the polynomial is one less than the number of points). It should be pointed out here that a polynomial that passes through all the points, or polynomials with higher order, do not necessarily give a better fit overall. High-order polynomials can deviate significantly between the data points. Figure 2 shows how polynomials of different degrees fit the same set of data points. A set of seven points is given by (0.9, 0.9), (1.5, 1.5), (3, 2.5), (4, 5.1), 153 CU IDOL SELF LEARNING MATERIAL (SLM)

Fig 7.2: Fitting data with polynomials of different order (6, 4.5), (8, 4.9), and (9.5, 6.3). The points are fitted using the polyfit function with polynomials of degrees 1 through 6. Each plot in Figure 2 shows the same data points, marked with circles, and a curve-fitted line that corresponds to a polynomial of the specified degree. It can be seen that the polynomial with n = 1 is a straight line, and with n = 2 is a slightly curved line. As the degree of the polynomial increases, the line develops more bends such that it passes closer to more points. When n = 6, which is one less than the number of points, the line passes through all the points. However, between some of the points, the line deviates significantly from the trend of the data. The script file used to generate one of the plots in Figure 2 (the polynomial with n = 3) is shown below. Note that in 154 CU IDOL SELF LEARNING MATERIAL (SLM)

order to plot the polynomial (the line) a new vector xp with small spacing is created. This vector is then used with The function polyval to create a vector yp with the value of the polynomial for each element of xp. When the script file is executed, the following vector p is displayed in the Command Window. 7.3 OTHER CURVE FITTING Many situations in science and engineering require fitting functions that are not polynomials to given data. Theoretically, any function can be used to model data within some range. For a particular data set, however, some functions provide a better fit than others. In addition, determining the best-fitting coefficients can be more difficult for some functions than for others. This section covers curve fitting with power, exponential, logarithmic, and reciprocal functions, which are commonly used. The forms of these functions are: 155 CU IDOL SELF LEARNING MATERIAL (SLM)

All of these functions can easily be fitted to given data with the polyfit function. This is done by rewriting the functions in a form that can be fitted with a linear polynomial (n = 1), which is the logarithmic function is already in this form, and the power, exponential and reciprocal equations can be rewritten as: These equations describe a linear relationship between ln(y) and ln(x) for the power function, between ln(y) and x for the exponential function, between y and ln(x) or for log (x) the logarithmic function, and between 1/y and x for the reciprocal function. This means that the polyfit (x, y,1) function can be used to determine the best-fit constants m and b for best fit if, instead of x and y, the following arguments are used. The result of the polyfit function is assigned to p, which is a two-element vector. The first element, p(1), is the constant m, and the second element, p(2), is b for the logarithmic and reciprocal functions, ln (b) or log (b) for the exponential function, and ln (b) for the power function ( b = e p(2) or b = 10 p(2) for the exponential function, and b = e p(2) for the power function). For given data it is possible to estimate, to some extent, which of the functions has the potential for providing a good fit. This is done by plotting the data using different combinations of linear and logarithmic axes. If the data points in one of the plots appear to fit a straight line, the corresponding function can provide a good fit according to the list below 156 CU IDOL SELF LEARNING MATERIAL (SLM)

Other considerations in choosing a function: • Exponential functions cannot pass through the origin. • Exponential functions can fit only data with all positive y’s or all negative y’s. • Logarithmic functions cannot model x = 0 or negative values of x. • For the power function y = 0 when x = 0. • The reciprocal equation cannot model y = 0. Example 1: Fitting an equation to data points - The following data points are given. Determine a function w = f (t) (t is the independent variable, w is the dependent variable) with a form discussed in this section that best fits the data. Solution The data is first plotted with linear scales on both axes. Fig 7.3: Data Plotted with linear scales 157 CU IDOL SELF LEARNING MATERIAL (SLM)

The figure indicates that a linear function will not give the best fit since the points do not appear to line up along a straight line. From the other possible functions, the logarithmic function is excluded since for the first point t =0, and the power function is excluded since at t = 0, w ≠ 0. To check if the other two functions (exponential and reciprocal) might give a better fit, two additional plots, shown below, are made. The plot on the left has a log scale on the vertical axis and linear horizontal axis. In the plot on the right both axes have linear scales, and the quantity 1/w is plotted on the vertical axis. Fig 7.4 (a): log scale on the vertical axis and linear horizontal axis Fig 7.4 (b): both axes have linear scales In the left figure the data points appear to line up along a straight line. This indicates that an exponential function of the form can give a good fit to the data. A program in a script file that determines the constants b and m, and that plots the data points and the function is given below When the program is executed, the values of the constants m and b are displayed in the Command Window. 158 CU IDOL SELF LEARNING MATERIAL (SLM)

The plot generated by the program, which shows the data points and the function (with axis labels added with the Plot Editor) is Fig 7.5: Plot generated by the program 7.4 SUMMARY Curve fitting is one of the most powerful and most widely used analysis tools in Origin. Curve fitting examines the relationship between one or more predictors (independent variables) and a response variable (dependent variable), with the goal of defining a \"best fit\" model of the relationship. 7.5 KEYWORDS:  Curve fitting: also called regression analysis, is a process of fitting a function to a set of data points.  Script File: A script, in most general terms, is simply a text file with one command on each line  Linear functions: are those whose graph is a straight line  A linear scale: also called a bar scale, scale bar, graphic scale, or graphical scale, is a means of visually showing the scale of a map, nautical chart, engineering drawing, or architectural drawing.  Elements of Vector: To refer to elements in a vector MATLAB uses round brackets 159 CU IDOL SELF LEARNING MATERIAL (SLM)

7.6 LEARNING ACTIVITY 1. Fit a straight line into the following data. x: 0 1 2 3 4 5 y: 3 6 8 11 13 14 --------------------------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------------- ------------- 2. Fit a straight line y=a+bx into the given data by Actual Mean Method. What is the value of b? x: 10 20 30 40 50 y: 22 23 27 28 30 -------------------------------------------------------------------------------------------------------- ------------- --------------------------------------------------------------------------------------------------------------------- 7.7 UNIT END QUESTIONS: A. Descriptive type Question 1. What is polyfit Function? 2. Explain: Polynomials that do not necessarily pass through any of the points: 3. How Curve fitting with polynomials is done in MATLAB? 4. Discuss other curve fitting 5. State other considerations in choosing a function: B. Multiple Choice Questions 160 1. Fit a straight-line y=a+bx into the given data: (x, y) :(5,12) (10,13) (15,14) (20,15) (25,16). a) y=11 b) y=0.2x CU IDOL SELF LEARNING MATERIAL (SLM)

c) y=11+0.2x d) y=1.1+0.2x 2. Fit a straight line y=a+bx into the given data. Also estimate the production in the year 2000. Year(x): 1966 1976 1986 1996 2006 10 12 13 16 17 Production in lbs(y): a) 12.33 b) 14.96 c) 11.85 d) 18.67 3. Fit a straight line y=a+bx into the given data. What is the value of y when x=8? x: 1 2 3 4 5 6 y: 20 21 22 23 24 25 a) 45.2 161 b) 26 c) 28 d) 37 4. If n =__________ the polynomial is a parabola a). 2 b).1 c). 3 d). 4 5. Exponential functions cannot pass through the______ a). X- axis b). Y- axis c). Z- axis d). origin Answers: CU IDOL SELF LEARNING MATERIAL (SLM)

1 – c; 2-b; 3 - b; 4 -a; 5 - d; 7.8 REFERENCES:  Salaria R.S.A, Textbook of Statistical and Numerical Methods in Engineering, Delhi: Khanna Book Publishing Company.  Gupta S.P. and Kapoor, V.K. (2014). Fundamentals of Mathematical Statistics. Delhi: Sultan Chand and Sons.  Sujatha Sinha, Sushma Pradhan, Numerical Analysis and Statistical Methods, Academic Publishers.  Gupta S.P. and Kapoor, V.K. (2015). Fundamentals of Applied statistics. Delhi: Sultan Chand & Sons.  J. H. Wilkinson , The Algebraic Eigenvalue Problem (Numerical Mathematics and Scientific Computation), Clarendon Press  Gupta Dey , Numerical Methods , McGraw Hill Education;  Numerical Methods & Analysis – Engineering App – Google Play store  https://en.wikibooks.org/wiki/Statistics/Numerical_Methods/Numerical_Comparison_of_Stati stical_Software 162 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT 8- STATISTICAL METHODS Structure: 8.0 Learning Objective 8.1 Introduction 8.2 Sample distributions 8.3 Properties of Sampling Distributions 8.4 Creating a Sampling Distribution 8.5 Test of Significance 8.5.1 Definition of Significance Testing 8.5.2 Tests of Significance in Statistics 8.5.3 Process of Significance Testing 8.5.4 Types of Errors 8.5.5 Types of Statistical Tests 8.5.6 What is p-Value Testing? 8.6 t and F test 8.6.1 T-Test Solved Examples 8.7 Summary 8.8 Keywords 8.9 Learning Activity 8.10 Unit End Questions 8.11 References 8.0 LEARNING OBJECTIVE After studying this unit, you will be able to:  Explain the concept of Sample distributions,  Enumerate Test of Significance,  Comprehend the T and F test. 8.1 INTRODUCTION A sampling distribution is a probability distribution of a statistic that is obtained by drawing a large number of samples from a specific population. Researchers use sampling distributions in order to simplify the process of statistical inference. A sampling distribution is a probability distribution of a statistic obtained from a larger number of samples drawn from a specific population. The sampling distribution of a given population is the 163 CU IDOL SELF LEARNING MATERIAL (SLM)

distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population. In statistics, a population is the entire pool from which a statistical sample is drawn. A population may refer to an entire group of people, objects, events, hospital visits, or measurements. A population can thus be said to be an aggregate observation of subjects grouped together by a common feature. 8.2 SAMPLE DISTRIBUTIONS A sampling distribution is a probability distribution of a statistic obtained from a larger number of samples drawn from a specific population. The sampling distribution of a given population is the distribution of frequencies of a range of different outcomes that could possibly occur for a statistic of a population. The sampling distribution of a statistic is the distribution of the statistic for all possible samples from the same population of a given size Inferential statistics involves generalizing from a sample to a population. A critical part of inferential statistics involves determining how far sample statistics are likely to vary from each other and from the population parameter. These determinations are based on sampling distributions. The sampling distribution of a statistic is the distribution of that statistic, considered as a random variable, when derived from a random sample of size nn. It may be considered as the distribution of the statistic for all possible samples from the same population of a given size. Sampling distributions allow analytical considerations to be based on the sampling distribution of a statistic rather than on the joint probability distribution of all the individual sample values. The sampling distribution depends on: the underlying distribution of the population, the statistic being considered, the sampling procedure employed, and the sample size used. For example, consider a normal population with mean μμ and variance σσ. Assume we repeatedly take samples of a given size from this population and calculate the arithmetic mean for each sample. This statistic is then called the sample mean. Each sample has its own average value, and the distribution of these averages is called the “sampling distribution of the sample mean.” This distribution is normal since the underlying population is normal, although sampling distributions may also often be close to normal even when the population distribution is not. An alternative to the sample mean is the sample median. When calculated from the same population, it has a different sampling distribution to that of the mean and is generally not normal (but it may be close for large sample sizes). 8.3 PROPERTIES OF SAMPLING DISTRIBUTIONS Knowledge of the sampling distribution can be very useful in making inferences about the overall population. 164 CU IDOL SELF LEARNING MATERIAL (SLM)

Sampling Distributions and Inferential Statistics Sampling distributions are important for inferential statistics. In practice, one will collect sample data and, from these data, estimate parameters of the population distribution. Thus, knowledge of the sampling distribution can be very useful in making inferences about the overall population. For example, knowing the degree to which means from different samples differ from each other and from the population mean would give you a sense of how close your particular sample mean is likely to be to the population mean. Fortunately, this information is directly available from a sampling distribution. The most common measure of how much sample means differ from each other is the standard deviation of the sampling distribution of the mean. This standard deviation is called the standard error of the mean. Standard Error The standard deviation of the sampling distribution of a statistic is referred to as the standard error of that quantity. For the case where the statistic is the sample mean, and samples are uncorrelated, the standard error is: SE¯x=s√nSEx¯=sn Where ss is the sample standard deviation and nn is the size (number of items) in the sample. An important implication of this formula is that the sample size must be quadrupled (multiplied by 4) to achieve half the measurement error. When designing statistical studies where cost is a factor, this may have a role in understanding cost-benefit trade-offs. If all the sample means were very close to the population mean, then the standard error of the mean would be small. On the other hand, if the sample means varied considerably, then the standard error of the mean would be large. To be specific, assume your sample mean is 125 and you estimated that the standard error of the mean is 5. If you had a normal distribution, then it would be likely that your sample mean would be within 10 units of the population mean since most of a normal distribution is within two standard deviations of the mean. More Properties of Sampling Distributions 1. The overall shape of the distribution is symmetric and approximately normal. 2. There are no outliers or other important deviations from the overall pattern. 3. The centre of the distribution is very close to the true population mean. A statistical study can be said to be biased when one outcome is systematically favoured over another. However, the study can be said to be unbiased if the mean of its sampling distribution is equal to the true value of the parameter being estimated. Finally, the variability of a statistic is described by the spread of its sampling distribution. This spread is determined by the sampling design and the size of the sample. Larger samples give smaller spread. As long as the population is much larger than the sample (at least 10 times as large), the spread of the sampling distribution is approximately the same for any population size 165 CU IDOL SELF LEARNING MATERIAL (SLM)

8.4 CREATING A SAMPLING DISTRIBUTION Learn to create a sampling distribution from a discrete set of data. We will illustrate the concept of sampling distributions with a simple example. Consider three pool balls, each with a number on it. Two of the balls are selected randomly (with replacement), and the average of their numbers is computed. All possible outcomes are shown below. Notice that all the means are either 1.0, 1.5, 2.0, 2.5, or 3.0. The frequencies of these means are shown below. The relative frequencies are equal to the frequencies divided by nine because there are nine possible outcomes 166 CU IDOL SELF LEARNING MATERIAL (SLM)

The figure below shows a relative frequency distribution of the means. This distribution is also a probability distribution since the yy-axis is the probability of obtaining a given mean from a sample of two balls in addition to being the relative frequency. Fig 8.1: Relative Frequency Distribution The distribution shown in the above figure is called the sampling distribution of the mean. Specifically, it is the sampling distribution of the mean for a sample size of 2 (N=2N=2). For this simple example, the distribution of pool balls and the sampling distribution are both discrete distributions. The pool balls have only the numbers 1, 2, and 3, and a sample mean can have one of only five possible values. There is an alternative way of conceptualizing a sampling distribution that will be useful for more complex distributions. Imagine that two balls are sampled (with replacement), and the mean of the two balls is computed and recorded. This process is repeated for a second sample, a third sample, and eventually thousands of samples. After thousands of samples are taken and the mean is computed for each, a relative frequency distribution is drawn. The more samples, the closer the relative frequency distribution will come to the sampling distribution shown in the above figure. As the number of samples approaches infinity, the frequency distribution will approach the sampling distribution. This means that you can conceive of a sampling distribution as being a frequency distribution based on a very large number of samples. To be strictly correct, the sampling distribution only equals the frequency distribution exactly when there is an infinite number of samples. Example 1: A prototype automotive tire has a design life of 38,500 miles with a standard deviation of 2,500 miles. Five such tires are manufactured and tested. On the assumption that the actual population mean is 38,500 miles and the actual population standard deviation is 2,500 167 CU IDOL SELF LEARNING MATERIAL (SLM)

miles, find the probability that the sample mean will be less than 36,000 miles. Assume that the distribution of lifetimes of such tires is normal. Solution For simplicity we use units of thousands of miles. Then the sample mean has mean That is, if the tires perform as designed, there is only about a 1.25% chance that the average of a sample of this size would be so low. Example 2: An automobile battery manufacturer claims that its midgrade battery has a mean life of 50 months with a standard deviation of 6 months. Suppose the distribution of battery lives of this particular brand is approximately normal. a. On the assumption that the manufacturer’s claims are true, find the probability that a randomly selected battery of this type will last less than 48 months. b. On the same assumption, find the probability that the mean of a random sample of 36 such batteries will be less than 48 months. Solution a. Since the population is known to have a normal distribution 168 CU IDOL SELF LEARNING MATERIAL (SLM)

8.5 Test of Significance In Statistics, tests of significance are the method of reaching a conclusion to reject or support the claims based on sample data. The statistics are a special branch of Mathematics which deals with the collection and calculation over numerical data. This subject is well known for research based on statistical surveys. During a statistical process, a very common as well as an important term we come across is “significance” Statistical significance is very important in research not only in Mathematics but in several different fields such as medicine, psychology and biology. There are many methods through which the significance can be tested. These are known as significance tests. Let us learn about significance testing in detail. 8.5.1 Definition of Significance Testing In statistics, it is important to know if the result of an experiment is significant enough or not. In order to measure the significance, there are some predefined tests which could be applied. These tests are called the tests of significance or simply the significance tests. This statistical testing is subjected to some degree of error. For some experiments, the researcher is required to define the probability of sampling error in advance. In any test which does not consider the entire population, the sampling error does exist. The testing of significance is very important in statistical research. The significance level is the level at which it can be accepted if a given event is statistically significant. This is also termed as p-value. It is observed that the bigger samples are less prone to chance, thus the sample size plays a vital role in measuring the statistical significance. One should use only representative and random samples for significance testing. In short, the significance is the probability that a relationship exists. Significance tests tell us about the probability that if a relationship we found is due to random chance or not and to which level. This indicates about the error that would be made by us if the found relationship is assumed to exist. 8.5.2 Tests of Significance in Statistics Technically speaking, the statistical significance refers to the probability of a result of some statistical test or research occurring by chance. The main purpose of performing statistical research is basically to find the truth. In this process, the researcher has to make sure about the quality of sample, accuracy, and good measures which need a number of steps to be done. The researcher has to determine whether the findings of experiments have occurred due to a good study or just by fluke. The significance is a number which represents probability indicating the result of some study has occurred purely by chance. The statistical significance may be weak or strong. It does not necessarily indicate practical significance. Sometimes, when a researcher does not carefully make use of language in the report of their experiment, the significance may be misinterpreted. 169 CU IDOL SELF LEARNING MATERIAL (SLM)

The psychologists and statisticians look for a 5% probability or less which means 5% results occur due to chance. This also indicates that there is a 95% chance of results occurring NOT by chance. Whenever it is found that the result of our experiment is statistically significant, it refers that we should be 95% sure the results are not due to chance. 8.5.3 Process of Significance Testing In the process of testing for statistical significance, there are the following steps: 1. Stating a Hypothesis for Research 2. Stating a Null Hypothesis 3. Selecting a Probability of Error Level 4. Selecting and Computing a Statistical Significance Test 5. Interpreting the results 8.5.4 Types of Errors There are basically two types of errors:  Type I  Type II Type I Error The type I error occurs when the researcher finds out that the relationship assumed through research hypothesis does exist; but in reality, there is evidence that it does not exist. In this type of error, the researcher is supposed to reject the research hypothesis and accept the null hypothesis, but its opposite happens. The probability that researchers commit Type I error is denoted by alpha (α). Type II Error The type II error is just opposite the type I error. It occurs when it is assumed that a relationship does not exist, but in reality it does. In this type of error, the researcher is supposed to accept the research hypothesis and reject the null hypothesis, but he does not and the opposite happens. The probability that a type II error is committed is represented by beta (β). 8.5.5 Types of Statistical Tests One-tailed and two-tailed are two types of statistical tests that are used alternatively for the computation of the statistical significance of some parameter in a given set of data. These are also termed as one-sided and two-sided tests.  In research, the one-tailed test can be used when the deviations of the estimated parameter in one direction from an assumed benchmark value are considered theoretically possible.  On the other hand, the two-tailed test should be utilized when the deviations in both directions of benchmark value are considered as theoretically possible. 170 CU IDOL SELF LEARNING MATERIAL (SLM)

The word “tail” is used in the names on these tests since the extreme points of the distributions in which observations tend to reject the null hypothesis are quite small and “tail off” to zero similar to the bell curve or normal distribution. The choice of one-tailed or two-tailed significance test depends upon the research hypothesis. Example 1. The one-tailed test can be utilized for the test of the null hypothesis such as, boys will not score significantly higher marks than girls in 10 Standard. In this example, the null hypothesis does indirectly assume the direction of the difference. 2. The two-tailed test could be utilized in the testing of the null hypotheses: There is no significant difference in scores of boys and girls in 10 Standard. 8.5.6 What is p-Value Testing? In the context of the statistical significance of a data, the p-value is an important terminology for hypothesis testing. The p-value is said to be a function of observed sample results which is being used for testing of statistical hypothesis. A threshold value is to be selected before the test is performed. This value is known as the significance level that is traditionally 1% or 5%. It is denoted by α. In the case when the p-value is smaller than or equal to significance level (α), the data is said to be inconsistent for our assumption of the null hypothesis to be true. Therefore, the null hypothesis should be rejected and an alternative hypothesis is supposed to be accepted or assumed as true. Note that the smaller the p-value is, the bigger the significance should be as it indicates that the research hypothesis does not adequately explain the observation. If the p-value is calculated accurately, then such test controls type I error rate not to be greater than the significance level (α). The use of p-values in statistical hypothesis testing is very commonly seen in a wide variety of areas such as psychology, sociology, science, economics, social science, biology, criminal justice etc. Example 1: You have a coin and you would like to check whether it is fair or biased. More specifically, let θ be the probability of heads, θ=P(H). Suppose that you need to choose between the following hypotheses: H0 (the null hypothesis): The coin is fair, i.e., θ=θ0=12. 171 H1H1 (the alternative hypothesis): The coin is not fair, i.e., θ>12θ>12. We toss the coin 100100 times and observe 60 heads. 1. Can we reject H0 at significance level α=0.05? 2. Can we reject H0 at significance level α=0.01? CU IDOL SELF LEARNING MATERIAL (SLM)

3. What is the PP-value?  Solution o Let X be the random variable showing the number of observed heads. In our experiment, we observed X=60. Since n=100 is relatively large, assuming H0H0 is true, the random variable  is (approximately) a standard normal random variable, N (0,1). If H0 is true, we expect X to be close to 50, while if H1 is true, we expect X to be larger. Thus, we can suggest the following test: We choose a threshold cc. If W ≤ c, we accept H0; otherwise, we accept H1. To calculate P (type I error), we can write 1. If we require significance level α=0.05, then c = z0.05 = 1.645 The above value can be obtained in MATLAB using the following command: norminv (1−0.05). Since we have W=2>1.645, we reject H0, and accept H1. 2. If we require significance level α=0.01, then c = z0.01 = 2.33 The above value can be obtained in MATLAB using the following command: norminv(1−0.01). Since we have W=2≤2.33, we fail to reject H0, so we accept H0. 3. P-value is the lowest significance level αα that results in rejecting H0H0. Here, since W=2, we will reject H0 if and only if c<2. Note that zα=cz α=c, thus α=1−Φ(c). 172 CU IDOL SELF LEARNING MATERIAL (SLM)

If c=2, we obtain α=1−Φ(2)=0.023 Therefore, we reject H0H0 for α>0.023. Thus, the P-value is equal to 0.023. 8.6 T AND F TEST T- Test: The t-test is any statistical hypothesis test in which the test statistic follows a Student’s t-distribution under the null hypothesis. It can be used to determine if two sets of data are significantly different from each other, and is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. T-test uses means and standard deviations of two samples to make a comparison. The formula for T- test is given below: The formula for standard deviation is given by: Where, x = Values given = Mean n = Total number of values. 8.6.1 T-Test Solved Examples Question 1: Find the t-test value for the following two sets of values: 7, 2, 9, 8 and 1, 2, 3, 4? Solution: 173 CU IDOL SELF LEARNING MATERIAL (SLM)

Number of terms in first set: n1 = 4 Mean for first set of data: = 6.5 Construct the following table for standard deviation: Standard deviation for the first set of data: S1 = 3.11 Standa Number of terms in second set: n2 = 4 rd Mean for second set of data: =2.5 deviati Construct the following table for standard deviation: on for first CU IDOL SELF LEARNING MATERIAL (SLM) set of 174

data: S2 = 1.29 Formula for t-test value: F-Test: A test statistic which has an F-distribution under the null hypothesis is called an F test. It is used to compare statistical models as per the data set provided or available. George W. Snedecor, in honour of Sir Ronald A. Fisher, termed this formula as F-test Formula. To compare the variance of two different sets of values, the F test formula is used. To be applied to F distribution under the null hypothesis, we first need to find out the mean of two given observations and then calculate their variance. 175 CU IDOL SELF LEARNING MATERIAL (SLM)

The F-distribution is generated by drawing two samples from the same normal population, it can be used to test the hypothesis that two samples come from populations with the same variance. You would have two samples (one of size n1 and one of size n2) and the sample variance from each. Obviously, if the two variances are very close to being equal the two samples could easily have come from populations with equal variances. Because the F-statistic is the ratio of two sample variances, when the two sample variances are close to equal, the F-score is close to one. If you compute the F- score, and it is close to one, you accept your hypothesis that the samples come from populations with the same variance. This is the basic method of the F-test. Hypothesize that the samples come from populations with the same variance. Compute the F-score by finding the ratio of the sample variances. If the F-score is close to one, conclude that your hypothesis is correct and that the samples do come from populations with equal variances. If the F-score is far from one, then conclude that the populations probably have different variances. The basic method must be fleshed out with some details if you are going to use this test at work. There are two sets of details: first, formally writing hypotheses, and second, using the F-distribution tables so that you can tell if your F-score is close to one or not. Formally, two hypotheses are needed for completeness. The first is the null hypothesis that there is no difference (hence null). It is usually denoted as Ho. The second is that there is a difference, and it is called the alternative, and is denoted H1 or Ha. Using the F-tables to decide how close to one is close enough to accept the null hypothesis (truly formal statisticians would say “fail to reject the null”) is fairly tricky because the F-distribution tables are fairly tricky. Before using the tables, the researcher must decide how much chance he or she is willing to take that the null will be rejected when it is really true. Example 1: Conduct an F-Test on the following samples: Sample-1 having variance = 109.63, sample size = 41. Sample-2 having Variance = 65.99, sample size = 21. Solution: 176 Step-1:- First write the hypothesis statements as: CU IDOL SELF LEARNING MATERIAL (SLM)

H_0: No difference in variances. H_a: Difference in variances. Step-2:- Calculate the F-critical value. Here take the highest variance as the numerator and the lowest variance as the denominator: Step-3:- Calculate the degrees of freedom as: The degrees of freedom in the table will be the sample size -1, so for sample-1 it is 40 and for sample-2 it is 20. Step-4:- Choose the alpha level. As, no alpha level was given in the question, so we may use the standard level of 0.05. This needs to be halved for the test, so use 0.025. Step-5:- We will find the critical F-Value using the F-Table. We will use the table with 0.025. Critical-F for (40,20) at alpha (0.025) is 2.287. Step-6:- Compare the calculated value to the standard table value. If our calculated value is higher than the table value, then we may reject the null hypothesis. Here, 1.66 < 2 .287. So, we cannot reject the null hypothesis. 8.7 SUMMARY Sampling distributions are important in statistics because they provide a major simplification en route to statistical inference. More specifically, they allow analytical considerations to be based on the probability distribution of a statistic, rather than on the joint probability distribution of all the individual sample values. A sample is a subset of the possible values for a RV. All the possible values is called the population. Samples are available; populations are not. Samples are used to infer characteristics of the population. A good sample is representative of the population, in terms of 1. The values 2. The frequency of the values. 177 CU IDOL SELF LEARNING MATERIAL (SLM)

Imagine that you have taken all of the samples with n=10 from a population for which you knew the mean, found the t-distribution for 9 df by computing a t-score for each sample, and generated a relative frequency distribution of the t’s. When you were finished, someone brought you another sample (n=10) wondering if that new sample came from the original population. You could use your sampling distribution of t’s to test if the new sample comes from the original population or not. To conduct the test, first hypothesize that the new sample comes from the original population. With this hypothesis, you have hypothesized a value for μ, the mean of the original population, to use to compute a t-score for the new sample. If the t for the new sample is close to zero—if the t-score for the new sample could easily have come from the middle of the t-distribution you generated—your hypothesis that the new sample comes from a population with the hypothesized mean seems reasonable, and you can conclude that the data support the new sample coming from the original population. If the t-score from the new sample is far above or far below zero, your hypothesis that this new sample comes from the original population seems unlikely to be true, for few samples from the original population would have t-scores far from zero. In that case, conclude that the data support the idea that the new sample comes from some other population. This is the basic method of using this t-test. Hypothesize the mean of the population you think a sample might come from. Using that mean, compute the t-score for the sample. If the t-score is close to zero, conclude that your hypothesis was probably correct and that you know the mean of the population from which the sample came. If the t-score is far from zero, conclude that your hypothesis is incorrect, and the sample comes from a population with a different mean. Years ago, statisticians discovered that when pairs of samples are taken from a normal population, the ratios of the variances of the samples in each pair will always follow the same distribution. Not surprisingly, over the intervening years, statisticians have found that the ratio of sample variances collected in a number of different ways follow this same distribution, the F-distribution. 8.8 KEYWORD  inferential statistics: A branch of mathematics that involves drawing conclusions about a population based on sample data drawn from it.  sampling distribution: The probability distribution of a given statistic based on a random sample.  frequency distribution: a representation, either in a graphical or tabular format, which displays the number of observations within a given interval  Sample: a small part or quantity intended to show what the whole is like  Population: In statistics, a population is the entire pool from which a statistical sample is drawn  Tests of significance: In Statistics, tests of significance are the method of reaching a conclusion to reject or support the claims based on sample data. 178 CU IDOL SELF LEARNING MATERIAL (SLM)

 The p-value is said to be a function of observed sample results which is being used for testing of statistical hypothesis. 8.9 Learning Activity 1. How to create Sampling Distribution? Explain with example --------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------- 2. Explain Types of Statistical Tests --------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------- 8.10 UNIT END QUESTIONS A. Descriptive type Questions 1. Explain sampling Distribution. 2. State the properties of sampling distribution 3. Discuss Test of Significance 4. Write a note on T-Test 5. Explain in brief about F - Test B. Multiple Choice Questions 1. The difference between the sample value expected and the estimates value of the parameter is called as? a) bias b) error c) contradiction d) difference 2. In which of the following types of sampling the information is carried out under the opinion of an expert? a) quota sampling b) convenience sampling c) purposive sampling d) judgement sampling 179 CU IDOL SELF LEARNING MATERIAL (SLM)

3. Which of the following is a subset of population? a) distribution b) sample c) data d) set 4. Any population which we want to study is referred as? a) standard population b) final population c) infinite population d) target population 5. Suppose we want to make a voters list for the general elections 2019 then we require __________ a) sampling error b) random error c) census d) simple error Answers: 1 – a; 2 - d; 3 - b; 4 -d; 5 - c; 8.11 REFERENCES  Salaria R.S.A, Textbook of Statistical and Numerical Methods in Engineering, Delhi: Khanna Book Publishing Company.  Gupta S.P. and Kapoor, V.K. (2014). Fundamentals of Mathematical Statistics. Delhi: Sultan Chand and Sons.  Sujatha Sinha, Sushma Pradhan, Numerical Analysis and Statistical Methods, Academic Publishers.  Gupta S.P. and Kapoor, V.K. (2015). Fundamentals of Applied statistics. Delhi: Sultan Chand & Sons.  J. H. Wilkinson , The Algebraic Eigenvalue Problem (Numerical Mathematics and Scientific Computation), Clarendon Press  Gupta Dey , Numerical Methods , McGraw Hill Education;  Numerical Methods & Analysis – Engineering App – Google Play store  https://en.wikibooks.org/wiki/Statistics/Numerical_Methods/Numerical_Comparison_of_Statistic al_Software 180 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT 9- ANALYSIS OF VARIANCE: Structure 9.0 Learning Objective 9.1 Introduction 9.2 Definition and Assumption 9.3 Cochran’s Theorem 9.4 ANOVA Table 9.5 One Way Anova 9.5.1 Limitations of one-way ANOVA 9.6 Two-way classification (with one observation per cell). 9.7 Summary 9.8 Keywords 9.9 Learning Activity 9.10 Unit End Questions 9.11 References 9.0 LEARNING OBJECTIVE After studying this unit, you will be able to:  Explain the concept of Analysis of variance,  Enumerate Cochran’s Theorem  Comprehend the ANOVA Table  State the One-way classification and Two-way classification (with one observation per cell). 9.1 INTRODUCTION Analysis of Variance (ANOVA) is a parametric statistical technique used to compare datasets. This technique was invented by R.A. Fisher, and is thus often referred to as Fisher’s ANOVA, as well. It is similar in application to techniques such as t-test and z-test, in that it is used to compare means and the relative variance between them. However, analysis of variance (ANOVA) is best applied where more than 2 populations or samples are meant to be compared. A common approach to figure out a reliable treatment method would be to analyse the days it took the patients to be cured. We can use a statistical technique which can compare these three treatment samples and depict how different these samples are from one another. Such a technique, which compares the samples on the basis of their means, is called ANOVA. 181 CU IDOL SELF LEARNING MATERIAL (SLM)

Analysis of variance (ANOVA) is a statistical technique that is used to check if the means of two or more groups are significantly different from each other. ANOVA checks the impact of one or more factors by comparing the means of different samples. We can use ANOVA to prove/disprove if all the medication treatments were equally effective or not. Another measure to compare the samples is called a t-test. When we have only two samples, t-test and ANOVA give the same results. However, using a t-test would not be reliable in cases where there are more than 2 samples. If we conduct multiple t-tests for comparing more than two samples, it will have a compounded effect on the error rate of the result. 9. 2 DEFINITIONS AND ASSUMPTIONS Terminologies related to ANOVA you need to know Before we get started with the applications of ANOVA, I would like to introduce some common terminologies used in the technique. Grand Mean Mean is a simple or arithmetic average of a range of values. There are two kinds of means that we use in ANOVA calculations, which are separate sample means μ1, μ2, μ3 and the grand mean (μ). The grand mean is the mean of sample means or the mean of all observations combined, irrespective of the sample. Hypothesis Considering our above medication example, we can assume that there are 2 possible cases – either the medication will have an effect on the patients or it won’t. These statements are called Hypothesis. A hypothesis is an educated guess about something in the world around us. It should be testable either by experiment or observation. Just like any other kind of hypothesis that you might have studied in statistics, ANOVA also uses a Null hypothesis and an Alternate hypothesis. The Null hypothesis in ANOVA is valid when all the sample means are equal, or they don’t have any significant difference. Thus, they can be considered as a part of a larger set of the population. On the other hand, the alternate hypothesis is valid when at least one of the sample means is different from the rest of the sample means. In mathematical form, they can be represented as: Where belong to any two sample means out of all the samples considered for the test. In other words, the null hypothesis states that all the sample means are equal or the factor did not have 182 CU IDOL SELF LEARNING MATERIAL (SLM)

any significant effect on the results. Whereas, the alternate hypothesis states that at least one of the sample means is different from another. But we still can’t tell which one specifically. For that, we will use other methods that we will discuss later in this article. Between Group Variability Consider the distributions of the below two samples. As these samples overlap, their individual means won’t differ by a great margin. Hence the difference between their individual means and grand mean won’t be significant enough. Now consider these two sample distributions. As the samples differ from each other by a big margin, their individual means would also differ. The difference between the individual means and grand mean would therefore also be significant. Fig 9.1: Two Sample Distributions Such variability between the distributions called Between-group variability. It refers to variations between the distributions of individual groups (or levels) as the values within each group are different. Each sample is looked at and the difference between its mean and grand mean is calculated to calculate the variability. If the distributions overlap or are close, the grand mean will be similar to the individual means whereas if the distributions are far apart, difference between means and grand mean would be large. 183 CU IDOL SELF LEARNING MATERIAL (SLM)

Fig 9.2: Different types of Discrimination We will calculate Between Group Variability just as we calculate the standard deviation. Given the sample means and Grand mean, we can calculate it as: Fig 9.3: sample means and Grand mean, We also want to weigh each squared deviation by the size of the sample. In other words, a deviation is given greater weight if it’s from a larger sample. Hence, we’ll multiply each squared deviation by each sample size and add them up. This is called the sum-of-squares for between-group variability 184 CU IDOL SELF LEARNING MATERIAL (SLM)

There’s one more thing we have to do to derive a good measure of between-group variability. Again, recall how we calculate the sample standard deviation. We find the sum of each squared deviation and divide it by the degrees of freedom. For our between- group variability, we will find each squared deviation, weigh them by their sample size, sum them up, and divide by the degrees of freedom ( ), which in the case of between-group variability is the number of sample means (k) minus 1 9.3 COCHRAN’S THEOREM 185 CU IDOL SELF LEARNING MATERIAL (SLM)

9.4 ANOVA TABLE The ANOVA table breaks down the components of variation in the data into variation between treatments and error or residual variation. Statistical computing packages also produce ANOVA tables as part of their standard output for ANOVA, and the ANOVA table is set up as follows: The ANOVA table above is organized as follows.  The first column is entitled \"Source of Variation\" and delineates the between treatment and error or residual variation. The total variation is the sum of the between treatment and error variation.  The second column is entitled \"Sums of Squares (SS)\". The between treatment sums of squares is and is computed by summing the squared differences between each treatment (or group) mean and the overall mean. The squared differences are weighted by the sample sizes per group (nj). The error sums of squares is: and is computed by summing the squared differences between each observation and its group mean (i.e., the squared differences between each observation in group 1 and the group 1 mean, the squared differences between each observation in group 2 and the group 2 mean, and so on). The double 186 CU IDOL SELF LEARNING MATERIAL (SLM)

summation ( SS ) indicates summation of the squared differences within each treatment and then summation of these totals across treatments to produce a single value. (This will be illustrated in the following examples). The total sums of squares are: and is computed by summing the squared differences between each observation and the overall sample mean. In an ANOVA, data are organized by comparison or treatment groups. If all of the data were pooled into a single sample, SST would reflect the numerator of the sample variance computed on the pooled or total sample. SST does not figure into the F statistic directly. However, SST = SSB + SSE, thus if two sums of squares are known, the third can be computed from the other two.  The third column contains degrees of freedom. The between treatment degrees of freedom is df1 = k-1. The error degrees of freedom is df2 = N - k. The total degrees of freedom is N-1 (and it is also true that (k-1) + (N-k) = N-1).  The fourth column contains \"Mean Squares (MS)\" which are computed by dividing sums of squares (SS) by degrees of freedom (df), row by row. Specifically, MSB=SSB/(k-1) and MSE=SSE/(N-k). Dividing SST/(N-1) produces the variance of the total sample. The F statistic is in the rightmost column of the ANOVA table and is computed by taking the ratio of MSB/MSE. 9.5 ONE WAY ANOVA As we now understand the basic terminologies behind ANOVA, let’s dive deep into its implementation using a few examples. A recent study claims that using music in a class enhances the concentration and consequently helps students absorb more information. As a teacher, your first reaction would be scepticism. What if it affected the results of the students in a negative way? Or what kind of music would be a good choice for this? Considering all this, it would be immensely helpful to have some proof that it actually works. To figure this out, we decided to implement it on a smaller group of randomly selected students from three different classes. The idea is similar to conducting a survey. We take three different groups of ten randomly selected students (all of the same age) from three different classrooms. Each classroom was provided with a different environment for students to study. Classroom A had constant music being played in the background, classroom B had variable music being played and classroom C was a regular class with no music playing. After one month, we conducted a test for all the three groups and collected their test scores. The test scores that we obtained were as follows: 187 CU IDOL SELF LEARNING MATERIAL (SLM)

Now, we will calculate the means and the Grand mean. So, in our case, Looking at the above table, we might assume that the mean score of students from Group A is definitely greater than the other two groups, so the treatment must be helpful. Maybe it’s true, but there is also a slight chance that we happened to select the best students from class A, which resulted in better test scores (remember, the selection was done at random). This leads to a few questions, like: 1. How do we decide that these three groups performed differently because of the different situations and not merely by chance? 2. In a statistical sense, how different are these three samples from each other? 3. What is the probability of group A students performing so differently than the other two groups? To answer all these questions, first we will calculate the F-statistic which can be expressed as the ratio of Between Group variability and Within Group Variability. Let’s complete the ANOVA test for our example with = 0.05. 188 CU IDOL SELF LEARNING MATERIAL (SLM)

9.5.1 Limitations of one-way ANOVA A one-way ANOVA tells us that at least two groups are different from each other. But it won’t tell us which groups are different. If our test returns a significant f-statistic, we may need to run a post-hoc test to tell us exactly which groups have a difference in means. Merits · Layout is very simple and easy to understand. · Gives maximum degrees of freedom for error. Demerits · Population variances of experimental units for different treatments need to be equal. · Verification of normality assumption may be difficult. Example 1: Three different techniques namely medication, exercises and special diet are randomly assigned to (individuals diagnosed with high blood pressure) lower the blood pressure. After four weeks the reduction in each person’s blood pressure is recorded. Test at 5% level, whether there is significant difference in mean reduction of blood pressure among the three techniques. Solution: Step 1: Hypotheses Null Hypothesis: H0: µ1 = µ2 = µ3 That is, there is no significant difference among the three groups on the average reduction in blood pressure. Alternative Hypothesis: H1: μi ≠ μj for at least one pair (i, j); i, j = 1, 2, 3; i ≠ j. That is, there is significant difference in the average reduction in blood pressure in at least one pair of treatments. Step 2: Data 189 CU IDOL SELF LEARNING MATERIAL (SLM)

Step 3: Level of significance α = 0.05 Step 4: Test statistic F0 = MST / MSE Step 5: Calculation of Test statistic 190 CU IDOL SELF LEARNING MATERIAL (SLM)

Step 6: Critical value f(2, 12),0.05 = 3.8853. Step 7: Decision As F0 = 9.17 > f(2, 12),0.05 = 3.8853, the null hypothesis is rejected. Hence, we conclude that there exists significant difference in the reduction of the average blood pressure in atleast one pair of techniques. Example 2: Three composition instructors recorded the number of spelling errors which their students made on a research paper. At 1% level of significance test whether there is significant difference in the average number of errors in the three classes of students. Solution: Step 1: Hypotheses Null Hypothesis: H0: µ1 = µ2 = µ3 That is there is no significant difference among the mean number of errors in the three classes of students. Alternative Hypothesis H1 : μi ≠ μj for at one pair (i, j); i,j = 1,2,3; i ≠ j. That is, at least one pair of groups differ significantly on the mean number of errors. Step 2: Data 191 CU IDOL SELF LEARNING MATERIAL (SLM)

Step 3: Level of significance α = 5% Step 4: Test Statistic F0 = MST / MSE Step 5: Calculation of Test statistic Individual squares 192 CU IDOL SELF LEARNING MATERIAL (SLM)

ANOVA Table Step 6: Critical value 193 The critical value = f(15, 2),0.05 = 3.6823. Step 7: Decision CU IDOL SELF LEARNING MATERIAL (SLM)

As F0 = 0.710 < f(15, 2),0.05 = 3.6823, null hypothesis is not rejected. There is no enough evidence to reject the null hypothesis and hence we conclude that the mean number of errors made by these three classes of students are not equal. 9.6 TWO-WAY ANOVA/ TWO-WAY CLASSIFICATION (WITH ONE OBSERVATION PER CELL). Using one-way ANOVA, we found out that the music treatment was helpful in improving the test results of our students. But this treatment was conducted on students of the same age. What if the treatment was to affect different age groups of students in different ways? Or maybe the treatment had varying effects depending upon the teacher who taught the class. Moreover, how can we be sure as to which factor(s) is affecting the results of the students more? Maybe the age group is a more dominant factor responsible for a student’s performance than the music treatment. For such cases, when the outcome or dependent variable (in our case the test scores) is affected by two independent variables/factors we use a slightly modified technique called two-way ANOVA. In the one-way ANOVA test, we found out that the group subjected to ‘variable music’ and ‘no music at all’ performed more or less equally. It means that the variable music treatment did not have any significant effect on the students. So, while performing two-way ANOVA we will not consider the “variable music” treatment for simplicity of calculation. Rather a new factor, age, will be introduced to find out how the treatment performs when applied to students of different age groups. This time our dataset looks like this: 194 CU IDOL SELF LEARNING MATERIAL (SLM)

Here, there are two factors – class group and age group with two and three levels respectively. So we now have six different groups of students based on different permutations of class groups and age groups and each different group has a sample size of 5 students. A few questions that two-way ANOVA can answer about this dataset are: 1. Is music treatment the main factor affecting performance? In other words, do groups subjected to different music differ significantly in their test performance? 2. Is age the main factor affecting performance? In other words, do students of different age differ significantly in their test performance? 3. Is there a significant interaction between the factors? In other words, how do age and music interact with regard to a student’s test performance? For example, it might be that younger students and elder students reacted differently to such a music treatment. 4. Can any differences in one factor be found within another factor? In other words, can any differences in music and test performance be found in different age groups? Two-way ANOVA tells us about the main effect and the interaction effect. The main effect is similar to a one-way ANOVA where the effect of music and age would be measured separately. Whereas, the interaction effect is the one where both music and age are considered at the same time. That’s why a two-way ANOVA can have up to three hypotheses, which are as follows: Two null hypotheses will be tested if we have placed only one observation in each cell. For this example, those hypotheses will be: H1: All the music treatment groups have equal mean score. H2: All the age groups have equal mean score. For multiple observations in cells, we would also be testing a third hypothesis: H3: The factors are independent or the interaction effect does not exist. 195 CU IDOL SELF LEARNING MATERIAL (SLM)

We have calculated all the means – sound class mean, age group mean and mean of every group combination in the above table. Now, calculate the sum of squares (SS) and degrees of freedom (df) for sound class, age group and interaction between factor and levels. We already know how to calculate SS (within)/df (within) in our one-way ANOVA section, but in two-way ANOVA the formula is different. Let’s look at the calculation of two-way ANOVA: In two-way ANOVA, we also calculate SSinteraction and dfinteraction which defines the combined effect of the two factors. 196 CU IDOL SELF LEARNING MATERIAL (SLM)

Since we have more than one source of variation (main effects and interaction effects), it is obvious that we will have more than one F-statistic also. Now using these variances, we compute the value of F-statistic for the main and interaction effect. So, the values of f-statistic are, F1 = 12.16 F2 = 15.98 F12 = 0.36 We can see the critical values from the table Fcrit1 = 4.25 Fcrit2 = 3.40 Fcrit12 = 3.40 If, for a particular effect, its F value is greater than its respective F-critical value (calculated using the F-Table), then we reject the null hypothesis for that particular effect. Merits · Any number of blocks and treatments can be used. · Number of units in each block should be equal. · It is the most used design in view of the smaller total sample size since we are studying two variables at a time. Demerits · If the number of treatments is large enough, then it becomes difficult to maintain the homogeneity of the blocks. · If there is a missing value, it cannot be ignored. It has to be replaced with some function of the existing values and certain adjustments have to be made in the analysis. This makes the analysis slightly complex. Comparison between one-way ANOVA and two-way ANOVA 197 CU IDOL SELF LEARNING MATERIAL (SLM)

Example 1: A reputed marketing agency in India has three different training programs for its salesmen. The three programs are Method – A, B, C. To assess the success of the programs, 4 salesmen from each of the programs were sent to the field. Their performances in terms of sales are given in the following table. Test whether there is significant difference among methods and among salesmen. Solution: Step 1: Hypotheses Null Hypotheses: H01: μM1= μM2 = μM3 (for treatments) That is, there is no significant difference among the three programs in their mean sales. H02: μS1 = μS2 = μS3 = μS4 (for blocks) Alternative Hypotheses: H11: At least one average is different from the other, among the three programs. H12: At least one average is different from the other, among the four salesmen. Step 2: Data Step 3: Level of significance α = 5% 198 CU IDOL SELF LEARNING MATERIAL (SLM)

Step 4: Test Statistic Step-5: Calculation of the Test Statistic 199 CU IDOL SELF LEARNING MATERIAL (SLM)

Pages:

kuljeet.singh

MCA645 CU-MCA-SEM-II-Statistical & Numerical Methods

Like this book? You can publish your book online for free in a few minutes!

Create your own flipbook

TOP SEARCH

business design fashion music health life sports home marketing children

MCA645 CU-MCA-SEM-II-Statistical & Numerical Methods

Description: MCA645 CU-MCA-SEM-II-Statistical & Numerical Methods

Read the Text Version

kuljeet.singh

TOP SEARCH

RELATED PUBLICATIONS