Home Explore 21ODMPT655-Research Methods and Statistics –I

21ODMPT655-Research Methods and Statistics –I

Published by Teamlease Edtech Ltd (Amita Chitroda), 2022-04-04 07:32:17

Description: 21ODMPT655-Research Methods and Statistics –I

Read the Text Version

Pages:

Regression 245 14. Quasi Poisson Regression It is an alternative to negative binomial regression. It can also be used for overdispersed count data. Both the algorithms give similar results, while there are differences in estimating the effects of covariates. The variance of a quasi-Poisson model is a linear function of the mean while the variance of a negative binomial model is a quadratic function of the mean. 15. Cox Regression Cox Regression is suitable for time-to-event data. For example, time from customer opened the account until attrition. 16. Tobit Regression It is used to estimate linear relationships between variables when censoring exists in the dependent variable. Censoring means when we observe independent variable for all observations, but we only know the true value of dependent variable for a restricted range of observations. 10.7 Techniques of Regression (Practical Problems) 1. Regression Equation Regression equations are algebriac expressions of the regression lines. Since there are two regression lines, there are two regression equations – the regression equation of X on Y is used to describe the variations in the values of X for given changes in Y and the regression equation of Y on X is used to describe the variation in the values of Y for given changes in X. The Two Regression Equations The two regression equations are as follows: (i) Regression equation of X on Y X  x  r x (Y  y) y The values of a and b obtained by solving the two simultaneous equation: åy = Na + båx åxy = aåx + båx2 CU IDOL SELF LEARNING MATERIAL (SLM)

246 Research Methods and Statistics - I (ii) Regression equation of X on Y (y  y) = r y (x  x) x The values of a and b are obtained by solving the two simultaneous equations: åx = Na + båy åy = aåy + båy2 2. Regression Lines Regression Line refers to describe the average relationship between the two variables, say X and Y. It reveals mean value of X for given value of Y. The equation of regression line is known as “Regression Equation”. For example: Obtain the two regression equation line from the following: X – Series Y – Series Mean 20 25 Variance 4 9 Coefficient of correlation = 0.75 The regression equation of X on Y is X  x  r x (Y  y) y (X – 20) = 0.75  2  Y  25 3 X – 20 = 0.50 (Y – 25) X = 0.50Y – 12.5 + 20 X = 0.50Y + 7.5 ...................................................... (i) CU IDOL SELF LEARNING MATERIAL (SLM)

Regression 247 The regression equation of Y on X is (y  y) = r y (x  x) x 3 (Y – 25) = 0.75  2 (x  20) (Y – 25) = 1.125X (X – 20) (Y – 25) = 1.125X – 22.5; Y = 1.125X– 22.5 + 25 Y = 1.125X + 2.5...................................(ii) Regression Lines Signify The further the two regression lines from each other, the lesser is the degree of correlation and the nearer the two regression lines to each other, the higher is the degree of correlation. If the variables are independent, r is zero and the lines of regression right angles, i.e., parallel to OX and OY. For two variables X and Y, there are always two lines of regression: Regression Line of X on Y: Gives the best estimate for the value of X for any specific given values of Y. X = a + bY where, a = X-intercept b = Slope of the line X = Dependent variable Y = Independent variable CU IDOL SELF LEARNING MATERIAL (SLM)

248 Research Methods and Statistics - I For two variables X and Y, there are always two lines of regression: Regression Line of Y on X: Gives the best estimate for the value of Y for any specific given values of X. Y = a + bX where, a = Y-intercept b = Slope of the line Y = Dependent variable X = Independent variable 3. Regression Coefficient The rate of change of variable for unit change in the other variable is called the regression coefficient of former on the latter. Since there are two regression lines, there are two regression coefficients. The rate of change of X for unit change in Y is called regression coefficient of X on Y. It is the coefficient of Y in the regression equation. When it is in the form of X = n + bY, this is denoted by bxy. Regression coefficient value can be ascertained directly with the help of bxy and byx. Regression coefficient of X on Y is denoted by:   bxy = r x or xy  n x.y y bxy  y2  n y 2 Similarly, regression coefficient of Y on X is denoted by: byx = r x or  xy  n x.y y  byx  x2  n x 2 CU IDOL SELF LEARNING MATERIAL (SLM)

Regression 249 Properties of the Regression Coefficients (i) The coefficient of correlation is geometric mean of the two regression coefficients. r  byx bxy (ii) If byx is positive, then bxy should also be positive and vice versa. (iii) If one regression coefficient is greater than one, then the other must be less than one. (iv) The coefficient of correlation will have the same sign as that of our regression coefficient. (v) Arithmetic mean of byx and bxy is equal to or greater than coefficient of correlation. byx + bxy/2  r. Regression coefficient are independent of origin but not of scale. Standard Error of Estimate Standard error of estimate is the measure of variation around the computed regression line. Standard error of estimate (SE) of Y measure the variability of the observed values of Y around the regression line. Standard error of estimate gives us a measure about the line of regression of the scatter of the observations about the line of regression. Standard error of estimate of Y on X is: SE of Y on X (SExy) = (Y - Ye)2 N where, Y = Observed value of Y Ye = Estimated values from the estimated equation that correspond to each y value e = The error term (Y – Ye) n = Number of observation in sample. CU IDOL SELF LEARNING MATERIAL (SLM)

250 Research Methods and Statistics - I The convenient formula: (SExy) = y2  ay  byx n2 where, X = Value of independent variable Y = Value of dependent variable a = Y intercept b = Slope of estimating equation n = Number of data point • Regression Coefficient of X on Y: The regression coefficient of X on Y is represented by the symbol bxy that measures the change in X for the unit change in Y. Symbolically, it can be represented as: bxy = byx    X  x Y y The bxy can be obtained by using the following formula when the deviations are taken from the actual means of X and Y. When the deviations are obtained from the assumed mean, the following formula is used: (X  x) = r x (y  y) y • Regression Coefficient of Y on X: The symbol byx is used to measures the change in Y corresponding to the unit change in X. Symbolically, it can be represented as:    Y byx y = bxy X x In case, the deviations are taken from the actual means, the following formula is used. The byx can be calculated by using the following formula when the deviations are taken from the assumed means: (Y  y) = r y (X x) x CU IDOL SELF LEARNING MATERIAL (SLM)

Regression 251 The Regression Coefficient is also called as a slope coefficient because it determines the slope of the line, i.e., the change in the independent variable for the unit change in the independent variable. Illustration 1 Compute two regression coefficients when r = 0.8,  x = 5,  y = 7. bxy = r x = 0.8  5 = 0.57 y 7 byx = r y = 0.8  7  1.12 x 5 Illustration 2 If bxy = 0.8 and byx = 0.6, then find ‘r’. r = bxy  byx = 0.8  0.6 = +0.692 Illustration 3 State the relationship between correlation coefficient of two variables and their regression coefficient of two variables and their regression coefficients. Regression Coefficient of X on Y: bxy = r x y Regression Coefficient of Y on X: byx = r y x bxy byx = r 1  r y = r2  r y x \\ r = bxy byx It can be said that correlation coefficient is the geometric mean of regression coefficients. CU IDOL SELF LEARNING MATERIAL (SLM)

252 Research Methods and Statistics - I Illustration 4 Obtain the two regression equations from the following: X – Series Y – Series Mean 20 25 Variance 4 9 Coefficient of correlation = 0.75 Solution: The regression equation of X on Y is: X  x  r x (Y  y) y (X  20) = 0.75  2 Y  25 ; (X  20) 0.50 (Y 25) 3 (X  20) = 0.50Y – 12.5; X = 0.50Y + 20 – 12.5 X = 0.50Y + 7.5...................................(i) The regression equation of Y on X is: (Y  y)  r  y X  x  x (Y – 25) = 0.75  3 (X  20) ; (Y – 25) = 1.125X (X – 20) 2 (Y – 25) = 1.125X – 22.5; Y = 1.125X – 22.5 + 25 Y = 1.125X + 2.5...................................(ii) Illustration 5 Given the following information: X = 65, y = 67,  x = 25, Variance of y = 12.25 and r = 0.8. Obtain: (i) The two regression lines and (ii) the estimate of X when Y = 70 and Y when X = 58. CU IDOL SELF LEARNING MATERIAL (SLM)

Regression 253 Solution: Regression equation of Y on X: (Y  y)  r y X  x; Y – 67 = 0.8 3.5 (X  65) x 25 Y – 67 = 0.112 (X – 65) Y – 67 = 0.112X – 7.28 Y = 0.112X + 67 – 7.28 = 0.112X + 59.72 ................ (i) Regression equation of X on Y: X  x  r x (Y  y) ; y X  65  0.8 25 (Y  67) 3.5 X – 65 = 5.714 (Y – 67); X – 65 = 5.7y – 382.9 X = 5.75 – 382.9 + 65 = 5.714Y – 317.84 ................ (ii) Calculation of X, when Y = 70: X = 5.714Y – 317.84 X = 5.714(70) – 317.84 X = 399.98 – 317.84 = 82.14 Calculation of Y, when X = 58: Y = 0.112X + 59.72 = (0.112  58) + 59.72 = 6.496 + 59.72 = 66.216 CU IDOL SELF LEARNING MATERIAL (SLM)

254 Research Methods and Statistics - I 10.8 Summary Regression analysis is primarily used for two conceptually distinct purposes. First, regression analysis is widely used for prediction and forecasting, where its use has substantial overlap with the field of machine learning. Second, in some situations, regression analysis can be used to infer causal relationships between the independent and dependent variables. Regression is the measure of the average relationship between two or more variables in terms of the original units of the data. It is a statistical measurement used in finance, investing and other disciplines that attempts to determine the strength of the relationship between one dependent variable (usually denoted by Y) and a series of other changing variables. The term regression analysis refers to the methods by which estimates are made of the values of a variable from knowledge of the values of one or more other variables and to the measurement of the errors involved in this estimation process. Regression analysis is mathematical measure of average relationship between two or more variables. It is a statistical tool used in prediction of value of unknown variable from known variable. It is a very powerful tool in the field of statistical analysis in predicting the value of one variable, given the value of another variable, when those variables are related to each other. Regression equations are algebriac expressions of the regression lines. Since there are two regression lines, there are two regression equations – the regression equation of X on Y is used to describe the variations in the values of X for given changes in Y and the regression equation of Y on X is used to describe the variation in the values of Y for given changes in X. Regression Line refers to describe the average relationship between the two variables, say X and Y. It reveals mean value of X for given value of Y. The equation of regression line is known as “Regression Equation”. CU IDOL SELF LEARNING MATERIAL (SLM)

Regression 255 The rate of change of variable for unit change in the other variable is called the regression coefficient of former on the latter. Since there are two regression lines there are two regression coefficients. The rate of change of X for unit change in Y is called regression coefficient of X on Y. It is the Coefficient of Y in the regression equation when it is in the form of X = n + bY. Linear Regression is the simplest form of regression. It is a technique in which the dependent variable is continuous in nature. The relationship between the dependent variable and independent variables is assumed to be linear in nature. We can observe that the given plot represents a somehow linear relationship between the mileage and displacement of cars. Quantile regression is the extension of linear regression and we generally use it when outliers, high skeweness and heteroscedasticity exist in the data. In linear regression, we predict the mean of the dependent variable for given independent variables. Since mean does not describe the whole distribution, modelling the mean is not a full description of a relationship between dependent and independent variables. So, we can use quantile regression which predicts a quantile (or percentile) for given independent variables. 10.9 Key Words/Abbreviations  Regression: Regression is a statistical measurement used in finance.  Regression Analysis: Regression analysis is primarily used for two conceptually distinct purposes.  Regression Equation: Regression equations are algebriac expressions of the regression lines.  Regression Lines: Regression Line refers to describe the average relationship between the two variables, say X and Y. CU IDOL SELF LEARNING MATERIAL (SLM)

256 Research Methods and Statistics - I  Regression Coefficient: The rate of change of variable for unit change in the other variable is called the regression coefficient.  Linear Regression: Linear Regression is the simplest form of regression.  Polynomial Regression: It is a technique to fit a nonlinear equation by taking polynomial functions of independent variable.  Quantile Regression: Quantile regression is the extension of linear regression.  Ridge Regression: Ridge regression is a way to create a parsimonious model. 10.10 Learning Activity 1. You are suggested to give the interpretation of solving any two problems related to Regression Coefficient. _________________________________________________________________ _________________________________________________________________ 2. You are required to analyse the Linear Regression and Polynomial Regression. _________________________________________________________________ _________________________________________________________________ 10.11 Unit End Exercises (MCQs and Descriptive) Descriptive Type Questions 1. Define the term regression. 2. What are the regression lines? 3. What is regression coefficient? 4. Mention the two regression coefficients. CU IDOL SELF LEARNING MATERIAL (SLM)

Regression 257 5. Explain the properties of spearman’s correlation coefficient. 6. Explain the properties of regression coefficients with examples. Multiple Choice Questions 1. Which of the following is a set of statistical processes for estimating the relationships between a dependent variable and one or more independent variables? (a) Correlation (b) Regression Analysis (c) Standard Deviation (d) None of the above 2. Which of the following is the measure of the average relationship between two or more variables in terms of the original units of the data? (a) Regression (b) Correlation (c) Skewness (d) Hypothesis 3. Which of the following is the assumption in Regression Analysis? (a) Existence of actual linear relationship (b) The regression analysis is used to estimate the values within the range for which it is valid (c) The relationship between the dependent and independent variables remains the same till the regression equation is calculated (d) All the above 4. Which of the following is the technique of Regression? (a) Regression Equation (b) Regression Lines (c) Scatter Diagram (d) Regression Coefficient CU IDOL SELF LEARNING MATERIAL (SLM)

258 Research Methods and Statistics - I 5. Which of the following is the type of Regression? (a) Linear Regression (b) Polynomial Regression (c) Logistic Regression (d) All the above Answers: 1. (b), 2. (a), 3. (d), 4. (c), 5. (d) 10.12 References References of this unit have been given at the end of the book.  CU IDOL SELF LEARNING MATERIAL (SLM)

Parametric and Non-parametric Test 259 UNIT 11 PARAMETRIC AND NON-PARAMETRIC TEST Structure: 11.1 Introduction 11.2 Parametric Test 11.3 Types of Parametric Test 11.4 Non-parametric Test 11.5 Types of Non-parametric Tests 11.6 Advantages and Disadvantages of Non-parametric Tests 11.7 Summary 11.8 Key Words/Abbreviations 11.9 LearningActivity 11.10 Unit End Exercises (MCQs and Descriptive) 11.11 References 11.0 Learning Objectives After studying this unit, you will be able to:  Describe the parametric and non-parametric tests  Explain non-parametric test CU IDOL SELF LEARNING MATERIAL (SLM)

260 Research Methods and Statistics - I 11.1 Introduction Parametric tests assume underlying statistical distributions in the data. Therefore, several conditions of validity must be met so that the result of a parametric test is reliable. The t-test for two independent samples is reliable only if each sample follows a normal distribution and if sample variances are homogeneous. Non-parametric tests do not rely on any distribution. They can thus be applied even if parametric conditions of validity are not met. Parametric tests often have non-parametric equivalents. You will find different parametric tests with their equivalents when they exist in this grid. 11.2 Parametric Test Parametric test is a hypothesis testing procedure based on the assumption that observed data are distributed according to some distributions of well-known form (e.g., normal, Bernoulli, and so on) up to some unknown parameter(s) on which we want to make inference (say the mean, or the success probability). Parametric tests assume a normal distribution of values, or a “bell-shaped curve.” For example, height is roughly a normal distribution in that if you were to graph height from a group of people, one would see a typical bell-shaped curve. This distribution is also called a Gaussian distribution. Parametric tests are in general more powerful (require a smaller sample size) than non-parametric tests. 11.3 Types of Parametric Test 1. T-test A statistical examination of two population means. A two-sample t-test examines whether two samples are different and is commonly used when the variances of two normal distributions are unknown and when an experiment uses a small sample size. Formula: where, is the sample mean, Ä is a specified value to be tested, s is the sample standard deviation and n is the size of the sample. Look up the significance level of the z-value in the standard normal table. CU IDOL SELF LEARNING MATERIAL (SLM)

Parametric and Non-parametric Test 261 When the standard deviation of the sample is substituted for the standard deviation of the population, the statistic does not have a normal distribution; it has what is called the t-distribution. Because there is a different t-distribution for each sample size, it is not practical to list a separate area of the curve table for each one. Instead, critical t-values for common alpha levels (0.10, 0.05, 0.01, and so forth) are usually given in a single table for a range of sample sizes. For very large samples, the t-distribution approximates the standard normal (z) distribution. In practice, it is best to use t- distributions any time the population standard deviation is not known. Values in the t-table are not actually listed by sample size but by degrees of freedom (df). The number of degrees of freedom for a problem involving the t-distribution for sample size n is simply n – 1 for a one-sample mean problem. Uses of T-test Among the most frequently used t-tests are: 1. A one-sample location test of whether the mean of a normally distributed population has a value specified in a null hypothesis. 2. A two-sample location test of the null hypothesis that the means of two normally distributed populations are equal. All such tests are usually called Student’s t-tests, though strictly speaking that name should only be used if the variances of the two populations are also assumed to be equal; the form of the test used when this assumption is dropped is sometimes called Welch’s t-test. These tests are often referred to as “unpaired” or “independent samples” t-tests, as they are typically applied when the statistical units underlying the two samples being compared are non-overlapping. 3. A test of the null hypothesis that the difference between two responses measured on the same statistical unit has a mean value of zero. For example, suppose we measure the size of a cancer patient’s tumor before and after a treatment. If the treatment is effective, we expect the tumor size for many of the patients to be smaller following the treatment. This is often referred to as the “paired” or “repeated measures” t-test: A test of whether the slope of a regression line differs significantly from 0. CU IDOL SELF LEARNING MATERIAL (SLM)

262 Research Methods and Statistics - I Assumptions Most t-test statistics have the form , where z and s are functions of the data. Typically, z is designed to be sensitive to the alternative hypothesis (i.e., its magnitude tends to be larger when the alternative hypothesis is true), whereas s is a scaling parameter that allows the distribution of t to be determined. As an example, in the one-sample t-test, where is the sample mean of the data, n is the sample size, and s is the population standard deviation of the data; S in the one-sample t-test is /, where is the sample standard deviation. The assumptions underlying a t-test are that: 1. Z follows a standard normal distribution under the null hypothesis 2. ps2 follows a c2 distribution with p degrees of freedom under the null hypothesis, where p is a positive constant. 3. Z and S are independent. Unpaired and Paired Two-sample T-tests Two-sample t-tests for a difference in mean can be either unpaired or paired. Paired t-tests are a form of blocking, and have greater power than unpaired tests when the paired units are similar with respect to “noise factors” that are independent of membership in the two groups being compared. In a different context, paired t-tests can be used to reduce the effects of confounding factors in an observational study. Unpaired The unpaired, or “independent samples” t-test is used when two separate sets of independent and identically distributed samples are obtained, one from each of the two populations being compared. For example, suppose we are evaluating the effect of a medical treatment and we enroll 100 subjects into our study, and then randomise 50 subjects to the treatment group and 50 subjects to the control group. In this case, we have two independent samples and would use the unpaired form of the t-test. CU IDOL SELF LEARNING MATERIAL (SLM)

Parametric and Non-parametric Test 263 The randomisation is not essential here if we contacted 100 people by phone and obtained each person’s age and gender, and then used a two-sample t-test to see whether the mean ages differ by gender, this would also be an independent samples t-test, even though the data are observational. Paired Dependent samples (or “paired”) t-tests typically consist of a sample of matched pairs of similar units or one group of units that has been tested twice (a “repeated measures” t-test). A typical example of the repeated measures t-test would be where subjects are tested prior to a treatment, say for high blood pressure, and the same subjects are tested again after treatment with a blood-pressure lowering medication. A dependent t-test based on a “matched-pairs sample” results from an unpaired sample that is subsequently used to form a paired sample, by using additional variables that were measured along with the variable of interest. The matching is carried out by identifying pairs of values consisting of one observation from each of the two samples, where the pair is similar in terms of other measured variables. Example: Volunteers count the number of breeding horseshoe crabs on beaches on Delaware Bay every year; here are data from 2011 and 2012. The measurement variable is number of horseshoe crabs, one nominal variable is 2011 vs. 2012, and the other nominal variable is the name of the beach. Each beach has one pair of observations of the measurement variable, one from 2011 and one from 2012. The biological question is whether the number of horseshoe crabs has gone up or down between 2011 and 2012. Beach 2011 2012 2012-2011 Bennetts Pier 35282 21814 -13468 Big Stone 359350 83500 -275850 Broadkill 45705 13290 -32415 Cape Henlopen 49005 30150 -18855 Fortescue 68978 125190 56212 CU IDOL SELF LEARNING MATERIAL (SLM)

264 Research Methods and Statistics - I Fowler 8700 4620 –4080 Gandys 18780 88926 70146 Higbees 13622 –12417 Highs 24936 1205 Kimbles 17620 29800 4864 Kitts Hummock 117360 53640 36020 Norburys Landing 102425 68400 –48960 North Bowers 59566 74552 –27873 North Cape May 32610 36790 –22776 Pickering 137250 4350 –28260 Pierces Point 38003 110550 –26700 Primehook 101300 43435 5432 Reeds 62179 20580 –80720 Slaughter 203070 81503 19324 South Bowers 135309 53940 –149130 South CSL 150656 87055 –48254 Ted Harvey 115090 112266 –38390 Townbank 44022 90670 –24420 Villas 56260 21942 –22080 Woodland 32140 –24120 125 1260 1135 CU IDOL SELF LEARNING MATERIAL (SLM)

Parametric and Non-parametric Test 265 There is a lot of variation from one beach to the next. If the difference between years is small relative to the variation within years, it would take a very large sample size to get a significant two- sample t-test comparing the means of the two years. A paired t-test just looks at the differences, so if the two sets of measurements are correlated with each other, the paired t-test will be more powerful than a two-sample t-test. For the horseshoe crabs, the P value for a two-sample t-test is 0.110, while the paired t-test gives a P value of 0.045. You can only use the paired t-test when there is just one observation for each combination of the nominal values. If you have more than one observation for each combination, you have to use two-way anova with replication. For example, if you had multiple counts of horseshoe crabs at each beach in each year, you'd have to do the two-way anova. You can only use the paired t-test when the data are in pairs. If you wanted to compare horseshoe crab abundance in 2010, 2011, and 2012, you wouldd have to do a two-way anova without replication. “Paired t-test” is just a different name for “two-way anova without replication, where one nominal variable has just two values”; the results are mathematically identical. The paired design is a common one, and if all you are doing is paired designs, you should call your test the paired t-test; it will sound familiar to more people. But if some of your data sets are in pairs, and some are in sets of three or more, you should call all of your tests two-way anovas; otherwise people will think you're using two different tests. 2. ANOVA Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the “variation” among and between groups) used to analyse the differences among group means in a sample. ANOVA was developed by statistician and evolutionary biologist Ronald Fisher. The ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalises the t-test beyond two means. CU IDOL SELF LEARNING MATERIAL (SLM)

266 Research Methods and Statistics - I History While the analysis of variance reached fruition in the 20th century, antecedents extend centuries into the past according to Stigler. These include hypothesis testing, the partitioning of sums of squares, experimental techniques and the additive model. Laplace was performing hypothesis testing in the 1770s. The development of least-squares methods by Laplace and Gauss circa 1800 provided an improved method of combining observations (over the existing practices then used in astronomy and geodesy). It also initiated much study of the contributions to sums of squares. Laplace knew how to estimate a variance from a residual (rather than a total) sum of squares. By 1827, Laplace was using least squares methods to address ANOVA problems regarding measurements of atmospheric tides. Before 1800, astronomers had isolated observational errors resulting from reaction times (the “personal equation”) and had developed methods of reducing the errors. The experimental methods used in the study of the personal equation were later accepted by the emerging field of psychology which developed strong (full factorial) experimental methods to which randomisation and blinding were soon added. An eloquent non-mathematical explanation of the additive effects model was available in 1885. Ronald Fisher introduced the term variance and proposed its formal analysis in a 1918 article The Correlation between Relatives on the Supposition of Mendelian Inheritance. His first application of the Analysis of Variance was published in 1921. Analysis of Variance became widely known after being included in Fisher’s 1925 book Statistical Methods for Research Workers. Randomisation models were developed by several researchers. The first was published in Polish by Jerzy Neyman in 1923. One of the attributes of ANOVA that ensured its early popularity was computational elegance. The structure of the additive model allows solution for the additive coefficients by simple algebra rather than by matrix calculations. In the era of mechanical calculators, this simplicity was critical. The determination of statistical significance also required access to tables of the F function which were supplied by early statistics texts. CU IDOL SELF LEARNING MATERIAL (SLM)

Parametric and Non-parametric Test 267 The ANOVA Test An ANOVA test is a way to find out if survey or experiment results are significant. In other words, they help you to figure out if you need to reject the null hypothesis or accept the alternate hypothesis. Basically, you are testing groups to see if there is a difference between them. Examples of when you might want to test different groups: (a) A group of psychiatric patients are trying three different therapies: counselling, medication and biofeedback. You want to see if one therapy is better than the others. (b) A manufacturer has two different processes to make light bulbs. They want to know if one process is better than the other. (c) Students from different colleges take the same exam. You want to see if one college outperforms the other. Types of ANOVA Tests There are two main types: one-way and two-way. Two-way tests can be with or without replication. 1. One-way ANOVA between groups: used when you want to test two groups to see if there is a difference between them. 2. Two-way ANOVA without replication: used when you have one group and you’re double- testing that same group. 1. One-way ANOVA A one-way ANOVA is used to compare two means from two independent (unrelated) groups using the F-distribution. The null hypothesis for the test is that the two means are equal. Therefore, a significant result means that the two means are unequal. CU IDOL SELF LEARNING MATERIAL (SLM)

268 Research Methods and Statistics - I Examples of When to Use a One-way ANOVA Situation 1: You have a group of individuals randomly split into smaller groups and completing different tasks. For example, you might be studying the effects of tea on weight loss and form three groups: green tea, black tea and no tea. Situation 2: Similar to Situation 1, but in this case the individuals are split into groups based on an attribute they possess. For example, you might be studying leg strength of people according to weight. You could split participants into weight categories (obese, overweight and normal) and measure their leg strength on a weight machine. Limitations of the One-way ANOVA A one-way ANOVA will tell you that at least two groups were different from each other. But it will not tell you which groups were different. If your test returns a significant F-statistic, you may need to run an ad hoc test (like the Least Significant Difference test) to tell you exactly which groups had a difference in means. Example: A scientist wants to know if all children from schools A, B and C have equal mean IQ scores. Each school has 1,000 children. It takes too much time and money to test all 3,000 children. So, a simple random sample of n = 10 children from each school is tested. Part of these data – available from this Google sheet are shown below: Figure 11.1: Limitations of the One-way ANOVA CU IDOL SELF LEARNING MATERIAL (SLM)

Parametric and Non-parametric Test 269 2. Two-way ANOVA A Two-way ANOVA is an extension of the One-way ANOVA. With a one way, you have one independent variable affecting a dependent variable. With a Two-way ANOVA, there are two independents. Use a two-way ANOVA when you have one measurement variable (i.e., a quantitative variable) and two nominal variables. In other words, if your experiment has a quantitative outcome and you have two categorical explanatory variables, a two-way ANOVA is appropriate. For example, you might want to find out if there is an interaction between income and gender for anxiety level at job interviews. The anxiety level is the outcome, or the variable that can be measured. Gender and Income are the two categorical variables. These categorical variables are also the independent variables, which are called factors in a Two-way ANOVA. The factors can be split into levels. In the above example, income level could be split into three levels: low, middle and high income. Gender could be split into three levels: male, female and transgender. Treatment groups are all possible combinations of the factors. In this example, there would be 3 × 3 = 9 treatment groups. Main Effect and Interaction Effect The results from a Two-way ANOVA will calculate a main effect and an interaction effect. The main effect is similar to a One-way ANOVA: each factor’s effect is considered separately. With the interaction effect, all factors are considered at the same time. Interaction effects between factors are easier to test if there is more than one observation in each cell. For the above example, multiple stress scores could be entered into cells. If you do enter multiple observations into cells, the number in each cell must be equal. Two null hypotheses are tested if you are placing one observation in each cell. For this example, those hypotheses would be: H01: All the income groups have equal mean stress. H02: All the gender groups have equal mean stress. For multiple observations in cells, you would also be testing a third hypothesis: H03: The factors are independent or the interaction effect does not exist. CU IDOL SELF LEARNING MATERIAL (SLM)

270 Research Methods and Statistics - I Assumptions for Two-way ANOVA 1. The population must be close to a normal distribution. 2. Samples must be independent. 3. Population variances must be equal. 4. Groups must have equal sample sizes. Example: The following results are calculated using the Quattro Pro spreadsheet. It provides the p-value and the critical values are for alpha = 0.05. Source of Variation SS df MS F P-value F-crit Seed 512.8667 2 256.4333 28.283 0.000008 3.682 Fertilizer 449.4667 4 112.3667 12.393 0.000119 3.056 Interaction 143.1333 8 17.8917 1.973 0.122090 2.641 Within 136.0000 15 9.0667 Total 1241.4667 29 From the above results, we can see that the main effects are both significant, but there is no interaction between them, i.e., the types of seed are not all equal, and the types of fertilizer are not all equal, but the type of seed does not interact with the type of fertilizer. 3. Pearson’s Correlation Pearson’s correlation coefficient is the test statistics that measures the statistical relationship, or association, between two continuous variables. It is known as the best method of measuring the association between variables of interest because it is based on the method of covariance. It gives information about the magnitude of the association, or correlation, as well as the direction of the relationship. CU IDOL SELF LEARNING MATERIAL (SLM)

Parametric and Non-parametric Test 271 Pearson’s correlation coefficient also referred to as Pearson’s r, the Pearson product-moment correlation coefficient (PPMCC) or the bivariate correlation, is a measure of the linear correlation between two variables X and Y. According to the Cauchy-Schwarz inequality, it has a value between +1 and –1, where 1 is total positive linear correlation, 0 is no linear correlation, and –1 is total negative linear correlation. It is widely used in the sciences. It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s and for which the mathematical formula was derived and published by Auguste Bravais in 1844. The naming of the coefficient is, thus. an example of Stigler’s Law. Pearson’s correlation coefficient is the covariance of the two variables divided by the product of their standard deviations. The form of the definition involves a “product moment”, i.e., the mean (the first moment about the origin) of the product of the mean-adjusted random variables; hence the modifier product-moment in the name. Mathematical Properties The absolute values of both the sample and population Pearson correlation coefficients are less than or equal to 1. Correlations equal to 1 or –1 correspond to data points lying exactly on a line (in the case of the sample correlation), or to a bivariate distribution entirely supported on a line (in the case of the population correlation). The Pearson’s correlation coefficient is symmetric: corr(X,Y) = corr(Y,X) A key mathematical property of the Pearson’s correlation coefficient is that it is invariant under separate changes in location and scale in the two variables, i.e., we may transform X to a + bX and transform Y to c + dY where, a, b, c and d are constants with b, d > 0, without changing the correlation coefficient. (This holds for both the population and sample Pearson correlation coefficients.) CU IDOL SELF LEARNING MATERIAL (SLM)

272 Research Methods and Statistics - I Assumptions (i) Independent of Case: Cases should be independent to each other. (ii) Linear Relationship: Two variables should be linearly related to each other. This can be assessed with a scatter plot: plot the value of variables on a scatter diagram, and check if the plot yields a relatively straight line. (iii) Homoscedasticity: the residuals scatterplot should be roughly rectangular-shaped. Properties (i) Limit: Coefficient values can range from +1 to –1, where +1 indicates a perfect positive relationship, –1 indicates a perfect negative relationship, and a 0 indicates no relationship exists. (ii) Pure Number: It is independent of the unit of measurement. For example, if one variable’s unit of measurement is in inches and the second variable is in quintals, even then, Pearson’s correlation coefficient value does not change. (iii) Symmetric: Correlation of the coefficient between two variables is symmetric. This means between X and Y or Y and X, the coefficient value of will remain the same. 4. Z-test A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Because of the central limit theorem, many test statistics are approximately normally distributed for large samples. For each significance level, the Z-test has a single critical value (e.g., 1.96 for 5% two tailed) which makes it more convenient than the Student’s t-test which has separate critical values for each sample size. Therefore, many statistical tests can be conveniently performed as approximate Z-tests if the sample size is large or the population variance known. If the population variance is unknown (and therefore has to be estimated from the sample itself) and the sample size is not large, the Student’s t-test may be more appropriate. CU IDOL SELF LEARNING MATERIAL (SLM)

Parametric and Non-parametric Test 273 General Form The most general way to obtain a Z-test is to define a numerical test statistic that can be calculated from a collection of data, such that the sampling distribution of the statistic is approximately normal under the null hypothesis. Statistics that are averages of approximately independent data values are generally well-approximated by a normal distribution. An example of a statistic that would not be well-approximated by a normal distribution would be an extreme value such as the sample maximum. If T is a statistic that is approximately normally distributed under the null hypothesis, the next step in performing a Z-test is to determine the expected value 0 of T under the null hypothesis, and then obtain an estimate s of the standard deviation of T. We then calculate the standard score Z = (T - 0) / s, from which one-tailed and two-tailed p-values can be calculated as F( - |Z| ) and 2F( - |Z|), respectively, where ? is the standard normal cumulative distribution function. Use in Location Testing The term Z-test is often used to refer specifically to the one-sample location test comparing the mean of a set of measurements to a given constant. If the observed data X1, ..., Xn are: (i) uncorrelated, (ii) have a common mean m, and (iii) have a common variance s2, then the sample average X has mean ì and variance s2/n. If our null hypothesis is that the mean value of the population is a given number m 0, we can use X – m0 as a test-statistic, rejecting the null hypothesis if X – m0 is large. To calculate the standardised statistic Z = (X – m0)/s, we need to either know or have an approximate value for s2, from which we can calculate S2 = s2/n. In some applications, s2 is known, but this is uncommon. If the sample size is moderate or large, we can substitute the sample variance for s2, giving a plug-in test. The resulting test will not be an exact Z-test since the uncertainty in the sample variance is not accounted for however, it will be a good approximation unless the sample size is small. A t-test can be used to account for the uncertainty in the sample variance when the sample size is small and the data are exactly normal. There is no universal constant at which the sample size is generally considered large enough to justify use of the plug-in test. Typical rules of thumb range from 20 to 50 samples. For larger sample sizes, the t-test procedure gives almost identical p-values as the Z-test procedure. CU IDOL SELF LEARNING MATERIAL (SLM)

274 Research Methods and Statistics - I Conditions For the Z-test to be applicable, certain conditions must be met: (i) Nuisance parameters should be known, or estimated with high accuracy (an example of a nuisance parameter would be the standard deviation in a one-sample location test). Z-tests focus on a single parameter and treat all other unknown parameters as being fixed at their true values. In practice, due to Slutsky’s theorem, “plugging in” consistent estimates of nuisance parameters can be justified. However, if the sample size is not large enough for these estimates to be reasonably accurate, then the Z-test may not perform well. (ii) The test statistic should follow a normal distribution. Generally, one appeals to the central limit theorem to justify assuming that a test statistic varies normally. There is a great deal of statistical research on the question of when a test statistic varies approximately normally. If the variation of the test statistic is strongly non-normal, then a Z-test should not be used. (iii) If estimates of nuisance parameters are plugged in as discussed above, then it is important to use estimates appropriate for the way the data were sampled. In the special case of Z-tests for the one or two sample location problem, the usual sample standard deviation is only appropriate if the data were collected as an independent sample. (iv) In some situations, it is possible to devise a test that properly accounts for the variation in plug-in estimates of nuisance parameters. In the case of one- and two-sample location problems, a t-test does this. Z-tests Other than Location Tests Location tests are the most familiar t-tests. Another class of Z-tests arises in maximum likelihood estimation of the parameters in a parametric statistical model. Maximum likelihood estimates are approximately normal under certain conditions, and their asymptotic variance can be calculated in terms of the Fisher information. The maximum likelihood estimate divided by its standard error can be used as a test statistic for the null hypothesis that the population value of the parameter equals zero. When using a Z-test for maximum likelihood estimates, it is important to be aware that the normal approximation may be poor if the sample size is not sufficiently large. Although there is no CU IDOL SELF LEARNING MATERIAL (SLM)

Parametric and Non-parametric Test 275 simple, universal rule stating how large the sample size must be to use a Z-test, simulation can give a good idea as to whether a Z-test is appropriate in a given situation. Z-tests are employed whenever it can be argued that a test statistic follows a normal distribution under the null hypothesis of interest. Many non-parametric test statistics, such as U statistics, are approximately normal for large enough sample sizes, and hence are often performed as Z-tests. 11.4 Non-parametric Test Non-parametric tests are used in cases where parametric tests are not appropriate. Most non- parametric tests use some way of ranking the measurements and testing for weirdness of the distribution. Typically, a parametric test is preferred because it has better ability to distinguish between the two arms. In other words, it is better at highlighting the weirdness of the distribution. Non- parametric tests are about 95% as powerful as parametric tests. However, non-parametric tests are often necessary. Some common situations for using nonparametric tests are when the distribution is not normal (the distribution is skewed), the distribution is not known, or the sample size is too small (<30) to assume a normal distribution. Also, if there are extreme values or values that are clearly “out of range,” non-parametric tests should be used. Sometimes, it is not clear from the data whether the distribution is normal. If this is the case, previous studies using the variables can help distinguish between the two. The source of variability can also help. If numerous that is if numerous independent factors are affecting the variability, the distribution is more likely to be normal. You might think you could formally test to determine whether the distribution is normal, but unfortunately, these tests require large sample sizes, typically larger than required for the tests of significance being used, and at levels where the choice of parametric or nonparametric tests is less important. At large sample sizes, either of the parametric or the non- parametric tests work adequately. Also, non-parametric tests are used when the measures being used is not the one that lends itself to a normal distribution or where “distribution” has no meaning, such as colour of eyes and Expanded Disability Status Scale (EDSS). In other words, nominal or ordinal measures in many cases require a non-parametric test. CU IDOL SELF LEARNING MATERIAL (SLM)

276 Research Methods and Statistics - I A non-parametric test (sometimes called a distribution free test) does not assume anything about the underlying distribution (e.g., that the data comes from a normal distribution). That is compared to parametric test, which makes assumptions about a population’s parameters (e.g., the mean or standard deviation). When the word “non-parametric” is used in stats, it does not quite mean that you know nothing about the population. It usually means that you know the population data does not have a normal distribution. For example, one assumption for the one-way ANOVA is that the data comes from a normal distribution. If your data is not normally distributed, you cannot run an ANOVA, but you can run the nonparametric alternative the Kruskal-Wallis test. If at all possible, you should us parametric tests, as they tend to be more accurate. Parametric tests have greater statistical power, which means they are likely to find a true significant effect. Use nonparametric tests only if you have to (i.e., you know that assumptions like normality are being violated). Non-parametric tests can perform well with non-normal continuous data if you have a sufficiently large sample size (generally 15-20 items in each group). 11.5 Types of Non-parametric Tests When the word “parametric” is used in stats, it usually means tests like ANOVA or a t-test. Those tests both assume that the population data has a normal distribution. Non-parametric tests do not assume that the data is normally distributed. The only non-parametric test you are likely to come across in elementary stats is the chi-square test. However, there are several others, e.g., the Kruskal- Willis test is the non-parametric alternative to the One-way ANOVA and the Mann-Whitney is the non-parametric alternative to the two-sample t-test. The main non-parametric tests are: 1. One-sample sign test. Use this test to estimate the median of a population and compare it to a reference value or target value. 2. One-sample Wilcoxon signed rank test. With this test, you also estimate the population median and compare it to a reference/target value. However, the test assumes your data comes from a symmetric distribution (like the Cauchy distribution or uniform distribution). CU IDOL SELF LEARNING MATERIAL (SLM)

Parametric and Non-parametric Test 277 3. Friedman test. This test is used to test for differences between groups with ordinal dependent variables. It can also be used for continuous data if the one-way ANOVA with repeated measures is inappropriate (i.e., some assumption has been violated). 4. Goodman Kruska’s Gamma test. A test of association for ranked variables. 5. Kruskal-Wallis test. Use this test instead of a one-way ANOVA to find out if two or more medians are different. Ranks of the data points are used for the calculations, rather than the data points themselves. 6. The Mann-Kendall Trend test. This test looks for trends in time-series data. 7. Mann-Whitney test. Use this test to compare differences between two independent groups when dependent variables are either ordinal or continuous. 8. Mood’s Median test. Use this test instead of the sign test when you have two independent samples. 9. Spearman Rank Correlation. Use when you want to find a correlation between two sets of data. 11.6 Advantages and Disadvantages of Non-parametric Tests Compared to parametric tests, non-parametric tests have several advantages, including: 1. More statistical power when assumptions for the parametric tests have been violated. When assumptions have not been violated, they can be almost as powerful. 2. Fewer assumptions (i.e., the assumption of normality does not apply). 3. Small sample sizes are acceptable. 4. They can be used for all data types, including nominal variables, interval variables, or data that has outliers or that has been measured imprecisely. CU IDOL SELF LEARNING MATERIAL (SLM)

278 Research Methods and Statistics - I However, they do have their disadvantages. The most notable ones are: 1. Less powerful than parametric tests if assumptions have not been violated. 2. More labour-intensive to calculate by hand (for computer calculations, this is not an issue). 3. Critical value tables for many tests are not included in many computer software packages. This is compared to tables for parametric tests (like the Z-table or t-table) which usually are included. 11.7 Summary Parametric tests assume underlying statistical distributions in the data. Therefore, several conditions of validity must be met so that the result of a parametric test is reliable. The t-test for two independent samples is reliable only if each sample follows a normal distribution and if sample variances are homogeneous. Non-parametric tests do not rely on any distribution. They can, thus, be applied even if parametric conditions of validity are not met. Parametric tests often have nonparametric equivalents. You will find different parametric tests with their equivalents when they exist in this grid. Parametric test is a hypothesis testing procedure based on the assumption that observed data are distributed according to some distributions of well-known form (e.g., normal, Bernoulli, and so on) up to some unknown parameter(s) on which we want to make inference (say the mean, or the success probability). Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the “variation” among and between groups) used to analyse the differences among group means in a sample. ANOVA was developed by statistician and evolutionary biologist Ronald Fisher. The ANOVA is based on the law of total variance, where the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether two or more population means are equal, and therefore generalises the t-test beyond two means. Pearson’s correlation coefficient is the test statistics that measures the statistical relationship, or association, between two continuous variables. It is known as the best method of measuring the association between variables of interest because it is based on the method of covariance. It gives information about the magnitude of the association, or correlation, as well as the direction of the relationship. CU IDOL SELF LEARNING MATERIAL (SLM)

Parametric and Non-parametric Test 279 A Z-test is any statistical test for which the distribution of the test statistic under the null hypothesis can be approximated by a normal distribution. Because of the central limit theorem, many test statistics are approximately normally distributed for large samples. For each significance level, the Z-test has a single critical value (e.g., 1.96 for 5% two tailed) which makes it more convenient than the Student’s t-test which has separate critical values for each sample size. Therefore, many statistical tests can be conveniently performed as approximate Z-tests if the sample size is large or the population variance known. If the population variance is unknown (and therefore has to be estimated from the sample itself) and the sample size is not large, the Student’s t-test may be more appropriate. Non-parametric tests are used in cases where parametric tests are not appropriate. Most non- parametric tests use some way of ranking the measurements and testing for weirdness of the distribution. Typically, a parametric test is preferred because it has better ability to distinguish between the two arms. In other words, it is better at highlighting the weirdness of the distribution. Non- parametric tests are about 95% as powerful as parametric tests. 11.8 Key Words/Abbreviations  Parametric Test: Parametric tests assume underlying statistical distributions in the data.  ANOVA: The ANOVA is based on the law of total variance.  Pearson’s Correlation: Pearson’s correlation coefficient is the test statistics that measures the statistical relationship.  Z-test: A Z-test is any statistical test for which the distribution of the test statistic.  Non-parametric Test: Non-parametric tests are used in cases where parametric tests are not appropriate. 11.9 Learning Activity 1. You are required to select and perform hypothesis test when the sample size is 400 and give the reasons for selecting the specific type of testing tool. _________________________________________________________________ _________________________________________________________________ CU IDOL SELF LEARNING MATERIAL (SLM)

280 Research Methods and Statistics - I 2. You are suggested to compare unpaired and paired two-sample t-tests and prepare the report on the same. _________________________________________________________________ _________________________________________________________________ 3. You are instructed to describe the Main Effect and Interaction Effect of ANOVA. _________________________________________________________________ _________________________________________________________________ 11.10 Unit End Exercises (MCQs and Descriptive) Descriptive Type Questions 1. What is Parametric Test? Explain various types of Parametric Test. 2. What is t-test? Explain the uses of t-test. 3. Discuss about unpaired and paired two-sample t-tests. 4. What is ANOVA? Explain various types of ANOVA tests. 5. What are the assumptions for Two-way ANOVA? Discuss. 6. What is Pearson’s Correlation? 7. What is Z-test? 8. What is Non-parametric Test? Explain the various types of Non-parametric Tests. Multiple Choice Questions 1. Which of the following is a hypothesis testing procedure based on the assumption that observed data are distributed according to some distributions of well-known form up to some unknown parameter(s) on which we want to make inference? (a) Parametric test (b) Non-parametric test (c) Analysis (d) Interpretation CU IDOL SELF LEARNING MATERIAL (SLM)

Parametric and Non-parametric Test 281 2. Which of the following tests are used when the measures being used is not the one that lends itself to a normal distribution or where “distribution” has no meaning? (a) Parametric test (b) Non-parametric test (c) Analysis (d) Interpretation 3. Which of the following is the type of parametric test? (a) T-test (b) Z-test (c) ANOVA test (d) All the above 4. Which of the following is not the type of parametric test? (a) T-test (b) Z-test (c) Mann-Whitney test (d) ANOVA test 5. Which of the following is the assumption for Two Way ANOVA? (a) The population must be close to a normal distribution. (b) Samples must be independent. (c) Population variances must be equal. (d) All the above Answers: 1. (a), 2. (b), 3. (d), 4. (c), 5. (d) 11.11 References References of this unit have been given at the end of the book. CU IDOL SELF LEARNING MATERIAL (SLM)

282 Research Methods and Statistics - I References 1. Lakshmi, T. and Umesh Kumar, Y. (2015), Business Research Methods, Himalaya Publishing House, Delhi. 2. C.R. Kothari (2006), Research Methodology Methods and Techniques, NewAge Publication, Delhi. 3. John W. Creswell and J. David Creswell (2019), Research Design, Amazon. 4. William, M.K. and Trochim, James P. (2017, Research Methods Knowledge Base, Educators Technology. 5. Dr. Usha Devi, and Santhosh Kumar, A.V. (2016), Business Research Methods, Vision Book House, Bengaluru. 6. S.C. Gupta (2017), Fundamentals of Statistic, Himalaya Publishing House, Dehli. 7. P. Gupta, N. Aruna Rani and M. Haritha, Operations Research and Quantitative Techniques, Himalaya Publishing House, Dehli. 8. www.researchget.net 9. www.goodreads.com>shelf>show>researchmethods 10. www.educatorstechnology.com>2017/01>researchmethodology  CU IDOL SELF LEARNING MATERIAL (SLM)

Pages:

Teamlease Edtech Ltd (Amita Chitroda)

21ODMPT655-Research Methods and Statistics –I

Like this book? You can publish your book online for free in a few minutes!

Create your own flipbook

TOP SEARCH

business design fashion music health life sports home marketing children

21ODMPT655-Research Methods and Statistics –I

Description: 21ODMPT655-Research Methods and Statistics –I

Read the Text Version

Teamlease Edtech Ltd (Amita Chitroda)

TOP SEARCH

RELATED PUBLICATIONS