Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore 21ODMPT655-Research Methods and Statistics –I

21ODMPT655-Research Methods and Statistics –I

Published by Teamlease Edtech Ltd (Amita Chitroda), 2022-04-04 07:32:17

Description: 21ODMPT655-Research Methods and Statistics –I

Search

Read the Text Version

Data Analysis 195 Z=l+ fm  f1 c = 30 +  36  24 20   10 2fm  f1  f2  2(36)  24     = 30 +  12 10 = 30 +  120  = 30 + 4.285 Z = 34.285  72 - 44   28   Empirical Relationship between Mean, Median and Mode When mode is ill-defined, it is difficult to find the value of mode, a sort of empirical relationship exist among the mean, median and mode in such a way that the median lies between the mode and the mean. The mode departs (to the left, i.e., positive skewed) 2/3 difference from the median and the mean departs (to the right, i.e., negatively skewed) 1/3 difference from the median. Karl Pearson expressed this relationship as Z = 3M – 2 X (when it is positive skewness). Illustration 7 If M is 28 and AM is 29, find Mode. Solution: Z = 3M  2X = 3(28)  2(29) = 84  78 = 26  29 > 28 > 26 Illustration 8 M = ?, AM = 39, Z = 36.5 CU IDOL SELF LEARNING MATERIAL (SLM)

196 Research Methods and Statistics - I Solution: Z = 3M  2X  36.5 = 3(M) – 2(39)  36.5 = 3M – 8  -3M = –78 – 36.5 M =  114.5 M = 38.16 3 Illustration 9 M = 79.5, X = ?, Z = 78.1 Solution: Z = 3M  2X  78.1 = 3(79.5) – 2X  78.1 = 238.5 – 2X  2X = 238.5 – 78.1 X = 160.4 X = 80.2 2 Locating Mode Graphically Illustration 10 Monthly profits of 100 shops are distributed as follows; Profit per Shop (in ‘000’ `) No. of Shops 0 - 50 12 50 - 100 18 100 - 150 27 150 - 200 20 200 - 250 17 250 - 300 6 Locate mode graphically. CU IDOL SELF LEARNING MATERIAL (SLM)

Data Analysis 197 Solution: No. of Shops Profit per Shop (in ‘000’ `) 12 0 - 50 18 50 - 100 27 100 - 150 20 150 - 200 17 200 - 250 6 250 - 300 Z = 128 The distribution with two or more highest frequency, we have applied grouping and analysis table for find out model value or class in case of continuous series. Here, model class is 40-50 at analysis table reported. When highest frequencies are equal (i.e., two or more highest frequencies), the distribution is bimodal or multimodal. Hence, the mode is ill-defined. Without applying grouping and analysis table, mere finding two or more highest in the given distribution do not jump in to conclusion that mode is ill-defined. CU IDOL SELF LEARNING MATERIAL (SLM)

198 Research Methods and Statistics - I Mean X Median (M) Mode (Z)  fx  N th fm  f1 c 2fm  f1  f2 X= N  Median is  2  = 50/2 = 25 M=l+ l N  cf c  13  7  2 f 2(13)  7 13 M=  = 40 +  10 2580  25  22  = 40 +  26 6  10 50 = 40 +  13   10   20   11   110   60  = 40 +  13  10, = 40 +  13  = 40 +  6  = 51.6 = 40 + 8.46 = 48.46 = 40 + 10 = 50 8.6 Standard Deviation and Variance Standard deviation is defined as the square root of the mean of the squares of deviations from the mean. The concept of standard deviation was introduced by Karl Pearson in 1893. It is the most important and widely used measures of studying dispersion. It overcomes the defects from the earlier methods. Standard deviation is “the square root of the arithmetic mean of the square deviations of various values from their arithmetic mean. It is denoted by SD or  (Sigma). Symbolically,  = d 2 . n Standard deviation is the absolute measures of dispersion of a distribution. The greater standard deviation indicates more variations and less uniformity. The smaller standard deviation indicates less variations and more consistency of a series. Properties of Standard Deviation 1. Standard deviation is only used to measure spread or dispersion around the mean of a data set. 2. Standard deviation is never negative. CU IDOL SELF LEARNING MATERIAL (SLM)

Data Analysis 199 3. Standard deviation is sensitive to outliers. A single outlier can raise the standard deviation and in turn, distort the picture of spread. 4. For data with approximately the same mean, the greater the spread, the greater the standard deviation. 5. If all values of a data set are the same, the standard deviation is zero (because each value is equal to the mean). (a) When analysing normally distributed data, standard deviation can be used in conjunction with the mean in order to calculate data intervals. If S = standard deviation and x = a value in the data set, then: (i) about 68% of the data lie in the interval: S < x < + S (ii) about 95% of the data lie in the interval: 2S < x < + 2S (iii) about 99% of the data lie in the interval: 3S < x < + 3S Mathematical Properties of Standard Deviation The important properties of SD are: • Coefficient of Variation (CV) • Variance • Combined Standard Deviation Coefficient of Variation Coefficient of variation is the relative measure of dispersion of a distribution. To compare the variations of two different series, relative measures of standard deviation must be calculated. This is known as coefficient of variation or the coefficient of standard deviation. Thus, it is defined as the ratio of SD to its mean. Symbolically, CV = Standard deviation  100 Mean CU IDOL SELF LEARNING MATERIAL (SLM)

200 Research Methods and Statistics - I The series for which the coefficient of variation (CV) is greater, it is said to be called more variation and less uniformity. On the other hand, the series for which coefficient of variation (CV) is less, it is said to be called less variable and more stable or consistent. Variance Variance is the square of the standard deviation. Hence, variance is  2 or   Variance Variance is defined as the Mean of the Squares of Deviation from the Mean. It is also called Mean square Deviation. In order to calculate variance, we must first find out the mean of the series. The deviation of each size from mean is found out. The deviations are squared and added up. The Sum of the Squares of Deviations must be divided by the total number of items. The result is the variance. Variance = d2 N Combined Standard Deviation The combined standard deviation can be calculated by following the same method of calculating the combined mean of two or more than two groups. It may be denoted by  12 formula for combined standard deviation of two groups is: 12  N112  N 2 2  N1d12  N2d12 2 N1  N2 where, d12  x1  x d22  x2  x 12 12 Merits of Standard Deviation The advantages of standard deviation are as follows: (i) Standard deviation is rigidly defined and its value is always a definite figure. (ii) It takes into account every value in the series. Thus it is based on all the observations of the data. CU IDOL SELF LEARNING MATERIAL (SLM)

Data Analysis 201 (iii) It is suitable for mathematical treatment. (iv) It is less affected by fluctuations of sampling. Demerits of Standard Deviation The following are the demerits of standard deviation: (i) As compared to other measures it is difficult to compute. (ii) It gives more weight to values which differ greatly from the mean (i.e., extreme values) and less weight to values which are nearer to the mean. Suppose the mean of the series is 20. Suppose 40 and 19 are 2 values in the series. The deviation of these values from the mean are 20 and –1. The squares of the deviations are 400 and 1. Thus, the standard deviation gives more weight to extreme values. This is the reason why it is not much useful in most economic studies. Actual Mean Method (a) First calculate arithmetic mean, i.e., x . (b) Calculate deviation from the arithmetic mean, i.e.,  x  x  dx  .   (c) Square the deviation in order to remove the sign of deviation, i.e.,  x  x 2  dx2 .   (d) Multiply dx2 with the respect frequency the result will be fdx2 (only in discrete and continuous series). (e) Total the dx2 column in case of individual series and fdx2 in case of discrete and continuous series. (f) Apply formula to get standard deviation. Assumed Mean Method (a) Calculate deviation from the assumed mean, i.e.,  x  A  dx  .  c  (b) Square the deviation in order to remove the sign of deviation, i.e., dx2 . CU IDOL SELF LEARNING MATERIAL (SLM)

202 Research Methods and Statistics - I (c) Total the dx2 column in case of individual series and fdx2 in case of discrete and continuous series. (d) Apply formula to get standard deviation. Formula for Calculating Standard Deviation and its Coefficient Individual Series Discrete and Continuous Series Actual Mean Method s (x  x )2   f (x  x )2 Or n N Assumed Mean  dx2   fdx2 (a) Short-cut Method s= n N s=  dx 2   dx 2   fdx2    fdx 2    N  N  n n (b) Step Deviation Method  dx2   dx  2 c   fdx2    fdx  2 c n  n  N  N  (i) Individual Series Example 1: Calculate SD under actual mean method and assumed mean method. X : 120 125 130 135 140 145 150 155 160 165 Solution: Actual Mean Method (x – x = dx) dx2 Variable – 22.5 506.25 120 – 17.5 306.25 125 CU IDOL SELF LEARNING MATERIAL (SLM)

Data Analysis 203 130 – 12.5 156.25 135 – 7.5 56.25 140 – 2.5 6.25 145 – 2.5 6.25 150 7.5 56.25 155 12.5 156.25 160 17.5 306.25 165 22.5 506.25 ådx = 2,062.5 åx = 1.425 dx2 x  x x  1,425 x = 142.5 400 n 10 225 100 dx 2 2062.5 025  = n = 10 = 206.25 = 14.36 000 025 Assumed Mean Method 100 I. Short-cut Method Variable dx = x – A 120 – 20 125 – 15 130 – 10 135 – 5 140 (A) 0 145 5 150 10 CU IDOL SELF LEARNING MATERIAL (SLM)

204 Research Methods and Statistics - I 155 15 225 160 20 400 165 25 625 n = 10 ådx = 25 ådx2 = 2125 dx2  dx 2 2 n   n  =  s = 2125   2 5  = 212.5  6.25 = 206.25 = 14.36 10  10  II. Step Deviation Method dx  x A (c = 5 ) dx 2 Variable c 120 –4 16 125 –3 9 130 –2 4 135 –1 1 140 0 145 0 1 150 1 4 155 2 9 160 3 16 165 4 25 N = 10 5 å dx2 = 85 å dx = 5 CU IDOL SELF LEARNING MATERIAL (SLM)

Data Analysis 205  dx2   dx  2 c n  n  = 85   5 2 = 8.5  (.5)2 × 5 10  10  = 8.5  0.25 × 5 = 8.5  0.25 × 5 = 8.25 × 5 = 2.87 × 5 = 14.36 (ii) Discrete Series Example 2: Calculate SD and variance from the following data: Wages : 50 60 70 80 90 100 110 120 No. of Workers : 85 9 4 6 7 3 2 Solution: Actual Mean Method Wages (x) No. of Workers (f) f x dx = x – x dx2 = (x – x )2 fdx2 = f(x – x )2 50 8 400 –28.64 820.25 6561.98 60 5 300 –18.64 347.45 1737.25 70 9 630 –8.64 74.65 671.85 80 4 320 1.36 1.85 7.40 90 6 540 11.36 129.05 774.30 100 7 700 21.36 456.25 3193.75 CU IDOL SELF LEARNING MATERIAL (SLM)

206 Research Methods and Statistics - I 110 3 330 31.36 983.45 2050.35 41.36 120 2 240 1710.65 3421.30 N = 44 åfx = 3460 åfdx2 = 19318.18 x  fx = 3460 = 78.64 N 44  fdx 2 = 19318.18 = 439.05 = 20.95 N 44 Short-cut Method Wages (x) No. of Workers (f) dx = x – A dx2 fdx fdx2 50 8 – 30 900 – 240 7200 60 5 – 20 400 – 100 2000 70 9 – 10 100 – 90 900 80 (A) 4 0 00 0 90 6 10 100 60 600 100 7 20 400 140 2800 110 3 30 900 90 2700 120 2 40 1600 80 3200 N = 44 åfx = – 60 åfdx2 = 19400   fdx2    fdx 2 N  N  = 19400    60 2 = 440.911.86 = 439.05 = 20.95 44  44  CU IDOL SELF LEARNING MATERIAL (SLM)

Data Analysis 207 Step Deviation Method Wages (x) No. of dx  x  A dx2 fdx fdx 2 Workers (f) c 50 8 –3 9 – 24 72 60 5 –2 4 – 10 20 70 9 –1 1 –9 9 80A 4 0 0 00 90 6 1 1 66 100 7 2 4 14 28 110 3 3 9 9 27 120 2 4 16 8 32 N = 44 å dx = 4 å fdx = -6 å fdx2 = 194 Calculation of SD   fdx2    fdx  2 c N  N  = 194    6 2 × 10 = 4.409  (0.136)2 × 10 44  44  = 4.409  0.0185 × 10 = 4.3905 × 10 = 2.095 × 10 = 20.95 (iii) Continuous Series Example 3: Find standard deviation from the following data: Wages 0-10 10-20 20-30 30-40 40-50 50-60 60-70 70-80 4 3 2 No. 5 9 8 12 10 CU IDOL SELF LEARNING MATERIAL (SLM)

208 Research Methods and Statistics - I Solution: Actual Mean Method Wages (x) f MP (x) = fx dx = x – x dx2 fdx fdx2 x = 33.86 0-10 LL  UL 4167.39 10-20 2 25 – 28.87 833.48 – 150 3204.19 20-30 629.42 30-40 55 135 – 18.87 356.08 – 180 40-50 9 15 15.32 50-60 8 25 200 – 8.87 78.68 – 80 1238.77 60-70 12 35 1785.91 70-80 10 45 420 1.13 1.28 0 2907.23 4 55 3383.35 3 65 450 11.13 123.88 100 2 75 220 21.13 446.48 80 195 31.13 969.08 90 150 41.13 1691.68 80 N = 53 åfx = 1795 åfdx = – 60 åfdx2 = 17332.08 x  fx = 1795 = 33.86 N 53  fdx2  17332.08  327.02 = 18.08 N 53 Short-cut Method Wages (x) f MP (x) = dx = x – A dx2 fdx fdx2 A = 35 LL  UL 900 – 150 4500 2 400 – 180 3600 100 – 80 800 0-10 5 5 – 30 10-20 9 20-30 8 15 – 20 25 – 10 CU IDOL SELF LEARNING MATERIAL (SLM)

Data Analysis 209 30-40 12 35 0 0 0 0 100 1000 40-50 10 45 10 100 80 1600 90 2700 50-60 4 55 20 400 80 3200 60-70 3 65 30 900 70-80 2 75 40 1600 N = 53 åfdx = -60 åfdx2 = 17400   fdx2    fdx 2 N  N   17400   60 2  328.30  1.28  327.02 53  53  Step Deviation Method Wages (x) f MP (x) = dx  x  A fdx dx2 fdx2 c LL  UL A = 35 2 0-10 5 5 –3 – 15 9 45 10-20 9 15 –2 – 18 4 36 20-30 8 25 –1 –8 1 8 30-40 12 35 0 00 0 40-50 10 45 1 10 1 10 50-60 4 55 2 84 16 60-70 3 65 3 99 27 70-80 2 75 4 8 16 32 N = 53 å fdx = -6 å fdx2 = 174 CU IDOL SELF LEARNING MATERIAL (SLM)

210 Research Methods and Statistics - I   fdx2    fdx  2 c N  N  = 174    6 2 ×c = 3.28  (0.113)2 × 10 53  53  = 3.28  0.0128 × 10 = 3.267 × 10 = 1.808 × 10 = 18.08 Illustration 1 Calculate the standard deviation of the following data: Age (in years): 23, 27, 28, 29, 30, 31, 33, 35, 36, 38 Solution: Calculation of Standard Deviation Age in Years dx = x – x dx2 23 –8 64 27 –4 16 28 –3 9 29 –2 4 30 –1 1 31 0 0 33 +2 4 35 +4 16 36 +5 25 38 +7 49 åx = 310 ådx2 = 188 x  x 310 = 31 years  10 n   dx2  188  18.8  4.336 10 n CU IDOL SELF LEARNING MATERIAL (SLM)

Data Analysis 211 8.7 Skewness Skewness refers to distortion or asymmetry in a symmetrical bell curve, or normal distribution, in a set of data. If the curve is shifted to the left or to the right, it is said to be skewed. Skewness can be quantified as a representation of the extent to which a given distribution varies from a normal distribution. A normal distribution has a skew of zero, while a lognormal distribution, for example, would exhibit some degree of right-skew. The three probability distributions depicted below are positively-skewed (or right-skewed) to an increasing degree. Negatively-skewed distributions are also known as left-skewed distributions. Skewness is used along with kurtosis to better judge the likelihood of events falling in the tails of a probability distribution. Mean  Mode Coefficient of Skewness = Standard Deviation or X Z Coefficient of Skewness (SK(p)) =  where, however, the mode is till defined, this formula will be modified as under: Coefficient of Skewness (SK(p))= 3 (Mean  Median) S.D. Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive or negative, or undefined. For a unimodal distribution, negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right. In cases where one tail is long but the other tail is fat, skewness does not obey a simple rule. For example, a zero value means that the tails on both sides of the mean balance out overall; this is the case for a symmetric distribution, but can also be true for an asymmetric distribution where one tail is long and thin, and the other is short but fat. Besides positive and negative skew, distributions can also be said to have zero or undefined skew. In the curve of a distribution, the data on the right side of the curve may taper differently from CU IDOL SELF LEARNING MATERIAL (SLM)

212 Research Methods and Statistics - I the data on the left side. These taperings are known as “tails”. Negative skew refers to a longer or fatter tail on the left side of the distribution, while positive skew refers to a longer or fatter tail on the right. Skweness : T3  xi  x  Sk  i1  3 The mean of positively skewed data will be greater than the median. In a distribution that is negatively skewed, the exact opposite is the case: the mean of negatively skewed data will be less than the median. If the data graphs symmetrically, the distribution has zero skewness, regardless of how long or fat the tails are. There are several ways to measure skewness. Pearson’s first and second coefficients of skewness are two common ones. Pearson’s first coefficient of skewness, or Pearson mode skewness, subtracts the mode from the mean and divides the difference by the standard deviation. Pearson’s second coefficient of skewness, or Pearson’s median skewness, subtracts the median from the mean, multiplies the difference by three and divides the product by the standard deviation. Types of Skewness There are two types of skewness: They are: (1) Positive skewness and (2) Negative skewness. Figure 8.1: Types of Skewness CU IDOL SELF LEARNING MATERIAL (SLM)

Data Analysis 213 1. Positive Skewness Positive Skewness means when the tail on the right side of the distribution is longer or fatter. The mean and median will be greater than the mode. A series is said to have positive skewness when the following characteristics are noticed: Mean > Median > Mode The right tail of the curve is longer than its left tail, when the data are plotted through a histogram, or a frequency polygon. The formula of Skewness and its coefficient give positive figures. 2. Negative Skewness Negative Skewness is when the tail of the left side of the distribution is longer or fatter than the tail on the right side. The mean and median will be less than the mode. A series is said to have negative skewness when the following characteristics are noticed: Mode > Median > Mode The left tail of the curve is longer than the right tail, when the data are plotted through a histogram, or a frequency polygon. The formula of skewness and its coefficient give negative figures. 8.8 Kurtosis Kurtosis is all about the tails of the distribution not the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution. High kurtosis in a data set is an indicator that data has heavy tails or outliers. If there is a high kurtosis, then we need to investigate why we have so many outliers. It indicates a lot of things, maybe wrong data entry or other things. Low kurtosis in a data set is an indicator that data has light tails or lack of outliers. If we get low kurtosis (too good to be true), then also we need to investigate and trim the dataset of unwanted results. CU IDOL SELF LEARNING MATERIAL (SLM)

214 Research Methods and Statistics - I Figure 8.2: Kurtosis 1. Mesokurtic: This distribution has kurtosis statistic similar to that of the normal distribution. It means that the extreme values of the distribution are similar to that of a normal distribution characteristic. This definition is used so that the standard normal distribution has a kurtosis of three. 2. Leptokurtic (Kurtosis > 3): Distribution is longer, tails are fatter. Peak is higher and sharper than Mesokurtic, which means that data are heavy-tailed or profusion of outliers. Outliers stretch the horizontal axis of the histogram graph, which makes the bulk of the data appear in a narrow (“skinny”) vertical range, thereby giving the “skinniness” of a leptokurtic distribution. 3. Platykurtic (Kurtosis < 3): Distribution is shorter, tails are thinner than the normal distribution. The peak is lower and broader than Mesokurtic, which means that data are light-tailed or lack of outliers. The reason for this is because the extreme values are less than that of the normal distribution. 8.9 Range The range of a set of data is the difference between the largest and smallest values. Difference here is specific; the range of a set of data is the result of subtracting the smallest value from largest value. However, in descriptive statistics, this concept of range has a more complex meaning. The range is the size of the smallest interval (statistics) which contains all the data and provides an indication of statistical dispersion. It is measured in the same units as the data. Since it only depends on two of the observations, it is most useful in representing the dispersion of small data sets. CU IDOL SELF LEARNING MATERIAL (SLM)

Data Analysis 215 For n independent and identically distributed continuous random variables X1, X2, ..., Xn with cumulative distribution function G(x) and probability density function g(x). Let T denote the range of a sample of size n from a population with distribution function G(x). Range  H  l H  Higher value L  Lower Value Coefficient of Range  H  L HL 8.10 Summary Data analysis is a process of inspecting, cleansing, transforming and modelling data with the goal of discovering useful information, informing conclusion and supporting decision-making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, and is used in different business, science and social science domains. Common tasks include record matching, identifying inaccuracy of data and overall quality of existing data, deduplication, and column segmentation. Such data problems can also be identified through a variety of analytical techniques. Descriptive statistics is the type of statistics that probably springs to most people’s minds when they hear the word “statistics.” Here, the goal is to describe. Numerical measures are used to tell about features of a set of data. Skewness refers to distortion or asymmetry in a symmetrical bell curve, or normal distribution, in a set of data. If the curve is shifted to the left or to the right, it is said to be skewed. Skewness can be quantified as a representation of the extent to which a given distribution varies from a normal distribution. A normal distribution has a skew of zero, while a lognormal distribution, e.g., would exhibit some degree of right-skew. The three probability distributions depicted below are positively- skewed (or right-skewed) to an increasing degree. Negatively-skewed distributions are also known as left-skewed distributions. Skewness is used along with kurtosis to better judge the likelihood of events falling in the tails of a probability distribution. Skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, negative, or undefined. CU IDOL SELF LEARNING MATERIAL (SLM)

216 Research Methods and Statistics - I For a unimodal distribution, negative skew commonly indicates that the tail is on the left side of the distribution, and positive skew indicates that the tail is on the right. In cases where one tail is long but the other tail is fat, skewness does not obey a simple rule. For example, a zero value means that the tails on both sides of the mean balance out overall; this is the case for a symmetric distribution, but can also be true for an asymmetric distribution where one tail is long and thin, and the other is short but fat. Positive Skewness means when the tail on the right side of the distribution is longer or fatter. The mean and median will be greater than the mode. Negative Skewness is when the tail of the left side of the distribution is longer or fatter than the tail on the right side. The mean and median will be less than the mode. Kurtosis is all about the tails of the distribution not the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution. High kurtosis in a data set is an indicator that data has heavy tails or outliers. If there is a high kurtosis, then we need to investigate why we have so many outliers. It indicates a lot of things, maybe wrong data entry or other things. 8.11 Key Words/Abbreviations  Arithmetic Mean: Arithmetic mean is defined as the value obtained by dividing the total values of all items in the series by their number.  Median: Median is defined as the value of that item which divides the series into two equal halves.  Mode: The word “mode” is derived from the French word “1a mode” meaning fashion.  Standard Deviation: Standard deviation is defined as the square root of the mean of the squares of deviations from the mean.  Variance: Variance is the square of the standard deviation. CU IDOL SELF LEARNING MATERIAL (SLM)

Data Analysis 217  Skewness: Skewness refers to distortion or asymmetry in a symmetrical bell curve, or normal distribution, in a set of data.  Kurtosis: Kurtosis is all about the tails of the distribution not the peakedness or flatness.  Range: The range of a set of data is the difference between the largest and smallest values. 8.12 Learning Activity 1. You are required to identify and prepare the report about applications of Arithmetic Mean. _________________________________________________________________ _________________________________________________________________ 2. You are suggested to applications of Standard Deviation and Variance. _________________________________________________________________ _________________________________________________________________ 8.13 Unit End Exercises (MCQs and Descriptive) Descriptive Type Questions 1. What is Descriptive Statistics? Explain in details. 2. What is Arithmetic Mean? Explain applications of Arithmetic Mean. 3. What is Median? Discuss the uses of Median. 4. What is Mode? Explain the procedure of Mode. 5. Explain in details about Standard Deviation. 6. Discuss in details about Skewness. 7. Explain in details about Kurtosis. 8. What is Range? Explain in details about Range. CU IDOL SELF LEARNING MATERIAL (SLM)

218 Research Methods and Statistics - I Multiple Choice Questions 1. What is a process of inspecting, cleansing, transforming and modelling data with the goal of discovering useful information, informing conclusion and supporting decision-making? (a) Data Analysis (b) Data Collection (c) Review of Literature (d) All the above 2. Data cleaning is the process of preventing and correcting these errors and common tasks include __________. (a) Record matching (b) Identifying inaccuracy of data (c) Overall quality of existing data (d) All the above 3. Which of the following is the type of statistics that probably springs to most people’s minds when they hear the word statistics? (a) Descriptive Statistics (b) Proactive Statistics (c) Data Analysis (d) None of the above 4. Which of the following refers to distortion or asymmetry in a symmetrical bell curve, or normal distribution, in a set of data? (a) Skewness (b) Kurtosis (c) Range (d) All the above 5. Which of the following is a set of data is the difference between the largest and smallest values? (a) Skewness (b) Kurtosis (c) Range (d) All the above Answers: 1. (a), 2. (d), 3. (a), 4. (a), 5. (c) 8.14 References References of this unit have been given at the end of the book.  CU IDOL SELF LEARNING MATERIAL (SLM)

Inferential Statistics 219 UNIT 9 INFERENTIALSTATISTICS Structure: 9.0 Learning Objectives 9.1 Introduction 9.2 Meaning and Definitions of Correlation 9.3 Uses of Correlation 9.4 Advantages of Correlation 9.5 Disadvantages of Correlation 9.6 Types of Correlation 9.7 Methods of Determining Correlation 9.8 Methods of Determining Correlation (Practical Problems) 9.9 Summary 9.10 Key Words/Abbreviations 9.11 LearningActivity 9.12 Unit End Exercises (MCQs and Descriptive) 9.13 References 9.0 Learning Objectives After studying this unit, you will be able to:  Describe the Correlation  Explain the uses of correlation CU IDOL SELF LEARNING MATERIAL (SLM)

220 Research Methods and Statistics - I 9.1 Introduction Francis Galton was the first person to measure correlation, originally termed “correlation,” which actually makes sense considering you are studying the relationship between a couple of different variables. In Correlations and their Measurement, he said “The statures of kinsmen are correlated variables; thus, the stature of the father is correlated to that of the adult son, and so on; but the index of correlation … is different in the different cases.” It is worth noting though that Galton mentioned in his paper that he had borrowed the term from biology, where “Correlation and correlation of structure” was being used but until the time of his paper it had not been properly defined. In 1892, British statistician Francis Ysidro Edgeworth published a paper called “Correlated Averages,” Philosophical Magazine, 5th Series, 34, 190-204 where he used the term “Coefficient of Correlation.” It was not until 1896 that British mathematician Karl Pearson used “Coefficient of Correlation” in two papers: Contributions to the Mathematical Theory of Evolution and Mathematical Contributions to the Theory of Evolution III: Regression, Heredity and Panmixia. Correlation is a bivariate analysis that measures the strengths of association between two variables. In statistics, the value of the correlation coefficient varies between +1 and –1. When the value of the correlation coefficient lies around ±1, it is said to be a perfect degree of association between the two variables. As the correlation coefficient value goes towards 0, the relationship between the two variables will be weaker. The goal of a correlation analysis is to see whether two measurement variables co-vary and to quantify the strength of the relationship between the variables. Correlational analysis and research is useful in providing links between variables that can further be investigated. However, as correlation does not infer cause, this type of research can also be affected by mediating factors, making it lack internal validity. 9.2 Meaning and Definitions of Correlation Meaning of Correlation Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases. CU IDOL SELF LEARNING MATERIAL (SLM)

Inferential Statistics 221 Definitions of Correlation According to W.I. King, “Correlation means that between two series or groups of data there exists some causal connections.” At another place, he says, “If it is proved true that in a large number of instances two variables tend always to fluctuate in the same or in opposite directions, we consider that the fact is established and that a relationship exists. The relationship is called correlation.” According to E. Davenport, “The whole subject of correlation refers to that interrelation between separate character by which they tend, in some degree atleast, to move together.” According to Prof. Boddington, “Whenever some definite connection exists between the two or more groups, classes or series of data, there is said to be correlation.” According to Croxton and Cowden, “When the relationship is of quantitative nature, the appropriate statistical tool for discovering and measuring the relationship and expressing it in a brief formula is known as correlation.” Tippett states that “the effect of correlation is to reduce the range of uncertainty of our prediction.” In the words L.R. Conner, “If two or more quantities vary in sympathy so that movements in one tend to be accompanied by corresponding movement in the other then they are said to be correlated.” The statistical technique which deals with the association between two or more variables is known as correlation analysis. A.M. Tuttle defines correlation analysis as the co-variation between two or more variables. 9.3 Uses of Correlation There are three main uses for correlation and regression: 1. It is used to test hypotheses about cause-and-effect relationships. In this case, the experimenter determines the values of the X-variable and sees whether variation in X causes variation in Y. For example, giving people different amounts of a drug and measuring their blood pressure. CU IDOL SELF LEARNING MATERIAL (SLM)

222 Research Methods and Statistics - I 2. The second main use for correlation and regression is to see whether two variables are associated, without necessarily inferring a cause-and-effect relationship. In this case, neither variable is determined by the experimenter; both are naturally variable. If an association is found, the inference is that variation in X may cause variation in Y or variation in Y may cause variation in X, or variation in some other factor may affect both X and Y. 3. The third common use of linear regression is estimating the value of one variable corresponding to a particular value of the other variable. 9.4 Advantages of Correlation 1. Correlation research allows researchers to collect much more data than experiments. 2. Correlation research is that it opens up a great deal of further research to other scholars. 3. It allows researchers to determine the strength and direction of a relationship so that later studies can narrow the findings down and if possible, determine causation experimentally. 4. Gain quantitative data which can be easily analysed. 5. No manipulation of behaviour is required. 6. The correlation coefficient can readily quantify observational data. 9.5 Disadvantages of Correlation 1. Correlation research only uncovers a relationship; it cannot provide a conclusive reason for why there is a relationship. 2. A correlative finding does not reveal which variable influences the other. For example, finding that wealth correlates highly with education does not explain whether having wealth leads to more education or whether education leads to more wealth. 3. Reasons for either can be assumed, but until more research is done, causation cannot be determined. 4. Here, a third, unknown variable might be causing both. For instance, living in the city of Bangalore can lead to both wealth and education. CU IDOL SELF LEARNING MATERIAL (SLM)

Inferential Statistics 223 9.6 Types of Correlation I. On the Basis of Direction 1. Positive Correlation The correlation is said to be positive correlation if the values of two variables changing with same direction. Example: Production expenses and sales, height and weight, water consumption and temperature, study time and grades, etc. 2. Negative Correlation The correlation is said to be negative correlation when the values of variables change with opposite direction. Example: Price and quantity demanded, alcohol consumption and driving ability, etc. 3. Partial Correlation In partial correlation, more than two variables are recognised but only two variables influence each other, the effect of other influencing variable is kept constant. In the above example, if we limit our correlation analysis of yield and rainfall keeping fertilizer variable as constant, it becomes a problem of partial correlation. II. On the Basis of Number of Sets 1. Simple Correlation When only two variables are studied, it is a case of simple correlation. 2. Multiple Correlations When more than three variables are studied, it is known as multiple correlation. For example, when we study the relationship between the yield of rice per acre and both the amount of rainfall and the amount of fertilizer used, it is a case of multiple correlation. CU IDOL SELF LEARNING MATERIAL (SLM)

224 Research Methods and Statistics - I III. On the Basis of Change 1. Linear Correlation Correlation is said to be linear when the amount of change in one variable tends to bear a constant ratio to the amount of change in the other. The graph of the variables having a linear relationship will form a straight line. Example: X = 1, 2, 3, 4, 5, 6, 7, 8, ... Y = 5, 7, 9, 11, 13, 15, 17, 19, ... Y = 3 + 2x, ... 2. Non-linear Correlation The correlation would be non-linear if the amount of change in one variable does not bear a constant ratio to the amount of change in the other variable X1 2 3 4 5 Y7 14 21 28 35 9.7 Methods of Determining Correlation The various methods of studying correlation are as follows: 1. Scatter Diagram A scatter plot (also called a scatter plot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. If the points are coded (colour/shape/size), one additional variable can be displayed. The data are displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. A scatter plot can be used either when one continuous variable that is under the control of the experimenter and the other depends on it or when both continuous variables are independent. If a parameter exists that is systematically incremented and/or decremented by the other, it is called the control parameter or independent variable and is customarily plotted along the horizontal axis. The CU IDOL SELF LEARNING MATERIAL (SLM)

Inferential Statistics 225 measured or dependent variable is customarily plotted along the vertical axis. If no dependent variable exists, either type of variable can be plotted on either axis and a scatter plot will illustrate only the degree of correlation (not causation) between two variables. A scatter plot can suggest various kinds of correlations between variables with a certain confidence interval. For example, weight and height, weight would be on y axis and height would be on the x axis. Correlations may be positive (rising), negative (falling), or null (uncorrelated). If the pattern of dots slopes from lower left to upper right, it indicates a positive correlation between the variables being studied. If the pattern of dots slopes from upper left to lower right, it indicates a negative correlation. A line of best fit (alternatively called ‘trendline’) can be drawn in order to study the relationship between the variables. An equation for the correlation between the variables can be determined by established best-fit procedures. For a linear correlation, the best-fit procedure is known as linear regression and is guaranteed to generate a correct solution in a finite time. No universal best-fit procedure is guaranteed to generate a correct solution for arbitrary relationships. A scatter plot is also very useful when we wish to see how two comparable data sets agree to show nonlinear relationships between variables. The ability to do this can be enhanced by adding a smooth line such as LOESS. Furthermore, if the data are represented by a mixture model of simple relationships, these relationships will be visually evident as superimposed patterns. 2. Karl Pearson’s Coefficient of Correlation The Karl Pearson’s product-moment correlation coefficient (or simply, the Pearson’s correlation coefficient) is a measure of the strength of a linear association between two variables and is denoted by r or rxy (x and y being the two variables involved). This method of correlation attempts to draw a line of best fit through the data of two variables, and the value of the Pearson correlation coefficient, r, indicates how far away all these data points are to this line of best fit. But there are many other factors too, like your interest in that movie, your budget etc. Thus to analyse the situation in detail, you need to note down your similar past experiences and form a sort of distribution from that data. It is at this point that you require a Correlation Coefficient, which will now provide you with a value, based on which you can calculate the possibility of you not going for the movie this time if your friends don’t turn up. Karl Pearson’s Coefficient of Correlation is one such type of parameter which we’ll be studying in this section. CU IDOL SELF LEARNING MATERIAL (SLM)

226 Research Methods and Statistics - I 3. Rank Correlation The Spearman’s Correlation Coefficient, represented by r, is a non-parametric measure of the strength and direction of the association that exists between two ranked variables. It determines the degree to which a relationship is monotonic, i.e., whether there is a monotonic component of the association between two continuous or ordered variables. Monotonicity is “less restrictive” than that of a linear relationship. Although monotonicity is not actually a requirement of Spearman’s correlation, it will not be meaningful to pursue Spearman’s correlation to determine the strength and direction of a monotonic relationship if we already know the relationship between the two variables is not monotonic. 9.8 Methods of Determining Correlation (Practical Problems) Scatter Diagram Scatter Diagram is a graph of observed plotted point where each point represents the values of X and Y as a coordinate. It portrays the relationship between these two variables graphically. Sl. No. Maths Statistics 1 55 60 2 70 65 3 35 50 4 40 60 5 65 75 6 40 70 7 60 50 8 20 40 9 30 60 10 50 30 CU IDOL SELF LEARNING MATERIAL (SLM)

Inferential Statistics 227 11. 10 30 12. 20 10 Statistics y Estimating Line Scatter Low Degree of 70 Diagram Positive 60 50 Correlation 40 x 30 20 10 0 10 20 30 40 50 60 70 Maths Advantages of Scatter Diagram 1. It is a very simple and non-mathematical method. 2. It is not influenced by the size of extreme item. 3. It is the first step in investing the relationship between two variables. Disadvantage of Scatter Diagram 1. It cannot be adopted the exact degree of correlation. Karl Pearson’s Coefficient of Correlation Karl Pearson’s Coefficient of Correlation denoted by ‘r’ (–1 = r = +1). The coefficient of correlation ‘r’ measures the degree of linear relationship between two variables, say x and y. Degree of Correlation is expressed by a value of coefficient. Direction of change is indicated by sign (–ve) or (+ve).  xy When deviation taken from actual mean: r =  x2   y2 CU IDOL SELF LEARNING MATERIAL (SLM)

228 Research Methods and Statistics - I xy  (x )(y) n When deviation taken from an assumed mean: r = x 2  (x )2 y2  (y)2 n n Interpretation of Correlation Coefficient (r) The value of correlation coefficient ‘r’ ranges from –1 to +1. (a) If r = +1, then the correlation between the two variables is said to be perfect and positive. (b) If r = –, then the correlation between the two variables is said to be perfect and negative. (c) If r = 0, then there exists no correlation between the variables. Properties of Correlation Coefficient (a) The correlation coefficient lies between –1 and +1 symbolically (–1  r  1) (b) The correlation coefficient is independent of the change of origin and scale. (c) The coefficient of correlation is the geometric mean of two regression coefficient. r = bxy  byx (d) If one regression coefficient is (+ve) and the other regression coefficient is also (+ve), then the correlation coefficient is (+ve). Coefficient of Determination The convenient way of interpreting the value of correlation coefficient is to use of square of coefficient of correlation which is called Coefficient of Determination. The Coefficient of Determination = r2 . Suppose: r = 0.9, r2 = 0.81 This would mean that 81% of the variation in the dependent variable has been explained by the independent variable. CU IDOL SELF LEARNING MATERIAL (SLM)

Inferential Statistics 229 The maximum value of r2 is 1 because it is possible to explain all of the variations in y but it is not possible to explain more than all of it. Coefficient of Determination = Explained Variation / Total Variation. Procedure for Computing the Correlation Coefficient 1. Calculate the mean of the two series ‘X’ and ‘Y’. 2. Calculate the deviations ‘X’ and ‘Y’ in two series from their respective mean. 3. Square each deviation of ‘x’ and ‘y’, then obtain the sum of the squared deviation, i.e.,  x2 and  y2 . 4. Multiply each deviation under x with each deviation under y and obtain the product of ‘xy’. 5. Then obtain the sum of the product of x and y, i.e.,  xy . 6. Substitute the value in the formula. Probable Error Probable error of Coefficient of Correlation gives us the two limits within which the coefficient of correlation of series selected at random from the same universe is likely to fall. The formula for the probable error of r is as follows: 0.6745 1  r2 N where, r = Coefficient of correlation N = Number of pairs of observations Conditions Necessary for the Use of Probable Error The measure of probable error can be properly used only when the following three conditions exist: 1. The data must approximate a normal frequency curve, i.e., bell-shaped curve. 2. The statistical measure for which the PE is computed must have been calculated from a sample. CU IDOL SELF LEARNING MATERIAL (SLM)

230 Research Methods and Statistics - I 3. The sample must have been selected in an unbiased manner and the individual items must be independent. Illustration 1 If r = 0.6 and N = 64 of a distribution, find out the probable error. Solution: PE = 0.6745 1  r2 = N 0.6745 1  (0.6)2 64 = 0.6745 ´ 0.08 = 0.06 Illustration 2 Given the following, calculate the value of N: r = 0.61 and PE = 0.1312. Solution: PE = 0.6745 1  r2 0.1312 N 1- (0.61)2 =N 0.1312 × N = 0.6745 × 1 – (0.61)2 0.1312 × N = 0.4236 0.4236 N = 0.1312 N = 3.2280 N = 3.2280 N = 10.42 CU IDOL SELF LEARNING MATERIAL (SLM)

Inferential Statistics 231 Illustration 3 For a given distribution, the value of correlation is 0.64 and its probable error is 0.13274. Find the number of items in the series. Solution: 1  r2 PE = 0.6745 × n Given, PE = 0.13274, Correlation (r) = 0.64 n=? PE = 0.6745 × 1  ( 0.64 )2 n 0.6745(1  0.4096)  0.13274 = n  n 0.13274 = 0.6745 × 0.5904 0.3982  n = 0.13274  n = 2.999  n = 9 (Approx.) I. Direct Method Type 1: This method is used when given variables are small in magnitude. Type 2: It is direct formula to find r. This formula can effectively be used where X and Y is not in fractions. The formula is:  xy r =  x2   y2 Illustration 1 The following data relate to age of employees and the number of days they were reported sick in a month. Age 30 32 35 40 48 50 52 55 57 61 Sick days 1 0 2 5 2 4 6 5 7 8 Calculate Karl Pearson’s Coefficient of Correlation. CU IDOL SELF LEARNING MATERIAL (SLM)

232 Research Methods and Statistics - I Solution: Calculation of Karl Pearson’s Coefficient of Correlation Age X– x x2 Sick days Y– y y2 xy X x Y y 30 –16 256 1 –3 9 +48 32 –14 196 0 –4 16 +56 35 –11 121 2 –2 4 +22 40 –6 36 5 +1 1 –6 48 +2 4 2 –2 4 –4 50 +4 16 40 0 0 52 +6 36 6 +2 4 +12 55 +9 81 5 +1 1 +9 57 +11 121 7 +3 9 +33 61 +15 225 8 +4 16 +60 åx = 460 åx2 = 1092 åy = 40 åy2 = 64 åxy = 230 x= 460  46 10 y= 40  4 10  xy  230 230   0.87 r =  x2   y2 1092  64  263.36 There is a high degree of positive correlation between age and number of days reported sick. CU IDOL SELF LEARNING MATERIAL (SLM)

Inferential Statistics 233 Illustration 2 Compute Karl Pearson’s Coefficient of Correlation from the following data: Marks in Accountancy 77 54 27 52 14 35 90 25 56 60 Marks in English 35 58 60 40 50 40 35 56 34 42 Solution: Calculation of Karl Pearson’s Coefficient of Correlation X X – 49 x2 Y Y – 45 y2 xy x y 77 +28 784 35 –10 100 –280 54 +5 25 58 +13 169 +65 27 –22 484 60 +15 225 –330 52 +3 9 40 –5 25 –15 14 –35 1,225 50 +5 25 –175 35 –14 196 40 –5 25 +70 90 +41 1,681 35 –10 100 –410 25 –24 576 56 +11 121 –264 56 +7 49 34 –11 121 –77 60 +11 121 42 –3 9 –33 åX = 490 åx2 = 5,150 åY = 450 åy2 = 920 åxy = – 1,449  xy r =  x2   y2 = 1449  0.666 5150  920 CU IDOL SELF LEARNING MATERIAL (SLM)

234 Research Methods and Statistics - I 9.9 Summary Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases. Positive Correlation is said to be positive correlation if the values of two variables changing with same direction. Negative Correlation is said to be negative correlation when the values of variables change with opposite direction. Example: Price and quantity demanded, alcohol consumption and driving ability, etc. In partial correlation, more than two variables are recognised but only two variables influence each other, the effect of other influencing variable is kept constant. In the above example, if we limit our correlation analysis of yield and rainfall keeping fertilizer variable as constant to becomes a problem of partial correlation. A scatter plot can suggest various kinds of correlations between variables with a certain confidence interval. For example, weight and height, weight would be on y axis and height would be on the x axis. Correlations may be positive (rising), negative (falling), or null (uncorrelated). If the pattern of dots slopes from lower left to upper right, it indicates a positive correlation between the variables being studied. If the pattern of dots slopes from upper left to lower right, it indicates a negative correlation. A line of best fit (alternatively called ‘trendline’) can be drawn in order to study the relationship between the variables. An equation for the correlation between the variables can be determined by established best-fit procedures. For a linear correlation, the best-fit procedure is known as linear regression and is guaranteed to generate a correct solution in a finite time. No universal best-fit procedure is guaranteed to generate a correct solution for arbitrary relationships. A scatter plot is also very useful when we wish to see how two comparable data sets agree to show nonlinear relationships between variables. The ability to do this can be enhanced by adding a smooth CU IDOL SELF LEARNING MATERIAL (SLM)

Inferential Statistics 235 line such as LOESS. Furthermore, if the data are represented by a mixture model of simple relationships, these relationships will be visually evident as superimposed patterns. The Karl Pearson’s product-moment correlation coefficient (or simply, the Pearson’s correlation coefficient) is a measure of the strength of a linear association between two variables and is denoted by r or rxy (x and y being the two variables involved). This method of correlation attempts to draw a line of best fit through the data of two variables, and the value of the Pearson’s correlation coefficient, r, indicates how far away all these data points are to this line of best fit. Monotonicity is “less restrictive” than that of a linear relationship. Although monotonicity is not actually a requirement of Spearman’s correlation, it will not be meaningful to pursue Spearman’s correlation to determine the strength and direction of a monotonic relationship if we already know the relationship between the two variables is not monotonic. 9.10 Key Words/Abbreviations  Correlation: Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together.  Positive Correlation: The correlation is said to be positive correlation if the values of two variables changing with same direction.  Negative Correlation: Negative Correlation is said to be negative correlation when the values of variables change with opposite direction.  Linear Correlation: Correlation is said to be linear when the amount of change in one variable tends to bear a constant ratio to the amount of change in the other.  Karl Pearson’s Coefficient Correlation: The Karl Pearson’s product-moment correlation coefficient.  Scatter Diagram: A scatter plot is a type of plot or mathematical diagram. CU IDOL SELF LEARNING MATERIAL (SLM)

236 Research Methods and Statistics - I 9.11 Learning Activity 1. You are required to interpret the solutions of any two problems with respect to Positive Correlation and Negative Correlation. _________________________________________________________________ _________________________________________________________________ 2. You are suggested to prepare a report on implementation of Karl Pearson’s Coefficient Correlation and Rank Correlation. _________________________________________________________________ _________________________________________________________________ 9.12 Unit End Exercises (MCQs and Descriptive) Descriptive Type Questions 1. What is Correlation? 2. Mention any two uses of correlation. 3. How do we measure the reliability of correlation? 4. Name any two types of correlation. 5. What is ‘Positive Correlation’? 6. Define ‘Negative Correlation’. 7. What is a linear correlation? 8. What do you mean by coefficient of correlation? 9. What is a Probable Error? 10. What is the use of ‘Probable Error’? CU IDOL SELF LEARNING MATERIAL (SLM)

Inferential Statistics 237 11. State the assumptions of Karl Pearson’s Coefficient Correlation. 12. What is rank correlation? 13. What is a ‘Scatter Diagram’’? 14. Write down the Spearman’s Rank Coefficient Correlation Formula. 15. Define Rank Correlation Coefficient. 16. Explain types of correlation. Multiple Choice Questions 1. Who was the first person to measure correlation, originally termed “correlation”? (a) Francis Galton (b) Karl Pearson (c) Heredity and Panmixia (d) None of the above 2. When the British statistician Francis Ysidro Edgeworth published a paper called “Correlated Averages”? (a) 1890 (b) 1899 (c) 1892 (d) 1888 3. Which of the following is a statistical measure that indicates the extent to which two or more variables fluctuate together? (a) Regression (b) Correlation (c) Skewness (d) All the above 4. Which of the following is the advantage of Correlation? (a) Correlation research allows researchers to collect much more data than experiments (b) Correlation research is that it opens up a great deal of further research to other scholars CU IDOL SELF LEARNING MATERIAL (SLM)

238 Research Methods and Statistics - I (c) No manipulation of behaviour is required (d) All the above 5. Which of the following is the method of determining Correlation? (a) Scatter Diagram (b) Karl Pearson’s Coefficient of Correlation (c) Rank Correlation (d) All the above Answers: 1. (a), 2. (c), 3. (b), 4. (d), 5. (d) 9.13 References References of this unit have been given at the end of the book.  CU IDOL SELF LEARNING MATERIAL (SLM)

Regression 239 UNIT 10 REGRESSION Structure: 10.0 Learning Objectives 10.1 Introduction 10.2 Meaning of Regression 10.3 RegressionAnalysis 10.4 Assumptions in RegressionAnalysis 10.5 Techniques of Regression 10.6 Types of Regression 10.7 Techniques of Regression (Practical Problems) 10.8 Summary 10.9 Key Words/Abbreviations 10.10 LearningActivity 10.11 Unit End Exercises (MCQs and Descriptive) 10.12 References 10.0 Learning Objectives After studying this unit, you will be able to:  Describe the regression  Explain the assumptions in regression analysis CU IDOL SELF LEARNING MATERIAL (SLM)

240 Research Methods and Statistics - I 10.1 Introduction Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the ‘outcome variable’) and one or more independent variables (often called ‘predictors’, ‘covariates’, or ‘features’). The most common form of regression analysis is linear regression, in which a researcher finds the line (or a more complex linear function) that most closely fits the data according to a specific mathematical criterion. For example, the method of ordinary least squares computes the unique line (or hyperplane) that minimises the sum of squared distances between the true data and that line (or hyperplane). For specific mathematical reasons (see linear regression), this allows the researcher to estimate the conditional expectation (or population average value) of the dependent variable when the independent variables take on a given set of values. Less common forms of regression use slightly different procedures to estimate alternative location parameters (e.g., quantile regression or Necessary Condition Analysis) or estimate the conditional expectation across a broader collection of non-linear models (e.g., non-parametric regression). 10.2 Meaning of Regression Regression is the measure of the average relationship between two or more variables in terms of the original units of the data. Regression is a statistical measurement used in finance, investing and other disciplines that attempts to determine the strength of the relationship between one dependent variable (usually denoted by Y) and a series of other changing variables (known as independent variables). Regression helps investment and financial managers to value assets and understand the relationships between variables, such as commodity prices and the stocks of businesses dealing in those commodities. 10.3 Regression Analysis The term regression analysis refers to the methods by which estimates are made of the values of a variable from knowledge of the values of one or more other variables and to the measurement of the errors involved in this estimation process. Regression analysis is mathematical measure of average relationship between two or more variables. It is a statistical tool used in prediction of value of unknown variable from known variable. CU IDOL SELF LEARNING MATERIAL (SLM)

Regression 241 It is a very powerful tool in the field of statistical analysis in predicting the value of one variable, given the value of another variable, when those variables are related to each other. The two basic types of regression are linear regression and multiple linear regressions, although there are non-linear regression methods for more complicated data and analysis. Linear regression uses one independent variable to explain or predict the outcome of the dependent variable Y, while multiple regressions use two or more independent variables to predict the outcome. Regression can help finance and investment professionals as well as professionals in other businesses. Regression can also help predict sales for a company based on weather, previous sales, GDP growth or other types of conditions. The capital asset pricing model (CAPM) is an often-used regression model in finance for pricing assets and discovering costs of capital. 10.4 Assumptions in Regression Analysis 1. Existence of actual linear relationship. 2. The regression analysis is used to estimate the values within the range for which it is valid. 3. The relationship between the dependent and independent variables remains the same till the regression equation is calculated. 4. The dependent variable takes any random value but the values of the independent variables are fixed. 5. In regression, we have only one dependent variable in our estimating equation. However, we can use more than one independent variable. 10.5 Techniques of Regression 1. Regression Equation Regression equations are algebriac expressions of the regression lines. Since there are two regression lines, there are two regression equations – the regression equation of X on Y is used to describe the variations in the values of X for given changes in Y and the regression equation of Y on X is used to describe the variation in the values of Y for given changes in X. CU IDOL SELF LEARNING MATERIAL (SLM)

242 Research Methods and Statistics - I 2. Regression Lines Regression Line refers to describe the average relationship between the two variables, say X and Y. It reveals mean value of X for given value of Y. The equation of regression line is known as “Regression Equation”. 3. Regression Coefficient The rate of change of variable for unit change in the other variable is called the regression coefficient of former on the latter. Since there are two regression lines, there are two regression coefficients. The rate of change of X for unit change in Y is called regression coefficient of X on Y. It is the Coefficient of Y in the regression equation when it is in the form of X = n + bY. 10.6 Types of Regression Every regression technique has some assumptions attached to it which we need to meet before running analysis. These techniques differ in terms of type of dependent and independent variables and distribution. 1. Linear Regression It is the simplest form of regression. It is a technique in which the dependent variable is continuous in nature. The relationship between the dependent variable and independent variables is assumed to be linear in nature. We can observe that the given plot represents a somehow linear relationship between the mileage and displacement of cars. 2. Polynomial Regression It is a technique to fit a nonlinear equation by taking polynomial functions of independent variable. In the figure given below, you can see the red curve fits the data better than the green curve. Hence, in the situations where the relation between the dependent and independent variable seems to be non-linear, we can deploy Polynomial Regression Models. 3. Logistic Regression In logistic regression, the dependent variable is binary in nature (having two categories). Independent variables can be continuous or binary. In multinomial logistic regression, you can have more than two categories in your dependent variable. CU IDOL SELF LEARNING MATERIAL (SLM)

Regression 243 4. Quantile Regression Quantile regression is the extension of linear regression and we generally use it when outliers, high skeweness and heteroscedasticity exist in the data. In linear regression, we predict the mean of the dependent variable for given independent variables. Since mean does not describe the whole distribution, modelling the mean is not a full description of a relationship between dependent and independent variables. So, we can use quantile regression which predicts a quantile (or percentile) for given independent variables. The term “quantile” is the same as “percentile”. 5. Ridge Regression Ridge regression is a way to create a parsimonious model when the number of predictor variables in a set exceeds the number of observations, or when a data set has multicollinearity (correlations between predictor variables). Tikhivov’s method is basically the same as ridge regression, except that Tikhonov’s has a larger set. It can produce solutions even when your data set contains a lot of statistical noise (unexplained variation in a sample). Least squares regression is not defined at all when the number of predictors exceeds the number of observations. It does not differentiate “important” from “less-important” predictors in a model. So, it includes all of them. This leads to overfitting a model and failure to find unique solutions. Least squares also have issues dealing with multicollinearity in data. Ridge regression avoids all of these problems. It works in part because it does not require unbiased estimators, while least squares produces unbiased estimates, variances can be so large that they may be wholly inaccurate. Ridge regression adds just enough bias to make the estimates reasonably reliable approximations to true population values. 6. Lasso Regression Lasso stands for Least Absolute Shrinkage and Selection Operator. It makes use of L1 regularisation technique in the objective function. Lasso regression is a type of linear regression that uses shrinkage. Shrinkage is where data values are shrunk towards a central point, like the mean. The lasso procedure encourages simple, sparse models (i.e., models with fewer parameters). CU IDOL SELF LEARNING MATERIAL (SLM)

244 Research Methods and Statistics - I 7. Elastic Net Regression Elastic Net Regression is preferred over both ridge and lasso regression when one is dealing with highly correlated independent variables. 8. Principal Components Regression (PCR) PCR is a regression technique which is widely used when you have many independent variables or multicollinearity exists in your data. 9. Partial Least Squares (PLS) Regression It is an alternative technique of principal component regression when you have independent variables highly correlated. It is also useful when there are a large number of independent variables. 10. Support Vector Regression Support vector regression can solve both linear and non-linear models. SVM uses non-linear kernel functions (such as polynomial) to find the optimal solution for non-linear models. 11. Ordinal Regression Ordinal regression is used to predict ranked values. In simple words, this type of regression is suitable when dependent variable is ordinal in nature. Examples of ordinal variables are survey responses (1 to 6 scale), patient reaction to drug dose (none, mild, severe), etc. 12. Poisson Regression Poisson regression is used when dependent variable has count data. Applications of Poisson Regression: (i) Predicting the number of calls in customer care related to a particular product (ii) Estimating the number of emergency service calls during an event 13. Negative Binomial Regression Like Poisson Regression, it also deals with count data. The question arises “how it is different from Poisson regression”. The answer is negative binomial regression does not assume distribution of count having variance equal to its mean while Poisson regression assumes the variance equal to its mean. CU IDOL SELF LEARNING MATERIAL (SLM)


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook