Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore CU-BCA-SEM-III-PROBABILITY AND STATICS- Second Draft-converted

CU-BCA-SEM-III-PROBABILITY AND STATICS- Second Draft-converted

Published by Teamlease Edtech Ltd (Amita Chitroda), 2021-05-10 06:50:16

Description: CU-BCA-SEM-III-PROBABILITY AND STATICS- Second Draft-converted

Search

Read the Text Version

Example 9: In a study about viral fever, the number of people affected in a town were noted as Find its standard deviation. 101 CU IDOL SELF LEARNING MATERIAL (SLM)

Example 10: The measurements of the diameters (in cms) of the plates prepared in a factory are given below. Find its standard deviation. 102 CU IDOL SELF LEARNING MATERIAL (SLM)

Example 11: The time taken by 50 students to complete a 100-meter race are given below. Find its standard deviation. 103 CU IDOL SELF LEARNING MATERIAL (SLM)

Population and sample standard deviation Standard deviation measures the spread of a data distribution. It measures the typical distance between each data point and the mean. The formula we use for standard deviation depends on whether the data is being considered a population of its own, or the data is a sample representing a larger population. • If the data is being considered a population on its own, we divide by the number of data points, N. • If the data is a sample from a larger population, we divide by one fewer than the number of data points in the sample, n-1. 104 CU IDOL SELF LEARNING MATERIAL (SLM)

Population standard deviation Sample standard deviation The steps in each formula are all the same except for one—we divide by one less than the number of data points when dealing with sample data. We'll go through each formula step by step in the examples below. Population standard deviation Here's the formula again for population standard deviation: Here's how to calculate population standard deviation: Step 1: Calculate the mean of the data—this is in the formula. Step 2: Subtract the mean from each data point. These differences are called deviations. Data points below the mean will have negative deviations, and data points above the mean will have positive deviations. Step 3: Square each deviation to make it positive. Step 4: Add the squared deviations together. 105 CU IDOL SELF LEARNING MATERIAL (SLM)

Step 5: Divide the sum by the number of data points in the population. The result is called the variance. Step 6: Take the square root of the variance to get the standard deviation. Example: Four friends were comparing their scores on a recent essay. Calculate the standard deviation of their scores: 666, 222, 333, 111 Step 4: Add the squared deviations. 106 CU IDOL SELF LEARNING MATERIAL (SLM)

9+1+0+4=14 Step 5: Divide the sum by the number of scores. 14/4=3.5 Step 6: Take the square root of the result from Step 5. The standard deviation is approximately 1.87. Sample standard deviation Here's the formula again for sample standard deviation: Here's how to calculate sample standard deviation: Step 1: Calculate the mean of the data—this is xˉ on top in the formula. Step 2: Subtract the mean from each data point. These differences are called deviations. Data points below the mean will have negative deviations, and data points above the mean will have positive deviations. Step 3: Square each deviation to make it positive. Step 4: Add the squared deviations together. Step 5: Divide the sum by one less than the number of data points in the sample. The result is called the variance. Step 6: Take the square root of the variance to get the standard deviation. Example: A sample of 444 students was taken to see how many pencils they were carrying. Calculate the sample standard deviation of their responses:222, 222, 555, 777 107 CU IDOL SELF LEARNING MATERIAL (SLM)

Step 4: Add the squared deviations. 4+4+1+9=18 Step 5: Divide the sum by one less than the number of data points. 18/ (4-1) =18/3=6 Step 6: Take the square root of the result from Step 5. The sample standard deviation is approximately 2.452.452, point, 45. 5.6 QUARTILE, PERCENTILE, DECILES The procedure for computing quartiles. Deciles etc., is the same as the medium. While computing these values in individual and discrete series we add 1 to N whereas in continuous series we do not add 1. Thus Qi = Size of. N+1 thitem (individual observations and discrete series) 108 CU IDOL SELF LEARNING MATERIAL (SLM)

4 Ql = Size of N (in continuous series) 4 Q3 = Size of 3(N+ 1)th item (in individual and discrete series) 4 Q3 = Size of 3Nthitem (in continuous series) 4 D4 = Size of 4(N+ 1) thitem (in individual and discrete series) 10 D4= Size of 4Nthitem (in continuous series) 10 P60= Size of 60 (N+ 1) thitem (in individual and discrete series) 100 P60= Size of 60 Nthitem (in continuous series) 100 Quartiles, Quartile Deviation and Coefficient of Quartile Deviation The Quartile Deviation is a simple way to estimate the spread of a distribution about a measure of its central tendency (usually the mean). So, it gives you an idea about the range within which the central 50% of your sample data lies. Consequently, based on the quartile deviation, the Coefficient of Quartile Deviation can be defined, which makes it easy to compare the spread of two or more different distributions. Since both of these topics are based on the concept of quartiles, we’ll first understand how to calculate the quartiles of a dataset before working with the direct formulae. Quartiles A median divides a given dataset (which is already sorted) into two equal halves similarly, the quartiles are used to divide a given dataset into four equal halves. Therefore, logically there should be three quartiles for a given distribution, but if you think about it, the second quartile is equal to the median itself! We’ll deal with the other two quartiles in this section. • The first quartile or the lower quartile or the 25th percentile, also denoted by Q1, corresponds to the value that lies halfway between the median and the lowest value in the distribution (when it is already sorted in the ascending order). Hence, it marks the region which encloses 25% of the initial data. 109 CU IDOL SELF LEARNING MATERIAL (SLM)

• Similarly, the third quartile or the upper quartile or 75th percentile, also denoted by Q3, corresponds to the value that lies halfway between the median and the highest value in the distribution (when it is already sorted in the ascending order). It, therefore, marks the region which encloses the 75% of the initial data or 25% of the end data. The Quartile Deviation Formally, the Quartile Deviation is equal to the half of the Inter-Quartile Range and thus we can write it as Therefore, we also call it the Semi Inter-Quartile Range. • The Quartile Deviation doesn’t take into account the extreme points of the distribution. Thus, the dispersion or the spread of only the central 50% data is considered. • If the scale of the data is changed, the Qd also changes in the same ratio. • It is the best measure of dispersion for open-ended systems (which have open-ended extreme ranges). • Also, it is less affected by sampling fluctuations in the dataset as compared to the range (another measure of dispersion). • Since it is solely dependent on the central values in the distribution, if in any experiment, these values are abnormal or inaccurate, the result would be affected drastically. The Coefficient of Quartile Deviation 110 CU IDOL SELF LEARNING MATERIAL (SLM)

Based on the quartiles, a relative measure of dispersion, known as the Coefficient of Quartile Deviation, can be defined for any distribution. It is formally defined as – Since it involves a ratio of two quantities of the same dimensions, it is unit-less. Thus, it can act as a suitable parameter for comparing two or more different datasets which may or may not involve quantities with the same dimensions. Example 1: The number of vehicles sold by a major Toyota Showroom in a day was recorded for 10 working days. The data is given as Day 1 2 3 4 5 6 7 8 9 10 Frequency 20 15 18 5 10 17 21 19 25 28 Find the Quartile Deviation and its coefficient for the given discrete distribution case. Solution: We first need to sort the frequency data given to us before proceeding with the quartile’s calculation – Sorted Data – 5, 10, 15, 17, 18, 19, 20, 21, 25, 28 n (number of data points) = 10 Now, to find the quartiles, we use the logic that the first quartile lies halfway between the lowest value and the median; and the third quartile lies halfway between the median and the largest value. 111 CU IDOL SELF LEARNING MATERIAL (SLM)

Using the values for Q1 and Q3, now we can calculate the Quartile Deviation and its coefficient as follows Example 2: For the following open-ended data, calculate the Quartile Deviation and its coefficient. Marks 0-10 10-20 20-30 30-40 40-50 50-60 30 No of Students 10 20 30 50 40 Solution: For the case of a grouped-data distribution, we can find the quartiles through the following steps ⇒ Construct a cumulative frequency table for the given data alongside the given distribution 112 CU IDOL SELF LEARNING MATERIAL (SLM)

⇒ From the total number of data values, estimate the groups/classes of the Lower and Upper Quartiles ⇒ Use the following formulae to then calculate the quartiles: For the given data, we can form the required table with the cumulative frequency as Marks Frequency Cumulative Frequency 0-10 10 10 10-20 20 30 20-30 30 60 30-40 50 110 40-50 40 150 50-60 30 180 Since the total number of students is 180, the first quartile must lie at the position of 180/4 = 45th student. 113 CU IDOL SELF LEARNING MATERIAL (SLM)

Similarly, the third quartile must lie at the position of 180×3/4 = 135th student. By the distribution of our data into groups, we can note that the first quartile will lie in the 20-30 marks range. Similarly, the third quartile will lie in the 40-50 marks range. Now, using the values for Q1 and Q3, now we can calculate the Quartile Deviation and its coefficient as follows 114 CU IDOL SELF LEARNING MATERIAL (SLM)

Example 3: Calculate the lower and upper quartiles, third deciles and 20th percentile from the following data: Central value: 2.5 7.5 12.5 17.5 22.5 Frequency: 7 18 25 30 20 Solution: Since we are given mid-points, we will first find the lower and upper limits of the various classes. The method for finding these limits is to take the difference between the two central values, divide it by 2, deduct the values so obtained from the lower limit and add it to the upper limit. In the given case (7.5-2.5)/2 = 5/2 = 2.5. The first class shall be 0-5, second 5-10, etc. CALCULATION OF Q1, Q2, D3, P20 Class group f c.f. 0-5 7 7 5-10 18 25 10-15 25 50 15-20 30 80 20-25 20 100 N=100 Ql = Size of N (in continuous series) 115 4 Lower Quartile Q1 lies in the class 5-10 Q1= L+ N/4 – c. f. X i F L=5, N/4= 25. c.f. = 7, f=18, i= 5 CU IDOL SELF LEARNING MATERIAL (SLM)

Q1= 5+ 25-7 x 5 = 10 116 18 Upper Quartile Q3= Size of 3N/4th item Q3 lies in the class 15-20. Q3=L+ 3N/4 - c.f. X i f L = 15, 3N/4 = 75, c.f. = 50, f = 30, i = 5 = 15+ 75-50 x 5 = 19.17 30 Third Decile D3= Size of 4Nth item (in continuous series) 10 D3 lies in the class 10-15. D3 = L+ 3N/10- c.f.xi f L = 10, 3N/10 = 30, c.f. = 25, f = 25, i = 5 D3 = 10 +30-25 x 5 = 10 + 1 = 11 25 Twentieth Percentile P20= Size of 20Nthitem (in continuous series) 100 20 x 100 = 20th item 100 P20= L+ 20 N/100 – c. f. X i f CU IDOL SELF LEARNING MATERIAL (SLM)

L = 5, 20N/100 = 20, d. = 7, f = 18, i = 5 P20= 5+ 20-7 x 5 = 5+3.61= 8.61 18 5.7 SUMMARY • It refers to the ratio of the difference between two extreme items of the distribution to their sum. • It refers to the ratio of the difference between Upper Quartile and Lower Quartile of a distribution to their sum. • Mean deviation is an absolute measure of dispersion. • The quartiles of a data set are formed by the two boundaries on either side of the median, which divide the set into four equal sections. • The lowest 25% of the data being found below the first quartile value also called the lower quartile (Q1). • The median or second quartile divides the set into two equal sections. 5.8 KEYWORDS • Absolute measure: Dispersion contains the same unit as the original data set. • Deviation: A measure which is used to find the difference between the observed value and the expected value of a variable • Coefficient of variation (CV): A statistical measure of the dispersion of data points in a data series around the mean. • Quartile 1 (Q1): The median (middle value) of the lower half of the data. • Quartile 3 (Q3): The median (middle value) of the upper half of the data. • Inter Quartile Range (IQR): The difference between Q3 and Q1. 5.9 LEARNING ACTIVITY Students will be learning the definitions of quartile deviation and standard deviation and will be encouraged to work together to get a class average. Students are introduced to statistics and why it is important to daily life. The number of problems worked out by a student on seven days of a week was following. 5, 9, 15, 11, 13, 17, 7 117 CU IDOL SELF LEARNING MATERIAL (SLM)

Find the (i) lower quartile, (ii) upper quartile, (iii) inter quartile range, (iv) semi-inter quartile range, and (v) Range for the distribution. ________________________________________________________________________________ _______________________________________________________________ 5.10 UNIT END QUESTIONS A. Descriptive Questions Short Questions 1. What is the relationship between quartile deviation and standard deviation? 2. Find the lower quartile for the following data. 2, 1, 0, 3, 1, 2, 3, 4, 3, 5 3. How do you find standard deviation from quartile deviation? 4. Which is better Iqr or standard deviation? 5. Find the inter quartile range for the following data. 5, 9, 15, 11, 13, 17, 7 Long Questions 1. Find the median and quartiles of each of the following sets of numbers. These represent the “four cases” that you should be able to compute using the rules in this course. 23, 35, 28, 33, 5, 12, 40, 25, 20, 18, 1, 16 2. The following are the golf scores of 12 members of a women's golf team. 89 90 87 95 86 81 102 105 83 88 91 79. Compute the mean, median, five number summary, IQR, and standard deviation of the scores. Are there any outliers, according to our rule of thumb? 3. Find for the given distribution. (i) the lower quartile, (ii) the upper quartile, and (iii) the inter quartile range. Variate 1 2 3 4 5 6 7 8 Frequency 8 1 7 15 1 6 10 5 118 CU IDOL SELF LEARNING MATERIAL (SLM)

4. Find for the given distribution. (i) the lower quartile, (ii) the upper quartile, and (iii) the inter quartile range. Hint: Arrange the variates in ascending order. Variate 30 40 10 20 50 60 Frequency 11 30 15 8 12 9 5. Find for the given distribution. (i) the lower quartile, (ii) the upper quartile, and (iii) the inter quartile range. Variate 5 10 20 30 50 60 80 Cumulative 7 12 21 35 42 50 56 Frequency B. Multiple choice Questions 1. The mean and variance of 7 observations are 8 and 16. If 5 of the observations are 2, 4, 10, 12, 14 the remaining 2 observations are: a. x =6, y = 8 b. x=5, y=7 c. x=7, y=3 d. None of these 2. The variance of 15 observations is 4. If each observation is increased by 9, the variance of the resulting observation is: a. 2 b. 3 c. 4 d. 5 3. The mean of 5 observations is 4.4 and their variance is 8.24. If 3 of the observations are 1, 2, 5. The other 2 observations are: 119 CU IDOL SELF LEARNING MATERIAL (SLM)

a. 9, 4 b. 7, 8 c. 6, 5 d. 4, 8 4. In a symmetrical distribution Q1 = 20 and median= 30. The value of Q3 is a. 35 b. 40 c. 45 d. 50 5. The lower and upper quartiles of a distribution are 80 and 120 respectively, while median is 100. The shape of the distribution is a. Negatively skewed b. Positively skewed c. Normal d. Symmetrical Answers 1.a, 2.c, 3.a, 4.b, 5.d 5.11 REFERENCES Reference Books: • Dr. B. Krishna Gandhi, Dr. T.K.V Iyengar, M.V.S.S.N. Prasad, Probability and Statistics, S. Chand Publishing Co. • Quantitative Methods for Business & Economics by Mouhammed, Publisher: PHI, 2007 Edition. • Quantitative Techniques for Managerial Decisions by A. Sharma, Publisher: Macmillan, 2008 Edition. • Research Methodology by C. R. Kothari, Publisher: Vikas Publishing House 120 CU IDOL SELF LEARNING MATERIAL (SLM)

Textbooks: • Seymour Lipschutz, Jack Schiller, Jack Schiller S, Introduction to Probability & Statistics, McGraw-Hill Publishers. • Research Methodology and Statistical Techniques by Santosh Gupta, Publisher: Deep and Deep Publication • Research Methodology by V. P. Pandey, Publisher: Himalaya Publication • Research Methodology in Management by Arbind and Desai, Publisher: Ashish Publication House 121 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT 6: CORRELATION AND REGRESSION Structure 6.0 Learning Objectives 6.1 Introduction 6.2 Correlation Coefficient Formula 6.3 Simple Linear Regression Equation 6.4 Significance of Correlation 6.5 Correlation and Causation 6.6 Summary 6.7 Keywords 6.8 Learning activity 6.9 Unit End Questions 6.10 References 6.0 LEARNING OBJECTIVES After studying this unit students will be able to • Identify the direction and strength of a linear correlation between two factors. • Compute and interpret the Pearson correlation coefficient and the coefficient of determination, and test for significance. • Identify and explain three assumptions and three limitations for evaluating a correlation coefficient. • Delineate the use of the Spearman, point-biserial, and phi correlation coefficients. • Distinguish between a predictor variable and a criterion variable. • Compute and interpret the method of least squares. • Identify each source of variation in an analysis of regression and compute an analysis of regression and interpret the results. • Compute and interpret the standard error of estimate. 6.1 INTRODUCTION Correlation Analysis 122 CU IDOL SELF LEARNING MATERIAL (SLM)

Correlation analysis is applied in quantifying the association between two continuous variables, for example, a dependent and independent variable or among two independent variables. Regression Analysis Regression analysis refers to assessing the relationship between the outcome variable and one or more variables. The outcome variable is known as the dependent or response variable and the risk elements, and cofounders are known as predictors or independent variables. The dependent variable is shown by “y” and independent variables are shown by “x” in regression analysis. The sample of a correlation coefficient is estimated in the correlation analysis. It ranges between -1 and +1, denoted by r and quantifies the strength and direction of the linear association among two variables. The correlation among two variables can either be positive, i.e., a higher level of one variable is related to a higher level of another or negative, i.e., a higher level of one variable is related to a lower level of the other. The sign of the coefficient of correlation shows the direction of the association. The magnitude of the coefficient shows the strength of the association. For example, a correlation of r = 0.8 indicates a positive and strong association among two variables, while a correlation of r = -0.3 shows a negative and weak association. A correlation near to zero shows the non-existence of linear association among two continuous variables. Linear Regression Linear regression is a linear approach to modelling the relationship between the scalar components and one or more independent variables. If the regression has one independent variable, then it is known as a simple linear regression. If it has more than one independent variables, then it is known as multiple linear regression. Linear regression only focuses on the conditional probability distribution of the given values rather than the joint probability distribution. In general, all the real-world regressions models involve multiple predictors. So, the term linear regression often describes multivariate linear regression. 6.2 CORRELATION COEFFICIENT FORMULA Let X and Y be the two random variables. The population correlation coefficient for X and Y is given by the formula: Were, 123 CU IDOL SELF LEARNING MATERIAL (SLM)

ρXY = Population correlation coefficient between X and Y μX = Mean of the variable X μY = Mean of the variable Y σX = Standard deviation of X σY = Standard deviation of Y E = Expected value operator Cov = Covriance The above formulas can also be written as: The sample correlation coefficient formula is: The above are used to find the correlation coefficient for the given data. Based on the value obtained through these formulas, we can determine how much strong is the association between given two variables. This will always be a number between -1 and 1 (inclusive). • If r is close to 1, we say that the variables are positively correlated. This means there is likely a strong linear relationship between the two variables, with a positive slope. • If r is close to -1, we say that the variables are negatively correlated. This means there is likely a strong linear relationship between the two variables, with a negative slope. If r is close to 0, we say that the variables are not correlated. This means that there is likely • no linear relationship between the two variables, however, the variables may still be related in some other way. 124 CU IDOL SELF LEARNING MATERIAL (SLM)

Example 1: The time x in years that an employee spent at a company and the employee’s hourly pay, y, for 5 employees are listed in the table below. Calculate and interpret the correlation coefficient r. Include a plot of the data in your discussion. Example 2: The table below shows the number of absences, x, in a Calculus course and the final exam grade, y, for 7 students. Find the correlation coefficient and interpret your result. x 10 2 6 4 33 y 95 90 90 55 70 80 85 125 CU IDOL SELF LEARNING MATERIAL (SLM)

Example 3: The table below shows the height, x, in inches and the pulse rate, y, per minute, for 9 people. Find the correlation coefficient and interpret your result. x 68 72 65 70 62 75 78 64 68 98 70 65 72 y 90 85 88 100 105 126 CU IDOL SELF LEARNING MATERIAL (SLM)

6.3 SIMPLE LINEAR REGRESSION EQUATION As we know, linear regression is used to model the relationship between two variables. Thus, a simple linear regression equation can be written as: Y = a + bX Where, Y = Dependent variable X = Independent variable a = [(∑y)(∑x2) – (∑x)(∑xy)] / [n(∑x2) – (∑x)2] b = [n(∑xy) – (∑x)(∑y)] / [n(∑x2) – (∑x)2] Example 1: Find the equation of the regression line for each of the two examples and two practice problems for Example 1. 127 CU IDOL SELF LEARNING MATERIAL (SLM)

Example 2: Calculate the regression coefficient and obtain the lines of regression for the following data Solution: 128 CU IDOL SELF LEARNING MATERIAL (SLM)

Regression coefficient of X on Y: (i) Regression equation of X on Y (ii) Regression coefficient of Y on X (iii) Regression equation of Y on X 129 CU IDOL SELF LEARNING MATERIAL (SLM)

Y = 0.929X–3.716+11 = 0.929X+7.284 The regression equation of Y on X is Y= 0.929X + 7.284 Example 3: Calculate the two regression equations of X on Y and Y on X from the data given below, taking deviations from an actual means of X and Y. Estimate the likely demand when the price is Rs.20. Solution: Calculation of Regression equation (i) Regression equation of X on Y 130 CU IDOL SELF LEARNING MATERIAL (SLM)

(ii) Regression Equation of Y on X When X is 20, Y will be = –0.25 (20) +44.25 = –5+44.25 = 39.25 (when the price is Rs. 20, the likely demand is 39.25) Example 4: Obtain regression equation of Y on X and estimate Y when X=55 from the following Solution: 131 CU IDOL SELF LEARNING MATERIAL (SLM)

(i) Regression coefficients of Y on X (ii) Regression equation of Y on X 132 Y–51.57 = 0.942(X–48.29) Y = 0.942X–45.49+51.57=0.942 #–45.49+51.57 Y = 0.942X+6.08 CU IDOL SELF LEARNING MATERIAL (SLM)

The regression equation of Y on X is Y= 0.942X+6.08 Estimation of Y when X= 55 Y= 0.942(55) +6.08=57.89 Example 5: Find the means of X and Y variables and the coefficient of correlation between them from the following two regression equations: 2Y–X–50 = 0 3Y–2X–10 = 0. Solution: We are given 2Y–X–50 = 0 ... (1) 3Y–2X–10 = 0 ... (2) Solving equation (1) and (2) We get Y = 90 Putting the value of Y in equation (1) We get X = 130 Calculating correlation coefficient Let us assume equation (1) be the regression equation of Y on X 2Y = X+50 133 CU IDOL SELF LEARNING MATERIAL (SLM)

NOTE It may be noted that in the above problem one of the regression coefficient is greater than 1 and the other is less than 1. Therefore, our assumption on given equations is correct. Example 5: Find the means of X and Y variables and the coefficient of correlation between them from the following two regression equations: 4X–5Y+33 = 0 20X–9Y–107 = 0 Solution: We are given 4X–5Y+33 = 0 ... (1) 20X–9Y–107 = 0 ... (2) Solving equation (1) and (2) We get Y = 17 Putting the value of Y in equation (1) 134 CU IDOL SELF LEARNING MATERIAL (SLM)

Calculating correlation coefficient Let us assume equation (1) be the regression equation of X on Y Let us assume equation (2) be the regression equation of Y on X But this is not possible because both the regression coefficient is greater than So, our above assumption is wrong. Therefore, treating equation (1) has regression equation of Y on X and equation (2) has regression equation of X on Y. So, we get Example 6: 135 CU IDOL SELF LEARNING MATERIAL (SLM)

The following table shows the sales and advertisement expenditure of a form Coefficient of correlation r= 0.9. Estimate the likely sales for a proposed advertisement expenditure of Rs. 10 crores. Solution: When advertisement expenditure is 10 crores i.e., Y=10 then sales X=6(10) +4=64 which implies sales is 64. Example 7: There are two series of index numbers P for price index and S for stock of the commodity. The mean and standard deviation of P are 100 and 8 and of S are 103 and 4 respectively. The correlation coefficient between the two series is 0.4. With these data obtain the regression lines of P on S and S on P. Solution: Let us consider X for price P and Y for stock S. Then the mean and SD for P is considered as X-Bar = 100 and σx=8 respectively and the mean and SD of S is considered as Y-Bar =103 and σy=4. The correlation coefficient between the series is r (X, Y) =0.4 Let the regression line X on Y be 136 CU IDOL SELF LEARNING MATERIAL (SLM)

Example 8: For 5 pairs of observations the following results are obtained ∑X=15, ∑Y=25, ∑X2 =55, ∑Y2 =135, ∑XY=83 Find the equation of the lines of regression and estimate the value of X on the first line when Y=12 and value of Y on the second line if X=8. Solution: 137 CU IDOL SELF LEARNING MATERIAL (SLM)

Y–5 = 0.8(X–3) 138 = 0.8X+2.6 When X=8 the value of Y is estimated as = 0.8(8) +2.6 CU IDOL SELF LEARNING MATERIAL (SLM)

=9 Example 9: The two regression lines are 3X+2Y=26 and 6X+3Y=31. Find the correlation coefficient. Solution: Let the regression equation of Y on X be 3X+2Y = 26 139 CU IDOL SELF LEARNING MATERIAL (SLM)

Example 10: In a laboratory experiment on correlation research study the equation of the two regression lines were found to be 2X–Y+1=0 and 3X–2Y+7=0. Find the means of X and Y. Also work out the values of the regression coefficient and correlation between the two variables X and Y. Solution: Solving the two regression equations we get mean values of X and Y 140 CU IDOL SELF LEARNING MATERIAL (SLM)

Example 11: For the given lines of regression 3X–2Y=5and X–4Y=7. Find (i) Regression coefficients (ii) Coefficient of correlation Solution: (i) First convert the given equations Y on X and X on Y in standard form and find their regression coefficients respectively. Given regression lines are 3X–2Y = 5 ... (1) X–4Y = 7 ... (2) Let the line of regression of X on Y is 3X–2Y = 5 3X = 2Y+5 141 CU IDOL SELF LEARNING MATERIAL (SLM)

Coefficient of correlation Since the two regression coefficients are positive then the correlation coefficient is also positive and it is given by 142 CU IDOL SELF LEARNING MATERIAL (SLM)

6.4 SIGNIFICANCE OF CORRELATION The study of correlation is widely used in practical life today because of the numerous reasons depicted below: • Almost all the variables show some kind of relationship like between price and supply, income and expenditure, etc have relationship with each other. Correlation analysis helps in measuring the degree of relationship that exists between the different variables. • After knowing the relationship among various variables, we can estimate the value of one variable in comparison with the value of another. This is done with the help of regression analysis. • Correlation analysis contributes to the considerate economic behaviour aids in locating the significantly important variables on which others variable depends and helps in revealing the connection by which disturbances spread to the economist and suggests the paths through which stabilizing forces may become efficient. • In business, correlation analysis enables the executive to estimate costs like sale prices and other variables on the basis of some other series in which sales or prices may be functionally related with each other. Some of the presumption can be removed from decisions when the relationship between a variable is to be estimated. • The coefficient of correlation is widely used and is widely abused statistical measure. It is abused in the sense that many times it overlooks the fact that correlation measures are nothing but the strength of linear relationship. As a result, it does not necessarily imply a cause-effect relationship. • Continuous development in the technology of science and philosophy it has been characterized by increased knowledge of relationship or correlations. It also depicts the nature and in nature it also one finds multiplicity of interrelated forces. • The effect of correlation minimizes the uncertainty. The forecast based on correlation analysis is probably more valuable and realistic. 6.5 CORRELATION AND CAUSATION Correlation analysis helps us in determining the degree of relationship between two or more variables; it does not tell us anything about cause-and-effect relationship among variables. Existence of a high degree of correlation does not necessarily indicate that there is a relationship of cause and effect between the variables or we can say that correlation does not essentially imply a functional relationship though the existence of causation. This shows that it always implies 143 CU IDOL SELF LEARNING MATERIAL (SLM)

correlation by itself and it establishes only co- variation. The conventional dictum that \"correlation does not imply causation\" means that correlation cannot be used to infer a causal relationship among the variables. This saying should not be taken to denote that correlations cannot indicate causal relationships. On the other hand, the cause’s essential for correlation may be indirect and unknown. As a result, we establish a correlation among two variables may not be a sufficient condition to establish a causal relationship in either direction. We have seen in surroundings that a correlation between age and height in children is transparent, but on the other hand if I want to see correlation between mood and health in people then it is difficult to comment on relationship. Can you predict that improved mood lead to improved health; or if a person has good health, then he has good mood; or both? Really unpredictable and it varies from person to person. We can conclude from above example that a correlation can be an evidence for a possible causal relationship, but it cannot indicate what type of causal relationship is there or it might be existing. 6.6 SUMMARY • Correlation shows the relationship between the two variables, while regression allows us to see how one affects the other. • The data shown with regression establishes a cause and effect, when one changes, so does the other, and not always in the same direction. With correlation, the variables move together • a core concern in regression analysis is first take a step back and reflect on the reasons why they are needed, spurious relationships, they measure the impact of any given variable above and bogging down the discussion in cautions, let us look at its application and interpretation. 6.7 KEYWORDS • Correlation coefficients: A measure the strength of association between two variables. • Mean: The average of the numbers. • Standard deviation: A statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance. • Expectation: The probabilistic expected value of the result (measurement) of an experiment. • Covariance: A measure of the relationship between two random variables. 144 CU IDOL SELF LEARNING MATERIAL (SLM)

6.8 LEARNING ACTIVITY Students will be learning the definitions of correlation and regression and will be encouraged to work together to get a class average. Students are introduced to statistics and why it is important to daily life. A survey was conducted to study the relationship between expenditure on accommodation (X) and expenditure on Food and Entertainment (Y) and the following results were obtained: Write down the regression equation and estimate the expenditure on Food and Entertainment, if the expenditure on accommodation is Rs. 200. ________________________________________________________________________________ _______________________________________________________________ 6.9 UNIT END QUESTIONS A. Descriptive Questions Short Questions 1. What is difference between correlation and regression? 2. What is correlation and regression in statistics? 3. What's the difference between correlation and simple linear regression? 4. For 5 observations of pairs of (X, Y) of variables X and Y the following results are obtained. ∑X=15, ∑Y=25, ∑X2=55, ∑Y2=135, ∑XY=83. Find the equation of the lines of regression and estimate the values of X and Y if Y=8; X=12. 5. The two regression lines were found to be 4X–5Y+33=0 and 20X–9Y–107=0. Find the mean values and coefficient of correlation between X and Y. 145 CU IDOL SELF LEARNING MATERIAL (SLM)

6. The equations of two lines of regression obtained in a correlation analysis are the following 2X=8–3Y and 2Y=5–X. Obtain the value of the regression coefficients and correlation coefficient. Long Questions 1. From the data given below Find (a) The two regression equations, (b) The coefficient of correlation between marks in Economics and statistics, (c) The mostly likely marks in Statistics when the marks in Economics is 30. 2. The heights (in cm.) of a group of fathers and sons are given below Find the lines of regression and estimate the height of son when the height of the father is 164 cm. 3. The following data give the height in inches (X) and the weight in lb. (Y) of a random sample of 10 students from a large group of students of age 17 years: Estimate weight of the student of a height 69 inches. 4. Obtain the two regression lines from the following data N=20, ∑X=80, ∑Y=40, ∑X2=1680, ∑Y2=320 and ∑XY=480 5. Given the following data, what will be the possible yield when the rainfall is 29₹₹ 146 CU IDOL SELF LEARNING MATERIAL (SLM)

Coefficient of correlation between rainfall and production is 0.8 6. The following data relate to advertisement expenditure (in lakh of rupees) and their corresponding sales (in crores of rupees) Estimate the sales corresponding to advertising expenditure of Rs. 30 lakhs. 7. You are given the following data: If the Correlation coefficient between X and Y is 0.66, then find (i) the two regression coefficients, (ii) the most likely value of Y when X=10 8. The table below shows the number of absences, x, in a Calculus course and the final exam grade, y, for 7 students. Find the correlation coefficient and interpret your result. x 10 2 6 4 33 y 85 80 70 55 90 90 95 B. Multiple choice Questions 1. The correlation coefficient is the ________ of two regression coefficients: a. Geometric mean b. Arithmetic mean c. Harmonic mean d. Median 2. When two regression coefficients bear same algebraic signs, then correlation coefficient is: a. Positive b. Negative c. According to two signs d. Zero 147 CU IDOL SELF LEARNING MATERIAL (SLM)

3. It is possible that two regression coefficients have: a. Opposite signs b. Same signs c. No sign d. Difficult to tell 4. Regression coefficient is independent of: a. Units of measurement b. Scale and origin c. Both (a) and (b) d. None of these 5. In the regression line Y = a+ bX: a. ∑X = ∑ X̅ b. ∑Y = ∑ ȳ c. ∑X = ∑Y d. X = YA Answers 1.a, 2.c, 3.b, 4.c, 5.b 6.10 REFERENCES Reference Books: • Dr. J. Ravichandran, Probability & Statistics for Eng., Willey Publications • Dr. B. Krishna Gandhi, Dr. T.K.V Iyengar, M.V.S.S.N. Prasad, Probability and • Statistical Methods by S.P Gupta, Publisher: Sultan Chand & Sons, 2008 Edition. • Research Methodology by C. R. Kothari, Publisher: Vikas Publishing House Textbooks: • S.C. Gupta, V.K. Kapoor, Fundamental of Mathematical Statistics, Sultan Chand and Company. 148 CU IDOL SELF LEARNING MATERIAL (SLM)

• Seymour Lipschutz, Jack Schiller, Jack Schiller S, Introduction to Probability & Statistics, McGraw-Hill Publishers. • Research Methodology and Statistical Methods by T. Subbi Reddy, Publisher: Reliance Publishing House • Introduction to Linear Optimization,\" by Dimitris Bertsimas and John 149 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT 7: TYPES OF CORRELATION, MEANING, USES OF REGRESSION ANALYSIS Structure 7.0 Learning Objectives 7.1 Introduction 7.2 Types of Correlation 7.3 Simple, Partial and Multiple Correlations 7.4 Linear and Non-Linear (Curvilinear) Correlation 7.5 Rank Correlation 7.6 Coefficient of Correlation 7.7 Introduction to Regression 7.8 Uses of Regression 7.9 Summary 7.10 Keywords 7.11 Learning activity 7.12 Unit End Questions 7.13 References 7.0 LEARNING OBJECTIVES After studying this unit students will be able to • Identify the direction and strength of a correlation between two factors. • Compute and interpret the Pearson correlation coefficient and test for significance. • Compute and interpret the coefficient of determination. • Define linearity and normality and explain why each assumption is necessary to appropriately interpret a significant correlation coefficient. • Compute and interpret the Spearman correlation coefficient and test for significance. 150 CU IDOL SELF LEARNING MATERIAL (SLM)


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook