5. A random variable that assumes a finite or a countably infinite number of values is called      ___________      a. Continuous random variable      b. Discrete random variable      c. Irregular random variable      d. Uncertain random variable    6. In a discrete probability distribution, the sum of all probabilities is always?      a. 0      b. Infinite      c. 1      d. Undefined    7. The covariance of two independent random variable is ___________      a. 1      b. 0      c. -1      d. Undefined    8. What would be the probability of an event ‘G’ if H denotes its complement, according to      the axioms of probability?      a. P (G) = 1 / P (H)      b. P (G) = 1 – P (H)      c. P (G) = 1 + P (H)      d. P (G) = P (H)    9. The expected value of a discrete random variable ‘x’ is given by ___________             251      a. P(x)      b. ∑ P(x)      c. ∑ x P(x)      d. 1    Answers                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
1 -a, 2 -a, 3 –b, 4 –a, 5 –a, 6 –c, 7 -b, 8 –b, 9 –c    10.10 REFERENCES    Text Books:      • Allen B. Downey, “Think Python: How to Think like a Computer Scientist”, 2nd           edition, Updated for Python 3, Shroff/O ‘Reilly Publishers, 2016      • Michael Urban, Joel Murach, Mike Murach: Murach's Python Programming; Dec,           2016    Reference Books:      • Guido van Rossum and Fred L. Drake Jr, An Introduction to Python – Revised and           updated for Python 3.2,      • Jake Vander Plas, “Python Data Science Handbook”, O ‘Reilly Publishers, 2016.                                          252    CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT - 11: QUANTITATIVE EXPLORATORY DATA  ANALYSIS (EDA)    Structure     11.0. Learning Objectives     11.1. Introduction     11.2. Summary of Categorical Data     11.3. Summary of Continuous Data     11.4. Summary     11.5. Keywords     11.6. Learning Activity     11.7. Unit End Questions     11.8. References    11.0 LEARNING OBJECTIVES    After studying this unit, you will be able to:      • Describe about summarization in EDA      • Summarize continuous and categorical data    11.1 INTRODUCTION    Exploratory data analysis is one of the best practices used in data science today. While  starting a career in Data Science, people generally don’t know the difference between Data  analysis and exploratory data analysis. There is not a very big difference between the two, but  both have different purposes. Exploratory data analysis is one of the best practices used in  data science today. While starting a career in Data Science, people generally don’t know the  difference between Data analysis and exploratory data analysis. There is not a very big  difference between the two, but both have different purposes. A summary analysis is simply a  numeric reduction of a historical data set. It is quite passive. Its focus is in the past. Quite  commonly, its purpose is to simply arrive at a few key statistics (for example, mean and  standard deviation) which may then either replace the data set or be added to the data set in the  form of a summary table.                                          253    CU IDOL SELF LEARNING MATERIAL (SLM)
11.2 SUMMARY OF CATEGORICAL DATA    Categorical variables do not admit any mathematical operations on them. We cannot sum  them, or even sort them. We can only count them. As such, summaries of categorical variables  will always start with the counting of the frequency of each category.  Summary of Univariate Categorical Data  # Make some data  gender <- c(rep('Boy', 10), rep('Girl', 12))  drink <- c(rep('Coke', 5), rep('Sprite', 3), rep('Coffee', 6), rep('Tea', 7), rep('Water', 1))  age <- sample(c('Young', 'Old'), size = length(gender), replace = TRUE)  # Count frequencies  table(gender)    ## gender     Girl  ## Boy        12  ## 10    table(drink)          Coke          Sprite  Tea Water  ## drink              5             3       71  ## Coffee  ## 6    If instead of the level counts you want the proportions, you can use prop.table    prop.table(table(gender))    ## gender    ## Boy                      Girl    ## 0.4545455                0.5454545    Summary of Bivariate Categorical Data    library(magrittr)    cbind(gender, drink) %>% head # bind vectors into matrix and inspect    ##            gender        drink  ## [1,]       \"Boy\"         \"Coke\"  ## [2,]       \"Boy\"         \"Coke\"                                                                                     254                                CU IDOL SELF LEARNING MATERIAL (SLM)
## [3,]    \"Boy\"            \"Coke\"  ## [4,]    \"Boy\"            \"Coke\"  ## [5,]    \"Boy\"            \"Coke\"  ## [6,]    \"Boy\"            \"Sprite\"    table1 <- table(gender, drink) # count frequencies of bivariate combinations  table1    ## drink    ## gender           Coffee  Coke          Sprite                   Tea Water    ## Boy              2       5 3 00    ## Girl             4       0             0 71    Summary of Multivariate Categorical Data    table2.1 <- table(gender, drink, age) # A machine readable table.    table2.1    ## , , age = Old    Coffee  Coke Sprite           Tea Water  ##                  1       21                    00  ## drink            2       00                    31  ## gender  ## Boy              Coffee  Coke          Sprite                   Tea Water  ## Girl             1       3             2                        00  ##                  2       0             0                        40  ## , , age = Young  ##  ## drink  ## gender  ## Boy  ## Girl    table.2.2 <- ftable(gender, drink, age) # A human readable table.  table.2.2    ## age Old Young                                                                                  255                                CU IDOL SELF LEARNING MATERIAL (SLM)
## gender  drink                 11  ## Boy     Coffee                23  ##         Coke                 12  ##         Sprite               00  ##         Tea                   00  ##         Water                 22  ## Girl    Coffee                00  ##         Coke                 00  ##         Sprite               34  ##         Tea                   10  ##         Water    If you want proportions instead of counts, you need to specify the denominator, i.e., the  margins. Think: what is the margin in each of the following outputs?    prop.table(table1, margin = 1)    ## drink   Coffee               Coke  Sprite                          Tea Water  ## gender  0.20000000                                                 0.00000000 0.00000000  ## Boy     0.33333333           0.50000000 0.30000000                 0.58333333 0.08333333  ## Girl                                  0.00000000 0.00000000    prop.table(table1, margin = 2)    ## drink  ## gender Coffee Coke Sprite Tea Water  ## Boy 0.3333333 1.0000000 1.0000000 0.0000000 0.0000000  ## Girl 0.6666667 0.0000000 0.0000000 1.0000000 1.0000000    11.3 SUMMARY OF CONTINUOUS DATA    Continuous variables admit many more operations than categorical. We can compute sums,  means, quantiles, and more.  Summary of Univariate Continuous Data                                                                          256                                    CU IDOL SELF LEARNING MATERIAL (SLM)
We distinguish between several types of summaries, each capturing a different property of the  data.  Summary of Location  The mean, or average, of a sample x:=(x1,…,xn), denoted ¯x is defined as    The sample mean is non robust. A single large observation may inflate the mean indefinitely.  For this reason, we define several other summaries of location, which are more robust, i.e.,  less affected by “contaminations” of the data.  The α quantile of a sample x, denoted xα, is (non uniquely) defined as a value above 100α% of  the sample, and below 100(1−α)%  .We emphasize that sample quantiles are non-uniquely defined. See ?quantile for the 9(!)  different definitions that R provides.  The α trimmed mean of a sample x, denoted ¯xα is the average of the sample after removing  the α proportion of largest and α proportion of smallest observations.    The simple mean and median are instances of the alpha trimmed mean: ¯x0 and ¯x0.5  respectively.  Summary of Scale  The scale of the data, sometimes known as spread, can be thought of its variability.  The standard deviation of a sample x , denoted S(x), is defined as                                            S(x):=√(n−1)−1∑(xi−¯x)2.    For reasons of robustness, we define other, more robust, measures of scale.  The Median Absolute Deviation from the median, denoted as MAD(x) , is defined as                                              MAD(x):=c|x−x0.5|0.5.    where c is some constant, typically set to c=1.4826 so that MAD and S(x) have the same large  sample limit.  The Inter Quantile Range of a sample x , denoted as IQR(x), is defined as                                               IQR(x):=x0.75−x0.25.                                          257    CU IDOL SELF LEARNING MATERIAL (SLM)
Summary of Asymmetry    Summaries of asymmetry, also known as skewness, quantify the departure of the x from a  symmetric sample.  The Yule measure of assymetry, denoted Yule(x) is defined as                                Yule(x):=(1/2(x0.75+x0.25)−x0.5)/(1/2IQR(x))    Summary of Bivariate Continuous Data    When dealing with bivariate, or multivariate data, we can obviously compute univariate  summaries for each variable separately. This is not the topic of this section, in which we want  to summarize the association between the variables, and not within them.    The covariance between two samples, x and y, of same length n, is defined as                                     Cov(x,y):=(n−1)−1∑(xi−¯x)(yi−¯y)    We emphasize this is not the covariance you learned about in probability classes, since it is not  the covariance between two random variables but rather, between two samples. For this  reasons, some authors call it the empirical covariance, or sample covariance.    Pearson’s correlation coefficient, a.k.a. Pearson’s moment product correlation, or simply, the  correlation, denoted r(x,y), is defined as                                             r(x,y):=Cov(x,y)S(x)S(y).    If you find this definition enigmatic, just think of the correlation as the covariance between x  and y after transforming each to the unitless scale of z-scores.  The z-scores of a sample x are defined as the mean-centered, scale normalized observations:                                                  zi(x):=xi−¯xS(x).    We thus have that r(x,y)=Cov(z(x),z(y))    11.4 SUMMARY       • A summary analysis is simply a numeric reduction of a historical data set     • Summaries of categorical variables will always start with the counting of the frequency            of each category.     • Continuous variables admit many more operations than categorical                                                                                  258    CU IDOL SELF LEARNING MATERIAL (SLM)
• The sample mean is non robust. A single large observation may inflate the mean          indefinitely       • The scale of the data, sometimes known as spread, can be thought of its variability     • Summaries of asymmetry, also known as skewness, quantify the departure of the x            from a symmetric sample.     • The z-scores of a sample x are defined as the mean-centered, scale normalized            observations    11.5 KEYWORDS        • EDA- Exploratory Data Analysis      • Categorical Data-take on only a limited, and usually fixed number of possible values      • Univariate - type of data that contains only one attribute or characteristic      • Continuous Data-numeric value and can be meaningfully subdivided into finer and             finer increments      • Bivariate analysis- find the relationship between each variables    11.6 LEARNING ACTIVITY    1. Mean and median cannot be used to summarize all kind of data. Comment    2. Suppose you measure of how well the data are related. Can use pearson correlation.    11.7 UNIT END QUESTIONS                                                                     259    A. Descriptive Questions  Short Questions  1. What is the need for summary in EDA?  2. Discuss about mean and median.  3. What is harmonic mean?                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
4. How summary of uni-variate continuous data is performed?  5. Compare uni-variate and bivariate data  Long Questions  1. Illustrate the concepts of summary in EDA.  2. Describe how Summary of continuous data is done  3. Illustrate various strategies of summarizing categorical data  4. Discuss the parameters used in summarizing categorical data    B. Multiple Choice Questions  1. What is exploratory data analysis?             a. A rigid framework by which we analyze data           b. An initial way by which we can get a feel for data           c. A type of purely quantitative method of data analysis           d. A set of scientific principles for analyzing data in a categorical manner    2. Most often, EDA relies on _____.                                                         260           a. visual techniques           b. assumptions           c. fixed models           d. testing for statistical significance    3. Which of the following is a principle of analytic graphics?           a. Don't plot more than two variables at time           b. Only do what your tools allow you to do           c. Show box plots (univariate summaries)           d. Integrate multiple modes of evidence    4. What is the role of exploratory graphs in data analysis?           a. They are typically made very quickly.           b. They are made for formal presentations.           c. Only a few are constructed.           d. Axes, legends, and other details are clean and exactly detailed.                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
5. Which of the following is true about the base plotting system?           a. The system is most useful for conditioning plots.           b. Plots are created and annotated with separate functions.           c. Plots are typically created with a single function call.           d. Margins and spacings are adjusted automatically depending on the type of plot               and the data.    Answers  1 -b, 2 -a, 3 –c, 4 –a, 5 –b    11.8 REFERENCES    Text Books:      • Allen B. Downey, “Think Python: How to Think like a Computer Scientist”, 2nd           edition, Updated for Python 3, Shroff/O ‘Reilly Publishers, 2016      • Michael Urban, Joel Murach, Mike Murach: Murach's Python Programming; Dec,           2016    Reference Books:      • Guido van Rossum and Fred L. Drake Jr, An Introduction to Python – Revised and           updated for Python 3.2,      • Jake Vander Plas, “Python Data Science Handbook”, O‘Reilly Publishers, 2016.                                          261    CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT - 12: STATISTICS 1    Structure    12.0. Learning Objectives    12.1. Introduction    12.2. Mean    12.3. Median    12.4. Percentile    12.5. Quartiles    12.6. Outliers    12.7. Box Plot    12.8. Summary    12.9. Keywords    12.10. Learning Activity    12.11. Unit End Questions    12.12. References    12.0 LEARNING OBJECTIVES    After studying this unit, you will be able to:      • Learn the basics of statistics      • Use percentile and quartile for data visualization      • Perform outlier detection for real time scenarios    12.1 INTRODUCTION    Probability is the study of random events. Most people have an intuitive understanding of  degrees of probability, which is why you can use words like “probably” and “unlikely”  without special training, but we will talk about how to make quantitative claims about those  degrees. Statistics is the discipline of using data samples to support claims about populations.  Most statistical analysis is based on probability, which is why these pieces are usually  presented together. Computation is a tool that is well-suited to quantitative analysis, and  computers are commonly used to process statistics. Also, computational experiments are  useful for exploring concepts in probability and statistics.                                                              262    CU IDOL SELF LEARNING MATERIAL (SLM)
12.2 MEAN    The mean of a set of numbers, sometimes simply called the average, is the sum of the data  divided by the total amount of data. The most popular and widely used measure of  representing the entire data by one value is what most laymen call an 'average' and what the  statisticians call the arithmetic mean. Its value is obtained by adding together all the items  and by dividing this total by the number of items. Arithmetic mean may either be:        • Simple arithmetic mean, or        • Weighted arithmetic mean    Merits and Limitations of Arithmetic Mean: The merits and demerits are as follows:    Merits: Arithmetic mean is most widely used in practice because of the following reasons:      • It is the simplest average to understand and easiest to compute. Neither the arraying of           data as required for calculating median nor grouping of data as required for           calculating mode is needed while calculating mean.      • It is affected by the value of every item in the series.      • It is defined by a rigid mathematical formula with the result that everyone who           computes the average gets the same answer.      • Being determined by a rigid formula, it lends itself to subsequent algebraic treatment           better than the median or mode.      • It is relatively reliable in the sense that it does not vary too much when repeated           samples are taken from one and the same population. At least not as much as some           other kind of statistical descriptions. The mean is typical in the sense that it is the           center of gravity, balancing the values on either side of it.      • It is a calculated value, and not based on position in the series.    Limitations: Since the value of mean depends upon each and every item of the series  extreme items, i.e., very small and very large items, unduly affect the value of the average.  For example, if in a tutorial group there are 4 students and their marks in a test are 60, 70, 10  and 80 the average marks would be    60 + 70 + 10 + 80 = 220 = 55.                                                 263                                            CU IDOL SELF LEARNING MATERIAL (SLM)
44    One single Item, i.e., 10, has reduced the average marks considerably. The smaller the  number of observations the greater is likely to be the impact of extreme value.    It is important to understand the following:      • In a distribution with open-end classes the value of mean cannot be computed without           making assumptions regarding the size of the class interval of the' open-end classes. If           such classes contain a large proportion of the values, then mean may be subject to           substantial error. However, the values of the median and mode can be computed           where there are open-end classes without making any assumptions about size of class           interval.      • The arithmetic mean is not always a good measure of central tendency. The mean           provides a \"characteristic\" value. in the sense of indicating where most of the values           lie, only when the distribution of the variable is reasonably normal (bell-shaped). In           case of a V-shaped distribution the mean is not likely to serve a useful purpose.    12.3 MEDIAN    The median by definition refers to the middle value in a distribution. In case of median one-  half of the items in the distribution have a value the size of the median value or smaller and  one-half have a value the size of the median value or larger. The median is just the 50th  percentile value below which 50 per cent of the values in the sample fall. It splits the  observation into two halves. As distinct from the arithmetic mean which is calculated from  the value of every item in the series, the median is what is called a positional average. The  term 'position' refers to the place of a value in a series. The place of the median in a series is  such that an equal number of items lie on either side of it.    Example:    If the income of five employees is Rs. 5,900, 6,950. 7,020, 7, 200 and 8, 280 the median  would be 7, 020.    5, 900    6, 950    7, 020 « value at middle position of the array    7, 200                                          264    CU IDOL SELF LEARNING MATERIAL (SLM)
8, 280    For the above example the calculation of median was simple because of odd number of  observations. When an even number of observations are listed, there is no single middle  position value and the median is taken to be the arithmetic mean of two middlemost items.  For example, if in the above case we are given the income of six employees as 5,900, 6,950,  7,020, 7,200, 8.280, 9,300, the median income would be:    5,900  6.950  7,020  7,200  8,280  9,300  There are two middle position values  Median = 7,020 + 7,200 = 14,220                      22                                   =Rs.7, 110    Hence, in case of even number of observations median may be found by averaging two  middle position values. Thus, when, N is odd the median is an actual value, with the  remainder of the series in two equal parts on either side of it. If N is even the median is a  derived figure, i.e., half the sum of the middle values.  Calculation of Median-Individual Observations: The steps involved are:        • Arrange the data in ascending or descending order of magnitude. (Both arrangements           would give the same answer).        • In a group composed of an odd number of values such as 7, add 1 to be the total           number of values and divide by 2. Thus, 7 + l would be 8 which divided by 2 gives 4-           the number of the values starting at either end of the numerically arranged groups will                                          265    CU IDOL SELF LEARNING MATERIAL (SLM)
be the median value. In a large group the same method may be followed. In a group of  199 items the middle value would be 100th value.    This would be determined by 199 + 1 in the. form of formula:                                            2  Median = Size of N+1 th item    2    Example 1    From the following data of the wages of 7 workers, compute the median wage:    Wages (in Rs.)    4100    4150    6080    7120    5200    6160    7400    Solution:    CALCULATION OF MEDIAN               S. No.  Wages arranged in       S. No.             Wages arranged in                     ascending order                            ascending order               1 4100                          5 6160               2 4150                          6 7120               3 5200                          7 7400               4 6080    Median = Size of N+1th item = 7+1 = 4th item = Rs. 6080.                         266                                            CU IDOL SELF LEARNING MATERIAL (SLM)
22           Size of 4th item = 6080. Hence the median wage = Rs. 6080    We thus find that median is the middle most items: 3 persons get a wag less than Rs. 5200  and equal number, i.e., 3, get more than. Rs. 5200.    The procedure for determining the median of .an even-numbered group of items is not as  obvious as above. If there were for instance, different values in a group, the median is really  not determinable since both the 5th and 6th values are in the centre. In practice the median  value for a group composed of an even number of items is estimated by finding the arithmetic  mean of the two middle values that is, adding the two values in the middle and dividing by  two. Expressed in the form of formula, it amounts to:  Median = Size of N+1th item                           2    Thus, we find that it is both when N is odd as well as even that 1 (one) has to be added to  determine median value.    12.4 PERCENTILE    A percentile is a comparison score between a particular score and the scores of the rest of a  group. It shows the percentage of scores that a particular score surpassed. For example, if you  score 75 points on a test, and are ranked in the 85 th percentile, it means that the score 75 is  higher than 85% of the scores.    The percentile rank is calculated using the formula    R=P100(N)    where P is the desired percentile and N is the number of data points.    Example 1:    If the scores of a set of students in a math test are 20, 30, 15 and 75 what is the percentile  rank of the score 30?    Arrange the numbers in ascending order and give the rank ranging from 1 to the lowest to 4  to the highest.                                          267    CU IDOL SELF LEARNING MATERIAL (SLM)
NumberRank151202303754  Use the formula:  3=P100(4)3=P2575=P  Therefore, the score 30 has the 75 th percentile.  Note that, if the percentile rank R is an integer, the P th percentile would be the score with  rank R when the data points are arranged in ascending order.  If R is not an integer, then the P th percentile is calculated as shown.  Let I be the integer part and be the decimal part of D of R. Calculate the scores with the ranks  I and I+1. Multiply the difference of the scores by the decimal part of R. The P th percentile  is the sum of the product and the score with the rank I.  Example 2:  Determine the 35 th percentile of the scores 7,3,12,15,14,4 and 20.  Arrange the numbers in ascending order and give the rank ranging from 1 to the lowest to 7  to the highest.  NumberRank314273124145156207  Use the formula:  R=35100(7)     =2.45  The integer part of R is 2, calculate the score corresponding to the ranks 2 and 3. They are 4  and 7. The product of the difference and the decimal part is 0.45(7−4) =1.35.  Therefore, the 35 th percentile is 2+1.35=3.35.    12.5 QUARTILES    A Quartile is a percentile measure that divides the total of 100% into four equal parts:  25%,50%,75% and 100%. A particular quartile is the border between two neighboring  quarters of the distribution.                                          268    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 12.1 Quartile  Q1 (quartile 1) separates the bottom 25% of the ranked data (Data is ranked when it is  arranged in order.) from the top 75%. Q2 (quartile 2) is the mean or average. Q3 (quartile 3)  separates the top 25% of the ranked data from the bottom 75%. More precisely, at least 25%  of the data will be less than or equal to Q1 and at least 75% will be greater than or equal Q1.  At least 75% of the data will be less than or equal to Q3 while at least 25% of the data will be  greater than or equal to Q3.    Example 1:  Find the 1st quartile, median, and 3rd quartile of the following set of data.  24,26,29,35,48,72,150,161,181,183,183  There are 11 numbers in the data set, already arranged from least to greatest. The 6th number,  72, is the middle value. So, 72 is the median.  Once we remove 72, the lower half of the data set is  24,26,29,35,48  Here, the middle number is 29. So, Q1=29.  The top half of the data set is  150,161,181,183,183  Here, the middle number is 181. So, Q3=181.  The inter-quartile range or IQR is the distance between the first and third quartiles. It is  sometimes called the H-spread and is a stable measure of disbursement. It is obtained by  evaluating Q3−Q1.    12.6 OUTLIER    An outlier is an observation that lies an abnormal distance from other values in a random  sample from a population. In a sense, this definition leaves it up to the analyst (or a consensus  process) to decide what will be considered abnormal. Before abnormal observations can be  singled out, it is necessary to characterize normal observations.                                          269    CU IDOL SELF LEARNING MATERIAL (SLM)
Two activities are essential for characterizing a set of data:        • Examination of the overall shape of the graphed data for important features, including           symmetry and departures from assumptions.        • Examination of the data for unusual observations that are far removed from the mass           of data. These points are often referred to as outliers. Two graphical techniques for           identifying outliers, scatter plots and box plots    Example  The data set of N’s = 90 ordered observations as shown below is examined for outliers:    30, 171, 184, 201, 212, 250, 265, 270, 272, 289, 305, 306, 322, 322, 336, 346, 351, 370, 390,  404, 409, 411, 436, 437, 439, 441, 444, 448, 451, 453, 470, 480, 482, 487, 494, 495, 499,  503, 514, 521, 522, 527, 548, 550, 559, 560, 570, 572, 574, 578, 585, 592, 592, 607, 616,  618, 621, 629, 637, 638, 640, 656, 668, 707, 709, 719, 737, 739, 752, 758, 766, 792, 792,  794, 802, 818, 830, 832, 843, 858, 860, 869, 918, 925, 953, 991, 1000, 1005, 1068, 1441  The computations are as follows:       Median = (n+1)/2 largest data point = the average of the 45th and 46th ordered points =  (559 + 560)/2 = 559.5       Lower quartile = .25(N+1)th ordered point = 22.75th ordered point = 411 + .75(436-411) =  429.75       Upper quartile = .75(N+1)th ordered point = 68.25th ordered point = 739 +.25(752-739) =  742.25       Interquartile range = 742.25 - 429.75 = 312.5     Lower inner fence = 429.75 - 1.5 (312.5) = -39.0       Upper inner fence = 742.25 + 1.5 (312.5) = 1211.0     Lower outer fence = 429.75 - 3.0 (312.5) = -507.75     Upper outer fence = 742.25 + 3.0 (312.5) = 1679.75    From an examination of the fence points and the data, one point (1441) exceeds the upper  inner fence and stands out as a mild outlier; there are no extreme outliers.                                          270    CU IDOL SELF LEARNING MATERIAL (SLM)
A histogram with an overlaid box plot is shown below.                                  Figure 12.2 Histogram overlaid Box Plot    The outlier is identified as the largest value in the data set, 1441, and appears as the circle to  the right of the box plot.    Outliers should be investigated carefully. Often, they contain valuable information about the  process under investigation or the data gathering and recording process. Before considering  the possible elimination of these points from the data, one should try to understand why they  appeared and whether it is likely similar values will continue to appear. Of course, outliers  are often bad data points.    12.7 BOX PLOT    In descriptive statistics, a box plot or boxplot (also known as box and whisker plot) is a type  of chart often used in explanatory data analysis. Box plots visually show the distribution of  numerical data and skewness through displaying the data quartiles (or percentiles) and  averages.    Box plots show the five-number summary of a set of data: including the minimum score, first  (lower) quartile, median, third (upper) quartile, and maximum score.                                                           271    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 12.3 Box Plot  Minimum Score  The lowest score, excluding outliers (shown at the end of the left whisker).  Lower Quartile  Twenty-five percent of scores fall below the lower quartile value (also known as the first  quartile).  Median  The median marks the mid-point of the data and is shown by the line that divides the box into  two parts (sometimes known as the second quartile). Half the scores are greater than or equal  to this value and half are less.  Upper Quartile  Seventy-five percent of the scores fall below the upper quartile value (also known as the third  quartile). Thus, 25% of data are above this value.  Maximum Score  The highest score, excluding outliers (shown at the end of the right whisker).  Whiskers  The upper and lower whiskers represent scores outside the middle 50% (i.e., the lower 25%  of scores and the upper 25% of scores).  The Interquartile Range (or IQR)                                          272    CU IDOL SELF LEARNING MATERIAL (SLM)
This is the box plot showing the middle 50% of scores (i.e., the range between the 25th and  75th percentile).  Why are box plots useful?  Box plots divide the data into sections that each contain approximately 25% of the data in  that set.                                          Figure12.4: Sample Box Plot  Box plots are useful as they provide a visual summary of the data enabling researchers to  quickly identify mean values, the dispersion of the data set, and signs of skewness.  Note the image above represents data which is a perfect normal distribution, and most box  plots will not conform to this symmetry (where each quartile is the same length).  Box plots are useful as they show the average score of a data set.  The median is the average value from a set of data and is shown by the line that divides the  box into two parts. Half the scores are greater than or equal to this value and half are less.  Box plots are useful as they show the skewness of a data set  The box plot shape will show if a statistical data set is normally distributed or skewed.                                          273    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 12.5: Distribution in Box Plot    When the median is in the middle of the box, and the whiskers are about the same on both  sides of the box, then the distribution is symmetric.    When the median is closer to the bottom of the box, and if the whisker is shorter on the lower  end of the box, then the distribution is positively skewed (skewed right).    When the median is closer to the top of the box, and if the whisker is shorter on the upper end  of the box, then the distribution is negatively skewed (skewed left).    Box plots are useful as they show the dispersion of a data set.    In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a  distribution is stretched or squeezed.  The smallest value and largest value are found at the end of the ‘whiskers’ and are useful for  providing a visual indicator regarding the spread of scores (e.g., the range).      Figure 12.6: IQR in Box Plot        274    CU IDOL SELF LEARNING MATERIAL (SLM)
The interquartile range (IQR) is the box plot showing the middle 50% of scores and can be  calculated by subtracting the lower quartile from the upper quartile (e.g., Q3−Q1).    Box plots are useful as they show outliers within a data set.    An outlier is an observation that is numerically distant from the rest of the data.    When reviewing a box plot, an outlier is defined as a data point that is located outside the  whiskers of the box plot.    How to compare box plots    Box plots are a useful way to visualize differences among different samples or groups. They  manage to provide a lot of statistical information, including — medians, ranges, and outliers.    Note, although box plots have been presented horizontally in this article, it is more common  to view them vertically in research papers  Step 1: Compare the medians of box plots    Compare the respective medians of each box plot. If the median line of a box plot lies outside  of the box of a comparison box plot, then there is likely to be a difference between the two  groups.    Figure 12.7: Comparing Box Plot        275     CU IDOL SELF LEARNING MATERIAL (SLM)
Step 2: Compare the interquartile ranges and whiskers of box plots    Compare the interquartile ranges (that is, the box lengths), to examine how the data is  dispersed between each sample. The longer the box the more dispersed the data. The smaller  the less dispersed the data.                                Figure 12.8: Comparing interquartile range    Next, look at the overall spread as shown by the extreme values at the end of two whiskers.  This shows the range of scores (another type of dispersion). Larger ranges indicate wider  distribution, that is, more scattered data.    Step 3: Look for potential outliers (see above image)    When reviewing a box plot, an outlier is defined as a data point that is located outside the  whiskers of the box plot.    Step 4: Look for signs of skewness    If the data do not appear to be symmetric, does each sample show the same kind of  asymmetry?                                          276    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 12.9: Symmetry in Box Plot    12.8 SUMMARY       • The mean is the sum of the data divided by the total number of data     • The median by definition refers to the middle value in a distribution. In case of median            one-half of the items in the distribution have a value the size of the median value or          smaller and one-half have a value the size of the median value or larger     • A percentile is a comparison score between a particular score and the scores of the rest          of a group. It shows the percentage of scores that a particular score surpassed       • A Quartile is a percentile measure that divides the total of 100% into four equal parts:          25%,50%,75% and 100%. A particular quartile is the border between two neighboring          quarters of the distribution.       • An outlier is an observation that lies an abnormal distance from other values in a          random sample from a population       • A box plot or boxplot (also known as box and whisker plot) is a type of chart often          used in explanatory data analysis. Box plots visually show the distribution of          numerical data and skewness through displaying the data quartiles (or percentiles) and          averages.       • Box plots show the five-number summary of a set of data: including the minimum          score, first (lower) quartile, median, third (upper) quartile, and maximum score                                          277    CU IDOL SELF LEARNING MATERIAL (SLM)
12.9 KEYWORDS        • Mean-average of the numbers      • Median-middle number in a sorted list      • Percentile-ow a score compares to other scores in the same set      • Quartile-divides the number of data points into four parts      • Outlier-data points that are far from other data points      • Whisker Plot-displaying the data distribution through their quartiles    12.10 LEARNING ACTIVITY    1. Find the outliers for the following data set: 3,10,14,19,22,29,32,36,49,70    2. From the following data, find the value of median:    ---------------------------------------------------------------------------------------------------------    Income (Rs.)                                           No. of persons    ----------------------------------------------------------------------------------------------------------    4,000                                                           24    4,500                                                           26    5,800                                                           16    5,060                                                           20    6,600                                                           6    5,380                                                           30    12.11 UNIT END QUESTIONS    A. Descriptive Questions                                                                                                                278                              CU IDOL SELF LEARNING MATERIAL (SLM)
Short Questions  1. Define mean and median  2. Compare percentile and Quartile  3. What is outlier?  4. How do you construct whisker plot?  5. Lit the category of data represented in box plot  Long Questions  1. Illustrate the role of mean and median in data analysis.  2. How visualization is done using percentile and quartiles?  3. Describe the concept of whisker plot.  4. Discuss the role of outlier detection  5. Outlier sometime contains important information. Comment    B. Multiple Choice Questions  1. Any measure indicating the centre of a set of data, arranged in an increasing or decreasing        order of magnitude, is called a measure of:           a. Skewness           b. Symmetry           c. Central tendency           d. Dispersion    2. Scores that differ greatly from the measures of central tendency are called:           a. Raw scores           b. The best scores           c. Extreme scores           d. Z-scores    3. The measure of central tendency listed below is:                                         279           a. The raw score           b. The mean           c. The range                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
d. Standard deviation    4. The total of all the observations divided by the number of observations is called:           a. Arithmetic mean           b. Geometric mean           c. Median           d. Harmonic mean    5. While computing the arithmetic mean of a frequency distribution, each value of a class is      considered equal to:           a. Class mark           b. Lower limit           c. Upper limit           d. Lower class boundary    6. Change of origin and scale is used for calculation of the:           a. Arithmetic mean           b. Geometric mean           c. Weighted mean           d. Lower and upper quartiles    7. The arithmetic mean is highly affected by:           a. Moderate values           b. Extremely small values           c. Odd values           d. Extremely large values    8. Which of the following statements is always true?           a. The mean has an effect on extreme scores           b. The median has an effect on extreme scores           c. Extreme scores have an effect on the mean           d. Extreme scores have an effect on the median                                          280    CU IDOL SELF LEARNING MATERIAL (SLM)
9. The midpoint of the values after they have been ordered from the smallest to the largest or      the largest to the smallest is called:           a. Mean           b. Median           c. Lower quartile           d. Upper quartile    10. If a set of data has one mode and its value is less than mean, then the distribution is      called:           a. Positively skewed           b. Negatively skewed           c. Symmetrical           d. Normal    Answers  1 -c, 2 -c, 3 –b, 4 –a, 5 –b, 6 –a, 7 –d, 8 -c, 9 –b, 10 –a    12.12 REFERENCES    Text Books:      • Allen B. Downey, “Think Python: How to Think like a Computer Scientist”, 2nd           edition, Updated for Python 3, Shroff/O ‘Reilly Publishers, 2016      • Michael Urban, Joel Murach, Mike Murach: Murach's Python Programming; Dec,           2016    Reference Books:      • Guido van Rossum and Fred L. Drake Jr, An Introduction to Python – Revised and           updated for Python 3.2,      • Jake Vander Plas, “Python Data Science Handbook”, O‘Reilly Publishers, 2016.                                          281    CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT - 13: STATISTICS II    Structure  13.0. Learning Objectives  13.1. Introduction to Variance  13.2. Standard Deviation  13.3. Covariance  13.4. Scatter Plot  13.5. Pearson Correlation Coefficient  13.6. Summary  13.7. Keywords  13.8. Learning Activity  13.9. Unit End Questions  13.10. References    13.0 LEARNING OBJECTIVES    After studying this unit, you will be able to:      • Compare variance and covariance      • Describe standard deviation      • Analyze positive and negative correlation among the data    13.1 INTRODUCTIO TO VARIANCE    Variance is the measure of statistical dispersion, that is, the variation among the different  samples in a data set. It is the average of the squared differences from the mean.    Variance is a numerical value that shows how widely the individual figures in a set of data  distribute themselves about the mean and hence describes the difference of each value in the  dataset from the mean value    Variance is the square of the standard deviation. If you do not know the standard deviation,  you can use the following procedure to determine the variance.    Procedure for Finding the Variance:        1. Find the mean of the scores (x¯) .                                                                    282    CU IDOL SELF LEARNING MATERIAL (SLM)
2. Subtract the mean from each individual score (x−x¯).      3. Square each of the differences obtained above. (x−x¯)2.      4. Add all of the squares obtained in step 3. (∑(x−x¯)2).      5. Divide the total from step 4 by the number (n−1), where n is the total number of scores      used.    13.2 STANDARD DEVIATION    Standard deviation is the measure of how spread out your data is. It is a statistic that tells you  how closely all of the examples are gathered around the mean (average) in a data set. The  steeper the bell curve, the smaller the standard deviation. If the examples are spread far  apart, the bell curve will be much flatter, meaning the standard deviation is large. In  business, the smaller the standard deviation is the better.  Procedure for Finding the Standard Deviation:    1. Find the mean of the scores (x¯).  2. Subtract the mean from each individual score (x−x¯).  3. Square each of the differences obtained above. (x−x¯)2.  4. Add all of the squares obtained in step 3. (∑(x−x¯)2).  5 Divide the total from step 4 by the number (n−1), where n is the total number of scores  used.  6. Find the square root of the result of step 5.        Be careful not to round the mean too much as the resulting standard deviation can be in  error. Try not to round any intermediate results. Round only at the end.    13.3 COVARIANCE    Covariance is a measure of how much two random variables vary together. It’s similar  to variance, but where variance tells you how a single variable varies, co variance tells you  how two variables vary together.                                          283    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 13.1 Covariance    The Covariance Formula         The formula is:       Cov(X,Y) = Σ E((X-μ)E(Y-ν)) / n-1 where:       X is a random variable       E(X) = μ is the expected value (the mean) of the random variable X and       E(Y) = ν is the expected value (the mean) of the random variable Y       n = the number of items in the data set    Example    Calculate covariance for the following data set:  x: 2.1, 2.5, 3.6, 4.0 (mean = 3.1)  y: 8, 10, 12, 14 (mean = 11)    Substitute the values into the formula and solve:  Cov(X,Y) = ΣE((X-μ)(Y-ν)) / n-1  = (2.1-3.1)(8-11)+(2.5-3.1)(10-11)+(3.6-3.1)(12-11)+(4.0-3.1)(14-11) /(4-1)  = (-1)(-3) + (-0.6)(-1)+(.5)(1)+(0.9)(3) / 3  = 3 + 0.6 + .5 + 2.7 / 3  = 6.8/3  = 2.267  The result is positive, meaning that the variables are positively related.                                                                                 284    CU IDOL SELF LEARNING MATERIAL (SLM)
13.4 SCATTER PLOT    A scatter plot can be used for data in the form of ordered pairs of numbers. The result will be  a bunch of points \"scattered\" around the plane.    If the general tendency is for the points to rise from the left to the right of the graph, then we  say there is a positive correlation between the two variables measured. If the points tend to  fall from the left to the right of the graph, we say there is negative correlation . If there is no  general tendency, then there is no correlation .    If the tendency is not very pronounced – that is, the points are scattered widely – then we say  the variables are weakly correlated. If the correlation is more pronounced, we say the  variables are strongly correlated.    Figure 13.2 Weak Positive Correlations                                            285    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 13.3 Strong Negative Correlations    Figure 13.4 No Correlations    Examples:    If you graphed a person's height on one axis and their weight on the other, you would  probably get a strong positive correlation (because taller people generally weigh more).    If you graphed a man's age and the number of hairs on his head, you would probably get a  weak negative correlation (because some men have a tendency for baldness as they get  older).                                              286    CU IDOL SELF LEARNING MATERIAL (SLM)
If you graphed a woman's shoe size and the length of her hair, you would probably get no  correlation. (These variables are unrelated.)    13.5 PEARSON CORRELATION COEFFICIENT    Correlation between sets of data is a measure of how well they are related. The most common  measure of correlation in stats is the Pearson Correlation. The full name is the Pearson  Product Moment Correlation (PPMC). In layman terms, it’s a number between “+1” to “-1”  which represents how strongly the two variables are associated. Or to put this in simpler  words, it states the measure of the strength of linear association between two variables.  Basically, a Pearson Product Moment Correlation (PPMC)attempts to draw a line to best fit  through the data of the given two variables, and the Pearson correlation coefficient “r”  indicates how far away all these data points are from the line of best fit.    The value of “r” ranges from +1 to -1 where:    r= +1/-1 represents that all our data points lie on the line of best fit only i.e., there is no data  point which shows any variation from the line of best fit.                                   Figure 13.5 Data points with r=1    • Hence, the stronger the association between the two variables, the closer r will be to      1/-1.    • r = 0 means that there is no correlation between the two variables.  • The values of r between +1 and -1 indicate that there is a variation of data around the        line.                                          287    CU IDOL SELF LEARNING MATERIAL (SLM)
• The closer the values of r to 0, the greater the variation of data points around the line           of best fit.        Formula of Pearson Correlation coefficient:    Example  Find the value of the correlation coefficient from the following table:    Age and Glucose levels of 6 subjects    We’ll calculate the value of r using the formula mentioned above. For using that formula, we  need to compute Σ(X*Y), Σ(X), Σ(Y), Σ(X²), Σ(Y²).    The table below shows the computed values of all the summations mentioned above.                                          288    CU IDOL SELF LEARNING MATERIAL (SLM)
From our table we get:       Σ(X) = 247       Σ(Y) = 486       Σ(X*Y) = 20,485       Σ(X²) = 11,409       Σ(Y²) = 40,022       n is the sample size, in our case = 6    r = 6(20,485) — (247 × 486) / [√ [[6(11,409) — (24⁷²)] × [6(40,022) — 48⁶²]]]    r = 0.5298.    The range of the correlation coefficient is from -1 to +1. Our result is 0.5298 or 52.98%,  which means the variables have a moderate positive correlation  Problems with Pearson correlation.  The PPMC is not able to tell the difference between dependent variables and independent  variables. For example, if you are trying to find the correlation between a high calorie diet  and diabetes, you might find a high correlation of .8. However, you could also get the same  result with the variables switched around. In other words, you could say that diabetes causes  a high calorie diet. That obviously makes no sense. Therefore, as a researcher you have to be                                          289    CU IDOL SELF LEARNING MATERIAL (SLM)
aware of the data you are plugging in. In addition, the PPMC will not give you any  information about the slope of the line; it only tells you whether there is a relationship.    13.6 SUMMARY       • The Variance is the measure of statistical dispersion, that is, the variation among the          different samples in a data set. It is the average of the squared differences from the          mean       • Standard deviation is the measure of how spread out your data is. It is a statistic that          tells you how closely all of the examples are gathered around the mean (average) in a          data set. The steeper the bell curve, the smaller the standard deviation       • Covariance is a measure of how much two random variables vary together. It’s similar          to variance, but where variance tells you how a single variable varies, co variance tells          you how two variables vary together       • A scatter plot can be used for data in the form of ordered pairs of numbers. The result          will be a bunch of points \"scattered\" around the plane.       • Correlation between sets of data is a measure of how well they are related. The most          common measure of correlation in stats is the Pearson Correlation       • Pearson's Correlation Coefficient is a linear correlation coefficient that returns a value          of between -1 and +1    13.7 KEYWORDS        • Variance-measure of variability      • SD-Standard Deviation      • Covariance-measure of the directional relationship between two random variables      • Scatter Plot-observe and visually display the relationship between variables.      • Pearson Correlation-Correlation between sets of data    13.8 LEARNING ACTIVITY    1. Find the standard deviation of 4, 9, 11, 12, 17, 5, 8, 12, 14                                          290    CU IDOL SELF LEARNING MATERIAL (SLM)
2. Calculate the covariance of Daily Return for Two Stocks Using the Closing Price    13.9 UNIT END QUESTIONS                                                                     291    A. Descriptive Questions  Short Questions  1. Define variance  2. Compare variance and covariance  3. What is the use of scatter plot?  4. Differentiate box plot and scatter plot  5. List the disadvantage of Pearson Correlation coefficient  Long Questions  1. Illustrate the role of variance and standard deviation in data analysis.  2. How visualization is done using scatter plot?  3. Compare the performance of box plot and scatter plot  4. Discuss the role of Pearson correlation coefficient  5. Describe about positive and negative correlation    B. Multiple Choice Questions  1. Mean and variance of Poisson’s distribution is the same.                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
a. True                                                                            292           b. False    2. What is the mean and variance for standard normal distribution?           a. Mean is 0 and variance is 1           b. Mean is 1 and variance is 0           c. Mean is 0 and variance is ∞           d. Mean is ∞ and variance is 0    3. Variance of a random variable X is given by _________           a. E(X)           b. E(X2)           c. E(X2) – (E(X))2           d. (E(X))2    4. Mean of a constant ‘a’ is ___________           a. 0           b. a           c. a/2           d. 1    5. Variance of a constant ‘a’ is _________           a. 0           b. a           c. a/2           d. 1    6. The covariance is:           a. A measure of the strength of relationship between two variables.           b. Dependent on the units of measurement of the variables.           c. An unstandardized version of the correlation coefficient.           d. All the above.                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
7. If Pearson’s correlation coefficient between stress level and workload is .8, how much      variance in stress level is not accounted for by workload?           a. 20%           b. 2%           c. 8%           d. 36%    8. How much variance has been explained by a correlation of .9?           a. 18%           b. 9%           c. 81%           d. None of these    9. Correlation analysis is a _____________           a. Univariate analysis           b. Bivariate analysis           c. Multivariate analysis           d. Both b and c    10. When the amount of change in one variable leads to a constant ratio of change in the      other variable, then correlation is said to be _____________          a. Linear          b. Non-linear          c. Positive          d. Negative    Answers                                                                                     293  1 -a, 2 -a, 3 –c, 4 –b, 5 –a, 6 –d, 7 –d, 8 -d, 9 –d, 10 –a.    13.10 REFERENCES    Text Books:                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
• Allen B. Downey, “Think Python: How to Think like a Computer Scientist”, 2nd           edition, Updated for Python 3, Shroff/O ‘Reilly Publishers, 2016        • Michael Urban, Joel Murach, Mike Murach: Murach's Python Programming; Dec,           2016    Reference Books:        • Guido van Rossum and Fred L. Drake Jr, An Introduction to Python – Revised and           updated for Python 3.2,        • Jake Vander Plas, “Python Data Science Handbook”, O‘Reilly Publishers, 2016.                                          294    CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT - 14: WEB APPLICATION 1    Structure  14.0. Learning Objectives  14.1. Introduction  14.2. Virtual Environment  14.3. Creating a Django Project  14.4. Summary  14.5. Keywords  14.6. Learning Activity  14.7. Unit End Questions  14.8. References    14.0 LEARNING OBJECTIVES    After studying this unit, you will be able to:       • Identify the benefits of Django       • Design a web application using Django       • Create a project using Django    14.1 INTRODUCTION    Django is a high-level Python Web framework that encourages rapid development and clean,  pragmatic design. Built by experienced developers, it takes care of much of the hassle of Web  development, so you can focus on writing your app without needing to reinvent the wheel.  Django is a widely used free, open-source, and high-level web development framework. It  provides a lot of features to the developers \"out of the box,\" so development can be rapid.  However, websites built from it are secured, scalable, and maintainable at the same time.    Required Setup      1. Git Bash: The user of all operating systems can use it. All the Django related           commands and Unix commands are done through it      2. Text-Editor: Any Text-Editor like Sublime Text or Visual Studio Code can be used.           For the following project, Sublime Text is used.                                          295    CU IDOL SELF LEARNING MATERIAL (SLM)
3. Python 3: The latest version of Python can be downloaded in internet    14.2 VIRTUAL ENVIRONMENT    Virtual Environment acts as dependencies to the Python-related projects. It works as a self-  contained container or an isolated environment where all the Python-related packages and the  required versions related to a specific project are installed. Since newer versions of Python,  Django, or packages, etc. will roll out, through the help of a Virtual Environment, you can  work with older versions that are specific to your project. In Summary, you can start an  independent project related to Django of version 2.0, whereas another independent project  related to Django of version 3.0 can be started on the same computer.    Creating a Virtual Environment  To work with Django, we’ll first set up a virtual environment to work in. A virtual  environment is a place on your system where you can install packages and isolate them from  all other Python packages. Separating one project’s libraries from other projects is beneficial  and will be necessary when we deploy Learning Log to a server in Chapter 20. Create a new  directory for your project called learning_log, switch to that directory in a terminal, and  create a virtual environment. If you’re using Python 3, you should be able to create a virtual  environment with the following command:    learning_log$ python -m venv ll_env  learning_log$    Here we’re running the venv module and using it to create a virtual environment named  ll_env. If this works, move on to “Activating the Virtual Environment”    Installing virtualenv    If you’re using an earlier version of Python or if your system isn’t set up to use the venv  module correctly, you can install the virtualenv package. To install virtualenv, enter the  following:             $ pip install --user virtualenv  If you’re using Linux and this still doesn’t work, you can install virtualenv through your  system’s package manager. On Ubuntu, for example, the command sudo apt-get install  python-virtualenv will install virtualenv                                          296    CU IDOL SELF LEARNING MATERIAL (SLM)
Change to the learning_log directory in a terminal, and create a virtual environment like this:           learning_log$ virtualenv ll_env           New python executable in ll_env/bin/python           Installing setuptools, pip...done.           learning_log$    If you have more than one version of Python installed on your system, you should specify the  version for virtualenv to use. For example, the command virtualenv ll_env --python=python3  will create a virtual environment that uses Python 3.  Activating the Virtual Environment  Now that we have a virtual environment set up, we need to activate it with the following  command:             learning_log$ source ll_env/bin/activate           (ll_env)learning_log$  This command runs the script activate in ll_env/bin. When the environment is active, you’ll  see the name of the environment in parentheses; then you can install packages to the  environment and use packages that have already been installed. Packages you install in ll_env  will be available only while the environment is active  To stop using a virtual environment, enter deactivate:           (ll_env)learning_log$ deactivate           learning_log$  Installing Django  Once you’ve created your virtual environment and activated it, install Django:           (ll_env)learning_log$ pip install Django           Installing collected packages: Django           Successfully installed Django           Cleaning up...           (ll_env)learning_log$  Because we’re working in a virtual environment, this command is the same on all systems.  There’s no need to use the --user flag, and there’s no need to use longer commands like  python -m pip install package_name. Keep in mind that Django will be available only when  the environment is active                                          297    CU IDOL SELF LEARNING MATERIAL (SLM)
14.3 CREATING A DJANGO PROJECT        1. The first step is creating your project by using the 'django-admin           startproject project_name' command, where 'project_name' is 'django_blog' in your           case. Also, it will generate a lot of files inside our newly created project, which you           can research further in Django documentation if needed.        2. Change the directory to the newly created project using 'cd' command and to view the           created file using 'ls' command.    3. You can run your project by using 'python manage.py runserver'.    4. The project can be viewed in your favorite browser (Google Chrome, Mozilla Firefox,      etc.).You can come into your browser and type 'localhost:8000' or '127.0.0.1:8000'.    14.4 SUMMARY       • Django is a high-level Python Web framework that encourages rapid development and          clean, pragmatic design       • Django is a widely used free, open-source, and high-level web development                                                                        298    CU IDOL SELF LEARNING MATERIAL (SLM)
framework. It provides a lot of features to the developers \"out of the box,\" so          development can be rapid.       • Virtual Environment acts as dependencies to the Python-related projects. It works as a          self-contained container or an isolated environment where all the Python-related          packages and the required versions related to a specific project are installed       • A virtual environment is a place on your system where you can install packages and          isolate them from all other Python packages.       • Before installing Django, it's recommended to install Virtualenv that creates new          isolated environments to isolates your Python files on a per-project basis. This will          ensure that any changes made to your website won't affect other websites you're          developing       • Activating a virtual environment will put the virtual environment-specific python and          pip executables into your shell's PATH    14.5 KEYWORDS        • Django-web framework that enables rapid development of secure and maintainable           websites        • Web Application-uses a web browser to perform a particular function      • Virtual Environment-to execute an application      • Django Project-entire application and all its parts    14.6 LEARNING ACTIVITY    1. Build a couple of empty projects and look at what it creates. Make a new folder with a      simple name, like InstaBook or FaceGram (outside of your learning_log directory),      navigate to that folder in a terminal, and create a virtual environment        Install Django, and run the command django-admin.py startproject instabook .        (make sure you include the dot at the end of the command).    2. Look at the files and folders this command creates, and compare them to Learning Log.                                          299    CU IDOL SELF LEARNING MATERIAL (SLM)
Do this a few times until you’re familiar with what Django creates when starting a new  project. Then delete the project directories if you wish    14.7 UNIT END QUESTIONS                                                                     300    A. Descriptive Questions  Short Questions  1. What is the difference between web and desktop application?  2. Define client server architecture.  3. What is the need for virtual environment?  4. List the benefits of Django.  5. Is it necessary to activate virtual environment?  Long Questions  1. Illustrate the process of creating web application with Django.  2. Describe the benefits of using Django for application creation  3. Discuss how virtual environment is used with Django  4. Illustrate the steps in creating a Django Project  5. Describe about activating and deactivating virtual environments  B. Multiple Choice Questions  1. What is a Django App?        a. Django app is an extended package with base package is Django      b. Django app is a python package with its own components.      c. Both 1 & 2 Option      d. All of these    2. What are Migrations in Django?      a. They are files saved in migrations directory.      b. They are created when you run makemigrations command.      c. Migrations are files where Django stores changes to your models.      d. All of these                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
                                
                                
                                Search
                            
                            Read the Text Version
- 1
 - 2
 - 3
 - 4
 - 5
 - 6
 - 7
 - 8
 - 9
 - 10
 - 11
 - 12
 - 13
 - 14
 - 15
 - 16
 - 17
 - 18
 - 19
 - 20
 - 21
 - 22
 - 23
 - 24
 - 25
 - 26
 - 27
 - 28
 - 29
 - 30
 - 31
 - 32
 - 33
 - 34
 - 35
 - 36
 - 37
 - 38
 - 39
 - 40
 - 41
 - 42
 - 43
 - 44
 - 45
 - 46
 - 47
 - 48
 - 49
 - 50
 - 51
 - 52
 - 53
 - 54
 - 55
 - 56
 - 57
 - 58
 - 59
 - 60
 - 61
 - 62
 - 63
 - 64
 - 65
 - 66
 - 67
 - 68
 - 69
 - 70
 - 71
 - 72
 - 73
 - 74
 - 75
 - 76
 - 77
 - 78
 - 79
 - 80
 - 81
 - 82
 - 83
 - 84
 - 85
 - 86
 - 87
 - 88
 - 89
 - 90
 - 91
 - 92
 - 93
 - 94
 - 95
 - 96
 - 97
 - 98
 - 99
 - 100
 - 101
 - 102
 - 103
 - 104
 - 105
 - 106
 - 107
 - 108
 - 109
 - 110
 - 111
 - 112
 - 113
 - 114
 - 115
 - 116
 - 117
 - 118
 - 119
 - 120
 - 121
 - 122
 - 123
 - 124
 - 125
 - 126
 - 127
 - 128
 - 129
 - 130
 - 131
 - 132
 - 133
 - 134
 - 135
 - 136
 - 137
 - 138
 - 139
 - 140
 - 141
 - 142
 - 143
 - 144
 - 145
 - 146
 - 147
 - 148
 - 149
 - 150
 - 151
 - 152
 - 153
 - 154
 - 155
 - 156
 - 157
 - 158
 - 159
 - 160
 - 161
 - 162
 - 163
 - 164
 - 165
 - 166
 - 167
 - 168
 - 169
 - 170
 - 171
 - 172
 - 173
 - 174
 - 175
 - 176
 - 177
 - 178
 - 179
 - 180
 - 181
 - 182
 - 183
 - 184
 - 185
 - 186
 - 187
 - 188
 - 189
 - 190
 - 191
 - 192
 - 193
 - 194
 - 195
 - 196
 - 197
 - 198
 - 199
 - 200
 - 201
 - 202
 - 203
 - 204
 - 205
 - 206
 - 207
 - 208
 - 209
 - 210
 - 211
 - 212
 - 213
 - 214
 - 215
 - 216
 - 217
 - 218
 - 219
 - 220
 - 221
 - 222
 - 223
 - 224
 - 225
 - 226
 - 227
 - 228
 - 229
 - 230
 - 231
 - 232
 - 233
 - 234
 - 235
 - 236
 - 237
 - 238
 - 239
 - 240
 - 241
 - 242
 - 243
 - 244
 - 245
 - 246
 - 247
 - 248
 - 249
 - 250
 - 251
 - 252
 - 253
 - 254
 - 255
 - 256
 - 257
 - 258
 - 259
 - 260
 - 261
 - 262
 - 263
 - 264
 - 265
 - 266
 - 267
 - 268
 - 269
 - 270
 - 271
 - 272
 - 273
 - 274
 - 275
 - 276
 - 277
 - 278
 - 279
 - 280
 - 281
 - 282
 - 283
 - 284
 - 285
 - 286
 - 287
 - 288
 - 289
 - 290
 - 291
 - 292
 - 293
 - 294
 - 295
 - 296
 - 297
 - 298
 - 299
 - 300
 - 301
 - 302
 - 303
 - 304
 - 305
 - 306
 - 307
 - 308
 - 309
 - 310
 - 311
 - 312
 - 313
 - 314
 - 315
 - 316
 - 317
 - 318