Figure 8.15: Sample Box Plot  Boxplot can be colorized by passing color keyword. You can pass a dict whose keys  are boxes, whiskers, medians and caps. If some keys are missing in the dict, default colors are  used for the corresponding artists. Also, boxplot has sym keyword to specify fliers style.    When you pass other type of arguments via color keyword, it will be directly passed to  matplotlib for all the boxes, whiskers, medians and caps colorization.    The colors are applied to every box to be drawn. If you want more complicated colorization,  you can get each drawn artists by passing return_type.    In [39]: color={    ....: \"boxes\":\"DarkGreen\",    ....: \"whiskers\":\"DarkOrange\",    ....: \"medians\":\"DarkBlue\",    ....: \"caps\":\"Gray\",    ....: }    ....:                                          201    CU IDOL SELF LEARNING MATERIAL (SLM)
In [40]: df.plot.box(color=color,sym=\"r+\");                                          Figure 8.16:Colored Box Plot  Also, you can pass other keywords supported by matplotlib boxplot. For example, horizontal  and custom-positioned boxplot can be drawn by vert=False and positions keywords.  In [41]: df.plot.box(vert=False,positions=[1,4,5,6,8]);                                       Figure 8.17: Horizontal Box Plot                         202  The existing interface DataFrame.boxplot to plot boxplot still can be used.                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
In [42]: df=pd.DataFrame(np.random.rand(10,5))  In [43]: plt.figure();  In [44]: bp=df.boxplot()                                Figure 8.18: Box Plot with existing interface    You can create a stratified boxplot using the by keyword argument to create groupings. For  instance,    In [45]: df=pd.DataFrame(np.random.rand(10,2),columns=[\"Col1\",\"Col2\"])  In [46]: df[\"X\"]=pd.Series([\"A\",\"A\",\"A\",\"A\",\"A\",\"B\",\"B\",\"B\",\"B\",\"B\"])  In [47]: plt.figure();  In [48]: bp=df.boxplot(by=\"X\")                                                    203    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 8.19: Grouped Box Plot    You can also pass a subset of columns to plot, as well as group by multiple columns:    In [49]: df=pd.DataFrame(np.random.rand(10,3),columns=[\"Col1\",\"Col2\",\"Col3\"])  In [50]: df[\"X\"]=pd.Series([\"A\",\"A\",\"A\",\"A\",\"A\",\"B\",\"B\",\"B\",\"B\",\"B\"])  In [51]: df[\"Y\"]=pd.Series([\"A\",\"B\",\"A\",\"B\",\"A\",\"B\",\"A\",\"B\",\"A\",\"B\"])  In [52]: plt.figure();  In [53]: bp=df.boxplot(column=[\"Col1\",\"Col2\"],by=[\"X\",\"Y\"])                                                                                          204    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 8.20: Box Plot grouped by multiple columns  In boxplot, the return type can be controlled by the return_type, keyword. The valid choices  are {\"axes\", \"dict\", \"both\", None}. Faceting, created by DataFrame.boxplot with  the by keyword, will affect the output type as well:    Groupby.boxplot always returns a Series of return_type.    In [54]: np.random.seed(1234)  In [55]: df_box=pd.DataFrame(np.random.randn(50,2))  In [56]: df_box[\"g\"]=np.random.choice([\"A\",\"B\"],size=50)  In [57]: df_box.loc[df_box[\"g\"]==\"B\",1]+=3  In [58]: bp=df_box.boxplot(by=\"g\")                                          205    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 8.21: Box plot grouped by g  The subplots above are split by the numeric columns first, then the value of the g column.  Below the subplots are first split by the value of g, then by the numeric columns.    Area plot    Area charts depict a time-series relationship. But unlike line charts, they can also visually  represent volume. Information is graphed on two axes, using data points connected by line  segments. The area between the axis and this line is commonly emphasized with color or  shading for legibility    You can create area plots with Series.plot.area() and DataFrame.plot.area(). Area plots are  stacked by default. To produce stacked area plot, each column must be either all positive or  all negative values.    When input data contains NaN, it will be automatically filled by 0. If you want to drop or fill  by different values, use dataframe.dropna() or dataframe.fillna() before calling plot.    In [60]: df=pd.DataFrame(np.random.rand(10,4),columns=[\"a\",\"b\",\"c\",\"d\"])  In [61]: df.plot.area();                                          206    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 8.22: Area Plot  To produce an unstacked plot, pass stacked=False. Alpha value is set to 0.5 unless otherwise  specified:    In [62]: df.plot.area(stacked=False);    Figure 8.23: Unstacked Area plot       207     CU IDOL SELF LEARNING MATERIAL (SLM)
Scatter plot    The scatter plot is simply a set of data points plotted on an x and y axis to represent two sets  of variables. The shape those data points create tells the story, most often revealing  correlation (positive or negative) in a large amount of data    Scatter plot can be drawn by using the DataFrame.plot.scatter() method. Scatter plot requires  numeric columns for the x and y axes. These can be specified by the x and y keywords.    In [63]: df=pd.DataFrame(np.random.rand(50,4),columns=[\"a\",\"b\",\"c\",\"d\"])  In [64]: df.plot.scatter(x=\"a\",y=\"b\");                                             Figure 8.24: Scatter Plot    To plot multiple column groups in a single axis, repeat plot method specifying target ax. It is  recommended to specify color and label keywords to distinguish each group.  In [65]: ax=df.plot.scatter(x=\"a\",y=\"b\",color=\"DarkBlue\",label=\"Group 1\")  In [66]: df.plot.scatter(x=\"c\",y=\"d\",color=\"DarkGreen\",label=\"Group 2\",ax=ax);                                          208    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 8.25: Coloured and Labeled Scatter Plot  The keyword c may be given as the name of a column to provide colors for each point:    In [67]: df.plot.scatter(x=\"a\",y=\"b\",c=\"c\",s=50);                                 Figure 8.26: Column Colored Scatter Plot    Hexagonal bin plot  Hexagonal Binning is another way to manage the problem of having to many points that start  to overlap. Hexagonal binning plots density, rather than points. Points are binned into gridded  hexagons and distribution (the number of points per hexagon) is displayed using either the  color or the area of the hexagons                                                                                          209    CU IDOL SELF LEARNING MATERIAL (SLM)
You can create hexagonal bin plots with DataFrame.plot.hexbin(). Hexbin plots can be a  useful alternative to scatter plots if your data are too dense to plot each point individually.  In [69]: df=pd.DataFrame(np.random.randn(1000,2),columns=[\"a\",\"b\"])  In [70]: df[\"b\"]=df[\"b\"]+np.arange(1000)  In [71]: df.plot.hexbin(x=\"a\",y=\"b\",gridsize=25);                                        Figure 8.27: Hexagonal Bin Plot    A useful keyword argument is gridsize; it controls the number of hexagons in the x-direction,  and defaults to 100. A larger gridsize means more, smaller bins.    By default, a histogram of the counts around each (x, y) point is computed. You can specify  alternative aggregations by passing  valuestothe C and reduce_C_function arguments. C specifies the value at each (x, y) point  and reduce_C_function is a function of one argument that reduces all the values in a bin to a  single number (e.g., mean, max, sum, std). In this example the positions are given by  columns a and b, while the value is given by column z. The bins are aggregated with  NumPy’s max function.    In [72]: df=pd.DataFrame(np.random.randn(1000,2),columns=[\"a\",\"b\"])  In [73]: df[\"b\"]=df[\"b\"]=df[\"b\"]+np.arange(1000)  In [74]: df[\"z\"]=np.random.uniform(0,3,1000)  In [75]: df.plot.hexbin(x=\"a\",y=\"b\",C=\"z\",reduce_C_function=np.max,gridsize=25);                                          210    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 8.28: Aggregated Bin Plot  Pie plot  You can create a pie plot with DataFrame.plot.pie() or Series.plot.pie(). If your data includes  any NaN, they will be automatically filled with 0. A ValueError will be raised if there are any  negative values in your data.    In [76]: series=pd.Series(3*np.random.rand(4),index=[\"a\",\"b\",\"c\",\"d\"],name=\"series\")  In [77]: series.plot.pie(figsize=(6,6));           Figure 8.29: Pie Plot          211    CU IDOL SELF LEARNING MATERIAL (SLM)
For pie plots it’s best to use square figures, i.e., a figure aspect ratio 1. You can create the  figure with equal width and height or force the aspect ratio to be equal after plotting by  calling ax.set_aspect('equal') on the returned axes object.    Note that pie plot with DataFrame requires that you either specify a target column by  the y argument or subplots=True. When y is specified, pie plot of selected column will be  drawn. If subplots=True is specified, pie plots for each column are drawn as subplots. A  legend will be drawn in each pie plots by default; specify legend=False to hide it.    In [78]: df=pd.DataFrame(    ....: 3*np.random.rand(4,2),index=[\"a\",\"b\",\"c\",\"d\"],columns=[\"x\",\"y\"]    ....: )    ....:    In [79]: df.plot.pie(subplots=True,figsize=(8,4));                                 Figure 8.30: Pie Plotwith selected columns  You can use the labels and colors keywords to specify the labels and colors of each wedge.    If you want to hide wedge labels, specify labels=None. If fontsize is specified, the value will  be applied to wedge labels. Also, other keywords supported by matplotlib.pyplot.pie() can be  used.    If you pass values whose sum total is less than 1.0, matplotlib draws a semicircle.    In [81]: series=pd.Series([0.1]*4,index=[\"a\",\"b\",\"c\",\"d\"],name=\"series2\")  In [82]: series.plot.pie(figsize=(6,6));                                          212    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 8.31: Pie plot as semicircle    Plotting with missing data    pandas tries to be pragmatic about plotting DataFrames or Series that contain missing data.  Missing values are dropped, left out, or filled depending on the plot type.    Plot Type       NaN Handling  Line            Leave gaps at NaNs  Line (stacked)  Fill 0’s  Bar             Fill 0’s  Scatter         Drop NaNs  Histogram       Drop NaNs (column-wise)  Box             Drop NaNs (column-wise)  Area            Fill 0’s  KDE             Drop NaNs (column-wise)  Hexbin          Drop NaNs  Pie             Fill 0’s                                Table 8.2: Handling Missing Data    If any of these defaults are not what you want, or if you want to be explicit about how  missing values are handled, consider using fillna() or dropna() before plotting.                                                                      213                                CU IDOL SELF LEARNING MATERIAL (SLM)
8.5 SUMMARY       • A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular          fashion in rows and columns       • A pandas DataFrame can be created using various inputs like −           • Lists           • dict           • Series           • Numpy ndarrays           • Another DataFrame       • Plotting methods allow for a handful of plot styles other than the default line plot     • There are several plotting functions in pandas.plotting that take            a Series or DataFrame as an argument       • A histogram is a chart that groups numeric data into bins, displaying the bins as          segmented columns. They're used to depict the distribution of a dataset: how often          values fall into ranges       • A bar plot displays categorical data with rectangular bars whose length or height          corresponds to the value of each data point. Bar plots can be visualized using vertical          or horizontal bars       • A box plot visualization allows you to examine the distribution of data. One box plot          appears for each attribute element. Each box plot displays the minimum, first quartile,          median, third quartile, and maximum values.       • Area plots depict a time-series relationship.     • The scatter plot is simply a set of data points plotted on an x and y axis to represent            two sets of variables    8.6 KEYWORDS        • Data frame-2-dimensional labeled data structure with columns of potentially           different types        • Visualization-technique for creating images, diagrams, or animations                                          214    CU IDOL SELF LEARNING MATERIAL (SLM)
• Box Plot-graphically depicting groups of numerical data      • Area Plot-displays quantitative data visually      • Histogram-representation of the distribution of numerical data,      • Scatter Plot-diagram where each value in the data set is represented by a dot    8.7 LEARNING ACTIVITY    1. A number raised to the third power is a cube Plot the first five cubic numbers, and then      plot the first 5000 cubic numbers.    2. A number raised to the third power is a cube Plot the first five cubic numbers, and then      plot the first 5000 cubic numbers. Apply a colormap to your cubes plot.    8.8 UNIT END QUESTIONS                                                                      215    A. Descriptive Questions  Short Questions  1. What is a dataframe?  2. List the ways to create a frame.  3. Differentiate bar plot and pie plot  4. Is it possible to create pie plot in semi-circleshape?  5. What is histogram?  Long Questions  1. Illustrate the key concepts of dataframe.  2. Describe various ways of creating a dataframe  3. Illustrate various visualization methods being used                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
4. Discuss various kinds of plot available for visualization  5. Which plot is chosen for a particular application? Justify  B. Multiple Choice Questions        1. Observe the output figure. Identify the coding for obtaining this output.                 a. import matplotlib.pyplot as plt                                             216               plt.plot([1,2,3],[4,5,1])               plt.show()               b. import matplotlib.pyplot as plt               plt.plot([1,2],[4,5])               plt.show()               c. import matplotlib.pyplot as plt               plt.plot([2,3],[5,1])               plt.show()               d. import matplotlib.pyplot as plt               plt.plot([1,3],[4,1])               plt.show()    2. Read the code:           import matplotlib.pyplot as plt           plt.plot(3,2)           plt.show()    Identify the output for the above coding.                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
a.    b.  c.    d.    3. Identify the right type of chart using the following hints.             Hint 1: This chart is often used to visualize a trend in data over intervals of time.             Hint 2: The line in this type of chart is often drawn chronologically.           a. Line chart           b. Bar chart           c. Pie chart           d. Scatter plot    4. Read the statements given below. Identify the right option from the following for pie chart.                                                                                                    217        CU IDOL SELF LEARNING MATERIAL (SLM)
Statement A: To make a pie chart with Matplotlib, we can use the plt.pie() function.             Statement B: The autopct parameter allows us to display the percentage value using           the Python string formatting.      a. Statement A is correct      b. Statement B is correct      c. Both the statements are correct      d. Both the statements are wrong    5. Point out the wrong combination with regards to kind keyword for graph plotting.      a. ‘scatter’ for scatter plots      b. ‘kde’ for hexagonal bin plots      c. ‘pie’ for pie plots      d. None of these    6. We can create a scatter plot matrix using the __________ method in pandas.tools.plotting.      a. sca_matrix      b. scatter_matrix      c. DataFrame.plot      d. All of these    7. Which of the following value is provided by kind keyword for barplot?      a. bar      b. kde      c. hexbin      d. None of these    8. __________ plots are used to visually assess the uncertainty of a statistic.             218      a. Lag      b. RadViz      c. Bootstrap      d. None of these                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
9. Which of the following plots are often used for checking randomness in time series?      a. Autocausation      b. Autorank      c. Autocorrelation      d. None of these    Answers:  1 – a, 2 –c, 3 –a, 4 –c, 5 –b, 6 –b, 7 –a, 8 -c, 9 –c    8.9 REFERENCES    Text Books:      • Allen B. Downey, “Think Python: How to Think like a Computer Scientist”, 2nd           edition, Updated for Python 3, Shroff/O‘Reilly Publishers, 2016      • Michael Urban, Joel Murach, Mike Murach: Murach'sPython Programming; Dec,           2016    Reference Books:      • Guido van Rossum and Fred L. Drake Jr, An Introduction to Python – Revised and           updated for Python 3.2,      • Jake Vander Plas, “Python Data Science Handbook”, O‘Reilly Publishers, 2016.                                          219    CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT - 9: GRAPHICAL EXPLORATORY DATA  ANALYSIS (EDA) 1    Structure   9.0. LearningObjectives   9.1. Introduction   9.2. Introduction to Dataset and 2D Scatter Plot   9.3. Pair Plots   9.4. Histogram and Introduction to Probability Density Function   9.5. Summary   9.6. Keywords   9.7. Learning Activity   9.8. Unit End Questions   9.9. References    9.0 LEARNING OBJECTIVES    After studying this unit, you will be able to:      • Discuss data analysis using python      • Learn about 2d and 3D scatter plot      • Describe about probability density function    9.1 INTRODUCTION    Exploratory data analysis (EDA) is a method of analyzing and investigating the data sets to  summarize their main characteristics. Scientists often use data visualization methods to  discover patterns, spot anomalies, check assumptions or test a hypothesis through summary  statistics and graphical representations.    EDA goes beyond the formal modeling or hypothesis to give maximum insight into the data  set and its structure, and in identifying influential variables. It can also help in selecting the  most suitable data analysis technique for a given project. Specific knowledge, such as the  creation of a ranked list of relevant factors to be used as guidelines, can also be obtained  using EDA.                                          220    CU IDOL SELF LEARNING MATERIAL (SLM)
Types of EDA  The EDA types of techniques are either graphical or quantitative (non-graphical). While the  graphical methods involve summarizing the data in a diagrammatic or visual way, the  quantitative method, on the other hand, involves the calculation of summary statistics. These  two types of methods are further divided into univariate and multivariate methods.    Univariate methods consider one variable (data column) at a time, while multivariate  methods consider two or more variables at a time to explore relationships. Thus, there are  four types of EDA in all — univariate graphical, multivariate graphical, univariate non-  graphical, and multivariate non-graphical. The graphical methods provide more subjective  analysis, and quantitative methods are more objective.    Univariate non-graphical: This is the simplest form of data analysis among the four options.  In this type of analysis, the data that is being analysed consists of just a single variable. The  main purpose of this analysis is to describe the data and to find patterns.    Univariate graphical: Unlike the non-graphical method, the graphical method provides the  full picture of the data. The three main methods of analysis under this type are histogram,  stem and leaf plot, and box plots. The histogram represents the total count of cases for a  range of values. Along with the data values, the stem and leaf plot show the shape of the  distribution. The box plots graphically depict a summary of minimum, first quartile median,  third quartile, and maximum.    Multivariate non-graphical: The multivariate non-graphical type of EDA generally depicts  the relationship between multiple variables of data through cross-tabulation or statistics.    Multivariate graphical: This type of EDA displays the relationship between two or more set  of data. A bar chart, where each group represents a level of one of the variables and each bar  within the group represents levels of other    9.2 INTRODUCTION TO DATASET AND 2D SCATTER PLOT    Iris flower data set    The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the  British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper. The data set  contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One                                          221    CU IDOL SELF LEARNING MATERIAL (SLM)
class is linearly separable from the other 2; the latter are NOT linearly separable from each  other.    Here we can see that given 4 features i.e., sepal length, sepal width, petal length, and petal  width determine whether a flower is Setosa, Versicolor or Virginica.       Sepal length, Sepal width, Petal length, Petal width are called feature/Variable/Input-  variable/Independent-variable       Species are called Labels/Dependent-variable/out-variable/class/class-label/Response label                                             Figure 9.1: Iris Data Set    Scatter Plot    As far as Machine learning/Data Science is concerned, one of the most commonly used plots  for simple data visualization is scatter plots. This plot gives us a representation of where each  point in the entire dataset is present with respect to any 2/3 features (Columns). Scatter plots  are available in 2D as well as 3D. The 2D scatter plot is the important/common one, where  we will primarily find patterns/Clusters and separability of the data. The code snippet for  using a scatter plot is as shown below.             plt.scatter(iris['sepal_length'],iris['sepal_width'])             plt.xlabel('Sepal length')                                          222    CU IDOL SELF LEARNING MATERIAL (SLM)
plt.ylabel('Sepal width')  plt.title('Scatter plot on Iris dataset')                                        Figure 9.2: Scatter Plot on Iris Data set  Here we can see that all the points are marked on their corresponding position with respective  to their values of x and y. Lets tweak around to see if we can get points with different  colours.             plt.scatter(iris['sepal_length'],iris['sepal_width'],color=['r','b','g'])           plt.xlabel('Sepal length')           plt.ylabel('Sepal width')           plt.title('Scatter plot on Iris dataset')                                               223    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 9.3: Colored Scatter plot on Iris Data set  3D Scatter Plot  A 3D Scatter Plot is a mathematical diagram, the most basic version of three-dimensional  plotting used to display the properties of data as three variables of a dataset using the  cartesian coordinates. To create a 3D Scatter plot, Matplotlib's mplot3d toolkit is used to  enable three-dimensional plotting             import plotly.express as px           df = px.data.iris()           fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',                       color='species')           fig.show()                                          224    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 9.4: 3-D Scatter Plot  It is possible to customize the style of the figure through the parameters of px.scatter_3d for  some options, or by updating the traces or the layout of the figure             import plotly.express as px           df = px.data.iris()           fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width',                       color='petal_length', size='petal_length', size_max=18,                     symbol='species', opacity=0.7)           # tight layout           fig.update_layout(margin=dict(l=0, r=0, b=0, t=0))                                          225    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 9.5: Customized 3D Scatter Plot    9.3 PAIR PLOTS    Pair Plots are a simple (one-line-of-code simple!) way to visualize relationships between each  variable. It produces a matrix of relationships between each variable in your data for an  instant examination of our data.    We can use scatter plots for 2d with Matplotlib and even for 3D, we can use it from plot.ly.  What to do when we have 4d or more than that? This is when Pair plot from seaborn package  comes into play  Let’s say we have n number of features in a data, Pair plot will create us a (n x n) figure  where the diagonal plots will be histogram plot of the feature corresponding to that row and  rest of the plots are the combination of feature from each row in y axis and feature from each  column in x axis.    The code snippet for pair plot implemented for Iris dataset is provided below.             import seaborn as sns             sns.set_style(\"whitegrid\");             sns.pairplot(iris, hue=\"species\", size=3);             plt.show()                                          226    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 9.6: Pair Plot on Iris Dataset    By getting a high-level overview of plots from pair plot, we can see which two features can  well explain/separate the data and then we can use scatter plot between those 2 features to  explore further. From the above plot we can conclude like, Petal length and petal width are  the 2 features which can separate the data very well.    Since we will be getting n x n plots for n features, pairplot may become complex when we  have more number of feature say like 10 or so on. So, in such cases, the best bet will be using  a dimensionality reduction technique to map data into 2d plane and visualizing it using a 2d  scatter plot.    9.4 HISTOGRAM AND INTRODUCTION TO PROBABILITY DENSITY  FUNCTION    A histogram is a summary of the variation in a measured variable. It shows the number of  samples that occur in a category: this is called a frequency distribution. Histograms make                                           227    CU IDOL SELF LEARNING MATERIAL (SLM)
sense for categorical variables, but a histogram can also be derived from a continuous  variable.  Histogram on Sepal Length             plt.figure(fig size = (10, 7))           x = data[\"SepalLengthCm\"]           plt.hist(x, bins = 20, color = \"green\")           plt.title(\"Sepal Length in cm\")           plt.xlabel(\"Sepal_Length_cm\")           plt.ylabel(\"Count\")                         Figure 9.7 : Histogram on Iris Data Set based on Sepal Length          228  Histogram for Petal Length             plt.figure(figsize = (10, 7))           x = data.PetalLengthCm           plt.hist(x, bins = 20, color = \"green\")           plt.title(\"Petal Length in cm\")           plt.xlabel(\"Petal_Length_cm\")                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
plt.ylabel(\"Count\")  plt.show()                         Figure 9.8: Histogram on Iris Data Set based on Petal Length    Instead of the relative frequencies, we can also make an histogram with the empirical density  distribution. The empirical density is defined as    i.e., it is equal to the empirical probability divided by the interval length, or bin width.    The advantage is that the empirical densities are insensitive to changes in the bin width dy, in  contrast to the relative frequencies, since a smaller bin width results in a smaller relative  frequency.    Probability Density Function    A probability density function (PDF) is the continuous version of the histogram with  densities (you can see this by imagining infinitesimal small bin widths); it specifies how the  probability density is distributed over the range of values that a random variable can take.    The PDF of a random variable is often described by a certain analytical function. A large  number of such statistical distribution functions has been defined, well-known examples are  for example the uniform distribution and the normal distribution.                                                               229                         CU IDOL SELF LEARNING MATERIAL (SLM)
For example, given a random sample of a variable, we might want to know things like the  shape of the probability distribution, the most likely value, the spread of values, and other  properties.    Knowing the probability distribution for a random variable can help to calculate moments of  the distribution, like the mean and variance, but can also be useful for other more general  considerations, like determining whether an observation is unlikely or very unlikely and  might be an outlier or anomaly.    The problem is, we may not know the probability distribution for a random variable. We  rarely do know the distribution because we don’t have access to all possible outcomes for a  random variable. In fact, all we have access to is a sample of observations. As such, we must  select a probability distribution.    This problem is referred to as probability density estimation, or simply “density estimation,”  as we are using the observations in a random sample to estimate the general density of  probabilities beyond just the sample of data we have available.    Similarly as with the histogram, the probability that a random variable takes a value in a  certain interval [yj,ym] is equal to the area below the function, see figure below. It can be  calculated by taking the integral over the interval:                                          230    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 9.9: Histogram and PDF on iris Data set    9.5 SUMMARY    • Exploratory data analysis (EDA) is a method of analyzing and investigating the data      sets to summarize their main characteristics.    • EDA goes beyond the formal modeling or hypothesis to give maximum insight into the      data set and its structure, and in identifying influential variables. It can also help in      selecting the most suitable data analysis technique for a given project.    • The EDA types of techniques are either graphical or quantitative (non-graphical)  • The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by        the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper  • The 2D scatter plot is the important/common one, where we will primarily find        patterns/Clusters and separability of the data.  • A 3D Scatter Plot is a mathematical diagram, the most basic version of three-        dimensional plotting used to display the properties of data as three variables of a      dataset using the cartesian coordinates  • Pair Plots are a really simple (one-line-of-code simple!) way to visualize relationships      between each variable  • A histogram is a summary of the variation in a measured variable. It shows the number      of samples that occur in a category: this is called a frequency distribution.                                                                 231                 CU IDOL SELF LEARNING MATERIAL (SLM)
• A histogram is a summary of the variation in a measured variable. It shows the number          of samples that occur in a category: this is called a frequency distribution.    9.6 KEYWORDS        • EDA- Exploratory Data Analysis      • PDF-Probability Density Function      • Pair Plots-pairwise relationships in a dataset      • 3D Scatter Plot-mathematical diagram, the most basic version of three-             dimensional plotting    9.7 LEARNING ACTIVITY    1. Plot a histogram of the petal lengths of his 50 samples of Iris versicolor using  matplotlib/seaborn's default settings. Recall that to specify the default seaborn style, you can  use sns.set(), where sns is the alias that seaborn is imported as.    2. 3D scatter plot is constructed using seaborn library. Comment    9.8 UNIT END QUESTIONS                                                                      232    A. Descriptive Questions  Short Questions  1. What is the need for EDA?  2. Compare EDA and Data Visualization.  3. What is the advantage of pair plot?  4. What is the benefit of using probability density function?  5. What is histogram?  Long Questions  1. Illustrate the concepts of EDA.                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
2. Describe how 2d and 3D scatter plots are used in Data Analysis  3. Illustrate histogram plotting using Iris Dataset  4. Compare Scatter plot, Histogram and Pair Plot.  5. Discuss the role of Probability Density Function    B. Multiple Choice Questions      1. Which of the following graphs has properties in the below figure?        a. Exploratory      b. Inferential      c. Causal      d. None of these    2. Which of the following dimension type graph is shown in the below figure?        a. one-dimensional                                                               233      b. two-dimensional      c. three-dimensional      d. None of these    3. Which of the following gave rise to need of graphs in data analysis?      a. Data visualization      b. Communicating results                                                   CU IDOL SELF LEARNING MATERIAL (SLM)
c. Decision making      d. All of these    4. Which of the following is characteristic of exploratory graph?      a. Made slowly      b. Axes are not cleaned up      c. Color is used for personal information      d. All of these    5. Point out the correct statement.      a. coplots are one dimensional data graph      b. Exploratory graphs are made quickly      c. Exploratory graphs are made relatively less in number      d. All of these    6. Scatter diagram is graphical component of ____________      a. Regression analysis      b. Demand      c. Supply      d. Profit    7. A scatter diagram represents the relationship between _________ and ________      a. Cause, effects      b. Cause, problem      c. Effects, output      d. Production, productivity    8. A histogram gives _____ nature of process variability.      a. Static      b. Dynamic      c. Negative      d. Positive                                                                                     234    CU IDOL SELF LEARNING MATERIAL (SLM)
9. Which of the following mentioned standard Probability density functions is applicable           to discrete Random Variables?           a. Gaussian Distribution           b. Poisson Distribution           c. Rayleigh Distribution           d. Exponential Distribution        10. A table with all possible value of a random variable and its corresponding           probabilities is called ___________               a. Probability Mass Function               b. Probability Density Function               c. Cumulative distribution function               d. Probability Distribution    Answers  1 -a, 2 -b, 3 – d, 4 –c, 5 –a, 6 –a, 7 –a, 8 -a, 9 –b, 10 –d.    9.9 REFERENCES    Text Books:      • Allen B. Downey, “Think Python: How to Think like a Computer Scientist”, 2nd           edition, Updated for Python 3, Shroff/O‘Reilly Publishers, 2016      • Michael Urban, Joel Murach, Mike Murach: Murach'sPython Programming; Dec,           2016    Reference Books:      • Guido van Rossum and Fred L. Drake Jr, An Introduction to Python – Revised and           updated for Python 3.2,      • Jake Vander Plas, “Python Data Science Handbook”, O‘Reilly Publishers, 2016.                                          235    CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT - 10: GRAPHICAL EXPLORATORY DATA  ANALYSIS (EDA) II    Structure  10.0. Learning Objectives  10.1. Introduction  10.2. Adjusting the number of Bins in Histogram  10.3. Bee Swarm Plot  10.4. Univariate Analysis using PDF and CDF  10.5. PLOT DATA IN ECDF  10.6. Summary  10.7. Keywords  10.8. Learning Activity  10.9. Unit End Questions  10.10. References    10.0 LEARNING OBJECTIVES    After studying this unit, you will be able to:      • Perform data analysis using python      • Describe about bee swarm plotting      • Perform univariate analysis using PDF and CDF    10.1 INTRODUCTION    Exploratory Data Analysis is majorly performed using the following methods:   • Univariate analysis: - provides summary statistics for each field in the raw data set (or)         summary only on one variable. Ex: - CDF, PDF, Box plot, Violin plot.   • Bivariate analysis: - is performed to find the relationship between each variable in the         dataset and the target variable of interest (or) using 2 variables and finding the       relationship between them. Ex: -Box plot, Violin plot.   • Multivariate analysis: - is performed to understand interactions between different fields       in the dataset (or) finding interactions between variables more than 2. Ex: - Pair plot and       3D scatter plot.                                                         236    CU IDOL SELF LEARNING MATERIAL (SLM)
Axis Labels  A lot of times, graphs can be self-explanatory, but having a title to the graph, labels on the  axis, and a legend that explains what each line is can be necessary.  To start:  import matplotlib.pyplot as plt    x =[1,2,3]  y =[5,7,4]  x2 =[1,2,3]  y2 =[10,14,12]    This way, we have two lines that we can plot. Next:  plt.plot(x, y, label='First Line')  plt.plot(x2, y2, label='Second Line')  Here, we plot as we've seen already, only this time we add another parameter \"label.\" This  allows us to assign a name to the line, which we can later show in the legend. The rest of our  code:  plt.xlabel('Plot Number')  plt.ylabel('Important var')  plt.title('Interesting Graph\\nCheck it out')  plt.legend()  plt.show()  With plt.xlabel and plt.ylabel, we can assign labels to those respective axes. Next, we can  assign the plot's title with plt.title, and then we can invoke the default legend with  plt.legend(). The resulting graph:                                          237    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 10.1 Plot with Label                          238  Example 2:  Simple axes labels  Label the axes of a plot.  importnumpyasnp  importmatplotlib.pyplotasplt  fig=plt.figure()  fig.subplots_adjust(top=0.8)  ax1 =fig.add_subplot(211)  ax1.set_ylabel('volts')  ax1.set_title('a sine wave')  t=np.arange(0.0, 1.0, 0.01)  s=np.sin(2*np.pi*t)  line, =ax1.plot(t, s, lw=2)  # Fixing random state for reproducibility  np.random.seed(19680801)                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
ax2=fig.add_axes([0.15, 0.1, 0.7, 0.3])  n, bins, patches=ax2.hist(np.random.randn(1000), 50)  ax2.set_xlabel('time (s)')  plt.show()    Figure 10.2 Sub Plot with Label    10.2 ADJUSTING THE NUMBER OF BINS IN A HISTOGRAM    While tools that can generate histograms usually have some default algorithms for selecting  bin boundaries, you will likely want to play around with the binning parameters to choose  something that is representative of your data. Wikipedia has an extensive section on rules of  thumb for choosing an appropriate number of bins and their sizes, but ultimately, it’s worth  using domain knowledge along with a fair amount of playing around with different options to  know what will work best for your purposes.    Choice of bin size has an inverse relationship with the number of bins. The larger the bin  sizes, the fewer bins there will be to cover the whole range of data. With a smaller bin size,  the more bins there will need to be. It is worth taking some time to test out different bin sizes  to see how the distribution looks in each one, then choose the plot that represents the data                                                          239    CU IDOL SELF LEARNING MATERIAL (SLM)
best. If you have too many bins, then the data distribution will look rough, and it will be  difficult to discern the signal from the noise. On the other hand, with too few bins, the  histogram will lack the details needed to discern any useful pattern from the data.Setting the  bin size of a Matplotlib histogram specifies the size of the groups into which the data is  sorted.    TO CREATE BINS OF A DESIRED WIDTH  Use matplotlib.pyplot.hist(data, bins=None) with list as an iterable containing the start and  end point of each bin.  For example, setting bins to [10, 20, 30] creates a histogram with one bin containing  values between 10 and 20 and another containing values between 20 and 30.    data = np.random.normal(50,10, size =10000)  ax = plt.hist(data)    Figure 10.3 Bins of Desired Width    bins_list =[-10,20,40,50,60,80,110]  ax = plt.hist(data, bins = bins_list)    TO CREATE BINS OF EQUAL SIZE  Use matplotlib.pyplot.hist(data, bins=n) to create a histogram with n bins. To create bins  of a given width w, set bins to math.ceil(x) with x equal to (data.max() - data.min())/w.  ax = plt.hist(data)                                          240    CU IDOL SELF LEARNING MATERIAL (SLM)
w =3  n = math.ceil((data.max()- data.min())/w)  ax = plt.hist(data, bins = n)    Figure 10.4 Bins of Equal Size    10.3 BEE SWARM PLOT    A bee swarm plot is a one-dimensional scatter plot similar to stripchart, but with various  methods to separate coincident points such that each point is visible. Also, beeswarm  introduces additional features unavailable in stripchart, such as the ability to control. the color  and plotting character of each point  Seaborn swarmplot is probably similar to stripplot, only the points are adjusted so it won’t get  overlap to each other as it helps to represent the better representation of the distribution of  values. A swarm plot can be drawn on its own, but it is also a good complement to a box,  preferable because the associated names will be used to annotate the axes. This type of plot  sometimes known as “beeswarm    Syntax: seaborn.swarmplot(x=None, y=None, hue=None, data=None, order=None,  hue_order=None, dodge=False, orient=None, color=None, palette=None, size=5,  edgecolor=’gray’, linewidth=0, ax=None, **kwargs)    Parameters:       x, y, hue: Inputs for plotting long-form data.       data: Dataset for plotting.                                               241    CU IDOL SELF LEARNING MATERIAL (SLM)
color: Color for all of the elements     size: Radius of the markers, in points.  Example  import seaborn as sns  sns.set_theme(style=\"whitegrid\")  tips = sns.load_dataset(\"tips\")  ax = sns.swarmplot(x=tips[\"total_bill\"])                                      Figure 10.5 Horizontal Swarm Plot  Group the swarms by a categorical variable:  ax = sns.swarmplot(x=\"day\", y=\"total_bill\", data=tips)                                                                         242    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 10.6 Grouped Swarm Plot    10.4 UNIVARIATE ANALYSIS USING PDF AND CDF    Univariate analysis, as the name says, simply means analysis using a single variable. This  analysis gives the frequency/count of occurrences of the variable and lets us understand the  distribution of that variable at various values.  Probability Density Function  In PDF plot, X-axis is the feature on which analysis is done and the Y-axis is the  count/frequency of occurrence of that particular X-axis value in the data. Hence the term  “Density” in PDF.             import seaborn as sns           sns.set_style(\"whitegrid\")  Seaborn is the library that provides various types of plots for analysis.  sns.FacetGrid(haberman_data,hue='surv_status',height=5).map(sns.distplot,'age').add_legend(  )                                          243    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 10.7 Examples for PDF  In PDF, we can’t say exactly how many data points are in a range/ lower to a value/ higher  than a particular value.  Cumulative Density Function  The cumulative distribution function (CDF) of a random variable is another method to  describe the distribution of random variables. The advantage of the CDF is that it can be  defined for any kind of random variable (discrete, continuous, and mixed).  To know the number of data points below/above a particular value, CDF is very useful.  For a discrete Random Variable,  For a continuous Random Variable,    Uniform Probability Distribution –    The Uniform Distribution, also known as the Rectangular Distribution, is a type of  Continuous Probability Distribution.                                          244    CU IDOL SELF LEARNING MATERIAL (SLM)
It has a Continuous Random Variable X restricted to a finite interval [a,b] and it’s probability  function f(x) has a constant density over this interval. The Uniform probability distribution  function is defined as:    Matplotlib is a library in Python and it is a numerical — mathematical extension for the  NumPy library. The cumulative distribution function (CDF) of a real-valued random variable  X, or just distribution function of X, evaluated at x, is the probability that X will take a value  less than or equal to x.  Properties of CDF:        • Every cumulative distribution function F(X) is non-decreasing      • If maximum value of the cdf function is at x, F(x) = 1.      • The CDF ranges from 0 to 1.  Example:             # defining the libraries           import numpy as np           import matplotlib.pyplot as plt           import pandas as pd           %matplotlib inline            # No of Data points           N = 500            # initializing random values           data = np.random.randn(N)            # getting data of the histogram           count, bins_count = np.histogram(data, bins=10)                                          245    CU IDOL SELF LEARNING MATERIAL (SLM)
# finding the PDF of the histogram using count values  pdf = count / sum(count)   # using numpy np.cumsum to calculate the CDF  # We can also find using the PDF values by looping and adding  cdf = np.cumsum(pdf)   # plotting PDF and CDF  plt.plot(bins_count[1:], pdf, color=\"red\", label=\"PDF\")  plt.plot(bins_count[1:], cdf, label=\"CDF\")  plt.legend()                                           Figure 10.8 Plotting PDF and CDF    10.5 PLOT DATA IN ECDF    ECDFs stand for the \"empirical cumulative distribution function\", and they map every data  point in the dataset to a quantile, which is a number between 0 and 1 that indicates the  cumulative fraction of data points smaller than that data point itself.    Like histograms, ECDFs show a single variable distribution, but in a more efficient way.  We’ve seen previously how histograms can be misleading due to different bin sizing options.  That’s not the case with ECDFs. ECDFs show every data point, and the plot can be  interpreted only in one way.                                                                   246    CU IDOL SELF LEARNING MATERIAL (SLM)
Think of ECDFs as scatter plots because they also have points along X and Y axes. To be  more precise, here’s what ECDFs show on both axes:       X-axis — a quantity we are measuring (Age in the example above)     Y-axis — the percentage of data points that have a smaller value than the respective X  value (at each point X, Y% of the values are smaller or identical to X)  To make this sort of visualization, we need to do a bit of calculation first. Two arrays are  required:     X — sorted data (sorting the Age column from lowest to highest)     Y — list of evenly spaced data points where the maximum is 1 (as in 100%)  The following Python snippet can be used to calculate X and Y values for a single column in  a Pandas DataFrame:             def ecdf(df, column)              x = np.sort(df[column])              y = np.arange(1, len(x) + 1) / len(x)              return x, y             def plot_ecdf(x, y, size=(14, 8), title='ECDF', xlab='Age', ylab='Percentage',           color='#087E8B'):                plt.figure(figsize=size)              plt.scatter(x, y, color=color)              plt.title(title, size=20)              plt.xlabel(xlab, size=14)              plt.ylabel(ylab, size=14)  Let’s use this function to make an ECDF plot of the Age attribute:           x, y = ecdf(df, 'Age')           plot_ecdf(x, y, title='ECDF of passenger ages', xlab='Age')           plt.show()                                          247    CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 10.9 ECDF plot    10.6 SUMMARY       • Univariate analysis:- provides summary statistics for each field in the raw data set (or)          summary only on one variable. Ex:- CDF,PDF,Box plot, Violin plot.(don't worry, will          see below what each of them is)       • Bivariate analysis:- is performed to find the relationship between each variable in the          dataset and the target variable of interest (or) using 2 variables and finding the          relationship between them.Ex:-Box plot, Violin plot.       • Multivariate analysis: - is performed to understand interactions between different fields          in the dataset (or) finding interactions between variables more than 2. Ex: - Pair plot          and 3D scatter plot.       • Choice of bin size has an inverse relationship with the number of bins. The larger the          bin sizes, the fewer bins there will be to cover the whole range of data.       • A bee swarm plot is a one-dimensional scatter plot similar to strip chart, but with          various methods to separate coincident points such that each point is visible       • Univariate analysis, as the name says, simply means analysis using a single variable.          This analysis gives the frequency/count of occurrences of the variable and lets us          understand the distribution of that variable at various values.       • ECDFs stand for the \"empirical cumulative distribution function\", and they map every          data point in the dataset to a quantile, which is a number between 0 and 1 that indicates          the cumulative fraction of data points smaller than that data point itself.                                          248    CU IDOL SELF LEARNING MATERIAL (SLM)
10.7 KEYWORDS        • EDA- Exploratory Data Analysis      • PDF-Probability Density Function      • CDF-Cumulative Density Function      • Univariate -describe a type of data that contains only one attribute or characteristic      • Bee Swarm Plot-one-dimensional scatter plot    10.8 LEARNING ACTIVITY    1. Suppose You toss a coin twice. Let X be the number of observed heads. Find the CDF of  X.    2. Consider Iris data set. Compute ECDFs for each of the three species using cede () function.  The variables setosa_petal_length, versicolor_petal_length, and virginica_petal_length is all  in your namespace. Unpack the ECDFs into x_set, y_set, x_vers, y_vers and x_virg, y_virg,  respectively.    10.9 UNIT END QUESTIONS                                                                     249    A. Descriptive Questions  Short Questions  1. What is the need for EDA?  2. Compare EDA and Data Visualization.  3. What is the advantage of pair plot?  4. What is the benefit of using probability density function?  5. What is histogram?  Long Questions  1. Illustrate the concepts of EDA.                                                          CU IDOL SELF LEARNING MATERIAL (SLM)
2. Describe how 2d and 3D scatter plots are used in Data Analysis  3. Illustrate histogram plotting using Iris Dataset  4. Compare Scatter plot, Histogram and Pair Plot.  5. Discuss the role of Probability Density Function    B. Multiple Choice Questions  1. A variable that can assume any value between two given points is called ___________        a. Continuous random variable      b. Discrete random variable      c. Irregular random variable      d. Uncertain random variable    2. If the values taken by a random variable are negative, the negative values will have      ___________      a. Positive probability      b. Negative Probability      c. May have negative or positive probabilities      d. Insufficient data    3. If f(x) is a probability density function of a continuous random variable, then [Math      Processing Error]f(x)=?      a. 0      b. 1      c. undefined      d. Insufficient data    4. The variable that assigns a real number value to an event in a sample space is called      ___________      a. Random variable      b. Defined variable      c. Uncertain variable      d. Static variable                                                                       250    CU IDOL SELF LEARNING MATERIAL (SLM)
                                
                                
                                Search
                            
                            Read the Text Version
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 272
- 273
- 274
- 275
- 276
- 277
- 278
- 279
- 280
- 281
- 282
- 283
- 284
- 285
- 286
- 287
- 288
- 289
- 290
- 291
- 292
- 293
- 294
- 295
- 296
- 297
- 298
- 299
- 300
- 301
- 302
- 303
- 304
- 305
- 306
- 307
- 308
- 309
- 310
- 311
- 312
- 313
- 314
- 315
- 316
- 317
- 318
 
                    