Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore CU-MCA-Python Programming- Second Draft-converted

CU-MCA-Python Programming- Second Draft-converted

Published by Teamlease Edtech Ltd (Amita Chitroda), 2021-05-02 15:59:34

Description: CU-MCA-Python Programming- Second Draft-converted

Search

Read the Text Version

Figure 8.15: Sample Box Plot Boxplot can be colorized by passing color keyword. You can pass a dict whose keys are boxes, whiskers, medians and caps. If some keys are missing in the dict, default colors are used for the corresponding artists. Also, boxplot has sym keyword to specify fliers style. When you pass other type of arguments via color keyword, it will be directly passed to matplotlib for all the boxes, whiskers, medians and caps colorization. The colors are applied to every box to be drawn. If you want more complicated colorization, you can get each drawn artists by passing return_type. In [39]: color={ ....: \"boxes\":\"DarkGreen\", ....: \"whiskers\":\"DarkOrange\", ....: \"medians\":\"DarkBlue\", ....: \"caps\":\"Gray\", ....: } ....: 201 CU IDOL SELF LEARNING MATERIAL (SLM)

In [40]: df.plot.box(color=color,sym=\"r+\"); Figure 8.16:Colored Box Plot Also, you can pass other keywords supported by matplotlib boxplot. For example, horizontal and custom-positioned boxplot can be drawn by vert=False and positions keywords. In [41]: df.plot.box(vert=False,positions=[1,4,5,6,8]); Figure 8.17: Horizontal Box Plot 202 The existing interface DataFrame.boxplot to plot boxplot still can be used. CU IDOL SELF LEARNING MATERIAL (SLM)

In [42]: df=pd.DataFrame(np.random.rand(10,5)) In [43]: plt.figure(); In [44]: bp=df.boxplot() Figure 8.18: Box Plot with existing interface You can create a stratified boxplot using the by keyword argument to create groupings. For instance, In [45]: df=pd.DataFrame(np.random.rand(10,2),columns=[\"Col1\",\"Col2\"]) In [46]: df[\"X\"]=pd.Series([\"A\",\"A\",\"A\",\"A\",\"A\",\"B\",\"B\",\"B\",\"B\",\"B\"]) In [47]: plt.figure(); In [48]: bp=df.boxplot(by=\"X\") 203 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 8.19: Grouped Box Plot You can also pass a subset of columns to plot, as well as group by multiple columns: In [49]: df=pd.DataFrame(np.random.rand(10,3),columns=[\"Col1\",\"Col2\",\"Col3\"]) In [50]: df[\"X\"]=pd.Series([\"A\",\"A\",\"A\",\"A\",\"A\",\"B\",\"B\",\"B\",\"B\",\"B\"]) In [51]: df[\"Y\"]=pd.Series([\"A\",\"B\",\"A\",\"B\",\"A\",\"B\",\"A\",\"B\",\"A\",\"B\"]) In [52]: plt.figure(); In [53]: bp=df.boxplot(column=[\"Col1\",\"Col2\"],by=[\"X\",\"Y\"]) 204 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 8.20: Box Plot grouped by multiple columns In boxplot, the return type can be controlled by the return_type, keyword. The valid choices are {\"axes\", \"dict\", \"both\", None}. Faceting, created by DataFrame.boxplot with the by keyword, will affect the output type as well: Groupby.boxplot always returns a Series of return_type. In [54]: np.random.seed(1234) In [55]: df_box=pd.DataFrame(np.random.randn(50,2)) In [56]: df_box[\"g\"]=np.random.choice([\"A\",\"B\"],size=50) In [57]: df_box.loc[df_box[\"g\"]==\"B\",1]+=3 In [58]: bp=df_box.boxplot(by=\"g\") 205 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 8.21: Box plot grouped by g The subplots above are split by the numeric columns first, then the value of the g column. Below the subplots are first split by the value of g, then by the numeric columns. Area plot Area charts depict a time-series relationship. But unlike line charts, they can also visually represent volume. Information is graphed on two axes, using data points connected by line segments. The area between the axis and this line is commonly emphasized with color or shading for legibility You can create area plots with Series.plot.area() and DataFrame.plot.area(). Area plots are stacked by default. To produce stacked area plot, each column must be either all positive or all negative values. When input data contains NaN, it will be automatically filled by 0. If you want to drop or fill by different values, use dataframe.dropna() or dataframe.fillna() before calling plot. In [60]: df=pd.DataFrame(np.random.rand(10,4),columns=[\"a\",\"b\",\"c\",\"d\"]) In [61]: df.plot.area(); 206 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 8.22: Area Plot To produce an unstacked plot, pass stacked=False. Alpha value is set to 0.5 unless otherwise specified: In [62]: df.plot.area(stacked=False); Figure 8.23: Unstacked Area plot 207 CU IDOL SELF LEARNING MATERIAL (SLM)

Scatter plot The scatter plot is simply a set of data points plotted on an x and y axis to represent two sets of variables. The shape those data points create tells the story, most often revealing correlation (positive or negative) in a large amount of data Scatter plot can be drawn by using the DataFrame.plot.scatter() method. Scatter plot requires numeric columns for the x and y axes. These can be specified by the x and y keywords. In [63]: df=pd.DataFrame(np.random.rand(50,4),columns=[\"a\",\"b\",\"c\",\"d\"]) In [64]: df.plot.scatter(x=\"a\",y=\"b\"); Figure 8.24: Scatter Plot To plot multiple column groups in a single axis, repeat plot method specifying target ax. It is recommended to specify color and label keywords to distinguish each group. In [65]: ax=df.plot.scatter(x=\"a\",y=\"b\",color=\"DarkBlue\",label=\"Group 1\") In [66]: df.plot.scatter(x=\"c\",y=\"d\",color=\"DarkGreen\",label=\"Group 2\",ax=ax); 208 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 8.25: Coloured and Labeled Scatter Plot The keyword c may be given as the name of a column to provide colors for each point: In [67]: df.plot.scatter(x=\"a\",y=\"b\",c=\"c\",s=50); Figure 8.26: Column Colored Scatter Plot Hexagonal bin plot Hexagonal Binning is another way to manage the problem of having to many points that start to overlap. Hexagonal binning plots density, rather than points. Points are binned into gridded hexagons and distribution (the number of points per hexagon) is displayed using either the color or the area of the hexagons 209 CU IDOL SELF LEARNING MATERIAL (SLM)

You can create hexagonal bin plots with DataFrame.plot.hexbin(). Hexbin plots can be a useful alternative to scatter plots if your data are too dense to plot each point individually. In [69]: df=pd.DataFrame(np.random.randn(1000,2),columns=[\"a\",\"b\"]) In [70]: df[\"b\"]=df[\"b\"]+np.arange(1000) In [71]: df.plot.hexbin(x=\"a\",y=\"b\",gridsize=25); Figure 8.27: Hexagonal Bin Plot A useful keyword argument is gridsize; it controls the number of hexagons in the x-direction, and defaults to 100. A larger gridsize means more, smaller bins. By default, a histogram of the counts around each (x, y) point is computed. You can specify alternative aggregations by passing valuestothe C and reduce_C_function arguments. C specifies the value at each (x, y) point and reduce_C_function is a function of one argument that reduces all the values in a bin to a single number (e.g., mean, max, sum, std). In this example the positions are given by columns a and b, while the value is given by column z. The bins are aggregated with NumPy’s max function. In [72]: df=pd.DataFrame(np.random.randn(1000,2),columns=[\"a\",\"b\"]) In [73]: df[\"b\"]=df[\"b\"]=df[\"b\"]+np.arange(1000) In [74]: df[\"z\"]=np.random.uniform(0,3,1000) In [75]: df.plot.hexbin(x=\"a\",y=\"b\",C=\"z\",reduce_C_function=np.max,gridsize=25); 210 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 8.28: Aggregated Bin Plot Pie plot You can create a pie plot with DataFrame.plot.pie() or Series.plot.pie(). If your data includes any NaN, they will be automatically filled with 0. A ValueError will be raised if there are any negative values in your data. In [76]: series=pd.Series(3*np.random.rand(4),index=[\"a\",\"b\",\"c\",\"d\"],name=\"series\") In [77]: series.plot.pie(figsize=(6,6)); Figure 8.29: Pie Plot 211 CU IDOL SELF LEARNING MATERIAL (SLM)

For pie plots it’s best to use square figures, i.e., a figure aspect ratio 1. You can create the figure with equal width and height or force the aspect ratio to be equal after plotting by calling ax.set_aspect('equal') on the returned axes object. Note that pie plot with DataFrame requires that you either specify a target column by the y argument or subplots=True. When y is specified, pie plot of selected column will be drawn. If subplots=True is specified, pie plots for each column are drawn as subplots. A legend will be drawn in each pie plots by default; specify legend=False to hide it. In [78]: df=pd.DataFrame( ....: 3*np.random.rand(4,2),index=[\"a\",\"b\",\"c\",\"d\"],columns=[\"x\",\"y\"] ....: ) ....: In [79]: df.plot.pie(subplots=True,figsize=(8,4)); Figure 8.30: Pie Plotwith selected columns You can use the labels and colors keywords to specify the labels and colors of each wedge. If you want to hide wedge labels, specify labels=None. If fontsize is specified, the value will be applied to wedge labels. Also, other keywords supported by matplotlib.pyplot.pie() can be used. If you pass values whose sum total is less than 1.0, matplotlib draws a semicircle. In [81]: series=pd.Series([0.1]*4,index=[\"a\",\"b\",\"c\",\"d\"],name=\"series2\") In [82]: series.plot.pie(figsize=(6,6)); 212 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 8.31: Pie plot as semicircle Plotting with missing data pandas tries to be pragmatic about plotting DataFrames or Series that contain missing data. Missing values are dropped, left out, or filled depending on the plot type. Plot Type NaN Handling Line Leave gaps at NaNs Line (stacked) Fill 0’s Bar Fill 0’s Scatter Drop NaNs Histogram Drop NaNs (column-wise) Box Drop NaNs (column-wise) Area Fill 0’s KDE Drop NaNs (column-wise) Hexbin Drop NaNs Pie Fill 0’s Table 8.2: Handling Missing Data If any of these defaults are not what you want, or if you want to be explicit about how missing values are handled, consider using fillna() or dropna() before plotting. 213 CU IDOL SELF LEARNING MATERIAL (SLM)

8.5 SUMMARY • A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns • A pandas DataFrame can be created using various inputs like − • Lists • dict • Series • Numpy ndarrays • Another DataFrame • Plotting methods allow for a handful of plot styles other than the default line plot • There are several plotting functions in pandas.plotting that take a Series or DataFrame as an argument • A histogram is a chart that groups numeric data into bins, displaying the bins as segmented columns. They're used to depict the distribution of a dataset: how often values fall into ranges • A bar plot displays categorical data with rectangular bars whose length or height corresponds to the value of each data point. Bar plots can be visualized using vertical or horizontal bars • A box plot visualization allows you to examine the distribution of data. One box plot appears for each attribute element. Each box plot displays the minimum, first quartile, median, third quartile, and maximum values. • Area plots depict a time-series relationship. • The scatter plot is simply a set of data points plotted on an x and y axis to represent two sets of variables 8.6 KEYWORDS • Data frame-2-dimensional labeled data structure with columns of potentially different types • Visualization-technique for creating images, diagrams, or animations 214 CU IDOL SELF LEARNING MATERIAL (SLM)

• Box Plot-graphically depicting groups of numerical data • Area Plot-displays quantitative data visually • Histogram-representation of the distribution of numerical data, • Scatter Plot-diagram where each value in the data set is represented by a dot 8.7 LEARNING ACTIVITY 1. A number raised to the third power is a cube Plot the first five cubic numbers, and then plot the first 5000 cubic numbers. 2. A number raised to the third power is a cube Plot the first five cubic numbers, and then plot the first 5000 cubic numbers. Apply a colormap to your cubes plot. 8.8 UNIT END QUESTIONS 215 A. Descriptive Questions Short Questions 1. What is a dataframe? 2. List the ways to create a frame. 3. Differentiate bar plot and pie plot 4. Is it possible to create pie plot in semi-circleshape? 5. What is histogram? Long Questions 1. Illustrate the key concepts of dataframe. 2. Describe various ways of creating a dataframe 3. Illustrate various visualization methods being used CU IDOL SELF LEARNING MATERIAL (SLM)

4. Discuss various kinds of plot available for visualization 5. Which plot is chosen for a particular application? Justify B. Multiple Choice Questions 1. Observe the output figure. Identify the coding for obtaining this output. a. import matplotlib.pyplot as plt 216 plt.plot([1,2,3],[4,5,1]) plt.show() b. import matplotlib.pyplot as plt plt.plot([1,2],[4,5]) plt.show() c. import matplotlib.pyplot as plt plt.plot([2,3],[5,1]) plt.show() d. import matplotlib.pyplot as plt plt.plot([1,3],[4,1]) plt.show() 2. Read the code: import matplotlib.pyplot as plt plt.plot(3,2) plt.show() Identify the output for the above coding. CU IDOL SELF LEARNING MATERIAL (SLM)

a. b. c. d. 3. Identify the right type of chart using the following hints. Hint 1: This chart is often used to visualize a trend in data over intervals of time. Hint 2: The line in this type of chart is often drawn chronologically. a. Line chart b. Bar chart c. Pie chart d. Scatter plot 4. Read the statements given below. Identify the right option from the following for pie chart. 217 CU IDOL SELF LEARNING MATERIAL (SLM)

Statement A: To make a pie chart with Matplotlib, we can use the plt.pie() function. Statement B: The autopct parameter allows us to display the percentage value using the Python string formatting. a. Statement A is correct b. Statement B is correct c. Both the statements are correct d. Both the statements are wrong 5. Point out the wrong combination with regards to kind keyword for graph plotting. a. ‘scatter’ for scatter plots b. ‘kde’ for hexagonal bin plots c. ‘pie’ for pie plots d. None of these 6. We can create a scatter plot matrix using the __________ method in pandas.tools.plotting. a. sca_matrix b. scatter_matrix c. DataFrame.plot d. All of these 7. Which of the following value is provided by kind keyword for barplot? a. bar b. kde c. hexbin d. None of these 8. __________ plots are used to visually assess the uncertainty of a statistic. 218 a. Lag b. RadViz c. Bootstrap d. None of these CU IDOL SELF LEARNING MATERIAL (SLM)

9. Which of the following plots are often used for checking randomness in time series? a. Autocausation b. Autorank c. Autocorrelation d. None of these Answers: 1 – a, 2 –c, 3 –a, 4 –c, 5 –b, 6 –b, 7 –a, 8 -c, 9 –c 8.9 REFERENCES Text Books: • Allen B. Downey, “Think Python: How to Think like a Computer Scientist”, 2nd edition, Updated for Python 3, Shroff/O‘Reilly Publishers, 2016 • Michael Urban, Joel Murach, Mike Murach: Murach'sPython Programming; Dec, 2016 Reference Books: • Guido van Rossum and Fred L. Drake Jr, An Introduction to Python – Revised and updated for Python 3.2, • Jake Vander Plas, “Python Data Science Handbook”, O‘Reilly Publishers, 2016. 219 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT - 9: GRAPHICAL EXPLORATORY DATA ANALYSIS (EDA) 1 Structure 9.0. LearningObjectives 9.1. Introduction 9.2. Introduction to Dataset and 2D Scatter Plot 9.3. Pair Plots 9.4. Histogram and Introduction to Probability Density Function 9.5. Summary 9.6. Keywords 9.7. Learning Activity 9.8. Unit End Questions 9.9. References 9.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Discuss data analysis using python • Learn about 2d and 3D scatter plot • Describe about probability density function 9.1 INTRODUCTION Exploratory data analysis (EDA) is a method of analyzing and investigating the data sets to summarize their main characteristics. Scientists often use data visualization methods to discover patterns, spot anomalies, check assumptions or test a hypothesis through summary statistics and graphical representations. EDA goes beyond the formal modeling or hypothesis to give maximum insight into the data set and its structure, and in identifying influential variables. It can also help in selecting the most suitable data analysis technique for a given project. Specific knowledge, such as the creation of a ranked list of relevant factors to be used as guidelines, can also be obtained using EDA. 220 CU IDOL SELF LEARNING MATERIAL (SLM)

Types of EDA The EDA types of techniques are either graphical or quantitative (non-graphical). While the graphical methods involve summarizing the data in a diagrammatic or visual way, the quantitative method, on the other hand, involves the calculation of summary statistics. These two types of methods are further divided into univariate and multivariate methods. Univariate methods consider one variable (data column) at a time, while multivariate methods consider two or more variables at a time to explore relationships. Thus, there are four types of EDA in all — univariate graphical, multivariate graphical, univariate non- graphical, and multivariate non-graphical. The graphical methods provide more subjective analysis, and quantitative methods are more objective. Univariate non-graphical: This is the simplest form of data analysis among the four options. In this type of analysis, the data that is being analysed consists of just a single variable. The main purpose of this analysis is to describe the data and to find patterns. Univariate graphical: Unlike the non-graphical method, the graphical method provides the full picture of the data. The three main methods of analysis under this type are histogram, stem and leaf plot, and box plots. The histogram represents the total count of cases for a range of values. Along with the data values, the stem and leaf plot show the shape of the distribution. The box plots graphically depict a summary of minimum, first quartile median, third quartile, and maximum. Multivariate non-graphical: The multivariate non-graphical type of EDA generally depicts the relationship between multiple variables of data through cross-tabulation or statistics. Multivariate graphical: This type of EDA displays the relationship between two or more set of data. A bar chart, where each group represents a level of one of the variables and each bar within the group represents levels of other 9.2 INTRODUCTION TO DATASET AND 2D SCATTER PLOT Iris flower data set The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper. The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One 221 CU IDOL SELF LEARNING MATERIAL (SLM)

class is linearly separable from the other 2; the latter are NOT linearly separable from each other. Here we can see that given 4 features i.e., sepal length, sepal width, petal length, and petal width determine whether a flower is Setosa, Versicolor or Virginica. Sepal length, Sepal width, Petal length, Petal width are called feature/Variable/Input- variable/Independent-variable Species are called Labels/Dependent-variable/out-variable/class/class-label/Response label Figure 9.1: Iris Data Set Scatter Plot As far as Machine learning/Data Science is concerned, one of the most commonly used plots for simple data visualization is scatter plots. This plot gives us a representation of where each point in the entire dataset is present with respect to any 2/3 features (Columns). Scatter plots are available in 2D as well as 3D. The 2D scatter plot is the important/common one, where we will primarily find patterns/Clusters and separability of the data. The code snippet for using a scatter plot is as shown below. plt.scatter(iris['sepal_length'],iris['sepal_width']) plt.xlabel('Sepal length') 222 CU IDOL SELF LEARNING MATERIAL (SLM)

plt.ylabel('Sepal width') plt.title('Scatter plot on Iris dataset') Figure 9.2: Scatter Plot on Iris Data set Here we can see that all the points are marked on their corresponding position with respective to their values of x and y. Lets tweak around to see if we can get points with different colours. plt.scatter(iris['sepal_length'],iris['sepal_width'],color=['r','b','g']) plt.xlabel('Sepal length') plt.ylabel('Sepal width') plt.title('Scatter plot on Iris dataset') 223 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 9.3: Colored Scatter plot on Iris Data set 3D Scatter Plot A 3D Scatter Plot is a mathematical diagram, the most basic version of three-dimensional plotting used to display the properties of data as three variables of a dataset using the cartesian coordinates. To create a 3D Scatter plot, Matplotlib's mplot3d toolkit is used to enable three-dimensional plotting import plotly.express as px df = px.data.iris() fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width', color='species') fig.show() 224 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 9.4: 3-D Scatter Plot It is possible to customize the style of the figure through the parameters of px.scatter_3d for some options, or by updating the traces or the layout of the figure import plotly.express as px df = px.data.iris() fig = px.scatter_3d(df, x='sepal_length', y='sepal_width', z='petal_width', color='petal_length', size='petal_length', size_max=18, symbol='species', opacity=0.7) # tight layout fig.update_layout(margin=dict(l=0, r=0, b=0, t=0)) 225 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 9.5: Customized 3D Scatter Plot 9.3 PAIR PLOTS Pair Plots are a simple (one-line-of-code simple!) way to visualize relationships between each variable. It produces a matrix of relationships between each variable in your data for an instant examination of our data. We can use scatter plots for 2d with Matplotlib and even for 3D, we can use it from plot.ly. What to do when we have 4d or more than that? This is when Pair plot from seaborn package comes into play Let’s say we have n number of features in a data, Pair plot will create us a (n x n) figure where the diagonal plots will be histogram plot of the feature corresponding to that row and rest of the plots are the combination of feature from each row in y axis and feature from each column in x axis. The code snippet for pair plot implemented for Iris dataset is provided below. import seaborn as sns sns.set_style(\"whitegrid\"); sns.pairplot(iris, hue=\"species\", size=3); plt.show() 226 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 9.6: Pair Plot on Iris Dataset By getting a high-level overview of plots from pair plot, we can see which two features can well explain/separate the data and then we can use scatter plot between those 2 features to explore further. From the above plot we can conclude like, Petal length and petal width are the 2 features which can separate the data very well. Since we will be getting n x n plots for n features, pairplot may become complex when we have more number of feature say like 10 or so on. So, in such cases, the best bet will be using a dimensionality reduction technique to map data into 2d plane and visualizing it using a 2d scatter plot. 9.4 HISTOGRAM AND INTRODUCTION TO PROBABILITY DENSITY FUNCTION A histogram is a summary of the variation in a measured variable. It shows the number of samples that occur in a category: this is called a frequency distribution. Histograms make 227 CU IDOL SELF LEARNING MATERIAL (SLM)

sense for categorical variables, but a histogram can also be derived from a continuous variable. Histogram on Sepal Length plt.figure(fig size = (10, 7)) x = data[\"SepalLengthCm\"] plt.hist(x, bins = 20, color = \"green\") plt.title(\"Sepal Length in cm\") plt.xlabel(\"Sepal_Length_cm\") plt.ylabel(\"Count\") Figure 9.7 : Histogram on Iris Data Set based on Sepal Length 228 Histogram for Petal Length plt.figure(figsize = (10, 7)) x = data.PetalLengthCm plt.hist(x, bins = 20, color = \"green\") plt.title(\"Petal Length in cm\") plt.xlabel(\"Petal_Length_cm\") CU IDOL SELF LEARNING MATERIAL (SLM)

plt.ylabel(\"Count\") plt.show() Figure 9.8: Histogram on Iris Data Set based on Petal Length Instead of the relative frequencies, we can also make an histogram with the empirical density distribution. The empirical density is defined as i.e., it is equal to the empirical probability divided by the interval length, or bin width. The advantage is that the empirical densities are insensitive to changes in the bin width dy, in contrast to the relative frequencies, since a smaller bin width results in a smaller relative frequency. Probability Density Function A probability density function (PDF) is the continuous version of the histogram with densities (you can see this by imagining infinitesimal small bin widths); it specifies how the probability density is distributed over the range of values that a random variable can take. The PDF of a random variable is often described by a certain analytical function. A large number of such statistical distribution functions has been defined, well-known examples are for example the uniform distribution and the normal distribution. 229 CU IDOL SELF LEARNING MATERIAL (SLM)

For example, given a random sample of a variable, we might want to know things like the shape of the probability distribution, the most likely value, the spread of values, and other properties. Knowing the probability distribution for a random variable can help to calculate moments of the distribution, like the mean and variance, but can also be useful for other more general considerations, like determining whether an observation is unlikely or very unlikely and might be an outlier or anomaly. The problem is, we may not know the probability distribution for a random variable. We rarely do know the distribution because we don’t have access to all possible outcomes for a random variable. In fact, all we have access to is a sample of observations. As such, we must select a probability distribution. This problem is referred to as probability density estimation, or simply “density estimation,” as we are using the observations in a random sample to estimate the general density of probabilities beyond just the sample of data we have available. Similarly as with the histogram, the probability that a random variable takes a value in a certain interval [yj,ym] is equal to the area below the function, see figure below. It can be calculated by taking the integral over the interval: 230 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 9.9: Histogram and PDF on iris Data set 9.5 SUMMARY • Exploratory data analysis (EDA) is a method of analyzing and investigating the data sets to summarize their main characteristics. • EDA goes beyond the formal modeling or hypothesis to give maximum insight into the data set and its structure, and in identifying influential variables. It can also help in selecting the most suitable data analysis technique for a given project. • The EDA types of techniques are either graphical or quantitative (non-graphical) • The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician, eugenicist, and biologist Ronald Fisher in his 1936 paper • The 2D scatter plot is the important/common one, where we will primarily find patterns/Clusters and separability of the data. • A 3D Scatter Plot is a mathematical diagram, the most basic version of three- dimensional plotting used to display the properties of data as three variables of a dataset using the cartesian coordinates • Pair Plots are a really simple (one-line-of-code simple!) way to visualize relationships between each variable • A histogram is a summary of the variation in a measured variable. It shows the number of samples that occur in a category: this is called a frequency distribution. 231 CU IDOL SELF LEARNING MATERIAL (SLM)

• A histogram is a summary of the variation in a measured variable. It shows the number of samples that occur in a category: this is called a frequency distribution. 9.6 KEYWORDS • EDA- Exploratory Data Analysis • PDF-Probability Density Function • Pair Plots-pairwise relationships in a dataset • 3D Scatter Plot-mathematical diagram, the most basic version of three- dimensional plotting 9.7 LEARNING ACTIVITY 1. Plot a histogram of the petal lengths of his 50 samples of Iris versicolor using matplotlib/seaborn's default settings. Recall that to specify the default seaborn style, you can use sns.set(), where sns is the alias that seaborn is imported as. 2. 3D scatter plot is constructed using seaborn library. Comment 9.8 UNIT END QUESTIONS 232 A. Descriptive Questions Short Questions 1. What is the need for EDA? 2. Compare EDA and Data Visualization. 3. What is the advantage of pair plot? 4. What is the benefit of using probability density function? 5. What is histogram? Long Questions 1. Illustrate the concepts of EDA. CU IDOL SELF LEARNING MATERIAL (SLM)

2. Describe how 2d and 3D scatter plots are used in Data Analysis 3. Illustrate histogram plotting using Iris Dataset 4. Compare Scatter plot, Histogram and Pair Plot. 5. Discuss the role of Probability Density Function B. Multiple Choice Questions 1. Which of the following graphs has properties in the below figure? a. Exploratory b. Inferential c. Causal d. None of these 2. Which of the following dimension type graph is shown in the below figure? a. one-dimensional 233 b. two-dimensional c. three-dimensional d. None of these 3. Which of the following gave rise to need of graphs in data analysis? a. Data visualization b. Communicating results CU IDOL SELF LEARNING MATERIAL (SLM)

c. Decision making d. All of these 4. Which of the following is characteristic of exploratory graph? a. Made slowly b. Axes are not cleaned up c. Color is used for personal information d. All of these 5. Point out the correct statement. a. coplots are one dimensional data graph b. Exploratory graphs are made quickly c. Exploratory graphs are made relatively less in number d. All of these 6. Scatter diagram is graphical component of ____________ a. Regression analysis b. Demand c. Supply d. Profit 7. A scatter diagram represents the relationship between _________ and ________ a. Cause, effects b. Cause, problem c. Effects, output d. Production, productivity 8. A histogram gives _____ nature of process variability. a. Static b. Dynamic c. Negative d. Positive 234 CU IDOL SELF LEARNING MATERIAL (SLM)

9. Which of the following mentioned standard Probability density functions is applicable to discrete Random Variables? a. Gaussian Distribution b. Poisson Distribution c. Rayleigh Distribution d. Exponential Distribution 10. A table with all possible value of a random variable and its corresponding probabilities is called ___________ a. Probability Mass Function b. Probability Density Function c. Cumulative distribution function d. Probability Distribution Answers 1 -a, 2 -b, 3 – d, 4 –c, 5 –a, 6 –a, 7 –a, 8 -a, 9 –b, 10 –d. 9.9 REFERENCES Text Books: • Allen B. Downey, “Think Python: How to Think like a Computer Scientist”, 2nd edition, Updated for Python 3, Shroff/O‘Reilly Publishers, 2016 • Michael Urban, Joel Murach, Mike Murach: Murach'sPython Programming; Dec, 2016 Reference Books: • Guido van Rossum and Fred L. Drake Jr, An Introduction to Python – Revised and updated for Python 3.2, • Jake Vander Plas, “Python Data Science Handbook”, O‘Reilly Publishers, 2016. 235 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT - 10: GRAPHICAL EXPLORATORY DATA ANALYSIS (EDA) II Structure 10.0. Learning Objectives 10.1. Introduction 10.2. Adjusting the number of Bins in Histogram 10.3. Bee Swarm Plot 10.4. Univariate Analysis using PDF and CDF 10.5. PLOT DATA IN ECDF 10.6. Summary 10.7. Keywords 10.8. Learning Activity 10.9. Unit End Questions 10.10. References 10.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Perform data analysis using python • Describe about bee swarm plotting • Perform univariate analysis using PDF and CDF 10.1 INTRODUCTION Exploratory Data Analysis is majorly performed using the following methods: • Univariate analysis: - provides summary statistics for each field in the raw data set (or) summary only on one variable. Ex: - CDF, PDF, Box plot, Violin plot. • Bivariate analysis: - is performed to find the relationship between each variable in the dataset and the target variable of interest (or) using 2 variables and finding the relationship between them. Ex: -Box plot, Violin plot. • Multivariate analysis: - is performed to understand interactions between different fields in the dataset (or) finding interactions between variables more than 2. Ex: - Pair plot and 3D scatter plot. 236 CU IDOL SELF LEARNING MATERIAL (SLM)

Axis Labels A lot of times, graphs can be self-explanatory, but having a title to the graph, labels on the axis, and a legend that explains what each line is can be necessary. To start: import matplotlib.pyplot as plt x =[1,2,3] y =[5,7,4] x2 =[1,2,3] y2 =[10,14,12] This way, we have two lines that we can plot. Next: plt.plot(x, y, label='First Line') plt.plot(x2, y2, label='Second Line') Here, we plot as we've seen already, only this time we add another parameter \"label.\" This allows us to assign a name to the line, which we can later show in the legend. The rest of our code: plt.xlabel('Plot Number') plt.ylabel('Important var') plt.title('Interesting Graph\\nCheck it out') plt.legend() plt.show() With plt.xlabel and plt.ylabel, we can assign labels to those respective axes. Next, we can assign the plot's title with plt.title, and then we can invoke the default legend with plt.legend(). The resulting graph: 237 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 10.1 Plot with Label 238 Example 2: Simple axes labels Label the axes of a plot. importnumpyasnp importmatplotlib.pyplotasplt fig=plt.figure() fig.subplots_adjust(top=0.8) ax1 =fig.add_subplot(211) ax1.set_ylabel('volts') ax1.set_title('a sine wave') t=np.arange(0.0, 1.0, 0.01) s=np.sin(2*np.pi*t) line, =ax1.plot(t, s, lw=2) # Fixing random state for reproducibility np.random.seed(19680801) CU IDOL SELF LEARNING MATERIAL (SLM)

ax2=fig.add_axes([0.15, 0.1, 0.7, 0.3]) n, bins, patches=ax2.hist(np.random.randn(1000), 50) ax2.set_xlabel('time (s)') plt.show() Figure 10.2 Sub Plot with Label 10.2 ADJUSTING THE NUMBER OF BINS IN A HISTOGRAM While tools that can generate histograms usually have some default algorithms for selecting bin boundaries, you will likely want to play around with the binning parameters to choose something that is representative of your data. Wikipedia has an extensive section on rules of thumb for choosing an appropriate number of bins and their sizes, but ultimately, it’s worth using domain knowledge along with a fair amount of playing around with different options to know what will work best for your purposes. Choice of bin size has an inverse relationship with the number of bins. The larger the bin sizes, the fewer bins there will be to cover the whole range of data. With a smaller bin size, the more bins there will need to be. It is worth taking some time to test out different bin sizes to see how the distribution looks in each one, then choose the plot that represents the data 239 CU IDOL SELF LEARNING MATERIAL (SLM)

best. If you have too many bins, then the data distribution will look rough, and it will be difficult to discern the signal from the noise. On the other hand, with too few bins, the histogram will lack the details needed to discern any useful pattern from the data.Setting the bin size of a Matplotlib histogram specifies the size of the groups into which the data is sorted. TO CREATE BINS OF A DESIRED WIDTH Use matplotlib.pyplot.hist(data, bins=None) with list as an iterable containing the start and end point of each bin. For example, setting bins to [10, 20, 30] creates a histogram with one bin containing values between 10 and 20 and another containing values between 20 and 30. data = np.random.normal(50,10, size =10000) ax = plt.hist(data) Figure 10.3 Bins of Desired Width bins_list =[-10,20,40,50,60,80,110] ax = plt.hist(data, bins = bins_list) TO CREATE BINS OF EQUAL SIZE Use matplotlib.pyplot.hist(data, bins=n) to create a histogram with n bins. To create bins of a given width w, set bins to math.ceil(x) with x equal to (data.max() - data.min())/w. ax = plt.hist(data) 240 CU IDOL SELF LEARNING MATERIAL (SLM)

w =3 n = math.ceil((data.max()- data.min())/w) ax = plt.hist(data, bins = n) Figure 10.4 Bins of Equal Size 10.3 BEE SWARM PLOT A bee swarm plot is a one-dimensional scatter plot similar to stripchart, but with various methods to separate coincident points such that each point is visible. Also, beeswarm introduces additional features unavailable in stripchart, such as the ability to control. the color and plotting character of each point Seaborn swarmplot is probably similar to stripplot, only the points are adjusted so it won’t get overlap to each other as it helps to represent the better representation of the distribution of values. A swarm plot can be drawn on its own, but it is also a good complement to a box, preferable because the associated names will be used to annotate the axes. This type of plot sometimes known as “beeswarm Syntax: seaborn.swarmplot(x=None, y=None, hue=None, data=None, order=None, hue_order=None, dodge=False, orient=None, color=None, palette=None, size=5, edgecolor=’gray’, linewidth=0, ax=None, **kwargs) Parameters: x, y, hue: Inputs for plotting long-form data. data: Dataset for plotting. 241 CU IDOL SELF LEARNING MATERIAL (SLM)

color: Color for all of the elements size: Radius of the markers, in points. Example import seaborn as sns sns.set_theme(style=\"whitegrid\") tips = sns.load_dataset(\"tips\") ax = sns.swarmplot(x=tips[\"total_bill\"]) Figure 10.5 Horizontal Swarm Plot Group the swarms by a categorical variable: ax = sns.swarmplot(x=\"day\", y=\"total_bill\", data=tips) 242 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 10.6 Grouped Swarm Plot 10.4 UNIVARIATE ANALYSIS USING PDF AND CDF Univariate analysis, as the name says, simply means analysis using a single variable. This analysis gives the frequency/count of occurrences of the variable and lets us understand the distribution of that variable at various values. Probability Density Function In PDF plot, X-axis is the feature on which analysis is done and the Y-axis is the count/frequency of occurrence of that particular X-axis value in the data. Hence the term “Density” in PDF. import seaborn as sns sns.set_style(\"whitegrid\") Seaborn is the library that provides various types of plots for analysis. sns.FacetGrid(haberman_data,hue='surv_status',height=5).map(sns.distplot,'age').add_legend( ) 243 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 10.7 Examples for PDF In PDF, we can’t say exactly how many data points are in a range/ lower to a value/ higher than a particular value. Cumulative Density Function The cumulative distribution function (CDF) of a random variable is another method to describe the distribution of random variables. The advantage of the CDF is that it can be defined for any kind of random variable (discrete, continuous, and mixed). To know the number of data points below/above a particular value, CDF is very useful. For a discrete Random Variable, For a continuous Random Variable, Uniform Probability Distribution – The Uniform Distribution, also known as the Rectangular Distribution, is a type of Continuous Probability Distribution. 244 CU IDOL SELF LEARNING MATERIAL (SLM)

It has a Continuous Random Variable X restricted to a finite interval [a,b] and it’s probability function f(x) has a constant density over this interval. The Uniform probability distribution function is defined as: Matplotlib is a library in Python and it is a numerical — mathematical extension for the NumPy library. The cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Properties of CDF: • Every cumulative distribution function F(X) is non-decreasing • If maximum value of the cdf function is at x, F(x) = 1. • The CDF ranges from 0 to 1. Example: # defining the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd %matplotlib inline # No of Data points N = 500 # initializing random values data = np.random.randn(N) # getting data of the histogram count, bins_count = np.histogram(data, bins=10) 245 CU IDOL SELF LEARNING MATERIAL (SLM)

# finding the PDF of the histogram using count values pdf = count / sum(count) # using numpy np.cumsum to calculate the CDF # We can also find using the PDF values by looping and adding cdf = np.cumsum(pdf) # plotting PDF and CDF plt.plot(bins_count[1:], pdf, color=\"red\", label=\"PDF\") plt.plot(bins_count[1:], cdf, label=\"CDF\") plt.legend() Figure 10.8 Plotting PDF and CDF 10.5 PLOT DATA IN ECDF ECDFs stand for the \"empirical cumulative distribution function\", and they map every data point in the dataset to a quantile, which is a number between 0 and 1 that indicates the cumulative fraction of data points smaller than that data point itself. Like histograms, ECDFs show a single variable distribution, but in a more efficient way. We’ve seen previously how histograms can be misleading due to different bin sizing options. That’s not the case with ECDFs. ECDFs show every data point, and the plot can be interpreted only in one way. 246 CU IDOL SELF LEARNING MATERIAL (SLM)

Think of ECDFs as scatter plots because they also have points along X and Y axes. To be more precise, here’s what ECDFs show on both axes: X-axis — a quantity we are measuring (Age in the example above) Y-axis — the percentage of data points that have a smaller value than the respective X value (at each point X, Y% of the values are smaller or identical to X) To make this sort of visualization, we need to do a bit of calculation first. Two arrays are required: X — sorted data (sorting the Age column from lowest to highest) Y — list of evenly spaced data points where the maximum is 1 (as in 100%) The following Python snippet can be used to calculate X and Y values for a single column in a Pandas DataFrame: def ecdf(df, column) x = np.sort(df[column]) y = np.arange(1, len(x) + 1) / len(x) return x, y def plot_ecdf(x, y, size=(14, 8), title='ECDF', xlab='Age', ylab='Percentage', color='#087E8B'): plt.figure(figsize=size) plt.scatter(x, y, color=color) plt.title(title, size=20) plt.xlabel(xlab, size=14) plt.ylabel(ylab, size=14) Let’s use this function to make an ECDF plot of the Age attribute: x, y = ecdf(df, 'Age') plot_ecdf(x, y, title='ECDF of passenger ages', xlab='Age') plt.show() 247 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 10.9 ECDF plot 10.6 SUMMARY • Univariate analysis:- provides summary statistics for each field in the raw data set (or) summary only on one variable. Ex:- CDF,PDF,Box plot, Violin plot.(don't worry, will see below what each of them is) • Bivariate analysis:- is performed to find the relationship between each variable in the dataset and the target variable of interest (or) using 2 variables and finding the relationship between them.Ex:-Box plot, Violin plot. • Multivariate analysis: - is performed to understand interactions between different fields in the dataset (or) finding interactions between variables more than 2. Ex: - Pair plot and 3D scatter plot. • Choice of bin size has an inverse relationship with the number of bins. The larger the bin sizes, the fewer bins there will be to cover the whole range of data. • A bee swarm plot is a one-dimensional scatter plot similar to strip chart, but with various methods to separate coincident points such that each point is visible • Univariate analysis, as the name says, simply means analysis using a single variable. This analysis gives the frequency/count of occurrences of the variable and lets us understand the distribution of that variable at various values. • ECDFs stand for the \"empirical cumulative distribution function\", and they map every data point in the dataset to a quantile, which is a number between 0 and 1 that indicates the cumulative fraction of data points smaller than that data point itself. 248 CU IDOL SELF LEARNING MATERIAL (SLM)

10.7 KEYWORDS • EDA- Exploratory Data Analysis • PDF-Probability Density Function • CDF-Cumulative Density Function • Univariate -describe a type of data that contains only one attribute or characteristic • Bee Swarm Plot-one-dimensional scatter plot 10.8 LEARNING ACTIVITY 1. Suppose You toss a coin twice. Let X be the number of observed heads. Find the CDF of X. 2. Consider Iris data set. Compute ECDFs for each of the three species using cede () function. The variables setosa_petal_length, versicolor_petal_length, and virginica_petal_length is all in your namespace. Unpack the ECDFs into x_set, y_set, x_vers, y_vers and x_virg, y_virg, respectively. 10.9 UNIT END QUESTIONS 249 A. Descriptive Questions Short Questions 1. What is the need for EDA? 2. Compare EDA and Data Visualization. 3. What is the advantage of pair plot? 4. What is the benefit of using probability density function? 5. What is histogram? Long Questions 1. Illustrate the concepts of EDA. CU IDOL SELF LEARNING MATERIAL (SLM)

2. Describe how 2d and 3D scatter plots are used in Data Analysis 3. Illustrate histogram plotting using Iris Dataset 4. Compare Scatter plot, Histogram and Pair Plot. 5. Discuss the role of Probability Density Function B. Multiple Choice Questions 1. A variable that can assume any value between two given points is called ___________ a. Continuous random variable b. Discrete random variable c. Irregular random variable d. Uncertain random variable 2. If the values taken by a random variable are negative, the negative values will have ___________ a. Positive probability b. Negative Probability c. May have negative or positive probabilities d. Insufficient data 3. If f(x) is a probability density function of a continuous random variable, then [Math Processing Error]f(x)=? a. 0 b. 1 c. undefined d. Insufficient data 4. The variable that assigns a real number value to an event in a sample space is called ___________ a. Random variable b. Defined variable c. Uncertain variable d. Static variable 250 CU IDOL SELF LEARNING MATERIAL (SLM)


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook