If no color is specified, Matplotlib will automatically cycle through a set of default colors for multiple lines. Similarly, you can adjust the line style using the linestyle keyword \\In[7]: plt.plot(x, x + 0, linestyle='solid') plt.plot(x, x + 1, linestyle='dashed') plt.plot(x, x + 2, linestyle='dashdot') plt.plot(x, x + 3, linestyle='dotted'); # For short, you can use the following codes: plt.plot(x, x + 4, linestyle='-') # solid plt.plot(x, x + 5, linestyle='--') # dashed plt.plot(x, x + 6, linestyle='-.') # dashdot plt.plot(x, x + 7, linestyle=':'); # dotted Figure 6.4: Line Styles in Plot Axes Limits Matplotlib does a decent job of choosing default axes limits for your plot, but some‐ times it’s nice to have finer control. The most basic way to adjust axis limits is to use the plt.xlim() and plt.ylim() methods In[9]: plt.plot(x, np.sin(x)) plt.xlim(-1, 11) plt.ylim(-1.5, 1.5); 151 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 6.5: Example for Axes Limit Labeling Plots Titles and axis labels are the simplest such labels—there are methods that can be used to quickly set them In[14]: plt.plot(x, np.sin(x)) plt.title(\"A Sine Curve\") plt.xlabel(\"x\") plt.ylabel(\"sin(x)\"); Figure 6.6 Examples of axis labels and title Simple Scatter Plots Another commonly used plot type is the simple scatter plot, a close cousin of the line plot. 152 CU IDOL SELF LEARNING MATERIAL (SLM)
Instead of points being joined by line segments, here the points are represented individually with a dot, circle, or other shape. We’ll start by setting up the notebook for plotting and importing the functions we will use: In[1]: %matplotlib inline importmatplotlib.pyplot as plt plt.style.use('seaborn-whitegrid') importnumpy as np Scatter Plots with plt.plot In[2]: x = np.linspace(0, 10, 30) y = np.sin(x) plt.plot(x, y, 'o', color='black'); Figure 6.7: Example of scatter plot The third argument in the function call is a character that represents the type of symbol used for the plotting. Just as you can specify options such as '-' and '--' to control the line style, the marker style has its own set of short string codes. The full list of available symbols can be seen in the documentation of plt.plot, or in Matplotlib’s online documentation. Scatter Plots with plt.scatter A more powerful method of creating scatter plots is the plt.scatter function, which can be used very similarly to the plt.plot function In[6]: plt.scatter(x, y, marker='o'); 153 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 6.8: Simple Scatter plot The primary difference of plt.scatter from plt.plot is that it can be used to create scatter plots where the properties of each individual point (size, face color, edge color, etc.) can be individually controlled or mapped to data. 6.4 SUMMARY • NumPy provides an efficient interface to store and operate on dense data buffers. • NumPy arrays are like Python’s built-in list type, but NumPy arrays provide much more efficient storage and data operations as the arrays grow larger in size • numpy.reciprocol() returns the reciprocal of argument, elementwise • numpy.power()treats elements in the first input array as the base and returns it raised to the power of the corresponding element in the second input array • Matplotlib is a plotting library for Python. It is used along with NumPy to provide an environment that is an effective open-source alternative for MATLAB • Commonly used plot type is the simple scatter plot • A more powerful method of creating scatter plots is the plt.scatter function, which can be used very similarly to the plt.plot function 6.5 KEYWORDS • Numpy-library used for working with arrays • scatter plot-diagram where each value in the data set is represented by a dot • matplotlib-collection of functions that make matplotlib work like MATLAB 154 CU IDOL SELF LEARNING MATERIAL (SLM)
• pyplot-plotting library used for 2D graphics 6.6 LEARNING ACTIVITY 1. Consider you are plotting a graph using numpy. Can we use different characters to display points in plot? Comment 2. Consider you have two arrays. The first array is of size 3*3 and second array is 1*3. Is it possible to perform arithmetic operations of these arrays? Justify 6.7 UNIT END QUESTIONS A. Descriptive Questions Short Questions 1. What is the difference between numpy arrays and arrays? 2. List the numeric operation performed on numpy array 3. Give the difference between narray and ndarray 4. How to label plots in numpy? 5. Compare divide and floor_diide functions. Long Questions 1. Illustrate the numeric operation on numpy. 2. Describe the methods used for changing the colour and style in graphs 3. Illustrate how graphs are drawn in numpy 4. Describe the construction of scatter plots 5. Discuss how pie charts are constructed using matplotlib B. Multiple Choice Questions 1. The most important object defined in NumPy is an N-dimensional array type called? a. A. ndarray 155 CU IDOL SELF LEARNING MATERIAL (SLM)
b. narray c. nd_array d. darray 2. Which of the following statement is false? a. A. ndarray is also known as the axis array. b. ndarray.dataitemSize is the buffer containing the actual elements of the array. c. NumPy main object is the homogeneous multidimensional array d. In Numpy, dimensions are called axes 3. To create sequences of numbers, NumPy provides a function __________ analogous to range that returns arrays instead of lists. a. arange b. aspace c. aline d. All of these 4. Which of the following returns an array of ones with the same shape and type as a given array? a. all_like b. ones_like c. one_alike d. All of these 5. The ________ function returns its argument with a modified shape, whereas the ________ method modifies the array itself. a. resize, reshape b. reshape2, resize c. reshape, resize d. All of these Answers 1 – a, 2 – a, 3 – a, 4 – b, 5 – c, 156 CU IDOL SELF LEARNING MATERIAL (SLM)
6.8 REFERENCES Text Books: • Allen B. Downey, “Think Python: How to Think like a Computer Scientist”, 2nd edition, Updated for Python 3, Shroff/ O ‘Reilly Publishers, 2016 • Michael Urban, Joel Murach, Mike Murach: Murach's Python Programming; Dec, 2016 Reference Books: • Guido van Rossum and Fred L. Drake Jr, An Introduction to Python – Revised and updated for Python 3.2, • Jake VanderPlas, “Python Data Science Handbook”, O ‘Reilly Publishers, 2016. 157 CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT - 7: PYTHON FOR DATA ANALYSIS AND VISUALIZATION II Structure 7.0. Learning Objectives 7.1. Introduction 7.2. Plotting a Simple Line Graph 7.3. Random Walks 7.4. Rolling a dice with plotly 7.5. Pandas-Data frame 7.5.1 The Pandas Data Frame Object 7.5.2 Data Selection in DataFrame 7.6. Summary 7.7. Keywords 7.8. Learning Activity 7.9. Unit End Questions 7.10. References 7.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Perform data analysis using python • Learn about plotting graphs • Describe the concept of random walks 7.1 INTRODUCTION Data visualization is the process of interpreting data and presenting it in a pictorial or graphical format. Currently, we are living in the era of big data, where data has been described as a raw material for business. The volume of data used in businesses, industries, research organizations, and technological development is massive, and it is rapidly growing every day. The more data we collect and analyze, the more capable we can be in making critical business decisions. However, with the enormous growth of data, it has become harder for businesses to extract crucial information from the available data. That is where the importance of data visualization becomes clear. Data visualization helps people understand 158 CU IDOL SELF LEARNING MATERIAL (SLM)
the significance of data by summarizing and presenting a huge amount of data in a simple and easy-to-understand format in order to communicate the information clearly and effectively. Data and visual analytics are an emerging field concerned with analyzing, modeling, and visualizing complex high dimensional data. Python provides numerous libraries for data analysis and visualization mainly numpy, pandas, matplotlib, seaborn etc. In this section, we are going to discuss panda’s library for data analysis and visualization which is an open- source library built on top of numpy. It allows us to do fast analysis and data cleaning and preparation 7.2 PLOTTINGA SIMPLE LINE GRAPH For plotting graphs in Python, we will use the Matplotlib library. Matplotlib is used along with NumPy data to plot any type of graph. From matplotlib we use the specific function i.e., pyplot(), which is used to plot two-dimensional data. Different functions used in plotting a graph are given below: • np.arange(start, end): This function returns equally spaced values from the interval [start, end). • plt.title(): It is used to give title to the graph. Title is passed as the parameter to this function. • plt.xlabel(): It sets the label name at X-axis. Name of X-axis is passed as argument to this function. • plt.ylabel(): It sets the label name at Y-axis. Name of Y-axis is passed as argument to this function. • plt.plot(): It plots the values of parameters passed to it together. • plt.show(): It shows all the graph to the console. Example: # importing the modules importnumpy as np importmatplotlib.pyplot as plt # data to be plotted x = np.arange(1, 11) y=x*x # plotting plt.title(\"Line graph\") plt.xlabel(\"X axis\") plt.ylabel(\"Y axis\") plt.plot(x, y, color =\"red\") plt.show() 159 CU IDOL SELF LEARNING MATERIAL (SLM)
Output: Example # importing the library importnumpy as np importmatplotlib.pyplot as plt # data to be plotted x = np.arange(1, 11) y = np.array([100, 10, 300, 20, 500, 60, 700, 80, 900, 100]) # plotting plt.title(\"Line graph\") plt.xlabel(\"X axis\") plt.ylabel(\"Y axis\") plt.plot(x, y, color =\"green\") plt.show() Output 160 CU IDOL SELF LEARNING MATERIAL (SLM)
7.3 RANDOM WALKS A random walk is a path that has no clear direction but is determined by a series of random decisions, each of which is left entirely to chance. You might imagine a random walk as the path an ant would take if it had lost its mind and took every step in a random direction. Random walks have practical applications in nature, physics, biology, chemistry, and economics. For example, a pollen grain floating on a drop of water moves across the surface of the water because it is constantly being pushed around by water molecules. Molecular motion in a water drop is random, so the path a pollen grain traces out on the surface is a random walk. The code we’re about to write models many real-world situations. Creating the Random Walk() Class To create a random walk, we’ll create a Random Walk class, which will make random decisions about which direction the walk should take. The class needs three attributes: one variable to store the number of points in the walk and two lists to store the x- and y- coordinate values of each point in the walk from random import choice classRandomWalk(): \"\"\"A class to generate random walks.\"\"\" def __init__(self, num_points=5000): \"\"\"Initialize attributes of a walk.\"\"\" self.num_points = num_points # All walks start at (0, 0). 161 CU IDOL SELF LEARNING MATERIAL (SLM)
self.x_values = [0] self.y_values = [0] To make random decisions, we’ll store possible choices in a list and use choice () to decide which choice to use each time a decision is made. We then set the default number of points in a walk to 5000—large enough to generate some interesting patterns but small enough to generate walks quickly. Then at we make two lists to hold the x- and y-values, and we start each walk at point (0, 0) Choosing Directions We will use fill_walk(), as shown here, to fill our walk with points and determine the direction of each step. Add this method to random_walk.py: deffill_walk(self): \"\"\"Calculate all the points in the walk.\"\"\" # Keep taking steps until the walk reaches the desired length. whilelen(self.x_values) <self.num_points: # Decide which direction to go and how far to go in that direction. x_direction = choice([1, -1]) x_distance = choice([0, 1, 2, 3, 4]) x_step = x_direction * x_distance y_direction = choice([1, -1]) y_distance = choice([0, 1, 2, 3, 4]) y_step = y_direction * y_distance # Reject moves that go nowhere. ifx_step == 0 and y_step == 0: continue # Calculate the next x and y values. next_x = self.x_values[-1] + x_step next_y = self.y_values[-1] + y_step 162 CU IDOL SELF LEARNING MATERIAL (SLM)
self.x_values.append(next_x) self.y_values.append(next_y) The main part of this method tells Python how to simulate four random decisions: Will the walk go right or left? How far will it go in that direction? Will it go up or down? How far will it go in that direction? We use choice([1, -1]) to choose a value for x_direction, which returns either 1 for right movement or −1 for left. Next, choice([0, 1, 2, 3, 4]) tells Python how far to move in that direction (x_distance) by randomly selecting an integer between 0 and 4. (The inclusion of a 0 allows us to take steps along the y-axis as well as steps that have movement along both axes. We determine the length of each step in the x and y directions by multiplying the direction of movement by the distance chosen. A positive result for x_step moves us right, a negative result moves us left, and 0 moves us vertically. A positive result for y_step means move up, negative means move down, and 0 means move horizontally. If the value of both x_step and y_step are 0, the walk stops, but we continue the loop to prevent this. To get the next x-value for our walk, we add the value in x_step to the last value stored in x_values and do the same for the y-values. Once we have these values, we append them to x_values and y_values. Plotting the Random Walk importmatplotlib.pyplot as plt fromrandom_walk import RandomWalk # Make a random walk and plot the points. rw = RandomWalk() rw.fill_walk() plt.scatter(rw.x_values, rw.y_values, s=15) plt.show() 163 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 7.1 Random Walk Generating Multiple Random Walks Every random walk is different, and it’s fun to explore the various patterns that can be generated. One way to use the preceding code to make multiple walks without having to run the program several times is to wrap it in a while loop importmatplotlib.pyplot as plt fromrandom_walk import RandomWalk # Keep making new walks, as long as the program is active. while True: # Make a random walk and plot the points. rw = RandomWalk() rw.fill_walk() plt.scatter(rw.x_values, rw.y_values, s=15) plt.show() keep_running = input(\"Make another walk? (y/n): \") ifkeep_running == 'n': break This code will generate a random walk, display it in matplotlib’s viewer, and pause with the viewer open. When you close the viewer, you’ll be asked whether you want to generate another walk. Answer y, and you should be able to generate walks that stay near the starting point, that wander off mostly in one direction, that have thin sections connecting larger groups of points, and so on. Styling the Walk To do so, we identify the characteristics we want to emphasize, such as where the walk began, where it ended, and the path taken. Next, we identify the characteristics to deemphasize, like tick marks and labels. The result should be a simple visual representation that clearly communicates the path taken in each random walk. Coloring the Points 164 CU IDOL SELF LEARNING MATERIAL (SLM)
We’ll use a colormap to show the order of the points in the walk and then remove the black outline from each dot so the color of the dots will be clearer. To color the points according to their position in the walk, we pass the c argument a list containing the position of each point while True: # Make a random walk and plot the points. rw = RandomWalk() rw.fill_walk() point_numbers = list(range(rw.num_points)) plt.scatter(rw.x_values, rw.y_values, c=point_numbers, cmap=plt.cm.Blues, edgecolor='none', s=15) plt.show() keep_running = input(\"Make another walk? (y/n): \") Plotting the Starting and Ending Points In addition to coloring points to show their position along the walk, it would be nice to see where each walk begins and ends. To do so, we can plot the first and last points individually once the main series has been plotted. We’ll make the end points larger and color them differently to make them stand out while True: plt.scatter(rw.x_values, rw.y_values, c=point_numbers, cmap=plt.cm.Blues, edgecolor='none', s=15) # Emphasize the first and last points. plt.scatter(0, 0, c='green', edgecolors='none', s=100) . plt.scatter(rw.x_values[-1], rw.y_values[-1], c='red', edgecolors='none', s=100) plt.show() To show the starting point, we plot point (0, 0) in green in a larger size (s=100) than the rest of the points. To mark the end point, we plot the last x- and y-value in the walk in red with a size of 100. Make sure you insert this code just before the call to plt.show() so the starting and ending points are drawn on top of all the other points. When you run this code, you should be able to spot exactly where each walk begins and ends. 165 CU IDOL SELF LEARNING MATERIAL (SLM)
Adding Plot Points Let’s increase the number of points to give us more data to work with. To do so, we increase the value of num_points when we make a RandomWalk instance and adjust the size of each dot when drawing the plot while True: # Make a random walk and plot the points. rw = RandomWalk(50000) rw.fill_walk() # Plot the points and show the plot. point_numbers = list(range(rw.num_points)) plt.scatter(rw.x_values, rw.y_values, c=point_numbers, cmap=plt.cm.Blues, edgecolor='none', s=1) Fig 7.2 This example creates a random walk with 50,000 points (to mirror realworld data) and plots each point at size s=1. The resulting walk is wispy and cloud-like 7.4 ROLLING DICE WITH PLOTLY Rolling two dice results in larger numbers and a different distribution of results. The following code will create two D6 dice to simulate the way we roll a pair of dice: from die import Die 166 CU IDOL SELF LEARNING MATERIAL (SLM)
fromplotly.graph_objs import Bar, Layout fromplotly import offline # Create two D6 dice. die_1 = Die() die_2 = Die() # Make some rolls, and store results in a list. results = [] forroll_num in range(1000): result = die_1.roll()+ die_2.roll() results.append(result) # Analyze the results. frequencies = [] max_result = die_1.num_sides + die_2.num_sides for value in range(2, max_result+1): frequency = results.count(value) frequencies.append(frequency) # Visualize the results. x_values = list(range(2, max_result+1)) data = [Bar(x=x_values, y=frequencies)] x_axis_config = {'title': 'Result','dtick': 1} y_axis_config = {'title': 'Frequency of Result'} my_layout = Layout(title='Results of rolling two D6 dice 1000 times',xaxis=x_axis_config, yaxis=y_axis_config) offline.plot({'data': data, 'layout': my_layout}, filename='d6_d6.html') 167 CU IDOL SELF LEARNING MATERIAL (SLM)
ROLLING DICE WITH PYGAL Pygal are useful in visualizations that are presented on differently sized screens because they scale automatically to fit the viewer’s screen. If you plan to use your visualizations online, consider using Pygal so your work will look good on any device people use to view your visualizations. The study of rolling dice is often used in mathematics to explain various types of data analysis. But it also has real-world applications in casinos and other gambling scenarios, as well as in the way games like Monopoly and many role-playing games are played. from random import randint Creating the Die Class class Die (): \"\"\"A class representing a single die.\"\"\" def __init__(self, num_sides=6): \"\"\"Assume a six-sided die.\"\"\" self.num_sides = num_sides def roll(self): \"\"\"\"Return a random value between 1 and number of sides.\"\"\" returnrandint(1, self.num_sides) The __init__() method takes one optional argument. With this class, when an instance of our die is created, the number of sides will always be six if no argument is included. If an argument is included, that value is used to set the number of sides on the die. The roll() method uses the randint() function to return a random number between 1 and the number of sides from die import Die # Create a D6. die = Die() # Make some rolls, and store results in a list. results = [] 168 CU IDOL SELF LEARNING MATERIAL (SLM)
forroll_num in range(100): result = die.roll() results.append(result) print(results) 7.5 PANDAS-DATAFRAME Pandas is a newer package built on top of NumPy and provides an efficient implementation of a Data Frame. Data Frames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.DataFrame objects, builds on the NumPy array structure and provides efficient access to these sorts of “data munging” tasks that occupy much of a data scientist’s time. 7.5.1 The Pandas Data Frame Object The Data Frame can be thought of either as a generalization of a NumPy array, or as a specialization of a Python dictionary Data Frame as a generalized NumPy array If a Series is an analog of a one-dimensional array with flexible indices, a Data Frame is an analog of a two-dimensional array with both flexible row indices and flexible column names. Just as you might think of a two-dimensional array as an ordered sequence of aligned one- dimensional columns, you can think of a Data Frame as a sequence of aligned Series objects. Here, by “aligned” we mean that they share the same index. In[18]: area_dict = {'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995} area = pd.Series(area_dict) area Out[18]: California 423967 Florida 170312 Illinois 149995 169 CU IDOL SELF LEARNING MATERIAL (SLM)
New York 141297 Texas 695662 dtype: int64 In[19]: states = pd.DataFrame({'population': population, 'area': area}) states Out[19]: area population California 423967 38332521 Florida 170312 19552860 Illinois 149995 12882135 New York 141297 19651127 Texas 695662 26448193 Like the Series object, the DataFrame has an index attribute that gives access to the index labels: In[20]: states.index Out[20]: Index(['California', 'Florida', 'Illinois', 'New York', 'Texas'], dtype='object') Additionally, the DataFrame has a columns attribute, which is an Index object holding the column labels: In[21]: states.columns Out[21]: Index(['area', 'population'], dtype='object') Thus, the DataFrame can be thought of as a generalization of a two-dimensional NumPy array, where both the rows and columns have a generalized index for accessing the data. DataFrame as specialized dictionary Similarly, we can also think of a DataFrame as a specialization of a dictionary. Where a dictionary maps a key to a value, a DataFrame maps a column name to a Series of column data. In[22]: states['area'] Out[22]: California 423967 170 CU IDOL SELF LEARNING MATERIAL (SLM)
Florida 170312 Illinois 149995 New York 141297 Texas 695662 Name: area, dtype: int64 Constructing DataFrame objects A Pandas DataFrame can be constructed in a variety of ways. From a single Series object. A DataFrame is a collection of Series objects, and a single column DataFrame can be constructed from a single Series: In[23]: pd.DataFrame(population, columns=['population']) Out[23]: population California 38332521 Florida 19552860 Illinois 12882135 New York 19651127 Texas 26448193 From a list of dicts. Any list of dictionaries can be made into a DataFrame. We’ll use a simple list comprehension to create some data: In[24]: data = [{'a': i, 'b': 2 * i} fori in range(3)] pd.DataFrame(data) Out[24]: a b 000 112 224 From a dictionary of Series objects. As we saw before, a DataFrame can be constructed from a dictionary of Series objects as well: 171 CU IDOL SELF LEARNING MATERIAL (SLM)
In[26]: pd.DataFrame({'population': population, 'area': area}) Out[26]: area population California 423967 38332521 Florida 170312 19552860 Illinois 149995 12882135 New York 141297 19651127 Texas 695662 26448193 7.5.2 Data Selection in DataFrame Recall that a DataFrame acts in many ways like a two-dimensional or structured array, and in other ways like a dictionary of Series structures sharing the same index. These analogies can be helpful to keep in mind as we explore data selection within this structure. DataFrame as a dictionary The first analogy we will consider is the DataFrame as a dictionary of related Series objects. Let’s return to our example of areas and populations of states: In[18]: area = pd.Series({'California': 423967, 'Texas': 695662, 'New York': 141297, 'Florida': 170312, 'Illinois': 149995}) pop = pd.Series({'California': 38332521, 'Texas': 26448193, 'New York': 19651127, 'Florida': 19552860, 'Illinois': 12882135}) data = pd.DataFrame({'area’: area, 'pop’: pop}) data Out[18]: area pop California 423967 38332521 Florida 170312 19552860 Illinois 149995 12882135 New York 141297 19651127 Texas 695662 26448193 The individual Series that make up the columns of the DataFrame can be accessed via dictionary-style indexing of the column name: In[19]: data['area'] 172 CU IDOL SELF LEARNING MATERIAL (SLM)
Out[19]: California 423967 Florida 170312 Illinois 149995 New York 141297 Texas 695662 Name: area, dtype: int64 DataFrame as two-dimensional array we can also view the DataFrame as an enhanced twodimensional array. We can examine the raw underlying data array using the values attribute: In[24]: data.values Out[24]: array([[ 4.23967000e+05, 3.83325210e+07, 9.04139261e+01], [ 1.70312000e+05, 1.95528600e+07, 1.14806121e+02], [ 1.49995000e+05, 1.28821350e+07, 8.58837628e+01], [ 1.41297000e+05, 1.96511270e+07, 1.39076746e+02], [ 6.95662000e+05, 2.64481930e+07, 3.80187404e+01]]) we can do many familiar array-like observations on the DataFrame itself. For example, we can transpose the full DataFrame to swap rows and columns: 7.6 SUMMARY • For plotting graphs in Python, we will use the Matplotlib library. • Matplotlib is used along with NumPy data to plot any type of graph. From matplotlib we use the specific function i.e., pyplot(), which is used to plot two-dimensional data. • np.arange(start, end): This function returns equally spaced values from the interval [start, end). • plt.title(): It is used to give title to the graph. Title is passed as the parameter to this function. • plt.xlabel(): It sets the label name at X-axis. Name of X-axis is passed as argument to this function. • A random walk is a path that has no clear direction but is determined by a series of random decisions, each of which is left entirely to chance. • Data Frames are essentially multidimensional arrays with attached row and column labels, and often with heterogeneous types and/or missing data. 173 CU IDOL SELF LEARNING MATERIAL (SLM)
7.7 KEYWORDS • Pandas-open source data analysis and manipulation tool • Data frame-2-dimensional labeled data structure with columns of potentially different types • Random Walk,-describes a path that consists of a succession of random steps • plotly- an interactive open-source library 7.8 LEARNING ACTIVITY 1. Modified Random Walks: In the class RandomWalk, x_step and y_step are generated from the same set of conditions. The direction is chosen randomly from the list [1, -1] and the distance from the list [0, 1, 2, 3, 4]. Modify the values in these lists to see what happens to the overall shape of your walks. Try a longer list of choices for the distance, such as 0 through 8, or remove the −1 from the x or y direction list. 2. When you roll two dice, you usually add the two numbers together to get the result. Create a visualization that shows what happens if you multiply these numbers instead. 7.9 UNIT END QUESTIONS 174 A. Descriptive Questions Short Questions 1. What is the difference between numpy arrays and arrays? 2. What is random Walk? 3. What is the need for Matplotlib? 4. How is data frame used in data visualization? 5. Can we change the label name of X and Y axis? CU IDOL SELF LEARNING MATERIAL (SLM)
Long Questions 1. Illustrate the key benefits of using Matplotlib for data visualization. 2. How simple and scatter plots are drawn. Give example 3. Describe the working of random walk 4. Illustrate the implementation of rolling two dice with plotly 5. Discuss the role of data frame related to data visualization B. Multiple Choice Questions 1. NumPy is often used along with packages like? a. Node.js b. Matplotlib c. SciPy d. Both B and C 2. What will be output for the following code? import numpy as np a = np.array([1,2,3]) print a a. [[1, 2, 3]] b. [1] c. [1, 2, 3] d. Error 3. If a dimension is given as ____ in a reshaping operation, the other dimensions are automatically calculated. a. Zero b. One c. Negative one d. Infinite 175 CU IDOL SELF LEARNING MATERIAL (SLM)
4. Each built-in data type has a character code that uniquely identifies it.What is meaning of code \"M\"? a. Time delta b. datetime c. objects d. Unicode 5. What will be output for the following code? import numpy as np a = np.array([1, 2, 3], dtype = complex) print a a. [[ 1.+0.j, 2.+0.j, 3.+0.j]] b. [ 1.+0.j] c. Error d. [ 1.+0.j, 2.+0.j, 3.+0.j] Answers 1-d, 2-c,3- c,4-b, 5-d 7.10 REFERENCES Text Books: • Allen B. Downey, “Think Python: How to Think like a Computer Scientist”, 2nd edition, Updated for Python 3, Shroff/O‘Reilly Publishers, 2016 • Michael Urban, Joel Murach, Mike Murach: Murach'sPython Programming; Dec, 2016 Reference Books: • Guido van Rossum and Fred L. Drake Jr, An Introduction to Python – Revised and updated for Python 3.2, • Jake Vander Plas, “Python Data Science Handbook”, O‘Reilly Publishers, 2016. 176 CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT - 8: PYTHON FOR DATA ANALYSIS AND VISUALIZATION III Structure 8.0. Learning Objectives 8.1. Introduction to Data frame 8.2. Creating Data frame 8.3. Visualization 8.4. Other Plots 8.5. Summary 8.6. Keywords 8.7. Learning Activity 8.8. Unit End Questions 8.9. References 8.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Perform data analysis using python • Learn about the features of data frame • Describe various tools for visualization 8.1 INTRODUCTION TO DATA FRAME A Data frame is a two-dimensional data structure, i.e., data is aligned in a tabular fashion in rows and columns. Features of DataFrame • Potentially columns are of different types • Size – Mutable • Labeled axes (rows and columns) • Can Perform Arithmetic operations on rows and columns Structure Let us assume that we are creating a data frame with student’s data. 177 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 8.1 Sample Data Frame pandas. DataFrame A panda Data Frame can be created using the following constructor − pandas. DataFrame( data, index, columns, dtype, copy) The parameters of the constructor are as follows : Sr.No Parameter & Description 1 data data takes various forms like ndarray, series, map, lists, dict, constants and also another DataFrame. 2 index For the row labels, the Index to be used for the resulting frame is Optional Default np.arange(n) if no index is passed. 3 columns For column labels, the optional default syntax is - np.arange(n). This is only true if no index is passed. 4 dtype 178 CU IDOL SELF LEARNING MATERIAL (SLM)
Data type of each column. 5 copy This command (or whatever it is) is used for copying of data, if the default is False. Table 8.1 Parameters of Constructor 8.2 CREATING DATA FRAME A panda DataFrame can be created using various inputs like − • Lists • dict • Series • Numpy ndarrays • Another DataFrame In the subsequent sections of this chapter, we will see how to create a DataFrame using these inputs. Create an Empty DataFrame A basic DataFrame, which can be created is an Empty Dataframe. Example #import the pandas library and aliasing as pd import pandas as pd df = pd.DataFrame() print df Output: Empty DataFrame Columns: [] 179 CU IDOL SELF LEARNING MATERIAL (SLM)
Index: [] Create a DataFrame from Lists The DataFrame can be created using a single list or a list of lists. Example 1 import pandas as pd data = [1,2,3,4,5] df = pd.DataFrame(data) print df Output 0 01 12 23 34 45 Example 2 import pandas as pd data = [['Alex',10],['Bob',12],['Clarke',13]] df = pd.DataFrame(data, columns=['Name','Age']) print df Output Name Age 0 Alex 10 1 Bob 12 2 Clarke 13 180 CU IDOL SELF LEARNING MATERIAL (SLM)
Create a DataFrame from Dict of ndarrays / Lists All the ndarrays must be of same length. If index is passed, then the length of the index should equal to the length of the arrays. If no index is passed, then by default, index will be range(n), where n is the array length. Example 1 import pandas as pd data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} df = pd.DataFrame(data) print df Output Age Name 0 28 Tom 1 34 Jack 2 29 Steve 3 42 Ricky Example 2 Let us now create an indexed DataFrame using arrays. import pandas as pd data = {'Name':['Tom', 'Jack', 'Steve', 'Ricky'],'Age':[28,34,29,42]} df = pd.DataFrame(data, index=['rank1','rank2','rank3','rank4']) print df Output Age Name rank1 28 Tom rank2 34 Jack rank3 29 Steve 181 CU IDOL SELF LEARNING MATERIAL (SLM)
rank4 42 Ricky Create a DataFrame from List of Dicts List of Dictionaries can be passed as input data to create a DataFrame. The dictionary keys are by default taken as column names. Example 1 The following example shows how to create a Data Frame by passing a list of dictionaries. import pandas as pd data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] df = pd.DataFrame(data) print df Output ab c 0 1 2 NaN 1 5 10 20.0 Example 2 The following example shows how to create a Data Frame by passing a list of dictionaries and the row indices. import pandas as pd data = [{'a': 1, 'b': 2},{'a': 5, 'b': 10, 'c': 20}] df = pd.DataFrame(data, index=['first', 'second']) print df Output ab c 182 CU IDOL SELF LEARNING MATERIAL (SLM)
first 1 2 NaN second 5 10 20.0 Create a Data Frame from Dict of Series Dictionary of Series can be passed to form a Data Frame. The resultant index is the union of all the series indexes passed. Example import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print df Output one two a 1.0 1 b 2.0 2 c 3.0 3 d NaN 4 Column Selection We will understand this by selecting a column from the Data Frame. Example import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print df ['one'] 183 CU IDOL SELF LEARNING MATERIAL (SLM)
Output a 1.0 b 2.0 c 3.0 d NaN Name: one, dtype: float64 Column Addition We will understand this by adding a new column to an existing data frame. Example import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) # Adding a new column to an existing DataFrame object with column label by passing new series print (\"Adding a new column by passing as Series:\") df['three']=pd.Series([10,20,30],index=['a','b','c']) print df print (\"Adding a new column using the existing columns in DataFrame:\") df['four']=df['one']+df['three'] print df Its output is as follows – 184 Adding a new column by passing as Series: one two three a 1.0 1 10.0 b 2.0 2 20.0 c 3.0 3 30.0 CU IDOL SELF LEARNING MATERIAL (SLM)
d NaN 4 NaN 185 Adding a new column using the existing columns in DataFrame: one two three four a 1.0 1 10.0 11.0 b 2.0 2 20.0 22.0 c 3.0 3 30.0 33.0 d NaN 4 NaN NaN Column Deletion Columns can be deleted or popped; let us take an example to understand how. Example # Using the previous DataFrame, we will delete a column # using del function import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd']), 'three' : pd.Series([10,20,30], index=['a','b','c'])} df = pd.DataFrame(d) print (\"Our data frame is:\") print df # using del function print (\"Deleting the first column using DEL function:\") del df['one'] print df # using pop function print (\"Deleting another column using POP function:\") df.pop('two') CU IDOL SELF LEARNING MATERIAL (SLM)
print df Its output is as follows – Our data frame is: one three two a 1.0 10.0 1 b 2.0 20.0 2 c 3.0 30.0 3 d NaN NaN 4 Deleting the first column using DEL function: three two a 10.0 1 b 20.0 2 c 30.0 3 d NaN 4 Deleting another column using POP function: three a 10.0 b 20.0 c 30.0 d NaN Row Selection, Addition, and Deletion We will now understand row selection, addition and deletion through examples. Let us begin with the concept of selection. Selection by Label Rows can be selected by passing row label to a loc function. import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 186 CU IDOL SELF LEARNING MATERIAL (SLM)
'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print df.loc['b'] Its output is as follows – one 2.0 two 2.0 Name: b, dtype: float64 The result is a series with labels as column names of the DataFrame. And, the Name of the series is the label with which it is retrieved. Selection by integer location Rows can be selected by passing integer location to an iloc function. import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print df.iloc[2] Output one 3.0 two 3.0 Name: c, dtype: float64 Slice Rows Multiple rows can be selected using ‘ : ’ operator. 187 CU IDOL SELF LEARNING MATERIAL (SLM)
import pandas as pd d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']), 'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])} df = pd.DataFrame(d) print df[2:4] Output one two c 3.0 3 d NaN 4 Addition of Rows Add new rows to a DataFrame using the append function. This function will append the rows at the end. import pandas as pd df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b']) df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b']) df = df.append(df2) print df Output ab 0 12 1 34 0 56 1 78 Deletion of Rows 188 CU IDOL SELF LEARNING MATERIAL (SLM)
Use index label to delete or drop rows from a DataFrame. If label is duplicated, then multiple rows will be dropped. If you observe, in the above example, the labels are duplicate. Let us drop a label and will see how many rows will get dropped. import pandas as pd df = pd.DataFrame([[1, 2], [3, 4]], columns = ['a','b']) df2 = pd.DataFrame([[5, 6], [7, 8]], columns = ['a','b']) df = df.append(df2) # Drop rows with label 0 df = df.drop(0) print df Output ab 1 34 1 78 8.3 VISUALIZATION We use the standard convention for referencing the matplotlib API: In [1]: importmatplotlib.pyplotasplt In [2]: plt.close(\"all\") basics in pandas to easily create decent looking plots. See the ecosystem section for visualization libraries that go beyond the basics documented here. Basic plotting: plot The plot method on Series and DataFrame is just a simple wrapper around plt.plot(): In [3]: 189 CU IDOL SELF LEARNING MATERIAL (SLM)
ts=pd.Series(np.random.randn(1000),index=pd.date_range(\"1/1/2000\",periods=1000)) In [4]: ts=ts.cumsum() In [5]: ts.plot(); Figure 8.2 Basic Plot If the index consists of dates, it calls gcf().autofmt_xdate() to try to format the x-axis nicely as per above. On DataFrame, plot() is a convenience to plot all of the columns with labels: In [6]: df=pd.DataFrame(np.random.randn(1000,4),index=ts.index,columns=list(\"ABCD\")) In [7]: df=df.cumsum() In [8]: plt.figure(); In [9]: df.plot(); 190 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 8.3 Basic Plot on single column You can plot one column versus another using the x and y keywords in plot(): In [10]: df3=pd.DataFrame(np.random.randn(1000,2),columns=[\"B\",\"C\"]).cumsum() In [11]: df3[\"A\"]=pd.Series(list(range(len(df)))) In [12]: df3.plot(x=\"A\",y=\"B\"); Figure 8.4: Plot of on column vs another 8.4 OTHER PLOTS Plotting methods allow for a handful of plot styles other than the default line plot. These methods can be provided as the kind keyword argument to plot(), and include: • ‘bar’ or ‘barh’ for bar plots 191 CU IDOL SELF LEARNING MATERIAL (SLM)
• ‘hist’ for histogram • ‘box’ for boxplot • ‘kde’ or ‘density’ for density plots • ‘area’ for area plots • ‘scatter’ for scatter plots • ‘hexbin’ for hexagonal bin plots • ‘pie’ for pie plots Bar Plot A bar plot displays categorical data with rectangular bars whose length or height corresponds to the value of each data point. Bar plots can be visualized using vertical or horizontal bars A bar plot can be created the following way: In [13]: plt.figure(); In [14]: df.iloc[5].plot(kind=\"bar\"); Figure 8.5: Bar Plot You can also create these other plots using the methods DataFrame.plot.<kind> instead of providing the kind keyword argument. This makes it easier to discover plot methods and the specific arguments they use: In [15]: df=pd.DataFrame() 192 CU IDOL SELF LEARNING MATERIAL (SLM)
In [16]: df.plot.<TAB># noqa: E225, E999 df.plot.line df.plot.scatter df.plot.area df.plot.barh df.plot.density df.plot.hist df.plot.pie df.plot.bar df.plot.box df.plot.hexbin df.plot.kde In addition to these kind s, there are the DataFrame.hist(), and DataFrame.boxplot() methods, which use a separate interface. Finally, there are several plotting functions in pandas.plotting that take a Series or DataFrame as an argument. These include: • Scatter Matrix • Andrews Curves • Parallel Coordinates • Lag Plot • Autocorrelation Plot • Bootstrap Plot • RadViz Plots may also be adorned with errorbars or tables. For labeled, non-time series data, you may wish to produce a bar plot: In [17]: plt.figure(); In [18]: df.iloc[5].plot.bar(); In [19]: plt.axhline(0,color=\"k\"); 193 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 8.6: Bar Plot Calling a DataFrame’s plot.bar() method produces a multiple bar plot: In [20]: df2=pd.DataFrame(np.random.rand(10,4),columns=[\"a\",\"b\",\"c\",\"d\"]) In [21]: df2.plot.bar(); Figure 8.7: Multiple Bar Plot To produce a stacked bar plot, pass stacked=True: In [22]: df2.plot.bar(stacked=True); 194 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 8.8: Stacked Bar Plot To get horizontal bar plots, use the barh method: In [23]: df2.plot.barh(stacked=True); Figure 8.9: Horizontal Bar Plot 195 CU IDOL SELF LEARNING MATERIAL (SLM)
Histograms A histogram is a chart that groups numeric data into bins, displaying the bins as segmented columns. They're used to depict the distribution of a dataset: how often values fall into ranges Histograms can be drawn by using the DataFrame.plot.hist() and Series.plot.hist() methods. In [24]: df4=pd.DataFrame( ....: { ....: \"a\":np.random.randn(1000)+1, ....: \"b\":np.random.randn(1000), ....: \"c\":np.random.randn(1000)-1, ....: }, ....: columns=[\"a\",\"b\",\"c\"], ....: ) ....: In [25]: plt.figure(); In [26]: df4.plot.hist(alpha=0.5); Figure 8.10: Histogram 196 CU IDOL SELF LEARNING MATERIAL (SLM)
A histogram can be stacked using stacked=True. Bin size can be changed using the bins keyword. In [27]: plt.figure(); In [28]: df4.plot.hist(stacked=True,bins=20); Figure 8.11: Stacked Histogram You can pass other keywords supported by matplotlib hist. For example, horizontal and cumulative histograms can be drawn by orientation='horizontal' and cumulative=True. In [29]: plt.figure(); In [30]: df4[\"a\"].plot.hist(orientation=\"horizontal\",cumulative=True); 197 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 8.12 Horizontal Histogram See the hist method and the matplotlib hist documentation for more. The existing interface DataFrame.hist to plot histogram still can be used. In [31]: plt.figure(); In [32]: df[\"A\"].diff().hist(); DataFrame.hist() plots the histograms of the columns on multiple subplots: In [33]: plt.figure(); In [34]: df.diff().hist(color=\"k\",alpha=0.5,bins=50); 198 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 8.13: Histogram with multiple subplots The by keyword can be specified to plot grouped histograms: In [35]: data=pd.Series(np.random.randn(1000)) In [36]: data.hist(by=np.random.randint(0,4,1000),figsize=(6,4)); 199 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 8.14: Grouped Histogram Box plots A box plot visualization allows you to examine the distribution of data. One box plot appears for each attribute element. Each box plot displays the minimum, first quartile, median, third quartile, and maximum values. Boxplot can be drawn calling Series.plot.box() and DataFrame.plot.box(), or DataFrame.boxplot() to visualize the distribution of values within each column. For instance, here is a boxplot representing five trials of 10 observations of a uniform random variable on [0,1). In [37]: df=pd.DataFrame(np.random.rand(10,5),columns=[\"A\",\"B\",\"C\",\"D\",\"E\"]) In [38]: df.plot.box(); 200 CU IDOL SELF LEARNING MATERIAL (SLM)
Search
Read the Text Version
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251
- 252
- 253
- 254
- 255
- 256
- 257
- 258
- 259
- 260
- 261
- 262
- 263
- 264
- 265
- 266
- 267
- 268
- 269
- 270
- 271
- 272
- 273
- 274
- 275
- 276
- 277
- 278
- 279
- 280
- 281
- 282
- 283
- 284
- 285
- 286
- 287
- 288
- 289
- 290
- 291
- 292
- 293
- 294
- 295
- 296
- 297
- 298
- 299
- 300
- 301
- 302
- 303
- 304
- 305
- 306
- 307
- 308
- 309
- 310
- 311
- 312
- 313
- 314
- 315
- 316
- 317
- 318