MASTER OF COMPUTER APPLICATIONS PARALLEL AND DISTRIBUTED COMPUTING MCA635 Dr. Abdullah Khan Gezahagn Haileslassie Gezahagn
CHANDIGARH UNIVERSITY Institute of Distance and Online Learning Course Development Committee Chairman Prof. (Dr.) R.S. Bawa Vice Chancellor, Chandigarh University, Punjab Advisors Prof. (Dr.) Bharat Bhushan, Director, IGNOU Prof. (Dr.) Majulika Srivastava, Director, CIQA, IGNOU Programme Coordinators & Editing Team Master of Business Administration (MBA) Bachelor of Business Administration (BBA) Co-ordinator - Prof. Pragya Sharma Co-ordinator - Dr. Rupali Arora Master of Computer Applications (MCA) Bachelor of Computer Applications (BCA) Co-ordinator - Dr. Deepti Rani Sindhu Co-ordinator - Dr. Raju Kumar Master of Commerce (M.Com.) Bachelor of Commerce (B.Com.) Co-ordinator - Dr. Shashi Singhal Co-ordinator - Dr. Minakshi Garg Master of Arts (Psychology) Bachelor of Science (Travel & TourismManagement) Co-ordinator - Dr. Samerjeet Kaur Co-ordinator - Dr. Shikha Sharma Master of Arts (English) Bachelor of Arts (General) Co-ordinator - Dr. Ashita Chadha Co-ordinator - Ms. Neeraj Gohlan Master of Arts (Mass Communication and Bachelor of Arts (Mass Communication and Journalism) Journalism) Co-ordinator - Dr. Chanchal Sachdeva Suri Co-ordinator - Dr. Kamaljit Kaur Academic and Administrative Management Prof. (Dr.) Pranveer Singh Satvat Prof. (Dr.) S.S. Sehgal Pro VC (Academic) Registrar Prof. (Dr.) H. Nagaraja Udupa Prof. (Dr.) Shiv Kumar Tripathi Director – (IDOL) Executive Director – USB © No part of this publication should be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, recording and/or otherwise without the prior written permission of the author and the publisher. SLM SPECIALLY PREPARED FOR CU IDOL STUDENTS Printed and Published by: Himalaya Publishing House Pvt. Ltd., E-mail: [email protected], Website: www.himpub.com For: CHANDIGARH UNIVERSITY Institute of Distance and Online Learning CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel and Distributed Computing Course Code: MCA635 Credits: 2 Course Objectives: To provide students with contemporary knowledge in parallel and distributed systems. To equip students with skills to analyze and design parallel and distributed applications. Syllabus Unit 1: Introduction: Parallel Computing, Parallel Architecture, Architectural Classification Scheme, Performance of Parallel Computers, Performance Metrics for Processors, Parallel Programming Models, Parallel Algorithms. Unit 2. Pipeline Processing: Introduction, Pipeline Performance, Arithmetic Pipelines, Pipelined Instruction Processing, Pipeline Stage Design, Hazards, Dynamic Instruction Scheduling. Unit 3. Synchronous Parallel Processing: Introduction, Example-SIMD Architecture and Programming Principles, SIMD Parallel Algorithms, Data Mapping. Unit 4. Introduction to Distributed Systems: Definition, Issues, Goals, Types of distributed systems, Distributed System Models, Hardware concepts, Software Concept, Models of Middleware, Services offered by middleware, Client Server model. Unit 5. Communication: Layered Protocols, Remote Procedure Call, Remote Object Invocation, Message Oriented Communication, Stream Oriented Communication Unit 6. Resource Management: Desirable Features of global Scheduling algorithm, Task assignment approach, Load Balancing approach, Load sharing approach Unit 7. Process Management: Introduction to process management, process migration, Threads, Virtualization, Clients, Serves, Code Migration Text Books: 1. Bhujade, M.R. (2009), Parallel Computing, 2nd edition, New Delhi: New Age International Publishers. 2. Tanenbaum A.S., Steen, M.V. (2007).Distributed Systems: Principles and Paradigms. 2nd edition. Delhi: Pearson Education. Reference Books: 1. Coulouris G., Dollimore, J., Kindberg, T. (2012), Distributed Systems: Concepts and Design. 5th Edition.Delhi: Pearson Education. CU IDOL SELF LEARNING MATERIAL (SLM)
CONTENTS 1 – 36 37 – 71 Unit 1: Parallel Computing 72 – 127 Unit 2: Pipeline Processing 128 – 170 Unit 3: Synchronous Parallel Processing 171 – 204 Unit 4: Introduction to Distributed Systems 205 – 221 Unit 5: Communication 222 – 246 Unit 6: Resource Management Unit 7: Process Management 247 References CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 1 UNIT 1 PARALLEL COMPUTING Structure: 1.0 Learning Objectives 1.1 Introduction 1.2 Parallel Architecture 1.3 Architectural Classification Scheme 1.4 Performance of Parallel Computers 1.5 Performance Metrics for Processors 1.6 Parallel Programming Models 1.7 Parallel Algorithms 1.8 Summary 1.9 Key Words/Abbreviations 1.10 Learning Activity 1.11 Unit End Questions (MCQ and Descriptive) 1.12 References CU IDOL SELF LEARNING MATERIAL (SLM)
2 Parallel and Distributed Computing 1.0 Learning Objectives After studying this unit, you will be able to: Define Parallel Computing Illustrate Parallel Architecture Elaborate The Performance Of Parallel Computers Describe Parallel Programming Models Explain Parallel Algorithms Able to List the Performance Metrics for Processors Describe the Architectural Classification Scheme 1.1 Introduction Introduction to Parallel Computing Before taking a toll on Parallel Computing, first let’s take a look at the background of computations of computer software and why it failed for the modern era. Computer software was written conventionally for serial computing. This meant that to solve a problem, an algorithm divides the problem into smaller instructions. These discrete instructions are then executed on Central Processing Unit of a computer one by one. Only after one instruction is finished, next one starts. Real life example of this would be people standing in a queue waiting for movie ticket and there is only cashier. Cashier is giving ticket one by one to the persons. Complexity of this situation increases when there are 2 queues and only one cashier. So, in short Serial Computing is following: 1. In this, a problem statement is broken into discrete instructions. 2. Then the instructions are executed one by one. 3. Only one instruction is executed at any moment of time. CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 3 Look at point 3. This was causing a huge problem in computing industry as only one instruction was getting executed at any moment of time. This was a huge waste of hardware resources as only one part of the hardware will be running for a particular instruction and of time. As problem statements were getting heavier and bulkier, so does the amount of time in execution of those statements. Examples of processors are Pentium 3 and Pentium 4. Now let’s come back to our real life problem. We could definitely say that complexity will decrease when there are 2 queues and 2 cashier giving tickets to 2 persons simultaneously. This is an example of Parallel Computing. Parallel Computing It is the use of multiple processing elements simultaneously for solving any problem. Problems are broken down into instructions and are solved concurrently as each resource which has been applied to work is working at the same time. Advantages of Parallel Computing over Serial Computing are as follows: 1. It saves time and money as many resources working together will reduce the time and cut potential costs. 2. It can be impractical to solve larger problems on Serial Computing. 3. It can take advantage of non-local resources when the local resources are finite. 4. Serial Computing ‘wastes’ the potential computing power, thus Parallel Computing makes better work of hardware. Types of Parallelism 1. Bit-level parallelism: It is the form of parallel computing which is based on the increasing processor’s size. It reduces the number of instructions that the system must execute in order to perform a task on large-sized data. Example: Consider a scenario where an 8-bit processor must compute the sum of two 16- bit integers. It must first sum up the 8 lower-order bits, then add the 8 higher-order bits, thus requiring two instructions to perform the operation. A 16-bit processor can perform the operation with just one instruction. CU IDOL SELF LEARNING MATERIAL (SLM)
4 Parallel and Distributed Computing 2. Instruction-level parallelism: A processor can only address less than one instruction for each clock cycle phase. These instructions can be re-ordered and grouped which are later on executed concurrently without affecting the result of the program. This is called instruction-level parallelism. 3. Task Parallelism: Task parallelism employs the decomposition of a task into subtasks and then allocating each of the subtasks for execution. The processors perform execution of sub tasks concurrently. Why Parallel Computing? The whole real world runs in dynamic nature i.e. many things happen at a certain time but at different places concurrently. This data is extensively huge to manage. Real world data needs more dynamic simulation and modeling, and for achieving the same, parallel computing is the key. Parallel computing provides concurrency and saves time and money. Complex, large datasets, and their management can be organized only and only using parallel computing’s approach. Ensures the effective utilization of the resources. The hardware is guaranteed to be used effectively whereas in serial computation only some part of hardware was used and the rest rendered idle. Also, it is impractical to implement real-time systems using serial computing. Applications of Parallel Computing: Data bases and Data mining. Real time simulation of systems. Science and Engineering. Advanced graphics, augmented reality and virtual reality. CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 5 Limitations of Parallel Computing: It addresses such as communication and synchronization between multiple sub-tasks and processes which is difficult to achieve. The algorithms must be managed in such a way that they can be handled in the parallel mechanism. The algorithms or program must have low coupling and high cohesion. But it’s difficult to create such programs. More technically skilled and expert programmers can code a parallelism based program well. Future of Parallel Computing: The computational graph has undergone a great transition from serial computing to parallel computing. Tech giant such as Intel has already taken a step towards parallel computing by employing multicore processors. Parallel computation will revolutionize the way computers work in the future, for the better good. With all the world connecting to each other even more than before, Parallel Computing does a better role in helping us stay that way. With faster networks, distributed systems, and multi-processor computers, it becomes even more necessary. 1.2 Parallel Architecture Hardware Architecture (Parallel Computing) Let’s discuss about parallel computing and hardware architecture of parallel computing in this post. Note that there are two types of computing but we only learn parallel computing here. As we are going to learn parallel computing for that we should know following terms. 1. Era of computing – The two fundamental and dominant models of computing are sequential and parallel. The sequential computing era began in the 1940s and the parallel (and distributed) computing era followed it within a decade. 2. Computing – So, now the question arises that what is Computing? 3. Computing is any goal-oriented activity requiring, benefiting from, or creating computers. Computing includes designing, developing and building hardware and software systems; CU IDOL SELF LEARNING MATERIAL (SLM)
6 Parallel and Distributed Computing designing a mathematical sequence of steps known as an algorithm; processing, structuring and managing various kinds of information. 4. Type of Computing – Following are two types of computing : Parallel computing Distributed computing Parallel Computing As in this article, we are going to learn Parallel computing so what is parallel processing? Processing of multiple tasks simultaneously on multiple processors is called parallel processing. The parallel program consists of multiple active processes (tasks) simultaneously solving a given problem. As we learn what is parallel processing and there type now we are going more deeply on the topic of the parallel computing and understand the concept of the hardware architecture of parallel computing. Hardware Architecture of Parallel Computing The hardware architecture of parallel computing is disturbed along the following categories as given below: 1. Single-instruction, single-data (SISD) systems 2. Single-instruction, multiple-data (SIMD) systems 3. Multiple-instruction, single-data (MISD) systems 4. Multiple-instruction, multiple-data (MIMD) systems Computer Architecture | Flynn’s taxonomy Parallel computing is a computing where the jobs are broken into discrete parts that can be executed concurrently. Each part is further broken down to a series of instructions. Instructions from each part execute simultaneously on different CPUs. Parallel systems deal with the simultaneous use of multiple computer resources that can include a single computer with multiple processors, a number of computers connected by a network to form a parallel processing cluster or a combination of both. CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 7 Parallel systems are more difficult to program than computers with a single processor because the architecture of parallel computers varies accordingly and the processes of multiple CPUs must be coordinated and synchronized. The crux of parallel processing are CPUs. Based on the number of instruction and data streams that can be processed simultaneously, computing systems are classified into four major categories: Flynn’s Classification 1. Single-instruction, single-data (SISD) systems – An SISD computing system is a uniprocessor machine which is capable of executing a single instruction, operating on a single data stream. In SISD, machine instructions are processed in a sequential manner and computers adopting this model are popularly called sequential computers. Most conventional computers have SISD architecture. All the instructions and data to be processed have to be stored in primary memory. Figure 1.1: SISD The speed of the processing element in the SISD model is limited(dependent) by the rate at which the computer can transfer information internally. Dominant representative SISD systems are IBM PC, workstations. 2. Single-instruction, multiple-data (SIMD) systems – An SIMD system is a multiprocessor machine capable of executing the same instruction on all the CPUs but operating on different data streams. Machines based on an SIMD model are well suited to scientific computing since they involve lots of vector and matrix operations. So that the information can be passed to all the processing elements (PEs) organized data CU IDOL SELF LEARNING MATERIAL (SLM)
8 Parallel and Distributed Computing elements of vectors can be divided into multiple sets (N-sets for N PE systems) and each PE can process one data set. Figure 1.2: SIMD Dominant representative SIMD systems is Cray’s vector processing machine. 3. Multiple-instruction, single-data (MISD) systems – An MISD computing system is a multiprocessor machine capable of executing different instructions on different PEs but all of them operating on the same dataset. Figure 1.3: MISD Example Z = sin(x) + cos(x) + tan(x) CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 9 The system performs different operations on the same data set. Machines built using the MISD model are not useful in most of the application, a few machines are built, but none of them are available commercially. 4. Multiple-instruction, multiple-data (MIMD) systems – An MIMD system is a multiprocessor machine which is capable of executing multiple instructions on multiple data sets. Each PE in the MIMD model has separate instruction and data streams; therefore machines built using this model are capable to any kind of application. Unlike SIMD and MISD machines, PEs in MIMD machines work asynchronously. Figure 1.4: MIMD MIMD machines are broadly categorized into shared-memory MIMD and distributed- memory MIMD based on the way PEs are coupled to the main memory. In the shared memory MIMD model (tightly coupled multiprocessor systems), all the PEs are connected to a single global memory and they all have access to it. The communication between PEs in this model takes place through the shared memory, modification of the data stored in the global memory by one PE is visible to all other PEs. Dominant representative shared memory MIMD systems are Silicon Graphics machines and Sun/IBM’s SMP (Symmetric Multi-Processing). In Distributed memory MIMD machines (loosely coupled multiprocessor systems) all PEs have a local memory. The communication between PEs in this model takes place CU IDOL SELF LEARNING MATERIAL (SLM)
10 Parallel and Distributed Computing through the interconnection network (the inter process communication channel, or IPC). The network connecting PEs can be configured to tree, mesh or in accordance with the requirement. The shared-memory MIMD architecture is easier to program but is less tolerant to failures and harder to extend with respect to the distributed memory MIMD model. Failures in a shared-memory MIMD affect the entire system, whereas this is not the case of the distributed model, in which each of the PEs can be easily isolated. Moreover, shared memory MIMD architectures are less likely to scale because the addition of more PEs leads to memory contention. This is a situation that does not happen in the case of distributed memory, in which each PE has its own memory. As a result of practical outcomes and user’s requirement , distributed memory MIMD architecture is superior to the other existing models. Hardware Computing Computer hardware is the collection of physical parts of a computer system. This includes the computer case, monitor, keyboard, and mouse. It also includes all the parts inside the computer case, such as the hard disk drive, motherboard, video card, and many others. Computer hardware is what you can physically touch. 1.3 Architectural Classification Scheme Check Convergence of Parallel Architectures Parallel machines have been developed with several distinct architecture. In this section, we will discuss different parallel computer architecture and the nature of their convergence. Communication Architecture Parallel architecture enhances the conventional concepts of computer architecture with communication architecture. Computer architecture defines critical abstractions (like user-system boundary and hardware-software boundary) and organizational structure, whereas communication architecture defines the basic communication and synchronization operations. It also addresses the organizational structure. CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 11 Figure 1.5: Physical Communication Medium Programming model is the top layer. Applications are written in programming model. Parallel programming models include: Shared address space Message passing Data parallel programming Shared address programming is just like using a bulletin board, where one can communicate with one or many individuals by posting information at a particular location, which is shared by all other individuals. Individual activity is coordinated by noting who is doing what task. Message passing is like a telephone call or letters where a specific receiver receives information from a specific sender. Data parallel programming is an organized form of cooperation. Here, several individuals perform an action on separate elements of a data set concurrently and share information globally. CU IDOL SELF LEARNING MATERIAL (SLM)
12 Parallel and Distributed Computing Shared Memory Shared memory multiprocessors are one of the most important classes of parallel machines. It gives better throughput on multiprogramming workloads and supports parallel programs. Figure 1.6: Shared Memory In this case, all the computer systems allow a processor and a set of I/O controller to access a collection of memory modules by some hardware interconnection. The memory capacity is increased by adding memory modules and I/O capacity is increased by adding devices to I/O controller or by adding additional I/O controller. Processing capacity can be increased by waiting for a faster processor to be available or by adding more processors. All the resources are organized around a central memory bus. Through the bus access mechanism, any processor can access any physical address in the system. As all the processors are equidistant from all the memory locations, the access time or latency of all the processors is same on a memory location. This is called symmetric multiprocessor. Message-Passing Architecture Message passing architecture is also an important class of parallel machines. It provides communication among processors as explicit I/O operations. In this case, the communication is CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 13 combined at the I/O level, instead of the memory system. In message passing architecture, user communication executed by using operating system or library calls that perform many lower level actions, which includes the actual communication operation. As a result, there is a distance between the programming model and the communication operations at the physical hardware level. Send and receive is the most common user level communication operations in message passing system. Send specifies a local data buffer (which is to be transmitted) and a receiving remote processor. Receive specifies a sending process and a local data buffer in which the transmitted data will be placed. In send operation, an identifier or a tag is attached to the message and the receiving operation specifies the matching rule like a specific tag from a specific processor or any tag from any processor. The combination of a send and a matching receive completes a memory-to-memory copy. Each end specifies its local data address and a pair wise synchronization event. Convergence Development of the hardware and software has faded the clear boundary between the shared memory and message passing camps. Message passing and a shared address space represents two distinct programming models; each gives a transparent paradigm for sharing, synchronization and communication. However, the basic machine structures have converged towards a common organization. Data Parallel Processing Another important class of parallel machine is variously called − processor arrays, data parallel architecture and single-instruction-multiple-data machines. The main feature of the programming model is that operations can be executed in parallel on each element of a large regular data structure (like array or matrix). Data parallel programming languages are usually enforced by viewing the local address space of a group of processes, one per processor, forming an explicit global space. As all the processors communicate together and there is a global view of all the operations, so either a shared address space or message passing can be used. CU IDOL SELF LEARNING MATERIAL (SLM)
14 Parallel and Distributed Computing Fundamental Design Issues Development of programming model only cannot increase the efficiency of the computer nor can the development of hardware alone do it. However, development in computer architecture can make the difference in the performance of the computer. We can understand the design problem by focusing on how programs use a machine and which basic technologies are provided. In this section, we will discuss about the communication abstraction and the basic requirements of the programming model. Communication Abstraction Communication abstraction is the main interface between the programming model and the system implementation. It is like the instruction set that provides a platform so that the same program can run correctly on many implementations. Operations at this level must be simple. Communication abstraction is like a contract between the hardware and software, which allows each other the flexibility to improve without affecting the work. Programming Model Requirements A parallel program has one or more threads operating on data. A parallel programming model defines what data the threads can name, which operations can be performed on the named data, and which order is followed by the operations. To confirm that the dependencies between the programs are enforced, a parallel program must coordinate the activity of its threads. Brief Explanation and Classification of Parallel Architecture Classification of Parallel Computers (a) Pipeline computers (b) Array Processors (c) Multiprocessors (d) Systolic Architecture (e) Data flow Architecture CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 15 Classification based on Architectural Schemes (a) Flynn’s classification (b) Shores classification (c) Feng’s classification (d) Handle’s classification Classification based on Memory Access (a) Shared (b) Distributed (c) Hybrid Classification based on Inter connections between processing elements and memory modules. (a) Multiple PE and Memory modules Classification based on characteristic of processing elements. (a) Complex instruction set (b) Reduced Instruction set (c) DSP and vector processing Classification based on Memory Access Shared Memory General Characteristics: Hybrid Distributed-Shared Memory General Characteristics: The largest and fastest computers in the world today employ both shared and distributed memory architectures. Shared memory parallel computers vary widely, but generally have in common the ability for all processors to access all memory as global address space. Multiple processors can operate independently but share the same memory resources. CU IDOL SELF LEARNING MATERIAL (SLM)
16 Parallel and Distributed Computing Changes in a memory location effected by one processor are visible to all other processors. Historically, shared memory machines have been classified as UMA and NUMA, based upon memory access times. Uniform Memory Access (UMA): Most commonly represented today by Symmetric Multiprocessor (SMP) machines. Identical processors. Equal access and access times to memory. Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means if one processor updates a location in shared memory, all the other processors know about the update. Cache coherency is accomplished at the hardware level. Non-Uniform Memory Access (NUMA): Often made by physically linking two or more SMPs One SMP can directly access memory of another SMP Not all processors have equal access time to all memories Memory access across link is slower If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent NUMA. Advantages: Global address space provides a user-friendly programming perspective to memory. Data sharing between tasks is both fast and uniform due to the proximity of memory to CPUs Disadvantages: Primary disadvantage is the lack of scalability between memory and CPUs. Adding more CPUs can geometrically increases traffic on the shared memory-CPU path, and for cache CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 17 coherent systems, geometrically increase traffic associated with cache/memory management. Programmer responsibility for synchronization constructs that ensure “correct” access of global memory. Figure 1.7: (a) Uniform Memory Access Figure 1.7: (b) Shared Memory (NUMA) CU IDOL SELF LEARNING MATERIAL (SLM)
18 Parallel and Distributed Computing Distributed Memory General Characteristics: Like shared memory systems, distributed memory systems vary widely but share a common characteristic. Distributed memory systems require a communication network to connect inter- processor memory. Processors have their own local memory. Memory addresses in one processor do not map to another processor, so there is no concept of global address space across all processors. Because each processor has its own local memory, it operates independently. Changes it makes to its local memory have no effect on the memory of other processors. Hence, the concept of cache coherency does not apply. When a processor needs access to data in another processor, it is usually the task of the programmer to explicitly define how and when data is communicated. Synchronization between tasks is likewise the programmer's responsibility. The network “fabric” used for data transfer varies widely, though it can be as simple as Ethernet. Advantages: Memory is scalable with the number of processors. Increase the number of processors and the size of memory increases proportionately. Each processor can rapidly access its own memory without interference and without the overhead incurred with trying to maintain global cache coherency. Cost effectiveness: can use commodity, off-the-shelf processors and networking. Disadvantages: The programmer is responsible for many of the details associated with data communication between processors. It may be difficult to map existing data structures, based on global memory, to this memory organization. Non-uniform memory access times - data residing on a remote node takes longer to access than node local data. CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 19 Figure 1.8: Distributed Memory Access Hybrid Distributed-Shared Memory General Characteristics: The largest and fastest computers in the world today employ both shared and distributed memory architectures. The shared memory component can be a shared memory machine and/or graphics processing units (GPU). The distributed memory component is the networking of multiple shared memory/GPU machines, which know only about their own memory – not the memory on another machine. Therefore, network communications are required to move data from one machine to another. Current trends seem to indicate that this type of memory architecture will continue to prevail and increase at the high end of computing for the foreseeable future. Advantages and Disadvantages: Whatever is common to both shared and distributed memory architectures. Increased scalability is an important advantage. Increased programmer complexity is an important disadvantage. CU IDOL SELF LEARNING MATERIAL (SLM)
20 Parallel and Distributed Computing Figure 1.9: Hybrid Distributed Memory Access Hybrid Distributed Memory Access with GPU’s Classification based on interconnection between PEs and memory modules They are also classified in term of interconnecting network arrangement. So that various PEs can communicate with each other for real time signal processing and controlling architecture. They follow MIMD architectures (multiple PEs and memory modules) 1.4 Performance of Parallel Computers Two key goals to be achieved with the design of parallel applications are: Performance – the capacity to reduce the time needed to solve a problem as the computing resources increase Scalability – the capacity to increase performance as the size of the problem increases The main factors limiting the performance and the scalability of an application can be divided into: Architectural limitations Algorithmic limitations CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 21 Factors Limiting Performance Architectural Limitations: Latency and bandwidth Data coherency Algorithmic Limitations: Missing parallelism (sequential code) Communication frequency Synchronization frequency Poor scheduling (task granularity/load balancing) Parallel Computing In simple terms, parallel computing is breaking up a task into smaller pieces and executing those pieces at the same time, each on their own processor or on a set of computers that have been networked together. Let’s look at a simple example. Say we have the following equation: Y = (4 × 5) + (1 × 6) + (5 × 3) On a single processor, the steps needed to calculate a value for Y might look like: Step 1: Y = 20 + (1 × 6) + (5 × 3) Step 2: Y = 20 + 6 + (5 × 3) Step 3: Y = 20 + 6 + 15 Step 4: Y = 41 But in a parallel computing scenario, with three processors or computers, the steps look something like: Step 1: Y = 20 + 6 + 15 Step 2: Y = 41 Now, this is a simple example, but the idea is clear. Break the task down into pieces and execute those pieces simultaneously. CU IDOL SELF LEARNING MATERIAL (SLM)
22 Parallel and Distributed Computing 1.5 Performance Metrics for Processors Performance Metrics There are 2 distinct classes of performance metrics: Performance metrics for processors/cores – assess the performance of a processing unit, normally done by measuring the speed or the number of operations that it does in a certain period of time . Performance metrics for parallel applications – assess the performance of a parallel application, normally done by comparing the execution time with multiple processing units against the execution time with just one processing R. Rocha and F. Silva (DCC- FCUP) Performance Metrics Parallel Computing 15/16 4 multiple processing units against the execution time with just one processing unit Here, we are mostly interested in metrics that measure the performance of parallel applications. Performance Metrics for Processors/Cores Some of the best known metrics are: MIPS – Millions of Instructions Per Second MFLOPS – Millions of FLoating point Operations Per Second SPECint – SPEC (Standard Performance Evaluation Corporation ) benchmarks that evaluate processor performance on integer arithmetic (first release in 1992) SPECfp – SPEC benchmarks that evaluate processor performance on floating R. Rocha and F. Silva (DCC-FCUP) Performance Metrics Parallel Computing 15/16 5 SPECfp – SPEC benchmarks that evaluate processor performance on floating point operations (first release in 1989) Whetstone – synthetic benchmarks to assess processor performance on floating point operations (first release in 1972) Dhrystone – synthetic benchmarks to assess processor performance on integer arithmetic (first release in 1984) CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 23 Performance Metrics for Parallel Applications Some of the best known metrics are: Speed Efficiency Redundancy Utilization There also some laws/metrics that tries to explain and assert the potential performance of a parallel application. The best known are: Amdahl’s law Gustafson-Barsis’ law Karp-Flatt metric Isoefficiency metric 1.6 Parallel Programming Models In computing, a parallel programming model is an abstraction of parallel computer architecture, with which it is convenient to express algorithms and their composition in programs. The value of a programming model can be judged on its generality: how well a range of different problems can be expressed for a variety of different architectures, and its performance: how efficiently the compiled programs can execute. The implementation of a parallel programming model can take the form of a library invoked from a sequential language, as an extension to an existing language, or as an entirely new language. Consensus around a particular programming model is important because it leads to different parallel computers being built with support for the model, thereby facilitating portability of software. In this sense, programming models are referred to as bridging between hardware and software. CU IDOL SELF LEARNING MATERIAL (SLM)
24 Parallel and Distributed Computing Classification of Parallel Programming Models Classifications of parallel programming models can be divided broadly into two areas: process interaction and problem decomposition. Process Interaction Process interaction relates to the mechanisms by which parallel processes are able to communicate with each other. The most common forms of interaction are shared memory and message passing, but interaction can also be implicit (invisible to the programmer). Shared Memory Shared Memory (Interposes Communication) Shared memory is an efficient means of passing data between processes. In a shared-memory model, parallel processes share a global address space that they read and write to asynchronously. Asynchronous concurrent access can lead to race conditions, and mechanisms such as locks, semaphores and monitors can be used to avoid these. Conventional multi-core processors directly support shared memory, which many parallel programming languages and libraries, such as Cilk, OpenMP and Threading Building Blocks, are designed to exploit. Message Passing In a message-passing model, parallel processes exchange data through passing messages to one another. These communications can be asynchronous, where a message can be sent before the receiver is ready, or synchronous, where the receiver must be ready. The Communicating sequential processes (CSP) formalisation of message passing uses synchronous communication channels to connect processes, and led to important languages such as Occam, Limbo and Go. In contrast, the actor model uses asynchronous message passing and has been employed in the design of languages such as D, Scala and SALSA. Implicit Interaction Implicit Parallelism In an implicit model, no process interaction is visible to the programmer and instead the compiler and/or runtime is responsible for performing it. Two examples of implicit parallelism are with domain-specific languages where the concurrency within high-level operations is CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 25 prescribed, and with functional programming languages because the absence of side-effects allows non-dependent functions to be executed in parallel. However, this kind of parallelism is difficult to manage and functional languages such as Concurrent Haskell and Concurrent ML provide features to manage parallelism explicitly. Problem Decomposition A parallel program is composed of simultaneously executing processes. Problem decomposition relates to the way in which the constituent processes are formulated. Task Parallelism A task-parallel model focuses on processes, or threads of execution. These processes will often be behaviourally distinct, which emphasises the need for communication. Task parallelism is a natural way to express message-passing communication. In Flynn's taxonomy, task parallelism is usually classified as MIMD/MPMD or MISD. Data Parallelism A data-parallel model focuses on performing operations on a data set, typically a regularly structured array. A set of tasks will operate on this data, but independently on disjoint partitions. In Flynn's taxonomy, data parallelism is usually classified as MIMD/SPMD or SIMD. Implicit Parallelism As with implicit process interaction, an implicit model of parallelism reveals nothing to the programmer as the compiler, the runtime or the hardware is responsible. For example, in compilers, automatic parallelization is the process of converting sequential code into parallel code, and in computer architecture, superscalar execution is a mechanism whereby instruction-level parallelism is exploited to perform operations in parallel. Terminology Parallel programming models are closely related to models of computation. A model of parallel computation is an abstraction used to analyze the cost of computational processes, but it does not necessarily need to be practical, in that it can be implemented efficiently in hardware and/or software. A programming model, in contrast, does specifically imply the practical considerations of hardware and software implementation. A parallel programming language may CU IDOL SELF LEARNING MATERIAL (SLM)
26 Parallel and Distributed Computing be based on one or a combination of programming models. For example, High Performance Fortran is based on shared-memory interactions and data-parallel problem decomposition, and Go provides mechanism for shared-memory and message-passing interaction. Table 1.1: Example Parallel Programming Models Name Class of Class of Example Implementations Interaction Decomposition Actor model Asynchronous Task D, Erlang, Scala, SALSA message passing Bulk synchronous Shared memory Task Apache Giraph, Apache Hama, BSPlib parallel Communicating Synchronous Task Ada, Occam, VerilogCSP, Go sequential processes message passing Circuits Message passing Task Verilog, VHDL Dataflow Message passing Task Lustre, TensorFlow, Apache Flink Functional Message passing Task Concurrent Haskell, Concurrent ML LogP machine Synchronous Not specified None message passing Parallel random Shared memory Data Cilk, CUDA, OpenMP, Threading access machine Building Blocks, XMTC How to Evaluate the Performance of a Parallel Program The development of parallel programming created the need of performance metrics and a software tool to evaluate the performance of a parallel algorithm in order to decide whether its use is convenient or not. Indeed, the focus of parallel computing is to solve large problems in a relatively short time. The factors that contribute to the achievement of this objective are, for example, the type of hardware used, the degree of parallelism of the problem, and which parallel programming model is adopted. To facilitate this, analysis of basic concepts was introduced, which compares the parallel algorithm obtained from the original sequence. The performance is achieved by analyzing and quantifying the number of threads and/or the number of processes used. CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 27 To analyze this, a few performance indexes are introduced: speedup, efficiency, and scaling. The limitations of a parallel computation are introduced by the Ahmdal's law to evaluate the degree of the efficiency of parallelization of a sequential algorithm we have the Gustafson's law. Speedup Speedup is the measure that displays the benefit of solving a problem in parallel. It is defined as the ratio of the time taken to solve a problem on a single processing element, TS, to the time required to solve the same problem on p identical processing elements, Tp. We denote speedup by S = TS . We have a linear speedup, where if S = p, it means that the SP speed of execution increases with the number of processors. Of course, this is an ideal case. While the speedup is absolute when Ts is the execution time of the best sequential algorithm, the speedup is relative when Ts is the execution time of the parallel algorithm for a single processor. Let's recap these conditions: S = p is linear or ideal speedup S < p is real speedup S > p is superlinear speedup Efficiency In an ideal world, a parallel system with p processing elements can give us a speedup equal to p. However, this is very rarely achieved. Usually, some time is wasted in either idling or communicating. Efficiency is a performance metric estimating how well-utilized the processors are in solving a task, compared to how much effort is wasted in communication and synchronization. We denote it by E and can define it as E = S = TS . The algorithms with linear speedup p pTP have the value of E = 1; in other cases, the value of E is less than 1. The three cases are identified as follows: CU IDOL SELF LEARNING MATERIAL (SLM)
28 Parallel and Distributed Computing When E = 1, it is a linear case When E < 1, it is a real case When E<< 1, it is a problem that is parallelizable with low efficiency Scaling Scaling is defined as the ability to be efficient on a parallel machine. It identifies the computing power (speed of execution) in proportion with the number of processors. By increasing the size of the problem and at the same time the number of processors, there will be no loss in terms of performance. The scalable system, depending on the increments of the different factors, may maintain the same efficiency or improve it. Amdahl’s Law Amdahl’s law is a widely used law used to design processors and parallel algorithms. It states that the maximum speedup that can be achieved is limited by the serial component of the program: S = 1 , where 1 – P denotes the serial component (not parallelized) of a program. 1– P This means that for, as an example, a program in which 90 percent of the code can be made parallel, but 10 percent must remain serial, the maximum achievable speedup is 9 even for an infinite number of processors. Gustafson’s Law Gustafson’s law is based on the following considerations: While increasing the dimension of a problem, its sequential parts remain constant. While increasing the number of processors, the work required on each of them still remains the same. This states that S(P) = P – α ( P – 1), where P is the number of processors, S is the speedup, and α is the non-parallelizable fraction of any parallel process. This is in contrast to Amdahl’s law, which takes the single-process execution time to be the fixed quantity and compares it to a shrinking per process parallel execution time. Thus, Amdahl’s law is based on the assumption of a fixed problem size; it assumes that the overall workload of a program does CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 29 not change with respect to the machine size (that is, the number of processors). Gustafson’s law addresses the deficiency of Amdahl’s law, which does not take into account the total number of computing resources involved in solving a task. It suggests that the best way to set the time allowed for the solution of a parallel problem is to consider all the computing resources and on the basis of this information, it fixes the problem. 1.7 Parallel Algorithms A parallel algorithm can be executed simultaneously on many different processing devices and then combined together to get the correct result. Parallel algorithms are highly useful in processing huge volumes of data in quick time. This Section provides an introduction to the design and analysis of parallel algorithms. In addition, it explains the models followed in parallel algorithms, their structures, and implementation. An algorithm is a sequence of steps that take inputs from the user and after some computation, produces an output. A parallel algorithm is an algorithm that can execute several instructions simultaneously on different processing devices and then combine all the individual outputs to produce the final result. What is Parallelism? Parallelism is the process of processing several set of instructions simultaneously. It reduces the total computational time. Parallelism can be implemented by using parallel computers, i.e., a computer with many processors. Parallel computers require parallel algorithm, programming languages, compilers and operating system that support multitasking. In this tutorial, we will discuss only about parallel algorithms. Before moving further, let us first discuss about algorithms and their types. What is an Algorithm? An algorithm is a sequence of instructions followed to solve a problem. While designing an algorithm, we should consider the architecture of computer on which the algorithm will be executed. As per the architecture, there are two types of computers − CU IDOL SELF LEARNING MATERIAL (SLM)
30 Parallel and Distributed Computing Sequential Computer Parallel Computer Depending on the architecture of computers, we have two types of algorithms − Sequential Algorithm − An algorithm in which some consecutive steps of instructions are executed in a chronological order to solve a problem. Parallel Algorithm − The problem is divided into sub-problems and are executed in parallel to get individual outputs. Later on, these individual outputs are combined together to get the final desired output. It is not easy to divide a large problem into sub-problems. Sub-problems may have data dependency among them. Therefore, the processors have to communicate with each other to solve the problem. It has been found that the time needed by the processors in communicating with each other is more than the actual processing time. So, while designing a parallel algorithm, proper CPU utilization should be considered to get an efficient algorithm. To design an algorithm properly, we must have a clear idea of the basic model of computation in a parallel computer. Model of Computation Both sequential and parallel computers operate on a set (stream) of instructions called algorithms. These set of instructions (algorithm) instruct the computer about what it has to do in each step. 1.8 Summary Parallel computing is a type of computing architecture in which several processors execute or process an application or computation simultaneously. Parallel computing helps in performing large computations by dividing the workload between more than one processor, all of which work through the computation at the same time. Most supercomputers employ parallel computing principles to operate. CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 31 Parallel computing is also known as parallel processing. Parallel processing is generally implemented in operational environments/scenarios that require massive computation or processing power. The primary objective of parallel computing is to increase the available computation power for faster application processing or task resolution. Typically, parallel computing infrastructure is housed within a single facility where many processors are installed in a server rack or separate servers are connected together. The application server sends a computation or processing request that is distributed in small chunks or components, which are concurrently executed on each processor/server. Parallel computation can be classified as bit-level, instructional level, data and task parallelism. Parallel computer architecture adds a new dimension in the development of computer system by using more and more number of processors. In principle, performance achieved by utilizing large number of processors is higher than the performance of a single processor at a given point of time. In computing, a parallel programming model is an abstraction of parallel computer architecture, with which it is convenient to express algorithms and their composition in programs. The value of a programming model can be judged on its generality: how well a range of different problems can be expressed for a variety of different architectures, and its performance: how efficiently the compiled programs can execute. The implementation of a parallel programming model can take the form of a library invoked from a sequential language, as an extension to an existing language, or as an entirely new language. Consensus around a particular programming model is important because it leads to different parallel computers being built with support for the model, thereby facilitating portability of software. In this sense, programming models are referred to as bridging between hardware and software. 1.9 Key Words/Abbreviations Atomicity: either all changes occur or none occur. Consistency: transactions cannot bring database into an invalid state (note that this is rather different than consistency in the CAP sense) CU IDOL SELF LEARNING MATERIAL (SLM)
32 Parallel and Distributed Computing Isolation: concurrent execution doesn't change results. Durability: effects survive failures. Paxos is the original distributed consensus algorithm, with modified versions used by Chubby and many others. (Zookeeper's ZAB is similar to Paxos as well.) RAFT is a much simpler consensus algorithm. Used by Consul. SWIM is a distributed gossip protocol for group membership (e.g. for determining members of a caching ring, etc) Two-phase commit is a protocol for atomic distributed commits 1.10 Learning Activity 1. What is parallel computing with example? ----------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------- 2. What is meant by parallel computing? ----------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------- 3. What is parallel and distributed computing? ----------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------- 1.11 Unit End Questions (MCQ and Descriptive) A. Descriptive Types Questions 1. What is Shared-memory Architecture? 2. What is Numa Memory Architecture? 3. Name some Network Architectures Prevalent in Machines Supporting the Message Passing Paradigm? CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 33 4. What is Data-parallel Computation? 5. What is Task-parallel Computation? 6. What is Task-latency? 7. What is Task-throughput? 8. What is Speed-up? 9. What is Parallel Efficiency? 10. What is An Inherently Sequential Task? 11. What is the Maximum Time Speed-up Possible According to Amdahl’s Law? 12. What is SIMD? 13. What is Cache Coherence? 14. What is a Hypercube Connection? 15. What is the Diameter of an N-node Hypercube? 16. How does Openmp Provide a Shared-memory Programming Environment.? 17. What is the Memory Consistency Model Supported by Openmp? 18. How are Threads Allocated to Processors when there are more threads than the Number of Processors? 19. What is Common Crcw Pram? 20. What is the Impact of Limiting Pram Model to a Fixed Number of Processors or a Fixed Memory Size? 21. What is the Impact of Eliminating Shared Write from Pram? 22. What is the Significance of Work Complexity Analysis? 23. What does Bulk Synchronous Model Add to Pram for Parallel Algorithm Analysis? 24. Is it True that all Nc Problems Parallelize Well? 25. Is User Locking Required to Control the Order of Access to Guarantee Sequential Consistency? CU IDOL SELF LEARNING MATERIAL (SLM)
34 Parallel and Distributed Computing 26. What is the Difference Between Processor And FIFO Consistency? 27. What is False Sharing? 28. What is a Task Dependency Graph? 29. When can an MPI Send Call Return? 30. What is a Collective Communication Call? 31. How Cam MPI be Used For Shared Memory Style Programming? 32. What is the Complexity of Prefix Sum in Pram Model? 33. What is the Time Complexity of Optimal Merge Algorithm (on Pram)? 34. What is Accelerated Cascading? 35. Why Must Cuda Divide Computation Twice: Into Grids and then Blocks? 36. How do Memory Operations in Gpus Differ from those in Cpus? 37. How can Two Gpu Threads Communicate Through Shared Memory? 38. How can Prefix Minima be Found in O(1) Time? 39. How Long does Bitonic Sorting Require on Pram? 40. How Long does Batcher’s Odd-even Merge Require? 41. In Order To Balance Load For Parallel Bucket Sort Of N Elements, Uniformly Spaced Splitters Need To Be Selected. This Can Be Done By First Dividing The List Into B Lists And Choosing B Equi-spaced Samples From Each. The Final B Splitters Are Chosen Uniformly Spaced From These Samples. How Balanced Are The Buckets If These Splitters Are Used? 42. How Fast Can Two Sorted Lists of Size N Each Be Merged Into One Using P Processors? 43. How Fast Can A List Be Sorted Using N Processors Using Local Sorting Of N/p Elements Each Followed By Optimal Multi-way Merge? 44. When Stealing Load From A Random Loaded Processor, What Type Of Synchronization Is Needed? CU IDOL SELF LEARNING MATERIAL (SLM)
Parallel Computing 35 B. Multiple Choice/Objective Type Questions 1. Execution of several activities at the same time. (a) processing (b) parallel processing (c) serial processing (d) multitasking 2. A term for simultaneous access to a resource, physical or logical. (a) Multiprogramming (b) Multitasking (c) Threads (d) Concurrency 3. ____________ leads to concurrency. (a) Serialization (b) Parallelism (c) Serial processing (d) Distribution 4. A parallelism based on increasing processor word size. (a) Increasing (b) Count based (c) Bit based (d) Bit level 5. A type of parallelism that uses micro architectural techniques. (a) instructional (b) bit level (c) bit based (d) increasing 6. MIPS stands for? (a) Mandatory Instructions/sec (b) Millions of Instructions/sec (c) Most of Instructions/sec (d) Many Instructions/sec 7. The measure of the “effort” needed to maintain efficiency while adding processors. (a) Maintainablity (b) Efficiency (c) Scalabilty (d) Effectiveness 8. The rate at which the problem size need to be increased to maintain efficiency. (a) Isoeffciency (b) Efficiency (c) Scalabilty (d) Effectiveness CU IDOL SELF LEARNING MATERIAL (SLM)
36 Parallel and Distributed Computing 9. Several instructions execution simultaneously in ____________. (a) processing (b) parallel processing (c) serial processing (d) multitasking Answers 1. (b), 2. (d), 3. (b), 4. (d), 5. (a), 6. (b), 7. (c), 8. (a), 9. (b) 1.12 References References of this unit have been given at the end of the book. CU IDOL SELF LEARNING MATERIAL (SLM)
Pipeline Processing 37 UNIT 2 PIPELINE PROCESSING Structure: 2.0 Learning Objectives 2.1 Introduction 2.2 Pipeline Processing Introduction 2.3 Pipeline Performance 2.4 Arithmetic Pipelines 2.5 Pipelined Instruction Processing 2.6 Pipeline Stage Design 2.7 Hazards 2.8 Dynamic Instruction Scheduling 2.9 Summary 2.10 Key Words/Abbreviations 2.11 Learning Activity 2.12 Unit End Questions (MCQ and Descriptive) 2.13 References CU IDOL SELF LEARNING MATERIAL (SLM)
38 Parallel and Distributed Computing 2.0 Learning Objectives After studying this unit, you will be able to: Define Pipeline Performance Explain Arithmetic Pipelines Describe pipelined instruction processing Explain pipeline stage design Elaborate dynamic instruction scheduling Describe Pipeline Stage Design Able to understand Pipeline Stage Design Hazards 2.1 Introduction Pipelining The term Pipelining refers to a technique of decomposing a sequential process into sub- operations, with each sub-operation being executed in a dedicated segment that operates concurrently with all other segments. The most important characteristic of a pipeline technique is that several computations can be in progress in distinct segments at the same time. The overlapping of computation is made possible by associating a register with each segment in the pipeline. The registers provide isolation between each segment so that each can operate on distinct data simultaneously. The structure of a pipeline organization can be represented simply by including an input register for each segment followed by a combinational circuit. Let us consider an example of combined multiplication and addition operation to get a better understanding of the pipeline organization. The combined multiplication and addition operation is done with a stream of numbers such as: Ai × Bi + Ci for i = 1, 2, 3, ......., 7 CU IDOL SELF LEARNING MATERIAL (SLM)
Pipeline Processing 39 The operation to be performed on the numbers is decomposed into sub-operations with each sub-operation to be implemented in a segment within a pipeline. The sub-operations performed in each segment of the pipeline are defined as: R1 ← Ai, R2 ← Bi Input Ai, and Bi R3 ← R1 * R2, R4 ← Ci Multiply, and input Ci R5 ← R3 + R4 Add Ci to product The following block diagram represents the combined as well as the sub-operations performed in each segment of the pipeline. Pipeline Processing Figure 2.1: Pipeline Processing Registers R1, R2, R3, and R4 hold the data and the combinational circuits operate in a particular segment. CU IDOL SELF LEARNING MATERIAL (SLM)
40 Parallel and Distributed Computing The output generated by the combinational circuit in a given segment is applied as an input register of the next segment. For instance, from the block diagram, we can see that the register R3 is used as one of the input registers for the combinational adder circuit. In general, the pipeline organization is applicable for two areas of computer design which includes: 1. Arithmetic Pipeline 2. Instruction Pipeline 2.2 Pipeline Processing Introduction What is Pipelining? Pipelining is the process of accumulating instruction from the processor through a pipeline. It allows storing and executing instructions in an orderly process. It is also known as pipeline processing. Pipelining is a technique where multiple instructions are overlapped during execution. Pipeline is divided into stages and these stages are connected with one another to form a pipe like structure. Instructions enter from one end and exit from another end. Pipelining increases the overall instruction throughput. In pipeline system, each segment consists of an input register followed by a combinational circuit. The register is used to hold data and combinational circuit performs operations on it. The output of combinational circuit is applied to the input register of the next segment. Figure 2.2: Combinational Circuit CU IDOL SELF LEARNING MATERIAL (SLM)
Pipeline Processing 41 Pipeline system is like the modern day assembly line setup in factories. For example in a car manufacturing industry, huge assembly lines are setup and at each point, there are robotic arms to perform a certain task, and then the car moves on ahead to the next arm. Working of Pipeline Pipelining organizes the execution of the multiple instructions simultaneously. Pipelining improves the throughput of the system. In pipelining the instruction is divided into the subtasks. Each subtask performs the dedicated task. The instruction is divided into 5 subtasks: instruction fetch, instruction decode, operand fetch, instruction execution and operand store. The instruction fetch subtask will only perform the instruction fetching operation, instruction decode subtask will only be decoding the fetched instruction and so on the other subtasks will do. Have you ever visited an industrial plant and see the assembly lines over there? How a product passes through the assembly line and while passing it is worked on, at different phases simultaneously. For example, take a car manufacturing plant. At the first stage, the automobile chassis is prepared, in the next stage workers add body to the chassis, further, the engine is installed, then painting work is done and so on. The group of workers after working on the chassis of the first car don’t sit idle. They start working on the chassis of the next car. And the next group take the chassis of the car and add body to it. The same thing is repeated at every stage, after finishing the work on the current car body they take on next car body which is the output of the previous stage. Here, though the first car is completed in several hours or days, due to the assembly line arrangement it becomes possible to have a new car at the end of an assembly line in every clock cycle. Similarly, the concept of pipelining works. The output of the first pipeline becomes the input for the next pipeline. It is like a set of data processing unit connected in series to utilize processor up to its maximum. An instruction in a process is divided into 5 subtasks likely, Instruction Instruction Operand Instruction Operand Fetch Decode Fetch Execute Store CU IDOL SELF LEARNING MATERIAL (SLM)
42 Parallel and Distributed Computing 1. In the first subtask, the instruction is fetched. 2. The fetched instruction is decoded in the second stage. 3. In the third stage, the operands of the instruction are fetched. 4. In the fourth, arithmetic and logical operation are performed on the operands to execute the instruction. 5. In the fifth stage, the result is stored in memory. Now, understanding the division of the instruction into subtasks. Let us understand, how the n number of instructions in a process, are pipelined. Look at the figure below the 5 instructions are pipelined. The first instruction gets completed in 5 clock cycle. After the completion of first instruction, in every new clock cycle, a new instruction completes its execution. Figure 2.3: Pipelining of 5 Instructions Observe that when the Instruction fetch operation of the first instruction is completed in the next clock cycle the instruction fetch of second instruction gets started. This way the hardware never sits idle it is always busy in performing some or other operation. But, no two instructions can execute their same stage at the same clock cycle. CU IDOL SELF LEARNING MATERIAL (SLM)
Pipeline Processing 43 Types of Pipelining In 1977 Handler and Rama moorthy classified pipeline processors depending on their functionality. 1. Arithmetic Pipelining It is designed to perform high-speed floating-point addition, multiplication and division. Here, the multiple arithmetic logic units are built in the system to perform the parallel arithmetic computation in various data format. Examples of the arithmetic pipelined processor are Star-100, TI-ASC, Cray-1, Cyber-205. 2. Instruction Pipelining Here, the number of instruction are pipelined and the execution of current instruction is overlapped by the execution of the subsequent instruction. It is also called instruction lookahead. 3. Processor Pipelining Here, the processors are pipelined to process the same data stream. The data stream is processed by the first processor and the result is stored in the memory block. The result in the memory block is accessed by the second processor. The second processor reprocesses the result obtained by the first processor and the passes the refined result to the third processor and so on. 4. Unification vs. Multifunction Pipelining The pipeline performing the precise function every time is unifunctional pipeline. On the other hand, the pipeline performing multiple functions at a different time or multiple functions at the same time is multifunction pipeline. 5. Static vs. Dynamic Pipelining The static pipeline performs a fixed-function each time. The static pipeline is unifunctional. The static pipeline executes the same type of instructions continuously. Frequent change in the type of instruction may vary the performance of the pipelining. Dynamic pipeline performs several functions simultaneously. It is a multifunction pipelining. 6. Scalar vs. Vector Pipelining Scalar pipelining processes the instructions with scalar operands. The vector pipeline processes the instruction with vector operands. CU IDOL SELF LEARNING MATERIAL (SLM)
44 Parallel and Distributed Computing Pipeline Conflicts There are some factors that cause the pipeline to deviate its normal performance. Some of these factors are given below: 1. Timing Variations All stages cannot take same amount of time. This problem generally occurs in instruction processing where different instructions have different operand requirements and thus different processing time. 2. Data Hazards When several instructions are in partial execution, and if they reference same data then the problem arises. We must ensure that next instruction does not attempt to access data before the current instruction, because this will lead to incorrect results. 3. Branching In order to fetch and execute the next instruction, we must know what that instruction is. If the present instruction is a conditional branch, and its result will lead us to the next instruction, then the next instruction may not be known until the current one is processed. 4. Interrupts Interrupts set unwanted instruction into the instruction stream. Interrupts effect the execution of instruction. 5. Data Dependency It arises when an instruction depends upon the result of a previous instruction but this result is not yet available. Advantages of Pipelining 1. The cycle time of the processor is reduced. 2. It increases the throughput of the system 3. It makes the system reliable. CU IDOL SELF LEARNING MATERIAL (SLM)
Pipeline Processing 45 Disadvantages of Pipelining 1. The design of pipelined processor is complex and costly to manufacture. 2. The instruction latency is more. 2.3 Pipeline Performance Performance of Pipelined Execution The following parameters serve as criterion to estimate the performance of pipelined execution: Speed Up Efficiency Throughput 1. Speed Up It gives an idea of “how much faster” the pipelined execution is as compared to non- pipelined execution. It is calculated as: Speed Up (S) = Non - pipelined execution time Pipelined execution time 2. Efficiency The efficiency of pipelined execution is calculated as: Efficiency () = Speed Up Number of stages in Pipelined Architecture OR Efficiency () = Number of boxes utilized in phase - time diagram Total number of boxes in phase - time diagram CU IDOL SELF LEARNING MATERIAL (SLM)
Search
Read the Text Version
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210
- 211
- 212
- 213
- 214
- 215
- 216
- 217
- 218
- 219
- 220
- 221
- 222
- 223
- 224
- 225
- 226
- 227
- 228
- 229
- 230
- 231
- 232
- 233
- 234
- 235
- 236
- 237
- 238
- 239
- 240
- 241
- 242
- 243
- 244
- 245
- 246
- 247
- 248
- 249
- 250
- 251