Important Announcement
PubHTML5 Scheduled Server Maintenance on (GMT) Sunday, June 26th, 2:00 am - 8:00 am.
PubHTML5 site will be inoperative during the times indicated!

Home Explore MCA 635_Parallel and Distributed Computing

MCA 635_Parallel and Distributed Computing

Published by Teamlease Edtech Ltd (Amita Chitroda), 2020-12-14 08:50:28

Description: MCA 635_Parallel and Distributed Computing

Search

Read the Text Version

46 Parallel and Distributed Computing 3. Throughput Throughput is defined as number of instructions executed per unit time. It is calculated as: Throughput = Number of instruction executed Total time taken 2.4 Arithmetic Pipelines Arithmetic Pipeline Arithmetic Pipelines are mostly used in high-speed computers. They are used to implement floating-point operations, multiplication of fixed-point numbers, and similar computations encountered in scientific problems. To understand the concepts of arithmetic pipeline in a more convenient way, let us consider an example of a pipeline unit for floating-point addition and subtraction. The inputs to the floating-point adder pipeline are two normalized floating-point binary numbers defined as: X = A × 2a = 0.9504 × 103 Y = B × 2b = 0.8200 × 102 Where A and B are two fractions that represent the mantissa and a and b are the exponents. The combined operation of floating-point addition and subtraction is divided into four segments. Each segment contains the corresponding suboperation to be performed in the given pipeline. The suboperations that are shown in the four segments are: 1. Compare the exponents by subtraction. 2. Align the mantissas. 3. Add or subtract the mantissas. 4. Normalize the result. We will discuss each suboperation in a more detailed manner later in this section. CU IDOL SELF LEARNING MATERIAL (SLM)

Pipeline Processing 47 The following block diagram represents the suboperations performed in each segment of the pipeline. Pipeline organization for floating point addition and subtraction: Figure 2.4: Arithmetic Operation Note: Registers are placed after each suboperation to store the intermediate results. 1. Compare Exponents by Subtraction The exponents are compared by subtracting them to determine their difference. The larger exponent is chosen as the exponent of the result. CU IDOL SELF LEARNING MATERIAL (SLM)

48 Parallel and Distributed Computing The difference of the exponents, i.e., 3 – 2 = 1 determines how many times the mantissa associated with the smaller exponent must be shifted to the right. 2. Align the Mantissas The mantissa associated with the smaller exponent is shifted according to the difference of exponents determined in segment one. X = 0.9504 × 103 Y = 0.08200 × 103 3. Add Mantissas The two mantissas are added in segment three. Z = X + Y = 1.0324 × 103 4. Normalize the Result After normalization, the result is written as: Z = 0.1324 × 104 2.5 Pipelined Instruction Processing 1. In 8086, to speed up the execution of program, the instruction fetching and execution of instructions are overlapped with each other. 2. This process of fetching the next instruction when the present instruction is being executed is called as pipelining. 3. In pipelining, when the nth instruction is executed, the (n+1) th instruction is fetched and thus the processing speed is increased. 4. Pipelining has become possible due to the use of queue. 5. BIU (Bus Interfacing Unit) fills in the queue until the entire queue is full. 6. BIU restarts filling in the queue when at least two locations of queue are vacant. CU IDOL SELF LEARNING MATERIAL (SLM)

Pipeline Processing 49 The Instruction Queue 1. The execution unit (EU) is supposed to decode or execute an instruction. 2. Decoding does not require the use of buses. 3. When EU is busy in decoding and executing an instruction, the BIU fetches up to six instruction bytes for the next instructions. 4. These bytes are called as the pre-fetched bytes and they are stored in a first in first out (FIFO) register set, which is called as a queue. Significance of Queue Figure 2.5: Significance of Queue 1. As shown in the above figure, while the EU is busy in decoding the instruction corresponding to memory location 100F0, the BIU fetches the next six instruction bytes from locations 100F1 to 100F6 numbered as 1 to 6. 2. These instruction bytes are stored in the 6 byte queue on the first in first out (FIFO) basis. 3. When EU completes the execution of the existing instruction and becomes ready for the next instruction, it simply reads the instruction bytes in the sequence 1, 2…. from the Queue. CU IDOL SELF LEARNING MATERIAL (SLM)

50 Parallel and Distributed Computing 4. Thus the Queue will always hold the instruction bytes of the next instructions to be executed by the EU. Advantages of Pipelining 1. The EU always reads the next instruction byte from the queue in BIU. This is much faster than sending out an address to the memory and waiting for the next instruction byte to come. 2. In short pipelining eliminates the waiting time of EU and speeds up the processing. 3. The 8086 BIU will not initiate a fetch unless and until there are two empty bytes in its queue. 8086 BIU normally obtains two instruction bytes per fetch. Disadvantages of Pipelining 1. While pipelining can severely cut the time taken to execute a program, there are problems that cause it to not work as well as it perhaps should. 2. The three stages of the instruction execution process do not necessarily take an equal amount of time, with the time taken for 'execute' being generally longer than 'fetch'. This makes it much harder to synchronise the various stages of the different instructions. 3. Also, some instructions may be dependent on the results of other earlier instructions. This can arise when data produced earlier needs to be used or when a conditional branch based on a previous outcome is used. Instruction Pipeline Pipeline processing can occur not only in the data stream but in the instruction stream as well. Most of the digital computers with complex instructions require instruction pipeline to carry out operations like fetch, decode and execute instructions. In general, the computer needs to process each instruction with the following sequence of steps. 1. Fetch instruction from memory. 2. Decode the instruction. 3. Calculate the effective address. 4. Fetch the operands from memory. CU IDOL SELF LEARNING MATERIAL (SLM)

Pipeline Processing 51 5. Execute the instruction. 6. Store the result in the proper place. Each step is executed in a particular segment, and there are times when different segments may take different times to operate on the incoming information. Moreover, there are times when two or more segments may require memory access at the same time, causing one segment to wait until another is finished with the memory. The organization of an instruction pipeline will be more efficient if the instruction cycle is divided into segments of equal duration. One of the most common examples of this type of organization is a Four-segment instruction pipeline. A four-segment instruction pipeline combines two or more different segments and makes it as a single one. For instance, the decoding of the instruction can be combined with the calculation of the effective address into one segment. The following block diagram shows a typical example of a four-segment instruction pipeline. The instruction cycle is completed in four segments. Figure 2.6: Four-segment Instruction Pipeline CU IDOL SELF LEARNING MATERIAL (SLM)

52 Parallel and Distributed Computing Segment 1: The instruction fetch segment can be implemented using first in, first out (FIFO) buffer. Segment 2: The instruction fetched from memory is decoded in the second segment, and eventually, the effective address is calculated in a separate arithmetic circuit. Segment 3: An operand from memory is fetched in the third segment. Segment 4: The instructions are finally executed in the last segment of the pipeline organization. 2.6 Pipeline Stage Design Pipeline Stages RISC processor has 5 stage instruction pipeline to execute all the instructions in the RISC instruction set. Following are the 5 stages of RISC pipeline with their respective operations:  Stage 1 (Instruction Fetch): In this stage the CPU reads instructions from the address in the memory whose value is present in the program counter.  Stage 2 (Instruction Decode): In this stage, instruction is decoded and the register file is accessed to get the values from the registers used in the instruction.  Stage 3 (Instruction Execute): In this stage, ALU operations are performed.  Stage 4 (Memory Access): In this stage, memory operands are read and written from/to the memory that is present in the instruction.  Stage 5 (Write Back): In this stage, computed/fetched value is written back to the register present in the instructions. Four-Stage Pipeline In four stage pipelined architecture, the execution of each instruction is completed in following 4 stages: CU IDOL SELF LEARNING MATERIAL (SLM)

Pipeline Processing 53 1. Instruction fetch (IF) 2. Instruction decode (ID) 3. Instruction Execute (IE) 4. Write back (WB) To implement four stage pipeline:  The hardware of the CPU is divided into four functional units.  Each functional unit performs a dedicated task. Figure 2.7: Four-Stage Instruction Pipeline Stage-01: At stage-01,  First functional unit performs instruction fetch.  It fetches the instruction to be executed. Stage-02: At stage-02,  Second functional unit performs instruction decode.  It decodes the instruction to be executed. CU IDOL SELF LEARNING MATERIAL (SLM)

54 Parallel and Distributed Computing Stage-03: At stage-03,  Third functional unit performs instruction execution.  It executes the instruction. Stage-04: At stage-04,  Fourth functional unit performs write back.  It writes back the result so obtained after executing the instruction. Execution: In pipelined architecture,  Instructions of the program execute parallely.  When one instruction goes from nth stage to (n + 1)th stage, another instruction goes from (n – 1)th stage to nth stage. Phase-time Diagram  Phase-time diagram shows the execution of instructions in the pipelined architecture.  The following diagram shows the execution of three instructions in four stage pipeline architecture. Figure 2.8: Phase-Time Diagram Time taken to execute three instructions in four stage pipelined architecture = 6 clock cycles. CU IDOL SELF LEARNING MATERIAL (SLM)

Pipeline Processing 55 2.9 Hazards Pipeline Hazards There are situations, called hazards, that prevent the next instruction in the instruction stream from being executing during its designated clock cycle. Hazards reduce the performance from the ideal speedup gained by pipelining. There are Three Classes of Hazards:  Structural Hazards. They arise from resource conflicts when the hardware cannot support all possible combinations of instructions in simultaneous overlapped execution.  Data Hazards. They arise when an instruction depends on the result of a previous instruction in a way that is exposed by the overlapping of instructions in the pipeline.  Control Hazards. They arise from the pipelining of branches and other instructions that change the PC. Hazards in pipelines can make it necessary to stall the pipeline. The processor can stall on different events:  A cache miss. A cache miss stalls all the instructions on pipeline both before and after the instruction causing the miss.  A hazard in pipeline. Eliminating a hazard often requires that some instructions in the pipeline to be allowed to proceed while others are delayed. When the instruction is stalled, all the instructions issued later than the stalled instruction are also stalled. Instructions issued earlier than the stalled instruction must continue, since otherwise the hazard will never clear. A hazard causes pipeline bubbles to be inserted.The following table shows how the stalls are actually implemented. As a result, no new instructions are fetched during clock cycle 4, no instruction will finish during clock cycle 8. 1. Data Dependency Consider the following two instructions and their pipeline execution: CU IDOL SELF LEARNING MATERIAL (SLM)

56 Parallel and Distributed Computing Table 2.1: Data Dependency In the figure above, you can see that result of the Add instruction is stored in the register R2 and we know that the final result is stored at the end of the execution of the instruction which will happen at the clock cycle t4. But the Sub instruction need the value of the register R2 at the cycle t3. So the Sub instruction has to stall two clock cycles. If it doesn’t stall it will generate an incorrect result. Thus depending of one instruction on other instruction for data is data dependency. 2. Memory Delay When an instruction or data is required, it is first searched in the cache memory if not found then it is a cache miss. The data is further searched in the memory which may take ten or more cycles. So, for that number of cycle the pipeline has to stall and this is a memory delay hazard. The cache miss, also results in the delay of all the subsequent instructions. 3. Branch Delay Suppose the four instructions are pipelined I1, I2, I3, I4 in a sequence. The instruction I1 is a branch instruction and its target instruction is Ik. Now, processing starts and instruction I1 is fetched, decoded and the target address is computed at the 4th stage in cycle t3. But till then the instructions I2, I3, I4 are fetched in cycle 1, 2 and 3 before the target branch address is computed. As I1 is found to be a branch instruction, the instructions I2, I3, I4 has to be discarded because the instruction Ik has to be processed next to I1. So, this delay of three cycles 1, 2, 3 is a branch delay. CU IDOL SELF LEARNING MATERIAL (SLM)

Pipeline Processing 57 Table 2.2: Branch Delay Prefetching the target branch address will reduce the branch delay. Like if the target branch is identified at the decode stage then the branch delay will reduce to 1 clock cycle. Table 2.3: Branch Delay when Target Branch is Determined at Decode Stage 2.8 Dynamic Instruction Scheduling Introduction The order in which the instructions comprising a program are to be executed is normally assumed to be given by the order in which the instructions are held in program storage and by the sequencing control indicated by transfer and conditional transfer instructions. However a programmer, or compiler, can produce many different but equivalent versions of a program merely by making minor alterations to the sequence in which instructions are placed. Normally CU IDOL SELF LEARNING MATERIAL (SLM)

58 Parallel and Distributed Computing the actual choice among these alternative sequences will be somewhat arbitrary, though careful programming or compilation often involves an attempt to design a program whose detailed sequences are tailored to make best use of a computer’s control and functional capabilities. This can be particularly worthwhile for computers whose internal organization has been designed to attempt to overlap the use of its various functional capabilities. Take, for example, a computer which initiates execution of instructions in strict sequence, without necessarily awaiting the completion of one instruction before execution of the next instruction, provided that the operands of the second instruction are ready, and the necessary busses and functional units are available. In such a computer the sequence (written here for convenience in a 3-address format) R1 + R2 -> R3 R1 × R4 -> R5 R6 + R2 -> R7 R3 × R6 -> R8 might well be preferable to R1 + R2 -> R3 R6 + R2 -> R7 R1 × R4 -> R5 R3 × R6 -> R8 if the adder and multiplier were independent functional units. Thus if really effective use is to be made of the internal capabilities of such a computer, careful attention must be paid to the detailed sequencing of instructions in frequently' executed portions of a program. This ‘scheduling’ can be done by an ambitious optimizing compiler, or an extremely conscientious hand-coder. There is often, however, a difficulty in achieving really optimum sequencing by such means -- that of the effects of memory interference, which if present CU IDOL SELF LEARNING MATERIAL (SLM)

Pipeline Processing 59 will cause variations in the times which operands take to reach the arithmetic and control unit from storage. The effects of such memory interference will not usually be calculable in advance of program execution, particularly if the interference is caused by autonomous 1/0 units using the memory. Thus there is often cause to consider the possibility of supplementing (or even replacing) the static scheduling performed by coder or compiler by dynamic scheduling performed by the computer as it executes a program. In this paper we describe a technique of dynamic scheduling permitting non-sequential instruction execution. Furthermore, the technique presented is shown to be capable of controlling the simultaneous execution of two or more instructions at a time on machines with sufficiently generous bussing and functional capabilities. In any actual computer design care would of course have to be taken to ensure that any possible gains achieved by such dynamic scheduling were not offset by the cost (both in speed and in circuits) of the extra hardware necessary to perform the scheduling. The scheme presented uses a very general, but conceptually simple, method of controlling non-sequential instruction execution, and of identifying groups of instructions which are mutually independent and can be executed simultaneously. Non-sequential Instruction Execution In this section we restrict our attention to the sequencing of straight line coding comprised of instructions, the locations of whose operands and results can be determined directly from the instructions themselves, rather than needing any address computation to be performed. The sequence in which a series of instructions have been written implies the total effect that these instructions are intended to have when executed. Each separate instruction-contributes to this total effect by performing its operations on the contents of certain registers (accumulators, index registers, indicators, etc.) and setting its results into other registers. A dynamic scheduling technique has to insure that any instructions obeyed out of sequence do not change the contents of any registers which are to be used by any instructions whose execution has been delayed temporarily. CU IDOL SELF LEARNING MATERIAL (SLM)

60 Parallel and Distributed Computing A simple set of rules for determining if a given-instruction can be obeyed out of sequence is as follows: (i) The required busses and functional units are available. (ii) The instruction must not use any registers which are used as result registers by instructions whose execution has been initiated but not yet completed. (iii) The instruction must not use as result registers any registers which are used as operand registers by any preceding instructions which have not yet been initiated. (iv) The instruction must not use any registers (either as result or operand registers) which are used as result registers by any preceding instructions which have not yet been initiated. These checks can be made in a systematic fashion using what are here called ‘sequencing matrices’. Two matrices are used, namely a ‘source matrix’ (S) and a ‘destination matrix’ (D). At each cycle, when the machine is attempting to choose an instruction to be executed, rows in these matrices are set up corresponding to each of the instructions which are being considered by the scheduling mechanism. (The cycle referred to above is a clock cycle, which corresponds to the maximum rate at which instructions can be initiated, and will presumably be much shorter than a storage cycle.) The elements in each row of the matrices indicate whether a given register is being used, or will be affected, by the corresponding instruction. The element Si,j is set to one if the ith instruction uses the contents of register j as an operand. The element Di,j is set to one if execution of the ith instruction will cause the contents of register j to be replaced. Take, for example, a very simple machine with eight registers and a 3-address format, using a scheduling mechanism that processes four instructions per-cycle. A typical situation would be: CU IDOL SELF LEARNING MATERIAL (SLM)

Pipeline Processing 61 Figure 2.9: Unary Form Thus each row has been set up by processing the register address fields of the corresponding instructions, and converting these addresses into unary form. However in more realistic machines the setting up of the matrix elements would not be so straightforward. Almost certainly it would involve decoding the operation code part of the instruction to determine what implied registers are used by an instruction in addition to those indicated by address fields. In addition to the matrices, which provide a conveniently coded form of indicating the register requirements of instructions awaiting execution, a 'busy vector' (B) is used to indicate the current status of the machine registers. The length of the vector is equal to the number of registers. The element Bj is set to one when execution of an instruction which will cause the contents of register j to be replaced is initiated; it is reset to zero when the replacement has been completed. Once the sequencing matrices and the busy vector have been set up as described, the basic algorithm for choosing an instruction to be executed can be described as follows. Starting with the top row of the matrices, each instruction is checked--instruction i can be executed if: (i) The required busses and functional units are available. (ii) The elements of B corresponding to the non-zero elements of the ith rows of S and D are zero. (iii) The elements above row i of the columns of D corresponding to the non-zero elements of row i of S contain only zeroes. (iv) The elements above row i of the columns of S and D corresponding to non-zero elements of row i of D contain only zeroes. CU IDOL SELF LEARNING MATERIAL (SLM)

62 Parallel and Distributed Computing Returning to the previous example, with the busy vector set up to indicate that certain registers, 3 and 6 for instance, are still to have their contents replaced, by the action of previously initiated instructions Figure 2.10: Busy Vector Instruction 1 cannot be executed because of rule (ii) Instruction 2 cannot be executed because of rules (iii) and (iv) However instruction 3 can be executed, provided that the necessary bussing and functional capabilities are available. Each cycle; while the scheduling mechanism is attempting to choose an instruction to initiate, a decoding mechanism could be processing a further instruction, taken from the address in the instruction store given by an instruction counter. In contrast to a conventional instruction counter, this counter does not indicate which instruction is currently being executed, but rather which instruction is next in line for processing by the scheduling mechanism. With non-sequential instruction sequencing it is not possible to have a conventional instruction counter. This can in certain circumstances be a disadvantage of the system, and is discussed further below. At the end of a cycle, if an instruction has been chosen (it is of course possible that none of the instructions can be initiated until some of-the non-zero elements of the busy vector become zero), the rows corresponding to the instruction are removed from the matrices. The remaining rows are then pushed upwards to fill in any gap, the bottom row of the matrix is replenished using the instruction which has just been decoded, and the instruction counter is incremented. All is then ready for the scheduling mechanism to again scan the matrices in an attempt to choose another instruction to initiate. CU IDOL SELF LEARNING MATERIAL (SLM)

Pipeline Processing 63 In the above example, the situation at the start of the next cycle might be (assuming that registers 3 and 6 have still not had their contents replaced) as shown in Fig. 2.11. During this cycle the Divide instruction will be chosen for execution. Figure 2.11: Cycle of Instruction In the above general description of the proposed technique for non-sequential instruction execution the discussion has been limited to the scheduling of straight-line coding composed of instructions whose register requirements can be determined immediately from inspection of the instructions. The next two sections of this paper deal with the effect of unconditional and conditional branch instructions, and with a technique for scheduling instructions which refer to indexed addresses in storage. Unconditional and Conditional Branching There is one kind of branch instruction, namely the unconditional branch to an explicit instruction address, which can be handled very simply, without recourse to the sequencing matrices. The instruction is executed as soon as it has been decoded, causing the appropriate modification to the instruction counter which indicates the location from which the sequencing matrices are to replenished. The other types of branch instructions, where the branch address and/or the question of whether the branch is to be taken cannot be determined directly from the instruction, but rather depend on the contents of one or more registers, cause rows to be entered into the sequencing matrices in the usual way. However refilling of the matrices then stops until the branch instruction has been executed and any necessary modification has been made to the instruction counter. Thus once such a branch instruction has entered into the matrices, the matrices will gradually empty CU IDOL SELF LEARNING MATERIAL (SLM)

64 Parallel and Distributed Computing until the execution of the branch instruction permits refilling to begin. This means that every effort should be made to initiate execution of the branch instruction as soon as possible, and that once the branch instruction has been executed, empty rows of the matrix should be replenished as quickly as possible. Otherwise, the matrices will spend much of their time only partly full, and the chances of finding an executable instruction each cycle will be considerably reduced. Since a scan of the matrices enables all the executable instructions to be identified, what is required is to ensure that a branch instruction is given priority over any other executable instructions. The simplest way of doing this, since there can never be any instructions in the matrices below a branch instruction, is to always choose the lowest executable instruction, whether or not this is a branch instruction. However it could be argued that this is taking unnecessary liberties with the sequencing of a program, which will cause undue complications in program debugging. The alternative is to arrange some system whereby if there is an executable branch instruction it is initiated, but that otherwise the highest executable instruction is chosen. The second requirement, that of speedy replenishment of the matrices once a branch instruction has been executed, required decoding facilities operating in parallel on several instructions. The alternative of relying solely on the normal decoding and replenishment mechanism, which fills only one row each cycle, is unlikely to be adequate. An ‘Execute’ instruction, which can be regarded as a temporary branch for the duration of a single instruction, involves only slight extensions to the above system. Filling of the matrices is halted once an Execute instruction has been reached, until it can be obeyed and the instruction which it specifies can be fetched and placed in the matrices. Unless this is another Execute instruction, or a branch instruction, filling of the matrices can then be resumed, starting with the instruction following the original Execute instruction. 2.9 Summary Pipelining is the process of accumulating instruction from the processor through a pipeline. It allows storing and executing instructions in an orderly process. It is also known as pipeline processing. Pipelining is a technique where multiple instructions are overlapped during execution. CU IDOL SELF LEARNING MATERIAL (SLM)

Pipeline Processing 65 Pipeline is divided into stages and these stages are connected with one another to form a pipe like structure. Instructions enter from one end and exit from another end. Pipelining increases the overall instruction throughput. In pipeline system, each segment consists of an input register followed by a combinational circuit. The register is used to hold data and combinational circuit performs operations on it. The output of combinational circuit is applied to the input register of the next segment. Pipelined Execution In pipelined architecture, Multiple instructions are executed parallely. This style of executing the instructions is highly efficient. Instruction Pipelining Instruction pipelining is a technique that implements a form of parallelism called as instruction level parallelism within a single processor. A pipelined processor does not wait until the previous instruction has executed completely. Rather, it fetches the next instruction and begins its execution. Pipelined Architecture In pipelined architecture,  The hardware of the CPU is split up into several functional units.  Each functional unit performs a dedicated task.  The number of functional units may vary from processor to processor.  These functional units are called as stages of the pipeline.  Control unit manages all the stages using control signals.  There is a register associated with each stage that holds the data.  There is a global clock that synchronizes the working of all the stages.  At the beginning of each clock cycle, each stage takes the input from its register.  Each stage then processes the data and feed its output to the register of the next stage. CU IDOL SELF LEARNING MATERIAL (SLM)

66 Parallel and Distributed Computing 2.10 Key Words/Abbreviations  Speed Up: It gives an idea of “how much faster” the pipelined execution is as compared to non-pipelined execution.  Efficiency: The efficiency of pipelined execution.  Throughput: Throughput is defined as number of instructions executed per unit time.  Arithmetic Pipeline: It is designed to perform high-speed floating-point addition, multiplication and division.  Instruction Pipeline: Here, the number of instruction are pipelined and the execution of current instruction is overlapped by the execution of the subsequent instruction.  Processor Pipelining: Here, the processors are pipelined to process the same data streame. 2.11 Learning Activity 1. Propose a change to the 5-stage MIPS pipeline (as shown in the notes) to accommodate this addressing mode. Draw the new pipeline and explain the changes. ----------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------- 2. What new data hazards are created by this addressing mode using your modified pipeline? Give an example of each of the hazards you find. Explain where in the pipeline the hazard occurs. ----------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------- 3. List all changes or additions which must be made to the hardware to accommodate this new addressing mode. ----------------------------------------------------------------------------------------------------------- ----------------------------------------------------------------------------------------------------------- CU IDOL SELF LEARNING MATERIAL (SLM)

Pipeline Processing 67 2.12 Unit End Questions (MCQ and Descriptive) A. Descriptive Types Questions 1. Consider a pipeline having 4 phases with duration 60, 50, 90 and 80 ns. Given latch delay is 10 ns. Calculate. (i) Pipeline cycle time (ii) Non-pipeline execution time (iii) Speed up ratio (iv) Pipeline time for 1000 tasks (v) Sequential time for 1000 tasks (vi) Throughput 2. A four stage pipeline has the stage delays as 150, 120, 160 and 140 ns respectively. Registers are used between the stages and have a delay of 5 ns each. Assuming constant clocking rate, the total time taken to process 1000 data items on the pipeline will be- (i) 120.4 microseconds (ii) 160.5 microseconds (iii) 165.5 microseconds (iv) 590.0 microseconds 3. Consider a non-pipelined processor with a clock rate of 2.5 gigahertz and average cycles per instruction of 4. The same processor is upgraded to a pipelined processor with five stages but due to the internal pipeline delay, the clock speed is reduced to 2 gigahertz. Assume there are no stalls in the pipeline. The speed up achieved in this pipelined processor is: (i) 3.2 (ii) 3.0 (iii) 2.2 (iv) 2.0 4. The stage delays in a 4 stage pipeline are 800, 500, 400 and 300 picoseconds. The first stage is replaced with a functionally equivalent design involving two stages with respective delays 600 and 350 picoseconds. The throughput increase of the pipeline is __________%. 5. A non-pipelined single cycle processor operating at 100 MHz is converted into a synchronous pipelined processor with five stages requiring 2.5 ns, 1.5 ns, 2 ns, 1.5 ns and 2.5 ns respectively. The delay of the latches is 0.5 sec. CU IDOL SELF LEARNING MATERIAL (SLM)

68 Parallel and Distributed Computing The speed up of the pipeline processor for a large number of instructions is: (i) 4.5 (ii) 4.0 (iii) 3.33 (iv) 3.0 6. We have 2 designs D1 and D2 for a synchronous pipeline processor. D1 has 5 stage pipeline with execution time of 3 ns, 2 ns, 4 ns, 2 ns and 3 ns. While the design D2 has 8 pipeline stages each with 2 ns execution time. How much time can be saved using design D2 over design D1 for executing 100 instructions? (i) 214 ns (ii) 202 ns (iii) 86 ns (iv) 200 ns 7. Consider an instruction pipeline with four stages (S1, S2, S3 and S4) each with combinational circuit only. The pipeline registers are required between each stage and at the end of the last stage. Delays for the stages and for the pipeline registers are as given in the figure. What is the approximate speed up of the pipeline in steady state under ideal conditions when compared to the corresponding non-pipeline implementation? (i) 4.0 (ii) 2.5 (iii) 1.1 (iv) 3.0 CU IDOL SELF LEARNING MATERIAL (SLM)

Pipeline Processing 69 8. Consider a 4 stage pipeline processor. The number of cycles needed by the four instructions I1, I2, I3 and I4 in stages S1, S2, S3 and S4 is shown below: S1 S2 S3 S4 I1 2 1 1 1 I2 1 3 2 2 I3 2 1 1 3 I4 1 2 2 2 What is the number of cycles needed to execute the following loop? for(i=1 to 2) { I1; I2; I3; I4; } (i) 16 (ii) 23 (iii) 28 (iv) 30 9. Consider a pipelined processor with the following four stages: IF: Instruction Fetch ID: Instruction Decode and Operand Fetch EX: Execute WB: Write Back The IF, ID and WB stages take one clock cycle each to complete the operation. The number of clock cycles for the EX stage depends on the instruction. The ADD and SUB instructions need 1 clock cycle and the MUL instruction need 3 clock cycles in the EX stage. Operand forwarding is used in the pipelined processor. What is the number of clock cycles taken to complete the following sequence of instructions? ADD R2, R1, R0 R2 ← R0 + R1 MUL R4, R3, R2 R4 ← R3 + R2 SUB R6, R5, R4 R6 ← R5 + R4 (i) 7 (ii) 8 (iii) 10 (iv) 14 CU IDOL SELF LEARNING MATERIAL (SLM)

70 Parallel and Distributed Computing 10. Consider the following procedures. Assume that the pipeline registers have zero latency. P1 : 4 stage pipeline with stage latencies 1 ns, 2 ns, 2 ns, 1 ns P2 : 4 stage pipeline with stage latencies 1 ns, 1.5 ns, 1.5 ns, 1.5 ns P3 : 5 stage pipeline with stage latencies 0.5 ns, 1 ns, 1 ns, 0.6 ns, 1 ns P4 : 5 stage pipeline with stage latencies 0.5 ns, 0.5 ns, 1 ns, 1 ns, 1.1 ns Which procedure has the highest peak clock frequency? (i) P1 (ii) P2 (iii) P3 (iv) P4 11. Consider a 3 GHz (gigahertz) processor with a three-stage pipeline and stage latencies T1, T2 and T3 such that T1 = 3T2/4 = 2T3. If the longest pipeline stage is split into two pipeline stages of equal latency, the new frequency is ____ GHz, ignoring delays in the pipeline registers. B. Multiple Choice/Objective Type Questions 1. ____________ have been developed specifically for pipelined systems. (a) Utility software (b) Speed up utilities (c) Optimizing compilers (d) None of the mentioned 2. The pipelining process is also called as ____________. (a) Superscalar operation (b) Assembly line operation (c) Von Neumann cycle (d) None of the mentioned 3. The fetch and execution cycles are interleaved with the help of ____________. (a) Modification in processor architecture (b) Clock (c) Special unit (d) Control unit CU IDOL SELF LEARNING MATERIAL (SLM)

Pipeline Processing 71 4. Each stage in pipelining should be completed within ____________ cycle. (a) 1 (b) 2 (c) 3 (d) 4 5. If a unit completes its task before the allotted time period, then ____________. (a) It’ll perform some other task in the remaining time (b) Its time gets reallocated to a different task (c) It’ll remain idle for the remaining time (d) None of the mentioned 6. To increase the speed of memory access in pipelining, we make use of ____________. (a) Special memory locations (b) Special purpose registers (c) Cache (d) Buffers 7. The periods of time when the unit is idle is called as ____________. (a) Stalls (b) Bubbles (c) Hazards (d) Both Stalls and Bubbles 8. The contention for the usage of a hardware device is called ____________. (a) Structural hazard (b) Stalk (c) Deadlock (d) None of the mentioned 9. The situation wherein the data of operands are not available is called ____________. (a) Data hazard (b) Stock (b) Deadlock (d) Structural hazard Answers 1. (c), 2. (b), 3. (b), 4. (a), 5. (c), 6. (c), 7. (d), 8. (a), 9. (a) 2.14 References References of this unit have been given at the end of the book. CU IDOL SELF LEARNING MATERIAL (SLM)

72 Parallel and Distributed Computing UNIT 3 SYNCHRONOUS PARALLEL PROCESSING Structure: 3.0 Learning Objectives 3.1 Introduction 3.2 Parallel Processing 3.3 Introduction of Single Instruction, Multiple Data (SIMD) 3.4 SIMD Architecture (Single Instruction Multiple Data) 3.5 SIMD Parallel Process 3.6 Programming Principle 3.7 Data Mapping 3.8 Summary 3.9 Key Words/Abbreviations 3.10 Learning Activity 3.11 Unit End Questions (MCQ and Descriptive) 3.12 References CU IDOL SELF LEARNING MATERIAL (SLM)

Synchronous Parallel Processing 73 3.0 Learning Objectives After studying this unit, you will be able to:  Able to understand Synchronous Parallel Processing.  Describe SIMD architecture and programming principles.  Explain SIMD parallel algorithms. 3.1 Introduction Parallel processing is a method of simultaneously breaking up and running program tasks on multiple microprocessors, thereby reducing processing time. Parallel processing may be accomplished via a computer with two or more processors or via a computer network. Parallel processing is also called parallel computing. Parallel Computing It is the use of multiple processing elements simultaneously for solving any problem. Problems are broken down into instructions and are solved concurrently as each resource which has been applied to work is working at the same time. Advantages of Parallel Computing over Serial Computing are as follows: 1. It saves time and money as many resources working together will reduce the time and cut potential costs. 2. It can be impractical to solve larger problems on Serial Computing. 3. It can take advantage of non-local resources when the local resources are finite. 4. Serial Computing ‘wastes’ the potential computing power, thus Parallel Computing makes better work of hardware. Types of Parallelism 1. Bit-level parallelism: It is the form of parallel computing which is based on the increasing processor’s size. It reduces the number of instructions that the system must execute in order to perform a task on large-sized data. CU IDOL SELF LEARNING MATERIAL (SLM)

74 Parallel and Distributed Computing Example: Consider a scenario where an 8-bit processor must compute the sum of two 16- bit integers. It must first sum up the 8 lower-order bits, then add the 8 higher-order bits, thus requiring two instructions to perform the operation. A 16-bit processor can perform the operation with just one instruction. 2. Instruction-level parallelism: A processor can only address less than one instruction for each clock cycle phase. These instructions can be re-ordered and grouped which are later on executed concurrently without affecting the result of the program. This is called instruction-level parallelism. 3. Task Parallelism: Task parallelism employs the decomposition of a task into subtasks and then allocating each of the subtasks for execution. The processors perform execution of sub tasks concurrently. Why Parallel Computing?  The whole real world runs in dynamic nature i.e. many things happen at a certain time but at different places concurrently. This data is extensively huge to manage.  Real world data needs more dynamic simulation and modeling, and for achieving the same, parallel computing is the key.  Parallel computing provides concurrency and saves time and money.  Complex, large datasets, and their management can be organized only and only using parallel computing’s approach.  Ensures the effective utilization of the resources. The hardware is guaranteed to be used effectively whereas in serial computation only some part of hardware was used and the rest rendered idle.  Also, it is impractical to implement real-time systems using serial computing. Parallel processing is particularly useful when running programs that perform complex computations, and it provides a viable option to the quest for cheaper computing alternatives. Supercomputers commonly have hundreds of thousands of microprocessors for this purpose. Parallel processing should not be confused with concurrency, which refers to multiple tasks that run simultaneously. CU IDOL SELF LEARNING MATERIAL (SLM)

Synchronous Parallel Processing 75 Each of the processing units has its own local memory unit to store both data and instructions. In SIMD computers, processors need to communicate among themselves. This is done by shared memory or by interconnection network. While some of the processors execute a set of instructions, the remaining processors wait for their next set of instructions. Instructions from the control unit decides which processor will be active (execute instructions) or inactive (wait for next instruction). 3.2 Parallel Processing Parallel processing can be described as a class of techniques which enables the system to achieve simultaneous data-processing tasks to increase the computational speed of a computer system. A parallel processing system can carry out simultaneous data-processing to achieve faster execution time. For instance, while an instruction is being processed in the ALU component of the CPU, the next instruction can be read from memory. The primary purpose of parallel processing is to enhance the computer processing capability and increase its throughput, i.e. the amount of processing that can be accomplished during a given interval of time. A parallel processing system can be achieved by having a multiplicity of functional units that perform identical or different operations simultaneously. The data can be distributed among various multiple functional units. The following diagram shows one possible way of separating the execution unit into eight functional units operating in parallel. The operation performed in each functional unit is indicated in each block if the diagram: CU IDOL SELF LEARNING MATERIAL (SLM)

76 Parallel and Distributed Computing Figure 3.1: Parallel Processing  The adder and integer multiplier performs the arithmetic operation with integer numbers.  The floating-point operations are separated into three circuits operating in parallel.  The logic, shift, and increment operations can be performed concurrently on different data. All units are independent of each other, so one number can be shifted while another number is being incremented. 3.3 Introduction of Single Instruction, Multiple Data (SIMD) Single instruction, multiple data (SIMD) is a class of parallel computers in Flynn's taxonomy It describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously. Such machines exploit data level parallelism, but not concurrency: there are simultaneous (parallel) computations, but only a single process (instruction) CU IDOL SELF LEARNING MATERIAL (SLM)

Synchronous Parallel Processing 77 at a given moment. SIMD is particularly applicable to common tasks such as adjusting the contrast in a digital image or adjusting the volume of digital audio. Most modern CPU designs include SIMD instructions to improve the performance of multimedia use. Multivector and SIMD Computers In this section, we will discuss supercomputers and parallel processors for vector processing and data parallelism. Vector Supercomputers In a vector computer, a vector processor is attached to the scalar processor as an optional feature. The host computer first loads program and data to the main memory. Then the scalar control unit decodes all the instructions. If the decoded instructions are scalar operations or program operations, the scalar processor executes those operations using scalar functional pipelines. On the other hand, if the decoded instructions are vector operations then the instructions will be sent to vector control unit. Figure 3.2: Vector Supercomputers CU IDOL SELF LEARNING MATERIAL (SLM)

78 Parallel and Distributed Computing SIMD Supercomputers In SIMD computers, ‘N’ number of processors are connected to a control unit and all the processors have their individual memory units. All the processors are connected by an interconnection network. Figure 3.3: SIMD Supercomputers 3.4 SIMD Architecture (Single instruction Multiple Data)  Single instruction is applied to a multiple data item to produce the same output.  Master instruction work on vector of operand  No of processors running the same instruction one clock cycle by the strict lock approach  It is type of Instruction level parallelism  Communication network allow parallel synchronous communication between several Processing Element/Memory modules. CU IDOL SELF LEARNING MATERIAL (SLM)

Synchronous Parallel Processing 79 Figure 3.4: SIMD Processor Architecture Following two SIMD architectures depict fundamentally different approaches to the parallel processing. Data Communication based on message passing paradigm: Here the memory is part of PE and thus it communicates through the interconnection network for passing the data. CU IDOL SELF LEARNING MATERIAL (SLM)

80 Parallel and Distributed Computing Figure 3.5: SIMD Architectures Shared Memory between Processors: Here memories are not local and the data is read and aligned by the alignment network that aligns the data between PEs and Memory modules. Figure 3.6: SIMD Architectures CU IDOL SELF LEARNING MATERIAL (SLM)

Synchronous Parallel Processing 81 3.5 SIMD Parallel Process  During the execution of program, it is often required to mask of a PE from doing processing, which is equivalent to having some autonomous control within a PE.  PE has a mask bit which can be masked during processing of an instruction.  When a mask in PE is set it receives instruction from Control Unit as No operation.  Executes instruction when mask bit is reset.  Each PE has one or more index registers added to global addresses supplied by the CU Instruction.  The arithmetic logic unit has few general purpose registers and pointer registers to support data and address manipulation. SIMD Mesh Connected Architecture  Here we are dealing with the mesh Connected architecture which has been built using the mesh connected architecture  Each node of such machine will have four ports- Top port, left port,right port and bottom port.  The instruction set belongs to CU with PEs executing some of instructions that are prefixed with P to indicate that these shall be executed on PEs in parallel.  Each PE also has four bidirectional ports for communication to four neighbors. 3.6 Programming Principle Programming Style Programming style refers to the technique used in writing the source code for a computer program. Most programming styles are designed to help programmers quickly read and understands the program as well as avoid making errors. (Older programming styles also focused on conserving screen space.) A good coding style can overcome the many deficiencies of a first programming language, while poor style can defeat the intent of an excellent language. The goal of good programming style is to provide understandable, straightforward, elegant code. The CU IDOL SELF LEARNING MATERIAL (SLM)

82 Parallel and Distributed Computing programming style used in a various program may be derived from the coding standards or code conventions of a company or other computing organization, as well as the preferences of the actual programmer. Some general rules or guidelines in respect of programming style: Figure 3.7: Programming Style 1. Clarity and simplicity of Expression: The programs should be designed in such a manner so that the objectives of the program is clear. 2. Naming: In a program, you are required to name the module, processes, and variable, and so on. Care should be taken that the naming style should not be cryptic and non- representative. For Example: a = 3.14 × r × r area of circle = 3.14 × radius × radius; CU IDOL SELF LEARNING MATERIAL (SLM)

Synchronous Parallel Processing 83 3. Control Constructs: It is desirable that as much as a possible single entry and single exit constructs used. 4. Information hiding: The information secure in the data structures should be hidden from the rest of the system where possible. Information hiding can decrease the coupling between modules and make the system more maintainable. 5. Nesting: Deep nesting of loops and conditions greatly harm the static and dynamic behavior of a program. It also becomes difficult to understand the program logic, so it is desirable to avoid deep nesting. 6. User-defined types: Make heavy use of user-defined data types like enum, class, structure, and union. These data types make your program code easy to write and easy to understand. 7. Module size: The module size should be uniform. The size of the module should not be too big or too small. If the module size is too large, it is not generally functionally cohesive. If the module size is too small, it leads to unnecessary overheads. 8. Module Interface: A module with a complex interface should be carefully examined. 9. Side-effects: When a module is invoked, it sometimes has a side effect of modifying the program state. Such side-effect should be avoided where as possible. 7 Common Programming Principles Programming is just like telling a story to a fellow programmer where variables are characters in your story some plays their role till the end and some end up in the middle, different functions are telling different parts of your story and connecting all the classes or functions in a specific order can only complete the story. To write down the story further, you want everything in a specific order so that you can understand the story easily and continue it adding your own lines from where it was left. No matter how good coder you are, in programming your job is not just writing code that works and give you the desired output, your job is also writing a code that is maintainable, extensible and easy to understand so later the one who continue or maintains your project can CU IDOL SELF LEARNING MATERIAL (SLM)

84 Parallel and Distributed Computing understand it and he/she doesn’t have to go through a horror story which gives him/her a nightmare. 1. KISS: Nobody in programming love to debug, maintain or make changes in complex code. “Keep It Simple, Stupid (KISS)” states that most systems work best if they are kept simple rather than making it complex, so when you are writing code your solution should not be complicated that takes a lot of time and effort to understand. If your code is simple then other developers won’t face any problem understanding the code logic and they can easily proceed further with your code. So always try to simplify your code using different approaches like breaking a complex problem into smaller chunks or taking out some unnecessary code you have written. 2. DRY: Duplication of data, logic or function in code not only makes your code lengthy but also wastes a lot of time when it comes to maintain, debug or modify the code. If you need to make a small change in your code then you need to do it at several places. “Don’t Repeat Yourself (DRY)” principal goal is to reduce the repetition of code. It states that a piece of code should be implemented in just one place in the source code. The opposite of the DRY principle is WET (“write everything twice” or “waste everyone’s time”) which breaks the DRY principle if you are writing the same logic at several places. You can create a common function or abstract your code to avoid the repetition in your code. 3. YAGNI: Your software or program can become larger and complex if you are writing some code which you may need in future but not at the moment. “You Aren’t Gonna Need It (YAGNI)” principle states that “don’t implement something until it is necessary” because in most of the cases you are not going to use that piece of code in future. Most of the programmers while implementing software think about the future possibility and add some code or logic for some other features which they don’t need at present. They add all the unnecessary class and functionality which maybe they never use in the future. Doing this is completely wrong and you will eventually end up in writing bloated code also your project becomes complicated and difficult to maintain. We recommend all the programmers to avoid this mistake to save a lot of time and effort. CU IDOL SELF LEARNING MATERIAL (SLM)

Synchronous Parallel Processing 85 4. SOLID: The SOLID principle stands for five principles which are Single responsibility, Open-closed, Liskov substitution, Interface Segregation, and Dependency inversion. These principles are given by Robert C. Martin and you can check about these SOLID Principle in detail. 5. Separation of Concerns (SoC): Separation of Concerns Principle partition a complicated application into different sections or domains. Each section or domain addresses a separate concern or has a specific job. Each section is independent of each other and that’s why each section can be tackled independently also it becomes easier to maintain, update and reuse the code. For example business logic (the content of the webpage) in an application is a different concern and user interface is a different concern in a web application program. One of the good examples of SoC is the MVC pattern where data (“model”), the logic (“controller”), and what the end-user sees (“view”) divided into three different sections and each part is handled independently. Saving of data to a database has nothing to do with rendering the data on the web. 6. Avoid Premature Optimization: It is true that optimization helps in speeding up the program or algorithm but according to this principle you don’t need to optimize your algorithm at an early stage of development. If you do premature optimization you won’t be able to know where a program’s bottlenecks will be and maintenance will become harder for you. If you optimize your code in the beginning and in case if the requirement may change than your efforts will be wasted and your code will go to the garbage. So it’s better to optimize the algorithm at the right time to get the right benefit of it. 7. Law of Demeter: This principle was first introduced by Ian Holland in 1987 at Northeastern University. It is also known as the principle of least knowledge. This principle divides the responsibility between classes or different units and it can be summarized in three points.  Each unit should have only limited knowledge about other units: only units “closely” related to the current unit.  Each unit should only talk to its friends; don’t talk to strangers.  Only talk to your immediate friends. CU IDOL SELF LEARNING MATERIAL (SLM)

86 Parallel and Distributed Computing The Law of Demeter helps in maintaining independent classes and makes your code less coupled which is very important in software development to make your application flexible, stable, maintainable and understandable. A parallel algorithm is an algorithm that can execute several instructions simultaneously on different processing devices and then combine all the individual outputs to produce the final result. The model of a parallel algorithm is developed by considering a strategy for dividing the data and processing method and applying a suitable strategy to reduce interactions. In this chapter, we will discuss the following Parallel Algorithm Models −  Data parallel model  Task graph model  Work pool model Data Parallel In data parallel model, tasks are assigned to processes and each task performs similar types of operations on different data. Data parallelism is a consequence of single operations that is being applied on multiple data items. Data-parallel model can be applied on shared-address spaces and message-passing paradigms. In data-parallel model, interaction overheads can be reduced by selecting a locality preserving decomposition, by using optimized collective interaction routines, or by overlapping computation and interaction. The primary characteristic of data-parallel model problems is that the intensity of data parallelism increases with the size of the problem, which in turn makes it possible to use more processes to solve larger problems. Example − Dense matrix multiplication. CU IDOL SELF LEARNING MATERIAL (SLM)

Synchronous Parallel Processing 87 Figure 3.8: Data Parallel Model Task Graph Model In the task graph model, parallelism is expressed by a task graph. A task graph can be either trivial or nontrivial. In this model, the correlation among the tasks are utilized to promote locality or to minimize interaction costs. This model is enforced to solve problems in which the quantity of data associated with the tasks is huge compared to the number of computation associated with them. The tasks are assigned to help improve the cost of data movement among the tasks. Examples − Parallel quick sort, sparse matrix factorization, and parallel algorithms derived via divide-and-conquer approach. CU IDOL SELF LEARNING MATERIAL (SLM)

88 Parallel and Distributed Computing Figure 3.9: Task Graph Model Here, problems are divided into atomic tasks and implemented as a graph. Each task is an independent unit of job that has dependencies on one or more antecedent task. After the completion of a task, the output of an antecedent task is passed to the dependent task. A task with antecedent task starts execution only when its entire antecedent task is completed. The final output of the graph is received when the last dependent task is completed (Task 6 in the above figure). Work Pool Model In work pool model, tasks are dynamically assigned to the processes for balancing the load. Therefore, any process may potentially execute any task. This model is used when the quantity of data associated with tasks is comparatively smaller than the computation associated with the tasks. There is no desired pre-assigning of tasks onto the processes. Assigning of tasks is centralized or decentralized. Pointers to the tasks are saved in a physically shared list, in a priority queue, or in a hash table or tree, or they could be saved in a physically distributed data structure. CU IDOL SELF LEARNING MATERIAL (SLM)

Synchronous Parallel Processing 89 The task may be available in the beginning, or may be generated dynamically. If the task is generated dynamically and a decentralized assigning of task is done, then a termination detection algorithm is required so that all the processes can actually detect the completion of the entire program and stop looking for more tasks. Example − Parallel tree search. Figure 3.10: Work Pool Model Parallel Algorithm – Design Techniques Selecting a proper designing technique for a parallel algorithm is the most difficult and important task. Most of the parallel programming problems may have more than one solution. In this chapter, we will discuss the following designing techniques for parallel algorithms −  Divide and conquer  Greedy Method  Dynamic Programming CU IDOL SELF LEARNING MATERIAL (SLM)

90 Parallel and Distributed Computing  Backtracking  Branch and Bound  Linear Programming Divide and Conquer Method In the divide and conquer approach, the problem is divided into several small sub-problems. Then the sub-problems are solved recursively and combined to get the solution of the original problem. The divide and conquer approach involves the following steps at each level:  Divide: The original problem is divided into sub-problems.  Conquer: The sub-problems are solved recursively.  Combine: The solutions of the sub-problems are combined together to get the solution of the original problem. The divide and conquer approach is applied in the following algorithms:  Binary search  Quick sort  Merge sort  Integer multiplication  Matrix inversion  Matrix multiplication Greedy Method In greedy algorithm of optimizing solution, the best solution is chosen at any moment. A greedy algorithm is very easy to apply to complex problems. It decides which step will provide the most accurate solution in the next step. CU IDOL SELF LEARNING MATERIAL (SLM)

Synchronous Parallel Processing 91 This algorithm is a called greedy because when the optimal solution to the smaller instance is provided, the algorithm does not consider the total program as a whole. Once a solution is considered, the greedy algorithm never considers the same solution again. A greedy algorithm works recursively creating a group of objects from the smallest possible component parts. Recursion is a procedure to solve a problem in which the solution to a specific problem is dependent on the solution of the smaller instance of that problem. Dynamic Programming Dynamic programming is an optimization technique, which divides the problem into smaller sub-problems and after solving each sub-problem, dynamic programming combines all the solutions to get ultimate solution. Unlike divide and conquer method, dynamic programming reuses the solution to the sub-problems many times. Recursive algorithm for Fibonacci Series is an example of dynamic programming. Backtracking Algorithm Backtracking is an optimization technique to solve combinational problems. It is applied to both programmatic and real-life problems. Eight queen problem, Sudoku puzzle and going through a maze are popular examples where backtracking algorithm is used. In backtracking, we start with a possible solution, which satisfies all the required conditions. Then we move to the next level and if that level does not produce a satisfactory solution, we return one level back and start with a new option. Branch and Bound A branch and bound algorithm is an optimization technique to get an optimal solution to the problem. It looks for the best solution for a given problem in the entire space of the solution. The bounds in the function to be optimized are merged with the value of the latest best solution. It allows the algorithm to find parts of the solution space completely. The purpose of a branch and bound search is to maintain the lowest-cost path to a target. Once a solution is found, it can keep improving the solution. Branch and bound search is implemented in depth-bounded search and depth–first search. CU IDOL SELF LEARNING MATERIAL (SLM)

92 Parallel and Distributed Computing Linear Programming Linear programming describes a wide class of optimization job where both the optimization criterion and the constraints are linear functions. It is a technique to get the best outcome like maximum profit, shortest path, or lowest cost. In this programming, we have a set of variables and we have to assign absolute values to them to satisfy a set of linear equations and to maximize or minimize a given linear objective function. Parallel Algorithm – Matrix Multiplication A matrix is a set of numerical and non-numerical data arranged in a fixed number of rows and column. Matrix multiplication is an important multiplication design in parallel computation. Here, we will discuss the implementation of matrix multiplication on various communication networks like mesh and hypercube. Mesh and hypercube have higher network connectivity, so they allow faster algorithm than other networks like ring network. Mesh Network A topology where a set of nodes form a p-dimensional grid is called a mesh topology. Here, all the edges are parallel to the grid axis and all the adjacent nodes can communicate among themselves. Total number of nodes = (number of nodes in row) × (number of nodes in column) A mesh network can be evaluated using the following factors −  Diameter  Bisection width Diameter − In a mesh network, the longest distance between two nodes is its diameter. A p- dimensional mesh network having kP nodes has a diameter of p(k–1). Bisection width − Bisection width is the minimum number of edges needed to be removed from a network to divide the mesh network into two halves. CU IDOL SELF LEARNING MATERIAL (SLM)

Synchronous Parallel Processing 93 Matrix Multiplication Using Mesh Network We have considered a 2D mesh network SIMD model having wraparound connections. We will design an algorithm to multiply two n × n arrays using n2 processors in a particular amount of time. Matrices A and B have elements aij and bij respectively. Processing element PEij represents aij and bij. Arrange the matrices A and B in such a way that every processor has a pair of elements to multiply. The elements of matrix A will move in left direction and the elements of matrix B will move in upward direction. These changes in the position of the elements in matrix A and B present each processing element, PE, a new pair of values to multiply. Steps in Algorithm  Stagger two matrices.  Calculate all products, aik × bkj  Calculate sums when step 2 is complete. Algorithm Procedure MatrixMulti Begin for k = 1 to n-1 for all Pij; where i and j ranges from 1 to n ifi is greater than k then rotate a in left direction end if if j is greater than k then rotate b in the upward direction end if for all Pij ; where i and j lies between 1 and n compute the product of a and b and store it in c CU IDOL SELF LEARNING MATERIAL (SLM)

94 Parallel and Distributed Computing for k= 1 to n-1 step 1 for all Pi;j where i and j ranges from 1 to n rotate a in left direction rotate b in the upward direction c=c+a×b End Hypercube Network A hypercube is an n-dimensional construct where edges are perpendicular among themselves and are of same length. An n-dimensional hypercube is also known as an n-cube or an n- dimensional cube. Features of Hypercube with 2k node  Diameter = k  Bisection width = 2k–1  Number of edges = k Matrix Multiplication using Hypercube Network General specification of Hypercube networks:  Let N = 2m be the total number of processors. Let the processors be P0, P1…..PN-1.  Let i and ib be two integers, 0 < i,ib < N-1 and its binary representation differ only in position b, 0 < b < k–1.  Let us consider two n × n matrices, matrix A and matrix B.  Step 1 − The elements of matrix A and matrix B are assigned to the n3 processors such that the processor in position i, j, k will have aji and bik.  Step 2 − All the processor in position (i,j,k) computes the product C(i,j,k) = A(i,j,k) × B(i,j,k)  Step 3 − The sum C(0,j,k) = ΣC(i,j,k) for 0 ≤ i ≤ n-1, where 0 ≤ j, k < n–1. CU IDOL SELF LEARNING MATERIAL (SLM)

Synchronous Parallel Processing 95 3.7 Data Mapping We have discussed- Cache memory bridges the speed mismatch between the processor and the main memory. When Cache Hit Occurs  The required word is present in the cache memory.  The required word is delivered to the CPU from the cache memory. When Cache Miss Occurs  The required word is not present in the cache memory.  The page containing the required word has to be mapped from the main memory.  This mapping is performed using cache mapping techniques. In this section, we will discuss different cache mapping techniques. Cache Mapping Cache mapping defines how a block from the main memory is mapped to the cache memory in case of a cache miss. OR  Cache mapping is a technique by which the contents of main memory are brought into the cache memory. The following diagram illustrates the mapping process- CU IDOL SELF LEARNING MATERIAL (SLM)


Like this book? You can publish your book online for free in a few minutes!
Create your own flipbook