Home Explore CU-MCA-SEM II-Parallel and Distributed Computing-Second Draft

CU-MCA-SEM II-Parallel and Distributed Computing-Second Draft

Published by Teamlease Edtech Ltd (Amita Chitroda), 2021-10-12 04:23:24

Description: CU-MCA-SEM II-Parallel and Distributed Computing-Second Draft

Read the Text Version

Pages:

The process of collecting instructions from the processor via a pipeline is known as pipelining. It provides for the systematic storage and execution of instructions. Pipeline processing is another name for it. Before proceeding with pipelining, review the following subjects to get a better understanding of the concept:  Memory Management  Virtual Memory and Memory Mapping  Processing on many threads Pipelining is a method of executing several instructions at the same time. The pipeline is split into stages, each of which is linked to the next to create a pipe-like structure. Instructions come in from one point and leave from the other. Pipelining improves the total flow of instructions. Figure 3.1 Basic Pipeline Model Each segment of a pipeline system is made up of an input register accompanied by such a combinational circuit. The register is being used to store data, and the combinational circuit manipulates it. The output of the combinational circuit gets applied to the following segment's input register. A pipeline system is similar to a factory's contemporary assembly line arrangement. For example, in the automobile sector, massive assembly lines are built up with robotic arms doing specific tasks at each stage, and the vehicle then goes on to the next arm. Pipelines may be of many types. It is split into two groups: 1. Arithmetic Pipeline 2. Pipeline of Instructions 51 CU IDOL SELF LEARNING MATERIAL (SLM)

3.1.1 Pipeline for arithmetic Arithmetic pipelines may be found in almost all computers. They are used to do floating point calculations, multiplication for fixed point values, and other similar tasks. Consider the following scenario: The following is the input to a Floating-Point Adder pipeline: A*2a = X B*2b Equals Y Mantissas (significant digits of floating-point numbers) A and B are mantissas, whereas a and b are exponents. There are four components to floating point numeracy: 1. Make a comparison of the exponents. 2. Set the mantissas in place. 3. Increase or decrease the number of mantissas 4. Create the final product. The partial products between the aforementioned processes are stored in registers. 3.1.2 Pipeline of instructions By overlapping the fetch, decode, and execute stages of an instruction cycle, a stream containing instructions may be performed. This kind of approach is utilised to improve the computer system's throughput. While prior instructions are being processed in other parts of the pipeline, an instruction pipeline receives instructions from memory. As a result, we may run several instructions at the same time. If indeed the instruction cycles is split into equal-length parts, the pipeline are more efficient. 3.1.3 Pipeline disputes Some circumstances cause the pipeline's performance to vary from its usual state. The following are some of these factors: Variations in Timing It is impossible for all phases to take the same period of time. This issue is most common in instruction processing, because various instructions need different operands and therefore take varying amounts of time to process. Data Risks When multiple instructions are partially executed and they all relate to the same data, a problem occurs. We must make sure that the following instruction does not try to access information before the present one, since this would result in erroneous results. 52 CU IDOL SELF LEARNING MATERIAL (SLM)

The ability to branch We need to know what the next instruction is in order to retrieve and execute it. If the current instruction involves a conditional branch, the outcome of which will lead this to the next operation, the next command may not have been known till the current one is completed. Disruptions Unwanted instructions are inserted into the instruction cycle via interrupts. Interrupts have an impact on how instructions are carried out. Data Reliance It occurs when a subsequent instruction is dependent on the outcome of a prior instruction, but the latter is not yet accessible. 3.1.4 Pipelining benefits 1. The processor's cycle time is decreased. 2. It improves the system's throughput. 3. It ensures the system's dependability. 3.1.5 Pipelining drawbacks 1. Manufacturing a pipelined CPU is difficult and expensive. 2. There is a longer instruction delay. 3.2 PERFORMANCE OF PIPELINE ARCHITECTURE 3.2.1 Introduction The pace of data creation has risen as technology has advanced. It is a fundamental need in many areas of application to handle such data in real-time instead of using an acquire and organize method. Many applications use the pipeline design to handle data in such a streaming manner when it happens in real processing. The pipeline architecture seems to be a parallelization technique that enables the programme to execute in chunks. The pipeline design is made up of several stages, each of which has a queue as well as a worker. Each step in the pipeline receives the previous stage's output as just an input, process it, and then outputs it as the next level's input. First, we'll look at the effect of the set of steps on performance in this article. We demonstrate that the number of steps required to get the optimum results is determined by the workload characteristics. 3.2.2 Background When developing applications in multithreaded settings, the pipeline architecture is a popular choice. It may be thought of as a collection of interconnected components (or stages), each of which has a queue (buffer) and a worker. 53 CU IDOL SELF LEARNING MATERIAL (SLM)

An example of a pipeline architecture is shown in Figure 3.2. Figure 3.2Architecture of the Pipeline Let m represent the pipeline's number of stages, and Si denote stage i. Let Qi and Wi represent the stage I (i.e., Si) queue and worker, respectively. A fresh task (request) will arrive at Q1 first and will be processed by W1 in a First-Come- First-Served (FCFS) manner. W1's output is sent to Q2, where it will remain till W2 processes it. This cycle repeats until Wm completes the job, at which time it exits the system. The pipeline architecture's linked nature enables employees to perform jobs in parallel, which is a significant benefit. Throughput may rise as a consequence of this. As a consequence, pipelining design is widely utilised in a variety of systems. For example, to achieve high throughput, stream processing systems like WSO2 SP, which is built on WSO2 Siddhi, utilise pipeline design. This pipelining concept may be used to a variety of scenarios. For example, in sentiment analysis, several data preparation steps are required, such as sentiment categorization and sentiment summarization. In addition, the pipeline design is widely utilised in the fields of image processing, 3D visualization, big data analytics, & document categorization. 3.2.3 Details of the experiment This section explains how we carry out our experiments. The workloads we'll look at in this essay are CPU-intensive. Our first goal is to see how number of steps in the pipeline affects performance in various situations. A pipeline design with n series of phases is referred to as an n-stage-pipeline. We conduct a series of tests to better understand the behaviour. The parameters that we change are as follows: 1. The total number of stages (workers + queue) 2. The processing time/service time 3. The rate of arrival (into the system) The tests were carried out on a system with a Core i7 CPU: 2.00 GHz x 4 processors RAM 8 GB. To assess the performance, we utilise two performance metrics: throughput and (average) latency. The throughput is defined as the pace where the system processes tasks, 54 CU IDOL SELF LEARNING MATERIAL (SLM)

while the latency is defined as the time difference between when a job leaves the system and when it arrives to the system. We execute each scenario five times and average the results to get the throughput & average latency. We design a pipeline architecture scenario in which the entry of a subsequent application (task) further into system causes the pipeline workers to generate a message of a specified size. We examine messages with sizes of ten bytes, one kilobyte, ten kilobyte, ten kilobyte, ten kilobyte, ten kilobyte, ten kilobyte Let's have a look at how the pipeline creates a response using 10 Bytes. Assume the pipeline contains just one step (i.e., 1-stage-pipeline). A request would arrive at Q1 but will remain there until it is processed by W1. The word \"process\" here refers to W1 creating a 10 -byte message. W1 creates the first part of a message (size = 5B) then puts the partly produced message in Q2 when the pipeline includes two stages. W2 builds the second half by reading the message from Q2. When there are m stages in the pipeline, each worker constructs a message with a size of 10 Bytes/m. It's worth noting that the workers' processing time is related to the size of a message they're constructing. Taking this into account, we've divided task processing time into the following six categories. Table 3.1 Processing Time of Messages We utilise a single stage to calculate processing time, using the difference between the time the request (task) leaves the worker and the time the worker begins processing that request. We obtain a broad variety of processing times as a consequence of utilising various message sizes. Class 1 denotes very fast processing speeds, whereas class 6 denotes extremely slow processing times. Processing time has an effect on performance. 55 CU IDOL SELF LEARNING MATERIAL (SLM)

Let's start by looking at how the number of steps in the pipeline affects throughput and average delay (assuming a constant request rate of 1000 requests per second). The graphs below illustrate how throughput and average delay change as the number of steps increases. Figure 3.3 Throughput Variation Figure 3.4 Average Latency Variation 56 CU IDOL SELF LEARNING MATERIAL (SLM)

As task processing durations rise, we can plainly observe a decrease in throughput. Similarly, when task processing durations rise, we observe an increase in average latency. This is expected behaviour because as processing time rises, end-to-end latency increases as well as the number of responses the system can handle decreases. Let's have a look at how the number of steps affects the workload in various workload classes. The main observations are summarised in the table below. Table 3.2 Impact on Different Workloads Let us now attempt to explain the above-mentioned behaviour. It's critical to realise that processing requests inside a pipelining manner incurs some overhead. Because we perform jobs utilising different threads, there is context-switch complexity when we have many steps in the pipeline. The overhead of context switches has a direct effect on performance, particularly latency. Additionally, there is indeed a cost involved with transmitting data from one step to the next. Transferring data between two stages may result in extra processing (for example, to construct a transfer object), which has an effect on performance. Furthermore, there is conflict as a result of the usage of associated data structures like queues, which ha s an effect on performance. When it comes to activities with short processing durations (e.g., class 1, class 2), the total overhead is considerable in comparison to the task processing time. As a result, having and over one step in the workload pipeline has no benefit. Indeed, as seen in the graphs above, performance deterioration may occur for such workloads. We may obtain performance improvements by utilising more than one step in the pipeline when job processing durations rise (e.g., class 4, class 5, and class 6). For instance, in high-processing-time situations, the 5- stage-pipeline provides the greatest throughput and lowest average latency. As a result, there 57 CU IDOL SELF LEARNING MATERIAL (SLM)

is obviously an advantage to having more than one step for high processing time use cases, a s it enables the pipeline to enhance performance by using available resources The influence of the arrival rate on performance We reported the findings in the preceding section using a steady arrival rate at 1000 requests per second. This section looks at how the pipeline's arrival rate affects performance. We can see that the arrival rate has an effect on the optimum number of steps in this case. The graph below illustrates how throughput and average delay for class 1 and class 5 change with various arrival rates. Figure 3.5 Throughput Variation Figure 3.6 Latency Variation 58 CU IDOL SELF LEARNING MATERIAL (SLM)

Even as arrival rate rises, the throughput rises, and the average latency rises owing to the increasing queuing time, as seen in the graphs above. Now let's look at the effect of the arrival rate on the 1st class workload type. We could see that the pipeline only with one stage has the best performance. We obtain no gain when we utilise more than one step in the pipeline for jobs that need short processing times (e.g., check the findings above for class 1), as previously stated. It's worth noting that this is true for all of the arrival rates examined. The tendency is different in the case of level 5 workload, in which the number of steps that might result in the greatest performance changes with both the arrival rates. In this post, we looked at how the number of levels affects the pipeline model's performance. We demonstrated that the number of steps required to get the optimum results is determined by the workload characteristics. The following must be considered takeaways:  The number of steps in the pipeline design that might result in the greatest performance is determined by the workload characteristics  We can obtain better performance via having a limited number of stages if jo b processing durations are reasonably short.  Using a pipeline with an unpredictable number of steps may lead to bad performance.  Under variable (non-stationary) traffic circumstances, constantly adjusting the stages in pipeline design may result in improved performance. 3.3 PIPELINE FOR ARITHMETIC Arithmetic Pipelines are often seen in high-performance computers. They're then used implement floating-point operations, fixed-point multiplication, and other calculations that come up in scientific issues. Let's consider the example of just a pipeline unit using floating-point adding and subtracting to better grasp the principles of arithmetic pipeline. Two normalised floating-point binary values define all inputs towards the floating-point adder pipeline: 0.9504 * 103 = X = A * 2a Y = B * 2b = 0.8200 * 102 Y = B * 2b = 0.8200 * 102 Y = B * 2b Here both a b are indeed the exponents and A and B respectively fractions that constitute the mantissa. The floating-point addition & subtraction process is split into four parts. The appropriate suboperation to be executed in the provided pipeline is included in each segment. The four parts illustrate the following suboperations: 59 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 3.7 Pipeline Organisation 60 1. Subtract the exponents to compare them. 2. Set the mantissas in place. CU IDOL SELF LEARNING MATERIAL (SLM)

3. Increase or decrease the mantissas. 4. Make the outcome normal. Later in this section, we'll go through each suboperation in more depth. In SQL, how can I find the Nth Highest Salary? Note that after each suboperation, registers are inserted to hold the interim results. 1. Subtract exponents to compare them: The difference between the exponents is calculated by subtracting them. The result's exponent is selected to be the bigger exponent. The exponent difference, 3 - 2 = 1, specifies the number of times that mantissa associated with the lower exponent should be shifted to the right. 2. Set the mantissas in place: According with difference in exponents established in segment one, the mantissa corresponding with the lower exponent is moved. X = 0.9504 * 103 = 0.9504 * 103 = 0.9504 * 103 = 0. 0.08200 * 103 Y = 0.08200 * 103 Y = 0.08200 * 103 3. Add mantissas to the mix: In segment three, the two mantissas were also added. 1.0324 * 103 = Z = X + Y 4. Make the outcome normal: The outcome is written as follows after normalisation: Z = 0.1324 * 104 Z = 0.1324 * 104 Z = 0.1324 * 3.4 PIPELINE OF INSTRUCTIONS Not only may pipeline processing happen in the data stream, but it can also happen in the instruction stream. To perform activities such as retrieve, decode, and execute instructions, most digital computers with complicated instructions need an instruction pipeline. In generally, the computer must process each command using the methods listed below. 1. Retrieve the command from memory. 2. Take the instruction and decode it. 3. Determine the actual address. 4. Use memory to get the operands. 61 CU IDOL SELF LEARNING MATERIAL (SLM)

5. Carry out the instructions. 6. Save the outcome in the appropriate location. Each step is carried out in its own segment, and various segments may take varying amounts of time to process the incoming data. Furthermore, there are instances when two or even more segments need memory access simultaneously time, requiring one segment to await until the memory access of another is completed. If the instructions cycle is split into equal-length segments, the structure of an integrative framework will be more efficient. A four-segment operation is among the most frequent instances of this kind of structure. A four-segment operation unifies two or more distinct segments into a single unit. For example, the decoding of both the instructions and the computation of the face increased may be merged into a single segment. A four-segment instructions pipeline is shown in the block diagram below. The instructional cycle is divided into four parts. 1st segment: The first in, first out (FIFO) cache may be used to implement the instruction fetch segment. 2nd segment: In the second section, the memory instruction is decoded, and the measures to be implemented is then computed in a different arithmetic circuit. 3rd segment In the third segment, an operand is retrieved from memory. 4th segment: In the last phase of the pipeline organisation, the instructions are ultimately carried out. The amount of time it takes to process information has an impact on performance. Let's start with how the number of pipeline stages impacts throughput and average latency. As the steps grows, the graphs below show how throughput & average latency vary. 62 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 3.8 Pipeline block diagram We can clearly see a reduction in throughput as job processing times increase. Similarly, we see an increase in average delay as task processing durations grow. This is anticipated 63 CU IDOL SELF LEARNING MATERIAL (SLM)

behaviour since as processing time grows, so does end-to-end latency, and the amount of replies the system can manage. Let's see how the frequency of steps impacts workload in different workload classes. The following table summarises the major findings. Let's try to explain the aforementioned behaviour immediately. It's crucial to understand that processing queries in a pipelining fashion adds overhead. When there are multiple stages in the pipeline, there's really context-switch complexity because we execute tasks using separate threads. Context transition overhead has a direct impact on performance, especially latency. Furthermore, there is a cost associated with moving data from one stage to the next. Transferring data between stages may need additional processing (as e.g., to create a transfer object), lowering performance. Furthermore, there is a performance impact due to the use of related data structures such as queues. The overall overhead is significant in proportion to the work processing time for activities having short processing periods (e.g., class 1, class 2). As a consequence, having more than one stage in the workflow pipeline is useless. Indeed, as shown in the graphs here, such workloads may cause performance degradation. When job processing times increase, we may be able to enhance performance by using more than one stage in the pipeline (. The 5-stage pipeline, for example, offers the highest throughput & lowest average latency in high- processing-time scenarios. Since a consequence, having and over one stage for strong computing time uses cases is clearly advantageous, as it allows the pipeline to improve performance by utilising available resources. 3.4.1 The Impact of The Arrival Time On Productivity We used a constant arrival rate of 1000 samples per second to present the results in the previous section. This section examines the impact of the pipeline's arrival time on performance. In this instance, we can observe that the response time has an impact on the optimal number of steps. The graph below shows how different arrival rates affect throughput & average delay between class 1 & class 5. As the arrival rate increases, throughput increases, and average latency increases due to increased waiting time, as shown in the graphs before. Let's take a look at how the arrival rate affects the first-class workload type. We can observe that the pipeline with just one stage performs the best. When we use more than one stage in the pipeline for tasks that need quick processing times (e.g., see the results above for class 1), we gain nothing. It's important to note that this is accurate for all the arrival rates studied. In the case for level 5 workload, the number of strides that may result in the best performance varies both with arrival rates and the steps that maybe result in the best performance. We looked at what the number of channels impacts the performance of the pipeline model in this article. We showed that the workload characteristics influence the number of steps needed to get the best outcomes. The following are essential takeaways: 64 CU IDOL SELF LEARNING MATERIAL (SLM)

The workload characteristics dictate the amount of stages in the physical modelling that may result in the best performance. If task processing times are quite short, we can get greater performance by limiting the number of steps. Using a pipeline with an unknown number of stages may result in poor performance. Constantly changing the phases in pipeline design can result in better performance under changeable (non-stationary) traffic conditions. 3.5 SUMMARY  A pipeline, sometimes known as a data pipeline, is a set of data processing units connected in a sequence, with one element's output becoming the input of the next. Pipeline elements are frequently processed in simultaneously or in temporal slices.  A parallel processing approach in which an action or a computation is partitioned into discontinuous phases is referred to as pipelining. Pipelining is common for processor control and arithmetic units. Also, by separating calculations into stages, applications take advantage of pipelined parallelism.  Pipelining is a term that is often used in everyday life. For example, on a car factory assembly line, each task—such as installing the engine, the hood, and the wheels—is frequently performed by a different work station. The stations work in tandem, each on a different car, to complete their tasks. Once a car has completed one assignment, it continues on to the next station. Variations in the time required to perform the tasks can be managed by \"buffering\" (parking one or more automobiles between stations) and/or \"stalling\" (temporarily stopping the upstream stations) until the next station becomes available.  Assume that putting together one car necessitates three actions that each take 20, 10, and 15 minutes. The factory would thus produce one car every 45 minutes if all three duties were completed by a single station. The factory could produce the first car in 45 minutes and a new one every 20 minutes by using a three-station pipeline.  As this example demonstrates, pipelining does not reduce latency, or the overall time it takes for a single item to pass through the entire system. However, it increases the system's throughput, or the rate at which subsequent items are processed after the first one has been handled.  If all processing elements are synchronised and take the same amount of time to process, each item can be received by each element in a single clock cycle if they are all synchronised and take the same amount of time to process. Like waves in a water channel, the things will flow through the pipeline at a consistent speed. Aside from the storage required for the data items, no synchronisation or buffering is required between the stagesin such \"wave pipelines.\" 65 CU IDOL SELF LEARNING MATERIAL (SLM)

3.6 KEYWORDS  Abstract data type. A data type given by its set of allowed values and the available operations on those values. The values and operations are defined independently of a particular representation of the values or implementation of the operations. In a programming language that directly supports ADTs, the interface of the type reveals the operations on it, but the implementation is hidden and can (in principle) be changed without affecting clients that use the type. The classic example of an ADT is a stack, which is defined by its operations, typically push and pop. Many different internal representations are possible.  Atomic. Atomic has slightly different meanings in different contexts. An atomic operation at the hardware level is uninterruptible, for example load and store, or atomic test-and-set instructions. In the database world, an atomic operation (or transaction) is one that appears to execute completely or not at all. In parallel programming, an atomic operation is one for which sufficient synchronization has been provided that it cannot be interfered with by other units of execution. Atomic operations also must be guaranteed to terminate (e.g., no infinite loops).  Data parallel. A type of parallel computing in which the concurrency is expressed by applying a single stream of instructions simultaneously to the elements of a data structure.  Incremental parallelism. Incremental parallelism is a technique for parallelizing an existing program, in which the parallelization is introduced as a sequence of incremental changes, parallelizing one loop at a time. Following each transformation, the program is tested to ensure that its behaviour is the same as the original program, greatly decreasing the changes of introducing undetected bugs.  Massively parallel processor. A distributed-memory parallel computer designed to scale to hundreds if not thousands of processors. To better support high scalability, the computer elements or nodes in the MPP machine are custom-designed for use in a scalable computer. This typically includes tight integration between the computing elements and the scalable network.  Multiprocessor. A parallel computer with multiple processors that share an address space. 3.7 LEARNING ACTIVITY 1. How might the clocks in two computers that are linked by a local network be synchronized without reference to an external time source? What factors limit the accuracy of the procedure you have described? How could the clocks in a large number of computers connected by the Internet be synchronized? Discuss the accuracy of that procedure. 66 CU IDOL SELF LEARNING MATERIAL (SLM)

2. A user arrives at a railway station that they have never visited before, carrying a PDA that is capable of wireless networking. Suggest how the user could be provided with information about the local services and amenities at that station, without entering the station’s name or attributes. What technical challenges must be overcome? 3.8 UNIT END QUESTIONS A. Descriptive Questions Short questions 1. Draw the architecture of pipeling. 2. Define pipeline in parallel computing. 3. What is the influence of the arrival rate on performance? 4. Give some examples of pipeline for arithmetic. 5. What is data parallel model? Long Questions 1. Describe about the architecture of pipelining. 2. What is the procedure of pipeline of instructions? 3. Explain about the impact of arrival time on productivity. 4. Describe the advantages and disadvantages of pipeline. 5. Explain about pipeline for arithmetic. B. Multiple Choice Questions 1. The pipelining process is also called as ______ a. Assembly line operation b. Superscalar operation c. Von Neuman architecture d. Utility software 2. The fetch and execution cycles are interleaved with the help of ________ 67 a. Modification in architecture CU IDOL SELF LEARNING MATERIAL (SLM)

b. Pipelining c. Clock d. Timestamps 3. Each stage in pipelining should be completed within ___________ cycle a. 2 b. 3 c. 1 d. 4 4. In pipelining the task which requires the least time is performed ____. a. First b. Second c. Third d. Fourth 5. If a unit completes its task before the allotted time period, then _______ a. It gets reallocated b. It gets saved c. It will remain dynamic d. It will remain idle for remaining time Answers 1-a, 2-c, 3-c, 4-a, 5-d 3.9 REFERENCES Reference books  George Coulouris, Jean Dollimore, Tim Kindberg, “Distributed Systems: Concepts and Design” (4th Edition), Addison Wesley/Pearson Education.  Pradeep K Sinha, “Distributed Operating Systems: Concepts and design”, IEEE computer society press. 68 CU IDOL SELF LEARNING MATERIAL (SLM)

Text Book References  M.R. Bhujade, “Parallel Computing”, 2nd edition, New Age International Publishers2009.  Andrew S. Tanenbaum and Maarten Van Steen, “Distributed Systems: Principles and Paradigms, 2nd edition, Pearson Education, Inc., 2007, ISBN: 0-13-239227-5. Websites:  https://www.geeksforgeeks.org/introduction-to-parallel  https://hpc.llnl.gov/training/tutorials/introduction-parallel-computing-tutorial  https://www.javatpoint.com/what-is-parallel-computing 69 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT 4 -SYNCHRONOUS PARALLEL PROCESSING 1 STRUCTURE 4.0 Learning Objectives 4.1 Introduction to Parallel Processing 4.2 SIMD Architecture 4.3 Examples of SIMD 4.4 SIMD Parallel algorithm 4.5 Master-Slave model 4.6 Summary 4.7 Keywords 4.8 Learning Activity 4.9 Unit end questions 4.10 References 4.0 LEARNING OBJECTIVES After studying this unit, you will be able to:  Outline the basics of parallel processing  Explain about SIMD architecture  Describe the procedure of SIMD parallel algorithm  Explain about master slave model 4.1 INTRODUCTION TO PARALLEL PROCESSING Parallel processing is a set of techniques that allows a computer system to do many data - processing jobs at the same time in order to boost its computational speed. A parallel processing system may process multiple pieces of data at the same time, resulting in a faster execution time. For example, the next instruction could be read from memory while an instruction has been processed in the ALU component of the CPU. The fundamental goal of parallel processing is to improve the computer's processing capability and throughput, or the amount of computing that can be done in a given length of time. 70 CU IDOL SELF LEARNING MATERIAL (SLM)

A parallel processing system is made up of a number of functional units that perform the same or distinct activities at the same time. The data can be dispersed throughout a number of different functional units. For serial computing, software was written in the traditional way. This indicated that an algorithm separates an issue into smaller instructions in order to solve it. These discrete instructions then are executed one by one on a computer's Central Processing Unit. Only after one instruction is completed does the following one begin. People waiting in line for a movie ticket but there is only a cashier are a real-life example of this. The cashier is handing out tickets to the individuals one by one. When there are two lines and just one cashier, the situation becomes more complicated. In a nutshell, Serial Computing is as follows: 1. A problem statement is broken down into distinct instructions in this method. 2. The instructions are then carried out one by one. 3. At any one time, just one instruction is carried out. Look at number three. This posed a significant difficulty in the computing industry because only one instruction was executed at any given time. This was a massive waste of hardware resources because only one section of the hardware will be active at any given time for a certain instruction. As issue statements grew in size and complexity, so did the amount of time it took to execute them. Pentium 3 and Pentium 4 CPUs are examples of processors. Now let's return to our real-life issue. When there are two queues and two cashiers delivering tickets to two people at the same time, we can safely assert that complexity is reduced. This is a case of parallel computing in action. It is the simultaneous usage of several processing elements to solve any problem. Problems are split down into instructions, and they are solved in parallel since each resource that has been applied to the work is working at the same time. From serial to parallel computing, the computational graph has experienced a significant transformation. By using multicore CPUs, tech giants like Intel have already made a step toward parallel computing. In the future, parallel computation will revolutionise the way computers work for the better. Parallel Computing plays a bigger role in helping us stay connected now that the world is even more connected than before. It's even more important now, with quicker networks, dispersed systems, and multi-processor computers. 4.2 SIMD ARCHITECTURE The acronym SIMD stands for 'Single Instruction, Multiple Data Stream.' It symbolises a company with a large number of processing units overseen by a central control unit. The control unit sends the same instruction to all processors, but they work on separate data. To 71 CU IDOL SELF LEARNING MATERIAL (SLM)

connect with all of the processors at the same time, the shared memory unit must have numerous modules. Figure 4.1 Architecture of SIMD As the name implies, such systems would contain several incoming data streams as well as a large number of processing units capable of acting on a single instruction at a certain given time. They have the same parallel computing architecture as multiprocessor systems. The SIMD computer is one of the four Flynn computer classifications. SISD, MISD, and MIMD computer are the other three. 72 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 4.2 SIMD processing technique Multiple processing elements are monitored by the common control unit in a SIMD computer, as shown in the picture below. The control unit broadcasts the same instruction to all of the processing parts, which are ALUs, although they work on separate data sets from different data streams. The control unit, as shown in the diagram, sends a common instruction stream to each processing part. The shared memory sub-system, which includes several modules, is critical. There are a lot of small and simple processors. Each processor is equipped with its own memory. It's usually just connected to its neighbours (NEWS), and the processors are usually customised (not off-the-shelf). Each instruction is issued by a single \"control processor,\" and each processor then performs the same command on its local information at the same time. On any instruction, some processors may well be turned off. This enables the execution of certain logical procedures. 73 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 4.3 Processor flow The main benefit of these machines was its ease with which they may be programmed. They have the disadvantage of being specialised machines that are not suitable for various applications. These architectures support the fascinating data-parallelism model. The parallelism in data parallel computing is in the data, not even in the control. Furthermore, the vector concept of parallelism is used in data parallel computing. Multiple data and serial control are assumed in this paradigm. Figure 4.4 Memory unit system A vector architecture is depicted in the diagram below (vector RAM or VRAM). The programme is stored in scalar memory, and there is only one programme and one programme counter. Parallel processing has emerged as a useful system in recent computers to address the demand for faster performance, lower costs, and more accurate outcomes in real-world applications. Due to the practise of multiprogramming, multiprocessing, or multicomputing, concurrent occurrences are widespread in today's computers. 74 CU IDOL SELF LEARNING MATERIAL (SLM)

Software packages on modern computers are strong and broad. To examine the evolution of computer performance, we must first comprehend the fundamental evolution of hardware and software. Milestones in Computer Development Mechanical and electromechanical portions are the two major stages of computer development. After the development of electronic components, modern computers emerged. The operational elements in mechanical computers were replaced with high mobility electrons in electronic computers. Electric signals, which travel at nearly the speed of light, have replaced mechanical gears and levers in data transmission. Components of Today's Computers Computer hardware, instruction sets, application programmes, system software, and user interface make up a contemporary computer system. Numerical computing, logical reasoning, & transaction processing are the three types of computing challenges. Some complicated tasks may necessitate the use of all three processing modes. Computer Architecture Has Undergone Revolutionary Transformations Over the Previous Four Decades Computer architecture has undergone revolutionary changes over the last four decades. From Von Neumann architecture to multicomputers and multiprocessors, we've come a long way. A SIMD system is a multiprocessor machine that can execute the same command on all CPUs while processing data from distinct data streams. Because scientific computing requires a lot of vector and matrix operations, machines based on a SIMD model are ideal. Organized data elements in vectors can also be divided into various sets (N-sets for N PE systems) so that information can be delivered to all processing elements (PEs). Each PE can handle one data set. Figure 4.5 SIMD system 75 CU IDOL SELF LEARNING MATERIAL (SLM)

The efficiency of a computer system A computer system's performance is determined by both machine capabilities and programme behaviour. Better hardware technology, advanced architectural features and effective resource management can all help machines perform better. Because it is dependent on the application and run-time factors, programme behaviour is unpredictable. 4.3. EXAMPLES OF SIMD Given below are examples of SIMD computer:  Illiac-IV  PEPE  BSP Graphics cards are the best illustration of SIMD. There are hundreds of processing units on these cards. If we compare the computational differences between SISD and SIMD, the SISD architecture would have to conduct three different add operations for the arrays [5, 15, 20] and [15, 25, 10]. The SIMD design, on the other hand, allows us to add them all in a single operation. Advantages of SIMD: The following are some of the benefits of SIMD architecture:  A single instruction can be used to perform the same operation on several components.  The system's throughput can be enhanced by raising the processor's core count.  The processing speed is faster than that of the SISD design. Disadvantages of SIMD: The following are some of the drawbacks of SIMD architecture:  The number of CPU cores communicates in a sophisticated way.  SISD architecture has a greater cost. 4.4 SIMD PARALLEL ALGORITHM The processors work in a synchronous manner, with each processor executing the same instruction on a different datum at each step. The command could be basic (such as adding or comparing two integers) or complex (such as calculating a percentage) (such as merging two lists of numbers). Similarly, the datum can be simple (a single number) or complicated (a set of numbers) (several numbers). Only a subset of a processors may be required to execute an instruction at times. These files can be accessed in the instruction itself, indicating whether 76 CU IDOL SELF LEARNING MATERIAL (SLM)

the processor should be active (or execute the instruction) or inactive (and not execute the instruction) (and wait for the next instruction). Lock-step functioning is ensured by a system such as a world clock. As a result, processors that are idle throughout an instruction or finish the instruction earlier others may remain idle till the next instruction is sent. The duration between two instructions may also be fixed or variable, depending on which instruction is being executed. Most interesting issues that we want to solve on a SIMD computer require the processors to be able to communicate with one another during the computation so that data or interim results can be exchanged. This can be accomplished in two ways, resulting in two subclasses: SIMD computers that communicate via shared memory and SIMD computers that communicate via an interconnection network. Consider a strategy for dividing all data and processing technique, then apply an appropriate approach to limit interactions to create a model of a parallel algorithm. The following topics will be discussed in this chapter: Models of Parallel Algorithms.  Data parallel model  Task graph model  Work pool model  Master slave model  Producer consumer or pipeline model  Hybrid model Data Parallel model Tasks are assigned to processes in the data parallel model, and each task performs comparable operations on distinct data. Data parallelism is the result of a single operation being applied to numerous data objects at the same time. On shared-address spaces with message-passing paradigms, the data-parallel model can be used. Interaction overheads can be decreased in a data-parallel architecture by adopting a locality-preserving decomposition, optimising collective interaction routines, or overlapping computation with interaction. The key feature of data-parallel model issues is that the degree of data parallelism increases as the problem size grows, allowing more processes to be used to tackle larger problems. Example − Dense matrix multiplication. 77 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 4.6 Parallel system Task Graph Model Parallelism is represented by a task graph in the task graph model. A task graph can be either simple or complex. The correlation between the tasks is used in this model to promote proximity or reduce interaction costs. This paradigm is used to tackle situations where the amount of data connected with the tasks is vastly greater than the number of computations required to complete them. The jobs have been given in order to reduce the cost of data transit between the activities. Parallel rapid sort, sparse matrix factorization, and divide-and-conquer parallel algorithms are among examples. Figure 4.7 Task graph model 78 CU IDOL SELF LEARNING MATERIAL (SLM)

Problems are broken down into atomic tasks and visualised as a graph. Each task is a stand- alone job that is dependent on one or more preceding tasks. The output of an antecedent task is delivered to the dependent task after a task is completed. Only when the antecedent mission is finished does a task with an antecedent task begin execution. When the last dependent task is done, the graph's final output is received (Task 6 in the above figure). Work Pool Model Tasks are dynamically allocated to processes in the work pool paradigm to balance the load. As a result, any process has the potential to complete any task. When the amount of data connected with tasks is smaller than that of the computation connected with the activities, this model is utilised. Figure 4.8 Work pool model There is no pre-assignment of tasks to processes that is desired. Task assignment can be centralised or dispersed. The task pointers are saved in a physically shared list, a priority queue, a hash table or tree, or in a physically distributed data structure. It's possible that the work will be provided from the start, or that it will be generated dynamically. If the task is created dynamically and decentralised task assignment is done, a 79 CU IDOL SELF LEARNING MATERIAL (SLM)

termination detection technique is required so that all processes can recognise the program's completion and stop hunting for new tasks.For instance, consider a parallel tree search. 4.5 MASTER-SLAVE MODEL One or more master industries utilize tasks and distribute them to slave processes in the master-slave architecture. If necessary, duties might be assigned ahead of time. The master can estimate the amount of the work, or a random assignment can balance the load satisfactorily, or slaves are assigned lesser jobs at different times. Because the interaction is naturally two-way, this model is equally suitable for shared- address-space or message-passing paradigms. Figure 4.9 Master Slave model A task may have to be performed in phases in some instances, and each phase's task must be finished before the following phase's task may be generated. The master-slave paradigm can be extended to a hierarchical and multi-level master-slave model, in which the top-level master assigns a major amount of the duties to the second-level master, who divides the jobs among its slaves and may complete part of the task itself. Use of the master-slave model with caution It's important to keep an eye on the master to make sure it doesn't become a bottleneck. If the jobs are too small or even the workers are too fast, this can happen. The tasks should be chosen so that the cost of completing the task outweighs the cost of communication and synchronisation. Asynchronous interaction may aid in the overlap of interaction and computation connected with the master's work generation. Pipeline Model 80 CU IDOL SELF LEARNING MATERIAL (SLM)

The producer-consumer model is another name for it. A set of data is passed through a succession of processes, which each performs a different function on it. The arrival of new data triggers the execution of a new task by a queued process. The processes could organise themselves into a queue in the form of linear and multidimensional arrays, trees, and general graphs with or without cycles. Figure 4.10 Pipeline model A chain of consumers and producers is depicted in this model. Each process in the queue can be seen of as a consumer of a series of data items for the process ahead of it in the queue, as well as a creator of data for the process behind it. It is not necessary for the queue to be a linear chain; instead, it might be a directed graph. Overlapping interaction with computing is the most popular interaction minimization strategy for this paradigm. 4.6 SUMMARY  Parallel processing is a computing technique that uses two or more processors (CPUs) to perform different sections of a larger operation. The amount of time it takes to run a programme can be reduced by splitting out distinct sections of a task over numerous processors.  Computational astrophysics, geoprocessing (or seismic surveying), climate modelling, agriculture estimates, financial risk management, video colour correction, computational fluid dynamics, medical imaging, and drug discovery are all examples of applications for parallel processing (also known as parallel computing).  A parallel programme is made up of numerous active processes (tasks) that work together to solve a problem. A divide-and-conquer strategy is used to divide a task into several subtasks, each of which is processed on a separate central processing unit (CPU). 81 CU IDOL SELF LEARNING MATERIAL (SLM)

 Single Instruction Multiple Data architecture, on the other hand, isn't a new concept in computing; it's been around since the Cray-1 supercomputer. SIMD works by packing a vector with data and delivering each data parallel to one another to conduct an instruction on multiple pieces of data at the same time.  A master entity takes one or more requests and then produces slave entities to execute them in the master/slave (also known as boss/worker) model. Typically, the master is in charge of the number of slaves and the tasks that each slave performs. Slaves operate independently of one another. 4.7KEYWORDS  Call stack - the list of called subroutines that has to be unwound by jumps back to the previous calling routine.  Clock Cycle - the small intervals of time between operations in the computer based on the clock frequency of the system  Code coverage - a metric of how many lines of the source code is executed and therefore “covered” by running a test suite. It is usually expressed as a percentage of the source lines of code.  Centralized version control system - a version control system implemented as a single centralized system.  Computational kernel - a section of the application which is both computationally intensive and conceptually self-contained.  MIMD - Multiple instructions, multiple data are a component of Flynn’s taxonomy represented by a multi-core system. 4.8 LEARNING ACTIVITY 1. Give an example of the interleaving, of two transactions that is serially equivalent at each server but is not serially equivalent globally. 2. Explain why making some replica managers read-only may improve the performance of a gossip system. 82 CU IDOL SELF LEARNING MATERIAL (SLM)

4.9 UNIT END QUESTIONS A. Descriptive Questions Short questions 1. What is the goal of parallel processing? 2. Define vector architecture. 3. Expand SIMD. 4. Give some examples of SIMD architecture. 5. What is data parallel model? Long Questions 1. Describe about SIMD parallel algorithm. 2. What is the procedure of task model? 3. Explain about SIMD architecture. 4. Describe the advantages and disadvantages of SIMD. 5. Explain about Master Slave model. B. Multiple Choice Questions 1. A term for simultaneous access to a resource, physical or logical___________ a. Concurrency b. Multiprogramming c. Tasking d. Token ring 2. A parallelism based on increasing processor word size___________ a. Increasing b. Count based c. Bit level d. Framing 3. A type of parallelism that uses micro architectural techniques___________ 83 a. bit level b. bit based CU IDOL SELF LEARNING MATERIAL (SLM)

c. increasing d. instructional 4. The measure of the “effort” needed to maintain efficiency while adding processors_________ a. Maintainability b. Transparency c. Scalability d. Occupational 5. SIMD stands for ____ a. Single Instruction Multiple Data b. Simple Instruction Multiple Decoding c. Sequential Instruction Multiple Decoding d. System Information Mutable Data Answers 1-b, 2-c, 3-d, 4-a, 5-a 4.10 REFERENCES Reference books  George Coulouris, Jean Dollimore, Tim Kindberg, “Distributed Systems: Concepts and Design” (4th Edition), Addison Wesley/Pearson Education.  Pradeep K Sinha, “Distributed Operating Systems: Concepts and design”, IEEE computer society press. Text Book References  M.R. Bhujade, “Parallel Computing”, 2nd edition, New Age International Publishers2009.  Andrew S. Tanenbaum and Maarten Van Steen, “Distributed Systems: Principles and Paradigms, 2nd edition, Pearson Education, Inc., 2007, ISBN: 0-13-239227-5. Websites:  https://www.geeksforgeeks.org/introduction-to-parallel 84 CU IDOL SELF LEARNING MATERIAL (SLM)

 https://hpc.llnl.gov/training/tutorials/introduction-parallel-computing-tutorial  https://www.javatpoint.com/what-is-parallel-computing 85 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT 5 -SYNCHRONOUS PARALLEL PROCESSING STRUCTURE 5.0 Learning Objectives 5.1 Introduction to data mapping 5.2 Array processors 5.3 Data mapping and memory in array processors 5.4 Case studies of SIMD parallel processors 5.5 Summary 5.6 Keywords 5.7 Learning Activity 5.8 Unit end questions 5.9 Suggested Reading 5.0 LEARNING OBJECTIVES After studying this unit, you will be able to:  Describe about data mapping  Explain the working of array processors  Understand the case studies of SIMD parallel processors 5.1 INTRODUCTION TO DATA MAPPING Many data operations require data mapping to be successful. A single blunder in data mapping can spread throughout your company, resulting in duplicated errors and, eventually, faulty analyses. Almost every company will have to transfer data between systems at some point. And similar data is stored in different ways by different systems. As a result, in order to move and consolidate data for the analysis or other tasks, a roadmap is required to ensure that the data arrives at its destination in a timely manner. Quality in data mapping will define the quality of the data to be studied for insights in processes including data integration, data migration, data storage automation, data synchronisation, automated data extraction, and other data management projects. 86 CU IDOL SELF LEARNING MATERIAL (SLM)

The process of connecting fields from one database to another is known as data mapping. It's the first step in making data transfer, data integration, and other data management activities more straightforward. Data must be homogenised in a form that makes it available to decision makers before it can be evaluated for business insights. Data currently comes from a variety of sources, each of which can define similar data pieces in a variety of ways. For example, a source system's state field may indicate Illinois as \"Illinois,\" but the destination system may record it as \"IL.\" When data is transferred from a source to a destination, data mapping bridges the gaps between two systems, and data models, ensuring that the data is accurate and useable. For a long time, data mapping has been a standard corporate function, but as the amount of data and sources has grown, the process has become more difficult, necessitating the use of automated tools to make it possible for huge data sets. The key to data management is data mapping. Many data management activities need the use of data mapping. Data may become corrupted as it travels to its destination if it is not correctly mapped. Data mapping quality is critical for getting the most of your data in data migrations, integration, transformations, and data warehouse populating. Migration of data The process of shifting data from one format to another in a one-time event is known as data migration. This is usually data that does not change over time. The destination becomes the new source of migrated data after the migration, and the original source is retired. By mapping source fields with destination fields, data mapping aids in the migration process. Integration of data Data integration is the process of moving information from one place to another on a regular basis. The integration can be set up to run on a regular basis, such as quarterly or monthly, or it can be triggered by a specific event. Both the source and the destination store and maintain data. Data mappings for integrations, such data migration, match source fields with destination fields. Transformation of data The process of changing data from one format to another is known as data transformation. This can include altering data types, eliminating nulls or duplicates, collecting data, enriching data, or doing other transformations. To match the destination format, “Illinois” can be changed to “IL.” The data map contains these transformation formulas. The data map uses transformation algorithms to convert data into the correct format for analysis as it is transferred. Data warehousing is a term that refers to the process of storing 87 CU IDOL SELF LEARNING MATERIAL (SLM)

If the goal is to consolidate data into a single source either analysis or other purposes, a data warehouse is typically used. The data arrives from the warehouse when you run a query, a report, or do analysis. Data has already been moved, integrated, and converted in the warehouse. Data mapping guarantees that data enters the warehouse in the correct format and reaches its intended destination. What are the steps in the data mapping procedure? Step 1: Define – Define the data to also be moved, including tables, fields within tables, and the format of the field after it has been moved. The frequency of data transfer is also specified for data integrations. Step 2: Map the Data – Connect the source and destination fields. Step 3: Transformation – The transformation formula and rule are coded if a field demands it. Step 4: Test – Run the transfer using a test system with sample data from the source to check how it works and make any necessary improvements. Step 5: Deploy – Once the data transformation is confirmed to be performing as expected, schedule a migration or integration go-live. Step 6: Maintain and Update – The data map is a living entity which will require updates and adjustments as new data sources are added, existing data sources change, or destination requirements change. What can a good data mapping tool do for you? 88 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 5.1Data Mapping Advanced cloud-based information mapping and transformation solutions can assist businesses in getting more value from their data without breaking the bank. Data fields are mapped from a source to a destination in this data mapping example. Previously, data mappings were documented on paper, which was sufficient at the time. However, the landscape has become far more complicated. Paper-based systems can't keep up with more data, more mappings, and rapid modifications. They lack openness and do not keep track of the data models' inevitable modifications. Hand-mapping also entails coding modifications by hand, which is time-consuming and error-prone. For analysts and architects, transparency is essential. Data analysts and architects require a detailed, real-time view of the data at its source and destination since data quality is critical. Data mapping technologies give analysts and architects a unified picture of the data structures being mapped, allowing them to examine the data content, flow, or transformations. Complex format optimization Data compatibility becomes a possible issue with so much data pouring from various sources. Good data mapping solutions simplify the translation process by including built-in capabilities that ensure accurate conversion of complex formats, saving time and reducing the risk of human mistake. Changes to data models pose less obstacles. Data maps aren't a one-time project. Maps must be updated as data standards, reporting requirements, and systems change. With a cloud-based data mapping tool, stakeholders never longer have to worry about losing change documentation. Users may track the impact of changes when maps are updated with the help of good data mapping tools. Users of data mapping tools can also reuse maps, so they don't have to start from scratch every time. 5.2 ARRAY PROCESSORS Multiprocessors and vector processors are other terms for array processors. They work with massive data sets to accomplish calculations. As a result, they are used to boost the computer's performance. 89 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 5.2 Array Processors Array Processors Come in a Variety of Shapes and Sizes Array processors are divided into two categories: Array Processors Attached Processors with SIMD Arrays Array Processors Attached An attached array processor is a processor that is attached to a general-purpose computer with the goal of enhancing and improving the machine's performance in numerical computational tasks. It provides excellent performance by using numerous functional units in parallel processing. Processors with SIMD Arrays SIMD refers to the configuration of a single computer with several parallel processors. The processing units are designed to work together under the supervision of a single control unit, resulting in a single instruction stream and several data streams. An array processor's general block diagram is given below. It is made up of a number of identical processing elements (PEs), each with its own local memory M. An ALU and registers are included in each processor element. The processing elements' actions are controlled by the master control unit. It also decodes instructions and determines how they should be carried out. The software is stored in the main memory. The instructions must be retrieved by the control unit. Vector instructions are sent to all PEs at the same time, and the results are stored in memory. 90 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 5.3 Vector Instructions The ILLIAC IV computer, manufactured by the Burroughs Corporation, is the most well- known SIMD array processor. Simultaneous Instruction Multiple Data (SIMD) processors are highly specialised computers. They're only good for numerical issues that can be stated as vectors or matrices; they're not good for other kinds of calculations. Why should you utilise an Array Processor? Array processors boost the total speed of instruction processing. Because most Array processors run asynchronously from the host CPU, the system's overall capacity is increased. Array processors have their own local memory, allowing them to provide additional memory to systems with limited memory. 5.3 DATA MAPPING AND MEMORY IN ARRAY PROCESSORS The development of array processors for high-performance signal processing has been intimately linked to the development on high-performance digital computers during the previous fifty years. Von Neumann architecture was the first and most well-known computer design. A SISD machine has a single processor that executes a single list of steps (SI) on a single stream of data (SD). The digital computer revolution was a huge success because of its conceptual simplicity and accompanying technology. The use of improved device technology with faster switching speeds has been an obvious strategy to attain higher computational throughput speed. This method, which relies solely on device technology advancements, has 91 CU IDOL SELF LEARNING MATERIAL (SLM)

frequently proven insufficient. Alternative basic computer processing principles and architectures, on the other hand, have been presented. Various types of concurrent operations have been used in one technique. Concurrency refers to a computer system's ability to do many operations at the same time, which can be accomplished through either parallelism or pipelining, or both. By replicating some processing parts, parallelism makes use of concurrency (PE). Simultaneous operations of such functions on distinct areas of the programme are used to obtain a high throughput rate. The single instruction (SI) and multiple data (MD) SIMD modes are used by several parallel-processing, vector-data processor arrays. Pipelining, on the other hand, solves concurrency by splitting a difficult task into many smaller, simpler portions, utilising many appropriate PEs, so that the processing can be done in a sequential order. This digital pipe is set up in such a way that it can process instructions and data regardless of the number of PEs in the pipe. Having rapid PEs allows for a high flow rate in the pipe. Because distinct parts of a single data (SD) stream are processed simultaneously by several instructions, a computer that uses the pipelining function is referred to as a MISD machine (MI). Finally, multiple-instruction and multiple-data-stream computers have a large number of PEs, each with its own processing capabilities. For many years, these broad \"parallel\" processing computer architectures were reserved for highly computationally intensive off-line scientific, economic, meteorologic, & military applications on specialist mainframe and minicomputer array processing computers. These \"parallel processing computers\" were almost always prohibitively expensive, and they were rarely used for real-time signal processing applications. These parallel processing computer systems have only recently become important in high-performance modern signal processing because to the introduction of VLSI semiconductor fabrication technology. The concept is systolic arrays is one of the most fundamental and crucial linkages between concurrent computation/architecture & high throughput current signal processing. H. T.Kung and C. E. Leiserson invented the word \"systolic array\" to describe a simple form of concurrent processor in which processed data moves in a regular and periodic manner, analogous to the heart's systolic pumping action. A systolic array's most basic concept is that once data is available, it is effectively utilised inside multiple PEs in the array to increase throughput. As a result, a systolic array can take advantage of both parallelism and pipelining in various algorithms. When there are only a few types of PEs in the array, each type performing the same specialised operation, a systolic array is needed to fulfil the properties, according to H. T. Kung's motivational and tutorial reprinted work. All activities are carried out in a synchronous fashion, regardless of the processed data. The synchronous clock signals are the only control data transmitted to the PEs, and the PEs only communicate with their closest neighbours. A systolic array's regular structure & local communication features are consistent with current VLSI architectures. Intuitively, a number 92 CU IDOL SELF LEARNING MATERIAL (SLM)

of systolic designs for correlation/convolution were created. These examples show how a same signal processing technique can have a variety of systolic implementations, each with its own set of hardware needs and ramifications. 5.4 CASE STUDIES OF SIMD PARALLEL PROCESSORS The SIMD computer Goodyear Aerospace Corporation of the United States constructed and delivered to NASA in 1982 is known as the Massively Parallel Processor (MPP) 4'5. The machine's processing capacity comes from 16 384 bit-serial processing units, which can conduct up to 6553M 8-bit integer additions per second, for example. MPP's basic structure is depicted in Figure. Under the control of the array control unit, which also does scalar arithmetic, the array unit (ARU) performs array computations. The programme and data management unit, which is also used to build programmes and execute diagnostic duties, is in charge of the overall flow of data and control in the MPP. Through two switches, data is transferred to and from the ARU via a staging memory and two 128-bit wide I/O interfaces. The ARU's PEs are laid out in a 128 x 128 square array. Each of the PEs has 1024 bits of memory and is a simple bit-serial processor. A VLSI chip contains a subarray of two to four processors (without memory). MPP's magnitude could only be maintained manageable by employing large-scale integration. For fault tolerance, the array has four extra columns of PEs (for a total of 132); four columns were chosen due to the 2 x 4 structure of PEs in the VLSI chips. An arbitrary set of four columns is deactivated if there are no faults in the array. When a fault occurs, the faulty PE's set of four columns is disabled, and processing can resume at full speed. Because data is frequently corrupted when a problem occurs, the presently running programme is either restarted from the beginning or rerun from a checkpoint when all data and system state are known. The PE-to-PE organisation is used by MPP. A variant of the simple FNN interprocessor communication network was chosen due to the huge number of processing elements, the simplicity of each PE, and the projected computational duties. Within each column, the top and bottom connections can be left open or connected. In an open spiral or a closed spiral, the left and right edges can be left open or joined within each row. PE I is bidirectionally connected to PE i+1, 0i16382 in the open spiral (PEs 0 and 16 383 are not connected). PEs 0 and 16 383 are also coupled in the closed spiral form. Under software control, several edge connection modes can be selected. The algorithm used will determine which edge connections are used. In other circumstances, like with the FNN network, several data transfers may be required to get data to its intended destination. 93 CU IDOL SELF LEARNING MATERIAL (SLM)

A PE is depicted in Figure. The bit-serial PEs operate at a rate of one operation per 100 nanoseconds. The A, B, C, G, P, and S registers are among the six 1 -bit registers accessible. They use a data bus to communicate with one other and with the random access memory. A 1-bit full adder with a configurable length shift register executes arithmetic operations. The interconnection network connects to the P register, which has a bidirectional link to each of its nearest PE neighbours. To route data from top to bottom, for example, each PE moves the contents of its P register into the P register of the PE below, receiving the data from its neighbour above. All Boolean operations on the content of the P register and the bit currently on the data bus can be performed by additional logic associated with the P register. The outcome is stored in the P register. The adder adds the contents of the A and P registers, as well as the carry in the C register, and stores the result in the B register and the carry in the C register. The shift register's length can be set to 2, 6, 10, 14, 18, 22, 26, and 30 bits. The A and B registers can be thought of as additional shift register elements, resulting in a total shift register length of four. All basic arithmetic operations, including addition, subtraction, multiplication, division, and floating-point operations, may be performed with just few components. PEs must be disabled in many applications (such as the smoothing example). The G register is used to disable PEs. Only those PEs with an I in their G register participate in the execution of a disguised instruction. Certain algorithms, such as floating-point number normalisation, benefit from the use of a comparator. The sum-OR tree, which is a tree of inclusive-OR gates, can do global minimum and maximum value searches as well as other global operations. It receives input from all PEs and sends its output to the array control unit. The S-registers are where data enters and leaves the ARU. Each PE has one l-bit S-register, and the 1282 S-registers work together to form a data- shifting plane. Each S-register is self-contained from the rest of the PE. An S-register sends its content to its right neighbour and receives material from its left neighbour, resulting in 128-bit data columns being shifted across the S-register plane. The input switches provide the 128 bits of data that are concurrently shifted into the S-registers of the leftmost column of PEs. This is accomplished without interfering with normal processing. A complete 128 × 128-bit plane is put into the S-registers after 128 shift cycles. This whole data plane of 128 × 128 bit is transferred from the S-registers into the PE memory in a single 100 ns cycle, with each S register sending one bit to its PE. In a second cycle, if necessary, all PEs can simultaneously relocate a memory bit from an old plane into their associated S-register. Then, at the same time, a new 128-bit plane can be shifted in and the old plane can be shifted out. The output switches accept the data that is moved out of the rightmost PEs, one column (128 bit) each cycle. This overlapping shift in and shift out method is an extremely effective means of getting data into and out of MPP PEs. The staging memory allows the ARU to communicate with magnetic tapes, discs, terminals, line printers, and a host computer. Data from the outside world, such as images from a 94 CU IDOL SELF LEARNING MATERIAL (SLM)

camera, will be arranged as a stream of pixel values. The same is true for output; for example, a smoothed image's output to an image display device must normally be represented as a stream of pixel values. On the other hand, the PEs must be loaded with a bit slice from all 128 *128-pixel values at the same time. This reformatting procedure for the input and output data is made easier by the staging memory (which has a capacity of up to 40 Mbytes). 5.5 SUMMARY  The process of matching fields from one database to another is known as data mapping. It's the first step in making data transfer, data integration, and other data management activities more straightforward. Data must be homogenised in a form that makes it accessible to decision makers before it can be evaluated for business insights.  Data mapping, in its most basic form, allows a company's databases to communicate with one another. In practise, data professionals connect data sources by linking traits and values. Consider the case where a customer's information is stored in two distinct databases.  Multiprocessors and vector processors are other terms for array processors. They work with massive data sets to accomplish calculations. As a result, they are used to boost the computer's performance.  A vector processor, also known as an array processor, is a central processing unit (CPU) that implements an instruction set that is designed to function efficiently and effectively on huge one-dimensional data arrays known as vectors. 5.6 KEYWORDS  Asynchronous communication- A type of communication in which the send operation is non-blocking and receive can be blocking (more common) or non- blocking.  Asynchronous interaction model-A model of distributed systems in which process execution speeds, message transmission delays and clock drift rates are arbitrary.  Datagram-The form of a packet sent across a network. Each datagram has a header that identifies both the sender and receiver followed by data.  Dependability-The set of requirements placed on a computer system which ensures its correctness, security and fault-tolerance.  Distributed system-A system of networked computers which communicate and coordinate their actions only by-passing messages. 95 CU IDOL SELF LEARNING MATERIAL (SLM)

5.7 LEARNING ACTIVITY 1. Consider the behavior of two machines in a distributed system. Both have clocks that are supposed to tick 1000 times per millisecond. One of them actually does, but the other ticks only 990 times per millisecond. If UTC updates come in once a minute, what is the maximum clock skew that will occur? ___________________________________________________________________________ ___________________________________________________________________________ 2. When a node synchronizes its clock to that of another node, it is generally a good idea to take previous measurements into account as well. Why? Also, give an example of how such past readings could be taken into account. ___________________________________________________________________________ ___________________________________________________________________________ 5.8 UNIT END QUESTIONS A. Descriptive Questions Short questions 1. What is data mapping? 2. Define VLSI architecture. 3. Expand SISD. 4. What are the categories of array processors? 5. Define complex format optimization. Long questions 1. What can a good data mapping tool do for you? 2. What are the steps in the data mapping procedure? 3. Explain about data mapping. 4. Describe about array processors. 5. Why should you utilise an Array Processor? B. Multiple Choice Questions 96 1. The software is stored in the ______ a. processor b. Main memory CU IDOL SELF LEARNING MATERIAL (SLM)

c. Resistance d. Secondary memory 2. Multiprocessors and vector processors are other terms for _____ a. multiware b. middleware c. array processors d. supply 3. Processes & threads on the CPUs are______. a. scheduled b. Administered c. complex d. Work stations 4. _______ require a detailed, real-time view of the data a. Data analyst b. Data scientist c. Data manager d. software interrupt 5._____ refers to process of storing. a. program b. TCP c. Data warehousing d. Provision 6. The process of shifting data from one format to another in a one-time event is known as _____ a. Data migration 97 CU IDOL SELF LEARNING MATERIAL (SLM)

b. Movement of data c. Shifting data d. Dynamic data Answers 1-b, 2- c, 3- a, 4- d, 5- c, 6-a 5.9 REFERENCES Reference books  George Coulouris, Jean Dollimore, Tim Kindberg, “Distributed Systems: Concepts and Design” (4th Edition), Addison Wesley/Pearson Education.  Pradeep K Sinha, “Distributed Operating Systems: Concepts and design”, IEEE computer society press. Text Book References  M.R. Bhujade, “Parallel Computing”, 2nd edition, New Age International Publishers2009.  Andrew S. Tanenbaum and Maarten Van Steen, “Distributed Systems: Principles and Paradigms, 2nd edition, Pearson Education, Inc., 2007, ISBN: 0-13-239227-5. Websites:  https://www.geeksforgeeks.org/introduction-to-parallel  https://hpc.llnl.gov/training/tutorials/introduction-parallel-computing-tutorial  https://www.javatpoint.com/what-is-parallel-computing 98 CU IDOL SELF LEARNING MATERIAL (SLM)

UNIT 6 - DISTRIBUTED SYSTEMS 1 STRUCTURE 6.0 Learning Objectives 6.1 Introduction to parallel computing 6.2 Types of distributed systems 6.3 Advantages of distributed systems 6.4 Disadvantages of distributed systems 6.5 Goals of distributed system 6.6 Distributed system models 6.7 Hardware concepts 6.8 Software concepts 6.9 Summary 6.10Keywords 6.11Learning Activity 6.12Unit End Questions 6.13References 6.0 LEARNING OBJECTIVES After studying this unit, you will be able to:  Explain the basics of Parallel computing  Outline the importance of Parallel computing  Describe the architecture of parallel computing  List the advantages and disadvantages of parallel computing 6.1 INTRODUCTION TO PARALLEL COMPUTING In its most basic form, a distributed system is a collection of computers that work together to appear to the end user as a single computer. These computers share a common state, run in parallel, and can fail without affecting the overall system's uptime. Let's make a database! Traditional databases are maintained on the filesystem of a single machine, and you must communicate directly with that machine to retrieve or insert data. 99 CU IDOL SELF LEARNING MATERIAL (SLM)

Figure 6.1 Distributed operating system We'd have to operate this database system on numerous PCs at the same time if we wanted to share it. If the user adds a record into node #1, node #3 must be able to return that record. A distributed system is made up of numerous nodes which are physically separate but are connected via a network. All of the nodes inside this system communicate with one another and work together to complete tasks. A tiny portion of the distributed operating system software is stored on each of these nodes. The explosion of networked workstations and the demise of the centralised mainframe has been the most dramatic transformation in information technology in the previous two decades. This move has given end-users access to more computing power & distributed hardware resources. When computers connect via a network, the network can pool the power of all connected computers to complete difficult tasks. There are two types of computation in networks with processing nodes: centralised and distributed computations. In a centralised solution, one node is selected as the computer node, which runs the entire programme locally, and the central system is constantly shared by all users. As a result, there is a single point of control and failure. The availability of low-cost, high-performance processors and network technologies is driving the emergence of decentralised computing. When a few powerful computers were linked together now and communicate with one another, the total computing power available is astounding. A system like this can outperform a single supercomputer in terms of performance. Distributed computing — a decentralised computing strategy – has the potential 100 CU IDOL SELF LEARNING MATERIAL (SLM)

Pages:

Teamlease Edtech Ltd (Amita Chitroda)

CU-MCA-SEM II-Parallel and Distributed Computing-Second Draft

Like this book? You can publish your book online for free in a few minutes!

Create your own flipbook

TOP SEARCH

business design fashion music health life sports home marketing children

CU-MCA-SEM II-Parallel and Distributed Computing-Second Draft

Description: CU-MCA-SEM II-Parallel and Distributed Computing-Second Draft

Read the Text Version

Teamlease Edtech Ltd (Amita Chitroda)

TOP SEARCH

RELATED PUBLICATIONS