EXECUTIVE M. TECH IN BLOCKCHAIN AND BIG DATA FIRST SEMESTER DBMS, DESIGN & IMPLEMENTATION 1
Prefix 2
CONTENT UNIT - 2: Parallel Database.......................................................................................................4 3
UNIT - 2: PARALLEL DATABASE STRUCTURE 2.0 Learning Objectives 2.1 Introduction 2.2 Architectural Pattern 2.2.1 Client Server Architectural pattern 2.2.2 Layered Architectures 2.2.3 Microservices architecture 2.2.4 Event-driven architecture (EDA) 2.2.5 MVC (Model-View-Controller) 2.3 Data Partitioning Strategies 2.3.1 Horizontal partitioning 2.3.2 Vertical partitioning 2.3.3 Key-based partitioning 2.3.4 Hash-based partitioning 2.3.5 Round-robin data partitioning 2.4 Interquery and Intraquery Parallelism 2.4.1 Intraquery Parallelism 2.4.1.1 Shared-memory parallelism 2.4.1.2 Message-passing parallelism 2.4.1.3 Data flow parallelism 2.4.2 Interquery Parallelism 2.5 Parallel Query Optimization 2.5.1 Parallel decomposition in Parallel Optimization 2.5.2 Parallel enumeration in Parallel Optimization 2.5.3 Parallel search in Parallel Optimization 2.5.4 Parallel Query Optimization 2.6 Summary 2.7 Keywords 4
2.8 Learning Activity 2.9 Unit End Questions 2.10 References 2.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Describe architectural and Data partitioning strategies • Identify the scope of Inert and Inra query parallelism • State the need and importance of parallel query optimization 2.1 INTRODUCTION Architecture and Data Partitioning Strategies Architecture and data partitioning strategies are crucial components of designing scalable and efficient software systems. The architecture of a system refers to its overall structure, including the components that make up the system and how they interact with each other. Data partitioning strategies, on the other hand, are techniques used to divide large datasets into smaller, more manageable subsets that can be processed more efficiently. 2.2 ARCHITECTURAL PATTERN An architectural pattern is a general, reusable solution to a recurring problem in software architecture design. It is a high-level design pattern that defines the overall structure and organization of a software system, including its components, relationships, and interactions. An architectural pattern provides a blueprint for the design of software systems and helps to ensure that they are scalable, maintainable, and efficient. There are several popular architectural patterns used in software systems, including: • Layered Architecture: In this pattern, the system is organized into layers, with each layer responsible for a specific function. The layers are arranged in a hierarchical structure, with each layer dependent on the layer below it. This architecture pattern promotes separation of concerns and makes it easier to manage and maintain the system. • Client-Server Architecture: In this pattern, the system is divided into two parts - the client and the server. The client is responsible for user interface and presentation, while 5
the server provides data and services. This pattern is widely used in distributed systems and enables communication between different parts of the system. • Microservices Architecture: In this pattern, the system is broken down into a set of small, independent services that communicate with each other using APIs. Each service is responsible for a specific function and can be developed, deployed, and maintained independently. This pattern promotes modularity, scalability, and flexibility. • Model-View-Controller (MVC) Architecture: In this pattern, the system is divided into three parts - the model, the view, and the controller. The model represents the data and business logic, the view represents the user interface, and the controller mediates between the two. This pattern separates the concerns of the system and makes it easier to develop and maintain. • Event-Driven Architecture: In this pattern, the system is designed to respond to events in real-time. Events trigger actions in the system, enabling real-time processing and analysis of data. This pattern is useful in systems where data processing and analysis are critical, such as financial trading systems and sensor networks. 2.2.1 Client Server Architectural pattern The Client-Server architectural pattern is a widely used design pattern in software systems. In this pattern, the system is divided into two parts - the client and the server. The client is responsible for user interface and presentation, while the server provides data and services. The client sends requests to the server for data or services, and the server responds with the requested information. The communication between the client and server can be synchronous or asynchronous, depending on the system requirements. The benefits of the client-server architectural pattern include: Scalability: The client-server pattern allows for easy scaling of the system. As the number of clients increases, additional servers can be added to the system to handle the load. Separation of concerns: The pattern separates the presentation layer from the data layer, making it easier to manage and maintain the system. Centralized data management: The server acts as a central point for data management, ensuring that data is consistent and up-to-date across all clients. Security: The pattern enables secure communication between the client and server, ensuring that sensitive data is protected. 6
Platform independence: The client and server can be developed on different platforms, making it easier to build and deploy the system across different devices and platforms. Some common examples of client-server architecture include web applications, email systems, and file sharing systems. The pattern is well-suited to systems that require centralized data management, secure communication, and scalability. However, it may not be suitable for systems that require real-time processing or complex data processing, as the server may become a bottleneck. 2.2.2 Layered Architectures Layered architecture is a popular design pattern in software systems. In this pattern, the system is organized into layers, with each layer responsible for a specific function. The layers are arranged in a hierarchical structure, with each layer dependent on the layer below it. The most common layers in a layered architecture are: • Presentation Layer: This layer is responsible for handling user interaction and displaying information to the user. It includes user interface components such as buttons, forms, and menus. • Business Layer: This layer contains the business logic of the system. It defines the rules and policies that govern the system's behavior and handles complex computations and data processing. • Data Access Layer: This layer provides access to the data storage system, such as a database or file system. It manages the communication between the business layer and the data storage system. The benefits of the layered architecture pattern include: • Separation of concerns: Each layer is responsible for a specific function, promoting separation of concerns and making it easier to manage and maintain the system. • Modular design: The layered architecture promotes a modular design, making it easier to add or remove functionality from the system. • Reusability: The layered architecture promotes reusability of code, as each layer can be used independently of the others. • Flexibility: The layered architecture is flexible and can be adapted to different types of systems, making it suitable for a wide range of applications. Some common examples of layered architecture systems include enterprise applications, web applications, and desktop applications. The pattern is well-suited to systems 7
that require separation of concerns, modularity, and flexibility. However, it may not be suitable for systems that require real-time processing or complex data processing, as the layered architecture may introduce additional processing overhead. 2.2.3 Microservices architecture It is a design pattern in software systems that structures the application as a collection of small, independent services that communicate with each other using APIs. Each service is responsible for a specific function and can be developed, deployed, and maintained independently of the other services. The benefits of the microservices architecture pattern include: • Scalability: The microservices architecture allows for easy scaling of the system. Each service can be scaled independently based on its usage and demand, allowing the system to handle large volumes of traffic. • Modularity: The microservices architecture promotes modularity and separation of concerns. Each service is responsible for a specific function, making it easier to manage and maintain the system. • Flexibility: The microservices architecture is flexible and can be adapted to different types of systems, making it suitable for a wide range of applications. • Resilience: The microservices architecture promotes resilience by allowing services to fail independently of the rest of the system. This ensures that the system can continue to function even if some services are not available. • Continuous Delivery: The microservices architecture enables continuous delivery by allowing each service to be developed, deployed, and tested independently of the other services. Some common examples of microservices architecture systems include e-commerce websites, social networking applications, and financial trading systems. The pattern is well-suited to systems that require modularity, scalability, flexibility, and resilience. However, it may introduce additional complexity to the system and require additional effort in managing the communication between the services. 2.2.4 Event-driven architecture (EDA) It is a software architecture pattern that promotes the production, detection, consumption of, and reaction to events. In an event-driven architecture, components exchange events asynchronously and communicate via event channels or message 8
queues. Events can be any kind of change in the state of a system, such as a user submitting a form, a sensor reading, or a database record being updated. When an event occurs, it triggers one or more actions, which can be performed by any component that is subscribed to that event. Event-driven architecture has several benefits, such as scalability, flexibility, and decoupling of components. By breaking down applications into smaller, more modular components, it allows for easier development and maintenance, as well as better fault tolerance and resilience. Some common examples of event-driven architecture include message brokers like Apache Kafka and RabbitMQ, as well as serverless computing platforms like AWS Lambda and Google Cloud Functions. These technologies allow developers to build event-driven systems quickly and easily, without having to worry about infrastructure and scaling issues. Overall, event-driven architecture is a powerful pattern that can help developers build more flexible and scalable systems, which can react quickly to changing conditions and provide better user experiences. 2.2.5 MVC (Model-View-Controller) It is a software architecture pattern commonly used in the development of web applications. It separates an application into three interconnected components: Model, View, and Controller. • Model: The Model is responsible for managing the data of the application. It represents the application's data and logic, including data validation, business rules, and database interactions. • View: The View is responsible for displaying the data to the user. It is a user interface that presents the data in a visually appealing way. It receives input from the user and sends it to the Controller for further processing. • Controller: The Controller is responsible for processing the user's input and controlling the flow of the application. It interacts with both the Model and View to ensure that the user's input is correctly processed and the appropriate response is displayed. The key benefits of the MVC architecture pattern are: • Separation of concerns: The MVC architecture separates the application into three distinct components, each with its own responsibilities. This separation of concerns makes the application easier to develop, test, and maintain. 9
• Reusability: Each component in the MVC architecture can be reused in other applications. For example, the same Model component can be used in multiple applications without any modification. • Scalability: The MVC architecture allows for the application to be scaled easily. Since the components are loosely coupled, it is possible to add or remove components without affecting the other components. Overall, the MVC architecture pattern is a popular choice for building web applications because of its flexibility, scalability, and ease of maintenance. 2.3 DATA PARTITIONING STRATEGIES Data partitioning is the process of dividing a large dataset into smaller, more manageable pieces, with the goal of improving performance and scalability of processing. Here are some common data partitioning strategies: • Horizontal partitioning (sharding): In this strategy, the dataset is divided into smaller subsets based on a particular attribute or criteria. Each subset is then stored on a separate node or server. This approach is useful when the data can be easily divided into smaller, independent units. • Vertical partitioning (columnar partitioning): In this strategy, the dataset is divided into smaller subsets based on the columns in the dataset. Each subset contains a subset of columns, with each subset being stored on a separate node or server. This approach is useful when the dataset has a large number of columns, but only a small number of them are frequently used in queries. • Key-based partitioning (range partitioning): In this strategy, the dataset is partitioned based on a specific key or range of keys. Each partition contains a range of keys and the data associated with those keys. This approach is useful when the data is sorted or indexed by a specific key, such as a timestamp or a customer ID. • Hash-based partitioning: In this strategy, the dataset is partitioned based on a hash function that assigns each record to a specific partition. The hash function distributes the records evenly across the available partitions, which can improve performance by reducing data skew. This approach is useful when the data is not naturally sorted or indexed by a specific key. 10
• Round-robin partitioning: In this strategy, the dataset is partitioned in a round-robin fashion across the available nodes. Each record is assigned to the next available partition in a circular fashion. This approach is useful when the data is not naturally sorted or indexed by a specific key, and the goal is to balance the data evenly across all partitions. Each of these data partitioning strategies has its own advantages and limitations, and the choice of strategy depends on the specific requirements of the application and the dataset being processed. 2.3.1 Horizontal partitioning Also known as sharding, is a technique used in database architecture where large databases are divided into smaller, more manageable parts called shards. Each shard contains a subset of the database's data, and all shards together make up the entire database. In horizontal partitioning, data is partitioned based on a specific criterion, such as geographic location, customer ID, or date. For example, if a company has a large customer database, they might shard the database based on the first letter of the customer's last name, so all customers with last names starting with A-F would be in one shard, G-L in another shard, and so on. The benefits of horizontal partitioning include improved performance, scalability, and availability. By dividing a large database into smaller, more manageable parts, queries can be processed faster because the database is more distributed. Also, if a server fails, only the data in that specific shard is affected, and the rest of the database remains available. However, horizontal partitioning can also have some downsides, such as increased complexity in database design and the potential for data inconsistency across shards. Additionally, certain types of queries may not be possible or may require more complex implementation due to the partitioning of the data. 2.3.2 Vertical partitioning Also known as columnar partitioning or vertical sharding, is a database partitioning technique that divides a table into smaller tables vertically by columns, rather than horizontally by rows. 11
In other words, vertical partitioning involves breaking a table into smaller tables that contain a subset of the original table's columns. This technique is commonly used in distributed database systems to improve query performance by reducing the amount of data that needs to be accessed and transmitted over the network. For example, if a table contains customer information such as name, address, phone number, and order history, vertical partitioning may involve splitting the table into two tables, one with customer information and the other with order history information. One advantage of vertical partitioning is that it can improve the efficiency of queries that only need to access a subset of the columns in a table. However, it can also create additional complexity in the database schema and application code, as well as increase the potential for data consistency issues. 2.3.3 Key-based partitioning Also known as range partitioning, is a database partitioning technique that divides a table into smaller tables based on the values of one or more columns in the table, usually the primary key. In Key-based partitioning, each partition contains a subset of the table's rows based on the range of values in the specified key column(s). For example, if a table contains customer orders with a primary key of order ID, a range partitioning scheme may divide the table into partitions based on the range of order IDs, such as orders with order IDs between 1 and 100 in one partition, and orders with order IDs between 101 and 200 in another partition. Key-based partitioning is commonly used in distributed database systems to improve query performance by allowing queries to be executed on smaller subsets of data. It also provides scalability by allowing new partitions to be added as the amount of data in the table grows. One potential disadvantage of key-based partitioning is that it can lead to data skew if the key column(s) are not uniformly distributed, resulting in some partitions containing 12
significantly more data than others. This can lead to performance issues and require additional measures to balance the workload across partitions. 2.3.4 Hash-based partitioning Hash-based partitioning is a technique used in distributed computing and database systems to partition data across multiple nodes in a cluster based on a hash function. In this technique, data is partitioned into multiple partitions based on the output of a hash function applied to a partition key. The hash function maps the partition key to a hash value, which is used to determine the partition in which the data should be stored. Each partition is typically stored on a separate node in the cluster, allowing for parallel processing of data. One advantage of hash-based partitioning is that it can help evenly distribute data across partitions. This can be particularly useful in scenarios where data is not uniformly distributed across the partition key, as it can help prevent hotspots that could lead to performance issues. Hash-based partitioning is commonly used in distributed databases and data warehousing systems, where it enables scalable and efficient processing of large datasets across multiple nodes in a cluster. 2.3.5 Round-robin data partitioning Round-robin data partitioning is a technique used in distributed computing systems for dividing a large dataset into smaller partitions and distributing them among multiple processing nodes or machines. This technique is similar to round-robin partitioning in that each partition is assigned to a processing node in a cyclic manner. For example, suppose there are four processing nodes and a dataset consisting of 12 records. The dataset is divided into four partitions of three records each, and each partition is assigned to a processing node in a round-robin manner. The first partition is assigned to processing node 1, the second partition to processing node 2, the third 13
partition to processing node 3, and the fourth partition to processing node 4. Then, the next partition is assigned to processing node 1, and the cycle continues until all partitions are processed. Round-robin data partitioning is commonly used in distributed computing systems such as Hadoop and Spark for parallel processing of large datasets. It ensures that each processing node gets an equal share of the data, which can improve performance and reduce processing time. It also provides fault tolerance, as the data can be replicated across multiple nodes to prevent data loss in case of node failures. 2.4 INTERQUERY AND INTRAQUERY PARALLELISM Interquery and Intraquery Parallelism are two techniques used in parallel computing to improve the performance of database queries and other data processing tasks. Intraquery Parallelism: Intraquery parallelism is the technique of breaking down a single query into smaller parts and processing them concurrently on multiple processors or cores within a single machine. This approach is used to speed up the processing of complex queries that involve large amounts of data. For example, a query that involves sorting a large dataset can be broken down into smaller parts, with each part being processed concurrently on a separate processor or core. Interquery Parallelism: Interquery parallelism is the technique of processing multiple independent queries simultaneously on separate processors or cores within a single machine or across multiple machines in a distributed computing environment. This approach is used to improve the overall throughput of the system by processing multiple queries concurrently. For example, a database server might process multiple SELECT statements concurrently, with each query being processed on a separate processor or core. 2.4.1 Intraquery Parallelism Intraquery parallelism is a technique used in parallel computing to break down a single query into smaller parts and process them concurrently on multiple processors or cores within a single machine. This technique is used to speed up the processing of complex queries that involve large amounts of data. 14
Intraquery parallelism can be applied to various parts of a query, such as sorting, filtering, and aggregating data. The basic idea is to partition the data into smaller subsets and process each subset on a separate processor or core. There are several approaches to implementing intraquery parallelism, including: Shared-memory parallelism: In this approach, multiple processors or cores share the same memory and can access the same data simultaneously. Each processor or core is assigned a subset of the data to process, and they communicate with each other through shared memory. This approach is commonly used in multi-core processors and can achieve high levels of parallelism. Message-passing parallelism: In this approach, multiple processors or cores communicate with each other through message passing. Each processor or core is assigned a subset of the data to process, and they exchange messages to coordinate their processing. This approach is commonly used in distributed computing environments and can scale to large numbers of processors or cores. Dataflow parallelism: In this approach, the data is partitioned into smaller subsets, and each subset is processed by a separate processing unit. The processing units are connected in a pipeline, and the data flows through the pipeline from one unit to the next. This approach can achieve high levels of parallelism and is commonly used in streaming data processing applications. Intraquery parallelism can significantly improve the performance of database queries and other data processing tasks by utilizing the processing power of multiple processors or cores. However, it requires careful consideration of data partitioning, load balancing, and communication overhead to ensure optimal performance. 2.4.1.1 Shared-memory parallelism Shared-memory parallelism is a type of parallel computing where multiple processors or cores share a common memory space and can access the same data simultaneously. In shared-memory parallelism, each processor or core is assigned a portion of the data to process, and they communicate with each other through shared memory. Shared-memory parallelism is commonly used in multi-core processors, where multiple processing units are integrated on a single chip. In this case, each core can access the 15
same memory and can work on different portions of the data simultaneously, which can significantly improve the performance of data-intensive tasks. One of the main advantages of shared-memory parallelism is its simplicity and ease of programming. Since all processors or cores can access the same memory, it is easy to share data and communicate between them. This makes it easier to implement parallel algorithms and applications. However, shared-memory parallelism also has some limitations. As the number of processors or cores increases, the contention for shared memory can become a bottleneck, which can limit the scalability of the system. Additionally, the performance of shared-memory parallelism is highly dependent on the memory hierarchy and the caching behavior of the system, which can impact the performance of different applications and algorithms. Overall, shared-memory parallelism is a powerful technique for improving the performance of data-intensive tasks in multi-core processors and can be used in various applications such as scientific simulations, data analytics, and machine learning. 2.4.1.2 Message-passing parallelism Type of parallel computing where multiple processors or cores communicate with each other by exchanging messages. In message-passing parallelism, each processor or core has its own memory and works on a portion of the data. When data needs to be shared between processors or cores, messages are sent between them to transfer the necessary data. Message-passing parallelism is commonly used in distributed computing environments, where multiple machines are connected by a network. In this case, each machine can work on a portion of the data and communicate with other machines by exchanging messages over the network. One of the main advantages of message-passing parallelism is its scalability. Since each processor or core has its own memory, the system can scale to a large number of 16
processors or cores without running into memory contention issues. Additionally, message-passing parallelism can be used in a wide range of applications, including scientific simulations, data analytics, and machine learning. However, message-passing parallelism can be more difficult to program than shared- memory parallelism, as it requires explicit communication between processors or cores using message passing libraries such as MPI (Message Passing Interface). Additionally, message-passing parallelism can be sensitive to network latency and bandwidth, which can impact the performance of the system. Overall, message-passing parallelism is a powerful technique for improving the performance of data-intensive tasks in distributed computing environments and can be used in a wide range of applications. However, it requires careful consideration of communication overhead and load balancing to ensure optimal performance. 2.4.1.3 Data flow parallelism Data flow parallelism is a type of parallel processing where computations are performed on different data elements simultaneously. In this approach, data is split into smaller pieces and processed independently by different processors or threads in parallel. The output from each processor is combined to produce the final result. The key idea behind data flow parallelism is to exploit the inherent parallelism in the data rather than the control flow of the program. In other words, the program is designed to work with data elements that can be processed independently, rather than using sequential steps to process the data. Data flow parallelism is commonly used in applications such as image processing, signal processing, and scientific simulations. It can improve the performance of these applications by reducing the time required to process large amounts of data. One popular approach to implementing data flow parallelism is through the use of data parallelism, where multiple processors or threads work on different parts of the data simultaneously. Another approach is through the use of task parallelism, where different processors or threads work on different tasks, which may or may not operate 17
on the same data. Both approaches can be used in combination to achieve even greater levels of parallelism. 2.4.2 Interquery Parallelism Interquery parallelism is a technique for improving the performance of a database system by executing multiple queries in parallel. This approach is particularly useful for systems that handle large volumes of data and complex queries, as it can significantly reduce the overall execution time. In interquery parallelism, each query is assigned to a different processor or thread, and they are executed simultaneously. The results from each query are then combined to produce the final output. This approach can be implemented in different ways, depending on the database system and the available hardware. One common approach to implementing interquery parallelism is through the use of parallel database systems. These systems are designed to support parallel processing of multiple queries, and typically use a shared-nothing architecture. In this architecture, each processor has its own memory and disk storage, and the data is partitioned across the processors. This allows each processor to work independently on a subset of the data, and the results are combined at the end. Another approach to implementing Interquery parallelism is through the use of query optimization techniques, such as parallel query execution and pipelined execution. Parallel query execution is a technique that divides a query into smaller sub-queries, each of which can be executed in parallel. Pipelined execution is a technique that divides a query into stages, each of which can be executed in parallel. These techniques can be used to identify opportunities for parallelism in a query and optimize its execution accordingly. Interquery parallelism can provide significant performance benefits for database systems that handle large volumes of data and complex queries. However, it also requires careful planning and management to ensure that the parallel execution of queries does not result in conflicts or inconsistencies in the data. Additionally, the performance gains from interquery parallelism may be limited by factors such as 18
network bandwidth and processor speed, and may require significant hardware investments. 2.5 PARALLEL QUERY OPTIMIZATION Parallel query optimization is a technique used in database systems to optimize the execution of queries by taking advantage of parallel processing. In this approach, the query optimization process is divided into multiple parallel tasks that can be executed simultaneously on multiple processors or threads. The main goal of parallel query optimization is to reduce the time required to optimize a query, especially for complex queries that involve large amounts of data. By using parallelism, the query optimizer can explore different optimization strategies in parallel and identify the most efficient execution plan in a shorter amount of time. There are several techniques that can be used to implement parallel query optimization, including: Parallel decomposition: This involves dividing the query into smaller sub-queries that can be executed in parallel. Each sub-query is optimized separately, and the results are combined to produce the final output. Parallel enumeration: This involves generating multiple query plans in parallel and evaluating them to identify the most efficient one. Each query plan is evaluated on a different processor or thread, and the results are combined to produce the final output. Parallel search: This involves searching the space of possible query plans in parallel. Multiple processors or threads explore different parts of the search space simultaneously, and the results are combined to produce the final output. 2.5.1 Parallel decomposition in Parallel Optimization 19
Parallel decomposition is a technique used in parallel optimization to break down a large optimization problem into smaller sub-problems that can be solved concurrently. Each sub-problem is solved on a separate processor or core, and the results are combined to obtain the solution to the overall problem. There are several ways to perform parallel decomposition in parallel optimization. One approach is to decompose the problem into independent sub-problems that can be solved simultaneously. Another approach is to decompose the problem into smaller sub-problems that can be solved sequentially, but in parallel with other sub-problems. The choice of decomposition technique depends on the structure of the problem and the available hardware resources. In some cases, a combination of both approaches may be used to achieve the best performance. Parallel decomposition can significantly reduce the time required to solve large optimization problems by leveraging the power of parallel computing. However, it requires careful management of communication and synchronization between the sub- problems to ensure that the results are accurate and consistent. 2.5.2 Parallel enumeration in Parallel Optimization Parallel enumeration is a technique used in parallel optimization to explore multiple solutions simultaneously by distributing the search process across multiple processors or cores. It is particularly useful for discrete optimization problems where the search space is discrete and finite, and a brute-force search is required to find the optimal solution. In parallel enumeration, the search space is partitioned into multiple sub-spaces, and each sub-space is explored by a separate processor or core. The search process can be carried out in parallel, which can significantly reduce the time required to find the optimal solution, especially for large search spaces. There are several methods to perform parallel enumeration in parallel optimization. One approach is to divide the search space into equal-sized sub-spaces and assign each sub- 20
space to a separate processor or core. Another approach is to use a load balancing strategy to distribute the workload among processors or cores dynamically. Parallel enumeration can be combined with other optimization techniques such as branch and bound or dynamic programming to further improve the efficiency of the search process. However, parallel enumeration can be computationally intensive and may require significant communication and synchronization between processors or cores, especially for large search spaces. Overall, parallel enumeration is a powerful technique for solving discrete optimization problems and can provide significant performance gains when executed properly. 2.5.3 Parallel search in Parallel Optimization Parallel search is a technique used in parallel optimization to explore multiple candidate solutions simultaneously, with the goal of finding the optimal solution faster. In parallel search, the search process is distributed across multiple processors or cores, allowing multiple candidate solutions to be evaluated concurrently. There are several methods for performing parallel search in parallel optimization. One approach is to use a divide-and-conquer strategy, where the search space is partitioned into smaller sub-spaces, and each sub-space is searched independently by a separate processor or core. Another approach is to use a parallel version of a search algorithm, such as depth- first search or breadth-first search, where each processor or core explores a different branch of the search tree. Parallel search can significantly reduce the time required to find the optimal solution, especially for large search spaces. However, it requires careful management of communication and synchronization between processors or cores to ensure that the search process is efficient and the results are accurate and consistent. Parallel search can be combined with other optimization techniques such as heuristics or local search to further improve the efficiency of the search process. However, the effectiveness of parallel search depends on the characteristics of the problem being solved and the hardware resources available for parallel computing. 21
2.6 SUMMARY • Choosing the right architectural pattern depends on the specific requirements of the system, including its size, complexity, scalability, and performance needs. It is important to consider these factors when designing a software system to ensure that it meets the needs of the users and is easy to maintain and scale over time. • In summary, Intraquery Parallelism is focused on speeding up the processing of a single query, while Interquery Parallelism is focused on improving the overall throughput of the system by processing multiple independent queries concurrently. Both techniques can be used together to further improve the performance of database queries and other data processing tasks in parallel computing environments. • Parallel query optimization can provide significant performance benefits for database systems that handle large volumes of data and complex queries. However, it also requires careful management to ensure that the parallel execution of the optimization process does not result in conflicts or inconsistencies in the data. Additionally, the performance gains from parallel query optimization may be limited by factors such as network bandwidth and processor speed, and may require significant hardware investments. • In summary, parallel search is a powerful technique for solving optimization problems, particularly for problems with large search spaces. It can provide significant performance gains when executed properly and can be combined with other optimization techniques to further improve efficiency. 22
2.7 KEYWORD • Data Partitioning― Data partitioning is the process of dividing a large dataset into smaller, more manageable pieces, with the goal of improving performance and scalability of processing. • Intraquery Parallelism Intraquery parallelism is the technique of breaking down a single query into smaller parts and processing them concurrently on multiple processors or cores within a single machine. 2.8 LEARNING ACTIVITY 1. Define Data partitioning ___________________________________________________________________________ ___________________________________________________________________________ 2. Differentiate Inter and Intra Query parallelism ___________________________________________________________________________ ___________________________________________________________________________ 2.9 UNIT END QUESTIONS A. Descriptive Questions Short Questions 1. Define Architectural pattern? Explain its types. Long Questions 1. Elaborate Data partitioning Techniques available to improve the performance of the data Processing. 2. Explain in detail the Parallel query optimization technique used in database systems to optimize the execution of queries 2.10 REFERENCES TEXT BOOKS: 1. Avi Silberschatz, Hank Korth, and S.Sudarshan,”Database System Concepts”, 6th Ed.McGraw Hill, 2010. 2. Ramez Elmasri B.Navathe: “Fundamentals of database systems”, 7th edition,Addison Wesley,2014 23
REFERENCE BOOKS: 1. S.K.Singh, “Database Systems: Concepts, Design Applications”, 2nd edition,Pearson education, 2011. 2. Joe Fawcett, Danny Ayers, Liam R. E. Quin: “Beginning XML”, Wiley India Private Limited 5th Edition, 2012. 3. Thomas M. Connolly and Carolyn Begg “Database Systems: A Practical Approach to Design, Implementation, and Management”, 6th edition, Pearson India, 2015 24
Search
Read the Text Version
- 1 - 24
Pages: