EXECUTIVE M. TECH IN BLOCKCHAIN AND BIG-DATA FIRST SEMESTER DBMS, DESIGN & IMPLEMENTATION 1
Prefix 2
CONTENT UNIT - 3: Distributed Database.................................................................................................4 3
UNIT - 3: DISTRIBUTED DATABASE STRUCTURE 3.0 Learning Objectives 3.1 Introduction 3.2 Features of Distributed Databases 3.3 Distributed Database Architecture 3.4 Fragmentation and Replication 3.5 Distributed Query Processing 3.6 Distributed Transactions Processing 3.7 Summary 3.8 Keywords 3.9 Learning Activity 3.10 Unit End Questions 3.11 References 3.0 LEARNING OBJECTIVES After studying this unit, you will be able to: • Describe distributed database architecture • Narrate the role of fragmentation and Replication • State the need and importance of distributed transactions processing 3.1 INTRODUCTION A distributed database is a database system that consists of multiple computers, or nodes, connected by a network. Each node stores a portion of the database, and the nodes work together to process queries and transactions. The data in a distributed database is distributed across the nodes, often using techniques such as data partitioning or replication. Distributed databases are designed to provide several advantages over centralized databases, including improved performance, scalability, fault tolerance, and data availability. By distributing the database across multiple nodes, the workload can be spread out, allowing for faster query processing and better performance. Additionally, if one node fails, the database 4
can continue to function by accessing data stored on other nodes. This provides fault tolerance and ensures that data remains available even in the event of hardware or network failures. Distributed databases are commonly used in large-scale applications where data needs to be accessed and processed quickly and reliably. Examples include e-commerce websites, social media platforms, and financial systems. 3.2 FEATURES OF DISTRIBUTED DATABASES Distributed databases have several features that distinguish them from centralized databases. Here are some of the main features of distributed databases: • Data distribution: Data is divided into smaller pieces and distributed across multiple nodes, allowing for better performance and scalability. • Data replication: Data can be replicated across multiple nodes, providing redundancy and ensuring data availability even if some nodes fail. • Distributed query processing: Queries can be processed in parallel across multiple nodes, allowing for faster query processing and improved performance. • Concurrency control: Distributed databases must implement mechanisms to ensure that multiple users can access and modify data simultaneously without causing conflicts or inconsistencies. • Fault tolerance: Distributed databases are designed to be fault-tolerant, meaning that they can continue to function even if some nodes in the network fail. • Security: Distributed databases must implement security measures to protect data from unauthorized access and ensure data privacy. • Scalability: Distributed databases can be easily scaled by adding or removing nodes from the network, allowing the database to handle large volumes of data and user requests. • Consistency: Distributed databases must ensure that all nodes have consistent copies of the data, even if multiple nodes are modifying the data at the same time. • Availability: Distributed databases must ensure that the data is available to users at all times, even if some nodes are offline or unreachable. 5
3.3 DISTRIBUTED DATABASE ARCHITECTURE 3.3.1 Homogeneous database architecture Homogeneous database architecture refers to a database system in which all components, including the hardware, software, and data types, are standardized across the entire system. In other words, all the databases in the system are the same in terms of structure, format, and processing capabilities. Figure 1: Homogeneous Distributed Database Src: https://www.filibeto.org/sun/lib/nonsun/oracle/11.1.0.6.0/B28359_01/server.111/b28310/ds_concepts001.htm In a homogeneous database architecture, data is stored in a single database management system (DBMS) that is used throughout the entire organization. This ensures that data is consistent and can be easily accessed and shared between different departments and applications. One advantage of a homogeneous database architecture is that it simplifies the management and administration of the database system. Since all components are standardized, it is easier to configure, monitor, and maintain the system. Another advantage is that it allows for better integration between different applications and systems within the organization. Because all the databases are the same, it is easier to share data and perform complex data analysis and reporting. Homogeneous database architecture may not be suitable for all organizations, especially those that require specialized databases or applications that are not supported by the standardized system. Additionally, it may be more difficult to scale a homogeneous database system, as adding new components may require significant changes to the existing system. 6
3.3.2 Heterogeneous distributed database A heterogeneous distributed database refers to a database system in which the components, including hardware, software, and data types, are not standardized and may vary across the different nodes of the distributed system. In a heterogeneous distributed database, the nodes in the network may use different DBMSs, operating systems, or hardware architectures. A heterogeneous distributed database may be used when an organization has multiple, diverse systems that need to share data across different platforms. For example, a large corporation may have a mix of mainframe systems, UNIX servers, and Windows servers, each using its own DBMS. By creating a heterogeneous distributed database, the organization can create a unified view of its data, allowing applications running on different systems to access and share data. The main challenge of a heterogeneous distributed database is the integration of data from different sources, which may have different data models, data types, and query languages. The distributed database must provide mechanisms for data mapping, data transformation, and data synchronization across the different nodes in the network. This requires specialized middleware that can translate between the different systems and provide a unified view of the data. Another challenge of a heterogeneous distributed database is ensuring data consistency and integrity across the network. Since the different nodes may use different DBMSs, there may be differences in data representation, storage, and processing. Therefore, the distributed database must provide mechanisms for data consistency and concurrency control, such as distributed transactions and two-phase commit protocols. Overall, a heterogeneous distributed database is more complex and difficult to manage than a homogeneous database system. However, it can provide significant benefits in terms of data integration, flexibility, and scalability, allowing organizations to share and access data across different platforms and systems. 3.3.3 Client Server Architecture Client-server distributed database architecture is a type of distributed database architecture in which the system is divided into two main components: a client component and a server component. The client component is responsible for sending requests to the server 7
component, while the server component is responsible for processing the requests and returning the results to the client. In a client-server distributed database architecture, the data is stored on the server side, and the clients connect to the server to access the data. The server manages the data and provides services to the clients, such as data storage, query processing, and transaction management. This architecture provides several benefits, including: • Scalability: Since the server is responsible for managing the data, it can be scaled up or down depending on the demand for the database. • Centralized management: The server provides centralized management and control of the data, making it easier to ensure data consistency and security. • Reduced network traffic: The client-server architecture reduces network traffic since the data is processed on the server side, reducing the amount of data that needs to be transmitted over the network. • Improved performance: The server component can perform complex data processing and indexing, allowing for faster query processing and improved performance. However, there are also some drawbacks to the client-server distributed database architecture, such as: • Single point of failure: If the server fails, the entire system may be affected, making it vulnerable to downtime and data loss. • Limited flexibility: The client-server architecture may not be suitable for all types of applications since it may be challenging to integrate with other systems or architectures. • Security concerns: The centralized nature of the server component may make it a target for security attacks, making it essential to implement robust security measures. Client-server distributed database architecture is a popular choice for applications that require centralized management and control of data, such as enterprise resource planning (ERP) systems or online transaction processing (OLTP) systems. 8
3.4 FRAGMENTATION AND REPLICATION Fragmentation and replication are two common techniques used in distributed database systems to improve data availability, performance, and scalability. Fragmentation refers to dividing a database into smaller subsets called fragments, which can be stored on different nodes of a distributed system. There are two main types of fragmentation: horizontal and vertical fragmentation. Horizontal fragmentation refers to dividing a table into subsets of rows based on a specific criteria, such as geographic location or customer segment. Each fragment contains a subset of the rows of the table, and these fragments can be stored on different nodes of the distributed system. Horizontal fragmentation can improve query performance since queries can be executed in parallel on different nodes, reducing the overall response time. Vertical fragmentation refers to dividing a table into subsets of columns based on the data access patterns. Each fragment contains a subset of the columns of the table, and these fragments can be stored on different nodes of the distributed system. Vertical fragmentation can improve performance by reducing the amount of data that needs to be transmitted over the network. Replication, on the other hand, involves creating multiple copies of data and storing them on different nodes of the distributed system. There are two main types of replication: full replication and partial replication. Full replication involves creating a complete copy of the entire database on each node of the distributed system. This can improve data availability since if one node fails, the data can still be accessed from another node. Partial replication involves creating copies of only a subset of the data on different nodes. This can improve performance by reducing the amount of data that needs to be transmitted over the network. Replication and fragmentation can be used together to improve the performance and availability of a distributed database system. For example, a database may be horizontally fragmented into subsets of rows, and each fragment may be replicated on multiple nodes to improve both performance and availability. 3.4.1 Fragmentation 3.4.1.1 Horizontal Fragmentation Horizontal fragmentation is a database design technique that involves dividing a table into smaller subsets of rows based on a specific criteria, such as geographic location or 9
customer segment. Each subset is called a fragment and can be stored on a different node in a distributed database system. Horizontal fragmentation is used to improve query performance by allowing queries to be executed in parallel on different nodes. Since each fragment contains a subset of the rows of the table, each node can process its fragment independently and return the results to the client. This can significantly reduce the overall response time of the query. For example, consider a customer database that contains customer information for a company with locations in multiple cities. A horizontal fragmentation approach may involve dividing the customer table into subsets based on the city of the customer. Each subset, or fragment, would contain customer information only for the customers located in that city. These fragments could be stored on different nodes in a distributed database system, allowing queries to be executed in parallel on different nodes, improving the overall query performance. Horizontal fragmentation can be performed on both partitioned and non-partitioned tables. Partitioned tables are tables that are divided into multiple partitions based on a specific criteria, such as a range of values or a hash function. Horizontal partitioning is a specific type of partitioning that divides a table based on rows. In contrast, non-partitioned tables are tables that are not divided into partitions. There are several advantages of horizontal fragmentation, including: • Improved performance: Since queries can be executed in parallel on different nodes, horizontal fragmentation can significantly improve query performance. • Reduced network traffic: By storing data locally on each node, horizontal fragmentation can reduce the amount of data that needs to be transmitted over the network, improving overall network performance. • Improved scalability: Horizontal fragmentation can improve scalability by allowing additional nodes to be added to the distributed database system as the data volume grows. However, there are also some disadvantages to horizontal fragmentation, including: • Increased complexity: Horizontal fragmentation can increase the complexity of database design and management, making it more difficult to maintain and modify the database. • Increased data redundancy: Horizontal fragmentation can result in data redundancy since each fragment may contain duplicate data. This can increase storage requirements and maintenance costs. 10
3.4.1.2 Vertical fragmentation Vertical fragmentation is a database design technique that involves dividing a table into smaller subsets of columns based on the data access patterns. Each subset is called a fragment and can be stored on a different node in a distributed database system. Vertical fragmentation is used to improve performance by reducing the amount of data that needs to be transmitted over the network. Since each fragment contains a subset of the columns of the table, only the necessary columns are transmitted over the network when a query is executed. This can significantly reduce the network traffic and improve query performance. For example, consider an employee database that contains information such as name, address, salary, and department. A vertical fragmentation approach may involve dividing the employee table into subsets based on the access patterns. One fragment could contain name and address information, while another fragment could contain salary and department information. These fragments could be stored on different nodes in a distributed database system, allowing queries to be executed more efficiently by transmitting only the necessary columns. Vertical fragmentation can be performed on both partitioned and non-partitioned tables. Partitioned tables are tables that are divided into multiple partitions based on a specific criteria, such as a range of values or a hash function. Vertical partitioning is a specific type of partitioning that divides a table based on columns. In contrast, non-partitioned tables are tables that are not divided into partitions. There are several advantages of vertical fragmentation, including: • Improved performance: Since only the necessary columns are transmitted over the network, vertical fragmentation can significantly improve query performance. • Reduced network traffic: By transmitting only the necessary columns, vertical fragmentation can reduce the amount of data that needs to be transmitted over the network, improving overall network performance. • Improved scalability: Vertical fragmentation can improve scalability by allowing additional nodes to be added to the distributed database system as the data volume grows. However, there are also some disadvantages to vertical fragmentation, including: 11
• Increased complexity: Vertical fragmentation can increase the complexity of database design and management, making it more difficult to maintain and modify the database. • Increased data redundancy: Vertical fragmentation can result in data redundancy since each fragment may contain duplicate rows. This can increase storage requirements and maintenance costs. 3.5 DISTRIBUTED QUERY PROCESSING Distributed query processing is the process of executing a query that involves data stored across multiple nodes in a distributed database system. The goal of distributed query processing is to execute the query in a way that minimizes the network traffic and processing time while still ensuring the correctness of the results. The distributed query processing can be divided into two main phases: query optimization and query execution. Query Optimization: The query optimization phase involves analyzing the query and developing an execution plan that will minimize the network traffic and processing time. The optimization process takes into account factors such as the size of the tables, the distribution of the data across the nodes, and the cost of transmitting data over the network. One of the most common optimization techniques used in distributed query processing is data partitioning. Data partitioning involves dividing a large table into smaller subsets based on a specific criteria, such as geographic location or customer segment. Each subset, or fragment, can be stored on a different node, allowing queries to be executed in parallel on different nodes, improving the overall query performance. Another optimization technique is replication, which involves creating multiple copies of data across different nodes to reduce the network traffic and processing time. In this approach, each query is executed on the node that has a replica of the necessary data, reducing the need to transmit data over the network. Query Execution: Once the query optimization phase is complete, the query execution phase begins. The execution plan developed in the optimization phase is used to execute the query in parallel on multiple nodes. The results are then combined to produce the final result set. During the execution phase, several challenges can arise, such as node failures, network outages, and data inconsistencies. To address these challenges, distributed database systems 12
often use techniques such as transaction management, data replication, and data consistency checks to ensure the correctness of the results. 3.6 DISTRIBUTED TRANSACTIONS PROCESSING Distributed transactions processing is the process of managing transactions that involve multiple nodes in a distributed database system. A distributed transaction is a transaction that involves accessing and updating data on multiple nodes in a distributed database system. The goal of distributed transaction processing is to ensure the correctness and consistency of the transactions, even in the presence of failures and network delays. The distributed transaction processing can be divided into three main phases: transaction initiation, transaction execution, and transaction commit. Transaction Initiation: The transaction initiation phase involves starting the transaction on the client node and identifying the data that needs to be accessed and updated on the different nodes in the distributed database system. The client node sends a transaction request to the first node, which becomes the coordinator for the transaction. Transaction Execution: The transaction execution phase involves executing the transaction on the different nodes in the distributed database system. Each node executes its portion of the transaction, which involves reading and writing data as necessary. If a node fails or the network connection is lost, the transaction can be aborted to ensure the consistency of the database. Transaction Commit: The transaction commit phase involves committing the transaction once all nodes have successfully executed their portion of the transaction. The coordinator node sends a commit request to each node, which confirms that the transaction has been executed successfully. If any node reports an error or a failure, the transaction is aborted to ensure the consistency of the database. To ensure the correctness and consistency of distributed transactions, several techniques are used, such as two-phase commit protocol, three-phase commit protocol, and optimistic concurrency control. Two-phase commit protocol is the most commonly used technique, which involves a coordinator node and multiple participant nodes. The coordinator node sends a prepare message to each participant node, and the participant nodes respond with either a commit or abort message. If all nodes respond with a commit message, the coordinator 13
sends a commit message to all nodes. Otherwise, the coordinator sends an abort message to all nodes. 3.7 SUMMARY Distributed databases provide a more robust and scalable solution for managing large volumes of data and handling high volumes of user requests. However, they require more complex management and administration than centralized databases. Distributed query processing is a complex process that involves several optimization and execution techniques to improve the performance and scalability of distributed database systems. Distributed transaction processing is a complex process that requires careful management to ensure the correctness and consistency of the database. The use of transaction management techniques such as two-phase commit protocol and optimistic concurrency control is critical to the success of distributed transactions processing. 3.8 KEYWORD Fragmentation –refers to dividing a database into smaller subsets called fragments, which can be stored on different nodes of a distributed system. Replication - involves creating multiple copies of data and storing them on different nodes of the distributed system 3.9 LEARNING ACTIVITY 1. Define Horizontal Fragmentation ___________________________________________________________________________ ___________________________________________________________________________ 2. Differentiate Full and Partial replication ___________________________________________________________________________ ___________________________________________________________________________ 3.10 UNIT END QUESTIONS A. Descriptive Questions Short Questions 14
1. Describe the Distributed database Architecture? Explain its types. Long Questions 1. Elaborate the major phases of distributed query processing. 2. Explain the process of managing transactions that involve multiple nodes in a distributed database system queries. Multiple Choice Questions 1. An autonomous homogenous environment is which of the following? A. Same DBMS is at each node and each DBMS works independently. B. Same DBMS is at each node and a central DBMS coordinates database access. C. Different DBMS is at each node and each DBMS works independently. D. Different DBMS is at each node and a central DBMS coordinates database access. 2. A distributed database has which of the following advantages over a centralized database? A. Software cost B. Software complexity C. Slow Response D. Modular growth 3. A heterogeneous distributed database is which of the following? A. The same DBMS is used at each location and data are not distributed across all nodes. B. The same DBMS is used at each location and data are distributed across all nodes. C. A different DBMS is used at each location and data are not distributed across all nodes. D. A different DBMS is used at each location and data are distributed across all nodes. 3.11 REFERENCES TEXT BOOKS: 1. Avi Silberschatz, Hank Korth, and S.Sudarshan,”Database System Concepts”, 6th Ed.McGraw Hill, 2010. 2. Ramez Elmasri B.Navathe: “Fundamentals of database systems”, 7th edition,Addison Wesley,2014 REFERENCE BOOKS: 1. S.K.Singh, “Database Systems: Concepts, Design Applications”, 2nd edition,Pearson education, 2011. 2. Joe Fawcett, Danny Ayers, Liam R. E. Quin: “Beginning XML”, Wiley India Private Limited 5th Edition, 2012. 3. Thomas M. Connolly and Carolyn Begg “Database Systems: A Practical Approach to Design, Implementation, and Management”, 6th edition, Pearson India, 2015 15
Search
Read the Text Version
- 1 - 15
Pages: