MODULE 3 REPLICATION AND CONSISTENCY 3-0 Introduction 3.1 Unit Objectives 3.2 Replication 3.2.1 Concept of Replication 3.2.2 Need for Replication in Distributed Systems 3.2.3 Issues with Replication 3.3 Consistency 3.3.1 Conflicting Operations 3.3.2 Types of Consistency 3.3.3 Consistency Models 3.4 Replica Management 3.4.1 Distribution Protocols 3.4.2 Consistency Protocols 3.5 Fault Tolerance 3.5.1 Attributes of Fault-Tolerant Systems 3.5.2 Errors. Failures and Faults 3.5.3 Fault Tolerant Design Techniques 3.6 Group Communication 3.6.1 Types of Groups 3.6.2 Group Management 3.7 Transactions 3.7.1 Definition 3.7.2 Two-Phase Commit 3.7.3 Three-Phase Commit 3.8 Check pointing 3.8.1 Checkpoint 3.8.2 types of Checkpoints 3.9 Summary
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘123.0 INTRODUCTIONThis unit introduces you to replication, which is an integral part of a distributed system. Since reliabilityis one of the most important features of a distributed system, replication allows data to be free fromsingle-point control failure by distributing it across different locations. Also, it provides clients w i t hfaster access of data from a server location available in their proximity. Replication raises the concernof consistency of data. In this unit, you w i l l learn that a distributed system design should make adecision about the consistency model it should follow before setting up the replicas. A distributedsystem should exhibit fault tolerance and withstand component failures to be able to provide qualityservices to its clients. You w i l l learn that to achieve fault tolerance, systems use two or three phase-based commit protocols and group communication techniques to provide resilient processcoordination. Furthermore, to be able to gracefully reo from process crashes, distributed systemsemploy check pointing and backward) recovery mechanisms.3.1 UNIT OBJECTIVESAfter going through this unit, you will be able to: \ • Understand the objectives of replication in distributed systems • Identify the issues associated with replication • Understand the various consistency models • Study the protocols associated with consistency models • Understand the concept of fault tolerance with respect to distributed system • Analyze the requirement of group communication methodologies in distributed systems • Discuss the two-phase commit and three-phase commit • Understand the concept of check pointing3.2 REPLICATION3.2.1 Concept of ReplicationReplication in distributed systems is the process of maintaining multiple copies of data at differentdistributed locations. It is an integral part of a distributed system. Database replication, disk storagereplication, distributed shared memory replication are multiple facets of replication in distributedsystems. Database replication is concerned with creating and maintaining multiple copies of the samedatabase at different locations. One possible way of maintaining database replication is to maintain amaster copy of the database servers which maintain slave copies of the database. Disc storagereplication is generally provided by mirroring technique. To maintain uniformity in the rest of the text,we will use a general term data reserve' for representing a distributed database, a distributed file systemand a distributed share memory.Dept. of Computer Science And Applications, SJCET, Palai 2
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12 The following two scenarios present the use of replication in distributed systems 1. On the Internet, a website is replicated in its entirety and put on and site. Such a site is called a mirror site, as it is the exact replica of the ad website. For a given website there can be multiple mirror sites. The number of mirror sites is a load-oriented business-specific decision. 2. For a large bank with multiple branches distributed across the court database replication technique is employed. At the main branch, the ceil database is maintained with a supporting infrastructure, and at the oil branches replicas are maintained. Database operations are divided among all the database servers that are a replica of the central database, dispersed over a network of interconnected computers, which results in la performance advantage due to load sharing.3.2.2 Need for Replication in Distributed SystemsThe performance requirements of a distributed system are far more stringent than that of centralizedsystems. Replication helps in intensifying the efficiency of a distributedsystem by providing: • Better and faster accessibility of services, and hence, increasing the performance of the distributed system. The clients that are far from the centralized data reserve can access the data locally. They do not need connect to a remote database server over a network. This result in faster accessibility of data and hence, faster execution of transactions. • Better availability of data thereby providing reliability against system failures: If a local copy of data is unavailable, the client can access data from a remote replica of the data reserve. This keeps the system running even in case of failures. We w i l l identify the benefits of replication with the help of Scenario 1 specified above. Mirror sites can be used for websites that are popular for downloading and experience alarger number of hits. A request from a global user can be directed to a mirror site that isgeographically near to the user or have faster servers. This helps in reducing network traffic byensuring better availability of the website.lt also enables the site or downloaded files to arrive morequickly for users close to the mirror site. Also, in case of failure of web server maintaining the website,the requests can be forwarded to its mirror site. This provides the business a reliable solutionagainst system failures. Hence, the system stands out with an improved performance and with a strongerreliability quotient. Replication supports the system's infrastructure backbone if the system needs toscale out in numbers and geographical area.3.2.3 Issues with ReplicationReplication proves to be a boon to the accomplishment of a distributed system but it is not free ofissues. First, the replication manager should make sure that the replication itself should betransparent to, an external user. To the user of the system it should appear that he is transactingwith only one data reserve. Access to a replicated entity should be uniform with the access to a single,non-replicated entity. If the data replication details such as the copy of data reserve in which thechanges made by the client should be updated are implemented in the client logic, then programmingeach client becomes costly, prone to error, and difficult to maintain. Instead, the concept of datareplication should be maintained transparent to the client process. Replication transparency is builtinto a server (instead of into client applications) and it handles automatically the details of locatingand maintaining data replicas.Dept. of Computer Science And Applications, SJCET, Palai 3
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12 Second, maintaining the consistency of data at the replicas is a big concern. Since multipleprocesses update the replicas simultaneously- it is extremely vital that the transactions of eachprocess result in correct and consistent results. Whenever a copy is modified, it should be updatedwith new changes, whereas other replicas still have the old data. So the updates need to be executedon all replicas to maintain consistency. The cost of these updates in terms of network band width,decision complexity as to when and how these updates should take place, determines the actualprice of replication. Updating replicated data at several locations takes much more processing timeand cost than updating a single object.3.3 CONSISTENCYThe operations that we are concerned with are the read operation that is required to read a data itemvalue from the data reserve and the w rite operation that update value of a data item in any copy of thedata reserve. The read operations are represented as Ri(x)a. where process i reads the value \"a\" of dataitem x. The write operation represented as Wi(x)a. where process i updates the value \"a\" of data itemx.3.3.1 Conflicting OperationsIn a distributed data reserve, the operations that affect the consistent) are the conflicting operations. Theconflicting operations are stated as the combination of read and write operations: • Read-write conflict: Two different processes perform read and w rite operations concurrently on the same data item X in different replicas. * Write—write conflict: Two different processes perform w rile operations on the same data item concurrently in different replicas. The inconsistent state of the replicas of a data reserve is mostly attributed the wrong order ofexecution of conflicting operations in different replicas. To keep replicas consistent, it is to be ensured that all conflicting operation done in the sameorder every where.3.3.2 Types of ConsistencyConsistency among replicas of a data reserve can be of two types: strong consist and weakconsistency. These are also known as the degree of consistency.Consistency among replicas of a data reserve can be of two types: strong consist and weakconsistency. These are also known as the degree of consistency.3.3.2.1 Strong consistencyAlso known as the tight consistency, it ensures that at any time all the copies of a data reserve areexactly the same. It guarantees a global ordering on conflicting operations If a process updates a localcopy, then that change is synchronously propagated to all copies of the data reserve. As a result, amprocess executing a read operation on a item retrieves the same value from any copy of the datareserve. The operations of passing the updates to all replicas should be done atomically to avoid anyanomalies Since every update needs to be propagated to all the replicas, maintaining kind of consistencyin the data reserve is very expensive in terms of synchrony time and the network bandwidthconsumption. Moreover, if the number of replicas increases, the effort for synchronizing all thereplicas also increases. Hence, scaling out a distributed system that requires tight consistency mayDept. of Computer Science And Applications, SJCET, Palai 4
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12result in poor perform as the time to synchronize the entire data reserve through global synchronizationbeen more than the access time of the data. Thus, light consistency may result in downgrading the scalability.3.3.2.2 Weak consistencyIt is also know n as loose consistency. It weakens the consistency requirements so globalsynchronization can be avoided. The systern does not guarantee that subsequent accesses will return theupdated value and this is acceptable to the clients of the sys Such systems can tolerate a relatively highdegree of inconsistency. of weak consistency is known as Eventual One such formconsistency .Eventual consistency is acceptable by the distributed systems that i n v o l v emostly read operations and very few simultaneous updates. Examples of such systemsinclude the World Wide Web and the worldwide naming service such as the DNS. Itcan be defined as if no new updates arc made to the data item: e v e n t u a l l y al laccesses to that data item, at the multiple replicas, w i l l return the last updated value.Hence, if no new updates take place for a longtime, a i l replicas w i l l gradually becomeconsistent. During the time, when the updates are propagated to a l l the replicas, theprocesses may read inconsistent values and find t h i s inconsistency acceptable.Whether inconsistency is acceptable depends foremost on the client application. Thefollowing are two conditions associated with eventual consistency: • First, a l l the updates .should eventually reach a l l the replicas of the data reserve. • Second, the client should be accessing the data from a fixed replica and should not be mobile.The most popular system that implements eventual consistency is DNS (DomainName System). Eventual consistency encourages scaling of distributed systems3.3.3 Consistency ModelsAs stated earlier, one of the big issues w i t h maintaining replicas is to maintain the uniformity ofdata among al l the replicas. Some applications have strict consistency requirements, whereasothers can loosen the consistency to attain efficient solutions. A consistency model outlines the set of rules that are to be followed by the processes thatrequire services from the replicated application. It is an agreement between the processes and the datareserve for obtaining the correct result for the executed transaction on any replica.3.3.3.1 Data-centric consistency modelThis model works on the principle that when an update is made it is visible to all the processes. Thisconsistency model is utilized in distributed systems that require a system-wide consistency. A distributedsystem has processes running at different locations and accessing the data at the various replicassimultaneously. Since there can be simultaneous updates, the use of synchronization mechanisms isrequired. It is essential for the distributed system to maintain consistency of data at al l the replicas ofthe data reserve. The following section discusses some consistency models based on the data-centricconsistency model:Sequential Consistency Model: This model ensures consistency based on the ordering of theoperations. It is used in the concurrent programming domain where multiple processes executeconcurrently on multiple machines. It is required that each process maintains its program order.Dept. of Computer Science And Applications, SJCET, Palai 5
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12 Sequential consistency is defined as follows 'The result of an execution is the same as if theoperations of a l l processes executed in some sequential order, and the operations of each individualprocess appear in this sequence in the order specified by its program.\" This model allow s the interleaving of the operations of different processes running on differentmachines, but the order of interleaving visible to all processes should be the same. This model assumesthe absence of a global time clock and hence time docs not play a role in ensuring the consistency of thedata reserve, only the order of the conflicting operations is important here. Figure 3.1 (a) shows four processes PKP2.P3 and P4 which are involved in the concurrentexecution of read and write operations on the variable y.PI: W(y)aP2: W(y)bP3: R(y}b R(y)aP4: W{y)b W(y)a Fig. 3.l(a) Processes involved in the Concurrent Execution of Read and Write Operations PI writes the value of variable y in its local copy as a.P2 writes the vain variable y in itslocal copy as 'b\ Since these two are concurrent operations are independently performed of eachother, their effect is currently visible only in the local data reserve copy of P1 and P2 respectively. Itso happens that the operation performed by P2 is updated earlier in all the copies globally. At this time,a read performed by P3 on variable y results in the value b\". A read operation by P4 also results in thesame value . Once the effect of write operation of PI is visible, then the read operations by P3 showthe value 'a'. This example depicts that only sequence and not time plays a role assuring consistencyin the sequential consistency model.PI: W(y)aP2: W(y)bP3: R(Y)b R(y)aP4: W(y)a W(y)b Fig. 5.1(b) Diagram showing Violation of Sequential Concurrency Model Figure 3.1 (b) shows the violation of the sequential concurrency model. PI write the value ofvariable y in its local copy as 'a'. P2 writes the value of variable y is local copy as 'b'. The updated value of variable y by the process P2 is visible to P3 but to not to P4 can see thevalue of y as updated by PI. Since the same order of updation is not maintained, the followingreads of P3 and P4 do not yield consistent values.Dept. of Computer Science And Applications, SJCET, Palai 6
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12 Causal Consistency Model: This consistency model relaxes the 'order operationrequirements' which was defined by the sequential consistency model. Casual consistency model makesa distinction between the causally-related operations and those that are not.Two events are said to be causally related if one is influenced by the other. If there is a readoperation followed later by a write operation, then the two operations are potentially causally related.For example, if P1 writes the data x, P2 reads this value modifies the variable y, since P2 uses thevariable x's value to obtain the value of y. R2(x) and W2(y) are causally related to each other. Also,if there is a read operation Rl(x) after write by P2 W2(y) then these operations are causally related. The causal consistency model states that the write operations that are potentially causallyrelated should be seen in the same order by all the processes, whereas the concurrent writeoperations can be seen in different order. Figure 3.2(a) shows the causal consistency model.PI: W(y)a W(y}cP2: R(v)a W(y)bP3: R(Y)a R(y)c R(y)bP4: R(y)a R(y)b R(v)cFig. 3.2 (a) Causal Consistency Model The read operation by P2 is causally related to W(x)a by PI and W(x)b is causally related toR(x)a. So using transitivity rule, W(x)a and W(x)b are causally related and should be seen by allprocesses in the same order. But W(x)c executed by PI is concurrent to W(x)b, and is visible toprocesses in any order. Hence, the consistency of data is maintained. In Figure 3.2(b), the causal consistency is violated. W(x)b is causally related to W(x)a as valueb may be determined by the read action R(x)a. But processes P3 and P4 do not see these two writeoperations in the same order. PI: W(y)aP2: R( }a W{y}b R(y)b R{y)a yP3:P4: R(y)a R(y)b Fig.3.2 (b) Diagram showing I 'iolation of Causal Consistency Model 3.3.3.2 Client-centric consistency model: This model looks at consistencyrequirements from the point of view of the client. Recall that one requirement of ensuring eventualconsistency is that the client should always access the same data reserve replica. Consider ascenario where a client is travelling and accessing a replica at location A. As he moves to anotherlocation, the server provides him access to replica B as it is geographically near to him. He startsreading and updating the data in replica B. In such a case, the client can face data anomalies as client'sDept. of Computer Science And Applications, SJCET, Palai 7
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12updates at A may not have yet been propagated to B, client may be reading newer entries than theones available at A or the client's updates at B may eventually conflict with those at A. For the client,the data is consistent if the entries he updated and/or read at replica A are present in replica B. Client-centric consistency model ensures consistency for a single client1 accessing multiplereplicas. It does not guarantee the consistency of concurrent access by different clients. The following are the models based on client-centric consistency model:Monotonic Reads: This model provides the most basic consistency as required by a mobile user.It states that a process reads the value of a data item, any successive operation on that process wi l lalways return that same or a more recent value. It provides consistency between two or more readoperations initiated by the same client at different times. This consistency model ensures that the datavalue read by a user from a rep of the data reserve is always consistent to the value read fromanother replica at a later time. As an example of this model, consider a client who reads and updates his personal calendar on aweb server database replica. On moving to some other continent, he transparently accessesanother aeb server replica and is able to read all updates made by him from the previous location.Monotonic Reads guarantees that the client sees = updates, no matter from which copy of data reservethe automatic reading takesMonotonic Writes: This consistency model ensures consistency for a single client by requiring thatthe write operations on a data item by a simile client should be performed the same order as initiated, itis defined as \"the write operation by a process on a I item completed before any successive writeoperation on the same process.\" in simpler terms, monotonic writes imposes the condition that anupdated value of the data item should be propagated to all the replicas by a successive write process. Thewrite operations can be arranged in a first-in first-out manner and applied to the data item in the sameorder. As an example of this model, consider a scenario where a program version X is present inthe data reserve. A client needs to update the program X to a newer versionY in replica A and then further upgrade to a version Z in the replica B. Since the latestupgrading is based on the previous version of the program, it is of utmost importancethe upgrading operations are performed in the same order as initiated, no matterwhich copy of data reserve the updates takes place.Read Your Writes: This model is an extension of the monotonic reads model, it defined as \"theeffect of a write operation by a process on data item will always be seen by a successive readoperation on the same process.\" This model ensures4 a read operation on a data item alwaysreturns a value updated by a preceeding write operation. As an example of this model, consider the scenario of updating a web page. If the web serverfollows the \"Read your writes' model then it guarantees that the client web browser shows thenewest version of the web page instead of its cached copy.Writes Follow Reads: it is defined as \"A write operation by a process on a data item. x following aprevious read operation on x by the same process, is guaranteed to take place on the same or a morerecent value of x that was read.\" This model ensures that any successive write operation by aprocess on a data item x will be performed01 replica of x that is up to date with the value mostrecently read by that process. Such a model is required in distributed scenarios where the data itemwill be updated to what value is determined by the existing latest value of the data item. For instance,if a client wants to post an answer to a query in a forum, he first needs to read the query itself thencan decide on the response. Hence, any response to the query will be written to theforum only after the query has been written as well.Dept. of Computer Science And Applications, SJCET, Palai 8
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12 3.4 REPLICATION MANAGEMENTManaging replicas can be divided into two steps, namely distribution protocols and consistencyprotocols. First, a distribution protocol needs to be selected. Distribution protocols deal with theplacement of replica servers. Once the replica servers are in place, the content needs to be placed.This requires finding the best server for placing contents. It also makes a decision of following asuitable technique for propagating the updated content to the relevant servers. Second, a decision has to be made for choosing the consistency protocol. 3.4.1Distribution ProtocolsDistribution protocols are divided into the following three phases: 1. Replica server placement 2.Content replication and placement 3. Content distribution1. Replica server placementReplica placement aims at finding out the best K locations out of N locations for placing K replicationservers. The best place for placing a replica server generally attributes to those areas where there isa burst of requests or the location that is geographically closest to the clients so that the bandwidthrequired for downloading is less. • The first solution identifies the location for the replica server on the basis of the proximity to the client. Select the best location out of (N - K ) locations such that the distance from the location (replica server) to the clients is minimal. This means choosing the best location when K. servers already have been placed. Then choose the next best server using the same criteria. The first chosen server location minimizes the average distance to all the clients. The distance is measured in terms of the network bandwidth. • The second solution considers the network made of autonomous systems. Autonomous system is a network of nodes that communicate using the same routing protocol and is managed by a single administrative body. This solution ignores the client's position and only takes the topology of the autonomous system's network into account. It selects the largest cluster containing the maximum nodes and makes one of the nodes as the replica server. It repeats the same procedure with the next largest autonomous cluster till it finds the K locations. Both the above solutions are computationally expensive with the algorithmic complexitygreater than O (N2:). where \"N is the number of locations to search from. As the number of locations toinspect increases, the time required for finding the best location also increases, thereby making theminappropriate for real time use. • The next algorithm eases the problem of finding the best location by identifying the most demanding region and takes it as the location to place a replica server. This region consists of the nodes that access the same content and the most demanding region is the one that contains the maximum number of nodes. It identifies areas with the highest density and places a replica server in each area computationally cheap and can be applied in real time.2. Content replication and placementDept. of Computer Science And Applications, SJCET, Palai 9
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12Content replication deals with finding the best replica server for placing the contentreplicas can be logically distinguished into the following types: (i) Permanent replicas: These are the initial set of replicas that constitute distributed data reserve. Consider the example of mirroring provided distributed website in section 3.2.1. A website is replicated in its entirety and put on another site. Such a: called a mirror site as it is the exact replica of the actual website for a given website there can be multiple mirror sites. (ii) Server-initiated replicas: As the name suggests, server-initiated rep are dynamically initiated on the request of a server in the data reserve server may request hosting a replica in case it is overloaded and the n1 of requests it is required to serve increases above a threshold value. Hi for performance reasons it is worthwhile to install new replicas to the load on the server. A web server can be stated as an example above situation. Let the web server be located in a Location X and handles requests from the clients globally. It so happens that the web server receiving a burst of request from Area Y due to holiday season in (Area Y is far from Location X). This results in a reduction of performance of the web server. In such a case, the web server initiates a request for hosting a replica in Area Y that can handle the requests of that area till the holiday season is over.i Also, there can be a situation where a set of clients issue very frequent large number requests for specific files. In such a scenario, a server request for a replica on which such (lies can be replicated and the rep can be placed in the proximity of clients that issue requests for those files One algorithm for implementing such situations is described by Rabinovich (1999). | Bach server keeps track of the access counts per file and the location of request. ' If two clients C1 and C2 share the same closest host P. all access request for file F at server Q from C1 and C2 are jointly registered at Q as a single access count cntQ(P.F). • When a number of requests for a specific tile F at server S drops bet a deletion threshold del (S.F) that file can be removed from S. Before removing the file care has to be taken that at least one copy of the file remains in the system. • Another parameter called the replication threshold rep (S.F) is compared with the number of accesses for a particular tile and if the number of requests is greater than the rep (S.F) then the file needs to be replied to another server. If the number of accesses is between deletion replication threshold, the file can be migrated. For example, if any server wants tore-evaluate the placement of the file checks the access count for each file. If the count is less than the deletion threshold the file is deleted. The decision to migrate a file F to another server P is taken if for some server P, cntQ (P.F) exceeds more than one-half of the total requests for F at Q, and also the total number of access requests for F on Q is greater than the replication threshold value rep(Q.F). (iii) Client-initiated replicas: These are the replicas that can be dynamically hosted on a request from the client. Such replicas are also known as client caches. Client caches are used to reduce the access time of fetching the data from the server. In scenarios where the client requests mainly consist of repeated read operations from the data reserve, then to reduce the access time of fetching the data from the server, data is stored on a cache memory. Cache can be located on the client's machine itself or on a separateDept. of Computer Science And Applications, SJCET, Palai 10
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12 machine belonging to the client machine's LAN network. It becomes the client's responsibility for updating the stale data to the latest one.3. Content distributionA decision needs to be made of what needs to be propagated to the replicas and what methodologyshould be used to propagate the updated data to the replicas. When a data item is updated at any of the replicas of the data reserve, the following three wayscan be used to send the updates of a data item x to the other replicas that contain the copy of thedata item: • Only a notification of the updates is propagated to the replicas. An invalidation report is sent to the replicas that informs the replicas that their copy of the data item ,x has become invalid as it has been updated. The replicas update their copy of data item x as and when required. Sending only the notification reduces the bandwidth consumption. • The updated data itself can be transmitted to all replicas. This alternative is efficient when the read-to-write ratio is higher, that is, the updated data is read before the next updation takes place. • The update operation itself is propagated to the replicas and they perform the operation locally. Though the bandwidth consumption is less, more processing power is needed at the replicas to perform the operations. The push or pull model can be used to propagate the updates. If the distributed system hasstrict consistency requirements, the server pushes the updates to the replicas. Thus, whenever an updateoperation is performed, the updates are pushed to the replicas. thereby, assuring consistency. The pushmodel is often used in permanent and server-initiated replicas. Moreover, the push model proves tobe efficient when the read-to-write ratio is high. The pull model is mostly used in client-initiated replicas. The updates are propagated to the replicaswhen the client or the server asks for the updates. This model fits in the scenario where the clientcache that has cached some data for using locally, checks with the server if the cached data items arest i l l up to date. If the server has some updates, the client pulls them from the server.3.4.2 Consistency ProtocolsOnce a specific consistency model that is suitable for distributed application has been decided, aprotocol is needed for the implementation of the model. Consistency protocol focuses on the actualimplementation of a specific consistency model. It defines the rules that need to be followed tokeep the data at various replicas consistent. Gen these rules involve the read and write operationscarried out by the participating processes. The following section is divided into two pans, the first part discuss consistencyprotocols regarding sequential consistency model and the second discusses the client specificconsistency model.Protocols for Data-Centric Consistency ModelsThese protocols are also known as Primary-based protocols. These protocols ;consistency of data by maintaining a primary server for each data item. Primary 3takes the responsibility of performing any update operation on item x and also forwardingthe update operation's result to all the replica servers. The working of primary protocolis distinguished on the basis of the location of the primary server.Dept. of Computer Science And Applications, SJCET, Palai 11
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12Remote-write protocol: When the client wants to perform a read operation, he uses the local copyof the data item. But while updating the data, the update operationforwarded to the primary server of the data which is remote to the client, and hence .client process remains blocked t i l l it receives a signal from the primary server.;remote primary server sends a invalidate message to all the replicas containing the copyof data item x. The primary server then performs the updation on the local copy of xforwards the update to all the replica servers. Replicas after updating the data item x intheir local copies, send acknowledgement back to the primary server. The primary serverthen informs the client process to proceed with the execution. Remote-write protocolsensure sequential consistency as all the processes see the write operations in the sameorder and are able to access the consistent data irrespective of the backup serveraccess. But performance decreases due to remote updates and also blocking of theclient process till update is performed by all the replicas. \Migrate-Write protocol: To avoid the remote update which server of the j item x and fetches themost recent state of the primary. This can be viewed as migrating the primary server to the localreplica of the client as and when an update fort particular data item is needed. Client then sendsinvalidate message to all the copies data x and performs the update locally. Primary then informs the replicas to update their copies of data item x. Thus this variant ofprimary based protocol does not require blocking of the client process. •Quorum-based protocol: These protocols are client-centric consistency protocols . They allow theupdate operations to be performed on multiple replicas based on mag voting. The clients acquirepermissions of multiple servers before either reading or writing a replicated data item. A version isassociated with each copy of the data item changing the data item the version number of that copy isupdated to a newer value. The client performs the following steps for reading a data item: 1. '1 he client needs votes of one-half plus one servers on which the data item is replicated. So the client contacts at least one-half plus one replica servers and raj them to send him the version number associated with the data item. If all the version numbers are same then the copy is the latest and the client can perform read operation. 2. To perform an update operation, the following steps are required: The client needs votes of one-half pi us one servers on which the date is replicated. So the client contacts at least one-half plus one replica servers and makes them agree to do the update. Once the data item is updated, its version number is also updated to a newer value.Dept. of Computer Science And Applications, SJCET, Palai 12
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘123.5 FAULT TOLERANCEIn a distributed system, it often happens that one or more components of the system undergo failure.leaving other unaffected components functioning normally. This quality of the system to continuefunctioning even u hen some of its components have failed is termed as fault tolerance. The ability to function, albeit with reduced capability, in the situation of a failure in itselfor in one or more of its associated components is the basic characteristic of a fault-tolerantsystem. Such failures might arise due to power failures, disk failures, communication failures, memoryfailures, and general component failures. In the event of any of these happenings, a fault tolerant systemw i l l resist a complete failure and would continue to operate in a reduced functionality mode. Thisattribute of a fault-tolerant system is known as graceful degradation. The decrease in the operationalquality is proportional to the failure. Fault-tolerant s\ stems are deployed in mission critical systems, life-support systems, aviation,hazard-oriented systems, financial systems, banking, stock markets, etc. Common examples include airtraffic, medical apparatuses used in intensive care units, bank ATMS, electronic transaction equipment atmerchant establishments, power backup systems in data centre. Fault tolerance is a holistic attribute of system; a system being composed of smallerconstituent participating entities. It is a cooperative characteristic of the system to recover andcontinue functioning in the event of a failure.Fault Tolerance in Network SystemsAs an example, connection-oriented network protocols exhibit fault tolerance. In the event of a linkfailure at a network node, the receiver/sender employs an acknowledgement policy If the sender does notreceive an acknowledgement from the receiver within a specific time period the packet isretransmitted. The retransmitted packet is then routed through a network in which all the intermediatelinks are functional. The TCP network layer is an example of such a fault-tolerant system. An interesting quality of a fault-tolerant system is that it allows for correction of the faultycomponent without having to shutdown the entire system. For example, in the event of a network failuredue to link breakdown, the router (a component in the distributed network system) will send packets overan alternate link till the fault is recovered. Fault tolerance is often achieved by employing redundancy in systems. In certain cases, a fail-safe operation is duplicated across multiple systems and if a failure happens on one of the systems, afall-back system takes over the operation and continues from the point where the original system leftoff.Systems that are fault tolerant in nature have a well-defined recovery path. Two widely usedrecovery techniques are roll-back and roll-forward. In the case of the former. in the event of a failure,the system reverts back it state to the last non-erroneous of operation and resumes processing; in thecase of the latter, the system on detecting an error, corrects itself and continues with the processing.3.5.1 Attributes of Fault-Tolerant SystemsFor a distributed system to be able to resist complete failures, or in other fault tolerant, it should bedependable. A dependable system should satisfy the n of availability, reliability, safety andmaintainability.Dept. of Computer Science And Applications, SJCET, Palai 13
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12 Availability is the metrics of a system which tells us the probability system is functioningproperly at any given instance in time. If a system fails and consequently recovers within a verysmall amount of time-—the system to a unavailable only for that very short duration. Thus, theprobability of the system being available would be high. On the other hand, reliability is the metrics of a system which indicate the ability of a system to operate without failures over a period of time. In the context of reliability, it is necessary for a system to function without failures. A highly available system does not necessary imply a highly reliable system. A system which stops every now and then, but for a very small duration of time, is a highly available system as the probability of finding it up and running is high; but since the system function without interruptions over an extended period of time, it cannot be said to be reliable. A system is said to exhibit safety if a failure in it does not lead to typical example would bethe flight control systems found in passenger airplanes are designed taking into account that a failure inone of its component does n< functioning of the system. In the context of dependable systems, maintainability implies that in' a failure, the system canbe repaired within a reasonable timeframe and without much effort. A maintainable system can insome cases be designed to automatically; correct errors. These self-recovery mechanisms are oftenpresent in high i systems.3.5.2 Errors, Failures and FaultsA distributed system provides a set of services to its users. When one or m< services becomeunavailable and does not perform as expected, the system is said to have failed. A failure in suchsystems is typically due to the erroneous state reaches as a result of a fault in one or more of itscomponents. In order to I tolerant distributed systems, it is very important to study how failures occur, Failures are the effect of error in the system; the error manifests itself of component failure inthe distributed system, and occurs due a fault. As can this cause-effect sequence, i.e.. the relationshipbetween fault, error and failure in all cases of system malfunction. Analysis of each of these components,help us design effective fault-tolerant systems.3.5.2.1 Fault categoriesFaults can be categorized into the following types:Permanent: A permanent fault is one that remains until the defective part i As an example, a failurein a database server due to a hard disk fault remains i until the bad disk is replaced with a good onefollowed by restoration of data;Transient: A transient fault happens once and then does not reappear; for example, a cellular phonefails to detect the mobile network in places such as tunnels and underground railway systems, but oncethe person comes out in the open, the system re-establishes connection with the nearest base station.Another example of a transient fault would be the disruption in the wireless transmission due to solarflares. Solar flares are explosions in the sun's outer atmosphere which results in the huge release ofelectromagnetic radiation into space. These radiations sometimes disrupt telecommunicationwhich automatically disappears once the effect of the solar flare subsides.Sporadic: As the term suggests, these types of faults occur on and off. For example, due to amalfunction in the cooling system, a processor shuts down to prevent its circuit burn-out, therebybringing the system to a halt: once the temperature drops, the processor starts functioning again.Sporadic failures are the most difficult to detect and isolate. This makes them the most difficult tocorrect and accounts for most of the system failures.Dept. of Computer Science And Applications, SJCET, Palai 14
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘123.5.2.2 Failure modelsIn accordance with a specific failure classification scheme, failures can be categorized by their typesinto crash failures, omission failures, timing failures, response failures and arbitrary failures.Crash failure: This occurs when a server stops due to a system malfunction. Before the crashhappened, the server was functioning correctly. A server program termination by the operating systemdue to an illegal memory address access would be a typical example of a crash. To recover fromsuch situations, the server program needs to be restarted.Omission failure: Such failure takes places when a server does not send its response to a request. Itmight very well be the case that the server never received the request to start with. It might also be dueto a transient failure in the communication media, which resulted in the total loss of network traffic.An omission failure can also occur in a situation when the server has failed to transmit the responseafter processing the request.Timing failures: These are noticed when a system is not able to respond to the requestor within apredefined time period. As an example, if a http request takes a long time to complete by a webserver, a timing failure happens—the result of which is the familiar 408 Request Timeout\" responsedisplayed by the browser. A timing failure might also happen if the server responds later than it isexpected to. Timing failures might arise due to high server loads when the request was issued, lowcommunication bandwidth, etc. Timing failures also arise due to response sent by the server tooquickly, when the client was not expecting it.Response failures: These are more severe forms of failures. When a system responds incorrectly to arequest, a response failure is said to have taken place. Response failures are of two distinct types, valuefailure—in which the response to the request is incorrect and state transition failure. A state transitionfailure occurs when the system receives a request that it is not designed to handle. The system mightprocess the request incorrectly and generate an incorrect response.Arbitrary failures: These are the most serious forms of failures. These failures are also knownas Byzantine failures. Byzantine refers to the Byzantine General's problem, in which a number ofgenerals separated by distances need to decide upon whether to attack the enemy or retreat. Eachgeneral communicates the decision to the nearest general. To aggravate the problem, it might bethat some of these generals are traitors and might maliciously alter the message. An arbitrary failurehappens u hen the! responds incorrectly because of the presence of faulty components within it . For egif a server interacts w i t h several other servers for processing a request and if a these servers arefaulty, then the system fails to produce the correct output. The manner in which a server system exhibits a crash failure can differ system to system.Some systems might halt altogether due to a fault. Exigency handling in such systems mightbroadcast that it is about to stop. Such failures are known stop. Fail-silent systems are those which dono intimate the interested parties that it is about to halt. In these systems, observer's processesmonitor the servers and initiate appropriate action such as to restart the failed system, informinterested parties, on. The other class of systems is referred to as fail-safe. These systems, in theevent a failure will respond to requests in such a manner that other processes would understand that allis not well with the server.3.5.3 Fault Tolerant Design Techniques3.5.3.1 RedundancyRedundancy is often used to build fault-tolerant systems. Take the case of a pass aeroplane whichtypically has more than one engine. In the event of an engine failure the other takes over, in otherwords, a redundant engine is installed to be used event of primary engine failure. SimilarlyDept. of Computer Science And Applications, SJCET, Palai 15
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12enterprise databases are replicated and hosted on different systems. In the event of a disk crash onany of the system, the systems still continue to function normally. In yet another example, multiplecopies of a critical process are run in a server. If any of the processes fail, the others still continue tofunction, thus providing uninterrupted service. This type of redundancy, where more than oneinstances of a critical system is deployed in known as physical redundancy Patterns exist that areemployed to address fault tolerance using physical redundancy Triple Modular Redundancy (TMR)is a technique where two instances are deployed for each of the components in a system. Each ofthese systems at a particular receives input at all the three components of the previous stage, and soon. A com in a particular stage accepts an input if it was provided by two or more of the comp of theprevious stage. If three different inputs are received, the system state is undefined.Temporal redundancy, on the other hand, involves recording a series of evenhappen in a system, and playing them back in case of a failure. This technique is heavilyused in database transactions. A transaction can be started, a series of operations can bemade, and then the transaction can be committed. If a system failure OCCURStransactions that have been logged can be replayed to bring the system to thewhere it was before the crash happened. Temporal redundancy is used in scenarioswhere transient faults or sporadic faults occur. ; Another technique used in the design of fault-tolerant systems is information redundancy. Inthis case, extra data is added to the transmission which helps in detecting and correcting errors at thereceiver's end. An example of such an algorithm is the Hamming code .Extra parity bits are addedto the transmitted data in such a wag the receiver on inspecting the data will be able to ascertainwhether the data is in error , and in certain cases, could also correct the data.3.6 GROUP COMMUNICATIONA system that employs physical redundancy (either hardware or software) uses groups to achieve faulttolerance. Group communication deals with the semantics of how requests and responses are propagatedw i t h i n a group of such redundant components. It is also related to the structure of such groups, andthe techniques employed in managing them. The fundamental property of a group is that there exist multiple components that can handle aservice request. When a request is sent to the group, al l members of the group receive it. Thus inthe event of a system failure within the group, another system can act on the request. Formal, processgroups are logical structures in which processes are managed and the group itself behaves as an entityresponsible for fulfilling requests. Process groups can vary over time. As an analogy. emailgroups found within an organization are created to respond to request from clients—if a member ofthe group is away from office, another group member can step in and respond to the client's email.1.6.1 Types of GroupsGroups can be Hat or hierarchical. The presence of a group leader (or manager) within a groupdistinguishes the group as a hierarchical one. The group leader process is entrusted with a different set ofresponsibilities apart from processing client requests. As an example, a process within a group might beentrusted with delegating requests depending on the load, compute capability, etc.. of the workerprocesses. As can be observed, hierarchical groups tend to reduce the fault-tolerance capability of asystem by providing a single point of failure—the group leader: if the group leader fails, the systembecomes unusable. On the other hand, in the case of a flat group, single point of failures do not exist, butthe decision-making process becomes complex. Consider the example of a web server. If a webserver is designed as a flat group, when an http request arrives more than one process canDept. of Computer Science And Applications, SJCET, Palai 16
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12service the request. In such situations, a ballot is taken to determine the process which wouldrespond to this request. If more than one process is handling it. the system must ensure that a singleresponse is sent back to the client, instead of multiple responses from each of the processes thathandled this request.3.6.2 Group ManagementGroup communication, as described above, requires the ability to be able to create or delete groups.Also required is the ability to join or leave a group. To address these requirements. Sev eraltechniques have been proposed. In one such technique, a group server is identified and delegated withthe responsibility of maintaining the group databases. All group management is done by this system. Themajor disadvantage of this scheme is that it introduces a single point of failure. In contrast, if the group membership management is distributed over all group members, thenmessages for joining or leaving a group by an entity needs to be transmitted to all the members of thegroup. In this scheme, every group member needs to maintain a database of currently available peersand should update this information in case a member unexpectedly ceases to exist due to a crash. In order to create resilient hierarchical groups, the single point of failure needs to be eliminated.This is done by replicating the primary member or the group leader. The primary is responsible forcoordinating all write operations within the group. The primary process if fixed, but in the event of acrash, any one of the backups can become the primary The primary selection in such casestypically takes place through some voting algorithm. Till now. you have learned the qualitative aspects of process groups. In order to create aprocess group, the size of the group is important. Questions like, how many processes need to bereplicated, need to be answered, A quantitative study of these attributes is used in the design offault-tolerant systems. The resilience factor. R. of a fault-tolerant sy stem is the number or componentfailures that the system can withstand, i.e., in spite of R component failures, the system continuesto function as per its specification. These systems are known as the R fault-tolerant systems. Ifthe component failures are not arbitrary (not Byzantine), a total of R+1 process per group would beDept. of Computer Science And Applications, SJCET, Palai 17
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12sufficient to create an R fault-tolerant system. This is because if R processes fail, then the R+ 1thprocess wi l l continue to provide the service. If the component failures are Byzantine in nature, the situation becomes complex. In such cases, itbecomes difficult to distinguish between faulty components and correctly running components. Toachieve the distinction, the number of processes that are required in such systems is 3R + 1, out ofwhich 2R + 1 process need to function correctly. Lamport et al. (1982) studied this problem as theByzantine agreement problem and proved the above conclusions. Fig. 3.4 R-Fault Tolerant SystemFaulty process3.7 TRANSACTIONS3.7.1 DefinitionA transaction is a collection of actions that produces a specific result in the system. The actionswithin a transaction are treated as indivisible—if any one of the action fails, the entire transaction issaid to have failed: the transaction succeeds if all the actions that it is constituted of succeeds.When a transaction succeeds, the changes are made permanent through a commit process. A distributed transaction is a special case in which the individual actions happen over two ormore networked computer systems. In this case, to make the changes permanent, a distributedcommit is performed. A distributed commit is a special case of distributed transaction: here, thetransaction being the commit itself. A system capable of performing distributed commits employs a coordinator mechanism.The coordinator is responsible for communicating to the participant processes (also known as cohorts)whether a commit needs to be performed or not. This method is known as one-phase commit. Thefundamental weakness of this scheme is that the coordinator is unaware of whether or not the cohortswere able to successfully commit the transaction and hence, cannot take corrective measures in caseof a commit failure. To circumvent the problem, other advanced techniques have been developed. Here,you will learn the two-phase and three-phase commits.Dept. of Computer Science And Applications, SJCET, Palai 18
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘123.7.2 Two-Phase CommitThe two-phase commit (2PC) protocol is outlined as follows:1. The cohorts wait for a QUERYTOCOMMIT message from the coordinator. The cohort is said to be in an init state. and waits for2. The coordinator multicasts a QUERYTOCOMMIT message responses from the cohorts. The coordinator is now in the LISTEN state. 3. On receiving a QUERYTOCOMMIT message, a cohort a. Responds with a READYTOCOMMIT message if it is prepared to locally commit the transactionor b. Responds with a NOTREADYTOCOMMIT message if it is not prepared to locally commit the transaction and waits for further messages from the coordinator. This state of the cohort is the READY stale. 4. If the coordinator receives a READYTOCOMMIT response from alt the cohorts. it multicasts a DO_COMMIT message and transitions to the COMMIT stale. 5. On the other hand, if the coordinator receives an NOTREADYTO COMMIT message from any of the cohort, it abandons the transaction and multicasts a DO_ABORT message. The coordinator is said to be in the ABORT state. 6. When a cohort receives a DO_COMMIT message. It commits the ongoing transaction locally and transitions to the COMMIT state.. On the contrary, if it receives a DOABORT message, it locally aborts the current transaction and transitions to the abort state.As can be noticed, in this protocol there are situations when either the coordinator or the cohort waitsinfinitely for a message. 2. Likewise, a coordinator might timeout in the LISTEN state waiting for either READY TOCOMMIT or NOT_READY_TO_COMMIT messages from the cohorts. In such cases it would multicast a DOABORT message and would transition to the ABORT state. 3. If the coordinator timeouts in the PRE COMMIT' state, it will deduce that one of the cohorts have crashed and since all the cohorts have agreed to commit (as it is in the PRE_COMMITs tate). it will multicast a D0C0MM1T message and will transition to the COMMIT state. The cohort which had failed would recover, and on recovery would commit the transaction. 4. I f a cohort timeouts in the READY state .or the PRE_COMMIT state (due to a coordinator failure), it will query other cohorts as in the case of 2PC. If it finds any cohort in the 7A7T state or in the ABORT state, it can safe I \ abort the transaction. If it finds a cohort in the COMMIT state, it will commit the transaction as well. If it finds that the majority of cohorts arc in the PRE COMMIT state. it can safely commit the transaction. On the other hand, if all the cohorts queried are in the READ Y state the cohort should abort the transaction. The salient feature of 3 PC is that a cohort can be in the A'/r state if and only if no othercohorts are in the PRE COMMIT\" state. This is due to the fact that a cohort can transition to thePRE_COMMIT state only if the coordinator has transitioned to the PRE_COMMIT state, which canhappen only when the coordinator has received a READY_TO_COMMIT message from all thecohorts (i.e.. al l the cohorts have progressed from the IN IT state).Dept. of Computer Science And Applications, SJCET, Palai 19
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘123.8 CHECKPOINTINGIn distributed systems, there are situations when a participating process might crash. If the system isfault tolerant, it is imperative to have measures in place that allow the process to recover fromsuch a crash. Crash recovery for a process is defined as the mechanism that enables the process toget back to the previous state before the crash occurred. In order to do this, the process should beable to safely store the state of the process from time to time. A stable storage is defined as a data store that can withstand system failures, processcrashes, media failures, etc.3.8.1 CheckpointA checkpoint is a snapshot of the state of a process at a particular instant. The snapshot is saved offto a stable storage. Typically, a process checkpoint is created after a major operation has beencompleted by the process and also at regular intervals of the process execution. In the event of aprocess crash and subsequent restart, the process is started and brought to the last state it was in byloading the latest checkpoint from the stable storage. This technique is known as backwardrecovery. A fault-tolerant distributed system uses backward recovery tore-establish the state of apreviously crashed process to the last-known-good-state before the crash. The backward recovery'is achieved by using the process space and state date saved off in a checkpoint. In the case of adistributed system, the checkpoint data apart from storing individual process states should alsocontain the last-known-good-global-state of the entire system. This state is known as thedistributed snapshot of the system.3.8.2 Types of CheckpointsVarious check pointing techniques have been developed for distributed systems. One of them is theindependent check pointing. In this case, a process creates a checkpoint without coordinatingwith other processes. The main advantage of this mechanism is low overhead, since the processesdo not have to adhere to additional protocols for creating checkpoints. The main disadvantage of thismechanism is that since the processes do have knowledge as to when other processes have taken acheckpoint, it might be quite possible that when the processes are recovered to a particularcheckpoint, the process states do not represent a globally coherent system state; hence, theprocesses might have to be rolled back to the next checkpoint, so on. and so forth. As a result, thesystem might reach its initial state instead of some fruitful intermediate state. This phenomenonis known as the domino effect. A variant of the independent check pointing mechanism is the coordinated checkpointing. In this protocol, all processes synchronize before they store the checkpoint. Thisensures that the last-known-good-global-state (or the globally-consistent s\ stem state) is written off tothe checkpoint. In the case of a crash recovery, since the checkpoints of each of the participatingprocesses were synchronized, the system is guaranteed to reach its consistent state on rollback.Coordinated check pointing can be implemented in the same lines as that of the2PC. In thistechnique, the coordinator multicasts a READ Y_T (^CHECKPOINT message to all cohorts (orthe processes that will do check pointing). The cohort, on receiving the message, creates a localcheckpoint and queues all subsequent operations that it was executing, and sends anacknowledgement to the coordinator. The coordinator upon receiving an acknowledgement from all theDept. of Computer Science And Applications, SJCET, Palai 20
MODULE 3 MCA-303SYSTEM SOFTWARE ADMN 2009 - ‘12cohorts multicasts a CHECKPOI\"NT_DONE message. When a cohort receives this message itresumes with the queued operations. As can be noticed, this method is blocking in nature.3.9 SUMMARYDistributed systems are abundant in the current computing landscape and form the backbone ofnumerous critical services. His therefore imperative that such systems are dependable and provideservice with a high degree of availability in an error-free manner. In this unit, you have learned thedifferent types of failures that a distributed system can encounter, and classified these failures intocategories based on their nature and severity. The unit also discussed the concept of fault tolerance, andwhy fault tolerance is important in the context of distributed systems. You have learned the techniquesavailable to build robust systems, the principles behind these techniques, and how these techniques canbe applied to build real world fault-tolerant systems. The unit discussed replication and coherency models and protocols, classified fault tolerance,and looked at problems faced by designers of fault-tolerant systems and the steps they take tocircumvent them. You have learned how clusters and groups of processes can be effective in thedesign of fault-tolerant systems. Moreover, you have learned how the concepts of checkpoints,transactions, commit, abort, and rollback are used to develop systems that recover automatically inthe event of a fault.3.10 KEY TERMS • Replication: The process of maintaining multiple copies of data at different distributed locations. • Multicasting: The process of transmitting a message to a group of recipients. • Broadcasting: The process of sending a message to even1 entity in a particular network. • Weak consistency: Weakens the consistency requirements so that global synchronization can be avoided. • Permanent replicas; The initial set of replicas that constitute a distributed data reserve. • Fault tolerance: The quality of a system to continue functioning even when some of its components have failed.Dept. of Computer Science And Applications, SJCET, Palai 21
Search
Read the Text Version
- 1 - 21
Pages: