COMPONENTS OF A DISTRIBUTED DBMS Figure 3.4 Components of DBMS 1. The user interface handler is responsible for interpreting user commands as they come in, and formatting the result data as it is sent to the user. 2. The semantic data controller uses the integrity constraints and authorizations that are defined as part of the global conceptual schema to check if the user query can be processed. 3. The global query optimizer and decomposer determines an execution strategy to minimize a cost function, and translates the global queries into local ones using the global and local conceptual schemas as well as the global directory. The global query optimizer is responsible, among other things, for generating the best strategy to execute distributed join operations. 4. The distributed execution monitor coordinates the distributed execution of the user request. The execution monitor is also called the distributed transaction manager. In 49 CU IDOL SELF LEARNING MATERIAL (SLM)
executing queries in a distributed fashion, the execution monitors at various sites may, and usually do, communicate with one another. Second Major Component 1. The local query optimizer, which actually acts as the access path selector, is responsible for choosing the best access path 5 to access any data item 2. The local recovery manager is responsible for making sure that the local database remains consistent even when failures occur 3. The run-time support processor physically accesses the database according to the physical commands in the schedule generated by the query optimizer. The run-time support processor is the interface to the operating system and contains the database buffer (or cache) manager. Distributed Database Design Top-Down Design Process Figure 3.5 Top down design process 50 CU IDOL SELF LEARNING MATERIAL (SLM)
Reasons for Fragmentation: Parallel Execution, Level of Concurrency Degree of Fragmentation — Horizontal & Vertical Figure 3.6 Design fragments Figure 3.7 Design fragments Hybrid Fragmentation Figure 3.8 Hybrid Fragmentation 51 CU IDOL SELF LEARNING MATERIAL (SLM)
Hybrid =Horizontal + Vertical Fragmentation Replications: Non-Replicated, partially replicated, Fully Replicated Correctness Rules of Fragmentation 1. Completeness. If a relation instance R is decomposed into fragments F R = {R 1, R 2, . . . , R n}, each data item that can be found in R can also be found in one or more of R i ’s. This property, which is identical to the lossless de- composition property 2. Reconstruction. If a relation R is decomposed into fragments F R = {R 1 , R 2 , . . . , R n }, it should be possible to define a relational operator to reconstruct it back. 3. Disjointness. If a relation R is horizontally decomposed into fragments F R = {R 1 , R 2 , . . . , R n } and data item d i is in R j , it is not in any other fragment R k (k 6 = j). TYPES OF DISTRIBUTED DATABASES Distributed databases can be broadly classified into homogeneous and heterogeneous distributed database environments, each with further sub-divisions, as shown in the following illustration. Figure 3.9 Types of distributed database i. Homogeneous Distributed Databases In a homogeneous distributed database, all the sites use identical DBMS and operating systems. Its properties are − 52 CU IDOL SELF LEARNING MATERIAL (SLM)
The sites use very similar software. The sites use identical DBMS or DBMS from the same vendor. Each site is aware of all other sites and cooperates with other sites to process user requests. The database is accessed through a single interface as if it is a single database. Types of Homogeneous Distributed Database There are two types of homogeneous distributed database − Autonomous − Each database is independent that functions on its own. They are integrated by a controlling application and use message passing to share data updates. Non-autonomous − Data is distributed across the homogeneous nodes and a central or master DBMS co-ordinates data updates across the sites. ii. Heterogeneous Distributed Databases In a heterogeneous distributed database, different sites have different operating systems, DBMS products and data models. Its properties are − Different sites use dissimilar schemas and software. The system may be composed of a variety of DBMSs like relational, network, hierarchical or object oriented. Query processing is complex due to dissimilar schemas. Transaction processing is complex due to dissimilar software. A site may not be aware of other sites and so there is limited co-operation in processing user requests. Types of Heterogeneous Distributed Databases Federated − The heterogeneous database systems are independent in nature and integrated together so that they function as a single database system. Un-federated − The database systems employ a central coordinating module through which the databases are accessed. Distributed DBMS Architectures 53 CU IDOL SELF LEARNING MATERIAL (SLM)
DDBMS architectures are generally developed depending on three parameters − Distribution − It states the physical distribution of data across the different sites. Autonomy − It indicates the distribution of control of the database system and the degree to which each constituent DBMS can operate independently. Heterogeneity − It refers to the uniformity or dissimilarity of the data models, system components and databases. Architectural Models Some of the common architectural models are − Client - Server Architecture for DDBMS Peer - to - Peer Architecture for DDBMS Multi - DBMS Architecture a. Client - Server Architecture for DDBMS This is a two-level architecture where the functionality is divided into servers and clients. The server functions primarily encompass data management, query processing, optimization and transaction management. Client functions include mainly user interface. However, they have some functions like consistency checking and transaction management. The two different client - server architecture is − Single Server Multiple Client Multiple Server Multiple Client (shown in the following diagram) 54 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 3.10 Client – Server architecture database b. Peer- to-Peer Architecture for DDBMS In these systems, each peer acts both as a client and a server for imparting database services. The peers share their resource with other peers and co-ordinate their activities. This architecture generally has four levels of schemas − Global Conceptual Schema − Depicts the global logical view of data. Local Conceptual Schema − Depicts logical data organization at each site. Local Internal Schema − Depicts physical data organization at each site. External Schema − Depicts user view of data. Figure 3.11 Peer to peer architecture database c.Multi - DBMS Architectures This is an integrated database system formed by a collection of two or more autonomous database systems. Multi-DBMS can be expressed through six levels of schemas − 55 CU IDOL SELF LEARNING MATERIAL (SLM)
Multi-database View Level − Depicts multiple user views comprising of subsets of the integrated distributed database. Multi-database Conceptual Level − Depicts integrated multi-database that comprises of global logical multi-database structure definitions. Multi-database Internal Level − Depicts the data distribution across different sites and multi-database to local data mapping. Local database View Level − Depicts public view of local data. Local database Conceptual Level − Depicts local data organization at each site. Local database Internal Level − Depicts physical data organization at each site. There are two design alternatives for multi-DBMS − Model with multi-database conceptual level. Model without multi-database conceptual level. Figure 3.12 Multi-DBMS 56 CU IDOL SELF LEARNING MATERIAL (SLM)
Figure 3.13 Model without Multi database conceptual level Design Alternatives The distribution design alternatives for the tables in a DDBMS are as follows − Non-replicated and non-fragmented Fully replicated Partially replicated Fragmented Mixed Non-replicated & Non-fragmented In this design alternative, different tables are placed at different sites. Data is placed so that it is at a close proximity to the site where it is used most. It is most suitable for database systems where the percentage of queries needed to join information in tables placed at different sites is low. If an appropriate distribution strategy is adopted, then this design alternative helps to reduce the communication cost during data processing. 57 CU IDOL SELF LEARNING MATERIAL (SLM)
Fully Replicated In this design alternative, at each site, one copy of all the database tables is stored. Since, each site has its own copy of the entire database, queries are very fast requiring negligible communication cost. On the contrary, the massive redundancy in data requires huge cost during update operations. Hence, this is suitable for systems where a large number of queries is required to be handled whereas the number of database updates is low. Partially Replicated Copies of tables or portions of tables are stored at different sites. The distribution of the tables is done in accordance to the frequency of access. This takes into consideration the fact that the frequency of accessing the tables vary considerably from site to site. The number of copies of the tables (or portions) depends on how frequently the access queries execute and the site which generate the access queries. Fragmented In this design, a table is divided into two or more pieces referred to as fragments or partitions, and each fragment can be stored at different sites. This considers the fact that it seldom happens that all data stored in a table is required at a given site. Moreover, fragmentation increases parallelism and provides better disaster recovery. Here, there is only one copy of each fragment in the system, i.e. no redundant data. The three fragmentation techniques are − Vertical fragmentation Horizontal fragmentation Hybrid fragmentation Mixed Distribution This is a combination of fragmentation and partial replications. Here, the tables are initially fragmented in any form (horizontal or vertical), and then these fragments are partially replicated across the different sites according to the frequency of accessing the fragments. TRANSPARENCY FEATURES DDBMS Transparency Features This topic discusses five types of transparency that give the illusion of being a local DBMS. Each of them has their own feature set: 58 CU IDOL SELF LEARNING MATERIAL (SLM)
Distribution transparency - This feature specifically hides distributed data aspects related to commands from the end user. Distribution transparency is the property of distributed databases by the virtue of which the internal details of the distribution are hidden from the users. The DDBMS designer may choose to fragment tables, replicate the fragments and store them at different sites. However, since users are oblivious of these details, they find the distributed database easy to use like any centralized database. The text explains that this can be implemented in one of three levels, which are simply measures of how much they accomplish: a. Fragmentation transparency - At this level, neither a user nor a programmer would have to reference the database fragment (by name or location) in which a particular data structure resides. This is high transparency. Fragmentation transparency enables users to query upon any table as if it were unfragmented. Thus, it hides the fact that the table the user is querying on is actually a fragment or union of some fragments. It also conceals the fact that the fragments are located at diverse sites. This is somewhat similar to users of SQL views, where the user may not know that they are using a view of a table instead of the table itself. b. Location transparency - At this level, a user or a programmer would have to reference the database fragment by name but not by location. This is medium transparency. Location transparency ensures that the user can query on any table(s) or fragment(s) of a table as if they were stored locally in the user’s site. The fact that the table or its fragments are stored at remote site in the distributed database system, should be completely oblivious to the end user. The address of the remote site(s) and the access mechanisms are completely hidden. In order to incorporate location transparency, DDBMS should have access to updated and accurate data dictionary and DDBMS directory which contains the details of locations of data. c. Local mapping transparency - At this level, a user or a programmer would have to reference the database fragment by name and by location. This is low transparency, which is not very transparent. The text provides an example of the grammar that might be used in each of the three cases. In those examples the word \"NODE\" is only symbolic, not the actual word that would be used. This feature is supported through shared documentation, either a distributed data dictionary (DDD) or a distributed data catalogue (DDC), which document the fragments at each location. Transaction transparency - This feature allows a transaction to update data at more than one location, as may be required for consistency of the database. This involves protocols 59 CU IDOL SELF LEARNING MATERIAL (SLM)
to update the database locally and remotely, and to rollback or commit transactions in both locations. Vocabulary for this feature: a. remote request - allows a single command to be carried out by one remote data processor b. remote transaction - allows a set of commands (a transaction) to be carried out by one remote data processor c. distributed request - allows one command to be carried out by more than one remote data processor d. distributed transaction - allows a set of commands (a transaction) to be carried out by more than one remote data processor Failure transparency - This is a redundancy feature of the distributed service, continuing to provide service if one of the data fragments is inaccessible. In the event of individual node failure, the DDBMS should ensure that the failed nodes' operations continue at other sites. This feature is referred to as Failure Transparency. Failure transparency is difficult to provide because a node failure may occur for many different reasons, including communication or software failures. Complicated concurrency control techniques as well as data fragmentation and replication also add to the complexities of providing continuous, transparent operations whilst ensuring database integrity and consistency. Performance transparency - This is a rather optimistic concept, that the system will not be slowed down due to distance, by fragmentation itself, or by translation from one platform to another. In case of a slow link, or an unresponsive segment, the network will access an alternative, replicated copy of the data elsewhere. This leads to replica transparency, which means that the data replica being used for a transaction is not important to the user, since all replicas of the same data will be synchronized as soon as possible. A DDBMS provides Performance Transparency if the system performs as a single centralised system without any significant decline in database performance. Optimal database performance with minimal associated costs is the goal of query optimisation. A query is one of the main functions of a database and it is used to transform data in the database into information. Most current databases require the user to specify what is required in a query, but not how to acquire it. A typical query may require data from multiple tables, contain nested subqueries and use aggregate operators. Even in a centralised database system, the query can be executed in a number of ways to produce the required result, so the DBMS must consider all possible access strategies. The DBMS then selects the most efficient execution strategy which includes the re-ordering of operations for better 60 CU IDOL SELF LEARNING MATERIAL (SLM)
performance, as well as minimising the overhead costs of disk access, memory use etc. This process is referred to as query optimisation. Heterogeneity transparency- This one is harder to spell, so it must be important. It means that the DDBMS will appear to any user as a single system, even when it is actually a federation of systems of different types: network, relational, or hierarchical. In this sense, \"federate\" means to give trust to and receive trust from another system. SUMMARY Single database can be divided into several fragments. The fragments can be stored on different computers within a network. Processing, too, can be dispersed among several different network sites, or nodes. The multisite database forms the core of the distributed database system. The growth of distributed database systems has been fostered by the dispersion of business operations across the country and around the world, along with the rapid pace of technological change that has made local and wide area networks practical and more reliable. The network-based distributed database system is very flexible: it can serve the needs of a small business operating two stores in the same town while at the same time meeting the needs of a global business. Although a distributed database system requires a more sophisticated DBMS, the end user should not be burdened by increased operational complexity. That is, the greater complexity of a distributed database system should be transparent to the end user. The distributed database management system (DDBMS) treats a distributed database as a single logical database; therefore, the basic design concepts you learned in earlier chapters apply. However, although the end user need not be aware of the distributed database’s special characteristics, the distribution of data among different sites in a computer network clearly adds to a system’s complexity. For example, the design of a distributed database must consider the location of the data and the partitioning of the data into database fragments. KEY WORDS/ABBREVIATIONS Database clustering: Connecting two or more servers and instances to a database, often for the advantages of fault tolerance, load balancing, and parallel processing. Data warehouse: A collection of individual computers that work together and appear to function as a single system. This requires access to a central database, multiple copies of a database on each computer, or database partitions on each machine. 61 CU IDOL SELF LEARNING MATERIAL (SLM)
Distributed relational database: A database that contains objects such as tables that are part of different yet interconnected systems. Distributed system: A collection of individual computers that work together and appear to work as a single system. This requires access to a central database, multiple copies of a database on each computer, or database partitions on each machine. Sharding: Also known as “horizontal partitioning,” sharding is where a database is split into several pieces, usually to improve the speed and reliability of an application. Strong consistency: A database concept that refers to the inability to commit transactions that violate a database’s rules for data validity. Site autonomy means that each server participating in a distributed database is administered independently (for security and backup operations) from the other databases, as though each database was a non-distributed database A remote query is a query that selects information from one or more remote tables, all of which reside at the same remote node. A remote update is an update that modifies data in one or more tables, all of which are located at the same remote node. Index - A separate structure that allows fast access to a table's rows based on the data values of the columns used in the index. Replication - A process where selected modifications in a master database is replicated (re-played) into another database. LEARNING ACTIVITY 1. How are the different transparencies are implemented in DDBMS. 2. Do the comparative study between distributed databases and client/server architecture with reference to different company profile 62 CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT END QUESTIONS (MCQ AND DESCRIPTIVE) A. Descriptive Type Questions 1. Explain the difference between a distributed database and distributed processing. 2. Explain fully distributed database management system? 3. Explain the components of a DDBMS? 4. List and explain the transparency features of a DDBMS. 5. Define and explain the different types of distribution transparency. B. Multiple Choice Questions 1. Which of the following is not a promise of distributed database? a) Network Transparency b) Replication Transparency c) Fragmentation Transparency d) None of the above 2. The transaction wants to edit the data item is called as ....... a) Exclusive Mode b). Shared Mode c) Inclusive Mode d) Unshared Mode 3. In distributed system, each processor has its own a) local memory b) Clock c) Both local memory and clock d) None of the mentioned 63 CU IDOL SELF LEARNING MATERIAL (SLM)
4. Processes on the remote systems are identified by a) host ID b) Host name and identifier c) Identifier d) Process ID 5. Which routing technique is used in a distributed system? a) Fixed routing b) Virtual routing c) Dynamic routing d) all of the mentioned 6. The capability of a system to adapt the increased service load is called a) scalability b) tolerance c) capacity d) None of the mentioned 7. If one site fails in distributed system 64 a) The remaining sites can continue operating b) All the sites will stop working c) Directly connected sites will stop working d) None of the mentioned 8. Transparency that enables multiple instances of resources to be used, is called a) Replication transparency CU IDOL SELF LEARNING MATERIAL (SLM)
b) Scaling transparency c) Concurrency transparency d) Performance transparency 9. A paradigm of multiple autonomous computers, having a private memory, communicating through a computer network, is known as a) Distributed computing b) Cloud computing c) Centralized computing d) Parallel computing Answer 1. d 2. a 3. d 4. b 5. d 6.a 7. a 8.a 9.a REFERENCES Elmasri R., Navathe S.B. (2015). Fundamentals of Database Systems. New Delhi: Pearson Education. Date C.J. (2004). An Introduction to Database Systems. 7th Edition, New Delhi: Pearson Education. Bipin Desai (2012). Introduction to Database Management system. New Delhi: Galgotia Pub. Ashdown, Lance; Kyte, Tom (September 2011). \"Oracle Database Concepts, 11g Release 2 (11.2)\". Oracle Corporation. Archived from the original on 2013-07-15. Retrieved 2013-07-17. Distributed SQL synchronously accesses and updates data distributed among multiple databases. [...] Distributed SQL includes distributed queries and distributed transactions. \"Clusterpoint database distributed storage multi-datacenter replication\". Clusterpoint. \"Riak database multi-datacenter replication\". Basho. Security, Networx. \"Distributed Database\". www.networxsecurity.org. Retrieved 2018-02-06. 65 CU IDOL SELF LEARNING MATERIAL (SLM)
66 CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT 4: DISTRIBUTED DATABASE MANAGEMENT SYSTEM -2 Structure Learning Objectives Introduction-Levels of Data Process of Distribution Single-Site Processing, Single-Site Data (SPSD): Multiple-Site Processing, Single-Site Data (MPSD): Multiple-Site Processing, Multiple-Site Data (MPMD): Summary Key Words/Abbreviations Learning Activity Unit End Questions (MCQ and Descriptive) References LEARNING OBJECTIVES After studying this unit, you will be able to: Explain levels of Data State knowledge of Process of Distribution. INTRODUCTION The DDBMS must include at least the following components: Computer workstations or remote devices (sites or nodes) that form the network system. The distributed database system must be independent of the computer system hardware. Network hardware and software components that reside in each workstation or device. The network components allow all sites to interact and exchange data. Because the components— computers, operating systems, network hardware, and so on—are likely to be supplied by 67 CU IDOL SELF LEARNING MATERIAL (SLM)
different vendors, it is best to ensure that distributed database functions can be run on multiple platforms. Communications media that carry the data from one node to another. The DDBMS must be communications media-independent; that is, it must be able to support several types of communications media. The transaction processor (TP), which is the software component found in each computer or device that requests data. The transaction processor receives and processes the application’s data requests (remote and local). The TP is also known as the application processor (AP) or the transaction manager (TM). The data processor (DP), which is the software component residing on each computer or device that stores and retrieves data located at the site. The DP is also known as the data manager (DM). A data processor may even be a centralized DBMS. The protocols determine how the distributed database system will:_ Interface with the network to transport data and commands between data processors (DPs) and transaction processors (TPs)._ Synchronize all data received from DPs (TP side) and route retrieved data to the appropriate TPs (DP side)._ Ensure common database functions in a distributed system. Such functions include security, concurrency control, backup, and recovery. DPs and TPs can be added to the system without affecting the operation of the other components. A TP and a DP can reside on the same computer, allowing the end user to access local as well as remote data transparently. In theory, a DP can be an independent centralized DBMS with proper interfaces to support remote access from other independent DBMSs in the network. Levels of Data Current database systems can be classified on the basis of how process distribution and data distribution are supported. For example, a DBMS may store data in a single site (centralized DB) or in multiple sites (distributed DB) and may support data processing at a single site or at multiple sites. Current database systems can be classified on the basis of how process distribution and data distribution are supported Figure 4.1 Levels of Data 68 CU IDOL SELF LEARNING MATERIAL (SLM)
PROCESS DISTRIBUTION In distributed processing, a database’s logical processing is shared among two or more physically independent sites that are connected through a network. For example, the data input/output (I/O), data selection, and data validation might be performed on one computer, and a report based on that data might be created on another computer. A distributed database, on the other hand, stores a logically related database over two or more physically independent sites. The sites are connected via a computer network. In contrast, the distributed processing system uses only a single-site database but shares the processing chores among several sites. In a distributed database system, a database is composed of several parts known as database fragments. The database fragments are located at different sites and can be replicated among various sites. Each database fragment is, in turn, managed by its local database process. Current database systems can be classified on the basis of how process distribution and data distribution are supported. For example, a DBMS may store data in a single site (centralized DB) or in multiple sites (distributed DB) and may support data processing at a single site or at multiple sites. The different types of Data and Process distribution methods are as follows. Single-Site Processing, Single-Site Data (SPSD): In the single-site processing, single-site data (SPSD) scenario, all processing is done on a single host computer (single-processor server, multiprocessor server, mainframe system) and all data are stored on the host computer’s local disk system. Processing cannot be done on the end user’s side of the system. Such a scenario is typical of most mainframe and midrange server computer DBMSs. The DBMS is located on the host computer, which is accessed by dumb terminals connected to it. Figure 4.2 Single site processing 69 CU IDOL SELF LEARNING MATERIAL (SLM)
In the above figure you can see that the functions of the TP and the DP are embedded within the DBMS located on a single computer. The DBMS usually runs under a time-sharing, multitasking operating system, which allows several processes to run concurrently on a host computer accessing a single DP. All data storage and data processing are handled by a single host computer. This scenario is also typical of the first generation of single-user microcomputer databases. Figure 4.3 Single data processing Multiple-Site Processing, Single-Site Data (MPSD): Under the multiple-site processing, single-site data (MPSD) scenario, multiple processes run on different computers sharing a single data repository. Typically, the MPSD scenario requires a network file server running conventional applications that are accessed through a network. Many multiuser accounting applications running under a personal computer network fit such a description. Consider the following figure. Figure 4.4 Multiple data processing 70 CU IDOL SELF LEARNING MATERIAL (SLM)
As you examine in the above Figure, Note that: • The TP on each workstation acts only as a redirector to route all network data requests to the file server. • The end user sees the file server as just another hard disk. Because only the data storage input/output (I/O) is handled by the file server’s computer, the MPSD offers limited capabilities for distributed processing. • The end user must make a direct reference to the file server in order to access remote data. All record- and file-locking activities are done at the end-user location. • All data selection, search, and update functions take place at the workstation, thus requiring that entire files travel through the network for processing at the workstation. Such a requirement increases network traffic, slows response time, and increases communication costs. For example, suppose the file server computer stores a CUSTOMER table containing 10,000 data rows, 50 of which have balances greater than $1,000. SELECT * FROM CUSTOMER KHERE CUS_BALANCE> 1000; Client/server architecture is similar to that of the network file server except that all database processing is done at the perform multiple-site processing, the latter's processing is distributed. Note that a network file server approach requires the database to be located at a single site. In contrast, the client/server architecture is capable of supporting data at multiple sites. Multiple-Site Processing, Multiple-Site Data (MPMD): The (MPMD)scenario describes a fully distributed DBMS with support for multiple data processors and transaction processors at multiple sites. depending on the level of support for various types of centralized. The multiple-site processing, multiple-site data (MPMD) scenario describes a fully distributed DBMS with support for multiple data processors and transaction processors at multiple sites. Depending on the level of support for various types of centralized DBMSs, DDBMSs are classified as either homogeneous or heterogeneous. Homogeneous DDBMSs integrate only one type of centralized DBMS over a network. Thus, the same DBMS will be running on different server platforms (single processor server, 71 CU IDOL SELF LEARNING MATERIAL (SLM)
multiprocessor server, server farms, or server blades). In contrast, heterogeneous DDBMSs integrate different types of centralized DBMSs over a network. A fully heterogeneous DDBMS will support different DBMSs that may even support different data models (relational, hierarchical, or network) running under different computer systems, such as mainframes and PCs. Heterogeneous DDBMSs integrate different types of centralized DBMSs over a network Fully heterogeneous DDBMSs will support different DBMSs that may even support different data models (relational, hierarchal, or network) running under different computer systems, such as mainframes and PCs. Some DDBMS implementations support several platforms, operating systems, and networks and allow remote data access to another DBMS. however, such DDBMSs still are subject to certain restriction. for example: Remote access is provided on a read –only basis and does not support write privilege. Restrictions are placed on the number of remote tables that may be accessed in a single transaction. Restrictions are placed on the number of distinct databases that may be accessed Restrictions are placed on the database model that may be accessed .Thus, access may be provided to relational databases but not to network or hierarchal databases. SUMMARY A distributed database management system (DDBMS) governs the storage and processing of logically related data over interconnected computer systems in which both data and processing are distributed among several sites. The use of a centralized database required that corporate data be stored in a single central site, usually a mainframe computer. Data access was provided through dumb terminals. The centralized approach worked well to fill the structured information needs of corporations, but it fell short when quickly moving events required faster response times and equally quick access to information. The slow progression from information request to approval to specialist to user simply did not serve decision makers well in a dynamic environment. The growing acceptance of the Internet as the platform for data access and distribution which leads to maintain the repository for distributed data. The wireless revolution. The widespread use of wireless digital devices, such as smart phones like 72 CU IDOL SELF LEARNING MATERIAL (SLM)
the iPhone and BlackBerry and personal digital assistants (PDAs), has created high demand for data access. Such devices access data from geographically dispersed locations and require varied data exchanges in multiple formats (data, voice, video, music, pictures, etc.) Although distributed data access does not necessarily imply distributed databases, performance and failure tolerance requirements often make use of data replication techniques similar to the ones found in distributed databases. The accelerated growth of companies providing “application as a service” type of services. This new type of service provides remote application services to companies wanting to outsource their application development, maintenance, and operations. The company data is generally stored on central servers and is not necessarily distributed. Just as with wireless data access, this type of service may not require fully distributed data functionality; however, other factors such as performance and failure tolerance often require the use of data replication techniques similar to the ones found in distributed databases. The increased focus on data analysis that led to data mining and data warehousing. Although a data warehouse is not usually a distributed database, it does rely on techniques such as data replication and distributed queries that facilitate data extraction and integration. KEY WORDS/ABBREVIATIONS Client/server architecture—Refers to the arrangement of hardware and software components to form a system composed of clients, servers, and middleware. The client/server architecture features a user of resources, or a client, and a provider of resources, or a server. Conceptual design—A process that uses data modelling techniques to create a model of a database structure that represents the real-world objects in the most realistic way possible. Both software- and hardware-independent Data allocation—In a distributed DBMS, describes the process of deciding where to locate data fragments. Database life cycle (DBLC)—The history of a database within an information system. Divided into six phases: initial study, design, implementation and loading, testing and evaluation, operation and maintenance, and evolution. Data inconsistency—A condition in which different versions of the same data yield different (inconsistent) results. Data independence—A condition that exists when data access is unaffected by changes in the physical data storage characteristics. 73 CU IDOL SELF LEARNING MATERIAL (SLM)
Data integrity—A condition in which given data always yield the same result. Data integrity is mandatory in any database. In a relational database, refers to the characteristic that allows a DBMS to maintain entity and referential integrity. Data management—A process that focuses on data collection, storage, and retrieval. Common data management functions include addition, deletion, modification, and listing. Decentralized design—A process in which conceptual design is used to model subsets of an organization’s database requirements. After verification of the views, processes, and constraints, the subsets are then aggregated into a complete design. Such modular designs are typical of complex systems in which the data component consists of a relatively large number of objects and procedures. Compare to centralized design Distributed processing—The activity of sharing (dividing) the logical processing of a database over two or more sites connected by a network. Distributed request—A database request that allows a single SQL statement to access data in several remote DPs in a distributed database. distributed transaction—A database transaction that accesses data in several remote DPs in a distributed database. LEARNING ACTIVITY 1. How SPSD, MPSD and MPMD are similar and different at the same time. Discuss. 2. Study the network of two organizations in which one has Centralized Database and another has Decentralized Database. Draw the comparative study on the basis of advantages and disadvantages of both. UNIT END QUESTIONS (MCQ AND DESCRIPTIVE) 74 A. Descriptive Type Questions 1. Differentiate between Homogeneous DDBMS and Heterogeneous DDBMS 2. Define Fully heterogeneous DDBMSs. 3. Discuss Client Server Architecture. 4. Differentiate between transaction processor (TP) and data processor (DP) 5. Explain the Single Site Process and Multiple site process CU IDOL SELF LEARNING MATERIAL (SLM)
B. Multiple Choice Questions 1. A distributed database is which of the following? a) A single logical database that is spread to multiple locations and is interconnected by a network b) A loose collection of file that is spread to multiple locations and is interconnected by a network c) A single logical database that is limited to one location. d) A loose collection of file that is limited to one location. 2. A distributed database can use which of the following strategies? a) Totally centralized at one location and accessed by many sites b) Partially or totally replicated across sites c) Partitioned into segments at different sites d) All of the above 3. Which of the following is not one of the stages in the evolution of distributed DBMS? a) Unit of work b) Remote unit of work c) Distributed unit of Work d) Distributed request 4. Depending on the situation each node in the Distributed Database system can act as, . a) A client b) A server c) Both A & B d) None of the above 5. A distributed database has which of the following advantages over a centralized database? a) Software cost 75 CU IDOL SELF LEARNING MATERIAL (SLM)
b) Software complexity c) Slow Response d) Modular growth 6. A heterogeneous distributed database is which of the following? a). The same DBMS is used at each location and data are not distributed across all nodes. b). The same DBMS is used at each location and data are distributed across all nodes. c). A different DBMS is used at each location and data are not distributed across all nodes. d). A different DBMS is used at each location and data are distributed across all nodes. 7. Some of the columns of a relation are at different sites is which of the following? a). Data Replication b). Horizontal Partitioning c). Vertical Partitioning d). Horizontal and Vertical Partitioning 8. A homogeneous distributed database is which of the following? a) The same DBMS is used at each location and data are not distributed across all nodes. b) The same DBMS is used at each location and data are distributed across all nodes. c) A different DBMS is used at each location and data are not distributed across all nodes. d) A different DBMS is used at each location and data are distributed across all nodes. Answer 1.a 2.d 3.a 4.c 5.d 6.d 7.c 8.b 76 CU IDOL SELF LEARNING MATERIAL (SLM)
REFERENCES Elmasri R., Navathe S.B. (2015). Fundamentals of Database Systems. New Delhi: Pearson Education. Date C.J. (2004). An Introduction to Database Systems. 7th Edition, New Delhi: Pearson Education. Bipin Desai (2012). Introduction to Database Management system. New Delhi: Galgotia Pub. Anthony Ralston et al (ed.) (2000). Encyclopaedia of Computer Science 4th ed. Nature Publishing Group. p. 865. Reddy, R.J. (2004). Business Data Processing & Computer Applications. New Delhi: A P H Publishing Corporation. p. 17. ISBN 8176486493. Dartmouth College. \"Introduction: What Is Data Analysis?\" (PDF). Retrieved July 5, 2013. Berthold, M.R.; Borgelt, C; Hőppner, F.; Klawonn, F (2010). Guide to Intelligent Data Analysis. Springer. p. 15. ISBN 978-1-84882-260-3. \"Definition: distributed database\". www.its.bldrdoc.gov. O'Brien, J. & Marakas, G.M. (2008) Management Information Systems (pp. 185-189). New York, NY: McGraw-Hill Irwin \"TechNet Glossary\". Microsoft. Retrieved 2013-07-16. distributed query [:] Any SELECT, INSERT, UPDATE, or DELETE statement that references tables and row sets from one or more external OLE DB data sources. 77 CU IDOL SELF LEARNING MATERIAL (SLM)
78 CU IDOL SELF LEARNING MATERIAL (SLM)
UNIT 5: OBJECT ORIENTED DATABASES Structure Learning Objectives Introduction Object Identity, and Objects versus Literals Complex Type Structures for Objects and Literals Encapsulation of Operations and Persistence of Objects Type and Class Hierarchies and Inheritance Complex Objects Current Trends of Database Technology Summary Key Words/Abbreviations Learning Activity Unit End Questions (MCQ and Descriptive) References LEARNING OBJECTIVES This unit helps to learn the concepts of Object Oriented Database. It defines the basis of concepts like Objects, Inheritance and Encapsulation. It gives insight of Types and class Hierarchies and Inheritance. Unit gives the insight of latest trends in database technology. After studying this unit, you will be able to: Explain about Objects and Complex Objects List of Encapsulation, and Inheritance State trends in Database Technology 79 CU IDOL SELF LEARNING MATERIAL (SLM)
INTRODUCTION The term object-oriented—abbreviated OO or O-O—has its origins in OO programming languages, or OOPLs. Today OO concepts are applied in the areas of databases, software engineering, knowledge bases, artificial intelligence, and computer systems in general. OOPLs have their roots in the SIMULA language, which was proposed in the late 1960s. The programming language Smalltalk, developed at Xerox PARC in the 1970s, was one of the first languages to explicitly incorporate additional OO concepts, such as message passing and inheritance. It is known as a pure OO programming language, meaning that it was explicitly designed to be object-oriented. This contrasts with hybrid OO programming languages, which incorporate OO concepts into an already existing language. An example of the latter is C++, which incorporates OO concepts into the popular C programming language. An object typically has two components: state (value) and behaviour (operations). It can have a complex data structure as well as specific operations defined by the programmer. Objects in an OOPL exist only during program execution; therefore, they are called transient objects. An OO database can extend the existence of objects so that they are stored permanently in a database, and hence the objects become persistent objects that exist beyond program termination and can be retrieved later and shared by other programs. In other words, OO databases store persistent objects permanently in secondary storage, and allow the sharing of these objects among multiple programs and applications. This requires the incorporation of other well-known features of database management systems, such as indexing mechanisms to efficiently locate the objects, concurrency control to allow object sharing among concurrent programs, and recovery from failures. An OO database system will typically interface with one or more OO programming languages to provide persistent and shared object capabilities. The internal structure of an object in OOPLs includes the specification of instance variables, which hold the values that define the internal state of the object. An instance variable is similar to the concept of an attribute in the relational model, except that instance variables may be encapsulated within the object and thus are not necessarily visible to external users. Instance variables may also be of arbitrarily complex data types. Object-oriented systems allow definition of the operations or functions (behaviour) that can be applied to objects of a particular type. In fact, some OO models insist that all operations a user can apply to an object must be predefined. This forces a complete encapsulation of objects. This rigid approach has been relaxed in most OO data models for two reasons. First, database users often need to know the attribute names so they can specify selection conditions on the attributes to retrieve specific objects. Second, complete encapsulation implies that any simple retrieval requires a predefined operation, thus making ad hoc queries difficult to specify on the fly. 80 CU IDOL SELF LEARNING MATERIAL (SLM)
To encourage encapsulation, an operation is defined in two parts. The first part, called the signature or interface of the operation, specifies the operation name and arguments (or parameters). The second part, called the method or body, specifies the implementation of the operation, usually written in some general-purpose programming language. Operations can be invoked by passing a message to an object, which includes the operation name and the parameters. The object then executes the method for that operation. This encapsulation permits modification of the internal structure of an object, as well as the implementation of its operations, with-out the need to disturb the external programs that invoke these operations. Hence, encapsulation provides a form of data and operation independence. Another OO concept is operator overloading, which refers to an operation’s ability to be applied to different types of objects; in such a situation, an operation name may refer to several distinct implementations, depending on the type of object it is applied to. This feature is also called operator polymorphism. For example, an operation to calculate the area of a geometric object may differ in its method (implementation), depending on whether the object is of type triangle, circle, or rectangle. This may require the use of late binding of the operation name to the appropriate method at runtime, when the type of object to which the operation is applied becomes known. OBJECT IDENTITY, AND OBJECTS VERSUS LITERALS One goal of an ODMS (Object Data Management System) is to maintain a direct correspondence between real-world and database objects so that objects do not lose their integrity and identity and can easily be identified and operated upon. Hence, an ODMS provides a unique identity to each independent object stored in the data-base. This unique identity is typically implemented via a unique, system-generated object identifier (OID). The value of an OID is not visible to the external user, but is used internally by the system to identify each object uniquely and to create and manage inter-object references. The OID can be assigned to program variables of the appropriate type when needed. The main property required of an OID is that it be immutable; that is, the OID value of a particular object should not change. This preserves the identity of the real-world object being represented. Hence, an ODMS must have some mechanism for generating OIDs and preserving the immutability property. It is also desirable that each OID be used only once; that is, even if an object is removed from the data-base, its OID should not be assigned to another object. These two properties imply that the OID should not depend on any attribute values of the object, since the value of an attribute may be changed or corrected. We can compare this with the relational model, where each relation must have a primary key attribute whose value identifies each tuple uniquely. In the relational model, if the value of the primary key is changed, the tuple will have a new identity, even though it may still rep-resent the same real-world object. Alternatively, a real-world object may have different names for key 81 CU IDOL SELF LEARNING MATERIAL (SLM)
attributes in different relations, making it difficult to ascertain that the keys represent the same real-world object (for example, the object identifier may be represented as Emp_id in one relation and as Ssn in another). It is inappropriate to base the OID on the physical address of the object in storage, since the physical address can change after a physical reorganization of the database. However, some early ODMSs have used the physical address as the OID to increase the efficiency of object retrieval. If the physical address of the object changes, an indirect pointer can be placed at the former address, which gives the new physical location of the object. It is more common to use long integers as OIDs and then to use some form of hash table to map the OID value to the current physical address of the object in storage. Some early OO data models required that everything—from a simple value to a complex object—was represented as an object; hence, every basic value, such as an integer, string, or Boolean value, has an OID. This allows two identical basic values to have different OIDs, which can be useful in some cases. For example, the integer value 50 can sometimes be used to mean a weight in kilograms and at other times to mean the age of a person. Then, two basic objects with distinct OIDs could be created, but both objects would represent the integer value 50. Although useful as a theoretical model, this is not very practical, since it leads to the generation of too many OIDs. Hence, most OO database systems allow for the representation of both objects and literals (or values). Every object must have an immutable OID, whereas a literal value has no OID and its value just stands for itself. Thus, a literal value is typically stored within an object and cannot be referenced from other objects. In many systems, complex structured literal values can also be created without having a corresponding OID if needed. 5.2.1 Complex Type Structures for Objects and Literals Another feature of an ODMS (and ODBs in general) is that objects and literals may have a type structure of arbitrary complexity in order to contain all of the necessary information that describes the object or literal. In contrast, in traditional database systems, information about a complex object is often scattered over many relations or records, leading to loss of direct correspondence between a real-world object and its database representation. In ODBs, a complex type may be constructed from other types by nesting of type constructors. The three most basic constructors are atom, struct (or tuple), and collection. One type constructor has been called the atom constructor, although this term is not used in the latest object standard. This includes the basic built-in data types of the object model, which are similar to the basic types in many programming languages: integers, strings, floating point numbers, enumerated types, Booleans, and so on. They are called single- valued or atomic types, since each value of the type is considered an atomic (indivisible) single value. 82 CU IDOL SELF LEARNING MATERIAL (SLM)
A second type constructor is referred to as the struct (or tuple) constructor. This can create standard structured types, such as the tuples (record types) in the basic relational model. A structured type is made up of several components, and is also sometimes referred to as a compound or composite type. More accurately, the struct constructor is not considered to be a type, but rather a type generator, because many different structured types can be created. For example, two different structured types that can be created are: struct Name<FirstName: string, MiddleInitial: char, LastName: string>, and struct CollegeDegree<Major: string, Degree: string, Year: date>. To create complex nested type structures in the object model, the collection type constructors are needed, which we discuss next. Notice that the type constructor’s atom and struct are the only ones available in the original (basic) relational model. Collection (or multivalued) type constructors include the set(T), list(T), bag(T), array(T), and dictionary (K, T) type constructors. These allow part of an object or literal value to include a collection of other objects or values when needed. These constructors are also considered to be type generators because many different types can be created. For example, set(string), set(integer), and set (Employee) are three different types that can be created from the set type constructor. All the elements in a particular collection value must be of the same type. For example, all values in a collection of type set(string) must be string values. The atom constructor is used to represent all basic atomic values, such as integers, real numbers, character strings, Booleans, and any other basic data types that the system supports directly. The tuple constructor can create structured values and objects of the form <a1:i1, a2:i2, ..., an:in>, where each aj is an attribute name and each ij is a value or an OID. The other commonly used constructors are collectively referred to as collection types, but have individual differences among them. The set constructor will create objects or literals that are a set of distinct elements {i1, i2, ..., in}, all of the same type. The bag constructor (sometimes called a multiset) is similar to a set except that the elements in a bag need not be distinct. The list constructor will create an ordered list [i1, i2, ..., in] of OIDs or values of the same type. A list is similar to a bag except that the elements in a list are ordered, and hence we can refer to the first, second, or jth element. The array constructor creates a single-dimensional array of elements of the same type. The main difference between array and list is that a list can have an arbitrary number of elements whereas an array typically has a maximum size. Finally, the dictionary constructor creates a collection of two tuples (K, V), where the value of a key K can be used to retrieve the corresponding value V. The main characteristic of a collection type is that its objects or values will be a collection of objects or values of the same type that may be unordered (such as a set or a bag) or ordered 83 CU IDOL SELF LEARNING MATERIAL (SLM)
(such as a list or an array). The tuple type constructor is often called a structured type, since it corresponds to the struct construct in the C and C++ programming languages. ENCAPSULATION OF OPERATIONS AND PERSISTENCE OF OBJECTS The concept of encapsulation is one of the main characteristics of OO languages and systems. It is also related to the concepts of abstract data types and information hiding in programming languages. In traditional database models and systems this concept was not applied, since it is customary to make the structure of database objects visible to users and external programs. In these traditional models, a number of generic database operations are applicable to objects of all types. For example, in the relational model, the operations for selecting, inserting, deleting, and modifying tuples are generic and may be applied to any relation in the database. The relation and its attributes are visible to users and to external programs that access the relation by using these operations. The concepts of encapsulation is applied to database objects in ODBs by defining the behaviour of a type of object based on the operations that can be externally applied to objects of that type. Some operations may be used to create (insert) or destroy (delete) objects; other operations may update the object state; and others may be used to retrieve parts of the object state or to apply some calculations. Still other operations may perform a combination of retrieval, calculation, and update. In general, the implementation of an operation can be specified in a general-purpose programming language that provides flexibility and power in defining the operations. The external users of the object are only made aware of the interface of the operations, which defines the name and arguments (parameters) of each operation. The implementation is hidden from the external users; it includes the definition of any hidden internal data structures of the object and the implementation of the operations that access these structures. The interface part of an operation is sometimes called the signature, and the operation implementation is sometimes called the method. For database applications, the requirement that all objects be completely encapsulated is too stringent. One way to relax this requirement is to divide the structure of an object into visible and hidden attributes (instance variables). Visible attributes can be seen by and are directly accessible to the database users and programmers via the query language. The hidden attributes of an object are completely encapsulated and can be accessed only through predefined operations. Most ODMSs employ high-level query languages for accessing visible attributes. The term class is often used to refer to a type definition, along with the definitions of the operations for that type. An operation is typically applied to an object by using the dot notation. For example, if d is a reference to a DEPARTMENT object, we can invoke an 84 CU IDOL SELF LEARNING MATERIAL (SLM)
operation such as no_of_emps by writing d.no_of_emps. Similarly, by writing d.destroy_dept, the object referenced by d is destroyed (deleted). The only exception is the constructor operation, which returns a reference to a new DEPARTMENT object. Hence, it is customary in some OO models to have a default name for the constructor operation that is the name of the class itself, although this was not used in Figure 11.2. The dot notation is also used to refer to attributes of an object—for example, by writing d.Dnumber or d.Mgr_Start_date. Specifying Object Persistence via Naming and Reachability. An ODBS is often closely coupled with an object-oriented programming language (OOPL). The OOPL is used to specify the method (operation) implementations as well as other application code. Not all objects are meant to be stored permanently in the data-base. Transient objects exist in the executing program and disappear once the pro-gram terminates. Persistent objects are stored in the database and persist after program termination. The typical mechanisms for making an object persistent are naming and reachability. The naming mechanism involves giving an object a unique persistent name within a particular database. This persistent object name can be given via a specific statement or operation in the program. The named persistent objects are used as entry points to the database through which users and applications can start their database access. Obviously, it is not practical to give names to all objects in a large database that includes thousands of objects, so most objects are made persistent by using the second mechanism, called reachability. The reachability mechanism works by making the object reachable from some other persistent object. An object B is said to be reachable from an object A if a sequence of references in the database lead from object A to object B. TYPE HIERARCHIES AND INHERITANCE Key concept in OO systems is that of type and class hierarchies and inheritance. This permits specification of new types or classes that inherit much of their structure and/or operations from previously defined types or classes. This makes it easier to develop the data types of a system incrementally, and to reuse existing type definitions when creating new types of objects. One problem in early OO database systems involved representing relationships among objects. The insistence on complete encapsulation in early OO data models led to the argument that relationships should not be explicitly represented, but should instead be described by defining appropriate methods that locate related objects. However, this approach does not work very well for complex databases with many relationships because it is useful to identify these relationships and make them visible to users. The ODMG object database standard has recognized this need and it explicitly represents binary relationships via a pair of inverse references. 85 CU IDOL SELF LEARNING MATERIAL (SLM)
Simplified Model for Inheritance. Another main characteristic of ODBs is that they allow type hierarchies and inheritance. Inheritance allows the definition of new types based on other predefined types, leading to a type (or class) hierarchy. A type is defined by assigning it a type name, and then defining a number of attributes (instance variables) and operations (methods) for the type. In the simplified model we use in this section, the attributes and operations are together called functions, since attributes resemble functions with zero arguments. A function name can be used to refer to the value of an attribute or to refer to the resulting value of an operation (method). We use the term function to refer to both attributes and operations, since they are treated similarly in a basic introduction to inheritance. A type in its simplest form has a type name and a list of visible (public) functions. When specifying a type in this section, we use the following format, which does not specify arguments of functions, to simplify the discussion: TYPE_NAME: function, function, ..., function For example, a type that describes characteristics of a PERSON may be defined as follows: PERSON: Name, Address, Birth_date, Age, Ssn In the PERSON type, the Name, Address, Ssn, and Birth_date functions can be imple-mented as stored attributes, whereas the Age function can be implemented as an operation that calculates the Age from the value of the Birth_date attribute and the current date. The concept of subtype is useful when the designer or user must create a new type that is similar but not identical to an already defined type. The subtype then inherits all the functions of the predefined type, which is referred to as the supertype. For example, suppose that we want to define two new types EMPLOYEE and STUDENT as follows: EMPLOYEE: Name, Address, Birth_date, Age, Ssn, Salary, Hire_date, Seniority STUDENT: Name, Address, Birth_date, Age, Ssn, Major, Gpa Since both STUDENT and EMPLOYEE include all the functions defined for PERSON plus some additional functions of their own, we can declare them to be subtypes of PERSON. Each will inherit the previously defined functions of PERSON— namely, Name, Address, Birth_date, Age, and Ssn. For STUDENT, it is only necessary to define the new (local) functions Major and Gpa, which are not inherited. Presumably, Major can be defined as a stored attribute, whereas Gpa may be implemented as an operation that calculates the student’s grade point average by accessing the Grade values that are internally stored (hidden) within each STUDENT object as hidden 86 CU IDOL SELF LEARNING MATERIAL (SLM)
attributes. For EMPLOYEE, the Salary and Hire_date functions may be stored attributes, whereas Seniority may be an operation that calculates Seniority from the value of Hire_date. Therefore, we can declare EMPLOYEE and STUDENT as follows: EMPLOYEE subtype-of PERSON: Salary, Hire_date, Seniority STUDENT subtype-of PERSON: Major, Gpa In general, a subtype includes all of the functions that are defined for its supertype plus some additional functions that are specific only to the subtype. Hence, it is possible to generate a type hierarchy to show the supertype/subtype relationships among all the types declared in the system. As another example, consider a type that describes objects in plane geometry, which may be defined as follows: GEOMETRY_OBJECT: Shape, Area, Reference_point For the GEOMETRY_OBJECT type, Shape is implemented as an attribute (its domain can be an enumerated type with values ‘triangle’, ‘rectangle’, ‘circle’, and so on), and Area is a method that is applied to calculate the area. Reference_point specifies the coordinates of a point that determines the object location. Now suppose that we want to define a number of subtypes for the GEOMETRY_OBJECT type, as follows: RECTANGLE subtype-of GEOMETRY_OBJECT: Width, Height TRIANGLE S subtype-of GEOMETRY_OBJECT: Side1, Side2, Angle CIRCLE subtype-of GEOMETRY_OBJECT: Radius Notice that the Area operation may be implemented by a different method for each subtype, since the procedure for area calculation is different for rectangles, triangles, and circles. Similarly, the attribute Reference_point may have a different meaning for each subtype; it might be the center point for RECTANGLE and CIRCLE objects, and the vertex point between the two given sides for a TRIANGLE object. Notice that type definitions describe objects but do not generate objects on their own. When an object is created, typically it belongs to one or more of these types that have been declared. For example, a circle object is of type CIRCLE and GEOMETRY_OBJECT (by inheritance). Each object also becomes a member of one or more persistent collections of objects (or extents), which are used to group together collections of objects that are persistently stored in the database. 87 CU IDOL SELF LEARNING MATERIAL (SLM)
Constraints on Extents Corresponding to a Type Hierarchy. In most ODBs, an extent is defined to store the collection of persistent objects for each type or sub-type. In this case, the constraint is that every object in an extent that corresponds to a subtype must also be a member of the extent that corresponds to its supertype. Some OO database systems have a predefined system type (called the ROOT class or the OBJECT class) whose extent contains all the objects in the system. Classification then proceeds by assigning objects into additional subtypes that are meaningful to the application, creating a type hierarchy (or class hierarchy) for the system. All extents for system- and user-defined classes are subsets of the extent corresponding to the class OBJECT, directly or indirectly. An extent is a named persistent object whose value is a persistent collection that holds a collection of objects of the same type that are stored permanently in the database. The objects can be accessed and shared by multiple programs. It is also possible to create a transient collection, which exists temporarily during the execution of a program but is not kept when the program terminates. For example, a transient collection may be created in a program to hold the result of a query that selects some objects from a persistent collection and copies those objects into the transient collection. The program can then manipulate the objects in the transient collection, and once the program terminates, the transient collection ceases to exist. In general, numerous collections—transient or persistent—may contain objects of the same type. COMPLEX OBJECTS Complex objects are built from simpler ones by applying constructors to them. The simplest objects are objects such as integers, characters, byte strings of any length, booleans and floats (one might add other atomic types). There are various complex object constructors: tuples, sets, bags, lists, and arrays are examples. The minimal set of constructors that the system should have are set, list and tuple. Sets are critical because they are a natural way of representing collections from the real world. Tuples are critical because they are a natural way of representing properties of an entity. Of course, both sets and tuples are important because they gained wide acceptance as object constructors through the relational model. Lists or arrays are important because they capture order, which occurs in the real world, and they also arise in many scientific applications, where people need matrices or time series data. The object constructors must be orthogonal: any constructor should apply to any object. The constructors of the relational model are not orthogonal, because the set construct can only be applied to tuples and the tuple constructor can only be applied to atomic values. Other 88 CU IDOL SELF LEARNING MATERIAL (SLM)
examples are non-first normal form relational models in which the top level construct must always be a relation. Note that supporting complex objects also requires that appropriate operators must be provided for dealing with such objects (whatever their composition). That is, operations on a complex object must propagate transitively to all its components. Examples include the retrieval or deletion of an entire complex object or the production of a ``deep'' copy (in contrast to a ``shallow'' copy where components are not replicated, but are instead referenced by the copy of the object root only). Additional operations on complex objects may be defined, of course, by users of the system. However, this capability requires some system provided provisions such as two distinguishable types of references (``is-part-of'' and ``general''). Figure 5.1 Complex Object CURRENT TRENDS OF DATABASE TECHNOLOGY Database Management Systems (DBMS) technology is a key technology in most Systems Integration (SI) projects. The fast pace of development in this area makes it difficult for IS professionals to keep up with the latest advances, and to appreciate the limitations of the present generation of products. The existing relational DBMS (RDBMS) technology has been successfully applied to many application domains. RDBMS technology has proved to be an effective solution for data management requirements in large and small organizations, and today this technology forms a key component of most information systems. However, the advances in computer hardware, 89 CU IDOL SELF LEARNING MATERIAL (SLM)
and the emergence of new application requirements such multimedia and mobile databases produced a situation where the basic underlying principles of data management need to be re- evaluated. Some technology observers see RDBMS technology as obsolete in the context of today's application requirements and advocate a shift towards Object-Oriented (OO) databases. The OO approach and the associated OO technologies are often seen as a universal solution in computing today, including database. While complex objects play an important role in many applications, they represent only a subset of the problems which database technology needs to address. Recent Achievements and Current Status of Database Technology During the last decade the early versions of RDBMS systems have evolved into mature database server technology with capability to support distributed applications and operate in heterogeneous environments. As a result of these developments, the present generation of database technology addresses the requirements of most business-style applications. From Relational Model to Commercial DBMS Implementations Starting from the relational model formulated by E.F. Codd in 1970 and the early prototype System R developed at IBM database research focused on solving practical problems associated with the management of large databases. The main accomplishments of database research in the following years include the development of high-level data management language SQL (Structured Query Language), theory of query optimization, and transaction management techniques. Numerous other technical solutions to problems related to the management of large amounts of data shared by multiple, concurrent users have been found. These include the development of buffer management, indexing, and physical storage techniques which dramatically improve the performance of commercial DBMS systems. SQL Language SQL has undergone major revisions from the original SQL86 standard. SQL86 lacked many of the key features of the relational model including referential integrity and domains. The next version of the standard (SQL89) rectified many of the shortcomings of SQL86, and the current SQL92 standard incorporates declarative integrity, domain definitions, and numerous other features that make SQL a powerful data management language. SQL is universally accepted as the basis of all leading DBMS systems today and is likely to be the dominant database language in the foreseeable future. The current standardization efforts of the ISO (International Standards Organization) centre on extending SQL to incorporate Object- Oriented (OO) features, and provide support for complex types of data such as text and multimedia. Query Optimization 90 CU IDOL SELF LEARNING MATERIAL (SLM)
SQL is a declarative language which does not specify the implementation details of database operations and uses query optimization to determine an efficient data access plan. Early versions of relational DBMSs were often criticized for poor performance. Advanced query optimization techniques based on extensive research combined with hardware advances make relational DBMSs the fastest available database technology today, suitable for operation in environments where high-level of performance is mandatory. More recently, query optimization techniques were developed for distributed databases, and effective solutions exist today for running queries across multiple databases. Further work on query optimization is likely to focus on producing techniques capable of taking full advantage of multiprocessor computer architectures. Transaction Management The two key problems associated with transaction management: concurrency and recovery have been effectively solved resulting in reliable and fast database technology capable of supporting large numbers of concurrent users. Most DBMS systems today use row-level (record-level) locking, group commits and other techniques resulting in high overall transaction rates. Similar to query optimization, transaction management techniques were extended to handle transactions spanning multiple database sites. Reliable recovery in distributed database environments is implemented using 2PC (two phase commit) protocol which maintains database consistency following failures during distributed update operations. Database Server Technology The emergence of client/server computing produced database server technology with advanced features designed to support operation in distributed environments. Database server systems in addition to the standard database features incorporate database stored procedures, database triggers, and remote procedure calls (RPCs). Database triggers and stored procedures improve the performance of applications, and integrity and security of data. Most database server systems incorporate facilities for data replication with support for both synchronous and asynchronous operation. Advanced database server systems are designed to operate in open systems environments and support standard APIs (Application Programming Interfaces) and gateways. For example, the ODBC (Open Database Connectivity) API developed by Microsoft and based on X/Open CLI (Call Level Interface) specification is supported by most database servers today. New Database Technology Trends As noted above commercially available database technology supports the requirements of most business-style applications. Such applications use structured data, i.e. information which can be represented as records with pre-defined standard data types (e.g. character, number, date, etc.). Business applications are characterized by well- defined (short) 91 CU IDOL SELF LEARNING MATERIAL (SLM)
transactions in which individual users update simple records; typically, a single row in a database table (e.g. debit/credit banking transactions). While the management of structured data remains important other types of information including image data, audio, video, and text are used increasingly in applications today. New Application Requirements A number of new application areas have emerged recently which require data management support that extends well beyond the traditional data management functions. Also importantly, advances in computer hardware, network and user interface technology have re- defined the context in which data management needs to performed. In this section we discuss three new database application areas: complex object databases, high-volume databases, and mobile databases. Complex Object Database Applications Application such as CAD/CAM (Computer Aided Design/Computer Aided Manufacturing), document management, and multimedia, are forcing the developers of DBMS technology to re-assess the basic underlying database principles and produce new technical solutions. The distinguishing feature of these applications are complex objects and unstructured data which are difficult to represent as tables in a relational system. Applications that use complex objects often involve complicated, long-duration transactions which can take several days or even weeks to complete. For example, transactions encountered during design activities tend to involve co-operation between several users and cannot be served effectively by traditional transaction mechanism which relies on locking and transaction rollbacks to resolve contentions. Many database researchers regard the relational model as unsuitable for applications of this type and advocate the use of object-oriented technology. OO technology has been applied successfully to programming and user interface development, but its impact on database is still rather unclear. Several pure OODBMS (Object- Oriented DBMS) products are available commercially (e.g. Versant, O2, Object Store), but they remain focused on specialized applications and have not achieved wide market acceptance. Some research problems and many practical issues remain to be resolved before OODBMS technology can be applied to large- scale, mission-critical applications. Hybrid solutions which support both structured and unstructured information using relational and OO technologies are being developed for commercial use by database vendors. For example, Oracle Media Server combines Oracle database server technology with text and multimedia servers. High-Volume Database Applications 92 CU IDOL SELF LEARNING MATERIAL (SLM)
Another active area of database research and development concerns applications which use very large volumes of data. Examples of such applications include retail chain applications which store every cashier transaction in a historical database and perform ad-hoc analysis to determine customer buying patterns, and the popularity of individual products. Traditionally, the main factor limiting database performance was the speed of disk read and write operations (I/O throughput). Advanced cashing techniques used in modern database server systems have resolved the I/O bottleneck and produced CPU-bound (central processing unit bound) systems. An obvious solution to the CPU bottleneck is to use multiple-processor systems configured as either SMP (symmetric multiprocessor) or MPP (massively parallel processor) systems. Parallel computing architectures have been used in scientific computing for some time, and they are now beginning to impact on database applications. Proprietary database machines (e.g. Tera-data, Tandem, etc.) built using specialized hardware and software were the first parallel database technologies on the market. Today, multiprocessor systems are widely available from a variety of vendors including Sequent, Encore, NCUBE, NCR, and others. The use of multiprocessor machines for database applications is becoming an attractive solution to performance problems associated with high- volume databases as the cost of multiprocessor hardware falls. MPP architectures present a particularly cost-effective solution as MPP systems are constructed using commodity processors (i.e. Intel 486 and 586) and offer excellent price-performance ratios and scalability. Several complex technical problems need to be overcome to enable database server systems to achieve good scalability on MPP architectures. To start with, MPP systems are based on shared-nothing architectures; each processor has its own RAM (Random Access Memory) and communicates with other processor via an interconnection network. Consequently conventional concurrency techniques, which rely on shared memory, do not work in this message-based environment. Also, to take full advantage of MPP architectures all database operations need to run in parallel. This applies in particular to query and update operations, but also to table and index creation, database loads and backups. Interestingly, the SQL language, because of its non- procedural nature and theoretical foundation in relational algebra, lends itself to parallelization. SQL queries can be expressed in functional form and operations such as aggregation, sorts, joins, and table scans can be run simultaneously on multiple processors. Production versions of database server technology are available today which perform at least some database operations in parallel. The impact of parallel computing on database processing is likely to be highly significant creating new opportunities for applications which cannot be accommodated with single processor architectures. Multimedia Databases Another important trend in database systems is the inclusion of multimedia data. By \\multimedia\" we mean information that represents a signal of some sort. Common forms of multimedia data include video, audio, radar signals, satellite images, and documents or 93 CU IDOL SELF LEARNING MATERIAL (SLM)
pictures in various encodings. These forms have in common that they are much larger than the earlier forms of data integers, character strings of fixed length, and of vastly varying sizes. The storage of multimedia data has forced DBMS's to expand in several ways. For example, the operations that one performs on multimedia data are not the simple ones suitable for traditional data forms. Thus, while one might search a bank database for accounts that have a negative balance, comparing each balance with the real number 0.0, it is not feasible to search a database of pictures for those that show a face that \\looks like\" a particular image. To allow users to create and use complex data operations such as image processing, DBMS's have had to incorporate the ability of users to introduce functions of their own choosing. Often, the object-oriented approach is used for such extensions, even in relational systems, which are then dubbed \\object relational. The size of multimedia objects also forces the DBMS to modify the storage manager so that objects or tuples of a gigabyte or more can be accommodated. Among the many problems that such large elements present is the delivery of answers to queries. In a conventional, relational database, an answer is a set of tuples. These tuples would be delivered to the client by the database server as a whole. However, suppose the answer to a query is a video clip a gigabyte long. It is not feasible for the server to deliver the gigabyte to the client as a whole. For one reason it takes too long and will prevent the server from handling other requests. For another, the client may want only a small part of the film clip, but doesn't have a way to ask for exactly what it wants without seeing the initial portion of the clip. For a third reason, even if the client wants the whole clip, perhaps in order to play it on a screen, it is succinct to deliver the clip at a fixed rate over the course of an hour (the amount of time it takes to play a gigabyte of compressed video). Thus, the storage system of a DBMS supporting multimedia data has to be prepared to deliver answers in an interactive mode, passing a piece of the answer to the client on request Mobile Databases With the emergence of mobile computing the corporate information system also includes notebook computers and other portable devices. Similar to desktop computing users, mobile users need access to information stored on corporate database servers. In some cases, mobile users carry fragments of the corporate database with them in notebook databases and from time to time need to synchronize their information with the information held on the remote server. Mobile computing can be regarded as a special type of distributed environment and presents a number of technical problems for database implementation. Firstly, mobile computing is characterized by relatively slow communications based on wireless networks (2 - 9 kbps) or modems using phone lines (up to 28.8 kbps). Mobile communications are non- continuous and generally unreliable causing poor system availability. Synchronous client/server computing techniques which need fast and reliable networks are not suitable for communications between the server and the client application running on a mobile 94 CU IDOL SELF LEARNING MATERIAL (SLM)
workstation. Asynchronous processing using store and forward techniques are best suited to this type of environment. During periods when the mobile station is disconnected, messages are stored in a queue and transmission is resumed when communications are re-established. In situations where the mobile station also carries data, the resulting environment can be characterized as a distributed database. Asynchronous replication techniques play an important role in maintaining information stored on mobile stations. Database and communication technologies suitable for mobile operation were announced recently and will be available commercially in the near future. Fuzzy Databases Most of our traditional tools for formal modelling, reasoning and computing are crisp, deterministic and precise in nature. Precision assumes that the parameters of a model represent exactly either our perception of the phenomenon modelled or the features of the real system that has been modelled. Certainty eventually indicates that we assume the structures and parameters of the model to be definitely known. However, if the model or theory asserts factuality, then the modelling language has to be suited to model the characteristics of the situation under study appropriately. However we have a problem. For factual models or modeling languages, two major complications arise: 1. Real situations are very often not crisp and deterministic and cannot be described precisely i.e. real situations are very often uncertain or vague in a number of ways. 2. Complete description of a real system would require far more detailed data than a human being could ever recognize and process simultaneously. Hence, among the various paradigmatic changes in science and mathematics in last century, one such has been the concern of the concept of uncertainty. In science this change is manifested by a gradual transition, from a view, which stated that uncertainty is undesirable to an alternative view that accepts uncertainty as an integral part of the whole system that is essential to model the real world. There are three basic types of uncertainties discussed in literature as 1. Fuzziness: Lack of definite or sharp distinctions. The alternate terms used for it are – (i) . Vagueness (ii).Cloudiness (iii). Haziness 2. Discord: Disagreement in choosing among several alternatives. The synonyms for it are Dissonance Incongruity 95 CU IDOL SELF LEARNING MATERIAL (SLM)
Discrepancy 3. No specificity: Two or more alternatives are left unspecified. The synonyms for it are Variety Generality Diversity The last two types of uncertainties can be classified as a higher uncertainty type, ambiguity, which means any situation in which it remains unclear which of several alternatives should be accepted as the genuine one. In general, ambiguity results from lack of certain distinctions characterizing an object, from conflicting distinctions or from both of these. An important point in the evolution of modern concept of uncertainty was the publication of a seminal paper by Lofti A Zadeh in which Zadeh introduced a theory whose objects fuzzy sets are sets with boundaries that are not precise and the membership in this fuzzy set is not a matter of true or false, but rather a matter of degree. This concept was called Fuzziness and the theory was called Fuzzy Set Theory. Fuzziness can be defined as the vagueness concerning the semantic meaning of events, phenomenon or statements themselves. It is particularly frequent in all areas in which human judgment, evaluation and decisions are important. One of the major concerns in the design and implementation of fuzzy databases is efficiency i.e. these systems must be fast enough to make interaction with the human users feasible. In general, we have two feasible ways to incorporate fuzziness in databases: 1. Making fuzzy queries to the classical databases 2. Adding fuzzy information to the system Spatial Databases Spatial databases were developed to correlate data in space. They provided answers to questions such as how much money have we spent within a 20 mile radius from this specific location? How far has waste product extended from the spill location? How many miles away is the closest hospital to this house? Most spatial databases don't stand on their own, but instead are just an extension to relational databases. They use a dialect of SQL called Simple Features Specification for Structured Query Language (SFSQL) - which simply adds spatial functions to SQL - such as distance, touches, centroid, inside, area, extent. In fact most spatial databases store spatial data in relational databases, but in specialized fields used to hold spatial data. Examples of spatial 96 CU IDOL SELF LEARNING MATERIAL (SLM)
databases are Oracle Spatial (which sits on top of Oracle), ESRI Arc SDE (which can sit on top of a Microsoft SQL Server or Oracle database), PostGIS (sits on top of PostgreSQL), DB2 Spatial Extender which adds spatial functionality to IBM DB2 databases, and even MySQL is providing limited functionality for Spatial data in its upcoming 4.1 version. On-line Analytical Processing Databases (OLAP) OLAP databases are geared toward analyzing data rather than updating data. They are used to drive business processes based on statistical analysis of data and what-if analysis. The main feature of OLAP databases is speed of querying and multi-dimensionality. Most real OLAP databases allow you to slice data into an infinite number of dimensions - e.g. by time, product line, and sales groups. These databases are fed most often by relational databases. Many OLAP databases have their own dialect of SQL specifically designed to deal with the multi- dimensionality of OLAP data. One example that comes to mind is Microsoft SQL Server's Analysis Services which uses a variant of SQL called Multi-Dimensional Expressions Language (MDX). SUMMARY New types of applications have created a situation where the existing database server technology is no longer sufficient. Both the size and the complexity of databases have dramatically increased forcing database researchers to develop new data and transaction models. The result of this research is now beginning to impact on commercially available database server technology. High-volume database applications are already a reality in many organizations today and parallel database technology is beginning to be used to address the requirements of such applications. Complex object databases and mobile databases are likely to become significant areas of activity in the future. It is now clear that object-oriented features will play an important role in future database applications. Object-oriented database techniques are necessary to accommodate complex objects and unstructured types of information. The OO approach is also highly suited for database operations in distributed environments. It is equally clear today that the OO approach is not a universal solution for all data management requirements. Relational technology is capable of further evolution as evidenced, for example, by the implementation of parallel database operations in SQL. Future database systems are most likely to include a combination of relational and object technologies covering a range of application requirements. It is likely that many of the desirable OO features will be incorporated into relational DBMS systems along the lines of the SQL3 ISO draft standard. Future versions of 97 CU IDOL SELF LEARNING MATERIAL (SLM)
relational products will support user-defined data types allowing database users to define arbitrarily complex data types and corresponding methods. Support for inheritance will improve the reusability and reliability of database and application objects. The process of incorporating OO features into SQL will however take some time, and the full impact of SQL3 is unlikely to affect the user community before the end of the decade. Of more immediate practical benefit to database users will be improved reliability, security and performance of database server technology in distributed environments, and performance gains related to advances in parallel database computing. KEY WORDS/ABBREVIATIONS Object oriented database management system: It is the data model in which data is stored in form of objects, which are instances of classes. These classes and objects together makes an object oriented data model. Object Structure: The structure of an object refers to the properties that an object is made up of. These properties of an object are referred to as an attribute. Thus, an object is a real world entity with certain attributes that makes up the object structure Messages :A message provides an interface or acts as a communication medium between an object and the outside world Update message: If the invoked method changes the value of a variable, then the invoking message is said to be an update message. Methods: When a message is passed then the body of code that is executed is known as a method. Read-only method: When the value of a variable is not affected by a method, then it is known as read-only method. Update-method: When the value of a variable changes by a method, then it is known as an update method. Variables: It stores the data of an object. The data stored in the variables makes the object distinguishable from one another. Inheritance mechanism: Which allows a class to inherit properties (attributes and methods) from its super classes 98 CU IDOL SELF LEARNING MATERIAL (SLM)
Search
Read the Text Version
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 115
- 116
- 117
- 118
- 119
- 120
- 121
- 122
- 123
- 124
- 125
- 126
- 127
- 128
- 129
- 130
- 131
- 132
- 133
- 134
- 135
- 136
- 137
- 138
- 139
- 140
- 141
- 142
- 143
- 144
- 145
- 146
- 147
- 148
- 149
- 150
- 151
- 152
- 153
- 154
- 155
- 156
- 157
- 158
- 159
- 160
- 161
- 162
- 163
- 164
- 165
- 166
- 167
- 168
- 169
- 170
- 171
- 172
- 173
- 174
- 175
- 176
- 177
- 178
- 179
- 180
- 181
- 182
- 183
- 184
- 185
- 186
- 187
- 188
- 189
- 190
- 191
- 192
- 193
- 194
- 195
- 196
- 197
- 198
- 199
- 200
- 201
- 202
- 203
- 204
- 205
- 206
- 207
- 208
- 209
- 210