Sei sulla pagina 1di 56

Data Abstraction

1. The major purpose of a database system is to provide users with an abstract view of the system. The system hides certain details of how data is stored and created and maintained Complexity should be hidden from database users. 2. There are several levels of abstraction: 1. Physical Level: How the data are stored. E.g. index, B-tree, hashing. Lowest level of abstraction. Complex low-level structures described in detail. 2. Conceptual Level: Next highest level of abstraction. Describes what data are stored. Describes the relationships among data. Database administrator level. 3. View Level: Highest level. Describes part of the database for a particular group of users. Can be many different views of a database. E.g. tellers in a bank get a view of customer accounts, but not of payroll data. Fig. 1.1 (figure 1.1 in the text) illustrates the three levels.

Figure 1.1: The three levels of data abstraction

The E-R Model 1. The entity-relationship model is based on a perception of the world as consisting of a collection of basic objects (entities) and relationships among these objects. o An entity is a distinguishable object that exists.

o o o o o o

Each entity has associated with it a set of attributes describing it. E.g. number and balance for an account entity. A relationship is an association among several entities. e.g. A cust_acct relationship associates a customer with each account he or she has. The set of all entities or relationships of the same type is called the entity set or relationship set. Another essential element of the E-R diagram is the mapping cardinalities, which express the number of entities to which another entity can be associated via a relationship set.

We'll see later how well this model works to describe real world situations. 2. The overall logical structure of a database can be expressed graphically by an E-R diagram: o rectangles: represent entity sets. o ellipses: represent attributes. o diamonds: represent relationships among entity sets. o lines: link attributes to entity sets and entity sets to relationships. See figure 1.2 for an example.

Figure 1.2: A sample E-R diagram.

The Object-Oriented Model 1. The object-oriented model is based on a collection of objects, like the E-R model. o An object contains values stored in instance variables within the object. o Unlike the record-oriented models, these values are themselves objects. o Thus objects contain objects to an arbitrarily deep level of nesting. o An object also contains bodies of code that operate on the the object. o These bodies of code are called methods. o Objects that contain the same types of values and the same methods are grouped into classes. o A class may be viewed as a type definition for objects. o Analogy: the programming language concept of an abstract data type. o The only way in which one object can access the data of another object is by invoking the method of that other object. o This is called sending a message to the object. o Internal parts of the object, the instance variables and method code, are not visible externally. o Result is two levels of data abstraction.

For example, consider an object representing a bank account.

The object contains instance variables number and balance. The object contains a method pay-interest which adds interest to the balance. Under most data models, changing the interest rate entails changing code in application programs. o In the object-oriented model, this only entails a change within the pay-interest method. 2. Unlike entities in the E-R model, each object has its own unique identity, independent of the values it contains: o Two objects containing the same values are distinct. o Distinction is created and maintained in physical level by assigning distinct object identifiers.

o o o

Data Independence
1. The ability to modify a scheme definition in one level without affecting a scheme definition in a higher level is called data independence. 2. There are two kinds: o Physical data independence The ability to modify the physical scheme without causing application programs to be rewritten Modifications at this level are usually to improve performance o Logical data independence The ability to modify the conceptual scheme without causing application programs to be rewritten Usually done when logical structure of database is altered 3. Logical data independence is harder to achieve as the application programs are usually heavily dependent on the logical structure of the data. An analogy is made to abstract data types in programming languages.

Data Definition Language (DDL)


1. Used to specify a database scheme as a set of definitions expressed in a DDL 2. DDL statements are compiled, resulting in a set of tables stored in a special file called a data dictionary or data directory. 3. The data directory contains metadata (data about data) 4. The storage structure and access methods used by the database system are specified by a set of definitions in a special type of DDL called a data storage and definition language 5. basic idea: hide implementation details of the database schemes from the users

Data Manipulation Language (DML)


1. Data Manipulation is: o retrieval of information from the database o insertion of new information into the database o deletion of information in the database o modification of information in the database 2. A DML is a language which enables users to access and manipulate data. The goal is to provide efficient human interaction with the system. 3. There are two types of DML: o procedural: the user specifies what data is needed and how to get it o nonprocedural: the user only specifies what data is needed Easier for user May not generate code as efficient as that produced by procedural languages 4. A query language is a portion of a DML involving information retrieval only. The terms DML and query language are often used synonymously.

Entities and Entity Sets


An entity is an object that exists and is distinguishable from other objects. For instance, John Harris with S.I.N. 890-12-3456 is an entity, as he can be uniquely identified as one particular person in the universe. An entity may be concrete (a person or a book, for example) or abstract (like a holiday or a concept). An entity set is a set of entities of the same type (e.g., all persons having an account at a bank). Entity sets need not be disjoint. For example, the entity set employee (all employees of a bank) and the entity set customer (all customers of the bank) may have members in common. An entity is represented by a set of attributes. o E.g. name, S.I.N., street, city for ``customer'' entity. o The domain of the attribute is the set of permitted values (e.g. the telephone number must be seven positive integers). Formally, an attribute is a function which maps an entity set into a domain. o Every entity is described by a set of (attribute, data value) pairs. o There is one pair for each attribute of the entity set. o E.g. a particular customer entity is described by the set {(name, Harris), (S.I.N., 890-123-456), (street, North), (city, Georgetown)}.

An analogy can be made with the programming language notion of type definition.

The concept of an entity set corresponds to the programming language type definition. A variable of a given type has a particular value at a point in time. Thus, a programming language variable corresponds to an entity in the E-R model.

Figure 2-1 shows two entity sets. We will be dealing with five entity sets in this section:

branch, the set of all branches of a particular bank. Each branch is described by the attributes branch-name, branch-city and assets. customer, the set of all people having an account at the bank. Attributes are customername, S.I.N., street and customer-city. employee, with attributes employee-name and phone-number. account, the set of all accounts created and maintained in the bank. Attributes are account-number and balance. transaction, the set of all account transactions executed in the bank. Attributes are transaction-number, date and amount.

Relationships & Relationship Sets


A relationship is an association between several entities. A relationship set is a set of relationships of the same type. Formally it is a mathematical relation on If (possibly non-distinct) sets.

are entity sets, then a relationship set R is a subset of

where

is a relationship.

For example, consider the two entity sets customer and account. (Fig. 2.1 in the text). We define the relationship CustAcct to denote the association between customers and their accounts. This is a binary relationship set (see Figure 2.2 in the text). Going back to our formal definition, the relationship set CustAcct is a subset of all the possible customer and account pairings. This is a binary relationship. Occasionally there are relationships involving more than two entity sets. The role of an entity is the function it plays in a relationship. For example, the relationship works-for could be ordered pairs of employee entities. The first employee takes the role of manager, and the second one will take the role of worker. A relationship may also have descriptive attributes. For example, date (last date of account access) could be an attribute of the CustAcct relationship set.

Attributes

It is possible to define a set of entities and the relationships among them in a number of different ways. The main difference is in how we deal with attributes.

Consider the entity set employee with attributes employee-name and phone-number. We could argue that the phone be treated as an entity itself, with attributes phonenumber and location. Then we have two entity sets, and the relationship set EmpPhn defining the association between employees and their phones. This new definition allows employees to have several (or zero) phones. New definition may more accurately reflect the real world. We cannot extend this argument easily to making employee-name an entity.

The question of what constitutes an entity and what constitutes an attribute depends mainly on the structure of the real world situation being modeled, and the semantics associated with the attribute in question.

Mapping Constraints
An E-R scheme may define certain constraints to which the contents of a database must conform.

Mapping Cardinalities: express the number of entities to which another entity can be associated via a relationship. For binary relationship sets between entity sets A and B, the mapping cardinality must be one of: 1. One-to-one: An entity in A is associated with at most one entity in B, and an entity in B is associated with at most one entity in A. (Figure 2.3) 2. One-to-many: An entity in A is associated with any number in B. An entity in B is associated with at most one entity in A. (Figure 2.4) 3. Many-to-one: An entity in A is associated with at most one entity in B. An entity in B is associated with any number in A. (Figure 2.5) 4. Many-to-many: Entities in A and B are associated with any number from each other. (Figure 2.6) The appropriate mapping cardinality for a particular relationship set depends on the real world being modeled. (Think about the CustAcct relationship...)

Existence Dependencies: if the existence of entity X depends on the existence of entity Y, then X is said to be existence dependent on Y. (Or we say that Y is the dominant entity and X is the subordinate entity.) For example,
o o o o

Consider account and transaction entity sets, and a relationship log between them. This is one-to-many from account to transaction. If an account entity is deleted, its associated transaction entities must also be deleted. Thus account is dominant and transaction is subordinate.

Keys
Differences between entities must be expressed in terms of attributes.

A superkey is a set of one or more attributes which, taken collectively, allow us to identify uniquely an entity in the entity set. For example, in the entity set customer, customer-name and S.I.N. is a superkey. Note that customer-name alone is not, as two customers could have the same name. A superkey may contain extraneous attributes, and we are often interested in the smallest superkey. A superkey for which no subset is a superkey is called a candidate key. In the example above, S.I.N. is a candidate key, as it is minimal, and uniquely identifies a customer entity. A primary key is a candidate key (there may be more than one) chosen by the DB designer to identify entities in an entity set.

An entity set that does not possess sufficient attributes to form a primary key is called a weak entity set. One that does have a primary key is called a strong entity set. For example,

The entity set transaction has attributes transaction-number, date and amount. Different transactions on different accounts could share the same number. These are not sufficient to form a primary key (uniquely identify a transaction). Thus transaction is a weak entity set.

For a weak entity set to be meaningful, it must be part of a one-to-many relationship set. This relationship set should have no descriptive attributes. (Why?) The idea of strong and weak entity sets is related to the existence dependencies seen earlier.

Member of a strong entity set is a dominant entity. Member of a weak entity set is a subordinate entity.

A weak entity set does not have a primary key, but we need a means of distinguishing among the entities. The discriminator of a weak entity set is a set of attributes that allows this distinction to be made. The primary key of a weak entity set is formed by taking the primary key of the strong entity set on which its existence depends (see Mapping Constraints) plus its discriminator. To illustrate:

transaction is a weak entity. It is existence-dependent on account. The primary key of account is account-number. transaction-number distinguishes transaction entities within the same account (and is thus the discriminator).

So the primary key for transaction would be (account-number, transaction-number).

Just Remember: The primary key of a weak entity is found by taking the primary key of the strong entity on which it is existence-dependent, plus the discriminator of the weak entity set.

Primary Keys for Relationship Sets


The attributes of a relationship set are the attributes that comprise the primary keys of the entity sets involved in the relationship set. For example:

S.I.N. is the primary key of customer, and account-number is the primary key of account. The attributes of the relationship set custacct are then (account-number, S.I.N.).

This is enough information to enable us to relate an account to a person. If the relationship has descriptive attributes, those are also included in its attribute set. For example, we might add the attribute date to the above relationship set, signifying the date of last access to an account by a particular customer. Note that this attribute cannot instead be placed in either entity set as it relates to both a customer and an account, and the relationship is many-to-many. The primary key of a relationship set of descriptive attributes. With no descriptive attributes:

depends on the mapping cardinality and the presence

many-to-many: all attributes in . one-to-many: primary key for the ``many'' entity.

Descriptive attributes may be added, depending on the mapping cardinality and the semantics involved (see text).

The Entity Relationship Diagram


We can express the overall logical structure of a database graphically with an E-R diagram. Its components are:

rectangles representing entity sets. ellipses representing attributes. diamonds representing relationship sets. lines linking attributes to entity sets and entity sets to relationship sets.

In the text, lines may be directed (have an arrow on the end) to signify mapping cardinalities for relationship sets. Figures 2.8 to 2.10 show some examples.

Figure 2.7: An E-R diagram

Figure 2.8: One-to-many from customer to account

Figure 2.9: Many-to-one from customer to account

Figure 2.10: One-to-one from customer to account Go back and review mapping cardinalities. They express the number of entities to which an entity can be associated via a relationship. The arrow positioning is simple once you get it straight in your mind, so do some examples. Think of the arrow head as pointing to the entity that ``one'' refers to.

Other Styles of E-R Diagram


The text uses one particular style of diagram. Many variations exist.

Some of the variations you will see are:

Diamonds being omitted - a link between entities indicates a relationship. o Less symbols, clearer picture. o What happens with descriptive attributes? o In this case, we have to create an intersection entity to possess the attributes. Numbers instead of arrowheads indicating cardinality. o Symbols, 1, n and m used. o E.g. 1 to 1, 1 to n, n to m. o Easier to understand than arrowheads. A range of numbers indicating optionality of relationship. (See Elmasri & Navathe, p 58.) o E.g (0,1) indicates minimum zero (optional), maximum 1. o Can also use (0,n), (1,1) or (1,n). o Typically used on near end of link - confusing at first, but gives more information. o E.g. entity 1 (0,1) -- (1,n) entity 2 indicates that entity 1 is related to between 0 and 1 occurrences of entity 2 (optional). o Entity 2 is related to at least 1 and possibly many occurrences of entity 1 (mandatory). Multivalued attributes may be indicated in some manner. o Means attribute can have more than one value. o E.g. hobbies. o Has to be normalized later on. Extended E-R diagrams allowing more details/constraints in the real world to be recorded. (See Elmasri & Navathe, chapter 21.) o Composite attributes. o Derived attributes. o Subclasses and superclasses. o Generalization and specialization.

Roles in E-R Diagrams The function that an entity plays in a relationship is called its role. Roles are normally explicit and not specified. They are useful when the meaning of a relationship set needs clarification. For example, the entity sets of a relationship may not be distinct. The relationship works-for might be ordered pairs of employees (first is manager, second is worker). In the E-R diagram, this can be shown by labelling the lines connecting entities (rectangles) to relationships (diamonds). (See figure 2.11).

Figure 2.11: E-R diagram with role indicators

Weak Entity Sets in E-R Diagrams A weak entity set is indicated by a doubly-outlined box. For example, the previouslymentioned weak entity set transaction is dependent on the strong entity set account via the relationship set log. Figure 2.12) shows this example.

Figure 2.12: E-R diagram with a weak entity set Nonbinary Relationships Non-binary relationships can easily be represented. Figure 2.13) shows an example.

Figure 2.13: E-R diagram with a ternary relationship This E-R diagram says that a customer may have several accounts, each located in a specific bank branch, and that an account may belong to several different customers.

Reducing E-R Diagrams to Tables


A database conforming to an E-R diagram can be represented by a collection of tables. We'll use the E-R diagram of Figure 2.14) as our example.

Figure 2.14: E-R diagram with strong and weak entity sets For each entity set and relationship set, there is a unique table which is assigned the name of the corresponding set. Each table has a number of columns with unique names. (E.g. Figs. 2.14 - 2.18 in the text).

Representation of Strong Entity Sets


We use a table with one column for each attribute of the set. Each row in the table corresponds to one entity of the entity set. For the entity set account, see the table of figure 2.14. We can add, delete and modify rows (to reflect changes in the real world). A row of a table will consist of an n-tuple where n is the number of attributes. Actually, the table contains a subset of the set of all possible rows. We refer to the set of all possible rows as the cartesian product of the sets of all attribute values. We may denote this as

for the account table, where balances, respectively.

and

denote the set of all account numbers and all account

In general, for a table of n columns, we may denote the cartesian product of by

Representation of Weak Entity Sets


For a weak entity set, we add columns to the table corresponding to the primary key of the strong entity set on which the weak set is dependent. For example, the weak entity set transaction has three attributes: transaction-number, date and amount. The primary key of account (on which transaction depends) is account-number. This gives us the table of figure 2.16.

Representation of Relationship Sets


Let R be a relationship set involving entity sets .

The table corresponding to the relationship set R has the following attributes:

If the relationship has k descriptive attributes, we add them too:

An example:

The relationship set CustAcct involves the entity sets customer and account. Their respective primary keys are S.I.N. and account-number. CustAcct also has a descriptive attribute, date. This gives us the table of figure 2.17.

Non-binary Relationship Sets The ternary relationship of Figure 2.13 gives us the table of figure 2.18. As required, we take the primary keys of each entity set. There are no descriptive attributes in this example. Linking a Weak to a Strong Entity These relationship sets are many-to-one, and have no descriptive attributes. The primary key of the weak entity set is the primary key of the strong entity set it is existence-dependent on, plus its discriminator. The table for the relationship set would have the same attributes, and is thus redundant.

Generalization
Consider extending the entity set account by classifying accounts as being either savingsaccount or chequing-account. Each of these is described by the attributes of account plus additional attributes. (savings has interest-rate and chequing has overdraft-amount.) We can express the similarities between the entity sets by generalization. This is the process of forming containment relationships between a higher-level entity set and one or more lower-level entity sets. In E-R diagrams, generalization is shown by a triangle, as shown in Figure 2.19.

Figure 2.19: Generalization


Generalization hides differences and emphasizes similarities. Distinction made through attribute inheritance. Attributes of higher-level entity are inherited by lower-level entities. Two methods for conversion to a table form: o Create a table for the high-level entity, plus tables for the lower-level entities containing also their specific attributes. o Create only tables for the lower-level entities.

Aggregation
The E-R model cannot express relationships among relationships. When would we need such a thing? Consider a DB with information about employees who work on a particular project and use a number of machines doing that work. We get the E-R diagram shown in Figure 2.20.

Figure 2.20: E-R diagram with redundant relationships Relationship sets work and uses could be combined into a single set. However, they shouldn't be, as this would obscure the logical structure of this scheme. The solution is to use aggregation.

An abstraction through which relationships are treated as higher-level entities. For our example, we treat the relationship set work and the entity sets employee and project as a higher-level entity set called work. Figure 2.21 shows the E-R diagram with aggregation.

Figure 2.21: E-R diagram with aggregation Transforming an E-R diagram with aggregation into tabular form is easy. We create a table for each entity and relationship set as before. The table for relationship set uses contains a column for each attribute in the primary key of machinery and work.

Design of an E-R Database Scheme


The E-R data model provides a wide range of choice in designing a database scheme to accurately model some real-world situation. Some of the decisions to be made are

Using a ternary relationship versus two binary relationships. Whether an entity set or a relationship set best fit a real-world concept. Whether to use an attribute or an entity set. Use of a strong or weak entity set. Appropriateness of generalization. Appropriateness of aggregation.

Mapping Cardinalities
The ternary relationship of Figure 2.13 could be replaced by a pair of binary relationships, as shown in Figure 2.22.

Figure 2.22: Representation of Figure 2.13 using binary relationships However, there is a distinction between the two representations:

In Figure 2.13, relationship between a customer and account can be made only if there is a corresponding branch. In Figure 2.22, an account can be related to either a customer or a branch alone. The design of figure 2.13 is more appropriate, as in the banking world we expect to have an account relate to both a customer and a branch.

Use of Entity or Relationship Sets


It is not always clear whether an object is best represented by an entity set or a relationship set.

Both Figure 2.13 and Figure 2.22 show account as an entity. Figure 2.23 shows how we might model an account as a relationship between a customer and a branch.

Figure 2.23: E-R diagram with account as a relationship set


This new representation cannot model adequately the situation where customers may have joint accounts. (Why not?) If every account is held by only one customer, this method works.

Structure of Relational Database


1. A relational database consists of a collection of tables, each having a unique name. A row in a table represents a relationship among a set of values. Thus a table represents a collection of relationships. 2. There is a direct correspondence between the concept of a table and the mathematical concept of a relation. A substantial theory has been developed for relational databases.

Basic Structure
1. Figure 3.1 shows the deposit and customer tables for our banking example.

Figure 3.1: The deposit and customer relations.


o o o

It has four attributes. For each attribute there is a permitted set of values, called the domain of that attribute. E.g. the domain of bname is the set of all branch names. , and the remaining attributes' domains

Let denote the domain of bname, and respectively.

Then, any row of deposit consists of a four-tuple

where

In general, deposit contains a subset of the set of all possible rows. That is, deposit is a subset of

In general, a table of n columns must be a subset of

2. Mathematicians define a relation to be a subset of a Cartesian product of a list of domains. You can see the correspondence with our tables.

We will use the terms relation and tuple in place of table and row from now on.

3. Some more formalities: o let the tuple variable refer to a tuple of the relation . o We say to denote that the tuple is in relation . o Then [bname] = [1] = the value of on the bname attribute. o So [bname] = [1] = ``Downtown'', o and [cname] = [3] = ``Johnson''. 4. We'll also require that the domains of all attributes be indivisible units. o A domain is atomic if its elements are indivisible units. o For example, the set of integers is an atomic domain. o The set of all sets of integers is not. o Why? Integers do not have subparts, but sets do - the integers comprising them. o We could consider integers non-atomic if we thought of them as ordered lists of digits.

Database Scheme
1. We distinguish between a database scheme (logical design) and a database instance (data in the database at a point in time). 2. A relation scheme is a list of attributes and their corresponding domains. 3. The text uses the following conventions: o italics for all names o lowercase names for relations and attributes o names beginning with an uppercase for relation schemes These notes will do the same.

For example, the relation scheme for the deposit relation:


o

Deposit-scheme = (bname, account#, cname, balance)

We may state that deposit is a relation on scheme Deposit-scheme by writing deposit(Deposit-scheme). If we wish to specify domains, we can write:
o

(bname: string, account#: integer, cname: string, balance: integer).

Note that customers are identified by name. In the real world, this would not be allowed, as two or more customers might share the same name.

Figure 3.2 shows the E-R diagram for a banking enterprise.

Figure 3.2: E-R diagram for the banking enterprise


4. The relation schemes for the banking example used throughout the text are: o Branch-scheme = (bname, assets, bcity) o Customer-scheme = (cname, street, ccity) o Deposit-scheme = (bname, account#, cname, balance) o Borrow-scheme = (bname, loan#, cname, amount) Note: some attributes appear in several relation schemes (e.g. bname, cname). This is legal, and provides a way of relating tuples of distinct relations. 5. Why not put all attributes in one relation?

Suppose we use one large relation instead of customer and deposit:


o o o o o o

Account-scheme = (bname, account#, cname, balance, street, ccity) If a customer has several accounts, we must duplicate her or his address for each account. If a customer has an account but no current address, we cannot build a tuple, as we have no values for the address. We would have to use null values for these fields. Null values cause difficulties in the database. By using two separate relations, we can do this without using null values

Keys
1. The notions of superkey, candidate key and primary key all apply to the relational model. 2. For example, in Branch-scheme, o {bname} is a superkey. o {bname, bcity} is a superkey. o {bname, bcity} is not a candidate key, as the superkey {bname} is contained in it. o {bname} is a candidate key. o {bcity} is not a superkey, as branches may be in the same city. o We will use {bname} as our primary key. 3. The primary key for Customer-scheme is {cname}. 4. More formally, if we say that a subset of is a superkey for , we are restricting consideration to relations in which no two distinct tuples have the same values on all attributes in . In other words, o If and are in , and

o o

, then .

Query Languages
1. A query language is a language in which a user requests information from a database. These are typically higher-level than programming languages.

They may be one of:


Procedural, where the user instructs the system to perform a sequence of operations on the database. This will compute the desired information. o Nonprocedural, where the user specifies the information desired without giving a procedure for obtaining the information. 2. A complete query language also contains facilities to insert and delete tuples as well as to modify parts of existing tuples.
o

The Relational Algebra


1. The relational algebra is a procedural query language. o Six fundamental operations: select (unary) project (unary) rename (unary) cartesian product (binary) union (binary) set-difference (binary) o Several other operations, defined in terms of the fundamental operations: set-intersection natural join division assignment o Operations produce a new relation as a result.

Fundamental Operations
1. The Select Operation

Select selects tuples that satisfy a given predicate. Select is denoted by a lowercase Greek sigma ( ), with the predicate appearing as a subscript. The argument relation is given in parentheses following the . For example, to select tuples (rows) of the borrow relation where the branch is ``SFU'', we would write

Let Figure 3.3 be the borrow and branch relations in the banking example.

Figure 3.3: The borrow and branch relations. The new relation created as the result of this operation consists of one tuple: . We allow comparisons using =, , <, , > and (or) and in the selection predicate. (and). For example:

We also allow the logical connectives

Figure 3.4: The client relation. Suppose there is one more relation, client, shown in Figure 3.4, with the scheme

we might write

to find clients who have the same name as their banker.


2. The Project Operation

Project copies its argument relation for the specified attributes only. Since a relation is a set, duplicate rows are eliminated. Projection is denoted by the Greek capital letter pi ( ). The attributes to be copied appear as subscripts. For example, to obtain a relation showing customers and branches, but ignoring amount and loan#, we write

We can perform these operations on the relations resulting from other operations. To get the names of customers having the same name as their bankers,

Think of select as taking rows of a relation, and project as taking columns of a relation.

3. The Cartesian Product Operation

The cartesian product of two relations is denoted by a cross ( ), written

The result of from and .

is a new relation with a tuple for each possible pairing of tuples

In order to avoid ambiguity, the attribute names have attached to them the name of the relation from which they came. If no ambiguity will result, we drop the relation name. The result tuples, then is a very large relation. If will have tuples. has tuples, and has

The resulting scheme is the concatenation of the schemes of names added as mentioned.

and

, with relation

To find the clients of banker Johnson and the city in which they live, we need information in both client and customer relations. We can get this by writing

However, the customer.cname column contains customers of bankers other than Johnson. (Why?) We want rows where client.cname = customer.cname. So we can write

to get just these tuples. Finally, to get just the customer's name and city, we need a projection:

4. The Rename Operation

The rename operation solves the problems that occurs with naming when performing the cartesian product of a relation with itself. Suppose we want to find the names of all the customers who live on the same street and in the same city as Smith. We can get the street and city of Smith by writing

To find other customers with the same information, we need to reference the customer relation again:

where

is a selection predicate requiring street and ccity values to be equal.

Problem: how do we distinguish between the two street values appearing in the Cartesian product, as both come from a customer relation? Solution: use the rename operator, denoted by the Greek letter rho ( ). We write

to get the relation under the name of . If we use this to rename one of the two customer relations we are using, the ambiguities will disappear.

5. The Union Operation

The union operation is denoted two compatible relations. For a union operation
o o

as in set theory. It returns the union (set union) of

to be legal, we require that

and must have the same number of attributes. The domains of the corresponding attributes must be the same.

To find all customers of the SFU branch, we must find everyone who has a loan or an account or both at the branch. We need both borrow and deposit relations for this:

As in all set operations, duplicates are eliminated, giving the relation of Figure 3.5(a).

Figure 3.5: The union and set-difference operations.


6. The Set Difference Operation

Set difference is denoted by the minus sign ( ). It finds tuples that are in one relation, but not in another. Thus results in a relation containing tuples that are in but not in .

To find customers of the SFU branch who have an account there but no loan, we write

The result is shown in Figure 3.5(b). We can do more with this operation. Suppose we want to find the largest account balance in the bank. Strategy:
o o

Find a relation containing the balances not the largest. Compute the set difference of and the deposit relation.

To find , we write

This resulting relation contains all balances except the largest one. (See Figure 3.6(a)). Now we can finish our query by taking the set difference:

Figure 3.6(b) shows the result.

Figure 3.6: Find the largest account balance in the bank.

Formal Definition of Relational Algebra


1. A basic expression consists of either o A relation in the database. o A constant relation. 2. General expressions are formed out of smaller subexpressions using o select (p a predicate) o project (s a list of attributes) o rename (x a relation name) o union o set difference o cartesian product

Additional Operations
1. Additional operations are defined in terms of the fundamental operations. They do not add power to the algebra, but are useful to simplify common queries. 2. The Set Intersection Operation

Set intersection is denoted by , and returns a relation that contains tuples that are in both of its argument relations. It does not add any power as

To find all customers having both a loan and an account at the SFU branch, we write

3. The Natural Join Operation

Often we want to simplify queries on a cartesian product. For example, to find all customers having a loan at the bank and the cities in which they live, we need borrow and customer relations:

Our selection predicate obtains only those tuples pertaining to only one cname. This type of operation is very common, so we have the natural join, denoted by a sign. Natural join combines a cartesian product and a selection into one operation. It performs a selection forcing equality on those attributes that appear in both relation schemes. Duplicates are removed as in all relation operations. To illustrate, we can rewrite the previous query as

The resulting relation is shown in Figure 3.7.

Figure 3.7: Joining borrow and customer relations. We can now make a more formal definition of natural join.
o o o o o o

Consider and to be sets of attributes. We denote attributes appearing in both relations by . We denote attributes in either or both relations by . Consider two relations and . The natural join of and , denoted by is a relation on scheme . It is a projection onto of a selection on where the predicate requires for each attribute in .

Formally,

where

To find the assets and names of all branches which have depositors living in Stamford, we need customer, deposit and branch relations:

Note that

is associative.

To find all customers who have both an account and a loan at the SFU branch:

This is equivalent to the set intersection version we wrote earlier. We see now that there can be several ways to write a query in the relational algebra. If two relations . and have no attributes in common, then , and

4. The Division Operation

Division, denoted

, is suited to queries that include the phrase ``for all''.

Suppose we want to find all the customers who have an account at all branches located in Brooklyn. Strategy: think of it as three steps. We can obtain the names of all branches located in Brooklyn by

Figure 3.19 in the textbook shows the result. We can also find all cname, bname pairs for which the customer has an account by

Figure 3.20 in the textbook shows the result. Now we need to find all customers who appear in with every branch name in .

The divide operation provides exactly those customers:

which is simply

Formally,
o o o o

Let and Let . The relation A tuple is in the following:

be relations. is a relation on scheme . if for every tuple in there is a tuple

in satisfying both of

These conditions say that the portion of a tuple is in if and only if there are tuples with the portion and the portion in for every value of the portion in relation .

We will look at this explanation in class more closely. The division operation can be defined in terms of the fundamental operations.

Read the text for a more detailed explanation.


5. The Assignment Operation

Sometimes it is useful to be able to write a relational algebra expression in parts using a temporary relation variable (as we did with and in the division example). The assignment operation, denoted language. , works like assignment in a programming

We could rewrite our division definition as

No extra relation is added to the database, but the relation variable created can be used in subsequent expressions. Assignment to a permanent relation would constitute a modification to the database.

The Tuple Relational Calculus


1. The tuple relational calculus is a nonprocedural language. (The relational algebra was procedural.) We must provide a formal description of the information desired. 2. A query in the tuple relational calculus is expressed as

i.e. the set of tuples for which predicate

is true.

3. We also use the notation o to indicate the value of tuple on attribute . o to show that tuple is in relation .

Example Queries
1. For example, to find the branch-name, loan number, customer name and amount for loans over $1200:

This gives us all attributes, but suppose we only want the customer names. (We would use project in the algebra.) We need to write an expression for a relation on scheme (cname).

In English, we may read this equation as ``the set of all tuples such that there exists a tuple in the relation borrow for which the values of and for the cname attribute are equal, and the value of for the amount attribute is greater than 1200.'' The notation is true''. means ``there exists a tuple in relation such that predicate

How did we get the above expression? We needed tuples on scheme cname such that there were tuples in borrow pertaining to that customer name with amount attribute . The tuples get the scheme cname implicitly as that is the only attribute is mentioned with. Let's look at a more complex example. Find all customers having a loan from the SFU branch, and the the cities in which they live:

In English, we might read this as ``the set of all (cname,ccity) tuples for which cname is a borrower at the SFU branch, and ccity is the city of cname''. Tuple variable ensures that the customer is a borrower at the SFU branch.

Tuple variable is restricted to pertain to the same customer as , and also ensures that ccity is the city of the customer. The logical connectives (AND) and (OR) are allowed, as well as . (negation).

We also use the existential quantifier and the universal quantifier Some more examples:

1. Find all customers having a loan, an account, or both at the SFU branch:

Note the use of the connective. As usual, set operations remove all duplicates. 2. Find all customers who have both a loan and an account at the SFU branch. Solution: simply change the connective in 1 to a .

3. Find customers who have an account, but not a loan at the SFU branch.

4. Find all customers who have an account at all branches located in Brooklyn. (We used division in relational algebra.) For this example we will use implication, denoted by a pointing finger in the text, but by here. The formula means implies , or, if is true, then must be true.

In English: the set of all cname tuples such that for all tuples in the branch relation, if the value of on attribute bcity is Brooklyn, then the customer has an account at the branch whose name appears in the bname attribute of . Division is difficult to understand. Think it through carefully.

Safety of Expressions
1. A tuple relational calculus expression may generate an infinite expression, e.g.

2. There are an infinite number of tuples that are not in borrow! Most of these tuples contain values that do not appear in the database. 3. Safe Tuple Expressions

We need to restrict the relational calculus a bit.


o o o o o

The domain of a formula , denoted dom( ), is the set of all values referenced in . These include values mentioned in as well as values that appear in a tuple of a relation mentioned in . So, the domain of is the set of all values explicitly appearing in or that appear in relations mentioned in . is the set of all values appearing in borrow. is the set of all values appearing in borrow.

We may say an expression values from dom( ).

is safe if all values that appear in the result are

4. A safe expression yields a finite number of tuples as its result. Otherwise, it is called unsafe.

Expressive Power of Languages


1. The tuple relational calculus restricted to safe expressions is equivalent in expressive power to the relational algebra.

The Domain Relational Calculus


1. Domain variables take on values from an attribute's domain, rather than values for an entire tuple.

Formal Definitions
1. An expression is of the form

where the

represent domain variables, and

is a formula.

2. An atom in the domain relational calculus is of the following forms o where is a relation on attributes, and , are domain variables or constants. o , where and are domain variables, and is a comparison operator. o , where c is a constant. 3. Formulae are built up from atoms using the following rules: o An atom is a formula. o If is a formula, then so are and . o If and are formulae, then so are , and .

If

is a formula where x is a domain variable, then so are .

and

Example Queries
1. Find branch name, loan number, customer name and amount for loans of over $1200.

2. Find all customers who have a loan for an amount > than $1200.

3. Find all customers having a loan from the SFU branch, and the city in which they live.

4. Find all customers having a loan, an account or both at the SFU branch.

5. Find all customers who have an account at all branches located in Brooklyn.

If you find this example difficult to understand, try rewriting this expression using implication, as in the tuple relational calculus example. Here's my attempt:

I've used two letter variable names to get away from the problem of having to remember what stands for.

Modifying the Database


1. Up until now, we have looked at extracting information from the database. We also need to add, remove and change information. Modifications are expressed using the assignment operator.

Deletion
1. Deletion is expressed in much the same way as a query. Instead of displaying, the selected tuples are removed from the database. We can only delete whole tuples.

In relational algebra, a deletion is of the form

where is a relation and Tuples in for which


2. Some examples:

is a relational algebra query.

is true are deleted.

1. Delete all of Smith's account records.

2. Delete all loans with loan numbers between 1300 and 1500.

3. Delete all accounts at Branches located in Needham.

Insertions
1. To insert data into a relation, we either specify a tuple, or write a query whose result is the set of tuples to be inserted. Attribute values for inserted tuples must be members of the attribute's domain. 2. An insertion is expressed by

where is a relation and


3. Some examples:

is a relational algebra expression.

1. To insert a tuple for Smith who has $1200 in account 9372 at the SFU branch.

2. To provide all loan customers in the SFU branch with a $200 savings account.

Updating
1. Updating allows us to change some values in a tuple without necessarily changing all.

We use the update operator, , with the form

where is a relation with attribute The expression relation . Some examples:

, which is assigned the value of expression

is any arithmetic expression involving constants and attributes in

1. To increase all balances by 5 percent.

This statement is applied to every tuple in deposit. 2. To make two different rates of interest payment, depending on balance amount:

Note: in this example the order of the two operations is important. (Why?)

Views
1. We have assumed up to now that the relations we are given are the actual relations stored in the database. 2. For security and convenience reasons, we may wish to create a personalized collection of relations for a user. 3. We use the term view to refer to any relation, not part of the conceptual model, that is made visible to the user as a ``virtual relation''. 4. As relations may be modified by deletions, insertions and updates, it is generally not possible to store views. (Why?) Views must then be recomputed for each query referring to them.

View Definition
1. A view is defined using the create view command:

where <query expression> is any legal query expression.

The view created is given the name .


2. To create a view all-customer of all branches and their customers:

3. Having defined a view, we can now use it to refer to the virtual relation it creates. View names can appear anywhere a relation name can. 4. We can now find all customers of the SFU branch by writing

Updates Through Views and Null Values


1. Updates, insertions and deletions using views can cause problems. The modifications on a view must be transformed to modifications of the actual relations in the conceptual model of the database. 2. An example will illustrate: consider a clerk who needs to see all information in the borrow relation except amount.

Let the view loan-info be given to the clerk:

3. Since SQL allows a view name to appear anywhere a relation name may appear, the clerk can write:

This insertion is represented by an insertion into the actual relation borrow, from which the view is constructed. However, we have no value for amount. A suitable response would be
o o

Reject the insertion and inform the user. Insert (``SFU'',3,``Ruth'',null) into the relation.

The symbol null represents a null or place-holder value. It says the value is unknown or does not exist.
4. Another problem with modification through views: consider the view

This view lists the cities in which the borrowers of each branch live.

Now consider the insertion

Using nulls is the only possible way to do this (see Figure 3.22 in the textbook). If we do this insertion with nulls, now consider the expression the view actually corresponds to:

As comparisons involving nulls are always false, this query misses the inserted tuple. To understand why, think about the tuples that got inserted into borrow and customer. Then think about how the view is recomputed for the above query.

Relational Database Design


The goal of relational database design is to generate a set of schemas that allow us to

Store information without unnecessary redundancy. Retrieve information easily (and accurately).

Pitfalls in Relational DB Design


A bad design may have several properties, including:

Repetition of information. Inability to represent certain information. Loss of information.

Representation of Information
1. Suppose we have a schema, Lending-schema,
2. 3. Lending-schema = (bname, bcity, assets, cname, loan#, amount)

and suppose an instance of the relation is Figure 7.1.

Figure 7.1: Sample lending relation.


4. A tuple t in the new relation has the following attributes: o t[assets] is the assets for t[bname] o t[bcity] is the city for t[bname]

o o

is the loan number made by branch t[bname] to t[cname].

t[amount] is the amount of the loan for 5. If we wish to add a loan to our database, the original design would require adding a tuple to borrow:
6. 7. (SFU, L-31, Turner, 1K)

8. In our new design, we need a tuple with all the attributes required for Lending-schema. Thus we need to insert
9. (SFU, Burnaby, 2M, Turner, L-31, 1K) 10.

11. We are now repeating the assets and branch city information for every loan. o Repetition of information wastes space. o Repetition of information complicates updating. 12. Under the new design, we need to change many tuples if the branch's assets change. 13. Let's analyze this problem: o We know that a branch is located in exactly one city. o We also know that a branch may make many loans. o The functional dependency bname bcity holds on Lending-schema. o The functional dependency bname loan# does not. o These two facts are best represented in separate relations. 14. Another problem is that we cannot represent the information for a branch (assets and city) unless we have a tuple for a loan at that branch. 15. Unless we use nulls, we can only have this information when there are loans, and must delete it when the last loan is paid off.

Decomposition
1. The previous example might seem to suggest that we should decompose schema as much as possible. Careless decomposition, however, may lead to another form of bad design. 2. Consider a design where Lending-schema is decomposed into two schemas
3. 4. 5. 6. Branch-customer-schema = (bname, bcity, assets, cname) Customer-loan-schema = (cname, loan#, amount)

7. We construct our new relations from lending by:


8. 9. 10. branch-customer = customer-loan =

Figure 7.2: The decomposed lending relation.

11. It appears that we can reconstruct the lending relation by performing a natural join on the two new schemas. 12. Figure 7.3 shows what we get by computing branch-customer customer-loan.

Figure 7.3: Join of the decomposed relations. 13. We notice that there are tuples in branch-customer customer-loan that are not in lending. 14. How did this happen? o The intersection of the two schemas is cname, so the natural join is made on the basis of equality in the cname. o If two lendings are for the same customer, there will be four tuples in the natural join. o Two of these tuples will be spurious - they will not appear in the original lending relation, and should not appear in the database. o Although we have more tuples in the join, we have less information. o Because of this, we call this a lossy or lossy-join decomposition. o A decomposition that is not lossy-join is called a lossless-join decomposition. o The only way we could make a connection between branch-customer and customer-loan was through cname. 15. When we decomposed Lending-schema into Branch-schema and Loan-info-schema, we will not have a similar problem. Why not?
16. 17. 18. 19. Branch-schema = (bname, bcity, assets) Branch-loan-schema = (bname, cname, loan#, amount) o

The only way we could represent a relationship between tuples in the two relations is through bname. o This will not cause problems. o For a given branch name, there is exactly one assets value and branch city. 20. For a given branch name, there is exactly one assets value and exactly one bcity; whereas a similar statement associated with a loan depends on the customer, not on the amount of the loan (which is not unique). 21. We'll make a more formal definition of lossless-join: o Let R be a relation schema.
o

A set of relation schemas

is a decomposition of R if

o o

That is, every attribute in R appears in at least one Let r be a relation on R, and let

for

o o

That is,

is the database that results from decomposing R into

. It is always the case that:

To see why this is, consider a tuple


When we compute the relations , the tuple t gives rise to one tuple in each . These n tuples combine together to regenerate t when we compute the natural join of the . .

Thus every tuple in r appears in However, in general,

o o o o

We saw an example of this inequality in our decomposition of lending into branch-customer and customer-loan. In order to have a lossless-join decomposition, we need to impose some constraints on the set of possible relations. Let C represent a set of constraints on the database. A decomposition of a relation schema R is a lossless-join decomposition for R if, for all relations r on schema R that are legal under C:

22. In other words, a lossless-join decomposition is one in which, for any legal relation r, if we decompose r and then ``recompose'' r, we get what we started with - no more and no less.

Normalization Using Functional Dependencies


We can use functional dependencies to design a relational database in which most of the problems we have seen do not occur. Using functional dependencies, we can define several normal forms which represent ``good'' database designs.

Desirable Properties of Decomposition


We'll take another look at the schema
Lending-schema = (bname, assets, bcity, loan#, cname, amount)

which we saw was a bad design.


The set of functional dependencies we required to hold on this schema was:
bname assets bcity

loan#

amount bname

If we decompose it into
Branch-schema = (bname, assets, bcity)

Loan-info-schema = (bname, loan#, amount)

Borrow-schema = (cname, loan#)

we claim this decomposition has several desirable properties.


Lossless-Join Decomposition 1. We claim the above decomposition is lossless. How can we decide whether a decomposition is lossless? o Let R be a relation schema. o Let F be a set of functional dependencies on R.
o o

Let and form a decomposition of R. The decomposition is a lossless-join decomposition of R if at least one of the following functional dependencies are in : 1. 2.

Why is this true? Simply put, it ensures that the attributes involved in the natural join ( ) are a candidate key for at least one of the two relations.

This ensures that we can never get the situation where spurious tuples are generated, as for any value on the join attributes there will be a unique tuple in one of the relations.

2. We'll now show our decomposition is lossless-join by showing a set of steps that generate the decomposition: o First we decompose Lending-schema into
o o o o o o o o o o o o o o o o Branch-schema = (bname, bcity, assets) Loan-info-schema = (bname, cname, loan#, amount)

Since bname implies that


bname

assets bcity, the augmentation rule for functional dependencies


bname assets bcity

Since Branch-schema Borrow-schema = bname, our decomposition is lossless join. Next we decompose Borrow-schema into
Loan-schema = (bname, loan#, amount) Borrow-schema = (cname, loan#)

As loan# is the common attribute, and


loan# amount bname

This is also a lossless-join decomposition.

Dependency Preservation 1. Another desirable property in database design is dependency preservation. o We would like to check easily that updates to the database do not result in illegal relations being created. o It would be nice if our design allowed us to check updates without having to compute natural joins. o To know whether joins must be computed, we need to determine what functional dependencies may be tested by checking each relation individually. o Let F be a set of functional dependencies on schema R.
o o o o o o o o o o

Let The restriction of F to

be a decomposition of R. is the set of all functional dependencies in that include

only attributes of . Functional dependencies in a restriction can be tested in one relation, as they involve attributes in one relation schema. The set of restrictions is the set of dependencies that can be checked efficiently. We need to know whether testing only the restrictions is sufficient. Let .

F' is a set of functional dependencies on schema R, but in general, . However, it may be that . If this is so, then every functional dependency in F is implied by F', and if F' is satisfied, then F must also be satisfied. A decomposition having the property that is a dependency-preserving decomposition.

2. The algorithm for testing dependency preservation follows this method:


3. 4. 5. 6. 7. 8. 9. to 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. for each restriction begin do ; end compute for each schema begin := the restriction of in D do

end compute if ( ; ) then return (true)

els

29. We can now show that our decomposition of Lending-schema is dependency preserving. o The functional dependency
o o bname assets bcity

can be tested in one relation on Branch-schema.


o o o

The functional dependency


loan# amount bname

can be tested in Loan-schema.


30. As the above example shows, it is often easier not to apply the algorithm shown to test dependency preservation, as computing takes exponential time. 31. An Easier Way To Test For Dependency Preservation

Really we only need to know whether the functional dependencies in F and not in F' are implied by those in F'. In other words, are the functional dependencies not easily checkable logically implied by those that are? Rather than compute and , and see whether they are equal, we can do this:

o o o

Find F - F', the functional dependencies not checkable in one relation. See whether this set is obtainable from F' by using Armstrong's Axioms. This should take a great deal less work, as we have (usually) just a few functional dependencies to work on.

Use this simpler method on exams and assignments (unless you have exponential time available to you). Repetition of Information 1. Our decomposition does not suffer from the repetition of information problem. o Branch and loan data are separated into distinct relations. o Thus we do not have to repeat branch data for each loan. o If a single loan is made to several customers, we do not have to repeat the loan amount for each customer. o This lack of redundancy is obviously desirable. o We will see how this may be achieved through the use of normal forms.

Boyce-Codd Normal Form


1. A relation schema R is in Boyce-Codd Normal Form (BCNF) with respect to a set F of functional dependencies if for all functional dependencies in and
o o

of the form

, where

, at least one of the following holds:

is a trivial functional dependency (i.e. ). is a superkey for schema R. 2. A database design is in BCNF if each member of the set of relation schemas is in BCNF. 3. Let's assess our example banking design:
4. Customer-schema = (cname, street, ccity) 5. 6. cname street ccity 7. 8. Branch-schema = (bname, assets, bcity) 9. 10. bname assets bcity 11. 12. Loan-info-schema = (bname, cname, loan#, amount) 13. 14. loan# amount bname 15.

Customer-schema and Branch-schema are in BCNF.


16. Let's look at Loan-info-schema: o We have the non-trivial functional dependency loan# amount, and o loan# is not a superkey. o Thus Loan-info-schema is not in BCNF. o We also have the repetition of information problem. o For each customer associated with a loan, we must repeat the branch name and amount of the loan. o We can eliminate this redundancy by decomposing into schemas that are all in BCNF.

17. If we decompose into


18. 19. 20. 21. Loan-schema = (bname, loan#, amount) Borrow-schema = (cname, loan#)

we have a lossless-join decomposition. (Remember why?) To see whether these schemas are in BCNF, we need to know what functional dependencies apply to them.
o o o

For Loan-schema, we have loan# amount bname applying. Only trivial functional dependencies apply to Borrow-schema. Thus both schemas are in BCNF.

We also no longer have the repetition of information problem. Branch name and loan amount information are not repeated for each customer in this design. 22. Now we can give a general method to generate a collection of BCNF schemas.
23. result := ; 24. 25. done := false; 26. 27. compute ; 28. 29. while (not done) do 30. 31. BCNF) 32. 33. 34. 35. nontrivial 36. functional dependency that holds on 37. 38. 39. and 40. 41. 42. 43. 44. 45. 46. o o ; result = (result ; end else done = true; then begin let be a if (there is a schema in result that is not in

suc

47. This algorithm generates a lossless-join BCNF decomposition. Why? We replace a schema The dependency with holds on . and .

o So we have , and thus a lossless join. 48. Let's apply this algorithm to our earlier example of poor database design: 49. 50. Lending-schema = (bname, assets, bcity, loan#, cname, amount)

The set of functional dependencies we require to hold on this schema are


bname assets bcity

loan#

amount bname

A candidate key for this schema is {loan#, cname}. We will now proceed to decompose:
o o o

The functional dependency


bname assets bcity

holds on Lending-schema, but bname is not a superkey. We replace Lending-schema with


Branch-schema = (bname, assets, bcity)

Loan-info-schema = (bname, loan#, cname, amount)

o o o o

Branch-schema is now in BCNF. The functional dependency


loan# amount bname

holds on Loan-info-schema, but loan# is not a superkey. We replace Loan-info-schema with


Loan-schema = (bname, loan#, amount)

Borrow-schema = (cname, loan#)

These are both now in BCNF. We saw earlier that this decomposition is both lossless-join and dependencypreserving. 51. Not every decomposition is dependency-preserving. o Consider the relation schema
o o o o o o o o o o o o o o o o Banker-schema = (bname, cname, banker-name)

o o

The set F of functional dependencies is


banker-name bname banker-name cname bname

The schema is not in BCNF as banker-name is not a superkey. If we apply our algorithm, we may obtain the decomposition
Banker-branch-schema = (bname, banker-name) Cust-banker-schema = (cname, banker-name)

The decomposed schemas preserve only the first (and trivial) functional dependencies. The closure of this dependency does not include the second one. Thus a violation of cname bname banker-name cannot be detected unless a join is computed.

This shows us that not every BCNF decomposition is dependency-preserving. 52. It is not always possible to satisfy all three design goals: o BCNF. o Lossless join. o Dependency preservation. 53. We can see that any BCNF decomposition of Banker-schema must fail to preserve
54. 55. cname bname banker-name

56. Some Things To Note About BCNF o There is sometimes more than one BCNF decomposition of a given schema. o The algorithm given produces only one of these possible decompositions. o Some of the BCNF decompositions may also yield dependency preservation, while others may not. o Changing the order in which the functional dependencies are considered by the algorithm may change the decomposition. o For example, try running the BCNF algorithm on
o o o o o o o o

Then change the order of the last two functional dependencies and run the algorithm again. Check the two decompositions for dependency preservation.

Third Normal Form


1. When we cannot meet all three design criteria, we abandon BCNF and accept a weaker form called third normal form (3NF). 2. It is always possible to find a dependency-preserving lossless-join decomposition that is in 3NF. 3. A relation schema R is in 3NF with respect to a set F of functional dependencies if for all functional dependencies in of the following holds:
o o

of the form

, where

and

, at least one

is a trivial functional dependency. is a superkey for schema R.

o Each attribute A in is contained in a candidate key for R. 4. A database design is in 3NF if each member of the set of relation schemas is in 3NF. 5. We now allow functional dependencies satisfying only the third condition. These dependencies are called transitive dependencies, and are not allowed in BCNF. 6. As all relation schemas in BCNF satisfy the first two conditions only, a schema in BCNF is also in 3NF. 7. BCNF is a more restrictive constraint than 3NF. 8. Our Banker-schema decomposition did not have a dependency-preserving lossless-join decomposition into BCNF. The schema was already in 3NF though (check it out). 9. We now present an algorithm for finding a dependency-preserving lossless-join decomposition into 3NF. 10. Note that we require the set F of functional dependencies to be in canonical form. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. let be a canonical cover for F; i := 0; for each functional dependency if none of the schemas , do contains

then begin i := i + 1; :=

24. 25.

end ,

26. if none of the schemas 27. contains a candidate key for R 28. 29. then begin 30. 31. 32.

i := i + 1;

33. candidate key for R 34. 35. 36. 37. 38. return ( end )

:= any

39. Each relation schema is in . Why? (A proof is given is [Ullman 1988].) 40. The design is as a schema is built for each given dependency.

is guaranteed by the requirement that a candidate key for R be in at least one of the schemas.
41. To review our Banker-schema consider an extension to our example:
42. 43. Banker-info-schema = (bname, cname, banker-name, office#)

The set F of functional dependencies is


banker-name bname office#

cname bname

banker-name

The for loop in the algorithm gives us the following decomposition:


Banker-office-schema = (banker-name, bname, office#)

Banker-schema = (cname, bname, banker-name)

Since Banker-schema contains a candidate key for Banker-info-schema, the process is finished.

Comparison of BCNF and 3NF


1. We have seen BCNF and 3NF. o It is always possible to obtain a 3NF design without sacrificing lossless-join or dependency-preservation. o If we do not eliminate all transitive dependencies, we may need to use null values to represent some of the meaningful relationships. o Repetition of information occurs. 2. These problems can be illustrated with Banker-schema. o As banker-name bname , we may want to express relationships between a banker and his or her branch.

Figure 7.4: An instance of Banker-schema.


Figure 7.4 shows how we must either have a corresponding value for customer name, or include a null. o Repetition of information also occurs. o Every occurrence of the banker's name must be accompanied by the branch name. If we must choose between BCNF and dependency preservation, it is generally better to opt for 3NF. o If we cannot check for dependency preservation efficiently, we either pay a high price in system performance or risk the integrity of the data. o The limited amount of redundancy in 3NF is then a lesser evil. To summarize, our goal for a relational database design is o BCNF. o Lossless-join. o Dependency-preservation. If we cannot achieve this, we accept o 3NF o Lossless-join. o Dependency-preservation. A final point: there is a price to pay for decomposition. When we decompose a relation, we have to use natural joins or Cartesian products to put the pieces back together. This takes computational time.
o

3.

4.

5.

6.

Normalization Using Multivalued Dependencies (not to be covered)


1. Suppose that in our banking example, we had an alternative design including the schema:
2. 3. BC-schema = (loan#, cname, street, ccity)

We can see this is not BCNF, as the functional dependency


cname street ccity

holds on this schema, and cname is not a superkey. 4. If we have customers who have several addresses, though, then we no longer wish to enforce this functional dependency, and the schema is in BCNF. 5. However, we now have the repetition of information problem. For each address, we must repeat the loan numbers for a customer, and vice versa.

Multivalued Dependencies
1. Functional dependencies rule out certain tuples from appearing in a relation.

If A B, then we cannot have two tuples with the same A value but different B values.
2. Multivalued dependencies do not rule out the existence of certain tuples.

Instead, they require that other tuples of a certain form be present in the relation.
3. Let R be a relation schema, and let and .

The multivalued dependency

holds on R if in any legal relation r(R), for all pairs of tuples , there exist tuples and in r such that:

and

in r such that

4. Figure 7.5 (textbook 6.10) shows a tabular representation of this. It looks horrendously complicated, but is really rather simple. A simple example is a table with the schema (name, address, car), as shown in Figure 7.6.

Figure 7.5: Tabular representation of

Figure 7.6: (name, address, car) where


o

and

.
is independent of the

Intuitively,

says that the relationship between and .

relationship between and


o

If the multivalued dependency is satisfied by all relations on schema R, then we say it is a trivial multivalued dependency on schema R.

o Thus is trivial if or . 5. Look at the example relation bc relation in Figure 7.7 (textbook 6.11).

Figure 7.7: Relation bc, an example of redundancy in a BCNF relation.


o o o o o o o o

We must repeat the loan number once for each address a customer has. We must repeat the address once for each loan the customer has. This repetition is pointless, as the relationship between a customer and a loan is independent of the relationship between a customer and his or her address. If a customer, say ``Smith'', has loan number 23, we want all of Smith's addresses to be associated with that loan. Thus the relation of Figure 7.8 (textbook 6.12) is illegal. If we look at our definition of multivalued dependency, we see that we want the multivalued dependency
cname street ccity

to hold on BC-schema.

Figure 7.8: An illegal bc relation.

6. Note that if a relation r fails to satisfy a given multivalued dependency, we can construct a relation r' that does satisfy the multivalued dependency by adding tuples to r.

Theory of Multivalued Dependencies


1. We will need to compute all the multivalued dependencies that are logically implied by a given set of multivalued dependencies. o Let D denote a set of functional and multivalued dependencies. o The closure of D is the set of all functional and multivalued dependencies logically implied by D. o We can compute from D using the formal definitions, but it is easier to use a set of inference rules. 2. The following set of inference rules is sound and complete. The first three rules are Armstrong's axioms from Chapter 5. 1. Reflexivity rule: if is a set of attributes and 2. Augmentation rule: if holds. 3. Transitivity rule: if 4. Complementation rule: if 5. Multivalued augmentation rule: if holds. 6. Multivalued transitivity rule: if holds. 7. Replication rule: if 8. Coalescence rule: if and , then holds, then holds, and holds. and , where holds, and . , and there is a such that and holds, then , then holds.

holds, and is a set of attributes, then holds, and holds, then holds. holds. and , then

holds, then holds, and

An example of multivalued transitivity rule is as follows. . Thus we have .

An example of coalescence rule is as follows. If we have , and .


Let's do an example: o Let R=(A,B,C,G,H,I) be a relation schema. o Suppose holds.
o

, then we have

The definition of multivalued dependencies implies that if there exists tuples and such that:

, then

o o o

o o o o o o o

The complementation rule states that if

then

o Tuples and satisfy if we simply change the subscripts. We can simplify calculating , the closure of D by using the following rules, derivable from the previous ones: o o o

Multivalued union rule: if Intersection rule: if Difference rule: if holds.

holds and holds and holds and

holds, then holds, then holds, then

holds. holds. holds and

An example will help:

Let R=(A,B,C,G,H,I) with the set of dependencies:

We list some members of


o o

:
, complementation rule implies that

: since : Since .

, and R - B - A = CGHI. and , multivalued transitivity rule implies that

: coalescence rule can be applied. and B,

holds,

and

, so we can satisfy the coalescence rule with being . . By the difference

being HI, being CG, and being H. We conclude that : now we know that and .

rule,

Fourth Normal Form (4NF)


1. We saw that BC-schema was in BCNF, but still was not an ideal design as it suffered from repetition of information. We had the multivalued dependency cname street ccity, but no non-trivial functional dependencies.

2. We can use the given multivalued dependencies to improve the database design by decomposing it into fourth normal form. 3. A relation schema R is in 4NF with respect to a set D of functional and multivalued dependencies if for all multivalued dependencies in and
o o

of the form

, where

, at least one of the following hold:

is a trivial multivalued dependency. is a superkey for schema R. 4. A database design is in 4NF if each member of the set of relation schemas is in 4NF. 5. The definition of 4NF differs from the BCNF definition only in the use of multivalued dependencies. o Every 4NF schema is also in BCNF. o To see why, note that if a schema is not in BCNF, there is a non-trivial functional dependency holding on R, where is not a superkey.

o Since implies , by the replication rule, R cannot be in 4NF. 6. We have an algorithm similar to the BCNF algorithm for decomposing a schema into 4NF: 7. result := ; 8. 9. done := false; 10. 11. compute ; 12. 13. while (not done) do 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. dependency that holds on 24. 25. 26. 27. 28. 29. result = 30. 31. 32. 33. 34. end else done = true; such that if (there is a schema in result

then begin let be a nontrivial multivalued

35. If we apply this algorithm to BC-schema: o cname loan# is a nontrivial multivalued dependency and cname is not a superkey for the schema.

These two schemas are in 4NF. 36. We can show that our algorithm generates only lossless-join decompositions. o Let R be a relation schema and D a set of functional and multivalued dependencies on R.
o o o o o o o o

o o o o o o

We then replace BC-schema by two schemas:


Cust-loan-schema=(cname, loan#) Customer-schema=(cname, street, ccity)

Let and form a decomposition of R. This decomposition is lossless-join if and only if at least one of the following multivalued dependencies is in :

We saw similar criteria for functional dependencies. This says that for every lossless-join decomposition of R into two schemas and

, one of the two above dependencies must hold. You can see, by inspecting the algorithm, that this must be the case for every decomposition. 37. Dependency preservation is not as simple to determine as with functional dependencies. o Let R be a relation schema.
o o o

Let be a decomposition of R. Let D be the set of functional and multivalued dependencies holding on R. The restriction of D to

is the set

consisting of: that include only attributes of where . and

All functional dependencies in

All multivalued dependencies of the form

is in . A decomposition of schema R is dependency preserving with respect to a set D of functional and multivalued dependencies if for every set of relations such that for all i, satisfies , there exists a relation

r(R) that satisfies D and for which for all i. 38. What does this formal statement say? It says that a decomposition is dependency preserving if for every set of relations on the decomposition schema satisfying only the restrictions on D there exists a relation r on the entire schema R that the decomposed schemas can be derived from, and that r also satisfies the functional and multivalued dependencies. 39. We'll do an example using our decomposition algorithm and check the result for dependency preservation. o Let R=(A,B,C,G,H,I). o Let D be
o o o o o o o

R is not in 4NF, as we have and A is not a superkey. The algorithm causes us to decompose using this dependency into

o o o o o

is now in 4NF, but is not. Applying the multivalued dependency algorithm then decomposes into

(how did we get this?), our

o o o o o o o o o o The algorithm terminates, and our decomposition is 40. Let's analyze the result.

is now in 4NF, but Why? As gives us is in .

is not. (why?) then the restriction of this dependency to into

Applying this dependency in our algorithm finally decomposes

and

Figure 7.9: Projection of relation r onto a 4NF decomposition of R.


o o o o o o o o o o o

This decomposition is not dependency preserving as it fails to preserve . Figure 7.9 (textbook 6.14) shows four relations that may result from projecting a relation onto the four schemas of our decomposition. The restriction of D to (A,B) is and some trivial dependencies. We can see that satisfies as there are no pairs with the same A value. Also, satisfies all functional and multivalued dependencies since no two tuples have the same value on any attribute. We can say the same for and . So our decomposed version satisfies all the dependencies in the restriction of D. However, there is no relation r on (A,B,C,G,H,I) that satisfies D and decomposes into and . Figure 7.10 (textbook 6.15) shows Relation r does not satisfy . Any relation s containing r and satisfying . . must include the tuple

o o

However, includes a tuple that is not in . Thus our decomposition fails to detect a violation of .

Figure 7.10: A relation r(R) that does not satisfy

41. We have seen that if we are given a set of functional and multivalued dependencies, it is best to find a database design that meets the three criteria: o 4NF. o Dependency Preservation. o Lossless-join. 42. If we only have functional dependencies, the first criteria is just BCNF. 43. We cannot always meet all three criteria. When this occurs, we compromise on 4NF, and accept BCNF, or even 3NF if necessary, to ensure dependency preservation.

Potrebbero piacerti anche