Sei sulla pagina 1di 104

Jamnalal Bajaj Institute of Management Studies MMS I Alpha 10/11/2011

DATABASE CONCEPTS WITH EMERGING TRENDS

Submitted by: Abhijeet Shinde (93) Akshay Wagle (115) Amrit Pradhan (80) Rahul Tambakhe (104) Saiprasad Sale (85) Sagar Patil (74) Srinath Subramanian (100) Vaibhav Thakur (106)

Table of Contents
Introduction .........................................................................................................................................................3

Database Design Methodologies ............................................................................................................ 11

Database management systems....................................................................................26

Relational Database Management Systems .................................................................49

Multimedia Database Management Systems ...............................................................64

Distributed Databases ...................................................................................................88

Document Oriented Database .......................................................................................95

Conclusion ....................................................................................................................104

Introduction
Data The term data refers to qualitative or quantitative attributes of a variable or set of variables. Data is typically the result of measurements and can be the basis of graphs, images, or observations of a set of variables. Data are often viewed as the lowest level of abstraction from which information and then knowledge are derived. Raw data, i.e. unprocessed data, refers to a collection of numbers, characters, images or other outputs from devices that collect information to convert physical quantities into symbols. Also, data is a representation of a fact, figure, and idea. Such usage is the origin of data as a concept in computer science: data are numbers, words, images, etc., accepted as they stand. The terms information and knowledge are frequently used for overlapping concepts. The main difference is in the level of abstraction being considered. Data is the lowest level of abstraction, information is the next level, and finally, knowledge is the highest level among all three. Data on its own carries no meaning. For data to become information, it must be interpreted and take on a meaning. For example, the height of Mt. Everest is generally considered as "data", a book on Mt. Everest geological characteristics may be considered as "information", and a report containing practical information on the best way to reach Mt. Everest's peak may be considered as "knowledge". Information as a concept bears a diversity of meanings, from everyday usage to technical settings. Generally speaking, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, and representation. Beynon-Davies uses the concept of a sign to distinguish between data and information; data are symbols while information occurs when symbols are used to refer to something. It is people and computers who collect data and impose patterns on it. These patterns are seen as information which can be used to enhance knowledge. These patterns can be interpreted as truth, and are authorized as aesthetic and ethical criteria. Events that leave behind perceivable physical or virtual remains can be traced back through data. Marks are no longer considered data once the link between the mark and observation is broken.

Raw data refers to a collection of numbers, characters, images or other outputs from devices to convert physical quantities into symbols, that are unprocessed. Such data is typically further processed by a human or input into a computer, stored and processed there, or transmitted (output) to another human or computer (possibly through a data cable). Raw data is a relative term; data processing commonly occurs by stages, and the "processed data" from one stage may be considered the "raw data" of the next. Mechanical computing devices are classified according to the means by which they represent data. An analog computer represents a datum as a voltage, distance, position, or other physical quantity. A digital computer represents a datum as a sequence of symbols drawn from a fixed alphabet. The most common digital computers use a binary alphabet, that is, an alphabet of two characters, typically denoted "0" and "1". More familiar representations, such as numbers or letters, are then constructed from the binary alphabet. Some special forms of data are distinguished. A computer program is a collection of data, which can be interpreted as instructions. Most computer languages make a distinction between programs and the other data on which programs operate, but in some languages, notably Lisp and similar languages, programs are essentially indistinguishable from other data. It is also useful to distinguish metadata, that is, a description of other data. A similar yet earlier term for metadata is "ancillary data." The prototypical example of metadata is the library catalog, which is a description of the contents of books. Experimental data refers to data generated within the context of a scientific investigation by observation and recording. Field data refers to raw data collected in an uncontrolled in situ environment. Example: ROBCOR Company tracks all sales for its two divisions through invoices. Each of the invoices contains raw facts such as these: Invoice number = 300124, Invoice date = 12-Jan-2004, and Sales amount = $125.98 Suppose that ROBCOR's two divisions have generated 1,380,456 and 1,453,907 invoices, respectively, between the first quarter of 1999 and the first quarter of 2004. These millions of invoices the ROBCOR had generated raw facts which do not have meaning unless and until they are processed into information. Now, suppose that for the purpose of making conclusions and/or decisions, the ROBCOR's sales managers want information about sales productivity per employee for each of the two divisions. The generated sales productivity

information will reveal meaning of the data exist in the sales invoices. An application program in an information system will generate the required information. When data is stored electronically in files, it can be used as input for an information system. An information system has programs to process (or transform) data to produce information as an output. Information reveals meaning of data. For example, students' data values such as ID, Name, Address, Major, and Phone number represent raw facts. Class roll is a list which shows students' ID and Names of those students who are enrolled in particular class. Summary of Data
o o o o o o

Facts, statistics used for reference or analysis. Numbers, characters, symbols, images etc., which can be processed by a computer. Data must be interpreted, by a human or machine, to derive meaning "Data is a representation of information" * Latin 'datum' meaning "that which is given" Data plural, datum singular

Information Information in its most restricted technical sense is a message (utterance or expression) or collection of messages that consists of an ordered sequence of symbols, or it is the meaning that can be interpreted from such a message or collection of messages. Information can be recorded or transmitted. It can be recorded as signs, or conveyed as signals by waves. Information is any kind of event that affects the state of a dynamic system. The concept has numerous other meanings in different context. Moreover, the concept of information is closely related to notions of constraint, communication, control, data, form, instruction, knowledge, meaning, mental stimulus, pattern, perception, representation, and especially entropy. Records are a specialized form of information. Essentially, records are information produced consciously or as by-products of business activities or transactions and retained

because of their value. Primarily their value is as evidence of the activities of the organization but they may also be retained for their informational value. Sound records management ensures that the integrity of records is preserved for as long as they are required. The international standard on records management, ISO 15489, defines records as "information created, received, and maintained as evidence and information by an organization or person, in pursuance of legal obligations or in the transaction of business". The International Committee on Archives (ICA) Committee on electronic records defined a record as, "a specific piece of recorded information generated, collected or received in the initiation, conduct or completion of an activity and that comprises sufficient content, context and structure to provide proof or evidence of that activity". Records may be maintained to retain corporate memory of the organization or to meet legal, fiscal or accountability requirements imposed on the organization. Willis (2005) expressed the view that sound management of business records and information delivered "six key requirements for good corporate governancetransparency; accountability; due process; compliance; meeting statutory and common law requirements; and security of personal and corporate information." Summary of Information
o o o

Knowledge derived from study, experience (by the senses), or instruction. Communication of intelligence. Information is any kind of knowledge that is exchangeable amongst people, about things, facts, concepts, etc., in some context.

Information is interpreted data.

Data processed into information

Students Enrolment Data

Student Registration System

Class Roll

Data constitute the building blocks of information: Information is produced by processing data. Information is used to reveal the meaning of data. Good, relevant, and timely information is the key to good decision making. Good decision making is the key to organizational survival in a global environment. Timely and useful information requires accurate data. To achieve accurate information, the data must be stored and generated properly. Also, the data must be stored in a format that is easy to access and process. And like any basic resource, the data environment must be managed carefully. Thus, data management is a discipline that focuses on the proper generation, storage, and retrieval of data.

Database Introduction

The term database is correctly applied to the data and their supporting data structures, and not to the DBMS which is a software system used to manage the database. The structure of a database is generally too complex to be handled without its DBMS, and any attempt to do otherwise is very likely to result in database corruption. DBMSs are packaged as computer software products: Well known and highly utilized DBMSs include the proprietary (ranked by market shareOracle, IBM DB2, Microsoft SQL Server, and the open sourceDBMSs PostgreSQL, and the simpler, with less functionality, lower-end (but satisfactory in many cases), MySQL and SQLite. Each such DBMS product currently supports many thousands of databases all over the world; the mostly embedded DBMS SQLite probably supports many millions of instances. A database is not generally portable across different DBMS, but different DBMSs can inter-operate to some degree (while each DBMS type controls a database of its own database type) by using standards like SQLand ODBC to support together a single application. A successful general-purpose DBMS is designed in such a way that it can satisfy as many different applications and application designers and builders as possible. A DBMS also needs to provide effective run-time execution to properly support (e.g., in terms of performance, availability, and security) as many end-users (the database's application

users) as needed. Sometimes the combination of a database and its respective DBMS is referred to as a Database system (DBS). The design, construction, and maintenance of a complex database requires specialist skills: the staff performing these functions from the are referred to as database application programmers (different DBMS developers/programmers)

and database administrators. Their tasks are supported by tools provided either as part of the DBMS or as free-standing software products. These tools include specialized database languages including data definition languages (DDLs), data manipulation languages (DMLs), and query languages. These can be seen as special-purpose programming languages, tailored specifically to manipulate databases; sometimes they are provided as extensions of existing programming languages, with added special database commands. Database languages are generally specific to one data model, and in many cases they are specific to one DBMS type. The most widely supported standard database language is SQL, which has been developed for the relational data model and combines the roles of both DDL, DML, and a query language.

The database concept

The database concept has evolved since the 1960s to ease increasing difficulties in designing, building, and maintaining complex information systems (typically with many concurrent end-users, and with a diverse large amount of data). It has evolved together with the evolvement of Database management systems (DBMSs) which enable the effective handling of databases. Though the terms database and DBMS define different entities, they are inseparable: A database's properties are determined by its supporting DBMS and viceversa. The Oxford English dictionary cites a 1962 technical report as the first to use the term "data-base." With the progress in technology in the areas of processors, computer memory, computer storage. Computer networks, the sizes, capabilities, and performance of databases and their respective DBMSs have grown in orders of magnitudes. For decades it has been unlikely that a complex information system can be built effectively without a proper database supported by a DBMS. The utilization of databases is now spread to such a wide degree that virtually every technology and product relies on databases and DBMSs for its development and commercialization, or even may have such embedded in it. Also,

organizations and companies, from small to large, heavily depend on databases for their operations. No widely accepted exact definition exists for DBMS. However, a system needs to provide considerable functionality to qualify as a DBMS. Accordingly its supported data collection needs to meet respective usability requirements to qualify as a database. Thus, a database and its supporting DBMS are defined here by a set of general requirements listed below. Virtually all existing mature DBMS products meet these requirements to a great extent, while less mature either meet them or converge to meet them. Parts of a database There are four major parts to a database. Which are tables, forms, reports and queries: Tables- Where all the information and data is stored within a database. Forms- Forms are used to enter, edit or deleted data in a database one record at a time. Reports- A report makes a paper copy of the data contained within a database. Queries and Sorts- Queries are a special database tool that allows you to search a database for a specific records based on certain information. Definition of a key A key is a set of columns that can be used to identify or access a particular row or rows. The key is identified in the description of a table, index, or referential constraint. The same column can be part of more than one key. A unique key is a key that is constrained so that no two of its values are equal. The columns of a unique key cannot contain NULL values. For example, an employee number column can be defined as a unique key, because each value in the column identifies only one employee. No two employees can have the same employee number. The mechanism used to enforce the uniqueness of the key is called a unique index. The unique index of a table is a column, or an ordered collection of columns, for which each value identifies (functionally determines) a unique row. A unique index can contain NULL values.

Primary Key The primary key of a relational table uniquely identifies each record in the table. It can either be a normal attribute that is guaranteed to be unique (such as Social Security Number in a table with no more than one record per person) or it can be generated by the DBMS (such as a globally unique identifier, or GUID, in Microsoft SQL Server). Primary keys may consist of a single attribute or multiple attributes in combination. Examples: Imagine we have a STUDENTS table that contains a record for each student at a university. The student's unique student ID number would be a good choice for a primary key in the STUDENTS table. The student's first and last name would not be a good choice, as there is always the chance that more than one student might have the same name. Secondary Key A secondary key is made on a field that you would like to be indexed for faster searches Secondary key is used for Identification of Rows but not usually Unique. We can have multiple Secondary Keys per table. Attribute used for Secondary key are not the ones used for Super Key i.e. Secondary Key is not even be one of the Super Key.

10

Database Design Methodologies


Database design
It is a structured approach that uses procedures, techniques, tools, and documentation aids to support and facilitate the process of designing a database Critical Success Factors in Database Design : 1. Use diagrams to represent as much of the data models as possible. 2. Good documentation of additional data semantics. 3. Build a data dictionary to supplement the data model diagrams. 4. Be willing to repeat steps 5. Work interactively with the users as much as possible. 6. Follow a structured methodology throughout the data modeling process. 7. Employ a data-driven approach. 8.Incorporate structural and integrity considerations into the data models. 9. Combine conceptualization, normalization, and transaction validation techniques into the data modeling methodology. 3 main phases of database design: I. Conceptual database design: Construct a model of the information used, independent of all physical considerations. II. Logical database design: Construct a model of the information used, based on a specific data model (e.g. relational), but independent of a particular DBMS and other physical considerations. II. Physical database design: Produce a description of the implementation of the DB on secondary storage; it describes the storage structures and access methods used to achieve efficient access to the data. Methodology Overview Step 1 Build local conceptual data model for each user view Step 2 Build and Validate Local Logical Data Model Step 3 Build and Validate Global Logical Data Model Step 4 Translate Global Logical Data Model for Target DBMS

11

Step 5 Design Physical Representation Step 6 Design and Implement Security Mechanisms Step 7 Monitor and Tune the Operational System Methodology Overview - Conceptual Database Design Step 1 Build local conceptual data model for each user view Step 1.1 Identify entity types Step 1.2 Identify relationship types Step 1.3 Identify and associate attributes with entity or relationship types Step 1.4 Determine attribute domains Step 1.5 Determine candidate and primary key attributes Step 1.6 Specialize/generalize entity types (optional step) Step 1.7 Draw EntityRelationship diagram Step 1.8 Review Local Conceptual Data Model with User Methodology Overview Logical Database Design for Relational Model Step 2 Build and Validate Local Logical Data Model Step 2.1 Map Local Conceptual Data Model to Local Logical Data Model Step 2.2 Derive Relations from Local Logical Data Model Step 2.3 Validate Model using Normalization Step 2.4 Validate Model against User Transactions Step 2.5 Draw Entity-Relationship Diagram Step 2.6 Define Integrity Constraints Step 2.7 Review Local Logical Data Model with User Step 3 Build and Validate Global Logical Data Model Step 3.1 Merge Local Logical Data Models into Global Model Step 3.2 Validate Global Logical Data Model Step 3.3 Check for Future Growth Step 3.4 Draw Final Entity-Relationship Diagram Step 3.5 Review Global Logical Data Model with Users Step 3 Build and Validate Global Logical Data Model Step 3.1 Merge Local Logical Data Models into Global Model Step 3.2 Validate Global Logical Data Model

12

Step 3.3 Check for Future Growth Step 3.4 Draw Final Entity-Relationship Diagram Step 3.5 Review Global Logical Data Model with Users

Methodology Overview Physical Database Design for Relational Databases Step 4 Translate Global Logical Data Model for Target DBMS Step 4.1 Design Base Relations for Target DBMS Step 4.2 Design Integrity Rules for Target DBMS Step 5 Design Physical Representation Step 5.1 Analyze Transactions Step 5.2 Choose File Organization Step 5.3 Choose Secondary Indexes Step 5.4 Consider the Introduction of Controlled Redundancy Step 5.5 Estimate Disk Space Step 6 Design and Implement Security Mechanisms Step 6.1 Design User Views Step 6.2 Design Access Rules Step 7 Monitor and Tune the Operational System

Entity
An Entity is defined as a person, place, thing or concept which is a) of interest to the corporation b) is capable of being described in real terms c) is relevant within the context of the specific environment of the firm. An entity is something that has a distinct, separate existence, although it need not be a material existence. Entities are used in system developmental models that display communications and internal processing of, say, documents compared to order processing. Entities are represented on the diagram by labelled boxes.

13

Types of entities: Strong Entity: Objects are represented by their attributes and, as objects are inter

distinguishable, a subset of these attributes forms a primary key for uniquely identified an instance of an entity. Entity types that have primary keys are called strong entities. Weak Entity: Instance of weak entity sets associated with the same instance of the strong entity must be distinguishable from each other by a subset of the attributes of the weak entity. This subset of the attributes is called discriminator. They don't have a primary key Entity Set : An entity set is a set of entities of the same type that share the same properties, or attributes. For example, all software engineers working in the department involved in the Internet projects can be defined as the entity set Internet Group. The individual entities that constitute a set are called extension of the entity set. Thus, all individual software engineers of in the Internet projects are the extensions of the entity set InternetGroup. Entity sets dont need to be disjointed. For example, we can define an entity set Employee. An employee may or may not be working on some Internet projects. In InternetGroup we will have some entries that are there in Employee entity set. Therefore, entity sets Employee and InternetGroup are not disjoint. A database usually contains groups of entities that are similar. For example, employees of a company share the same attributes. However, every employee entity has its own values for each attribute. An entity type defines a set of entities that have same attributes. A name and a list of attributes describe each entity type. An Entity Type is represented in ER diagrams as rectangular box and the corresponding attributes are shown in ovals attached to the entity type by straight lines. An entity type is basically the schema or intension or structure for the set of entities that share the same

14

structure whereas the individual entities of a particular entity type are collectively called entity set. The entity set is also called the extension of the entity type Entity Sub-Types Sometimes it is useful to generalise about a group of Entities which have similar characteristics. For example a VEHICLE is a generalisation of a CAR, a TRUCK and a MOTORCYCLE. Conversely, it can be useful to identify specialised sub-types of an entity. For example, a CAR is a specialised type of VEHICLE, a TRUCK is a specialised type of VEHICLE and a MOTORCYCLE is a specialised type of VEHICLE.

Attribute
An attribute is an aspect or quality of an entity which describes it or its actions. An attribute may be a physical aspect such as height, weight or an aspect of the entity's location such as place of residence or place of birth or Quality such as the level of a particular skill, educational degree achieved etc An attribute is a property or descriptor of an entity, for example, Customer Name is an attribute of the entity Customer. Each attribute will eventually be represented by one or more entity attributes in the physical database structure. Attributes Define Entities Collectively, attributes define an entity. An attribute is meaningless by itself. For example, date of birth comes from the context of the entity to which it is assigned, for example, date of birth of an employee. Attributes are not shown on the Entity-Relationship Model but are recorded in the underlying data dictionary which contains the definitions of attributes for all entities and relationships identified in the model. An attribute should not have facts recorded about it. In practice, however, there are exceptions. For example, you might wish to show address as an attribute of Customer. Address is not significant enough to be modelled as an entity in its own right and would typically be shown as an attribute of Customer. However, at the detailed level, it may itself have attributes such as an indicator for mailing address or home address. Attributes do not have to be recognized and defined during the early stages of entity definition. Entity

15

definition is an iterative process, and it is unlikely that a completely satisfactory EntityRelationship Model will be obtained on the first iteration. Following figure shows two entity types Employee and Product. Their attribute list is also shown. A few members of each entity type are shown.

Identifying Attributes To identify entity attributes, examine: 1. All external entities from the Context Diagram, 2. The data flows passed by the external entities, 3. Existing automated data, 4. Each entity (i.e., generate a list of entity attributes that describe the entity). Attributes Versus Data Elements Attributes have a looser description than data elements. For instance, whereas an attribute may have only a descriptive name, a data element needs: 1. a size and range, 2. a format and length, 3. an accurate and detailed description, 4. valid values, 5. defined edit rules. Some attributes may be converted into many data elements. For instance, the attribute "address" may become four data elements representing: 1. Street Address,

16

2. City/Town, 3. State/Province, 4. Postal or Zip Code. Additional data elements may also be defined as a result of customer requirements. For example, the customer may require a list of all companies by county. For the purposes of Data Modelling, Attributes and data elements are often considered identical because attributes in the data model typically become data elements in the database. Categories of Attributes Attributes fall into three categories depending on the information that the attribute captures: 1. Descriptive attributes provide facts intrinsic to each instance of the entity. For example, the Salary of Employee or the Address of Customer. 2. Naming attributes provide facts about arbitrary labels and names carried by each instance of an entity. For example, the Employee Name of Employee or the Employee ID of Employee. 3. Referential attributes (i.e., foreign keys) provide facts which tie an instance of one entity to an instance of another entity. For example, the Department Number of the Department to which an Employee is assigned ties the Employee to the Department. Attribute domains A domain is a set of values for an attribute (i.e., the properties or characteristics of entities). The value set conforms to a common definition for the domain (e.g., type, format, syntax, meaning). Specify domains in one of the following ways: 1. List all possible values (e.g., for an attribute named color, possible values are red, green, blue). 2. Identify a source (e.g., procedures manual) that contains the valid values. 3. List an acceptable range of values for the domain (e.g., for an attribute named weight, possible values range from one to five pounds).

17

4. Define a business rule that permits determination of the validity of a value assigned to an attribute (e.g., discounts greater than five percent only apply to commercial customers). Derived Attributes Derived attributes are attributes whose values are generated from other attributes using calculations, algorithms or procedures. For example, Account Balance is derived by subtracting Total Debit from Total Credit. Generally, the specifications for calculating derived attributes are a concern of the processing aspects of the information system (e.g., process model). Derived attributes may be included in the data model if the rules for calculating the attribute values would otherwise be lost. Clearly indicate in the data model when an attribute is derived. Ensure that the rules needed to derive or calculate the attribute value are captured in the model. Verify that all attributes needed to calculate the derived attribute are present in the data model. Once the physical data model is constructed, some derived attributes are added to the model to improve performance of the system.

Relationships
Frequently, a meaningful relationship exists between two different types of entity. For example: EMPLOYEEs work in a DEPARTMENT LAWYERs advise CLIENTs

18

EQUIPMENT is allocated to PROJECTs TRUCK is a type of VEHICLE There are potentially three types of relationship which can exist between two different entities: 1. One-to-One Relationships 2. One-to-Many Relationships 3. Many-to-Many Relationships One-to-One Relationships: This type of relationship takes place when a single occurrence of an entity is related to just one occurrence of a second entity. For example, a ROOF covers one BUILDING; a BUILDING is covered by one ROOF. A One-to-One relationship is shown on the diagram by a line connecting the two Entities

One-to-Many Relationships: This type of relationship takes place when a single occurrence of an entity is related to many occurrences of a second entity. For example, An EMPLOYEE works in one DEPARTMENT; a DEPARTMENT has many EMPLOYEEs. A One-to-Many relationship is shown on the diagram by a line connecting the two entities with a crows feet symbol denoting the "many" end of the relationship.

19

Many-to-Many Relationships : This type of relationship takes place when many occurrences of an entity are related to many occurrences of a second entity. For example, EQUIPMENT is allocated to many PROJECTs; A PROJECT is allocated many items of EQUIPMENT. A Many-to-Many relationship is shown on the diagram by a line connecting the two entities with a crows feet at each end of the line.

Eliminating Many-to-Many Relationships Many-to-many relationships in an entity relationship diagram tend to conceal areas of poor

20

understanding. Almost always, a many-to-many relationship conceals a hidden entity. For this reason many-to-many relationships are eliminated by identifying and adding the hidden entity to the model. The new entity is related to the two original entities by a pair of one-to-many relationships.

Choosing the Right Relationship Depending on the purpose of the model, the length of time involved, and the definition of the entities participating in the relationship, the type of relationship between two entities can change. For example, if the definition of a ROOF entity is an apex or flat surface covering a building, then a BUILDING is covered by many ROOFs. Likewise, over a period of time, an EMPLOYEE works in many DEPARTMENTS. Optional and Mandatory Relationships Sometimes the participation of an entity in a relationship is optional. At other times it is mandatory. For example, EQUIPMENT may be allocated to PROJECTs, an EMPLOYEE must work in a DEPARTMENT and a DEPARTMENT may have EMPLOYEEs working in it.. Optional participation in a relationship is shown by a circle placed on the relationship line next to the entity that optionally participates in the relationship. Mandatory participation is shown in a similar manner by placing a bar on the relationship line.

21

Recursive Relationships Relationships can exist between different occurrences of the same type of entity. For example, A PERSON dances with another PERSON, an EMPLOYEE manages other EMPLOYEEs and a COMPONENT is made from other COMPONENTs

ER Model
The whole purpose of ER modelling is to create an accurate reflection of the real world in a database. The ER model doesnt actually give us a database description. It gives us an intermediate step from which it is easy to define a database. Lets look at an example. (You will see much more detail on these concepts in the rest of this chapter. For now just try to understand the overall process.) Suppose you are presented with the following situation and are told to create a database for it: Every department within our company is in only one division. Each division has more than one department in it. We dont have an upper limit on the number of departments that a division can have. For example, the New Business Development---the one managed by Mackenzie---and Higher Education departments are both in the Marketing division.

22

This is a fairly clear description of a situation. Many things are left unsaid that we understand about the situation. For example: each division has a name, and that name is unique within the company.

Deriving an E-R Diagram An E-R diagram is used to represent the E-R model. It contains all known entities for the given scenario, their attributes, identifiers, and the relationships that exist among the entities. Deriving the entities, their attributes, identifiers, and the relationships among them is one of the most important activities carried out during database development, so it is imperative that it is done accurately. From developing a number of databases, I have come up with a technique that allows for rapid database design and development. This technique was initially taught to me by consultants from the CRC SOGEMA INC. group in 1998, and a similar technique is described in the IBM Informix Database Design and Implementation Guide (2005). Steps 1. Discover the entities and the basic attributes for each entity. This process usually involves performing a careful analysis of the processes and documentation used by the users in the problem domain. Make a list of each of the entities discovered along with the basic attributes of each entity. 2. Identify the relationships among the entities. Using an Entity-Entity Matrix (E-E Matrix) is one of the fastest methods for deriving all the relationships among all the entities. This matrix consists of an equal number of rows and columns, with each entity discovered heading a row and a column. The intersection of the rows and columns represents relationships that may exist between the entities it is possible that no relationship will exist between entities, that more than one relationship will exist between entities, and that an entity will be related to itself. 3. Use the information from the E-E matrix to construct a simple initial E-R diagram. This

23

diagram will contain the entities and their basic attributes (and possible identifiers), and the relationships indicated in the E-E matrix. 4. Derive optionalities and cardinalities for each relationship in the initial E-R diagram and write assertions for each relationship. This records the assertions that can be made about the data (relationships) in the users domain. Optionality says what can and must happen in a relationship, and cardinality indicates the number of entity occurrences in a relationship: Optionality 0 Optionality 1 Cardinality 1 Cardinality N Can (optional) Must (obligatory) Only one Many/At least one

5. Construct a detailed E-R diagram that includes all the derived information. This diagram will include optionalities, cardinalities, attributes, identifiers, and relationships. Unique identifiers are indicated with an asterisk (*), and non-unique identifiers with a plus (+). Implementing the E-R Model The Relational Model While and E-R diagram displays the logical nature of the data that exists in the users domain, the relational model shows how this data will be represented in a Relational Database Management System (RDBMS). This model is important because the most common and popular database management systems in use today are relational, and relational model diagrams can be directly implemented on any RDBMS. The relational model is similar to the E- R model but based on a branch of mathematics called relational algebra, and as such, there are strict definitions and rules regarding the elements of this model. Here are some of these definitions and rules: o A relation is defined as a two dimensional table that contains rows and columns Rows are instances of the entity and columns are attributes of the entity. o No two rows of the table must be identical each row must be unique (the contents of

24

the row that is), and the order of the rows is not important. o Each column must have a unique name, and the order of the columns is not important. They must also contain single values multiple values and repeating groups are not allowed. o A key is a group of one or more attributes that uniquely identifies a row each table must have a key. Since each row in the table is unique, the key is what is used to distinguish one row from another. It can be comprised of a single attribute or it can be composite, comprising of more than one attribute. (Kroenke D, 1999, p. 113-117) o Every column (attribute) in a table must depend solely on the primary key. It should not depend on any other column in the table, or on part of the primary key if it is a composite key.

25

Database management systems


A database management system (DBMS) is a software package with computer programs that control the creation, maintenance, and the use of a database. It allows organizations to conveniently develop databases for various applications by database administrators (DBAs) and other specialists. A database is an integrated collection of data records, files, and other database objects. A DBMS allows different user application programs to concurrently access the same database. DBMSs may use a variety of database models, such as the relational model or object model, to conveniently describe and support applications. It typically supports query languages, which are in fact high-level programming languages, dedicated database languages that considerably simplify writing database application programs. Database languages also simplify the database organization as well as retrieving and presenting information from it. A DBMS provides facilities for controlling data access, enforcing data integrity, managing concurrency control, recovering the database after failures and restoring it from backup files, as well as maintaining database security. The brief topics discussed are as follows:Overview A DBMS is a set of software programs that controls the system organization, storage, management, and retrieval of data in a database. DBMSs are categorized according to their data structures or types. The DBMS accepts requests for data from an application program and instructs the operating system to transfer the appropriate data. The queries and responses must be submitted and received according to a format that conforms to one or more applicable protocols. When a DBMS is used, information systems can be changed more easily as the organization's information requirements change. New categories of data can be added to the database without disruption to the existing system. Database servers are dedicated computers that hold the actual databases and run only the DBMS and related software. Database servers are usually multiprocessor computers, with generous memory and RAID disk arrays used for stable storage. Hardware database accelerators, connected to one or more servers via a high-speed channel, are also used in large volume transaction processing environments. DBMSs are found at the heart of most database applications. DBMSs may be built around a custom multitasking kernel with built-

26

in networking support, but modern DBMSs typically rely on a standard operating system to provide these functions. History Databases have been in use since the earliest days of electronic computing. Unlike modern systems which can be applied to widely different databases and needs, the vast majority of older systems were tightly linked to the custom databases in order to gain speed at the expense of flexibility. Originally DBMSs were found only in large organizations with the computer hardware needed to support large data sets. 1960s Navigational DBMS As computers grew in speed and capability, a number of general-purpose database systems emerged; by the mid-1960s there were a number of such systems in commercial use. Interest in a standard began to grow, and Charles Bachman, author of one such product, the Integrated Data Store (IDS), founded the "Database Task Group" within CODASYL, the group responsible for the creation and standardization of COBOL. In 1971 they delivered their standard, which generally became known as the "Codasyl approach", and soon a number of commercial products based on this approach were made available. The Codasyl approach was based on the "manual" navigation of a linked data set which was formed into a large network. When the database was first opened, the program was handed back a link to the first record in the database, which also contained pointers to other pieces of data. To find any particular record the programmer had to step through these pointers one at a time until the required record was returned. Simple queries like "find all the people in India" required the program to walk the entire data set and collect the matching results one by one. There was, essentially, no concept of "find" or "search". This may sound like a serious limitation today, but in an era when most data was stored on magnetic tape such operations were too expensive to contemplate anyway. IBM also had their own DBMS system in 1968, known as IMS. IMS was a development of software written for the Apollo program on the System/360. IMS was generally similar in concept to Codasyl, but used a strict hierarchy for its model of data navigation instead of Codasyl's network model. Both concepts later became known as navigational databases due

27

to the way data was accessed, and Bachman's 1973 Turing Award award presentation was The Programmer as Navigator. IMS is classified as a hierarchical database. IDMS and CINCOM's TOTAL database are classified as network databases. 1970s relational DBMS Edgar Codd worked at IBM in San Jose, California, in one of their offshoot offices that was primarily involved in the development of hard disk systems. He was unhappy with the navigational model of the Codasyl approach, notably the lack of a "search" facility. In 1970, he wrote a number of papers that outlined a new approach to database construction that eventually culminated in the groundbreaking A Relational Model of Data for Large Shared Data Banks. In this paper, he described a new system for storing and working with large databases. Instead of records being stored in some sort of linked list of free-form records as in Codasyl, Codd's idea was to use a "table" of fixed-length records. A linked-list system would be very inefficient when storing "sparse" databases where some of the data for any one record could be left empty. The relational model solved this by splitting the data into a series of normalized tables, with optional elements being moved out of the main table to where they would take up room only if needed.

In the relational model, related records are linked together with a "key".

28

For instance, a common use of a database system is to track information about users, their name, login information, various addresses and phone numbers. In the navigational approach all of these data would be placed in a single record, and unused items would simply not be placed in the database. In the relational approach, the data would be normalized into a user table, an address table and a phone number table (for instance). Records would be created in these optional tables only if the address or phone numbers were actually provided. Linking the information back together is the key to this system. In the relational model, some bit of information was used as a key, uniquely defining a particular record. When information was being collected about a user, information stored in the optional (or related) tables would be found by searching for this key. For instance, if the login name of a user is unique, addresses and phone numbers for that user would be recorded with the login name as its key. This "re-linking" of related data back into a single collection is something that traditional computer languages are not designed for. Just as the navigational approach would require programs to loop in order to collect records, the relational approach would require loops to collect information about any one record. Codd's solution to the necessary looping was a set-oriented language, a suggestion that would later spawn the ubiquitous SQL. Using a branch of mathematics known as tuple calculus, he demonstrated that such a system could support all the operations of normal databases (inserting, updating etc.) as well as providing a simple system for finding and returning sets of data in a single operation. Codd's paper was picked up by two people at Berkeley, Eugene Wong and Michael Stonebraker. They started a project known as INGRES using funding that had already been allocated for a geographical database project, using student programmers to produce code. Beginning in 1973, INGRES delivered its first test products which were generally ready for widespread use in 1979. During this time, a number of people had moved "through" the group perhaps as many as 30 people worked on the project, about five at a time. INGRES was similar to System R in a number of ways, including the use of a "language" for data access, known as QUEL QUEL was in fact relational, having been based on Codd's own Alpha language, but has since been corrupted to follow SQL, thus violating much the same concepts of the relational model as SQL itself.

29

IBM itself did one test implementation of the relational model, PRTV, and a production one, Business System 12, both now discontinued. Honeywell did MRDS for Multics, and now there are two new implementations: Alphora Dataphor and Rel. All other DBMS implementations usually called relational are actually SQL DBMSs. In 1968, the University of Michigan began development of the Micro DBMS . It was used to manage very large data sets by the US Department of Labor, the Environmental Protection Agency and researchers from University of Alberta, the University of Michigan and Wayne State University. It ran on mainframe computers using Michigan Terminal System. The system remained in production until 1996. Late-1970s SQL DBMS IBM started working on a prototype system loosely based on Codd's concepts as System R in the early 1970s. The first version was ready in 1974/5, and work then started on multitable systems in which the data could be split so that all of the data for a record (some of which is optional) did not have to be stored in a single large "chunk". Subsequent multi-user versions were tested by customers in 1978 and 1979, by which time a standardized query language SQL had been added. Codd's ideas were establishing themselves as both workable and superior to Codasyl, pushing IBM to develop a true production version of System R, known as SQL/DS, and, later, Database 2 (DB2). Many of the people involved with INGRES became convinced of the future commercial success of such systems, and formed their own companies to commercialize the work but with an SQL interface. Sybase, Informix, NonStop SQL and eventually Ingres itself were all being sold as offshoots to the original INGRES product in the 1980s. Even Microsoft SQL Server is actually a re-built version of Sybase, and thus, INGRES. Only Larry Ellison's Oracle started from a different chain, based on IBM's papers on System R, and beat IBM to market when the first version was released in 1978. Stonebraker went on to apply the lessons from INGRES to develop a new database, Postgres, which is now known as PostgreSQL. PostgreSQL is often used for global mission critical applications (the .org and .info domain name registries use it as their primary data store, as do many large companies and financial institutions).

30

In Sweden, Codd's paper was also read and Mimer SQL was developed from the mid-70s at Uppsala University. In 1984, this project was consolidated into an independent enterprise. In the early 1980s, Mimer in c introduced transaction handling for high robustness in applications, an idea that was subsequently implemented on most other DBMS. 1980s object-oriented databases The 1980s, along with a rise in object oriented programming, saw a growth in how data in various databases were handled. Programmers and designers began to treat the data in their databases as objects. That is to say that if a person's data were in a database, that person's attributes, such as their address, phone number, and age, were now considered to belong to that person instead of being extraneous data. This allows for relations between data to be relations to objects and their attributes and not to individual fields. Another big game changer for databases in the 1980s was the focus on increasing reliability and access speeds. In 1989, two professors from the University of Wisconsin at Madison published an article at an ACM associated conference outlining their methods on increasing database performance. The idea was to replicate specific important, and often queried information, and store it in a smaller temporary database that linked these key features back to the main database. This meant that a query could search the smaller database much quicker, rather than search the entire dataset. This eventually leads to the practice of indexing, which is used by almost every operating system from Windows to the system that operates Apple iPod devices. 21st century NoSQL databases In the 21st century a new trend of NoSQL databases was started. Those non-relational databases are significantly different from the classic relational databases. They often do not require fixed table schemas, avoid join operations by storing denormalized data, and are designed to scale horizontally. Most of them can be classified as either key-value stores or document-oriented databases. In recent years there was a high demand for massively distributed databases with high partition tolerance but according to the CAP theorem it is impossible for a distributed system to simultaneously provide consistency, availability and partition tolerance

31

guarantees. A distributed system can satisfy any two of these guarantees at the same time, but not all three. For that reason many NoSQL databases are using what is called eventual consistency to provide both availability and partition tolerance guarantees with a maximum level of data consistency. The most popular software in that category include: memcached, Redis, MongoDB, CouchDB, Apache Cassandra and HBase. Current trends In 1998, database management was in need of a new style of databases to solve current database management problems. Researchers realized that the old trends of database management were becoming too complex and there was a need for automated configuration and management. Surajit Chaudhuri, Gerhard Weikum and Michael Stonebraker were the pioneers that dramatically affected the thought of database management systems. They believed that database management needed a more modular approach and there were too many specifications needed for users. Since this new development process of database management there are more possibilities. Database management is no longer limited to monolithic entities. Many solutions have been developed to satisfy the individual needs of users. The development of numerous database options has created flexibility in database management. There are several ways database management has affected the field of technology. Because organizations' demand for directory services has grown as they expand in size, businesses use directory services that provide prompted searches for company information. Mobile devices are able to store more than just the contact information of users, and can cache and display a large amount of information on smaller displays. Search engine queries are able to locate data within the World Wide Web. Retailers have also benefited from the developments with data warehousing, recording customer transactions. Online transactions have become tremendously popular for e-business. Consumers and businesses are able to make payments securely through some company websites.

32

Components

DBMS engine accepts logical requests from various other DBMS subsystems, converts them into physical equivalents, and actually accesses the database and data dictionary as they exist on a storage device.

Data definition subsystem helps the user create and maintain the data dictionary and define the structure of the files in a database. Data manipulation subsystem helps the user to add, change, and delete information in a database and query it for valuable information. Software tools within the data manipulation subsystem are most often the primary interface between user and the information contained in a database. It allows the user to specify its logical information requirements.

Application generation subsystem contains facilities to help users develop transaction-intensive applications. It usually requires that the user perform a detailed series of tasks to process a transaction. It facilitates easy-to-use data entry screens, programming languages, and interfaces.

Data administration subsystem helps users manage the overall database environment by providing facilities for backup and recovery, security management, query optimization, concurrency control, and change management.

Modeling language A modeling language is a data modeling language to define the schema of each database hosted in the DBMS, according to the DBMS database model. Database management systems (DBMS) are designed to use one of five database structures to provide simplistic access to information stored in databases. The five database structures are: the hierarchical model, the network model, the relational model, the multidimensional model, and the object model.

33

Inverted lists and other methods are also used. A given database management system may provide one or more of the five models. The optimal structure depends on the natural organization of the application's data, and on the application's requirements, which include transaction rate (speed), reliability, maintainability, scalability, and cost. The hierarchical structure was used in early mainframe DBMS. Records relationships form a treelike model. This structure is simple but nonflexible because the relationship is confined to a one-to-many relationship. IBMs IMS system and the RDM Mobile are examples of a hierarchical database system with multiple hierarchies over the same data. RDM Mobile is a newly designed embedded database for a mobile computer system. The hierarchical structure is used primarily today for storing geographic information and file systems. The network structure consists of more complex relationships. Unlike the hierarchical structure, it can relate to many records and accesses them by following one of several paths. In other words, this structure allows for many-to-many relationships. The relational structure is the most commonly used today. It is used by mainframe, midrange and microcomputer systems. It uses two-dimensional rows and columns to store data. The tables of records can be connected by common key values. While working for IBM, E.F. Codd designed this structure in 1970. The model is not easy for the end user to run queries with because it may require a complex combination of many tables. The multidimensional structure is similar to the relational model. The dimensions of the cube-like model have data relating to elements in each cell. This structure gives a spreadsheet-like view of data. This structure is easy to maintain because records are stored as fundamental attributesin the same way they are viewedand the structure is easy to understand. Its high performance has made it the most popular database structure when it comes to enabling online analytical processing (OLAP). The object-oriented structure has the ability to handle graphics, pictures, voice and text, types of data, without difficultly unlike the other database structures. This structure is popular for multimedia Web-based applications. It was designed to work with objectoriented programming languages such as Java.

34

The dominant model in use today is the ad hoc one embedded in SQL,despite the objections of purists who believe this model is a corruption of the relational model since it violates several fundamental principles for the sake of practicality and performance. Many DBMSs also support the Open Database Connectivity API that supports a standard way for programmers to access the DBMS. Before the database management approach, organizations relied on file processing systems to organize, store, and process data files. End users criticized file processing because the data is stored in many different files and each organized in a different way. Each file was specialized to be used with a specific application. File processing was bulky, costly and nonflexible when it came to supplying needed data accurately and promptly. Data redundancy is an issue with the file processing system because the independent data files produce duplicate data so when updates were needed each separate file would need to be updated. Another issue is the lack of data integration. The data is dependent on other data to organize and store it. Lastly, there was not any consistency or standardization of the data in a file processing system which makes maintenance difficult. For these reasons, the database management approach was produced. Data structure Data structures (fields, records, files and objects) optimized to deal with very large amounts of data stored on a permanent data storage device (which implies relatively slow access compared to volatile main memory). Database query language A database query language and report object allows users to interactively interrogate the database, analyze its data and update it according to the users privileges on data. It also controls the security of the database. Data security prevents unauthorized users from viewing or updating the database. Using passwords, users are allowed access to the entire database or subsets of it called subschemas. For example, an employee database can contain all the data about an individual employee, but one group of users may be authorized to view only payroll data, while others are allowed access to only work history and medical data.

35

If the DBMS provides a way to interactively enter and update the database, as well as interrogate it, this capability allows for managing personal databases. However, it may not leave an audit trail of actions or provide the kinds of controls necessary in a multi-user organization. These controls are only available when a set of application programs are customized for each data entry and updating function. Transaction mechanism A database transaction mechanism ideally guarantees ACID properties in order to ensure data integrity despite concurrent user accesses (concurrency control), and faults (fault tolerance). It also maintains the integrity of the data in the database. The DBMS can maintain the integrity of the database by not allowing more than one user to update the same record at the same time. The DBMS can help prevent duplicate records via unique index constraints; for example, no two customers with the same customer numbers (key fields) can be entered into the database. External, logical and internal view

Traditional view of data A DBMS Provides the ability for many different users to share data and process resources. As there can be many different users, there are many different database needs. The question is: How can a single, unified database meet varying requirements of so many users?

36

A DBMS minimizes these problems by providing three views of the database data: an external view (or user view), logical view (or conceptual view) and physical (or internal) view. The users view of a database program represents data in a format that is meaningful to a user and to the software programs that process those data. One strength of a DBMS is that while there is typically only one conceptual (or logical) and physical (or internal) view of the data, there can be an endless number of different external views. This feature allows users to see database information in a more business-related way rather than from a technical, processing viewpoint. Thus the logical view refers to the way the user views the data, and the physical view refers to the way the data are physically stored and processed. Features and capabilities Alternatively, and especially in connection with the relational model of database management, the relation between attributes drawn from a specified set of domains can be seen as being primary. For instance, the database might indicate that a car that was originally "red" might fade to "pink" in time, provided it was of some particular "make" with an inferior paint job. Such higher parity relationships provide information on all of the underlying domains at the same time, with none of them being privileged above the others. Simple definition A database management system is the system in which related data is stored in an efficient and compact manner. "Efficient" means that the data which is stored in the DBMS can be accessed quickly and "compact" means that the data takes up very little space in the computer's memory. The phrase "related data" means that the data stored pertains to a particular topic. Specialized databases have existed for scientific, imaging, document storage and like uses. Functionality drawn from such applications has begun appearing in mainstream DBMS's as well. However, the main focus, at least when aimed at the commercial data processing market, is still on descriptive attributes on repetitive record structures.

37

Thus, the DBMSs of today roll together frequently needed services or features of attribute management. By externalizing such functionality to the DBMS, applications effectively share code with each other and are relieved of much internal complexity. Features commonly offered by database management systems include: Query ability Querying is the process of requesting attribute information from various perspectives and combinations of factors. Example: "How many 2-door cars in Texas are green?" A database query language and report writer allow users to interactively interrogate the database, analyze its data and update it according to the users privileges on data. Backup and replication Copies of attributes need to be made regularly in case primary disks or other equipment fails. A periodic copy of attributes may also be created for a distant organization that cannot readily access the original. DBMS usually provide utilities to facilitate the process of extracting and disseminating attribute sets. When data is replicated between database servers, so that the information remains consistent throughout the database system and users cannot tell or even know which server in the DBMS they are using, the system is said to exhibit replication transparency. Rule enforcement Often one wants to apply rules to attributes so that the attributes are clean and reliable. For example, we may have a rule that says each car can have only one engine associated with it (identified by Engine Number). If somebody tries to associate a second engine with a given car, we want the DBMS to deny such a request and display an error message. However, with changes in the model specification such as, in this example, hybrid gas-electric cars, rules may need to change. Ideally such rules should be able to be added and removed as needed without significant data layout redesign. Security For security reasons, it is desirable to limit who can see or change specific attributes or groups of attributes. This may be managed directly on an individual basis, or by the assignment of individuals and privileges to groups, or (in the most elaborate models) through the assignment of individuals and groups to roles which are then granted entitlements.

38

Computation Common computations requested on attributes are counting, summing, averaging, sorting, grouping, cross-referencing, and so on. Rather than have each computer application implement these from scratch, they can rely on the DBMS to supply such calculations. Change and access logging This describes who accessed which attributes, what was changed, and when it was changed. Logging services allow this by keeping a record of access occurrences and changes. Automated optimization For frequently occurring usage patterns or requests, some DBMS can adjust themselves to improve the speed of those interactions. In some cases the DBMS will merely provide tools to monitor performance, allowing a human expert to make the necessary adjustments after reviewing the statistics collected. Meta-data repository Metadata is data describing data. For example, a listing that describes what attributes are allowed to be in data sets is called "meta-information". Advanced DBMS An example of an advanced DBMS is Distributed Data Base Management System (DDBMS), a collection of data which logically belong to the same system but are spread out over the sites of the computer network. The two aspects of a distributed database are distribution and logical correlation: Distribution: The fact that the data are not resident at the same site, so that we can distinguish a distributed database from a single, centralized database. Logical Correlation: The fact that the data have some properties which tie them together, so that we can distinguish a distributed database from a set of local databases or files which are resident at different sites of a computer network.

39

Access Control in Database Management Systems Introduction Result of the database management systems use is existence of one data set on which are working all users and applications in an organization. On one side, hardly solved problems with duplication, inconsistency and application dependency of data disappear. On the other side much more dangerous security problems appear. There is a brief introduction to them on next lines. Design of secure DBMS assumes identification of security risks and selection of right security policies (what is the security system supposed to do) and mechanisms (the way we are going to achieve that) for their neutralization. Secure database system should satisfy three basic requirements on data protection: security, integrity and availability. What is the content of those words?

Ensuring security - preventing, detecting and deterring improper disclosure of information. This is especially important in strongly protected environments (e.g. army).

Ensuring integrity - preventing, detecting and deterring improper changes of information. The proper function of any organization depends on proper operations on proper data.

Ensuring system availability - effort for prevention of improper denial of service that DBMS provides.

Security Policies Security policies are guidelines describing all (if possible) actions pertinent to the information system. Logical access control belongs to that area and so security policies should define principles on which is design of secure DBMS based. Generally, policies should give answers on basic security questions. Policies can be divided into two basic types - of minimal (army) and maximal (university) privilege. System with such a policy is called closed or opened, respectively.

40

Talking about access control, the way of administration of access rules should be determined.

hierarchical decentralized - central authorizer distributes responsibilities among dependent subauthorizers ownership based - owner of an object (its author) determines access to the object cooperative authorization - authorization of special rights for special resources is approved by all members of predefined group

Security Mechanisms In the moment when politics are defined, mechanisms that fulfill them can be selected. Mechanisms are external:

administrative controls physical controls

and internal that are part of information system itself:

authentication - user identity is verified; this process is based on knowledge of something, ownership of an object or on physical characteristics of user authorization - system answers only those queries that user is authorized for (access control) audit - is composed from two phases; logging of actions in the system and reporting of logged information

Security Threat This word has been used several times already. Security threat is any hostile agent which randomly or with use of specialized techniques can obtain or change information in the information system. Random security threats are:

natural or accidental disasters- earthquake, water damage or fire. As data as hardware is damaged which leads to the integrity violence and service rejection. errors and bugs in hardware andsoftware - causes improper application of security policies.

41

human errors - unintentional violations such as incorrect input or wrong use of applications.

Intended security threats can be categorized according to their originator:


authorized users - abuse there privileges hostile agents - various hostile programs - viruses, Trojan horses, back-doors

Requirements on DBMS Security At this moment we have basic image of information system security and we can take a look at concrete aspects that should be covered with DBMS security mechanisms. 1. Protection from improper access- only authorized users should be granted access to objects of DBMS. This control should be applied on smaller objects (record, attribute, value). 2. Protection from inference - inference of confidential information from available data should be avoided. This regards mainly statistical DBMSs. 3. Database integrity - partially is ensured with system controls of DBMS (atomic transactions) and various back-up and recovery procedures and partially with security procedures. 4. Operational data integrity - logical consistence of data during concurrent transactions (concurrency manager), serializability and isolation of transactions (locking techniques). 5. Semantic data integrity - ensuring that attribute values are in allowed ranges. This is ensured with integrity constraints. 6. Accountability and auditing - there should be possibility to log all data accesses. 7. User authentication - there should be unambiguous identification of each DBMS user. This is basis for all authorization mechanisms. 8. Management and protection of sensitive data - access should be granted only to narrow round of users. 9. Multilevel security - data may be classified according to their sensitivity. Access granting should then depend on that classification. 10. Confinement (subject isolation) - there is necessity to isolate subjects to avoid uncontrolled data flow between programs (memory channels, covert channels).

42

At least five aspects from the previous list must be ensured with special techniques that do not exist in unsecure DBMSs. There are three basic ways to do it:

flow control - we control information flows in frame of DBMS inference control - control of dependencies among data access control - access to the information in DBMS is restricted

Cryptographic techniques can be added to the controls. Flow Control Flow control regulates distribution (flow) of information among available objects. E.g. reading of information from object X and its direct writing into object Y. Flow control policies need list of acceptable information flows and their constrains. Flow constraints are often based on classification of system elements and definition of acceptable flows between classification levels. Inference Control The aim of the inference control is to avoid indirect disclosure of information (set of data X that can be read by user can be used for determination of data set Y (Y=f(X)) ). Generally there are three ways to unauthorized data disclosure: 1. correlated data - typical channel when visible data X are semantically related with invisible data Y 2. missing data - result of query contains NULL values that mask sensitive data. Existence of that data may by detected that way. 3. statistical inference - typical for databases that provide statistical information about entities. Statistical databases do not allow direct access to data and user has to query only statistical data. Attacks in this DBMSs can be faced with two different approaches: 1. data perturbation - concrete data are replaced with statistical results 2. query controls - more frequently used, mostly it is based on minimal and maximal number of items that are concerned with query. Results are satisfactory but this technique is expensive and difficult for administration.

43

Access Control Access control is responsible for control of rules determined by security policies for all direct accesses to the system. Traditional control systems work with notions subject, object and operation.

Fig.1 Schema of secure database management system Security Models The aim of security modeling is to create abstract, software independent, conceptual model from requirements specification for system protection. Security model should provide rich semantic representation which allows description of functional and structural properties of security system. It should also provide definitions for protection requirements and system policies. The proof of model properties should be available, too.

44

It is clear that level on which we decide to describe access control can greatly differ as we will see in the next paragraph. Description of basic approaches for access control is introduced in chapter 2.2. Description of concrete models follows. Abstractions of Access Models Access control models can be classified in several levels as proposed by LaPadula and Williams. Following levels proceed from general to more implementation dependent: 1. trust objectives - definition of basic requirements on the system trustfulness 2. requirements on external interface - security requirements on interface systemenvironments 3. internal requirements - requirements that must be satisfied in system components 4. operational rules - describe assurance of internal requirements 5. functional design - functional description of behavior of system components Models that are going to be described are on level 3. TBAC and RBAC model families partly interfere with level 2 because they are much more general and cover wider range of problems. Types of Access Control Models Security models can be classified according to many aspects. For example target system, type of security policy, addressed security aspects, type of control (direct or of flows) and so on. In the moment we will talk about type of security policy that is enforced by the model. Two basic model types arised very soon - discretionary and mandatory access control. Owner of data governs access in the former one. This is the most common form of authorization administration - ownership based. That policy is very flexible but also very difficult for control from the global point of view. Models with mandatory access control enforce global policy by the flow control among security levels that are assigned to objects. It seemed that nothing else would exist but OO technologies have encourage new approaches that reflect OO DBMSs and new requirements of commercial sphere. The first one is RBAC - access control based on roles and the second one is TBAC which is based on concept of task. TBAC brings absolutely new ideas and notion of active security.

45

We finish this introduction and try to describe policy types on concrete models. HRU - Matrix Access Model The HRU (Harrison-Ruzzo-Ullman) model covers security of data for DBMS and OS. It was proposed in 1971 and in 1976 was formalized. Authorization State Authorization state is derived from relations between objects and subjects. It is defined as triple Q=(S, O, A) where S is set of subjects, O is set of objects (contains also subjects) and A is access matrix (A[s, o]=r). Authorization state for DBMS can be enriched by predicates of access rights. Those predicates may by data, time, context or history dependent. Six primitive operations are defined for authorization state administration. They are granting and revoking of rights r from A[s, o], creating and deleting of subject s, creating and deleting of object o. Operations are used for composition of commands for authorization state modifications. Harrison assumes that modification commands have two parts; conditional and executive. The executive part is performed when the conditional is true. Set of possible access rights depends on object. The model knows read, write, append, execute, own. The last one determines owner that administers access to it. Administration of Authorizations The owner of object may grant/revoke any rights (except own) to the given object for other subjects. Even those rights that he does not possess in the moment. This approach breaks principle of attenuation of privileges. There are modifications of access rights in some systems that allow other subjects to administer access rights. This is expressed by explicit flag on access right that may have two forms.

copy flag - m* in A[s1, o] allows grant an access right m for object o to other subjects, rights of subject s1 do not change and subject s2 receives right m for object o

46

Fig. 2. Use of copy flag

transfer flag - subject s1 grants right m for object o to subject s2, subject s1 loses the right. Subject s1 has got m+ (with transfer flag)

Fig. 3. Use of transfer flag Model Extensions The HRU model is very flexible for expression and control of access rights for information. The most important problem is security question (Is there a reachable state to the given starting authorization state where a subject receives certain right for given object?) that is undecidable. This problem coheres with danger of Trojan horses in DAC. The model has been extended several ways for security question decidability.

Schematic protections model (Sandhu) Extended schematic protection model (Ammann and Sandhu) Typed access matrix (TAM) model (Sandhu) - each object and subject has a type that is assign with creation of entity and it does not change during lifecycle.

Bell-LaPadula Model This model is extension of HRU model oriented on definition of security requirements in complex systems where system elements can be classified (security levels). The model was

47

proposed in 1973 for operation systems. The model became referential for mandatory access control (MAC). Role-Based Access Control This model family has been proposed to satisfy requirements of the commercial sphere. The basic idea is notion of role. Users are granted roles that allow them to perform certain actions that are assigned to the roles. Roles are usually created according to the functions in organization. In the moment we decide to revoke or change access from a user, the only one thing we have to do is remove him from a group or reassign him to another one. This change is very easy for an administrator. This approach offers very suitable abstraction for expression of security policies of large companies and seems to be promising alternative to the traditional MAC and DAC models. RBAC is suitable for environments where MAC models are too strict. It has been proofed that RBAC can simulate mandatory access models and, in addition, that MAC models are specific case of RBAC. One may say that RBAC is based on something that is very similar to groups that are used in many older models. Groups of users are used to associate subjects with the same access rights. Such a group does not represent a role but common needs for execution of certain system actions. Roles in RBAC exploit groups to abstract access rights needed for some concrete actions that are later accessible for role members. Narrow, specific meaning of roles can not exhaustively express status of users. It seems that for effective implementation of role-base access control we need to define three aspects. They are role, individuality and locality. I demonstrate it on an example when a firm engages a new person - Smith. His position is seller (it is his role). During the trial period you do not want him to do transactions with banks (individuality) and finally you do not want him to access your system elsewhere than in New York (locality). What are the advantages of the system? You are not bound during the policy selection, model offers strict separation of duties, ease expression of organizational policy and of course easy access rights administration.

48

Relational Database Management Systems (RDBMS)


A relational database management system (RDBMS) is a database management system (DBMS) that is based on the relational model. A short definition of an RDBMS is: a DBMS in which data is stored in tables and the relationships among the data are also stored in tables. The data can be accessed or reassembled in many different ways without having to change the table forms. The Relational Database concept was obviously thought about by folk who on a day-to-day basis had to manage large production data structures in the 1960's and 1970's and 1980's and 1990's and today. But it was probably Codd (Dr. E.F.) who first published (in an IBM Research Report of August 1969) some form of specification (set) of rules by which such a species of database would be known:CODD's TWELVE RULES Rule 1: The Information Rule: All information in a relational database is represented explicitly at the logical level in exactly one way by values in tables. Rule 2: Guaranteed Access Rule: Each and every datum (atomic value) in a relational database is guaranteed to be logically accessible by resorting to a table name, primary key value, and column name. Rule 3: Systematic Treatment of Null Values: Null values (distinct from empty character string or a string of blank characters and distinct from zero or any other number) are supported in the fully relational DBMS for representing missing information in a systematic way, independent of data type. Rule 4: Dynamic On-line Catalog Based on the Relational Model: The database description is represented at the logical level in the same way as ordinary data, so authorized users can apply the same relational language to its interrogation as they apply to regular data. Rule 5: Comprehensive Data Sublanguage Rule: A relational system may support several languages and various modes of terminal use (for example, the fill-in-blanks mode). However, there must be at least one language whose statements are expressible, per some well-defined syntax, as character strings and whose ability to support all of the following is comprehensible: data definition, view definition, data manipulation (interactive and by program), integrity constraints, and transaction boundaries (begin, commit, and rollback).

49

Rule 6: View Updating Rule: All views that are theoretically updateable are also updateable by the system. Rule 7: High-level Insert, Update, and Delete: The capability of handling a base relation or a derived relation as a single operand applies not only to the retrieval of data but also to the insertion, update, and deletion of data. Rule 8: Physical Data Independence: Application programs and terminal activities remain logically unimpaired whenever any changes are made in either storage representation or access methods. Rule 9: Logical Data Independence: Application programs and terminal activities remain logically unimpaired when information preserving changes of any kind that theoretically permit unimpairment are made to the base tables. Rule 10: Integrity Independence: Integrity constraints specific to a particular relational database must be definable in the relational data sublanguage and storable in the catalog, not in the application programs. A minimum of the following two integrity constraints must be supported: 1. Entity integrity: No component of a primary key is allowed to have a null value. 2. Referential integrity: For each distinct non null foreign key value in a relational database, there must exist a matching primary key value from the same domain. Rule 11: Distribution Independence: A relational DBMS has distribution independence. Distribution independence implies that users should not have to be aware of whether a database is distributed. Rule 12: Non subversion Rule: If a relational system has a low-level (single-record-at-atime) language, that low-level language cannot be used to subvert or bypass the integrity

50

rules or constraints expressed in the higher-level (multiple-records-at-a-time) relational language. Note: There is a rider to these 12 rules known as Rule Zero: "For any system that is claimed to be a relational database management system, that system must be able to manage data entirely through its relational capabilities." On the basis of the above rules, there is no fully relational DBMS available today. By the mid 1980's there were a number of database vendors who had already made the claim that their DBMS was in reality an RDBMS. Previously known as Relational Software, Oracle Corporation released the first commercially available database management system packaged with a DBMS Language of its own called SQL (Structured Query Language). IBM also pioneered research and development into making the DBMS more of an RDBMS, and by development of the language which was to become known as SQL. It was not until almost the 1990's that Microsoft joined these two major league RDBMS/DBMS vendors on the field with the SQL Server RDBMS/DBMS product. Are these products RDBMS or DBMS? What it is about an RDBMS which is missing from a DBMS, with the exception of the obvious R for Relational? Some insist it is nothing but a rigorous adherence to Codd's 12 Rules, while others insist a non-relational DBMS structure can be "made relational" by the application of specific housekeeping and data integrity exception management subsystems. From the perspective of IT Management however, whatever DBMS or RDBMS it was that housed the organisational data in production real-time, it was this very ever-changing database of information which needed to be safeguarded with the utmost diligence. The daily backups needed to be performed and checked off, and if they were automated then they needed to be checked. Periodically restores of such backups needed to be scheduled to test the integrity of the backup and restore process, or media, or device. A backup was useless if it could not be restored. More importantly, the integrity of the database had to be maintained at the highest possible level. If this duty were not attended to by some responsible party internal or external to the

51

organisation, then the value of the database as an entity would be suspect to varying degrees. With experience in the field one understands that there are indeed usually an entire series of specific forms of data integrity issues and problems which may arise at any time within any given database, due to a number of different environmental factors. At the top of the list, we see Rules 2 and 3 may have direct bearing on data integrity issues if they are not enforced. But it is Rule number 10 that is often quoted as the one which is not successfully established in certain instances by default within the DBMS, to make it a truly RDBMS. As CJ Date points out in his article entitled Business Rules and the Relational Model, the title of Codd's original paper was "Derivability, Redundancy, and Consistency of Relations Stored in Large Data Banks". He goes on to reinforce that the term "consistency of stored relations" from Codd, is implicitly the same thing as the term integrity constraints, and to note that another formal description for the set of these things is "the relation predicate. When all is said and done however, the RDBMS vendors of today have a far more powerful and robust product than they had a decade ago, and while it is often very interesting examining the theoretical threads of the beginnings of the Relational Database concept, this article is bound closely to follow the path and perspective of the IT Management staff who have used all these evolving products, and seen the decades come and go. For this reason, in terms of the technical evolution of the major RDBMS products available since the 1990's, we are going to skip this era, and move right up to the present day in 2002 and have a look under the bonnet of one of these major RDBMS products, just to see what it does. Again, let me state clearly that while the academic approach to the failure of RDBMS vendors to supply greater support for integrity constraints, or for a number of technical issues by which their products may not conform 100% to the relational model is often warranted and heeded by the vendors, my approach has not had the luxury of time.

52

When we are dealing with a production database, alive and breathing through all forms of organisational processes, we cannot afford the time to wait for the longer term solutions to such problems. Production database integrity, and by this I mean real live day-to-day production sites, is a concern that needs to be addressed without delay. If you as an IT Manager designed a series of data integrity exceptions alerts to identify the quantitative failure of specific integrity constraints, and had such resolved, such a mechanism might work-around potential major problems. Perhaps the real difference between a DBMS and an RDBMS is similar to the difference between a 6 cylinder car and an 8 cylinder car? If we looked under the bonnet we might be able to tell, so lets choose one of the major 3 RDBMS vendors, and have a little look. In this article we have selected to check under the bonnet of the SQL Server product from Microsoft Corporation. Relational model The relational model for database management is a database model based on first-order predicate logic, first formulated and proposed in 1969 by Edgar F. Codd.

Diagram of an example database according to the Relational model

53

In the relational model, related records are linked together with a "key". The purpose of the relational model is to provide a declarative method for specifying data and queries: users directly state what information the database contains and what information they want from it, and let the database management system software take care of describing data structures for storing the data and retrieval procedures for answering queries. The fundamental assumption of the relational model is that all data is represented as mathematical n-ary relations, an n-ary relation being a subset of the Cartesian product of n domains. In the mathematical model, reasoning about such data is done in two-valued predicate logic, meaning there are two possible evaluations for each proposition: either true or false (and in particular no third value such as unknown, or not applicable, either of which are often associated with the concept of NULL). Some think two-valued logic is an important part of the relational model, while others think a system that uses a form of three-valued logic can still be considered relational. Data are operated upon by means of a relational calculus or relational algebra, these being equivalent in expressive power.

54

The relational model of data permits the database designer to create a consistent, logical representation of information. Consistency is achieved by including declared constraints in the database design, which is usually referred to as the logical schema. The theory includes a process of database normalization whereby a design with certain desirable properties can be selected from a set of logically equivalent alternatives. The access plans and other implementation and operation details are handled by the DBMS engine, and are not reflected in the logical model. This contrasts with common practice for SQL DBMSs in which performance tuning often requires changes to the logical model. The basic relational building block is the domain or data type, usually abbreviated nowadays to type. A tuple is an ordered set of attribute values. An attribute is an ordered pair of attribute name and type name. An attribute value is a specific valid value for the type of the attribute. This can be either a scalar value or a more complex type. A relation consists of a heading and a body. A heading is a set of attributes. A body (of an nary relation) is a set of n-tuples. The heading of the relation is also the heading of each of its tuples. A relation is defined as a set of n-tuples. In both mathematics and the relational database model, a set is an unordered collection of unique, non-duplicated items, although some DBMSs impose an order to their data. In mathematics, a tuple has an order, and allows for duplication. E.F. Codd originally defined tuples using this mathematical definition. Later, it was one of E.F. Codd's great insights that using attribute names instead of an ordering would be so much more convenient (in general) in a computer language based on relations. This insight is still being used today. Though the concept has changed, the name "tuple" has not. An immediate and important consequence of this distinguishing feature is that in the relational model the Cartesian product becomes commutative. A table is an accepted visual representation of a relation; a tuple is similar to the concept of row, but note that in the database language SQL the columns and the rows of a table are ordered.

55

A relvar is a named variable of some specific relation type, to which at all times some relation of that type is assigned, though the relation may contain zero tuples. The basic principle of the relational model is the Information Principle: all information is represented by data values in relations. In accordance with this Principle, a relational database is a set of relvars and the result of every query is presented as a relation. The consistency of a relational database is enforced, not by rules built into the applications that use it, but rather by constraints, declared as part of the logical schema and enforced by the DBMS for all applications. In general, constraints are expressed using relational comparison operators, of which just one, "is subset of" (), is theoretically sufficient. In practice, several useful shorthands are expected to be available, of which the most important are candidate key (really, super key) and foreign key constraints. SQL SQL, often referred to as Structured Query Language is a programming language designed for managing data in relational database management systems (RDBMS). Originally based upon relational algebra and tuple relational calculus, its scope includes data insert, query, update and delete, schema creation and modification, and data access control. SQL was one of the first commercial languages for Edgar F. Codd's relational model, as described in his influential 1970 paper, "A Relational Model of Data for Large Shared Data Banks". Despite not adhering to the relational model as described by Codd, it became the most widely used database language. Though often described as, and to a great extent is a declarative language, SQL also includes procedural elements. SQL became a standard of the American National Standards Institute (ANSI) in 1986 and of the International Organization for Standards (ISO) in 1987. Since then the standard has been enhanced several times with added features. However, issues of SQL code portability between major RDBMS products still exist due to lack of full compliance with, or different interpretations of the standard. Among the reasons mentioned are the large size, and incomplete specification of the standard, as well as vendor lock-in. The SQL language is sub-divided into several language elements, including:

56

Clauses, which are constituent components of statements and queries. (In some cases, these are optional.) Expressions, which can produce either scalar values or tables consisting of columns and rows of data. Predicates, which specify conditions that can be evaluated to SQL three-valued logic (3VL) or Boolean (true/false/unknown) truth values and which are used to limit the effects of statements and queries, or to change program flow. Queries, which retrieve the data based on specific criteria. This is the most important element of SQL. Statements, which may have a persistent effect on schemata and data, or which may control transactions, program flow, connections, sessions, or diagnostics. SQL statements also include the semicolon (";") statement terminator. Though not required on every platform, it is defined as a standard part of the SQL grammar. Insignificant whitespace is generally ignored in SQL statements and queries, making it easier to format SQL code for readability. Data manipulation The Data Manipulation Language (DML) is the subset of SQL used to add, update and delete data: INSERT adds rows (formally tuples) to an existing table, e.g.,:

INSERT INTO My_table (field1, field2, field3) VALUES ('test', 'N', NULL); UPDATE modifies a set of existing table rows, e.g.,:

UPDATE My_table SET field1 = 'updated value' WHERE field2 = 'N'; DELETE removes existing rows from a table, e.g.,:

57

DELETE FROM My_table WHERE field2 = 'N'; MERGE is used to combine the data of multiple tables. It combines the INSERT and UPDATE elements. It is defined in the SQL:2003 standard; prior to that, some databases provided similar functionality via different syntax, sometimes called "upsert". Data definition The Data Definition Language (DDL) manages table and index structure. The most basic items of DDL are the CREATE, ALTER, RENAME, DROP and TRUNCATE statements: CREATE creates an object (a table, for example) in the database, e.g.,:

CREATE TABLE My_table( my_field1 INT, my_field2 VARCHAR(50), my_field3 DATE ); ALTER modifies the structure of an existing object in various ways, for example, adding a column to an existing table or a constraint, e.g.,: NOT NULL, PRIMARY KEY (my_field1, my_field2)

ALTER TABLE My_table ADD my_field4 NUMBER(3) NOT NULL; TRUNCATE deletes all data from a table in a very fast way, deleting the data inside the table and not the table itself. It usually implies a subsequent COMMIT operation, i.e., it cannot be rolled back.

TRUNCATE TABLE My_table;

58

DROP deletes an object in the database, usually irretrievably, i.e., it cannot be rolled back, e.g.,:

DROP TABLE My_table; People who work with databases 1. systems analyst A systems analyst researches problem, plans solutions, recommends software and systems, and coordinates development to meet business or other requirements. They will be familiar with multiple varieties of programming languages, operating systems, and computer hardware platforms. Because they often write user requests into technical specifications, the systems analysts are the liaisons between vendors and information technology professionals. They may be responsible for developing cost analysis, design considerations, and implementation time-lines. A systems analyst is a person who conducts a study, identifies activities and objectives and determines a procedure to achieve the objectives. Designing and implementing systems to suit organizational needs are the functions of the systems analyst. Their responsibilities include: Communicate with each prospective database user group in order to understand its o o information needs processing needs

Develop a specification of each user groups information and processing needs Develop a specification integrating the information and processing needs of the user groups Document the specification

2. Database designers The database designer defines the tables, indexes, views, constraints, triggers, stored procedures, table, spaces or storage parameters, and other database-specific constructs needed to store, retrieve, and delete persistent objects. Their responsibilities include:-

59

Choose appropriate structures to represent the information specified by the system analysts Choose appropriate structures to store the information in a normalized manner in order to guarantee integrity and consistency of data Choose appropriate structures to guarantee an efficient system Document the database design

3. Application Developers A database developer writes and modifies databases. Data in a database can be reorganized, dispersed, and accessed in a number of ways. Databases are important to companies and organizations, because they contain records or files, such as sales transactions, product catalog and inventories, and customer profiles. Database developers create management systems to provide effective and efficient access to information stored in databases. They provide expertise and guidance in the design, implementation and maintenance of database management systems. Database developers control the design and use of the database and they also determine the way the filing systems will be organized and accessed. An important part of this work involves implementing and controlling security procedures to protect the database from accidental or intentional damage or loss. They must be good at communicating not only in computer languages, but with people as well. They not only develop the programs, but write up descriptions about them, prepare manuals, help screens, or explain the new systems in person to users. They might have to adapt the programs once they're in use, to make them more suitable to the users. In any business or office environment it is the database developers who are the authorities on the use of these management systems. This means they may spend a lot of their day, answering questions on the correct usage of the database and correcting any glitches that may come up within the system. Their responsibilities can be summarized as follows: Implement the database design Implement the application programs to meet the program specifications Test and debug the database implementation and the application programs Document the database implementation and the application programs

60

4. Database administrators A database administrator (short form DBA) is a person responsible for the design, implementation, maintenance and repair of an organization's database. They are also known by the titles Database Coordinator or Database Programmer, and is closely related to the Database Analyst, Database Modeller, Programmer Analyst, and Systems Manager. The role includes the development and design of database strategies, monitoring and improving database performance and capacity, and planning for future expansion requirements. They may also plan, co-ordinate and implement security measures. Database administrator's duties include: Schema definition Storage structure and access method definition Schema and physical organization modification Granting user authority to access the database Specifying integrity constraints Acting as liaison with users Monitoring performance and responding to changes in requirements Database backup and recovery Database security

5. Database Users Users are differentiated by the way they expect to interact with the system. The different types of users are: Application programmers An application programmer is someone who has acquired high-level skills in a variety of computer programming languages. They are able to use these skills to modify and create new software applications or programs. An application programmer is also known as a developer or systems analyst. They interact with system through DML calls. Sophisticated users

61

These users form requests in a database query language. E.g. Engineers, scientists, business analysts, and others who thoroughly familiarize themselves with the facilities of the DBMS so as to implement their applications to meet their complex requirements. They try to learn most of the DBMS facilities in order to achieve their complex requirements. Specialized users These users write specialized database applications that do not fit into the traditional data processing framework. They maintain personal databases by using ready-made program packages that provide easy-to-use menu- or graphics-based interfaces. An example is the user of a tax package that stores a variety of personal financial data for tax purposes Typically they become very proficient in using a specific software package. Naive users

These users invoke one of the permanent application programs that have been written previously. Their characteristics include constantly querying and updating the database, using standard types of queries and updates called canned transactions that have been carefully programmed and tested. E.g. Bank tellers check account balances and post withdrawals and deposits; Reservation clerks for airlines, hotels, and car rental companies check availability for a given request and make reservations; Clerks at receiving stations for courier mail enter package identifications via bar codes and descriptive information through buttons to update a central database of received and in-transit packages.

Disadvantages of RDBMS
The main problem when using a relational database is the complexity that arises when it is first created. It is absolutely vital that the defined relationships between the tables are correct and that each set of information is linked to its pair. Although less information has to be entered in total than with other databases, making sure every point is set up correctly is a slow process. Furthermore, the relationships can become extremely intricate when a relational database contains more than just two tables. It was well known from the outset that the relational model did not and could not model some kinds of data. One common data structure that the Relational Model didnt accommodate, for example, was a hierarchy. You could implement laborious get-around to accommodate such data within a relational database, but the simple reality was that the

62

Relational Model never got round to defining hierarchies properly. Insofar as Relational database included a mathematical model of data, it was incomplete. With the development of Object Oriented programming languages in the early 1990s, the limits of the Relational Model began to become awkwardly visible. What quickly became clear was that the view of data that an OO program had was not compatible with the way an RDBMS stored data. An easy way to explain this problem is this: Think of a car. When you have finished using it, you park it in your garage and leave it. When you park it you do not take out the seats, take off the tires, remove the doors, pull the steering wheel off, take out the engine, etc. However RDBMS with their normalization processes would often insist in storing a car in exactly that way and then reassembling it for you when you next needed to use it. The limitations of RDBMS became even more obvious with the advent of the Internet. RDBMS had no way of representing web pages which were, by and large, collections of data and mark-up tags. So web pages were stored as character strings by RDBMS and their inner structure was buried.

63

Multimedia Database Management Systems


INTRODUCTION Due to the heterogeneous nature of multimedia data, systems intended to store, transport, display and in general manage such data must have considerably more capabilities than conventional information-management systems. In this paper, we elaborate on issues pertaining to multimedia database management and discuss technical challenges involved in developing a general-purpose multimedia DBMS. We can view these issues as specification requirements for this DBMS and focus our discussion on a reference architecture for such a system. A list of prominent issues in multimedia database management follows. Development of formal semantic modelling techniques for multimedia information, especially for video and image data. These models should be rich in capabilities for abstracting multimedia information and capturing semantics. They should be able to provide canonical representations of complex images, scenes, and events in terms of objects and their spatio-temporal behavior. These models need to be compared and evaluated. Design of powerful indexing, searching, and organization methods for multimedia data. Search in multimedia databases can be computationally intensive, especially if contentbased retrieval is needed for image and video data stored in compressed or uncompressed form. Development of models for specifying the media synchronization/integration requirements. Integration of these models with the monomedia database schema will be required. Subsequently, in order to determine the synchronization requirements at retrieval time, transformation of these models into a metaschema is needed. This entails designing objectretrieval algorithms for the operating systems. Similarly, integration of these models with higher-level information abstractions such as hypermedia or object-oriented models may be required. Designing formal multimedia query languages. These languages should have strong capabilities to express arbitrarily complex semantic and spatiotemporal schemas associated with composite multimedia information. They must support manipulation of content-based functions for multimedia objects. Development of efficient data replacement schemas for physical storage management. These schemes are needed to manage real-time multimedia data, for both single and

64

parallel disk systems. Design and development of suitable architecture and operating system support. Heterogeneity of multimedia information dictates that the architecture of a general-purpose multimedia database management system must support a rich set of datamanagement and computational functionalities. The operating system must also support real-time requirements of multimedia data. Management of distributed multimedia databases. In a networked environment extensive coordination and management capabilities are needed among the distributed sites to provide location-transparent access and support real-time delivery of data to distributed users. A REFERENCE ARCHITECTURE FOR MULTIMEDIA DBMS Many types of real-world knowledge can be represented by describing the interplay among objects (persons, buildings, events, etc.) in the course of time and their relationships in space. An application may need to store and access information about this knowledge that can be expressed by complex spatiotemporal logic. A video database is a typical replica of this real-world environment [Day et al. 1995]. The spatiotemporal specification process is reversed while composing multimedia information documents [Little and Ghafoor 1993; lino et al. 1994]. In this case, a user synthetically creates interplay among various monomeric objects (akin to physical objects), both in space and time. In multimedia databases these objects may represent individual data entities that serve as components of some multimedia document [Little and Ghafoor 1993; Stotts and Furuta 1989]. Furthermore, these entities/documents can be grouped together for efficient management and access. In either case, it is essential that the user be able to identify and address different objects and be able to express their relations in time and space. These relations should be representable in a suitable structure that is powerful enough to specify higher-level contents and semantic abstractions. It is, therefore, desirable that a general framework for spatiotemporal and semantic modeling should be available. Such a framework can provide a reasonable approach to address the issues discussed in the previous section and can be used in designing a general purpose multimedia database-management system. Various conceptual models for multimedia information have been proposed in the literature [Little and Ghafoor 1993; Stotts and Furuta 1989]. These models are either aimed at synchronization aspects of the multimedia data or are concerned with the browsing aspects

65

of information. Irrespective of the media type, it is imperative that a model should be able to generate a clear specification of the metaschema and it must be integral with the underlying data models of various monomedia. Such a model must also facilitate the efficient development of high-level user interfaces. From this discussion, we can perceive reference architecture for a general-purpose multimedia DBMS as shown in Figure 1(a). The architecture consists of three layers: a monomedia database management layer, a multimedia composition/management layer, and a user interface layer. We now describe functionalities of each layer and discuss their role in managing the issues mentioned in the previous section. The monomedia databasemanagement layer provides the functionalities essential for managing individual media. The key aspects of each DBMS at this level include maintaining efficient indexing mechanisms and allowing users to develop semantics-based modeling and grouping of complex information associated with each medium. In this respect, this layer deals with the first two issues discussed in the previous section. The primary objective is to process content-based queries as discussed in the previous section. The primary objective is to process contentbased queries and facilitate retrieval of appropriate pieces of monomedia data, such as video clips, parts of an image, or audio segments. For example, for video data, abstraction hierarchies in space and time, as shown in Figure 1(b), are needed in order to express events and allow content-based indexing and retrieval. Spatiotemporal logic can provide a formal framework for representing such events and building these hierarchical abstractions [Day et al. 1995]. Subsequently this process can lead to schema definition for each monomedia database. Such schema may also contain some unconventional information, for example, the quality of service (QoS) parameters for presentation, such as speed, volume, resolution, delay bounds, and the like. The objective is to allow retrieval of monomedia data with the desired quality, if possible. Accordingly, a suitable database language is needed at this level to manipulate hierarchical abstractions and to query about events of interest. It is important to mention that considerable computational demand can be placed at this layer, especially for image and video data, if a fully automated system is desired that is capable of generating content-based indices based on object recognition [Guidavada and Raghavan 1995]. Such computation can span a broad spectrum of effective and robust computer-vision and image processing algorithms that are intermingled with data and entities at various levels of abstractions of Figure Kb).

66

Another important function incorporated within this layer is the physical management of individual databases. Such management may require, as mentioned previously, efficient placement of data on a single/parallel disk system. For example, due to the time-variant characteristics of compressed video, it is difficult to predict the disk seek latencies. A poor data-placement strategy can result in significant degradation in quality due to excessive droppage of video frames during multiple concurrent sessions. The primary objective of the middle layer is to deal with the third issue, the integration of monomedia to compose multimedia documents and to crosslink information stored in monomedia databases. There is a growing demand for management of multimedia documents and libraries, and the need for efficient integration models is one of the key research issues in developing a general-purpose multimedia DBMS. Integration of media can span multiple dimensions including space, time, and semantics. Therefore, this layer needs to maintain a multidimensional monomedia integration model, in form of a metaschema, along with the QoS parameters associated with each medium. Basically, the layer processes users' queries for composite multimedia information and generates appropriate subqueries for the monomedia databases. Retrieval of the monomedia data is controlled by the integration model maintained by the metaschema [Little and Ghafoor 1993]. It is important to note that for the purpose of consistency, this meta-schema needs to be interfaced with the schemas associated with the monomedia databases present at the lower layer, because both schemas share information about monomedia, including their contentbased semantics and QoS parameters. Another important function of this layer is to provide coordination among monomedia databases if they are distributed. As mentioned before, one of the major objectives in this case is to provide locationtransparent access to different database sites and maintain synchronization among media streams originating from heterogeneous hosts. From the database point of view, information about the location of various objects of omposite multimedia and their schema must be maintained by this layer. The interactive layer consists of various user-interface facilities that support multimedia presentation functionalities such as display of images, playout of video clips or some audio segment, and the like. These interactive facilities may require some formal query language that can be identical to that used at the bottom layer, with some enhanced capabilities to manipulate composite multimedia information. Alternatively, some graphical query interface can be used for that purpose. Additional capabilities may include browsing, media editing, and so on.

67

68

DEVELOPMENT CONSIDERATIONS FOR THE REFERENCE ARCHITECTURE The development of a general-purpose multimedia database system, including design of multilevel meta-schema, management of a large number of indices and physical databases, as well as interfacing with the operating system for real-time retrieval, can be a technologically daunting task. In this section, we focus on two development issues that are unique to multimedia databases and are crucial for realizing the reference architecture of Figure 1(a): spatiotemporal modeling of monomedia/composite multimedia and the suitability of two well-known data-modeling paradigms, relational versus object-oriented, for developing a general purpose multimedia DBMS. A number of attempts have been made to develop synchronization models for multimedia information. HyTime, one such model, has been recommended as an ISO standard. However, currently this model suffers from the drawback that the extraction of various spatiotemporal and content semantics from this model can be quite cumbersome. On the other hand, the Petrinet-based model proposed in Little and Ghafoor [1993] not only allows extraction of the desired semantics and generation of a database schema in a rather straightforward manner, but also has the additional advantage of pictorially illustrating synchronization aspects of the information. In this regard this model is unique and is thus also well suited for the third layer of the reference architecture, where visual orchestration of multimedia information can be highly desirable. The model uses a set of generalized temporal operators. An expanded version of the model also allows specification of spatial semantics of information [lino et al. 1994]. The notion of generalized spatiotemporal relations has recently been extended to specify complex events in video data, a process that is almost a reversal of the multimedia composition process [Day et al. 1995]. Numerous extensions to this model have been proposed in the literature, including provision for user interaction, modeling of distributed object synchronization, development of synchronization protocols for communication, and the like. In summary, the Petrinet-based model in Little and Ghafoor [1993] can be a good candidate for spatiotemporal modeling and generation of meta-schemas for the reference architecture. We now discuss the second development issue. As the application domain for a given multimedia application can be highly complex and illstructured, both relational calculus and semantic-based object-oriented approaches need to be scrutinized. We can evaluate these approaches on two important criteria: their ability to

69

express semantics associated with hierarchical abstractions, which need to be managed by almost all the monomedia DBMSs of the bottom layer of Figure 1(a), and their expressive power for the specification of meta-schema for media integration, which is managed by the middle layer. Hierarchical abstractions can result in considerable semantic heterogeneity, which has been a difficult problem for relational database models. On the other hand, the semantic-based object-oriented models can provide a richer set of abstractions that are particularly useful for the users to extract/define view of information at various levels of these hierarchies [Day et al. 1995]. From the media integration and composition point of view, the relational technology again provides no elegant mechanisms for maintaining complex logical structures associated with composite multimedia objects. Although some relational DBMSs support access to multimedia objects by using pointers to BLOBs (binary large objects), there is no provision for interactively accessing various portion of these objects, because a BLOB is treated as a single entity in its entirety. It has been demonstrated that the object-oriented technology can provide a powerful paradigm to meet the requirements of multimedia composition and developing semantic models [Day et al. 1995; lino et al. 1994]. Its data and computational encapsulation features provide effective mechanisms for media synchronization.

Multimedia Database Systems: Where are we now?

Introduction A Multimedia Database Management System (MMDBMS) must support multimedia data types in addition to providing facilities for traditional DBMS functions like database creation, data modeling, data retrieval, independence. The area and data access and organization, and data applications have experienced tremendous growth.

Especially with the rapid development of network technology, multimedia database system gets more tremendous development and multimedia information exchange becomes very important. This paper reviews the history and the current state-of-the-art in MMDBMS.

70

The first wave The first MMDBMS rely mainly on the operating system for storing and querying files. These were ad-hoc systems that served mostly as repositories. The mid 90s saw a first wave of commercial, implemented them were MediaDB, from-the-scratch, and full-fledged MMDBMS. Some of now MediaWay [1], JASMINE [2], and ITASCA that is the

commercial successor of ORION [3]. They were all able to handle diverse kinds of data and provided mechanisms for querying, retrieving, inserting, and updating data. Most of these products disappeared from the market after some years of existence, and only some of them continued and adapted themselves successfully to the hardware and software advances as well as to application changes. For instance, MediaWay provided early very specific support for a wide variety of different media types. Specifically different media file formats varying from images, and video to PowerPoint documents can be managed segmented, linked and searched. Designing MMDBs Many inherent characteristics of multimedia data have direct and indirect impacts on the design of MMDBs. These include: the huge size of MMDBs, temporal nature, richness of content, complexity of representation and subjective interpretation. The major challenges in designing multimedia databases arise from several requirements they need to satisfy such as the following: 1. Manage different types of input, output, and storage devices. Data input can be from a variety of devices such as scanners, digital camera for images, microphone, MIDI devices for audio, video cameras. Typical output devices are high-resolution monitors for images and video, and speakers for audio. 2. Handle a variety of data compression and storage formats. The data encoding has a variety of formats even within a single application. For instance, in medical applications, the MRI images of brain has lossless or very stringent quality of lossy coding technique, while the X-ray images of bones can be less stringent. Also, the radiological image data, the ECG data, other patient data, etc. have widely varying formats.

71

3.

Support different computing platforms and operating systems. Different users

operate computers and devices suited to their needs and tastes. But they need the same kind of user-level view of the database. 4. Integrate different data models. Some data such as numeric and textual data are best handled using a relational database model, while some others such as video documents are better handled using an object-oriented database model. So these two models should coexist together in MMDBs. 5. Offer a variety of user-friendly query systems suited to different kinds of media. From a user point of view, easy-to-use queries and fast and accurate retrieval of information is highly desirable. The query for the same item can be in different forms. For example, a portion of interest in a video can be queried by using either 1) A few sample video frames as an example, 2) A clip of the corresponding audio track or 3) A textual description using keywords. 6. Handle different kinds of indices. The inexact and subjective nature of multimedia data has rendered keyword-based indices and exact and range searches used in traditional databases ineffective. For example, the retrieval of records of persons based on social security number is precisely defined, but the retrieval of records of persons having certain facial features from a database of facial images requires, content-based queries and similarity-based retrievals. This requires indices that are content dependent, in addition to key-word indices. 7. Develop measures of data similarity that correspond well with perceptual similarity. Measures of similarity for different media types need to be quantified to correspond well with the perceptual similarity of objects of those data types. These need to be incorporated into the search process 8. Provide transparent view of geographically distributed data. MMDBs are likely to be a distributed nature. The media data resides in many different storage units possibly spread out geographically. This is partly due to the changing nature of computation and computing resources from centralized to networked and distributed. 9. Adhere to real-time constraints for the transmission of media data. Video and audio are inherently temporal in nature. For example, the frames of a video need to be presented at the rate of at least 30 frames/sec. for the eye to perceive continuity in the video.

72

10.

Synchronize different media types while presenting to user. It is likely that different

media types corresponding to a single multimedia object are stored in different formats, on different devices, and have different rates of transfer. Thus they need to be periodically synchronized for presentation. The recent growth in using multimedia data in applications has been phenomenal. Multimedia databases are essential for efficient management and effective use of huge amounts of data. The diversity of applications using multimedia data, the rapidly changing technology, and the inherent complexities in the semantic representation, interpretation and comparison for similarity pose many challenges. MMDBs are still in their infancy. Today's MMDBs are closely bound to narrow application areas. The experiences acquired from developing and using novel multimedia applications will help advance the multimedia database technology. Benefits of using Multimedia Database The incorporation of multimedia database systems will improve the quantity and quality of information manipulated by computer users in all fields, computer aided design, and information retrieval. The area of intelligent multimedia content analysis and retrieval techniques is an emerging discipline. Techniques for representing and extracting semantic information from media such as speech, images, and video are required. When a multimedia application lacks a database, the data structure is buried in the script, where all of its value is lost. This omission also makes the script more complicated and less flexible. Using a multimedia database makes the data structure logic available to other multimedia applications and simplifies the script so that many scripts can share the same multimedia metadata. In addition, when a multimedia or abstract data database is organized and annotated for one application, other applications can use those annotations without going through the same time-consuming process. This capability adds great value to the data through reuse and controlled redundancy. When multimedia application content is controlled by the multimedia database, multimedia content can be added, deleted, or modified without modifying the application script. For example, interactive kiosks that display, describe, and demonstrate products can be updated automatically without reprogramming the application script. Furthermore, a

73

multimedia application such as a multimedia textbook can actually control the operation of book topics that have the same look and feel. This control lets the script perform as a template: An entire series of math textbooks (algebra, calculus, trigonometry, and geometry), including text and video, can use the same multimedia application because all data is physically separate. Search and retrieval operations are critical in interactive multimedia applications; they must be equally efficient and powerful. Search and retrieval of multimedia and abstract data is challenging, but multimedia databases make it feasible through internal storage format flexibility and efficient operation. The DBMS should have significant knowledge about the data and its structure to enable powerful semantic optimizations and intelligent searches. Search and retrieval operations also give the application access to media components so that they can be dynamically and seamlessly processed when necessary. Features of Multimedia Database Query in Multimedia DBMS In a DBMS process, declarative user queries are done by defining a query language as part of DBMS. It is known that the query is one of the most important parts of a DBMS. Since a query gets used by both native and export users it is critical to use effective query languages in order to promote user friendliness. Multimedia query languages must deal with complex spatial and temporal relationships inherited in the wide range of multimedia data types. Powerful query languages could help manipulate multimedia DBMS and maintain the desired independence between the database and the application. Due to the complexity of multimedia data, we need a query language that could support semantic data retrieval which is done by using keywords, indexes on keywords, and contents of multimedia objects. However the query language of traditional DBMS dealt only with exact key match queries on data types which means sufficient to deal with queries posed against metadata and annotations of multimedia data. There are two types of queries that are used in the database system. The first query is called a well-defined query. A well-defined query is when properties of objects are represented by a well defined set of labels. The conditional operators are also well defined. In well-defined query, the user must know the exact knowledge of the underlying database and the desired query result. This query only works with exact matches. The second query type of is called

74

the fuzzy query where the properties of query objects are unclear or comparison operators in the query do not give exact matches, This will result in all sets of different answers. Charts and Graphs Multimedia database supports all type of charts and graphs which is considered a great feature in the MDBS. Creating those charts and graphs will need special queries and those queries are created by using SQL +D.

75

Multimedia Presentation The terms multimedia document, multimedia or temporal presentation, have been used in the literature interchangeably. Multimedia Database offers to store large amount of media like audio and video, all of those are used also in the presentation process. MDBS also stores multimedia documents and presentation.

Applications of MMDB Data Streaming and E-mail Technologies The use of streaming video and audio media is becoming an increasingly desirable feature in network technologies. As networks become more saturated with traffic, the timely delivery of latency-sensitive data, such as the data generated by video conferencing and Internet Protocol (IP) telephony programs, becomes more problematic. ATM and QoS are two technologies that you can use to facilitate the timely delivery of latency-sensitive data. POP3 is an e-mail protocol that can be used facilitate mail delivery to computers that are not part of a network or computers that do not have an e-mail server on the local network.

76

Interactive video technology A few interactive video technologies have been developed in last few years that utilize a new way of encoding videos allowing users to click on any person, place, or object in the video. Various existing trends of usage of the technology Customizable online interactive videos Conversational online interactive videos Exploratory online interactive videos Interactive video in early computer games Interactive video in cinema Interactive video in youtube Interactive video art Interactive video in cell phone gaming Interactive video in VJing Hypervideo

Interactive Video ads These are designed to meet the marketing needs of advertisers in a wide variety of industries, including: Entertainment, CPG, Automotive, Retail, Telecommunications, and Travel, with more categories being added on an ongoing basis. Some of the flexible features and functionality: Multiple videos within a single ad unit, including Interactive video, augmented Social media integration and RSS feeds In-player video games, polling, surveys, and quizzes Localized geographic data, store location, and mapping Product features/carousel Coupons and click-to-schedule/purchase reality, and dynamic content

Digital library A digital library is a library in which collections are stored in digital formats (as opposed to print, microform, or other media) and accessible by computers. The digital content may be

77

stored locally, or accessed remotely via computer networks. A digital library is a type of information retrieval system. A distinction is often made between content that was created in a digital format, known as born-digital, and information that has been converted from a physical medium, e.g., paper, by digitizing. The term hybrid library is sometimes used for libraries that have both physical collections and digital collections. Large scale digitization projects are underway at Google, the Million Book Project, and Internet Archive. With continued improvements in book handling and presentation technologies such as optical character recognition and ebooks, and development of alternative depositories and business models, digital libraries are rapidly growing in popularity as demonstrated by Google, Yahoo!, and MSN's efforts. Just as libraries have ventured into audio and video collections, so have digital libraries such as the Internet Archive. According to Larry Lannom, Director of Information Management Technology at the nonprofit Corporation should be for National Research Initiatives, "all the problems associated with digital libraries are wrapped up in archiving." He goes on to state, "If in 100 years people can still read your article, we'll have solved the problem." Daniel Akst, author of The Webster Chronicle, proposes that "the future of librariesand of informationis digital." Peter Lyman and Hal Varian, information scientists at the University of California, Berkeley, estimate that "the world's total yearly production of print, film, optical, and magnetic content would require roughly 1.5 billion gigabytes of storage." Therefore, they believe that "soon it will be technologically possible for an average person to access virtually all recorded information." Most digital libraries provide a search interface which allows resources to be found. These resources are typically deep web (or invisible web) resources since they frequently cannot be located by search engine crawlers. Some digital libraries create special pages or sitemaps to allow search engines to find all their resources. Digital libraries frequently use the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) to expose their metadata to other digital libraries, and search engines like Google Scholar, Yahoo! and Scirus can also use OAI-PMH to find these deep web resources.[9] There are two general strategies for searching a federation of digital libraries:

78

Distributed searching, and Searching previously harvested metadata.

Distributed searching typically involves a client sending multiple search requests in parallel to a number of servers in the federation. The results are gathered, duplicates are eliminated or clustered, and the remaining items are sorted and presented back to the client. Protocols like Z39.50 are frequently used in distributed searching. A benefit to this approach is that the resource-intensive tasks of indexing and storage are left to the respective servers in the federation. A drawback to this approach is that the search mechanism is limited by the different indexing and ranking capabilities of each database, making it difficult to assemble a combined result consisting of the most relevant found items. Searching over previously harvested metadata involves searching a locally stored index of information that has previously been collected from the libraries in the federation. When a search is performed, the search mechanism does not need to make connections with the digital libraries it is searching - it already has a local representation of the information. This approach requires the creation of an indexing and harvesting mechanism which operates regularly, connecting to all the digital libraries and querying the whole collection in order to discover new and updated resources. OAI-PMH is frequently used by digital libraries for allowing metadata to be harvested. A benefit to this approach is that the search mechanism has full control over indexing and ranking algorithms, possibly allowing more consistent results. A drawback is that harvesting and indexing systems are more resource-intensive and therefore expensive. Advantages of a digital library The advantages of digital libraries as a means of easily and rapidly accessing books, archives and images of various types are now widely recognized by commercial interests and public bodies alike.[14] Traditional libraries are limited by storage space; digital libraries have the potential to store much more information, simply because digital information requires very little physical space to contain it. As such, the cost of maintaining a digital library is much lower than that of a traditional library. A traditional library must spend large sums of money paying for staff, book maintenance, rent, and additional books. Digital libraries may reduce or, in some instances, do away with these fees. Both types of library require cataloguing input to allow users to locate and

79

retrieve material. Digital libraries may be more willing to adopt innovations in technology providing users with improvements in electronic and audio book technology as well as presenting new forms of communication such as wikis and blogs; conventional libraries may consider that providing online access to their OPAC catalogue is sufficient. An important advantage to digital conversion is increased accessibility to users. They also increase availability to individuals who may not be traditional patrons of a library, due to geographic location or organizational affiliation. No physical boundary: The user of a digital library need not to go to the library physically; people from all over the world can gain access to the same information, as long as an Internet connection is available. Round the clock availability A major advantage of digital libraries is that people can gain access 24/7 to the information. Multiple access: The same resources can be used simultaneously by a number of institutions and patrons. This may not be the case for copyrighted material: a library may have a license for "lending out" only one copy at a time; this is achieved with a system of digital rights management where a resource can become inaccessible after expiration of the lending period or after the lender chooses to make it inaccessible (equivalent to returning the resource). Information retrieval: The user is able to use any search term (word, phrase, title, name, subject) to search the entire collection. Digital libraries can provide very user-friendly interfaces, giving clickable access to its resources. Preservation and conservation: Digitization is not a long-term preservation solution for physical collections, but does succeed in providing access copies for materials that would otherwise fall to degradation from repeated use. Space: Traditional libraries are limited by storage space, digital libraries have the potential to store much more information, and simply because digital information requires very little physical space to contain them and media storage technologies are more affordable than ever before. Added value: Certain characteristics of objects, primarily the quality of images, may be improved. Digitization can enhance legibility and remove visible flaws such as stains and discoloration. Easily accessible.

80

Image retrieval An image retrieval system is a computer system for browsing, searching and retrieving images from a large database of digital images. Most traditional and common methods of image retrieval utilize some method of adding metadata such as captioning, keywords, or descriptions to the images so that retrieval can be performed over the annotation words. Manual image annotation is time-consuming, laborious and expensive; to address this, there has been a large amount of research done on automatic image annotation. Additionally, the increase in social web applications and the semantic web have inspired the development of several web-based image annotation tools. Image search is a specialized data search used to find images. To search for images, a user may provide query terms such as keyword, image file/link, or click on some image, and the system will return images "similar" to the query. The similarity used for search criteria could be meta tags, color distribution in images, region/shape attributes, etc. Image meta search - search of images based on associated metadata such as Content-based image retrieval (CBIR) the application of computer vision to the keywords, text, etc. image retrieval POP3 Post Office Protocol version 3 (POP3) service enables a server to host e-mail accounts for subscribers of Internet service providers (ISP) or for small businesses. The most common use is for home ISP clients or small businesses that cannot locally implement and support a mail server, Simple Mail Transfer Protocol (SMTP) service, and related mail handling infrastructure. The POP3 protocol is used to retrieve mail that is being held by a mail server that offers SMTP service. The POP3 service performs the tasks of downloading messages, moving the messages from a folder in the file system of a mail server to clients on small networks, or to stand-alone home computers. Typically, after mail is downloaded from the mail server, the mail server deletes the mail from its folder.

81

Recent Developments

News-On-Demand The Canadian Institute for Telecommunication Research (CITR) Broadband Services studies and prototypes enabling technologies for distributed multimedia applications using ObjectOriented Multimedia database system. Such applications include multimedia news, distance learning, and home shopping. A successful product of their project is the News-on-Demand Application. This application incorporates a distributed database server by storing up to date news items from various sources, such as TV, radio, newspapers, and magazines. News-on-Demand allows subscriber to retrieve one or more news items using an article query interface. Various news items from different sources are annotated and organized into multimedia documents by service providers. News-on-Demand uses a disturbed service where a client can access news over the broadband network.

Video-On-Demand Multimedia technologies are attracting more and more interest every day. Video-onDemand is one of the buzzwords today and is now available for the public. Content providers such as publishers, broadcasting companies and audio/video production firms must be able to archive and index their productions for later retrieval. This is a formidable task and it is even more so when the material to be handled encompasses several media types and covers a time span of several years. In order for such a vast amount of data to be easily available, existing multimedia database design models, indexing and retrieval methodologies and delivery methods have to be improved and refined. In addition, video, image and audio data cannot be effectively managed with the exclusive use of older, keyword-based search techniques. KMeD The UCLA Knowledge-Based Multimedia Medical Distributed Database (KMeD) project is a joint project between the Computer Science Department and the Radiological Science Department.

82

The KMeD project has five major objectives: 1. 2. 3. 4. 5. Query medical multimedia distributed database by image, alphabetical, and Model temporal, spatial, and evolutionary nature of medical objects. Formulate queries using conceptual and imprecise medical terms to support Develop a domain-independent, high-level query language and a medical domain Provide analysis and presentation methods for visualization of knowledge and data numerical content.

cooperative processing. user interface to support KMeD functionality. models. KMeD database is presented by features and objects. The selected objects of interest in medical images such as X-Ray and MRI image are segmented using knowledge-based modelguiding techniques. As with other multimedia database system features and contents of the medical image are extracted and stored in a feature and content database. In KMeD, Type Abstraction Hierarchies (TAHs) is used to represent the knowledge about the image in three structures. A query process supports operators such as "similar to" and "nearby" and conceptual terms such as "small" and "large" to find the approximate matches to the features and contents. A technique has been developed for visual interface to use pointclick-and-drag input. This new development by UCLA, namely KMeD database, can change the way hospitals maintain and utilize all of their data. EtherMed EtherMed is an experimental database to Internet accessible multimedia courseware in health professions education. EtherMed contains records of educational materials that are freely accessible on the web and links to the actual courseware. EtherMed records are descriptive and non-evaluative. EtherMed is an experiment is collaborative database development and maintenance. OHPCC staff and collaborating consultants at distant locations can directly enter, modify, and delete records online. The aim is determine whether distributed management of database resources is possible and can become selfsustaining and to determine how well database approaches compare to those employing search engines. Online videoconferencing and other collaboration tools are used to support this effort. ImageMed is an experimental system that uses:

83

(1) the Unified Medical Language System (UMLS) Meta-thesaurus for indexing and retrieval of medical images. (2) a distributed image database. Future Work Future requirements of distributed multimedia systems will be even much more demanding than it is now. It is envisaged that users will be heavily mobile and require ubiquitous access to, and satisfactory presentation of, multimedia data, regardless of the actual connectivity and the specific presentation device they are currently using (e.g., wire-based vs. wireless network, high-resolution graphics screen vs. Web-enabled cellular phone). Moreover, users will expect to find information and multimedia contents faster and easily, and will interact with the contents much more intensively than today. In this scope, the CODAC and its sister project aims at realizing a quality adaptive end-toend multimedia system, i.e., we shall provide means for indexing and retrieving multimedia data by their content and adaptation capabilities and develop methods to guarantee a quality adaptive video transport to the client. Depicts the architectural view of the end-to-end multimedia system, which we would like to realize. For this we need to carry out a number of activities and the following are the first work items under development in the CODAC project: 1-Development of a multimedia cartridge in the core of an Oracle 9i DBMS: The advantages of the cartridge technology, as proposed by Oracle and other database vendors, are reusability, extensibility and especially a very clean interface to components of the database system, like query processor, optimizer and page access management. This multimedia cartridge realizes the meta-database and provides access to the clients for complex search functionality, supported by advanced indexing structures (like a combination of X-trees, SStrees, etc.). 2. Realization of a Processing Unit for MPEG-4 videos: The Processing Unit is supposed to be situated between the video server and the meta-database and shall extract the necessary

84

quality adaptation capability information from the A/V streams to be stored as meta-data in the database. Upon the insertion of the video (the insertion could be on demand or on a regular basis), the Processing Unit shall apply efects such as transformation and scaling to the MPEG-4 encoded videos and report results (performance and variation information) to the meta-database and write back the adapted videos to the video server. Implementation of an indexing structure for the access of MPEG-7 files, possibly in BiM format: A composition mechanism of different MPEG-7 Access Units shall be developed and the mapping process of MPEG-7 Access Units to MPEG-4 Access Units shall be implemented. 4. Realization of the cross-referencing between MPEG-7 and MPEG-4, i.e., how to access media data from meta-data and vice-versa: indexing structures of the later work item shall be employed and integrated into a reference stream which allows the efficient access of the meta-data from the media-data. Referencing of MPEG-7 and MPEG-4 will be employed in the proxy-cache and in the active routers. The application scenarios, we focus upon are sport event videos, M3box, and tele-teaching. Sport events are an interesting application scenario since these videos provide semantically rich content. The M3box is an adaptive multimedia message box. It is developed by Siemens Corporate Technology. 2) In cooperation with our Institute. The final application is tele-teaching. This application differs from the previous ones, as here, multimedia data plays a supporting role, rather than being the central data component. Therefore, the meta-database has to keep, besides descriptive information on the multimedia data, information on the tele-teaching material and the teaching process. Future Improvement The most interesting and exciting thing about multimedia databases is how quickly they're evolving. This growth, along with the emergence of inexpensive removable storage devices such as DVD-which stores tens of gigabytes-will ignite an explosion of multimedia

85

applications. This explosion, in turn, will fuel an intense need for powerful multimedia databases. Companies and Institutes and other organizations that interacts directly with Multimedia Database Laboratories and Research Groups Research in this area involves representation, indexing, transmission, retrieval and presentation of multimedia in a distributed environment, content-based retrieval, data mining, and bioinformatics. Their main projects that was mainly built on Multimedia Database are: MultiStore MultiStore is a storage infrastructure project to construct, manage, and store multidimensional data sets. The project will carry out geographic imaging research, bioinformatics research, pharmacogenomic research, and basic research on large scale data sets and information visualization. Collaborators include David Mark, Director of the NCGIA, Nobel laureate Herbert Hauptman, and Norma Nowak, Director of the DNA Micro array Facility at the Roswell Park Cancer Institute. NetMedia Consistent and robust retrieval, transmission and presentation of multimedia data in distributed multimedia database systems. Data Mining and Bioinformatics While available information retrieval techniques rely on human coding of semantic content, Zhangs intelligent content analysis analyzes images automatically. The picture analyzed depicts Bell Hall, home to the main office of the Department of Computer Science and Engineering on UB's North Campus. Multimedia Database at RMIT The Multimedia Database Systems Group at RMIT is one of Australia's leading centers of research in electronic document management. The group develops systems for document management and undertakes consultation within the wider area of database management and information retrieval.

86

The TeraText Database System RMIT MDS is responsible for the development and maintenance of the TeraText database system (formerly called the Structured Information Manager (SIM)), a text database system with native XML support. The TeraText DBS is designed for building web based text intensive applications. for large enterprises. The TeraText product is now sold in both Australia and the USA and is used by a large number of organizations in both the public and private sectors.

DELTA Collaborative Multimedia Database Project The DELTA Collaborative Multimedia Database Project is designed to provide faculty and students with access to a rich diversity of digital instructional materials (DIM) for self study and for use by faculty in developing courseware and electronic presentations for instructional purposes. The materials have copyright protection, but are available for noncommercial use without charge. Currently, databases of digital still images, sounds, and video have been developed from analog materials provided by faculty who have certified copyright on the original work. The material contained in this database may only be used for non-profit, educational purposes. Any other use is prohibited without the express written consent of the Trustees of the California State University or the copyright owner. World War II Multimedia Database Originally completed as my thesis at Fordham University, the World War II Multimedia Database is dedicated to the 50 million people killed in the Second World War. Currently the World War II Multimedia Database has 1850 photos, 93 video clips, and a virtual radio.

87

Distributed Databases
In todays world of universal dependence on information systems, all sorts of people need access to companies databases. In addition to a companys own employees, these include the companys customers, potential customers, suppliers, and vendors of all types. It is possible for a company to have all of its databases concentrated at one mainframe computer site with worldwide access to this site provided by telecommunications networks, including the Internet. Although the management of such a centralized system and its databases can be controlled in a well-contained manner and this can be advantageous, it poses some problems as well. For example, if the single site goes down, then everyone is blocked from accessing the databases until the site comes back up again. Also the communications costs from the many far PCs and terminals to the central site can be expensive. One solution to such problems, and an alternative design to the centralized database concept, is known as distributed database. The idea is that instead of having one, centralized database, we are going to spread the data out among the cities on the distributed network, each of which has its own computer and data storage facilities. All of this distributed data is still considered to be a single logical database. When a person or process anywhere on the distributed network queries the database, it is not necessary to know where on the network the data being sought is located. The user just issues the query, and the result is returned. This feature is known as location transparency. This can become rather complex very quickly, and it must be managed by sophisticated software known as a distributed database management system or distributed DBMS. A distributed database (DDB) is a collection of multiple, logically interrelated databases distributed over a computer network. A distributed database management system (DDBMS) is the software that manages the DDB, and provides an access mechanism that makes this distribution transparent to the user. Collections of data (e.g. in a database) can be distributed across multiple physical locations. A distributed database can reside on network servers on the Internet, on corporate intranets or extranets, or on other company networks. The replication and distribution of databases improves database performance at end-user worksites.

88

Although there are a number of advantages to using a distributed DBMS, there are also a number of problems and implementation issues. Finally, data in a distributed DBMS can be partitioned or replicated or both. Homogeneous distributed databases A homogeneous distributed database is one where all sites are homogeneous or similar. All sites have identical software. They are aware of each other and agree to cooperate in processing user request Heterogeneous distributed databases In a heterogeneous distributed database system, at least one of the databases uses different schemas and software. A database system having different schema may cause a major problem for query processing. A database system having different software may cause a major problem for transaction processing. Therefore, special attention needs to be paid to integrate a heterogeneous distributed database. Methods to achieve a distributed database A distributed database is one that is held in several locations. There are three methods of achieving this: 1. 2. 3. Each remote processor has data on its own customers, stock, etc. The database is The entire database is duplicated at the remote side. The database is uploaded each The central database contains only an index of entries. The actual records are held uploaded each night and changes made. night and changes made. at the remote site. A query to the central database will locate where the record is held. This system is used by very large databases. Fragmentation In DDBS we need to define the logical unit of DB distribution and allocation. In some cases it might be more efficient to split the tables into smaller units (fragments) and allocate them in different sites. Fragmentation has three different types: 1. Horizontal Fragmentation

In horizontal partitioning, the rows of a table are divided up among several sites on the

89

network. Each such partition must include the primary key attribute(s) of the table. An example on horizontal fragmentation is the employees table which makes since for the company to split the table into different partitions based on the employees who work on that site. This makes the management, queries, and transactions convenient and efficient. The down side of this choice is that, whenever a query involving all records, it has to request all partitions from all sites and do a union on them. 2. Vertical Fragmentation

In vertical partitioning, the columns of a table are divided up among several sites on the network. Each such partition must include the primary key attribute(s) of the table. This arrangement can make sense when different sites are responsible for processing different functions involving an entity. For example, the salary attributes of a personnel table might be stored in one city while the skills attributes of the table might be stored in another city. Both partitions would include the employee number, the primary key of the full table. A down side of this option is that, a query involving the entire table would have to request all portions from all sites and do a join on them. 3. Hybrid Fragmentation

In this type of fragmentation scheme, the table is divided into arbitrary blocks, based on the needed requirements. Each fragment then can be allocated on to a specific site. This type of fragmentation is the most complex one, which needs more management. Query Processing DDBS adds to the conventional centralized DBS some other types of processing expenses, because of the additional design (hardware & software) to handle the distribution. These expenses present as the cost of data transfer over the network. Data transferred could be, intermediate files resulting from local sites, or final results need to be sent back to the original site that issued the query. Therefore, database designers are concerned about query optimization, which target minimizing the cost of transferring data across the network. One method to optimize query on DDBS is, where a relation R1 can send the entire join-column CR1 to the target relation R2, then the site containing R2 would perform the join on CR1, and project on the passed attributes. The resulting tuples are then shipped back to R! for further processing.

90

This can significantly enhance the query efficiency, since the data transferred on the network is minimized. Concurrency and Recovery DDBS design of concurrency and recovery, has to consider different aspects other than of those of centralized DBS. These aspects include: Multiple copies of data: concurrency has to maintain the data copies consistent. Recovery on the other hand has to make a copy consistent with others whenever a site recovers from a failure. Failure of communication links Failure of individual sites Distributed commit: during transaction commit some sites may fail, so the twophase Deadlocks on multiple sites.

commit is used to solve this problem.

The following two sections describe two suggestions to manage concurrency control. 1. Distinguished Copy of a Data Item Primary site

There are three variations to this method: In this method, a single site is designated as the coordinator site. All locks and unlocks for all data units are controlled by this site. One advantage is, easy to implement. However two downsides of this method are overloading of the coordinator site, and this site forms a single point failure for the entire DDBS. Primary site with backup site This technique addresses the second disadvantage in the 1st technique (primary site) by designating a backup site, that can take over as the new coordinator in case of failure, in which case, an other backup site has to be selected. Primary copy technique This method distributes the load to the sites that have a designated primary copy of a data unit as opposed to centralizing the entire data units in one coordinator site. This way if a site goes down, only transactions involving the primary copies residing on that site will be effected.

91

2.

Voting

This method does not designate any distinguished copy or site to be the coordinator as suggested in the 1st two methods described above. When a site attempts to lock a data unit, requests to all sites having the desired copy, must be sent asking to lock this copy. If the requesting transaction did was not granted the lock by the majority voting from the sites, then the transaction fails and sends cancellation to all. Otherwise it keeps the lock and informs all sites that it has been granted the lock. 3. Recovery

The first step of dealing with the recovery problem is to identify that there was a failure, what type was it, and at which site that happened. Dealing with distributed recovery requires aspects include: database logs, and update protocols, transaction failure recovery protocol, etc. Advantages of Distributed Database Management of distributed data with different levels of transparency like Increased reliability and availability: Reliability is defined as, the probability that

fragmentation transparency, replication etc. the system will be up at a given time. The availability is defined as, the probability that the system will be up continuously during a given time period. These important system parameters are improved with the DDBS. In the centralized DBS, if any component of the DB goes down, the entire system will go down, whereas in the DDBS, only the effected site is down, and the rest of the system will not be affected. Furthermore, if the data is replicated at the different sites, the effects are greatly minimized. Easier expansion. Reflects organizational structure: Database fragments are located in the Local autonomy or site autonomy: A department can control the data about them Protection of valuable data: If there were ever a catastrophic event such as a fire, all

departments they relate to. (as they are the ones familiar with it.) of the data would not be in one place, but distributed in multiple locations.

92

Improved performance: Data is located near the site of greatest demand, and the

database systems themselves are parallelized, allowing load on the databases to be balanced among servers. (A high load on one module of the database won't affect other modules of the database in a distributed database.) Economics: It costs less to create a network of smaller computers with the power of Modularity: Systems can be modified, added and removed from the distributed Reliable transactions : Due to replication of database. Hardware, Operating System, Network, Fragmentation, DBMS, Replication and Continuous operation. Distributed Query processing. Distributed Transaction management. Complexity: Extra work must be done by the DBAs to ensure that the distributed a single large computer. database without affecting other modules (systems).

Location Independence.

Disadvantage of Distributed databases nature of the system is transparent. Extra work must also be done to maintain multiple disparate systems, instead of one big one. Extra database design work must also be done to account for the disconnected nature of the database for example, joins become prohibitively expensive when performed across multiple systems. Economics: Increased complexity and a more extensive infrastructure means extra Security: Remote database fragments must be secured, and they are not centralized labour costs. so the remote sites must be secured as well. The infrastructure must also be secured (e.g., by encrypting the network links between remote sites). Difficult to maintain integrity: But in a distributed database, enforcing integrity over Inexperience: distributed databases are difficult to work with, and as a young field Lack of standards: There are no tools or methodologies yet to help users convert a a network may require too much of the network's resources to be feasible., there is not much readily available experience on proper practice. centralized DBMS into a distributed DBMS.

93

Database design more complex: Besides of the normal difficulties, the design of a

distributed database has to consider fragmentation of data, allocation of fragments to specific sites and data replication. Additional software is required. Operating System should support distributed environment. Concurrency control: It is a major issue. It is solved by locking and timestamping.

94

Document Oriented Database


A document-oriented database is a computer program designed for storing, retrieving, and managing document-oriented, or semi structured data, information. Document-oriented databases are one of the main categories of so-called NoSQL databases and the popularity of the term "document-oriented database" (or "document store") has grown with the use of the term NoSQL itself. DocumentsThe central concept of a document-oriented database is the notion of a Document. While each document-oriented database implementation differs on the details of this definition, in general, they all assume documents encapsulate and encode data (or information) in some standard format(s) (or encoding(s)). Encodings in use include XML, YAML, JSON and BSON, as well as binary forms like PDF and Microsoft Office documents (MS Word, Excel, and so on). Documents inside a document-oriented database are similar, in some ways, to records or rows, in relational databases, but they are less rigid. They are not required to adhere to a standard schema nor will they have all the same sections, slots, parts, keys, or the like. For example here's a document: FirstName="Bob", Address="5 Oak St.", Hobby="sailing". Another document could be: FirstName="Jonathan", {Name:"Elena", Age:2}]. Both documents have some similar information and some different. Unlike a relational database where each record would have the same set of fields and unused fields might be kept empty, there are no empty 'fields' in either document (record) in this case. This system allows new information to be added and it doesn't require explicitly stating if other pieces of information are left out. Address="15 Wanamassa Point Road", Children=[{Name:"Michael",Age:10}, {Name:"Jennifer", Age:8}, {Name:"Samantha", Age:5},

95

Keys, Retrieval, and Organization Keys Documents are addressed in the database via a unique key that represents that document. Often, this key is a simple string. In some cases, this string is a URI or path. Regardless, you can use this key to retrieve the document from the database. Typically, the database retains an index on the key such that document retrieval is fast. Retrieval One of the other defining characteristics of a document-oriented database is that, beyond the simple key-document (or key-value) lookup that you can use to retrieve a document, the database will offer an API or query language that will allow you to retrieve documents based on their contents. For example, you may want a query that gets you all the documents with a certain field set to a certain value. The set of query APIs or query language features available, as well as the expected performance of the queries, varies significantly from one implementation to the next. Organization Implementations offer a variety of ways of organizing documents, including notions of Collections Tags Non-visible Metadata Directory hierarchies

XML database implementations An XML database is a data persistence software system that allows data to be stored in XML format. This data can then be queried, exported and serialized into the desired format. Two major classes of XML database exist: XML-enabled: these map all XML to a traditional database (such as a relational database), accepting XML as input and rendering XML as output. This term implies that the database does the conversion itself (as opposed to relying on middleware).

96

Native XML (NXD): the internal model of such databases depends on XML and uses XML documents as the fundamental unit of storage, which are, however, not necessarily stored in the form of text files. Rationale for XML in databases O'Connell gives one reason for the use of XML in databases: the increasingly common use of XML for data transport, which has meant, "data is extracted from databases and put into XML documents and vice-versa". It may prove more efficient (in terms of conversion costs) and easier to store the data in XML format. Native XML databases The term "native XML database" (NXD) can lead to confusion. Many NXDs do not function as standalone databases at all, and do not really store the native (text) form. The formal definition from the XML:DB initiative (which appears to be inactive since 2003) states that a native XML database: Defines a (logical) model for an XML document as opposed to the data in that document and stores and retrieves documents according to that model. At a minimum, the model must include elements, attributes, PCDATA, and document order. Examples of such models include the XPath data model, the XML Infoset, and the models implied by the DOM and the events in SAX 1.0. Has an XML document as its fundamental unit of (logical) storage, just as a relational database has a row in a table as its fundamental unit of (logical) storage. Need not have any particular underlying physical storage model. For example, NXDs can use relational, hierarchical, or object-oriented database structures, or use a proprietary storage format (such as indexed, compressed files). Additionally, many XML databases provide a logical model of grouping documents, called "collections". Databases can set up and manage many collections at one time. In some implementations, a hierarchy of collections can exist, much in the same way that an operating system's directory-structure works. All XML databases now support at least one form of querying syntax. Minimally, just about all of them support XPath for performing queries against documents or collections of documents. XPath provides a simple pathing system that allows users to identify nodes that match a particular set of criteria.

97

In addition to XPath, many XML databases support XSLT as a method of transforming documents or query-results retrieved from the database. XSLT provides a declarative language written using an XML grammar. It aims to define a set of XPath filters that can transform documents (in part or in whole) into other formats including Plain text, XML, or HTML. Many XML databases also support XQuery to perform querying. XQuery includes XPath as a node-selection method, but extends XPath to provide transformational capabilities. Users sometimes refer to its syntax as "FLWOR" (pronounced 'Flower') because the query may include the following clauses: 'for', 'let', 'where', 'order by' and 'return'. Traditional RDBMS vendors (who traditionally had SQL only engines), are now shipping with hybrid SQL and XQuery engines. Hybrid SQL/XQuery engines help to query XML data alongside the relational data, in the same query expression. This approach helps in combining relational and XML data. Some XML databases support an API called the XML:DB API (or XAPI) as a form of implementation-independent access to the XML datastore. In XML databases, XAPI resembles ODBC and JDBC as used with relational databases. On the 24th of June 2009, The Java Community Process released the final version of the XQuery API for Java specification (XQJ) - "a common API that allows an application to submit queries conforming to the W3C XQuery 1.0 specification and to process the results of such queries".

98

Implementations of the DatabasesName Lotus Notes askSam Apstrata Datawasp Publisher IBM askSam Systems Apstrata Significant Data Systems License Proprietary Proprietary Proprietary Proprietary Scalable, high-performance, schema-free, documentoriented database management system platform with server based data storage, fast full text search engine functionality, information ranking for search revelevance and clustering. Commonly used in health applications. Language Notes

Clusterpoint

Clusterpoint Ltd.

Free community license / Commercial[1] C++

CRX MUMPS Database[2] UniVerse UniData Jackrabbit CouchDB FleetDB MongoDB

Day Software Rocket Software Rocket Software Apache Software Foundation

Proprietary Proprietary and GNU Affero GPL[3] Proprietary Proprietary Apache License

MUMPS

Java JSON over REST/HTTP with Multi-Version Concurrency Control and ACID properties. Uses map and reduce for views and queries.[4] A JSON-based schema-free database optimized for agile Clojure development. Fast, document-oriented database optimized for highly C++ transient data. Memory-oriented, fast, key-value database with Java, .NET, C++ indexing and querying support. Java JSON over HTTP A .NET LINQ-enabled Document Database, focused on providing high performance, transactional, schema.NET less, flexible and scalable NoSQL data store for the .NET and Windows platforms. Key-value store supporting lists and sets with fast, ANSI C simple and binary-safe protocol. Alpha software. Java JSON/HTTP Erlang Built on top of Apache Thrift framework that provides indexing and document storage services for building and scaling websites. Alternate implementation is being developed in Java. Alpha software. A JSON database and JavaScript Application Server. Provides RESTful JSON interface for Create, read, update, and delete access to data. Also supports JSONQuery/JSONPath querying. database abstraction layer (over MySQL) used by the New York Times. JSON over HTTP. High performance. Based on Dynamic objects. Supports LINQ, SQL queries.

Couchbase, Apache Software Foundation Apache License FleetDB MIT License GNU AGPL v3.0[6] Commercial Apache License commercial or GNU AGPL v3.0

GemFire Enterprise [1] VMWare OrientDB RavenDB Orient Technologies RavenDB

Redis StrokeDB Terrastore [2]

BSD License MIT License Apache License

ThruDB

BSD License

C++, Java

Persevere

Persevere

BSD License

DBSlayer Eloquera DB

DBSlayer Eloquera

Apache License Proprietary

C .NET

99

Mobile DatabaseA mobile database is a database that can be connected to by a mobile computing device over a mobile network. The client and server have wireless connections. A cache is maintained to hold frequent data and transactions so that they are not lost due to connection failure. A database is a structured way to organize information. This could be a list of contacts, price information or distance travelled. The use of laptops, mobiles and PDAs is increasing and likely to increase in the future with more and more applications residing in the mobile systems. While those same analysts cant tell us exactly which applications will be the most popular, it is clear that a large percentage will require the use of a database of some sort. Many applications such as databases would require the ability to download information from an information repository and operate on this information even when out of range or disconnected. An example of this is a mobile workforce. In this scenario user would require to access and update information from files in the home directories on a server or customer records from a database. This type of access and workload generated by such users is different from the traditional workloads seen in clientserver systems of today. With the advent of mobile databases, now users can load up their smart phones or PDAs with mobile databases to exchange mission-critical data remotely without worrying about time or distance. Mobile databases let employees enter data on the fly. Information can be synchronized with a server database at a later time. Embedded DatabaseAn embedded database system is a database management system (DBMS), which is tightly integrated with application software that requires access to stored data, such that the database system is hidden from the applications end-user and requires little or no ongoing maintenance. It is actually a broad technology category that includes database systems with differing application programming interfaces (SQL as well as proprietary, native APIs); database architectures (client/server and in-process); storage modes (on-disk, in-memory and combined); database models (relational, object-oriented, Entity-AttributeValue model and network/CODASYL); and target markets. The term "embedded database" can be confusing because only a small subset of embedded database products is used in

100

real-time embedded systems such as telecommunications switches and consumer electronics devices. Mobile and Embedded DBMS CharacteristicsThe data access and management requirements of the applications described above are significantly different from that of traditional server DBMSs. These new applications must be able to run on multiple tiers rang- ing from devices to servers to web and would benefit from various existing database mechanisms. However, these database mechanisms (like query, indexing, persistence) must be unlocked from the traditional monolithic DBMSs and made available as embeddable components (e.g. DLLs) that can be embedded within applications, thereby, enabling them to meet the requirements described above. Such Mobile and Embedded DBMSs have the following characteristics: 1. Embeddable in applications Mobile and Embedded DBMSs form an integral part of the application or the application infrastructure, often requiring no administration. Database functionality is delivered as part of the application (or app infrastructure). While the database must be embeddable as a DLL in applications, it must also be possible to deploy it as a stand-alone DBMS with support for multiple transactions and applications. 2. Small footprint For many applications, especially those that are downloadable, it is important to mini- mize DBMS footprint. Since the database system is part of the application, the size of the DBMS affects the overall application footprint. In addition to the small footprint, it is also desirable to have short code paths for efficient application execution. Most of these applications do not require the full functionality of commercial DBMSs; they require simple query and execute in constrained environments. 3. Run on mobile devices The DBMSs that run on mobile devices tend to be specialized versions of mobile and embedded DBMSs. In addition to handling the memory, disk and processor limitations of these devices, the DBMS must also run on specialized operating systems. The DBMS must be able to store and forward data to the back-end databases as synchronization with backend systems is critical for them. 4. Componentized DBMS Often, to support the small footprint requirement, it is important to include only the functionality that is required by the applications. For example, many simple applications just require ISAM like record-oriented access. For these applications, there is no need to include the query processor, thereby increasing the footprint. Similarly,

101

many mobile and mid-tier applications require only a small set of relational operators while others require XML access and not relational access. So, it should be possible to pick and choose the desired components. 5. Self managed DBMS The embedded DBMS is invisible to the application user. There can be no DBA to manage the database and operations like backups, recovery, indexing, tuning etc. cannot be initiated by a DBA. If the database crashes, the recovery must start instantaneously. The database must be self managed or managed by the application. Also, embedded DBMS must auto install with the application it should not be installed explicitly (user action) or independently. Similarly when the application is shut down, the DBMS must transparently shutdown. 66. In-Memory DBMS These are specialized DBMSs serving applications that require high performance on data that is small enough to be contained in main memory. In-memory DBMSs require specialized query processing and indexing techniques that are optimized for main memory usage. Such DBMSs also can support data that may never get persisted. 7. Portable databases There are many applications, which require very simple deployment installing the application, should install the database associated with it. This requires the database to be highly portable. Typically, single file databases (e.g. like Microsoft Access databases) are ideally suited for this purpose. Again, there should be no need to install the DBMS separately installing the application installs the DBMS and then copying the database file completes the application migration. 8. No code in the database Portable database must also be safe. Executable code can be a carrier of virus or other threats. By eliminating any code storage in the database, it can be made safer and portable. 9. Synchronize with back-end data sources In the case of mobile and cached scenarios; it must be possible to synchronize the data with the back-end data sources. In typical mid-tier (application server) caches, the data is fetched from the back-end databases into the cache, operated on, and synchronized with the back-end database. 10. Remote management While mobile and embedded DBMSs must be self managed, it is important to allow them to be managed remotely also, especially those on mobile devices. In enterprises (e.g. FedEX, UPS), mobile devices must be configured and managed in a manner compliant with the company standards. Therefore centralized remote management of these devices is necessary.

102

11. Custom programming interfaces An important usage of embedded DBMS is in specialized data- centric applications. Such applications use variety of data models (e.g. Relational, XML, Streams, and Analytics) and query languages. The embedded DBMSs must be componentized and extensible to allow (application) domain-specific query languages and programming surfaces.

103

Conclusion
With the increased interest in specialized applications and database systems (like streaming, analytics, mobile, sensor networks, etc.), the need for Mobile and Embedded DBMSs is increasing. There have been mobile and embedded DBMS products in the market that satisfy some of the characteristics described above, but were designed as low-end database products and do not factor in the recent hardware and application trends. For example, products that focused on small footprint but did not design with componentization for embedded applications; similarly, there are products that only provide embedded (well componentized) storage engines but without any query processing components. The hardware trends in multi core CPUs, large main memory, and NVRAMs (e.g. flash) are beginning to disrupt the traditional database architectures. Particularly, advances in large memory and flash storage open doors for further research in database architectures. The componentized mobile and embedded DBMSs are in a position to adapt to these changing hardware and software trends more rapidly.

104

Potrebbero piacerti anche