Sei sulla pagina 1di 16

Data Warehouse Overview Abstract

Data warehouse is a large database system designed for the purpose of data analysis. The design is different with operational database. Data warehouse reads data from multiple operational databases instead of getting the data for the end-user transaction input. Since data warehouse does not require to perform transaction processing, it can perform computation intensive query for data analysis. Also the data model is different with operational database in order to make the data browsing easier. This document presents an overview to data warehouse in terms of the architecture, the data model, and the user interface.

1.0

Introduction

Data warehouse is a large database system designed for data analysis. The data source comes many operational database systems. The data source can for example be accounting information, operation information, inventory information, customer information, etc. Data warehouse builds cross reference information between these different data sources to enable data analysis. It groups data into subject areas so that users can find data earlier. It maintains historical data for trend analysis. Since data warehouse is not used for end user transaction processing, it can afford the resources to run computation intensive query for data analysis. From a business perspective, data warehouse provides a single and consistence data source. It makes the data collection process much easier and faster for the users. The users can answer different business questions by issuing queries to the data warehouse. Potentially, better business decisions can be made in a shorter period of time.

2.0 Business Driving Force for Data Warehouse


With the increase in business competition, there is a need to obtain and analysis business data faster. A lot of important business data is in operational database systems. However, these systems are not designed for business data analysis due to the following reasons: Data model is normalized for speed and not for data analysis. Data model is not grouped into subject areas for analysis. Data model is not dimensional.

Operational database can't afford the resources to perform computation intensive query during analysis. There's no cross reference information between data from the different operational databases, i.e. between financial and operational database. Historical data may not be found in operational database for trend analysis. Data warehouse has data description and data browsing facilities. Operational data changes over time.

With a data warehouse, user could for example ask the following questions. ROI for a new types of distribution mechanism? When are beer buyers most likely to also buy snacks? How likely are we to meet our fourth quarter projections? What is our growth rate in the southwest versus competitor X? What is the financial and the operational information for a geographical area? Data warehouse can provide easy information access for business people to increase revenue, profit, customer satisfaction, saving, and market share. The system can be used for different departments in the organization.

3.0

Development Steps

The development steps for data warehouse project is similar to other information systems. The following outlines some key steps during developing. The outline is divided into three sections and they are planning & design, building & testing, roll out & maintenance. Planning and Design Business drivers Objectives User needs User and sponsor expectation Application orientation Data sources Data quality To build data warehouse or data mart Project risk Budget plan Time frame Cost benefit analysis Project team composition (DA, DAB, OLAP development, GUI development, query development, report development, user training, network management, system integration) Logical and physical data model (depends on access & usage)

The logical and physical data model design depends on the data access and usage. At the planning & design phase, the data model is just in preliminary design. Building and Testing HW, SW, transformation SW, middle ware, OLAP SW, system management SW Network infrastructure and management Connect to source databases (flat file, ongoing connection, direct access) Summarize or aggregate data Prototyping Data mining to find out data patterns. Some data extraction & transformation software are very useful to development data transformation routines. These tools are very useful for both the construction and the maintenance phase. In addition, system management software can control the data extraction processes to extract data from other database systems to the data warehouse. Roll out & Maintenance System growth Performance management System maintenance Security Backup, recovery Update data Risk management is important to the success of a data warehouse project. Some of the project risks are: Technology risk: i.e. new technology to the market place, new technology to the organization, and technologies coexist, etc. Complexity risk: i.e. complex data model and database process, business process change, mission critical requirement, large number of installations, distributed system, data re-modeling required for legacy system, etc. Integration risk: i.e. integration with other information system, real time requirements for the interfaces, etc. Project team risk: i.e. team member experience, business user involvement, etc.

4.0

Architecture

Data warehouse reads source data from different database systems in the organization. The source databases are usually operational databases. The following is one of the data warehouse logical architecture:

Figure 1: Data warehouse architecture Data warehouse reads data from multiple operational databases. The data is clean, transformed, or aggregated. The data is either updated or inserted into the data warehouse depending on the trend analysis requirement. In addition, cross reference data is generated based on the new data, for example, accounting data from the accounting database needs to be cross referenced with the facility data from the facility database. Depending on the data model, the amount of data, and the particular query, performance can be a problem for a data warehouse system. In the data warehouse, some tables can contain millions of entries. Query operation to these tables can take a long time. For example, the query performs aggregate operation to summarize the data. Also, if the query needs many join operations or sub-queries, the performance will even be slower. These long queries can be performed over night in order to minimize the performance impact to the end users. Some of these long queries can be speeded up by modifying the data model or turning the database. Scalability is an important consideration in choosing software, hardware, and system architecture for the data warehouse. Both the database size and the number of users for the data warehouse can increase substantially over time. The software and hardware must be scalable to support the new requirements. There are different types of database management system such as relational database system, object oriented database system, hierarchical database system, etc. Relational database is usually the choice for implementing data warehouse because the following reasons:

Relational database is the most commonly used database system in the commercial environment. Many developers already have experience with relational database products. This reduces the learning curve for the developers. Most of the operational database system is constructed with relational database. If the data warehouse is also constructed with relational database, the data conversion process between the operational database and the data warehouse can be simpler. Also, there are many database products that enable direct data transfer between relational database systems. Relational database is more mature than other types of database system in terms of its scalability, stability, and efficiency. Relational database has less proprietary functions than other database systems. This increases the degree of platform independent.

Object oriented database is sometime used in database application because it has a richer set of constructs to represent the data model. For example, hierarchical data structure can be represented better than relational database. Object oriented database provides better integration between data and functions. Therefore object oriented database is good for application that has both complex data structure and functions (i.e. CAD application, simulation application).

4.1

Differences Between Data Warehouse and Data Mart

Data mart has similar functions as data warehouse except that data mart is a lot smaller in size and has smaller group of users. For example, a department can design a data mart that is tailored to the department specific needs. The data mart can contain additional domain specific information for the department. Data mart costs less time and money to build and the design can be more flexible. Some software products can merge multiple data marts into a data warehouse so that the data can be shared by the entire organization. The software product provides data management capabilities that extract a subset of the data from the data marts to form the data warehouse. Some suggest that this for data warehouse development is more realistic because it is a step-by-step methodology to build data warehouse.

Figure 2: Data mart architecture

5.0

Data Source and Data Extraction

Data warehouse reads data from multiple data sources. These data sources are usually operational databases such as accounting information database, financial information database, facility information database, ERP (Enterprise Resources Planning, i.e. SAP), operational information database, research & engineering database, GIS (Geographical Information System), etc. Other external data source can be industry data, economic data, credit data, commodity (raw material) data, meteorological data, competitor related data, demographic data, etc. Depending on the business requirements and the types of data, the data loading frequency can be just once, once a day, once a week, or once a month. Once a day is the most often. There will be on going data and system administrative work required to maintain the data warehouse.

There are different ways to implement data extraction processes. It depends on the requirements and the technical environment. Some implements require more maintenance effort than the others. The following lists out some implementation methods for the data extraction process. The data can be extracted into an ASCII report file. The file can be in fix width or in CSV format. The ASCII report file is generated through standard report function on the operational database system. In some situation, a custom report function is developed. The data can directly be extracted from the source database system. The source database can create a single database view that contains all the necessary information. With this database view, the data transformation process can directly request for information and load the data into the data warehouse. Data loading process can have errors. The problems can be data referential integrity error, data format error, data range error, or other data quality errors. In these situations, the source data has to be modified before it can be loaded into the data warehouse. Depending on the data source, the source database system may need to be remodeled in order to produce the required data for the data warehouse. This data re-modeling work can be use a lot of time.

6.0

Data Modeling

Data modeling is one of the most important steps in building a data warehouse. Data warehouse uses dimensional modeling in a relational database environment. There are two types of table and they are dimension table and fact table. Dimension table contains information that is relatively static over time. Fact table contains transactional type information that changes over time. Fact table contains multiple foreign keys to dimension tables and has some of its own attributes. In comparison, entity relationship modeling has data table, primary table, lookup table, characteristic table, virtual table, and summarized table. Data modeling is a creative process and there can be different modeling solution for the same set of data. The purpose of data modeling is to organize data to meet business objectives and to provide good performance for database operation. Meta data is important information in data modeling. It is the information about the data model. For example "$5.64 sales amount", without meta data, the data shows as "5.64" and we don't know what it means. Meta data captures business rules for data such as data name, description, value range, data version, data source, and referential integrity information. The organization of meta data can be separated into technical level and business level. The following tables describes the information to be stored in meta data repository.

Technical Level

Business Level

Data physical location Data access method Program and script name Dependencies Data transformation logic Data refresh rules Rules to resolve data inconsistencies Rules for data derivation (i.e. aggregation)

Mapping data source & target Valid user entries Frequency of update and usage Data update responsibilities Data security Other business rules Data ownership Table size estimates Data access, drill down, and roll-up

Predefined queries and reports Meta data can be used as a semantic layer for users to navigate through the data warehouse without having to understand the complex physical data structure. Some meta data can be extracted from the database management system or the data modeling case tool.

6.1

Star Schema

Star schema is a relational data model. Each schema has one fact table associated with multiple dimension tables. Each data warehouse has many star schema. Star schema organizes data for the purpose of end-user analysis. Star schema is easy to understand by end-user. Also, there are many OLAP tools that support star schema analysis. Figure 3 is an example of a star schema.

Figure 3: Star Schema In figure 3, dimension tables are Student table, Instructor table, Course table, and Semester table. Information in these dimension tables are relatively static over time. Each star schema can have multiple dimension tables. There is only one fact table for each star schema. The fact table in figure 3 is the Attendance table. The fact table contains multiple foreign keys to the four dimension tables. The fact table primary key is the composite of the four foreign keys. Since the fact table contains transaction type information and the dimension table contains relatively static information, the amount of data in the fact table is a lot more than the amount of data in the dimension tables. The about data model can for example provide the following query result: List of students in a course, a major, or a minor Instructors for a course Courses taught by an instructor List of instructors in a faculty List of students taught by an instructor List of instructors that teach a student Summary of a student's grade Total credit obtained by a student List of courses taken by a student in a semester, or a year Number of students in a course Number of openings in a course

6.2

Historical Data and Trend Analysis

Historical data is stored in the data warehouse for trend analyze. Trend analyze is a very important feature for the data warehouse. Fact table contains

transaction type information. Data is inserted into the fact table without overwriting existing data. Therefore, fact table already captures historical information. For dimension table, some data modeling changes is required to capture historical information. This is because data in dimension table is not transaction type information. For example, based on the data model in figure 3, user want to change a student's address. Method Modify the address field in the Student table Problem The new address overwritten the old address.

Create a new record in the Student table. The This violates the referential new record has the same information as the integrity of the Student table old record except the address information. and creates database error. Create a new record in the Student table. The The existing reference to the old new record has the same information as the student record won't have the old record except the address and the student new address information. ID. To capture historical data in the dimension table, the data model has to be modified. For example, based on the data model in figure 3, the relationship between the Attendance table and the Student table becomes:

Figure 4: New relationship between the Attendance table and the Student table to capture historical data With the data model in figure 4, the student's address can be modified by inserting a new record into the Student table. Old record Student Entity 123 New record 478

ID Student ID Student Name Major Minor Address Phone 981215 John Smith Computer science (NULL) 1210 10 Ave. SW 456-7815 981215 John Smith Computer science (NULL) 1215 15 Ave. SW 456-7815 Sept 1, 98

Effective From Sept 1, 96 Date Effective To Date

August 31, 98 (NULL)

6.3

Snow Flake Schema

Snow flake schema is similar to star schema. It normalizes dimension table to save data storage space. It can be used to represent hierarchies of information.

Figure 5: Snow Flake Schema The Student table is normalized to contain foreign keys to Major and Minor tables. The relationship between Student table to Major table is many-to-one. In other situations, if the relationship is many-to-many, this will create a chain of tables for the dimension table. This makes the data model more difficult to

use and understand by the end-user. Therefore, the use of snow flake schema can decrease the browsing performance. In addition, the storage space saving in the dimension table is not significant in comparison to the size of the fact table. Fact table is usually many times larger than the dimension table.

6.4

Information Grouping for Analysis

Information can be grouped for data analysis. The grouping information can come from the original data source or from the end-user. For example, the original data source contains geographical information for each facility. This allows facilities to be grouped by geographical area. The index to book value information for each facility allows calculation of total book value for the geographical area. End-user can provide other custom grouping information. Storage space and user interface are needed for the end-user to maintenance this type of grouping information.

6.5

Summary Information

Some information are summarized before loading into the data warehouse. This depends on the level of detail of the information required by the user. For example, a supermarket may have a few thousands of transaction each day. This transaction can be summarized by each product before loading into the data warehouse.

6.6

Cross Reference Information

Cross reference information between information from different databases is very important for data analysis. For example, financial information and operational information can come from two different database systems. The financial database contains cost and revenue information for each facility. The operational database contains operation information for each facility. Cross reference information between these two database systems can enable cost analysis on operation activity.

6.7

Data Model Prototype

Prototyping is a good way to analysis the data model in early development stage. It can demonstrate the benefit of the data warehouse strategy. In helping to present the data model, the data model can be divided into two views. One is the business view and the other is the developer view. End user can use the business view to understand the system functionality. There are some limitations to prototyping. Prototyping may not show:

Data migration processes System with all the data Performance issues Security features

6.8

Stage Older Data

A terabyte data warehouse requires 500 to 1000 physical disk drives and plus 100 more disk controllers. Some old data should be archived as new data is loaded into the warehouse. Fact table old data can be archived. For example, sale data from 7 years ago may be required by end user for analysis.

7.0 Data Transformation and Loading


Data transformation and loading process puts source data into the data warehouse. The programming logic is usually simple. Depending on the data quality of the source data, the transformation and loading process can be time consuming. For example, if the source data is manually maintained, a lot of effort may be needed to clean the source data. Some records may have bad data and require to be corrected before loading into the data warehouse. If data come from two different database systems and the data warehouse is required to build cross reference information between them. There can be data referential integrity problem. For example, one database contains operational data for each facility and the other database contains financial data for each facility. The operational data may refer to a facility that does not exist in the financial data. For every time there is change in the source data definition or the target data model, the associated data transformation and loading process has to be modified accordingly. If there are many changes, a lot of time is required to modify the processes. There are some visual development tools that are specialized in developing these data transformation and loading processes. These tools have a GUI interface that allows developer to specify the data transformation logic. It makes the data transformation and loading processes easier to develop and to maintenance. There are some other ways to implement the data transformation and loading processes. These processes can be implemented in conventional languages such as C and Cobol. Using C can achieve a fast execution speed and this is necessary for some computation intensive data warehouse processes.

8.0

Process Control and Scheduling

There are many data transformation and loading processes in the data warehouse for data population. There are sequences and dependencies for these processes to execute. A control process is required to control these data transformation and loading processes. It may be required to access multiple computer systems. For example, it can start a data extraction process on another database system and transfer the data to the data warehouse server for data transformation and loading.

9.0

User Interfaces

There are a few types of user interface for the data warehouse system. These user interfaces can be used by the system administrator or the end-user. With these user interfaces, system administrator can: Maintenance user accounts Monitor and control data loading and transformation processes End-user can: Analysis data (OLAP tool) Create and generate report Create and execute query Maintenance user input data (i.e. group information for data analysis)

9.1

OLAP Tools

Online Analytical Processing (OLAP) tool is used for data analysis especially for dimensional data model. The tool provides an front end user interface for the user to access the data warehouse. Through the tool, the user can perform data analysis, design custom report or query. User can perform joins, aggregations, sorts, roll-up and roll-down to the data. Roll-up is done by adding row headers from the dimension tables. Roll-down is done by subtracting row headers. Security features can be implemented with database view. View is a logical table derived from the physical tables in the database. View provides a logical layer for the user to access the database physical tables. For example, Employee is a physical table with the following attributes: Employee First Name Last Name Department Position Phone Number View A X X X X X View B X X X X X

Age

Salary X Employee table has both View A and View B. View A can access all attributes in the Employee table. View B can access all attributes except for attribute Age and Salary. View A is used by manager in the company. View B is used by all other users.

9.2

Query, Report, and Application

The data warehouse can have predefined queries and reports. With some reporting tools, user can access the reports through the company intranet. User can subscribe to a pre-defined report. The pre-defined report can be pre generated to save both user and system time. Application can be developed for the data warehouse. The application is for data analysis purpose. It is not for transaction processing purpose to update the data in the data warehouse.

9.3

System Administration and Maintenance

Some data are manually maintained is the data warehouse. These can be system related data for the data warehouse to operate. These data is usually maintained by the system administrator. For example, the data warehouse has information about all the data loading processes. Scheduling program can based on these information to execute the data loading processes and the execution status can be stored in the data warehouse for process tracking. Also, system administrator can maintain information about user account and access privilege. A user interface can be developed for the administrator to maintenance the information. Some lookup data and grouping data are also manually maintained. These data is for data analysis purposes.

10.0

Conclusion

Data warehouse is a good solution for storing and analyzing large amount of data. It reads data from multiple operational databases on an ongoing basis. Cross reference information is generate between the data from the different databases. The data model is designed to provide good browsing performance to the end user. Data warehouse can be seen as a centralized data repository to provide both current and historical data to the end user.

References

Akmal B. Chaudhri, Mary Loomis, (1998). Object Databases in Practice. Hewlett-Packard Company, Prentice-Hall. DCI, (1997). Database & Client/Server World and Data Warehouse World Seminars. DCI, (1997). The Roadmap for Data Warehouse Implementation. Kimball, (1996). Data Warehouse Toolkit. John Wiley & Sons, Inc. [back to the top of this document]

Potrebbero piacerti anche