Source System: An operational system of record whose
function it is to capture the transactions of the business. A source system is often called a legacy system in a mainframe environment. The main priorities of the source system are uptime and availability. Queries against source systems are narrow, account-based queries that are part of the normal transaction flow and severely restricted in their demands on the legacy system. Source systems have keys that make certain things unique, like product keys or customer keys. We call these source system keys production keys, and we treat them as attributes, just like any other textual description of something. We never use the production keys as the keys within our data warehouse. Data Staging Area: A storage area and set of processes that clean, transform, combine, de-duplicate, household, archive, and prepare source data for use in the data warehouse. The data staging area is everything in between the source system and the presentation server. The key defining restriction on the data staging area is that it does not provide query and presentation services. Presentation Server: The target physical machine on which the data warehouse data is organized and stored for direct querying by end users, report writers, and other applications. Three very different systems are required for a data warehouse to function: the source system, the data staging area, and the presentation server. The source system should be thought of as outside the data warehouse, since we have no control over the content and format of the data in the legacy system. The data staging area is the initial storage and cleaning system for data that is moving toward the presentation server, and the data staging area may consist of a system of flat files. It is the presentation server where we insist that the data be presented and stored in a dimensional framework. If the presentation server is based on a relational database, then the tables will be organized as star schemas. If the presentation server is based on non-relational on-line analytic processing (OLAP) technology, then the data will still have recognizable dimensions. Dimensional Model: is a specific discipline for modeling data that is an alternative to entity-relationship (E/R) modeling. A dimensional model contains the same information as an E/R model but packages the data in a symmetric format whose design goals are user understandability, query performance, and resilience to The Basic Elements of Data Warehouse Architecture
change. The main components of a dimensional model are fact tables and dimension tables. A fact table is the primary table in each dimensional model that is meant to contain measurements of the business. We will use the word fact to represent a business measure. Every fact table represents a many-to-many relationship and every fact table contains a set of two or more foreign keys that join to their respective dimension tables. A dimension table is one of a set of companion tables to a fact table. Each dimension is defined by its primary key that serves as the basis for referential integrity with any given fact table to which it is joined. Most dimension tables contain many textual attributes (fields). Business Process: A coherent set of business activities that make sense to the business users of our data warehouses. A business process is usually a set of activities like order processing or customer pipeline management, but business processes can overlap. Data Mart: A logical subset of the complete data warehouse. A data warehouse is made up of the union of all its data marts. The data mart is probably sponsored by and built by a single part of the business, and a data mart is usually organized around a single business process. Every data mart must be represented by a dimensional model and, within a single data warehouse, all such data marts must be built from conformed dimensions and conformed facts. This is the basis of the Data Warehouse Bus Architecture. There are two contrasting points of view about top-down vs bottom-up data warehouses. The extreme top-down perspective is that a completely centralized, tightly designed master database must be completed before parts of it are summarized and published as individual data marts. The extreme bottom-up perspective is that an enterprise data warehouse can be assembled from disparate and unrelated data marts. Data Warehouse: The queriable source of data in the enterprise. The data warehouse is nothing more than the union of all the constituent data marts. A data warehouse is fed from the data staging area. The data warehouse manager is responsible both for the data warehouse and the data staging area. Operational Data Store (ODS): ODS was meant to serve as the point of integration for operational systems. This was especially important for legacy systems that grew up independent of each other.( Banks, for example, typically had several independent systems set up to support different productsloans, checking accounts, savings accounts, and so on. The advent of teller support computers and the ATM helped push many banks to create an operational data store to integrate current balances and recent history from these separate accounts under one customer number.) OLAP (On-Line Analytic Processing): The OLAP vendors technology is non-relational and is almost always based on an explicit multidimensional cube of data. OLAP databases are also known as multidimensional databases, or MDDBs. OLAP installations would be classified as small, individual data marts when viewed against the full range of data warehouse applications. ROLAP (Relational OLAP): A set of user interfaces and applications that give a relational database a dimensional flavor. MOLAP (Multidimensional OLAP): A set of user interfaces, applications, and proprietary database technologies that have a strongly dimensional flavor. End User Application: A collection of tools that query, analyze, and present information targeted to support a business need. End User Data Access Tool: A client of the data warehouse. In a relational data warehouse, such a client maintains a session with the presentation server, sending a stream of separate SQL requests to the server. Eventually the end user data access tool is done with the SQL session and turns around to present a screen of data or a report, a graph, or some other higher form of analysis to the user. Ad Hoc Query Tool: A specific kind of end user data access tool that invites the user to form their own queries by directly manipulating relational tables and their joins. Modeling Applications: A sophisticated kind of data warehouse client with analytic capabilities that transform or digest the output from the data warehouse. Modeling applications include: Forecasting models that try to predict the future Behavior scoring models that cluster and classify customer purchase behavior or customer credit behavior Allocation models that take cost data from the data warehouse and spread the costs across product groupings or customer groupings Most data mining tools Metadata: All of the information in the data warehouse environment that is not the actual data itself. //not in syllabus Basic Processes of the Data Warehouse Data staging is a major process that includes, among others, the following sub processes: extracting, transforming, loading and indexing, and quality assurance checking. Extracting. The extract step is the first step of getting data into the data warehouse environment. Extracting means reading and understanding the source data, and copying the parts that are needed to the data staging area for further work. Transforming. Once the data is extracted into the data staging area, there are many possible transformation steps, including Cleaning the data by correcting misspellings, resolving domain conflicts dealing with missing data elements, and parsing into standard formats Purging selected fields from the legacy data that are not useful for the data warehouse Combining data sources, by matching exactly on key values or by performing fuzzy matches on non-key The Basic Elements of Data Warehouse Architecture
attributes, including looking up textual equivalents of legacy system codes Creating surrogate keys for each dimension record in order to avoid a dependence on legacy defined keys, where the surrogate key generation process enforces referential integrity between the dimension tables and the fact tables Building aggregates for boosting the performance of common queries Loading and Indexing. At the end of the transformation process, the data is in the form of load record images. Loading in the data warehouse environment usually takes the form of replicating the dimension tables and fact tables and presenting these tables to the bulk loading facilities of each recipient data mart. Bulk loading is a very important capability that is to be contrasted with record-at-a-time loading, which is far slower. The target data mart must then index the newly arrived data for query performance, if it has not already done so. Quality Assurance Checking. Quality assurance can be checked by running a comprehensive exception report over the entire set of newly loaded data. All reported values must be consistent with the time series of similar values that preceded them. The exception report is probably built with the data marts end user report writing facility. Release/Publishing. When each data mart has been freshly loaded and quality assured, the user community must be notified that the new data is ready. Publishing also communicates the nature of any changes that have occurred in the underlying dimensions and new assumptions that have been introduced into the measured or calculated facts. Updating. Contrary to the original religion of the data warehouse, modern data marts may well be updated, sometimes frequently. Incorrect data should obviously be corrected. Changes in labels, changes in hierarchies, changes in status, and changes in corporate ownership often trigger necessary changes in the original data stored in the data marts that comprise the data warehouse, but in general these are managed load updates, not transactional updates. Querying. Querying is a broad term that encompasses all the activities of requesting data from a data mart, including ad hoc querying by end users, report writing, complex decision support applications, requests from models, and full-fledged data mining. Querying never takes place in the data staging area. By definition, querying takes place on a data warehouse presentation server. Querying, obviously, is the whole point of using the data warehouse. Auditing. At times it is critically important to know where the data came from and what were the calculations performed. audit records are linked directly to the real data in such a way that a user can ask for the audit record (the lineage) of the data at any time. Securing. Protect the valuable sensitive data from hackers, snoopers, and industrial spies. The data warehouse team must now include a new senior member: the data warehouse security architect. Data warehouse security must be managed centrally, from a single console. Backing Up and Recovering. Since data warehouse data is a flow of data from the legacy systems on through to the data marts and eventually onto the users desktops, a real question arises about where to take the necessary snapshots of the data for archival purposes and disaster recovery. Additionally, it may be even more complicated to back up and recover all of the metadata that greases the wheels of the data warehouse operation.