Sei sulla pagina 1di 9

EXPERIMENT 3

DATE- 11.2.11

TO UNDERSTAND THE CONCEPTS INVOLVED IN DATAWAREHOUSING IMPLEMENTATION


FULL TABLE SCAN
If a query cannot use an index, solidDB must perform a full table scan to execute the query. This involves reading all rows of a table sequentially. Each row is examined to determine whether it meets the criteria of the query's WHERE clause. Finding a single row with an indexed query can be substantially faster than finding the row with a full table scan. On the other hand, a query that selects more than 15% of a table's rows may be performed faster by a full table scan than by an indexed query. You should check every query using the EXPLAIN PLAN statement. (You should use your real data when doing this, since the best plan will depend upon the actual amount of data and the characteristics of that data.) The output from the EXPLAIN PLAN statement allows you to detect whether an index is really used and if necessary you can redo the query or the index. Full table scans often cause slow response time for SELECT queries, as well as excessive disk activity. To perform a full table scan, every block in the table is read. For each block, every row stored in the block is read. To perform an indexed query, the rows are read in the order in which they appear in the index, regardless of which blocks contain them. If a block contains more than one selected row it may be read more than once. So, there are cases when a full table scan requires less I/O than an indexed query, if the result set is relatively large.

INDEX SCAN
Index Scan retrieves all the rows from the table. Index Seek retrieves selective rows from the table. IndexScan: Since a scan touches every row in the table whether or not it qualifies, the cost is proportional to the total number of rows in the table. Thus, a scan is an efficient strategy if the table is small or if most of the rows qualify for the predicate. IndexSeek: Since a seek only touches rows that qualify and pages that contain these

qualifying rows, the cost is proportional to the number of qualifying rows and pages rather than to the total number of rows in the table. Index Scan is nothing but scanning on the data pages from the first page to the last page. If there is an index on a table, and if the query is touching a larger amount of data, which means the query is retrieving more than 50 percent or 90 percent of the data, and then optimizer would just scan all the data pages to retrieve the data rows. If there is no index, then you might see a Table Scan (Index Scan) in the execution plan.

Index range scanThis is the retrieval of one or more ROWIDs from an index. Indexed values are generally scanned in ascending order. Index unique scanThis is the retrieval of a single ROWID from an index

INDEX RANGE SCAN The index range scan is one of the most common access methods. During an index range scan, Oracle accesses adjacent index entries and then uses the ROWID values in the index to retrieve the table rows An example of an index range scan would be the following query.
select employee_name from employee where home_city = Rocky Ford

INDEX SKIP SCAN


In previous releases a composite index could only be used if the first column, the leading edge, of the index was referenced in the WHERE clause of a statement. In Oracle9i this restriction is removed because the optimizer can perform skip scans to retrieve rowids for values that do not use the prefix.

How It Works
Rather than restricting the search path using a predicate from the statement, Skip Scans are initiated by probing the index for distinct values of the prefix column. Each of these distinct values is then used as a starting point for a regular index search. The result is several separate searches of a single index that, when combined, eliminate the affect of the prefix column. Essentially, the index has been searched from the second level down.

The optimizer uses statistics to decide if a skip scan would be more efficient than a full table scan.

Advantages
This approach is advantageous because: It reduces the number of indexes needed to support a range of queries. This increases performance by reducing index maintenance and decreases wasted space associated with multiple indexes. The prefix column should be the most discriminating and the most widely used in queries. These two conditions do not always go hand in hand which makes the decision difficult. In these situations skip scanning reduces the impact of makeing the "wrong" decision.

INDEX FAST FULL SCAN


In Oracle there are some SQL queries that can be resolved by reading the index without touching the table data. INDEX FAST FULL SCAN is the equivalent of a FULL TABLE SCAN, but for an index. It reads using multiblock reads, but results are NOT returned sorted. For a query to make use of Index FFS the column should be defined as NOT NULL or at least one column in a composite index is NOT NULL.

INDEX JOIN
A join Index is a cross between a view and an index. It is like a view in that it is created using a query to specify the structure, composition and source of the contents. It is like an index in the way that it is used automatically by the database system to improve the performance of a query. Join indexes use the "classic" space-time tradeoff trading disk space for storage of the join index to get improved performance for queries. Unlike many other materialized view implementations, Teradata join indexes are updated immediately and automatically when changes are made to the base tables. There is never a concern that you might be using stale data when the system chooses to use a join index in the query plan. Also, Teradata uses its sophisticated coverage testing algorithm to minimize the cost associated with join index maintenance.

BITMAP JOIN
A bitmap join index is a bitmap index for the join of two or more tables. For each value in a table column, the index stores the rowid of the corresponding row in the indexed table. In contrast, a standard bitmap index is created on a single table. A bitmap join index is an efficient means of reducing the volume of data that must be joined by performing restrictions in advance. For an example of when a bitmap join index would be useful, assume that users often query the number of employees with a particular job type

PARTITIONING METHODS
Partitioning is the distribution of a table over multiple subtables that may reside on different databases or servers in order to improve read/write performance. SQL Server partitioning is typically done at the table level, and a database is considered partitioned when groups of related tables have been distributed. Tables are normally partitioned horizontally or vertically. The following tip will help you understand these SQL Server partitioning methods and determine when to use one over the other. Vertical partitioning improves access to data In a vertically partitioned table, columns are removed from the main table and placed in child tables through a process called denormalization. This type of partitioning allows you to fit more rows on a database page, making tables narrower to improve data-access performance. Therefore, a single I/O operation will return more rows. By vertically partitioning your data, you may have to resort to joins to return the denormalized columns.

Horizontal partitioning improves overall read/write performance, sort of A horizontally positioned table distributes groups of rows into different tables or databases (sometimes residing on different servers) according to a criterion, which may include a primary key value, date column or a location identifier.

DATA NORMALIZATION
What is Normalization? Normalization is the process of efficiently organizing data in a database. There are two goals of the normalization process: eliminating redundant data (for example, storing the same data in more than one table) and ensuring data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored. The Normal Forms The database community has developed a series of guidelines for ensuring that databases are normalized. These are referred to as normal forms and are numbered from one (the lowest form of normalization, referred to as first normal form or 1NF) through five (fifth normal form or 5NF). In practical applications, you'll often see 1NF, 2NF, and 3NF along with the occasional 4NF. Fifth normal form is very rarely seen and won't be discussed in this article. Before we begin our discussion of the normal forms, it's important to point out that they are guidelines and guidelines only. Occasionally, it becomes necessary to stray from them to meet practical business requirements. However, when variations take place, it's extremely important to evaluate any possible ramifications they could have on your system and account for possible inconsistencies. That said, let's explore the normal forms. First Normal Form (1NF) First normal form (1NF) sets the very basic rules for an organized database:

Eliminate duplicative columns from the same table. Create separate tables for each group of related data and identify each row with a unique column or set of columns (the primary key). Second Normal Form (2NF) Second normal form (2NF) further addresses the concept of removing duplicative data:

Meet all the requirements of the first normal form. Remove subsets of data that apply to multiple rows of a table and place them in separate tables. Create relationships between these new tables and their predecessors through the use of foreign keys. Third Normal Form (3NF)

Third normal form (3NF) goes one large step further:


Meet all the requirements of the second normal form. Remove columns that are not dependent upon the primary key.

Fourth Normal Form (4NF) Finally, fourth normal form (4NF) has one additional requirement:

Meet all the requirements of the third normal form. A relation is in 4NF if it has no multi-valued dependencies.

DATA DENORMALIZATION
Data Denormalization is a process in which internal schema is developed from conceptual schema. The data denormalization, although done by adding redundant data, is actually a process of optimizing a relational database's performance. This is often done with relational model database management system which is poorly implemented. At the logical level, a true relational database management system would allow for a fully normalized database while providing physical storage of data which is designed to function at very high performance. Data Denormalization is an important aspect of data modeling which is the process of creating and exploring data oriented structures taken from real life activities of an organization. There general three categories namely the Conceptual data model which is used to explore domain concepts with project stakeholders; Logical data model which used to explore the relationships among domain concepts; and the Physical data model which used to design the internal schema of the database with focus on data columns of tables and relationships between tables.

PARALLELISM
Bit-level parallelism
From the advent of very-large-scale integration (VLSI) computer chip fabrication technology in the 1970s until about 1986, advancements in computer architecture were done by doubling computer word sizethe amount of information the processor can execute per cycle. Increasing the word size reduces the number of instructions the processor must execute in order to perform an operation on variables whose sizes are

greater than the length of the word. For example, consider a case where an 8-bit processor must add two 16-bit integers. The processor must first add the 8 lower-order bits from each integer using the standard addition instruction, then add the 8 higherorder bits using an add-with-carry instruction and the carry bit from the lower order addition. Thus an 8 bit processor requires two instructions to complete a single operation, where a 16-bit processor would be able to complete the operation with single instruction. Historically, 4-bit microprocessors were replaced with 8-bit, then 16-bit, then 32-bit microprocessors. This trend generally came to an end with the introduction of 32-bit processors, which has been a standard in general purpose computing for two decades. Only recently, with the advent of x86-64 architectures, have 64-bit processors become commonplace.

Instruction level parallelism


A computer program is, in essence, a stream of instructions executed by a processor. These instructions can be re-ordered and combined into groups which are then executed in parallel without changing the result of the program. This is known as instruction level parallelism. Advancements in instruction level parallelism dominated computer architecture from the mid-1980s until the mid-1990s. Modern processors have multi-stage instruction pipelines. Each stage in the pipeline corresponds to a different action the processor performs on that instruction in that stage. In other words, a processor with N pipeline stages can have up to N different instructions at different stages of completion. The canonical example of a pipelined processor is a RISC processor with five stages: instruction fetch, decode, execute, memory access, write back. The Pentium 4 processor had a 35-stage pipeline. In addition to instruction level parallelism from pipelining, some processors can issue more than one instruction at a time. These are known as superscalar processors. Instructions can be grouped together only if there is no data dependency between them. 8 Scoreboarding and the Tomasulo algorithm (which is similar to scoreboarding but makes use of register renaming) are two of the most common techniques for implementing out-of-order execution and instruction level parallelism.

Data parallelism
Data parallelism is parallelism inherent in program loops, which focuses on distributing the data across different computing nodes to be processed in parallel. "Parallelizing loops often leads to similar (not necessarily identical) operation sequences or functions being performed on elements of a large data structure." Many scientific and engineering applications exhibit data parallelism. A loop-carried dependency is the property of a loop iteration that it depends on the output of one or more previous iterations. Loop carried dependencies prevent parallelization of loops. For example, consider the following pseudocode that computes the first few Fibonacci numbers: 1: prev2 := 0 2: prev1 := 1 3: cur := 1 4: do: 5: CUR := PREV1 + PREV2 6: PREV2 := PREV1

7: PREV1 := CUR 8: while (CUR < 10) This loop cannot be parallelized because CUR depends on itself (PREV1) and PREV2, which are computed in each loop iteration. Since each loop iteration depends on the result of the previous iteration, they cannot be done in parallel. As the size of a problem gets bigger, the amount of data-parallelism available usually does as well.

Cardinality
In the implementation of a structure query language (SQL), the term data cardinality is used to mean the uniqueness of the data values which are contained in a particular column, known as attribute, of a database table. High data cardinalityrefers to the instance where the values of a data column are very uncommon. For example, a data column referring to values for social security numbers should always be unique for each person. This is an example of very high cardinality. Same goes with email address and user names. Automatically generated numbers are of very high data cardinality. For instance, in a data table column, a column named USER-ID would contain values starting with an automatically increments every time a new user is added. Normal data cardinality refers to the instance where values of a data column are

somewhat uncommon but never unique. For example, a CLIENT table having a data column containing LAST_NAME values can be said to be of normal data cardinality as there may be several entries of the same last name like Jones and may other varied names in one column. At close inspection of the LAST_NAME column, one can see that there could be clumps of last names side by side with unique last names.
Low data cardinality refers to the instance where values of a data column are not very unusual. Some table columns take very limited values. For instance, Boolean values can only take 0 or 1, yes or no, true or false. Another table columns with low cardinality are status flags. Yet another example of low data cardinality is the gender attribute which can take only two values male or female.

BITMAP INDEXING
A bitmap index is a special kind of database index that uses bitmaps. Bitmap indexes have traditionally been considered to work well for data such as gender, which has a small number of distinct values, for example male and female, but many occurrences of those values. This would happen if, for example, you had gender data for

each resident in a city. Bitmap indexes have a significant space and performance advantage over other structures for such data. However, some researchers argue that Bitmap indexes are also useful for unique valued data which is not updated frequently.[1] Bitmap indexes use bit arrays (commonly called bitmaps) and answer queries by performing bitwise logical operations on these bitmaps. Bitmap indexes are also useful in the data warehousing applications for joining a large fact table to smaller dimension tables such as those arranged in a star schema. In other scenarios, a B-tree index would be more appropriate.

QUERY OPTIMIZER
The query optimizer is the component of a database management system that attempts to determine the most efficient way to execute a query. The optimizer considers the possible query plans for a given input query, and attempts to determine which of those plans will be the most efficient. Cost-based query optimizers assign an estimated "cost" to each possible query plan, and choose the plan with the smallest cost. Costs are used to estimate the runtime cost of evaluating the query, in terms of the number of I/O operations required, the CPU requirements, and other factors determined from the data dictionary. The set of query plans examined is formed by examining the possible access paths (e.g. index scan, sequential scan) and join algorithms (e.g. sort-merge join, hash join, nested loop join). The search space can become quite large depending on the complexity of the SQL query.

Potrebbero piacerti anche