Sei sulla pagina 1di 43

SIES GRADUATE SCHOOL OF TECHNOLOGY NERUL, NAVI MUMBAI DEPARTMENT OF COMPUTER ENGG

SEM: - VI DATA WAREHOUSING & MINING

BRANCH: - CE

LIST OF PROGRAMS: 1. Build & edit Cube 2. Design Storage and Process the Cube 3. K- Nearest Neighbors (KNN) Algorithm 4. K-Means Algorithm 5. Nave Bayesian Classifier 6. Decision Tree 7. Nearest Neighbors Clustering Algorithm 8. Agglomerative Clustering Algorithm 9. DBSCAN Clustering Algorithm 10. Apriori Algorithm

Department Of Computer Engineering SIESGST

PROGRAM NO. 1: Build & Edit Cube

Aim: To build and edit Cube Theory: Build a Cube A cube is a multidimensional structure of data. Cubes are defined by a set of dimensions and measures. Modeling data multidimensionally facilitates online business analysis and query performance. Analysis Manager allows you to turn data stored in relational databases into meaningful, easy-tonavigate business information by creating a data cube. The most common way of managing relational data for multidimensional use is with a star schema. A star schema consists of a single fact table and multiple dimension tables linked to the fact table. Scenario: You are a database administrator working for the FoodMart Corporation. FoodMart is a large grocery store chain with sales in the United States, Mexico, and Canada. The marketing department wants to analyze all of the sales by products and customers that were made during the 1998 calendar year. Using data that is stored in the company's data warehouse, you will build a multidimensional data structure (a cube) to enable fast response times when the marketing analysts query the database. We will build a cube that will be used for sales analysis. How to open the Cube Wizard In the Analysis Manager tree pane, under the Tutorial database, right-click the Cubes folder, click to New Cube, and then click Wizard.

Department Of Computer Engineering SIESGST

How to add measures to the cube Measures are the quantitative values in the database that you want to analyze. Commonly-used measures are sales, cost, and budget data. Measures are analyzed against the different dimension categories of a cube. 1. In the Welcome step of the Cube Wizard, click Next. 2. In the Select a fact table from a data source step, expand the Tutorial data source, and then click sales_fact_1998. 3. You can view the data in the sales_fact_1998 table by clicking Browse data. After you finish browsing data, close the Browse data window, and then click Next. 4. To define the measures for your cube, under Fact table numeric columns, double-click store_sales. Repeat this procedure for the store_cost and unit_sales columns, and then click Next.

How to build your Time dimension 1. In the Select the dimensions for your cube step of the wizard, click New Dimension. This calls the Dimension Wizard. 2. In the Welcome step, click Next. 3. In the Choose how you want to create the dimension step, select Star Schema: A single dimension table, and then click Next. 4. In the Select the dimension table step, click time_by_day. You can view the data contained in the time_by_day table by clicking Browse Data. When you are finished viewing the time_by_day table, click Next. 5. In the Select the dimension type step, select Time dimension, and then click Next.

Department Of Computer Engineering SIESGST

6. Next, you will define the levels for your dimension. In the Create the time dimension levels step, click Select time levels, click Year, Quarter, Month, and then click Next. 7. In the Select advanced options step, click Next. 8. In the last step of the wizard, enter Time for the name of your new dimension. 7. Click Finish to return to the Cube Wizard. 8. In the Cube Wizard, you should now see the Time dimension in the Cube dimensions list.

Department Of Computer Engineering SIESGST

How to build your Product dimension 1. Click New Dimension again. In the Welcome to the Dimension Wizard step, click Next. 2. In the Choose how you want to create the dimension step, select Snowflake Schema: Multiple, related dimension tables, and then click Next. 3. In the Select the dimension tables step, double-click product and product_class to add them to Selected tables. Click Next. 4. The two tables you selected in the previous step and the existing join between them are displayed in the Create and edit joins step of the Dimension Wizard. Click Next.

5. To define the levels for your dimension, under Available columns, double-click the product_category, product_subcategory, and brand_name columns, in that order. After you double-click each column, its name appears under Dimension levels. Click Next after you have selected all three columns. 6. In the Specify the member key columns step, click Next. 7. In the Select advanced options step, click Next. 8. In the last step of the wizard, enter Product in the Dimension name box, and leave the Share this dimension with other cubes box selected. Click Finish.
Department Of Computer Engineering SIESGST

9. You should see the Product dimension in the Cube dimensions list. How to build your Customer dimension 1. Click New Dimension. 2. In the Welcome step, click Next. 3. In the Choose how you want to create the dimension step, select Star Schema: A single dimension table, and then click Next. 4. In the Select the dimension table step, click Customer, and then click Next. 5. In the Select the dimension type step, click Next. 6. To define the levels for your dimension, under Available columns, double-click the Country, State_Province, City, and lname columns, in that order. After you doubleclick each column, its name appears under Dimension levels. After you have selected all four columns, click Next. 7. In the Specify the member key columns step, click Next. 8. In the Select advanced options step, click Next. 9. In the last step of the wizard, enter Customer in the Dimension name box, and leave the Share this dimension with other cubes box selected. Click Finish. 10. In the Cube Wizard, you should see the Customer dimension in the Cube dimensions list.

How to build your Store dimension 1. Click New Dimension. 2. In the Welcome step, click Next. 3. In the Choose how you want to create the dimension step, select Star Schema: A single dimension table, and then click Next. 4. In the Select the dimension table step, click Store, and then click Next. 5. In the Select the dimension type step, click Next. 6. To define the levels for your dimension, under Available columns, double-click the store_country, store_state, store_city, and store_name columns, in that order. After
Department Of Computer Engineering SIESGST

you double-click each column, its name will appear under Dimension levels. After you have selected all four columns, click Next. 7. In the Specify the member key columns step, click Next. 8. In the Select advanced options step, click Next. 9. In the last step of the wizard, enter Store in the Dimension name box, and leave the Share this dimension with other cubes box selected. Click Finish. 10. In the Cube Wizard, you should see the Store dimension in the Cube dimensions list.

How to finish building your cube 1. In the Cube Wizard, click Next. 2. Click Yes when prompted by the Fact Table Row Count message.

3. In the last step of the Cube Wizard, name your cube Sales, and then click Finish. 4. The wizard closes and then launches Cube Editor, which contains the cube you just created. By clicking on the blue or yellow title bars, arrange the tables so that they match the following illustration.

Department Of Computer Engineering SIESGST

Edit a Cube We can make changes to your existing cube by using Cube Editor. You may want to browse a cube's data and examine or edit its structure. In addition, Cube Editor allows you to perform other procedures (these are described in SQL Server Books Online). Scenario: You realize that you need to add another level of information to the cube, so that you can analyze customers based on their demographic information. How to edit your cube in Cube Editor You can use two methods to get to Cube Editor: In the Analysis Manager tree pane, right-click an existing cube, and then click Edit. -orCreate a new cube using Cube Editor directly. This method is not recommended unless you are an advanced user. If you are continuing from the previous section, you should already be in Cube Editor. In the schema pane of Cube Editor, you can see the fact table (with yellow title bar) and the joined dimension tables (blue title bars). In the Cube Editor tree pane, you can preview the
Department Of Computer Engineering SIESGST

structure of your cube in a hierarchical tree. You can edit the properties of the cube by clicking the Properties button at the bottom of the left pane.

How to add a dimension to an existing cube At this point, you decide you need a new dimension to provide data on product promotions. You can easily build this dimension in Cube Editor. 1. In Cube Editor, on the Insert menu, click Tables. 2. In the Select table dialog box, click the promotion table, click Add, and then click Close. 3. To define the new dimension, double-click the promotion_name column in the promotion table. 4. In the Map the Column dialog box, select Dimension, and then click OK.

5. Select the Promotion Name dimension in the tree view.


Department Of Computer Engineering SIESGST

6. On the Edit menu, click Rename. 7. Type Promotion, and then press ENTER. 8. Save your changes. 9. Close Cube Editor. When prompted to design the storage, click No. You will design storage in a later section. Conclusion: Thus, successfully Cube is build and edited.

Department Of Computer Engineering SIESGST

PROGRAM NO. 2: Design Storage and Process the Cube

Aim: To design storage and process the cube

Theory: We can design storage options for the data and aggregations in your cube. Before you can use or browse the data in your cubes, you must process them. You can choose from three storage modes: multidimensional OLAP (MOLAP), relational OLAP (ROLAP), and hybrid OLAP (HOLAP). Microsoft SQL Server 2000 Analysis Services allows you to set up aggregations. Aggregations are precalculated summaries of data that greatly improve the efficiency and response time of queries. When you process a cube, the aggregations designed for the cube are calculated and the cube is loaded with the calculated aggregations and data. For more information, see SQL Server Books Online. Scenario: Now that you have designed the structure of the Sales cube, you need to choose the storage mode it will use and designate the amount of precalculated values to store. After this is done, the cube needs to be populated with data. In this section you will select MOLAP for your storage mode, create the aggregation design for the Sales cube, and then process the cube. Processing the Sales cube loads data from the ODBC source and calculates the summary values as defined in the aggregation design. How to design storage by using the Storage Design Wizard 1. In the Analysis Manager tree pane, expand the Cubes folder, right-click the Sales cube, and then click Design Storage. 2. In the Welcome step, click Next. 3. Select MOLAP as your data storage type, and then click Next.
Department Of Computer Engineering SIESGST

4. Under Set Aggregation Options, click Performance gain reaches. In the box, enter 40 to indicate the percentage. You are instructing Analysis Services to give a performance boost of up to 40 percent, regardless of how much disk space this requires. Administrators can use this tuning ability to balance the need for query performance against the disk space required to store aggregation data. 5. Click Start.
6.

You can watch the Performance vs. Size graph in the right side of the wizard while Analysis Services designs the aggregations. Here you can see how increasing performance gain requires additional disk space utilization. When the process of designing aggregations is complete, click Next.

7. Under What do you want to do?, select Process now, and then click Finish. Note: Processing the aggregations may take some time. 8. In the window that appears, you can watch your cube while it is being processed. When processing is complete, a message appears confirming that the processing was completed successfully. 9. Click Close to return to the Analysis Manager tree pane.

Department Of Computer Engineering SIESGST

Browse Cube Data Using Cube Browser, you can look at data in different ways: You can filter the amount of dimension data that is visible, you can drill down to see greater detail, and you can drill up to see less detail. Scenario: Now that the Sales cube is processed, data is available for analysis. In this section, you will use Cube Browser to slice and dice through the sales data. How to view cube data using Cube Browser 1. In the Analysis Manager tree pane, right-click the Sales cube, and then click Browse Data. 2. Cube Browser appears, displaying a grid made up of one dimension and the measures of your cube. The additional four dimensions appear at the top of the browser.

How to replace a dimension in the grid 1. To replace one dimension in the grid with another, drag the dimension from the top box and drop it directly on top of the column you want to exchange it with. Make sure the pointer appears with a double-ended arrow during this process.

Department Of Computer Engineering SIESGST

2.

Using this drag and drop technique, select the Product dimension button and drag it to the grid, dropping it directly on top of Measures. The Product and Measures dimensions will switch positions in Cube Browser.

How to filter your data by time 1. Click the arrow next to the Time dimension. 2. Expand All Time and 1998, and then click Quarter 1. The data in the grid is filtered to reflect figures for only that one quarter.

Department Of Computer Engineering SIESGST

How to drill down 1. Switch the Product and Customer dimensions using the drag and drop technique. Click Product and drag it on top of Country. 2. Double-click the cell in your grid that contains Baking Goods. The cube expands to include the subcategory column.

Use the above techniques to move dimensions to and from the grid. This will help you understand how Analysis Manager puts information about complex data relationships at your fingertips. 3. When you are finished, click Close to close Cube Browser.

Conclusion: Thus, we have successfully design a storage and process the cube.

Department Of Computer Engineering SIESGST

PROGRAM NO. 3: K nearest Neighbors (KNN) Algorithm

Aim: To implement KNN algorithm in Java

Theory: It is Non-parametric pattern classification. In pattern recognition, the k-nearest neighbor algorithm (KNN) is a method for classifying objects based on closest training examples in the feature space. KNN is a type of instance-based learning, or lazy learning where the function is only approximated locally and all computation is deferred until classification. The k-nearest neighbor algorithm is amongst the simplest of all machine learning algorithms: an object is classified by a majority vote of its neighbors, with the object being assigned to the class most common amongst its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of its nearest neighbor. In the classification phase, k is a user-defined constant. Usually Euclidean distance is used as the distance metric.

Consider a two-class problem where each sample consists of two measurements (x, y). For a given query point q, assign the class of the nearest neighbour. Compute the k nearest neighbors and assign the class by majority vote.

Department Of Computer Engineering SIESGST

For K=1

For K=3

For classification, compute the confidence for each class as Ci /K, (where Ci is the number of patterns among the K nearest patterns belonging to class i.) The classification for the input pattern is the class with the highest confidence. Advantage: No training is required, confidence level can be obtained Disadvantage: classification accuracy is low is complex decision-region boundary exists, large storage required. Conclusion: Thus KNN is successfully implemented in Java & tested for training database.

Department Of Computer Engineering SIESGST

PROGRAM NO. 4: K-means Algorithm Aim: To implement K means Algorithm in Java

Theory: Clustering allows for unsupervised learning. That is, the machine / software will learn on its own, using the data (learning set), and will classify the objects into a particular class It is Partition Clustering Approach Each cluster is associated with a centroid (center point). Each point is assigned to the cluster with the closest centroids. Number of clusters, K, must be specific Algorithm: 1. Select K points as the initial centroids 2. repeat 3. Form K clusters by assigning all points to the closest centroid 4. Recompute the centroid of each cluster 5. until the centroids dont change Initial centoids are often chosen randomly. The centroid is (typically) the mean of the points in the cluster. Closeness is measured by Euclidean distance, cosine similarity, correlation, etc K means will converge (centroids move at each iteration) K-means Example: Problem: Cluster the following eight points (with (x, y) representing locations) into three clusters A1(2, 10) A2(2, 5) A3(8, 4) A4(5, 8) A5(7, 5) A6(6, 4) A7(1, 2) A8(4, 9). Initial cluster centers are: A1(2, 10), A4(5, 8) and A7(1, 2). The distance function between two points a=(x1, y1) and b=(x2, y2) is defined as: (a, b) = |x2 x1| + |y2 y1| . Use k-means algorithm to find the three cluster centers after the second iteration. Solution: First we list all points in the first column of the table above. The initial cluster centers means, are (2, 10), (5, 8) and (1, 2) - chosen randomly. Next, we will calculate the distance from the first point (2, 10) to each of the three means, by using the distance function:
Department Of Computer Engineering SIESGST

point x1, y1 (2, 10)

mean1 x2, y2 (2, 10)

(a, b) = |x2 x1| + |y2 y1| (point, mean1) = |x2 x1| + |y2 y1| = |2 2| + |10 10| = 0 + 0 = 0 Iteration 1 (2, 10) Point A1 (2, 10) A2 (2, 5) A3 (8, 4) A4 (5, 8) A5 (7, 5) A6 (6, 4) A7 (1, 2) A8 (4, 9) Dist Mean 1 0 5 12 5 10 10 9 3 (5, 8) Dist Mean 2 5 6 7 0 5 5 10 2 (1, 2) Dist Mean 3 9 4 9 10 9 7 0 10 Cluster 1 3 2 2 2 2 3 2

Cluster 1 (2, 10)

Cluster 2 (8, 4) (5, 8) (7, 5) (6, 4) (4, 9)

Cluster 3 (2, 5) (1, 2)

Next, we need to re-compute the new cluster centers (means). We do so, by taking the mean of all points in each cluster. For Cluster 1, we only have one point A1(2, 10), which was the old mean, so the cluster center remains the same.
Department Of Computer Engineering SIESGST

For Cluster 2, we have ( (8+5+7+6+4)/5, (4+8+5+4+9)/5 ) = (6, 6) For Cluster 3, we have ( (2+1)/2, (5+2)/2 ) = (1.5, 3.5) That was Iteration1. Next, we go to Iteration2, Iteration3, and so on until the means do not change anymore. In Iteration2, we repeat the process from Iteration1 this time using the new means we computed. After 2nd iteration, results would be 1: {A1,A8}, 2:{A3,A4,A5,A6},3:{A2,A7} With centers C1=(3,9.5), C2=(6.5,5.25) and C3=(1.5,3.5) After 3rd iteration, results would be 1: {A1,A4,A8}, 2:{A3,,A5,A6},3:{A2,A7} With centers C1=(3.66,9), C2=(7,4.33) and C3=(1.5,3.5) Conclusion: Thus, we have successfully implemented K-means in Java & tested for variety of training databases.

Department Of Computer Engineering SIESGST

PROGRAM NO. 5: Nave Bayesian Classifier

Aim: To implement Nave Bayesian Classifier

Theory: The Naive Bayes Classifier technique is based on the so-called Bayesian theorem and is particularly suited when the dimensionality of the inputs is high. Despite its simplicity, Naive Bayes can often outperform more sophisticated classification methods.

To demonstrate the concept of Nave Bayes Classification, consider the example displayed in the illustration above. As indicated, the objects can be classified as either GREEN (light color) or RED (dark color). Our task is to classify new cases as they arrive, i.e., decide to which class label they belong, based on the currently exiting objects. Since there are twice as many GREEN objects as RED, it is reasonable to believe that a new case (which hasn't been observed yet) is twice as likely to have membership GREEN rather than RED. In the Bayesian analysis, this belief is known as the prior probability. Prior probabilities are based on previous experience, in this case the percentage of GREEN and RED objects, and often used to predict outcomes before they actually happen. Thus, we can write:

Department Of Computer Engineering SIESGST

Since there is a total of 60 objects, 40 of which are GREEN and 20 RED, our prior probabilities for class membership are:

Having formulated our prior probability, we are now ready to classify a new object (WHITE circle). Since the objects are well clustered, it is reasonable to assume that the more GREEN (or RED) objects in the vicinity of X, the more likely that the new cases belong to that particular color. To measure this likelihood, we draw a circle around X which encompasses a number (to be chosen a priori) of points irrespective of their class labels. Then we calculate the number of points in the circle belonging to each class label. From this we calculate the likelihood:

From the illustration above, it is clear that Likelihood of X given GREEN is smaller than Likelihood of X given RED, since the circle encompasses 1 GREEN object and 3 RED ones. Thus:

Department Of Computer Engineering SIESGST

Although the prior probabilities indicate that X may belong to GREEN (given that there are twice as many GREEN compared to RED) the likelihood indicates otherwise; that the class membership of X is RED (given that there are more RED objects in the vicinity of X than GREEN). In the Bayesian analysis, the final classification is produced by combining both sources of information, i.e., the prior and the likelihood, to form a posterior probability using the so-called Bayes' rule.

Finally, we classify X as RED since its class membership achieves the largest posterior probability. Conclusion: Thus, we have successfully implemented Nave Bayesian Classifier in Java & tested for variety of training databases.

Department Of Computer Engineering SIESGST

PROGRAM NO. 6: Decision Tree

Aim: To implement Decision Tree using ID3 algorithm in Java

Theory:

Decision Tree
Decision trees are most useful, powerful and popular tool for classification and prediction due to their simplicity, accuracy, ease of use and understanding, and speed of algorithm. Decision tree approach divides the search space into rectangular regions. Decision tree represents rule. Rules can be easily expresses and understand by humans. Also they can directly used in database access language SQL so that records falling into a particular category may be retrieved. A decision tree is a tree in which each branch node represents a choice between a number of alternatives, and each leaf node represents a classification or decision.

For example:

Department Of Computer Engineering SIESGST

ID3 ID3 stands for Iterative Dichotomiser 3 Invented by J. Ross Quinlan in 1979. Builds the tree from the top down, with no backtracking. Information Gain is used to select the most useful attribute for classification. ID3 is a precursor to the C4.5 Algorithm. Main aim is to minimize expected number of comparisons. The basic idea of ID3 algorithm is to construct the decision tree by employing a top down, greedy search through the given sets to test each attribute at every tree node. In order to select the attribute that is most useful for classifying a given sets, we use a metric --information gain.

The main ideas behind the ID3 algorithm are: Each non-leaf node of a decision tree corresponds to an input attribute, and each arc to a possible value of that attribute. A leaf node corresponds to the expected value of the output attribute when the path from the root node to that leaf node describes the input attributes. In a good decision tree, each non-leaf node should correspond to the input attribute which is the most informative(less entropy) about the output attribute amongst all the input attributes not yet considered in the path from the root node to that node. Entropy is used to determine how informative a particular input attribute is about the output attribute for a subset of the training data.

ID3 Process Take all unused attributes and calculates their entropies. Chooses attribute that has the lowest entropy or when information gain is maximum Makes a node containing that attribute

Department Of Computer Engineering SIESGST

Entropy: Concept used to quantify information is called Entropy. Entropy measures the randomness in data. For example: A complete homogeneous sample has entropy of 0: If all values are same, entropy is zero as there is no randomness. An equally divided sample as entropy of 1: If there is change in value, entropy is there as there is randomness.

Formula of Entropy

Department Of Computer Engineering SIESGST

Conclusion: Thus Decision Tree using ID3 is successfully implemented in Java & tested for training database.
Department Of Computer Engineering SIESGST

PROGRAM NO. 7: Nearest Neighbor Clustering Algorithm Aim: To implement Nearest Neighbor Clustering Algorithm in Java

Theory: Basic Idea: A new instance o forma a new cluster o or is merged to an existing one Depending on how close it is to the existing cluster A threshold T is used to determine whether to merge, or to create a new cluster The number of cluster k is not required as an input Complexity depends on the number of items. For each loop, each item must be compared to each item already in a cluster.(n is worst case) Time Complexity: O(n2) & Space Complexity: O(n2) Example: Given 5 items with the distance between them Task: Cluster them using nearest neighbor algorithm: threshold t=1.5 Item A B C D E A B C D E 0 1 2 2 3 1 0 2 4 3 2 2 0 1 5 2 4 1 0 3 3 3 5 3 0

Department Of Computer Engineering SIESGST

Item A is put into cluster K1={A}; For item B, dist (A, B) =1 which is less than threshold so include in cluster K1 K1= {A, B}; For item C, dist (A, C) =2 which is more than threshold dist (B, C)=2 which is more than threshold Not satisfied, so new cluster is created K1={C}; For item D, dist (A, D) =2 which is more than threshold dist (B, D)=5 which is more than threshold dist(C,D)=1 which is less than threshold so include in cluster K2 K1={A,B}, K2={C,D} For item E, dist (A, E) =3 which is more than threshold dist (B,E)=3 which is more than threshold dist(C,E)=5 which is more than threshold dist(D,E)=3 which is more than threshold Not satisfied, so new cluster is created K3= {E}; Final Clustering Output: K1= {A, B}, K2={C, D}, K3= {E} Conclusion: Thus, we have successfully implemented Nearest Neighbor in Java & tested for variety of training databases.

Department Of Computer Engineering SIESGST

PROGRAM NO. 8: Agglomerative Clustering Algorithm Aim: To implement Agglomerative Clustering Algorithm Theory: Agglomerative hierarchical clustering Data objects are grouped in a bottom-up fashion. Initially each data object is in its own cluster. Then merge these atomic clusters into larger and larger clusters, until all of the objects are in a single cluster or until certain termination conditions are satisfied. The user can specify termination condition, as the desired number of clusters. Output is Dendrogram,which can be represent as a set of order triples <d, k, K> where d is the threshold distance, k is the number of clusters, and K is the set of clusters.

Dendrogram: It is a tree data structure, which illustrates hierarchical clustering techniques. Each level shows clusters for that level. o Leaf individual clusters o Root one cluster A cluster at level i is the union of its children clusters at level i+1.

Department Of Computer Engineering SIESGST

Given a set of N items to be clustered, and an N*N distance (or similarity) matrix, the basic process of hierarchical clustering) is this: 1. Start by assigning each item to a cluster, so that if you have N items, you now have N clusters, each containing just one item. Let the distances (similarities) between the clusters the same as the distances (similarities) between the items they contain. 2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one cluster less. 3. Compute distances (similarities) between the new cluster and each of the old clusters. 4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N. (*) Step 3 can be done in different ways, which is what distinguishes single-linkage from completelinkage and average-linkage clustering.

In single-linkage clustering (also called the connectedness or minimum method), we consider the distance between one cluster and another cluster to be equal to the shortest distance from any member of one cluster to any member of the other cluster. In complete-linkage clustering (also called the diameter or maximum method), we consider the distance between one cluster and another cluster to be equal to the greatest distance from any member of one cluster to any member of the other cluster. In average-linkage clustering, we consider the distance between one cluster and another cluster to be equal to the average distance from any member of one cluster to any member of the other cluster. This kind of hierarchical clustering is called agglomerative because it merges clusters iteratively.

Complexity for Hierarchical Clustering: Space complexity for hierarchical algorithm is O (n2) because this the space required for the adjacency matrix. Space required for the dendrogram is O(kn), which is much less than O(n2)
Department Of Computer Engineering SIESGST

Time complexity for hierarchical algorithms is O (kn2) because there is one iteration for each level in the dendrogram.

Conclusion: Thus, we have successfully implemented Agglomerative Clustering Algorithm in Java & tested for variety of training databases.

Department Of Computer Engineering SIESGST

PROGRAM NO. 9: DBSCAN Clustering Algorithm Aim: To implement Density Based Spatial Clustering of Application with Noise Algorithm Theory: Major features Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination condition Used to create clusters of minimum size and density. Density is defined as minimum no. of points within a certain distance of each other. Two global parameters: Eps: Maximum radius of the neighbourhood MinPts: Minimum number of points in an Eps-neighbourhood of that point Core Object: object with at least MinPts objects within a radius Eps-neighborhood Border Object: object that on the border of a cluster Basic Concepts: -neighborhood & core objects = 1 cm The neighborhood within a radius of a given object is called the -neighborhood of the object

If the -neighborhood of an object contains at least a minimum number, MinPts, of objects then the object is called a core object Example: = 1 cm, MinPts=3 m and p are core objects because their -neighborhoods contain at least 3 points
Department Of Computer Engineering SIESGST

Directly density-Reachable Objects An object p is directly density-reachable from object q if p is within the -neighborhood of q and q is a core object

Example: q is directly density-reachable from m m is directly density-reachable from p and vice versa

Density-Reachable Objects An object p is density-reachable from object q with respect to and MinPts if there is a chain of objects p1,pn where p1=q and pn=p such that pi+1 is directly reachable from pi with respect to and MinPts

Department Of Computer Engineering SIESGST

Example: q is density-reachable from p because q is directly density reachable from m and m is directly density-reachable from p p is not density-reachable from q because q is not a core object

Density-Connectivity An object p is density-connected to object q with respect to and MinPts if there is an object O such as both p and q are density reachable from O with respect to and MinPts

Example: p, q and m are all density connected DBSCAN Algorithm Steps Arbitrary select a point p Retrieve all points density-reachable from p w.r.t. Eps and MinPts. If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database. Continue the process until all of the points have been processed.

Department Of Computer Engineering SIESGST

Example: If Epsilon is 2 and minpoint is 2, what are the clusters that DBScan would discover with the following examples: A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2), A8=(4,9). Epsilon ==2 MinPts=2
A1 (2, 10) A1 (2, 10) A2 (2, 5) A3 (8, 4) A4 (5, 8) A5 (7, 5) A6 (6, 4) A7 (1, 2) A8 (4, 9) 0 >2 >2 >2 >2 >2 >2 >2 A2 (2, 5) >2 >2 >2 >2 >2 >2 >2 >2 A3 (8, 4) >2 >2 0 >2 2 2 >2 >2 A4 (5, 8) >2 >2 >2 0 >2 >2 >2 2 A5 (7, 5) >2 >2 2 >2 0 2 >2 >2 A6 (6, 4) >2 >2 2 >2 2 0 >2 >2 A7 (1, 2) >2 >2 >2 >2 >2 >2 0 >2 A8 (4, 9) >2 >2 >2 2 >2 >2 >2 0

N2(A1)={} N2(A5)={A3,A6}

N2(A2)={} N2(A6)={A3,A5}

N2(A3)={A5,A6} N2(A7)={}

N2(A4)={A8} N2(A8)={A4}

So A1, A2, and A7 are outliers, while we have two clusters C1= {A4, A8} and C2={A3, A5, A6} If Epsilon is square root(10) then the neighborhood of some points will increase: A1 would join the cluster C1 and A2 would joint with A7 to form cluster C3= {A2, A7}.

Complexity: Space complexity: O (log n) Time complexity O (n log n)

Conclusion: Thus, we have successfully implemented DBSCAN Clustering Algorithm in Java & tested for variety of training databases.

Department Of Computer Engineering SIESGST

PROGRAM NO. 10: Apriori Association Algorithm Aim: To implement Apriori Association Algorithm in Java programming language. Theory: Basics: The Apriori Algorithm is an influential algorithm for mining frequent itemsets for boolean association rules. Key Concepts: Frequent Itemsets: The sets of item, which has minimum support (denoted by Li for ithItemset). Apriori Property: Any subset of frequent itemset must be frequent. Join Operation: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1with itself.

Find the frequent itemsets: the sets of items that have minimum support o A subset of a frequent itemset must also be a frequent itemset i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset o Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset) Use the frequent itemsets to generate association rules.

Department Of Computer Engineering SIESGST

Apriori Algorithm: Pseudo code

The Apriori Algorithm: Example Consider a database, D , consisting of 9 transactions. Suppose min. support count required is 2 (i.e. min_sup = 2/9 = 22 % ) Let minimum confidence required is 70%. We have to first find out the frequent itemset using Apriori algorithm. Then, Association rules will be generated using min. support & min. confidence.

Department Of Computer Engineering SIESGST

Step 1: Generating 1-itemset Frequent Pattern

The set of frequent 1-itemsets, L1, consists of the candidate 1-itemsets satisfying minimum support. In the first iteration of the algorithm, each item is a member of the set of candidate. Step 2: Generating 2-itemset Frequent Pattern

Department Of Computer Engineering SIESGST

To discover the set of frequent 2-itemsets, L2, the algorithm uses L1 Join L1 to generate a candidate set of 2-itemsets, C2. Next, the transactions in D are scanned and the support count for each candidate itemset in C2 is accumulated (as shown in the middle table). The set of frequent 2-itemsets, L2, is then determined, consisting of those candidate 2itemsets in C2 having minimum support. Note: We havent used Apriori Property yet.

Department Of Computer Engineering SIESGST

Step 3: Generating 3-itemset Frequent Pattern

The generation of the set of candidate 3-itemsets, C3, involves use of the Apriori Property. In order to find C3, we compute L2 Join L2. C3= L2 Join L2 = {{I1, I2, I3}, {I1, I2, I5}, {I1, I3, I5}, {I2, I3, I4}, {I2, I3, I5}, {I2, I4, I5}}. Now, Join step is complete and Prune step will be used to reduce the size of C3. Prune step helps to avoid heavy computation due to large Ck. Based on the Apriori property that all subsets of a frequent itemset must also be frequent, we can determine that four latter candidates cannot possibly be frequent. How ? For example , lets take {I1, I2, I3}.The 2-item subsets of it are {I1, I2}, {I1, I3} & {I2, I3}. Since all 2-item subsets of {I1, I2, I3} are members of L2, We will keep {I1, I2, I3} in C3. Lets take another example of {I2, I3, I5}which shows how the pruning is performed. The 2-item subsets are {I2, I3}, {I2, I5} & {I3,I5}. BUT, {I3, I5} is not a member of L2 and hence it is not frequent violating Apriori Property. Thus We will have to remove {I2, I3, I5} from C3. Therefore, C3= {{I1, I2, I3}, {I1, I2, I5}} after checking for all members of result of Join operationfor Pruning.

Department Of Computer Engineering SIESGST

Now, the transactions in D are scanned in order to determine L3, consisting of those candidates 3-itemsets in C3 having minimum support.

Step 4: Generating 4-itemset Frequent Pattern The algorithm uses L3 Join L3 to generate a candidate set of 4-itemsets, C4. Although the join results in {{I1, I2, I3, I5}}, this itemset is pruned since its subset {{I2, I3, I5}}is not frequent. Thus, C4= , and algorithm terminates, having found all of the frequent items. This completes our Apriori Algorithm. These frequent itemsets will be used to generate strong association rules (where strong association rules satisfy both minimum support & minimum confidence).

Step 5: Generating Association Rules from Frequent Itemsets Procedure: o For each frequent itemset l, generate all nonempty subsets of l. o For every nonempty subset s of l, output the rule s (l-s) if support_count(l) / support_count(s) >= min_conf where min_conf is minimum confidence threshold. Example: We had L = {{I1}, {I2}, {I3}, {I4}, {I5}, {I1,I2}, {I1,I3}, {I1,I5}, {I2,I3}, {I2,I4}, {I2,I5}, {I1,I2,I3}, {I1,I2,I5}}. Lets take l = {I1, I2, I5}. Its all nonempty subsets are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2}, {I5}.

Let minimum confidence threshold is, say 70%. The resulting association rules are shown below, each listed with its confidence. R1: I1 ^ I2 I5
Department Of Computer Engineering SIESGST

Confidence = sc{I1,I2,I5}/sc{I1,I2} = 2/4 = 50% R1 is rejected. R2: I1 ^ I5 I2 Confidence = sc{I1,I2,I5}/sc{I1,I5} = 2/2 = 100% R2 is selected. R3: I2 ^ I5 I1 Confidence = sc{I1,I2,I5}/sc{I2,I5} = 2/2 = 100% R3 is selected. -R4: I1 I2 ^ I5 Confidence = sc{I1,I2,I5}/sc{I1} = 2/6 = 33% R4 is rejected. R5: I2 I1 ^ I5 Confidence = sc{I1,I2,I5}/{I2} = 2/7 = 29% R5 is rejected. R6: I5 I1 ^ I2 Confidence = sc{I1,I2,I5}/ {I5} = 2/2 = 100% R6 is selected. In this way, we have found three strong association rules

Conclusion: Thus, we have successfully implemented Apriori Association Algorithm in Java & tested for variety of training databases.

Department Of Computer Engineering SIESGST

Potrebbero piacerti anche