Sei sulla pagina 1di 14

Amity Campus Uttar Pradesh India 201303

Subject Name Warehousing and Mining Study COUNTRY Roll Number (Reg.No.) 2012059 Student Name : Data : UGANDA BSCIT01152009:NAJJITA LILIAN

INSTRUCTIONS a) Students are required to submit all three assignment sets. ASSIGNMENT DETAILS MARK S 10 10 10

Assignment A Five Subjective Questions Assignment B Three Subjective Questions + Case Study Assignment C Objective or one line


b) Total weight age given to these assignments is 30%. OR 30 Marks c) All assignments are to be completed as typed in word/pdf. d) All questions are required to be attempted. e) All the three assignments are to be completed by due dates and need to be submitted for evaluation by Amity University. f) The students have to attach a scan signature in the form.

Signature Date

: :




( ) Tick mark in front of the assignments submitted Assignmen A Assignment B Assignment C

Data Warehousing and Mining

Assignment A
Q1. Discuss various types of concept hierarchies by providing two examples for each type? Schema hierarchies Schema hierarchy is the total or partial order among attributes in the database schema. It may formally express existing semantic relationships between attributes and Provides metadata information. Example: location hierarchy street < city < province/state < country Set-grouping hierarchies It organizes values for a given attribute into groups or sets or range of values. Total or partial order can be defined among groups. It is used to refine or enrich schema-defined hierarchies. It is typically used for small sets of object relationships.

Example: Set-grouping hierarchy for age {young, middle_aged, senior} all (age) {20.29} young {40.59} middle_aged {60.89} senior Operation-derived hierarchies It is operation-derived and based on operations specified .Its operations may include decoding of information-encoded strings, information extraction from complex data objects and data clustering. Example: URL or email address gives login name < dept. < univ. < country Rule-based hierarchies Rule-based occurs when either whole or portion of a concept hierarchy is defined as a set of rules and is evaluated dynamically based on current database data and rule definition Example: Following rules are used to categorize items as low_profit, medium_profit and high_profit_margin. low_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)<50) medium_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)50)^ ((P1-P2)250) high_profit_margin(X) <= price(X,P1)^cost(X,P2)^((P1-P2)>250) Q2 Illustrate the typical requirements of clustering data mining. Scalability: Many clustering algorithms work well on small data sets containing fewer than 200 data objects. However, a large database may contain millions of objects. Clustering on a sample of a given large data set may lead to biased results. Highly scalable clustering algorithms are needed. Ability to deal with different types of attributed: Many algorithms are designed to cluster interval-based (numerical)data. However, applications may require clustering other types of data, such as binary, categorical (nominal), and ordinal data, or mixtures of these data types. Discovery of clusters with arbitrary shape: Many clustering algorithms determined clusters based on Euclidean or Manhattan distance measures. Algorithms based on such distance measures end to find spherical clusters with similar size and density. However, a cluster could be of any shape. It is important to develop algorithms that can detect clusters of arbitrary shape. Minimal requirements for domain knowledge of determine input parameters: Many clustering algorithms require users to input certain parameters in cluster analysis (such as the number of desired clusters). The clustering results can be quite sensitive to input parameters. Parameters are often hard to determine, especially for data sets containing high-dimensional objects. This not only burdens users, but also makes the quality of clustering difficult to control. Ability to deal with noisy data: Most real-world databases contain outliners or missing, unknown, erroneous data. Some clustering algorithms are sensitive to such data and may lead to clusters of poor quality.

Insensitivity to the order of input records: Some clustering algorithms are sensitive to the order of input data; for example, may generated dramatically different clusters. It is important to develop algorithms that are insensitive to the order of input. High dimensionality: A database or a data warehouse can contain several dimensions or attributes. Many clustering algorithms are good at handling low-dimensional data, involving only two to three dimensions. Human eyes are good at judging the quality of clustering for up to three dimensions. It is challenging to cluster data objects in high-dimensional space, especially considering that such data can be very sparse and highly skewed. Constraint-based clustering: Real-world applications may need to perform clustering under various kinds of constraints. Suppose that your job is to choose the locations for a given number of new automatic cash-dispensing machines (ATMs) in a city. To decide upon this, we may cluster household while considering constraints such as the citys rivers and highway networks and customer requirements per region. A challenging task is to find groups of data with good clustering behavior that satisfy specified constraints. Interpretability and usability: Users expect clustering results to be interpretable, comprehensible, and usable. That is, clustering may need to be tied up with specific semantic interpretations and applications. It is important to study how an applications goal may influence the selection of clustering methods. Q3 State various evaluation criteria that are essential for classification and prediction methods. Bayesian classification based on Beyes theorem. The decision tree using ID3 algorithm. K-Nearest neighbors-using distance. Q4. What is meant by data reduction? Discuss any two data reduction strategies for obtaining a reduced data representation Data reduction is the process of minimizing the amount of data that needs to be stored in a data storage environment. Data reduction can increase storage efficiency and reduce costs. Reduction strategies Data archiving and data compression can reduce the amount of data needed to be stored on primary storage systems. Data archiving works by filing infrequently accessed data to secondary data storage systems. Data compression reduces the size of a file by removing redundant information from files so that less disk space is required .Data reduction obtains a reduced representation of the data set that is much smaller in volume, yet produces the same (or almost the same) analytical results. Data aggregation (e.g., building a data cube), Attribute subset selection (e.g., removing irrelevant attributes through correlation analysis) Dimensionality reduction (e.g., using encoding schemes such as minimum length encoding or wavelets), Numerosity reduction (e.g., replacing the data by alternative, smaller representations such as clusters or parametric models).

Q5 Differentiate between STAR and SNOWFLAKE schemas A snow flake schema design is usually more complex than a start schema. In a start schema a fact table is surrounded by multiple fact tables. This is also how the Snow flake schema is designed. However, in a snow flake schema, the dimension tables can be further broken down to sub dimensions. Hence, data in a snow flake schema is more stable and standard as compared to a Start schema. A snowflake schema is a more normalized form of a star schema. In a star schema, one fact table is stored with a number of dimension tables. On the other hand, in a star schema, one dimension table can have multiple sub dimensions. This means that in a star schema, the dimension table is independent without any sub dimensions. The dimensional table itself consists of hierarchies of dimensions in star schema, where as hierarchies are split into different tables in snow flake schema. The drilling down data from top most hierarchies to the lowermost hierarchies can be done. Q6 State the salient differences between data query and knowledge query? Data query involves the use of raw data whereas Knowledge query integrates the use of processed data and answers questions like ''how'', ''who'', ''which'' Q.1 Case Study Q1. Suppose that a data warehouse consists of four dimensions date, viewer, cinema hall and movie and two measures count and charge, where charge is the ticket fee that the viewer pays for watching the movie on a given date. The viewers can be children below 5, above 5, adults or seniors with each category having its own charge rate. i) Draw a star schema diagram for data warehouse

ii) Starting with the base cuboid [date, viewer, cinema hall, movie], what specific OLAP operations one should perform in order to list the total charge paid by

adults at the cinema hall Paradise in 2004? Roll-up Drill-down Slice and dice Q2. Give an example to show that items in a strong association rule may actually be negatively correlated. If itemsets X and Y are both frequent but rarely occur together (that is to say, sup(X U Y) < sup(X) * sup(Y)), then itemsets X and Y are negatively correlated and that X U Y is a negatively correlated pattern. If sup(X U Y) << sup(X) * sup(Y), then X and Y are strongly negatively correlated, and the pattern XuY is a strongly negatively correlated pattern. For example, suppose a sewing store sells needle packages A and B. The store sold 100 packages each of A and B, but only one transaction contains both A and B. Intuitively, A is negatively correlated with B since the purchase of one does not seem to encourage the purchase of the other. If there are 200 transaction, sup(A U B) = 1/200=0.005 and sup(A)*sup(B)=100/200*100/200=0.25. Thus, sup(A U B) << sup(X) * sup(Y), explaining how A and B are strongly negatively correlated. Q3. What are Bayesian classifiers? Explain the theorem on which Bayesian classification is based. Bayesian qualifiers are statistical classifications which predict the probability that a given sample is a member of a particular class. Bayesian qualifiers are based on Bayesian theorem. Bayesian classification shows better accuracy and speed when applied to large databases. The basic underlying assumptions (also called class conditional independence) for the simplest form of classification called the native Bayesian classification is: The effect of an attribute value on a given class is independent of the value of other attributes The Bayers Theorem: Assuming the following: X is the data value whose class is to be defined. H is the hypothesis such that the data sample X belongs to a class C. P(H/X) is the probability that hypothesis H holds for data sample X. it is also called the posterior probability that condition H holds for the sample X. P(H) is the prior probability of H condition on the training data.

P(X/H) is the posterior probability of X sample, given that H is true. P(X) is the prior probability on the sample X. The Bayers theorem states; P(H/X) = P(X/H)P(H) P(X) We can calculate P(X), P(X/H) and P(H) from the data sample X and training data. It is only P(H/X) which basically defines the probability that X belongs to a class C, and cannot be calculated. Bayes theorem does precisely this function. Q4. Explain the application of data mining in CRM in Healthcare. How Data Mining algorithms can be implemented in CRM. A pharmaceutical company can analyze its recent sales force activity and their results to improve targeting of high-value physicians and determine which marketing activities will have the greatest impact in the next few months. The data needs to include competitor market activity as well as information about the local health care systems. The results can be distributed to the sales force via a wide-area network that enables the representatives to review the recommendations from the perspective of the key attributes in the decision process. Data mining in health care is applied in areas of image screening for subsequent manual processing. Diagnostics through machine learning Data mining algorithms in CRM can be implemented through; Rule induction: The extraction of useful if-then rules from data based on statistical significance. Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID) . Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique. Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. ASSIGNMENT C 1) Which of the following statements correctly describe a Dimension table in Dimensional

Modeling? 1: Dimension tables contain fields that describe the facts. 2: Dimension tables do not contain numeric fields. 3: Dimension tables are typically larger than fact tables. 4: Dimension tables do not need system-generated keys. 5: Dimension tables usually have fewer fields than fact tables 2) How are dimensions in a Multi-Dimensional Database related? 1: Hierarchically. 2: Through foreign keys. 3: Through a hierarchy and foreign keys. 4: Through a network. 5: Through an inverse list. 3) What is a primary risk of a 'phased' implementation? 1: Previous implementations may need to be reworked. 2: The project may lose momentum. 3: Business Analysts will find problems in the data sooner. 4: Executives will lose focus. 5: The project budget may be exceeded. 4) How do highly distributed source systems impact the Data Warehouse or Data Mart project? 1: The source data exists in multiple environments. 2: The location of the source systems has minimal impact on the Data Warehouse or Data Mart implementation. 3: The timing and coordination of software development, extraction, and data updates are more complex. 4: Large volumes of data must be moved between locations. 5: Additional network and data communication hardware will be needed. 5) OLAP tool (as described above)? 1: Drill down to another level of detail. 2: Display the top 10 items that meet a specific selection criteria. 3: Trend analysis. 4: Calculate a rolling average on a set of data. 5: Display a report based on specific selection criteria. 6) In a Data Mart Only architecture, what will the Data Mart Development Team(s) encounter? 1: There is little or no minimal data redundancy across all of the Data Mart databases. 2: Issues such as inconsistent definitions and dirty data in extracting data from multiple source systems will be addressed several times. 3: Database design will be easier than expected because Data Mart databases support only a single user. 4: There is ease in consolidating the Data Marts to create a Data Warehouse.

5: It is easy to develop the data extraction system due to the use of the warehouse as a single datasource. 7) What is the primary responsibility of the 'project sponsor' during a Data Warehouse project? 1: To manage the day-to-day project activity. 2: To review and approve all decisions concerning the project. 3: To approve and monitor the project budget. 4: To ensure cooperation and support from all 'involved' departments. 5: To communicate project status to higher management and the board of directors. 8) What are Metadata? 1: Data used only by the IS organization. 2: Information that describes and defines the organization's data. 3: Definitions of data elements. 4: Any business data occurring in large volumes. 5: Summarized data. 9) How can the managers of a department best understand the cost of their use of the data warehouse? 1: A percentage of the business department's budget should be directed to the maintenance and enhancement of the Data Warehouse. 2: Institute a charge-back system of computer costs for the access to the Data Warehouse. 3: Develop a training program for department management. 4: Provide executive management with computer utilization reports that show what percentage of utilization is due to the Data Warehouse. 5: Business managers should participate in the acquisition process for computer hardware and software. 10) Which of the following is NOT a consequence of the creation of independent Data Marts? 1: Potentially different answers to a single business question if the question is asked of more than one Data Mart. 2: Increase in data redundancy due to duplication of data between the Data Marts. 3: Consistent definitions of the data in the Data Marts. 4: Creation of multiple application systems that have duplicate processing due to the duplication of data between the Data Marts. 5: Increased costs of hardware as the databases in the Data Marts grow. 11) What is meant by artificial intelligence when it is applied to data cleansing and transformation tools? 1: The tool can perform highly complex mathematical and statistical calculations to create derived data elements. 2: The tool can accomplish highly complex code translations when data comes from multiple source systems. 3: The tool can determine through heuristics the changes needed for a set of dirty data and

then make the changes. 4: The tool can perform highly complex summarizations across multiple databases. 5: The tool can identify data that appears to be inconsistent between multiple source systems and provide reporting to assist in the clean up of the source system data. 12) Which of the following classes of corporations can gain the most insights from their legacy data? 1: A corporation that wants to determine the attitude of its customers towards the corporation. 2: A corporation that offers new products and services. 3: A new corporation. 4: A corporation that has existed for a long time. 5: A corporation that is constantly introducing new and different products and services. 13) Which of the following is NOT found in an Entity Relationship Model? 1: A definition for each Entity and Data Element. 2: Entity Relationship Diagram 3: Entity and Data Element Names 4: Fact and Dimension Tables 5: Business Rules associated with the entities, entity relationships, and the data elements. 14) What is Data Mining? 1: The capability to drill down into an organization's data once a question has been raised. 2: The setting up of queries to alert management when certain criteria are met. 3: The process of performing trend analysis on the financial data of an organization. 4: The automated process of discovering patterns and relationships in an organization's data. 5: A class of tools that support the manual process of identifying patterns in large databases. 15) What does implementing a Data Warehouse or Data Mart help reduce? 1: The data gathering effort for data analysis. 2: Hardware costs. 3: User requests for custom reports. 4: Costs when management downsizes the organization. 5: All of the above. 16) Profitability Analysis is one of the most common applications of data warehousing. Why is Profitability Analysis in data warehousing more difficult than usually expected? 1: Almost every manager in an organization wants to get profitability reports. 2: Revenue data cannot be tracked accurately. 3: Expense data is often tracked at a higher level of detail than revenue data. 4: Revenue data is difficult to collect and organize. 5: Transaction grain data is required to properly compute profitability figures. 17) An operational system is which of the following? A. A system that is used to run the business in real time and is based on historical data.

B. A system that is used to run the business in real time and is based on current data. C. A system that is used to support decision making and is based on current data. D. A system that is used to support decision making and is based on historical data. 18) A data warehouse is which of the following? A. Can be updated by end users. B. Contains numerous naming conventions and formats. C. Organized around important subject areas. D. Contains only current data. 19) The load and index is which of the following? A.A process to reject data from the data warehouse and to create the necessary indexes B.A process to load the data in the data warehouse and to create the necessary indexes C.A process to upgrade the quality of data after it is moved into a data warehouse D.A process to upgrade the quality of data before it is moved into a data warehouse 20) The extract process is which of the following? A. Capturing all of the data contained in various operational systems B. Capturing a subset of the data contained in various operational systems C. Capturing all of the data contained in various decision support systems D. Capturing a subset of the data contained in various decision support systems 21) A star schema has what type of relationship between a dimension and fact table? A. Many-to-many B. One-to-one C. One-to-many D. All of the above. 22) What does the term 'Ad-hoc Analysis' mean? Choice 1 Business analysts use a subset of the data for analysis. 2: Business analysts access the Data Warehouse data in frequently. 3: Business analysts access the Data Warehouse data from different locations. 4: Business analysts do not know data requirements prior to beginning work. 5: Business analysts use sampling techniques. 23) What should be the business analyst's involvement in monitoring the performance of a Data Warehouse or Data Mart ? 1: Be patient when load monitoring on the Data Warehouse or Data Mart is taking place. 2: Become experts in SQL queries. 3: No involvement in performance monitoring. 4: Contact IT if a query takes too long or does not complete. 5: Complete all required training on the query tools they will be using 24) What factor heavily influences data warehouse size estimates? 1: The design of the warehouse schemas 2: The size of the source system schemas 3: The record size of the source tables

4: The number of expected data warehouse users 5: The number of customers an organization has Data warehouses or data marts allow organizations to define 'alert' conditions -- an alert is raised when something noteworthy has taken place. For implementing a facility of 'alerts', 25) What is the advantage of using a WEB interface over a client/server approach? 1: Access to the 'Alert' report is possible through a highly accessible means already available within the organization. 2: The selection criteria used in determining when an 'alert' needs to be issued is easier to implement using a WEB browser. 3: As long as the appropriate individual can access the 'alert', how it is implemented does not present an advantage. 4:'Alerts' can be directed only to the requestor of the 'alert'. 5: Access to the 'alert' data can be tightly controlled. 26). Transient data is which of the following? A. Data in which changes to existing records cause the previous version of the records to be eliminated B. Data in which changes to existing records do not cause the previous version of the records to be eliminated C. Data that are never altered or deleted once they have been added D. Data that are never deleted once they have been added 27). A multifield transformation does which of the following? A. Converts data from one field into multiple fields B. Converts data from multiple fields into one field C. Converts data from multiple fields into multiple fields D. All of the above 28). A snowflake schema is which of the following types of tables? A.Fact B.Dimension C.Helper D.All of the above 29). The generic two-level data warehouse architecture includes which of the following? A.At least one data mart B.Data that can extracted from numerous internal and external sources C.Near real-time updates D.All of the above. 30). Fact tables are which of the following? A.Completely denoralized B.Partially denoralized C.Completely normalized D.Partially normalized

31. Data transformation includes which of the following? A. A process to change data from a detailed level to a summary level B.A process to change data from a summary level to a detailed level C.Joining data from one source into various sources of data D.Separating data from one source into various sources of data 32. Information is a. Data b. Processed Data c. Manipulated input d. Computer output 33. Data by itself is not useful unless a. It is massive b. It is processed to obtain information c. It is collected from diverse sources d. It is properly stated 34 What are the three essential components of a learning system? Give a definition of each. Give an example of each, including equations where necessary. (1 mark) A. Model, gradient descent, learning algorithm B. Error function, model, learning algorithm C. Accuracy, Sensitivity, Specificity D. Model, error function, cost function 35. The error function most suited for gradient descent using logistic regression is A. The entropy function B. The squared error C. The cross-entropy function D. The number of mistakes 36. After SVM learning, each Lagrange multiplier ai takes either zero or non-zero value. What does it indicate in each situation? A. A non-zero ai indicates the datapoint i is a support vector, meaning it touches the margin boundary. B. A non-zero ai indicates that the learning has not yet converged to a global minimum. C. A zero ai indicates that the datapoint i has become a support vector datapoint, on the margin. D. A zero ai indicates that the learning process has identified support for vector i. 37. A Bayesian Network is most accurately described as A. A special case of a neural network that makes use of Bayes Theorem. B. The network variant of Bayes Theorem, assuming independent features. C. A probabilistic model of which Naive Bayes is a special case. D. A network of probabilistic learning functions, connected by Bayes Rule.

38. Data scrubbing is which of the following? A. A process to reject data from the data warehouse and to create the necessary indexes B. A process to load the data in the data warehouse and to create the necessary indexes C. A process to upgrade the quality of data after it is moved into a data warehouse D. A process to upgrade the quality of data before it is moved into a data warehouse 39. The active data warehouse architecture includes which of the following? A. At least one data mart B. Data that can extracted from numerous internal and external sources C. Near real-time updates D. All of the above. 40. A goal of data mining includes which of the following? A. To explain some observed event or condition B. To confirm that data exists C. To analyze data for expected relationships D. To create a new data warehouse