CS1004 DataMining Unit 4 Notes

Data Ware Housing & Data Mining
UNIT IV 2 marks: 1. What is classification? A bank loan officer wants to analyze which loan applicants are safe and which are risky for the bank. A marketing manager needs data analysis to help guess whether a customer with a given profile will buy a new computer. In the above examples, the data analysis task is classification where a model or classifier is constructed to predict categorical labels such as safe or risky for loan application data. 2. What is prediction? Suppose the marketing manager would like to predict how much a given customer will spend during a sale. The data analysis task is an example of numeric prediction. The term prediction is used to refer to numeric prediction. 3. How do classifications work? Or Explain the steps involved in data classification? Data classification is a two step process: Step1: A classifier is built describing a predetermined set of data classes or concepts. This is the learning step(training phase), where a classification algorithm builds the classifier by analyzing or learning from a training set made up of database tuples and their associated class labels. Step 2: The model is used for classification. A test set is used, made up of tuples and their associated class labels. (a) Learning Training data are analyzed by a classification algorithm
Training data
Classification algorithm
Name Sandy Jones
age young
income loan_decision low risky safe safe Classification rules If age = young then loan_decision= risky if income=high then loan_decision=safe
Caroline middle high aged Susan senior high Lake
(b) Classification Test data are used to estimate the
Classification rules
Test data
John Henry Middle Aged, low loan_decision = risky New data
4. What is supervised learning? The class label of each training tuple is provided in supervised learning (i.e. the learning of the classifier is supervised in that it is told to which class each training tuple belongs) Eg Learning Training data are analyzed by a classification algorithm. Training data Eg Name Age Income loan_decision Sandy Jones young low risky Caroline middle aged high safe Susan Lake senior low safe In the above table the class label attribute is loan_decision and the learned model or classification is representing the form of classification rules. Eg. If age =young THEN loan_decision = risky If income=high THEN loan_decision=safe If age=middle-aged and income=low THEN loan_decision=risky 5. What is unsupervised learning? In unsupervised learning (or clustering),is which the class label or each training Tuple is not known, and the number or set of classes to be learned may not be known in advance. For eg: if we did not have the loan_decision data available for the training set we could use clustering to try to determine group of like tuples, which may correspond to risk group within the loan application data.
6. What are preprocessing steps of classification and prediction process? The following preprocessing steps applied to the data to help inform the accuracy,efficieny and stability of the classification or prediction process. (a) data cleaning-this refers to preprocessing of data in order to remove or reduce noise (by applying smoothing techniques )and the treatment of missing values (b) Relevance analysis- many of the attributes in the data may be redundant. A strong correlation between attributes A1 and A2 would suggest that one of the two could be removed from further analysis. Attributes subset selection can be used to find a reduced set of attributes. Relevance analysis in the form of correlation analysis and attribute subset selection can be used to delete attributes that do not contribute to the classification prediction task. (c) Data transform and reduction-the data may be transformed by normalization. Data can also be transformed by generalizing it to higher level concepts. Eg: the attribute income can be generalized to discriminate ranges such as low, medium and high. 7. What are the criteria used in comparing classification and prediction methods? Accuracy-the accuracy of the classification refers to the ability of the classifier to correctly predict the class label of new or previously unseen data. (i.e. tuples within class label information) The accuracy of the prediction refers to how well the given prediction can guess the value of the predicted attributes for new or unseen data. Speed this refers to the computational cost involved in generating and using the given classification or prediction. Robustness The ability to make correct predications from the given noisy data or data with missing values. Scalability Ability to construct the classifier or predictor efficiency given large amount data. Interpretability This refers to the level of understanding and insight that is provided by the classifier or predictor . 9. What is information gain? Decision tree algorithm known as IDB uses information given as its attributes selection measure. This is based on pioneering work done by Claude Shannon in information theory which studied the value or information content of messages. The expected information needed to classify a tuple in D is given by, Info (D) = - i= -1 m Pi log 2 (Pi) Where Pi is the probability that arbitrary tuple in D belongs to class Ci Information gain is defined as the difference between the original information requirement (ie, based on the proportion of the classes) and the new requirement (ie obtained after partitioning A). Ie, Gain (A) = info (D) - infoA(D)
10. What is tree pruning? What are the common approaches used for tree pruning? When a decision tree is built many of the branches will reflect anomalies in the training data due to noise or outliers. Tree pruning method address this problem of over felting the data.
Pre pruning approach A tree is pruned by halting its construction early. Upon halting the node becomes a leaf. The leaf may hold the most frequent class among ;the subset tuples or probability distribution of those tuples. Post pruning The second and the most common approach which removes sub trees from a fully grown tree. A sub tree at a given node is pruned my removing its branches and replacing it with a leaf. The leaf is labeled with the most frequent class among the sub tree being replaced.
11. What are Bayesian classifiers? Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class. Bayesian classification is based on Bayes theorem. Bayesian classifiers have exhibited a high accuracy and speed when applied to large data bases. 12. Define Bayes theorem. Let X be a data tuple. In Bayesian terms, X is considered evidence. Let H be some hypothesis such as the data tuple X belongs to a specified class C. P(H|X) is the posterior probabilities of H conditioned in H. Suppose X is a 35 yr old customer with an income of $40,000 and H is the hypothesis that X will buy a computer given that we know the customers age and income. In contrast. P(H) is prior probability of H. This is the probability that any given customer will buy a computer, regardless of age, income, or any other information. P(X|H) is the posterior probability of X conditioned on H. i.e. , it is the probability that a customer X, is 35 yrs old and earns $40,000, given that we know the customer will buy a computer. P(X) is the prior probability of X. It is the probability that a person from our set of customers is 35yrs old and earns $40,000. How are the probabilities estimated? P(H), P(X|H), and P(X) may be estimated from the given data. Bayes theorem is useful in that it provides a way of calculating the posterior probability P(H|X), from P(H) P(X| H), and P(X). Bayes theorem is P(H|X) = [ P(X|H) P(H)] / P(X)
13. What are Bayesian belief networks? Give an example. Bayesian belief networks specify joint conditional probability distribution. They provide a graphical model of casual relationships, on which learning can be performed. Bayesian belief networks can be used for classification. A belief network is defined by two components - a directed acyclic graph (DAG) and a set of conditional probability tables. The DAG represents a random variable. They may correspond to actual attributes given in the data believed to form a relationship (i.e. in the case of medical data a hidden variable may indicate a syndrome, representing a number of symptoms that together characterizes a specific disease. Each represents a probabilistic dependence. y z
Y is the parent or immediate predecessor of Z, and Z is the descendent of Y. Family history smoker
Lung cancer
emphysem a
Positive X-ray (a) A simple Bayesian belief network CPT FH,S LC 0.8 FH,~S 0.5 0.5
Dyspnea
~FH,S 0.7 0.3
~FH,~S 0.1 0.9
~LC 0.2
(b) The conditional probability table for value of the variable Lung Cancer (LC) showing each possible combination of the value of its parents. A belief network has one conditional probability table (CPT) for each variable. The CPT for a variable Y specifies the conditional distribution P(Y | Parent(Y)), where Parent(Y) are the parents of Y.
14. What is rule based classification? Rules are good way of representing information or bits of knowledge. A rule based classification uses a set of IF-THEN rules for classification. An IF-THEN rule is an expression of the form IF condition THEN conclusion Eg: R1: IF age = youth AND student = yes THEN buys computer Explanation: The IF part (or left hand side) of a rule is known as antecedent or precondition. The THEN part (or right hand side) is the rule consequent. The rule R1 also can be written as R1: (age = youth) (student = yes) => (buys_computer = yes) 15. What is sequential covering algorithm? How is it different from decision tree induction? IF-THEN rules can be extracted directly from the training data (i.e. without having to generate a decision tree first) using a sequential covering algorithm. Sequential covering algorithms are most widely used approach to mining disjunctive sets of classification rules. Popular sequential covering algorithms are AQ, CN2, and the most recent RIPPER. The general strategy is as follows. Rules are learned one at a time. Each time a rule is learned, the tuples covered by the rule are removed, and the process repeats on the remaining tuples. The sequential learning of rules is in contrast to decision tree induction. The path to each leaf in a decision tree corresponds to a rule. 16. What is back propagation? Back propagation is a neural network learning algorithm. Neural network is a set of connected input/output units in which each connection has a weight associated with it. During the learning phase, the network learns by adjusting the weights so as to be able to predict the correct label or input tuples. Back propagation learns by iteratively processing a data set of training tuples comparing the networks prediction for each tuple with the actual target value. 17. What is associative classification? In associative classification, association rules are generated and analyzed for use in classification. The general idea is that we can search for strong association between frequent patterns (confirmation of attribute-value pairs) and class labels. The decision tree induction considers only one attribute at a time where as association rules explore highly confident associations among multiples attributes. Various associative classification methods are a) CBA classification based associative CBA uses an iterative approach to frequent item set mining. b) CMAR (classification based on multiple association rules) it differs from CBA in its strategy for a frequent item set mining and its construction of the classifier. 18. What are k-Nearest Neighbour classifier? Nearest neighbour classifier are based on learning by analogy, i.e. by comparing a given tuple with training tuples that are similar to it. The training tuples are described by n attributes. Each tuple represents a point in an n-dimensional space. In this way, all of the training tuples are stored in an n-dimensional pattern space. When given an unknown tuple, k-nearest-neighbour classifier searches the pattern space for the k training tuples that are closest to the unknown tuple.
19. What is regression analysis? Regression analysis can be used to model the relationship between one or more independent or predictor variables and a dependent or response variable (which is continuous-valued). The predictor variables are attributes of interest describing the tuple (i.e. making up the attribute vector). In general, the values of predictor variables are known. The response variable is what we want to predict. Given a tuple described by predictor variables, we want to predict the associated value of the response variable. Many problems can be solved by linear regression. Several packages exist to solve regression problems. Examples include SAS, SPSS and S-Plus. 20. What is non-linear regression?
If a given response variable and prediction variable have a relationship that may be modeled by a polynomial function, it is called non-linear regression or polynomial regression. It can be modeled by adding polynomial terms to the basis linear model. 21. What are different clustering methods?
a) Partitioning methods: Given a data base of n objects or data tuples, a portioning method constructs k partitions of data, where each partition represents cluster and k <= n. i.e. It classifies the data into k groups such that (i) each group must contain at least one object and (ii) each object must belong to exactly one object. b) Hierarchical methods: A hierarchical method creates a hierarchical decomposition of the given set of data objects. c) Density based methods. d) Grid based methods. e) Model based methods. 22. Explain clustering by k-means partitioning. The k-means algorithm takes the input parameter k , and partitions a set of n objects into k clusters so that the resulting intra cluster similarity is high but inter cluster similarity is low. Cluster similarity is measured in regard to the mean value of the objects in a cluster (cluster centroid or center of gravity). How does k-means algorithm work? The k-means algorithm proceeds as follows : First it randomly selects k of the objects, each of which initially represents a cluster mean or center. For each of the remaining objects, an object is assigned to the cluster to which it is the most similar, based on the distance between the objects and the cluster mean. It then computes the new mean for each cluster. This process iterates until the exterior function converges. Typically, the square error criterion is used, defined as E = i=1i=k p Ci | p-mi|2 Where E = sum of the square errors for all objects in the data set p = point in space representing the given object mi = mean of cluster mi 23. What is k Medoids method? The k means algorithm is sensitive to outliers because an object with an extremely large value may substantially distort the distribution of data. Instead of taking the mean value of the objects in a cluster as a reference point, we can pick actual objects to represent the clusters, using one representative objects per cluster. Each remaining object is clustered with the representative object to which it is the most similar. The partitioning method is then performed based on the principle of minimizing
the sum of the dissimilarities between each object and its corresponding reference point. The absolute error criterion is defined as E = j=1j=k p Cj | p-oj| Where E = sum of the absolute errors for all objects in the data set p = point in space representing the given object in cluster Cj oj is the representative object of Cj The algorithm iterates until eventually, each representative object is actually medoid or most centrally located object of its cluster. This is the basis of the k Medoids method for grouping n objects into k clusters. 24. What is outlier detection and analysis? One persons noise could be another persons signal. Outliers are data objects that do not comply with the general behavior or the model of the data. Outliers can be caused by measurement or execution error. Many data mining algorithms try to minimize the inference of outliers or eliminate them all together. However, outliers may be of particular interest such as in the case of fraud detection. 25. What is outlier mining? Outlier detection and analysis is a interesting data mining task refered to as outlier mining. Outlier mining has wide applications. It can be used in fraud detection (detecting unusual usage of credit cards etc). outlier mining can be described as follows : Given a set of n data points or objects and k, the expected number of outliers, find the top key objects that are considered the dissimilar, exceptional, or inconsistent with respect to the remaining data. 26. Explain in brief different outlier detection. Computer based methods for outlier detection : There are four approaches : 1. Statistical approach. 2. The distance based approach. 3. The density based local outlier approach. 4. Deviation based approach.

CS1004 DataMining Unit 4 Notes

Caricato da

Informazioni sul documento

Descrizione originale:

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

CS1004 DataMining Unit 4 Notes

Caricato da

Copyright:

Formati disponibili

Data Ware Housing & Data Mining

Name Sandy Jones

Caroline middle high aged Susan senior high Lake

(b) Classification Test data are used to estimate the

John Henry Middle Aged, low loan_decision = risky New data

~FH,S 0.7 0.3

~FH,~S 0.1 0.9

Potrebbero piacerti anche