Sei sulla pagina 1di 23

TAM002 Data mining with GUHA Part 1 Does my data contain something interesting?

Esko Turunen1
1 Tampere

University of Technology

Esko Turunen

http://www.vrtuosi.com

The aim of data mining is to give answers to a question Does my data contain something interesting? In this chapter we introduce the following basic concepts: knowledge discovery in databases and data mining data, typical data mining tasks and data mining tasks outputs GUHA and Bcourse: two dissimilar approaches to data mining We also get acquainted with a real life data collected from Indonesia. We use this data to illustrate issues all over during this course. These push buttons open short verbal comments that might be useful.

Esko Turunen

http://www.vrtuosi.com

Knowledge discovery in databases (KDD) was initially dened as the nontrivial extraction of implicit, previously unknown, and potentially useful information from data [1]. A revised version of this denition states that KDD is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [2] . According to this denition, data mining is a step in the KDD process concerned with applying computational techniques (i.e., data mining algorithms implemented as computer programs) to actually nd patters in the data. In a sense, data mining is the central step in the KDD process. The other steps in the KDD process are concerned with preparing data for data mining, as well as evaluating the discovered patterns, the results of data mining.

Esko Turunen

http://www.vrtuosi.com

I Data The input to a data mining algorithm is most commonly a single at table comprising a number of elds (columns) and records (rows). In general, each row represents an object and columns represent properties of objects. II Typical data mining tasks One task is to predict the value of one eld from other elds. If the class is continuous, the task is called regression. If the class is discrete the task is called classication. Clustering is concerned with grouping objects into classes of similar objects. A cluster is a collection of objects that are similar to each other and are dissimilar to objects in other clusters. Association analysis is the discovery of association rules. Association rules specify correlation between frequent item sets. Data characterization sums up the general characteristics or features of the target class of data: this class is typically collected by a database query.
Esko Turunen http://www.vrtuosi.com

Outlier detection is concerned with nding data objects that do not t the general behavior or model of the data: these are called outliers. Evaluation analysis describes and models regularities or trends whose behavior changes over time. III Outputs of data mining procedures can be Equations e.g. TotalSpent = 189.5275 Age + 7146[$] Predictive rules e.g. IF income is 100.000[$] and Gender = Male THEN Not a Big Spender Association rules e.g. {Gender = Female, Age 52} {Big Spender = Yes} Probabilistic models e.g. Bayesian networks Distance and similarity measures, decision trees Many others

Esko Turunen

http://www.vrtuosi.com

Our aim is to study in detail a particular data mining method called GUHA its principle was formulated in a paper by Hjek, Havel and Chytil already in 1966 [3]. GUHA is the acronym for General Unary Hypotheses Automaton and its computer implementation called LISpMiner developed in Prague University of Economics by Jan Rauch and Milan imunek. LISpMiner is freely downloadable from http://lispminer.vse.cz/ . GUHA approach is suitable e.g. for association analysis, classication, clustering and outlier detection tasks. We start be introducing a real life data which will serve as a benchmark data test set during the whole course. To show how GUHA differs from a Bayesian approach we briey take a quick look at Bcourse, see http://b-course.cs.helsinki./obc/.

Esko Turunen

http://www.vrtuosi.com

The data we use is Tjen-Sien Lims publicly available data set from the 1987 National Indonesia Contraceptive Prevalence Survey. These are the responses from interviews of m = 1473 married women who were not pregnant at the time of interview. The challenge is to learn to predict a womans contraceptive method from knowledge about her demographic and socio-economic characteristics. The 10 survey response variables and their types are
Age Education Husbands education Number of children borne Islamic Working Husbands occupation Standard of living Good media exposure Contraceptive method used integer 1649 4 categories 4 categories integer 015 binary (yes/no) binary (yes/no) 4 categories 4 categories binary (yes/no) 3 categories (None, Long-term, Short-term)

Esko Turunen

http://www.vrtuosi.com

http://b-course.cs.helsinki.fi/obc/

Esko Turunen http://www.vrtuosi.com

http://b-course.cs.helsinki.fi/obc/

Esko Turunen

http://www.vrtuosi.com

Esko Turunen

http://www.vrtuosi.com

Esko Turunen

http://www.vrtuosi.com


Esko Turunen http://www.vrtuosi.com

Esko Turunen

http://www.vrtuosi.com

Esko Turunen

http://www.vrtuosi.com

Esko Turunen

http://www.vrtuosi.com

Esko Turunen

http://www.vrtuosi.com

Esko Turunen

http://www.vrtuosi.com

Esko Turunen

http://www.vrtuosi.com

Esko Turunen

http://www.vrtuosi.com

Speculating about causalities Remember that dependencies are not necessarily causalities. However, the theory of inferred causation makes it possible to speculate about the causalities that have caused the dependencies of the model. There are two different speculations (called naive model and not so naive model) which are based on different background assumptions.

Esko Turunen

http://www.vrtuosi.com

How to read naive causal model ? Naive causal models are easy to read, but they are built on assumptions that are many times unrealistic, namely that there are no latent (unmeasured) variables in the domain that causes the dependencies between variables. A simple example of the situation where this assumption is violated can be placed to Finland where cold winter makes lakes and sea ice covered. Because of that most drowning accidents happen in summertime. The warm summer also makes people eat much more ice-cream than in wintertime. If you measure both the number of drowning accidents and the icecream consumption, but don't include the variable indicating the season there is clear dependency between ice-cream consumption and drowning. Evidently this dependency is not causal (ice cream does not cause drowning or other way round), but due to the excluded variable summer (technically this is called confounding). Naive causal models are built on the assumption that there is no confounding. In naive causal models there may be two kind of connections between variables: undirected arcs and directed arcs. Directed arcs denote the causal influence from cause to effect and the undirected arcs denote the causal influence directionality of which cannot be automatically inferred from the data. You can also read the naive causal models as representing the set of dependency models sharing the same directed arcs. Unfortunately, this does not give you the freedom to re-orient the undirected arcs any way you want. You are free to re-orient the undirected arcs as long as re-orienting them does not create new V-structures in a graph. V-structure is the system of three variables A B C such that there is directed arc from A to B and there is directed arc from C to B, but there is no arc (neither directed nor undirected) between A and C.

Esko Turunen

http://www.vrtuosi.com

How to read causal graph produced by B-course?


Causal models are not difficult to read once you learn the difference between different kinds of arcs. There are two kinds of lines in arcs, solid and dashed. With solid lines we indicate relations that can be determined from the data. Dashed lines are used when we know that there is a dependency, but we are not sure about its exact nature. The table below lists the different types of arcs that can be found in causal models. A has direct causal influence to B (direct meaning that causal influence Solid arc from A to B is not mediated by any other variable that is included in the study) Dashed arc from A to B. There are two possibilities, but we do not know which holds. Either A is cause of B or there is a latent cause for both A and B.

Dashed line without any There is a dependency but we do not know whether A causes B or if B arrow heads between A and causes A or if there is a latent cause of them both the dependency B. (confounding).

Esko Turunen http://www.vrtuosi.com

W. Frawley, G. Piatetsky-Shapiro and C. Matheus: Knowledge Discovery in Databases: An Overview. In Knowledge Discovery in Databases, eds. G. Piatetsky-Shapiro and W. Frawley (1991) 127. Cambridge, Mass.: AAAI Press / The MIT Press. U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and P. Uthurusamy: Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press (1996). I. Havel , M. Chytil M.and P. Hjek: The GUHAMethod of Automatic Hypotheses Determination. Computing, Vol. 1, (1966) 293308.

Esko Turunen

http://www.vrtuosi.com

Potrebbero piacerti anche