Sei sulla pagina 1di 15

Data Mining Techniques

Statistical Neural Computing Intelligent Agents Genetic Algorithms

Statistical

Point Estimation Models Based on Summarization Bayes Theorem and Decision Tree Hypothesis Testing Regression and Correlation

Point Estimation
Point Estimate: estimate a population parameter given by a single number May be made by calculating the parameter for a sample. May be used to predict value for missing data.

R contains 100 employees 99 have salary information Mean salary of these is $50,000 Use $50,000 as value of remaining employees salary.

Estimation Error
Bias: Difference between expected value and actual value. Mean Squared Error (MSE): expected value of the squared difference between the estimate and the actual value:

Root Mean Square Error (RMSE)


4

Models Based on Summarization


Basic concepts to provide an abstraction and summarization of the data as a whole.
Statistical concepts: mean, variance, median, mode, etc.

Visualization: display the structure of the data graphically.


Line graphs, Pie charts, Histograms, Scatter plots, Hierarchical graphs

Bayes Theorem
Posterior Probability: P(h1|xi) Prior Probability: P(h1) Bayes Theorem:

Assign probabilities of hypotheses given a data value.


6

Decision Tree
It can be defined as a root followed by internal nodes. Each labeled as a question to cover all possible responses Used in classification and clustering methods to breakdown problems down into increasingly discrete subsets by working from generalization to more specific information

Hypothesis Testing
Find model to explain behavior by creating and then testing a hypothesis about the data. Exact opposite of usual DM approach. H0 Null hypothesis; Hypothesis to be tested. H1 Alternative hypothesis

Chi-Square Test
One technique to perform hypothesis testing Used to test the association between two observed variable values and determine if a set of observed values is statistically different. The chi-squared statistic is defines as:

O observed value E Expected value based on hypothesis.


9

Regression
Predict future values based on past values Fitting a set of points to a curve Linear Regression assumes linear relationship exists. y = c 0 + c 1 x1 + + c n x n
n input variables, (called regressors or predictors) One out put variable, called response n+1 constants, chosen during the modlong process to match the input examples

10

Linear Regression -- with one input value

11

Correlation
Examine the degree to which the values for two variables behave similarly. Correlation coefficient r:
1 = perfect correlation -1 = perfect but opposite correlation 0 = no correlation

12

Neural Computing
Neural networks utilize many connected nodes to examine large amount of data to find a pattern so as one can go through large amount of data quickly. They can be used to model complex relationships between inputs and outputs or to find patterns in data. Using neural networks as a tool, data warehousing firms harvest information from datasets Neural networks essentially comprise three pieces: the architecture or model; the learning algorithm; and the activation functions

Intelligent Agents
An Intelligent agent is software that assists people and acts on their behalf. Intelligent agents work by allowing people to delegate work that they could have done to the agent software These are special types of software applications used for data filtering and analysis, information brokering, condition monitoring and alarm generation, workflow management, personal assistance, simulation and gaming etc.

Genetic Algorithms

Genetic Algorithms work on the principle of expansion of possible outcomes. They are used for clustering and association rules. Given a fixed no of possible outcomes, they seek to define new and better solutions

Potrebbero piacerti anche