Sei sulla pagina 1di 17

MACHINE LEARNING

Short description of concepts and terms

UNIT-1

1.INTRODUCTION

1. Machine learning uses theory of Statistics in building mathematical


models.

2. Machine learning is useful in: (i) Constructing good and useful


approximation, (ii) Detecting patterns and regularities and (iii)
Making future predictions from past data.

3. Application of machine learning methods to large databases is called


data mining.

4. EXAMPLES:
 Association rule: Basket Analysis (Finding associations
between products bought by customers. Customers who bought
both X and Y (bread & jam, pencil & sharpner).

 Classification: Financial Institution or Bank (Classifying High-


risk & Low-risk customers based on income and savings).
o Discriminant: Function that separates the examples of
different classes.
o Prediction: Predicting High-risk & Low-risk customers,
Medical diagnosis from past history (Making predictions for
the novel (new) instances if the future is similar to the past).
o Pattern Recognition:
 OCR (Optical Character Recognition) - Recognizing
character codes from their images. (used to read zip
codes on envelopes or amounts on checks).
 Face Recognition: Input is an image & associate face
images to identities. (Difficult than OCR because of 3D
Images, Lighting causes changes, Glasses may hide eyes).
 Medical Diagnosis: Recognizing (inputs) Patients &
(classes) Illness. (Inputs contain the patient’s age, gender,
past medical history, and current symptoms. Wrong
decisions may lead to wrong treatment.)
 Speech recognition: Input is sound (acoustic) and
classes are words that can be uttered. (Different people
(age, gender, accent) may pronounce the same word
differently)
 Biometrics: Recognition or authentication of people
using their physiological and/or behavioral
characteristics that requires an integration of inputs
from different modalities. (Images of face, finger print,
palm, Iris, Voice, Signature)

o Knowledge Extraction: Learning a rule from data.


(Learning discriminant function separating high-risk &
low-risk customers and extracting knowledge to target
potential low-risk customers)
o Compression: Fitting a rule to data, simpler than data,
requires less memory to store and less computation to
process.
o Outlier Detection: Finding the instances that do not obey
the rule and expectations. (Detecting fraud in bank).

 Regression: Example: Predicting the price of a used car, old


house, etc. (Used car- (more independent variables) brand,
year, mileage, color & (one dependent variable) price of the
car. Regression problem – Output is a number).

 Supervised Learning: Labeled data or learning from the past


known data. (Example: Both regression and classification).

o Regression example: Navigating a mobile robot or


driving an autonomous car without hitting obstacles by
deviating the route. (Using video sensors, GPS, etc).

 Unsupervised Learning: Unlabeled data. Only input is given &


aim is to find regularities in the input.
o Density Estimation: Finding certain patterns in input
space that occurs more often than others.
o Clustering: Finding clusters or groupings of input.
Example:
 Customer segmentation – Grouping based on
age, gender, education, etc (demographics)
 Customer relationship management – Grouping
based on strategies of a company for different
services and products to different customer
groups (aged, adult, male, female, etc)
 Image Compression – Image pixels represented
as RGB values. Grouping similar colors or
frequently occurring colors in same group.
 Document clustering – Grouping similar
documents. (Example: News reports – Politics,
Sports, Business, Weather, etc.)

 Reinforcement Learning: Generating a policy from the past


action sequences. (Example: Game Playing policy – Finding
sequence of right moves that is good, Robot navigating in an
environment – Finding right sequence of actions and
directions without hitting the obstacles to reach its goal state).

2. SUPERVISED LEARNING (Unit-1)

1. Learning a class from examples

Example: Learning a class of a “family car”.


 Positive examples: People believe & label them
 Negative examples: Others
 Input representation: Two attributes: Price & Engine
power.
 Hypothesis class: Using a set of rectangles, learning
algorithm finds a particular hypothesis[model chosen to
proceed further].
 Empirical error: Predictions of hypotheses do not match
the required values.
 Generalization: How well the chosen hypothesis/model
correctly classify or works for future examples (new
instances).
o Generalization error: If the chosen
hypothesis/model does not classify or work well for
new instances.
 The most specific hypothesis: Using the tightest rectangle
that includes only positive examples and no negative
examples. (S)
 The most general hypothesis: Using the largest rectangle
that includes only positive examples and no negative
examples. (G)
 Version space: A valid hypothesis (between S and G) with
no error and said to be consistent with the training set.
 Margin: Distance between the boundary and the instances
closest to it.
 Doubt: Cannot label with certainty due to lack of data. In
this case, the system rejects the instance and defers the
decision to human expert.

2. Vapnik-Chervonenkis (VC) Dimension: The maximum number of


points that VC dimension can be shattered by H (denoted as VC(H), and
measures the capacity of H). {H- Hypothesis}

3. Probably Approximately Correct (PAC) Learning: Finding the


hypothesis that will be correct most of the time and probably correct as
well. (Probability can be specified and Finding the difference between
the given class C and hypothesis h)

4. Noise: Unwanted anomaly in data. (wrongly recorded input data,


irrelevant data, missed input data, hidden data)
 Teacher’s noise: Labeling positive as negative and Vice-versa.
 Principle of Occam’s razor: Unnecessary complexity should be
shaved off.

5. Learning Multiple Classes: Learning more than two classes.


 Example: Three classes: family car, sports car, and luxury sedan.
Reject: We cannot choose a class, and this is the case of doubt
and the classifier rejects such cases.
6. Regression: Output is a numeric value. Learning a numeric function.
 Interpolation: There is no noise in the task. Finding the
function that passes through these points.
 Extrapolation: Predicting the output for any input x that is
outside range of training set.
 Time-series prediction: We have data upto the present and
we want to predict for the future.
Example: With two inputs X1, X2 (0, 1) there are four possible
cases [(0,0), (0,1), (1,0), (1,1)] and there are sixteen possible
Boolean functions (h1, h2, …….., h16).

7. Model Selection and Generalization:


 Learning a Boolean function: For d inputs, there are 2d possible
cases and 22d possible Boolean functions of d inputs. (Refer
previous example).
 ill- posed problem: Data is not sufficient to find a unique
solution.
 Bias: Erroneous assumptions.
 Inductive bias: Making a set of extra assumptions to find a
unique solution with the given data. (minimizes squared error)
 Model selection: Knowing how to choose the right bias?
 Generalization: How well a model trained on the training set
that predicts the right output for new instances?
 Underfitting: Data is not enough to constrain it and may end up
with bad hypothesis.
 Overfitting: An overcomplex hypothesis may learn not only the
underlying function but also learns the noise in the data.
 Triple Trade-off: 3 factors: (1) Complexity of the hypothesis, (2)
Amount of training data, (3) Generalization error on new
examples.
 Validation set: Used to test the generalization ability.
 Cross validation: Finding the best hypothesis that is the most
accurate on the validation set.
 Test set: Also known as publication set. It contains examples that
are not used in training or validation.
Example: When we are taking a course:
(i) Training set: The example problems that the instructor
solves in class while teaching a subject.
(ii) Validation set: Exam questions
(iii) Test set: Problems we solve in our later, professional life.

8. Dimensions of a Supervised Machine Learning Algorithm: The


sample is independent and identically distributed (iid). The aim is to
build a good and useful approximation. Three decisions we must make:
(1) Model, we use in learning.
(2) Loss function, to compute the difference between the desired
output and our approximation to it.
(3) Optimization procedure, that minimizes the total error.

UNIT-2

3. BAYESIAN DECISION THEORY

1. Introduction:

 In classification, Bayes’ rule is used to calculate the probabilities


of the classes. (Probability theory is used to analyze it).
 Example: Tossing a coin
 It is a random process because we cannot predict whether the
outcome will be head or tail.
 We can only guess the probability that the next toss would be
head or tail.
 Extra knowledge needed to predict the exact outcome
(head/tail): Exact composition of the coin, its initial position,
force and its direction applied to the coin while tossing it, etc.
 Observable variable: Only the outcome of the toss.
 Unobservable variable: Extra pieces of knowledge that we do
not have access.
 Sample: It contains examples drawn from the probability
distribution from the observables. In this example, the sample
contains the outcomes of the past N tosses.

[Estimated by: (tosses with outcome heads)/ (total no. of


tosses)]
Example : Given the sample {heads, heads, heads, tails, heads,
tails, tails, heads, heads}, we have X = {1, 1, 1, 0, 1, 0, 0, 1, 1} and
the estimate is 6/9. (heads-> 1, tails-> 0)

2. Classification: (Using Bayesian decision theory)

Example: Learning the class of “high-risk customer” in a bank.


 Observable variables: Customer’s income and savings.
 Unobservable variables: State of economy in full detail, full
knowledge of the customer, intention, moral codes, etc.
 Bernoulli random variable: Credibility of the customer,
denoted by C. [C=1->high risk customer, C=0-> low risk
customer]
 Bayes’ rule: Used to calculate the probabilities.
 Prior probability: Calculating the probability that a customer
is high-risk [P(C=1)]. It is the prior knowledge that has to be
valued before looking at the observables.
 Class likelihood: Conditional probability that a high risk
customer has associated with the observation value.
 Evidence: Marginal probability that a observation is positive
or negative.
 Posterior probability: Combining the prior and what the data
tells using Bayes’ rule. [Posterior = (prior × likelihood) /
evidence].

3. Losses and Risks: Example: A Financial institution must analyze


both potential gain and loss when making a decision for a loan
applicant.
(i) Accepting a high-risk applicant erroneously increases loss & risk.
(ii) Rejecting a low-risk applicant erroneously decreases the gain
but there is no loss & risk like previous case.
 Other domains: Medical diagnosis and Earthquake prediction.
 Loss function & Expected risk: Calculating the loss incurred and
expected risk while taking a wrong action.
 Reject: Wrong decisions and misclassifications may have very
high cost and additional actions are required. (Reject/Doubt)
 Example: Using an optical digit recognizer to read postal codes
on envelopes, wrongly recognizing the code causes the
envelope to be sent to a wrong destination.
4. Discriminant Functions: It separates examples of different classes.
 Maximum discriminant function corresponds to minimum
conditional risk.
 When there is two classes, single discriminant is defined.
 Decision regions: Dividing the feature space into decision
regions and the regions are separated by decision boundaries.
 Classification system: (k – No. of decision regions/ classes)
 Dichotomoizer: When k=2.
 Polychotomizer: When k > 3.

5. Utility Theory: Used for making rational decisions when we are


uncertain about the state. It minimizes expected risk.
 Utility function: Measures how good it is to take an action
when the state is uncertain (Incorrect).
 Expected utility: The rational decision maker chooses an
action that maximizes the expected utility.
 Maximizing expected utility minimizes expected risk.

6. Association Rules: Example: Basket Analysis. (Pencil & Eraser,


Bread & Jam, etc.)
 Finding the dependency between two items X and Y.
 An association rule is an implication of the form X → Y.
 X is the antecedent and Y is the consequent.
 Three measures frequently used:

(i) Support: Shows the statistical significance of the rule


Support(X, Y) ≡ P(X,Y) = #{customers who bought X and Y}
# {No. of customers}

(ii) Confidence: Shows the strength of the rule


Confidence(X → Y) ≡ P(Y|X) = #{customers who bought X and Y}
#{customers who bought X}

(iii) Lift (Interest):


Lift(X → Y) = P(Y|X)
P(Y)
 If the lift value > 1, then X makes Y more likely.
 If the lift value < 1, then X makes Y less likely.
 Apriori algorithm: Two steps:
(1) Finding frequent itemsets, that is, those which have enough
support.
(2) Converting them to rules with enough confidence, by
splitting the items into two (antecedent & consequent).

 Hidden variables: Values are not known through evidence.


Example: Basket analysis – A customer who buys the items “baby
food, diapers and milk”, there is a dependency between these
three items.
“Baby at home” may be designated as hidden node as the cause
of consumption of these three items.

4. PARAMETRIC METHODS (Unit-2)

1. Introduction:

 The sample is drawn from some distribution that obeys a known


model. Example: Gaussian (A bell-shaped curve that follows
normal distribution with an equal no. of measurements above and
below the mean value)
 The model is defined upto a small number of parameters (fixed
no. of parameters). Example: Mean, Variance.
 Once these parameters are estimated, the whole distribution is
known.
 Estimate the parameters of the distribution from the given
sample.
 Plug-in these estimates to the assumed model.
 Get an estimated distribution to make a decision.
 Method used to estimate parameters of distribution: Maximum
likelihood estimation.

2. Maximum Likelihood Estimation(MLE): Sample - Independent and


identically distributed (iid).
 Likelihood: Estimating product of the likelihoods of individual
points (because the sampling is independent).
 Maximum Likelihood Estimation: (i) Estimate the most likely
to be drawn. (ii) Maximize the log of the likelihood without
changing the value. (iii) Convert the product to sum for further
simplification.
 Different distributions:
(i) Bernoulli Density: Used for two-class problem. There are two
outcomes -- an event occurs or it does not occur.
(ii) Multinomial Density: For k>2 classes. The outcome of a
random event is one of K mutually exclusive and exhaustive
states.
(iii) Gaussian (normal) Density: Frequently used for modeling
class conditional input densities with numeric input. (Distributed
with mean and variance).

3. Evaluating an Estimator: Bias and Variance


 Mean Square Error (MSE): Average squared difference between
predicted values and actual target values.
 Unbiased estimator: Mean of the sample distribution shown
equal to the parameter being estimated.
 Variance: How much on average collection of estimates vary from
the expected value.
 Bias: How much expected value varies from correct value.

[Note: Low Bias and Low variance are always preferable]

4. The Bayes’ Estimator: Model the uncertainty of parameter value


(When the sample is small or the prior information is not quite useful).
 θ is the parameter being estimated.
 Prior density: Tells us the likely values that θ may take before
looking at the sample.
 Posterior density: tells us the likely θ values after looking at the
sample. (Combine prior density with sample data likelihood
density)
 Maximum A Posteriori estimate (MAP): Replacing the whole
density with a single point assuming it has a narrow peak around
its mode. (getting rid of integral part)
 Bayes’ Estimator: Defined as the expected value of the posteriori
density.
 The Bayes’ estimator is a weighted average of the prior
mean and the sample mean, with weights being inversely
proportional to their variances.
5. Regression: It contains more independent variables and one dependent
variable.
 Function of the input is called independent variable and numeric
output is called dependent variable.
 Assume the numeric output is the sum of a deterministic function
of the input and random noise.
 Use maximum likelihood to learn the parameters.
 Least squares estimate: The parameter (θ) that minimizes the
most frequently used error function.
 Measures to make regression a good fit:
 Linear model: Have a linear model, taking the derivative of
the sum of squared errors.
 Polynomial regression: The model is a polynomial in x
(input) of order k.
 Relative Square Error (RSE): If this value is close to 0, the
input helps and it gives a better fit. If this value is close to 1,
the input does not work better.
 Coefficient of determination: (R2 = 1 – RSE) R2 value must
be close to 1.
 For best generalization, adjust and tune the complexity of the
model to best fit.
 In polynomial regression, choose the best order that minimizes
the generalization error.

6. Model Selection Procedures:

 Cross validation: Method used to find the optimal complexity.


Bias and variance cannot be calculated but total error can be
calculated. Divide into two parts: Training and Validation set. As
the model complexity increases, training error keeps decreasing.
 Regularization: Used for tuning the function by adding an
additional penalty term in the error function. Minimize
augmented error function and penalize complex model to
decrease variance.
 Structural Risk Minimization: It uses a set of models ordered in
terms of their complexities. (Example: Polynomial order) It
corresponds to finding the simplest and best model in terms of
order and empirical error in data.
 Minimum Description Length: Uses an information theoretic
measure. Kolmogorov complexity (shortest programs that gives
exactly same output for all inputs) of a dataset is defined as the
shortest description of the data. If the data is simple, it has short
complexity.
 Bayesian Model Selection: Used when we have some prior
knowledge about appropriate class of approximating functions.

 When the prior is chosen, we give higher probabilities to


simpler models (following Occam’s razor), the Bayesian
approach, regularization, SRM, and MDL are equivalent.

 Cross-validation is different from all other methods for model


selection in that it makes no prior assumption about the model.

UNIT-3

5.MULTIVARIATE METHODS
1. Introduction:

 Multivariate has multiple inputs. (Continuous or discrete).


 Output is class code or continuous output and it is the function of
these inputs.
o Continuous – Values are continuous and
measurable. It can be subdivided. Example: age,
height, temperature, etc.
o Discrete – Values are countable but not measurable
and it cannot be subdivided into parts. Example:
No. of students in a class, No. of questions answered
correctly, No. of books in the shelf, etc.

2. Multivariate Data:

 Sample: It is viewed as data matrix which has d columns with d


variables denoting the result of measurement on individual or
event.
 Measurements: Made on individual or event generating an
observation vector.
 Input: The sample is called as inputs, features or attributes. The
N rows correspond to independent and identically distributed
(iid) observations, examples or instances on N individuals or
events.

 Example: In deciding a loan application, an observation vector is


associated with customers (age, marital status, yearly income, etc)
 These measurements may be of different scales.
o Age in numbers & yearly income in monetary units.
o Age is continuous & marital status may be discrete.
 All these variables are correlated for multivariate analysis.
 Multivariate classification: Predicted variable is discrete.
 Multivariate regression problem: Predicted variable is numeric.

3. Parameter Estimation:

 Mean vector: The mean vector μ is defined such that each of its
elements is the mean of one column of X.
 Covariance matrix: Represented as d x d matrix. Diagonal terms
are the variances, off diagonal terms (other terms in matrix) are
covariances and the matrix is symmetric (equal to its transpose).
 Correlation: Representing correlation between variables (-1 and
+1). If two variables are independent, then their correlation is 0.
 Sample mean: The maximum likelihood estimator for the mean.
 Estimation of sample covariance and sample correlation.

4. Estimation of missing values:

 Imputation: Fill-in the missing entries by estimating them.


 Mean Imputation:
 (i) For a numeric variable, substitute the mean (average)
of the available data for that variable in the sample.
 (ii) For a discrete variable, fill-in the most likely value i.e,
the value most often seen in the data.
 Imputation by regression: Trying to predict the value of a
missing variable from other variables whose values are known for
that case.
 A certain attribute value missing may be important sometimes.
Example: In a credit card application, if the applicant does not
declare his telephone number, which may be a critical piece of
information. In such cases, this is represented as a separate value
to indicate that the value is missing and is used as such.

5. Multivariate Normal Distribution:

 Mahalanobis distance: It is the distance between two points in


multivariate space. It measures the distance relative to the
centroid.
 Σ is the covariance matrix.
 Use of Inverse of the Σ:
(i) If the variable has a larger variance than another, it
receives less weight in Mahalanobis distance.
(ii) Two highly correlated variables do not contribute as
much as two less correlated variables.
 Z - Normalization: Standardized variables. (Min to Max)
 The density becomes an ellipse if the variances are different. The
density rotates depending on the sign of the covariance
(correlation).
 From the equation of an ellipse: When ρ > 0, the major axis of
the ellipse has a positive slope and if ρ < 0, the major axis has a
negative slope.
 In the expanded Mahalanobis distance of equation:
(i) Each variable is normalized to have unit variance.
(ii) There is the cross-term that corrects for the correlation
between the two variables.
 The density depends on five parameters: the two means, the two
variances, and the correlation.
 If ρ is +1 or −1, the two variables are linearly related.
 If ρ = 0, then the two variables are independent.
 Small |Σ| may also indicate that there is high correlation between
variables.

6. Multivariate Classification:

 Main advantages:
 Analytical simplicity
 Useful approximation for real data.
 Robust due to its mathematical tractability (easily handled).
 Requirements: The sample of a class should form a single group. If
there are multiple groups, one should use a mixture model.

 Example: Predicting the type of a car that a customer would be


interested in.
 Classes – Different cars.
 Observable data (x) – Age and Income of the customers.
 Vector (μi) – Mean age and income of the customers who buy car
type i.
 Covariance matrix (Σi) - σ2i1 and σ2 i2 are the age and income
variances and σi12 is the covariance of age and income in the
group of customers who buy car type i.

 Procedure
 Define the discriminant function.
 Estimates for the mean and covariances are found using
maximum likelihood separately for each class.
 These are plugged into the discriminant function to get the
estimates for the discriminants.
 Defines a quadratic discriminant.
 Estimate the number of parameters for means and covariance
matrices.
 Estimate a common covariance matrix for all classes.
 Unequal priors shift the boundary toward the less likely class.
 Decision boundaries are linear that leads to linear discriminant.
 Methods:
 Naive Bayes’ classifier: Used for further simplification, by
assuming all off-diagonals of the covariance matrix to be 0, thus
assuming independent variables.
 Euclidean distance: Used for further simplification from the
above, if we assume all variances to be equal, the Mahalanobis
distance reduces to Euclidean distance.
 Nearest Mean Classifier: Assigns the input to the class of the
nearest mean.
 Template matching procedure: Each mean is thought of as the
ideal prototype or template for the class.

 Finally, instead of learning the discriminant functions, suitable distance


function can be learned.

7. Tuning Complexity:

 Risk of introducing bias: When we make simplifying assumptions


about the covariance matrices and decrease the number of parameters
to be estimated.
 Large variance on small datasets: If no simplifying assumption is
made and the matrices are arbitrary, the quadratic discriminant may
have large variance on small datasets.
 For small dataset: It may be better to assume a shared covariance
matrix. A single covariance matrix has fewer parameters and it can be
estimated using more data, that is, instances of all classes. This
corresponds to using linear discriminants, which is very frequently used
in classification.
 To measure similarity: When variables are dependent, Euclidean
distance is used. When variables are assumed independent, Naive
Bayes’ classifier is used.
 Regularized Discriminant Analysis(RDA): A method that combines all
special cases. Regularization is done when one starts with high variance
and constrains toward lower variance, at the risk of increasing bias.
 Bayesian approach: Used for regularization by defining priors,
when the dataset is small.

8. Discrete features:

 Discrete attributes: Take one of the n different values.


Example: An attribute may be color ∈ {red, blue, green, black}, or
another attribute may be pixel ∈ {on, off}.
 Document categorization: Example: Classifying news reports into
various categories, such as, politics, sports, fashion, etc.
 Bag of words: We choose a priori d words that we believe give
information regarding the class. Xj is 1 if word j occurs in the document
and is 0 otherwise.
Example: In news classification, words such as “missile,” “athlete,” and
“couture” are useful, rather than ambiguous words such as “model,” or
even “runway.” (Specific words are useful than general words).
 Estimate the probabilities. It would be similar for different classes
which do not convey much information.
 Finding the probability high for one class and low for other
classes would be useful.
 Spam filtering: Another example of document categorization.
 Example: There are two classes of emails as spam and legitimate.

9. Multivariate Regression:

 The numeric output is assumed to be written as linear function, i.e, a


weighted sum of several input variables and noise.
 In statistics, it is called as multiple regression.
 The term multivariate is used, when there are multiple outputs.
 Maximizing the likelihood is equivalent to minimizing the sum of
squared errors.
 Define: (i) Multivariate linear model
(ii) Vectors and Matrix
 Solve the parameters.
 Multivariate polynomial regression: Used if necessary. In
multivariate regression, we rarely use polynomials of an order higher
than linear.

Potrebbero piacerti anche