Sei sulla pagina 1di 16

MACHINE LEARNING: LINEAR REGRESSION

Cost Function/OLS L2 Regularization


2
RSS=∑ 𝑦 − 𝑦𝑖 𝐶𝑜𝑠𝑡 = 𝑅𝑆𝑆 + 𝛼∑𝑤𝑖2

1 βest= (XTX)-1XTY
5
Gradient Descent Elastic Net
𝜕𝑅𝑆𝑆 2
𝑤𝑛+1 = 𝑤𝑛 − 𝜂 𝐶𝑜𝑠𝑡 = 𝑅𝑆𝑆 + 𝜆𝐿1 𝑅𝑎𝑡𝑖𝑜∑ 𝛽 ′ + 𝜆 1 − 𝐿1 𝑅𝑎𝑡𝑖𝑜 ∑𝛽 ′

2 𝜕𝑤𝑛
6
Stochastic Gradient Descent Grid Search CV
The derivative of Cost function is computed only on a sample of data not whole To tune hyper-parameters, use Grid Search. You can use 3-fold or 5 fold CV

3 data

7
depending on the access to hardware that you have

L1 Regularization
𝐶𝑜𝑠𝑡 = 𝑅𝑆𝑆 + 𝛼∑|𝑤𝑖 |

4 Bias Variance
8 •


In-sample Error=Bias
Out-sample Error= Variance
Simple models have high bias low variance compared to complex
models
• Bias and Variance can’t be reduced simultaneously
MACHINE LEARNING: LOGISTIC REGRESSION

Model: Binary ROC Curve


It’s a plot of FPR VS TPR

1
4
L1 Regularization
1
𝐶𝑜𝑠𝑡 = 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐 + 𝑐 ∑|𝑤𝑖 |

2 AUC
AUC of a classifier must be more than 0.5

L2 Regularization
1
𝐶𝑜𝑠𝑡 = 𝐿𝑜𝑔𝑖𝑠𝑡𝑖𝑐 + ∑𝑤 2
𝑐
5
3
MACHINE LEARNING: LOGISTIC REGRESSION

Classification Report Grid Search


6
9
Multiclass: OVR
7
Multiclass: Multinomial

8
MACHINE LEARNING: TREE BASED MODELS
Tree: Classifier Tree: Regressor
Classifiers use gini or entropy as purity metrics Purity metric is MSE or RSS

1 𝐺𝑖𝑛𝑖 = 1 − ∑𝑝𝑖2
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = −∑𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖 5 𝟏
𝑴𝑺𝑬 = 𝒏 ∑ 𝒚𝒊 − 𝝁 𝟐

Sklearn: Tree Classifier Sklearn: Tree Regressor


2 6
Sklearn: Hyperparameters Grid Search
3 •


Max Depth,
Min Sample Split
Max Features, etc
7
Grid Search Visualize: Classifier
4
8
Visualize: Regressor

9
MACHINE LEARNING: ENSEMBLE MODELS
Bagged Tree Model Random Forest
• Base Learner is a decision tree • Base Learner is a decision tree
• Each tree model is overfitted on a bootstrapped sample • Each tree model is overfitted on a bootstrapped sample
• The user can decide on: • While fitting a tree model only a random sample of columns
• The number of trees to be included in the ensemble is used, to decide the relevant variable for split.
• How deep each tree should grow • The user can decide on:
• All the hyperparameters associated with a tree • The number of trees to be included in the ensemble
based model • How deep each tree should grow
1 • Feature importance can be computed to ascertain which
predictors are most informative
2 • All the hyperparameters associated with a tree
based model
• Parameter tuning can be done by tracking the OOB error • Feature importance can be computed to ascertain which
predictors are most informative
• Parameter tuning can be done by tracking the OOB error

Boosted Tree Model


• Base learner is a tree model
• Tree models aren’t overfitted on the data unlike Random Forest and Bagged Tree Models
• Tree models are built sequentially one after the other.

3 • In Ada boost model, the rows, where the preceding model makes an error, gets a higher weight when
next tree model is built
• In Gradient Boosted Trees, each succeeding tree model is fitted on the residuals obtained due to
preceding model
• Feature importance can be computed to ascertain which predictors are most informative
• Boosted models don’t have out of bag observations so, oob error can’t be used for parameter tuning,
instead a k-fold CV is used
MACHINE LEARNING: ENSEMBLE MODELS
Bagged Tree Model Random Forest

1 2

Boosted Tree Model

3
DEEP LEARNING: MLP
Activation Functions MLP Architecture
• There is one input layer, at least one hidden layer
• Used to introduce non-linearity and one output layer
• Relu, Sigmoid and tanh are popular activation
functions

1 𝑟𝑒𝑙𝑢(𝑥) = max 0, 𝑥
𝑒 𝑥 − 𝑒 −𝑥
tanh(𝑥) = 𝑥
𝑒 + 𝑒 −𝑥
𝑠𝑖𝑔𝑚𝑜𝑖𝑑(𝑥) =
1
1 + 𝑒 −𝑥
2
Output Layer

Input Layer Hidden Layer


DEEP LEARNING: MLP
MLP Regressor MLP Classifier
• In the output layer, there are as many neurons as
• In the output layer we use linear activation the number of classes and the activation function
used is Softmax

3 Linear
Activation
Softmax
Activation

4
Output Layer

Input Layer Hidden Layer


Output Layer

Input Layer Hidden Layer


DEEP LEARNING: MLP
MLP: Terminology MLP: Keras
• Epochs: Number of times the data does a
complete pass in the network
• Batch Size: Number of data-points fed through
the network in each step

5 •


Backpropagation: Helps in computing the
gradients required to do gradient descent
Adagrad/Adam/SGD/Rmsprop: Optimizers used
during model training

6
DEEP LEARNING: CNN
Convolution Layers Pooling Layers

1 • Convolution layers contain, convolving kernels K, that act as


filters.
• The output size of a convolution operation is related to
2 • Pooling layers, help in reducing the size of convolved output
• In a pooling layer, it is common to either use max pooling,
where the maximum value of pixels is chosen or average
kernel size, stride and zero padding . pooling where the average of pixel values is chosen
𝑛 +2𝑝−𝑘
• 𝑛𝑜𝑢𝑡 = 𝑖𝑛 𝑠 +1 • Pooling layers don’t have any “weight” terms.

https://adeshpande3.github.io/A-Beginner%27s-Guide-To-
Understanding-Convolutional-Neural-Networks/
DEEP LEARNING: CNN
from keras.models import Sequential
CNN Architecture from keras.layers.convolutional import Conv2D
from keras.layers.core import Dense,Flatten
from keras.layers.pooling import MaxPooling2D
from keras.utils import np_utils
from keras.layers import Dropout
model=Sequential()
model.add(Conv2D(filters=6,kernel_size=(3,3),padding='same',input_shape=(28,28,1)))
model.add(MaxPooling2D(pool_size=(2,2)))

3 model.add(Conv2D(filters=16,kernel_size=(3,3),padding='valid'))
model.add(MaxPooling2D(pool_size=(2,2)))
model.add(Flatten())
model.add(Dropout(0.2,seed=100))##
model.add(Dense(120,activation='relu'))
model.add(Dense(84,activation='relu'))
model.add(Dense(10,activation='softmax’))
model.compile(loss='categorical_crossentropy',
optimizer=‘rmsprop’,
metrics=['accuracy’])
model.fit(X,y,epochs=10,batch_size=32)

import numpy as np
Transfer Learning import os
from keras.applications.inception_v3 import InceptionV3
## Create base model
from keras.models import Sequential,Model
from keras.layers import Dense,GlobalAveragePooling2D,Dropout,Flatten
base_model=InceptionV3(weights='imagenet',include_top=False,input_shape=(150,150,3))
x = base_model.output
x = GlobalAveragePooling2D()(x)
# let's add a fully-connected layer
x = Dense(1024, activation='relu')(x)

4 # and a logistic layer -- let's say we have 2 classes


predictions = Dense(2, activation='softmax')(x)
# this is the model we will train
model = Model(inputs=base_model.input, outputs=predictions)
## Freeze base layer
for layer in base_model.layers:
layer.trainable=False
model.compile(loss="categorical_crossentropy",optimizer="rmsprop",metrics=['accuracy’]
)
model.fit(X,y,batch_size=32,epochs=10)

https://adeshpande3.github.io/A-Beginner%27s-Guide-To-Understanding-Convolutional-
Neural-Networks/
DEEP LEARNING: CNN
Data Augmentation from keras.preprocessing import image
data_gen=image.ImageDataGenerator(rotation_range=40,
shear_range=0.2,
horizontal_flip=True,
vertical_flip=False,
zoom_range=0.2,
fill_mode='nearest')
train_generator=data_gen.flow_from_directory(os.path.join(base_dir,"train"),target_size=(150,150))
valid_generator=data_gen.flow_from_directory(os.path.join(base_dir,"test"),target_size=(150,150))
model.fit_generator(train_generator,
epochs=3,validation_data=valid_generator)

https://towardsdatascience.com/image-augmentation-
14a0aafd0498
DEEPLEARNING: RNN/LSTM
Embedding Layer from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
seq_len=16
max_words=10000
tokenizer=Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train.tolist())
sequence=tokenizer.texts_to_sequences(X_train.tolist())
train_features=pad_sequences(sequence,maxlen=seq_len)
from keras.utils import to_categorical
y_train=to_categorical(y_train)
from keras.models import Sequential
from keras.layers import Dense, Embedding, Flatten
model=Sequential()

1 model.add(Embedding(input_dim=max_words,output_dim=64,input_length=seq_len))
model.add(Flatten())
model.add(Dense(1024,activation='relu'))
model.add(Dense(4,activation='softmax’))
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['acc’])
model.fit(train_features,y_train,epochs=3,batch_size=32,validation_split=0.20)
• Converts the sequence of text into sequence of vectors
• Learns to retain the context of the words
DEEPLEARNING: RNN/LSTM
RNN layer from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
seq_len=16
max_words=10000
tokenizer=Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train.tolist())
sequence=tokenizer.texts_to_sequences(X_train.tolist())
train_features=pad_sequences(sequence,maxlen=seq_len)
from keras.utils import to_categorical
y_train=to_categorical(y_train)
from keras.models import Sequential
from keras.layers import Dense, Embedding, SimpleRNN,
model=Sequential()
model.add(Embedding(input_dim=max_words,output_dim=64,input_length=seq_len))
model.add(SimpleRNN(100))

2
model.add(Dense(4,activation='softmax’))
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['acc’])
model.fit(train_features,y_train,epochs=3,batch_size=32,validation_split=0.20)
• RNN layer contains, rnn cells.
• RNN layers help in making sure that the sequence of
words is taken into account.

𝒇 = 𝒕𝒂𝒏𝒉
𝒘 = 𝒊𝒏𝒑𝒖𝒕 𝒘𝒆𝒊𝒈𝒉𝒕𝒔
𝒖 = 𝒔𝒕𝒂𝒕𝒆 𝒘𝒆𝒊𝒈𝒉𝒕𝒔
𝒃 = 𝒃𝒊𝒂𝒔 𝒕𝒆𝒓𝒎
DEEPLEARNING: RNN/LSTM
LSTM layer from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
seq_len=16
max_words=10000
tokenizer=Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(X_train.tolist())
sequence=tokenizer.texts_to_sequences(X_train.tolist())
train_features=pad_sequences(sequence,maxlen=seq_len)
from keras.utils import to_categorical
y_train=to_categorical(y_train)
from keras.models import Sequential
from keras.layers import Dense, Embedding, LSTM,
model=Sequential()
model.add(Embedding(input_dim=max_words,output_dim=64,input_length=seq_len))
model.add(LSTM(100))

3
model.add(Dense(4,activation='softmax’))
model.compile(optimizer='rmsprop',
loss='categorical_crossentropy',
metrics=['acc’])
model.fit(train_features,y_train,epochs=3,batch_size=32,validation_split=0.20)
• LSTM layer contains, lstm cells.
• LSTM layers help in making sure that the sequence of
words is taken into account, in a manner better than
what RNN can achive

𝒇 = 𝒕𝒂𝒏𝒉, 𝐬𝐢𝐠𝐦𝐨𝐢𝐝 𝒊𝒕 = 𝒇(𝒘𝒊 𝒙𝒕 , 𝒖𝒊 𝒔𝒕−𝟏 , 𝒃𝒊 )


𝒘 = 𝒊𝒏𝒑𝒖𝒕 𝒘𝒆𝒊𝒈𝒉𝒕𝒔
𝒖 = 𝒔𝒕𝒂𝒕𝒆 𝒘𝒆𝒊𝒈𝒉𝒕𝒔 𝒇𝒕 = 𝒇(𝒘𝒇 𝒙𝒕 , 𝒖𝒇 𝒔𝒕−𝟏 , 𝒃𝒇 )
𝒃 = 𝒃𝒊𝒂𝒔 𝒕𝒆𝒓𝒎
𝑶𝒕 = 𝒇(𝒘𝑶 𝒙𝒕 , 𝒖𝑶 𝒔𝒕−𝟏 , 𝒃𝑶 )

Potrebbero piacerti anche