Machine Learning Lecture - 4 and Lecture - 5

Project area # Natural Language Processing
L4-L5: Handson implementation of Text classification
January 18, 2020
Objective: To classify the the type of review from the Amazon users using Machine Learning
technique.
About the Data-set: We would be using the Amazon Review Data set which has 10,000 rows of
Text data. The dataset is classified into two classes. “Label 1” or "Negative review" as one class
and “Label 2” or "Positive review " as other. The Data set has two columns “Text” and “Label”.
Code:
In [50]: import pandas as pd

import numpy as np
import nltk
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score
In [51]: #Set Random seed

np.random.seed(500)
In [52]: # Add the Data using pandas

Corpus = pd.read_csv(r"corpus.csv",encoding='latin-1')
In [53]: # Step - 1: Data Pre-processing - This will help in getting better results through the
# Step - 1a : Remove blank rows if any.

Corpus['text'].dropna(inplace=True)
1
In [54]: # Step - 1b : Change all the text to lower case. This is required as python interprets
Corpus['text'] = [entry.lower() for entry in Corpus['text']]
In [55]: # Step - 1c : Tokenization : In this each entry in the corpus will be broken into set o
Corpus['text']= [word_tokenize(entry) for entry in Corpus['text']]
In [56]: # Step - 1d : Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.
# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adje
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV
for index,entry in enumerate(Corpus['text']):

# Declaring Empty List to store the words that follow the rules for this step
Final_words = []
# Initializing WordNetLemmatizer()
word_Lemmatized = WordNetLemmatizer()
# pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(
for word, tag in pos_tag(entry):
# Below condition is to check for Stop words and consider only alphabets
if word not in stopwords.words('english') and word.isalpha():
word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
Final_words.append(word_Final)
# The final processed set of words for each iteration will be stored in 'text_final
Corpus.loc[index,'text_final'] = str(Final_words)
In [57]: print(Corpus['text_final'].head())
0 ['stun', 'even', 'sound', 'track', 'beautiful'...

1 ['best', 'soundtrack', 'ever', 'anything', 're...
2 ['amaze', 'soundtrack', 'favorite', 'music', '...
3 ['excellent', 'soundtrack', 'truly', 'like', '...
4 ['remember', 'pull', 'jaw', 'floor', 'hear', '...
Name: text_final, dtype: object
In [67]: # Step - 2: Split the model into Train and Test Data set
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus['text_final'
print(Train_X.shape)
print(Train_Y.shape)
print(Test_X.shape)
(7000,)
(7000,)
(3000,)
2
In [68]: # Step - 3: Label encode the target variable - This is done to transform Categorical d
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)
In [69]: # Step - 4: Vectorize the words by using TF-IDF Vectorizer - This is done to find how i
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus['text_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)
In [70]: # Step - 5: Now we can run different algorithms to classify out data check for accuracy
# Classifier - Algorithm - Naive Bayes

# fit the training dataset on the classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)
Out[70]: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)
In [71]: # predict the labels on validation dataset

predictions_NB = Naive.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy

print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)
Naive Bayes Accuracy Score -> 83.53333333333333
In [72]: # Classifier - Algorithm - SVM

# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)
Out[72]: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
In [73]: # predict the labels on validation dataset

predictions_SVM = SVM.predict(Test_X_Tfidf)
# Use accuracy_score function to get the accuracy

print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)
SVM Accuracy Score -> 84.86666666666667
In [ ]:
3
Project area # 5G_6G wireless networks
L4-L5: Handson implementation of Primary User detection
in fifth generation Cognitive Radio Networks
January 18, 2020
Objective: To classify the the presence or absence of primary user in CR Networks.
About the Data-set: The data is obtained using the USRP from empirical testbed setup. The
USRP is tuned to the UHF band. The data is a single column matrix where column indicates the
frequency tuned while the rows indicates the time instants.
Code:
1 Importing and Initializing data

The list of packages that will be used in the study: -
1. numpy - For vectorized implementation

2. math - Inbuilt support for mathematical operations
3. time - measuring runtime
4. keras - for LSTM model, training and testing
In [1]: import numpy as np

import math
from scipy.stats import norm
import time
import pandas as pd
import statsmodels.api as sm
from keras.layers import Dense, Dropout, LSTM, Embedding
from keras.models import Sequential
from matplotlib import pyplot as plt
#import plotly.offline as py
#import plotly.graph_objs as go
#py.init_notebook_mode(connected=True)
%matplotlib inline
Using TensorFlow backend.
1
2 Retrieving data
Retreiving data from the files for all the four wireless technologies.
In [2]: # Importing data from file (536.5 MHz)

UHF = np.fromfile('536_5.dat', dtype=np.float32)
# Reshaping to convert to a proper NUMPY vector

UHF = np.reshape(UHF, (UHF.shape[0], 1))
# Shape of UHF signal vector

print("Size of UHF: " + str(UHF.shape))
Size of UHF: (20000000, 1)
3 Creating NUMPY equivalent for bandpower function

A MATLAB equivalent for bandpower function is created.
In [3]: def bandpower(signal):

return np.mean(signal ** 2)
In [4]: bandpower(UHF)
Out[4]: 0.0016754537
4 Creating NUMPY equivalent for AWGN (Additive White Gaussian

Noise) function
First of all, the SNR (Signal-to-noise ratio) is converted from decibel scale to linear scale using the
given formula:
SNRdB
SNRlinear = 10 10
The variance of standard the signal is adjusted using the formula given below:
Power (signal )
SNRlinear =
Var (noise)
=⇒ Power (signal ) = Var (noise) × SNRlinear
In [5]: def awgn(signal, desired_snr):
# Converting the SNR from dB scale to linear scale

snr_linear = math.pow(10, desired_snr / 10)
# Standard normally distributed noise

noise = np.random.randn(signal.shape[0], 1)
2
# Using the boxed formula
var_signal = bandpower(noise) * snr_linear
# Normalizing the signal to have the given variance

normalized_signal = math.sqrt(var_signal) * (signal / math.sqrt(bandpower(signal)))
print("SNR = " + str(10 * math.log10(bandpower(normalized_signal) / bandpower(noise)))
return normalized_signal + noise
5 Filtering the data

The datasets are filtered to remove any transient peaks. Values between 10−7 and 1 are retained,
others are discarded.
In [6]: # Datasets are filtered to contain values between 10 ^ -7 and 1
UHF = UHF[np.logical_and(UHF > math.pow(10, -7), UHF < 1)]

UHF = UHF.reshape(UHF.shape[0], 1)
# Shape of UHF signal vector

print("Size of UHF: " + str(UHF.shape))
print(awgn(UHF[0:100000], 4).shape)
Size of UHF: (9992850, 1)

SNR = 4.0000002944270365
(100000, 1)
6 Making the dataset ready

The following will create a dataset for the signal with a given SNR, number of samples and size
of the samples in the sensing event. Dataset is constructed based on Energy values: -
Energy of the sensing event E is given by,
N
∑ y[n]
2
E=
i =1
In [7]: def create_dataset(signal, desired_snr, samples, sample_size):
# Creating the signal with desired SNR

snr_signal = awgn(signal[0:samples * sample_size], desired_snr)
# Allocating zeros to the dataset

X = np.zeros((samples, 1))
3
for i in range(0, samples):
# Extracting the sample based on sample size

sampled_signal = snr_signal[i * sample_size : (i + 1) * sample_size]
# Sorting the sampled signal

sampled_signal = np.sort(sampled_signal, axis=0)
# Energy detection
E = np.sum(sampled_signal ** 2)
# Assigning values to the dataset

X[i][0] = E
return X
In [9]: a = time.time()
print(create_dataset(UHF[50000:], 4, 15000, 100).shape)
b = time.time()
# Printing the time taken for execution

print(b - a)
SNR = 4.0000010176512495
(15000, 1)
0.5682578086853027
Making the dataset for all the SNRs in the range -20 till 4 with step size of 2. The following
function will take a range of SNRs as input and will output the dataset. Other inputs are the
sample size, the signal, and, the number of samples per SNR.
In [10]: def final_dataset(signal, snr_range, samples_per_snr, sample_size):

X = {}
for snr in snr_range:

# Creating dataset for the given SNR
X_snr = create_dataset(signal, snr, samples_per_snr, sample_size)
# Indexing within the final dataset matrix X

X[snr] = X_snr
return X
# UHF
X_UHF = {**final_dataset(UHF[100000:], range(-20, -4, 2), 5000, 100), **final_dataset(U
4
X_test_UHF = final_dataset(UHF[300000:], range(-20, 6, 2), 5129, 100)
b = time.time()
# Printing the time taken for execution

print("Time taken :- " + str(b - a))
#print(X_UHF.shape)
SNR = -20.000000947867104
SNR = -17.999999975177595
SNR = -16.000000564536986
SNR = -14.00000024874285
SNR = -12.000000647780782
SNR = -9.999999378102368
SNR = -7.999999424928018
SNR = -6.0000007320692585
SNR = -3.999998851007047
SNR = -1.9999968889338702
SNR = 8.946170328922171e-07
SNR = 2.000001013654348
SNR = 4.000000474544495
SNR = -19.999999636204638
SNR = -17.999999682125054
SNR = -16.000000286119587
SNR = -13.999999966226724
SNR = -12.000001115697067
SNR = -9.999999844380007
SNR = -8.000000337259307
SNR = -5.999999833073398
SNR = -3.999999084560373
SNR = -1.999999456449275
SNR = 4.338867090785353e-07
SNR = 2.0000011775885964
SNR = 4.000000351050954
Time taken :- 4.981256484985352
7 Generating White noise sequence

White noise of variance 1 is generated and is labelled as 0.
In [12]: def create_noise_sequence(samples, sample_size):
# Creating white noise sequence of variance 1

noise = np.random.randn(samples * sample_size, 1)
# Allocating zeros to the dataset

X = np.zeros((samples, 1))
5
for i in range(0, samples):
# Extracting the sample based on sample size

sampled_signal = noise[i * sample_size : (i + 1) * sample_size]
# Sorting the sampled signal

sampled_signal = np.sort(sampled_signal, axis=0)
# Energy detection
E = np.sum(sampled_signal ** 2)
# Assigning values to the dataset

X[i][0] = E
return X
X_noise = create_noise_sequence(100000, 100)
b = time.time()
print("Time taken = " + str(b - a))
print(X_noise.shape)
Time taken = 2.516165256500244

(100000, 1)
8 DataSet with Lookback for ANN

We use the lookback concept to avoid the effect of sudden abrupt changes in signal
In [17]: # Function for Chaning the dataset for look back

def create_look_back(X, look_back=1):
# Look back dataset is initialized to be empty

look_back_X = []
for i in range(len(X) - look_back + 1):

# Extracting an example from the dataset
a = X[i:(i + look_back), :]
a = a.flatten() # (For flattening)
# Appending to the dataset

look_back_X.append(a)
look_back_Y = []
6
# Returning in numpy's array format
return np.array(look_back_X)
The following function will insert look backs into our dataset for all the SNRs.
In [18]: def dataset_look_back(X_tech, snr_range, look_back):

X_tech_lb = {}
# Look backs for all SNRs

for snr in snr_range:
X_tech_lb[snr] = create_look_back(X_tech[snr], look_back)
return X_tech_lb
In [19]: look_back = 2
X_UHF_lb = dataset_look_back(X_UHF, range(-20, 6, 2), look_back)

print(X_UHF_lb[-20].shape)
X_noise_lb = create_look_back(X_noise, look_back)

print(X_noise_lb.shape)
X = X_UHF_lb[-20]
y = []
for snr in range(-18, 6, 2):

X = np.concatenate((X, X_UHF_lb[snr]), axis=0)
y = np.ones((X.shape[0], 1))
print(X.shape)
print(X_noise_lb.shape)
X = np.concatenate((X, X_noise_lb), axis=0)
y = np.concatenate((y, np.zeros((X_noise_lb.shape[0], 1))))

print(X.shape)
print(X)
print(y.shape)
print(y)
(4999, 2)
(99999, 2)
(99987, 2)
(99999, 2)
(199986, 2)
[[ 60.67028369 112.33572595]
[112.33572595 124.46628501]
[124.46628501 91.88249011]
7
...
[ 79.39234195 109.61704045]
[109.61704045 110.6177588 ]
[110.6177588 84.34582643]]
(199986, 1)
[[1.]
[1.]
[1.]
...
[0.]
[0.]
[0.]]
9 Creating the ANN model

In [20]: seed = 9
np.random.seed(seed)
#ANN Model
# create model
model = Sequential() # This means it's sequential model, which is from one direction to
model.add(Dense(7, input_dim=2, kernel_initializer='uniform', activation='relu'))
#model.add(Dense(10, init='uniform', activation='relu')) #You can add as many hidden la
#model.add(Dense(5,init='uniform', activation='relu'))
model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid')) #Output layer
# Compile model
#This function you have to explore in case you want to do the mathematic analysis
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
10 Training the ANN model

In [21]: # Fit the model
#Training and batch size
model.fit(X, y, epochs=20, batch_size=150, verbose=2)
#Evaluate the model

scores = model.evaluate(X, y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
Epoch 1/40
- 3s - loss: 0.5897 - acc: 0.6314
Epoch 2/40
- 2s - loss: 0.4589 - acc: 0.8142
Epoch 3/40
- 2s - loss: 0.4123 - acc: 0.8157
8
Epoch 4/40
- 2s - loss: 0.4010 - acc: 0.8161
Epoch 5/40
- 2s - loss: 0.3973 - acc: 0.8163
Epoch 6/40
- 2s - loss: 0.3964 - acc: 0.8166
Epoch 7/40
- 2s - loss: 0.3955 - acc: 0.8168
Epoch 8/40
- 2s - loss: 0.3949 - acc: 0.8174
Epoch 9/40
- 2s - loss: 0.3949 - acc: 0.8167
Epoch 10/40
- 2s - loss: 0.3940 - acc: 0.8177
Epoch 11/40
- 2s - loss: 0.3943 - acc: 0.8176
Epoch 12/40
- 2s - loss: 0.3944 - acc: 0.8176
Epoch 13/40
- 2s - loss: 0.3934 - acc: 0.8181
Epoch 14/40
- 2s - loss: 0.3936 - acc: 0.8176
Epoch 15/40
- 2s - loss: 0.3934 - acc: 0.8183
Epoch 16/40
- 2s - loss: 0.3929 - acc: 0.8178
Epoch 17/40
- 2s - loss: 0.3944 - acc: 0.8169
Epoch 18/40
- 2s - loss: 0.3946 - acc: 0.8176
Epoch 19/40
- 2s - loss: 0.3933 - acc: 0.8180
Epoch 20/40
- 2s - loss: 0.3934 - acc: 0.8179
Epoch 21/40
- 2s - loss: 0.3933 - acc: 0.8175
Epoch 22/40
- 2s - loss: 0.3928 - acc: 0.8182
Epoch 23/40
- 2s - loss: 0.3932 - acc: 0.8177
Epoch 24/40
- 2s - loss: 0.3935 - acc: 0.8175
Epoch 25/40
- 2s - loss: 0.3937 - acc: 0.8175
Epoch 26/40
- 2s - loss: 0.3932 - acc: 0.8180
Epoch 27/40
- 2s - loss: 0.3930 - acc: 0.8177
9
Epoch 28/40
- 2s - loss: 0.3928 - acc: 0.8178
Epoch 29/40
- 2s - loss: 0.3927 - acc: 0.8182
Epoch 30/40
- 2s - loss: 0.3934 - acc: 0.8172
Epoch 31/40
- 2s - loss: 0.3933 - acc: 0.8175
Epoch 32/40
- 2s - loss: 0.3930 - acc: 0.8178
Epoch 33/40
- 2s - loss: 0.3929 - acc: 0.8184
Epoch 34/40
- 2s - loss: 0.3932 - acc: 0.8180
Epoch 35/40
- 2s - loss: 0.3937 - acc: 0.8171
Epoch 36/40
- 2s - loss: 0.3937 - acc: 0.8177
Epoch 37/40
- 2s - loss: 0.3932 - acc: 0.8177
Epoch 38/40
- 2s - loss: 0.3933 - acc: 0.8179
Epoch 39/40
- 2s - loss: 0.3931 - acc: 0.8177
Epoch 40/40
- 2s - loss: 0.3929 - acc: 0.8175
199986/199986 [==============================] - 3s 16us/step
acc: 81.96%
In [22]: pd_UHF = {}
for snr in range(-20, 6, 2):

y_snr = np.ones((X_UHF_lb[snr].shape[0], 1))
scores = model.evaluate(X_UHF_lb[snr], y_snr)
print("At SNR = " + str(snr) + "\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*10
pd_UHF[snr] = scores[1]
plt.plot(range(-20, 6, 2), list(pd_UHF.values()))
4999/4999 [==============================] - 0s 29us/step

At SNR = -20
acc: 7.60%
4999/4999 [==============================] - 0s 25us/step
At SNR = -18
acc: 8.52%
4999/4999 [==============================] - 0s 19us/step
10
At SNR = -16
acc: 10.66%
4999/4999 [==============================] - 0s 21us/step
...
At SNR = -8
acc: 47.27%
4999/4999 [==============================] - 0s 20us/step
At SNR = -6
acc: 76.38%
11999/11999 [==============================] - 0s 18us/step
At SNR = -4
acc: 96.49%
11999/11999 [==============================] - 0s 18us/step
At SNR = -2
acc: 99.91%
11999/11999 [==============================] - 0s 17us/step
At SNR = 0
acc: 100.00%
11999/11999 [==============================] - 0s 19us/step
At SNR = 2
acc: 100.00%
11999/11999 [==============================] - 0s 18us/step
At SNR = 4
acc: 100.00%
Out[22]: [<matplotlib.lines.Line2D at 0x7f16674aa550>]
11
Project area # Biology / Bioinformatics
L4-L5: Handson implementation to classify the Cancer
patients with AML or ALL
January 18, 2020
Objective: The classify patients with acute myeloid leukemia (AML) and acute lymphoblastic
leukemia (ALL) using the SVM algorithm.
About the Data-set:
1. Each row represents a different gene.
2. Columns 1 and 2 are descriptions about that gene.
3. Each numbered column is a patient in label data.
4. Each patient has 7129 gene expression values — i.e each patient has one value for each gene.
5. The training data contain gene expression values for patients 1 through 38.
6. The test data contain gene expression values for patients 39 through 72
Code:

import numpy as np
from numpy import transpose as T
import matplotlib.pyplot as plt
import math
In [101]: Train_Data = pd.read_csv("data_set_ALL_AML_train.csv")

Test_Data = pd.read_csv("data_set_ALL_AML_independent.csv")
labels = pd.read_csv("actual.csv", index_col = 'patient')
Train_Data.head()
Out[101]: Gene Description Gene Accession Number 1 call 2 \

0 AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -214 A -139
1 AFFX-BioB-M_at (endogenous control) AFFX-BioB-M_at -153 A -73
2 AFFX-BioB-3_at (endogenous control) AFFX-BioB-3_at -58 A -1
3 AFFX-BioC-5_at (endogenous control) AFFX-BioC-5_at 88 A 283
4 AFFX-BioC-3_at (endogenous control) AFFX-BioC-3_at -295 A -264
1
call.1 3 call.2 4 call.3 ... 29 call.33 30 call.34 31 \
0 A -76 A -135 A ... 15 A -318 A -32
1 A -49 A -114 A ... -114 A -192 A -49
2 A -307 A 265 A ... 2 A -95 A 49
3 A 309 A 12 A ... 193 A 312 A 230
4 A -376 A -419 A ... -51 A -139 A -367
call.35 32 call.36 33 call.37

0 A -124 A -135 A
1 A -79 A -186 A
2 A -37 A -70 A
3 P 330 A 337 A
4 A -188 A -407 A
[5 rows x 78 columns]
In [102]: print(Train_Data.isna().sum().max())
print(Test_Data.isna().sum().max())
0
0
In [103]: cols = [col for col in Test_Data.columns if 'call' in col]

test = Test_Data.drop(cols, 1)
cols = [col for col in Train_Data.columns if 'call' in col]
train = Train_Data.drop(cols, 1)
In [104]: patients = [str(i) for i in range(1, 73, 1)]

df_all = pd.concat([train, test], axis = 1)[patients]
#import Transpose as T
df_all = df_all.T
In [105]: df_all["patient"] = pd.to_numeric(patients)

labels["cancer"]= pd.get_dummies(labels.cancer, drop_first=True)
In [106]: Data = pd.merge(df_all, labels, on="patient")

Data.head()
Out[106]: 0 1 2 3 4 5 6 7 8 9 ... 7121 7122 7123 \

0 -214 -153 -58 88 -295 -558 199 -176 252 206 ... -125 389 -37
1 -139 -73 -1 283 -264 -400 -330 -168 101 74 ... -36 442 -17
2 -76 -49 -307 309 -376 -650 33 -367 206 -215 ... 33 168 52
3 -135 -114 265 12 -419 -585 158 -253 49 31 ... 218 174 -110
4 -106 -125 -76 168 -230 -284 4 -122 70 252 ... 57 504 -26
7124 7125 7126 7127 7128 patient cancer

0 793 329 36 191 -37 1 0
2
1 782 295 11 76 -14 2 0
2 1138 777 41 228 -41 3 0
3 627 170 -50 126 -91 4 0
4 250 314 14 56 -25 5 0

In [107]: X, y = Data.drop(columns=["cancer"]), Data["cancer"]
In [108]: from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25, random_stat
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
(54, 7130)
(54,)
(18, 7130)
In [109]: from sklearn.preprocessing import StandardScaler

sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
print(X_test.shape)
(54, 7130)
(54,)
(18, 7130)
In [110]: from sklearn.decomposition import PCA

pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
total=sum(pca.explained_variance_)
k=0
current_variance=0
while current_variance/total < 0.90:
current_variance += pca.explained_variance_[k]
k=k+1
print(X_test.shape)
(54, 54)
(54,)
(18, 54)
3
In [111]: from sklearn.decomposition import PCA
pca = PCA(n_components = 38)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
cum_sum = pca.explained_variance_ratio_.cumsum()
cum_sum = cum_sum*100
plt.bar(range(38), cum_sum)
plt.ylabel("Cumulative Explained Variance")
plt.xlabel("Principal Components")
plt.title("Around 90% of variance is explained by the First 38 columns ")
print(X_test.shape)
(54, 38)
(54,)
(18, 38)
In [130]: from sklearn.model_selection import GridSearchCV

from sklearn import model_selection, naive_bayes, svm
from sklearn.svm import SVC
# parameters = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']},{'C': [1, 10, 100, 100
# search = GridSearchCV(SVC(), parameters, n_jobs=-1, verbose=1)
# search.fit(X_train, y_train)
4
In [131]: #best_parameters = search.best_estimator_
In [132]: model = SVC(C=1.0, kernel='linear', degree=3, gamma='auto')

model.fit(X_train, y_train)
Out[132]: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)
In [133]: y_pred=model.predict(X_test)
In [134]: from sklearn.metrics import accuracy_score, confusion_matrix

from sklearn import metrics
print('Accuracy Score:',round(accuracy_score(y_test, y_pred),2))
#confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Output:
# Accuracy Score: 0.67
Accuracy Score: 0.67
In [135]: class_names=[1,2,3]
fig, ax = plt.subplots()
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
class_names=['ALL', 'AML']
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="viridis" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
Out[135]: Text(0.5,257.44,'Predicted label')
5
In [ ]:
6
Intelligent Transportation System
Project Hands-on
1. Objective:
• Random traffic generation through Simulation of Urban Mobility (SUMO) using

open street map.
• To model propagation links for Vehicle to Vehicle communication using Ray trac-
ing simulator.
• To perform Classification for beam selection in vehicular to infrastructure and
compare various classifiers.
2. Hardware Used : NA
3. Software Used :
• For Traffic Generation : Simulation of Urban Mobility (SUMO) (https://www.

dlr.de/ts/en/desktopdefault.aspx/tabid-9883/16931_read-41000/)
• For Ray Tracing : Geometry based Efficient Propagation Model for Vehicular
Communication (GEM V 2 )(http://vehicle2x.net/)
• Platform : MATLAB 2018, Python 2.7
• Editor : Notepad ++
4. Expected Outcomes :
Figure 1: Expected Outcome for Exercise 1 : Random Traffic Generation in SUMO
5. Basic Instructions :
(a) Login with Windows 8.

(b) Go to Start and press ”cmd” to open command prompt.
(c) Crate an empty folder named ”map”.
Accuracy (%)
Classifier All Data Only NLOS
Linear SVM 31 11
Decision Tree 54 28
Deep Neural Network 65 37
Table 1: Expected Outcome for Exercise : Classification Results
Aim: Random traffic generation through Simulation of Urban Mobility (SUMO)

using open street map.
Execution steps to generate random traffic
1. Make two folders on your desktop named ”sumo” and ”map”
2. Go to http://sumo.dlr.de/wiki/Networks/Import/OpenStreetMap
3. Scroll down to ”Importing additional polygons”
4. Copy the script into Notepad++
5. Remove ”Power” feature from the script
6. Save as ”typemap.xml”
7. Go to https://www.openstreetmap.org: Choose your city or area in which you want

to generate traffic
8. Click on Export.
9. Save as ”map.osm”
10. Go to the sumo directory: and write: run start-command-line.bat

11. Now leave the ”SUMO” directory and Go to the map directory.
12. Generate ”netconvert” file. Type: netconvert –osm-files map.osm -o map.net.xml
13. Generate ”Polyconvert” file Type: polyconvert –net-file map.net.xml –osm-files

map.osm –type-file typemap.xml -o map.poly.xml
14. We are looking for random traffic, Go to your sumo folder and search ”randomTraf-
fic.py” and Note the parameters like (Number of vehicles, Simulation time, Route
length etc..)
Type: python C:/Users/Sagar/Desktop/sumo/sumo/tools/randomTrips.py
-n map.net.xml -e 100 -l
15. Type:python C:/Users/Sagar/Desktop/sumo/sumo/tools/randomTrips.py -

n map.net.xml -r map.rou.xml -e 100 -l to generate route file.
16. Go to the SUMO folder and search for ”test.sumo.cfg” and copy that file in ”map”
folder.
17. Edit ”test.sumo.cfg”
18. Go to command prompt and type: sumo-gui map.sumo.cfg
19. Press Enter and SUMO will open. Customize it as per your convenience. Press RUN
20. Note your simulation parameters and save files.
Aim: To model propagation links for Vehicle to Vehicle communication using

Ray tracing.
Execution Steps
1. To generate LOS and NLOS links, we need to import the .xml file of our specific route
to the folder: ”inputmobilitySUMO”.
2. Open folder: ”inputmobilitySUMO”, from SUMO. You can see already it has .xml files
of previously simulated data. (You can replace your .xml file with existing one and
modify the .m file as per your requirement, Here we only use the previously simulated
data)
3. Run: ”runSimulation.m”
4. Open folder: ”outputKML”, you can see file named as per current date.
5. Go to :”https://earth.google.com/web/”
6. On the left, click My Places My Places.
7. Click Import KML file.
8. Choose the location of the file you want to upload.
9. Select and open the KML file. A preview of the list will open in Google Earth.
10. To keep these places in your list, click Save and customize as per your requirement.
Aim: To perform DNN Classification.
Execution Steps:
1. Step 1: Initialization
1000 from future import p r i n t f u n c t i o n
import k e r a s
1002 #from k e r a s . d a t a s e t s import mnist
from k e r a s . models import S e q u e n t i a l
1004 from k e r a s . l a y e r s import Dense , Dropout , F l a t t e n , A c t i v a t i o n
#from k e r a s . l a y e r s import Conv1D , MaxPooling1D
1006 from k e r a s . l a y e r s import Conv2D , MaxPooling2D
from k e r a s . o p t i m i z e r s import Adagrad
1008 import numpy a s np
from s k l e a r n . p r e p r o c e s s i n g import minmax scale
1010 import k e r a s . backend a s K
import copy
1012 #one can d i s a b l e t h e i m p o r t s below i f not p l o t t i n g / s a v i n g
from k e r a s . u t i l s import p l o t m o d e l
1014 import m a t p l o t l i b . p y p l o t a s p l t
2. Load training and testing data sets

1000 b a t c h s i z e = 32
e p o c h s = 50
1002 numUPAAntennaElements=4∗4 #4 x 4 UPA
t r a i n F i l e N a m e = ’ . . / d a t a s e t s / a l l t r a i n c l a s s i f i c a t i o n . npz ’ #( 2 2 2 5 6 , 2 4 ,
362)
1004 p r i n t ( ” Reading d a t a s e t . . . ” , t r a i n F i l e N a m e )
t r a i n c a c h e f i l e = np . l o a d ( t r a i n F i l e N a m e )
1006
t e s t F i l e N a m e = ’ . . / d a t a s e t s / a l l t e s t c l a s s i f i c a t i o n . npz ’ #( 2 2 2 5 6 , 2 4 ,
362)
1008 p r i n t ( ” Reading d a t a s e t . . . ” , t e s t F i l e N a m e )
t e s t c a c h e f i l e = np . l o a d ( t e s t F i l e N a m e )
1010 #i n p u t f e a t u r e s ( X t e s t and X t r a i n ) a r e a r r a y s with m a t r i c e s . Here we
w i l l c o n v e r t m a t r i c e s t o 1−d a r r a y
X t r a i n = t r a i n c a c h e f i l e [ ’ p o s i t i o n m a t r i x a r r a y ’ ] #i n p u t s
1012 t r a i n b e s t t x r x a r r a y = t r a i n c a c h e f i l e [ ’ b e s t r a y a r r a y ’ ] #outputs , one
i n t e g e r f o r Tx and a n o t h e r f o r Rx
X t e s t = t e s t c a c h e f i l e [ ’ p o s i t i o n m a t r i x a r r a y ’ ] #i n p u t s
1014 t e s t b e s t t x r x a r r a y = t e s t c a c h e f i l e [ ’ b e s t r a y a r r a y ’ ] #outputs , one
#p r i n t ( p o s i t i o n m a t r i x a r r a y . shape )
1016 #p r i n t ( b e s t t x r x a r r a y . shape )
#X t r a i n and X t e s t have v a l u e s −4, −3, −1, 0 , 2 . S i m p l i f y i t t o u s i n g
o n l y −1 f o r b l o c k e r s and 1 f o r
1018 X t r a i n [ X t r a i n ==−4] = −1
X t r a i n [ X t r a i n ==−3] = −1
1020 X t r a i n [ X t r a i n ==2] = 1
X t e s t [ X t e s t ==−4] = −1
1022 X t e s t [ X t e s t ==−3] = −1
X t e s t [ X t e s t ==2] = 1
3. Load classes and features to find pairs
1000 t r a i n f u l l y = ( t r a i n b e s t t x r x a r r a y [ : , 0 ] ∗ numUPAAntennaElements +
t r a i n b e s t t x r x a r r a y [ : , 1 ] ) . a s t y p e ( np . i n t )
t e s t f u l l y = ( t e s t b e s t t x r x a r r a y [ : , 0 ] ∗ numUPAAntennaElements +
t e s t b e s t t x r x a r r a y [ : , 1 ] ) . a s t y p e ( np . i n t )
1002 t r a i n c l a s s e s = s e t ( t r a i n f u l l y ) #f i n d unique p a i r s
t e s t c l a s s e s = s e t ( t e s t f u l l y ) #f i n d unique p a i r s
1004 c l a s s e s = t r a i n c l a s s e s . union ( t e s t c l a s s e s )
1006 y t r a i n = np . empty ( t r a i n b e s t t x r x a r r a y . shape [ 0 ] )

y t e s t = np . empty ( t e s t b e s t t x r x a r r a y . shape [ 0 ] )
1008 f o r idx , c l i n enumerate ( c l a s s e s ) : #map i n s i n g l e index , c l i s t h e
o r i g i n a l c l a s s number , i d x i s i t s i n d e x
c l i d x = np . n o n z e r o ( t r a i n f u l l y == c l )
1010 y t r a i n [ c l i d x ] = idx
c l i d x = np . n o n z e r o ( t e s t f u l l y == c l )
1012 y t e s t [ c l i d x ] = idx
4. Step 4: Load Lables and satisfy the dimentioanlity

1000 numClasses = l e n ( c l a s s e s ) #t o t a l number o f l a b e l s
1002 t r a i n n e x a m p l e s=l e n ( X t r a i n )
t e s t n e x a m p l e s=l e n ( X t e s t )
1004 nrows=l e n ( X t r a i n [ 0 ] )
ncolumns=l e n ( X t r a i n [ 0 ] [ 0 ] )
1006
print ( ’ test nexamples = ’ , test nexamples )
1008 print ( ’ train nexamples = ’ , train nexamples )
print ( ’ i n p u t m a t r i c e s s i z e = ’ , nrows , ’ x ’ , ncolumns )
1010 print ( ’ numClasses = ’ , numClasses )
1012 #here , do not c o n v e r t matrix i n t o 1−d a r r a y

#X t r a i n = X t r a i n . r e s h a p e ( t r a i n n e x a m p l e s , nrows ∗ ncolumns )
1014 #X t e s t = X t e s t . r e s h a p e ( t e s t n e x a m p l e s , nrows ∗ ncolumns )
1016 #f r a c t i o n t o be used f o r t r a i n i n g s e t
v a l i d a t i o n F r a c t i o n = 0 . 2 #from 0 t o 1
1018
#Keras i s r e q u i r i n g an e x t r a d i m e n s i o n : I w i l l add i t with r e s h a p e
1020 X t r a i n = X t r a i n . r e s h a p e ( X t r a i n . shape [ 0 ] , nrows , ncolumns , 1 )
X t e s t = X t e s t . r e s h a p e ( X t e s t . shape [ 0 ] , nrows , ncolumns , 1 )
1022 i n p u t s h a p e = ( nrows , ncolumns , 1 ) #t h e i n p u t matrix with t h e e x t r a
d i m e n s i o n r e q u e s t e d by Keras
1024 p r i n t ( X t r a i n . shape [ 0 ] , ’ t r a i n s a m p l e s ’ )
p r i n t ( X t e s t . shape [ 0 ] , ’ t e s t s a m p l e s ’ )
1026 p r i n t ( X t e s t . shape [ 0 ] + X t r a i n . shape [ 0 ] , ’ t o t a l s a m p l e s ’ )
print (” Finished reading datasets ”)
5. Step 5: Perform classification
1000 #from s k l e a r n . p r e p r o c e s s i n g import OneHotEncoder
#e n c o d e r = OneHotEncoder ( )
1002 #y t r a i n = e n c o d e r . f i t t r a n s f o r m ( y t r a i n . r e s h a p e ( −1 , 1 ) )
y t r a i n = k e r a s . u t i l s . t o c a t e g o r i c a l ( y t r a i n , numClasses )
1004 o r i g i n a l y t e s t = copy . deepcopy ( y t e s t ) . a s t y p e ( i n t )
y t e s t = k e r a s . u t i l s . t o c a t e g o r i c a l ( y t e s t , numClasses )
1006
# d e c l a r e model Convnet with two conv1D l a y e r s f o l l o w i n g by MaxPooling
l a y e r , and two d e n s e l a y e r s
1008 # Dropout l a y e r c o n s i s t s i n randomly s e t t i n g a f r a c t i o n r a t e o f i n p u t
u n i t s t o 0 a t each update d u r i n g t r a i n i n g time , which h e l p s p r e v e n t
overfitting .
1010 model = S e q u e n t i a l ( )
model . add ( Conv2D ( 5 0 , ( 1 2 , 1 2 ) , padding=”SAME” , a c t i v a t i o n= ’ r e l u ’ ) )
1012 model . add ( MaxPooling2D ( p o o l s i z e =(6 , 6 ) ) )
model . add ( Conv2D ( 2 0 , ( 1 0 , 1 0 ) , padding=”SAME” , a c t i v a t i o n= ’ r e l u ’ ) )
1014 model . add ( Dropout ( 0 . 3 ) )
model . add ( Dense ( 4 , a c t i v a t i o n= ’ r e l u ’ ) )
1016 model . add ( F l a t t e n ( ) )
#model . add ( A c t i v a t i o n ( ’ tanh ’ ) )
1018 #model . add ( A c t i v a t i o n ( ’ softmax ’ ) ) #softmax f o r p r o b a b i l i t y
model . add ( Dense ( numClasses , a c t i v a t i o n= ’ softmax ’ ) )
1020
model . summary ( )
1022
model . c o m p i l e ( l o s s=k e r a s . l o s s e s . c a t e g o r i c a l c r o s s e n t r o p y ,
1024 o p t i m i z e r=k e r a s . o p t i m i z e r s . Adadelta ( ) ,
m e t r i c s =[ ’ a c c u r a c y ’ ] )
1026
1028 h i s t o r y = model . f i t ( X t r a i n , y t r a i n ,
b a t c h s i z e=b a t c h s i z e ,
1030 e p o c h s=epochs ,
v e r b o s e =1,
1032 s h u f f l e=True ,
v a l i d a t i o n s p l i t=v a l i d a t i o n F r a c t i o n )
1034 #v a l i d a t i o n d a t a =( X t e s t , y t e s t ) )
1036 # print results

s c o r e = model . e v a l u a t e ( X t e s t , y t e s t , v e r b o s e =0)
1038 p r i n t ( model . m e t r i c s n a m e s )
#p r i n t ( ’ Test l o s s rmse : ’ , np . s q r t ( s c o r e [ 0 ] ) )
1040 #p r i n t ( ’ Test a c c u r a c y : ’ , s c o r e [ 1 ] )
print ( score )
1042
val acc = history . history [ ’ val acc ’ ]
1044 acc = h i s t o r y . h i s t o r y [ ’ acc ’ ]
f = open ( ’ c l a s s i f i c a t i o n o u t p u t . t x t ’ , ’w ’ )
1046 f . w r i t e ( ’ v a l i d a t i o n a c c \n ’ )
f . write ( str ( val acc ) )
1048 f . w r i t e ( ’ \ n t r a i n a c c \n ’ )
f . write ( s t r ( acc ) )
1050 f . close ()
6. Step 6: Plotting
1000 #e n a b l e i f want t o p l o t images
i f False :
1002 from k e r a s . u t i l s import p l o t m o d e l
import m a t p l o t l i b . p y p l o t a s p l t
1004
#i n s t a l l g r a p h v i z : sudo apt−g e t i n s t a l l g r a p h v i z and then p i p i n s t a l l
r e l a t e d packages
1006 p l o t m o d e l ( model , t o f i l e = ’ c l a s s i f i c a t i o n m o d e l . png ’ , s h o w s h a p e s =
True )
1008
p r e d t e s t = model . p r e d i c t ( X t e s t )
1010 f o r i in range ( len ( y t e s t ) ) :
i f ( o r i g i n a l y t e s t [ i ] != np . argmax ( p r e d t e s t [ i ] ) ) :
1012 myImage = X t e s t [ i ] . r e s h a p e ( nrows , ncolumns )
p l t . imshow ( myImage )
1014 p l t . show ( )
p r i n t ( ”Type <ENTER> f o r next ” )
1016 input ()
Aim: Comparison of Various classifiers.
1. Step 1: Initilization
1000 import numpy a s np
#e n a b l e i f want t o p l o t images :
1002 #import m a t p l o t l i b
#m a t p l o t l i b . u s e ( ’ WebAgg ’ )
1004 #m a t p l o t l i b . u s e ( ’ Qt5Agg ’ )
#m a t p l o t l i b . u s e ( ’ agg ’ )
1006 #m a t p l o t l i b . i n l i n e ( )
#import m a t p l o t l i b . p y p l o t a s p l t
1008 #from m a t p l o t l i b . c o l o r s import ListedColormap
from s k l e a r n import m e t r i c s
1010 from s k l e a r n . m o d e l s e l e c t i o n import t r a i n t e s t s p l i t
from s k l e a r n . p r e p r o c e s s i n g import S t a n d a r d S c a l e r
1012 from s k l e a r n . d a t a s e t s import make moons , m a k e c i r c l e s ,
make classification
from s k l e a r n . n e u r a l n e t w o r k import M L P C l a s s i f i e r
1014 from s k l e a r n . n e i g h b o r s import K N e i g h b o r s C l a s s i f i e r
from s k l e a r n . svm import SVC
1016 from s k l e a r n . svm import LinearSVC
from s k l e a r n . g a u s s i a n p r o c e s s import G a u s s i a n P r o c e s s C l a s s i f i e r
1018 from s k l e a r n . g a u s s i a n p r o c e s s . k e r n e l s import RBF
from s k l e a r n . t r e e import D e c i s i o n T r e e C l a s s i f i e r
1020 from s k l e a r n . ensemble import R a n d o m F o r e s t C l a s s i f i e r , A d a B o o s t C l a s s i f i e r
from s k l e a r n . n a i v e b a y e s import GaussianNB
1022 from s k l e a r n . d i s c r i m i n a n t a n a l y s i s import Q u a d r a t i c D i s c r i m i n a n t A n a l y s i s
2. Step 2 :Load Classifiers

1000 names = [ #” Naive Bayes ” ,
” D e c i s i o n Tree ” , ”Random F o r e s t ” ,
1002 ” AdaBoost ” ,
” L i n e a r SVM” , ”RBF SVM” , ” Gaussian P r o c e s s ” ,
1004 ” Neural Net ” ,
”QDA” , ” N e a r e s t N e i g h b o r s ” ]
1006
classifiers = [
1008 #GaussianNB ( ) ,
D e c i s i o n T r e e C l a s s i f i e r ( max depth =100) ,
1010 R a n d o m F o r e s t C l a s s i f i e r ( max depth =100 , n e s t i m a t o r s =30 , m a x f e a t u r e s
=20) ,
AdaBoostClassifier () ,
1012 LinearSVC (C=10 , l o s s=” h i n g e ” ) , #l i n e a r SVM (maximum margin p e r c e p t r o n
)
SVC(gamma=1, C=1) ,
1014 G a u s s i a n P r o c e s s C l a s s i f i e r ( 1 . 0 ∗ RBF( 1 . 0 ) ) ,
M L P C l a s s i f i e r ( a l p h a =1) ,
1016 QuadraticDiscriminantAnalysis () ,
KNeighborsClassifier (3) ]
3. Step 3: Read test and train datasets for Beam selection
1000 numUPAAntennaElements=4∗4 #4 x 4 UPA
#t r a i n F i l e N a m e = ’ . . / d a t a s e t s / a l l t r a i n c l a s s i f i c a t i o n . npz ’ #(22256 , 2 4 ,
362)
1002 t r a i n F i l e N a m e = ’ . . / d a t a s e t s / n l o s t r a i n c l a s s i f i c a t i o n . npz ’ #( 2 2 2 5 6 , 2 4 ,
362)
p r i n t ( ” Reading d a t a s e t . . . ” , t r a i n F i l e N a m e )
1004 t r a i n c a c h e f i l e = np . l o a d ( t r a i n F i l e N a m e )
1006 #t e s t F i l e N a m e = ’ . . / d a t a s e t s / a l l t e s t c l a s s i f i c a t i o n . npz ’ #(22256 , 2 4 ,

362)
t e s t F i l e N a m e = ’ . . / d a t a s e t s / n l o s t e s t c l a s s i f i c a t i o n . npz ’ #( 2 2 2 5 6 , 2 4 ,
362)
1008 p r i n t ( ” Reading d a t a s e t . . . ” , t e s t F i l e N a m e )
t e s t c a c h e f i l e = np . l o a d ( t e s t F i l e N a m e )
1010
#i n p u t f e a t u r e s ( X t e s t and X t r a i n ) a r e a r r a y s with m a t r i c e s . Here we
w i l l c o n v e r t m a t r i c e s t o 1−d a r r a y
1012
X t r a i n = t r a i n c a c h e f i l e [ ’ p o s i t i o n m a t r i x a r r a y ’ ] #i n p u t s
1014 t r a i n b e s t t x r x a r r a y = t r a i n c a c h e f i l e [ ’ b e s t t x r x a r r a y ’ ] #outputs ,
one i n t e g e r f o r Tx and a n o t h e r f o r Rx
X t e s t = t e s t c a c h e f i l e [ ’ p o s i t i o n m a t r i x a r r a y ’ ] #i n p u t s
1016 t e s t b e s t t x r x a r r a y = t e s t c a c h e f i l e [ ’ b e s t t x r x a r r a y ’ ] #outputs , one
#p r i n t ( p o s i t i o n m a t r i x a r r a y . shape )
1018 #p r i n t ( b e s t t x r x a r r a y . shape )
1020 #X t r a i n and X t e s t have v a l u e s −4, −3, −1, 0 , 2 . S i m p l i f y i t t o u s i n g

o n l y −1 f o r b l o c k e r s and 1 f o r
X t r a i n [ X t r a i n ==−4] = −1
1022 X t r a i n [ X t r a i n ==−3] = −1
X t r a i n [ X t r a i n ==2] = 1
1024 X t e s t [ X t e s t ==−4] = −1
X t e s t [ X t e s t ==−3] = −1
1026 X t e s t [ X t e s t ==2] = 1
4. Step 4: Perform Classification

1000 #c o n v e r t output ( i , j ) t o s i n g l e number ( t h e c l a s s l a b e l ) and e l i m i n a t e
p a i r s t h a t do not appear
t r a i n f u l l y = ( t r a i n b e s t t x r x a r r a y [ : , 0 ] ∗ numUPAAntennaElements +
t r a i n b e s t t x r x a r r a y [ : , 1 ] ) . a s t y p e ( np . i n t )
1002 t e s t f u l l y = ( t e s t b e s t t x r x a r r a y [ : , 0 ] ∗ numUPAAntennaElements +
t e s t b e s t t x r x a r r a y [ : , 1 ] ) . a s t y p e ( np . i n t )
t r a i n c l a s s e s = s e t ( t r a i n f u l l y ) #f i n d unique p a i r s
1004 t e s t c l a s s e s = s e t ( t e s t f u l l y ) #f i n d unique p a i r s
c l a s s e s = t r a i n c l a s s e s . union ( t e s t c l a s s e s )
1006
y t r a i n = np . empty ( t r a i n b e s t t x r x a r r a y . shape [ 0 ] )
1008 y t e s t = np . empty ( t e s t b e s t t x r x a r r a y . shape [ 0 ] )
f o r idx , c l i n enumerate ( c l a s s e s ) : #map i n s i n g l e index , c l i s t h e
o r i g i n a l c l a s s number , i d x i s i t s i n d e x
1010 c l i d x = np . n o n z e r o ( t r a i n f u l l y == c l )
y t r a i n [ c l i d x ] = idx
1012 c l i d x = np . n o n z e r o ( t e s t f u l l y == c l )
y t e s t [ c l i d x ] = idx
1014
#n e w c l a s s e s = s e t ( y )
1016 numClasses = l e n ( c l a s s e s ) #t o t a l number o f l a b e l s
1018 t r a i n n e x a m p l e s=l e n ( X t r a i n )
t e s t n e x a m p l e s=l e n ( X t e s t )
1020 nrows=l e n ( X t r a i n [ 0 ] )
ncolumns=l e n ( X t r a i n [ 0 ] [ 0 ] )
1022
print ( ’ test nexamples = ’ , test nexamples )
1024 print ( ’ train nexamples = ’ , train nexamples )
print ( ’ i n p u t m a t r i c e s s i z e = ’ , nrows , ’ x ’ , ncolumns )
1026 print ( ’ numClasses = ’ , numClasses )
1028 #c o n v e r t matrix i n t o 1−d a r r a y

X t r a i n = X t r a i n . r e s h a p e ( t r a i n n e x a m p l e s , nrows ∗ ncolumns )
1030 X t e s t = X t e s t . r e s h a p e ( t e s t n e x a m p l e s , nrows ∗ ncolumns )
1032 p r i n t ( X t r a i n . shape [ 0 ] , ’ t r a i n s a m p l e s ’ )
p r i n t ( X t e s t . shape [ 0 ] , ’ t e s t s a m p l e s ’ )
1034 p r i n t ( X t e s t . shape [ 0 ] + X t r a i n . shape [ 0 ] , ’ t o t a l s a m p l e s ’ )
print (” Finished reading datasets ”)
5. Step 4: Output
1000 # i t e r a t e over c l a s s i f i e r s
f o r name , model i n z i p ( names , c l a s s i f i e r s ) :
1002 p r i n t ( ”#### T r a i n i n g c l a s s i f i e r ” , name )
model . f i t ( X t r a i n , y t r a i n )
1004 p r i n t ( ’ \ nPrediction accuracy f o r the t e s t dataset ’ )
p r e d t e s t = model . p r e d i c t ( X t e s t )
1006 p r i n t ( ’ { : . 2 % } \ n ’ . format ( m e t r i c s . a c c u r a c y s c o r e ( y t e s t , p r e d t e s t ) ) )
#now with t h e t r a i n s e t
1008 p r e d t r a i n = model . p r e d i c t ( X t r a i n )
p r i n t ( ’ \ nPrediction accuracy f o r the t r a i n dataset ’ )
1010 p r i n t ( ’ { : . 2 % } \ n ’ . format ( m e t r i c s . a c c u r a c y s c o r e ( y t r a i n , p r e d t r a i n ) ) )
Weather Forecasting Using Machine Learning
1 Using Machine Learning To Predict Weather in Charlotte, NC

import numpy as np
import datetime
import matplotlib
import matplotlib.pyplot as plt
#from scipy.special import logsumexp
#from scipy.misc import logsumexp
import seaborn as sns
#import statsmodels.api as sm
import sklearn.model_selection
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, median_absolute_error
from sklearn.preprocessing import StandardScaler
matplotlib.style.use('ggplot')
1.1 Raw Data Preview

Charlotte, NC Climate Data from 2013 to 2018 (downloaded from the NOAA NCEI site -
https://www.ncei.noaa.gov/)
In [48]: clt_climate_df = pd.read_csv("All Datasets/Charlotte_climate_info_2013_to_2018.csv", lo

clt_climate_df.head()
Out[48]: STATION STATION_NAME ELEVATION LATITUDE \

0 WBAN:13881 CHARLOTTE DOUGLAS AIRPORT NC US 221.9 35.2236
LONGITUDE DATE REPORTTPYE HOURLYSKYCONDITIONS HOURLYVISIBILITY \

0 -80.9552 6/1/2013 0:52 FM-15 FEW:02 55 SCT:04 250 10
1
1 -80.9552 6/1/2013 1:00 FM-12 NaN NaN
2 -80.9552 6/1/2013 1:52 FM-15 BKN:07 65 10
3 -80.9552 6/1/2013 2:52 FM-15 BKN:07 75 10
4 -80.9552 6/1/2013 3:52 FM-15 FEW:02 75 SCT:04 110 10
HOURLYPRSENTWEATHERTYPE ... \
0 NaN ...
1 NaN ...
2 NaN ...
3 NaN ...
4 NaN ...
MonthlyMaxSeaLevelPressureTime MonthlyMinSeaLevelPressureValue \
0 -9999 NaN
1 -9999 NaN
2 -9999 NaN
3 -9999 NaN
4 -9999 NaN
MonthlyMinSeaLevelPressureDate MonthlyMinSeaLevelPressureTime \
0 -9999 -9999
1 -9999 -9999
2 -9999 -9999
3 -9999 -9999
4 -9999 -9999
MonthlyTotalHeatingDegreeDays MonthlyTotalCoolingDegreeDays \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
MonthlyDeptFromNormalHeatingDD MonthlyDeptFromNormalCoolingDD \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
MonthlyTotalSeasonToDateHeatingDD MonthlyTotalSeasonToDateCoolingDD
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
2
1.2 Data Preparation & Cleanup
In [49]: # list all the columns to determine which is needed
clt_climate_df.columns
Out[49]: Index(['STATION', 'STATION_NAME', 'ELEVATION', 'LATITUDE', 'LONGITUDE', 'DATE',

'REPORTTPYE', 'HOURLYSKYCONDITIONS', 'HOURLYVISIBILITY',
'HOURLYPRSENTWEATHERTYPE', 'HOURLYDRYBULBTEMPF', 'HOURLYDRYBULBTEMPC',
'HOURLYWETBULBTEMPF', 'HOURLYWETBULBTEMPC', 'HOURLYDewPointTempF',
'HOURLYDewPointTempC', 'HOURLYRelativeHumidity', 'HOURLYWindSpeed',
'HOURLYWindDirection', 'HOURLYWindGustSpeed', 'HOURLYStationPressure',
'HOURLYPressureTendency', 'HOURLYPressureChange',
'HOURLYSeaLevelPressure', 'HOURLYPrecip', 'HOURLYAltimeterSetting',
'DAILYMaximumDryBulbTemp', 'DAILYMinimumDryBulbTemp',
'DAILYAverageDryBulbTemp', 'DAILYDeptFromNormalAverageTemp',
'DAILYAverageRelativeHumidity', 'DAILYAverageDewPointTemp',
'DAILYAverageWetBulbTemp', 'DAILYHeatingDegreeDays',
'DAILYCoolingDegreeDays', 'DAILYSunrise', 'DAILYSunset', 'DAILYWeather',
'DAILYPrecip', 'DAILYSnowfall', 'DAILYSnowDepth',
'DAILYAverageStationPressure', 'DAILYAverageSeaLevelPressure',
'DAILYAverageWindSpeed', 'DAILYPeakWindSpeed', 'PeakWindDirection',
'DAILYSustainedWindSpeed', 'DAILYSustainedWindDirection',
'MonthlyMaximumTemp', 'MonthlyMinimumTemp', 'MonthlyMeanTemp',
'MonthlyAverageRH', 'MonthlyDewpointTemp', 'MonthlyWetBulbTemp',
'MonthlyAvgHeatingDegreeDays', 'MonthlyAvgCoolingDegreeDays',
'MonthlyStationPressure', 'MonthlySeaLevelPressure',
'MonthlyAverageWindSpeed', 'MonthlyTotalSnowfall',
'MonthlyDeptFromNormalMaximumTemp', 'MonthlyDeptFromNormalMinimumTemp',
'MonthlyDeptFromNormalAverageTemp', 'MonthlyDeptFromNormalPrecip',
'MonthlyTotalLiquidPrecip', 'MonthlyGreatestPrecip',
'MonthlyGreatestPrecipDate', 'MonthlyGreatestSnowfall',
'MonthlyGreatestSnowfallDate', 'MonthlyGreatestSnowDepth',
'MonthlyGreatestSnowDepthDate', 'MonthlyDaysWithGT90Temp',
'MonthlyDaysWithLT32Temp', 'MonthlyDaysWithGT32Temp',
'MonthlyDaysWithLT0Temp', 'MonthlyDaysWithGT001Precip',
'MonthlyDaysWithGT010Precip', 'MonthlyDaysWithGT1Snow',
'MonthlyMaxSeaLevelPressureValue', 'MonthlyMaxSeaLevelPressureDate',
'MonthlyMaxSeaLevelPressureTime', 'MonthlyMinSeaLevelPressureValue',
'MonthlyMinSeaLevelPressureDate', 'MonthlyMinSeaLevelPressureTime',
'MonthlyTotalHeatingDegreeDays', 'MonthlyTotalCoolingDegreeDays',
'MonthlyDeptFromNormalHeatingDD', 'MonthlyDeptFromNormalCoolingDD',
'MonthlyTotalSeasonToDateHeatingDD',
'MonthlyTotalSeasonToDateCoolingDD'],
dtype='object')
In [50]: # Create new dataframe with necessary columns only

new_clt_climate_df = clt_climate_df.loc[:, ['STATION_NAME','DATE','DAILYMaximumDryBulbT
'DAILYMinimumDryBulbTemp', 'DAILYAverageDryBulbTemp', 'DAILYAverageRelativeHumid
3
'DAILYAverageDewPointTemp', 'DAILYPrecip']]
new_clt_climate_df.head()
Out[50]: STATION_NAME DATE DAILYMaximumDryBulbTemp \

0 CHARLOTTE DOUGLAS AIRPORT NC US 6/1/2013 0:52 NaN
1 CHARLOTTE DOUGLAS AIRPORT NC US 6/1/2013 1:00 83.0
DAILYMinimumDryBulbTemp DAILYAverageDryBulbTemp \
0 NaN NaN
1 70.0 76.0
2 NaN NaN
3 NaN NaN
4 NaN NaN
DAILYAverageRelativeHumidity DAILYAverageDewPointTemp DAILYPrecip

0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
In [51]: # Reindex by date

new_clt_climate_df['DATE'] = pd.to_datetime(new_clt_climate_df['DATE'])
new_clt_climate_df.set_index('DATE', inplace=True)
new_clt_climate_df.index = new_clt_climate_df.index.normalize()
Out[51]: STATION_NAME DAILYMaximumDryBulbTemp \

DATE
2013-06-01 CHARLOTTE DOUGLAS AIRPORT NC US NaN
2013-06-01 CHARLOTTE DOUGLAS AIRPORT NC US 83.0
DAILYMinimumDryBulbTemp DAILYAverageDryBulbTemp \
DATE
2013-06-01 NaN NaN
2013-06-01 70.0 76.0
2013-06-01 NaN NaN
2013-06-01 NaN NaN
2013-06-01 NaN NaN
DAILYAverageRelativeHumidity DAILYAverageDewPointTemp DAILYPrecip

DATE
4
2013-06-01 NaN NaN NaN
In [52]: # Drop rows with NaN values

new_clt_climate_df = new_clt_climate_df.dropna()
# Replace T values (trace amt) to zero for daily precipitation and convert data to floa
new_clt_climate_df['DAILYPrecip'].replace(['T'], '0', inplace=True)
new_clt_climate_df[['DAILYPrecip']] = new_clt_climate_df[['DAILYPrecip']].apply(pd.to_n
# Rename column names
new_clt_climate_df = new_clt_climate_df.rename(columns={'DAILYMaximumDryBulbTemp': 'Dai
Out[52]: STATION_NAME DailyMaxTemp DailyMinTemp \

DATE
2013-06-01 CHARLOTTE DOUGLAS AIRPORT NC US 85.0 66.0
DailyAvgTemp DailyAvgRelHumidity DailyAvgDewPointTemp \

DATE
2013-06-01 75.0 68.0 65.0
2013-06-02 77.0 78.0 66.0
2013-06-03 75.0 83.0 67.0
2013-06-04 76.0 70.0 64.0
2013-06-05 74.0 81.0 67.0
DailyPrecip
DATE
2013-06-01 0.00
2013-06-02 0.19
2013-06-03 2.33
2013-06-04 0.00
2013-06-05 0.03
In [53]: # Verify date range and total number of rows in the new dataframe
new_clt_climate_df.index
Out[53]: DatetimeIndex(['2013-06-01', '2013-06-02', '2013-06-03', '2013-06-04',

'2013-06-05', '2013-06-06', '2013-06-07', '2013-06-08',
'2013-06-09', '2013-06-10',
...
'2018-05-21', '2018-05-22', '2018-05-23', '2018-05-24',
'2018-05-25', '2018-05-26', '2018-05-27', '2018-05-28',
5
'2018-05-29', '2018-05-30'],
dtype='datetime64[ns]', name='DATE', length=1651, freq=None)
In [54]: # Verify data types

new_clt_climate_df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1651 entries, 2013-06-01 to 2018-05-30
Data columns (total 7 columns):
STATION_NAME 1651 non-null object
DailyMaxTemp 1651 non-null float64
DailyMinTemp 1651 non-null float64
DailyAvgTemp 1651 non-null float64
DailyAvgRelHumidity 1651 non-null float64
DailyAvgDewPointTemp 1651 non-null float64
DailyPrecip 1651 non-null float64
dtypes: float64(6), object(1)
memory usage: 103.2+ KB
1.3 Visualizing the Average Daily Temperature for Charlotte, NC - 2013 to 2018
In [55]: # Visualize some of the 'cleaned' data by plotting the daily avg temperature in Charlot
new_clt_climate_df['DailyAvgTemp'].plot(figsize=(20,7), color="green")
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Average Temperature - Fahrenheit')
plt.show()
output_13_0.png
1.4 Derive Features for Weather Prediction Experiment

In [56]: features = ['DailyMaxTemp','DailyMinTemp','DailyAvgTemp','DailyAvgRelHumidity','DailyAv
# Function that creates columns representing Nth prior measurements of feature
# None values maintain the consistent rows length for each N
def derive_nth_day_feature(new_clt_climate_df, feature, N):
rows = new_clt_climate_df.shape[0]
nth_prior_measurements = [None]*N + [new_clt_climate_df[feature][i-N] for i in rang
col_name = "{}_{}".format(feature, N)
new_clt_climate_df[col_name] = nth_prior_measurements
6
In [57]: # Call the above function using a loop through each feature
for feature in features:
if feature != 'DATE':
for N in range(1, 4):
derive_nth_day_feature(new_clt_climate_df, feature, N)
In [58]: new_clt_climate_df.head(32)

DATE

DATE
2013-06-01 75.0 68.0 65.0
2013-06-02 77.0 78.0 66.0
2013-06-03 75.0 83.0 67.0
7
2013-06-04 76.0 70.0 64.0
2013-06-05 74.0 81.0 67.0
2013-06-06 73.0 94.0 69.0
2013-06-07 75.0 88.0 68.0
2013-06-08 75.0 80.0 67.0
2013-06-09 76.0 82.0 68.0
2013-06-10 75.0 87.0 69.0
2013-06-11 77.0 71.0 66.0
2013-06-12 77.0 73.0 69.0
2013-06-13 81.0 77.0 70.0
2013-06-14 72.0 60.0 57.0
2013-06-15 71.0 66.0 60.0
2013-06-16 75.0 69.0 64.0
2013-06-17 78.0 83.0 69.0
2013-06-18 77.0 86.0 70.0
2013-06-19 77.0 69.0 65.0
2013-06-20 76.0 67.0 63.0
2013-06-21 75.0 60.0 60.0
2013-06-22 74.0 71.0 64.0
2013-06-23 78.0 76.0 69.0
2013-06-24 77.0 83.0 70.0
2013-06-25 81.0 76.0 70.0
2013-06-26 80.0 78.0 70.0
2013-06-27 81.0 82.0 73.0
2013-06-28 80.0 78.0 71.0
2013-06-29 79.0 79.0 70.0
2013-07-01 75.0 83.0 69.0
2013-07-02 76.0 90.0 71.0
2013-07-03 77.0 90.0 72.0
DailyPrecip DailyMaxTemp_1 DailyMaxTemp_2 DailyMaxTemp_3 \

DATE
2013-06-01 0.00 NaN NaN NaN
2013-06-02 0.19 85.0 NaN NaN
2013-06-03 2.33 84.0 85.0 NaN
2013-06-04 0.00 83.0 84.0 85.0
2013-06-05 0.03 84.0 83.0 84.0
2013-06-06 1.12 81.0 84.0 83.0
2013-06-07 0.72 78.0 81.0 84.0
2013-06-08 0.00 82.0 78.0 81.0
2013-06-09 0.12 83.0 82.0 78.0
2013-06-10 0.62 86.0 83.0 82.0
2013-06-11 0.00 83.0 86.0 83.0
2013-06-12 0.00 87.0 83.0 86.0
2013-06-13 0.49 90.0 87.0 83.0
2013-06-14 0.00 92.0 90.0 87.0
2013-06-15 0.00 83.0 92.0 90.0
2013-06-16 0.00 84.0 83.0 92.0
8
2013-06-17 0.24 85.0 84.0 83.0
2013-06-18 0.41 85.0 85.0 84.0
2013-06-19 0.00 84.0 85.0 85.0
2013-06-20 0.00 84.0 84.0 85.0
2013-06-21 0.00 84.0 84.0 84.0
2013-06-22 0.00 84.0 84.0 84.0
2013-06-23 0.01 86.0 84.0 84.0
2013-06-24 0.01 86.0 86.0 84.0
2013-06-25 0.00 85.0 86.0 86.0
2013-06-26 0.00 89.0 85.0 86.0
2013-06-27 0.32 90.0 89.0 85.0
2013-06-28 0.27 89.0 90.0 89.0
2013-06-29 0.00 91.0 89.0 90.0
2013-07-01 0.07 86.0 91.0 89.0
2013-07-02 0.34 83.0 86.0 91.0
2013-07-03 0.11 80.0 83.0 86.0
... DailyAvgTemp_3 DailyAvgRelHumidity_1 \

DATE ...
2013-06-01 ... NaN NaN
2013-06-02 ... NaN 68.0
2013-06-03 ... NaN 78.0
2013-06-04 ... 75.0 83.0
2013-06-05 ... 77.0 70.0
2013-06-06 ... 75.0 81.0
2013-06-07 ... 76.0 94.0
2013-06-08 ... 74.0 88.0
2013-06-09 ... 73.0 80.0
2013-06-10 ... 75.0 82.0
2013-06-11 ... 75.0 87.0
2013-06-12 ... 76.0 71.0
2013-06-13 ... 75.0 73.0
2013-06-14 ... 77.0 77.0
2013-06-15 ... 77.0 60.0
2013-06-16 ... 81.0 66.0
2013-06-17 ... 72.0 69.0
2013-06-18 ... 71.0 83.0
2013-06-19 ... 75.0 86.0
2013-06-20 ... 78.0 69.0
2013-06-21 ... 77.0 67.0
2013-06-22 ... 77.0 60.0
2013-06-23 ... 76.0 71.0
2013-06-24 ... 75.0 76.0
2013-06-25 ... 74.0 83.0
2013-06-26 ... 78.0 76.0
2013-06-27 ... 77.0 78.0
2013-06-28 ... 81.0 82.0
2013-06-29 ... 80.0 78.0
9
2013-07-01 ... 81.0 79.0
2013-07-02 ... 80.0 83.0
2013-07-03 ... 79.0 90.0
DailyAvgRelHumidity_2 DailyAvgRelHumidity_3 \
DATE
2013-06-01 NaN NaN
2013-06-02 NaN NaN
2013-06-03 68.0 NaN
2013-06-04 78.0 68.0
2013-06-05 83.0 78.0
2013-06-06 70.0 83.0
2013-06-07 81.0 70.0
2013-06-08 94.0 81.0
2013-06-09 88.0 94.0
2013-06-10 80.0 88.0
2013-06-11 82.0 80.0
2013-06-12 87.0 82.0
2013-06-13 71.0 87.0
2013-06-14 73.0 71.0
2013-06-15 77.0 73.0
2013-06-16 60.0 77.0
2013-06-17 66.0 60.0
2013-06-18 69.0 66.0
2013-06-19 83.0 69.0
2013-06-20 86.0 83.0
2013-06-21 69.0 86.0
2013-06-22 67.0 69.0
2013-06-23 60.0 67.0
2013-06-24 71.0 60.0
2013-06-25 76.0 71.0
2013-06-26 83.0 76.0
2013-06-27 76.0 83.0
2013-06-28 78.0 76.0
2013-06-29 82.0 78.0
2013-07-01 78.0 82.0
2013-07-02 79.0 78.0
2013-07-03 83.0 79.0
DailyAvgDewPointTemp_1 DailyAvgDewPointTemp_2 \
DATE
2013-06-01 NaN NaN
2013-06-02 65.0 NaN
2013-06-03 66.0 65.0
2013-06-04 67.0 66.0
2013-06-05 64.0 67.0
2013-06-06 67.0 64.0
2013-06-07 69.0 67.0
10
2013-06-08 68.0 69.0
2013-06-09 67.0 68.0
2013-06-10 68.0 67.0
2013-06-11 69.0 68.0
2013-06-12 66.0 69.0
2013-06-13 69.0 66.0
2013-06-14 70.0 69.0
2013-06-15 57.0 70.0
2013-06-16 60.0 57.0
2013-06-17 64.0 60.0
2013-06-18 69.0 64.0
2013-06-19 70.0 69.0
2013-06-20 65.0 70.0
2013-06-21 63.0 65.0
2013-06-22 60.0 63.0
2013-06-23 64.0 60.0
2013-06-24 69.0 64.0
2013-06-25 70.0 69.0
2013-06-26 70.0 70.0
2013-06-27 70.0 70.0
2013-06-28 73.0 70.0
2013-06-29 71.0 73.0
2013-07-01 70.0 71.0
2013-07-02 69.0 70.0
2013-07-03 71.0 69.0
DailyAvgDewPointTemp_3 DailyPrecip_1 DailyPrecip_2 \

DATE
2013-06-02 NaN 0.00 NaN
2013-06-03 NaN 0.19 0.00
2013-06-04 65.0 2.33 0.19
2013-06-05 66.0 0.00 2.33
2013-06-06 67.0 0.03 0.00
2013-06-07 64.0 1.12 0.03
2013-06-08 67.0 0.72 1.12
2013-06-09 69.0 0.00 0.72
2013-06-10 68.0 0.12 0.00
2013-06-11 67.0 0.62 0.12
2013-06-12 68.0 0.00 0.62
2013-06-13 69.0 0.00 0.00
2013-06-14 66.0 0.49 0.00
2013-06-15 69.0 0.00 0.49
2013-06-16 70.0 0.00 0.00
2013-06-17 57.0 0.00 0.00
2013-06-18 60.0 0.24 0.00
2013-06-19 64.0 0.41 0.24
2013-06-20 69.0 0.00 0.41
11
2013-06-21 70.0 0.00 0.00
2013-06-22 65.0 0.00 0.00
2013-06-23 63.0 0.00 0.00
2013-06-24 60.0 0.01 0.00
2013-06-25 64.0 0.01 0.01
2013-06-26 69.0 0.00 0.01
2013-06-27 70.0 0.00 0.00
2013-06-28 70.0 0.32 0.00
2013-06-29 70.0 0.27 0.32
2013-07-01 73.0 0.00 0.27
2013-07-02 71.0 0.07 0.00
2013-07-03 70.0 0.34 0.07
DailyPrecip_3
DATE
2013-06-01 NaN
2013-06-02 NaN
2013-06-03 NaN
2013-06-04 0.00
2013-06-05 0.19
2013-06-06 2.33
2013-06-07 0.00
2013-06-08 0.03
2013-06-09 1.12
2013-06-10 0.72
2013-06-11 0.00
2013-06-12 0.12
2013-06-13 0.62
2013-06-14 0.00
2013-06-15 0.00
2013-06-16 0.49
2013-06-17 0.00
2013-06-18 0.00
2013-06-19 0.00
2013-06-20 0.24
2013-06-21 0.41
2013-06-22 0.00
2013-06-23 0.00
2013-06-24 0.00
2013-06-25 0.00
2013-06-26 0.01
2013-06-27 0.01
2013-06-28 0.00
2013-06-29 0.00
2013-07-01 0.32
2013-07-02 0.27
2013-07-03 0.00
12
In [59]: # Evaluate the distribution of the feature data and transpose it; drop latitude and lon
spread = new_clt_climate_df.describe().T
spread
Out[59]: count mean std min 25% 50% 75% \

DailyMaxTemp 1651.0 73.609933 15.500866 26.0 62.0 76.0 86.00
DailyMinTemp 1651.0 51.928528 16.196924 5.0 39.0 55.0 67.00
DailyAvgTemp 1651.0 62.815263 15.366170 16.0 51.0 65.0 76.00
DailyAvgRelHumidity 1651.0 64.890975 14.879809 26.0 55.0 65.0 76.00
DailyAvgDewPointTemp 1651.0 49.496669 17.611304 -4.0 36.0 54.0 65.00
DailyPrecip 1651.0 0.121532 0.344428 0.0 0.0 0.0 0.03
DailyMaxTemp_1 1650.0 73.603030 15.503027 26.0 62.0 76.0 86.00
DailyMaxTemp_2 1649.0 73.596119 15.505187 26.0 62.0 76.0 86.00
DailyMaxTemp_3 1648.0 73.593447 15.509513 26.0 62.0 76.0 86.00
DailyMinTemp_1 1650.0 51.916364 16.194288 5.0 39.0 55.0 67.00
DailyMinTemp_2 1649.0 51.904184 16.191640 5.0 39.0 55.0 67.00
DailyMinTemp_3 1648.0 51.892597 16.189714 5.0 39.0 55.0 67.00
DailyAvgTemp_1 1650.0 62.806061 15.366277 16.0 51.0 65.0 76.00
DailyAvgTemp_2 1649.0 62.796847 15.366378 16.0 51.0 65.0 76.00
DailyAvgTemp_3 1648.0 62.789442 15.368099 16.0 51.0 65.0 76.00
DailyAvgRelHumidity_1 1650.0 64.880000 14.877634 26.0 55.0 65.0 76.00
DailyAvgDewPointTemp_1 1650.0 49.483030 17.607919 -4.0 36.0 54.0 65.00
DailyPrecip_1 1650.0 0.121606 0.344519 0.0 0.0 0.0 0.03
DailyPrecip_2 1649.0 0.121364 0.344484 0.0 0.0 0.0 0.03
DailyPrecip_3 1648.0 0.121317 0.344583 0.0 0.0 0.0 0.03
max
DailyMaxTemp 100.00
DailyMinTemp 78.00
DailyAvgTemp 88.00
DailyAvgRelHumidity 98.00
DailyAvgDewPointTemp 75.00
DailyPrecip 3.89
DailyMaxTemp_1 100.00
DailyMinTemp_1 78.00
DailyAvgTemp_1 88.00
13
DailyAvgRelHumidity_1 98.00
DailyAvgDewPointTemp_1 75.00
DailyPrecip_1 3.89
DailyPrecip_2 3.89
DailyPrecip_3 3.89
In [60]: # Drop rows with NaN values

new_clt_climate_df = new_clt_climate_df.dropna()

DATE

DATE
2013-06-04 76.0 70.0 64.0
2013-06-05 74.0 81.0 67.0
2013-06-06 73.0 94.0 69.0
2013-06-07 75.0 88.0 68.0
2013-06-08 75.0 80.0 67.0
DailyPrecip DailyMaxTemp_1 DailyMaxTemp_2 DailyMaxTemp_3 \

DATE
2013-06-04 0.00 83.0 84.0 85.0
2013-06-05 0.03 84.0 83.0 84.0
2013-06-06 1.12 81.0 84.0 83.0
2013-06-07 0.72 78.0 81.0 84.0
2013-06-08 0.00 82.0 78.0 81.0
... DailyAvgTemp_3 DailyAvgRelHumidity_1 \

DATE ...
2013-06-04 ... 75.0 83.0
2013-06-05 ... 77.0 70.0
2013-06-06 ... 75.0 81.0
2013-06-07 ... 76.0 94.0
2013-06-08 ... 74.0 88.0
DailyAvgRelHumidity_2 DailyAvgRelHumidity_3 \
DATE
14
2013-06-04 78.0 68.0
2013-06-05 83.0 78.0
2013-06-06 70.0 83.0
2013-06-07 81.0 70.0
2013-06-08 94.0 81.0
DailyAvgDewPointTemp_1 DailyAvgDewPointTemp_2 \
DATE
2013-06-04 67.0 66.0
2013-06-05 64.0 67.0
2013-06-06 67.0 64.0
2013-06-07 69.0 67.0
2013-06-08 68.0 69.0
DailyAvgDewPointTemp_3 DailyPrecip_1 DailyPrecip_2 \

DATE
2013-06-04 65.0 2.33 0.19
2013-06-05 66.0 0.00 2.33
2013-06-06 67.0 0.03 0.00
2013-06-07 64.0 1.12 0.03
2013-06-08 67.0 0.72 1.12
DailyPrecip_3
DATE
2013-06-04 0.00
2013-06-05 0.19
2013-06-06 2.33
2013-06-07 0.00
2013-06-08 0.03
In [61]: # Assess the linearity between variables using the Pearson correlation coefficient.
df_linear = new_clt_climate_df.corr()[['DailyAvgTemp']].sort_values('DailyAvgTemp')
df_linear
Out[61]: DailyAvgTemp
DailyPrecip_2 -0.038175
DailyPrecip 0.010496
DailyAvgRelHumidity 0.334309
15
DailyAvgDewPointTemp 0.939037
DailyMaxTemp 0.971456
DailyMinTemp 0.973825
DailyAvgTemp 1.000000
1.5 Visualizing Feature Relationships

In [62]: # Create new dataframe with features of interest
predictors = ['DailyMaxTemp_1','DailyMaxTemp_2','DailyMaxTemp_3','DailyMinTemp_1','Dail
new_clt_climate_df2 = new_clt_climate_df[['DailyAvgTemp'] + predictors]
new_clt_climate_df2.head()
Out[62]: DailyAvgTemp DailyMaxTemp_1 DailyMaxTemp_2 DailyMaxTemp_3 \

DATE
2013-06-04 76.0 83.0 84.0 85.0
2013-06-05 74.0 84.0 83.0 84.0
2013-06-06 73.0 81.0 84.0 83.0
2013-06-07 75.0 78.0 81.0 84.0
2013-06-08 75.0 82.0 78.0 81.0
DailyMinTemp_1 DailyMinTemp_2 DailyMinTemp_3 DailyAvgTemp_1 \

DATE
2013-06-04 67.0 69.0 66.0 75.0
2013-06-05 68.0 67.0 69.0 76.0
2013-06-06 68.0 68.0 67.0 74.0
2013-06-07 68.0 68.0 68.0 73.0
2013-06-08 68.0 68.0 68.0 75.0
DailyAvgTemp_2 DailyAvgTemp_3 DailyAvgDewPointTemp_1 \

DATE
2013-06-04 77.0 75.0 67.0
2013-06-05 75.0 77.0 64.0
2013-06-06 76.0 75.0 67.0
2013-06-07 74.0 76.0 69.0
2013-06-08 73.0 74.0 68.0
DailyAvgDewPointTemp_2 DailyAvgDewPointTemp_3
DATE
2013-06-04 66.0 65.0
16
2013-06-05 67.0 66.0
2013-06-06 64.0 67.0
2013-06-07 67.0 64.0
2013-06-08 69.0 67.0
In [63]: %matplotlib inline
# Manually set the parameters of the figure to an appropriate size

plt.rcParams['figure.figsize'] = [16, 22]
# Call subplots specifying the desired grid structure

# The y axes should be shared
fig, axes = plt.subplots(nrows=4, ncols=3, sharey=True)
# Loop through the features that will be the predictors to build the plot
# Rearrange data into a 2D array of 4 rows and 3 columns
arr = np.array(predictors).reshape(4, 3)
# Use enumerate to loop over the 2D array of rows and columns

# Create scatter plots of each DailyAvgTemp vs each feature; DailyAvgTemp will be the d
for row, col_arr in enumerate(arr):
for col, feature in enumerate(col_arr):
axes[row, col].scatter(new_clt_climate_df2[feature], new_clt_climate_df2['Daily
if col == 0:
axes[row, col].set(xlabel=feature, ylabel='DailyAvgTemp')
else:
axes[row, col].set(xlabel=feature)
plt.show()
output_23_0.png
1.6 Using Step-wise Regression to Build a Model

To test for the effects of interactions on the significance of any one variable in a linear regression
model a technique known as step-wise regression is often applied. Using step-wise regression you
add or remove variables from the model and assess the statistical significance of each variable on
the resultant model.
A backward elimination technique will be applied using the following steps: 1. Select a signif-
icance level A for which you test your hypothesis against to determine if a variable should stay
in the model 1. Fit the model with all predictor variables 1. Evaluate the p-values of the βj coeffi-
cients and for the one with the greatest p-value, if p-value > A progress to step 4, if not you have
17
your final model 1. Remove the predictor identified in step 3 1. Fit the model again but, this time
without the removed variable and cycle back to step 3
These steps will help to select statistically meaningful predictors (features)
In [ ]: # Separate predictor variables (X) from the outcome variable y

X = new_clt_climate_df2[predictors]
y = new_clt_climate_df2['DailyAvgTemp']
# Add a constant to the predictor variable set to represent the Bo intercept

X = sm.add_constant(X)
X.iloc[:5, :5]
In [ ]: # Step 1 - Select a significance value

alpha = 0.05
# Step 2 - Fit the model

model = sm.OLS(y, X).fit()
# Step 3 - Evaluate the coefficients' p-values

model.summary()
In [ ]: # Step 3 (cont.) - Identify the predictor with the greatest p-value and assess if its gr
# Based off the table, DailyAvgTemp_1 has the greatest p-value and it is greater than al
# Step 4 - Use pandas drop function to remove this column from X

X = X.drop('DailyAvgTemp_1', axis=1)
# Step 5 - Fit the model

model.summary()
In [ ]: # Repeat steps 1 - 5 to continue identifying predictors with the greatest p-value that a
# ROUND 2
X = X.drop('DailyMinTemp_2', axis=1)
model.summary()
# ROUND 3
X = X.drop('DailyAvgTemp_3', axis=1)
model.summary()
# ROUND 4
X = X.drop('DailyMaxTemp_3', axis=1)
model.summary()
18
# ROUND 5
X = X.drop('DailyMaxTemp_2', axis=1)
model.summary()
1.7 Using the SciKit-Learn Linear Regression Module to Predict the Weather
The training and testing datasets are split into 80% training and 20% testing.
In [ ]: # A random_state of 12 is assigned to ensure getting the right random selection of data.

# This random_state parameter is very useful for reproducibility of results.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12
In [ ]: # Instantiate the regressor class

regressor = LinearRegression()
# Fit and build the model by fitting the regressor to the training data
regressor.fit(X_train, y_train)
# Make a prediction set using the test set

prediction = regressor.predict(X_test)
# Evaluate the prediction accuracy of the model

print("Accuracy of Linear Regression: %.2f" % regressor.score(X_test, y_test))
print("The Mean Absolute Error: %.2f degrees fahrenheit" % mean_absolute_error(y_test, p
print("The Median Absolute Error: %.2f degrees fahrenheit" % median_absolute_error(y_tes
1.8 Visualizing Weather Forecast Predictions

In [ ]: # 365 days will be the number of forecast days
forecast_out = int(365)
# X contains the last 'n=forecast_out' rows for which we don't have label data
# Put those rows in a different Matrix X_forecast_out by X_forecast_out = X[end-forecast
X_forecast_out = X[-forecast_out:]
X = X[:-forecast_out]
print ("Length of X_forecast_out:", len(X_forecast_out), "& Length of X :", len(X))
In [ ]: # Predict average temp for the next 365 days using our Model
forecast_prediction = regressor.predict(X_forecast_out)
print(forecast_prediction)
In [ ]: # Plotting data with the 365-day forecast included

new_clt_climate_df['Forecast'] = np.nan
last_date = new_clt_climate_df.iloc[-1].name
last_unix = last_date.timestamp()
# Number of seconds in a day
one_day = 86400
19
next_unix = last_unix + one_day
for i in forecast_prediction:
next_date = datetime.datetime.fromtimestamp(next_unix)
next_unix += 86400
new_clt_climate_df.loc[next_date] = [np.nan for _ in range(len(new_clt_climate_df.co
new_clt_climate_df['DailyAvgTemp'].plot(figsize=(20,7), color="green")
new_clt_climate_df['Forecast'].plot(figsize=(20,7), color="orange")
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Average Temperature - Fahrenheit')
plt.show()
20
Create Database through Python SQL - Part 1
In [14]: # Python SQL toolkit and Object Relational Mapper

import sqlalchemy
from sqlalchemy import create_engine, MetaData
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, Numeric, Text, Float
import pandas as pd
In [15]: # Create an engine to the SQLite dB

engine = create_engine("sqlite:///Global_Land_Temps.sqlite")
In [16]: # Create a connection to the engine called `conn`

conn = engine.connect()
In [17]: # Load the cleaned csv file into a pandas dataframe

new_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-data/US_Cit
In [18]: # Verify the datatypes

new_df.info()
RangeIndex: 294522 entries, 0 to 294521
dt 294522 non-null object
AverageTemperature 294522 non-null float64
AverageTemperatureUncertainty 294522 non-null float64
City 294522 non-null object
Country 294522 non-null object
Latitude 294522 non-null object
Longitude 294522 non-null object
memory usage: 15.7+ MB
In [19]: # Use `declarative_base` from SQLAlchemy to model the table as an ORM class
Base = declarative_base()
class US_Cities(Base):
__tablename__ = 'US_Cities_GLT'
id = Column(Integer, primary_key=True)
dt = Column(Text)
1
AverageTemperature = Column(Float)
AverageTemperatureUncertainty = Column(Float)
City = Column(Text)
Country = Column(Text)
Latitude = Column(Text)
Longitude = Column(Text)
def __repr__(self):
return "id={self.id}, name={self.name}"
In [20]: # Use `create_all` to create the table in the database

Base.metadata.create_all(engine)
In [21]: # Use Orient='records' to create a list of data to write

# to_dict() cleans out DataFrame metadata as well
data = new_df.to_dict(orient='records')
In [22]: # Data is just a list of dictionaries that represent each row of data
print(data[:5])
[{'dt': '1918-01-01', 'AverageTemperature': 1.2830000000000004, 'AverageTemperatureUncertainty':
In [23]: # Use MetaData from SQLAlchemy to reflect the tables

metadata = MetaData(bind=engine)
metadata.reflect()
In [24]: # Save the reference to the table as a variable called `table`

table = sqlalchemy.Table('US_Cities_GLT', metadata, autoload=True)
In [25]: # Use `table.insert()` to insert the data into the table

# The SQL table is populated during this step
conn.execute(table.insert(), data)
Out[25]: <sqlalchemy.engine.result.ResultProxy at 0x7efdc828ce10>
In [26]: # Test that the insert works by fetching the first 5 rows.
conn.execute("select * from US_Cities_GLT limit 5").fetchall()
Out[26]: [(1, '1918-01-01', 1.2830000000000004, 0.325, 'Abilene', 'United States', '32.95N', '10
(2, '1918-02-01', 9.244, 0.319, 'Abilene', 'United States', '32.95N', '100.53W'),
(3, '1918-03-01', 14.636, 0.41600000000000004, 'Abilene', 'United States', '32.95N', '
(4, '1918-04-01', 16.227999999999998, 0.44299999999999995, 'Abilene', 'United States',
(5, '1918-05-01', 23.049, 0.486, 'Abilene', 'United States', '32.95N', '100.53W')]
2
Create Database through Python SQL - Part 2
In [1]: import sqlalchemy

import pandas as pd
In [2]: engine = create_engine("sqlite:///Global_Land_Temps.sqlite")

In [3]: state_GLT_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-data/G

country_GLT_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-data
GT_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-data/GLT100.c
In [4]: Base = declarative_base()

class States_GLT(Base):
__tablename__ = 'GLT_By_State'
dt = Column(Text)
State = Column(Text)
class Countries_GLT(Base):
__tablename__ = 'GLT_By_Country'
dt = Column(Text)
class GLT_General(Base):
__tablename__ = 'GLT'
dt = Column(Text)
LandAverageTemperature = Column(Float)
1
LandAverageTemperatureUncertainty = Column(Float)
LandMaxTemperature = Column(Float)
LandMaxTemperatureUncertainty = Column(Float)
LandMinTemperature = Column(Float)
LandMinTemperatureUncertainty = Column(Float)
LandAndOceanAverageTemperature = Column(Float)
LandAndOceanAverageTemperatureUncertainty = Column(Float)
def __repr__(self):
In [5]: Base.metadata.create_all(engine)
data2 = state_GLT_df.to_dict(orient='records')
data3 = country_GLT_df.to_dict(orient='records')
data4 = GT_df.to_dict(orient='records')
In [6]: print(data2[:5])
print(data3[:5])
print(data4[:5])
[{'dt': '1918-01-01', 'AverageTemperature': 24.22300000000001, 'AverageTemperatureUncertainty':

[{'dt': '1918-01-01', 'AverageTemperature': -5.4339999999999975, 'AverageTemperatureUncertainty'
[{'dt': '1918-01-01', 'LandAverageTemperature': 1.934, 'LandAverageTemperatureUncertainty': 0.25
In [7]: metadata = MetaData(bind=engine)

metadata.reflect()
In [8]: table2 = sqlalchemy.Table('GLT_By_State', metadata, autoload=True)

table3 = sqlalchemy.Table('GLT_By_Country', metadata, autoload=True)
table4 = sqlalchemy.Table('GLT', metadata, autoload=True)
In [9]: conn.execute(table2.insert(), data2)

conn.execute(table3.insert(), data3)
conn.execute(table4.insert(), data4)
Out[9]: <sqlalchemy.engine.result.ResultProxy at 0x7f0171a91550>
In [10]: conn.execute("select * from GLT_By_State limit 5").fetchall()
Out[10]: [(1, '1918-01-01', 24.22300000000001, 0.573, 'Acre', 'Brazil'),

(2, '1918-02-01', 24.663, 1.286, 'Acre', 'Brazil'),
(3, '1918-03-01', 24.882, 0.7120000000000001, 'Acre', 'Brazil'),
(4, '1918-04-01', 25.038, 0.461, 'Acre', 'Brazil'),
(5, '1918-05-01', 25.27, 0.562, 'Acre', 'Brazil')]
In [11]: conn.execute("select * from GLT_By_Country limit 5").fetchall()
2
Out[11]: [(1, '1918-01-01', -5.4339999999999975, 0.5579999999999999, 'Åland'),
(2, '1918-02-01', -2.636, 0.449, 'Åland'),
(3, '1918-03-01', -1.0500000000000005, 0.612, 'Åland'),
(4, '1918-04-01', 2.615, 0.418, 'Åland'),
(5, '1918-05-01', 7.162999999999999, 0.343, 'Åland')]
In [12]: conn.execute("select * from GLT limit 5").fetchall()
Out[12]: [(1, '1918-01-01', 1.934, 0.251, 7.5520000000000005, 0.261, -3.8020000000000014, 0.371,

(2, '1918-02-01', 2.455, 0.342, 8.256, 0.314, -3.568, 0.344, 13.312, 0.156),
(3, '1918-03-01', 4.811, 0.257, 10.704, 0.197, -1.2670000000000001, 0.31, 14.034, 0.14
(4, '1918-04-01', 7.643999999999999, 0.258, 13.706, 0.255, 1.426, 0.38, 14.794, 0.1480
(5, '1918-05-01', 10.54, 0.304, 16.48, 0.332, 4.386, 0.359, 15.732999999999999, 0.156)
3
CSV-to-SQLite-dB Part III
January 15, 2020
In [1]: import sqlalchemy

import pandas as pd
In [2]: engine = create_engine("sqlite:///Global_Land_Temps.sqlite")

In [3]: new_df = pd.read_csv("All Datasets/US-events-1980-2018.csv", low_memory=False)

new_df.info()
RangeIndex: 233 entries, 0 to 232
Name 233 non-null object
Disaster 233 non-null object
BeginDate 233 non-null int64
EndDate 233 non-null int64
Total_CPI_Adjusted_Cost_Millions 233 non-null float64
Deaths 233 non-null int64
dtypes: float64(1), int64(3), object(2)
memory usage: 11.0+ KB
In [4]: Base = declarative_base()

class US_Disaster_Events(Base):
__tablename__ = 'US_Disasters'
Name = Column(Text)
Disaster = Column(Text)
BeginDate = Column(Integer)
EndDate = Column(Integer)
Total_CPI_Adjusted_Cost_Millions = Column(Float)
Deaths = Column(Integer)
def __repr__(self):
1
In [5]: Base.metadata.create_all(engine)
In [6]: data = new_df.to_dict(orient='records')

print(data[:5])
[{'Name': 'Texas Hail Storm (June 2018)', 'Disaster': 'Severe Storm', 'BeginDate': 20180606, 'En
In [7]: metadata = MetaData(bind=engine)

metadata.reflect()
In [8]: table = sqlalchemy.Table('US_Disasters', metadata, autoload=True)
In [9]: conn.execute(table.insert(), data)
Out[9]: <sqlalchemy.engine.result.ResultProxy at 0x7f2a89bc08d0>
In [10]: conn.execute("select * from US_Disasters limit 5").fetchall()
Out[10]: [(1, 'Texas Hail Storm (June 2018)', 'Severe Storm', 20180606, 20180606, 1150.0, 0),
(2, 'Central and Eastern Severe Weather (May 2018)', 'Severe Storm', 20180513, 2018051
(3, 'Central and Northeastern Severe Weather (May 2018)', 'Severe Storm', 20180501, 20
(4, 'Southeastern Severe Storms and Tornadoes (March 2018)', 'Severe Storm', 20180318,
(5, 'Northeast Winter Storm (March 2018)', 'Winter Storm', 20180301, 20180303, 2216.0,
2
Data Preview & Prep - Part I
January 15, 2020

import numpy as np
import datetime
1 Earth Surface Temperature Datasets

1.0.1 Data by City - US Only (includes longitude & latitude)
In [2]: us_cities_GLT_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-da
state_GLT_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-data/G
country_GLT_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-data
GT_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-data/GlobalTe
In [3]: us_cities_GLT_df.head()
Out[3]: dt AverageTemperature AverageTemperatureUncertainty City \

0 1918-01-01 1.283 0.325 Abilene
1 1918-02-01 9.244 0.319 Abilene
2 1918-03-01 14.636 0.416 Abilene
3 1918-04-01 16.228 0.443 Abilene
4 1918-05-01 23.049 0.486 Abilene
Country Latitude Longitude

0 United States 32.95N 100.53W
In [4]: # Set date as the index and drop the extra date column
us_cities_GLT_df = us_cities_GLT_df.set_index(pd.DatetimeIndex(us_cities_GLT_df['dt']))
us_cities_GLT_df.drop('dt', axis=1, inplace=True)
us_cities_GLT_df.head()
Out[4]: AverageTemperature AverageTemperatureUncertainty City \

dt
1918-01-01 1.283 0.325 Abilene
1918-02-01 9.244 0.319 Abilene
1
1918-03-01 14.636 0.416 Abilene
1918-04-01 16.228 0.443 Abilene
1918-05-01 23.049 0.486 Abilene
Country Latitude Longitude

dt
1918-01-01 United States 32.95N 100.53W
1918-02-01 United States 32.95N 100.53W
1918-03-01 United States 32.95N 100.53W
1918-04-01 United States 32.95N 100.53W
1918-05-01 United States 32.95N 100.53W
In [5]: us_cities_GLT_df.info()
City 294522 non-null object
Latitude 294522 non-null object
Longitude 294522 non-null object
1.0.2 Data By State - All Countries

In [6]: state_GLT_df.head()
Out[6]: dt AverageTemperature AverageTemperatureUncertainty State Country

0 1855-05-01 25.544 1.171 Acre Brazil
1 1855-06-01 24.228 1.103 Acre Brazil
2 1855-07-01 24.371 1.044 Acre Brazil
3 1855-08-01 25.427 1.073 Acre Brazil
4 1855-09-01 25.675 1.014 Acre Brazil
In [7]: # Drop rows with NaN values and set date as index
new_state_GLT_df = state_GLT_df.dropna()
# Set date as the index and drop the extra date column
new_state_GLT_df = new_state_GLT_df.set_index(pd.DatetimeIndex(new_state_GLT_df['dt']))
new_state_GLT_df.drop('dt', axis=1, inplace=True)
# Only use the date from 1918 - 2013

state_GLT_100_df = new_state_GLT_df['1918-01-01':'2013-06-01']
state_GLT_100_df.head()
2
Out[7]: AverageTemperature AverageTemperatureUncertainty State Country
dt
1918-01-01 24.223 0.573 Acre Brazil
1918-02-01 24.663 1.286 Acre Brazil
1918-03-01 24.882 0.712 Acre Brazil
1918-04-01 25.038 0.461 Acre Brazil
1918-05-01 25.270 0.562 Acre Brazil
In [8]: state_GLT_100_df.info()
State 276186 non-null object
In [9]: state_GLT_100_df.index

'1918-05-01', '1918-06-01', '1918-07-01', '1918-08-01',
'1918-09-01', '1918-10-01',
...
'2012-09-01', '2012-10-01', '2012-11-01', '2012-12-01',
'2013-01-01', '2013-02-01', '2013-03-01', '2013-04-01',
'2013-05-01', '2013-06-01'],
dtype='datetime64[ns]', name='dt', length=276186, freq=None)
In [10]: #Filter out US info

us_states_df = state_GLT_100_df.query("Country == 'United States'")
us_states_df.head()
Out[10]: AverageTemperature AverageTemperatureUncertainty State \

dt
1918-01-01 2.733 0.490 Alabama
1918-02-01 11.486 0.316 Alabama
1918-03-01 16.338 0.356 Alabama
1918-04-01 15.583 0.218 Alabama
1918-05-01 22.087 0.200 Alabama
Country
dt
1918-01-01 United States
3
In [11]: # Output final df to CSV

state_GLT_100_df.to_csv('GLTbyState100.csv', encoding='utf-8')
1.0.3 Data By Country

In [12]: country_GLT_df.head()
Out[12]: dt AverageTemperature AverageTemperatureUncertainty Country

0 1743-11-01 4.384 2.294 Åland
1 1743-12-01 NaN NaN Åland
new_country_GLT_df = country_GLT_df.dropna()
# Set date as the index and drop the extra date column
new_country_GLT_df = new_country_GLT_df.set_index(pd.DatetimeIndex(new_country_GLT_df['
new_country_GLT_df.drop('dt', axis=1, inplace=True)

country_GLT_100_df = new_country_GLT_df['1918-01-01':'2013-06-01']
country_GLT_100_df.head()
Out[13]: AverageTemperature AverageTemperatureUncertainty Country

dt
1918-01-01 -5.434 0.558 Åland
1918-02-01 -2.636 0.449 Åland
1918-03-01 -1.050 0.612 Åland
1918-04-01 2.615 0.418 Åland
1918-05-01 7.163 0.343 Åland
In [14]: country_GLT_100_df.info()
In [15]: country_GLT_100_df.index
4
'1918-05-01', '1918-06-01', '1918-07-01', '1918-08-01',
'1918-09-01', '1918-10-01',
...
'2012-09-01', '2012-10-01', '2012-11-01', '2012-12-01',
'2013-01-01', '2013-02-01', '2013-03-01', '2013-04-01',
'2013-05-01', '2013-06-01'],

country_GLT_100_df.to_csv('GLTbyCountry100.csv', encoding='utf-8')
1.0.4 Global Temperature Data

In [17]: GT_df.head()
Out[17]: dt LandAverageTemperature LandAverageTemperatureUncertainty \

0 1750-01-01 3.034 3.574
1 1750-02-01 3.083 3.702
2 1750-03-01 5.626 3.076
3 1750-04-01 8.490 2.451
4 1750-05-01 11.573 2.072
LandMaxTemperature LandMaxTemperatureUncertainty LandMinTemperature \

0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
LandMinTemperatureUncertainty LandAndOceanAverageTemperature \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
LandAndOceanAverageTemperatureUncertainty
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
new_GT_df = GT_df.dropna()
# Set date as the index

new_GT_df = new_GT_df.set_index(pd.DatetimeIndex(new_GT_df['dt']))
5
new_GT_df.drop('dt', axis=1, inplace=True)

GT_100_df = new_GT_df['1918-01-01':'2013-06-01']
GT_100_df.head()
Out[18]: LandAverageTemperature LandAverageTemperatureUncertainty \

dt
1918-01-01 1.934 0.251
1918-02-01 2.455 0.342
1918-03-01 4.811 0.257
1918-04-01 7.644 0.258
1918-05-01 10.540 0.304
LandMaxTemperature LandMaxTemperatureUncertainty \
dt
1918-01-01 7.552 0.261
1918-02-01 8.256 0.314
1918-03-01 10.704 0.197
1918-04-01 13.706 0.255
1918-05-01 16.480 0.332
LandMinTemperature LandMinTemperatureUncertainty \
dt
1918-01-01 -3.802 0.371
1918-02-01 -3.568 0.344
1918-03-01 -1.267 0.310
1918-04-01 1.426 0.380
1918-05-01 4.386 0.359
LandAndOceanAverageTemperature \
dt
1918-01-01 13.129
1918-02-01 13.312
1918-03-01 14.034
1918-04-01 14.794
1918-05-01 15.733
LandAndOceanAverageTemperatureUncertainty
dt
1918-01-01 0.141
1918-02-01 0.156
1918-03-01 0.147
1918-04-01 0.148
1918-05-01 0.156
In [19]: GT_100_df.info()
6
LandAverageTemperature 1146 non-null float64
LandAverageTemperatureUncertainty 1146 non-null float64
LandMaxTemperature 1146 non-null float64
LandMaxTemperatureUncertainty 1146 non-null float64
LandMinTemperature 1146 non-null float64
LandMinTemperatureUncertainty 1146 non-null float64
LandAndOceanAverageTemperature 1146 non-null float64
LandAndOceanAverageTemperatureUncertainty 1146 non-null float64
dtypes: float64(8)
memory usage: 80.6 KB
In [20]: GT_100_df.index

'1918-05-01', '1918-06-01', '1918-07-01', '1918-08-01',
'1918-09-01', '1918-10-01',
...
'2012-09-01', '2012-10-01', '2012-11-01', '2012-12-01',
'2013-01-01', '2013-02-01', '2013-03-01', '2013-04-01',
'2013-05-01', '2013-06-01'],

GT_100_df.to_csv('GLT100.csv', encoding='utf-8')
7
Data Preview & Prep - Part II
January 15, 2020

import numpy as np
import datetime
1 Hurricanes and Typhoons Datasets

In [2]: atlantic_df = pd.read_csv("All Datasets/hurricanes-and-typhoons-1851-2014/atlantic.csv",
pacific_df = pd.read_csv("All Datasets/hurricanes-and-typhoons-1851-2014/pacific.csv", l
In [3]: atlantic_df.head()
Out[3]: ID Name Date Time Event Status Latitude \

0 AL011851 UNNAMED 18510625 0 HU 28.0N
1 AL011851 UNNAMED 18510625 600 HU 28.0N
2 AL011851 UNNAMED 18510625 1200 HU 28.0N
3 AL011851 UNNAMED 18510625 1800 HU 28.1N
4 AL011851 UNNAMED 18510625 2100 L HU 28.2N
Longitude Maximum Wind Minimum Pressure ... Low Wind SW \

0 94.8W 80 -999 ... -999
1 95.4W 80 -999 ... -999
2 96.0W 80 -999 ... -999
3 96.5W 80 -999 ... -999
4 96.8W 80 -999 ... -999
Low Wind NW Moderate Wind NE Moderate Wind SE Moderate Wind SW \

0 -999 -999 -999 -999
1 -999 -999 -999 -999
2 -999 -999 -999 -999
3 -999 -999 -999 -999
4 -999 -999 -999 -999
Moderate Wind NW High Wind NE High Wind SE High Wind SW High Wind NW
0 -999 -999 -999 -999 -999
1 -999 -999 -999 -999 -999
2 -999 -999 -999 -999 -999
3 -999 -999 -999 -999 -999
1
4 -999 -999 -999 -999 -999
In [4]: pacific_df.tail()
Out[4]: ID Name Date Time Event Status Latitude \

26132 EP222015 SANDRA 20151128 1200 LO 21.7N
26133 EP222015 SANDRA 20151128 1800 LO 22.4N
26134 EP222015 SANDRA 20151129 0 LO 23.1N
26135 EP222015 SANDRA 20151129 600 LO 23.5N
26136 EP222015 SANDRA 20151129 1200 LO 24.2N
Longitude Maximum Wind Minimum Pressure ... Low Wind SW \

26132 109.0W 35 1002 ... 0
26133 108.7W 30 1007 ... 0
26134 108.3W 30 1008 ... 0
26135 107.9W 25 1009 ... 0
26136 107.7W 20 1010 ... 0
Low Wind NW Moderate Wind NE Moderate Wind SE Moderate Wind SW \

26132 0 0 0 0
26133 0 0 0 0
26134 0 0 0 0
26135 0 0 0 0
26136 0 0 0 0
Moderate Wind NW High Wind NE High Wind SE High Wind SW \

26132 0 0 0 0
26133 0 0 0 0
26134 0 0 0 0
26135 0 0 0 0
26136 0 0 0 0
High Wind NW
26132 0
26133 0
26134 0
26135 0
26136 0
2 Tsunami Datasets
In [5]: sources_df = pd.read_csv("All Datasets/seismic-waves/sources.csv", low_memory=False)
waves_df = pd.read_csv("All Datasets/seismic-waves/waves.csv", low_memory=False)
2
In [6]: sources_df.head()
Out[6]: SOURCE_ID YEAR MONTH DAY HOUR MINUTE CAUSE VALIDITY FOCAL_DEPTH \
0 1 -2000 NaN NaN NaN NaN 1.0 1.0 NaN
PRIMARY_MAGNITUDE ... ALL_INJURIES INJURY_TOTAL \

0 NaN ... NaN NaN
1 NaN ... NaN NaN
2 NaN ... NaN NaN
3 6.0 ... NaN NaN
4 NaN ... NaN NaN
ALL_FATALITIES FATALITY_TOTAL ALL_DAMAGE_MILLIONS DAMAGE_TOTAL \

0 NaN 3.0 NaN 4.0
1 NaN 3.0 NaN 3.0
2 NaN NaN NaN 3.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
ALL_HOUSES_DAMAGED HOUSE_DAMAGE_TOTAL ALL_HOUSES_DESTROYED \

0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN
HOUSE_DESTRUCTION_TOTAL
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
In [7]: waves_df.tail()
Out[7]: SOURCE_ID WAVE_ID YEAR MONTH DAY REGION_CODE COUNTRY \

26198 5636 32278 2016 12.0 17.0 82.0 PAPUA NEW GUINEA
26199 5636 32277 2016 12.0 17.0 82.0 SOLOMON ISLANDS
26200 5637 32398 2016 12.0 25.0 89.0 CHILE
26201 5639 32404 2017 1.0 3.0 81.0 FIJI
26202 5642 32418 2017 1.0 22.0 82.0 SOLOMON ISLANDS
3
STATE/PROVINCE LOCATION LATITUDE \
26198 NaN D55023 BPR, ETD CORAL SEA -14.8030
26199 NaN TAREKUKURE, TARO ISLAND -6.6928
26200 AYSEN PUERTO MELINKA -43.8983
26201 NaN SUVA, KING'S WHARF -18.1330
26202 NaN TAREKUKURE, TARO ISLAND -6.6928
... INJURIES INJURY_ESTIMATE FATALITIES \

26198 ... NaN NaN NaN
FATALITY_ESTIMATE DAMAGE_MILLIONS_DOLLARS DAMAGE_ESTIMATE \

26198 NaN NaN NaN
26199 NaN NaN NaN
26200 NaN NaN NaN
26201 NaN NaN NaN
26202 NaN NaN NaN
HOUSES_DAMAGED HOUSE_DAMAGE_ESTIMATE HOUSES_DESTROYED \

26198 NaN NaN NaN
26199 NaN NaN NaN
26200 NaN NaN NaN
26201 NaN NaN NaN
26202 NaN NaN NaN
HOUSE_DESTRUCTION_ESTIMATE
26198 NaN
26199 NaN
26200 NaN
26201 NaN
26202 NaN
3 Tornadoes Dataset
In [8]: tornadoes_df = pd.read_csv("All Datasets/1950-2017_all_tornadoes.csv", low_memory=False)
tornadoes_df.head()
Out[8]: om yr mo dy date time tz st stf stn ... len wid ns sn \

0 1 1950 1 3 1/3/50 11:00:00 3 MO 29 1 ... 9.5 150 2 0
1 1 1950 1 3 1/3/50 11:00:00 3 MO 29 1 ... 6.2 150 2 1
2 1 1950 1 3 1/3/50 11:10:00 3 IL 17 1 ... 3.3 100 2 1
3 2 1950 1 3 1/3/50 11:55:00 3 IL 17 2 ... 3.6 130 1 1
4
4 3 1950 1 3 1/3/50 16:00:00 3 OH 39 1 ... 0.1 10 1 1
sg f1 f2 f3 f4 fc
0 1 0 0 0 0 0
1 2 189 0 0 0 0
2 2 119 0 0 0 0
3 1 135 0 0 0 0
4 1 161 0 0 0 0
In [9]: tornadoes_df.tail()
Out[9]: om yr mo dy date time tz st stf stn ... \

63676 2017-01425 2017 12 20 12/20/17 1:59:00 3 LA 22 0 ...
63677 2017-01426 2017 12 20 12/20/17 2:17:00 3 LA 22 0 ...
63678 2017-01427 2017 12 20 12/20/17 6:39:00 3 MS 28 0 ...
63679 2017-01428 2017 12 20 12/20/17 10:40:00 3 AL 1 0 ...
63680 2017-01429 2017 12 20 12/20/17 12:15:00 3 GA 13 0 ...
len wid ns sn sg f1 f2 f3 f4 fc
63676 3.05 600 1 1 1 19 0 0 0 0
63677 4.70 600 1 1 1 19 0 0 0 0
63678 0.17 50 1 1 1 147 0 0 0 0
63679 3.17 16 1 1 1 111 0 0 0 0
63680 3.17 125 1 1 1 199 0 0 0 0

Machine Learning Lecture - 4 and Lecture - 5

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Machine Learning Lecture - 4 and Lecture - 5

Caricato da

Copyright:

Formati disponibili

Project area # Natural Language Processing

L4-L5: Handson implementation of Text classification

January 18, 2020

In [50]: import pandas as pd

In [51]: #Set Random seed

In [52]: # Add the Data using pandas

# Step - 1a : Remove blank rows if any.

for index,entry in enumerate(Corpus['text']):

0 ['stun', 'even', 'sound', 'track', 'beautiful'...

# Classifier - Algorithm - Naive Bayes

Out[70]: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [71]: # predict the labels on validation dataset

# Use accuracy_score function to get the accuracy

Naive Bayes Accuracy Score -> 83.53333333333333

In [72]: # Classifier - Algorithm - SVM

Out[72]: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,

In [73]: # predict the labels on validation dataset

# Use accuracy_score function to get the accuracy

SVM Accuracy Score -> 84.86666666666667

January 18, 2020

Objective: To classify the the presence or absence of primary user in CR Networks.

1 Importing and Initializing data

1. numpy - For vectorized implementation

In [1]: import numpy as np

Using TensorFlow backend.

In [2]: # Importing data from file (536.5 MHz)

# Reshaping to convert to a proper NUMPY vector

# Shape of UHF signal vector

Size of UHF: (20000000, 1)

3 Creating NUMPY equivalent for bandpower function

In [3]: def bandpower(signal):

4 Creating NUMPY equivalent for AWGN (Additive White Gaussian

In [5]: def awgn(signal, desired_snr):

# Converting the SNR from dB scale to linear scale

# Standard normally distributed noise

# Normalizing the signal to have the given variance

print("SNR = " + str(10 * math.log10(bandpower(normalized_signal) / bandpower(noise)))

return normalized_signal + noise

5 Filtering the data

In [6]: # Datasets are filtered to contain values between 10 ^ -7 and 1

UHF = UHF[np.logical_and(UHF > math.pow(10, -7), UHF < 1)]

# Shape of UHF signal vector

Size of UHF: (9992850, 1)

6 Making the dataset ready

In [7]: def create_dataset(signal, desired_snr, samples, sample_size):

# Creating the signal with desired SNR

# Allocating zeros to the dataset

# Extracting the sample based on sample size

# Sorting the sampled signal

# Assigning values to the dataset

# Printing the time taken for execution

In [10]: def final_dataset(signal, snr_range, samples_per_snr, sample_size):

for snr in snr_range:

# Indexing within the final dataset matrix X

# Printing the time taken for execution

7 Generating White noise sequence

In [12]: def create_noise_sequence(samples, sample_size):

# Creating white noise sequence of variance 1

# Allocating zeros to the dataset

# Extracting the sample based on sample size

# Sorting the sampled signal

# Assigning values to the dataset

print("Time taken = " + str(b - a))