Sei sulla pagina 1di 73

Project area # Natural Language Processing

L4-L5: Handson implementation of Text classification

January 18, 2020

Objective: To classify the the type of review from the Amazon users using Machine Learning
technique.

About the Data-set: We would be using the Amazon Review Data set which has 10,000 rows of
Text data. The dataset is classified into two classes. “Label 1” or "Negative review" as one class
and “Label 2” or "Positive review " as other. The Data set has two columns “Text” and “Label”.

Code:

In [50]: import pandas as pd


import numpy as np
import nltk
# nltk.download('punkt')
# nltk.download('wordnet')
# nltk.download('averaged_perceptron_tagger')
# nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score

In [51]: #Set Random seed


np.random.seed(500)

In [52]: # Add the Data using pandas


Corpus = pd.read_csv(r"corpus.csv",encoding='latin-1')

In [53]: # Step - 1: Data Pre-processing - This will help in getting better results through the

# Step - 1a : Remove blank rows if any.


Corpus['text'].dropna(inplace=True)

1
In [54]: # Step - 1b : Change all the text to lower case. This is required as python interprets
Corpus['text'] = [entry.lower() for entry in Corpus['text']]

In [55]: # Step - 1c : Tokenization : In this each entry in the corpus will be broken into set o
Corpus['text']= [word_tokenize(entry) for entry in Corpus['text']]

In [56]: # Step - 1d : Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.

# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adje
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV

for index,entry in enumerate(Corpus['text']):


# Declaring Empty List to store the words that follow the rules for this step
Final_words = []
# Initializing WordNetLemmatizer()
word_Lemmatized = WordNetLemmatizer()
# pos_tag function below will provide the 'tag' i.e if the word is Noun(N) or Verb(
for word, tag in pos_tag(entry):
# Below condition is to check for Stop words and consider only alphabets
if word not in stopwords.words('english') and word.isalpha():
word_Final = word_Lemmatized.lemmatize(word,tag_map[tag[0]])
Final_words.append(word_Final)
# The final processed set of words for each iteration will be stored in 'text_final
Corpus.loc[index,'text_final'] = str(Final_words)

In [57]: print(Corpus['text_final'].head())

0 ['stun', 'even', 'sound', 'track', 'beautiful'...


1 ['best', 'soundtrack', 'ever', 'anything', 're...
2 ['amaze', 'soundtrack', 'favorite', 'music', '...
3 ['excellent', 'soundtrack', 'truly', 'like', '...
4 ['remember', 'pull', 'jaw', 'floor', 'hear', '...
Name: text_final, dtype: object

In [67]: # Step - 2: Split the model into Train and Test Data set
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus['text_final'
print(Train_X.shape)
print(Train_Y.shape)
print(Test_X.shape)

(7000,)
(7000,)
(3000,)

2
In [68]: # Step - 3: Label encode the target variable - This is done to transform Categorical d
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)

In [69]: # Step - 4: Vectorize the words by using TF-IDF Vectorizer - This is done to find how i
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus['text_final'])

Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)

In [70]: # Step - 5: Now we can run different algorithms to classify out data check for accuracy

# Classifier - Algorithm - Naive Bayes


# fit the training dataset on the classifier
Naive = naive_bayes.MultinomialNB()
Naive.fit(Train_X_Tfidf,Train_Y)

Out[70]: MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [71]: # predict the labels on validation dataset


predictions_NB = Naive.predict(Test_X_Tfidf)

# Use accuracy_score function to get the accuracy


print("Naive Bayes Accuracy Score -> ",accuracy_score(predictions_NB, Test_Y)*100)

Naive Bayes Accuracy Score -> 83.53333333333333

In [72]: # Classifier - Algorithm - SVM


# fit the training dataset on the classifier
SVM = svm.SVC(C=1.0, kernel='linear', degree=3, gamma='auto')
SVM.fit(Train_X_Tfidf,Train_Y)

Out[72]: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,


decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

In [73]: # predict the labels on validation dataset


predictions_SVM = SVM.predict(Test_X_Tfidf)

# Use accuracy_score function to get the accuracy


print("SVM Accuracy Score -> ",accuracy_score(predictions_SVM, Test_Y)*100)

SVM Accuracy Score -> 84.86666666666667

In [ ]:

3
Project area # 5G_6G wireless networks
L4-L5: Handson implementation of Primary User detection
in fifth generation Cognitive Radio Networks

January 18, 2020

Objective: To classify the the presence or absence of primary user in CR Networks.

About the Data-set: The data is obtained using the USRP from empirical testbed setup. The
USRP is tuned to the UHF band. The data is a single column matrix where column indicates the
frequency tuned while the rows indicates the time instants.

Code:

1 Importing and Initializing data


The list of packages that will be used in the study: -

1. numpy - For vectorized implementation


2. math - Inbuilt support for mathematical operations
3. time - measuring runtime
4. keras - for LSTM model, training and testing

In [1]: import numpy as np


import math
from scipy.stats import norm
import time
import pandas as pd
import statsmodels.api as sm
from keras.layers import Dense, Dropout, LSTM, Embedding
from keras.models import Sequential
from matplotlib import pyplot as plt
#import plotly.offline as py
#import plotly.graph_objs as go
#py.init_notebook_mode(connected=True)
%matplotlib inline

Using TensorFlow backend.

1
2 Retrieving data
Retreiving data from the files for all the four wireless technologies.

In [2]: # Importing data from file (536.5 MHz)


UHF = np.fromfile('536_5.dat', dtype=np.float32)

# Reshaping to convert to a proper NUMPY vector


UHF = np.reshape(UHF, (UHF.shape[0], 1))

# Shape of UHF signal vector


print("Size of UHF: " + str(UHF.shape))

Size of UHF: (20000000, 1)

3 Creating NUMPY equivalent for bandpower function


A MATLAB equivalent for bandpower function is created.

In [3]: def bandpower(signal):


return np.mean(signal ** 2)

In [4]: bandpower(UHF)

Out[4]: 0.0016754537

4 Creating NUMPY equivalent for AWGN (Additive White Gaussian


Noise) function
First of all, the SNR (Signal-to-noise ratio) is converted from decibel scale to linear scale using the
given formula:
SNRdB
SNRlinear = 10 10

The variance of standard the signal is adjusted using the formula given below:

Power (signal )
SNRlinear =
Var (noise)
=⇒ Power (signal ) = Var (noise) × SNRlinear

In [5]: def awgn(signal, desired_snr):

# Converting the SNR from dB scale to linear scale


snr_linear = math.pow(10, desired_snr / 10)

# Standard normally distributed noise


noise = np.random.randn(signal.shape[0], 1)

2
# Using the boxed formula
var_signal = bandpower(noise) * snr_linear

# Normalizing the signal to have the given variance


normalized_signal = math.sqrt(var_signal) * (signal / math.sqrt(bandpower(signal)))

print("SNR = " + str(10 * math.log10(bandpower(normalized_signal) / bandpower(noise)))

return normalized_signal + noise

5 Filtering the data


The datasets are filtered to remove any transient peaks. Values between 10−7 and 1 are retained,
others are discarded.

In [6]: # Datasets are filtered to contain values between 10 ^ -7 and 1

UHF = UHF[np.logical_and(UHF > math.pow(10, -7), UHF < 1)]


UHF = UHF.reshape(UHF.shape[0], 1)

# Shape of UHF signal vector


print("Size of UHF: " + str(UHF.shape))

print(awgn(UHF[0:100000], 4).shape)

Size of UHF: (9992850, 1)


SNR = 4.0000002944270365
(100000, 1)

6 Making the dataset ready


The following will create a dataset for the signal with a given SNR, number of samples and size
of the samples in the sensing event. Dataset is constructed based on Energy values: -
Energy of the sensing event E is given by,
N
∑ y[n]
2
E=
i =1

In [7]: def create_dataset(signal, desired_snr, samples, sample_size):

# Creating the signal with desired SNR


snr_signal = awgn(signal[0:samples * sample_size], desired_snr)

# Allocating zeros to the dataset


X = np.zeros((samples, 1))

3
for i in range(0, samples):

# Extracting the sample based on sample size


sampled_signal = snr_signal[i * sample_size : (i + 1) * sample_size]

# Sorting the sampled signal


sampled_signal = np.sort(sampled_signal, axis=0)

# Energy detection
E = np.sum(sampled_signal ** 2)

# Assigning values to the dataset


X[i][0] = E

return X

In [9]: a = time.time()
print(create_dataset(UHF[50000:], 4, 15000, 100).shape)
b = time.time()

# Printing the time taken for execution


print(b - a)

SNR = 4.0000010176512495
(15000, 1)
0.5682578086853027

Making the dataset for all the SNRs in the range -20 till 4 with step size of 2. The following
function will take a range of SNRs as input and will output the dataset. Other inputs are the
sample size, the signal, and, the number of samples per SNR.

In [10]: def final_dataset(signal, snr_range, samples_per_snr, sample_size):


X = {}

for snr in snr_range:


# Creating dataset for the given SNR
X_snr = create_dataset(signal, snr, samples_per_snr, sample_size)

# Indexing within the final dataset matrix X


X[snr] = X_snr

return X

In [11]: a = time.time()

# UHF
X_UHF = {**final_dataset(UHF[100000:], range(-20, -4, 2), 5000, 100), **final_dataset(U

4
X_test_UHF = final_dataset(UHF[300000:], range(-20, 6, 2), 5129, 100)
b = time.time()

# Printing the time taken for execution


print("Time taken :- " + str(b - a))
#print(X_UHF.shape)

SNR = -20.000000947867104
SNR = -17.999999975177595
SNR = -16.000000564536986
SNR = -14.00000024874285
SNR = -12.000000647780782
SNR = -9.999999378102368
SNR = -7.999999424928018
SNR = -6.0000007320692585
SNR = -3.999998851007047
SNR = -1.9999968889338702
SNR = 8.946170328922171e-07
SNR = 2.000001013654348
SNR = 4.000000474544495
SNR = -19.999999636204638
SNR = -17.999999682125054
SNR = -16.000000286119587
SNR = -13.999999966226724
SNR = -12.000001115697067
SNR = -9.999999844380007
SNR = -8.000000337259307
SNR = -5.999999833073398
SNR = -3.999999084560373
SNR = -1.999999456449275
SNR = 4.338867090785353e-07
SNR = 2.0000011775885964
SNR = 4.000000351050954
Time taken :- 4.981256484985352

7 Generating White noise sequence


White noise of variance 1 is generated and is labelled as 0.

In [12]: def create_noise_sequence(samples, sample_size):

# Creating white noise sequence of variance 1


noise = np.random.randn(samples * sample_size, 1)

# Allocating zeros to the dataset


X = np.zeros((samples, 1))

5
for i in range(0, samples):

# Extracting the sample based on sample size


sampled_signal = noise[i * sample_size : (i + 1) * sample_size]

# Sorting the sampled signal


sampled_signal = np.sort(sampled_signal, axis=0)

# Energy detection
E = np.sum(sampled_signal ** 2)

# Assigning values to the dataset


X[i][0] = E

return X

In [13]: a = time.time()
X_noise = create_noise_sequence(100000, 100)
b = time.time()

print("Time taken = " + str(b - a))

print(X_noise.shape)

Time taken = 2.516165256500244


(100000, 1)

8 DataSet with Lookback for ANN


We use the lookback concept to avoid the effect of sudden abrupt changes in signal

In [17]: # Function for Chaning the dataset for look back


def create_look_back(X, look_back=1):

# Look back dataset is initialized to be empty


look_back_X = []

for i in range(len(X) - look_back + 1):


# Extracting an example from the dataset
a = X[i:(i + look_back), :]

a = a.flatten() # (For flattening)

# Appending to the dataset


look_back_X.append(a)

look_back_Y = []

6
# Returning in numpy's array format
return np.array(look_back_X)

The following function will insert look backs into our dataset for all the SNRs.

In [18]: def dataset_look_back(X_tech, snr_range, look_back):


X_tech_lb = {}

# Look backs for all SNRs


for snr in snr_range:
X_tech_lb[snr] = create_look_back(X_tech[snr], look_back)

return X_tech_lb

In [19]: look_back = 2

X_UHF_lb = dataset_look_back(X_UHF, range(-20, 6, 2), look_back)


print(X_UHF_lb[-20].shape)

X_noise_lb = create_look_back(X_noise, look_back)


print(X_noise_lb.shape)

X = X_UHF_lb[-20]
y = []

for snr in range(-18, 6, 2):


X = np.concatenate((X, X_UHF_lb[snr]), axis=0)

y = np.ones((X.shape[0], 1))

print(X.shape)
print(X_noise_lb.shape)
X = np.concatenate((X, X_noise_lb), axis=0)

y = np.concatenate((y, np.zeros((X_noise_lb.shape[0], 1))))


print(X.shape)
print(X)
print(y.shape)
print(y)

(4999, 2)
(99999, 2)
(99987, 2)
(99999, 2)
(199986, 2)
[[ 60.67028369 112.33572595]
[112.33572595 124.46628501]
[124.46628501 91.88249011]

7
...
[ 79.39234195 109.61704045]
[109.61704045 110.6177588 ]
[110.6177588 84.34582643]]
(199986, 1)
[[1.]
[1.]
[1.]
...
[0.]
[0.]
[0.]]

9 Creating the ANN model


In [20]: seed = 9
np.random.seed(seed)

#ANN Model
# create model
model = Sequential() # This means it's sequential model, which is from one direction to
model.add(Dense(7, input_dim=2, kernel_initializer='uniform', activation='relu'))
#model.add(Dense(10, init='uniform', activation='relu')) #You can add as many hidden la
#model.add(Dense(5,init='uniform', activation='relu'))
model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid')) #Output layer

# Compile model
#This function you have to explore in case you want to do the mathematic analysis
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

10 Training the ANN model


In [21]: # Fit the model
#Training and batch size
model.fit(X, y, epochs=20, batch_size=150, verbose=2)

#Evaluate the model


scores = model.evaluate(X, y)
print("\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))

Epoch 1/40
- 3s - loss: 0.5897 - acc: 0.6314
Epoch 2/40
- 2s - loss: 0.4589 - acc: 0.8142
Epoch 3/40
- 2s - loss: 0.4123 - acc: 0.8157

8
Epoch 4/40
- 2s - loss: 0.4010 - acc: 0.8161
Epoch 5/40
- 2s - loss: 0.3973 - acc: 0.8163
Epoch 6/40
- 2s - loss: 0.3964 - acc: 0.8166
Epoch 7/40
- 2s - loss: 0.3955 - acc: 0.8168
Epoch 8/40
- 2s - loss: 0.3949 - acc: 0.8174
Epoch 9/40
- 2s - loss: 0.3949 - acc: 0.8167
Epoch 10/40
- 2s - loss: 0.3940 - acc: 0.8177
Epoch 11/40
- 2s - loss: 0.3943 - acc: 0.8176
Epoch 12/40
- 2s - loss: 0.3944 - acc: 0.8176
Epoch 13/40
- 2s - loss: 0.3934 - acc: 0.8181
Epoch 14/40
- 2s - loss: 0.3936 - acc: 0.8176
Epoch 15/40
- 2s - loss: 0.3934 - acc: 0.8183
Epoch 16/40
- 2s - loss: 0.3929 - acc: 0.8178
Epoch 17/40
- 2s - loss: 0.3944 - acc: 0.8169
Epoch 18/40
- 2s - loss: 0.3946 - acc: 0.8176
Epoch 19/40
- 2s - loss: 0.3933 - acc: 0.8180
Epoch 20/40
- 2s - loss: 0.3934 - acc: 0.8179
Epoch 21/40
- 2s - loss: 0.3933 - acc: 0.8175
Epoch 22/40
- 2s - loss: 0.3928 - acc: 0.8182
Epoch 23/40
- 2s - loss: 0.3932 - acc: 0.8177
Epoch 24/40
- 2s - loss: 0.3935 - acc: 0.8175
Epoch 25/40
- 2s - loss: 0.3937 - acc: 0.8175
Epoch 26/40
- 2s - loss: 0.3932 - acc: 0.8180
Epoch 27/40
- 2s - loss: 0.3930 - acc: 0.8177

9
Epoch 28/40
- 2s - loss: 0.3928 - acc: 0.8178
Epoch 29/40
- 2s - loss: 0.3927 - acc: 0.8182
Epoch 30/40
- 2s - loss: 0.3934 - acc: 0.8172
Epoch 31/40
- 2s - loss: 0.3933 - acc: 0.8175
Epoch 32/40
- 2s - loss: 0.3930 - acc: 0.8178
Epoch 33/40
- 2s - loss: 0.3929 - acc: 0.8184
Epoch 34/40
- 2s - loss: 0.3932 - acc: 0.8180
Epoch 35/40
- 2s - loss: 0.3937 - acc: 0.8171
Epoch 36/40
- 2s - loss: 0.3937 - acc: 0.8177
Epoch 37/40
- 2s - loss: 0.3932 - acc: 0.8177
Epoch 38/40
- 2s - loss: 0.3933 - acc: 0.8179
Epoch 39/40
- 2s - loss: 0.3931 - acc: 0.8177
Epoch 40/40
- 2s - loss: 0.3929 - acc: 0.8175
199986/199986 [==============================] - 3s 16us/step

acc: 81.96%

In [22]: pd_UHF = {}

for snr in range(-20, 6, 2):


y_snr = np.ones((X_UHF_lb[snr].shape[0], 1))
scores = model.evaluate(X_UHF_lb[snr], y_snr)
print("At SNR = " + str(snr) + "\n%s: %.2f%%" % (model.metrics_names[1], scores[1]*10
pd_UHF[snr] = scores[1]

plt.plot(range(-20, 6, 2), list(pd_UHF.values()))

4999/4999 [==============================] - 0s 29us/step


At SNR = -20
acc: 7.60%
4999/4999 [==============================] - 0s 25us/step
At SNR = -18
acc: 8.52%
4999/4999 [==============================] - 0s 19us/step

10
At SNR = -16
acc: 10.66%
4999/4999 [==============================] - 0s 21us/step
...
At SNR = -8
acc: 47.27%
4999/4999 [==============================] - 0s 20us/step
At SNR = -6
acc: 76.38%
11999/11999 [==============================] - 0s 18us/step
At SNR = -4
acc: 96.49%
11999/11999 [==============================] - 0s 18us/step
At SNR = -2
acc: 99.91%
11999/11999 [==============================] - 0s 17us/step
At SNR = 0
acc: 100.00%
11999/11999 [==============================] - 0s 19us/step
At SNR = 2
acc: 100.00%
11999/11999 [==============================] - 0s 18us/step
At SNR = 4
acc: 100.00%

Out[22]: [<matplotlib.lines.Line2D at 0x7f16674aa550>]

11
Project area # Biology / Bioinformatics
L4-L5: Handson implementation to classify the Cancer
patients with AML or ALL

January 18, 2020

Objective: The classify patients with acute myeloid leukemia (AML) and acute lymphoblastic
leukemia (ALL) using the SVM algorithm.

About the Data-set:

1. Each row represents a different gene.

2. Columns 1 and 2 are descriptions about that gene.

3. Each numbered column is a patient in label data.

4. Each patient has 7129 gene expression values — i.e each patient has one value for each gene.

5. The training data contain gene expression values for patients 1 through 38.

6. The test data contain gene expression values for patients 39 through 72

Code:

In [100]: import pandas as pd


import numpy as np
from numpy import transpose as T
import matplotlib.pyplot as plt
import math

In [101]: Train_Data = pd.read_csv("data_set_ALL_AML_train.csv")


Test_Data = pd.read_csv("data_set_ALL_AML_independent.csv")
labels = pd.read_csv("actual.csv", index_col = 'patient')
Train_Data.head()

Out[101]: Gene Description Gene Accession Number 1 call 2 \


0 AFFX-BioB-5_at (endogenous control) AFFX-BioB-5_at -214 A -139
1 AFFX-BioB-M_at (endogenous control) AFFX-BioB-M_at -153 A -73
2 AFFX-BioB-3_at (endogenous control) AFFX-BioB-3_at -58 A -1
3 AFFX-BioC-5_at (endogenous control) AFFX-BioC-5_at 88 A 283
4 AFFX-BioC-3_at (endogenous control) AFFX-BioC-3_at -295 A -264

1
call.1 3 call.2 4 call.3 ... 29 call.33 30 call.34 31 \
0 A -76 A -135 A ... 15 A -318 A -32
1 A -49 A -114 A ... -114 A -192 A -49
2 A -307 A 265 A ... 2 A -95 A 49
3 A 309 A 12 A ... 193 A 312 A 230
4 A -376 A -419 A ... -51 A -139 A -367

call.35 32 call.36 33 call.37


0 A -124 A -135 A
1 A -79 A -186 A
2 A -37 A -70 A
3 P 330 A 337 A
4 A -188 A -407 A

[5 rows x 78 columns]

In [102]: print(Train_Data.isna().sum().max())
print(Test_Data.isna().sum().max())

0
0

In [103]: cols = [col for col in Test_Data.columns if 'call' in col]


test = Test_Data.drop(cols, 1)
cols = [col for col in Train_Data.columns if 'call' in col]
train = Train_Data.drop(cols, 1)

In [104]: patients = [str(i) for i in range(1, 73, 1)]


df_all = pd.concat([train, test], axis = 1)[patients]
#import Transpose as T
df_all = df_all.T

In [105]: df_all["patient"] = pd.to_numeric(patients)


labels["cancer"]= pd.get_dummies(labels.cancer, drop_first=True)

In [106]: Data = pd.merge(df_all, labels, on="patient")


Data.head()

Out[106]: 0 1 2 3 4 5 6 7 8 9 ... 7121 7122 7123 \


0 -214 -153 -58 88 -295 -558 199 -176 252 206 ... -125 389 -37
1 -139 -73 -1 283 -264 -400 -330 -168 101 74 ... -36 442 -17
2 -76 -49 -307 309 -376 -650 33 -367 206 -215 ... 33 168 52
3 -135 -114 265 12 -419 -585 158 -253 49 31 ... 218 174 -110
4 -106 -125 -76 168 -230 -284 4 -122 70 252 ... 57 504 -26

7124 7125 7126 7127 7128 patient cancer


0 793 329 36 191 -37 1 0

2
1 782 295 11 76 -14 2 0
2 1138 777 41 228 -41 3 0
3 627 170 -50 126 -91 4 0
4 250 314 14 56 -25 5 0

[5 rows x 7131 columns]


In [107]: X, y = Data.drop(columns=["cancer"]), Data["cancer"]
In [108]: from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.25, random_stat
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
(54, 7130)
(54,)
(18, 7130)

In [109]: from sklearn.preprocessing import StandardScaler


sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
(54, 7130)
(54,)
(18, 7130)

In [110]: from sklearn.decomposition import PCA


pca = PCA()
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
total=sum(pca.explained_variance_)
k=0
current_variance=0
while current_variance/total < 0.90:
current_variance += pca.explained_variance_[k]
k=k+1
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
(54, 54)
(54,)
(18, 54)

3
In [111]: from sklearn.decomposition import PCA
pca = PCA(n_components = 38)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
cum_sum = pca.explained_variance_ratio_.cumsum()
cum_sum = cum_sum*100
plt.bar(range(38), cum_sum)
plt.ylabel("Cumulative Explained Variance")
plt.xlabel("Principal Components")
plt.title("Around 90% of variance is explained by the First 38 columns ")
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)

(54, 38)
(54,)
(18, 38)

In [130]: from sklearn.model_selection import GridSearchCV


from sklearn import model_selection, naive_bayes, svm
from sklearn.svm import SVC
# parameters = [{'C': [1, 10, 100, 1000], 'kernel': ['linear']},{'C': [1, 10, 100, 100
# search = GridSearchCV(SVC(), parameters, n_jobs=-1, verbose=1)
# search.fit(X_train, y_train)

4
In [131]: #best_parameters = search.best_estimator_

In [132]: model = SVC(C=1.0, kernel='linear', degree=3, gamma='auto')


model.fit(X_train, y_train)

Out[132]: SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,


decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
max_iter=-1, probability=False, random_state=None, shrinking=True,
tol=0.001, verbose=False)

In [133]: y_pred=model.predict(X_test)

In [134]: from sklearn.metrics import accuracy_score, confusion_matrix


from sklearn import metrics
print('Accuracy Score:',round(accuracy_score(y_test, y_pred),2))
#confusion matrix
cm = confusion_matrix(y_test, y_pred)
# Output:
# Accuracy Score: 0.67

Accuracy Score: 0.67

In [135]: class_names=[1,2,3]
fig, ax = plt.subplots()
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
class_names=['ALL', 'AML']
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="viridis" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')

Out[135]: Text(0.5,257.44,'Predicted label')

5
In [ ]:

6
Intelligent Transportation System
Project Hands-on

1. Objective:

• Random traffic generation through Simulation of Urban Mobility (SUMO) using


open street map.
• To model propagation links for Vehicle to Vehicle communication using Ray trac-
ing simulator.
• To perform Classification for beam selection in vehicular to infrastructure and
compare various classifiers.

2. Hardware Used : NA

3. Software Used :

• For Traffic Generation : Simulation of Urban Mobility (SUMO) (https://www.


dlr.de/ts/en/desktopdefault.aspx/tabid-9883/16931_read-41000/)
• For Ray Tracing : Geometry based Efficient Propagation Model for Vehicular
Communication (GEM V 2 )(http://vehicle2x.net/)
• Platform : MATLAB 2018, Python 2.7
• Editor : Notepad ++

4. Expected Outcomes :

Figure 1: Expected Outcome for Exercise 1 : Random Traffic Generation in SUMO

5. Basic Instructions :

(a) Login with Windows 8.


(b) Go to Start and press ”cmd” to open command prompt.
(c) Crate an empty folder named ”map”.
Accuracy (%)
Classifier All Data Only NLOS
Linear SVM 31 11
Decision Tree 54 28
Deep Neural Network 65 37

Table 1: Expected Outcome for Exercise : Classification Results

Aim: Random traffic generation through Simulation of Urban Mobility (SUMO)


using open street map.

Execution steps to generate random traffic

1. Make two folders on your desktop named ”sumo” and ”map”

2. Go to http://sumo.dlr.de/wiki/Networks/Import/OpenStreetMap

3. Scroll down to ”Importing additional polygons”

4. Copy the script into Notepad++

5. Remove ”Power” feature from the script

6. Save as ”typemap.xml”

7. Go to https://www.openstreetmap.org: Choose your city or area in which you want


to generate traffic
8. Click on Export.

9. Save as ”map.osm”

10. Go to the sumo directory: and write: run start-command-line.bat


11. Now leave the ”SUMO” directory and Go to the map directory.

12. Generate ”netconvert” file. Type: netconvert –osm-files map.osm -o map.net.xml

13. Generate ”Polyconvert” file Type: polyconvert –net-file map.net.xml –osm-files


map.osm –type-file typemap.xml -o map.poly.xml

14. We are looking for random traffic, Go to your sumo folder and search ”randomTraf-
fic.py” and Note the parameters like (Number of vehicles, Simulation time, Route
length etc..)
Type: python C:/Users/Sagar/Desktop/sumo/sumo/tools/randomTrips.py
-n map.net.xml -e 100 -l

15. Type:python C:/Users/Sagar/Desktop/sumo/sumo/tools/randomTrips.py -


n map.net.xml -r map.rou.xml -e 100 -l to generate route file.
16. Go to the SUMO folder and search for ”test.sumo.cfg” and copy that file in ”map”
folder.

17. Edit ”test.sumo.cfg”

18. Go to command prompt and type: sumo-gui map.sumo.cfg

19. Press Enter and SUMO will open. Customize it as per your convenience. Press RUN
20. Note your simulation parameters and save files.

Aim: To model propagation links for Vehicle to Vehicle communication using


Ray tracing.

Execution Steps
1. To generate LOS and NLOS links, we need to import the .xml file of our specific route
to the folder: ”inputmobilitySUMO”.
2. Open folder: ”inputmobilitySUMO”, from SUMO. You can see already it has .xml files
of previously simulated data. (You can replace your .xml file with existing one and
modify the .m file as per your requirement, Here we only use the previously simulated
data)

3. Run: ”runSimulation.m”

4. Open folder: ”outputKML”, you can see file named as per current date.

5. Go to :”https://earth.google.com/web/”

6. On the left, click My Places My Places.

7. Click Import KML file.

8. Choose the location of the file you want to upload.

9. Select and open the KML file. A preview of the list will open in Google Earth.

10. To keep these places in your list, click Save and customize as per your requirement.
Aim: To perform DNN Classification.

Execution Steps:

1. Step 1: Initialization
1000 from future import p r i n t f u n c t i o n
import k e r a s
1002 #from k e r a s . d a t a s e t s import mnist
from k e r a s . models import S e q u e n t i a l
1004 from k e r a s . l a y e r s import Dense , Dropout , F l a t t e n , A c t i v a t i o n
#from k e r a s . l a y e r s import Conv1D , MaxPooling1D
1006 from k e r a s . l a y e r s import Conv2D , MaxPooling2D
from k e r a s . o p t i m i z e r s import Adagrad
1008 import numpy a s np
from s k l e a r n . p r e p r o c e s s i n g import minmax scale
1010 import k e r a s . backend a s K
import copy
1012 #one can d i s a b l e t h e i m p o r t s below i f not p l o t t i n g / s a v i n g
from k e r a s . u t i l s import p l o t m o d e l
1014 import m a t p l o t l i b . p y p l o t a s p l t

2. Load training and testing data sets


1000 b a t c h s i z e = 32
e p o c h s = 50
1002 numUPAAntennaElements=4∗4 #4 x 4 UPA
t r a i n F i l e N a m e = ’ . . / d a t a s e t s / a l l t r a i n c l a s s i f i c a t i o n . npz ’ #( 2 2 2 5 6 , 2 4 ,
362)
1004 p r i n t ( ” Reading d a t a s e t . . . ” , t r a i n F i l e N a m e )
t r a i n c a c h e f i l e = np . l o a d ( t r a i n F i l e N a m e )
1006

t e s t F i l e N a m e = ’ . . / d a t a s e t s / a l l t e s t c l a s s i f i c a t i o n . npz ’ #( 2 2 2 5 6 , 2 4 ,
362)
1008 p r i n t ( ” Reading d a t a s e t . . . ” , t e s t F i l e N a m e )
t e s t c a c h e f i l e = np . l o a d ( t e s t F i l e N a m e )
1010 #i n p u t f e a t u r e s ( X t e s t and X t r a i n ) a r e a r r a y s with m a t r i c e s . Here we
w i l l c o n v e r t m a t r i c e s t o 1−d a r r a y
X t r a i n = t r a i n c a c h e f i l e [ ’ p o s i t i o n m a t r i x a r r a y ’ ] #i n p u t s
1012 t r a i n b e s t t x r x a r r a y = t r a i n c a c h e f i l e [ ’ b e s t r a y a r r a y ’ ] #outputs , one
i n t e g e r f o r Tx and a n o t h e r f o r Rx
X t e s t = t e s t c a c h e f i l e [ ’ p o s i t i o n m a t r i x a r r a y ’ ] #i n p u t s
1014 t e s t b e s t t x r x a r r a y = t e s t c a c h e f i l e [ ’ b e s t r a y a r r a y ’ ] #outputs , one
i n t e g e r f o r Tx and a n o t h e r f o r Rx
#p r i n t ( p o s i t i o n m a t r i x a r r a y . shape )
1016 #p r i n t ( b e s t t x r x a r r a y . shape )
#X t r a i n and X t e s t have v a l u e s −4, −3, −1, 0 , 2 . S i m p l i f y i t t o u s i n g
o n l y −1 f o r b l o c k e r s and 1 f o r
1018 X t r a i n [ X t r a i n ==−4] = −1
X t r a i n [ X t r a i n ==−3] = −1
1020 X t r a i n [ X t r a i n ==2] = 1
X t e s t [ X t e s t ==−4] = −1
1022 X t e s t [ X t e s t ==−3] = −1
X t e s t [ X t e s t ==2] = 1
3. Load classes and features to find pairs
1000 t r a i n f u l l y = ( t r a i n b e s t t x r x a r r a y [ : , 0 ] ∗ numUPAAntennaElements +
t r a i n b e s t t x r x a r r a y [ : , 1 ] ) . a s t y p e ( np . i n t )
t e s t f u l l y = ( t e s t b e s t t x r x a r r a y [ : , 0 ] ∗ numUPAAntennaElements +
t e s t b e s t t x r x a r r a y [ : , 1 ] ) . a s t y p e ( np . i n t )
1002 t r a i n c l a s s e s = s e t ( t r a i n f u l l y ) #f i n d unique p a i r s
t e s t c l a s s e s = s e t ( t e s t f u l l y ) #f i n d unique p a i r s
1004 c l a s s e s = t r a i n c l a s s e s . union ( t e s t c l a s s e s )

1006 y t r a i n = np . empty ( t r a i n b e s t t x r x a r r a y . shape [ 0 ] )


y t e s t = np . empty ( t e s t b e s t t x r x a r r a y . shape [ 0 ] )
1008 f o r idx , c l i n enumerate ( c l a s s e s ) : #map i n s i n g l e index , c l i s t h e
o r i g i n a l c l a s s number , i d x i s i t s i n d e x
c l i d x = np . n o n z e r o ( t r a i n f u l l y == c l )
1010 y t r a i n [ c l i d x ] = idx
c l i d x = np . n o n z e r o ( t e s t f u l l y == c l )
1012 y t e s t [ c l i d x ] = idx

4. Step 4: Load Lables and satisfy the dimentioanlity


1000 numClasses = l e n ( c l a s s e s ) #t o t a l number o f l a b e l s

1002 t r a i n n e x a m p l e s=l e n ( X t r a i n )
t e s t n e x a m p l e s=l e n ( X t e s t )
1004 nrows=l e n ( X t r a i n [ 0 ] )
ncolumns=l e n ( X t r a i n [ 0 ] [ 0 ] )
1006
print ( ’ test nexamples = ’ , test nexamples )
1008 print ( ’ train nexamples = ’ , train nexamples )
print ( ’ i n p u t m a t r i c e s s i z e = ’ , nrows , ’ x ’ , ncolumns )
1010 print ( ’ numClasses = ’ , numClasses )

1012 #here , do not c o n v e r t matrix i n t o 1−d a r r a y


#X t r a i n = X t r a i n . r e s h a p e ( t r a i n n e x a m p l e s , nrows ∗ ncolumns )
1014 #X t e s t = X t e s t . r e s h a p e ( t e s t n e x a m p l e s , nrows ∗ ncolumns )

1016 #f r a c t i o n t o be used f o r t r a i n i n g s e t
v a l i d a t i o n F r a c t i o n = 0 . 2 #from 0 t o 1
1018
#Keras i s r e q u i r i n g an e x t r a d i m e n s i o n : I w i l l add i t with r e s h a p e
1020 X t r a i n = X t r a i n . r e s h a p e ( X t r a i n . shape [ 0 ] , nrows , ncolumns , 1 )
X t e s t = X t e s t . r e s h a p e ( X t e s t . shape [ 0 ] , nrows , ncolumns , 1 )
1022 i n p u t s h a p e = ( nrows , ncolumns , 1 ) #t h e i n p u t matrix with t h e e x t r a
d i m e n s i o n r e q u e s t e d by Keras

1024 p r i n t ( X t r a i n . shape [ 0 ] , ’ t r a i n s a m p l e s ’ )
p r i n t ( X t e s t . shape [ 0 ] , ’ t e s t s a m p l e s ’ )
1026 p r i n t ( X t e s t . shape [ 0 ] + X t r a i n . shape [ 0 ] , ’ t o t a l s a m p l e s ’ )
print (” Finished reading datasets ”)
5. Step 5: Perform classification
1000 #from s k l e a r n . p r e p r o c e s s i n g import OneHotEncoder
#e n c o d e r = OneHotEncoder ( )
1002 #y t r a i n = e n c o d e r . f i t t r a n s f o r m ( y t r a i n . r e s h a p e ( −1 , 1 ) )
y t r a i n = k e r a s . u t i l s . t o c a t e g o r i c a l ( y t r a i n , numClasses )
1004 o r i g i n a l y t e s t = copy . deepcopy ( y t e s t ) . a s t y p e ( i n t )
y t e s t = k e r a s . u t i l s . t o c a t e g o r i c a l ( y t e s t , numClasses )
1006
# d e c l a r e model Convnet with two conv1D l a y e r s f o l l o w i n g by MaxPooling
l a y e r , and two d e n s e l a y e r s
1008 # Dropout l a y e r c o n s i s t s i n randomly s e t t i n g a f r a c t i o n r a t e o f i n p u t
u n i t s t o 0 a t each update d u r i n g t r a i n i n g time , which h e l p s p r e v e n t
overfitting .

1010 model = S e q u e n t i a l ( )
model . add ( Conv2D ( 5 0 , ( 1 2 , 1 2 ) , padding=”SAME” , a c t i v a t i o n= ’ r e l u ’ ) )
1012 model . add ( MaxPooling2D ( p o o l s i z e =(6 , 6 ) ) )
model . add ( Conv2D ( 2 0 , ( 1 0 , 1 0 ) , padding=”SAME” , a c t i v a t i o n= ’ r e l u ’ ) )
1014 model . add ( Dropout ( 0 . 3 ) )
model . add ( Dense ( 4 , a c t i v a t i o n= ’ r e l u ’ ) )
1016 model . add ( F l a t t e n ( ) )
#model . add ( A c t i v a t i o n ( ’ tanh ’ ) )
1018 #model . add ( A c t i v a t i o n ( ’ softmax ’ ) ) #softmax f o r p r o b a b i l i t y
model . add ( Dense ( numClasses , a c t i v a t i o n= ’ softmax ’ ) )
1020
model . summary ( )
1022
model . c o m p i l e ( l o s s=k e r a s . l o s s e s . c a t e g o r i c a l c r o s s e n t r o p y ,
1024 o p t i m i z e r=k e r a s . o p t i m i z e r s . Adadelta ( ) ,
m e t r i c s =[ ’ a c c u r a c y ’ ] )
1026

1028 h i s t o r y = model . f i t ( X t r a i n , y t r a i n ,
b a t c h s i z e=b a t c h s i z e ,
1030 e p o c h s=epochs ,
v e r b o s e =1,
1032 s h u f f l e=True ,
v a l i d a t i o n s p l i t=v a l i d a t i o n F r a c t i o n )
1034 #v a l i d a t i o n d a t a =( X t e s t , y t e s t ) )

1036 # print results


s c o r e = model . e v a l u a t e ( X t e s t , y t e s t , v e r b o s e =0)
1038 p r i n t ( model . m e t r i c s n a m e s )
#p r i n t ( ’ Test l o s s rmse : ’ , np . s q r t ( s c o r e [ 0 ] ) )
1040 #p r i n t ( ’ Test a c c u r a c y : ’ , s c o r e [ 1 ] )
print ( score )
1042
val acc = history . history [ ’ val acc ’ ]
1044 acc = h i s t o r y . h i s t o r y [ ’ acc ’ ]
f = open ( ’ c l a s s i f i c a t i o n o u t p u t . t x t ’ , ’w ’ )
1046 f . w r i t e ( ’ v a l i d a t i o n a c c \n ’ )
f . write ( str ( val acc ) )
1048 f . w r i t e ( ’ \ n t r a i n a c c \n ’ )
f . write ( s t r ( acc ) )
1050 f . close ()
6. Step 6: Plotting
1000 #e n a b l e i f want t o p l o t images
i f False :
1002 from k e r a s . u t i l s import p l o t m o d e l
import m a t p l o t l i b . p y p l o t a s p l t
1004
#i n s t a l l g r a p h v i z : sudo apt−g e t i n s t a l l g r a p h v i z and then p i p i n s t a l l
r e l a t e d packages
1006 p l o t m o d e l ( model , t o f i l e = ’ c l a s s i f i c a t i o n m o d e l . png ’ , s h o w s h a p e s =
True )

1008
p r e d t e s t = model . p r e d i c t ( X t e s t )
1010 f o r i in range ( len ( y t e s t ) ) :
i f ( o r i g i n a l y t e s t [ i ] != np . argmax ( p r e d t e s t [ i ] ) ) :
1012 myImage = X t e s t [ i ] . r e s h a p e ( nrows , ncolumns )
p l t . imshow ( myImage )
1014 p l t . show ( )
p r i n t ( ”Type <ENTER> f o r next ” )
1016 input ()
Aim: Comparison of Various classifiers.

1. Step 1: Initilization
1000 import numpy a s np
#e n a b l e i f want t o p l o t images :
1002 #import m a t p l o t l i b
#m a t p l o t l i b . u s e ( ’ WebAgg ’ )
1004 #m a t p l o t l i b . u s e ( ’ Qt5Agg ’ )
#m a t p l o t l i b . u s e ( ’ agg ’ )
1006 #m a t p l o t l i b . i n l i n e ( )
#import m a t p l o t l i b . p y p l o t a s p l t
1008 #from m a t p l o t l i b . c o l o r s import ListedColormap
from s k l e a r n import m e t r i c s
1010 from s k l e a r n . m o d e l s e l e c t i o n import t r a i n t e s t s p l i t
from s k l e a r n . p r e p r o c e s s i n g import S t a n d a r d S c a l e r
1012 from s k l e a r n . d a t a s e t s import make moons , m a k e c i r c l e s ,
make classification
from s k l e a r n . n e u r a l n e t w o r k import M L P C l a s s i f i e r
1014 from s k l e a r n . n e i g h b o r s import K N e i g h b o r s C l a s s i f i e r
from s k l e a r n . svm import SVC
1016 from s k l e a r n . svm import LinearSVC
from s k l e a r n . g a u s s i a n p r o c e s s import G a u s s i a n P r o c e s s C l a s s i f i e r
1018 from s k l e a r n . g a u s s i a n p r o c e s s . k e r n e l s import RBF
from s k l e a r n . t r e e import D e c i s i o n T r e e C l a s s i f i e r
1020 from s k l e a r n . ensemble import R a n d o m F o r e s t C l a s s i f i e r , A d a B o o s t C l a s s i f i e r
from s k l e a r n . n a i v e b a y e s import GaussianNB
1022 from s k l e a r n . d i s c r i m i n a n t a n a l y s i s import Q u a d r a t i c D i s c r i m i n a n t A n a l y s i s

2. Step 2 :Load Classifiers


1000 names = [ #” Naive Bayes ” ,
” D e c i s i o n Tree ” , ”Random F o r e s t ” ,
1002 ” AdaBoost ” ,
” L i n e a r SVM” , ”RBF SVM” , ” Gaussian P r o c e s s ” ,
1004 ” Neural Net ” ,
”QDA” , ” N e a r e s t N e i g h b o r s ” ]
1006
classifiers = [
1008 #GaussianNB ( ) ,
D e c i s i o n T r e e C l a s s i f i e r ( max depth =100) ,
1010 R a n d o m F o r e s t C l a s s i f i e r ( max depth =100 , n e s t i m a t o r s =30 , m a x f e a t u r e s
=20) ,
AdaBoostClassifier () ,
1012 LinearSVC (C=10 , l o s s=” h i n g e ” ) , #l i n e a r SVM (maximum margin p e r c e p t r o n
)
SVC(gamma=1, C=1) ,
1014 G a u s s i a n P r o c e s s C l a s s i f i e r ( 1 . 0 ∗ RBF( 1 . 0 ) ) ,
M L P C l a s s i f i e r ( a l p h a =1) ,
1016 QuadraticDiscriminantAnalysis () ,
KNeighborsClassifier (3) ]
3. Step 3: Read test and train datasets for Beam selection
1000 numUPAAntennaElements=4∗4 #4 x 4 UPA
#t r a i n F i l e N a m e = ’ . . / d a t a s e t s / a l l t r a i n c l a s s i f i c a t i o n . npz ’ #(22256 , 2 4 ,
362)
1002 t r a i n F i l e N a m e = ’ . . / d a t a s e t s / n l o s t r a i n c l a s s i f i c a t i o n . npz ’ #( 2 2 2 5 6 , 2 4 ,
362)
p r i n t ( ” Reading d a t a s e t . . . ” , t r a i n F i l e N a m e )
1004 t r a i n c a c h e f i l e = np . l o a d ( t r a i n F i l e N a m e )

1006 #t e s t F i l e N a m e = ’ . . / d a t a s e t s / a l l t e s t c l a s s i f i c a t i o n . npz ’ #(22256 , 2 4 ,


362)
t e s t F i l e N a m e = ’ . . / d a t a s e t s / n l o s t e s t c l a s s i f i c a t i o n . npz ’ #( 2 2 2 5 6 , 2 4 ,
362)
1008 p r i n t ( ” Reading d a t a s e t . . . ” , t e s t F i l e N a m e )
t e s t c a c h e f i l e = np . l o a d ( t e s t F i l e N a m e )
1010
#i n p u t f e a t u r e s ( X t e s t and X t r a i n ) a r e a r r a y s with m a t r i c e s . Here we
w i l l c o n v e r t m a t r i c e s t o 1−d a r r a y
1012

X t r a i n = t r a i n c a c h e f i l e [ ’ p o s i t i o n m a t r i x a r r a y ’ ] #i n p u t s
1014 t r a i n b e s t t x r x a r r a y = t r a i n c a c h e f i l e [ ’ b e s t t x r x a r r a y ’ ] #outputs ,
one i n t e g e r f o r Tx and a n o t h e r f o r Rx
X t e s t = t e s t c a c h e f i l e [ ’ p o s i t i o n m a t r i x a r r a y ’ ] #i n p u t s
1016 t e s t b e s t t x r x a r r a y = t e s t c a c h e f i l e [ ’ b e s t t x r x a r r a y ’ ] #outputs , one
i n t e g e r f o r Tx and a n o t h e r f o r Rx
#p r i n t ( p o s i t i o n m a t r i x a r r a y . shape )
1018 #p r i n t ( b e s t t x r x a r r a y . shape )

1020 #X t r a i n and X t e s t have v a l u e s −4, −3, −1, 0 , 2 . S i m p l i f y i t t o u s i n g


o n l y −1 f o r b l o c k e r s and 1 f o r
X t r a i n [ X t r a i n ==−4] = −1
1022 X t r a i n [ X t r a i n ==−3] = −1
X t r a i n [ X t r a i n ==2] = 1
1024 X t e s t [ X t e s t ==−4] = −1
X t e s t [ X t e s t ==−3] = −1
1026 X t e s t [ X t e s t ==2] = 1

4. Step 4: Perform Classification


1000 #c o n v e r t output ( i , j ) t o s i n g l e number ( t h e c l a s s l a b e l ) and e l i m i n a t e
p a i r s t h a t do not appear
t r a i n f u l l y = ( t r a i n b e s t t x r x a r r a y [ : , 0 ] ∗ numUPAAntennaElements +
t r a i n b e s t t x r x a r r a y [ : , 1 ] ) . a s t y p e ( np . i n t )
1002 t e s t f u l l y = ( t e s t b e s t t x r x a r r a y [ : , 0 ] ∗ numUPAAntennaElements +
t e s t b e s t t x r x a r r a y [ : , 1 ] ) . a s t y p e ( np . i n t )
t r a i n c l a s s e s = s e t ( t r a i n f u l l y ) #f i n d unique p a i r s
1004 t e s t c l a s s e s = s e t ( t e s t f u l l y ) #f i n d unique p a i r s
c l a s s e s = t r a i n c l a s s e s . union ( t e s t c l a s s e s )
1006
y t r a i n = np . empty ( t r a i n b e s t t x r x a r r a y . shape [ 0 ] )
1008 y t e s t = np . empty ( t e s t b e s t t x r x a r r a y . shape [ 0 ] )
f o r idx , c l i n enumerate ( c l a s s e s ) : #map i n s i n g l e index , c l i s t h e
o r i g i n a l c l a s s number , i d x i s i t s i n d e x
1010 c l i d x = np . n o n z e r o ( t r a i n f u l l y == c l )
y t r a i n [ c l i d x ] = idx
1012 c l i d x = np . n o n z e r o ( t e s t f u l l y == c l )
y t e s t [ c l i d x ] = idx
1014
#n e w c l a s s e s = s e t ( y )
1016 numClasses = l e n ( c l a s s e s ) #t o t a l number o f l a b e l s

1018 t r a i n n e x a m p l e s=l e n ( X t r a i n )
t e s t n e x a m p l e s=l e n ( X t e s t )
1020 nrows=l e n ( X t r a i n [ 0 ] )
ncolumns=l e n ( X t r a i n [ 0 ] [ 0 ] )
1022
print ( ’ test nexamples = ’ , test nexamples )
1024 print ( ’ train nexamples = ’ , train nexamples )
print ( ’ i n p u t m a t r i c e s s i z e = ’ , nrows , ’ x ’ , ncolumns )
1026 print ( ’ numClasses = ’ , numClasses )

1028 #c o n v e r t matrix i n t o 1−d a r r a y


X t r a i n = X t r a i n . r e s h a p e ( t r a i n n e x a m p l e s , nrows ∗ ncolumns )
1030 X t e s t = X t e s t . r e s h a p e ( t e s t n e x a m p l e s , nrows ∗ ncolumns )

1032 p r i n t ( X t r a i n . shape [ 0 ] , ’ t r a i n s a m p l e s ’ )
p r i n t ( X t e s t . shape [ 0 ] , ’ t e s t s a m p l e s ’ )
1034 p r i n t ( X t e s t . shape [ 0 ] + X t r a i n . shape [ 0 ] , ’ t o t a l s a m p l e s ’ )
print (” Finished reading datasets ”)

5. Step 4: Output
1000 # i t e r a t e over c l a s s i f i e r s
f o r name , model i n z i p ( names , c l a s s i f i e r s ) :
1002 p r i n t ( ”#### T r a i n i n g c l a s s i f i e r ” , name )
model . f i t ( X t r a i n , y t r a i n )
1004 p r i n t ( ’ \ nPrediction accuracy f o r the t e s t dataset ’ )
p r e d t e s t = model . p r e d i c t ( X t e s t )
1006 p r i n t ( ’ { : . 2 % } \ n ’ . format ( m e t r i c s . a c c u r a c y s c o r e ( y t e s t , p r e d t e s t ) ) )
#now with t h e t r a i n s e t
1008 p r e d t r a i n = model . p r e d i c t ( X t r a i n )
p r i n t ( ’ \ nPrediction accuracy f o r the t r a i n dataset ’ )
1010 p r i n t ( ’ { : . 2 % } \ n ’ . format ( m e t r i c s . a c c u r a c y s c o r e ( y t r a i n , p r e d t r a i n ) ) )
Weather Forecasting Using Machine Learning

1 Using Machine Learning To Predict Weather in Charlotte, NC

In [47]: import pandas as pd


import numpy as np
import datetime
import matplotlib
import matplotlib.pyplot as plt
#from scipy.special import logsumexp
#from scipy.misc import logsumexp
import seaborn as sns
#import statsmodels.api as sm
import sklearn.model_selection
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, median_absolute_error
from sklearn.preprocessing import StandardScaler
matplotlib.style.use('ggplot')

1.1 Raw Data Preview


Charlotte, NC Climate Data from 2013 to 2018 (downloaded from the NOAA NCEI site -
https://www.ncei.noaa.gov/)

In [48]: clt_climate_df = pd.read_csv("All Datasets/Charlotte_climate_info_2013_to_2018.csv", lo


clt_climate_df.head()

Out[48]: STATION STATION_NAME ELEVATION LATITUDE \


0 WBAN:13881 CHARLOTTE DOUGLAS AIRPORT NC US 221.9 35.2236
1 WBAN:13881 CHARLOTTE DOUGLAS AIRPORT NC US 221.9 35.2236
2 WBAN:13881 CHARLOTTE DOUGLAS AIRPORT NC US 221.9 35.2236
3 WBAN:13881 CHARLOTTE DOUGLAS AIRPORT NC US 221.9 35.2236
4 WBAN:13881 CHARLOTTE DOUGLAS AIRPORT NC US 221.9 35.2236

LONGITUDE DATE REPORTTPYE HOURLYSKYCONDITIONS HOURLYVISIBILITY \


0 -80.9552 6/1/2013 0:52 FM-15 FEW:02 55 SCT:04 250 10

1
1 -80.9552 6/1/2013 1:00 FM-12 NaN NaN
2 -80.9552 6/1/2013 1:52 FM-15 BKN:07 65 10
3 -80.9552 6/1/2013 2:52 FM-15 BKN:07 75 10
4 -80.9552 6/1/2013 3:52 FM-15 FEW:02 75 SCT:04 110 10

HOURLYPRSENTWEATHERTYPE ... \
0 NaN ...
1 NaN ...
2 NaN ...
3 NaN ...
4 NaN ...

MonthlyMaxSeaLevelPressureTime MonthlyMinSeaLevelPressureValue \
0 -9999 NaN
1 -9999 NaN
2 -9999 NaN
3 -9999 NaN
4 -9999 NaN

MonthlyMinSeaLevelPressureDate MonthlyMinSeaLevelPressureTime \
0 -9999 -9999
1 -9999 -9999
2 -9999 -9999
3 -9999 -9999
4 -9999 -9999

MonthlyTotalHeatingDegreeDays MonthlyTotalCoolingDegreeDays \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

MonthlyDeptFromNormalHeatingDD MonthlyDeptFromNormalCoolingDD \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

MonthlyTotalSeasonToDateHeatingDD MonthlyTotalSeasonToDateCoolingDD
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

[5 rows x 90 columns]

2
1.2 Data Preparation & Cleanup
In [49]: # list all the columns to determine which is needed
clt_climate_df.columns

Out[49]: Index(['STATION', 'STATION_NAME', 'ELEVATION', 'LATITUDE', 'LONGITUDE', 'DATE',


'REPORTTPYE', 'HOURLYSKYCONDITIONS', 'HOURLYVISIBILITY',
'HOURLYPRSENTWEATHERTYPE', 'HOURLYDRYBULBTEMPF', 'HOURLYDRYBULBTEMPC',
'HOURLYWETBULBTEMPF', 'HOURLYWETBULBTEMPC', 'HOURLYDewPointTempF',
'HOURLYDewPointTempC', 'HOURLYRelativeHumidity', 'HOURLYWindSpeed',
'HOURLYWindDirection', 'HOURLYWindGustSpeed', 'HOURLYStationPressure',
'HOURLYPressureTendency', 'HOURLYPressureChange',
'HOURLYSeaLevelPressure', 'HOURLYPrecip', 'HOURLYAltimeterSetting',
'DAILYMaximumDryBulbTemp', 'DAILYMinimumDryBulbTemp',
'DAILYAverageDryBulbTemp', 'DAILYDeptFromNormalAverageTemp',
'DAILYAverageRelativeHumidity', 'DAILYAverageDewPointTemp',
'DAILYAverageWetBulbTemp', 'DAILYHeatingDegreeDays',
'DAILYCoolingDegreeDays', 'DAILYSunrise', 'DAILYSunset', 'DAILYWeather',
'DAILYPrecip', 'DAILYSnowfall', 'DAILYSnowDepth',
'DAILYAverageStationPressure', 'DAILYAverageSeaLevelPressure',
'DAILYAverageWindSpeed', 'DAILYPeakWindSpeed', 'PeakWindDirection',
'DAILYSustainedWindSpeed', 'DAILYSustainedWindDirection',
'MonthlyMaximumTemp', 'MonthlyMinimumTemp', 'MonthlyMeanTemp',
'MonthlyAverageRH', 'MonthlyDewpointTemp', 'MonthlyWetBulbTemp',
'MonthlyAvgHeatingDegreeDays', 'MonthlyAvgCoolingDegreeDays',
'MonthlyStationPressure', 'MonthlySeaLevelPressure',
'MonthlyAverageWindSpeed', 'MonthlyTotalSnowfall',
'MonthlyDeptFromNormalMaximumTemp', 'MonthlyDeptFromNormalMinimumTemp',
'MonthlyDeptFromNormalAverageTemp', 'MonthlyDeptFromNormalPrecip',
'MonthlyTotalLiquidPrecip', 'MonthlyGreatestPrecip',
'MonthlyGreatestPrecipDate', 'MonthlyGreatestSnowfall',
'MonthlyGreatestSnowfallDate', 'MonthlyGreatestSnowDepth',
'MonthlyGreatestSnowDepthDate', 'MonthlyDaysWithGT90Temp',
'MonthlyDaysWithLT32Temp', 'MonthlyDaysWithGT32Temp',
'MonthlyDaysWithLT0Temp', 'MonthlyDaysWithGT001Precip',
'MonthlyDaysWithGT010Precip', 'MonthlyDaysWithGT1Snow',
'MonthlyMaxSeaLevelPressureValue', 'MonthlyMaxSeaLevelPressureDate',
'MonthlyMaxSeaLevelPressureTime', 'MonthlyMinSeaLevelPressureValue',
'MonthlyMinSeaLevelPressureDate', 'MonthlyMinSeaLevelPressureTime',
'MonthlyTotalHeatingDegreeDays', 'MonthlyTotalCoolingDegreeDays',
'MonthlyDeptFromNormalHeatingDD', 'MonthlyDeptFromNormalCoolingDD',
'MonthlyTotalSeasonToDateHeatingDD',
'MonthlyTotalSeasonToDateCoolingDD'],
dtype='object')

In [50]: # Create new dataframe with necessary columns only


new_clt_climate_df = clt_climate_df.loc[:, ['STATION_NAME','DATE','DAILYMaximumDryBulbT
'DAILYMinimumDryBulbTemp', 'DAILYAverageDryBulbTemp', 'DAILYAverageRelativeHumid

3
'DAILYAverageDewPointTemp', 'DAILYPrecip']]
new_clt_climate_df.head()

Out[50]: STATION_NAME DATE DAILYMaximumDryBulbTemp \


0 CHARLOTTE DOUGLAS AIRPORT NC US 6/1/2013 0:52 NaN
1 CHARLOTTE DOUGLAS AIRPORT NC US 6/1/2013 1:00 83.0
2 CHARLOTTE DOUGLAS AIRPORT NC US 6/1/2013 1:52 NaN
3 CHARLOTTE DOUGLAS AIRPORT NC US 6/1/2013 2:52 NaN
4 CHARLOTTE DOUGLAS AIRPORT NC US 6/1/2013 3:52 NaN

DAILYMinimumDryBulbTemp DAILYAverageDryBulbTemp \
0 NaN NaN
1 70.0 76.0
2 NaN NaN
3 NaN NaN
4 NaN NaN

DAILYAverageRelativeHumidity DAILYAverageDewPointTemp DAILYPrecip


0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

In [51]: # Reindex by date


new_clt_climate_df['DATE'] = pd.to_datetime(new_clt_climate_df['DATE'])
new_clt_climate_df.set_index('DATE', inplace=True)
new_clt_climate_df.index = new_clt_climate_df.index.normalize()
new_clt_climate_df.head()

Out[51]: STATION_NAME DAILYMaximumDryBulbTemp \


DATE
2013-06-01 CHARLOTTE DOUGLAS AIRPORT NC US NaN
2013-06-01 CHARLOTTE DOUGLAS AIRPORT NC US 83.0
2013-06-01 CHARLOTTE DOUGLAS AIRPORT NC US NaN
2013-06-01 CHARLOTTE DOUGLAS AIRPORT NC US NaN
2013-06-01 CHARLOTTE DOUGLAS AIRPORT NC US NaN

DAILYMinimumDryBulbTemp DAILYAverageDryBulbTemp \
DATE
2013-06-01 NaN NaN
2013-06-01 70.0 76.0
2013-06-01 NaN NaN
2013-06-01 NaN NaN
2013-06-01 NaN NaN

DAILYAverageRelativeHumidity DAILYAverageDewPointTemp DAILYPrecip


DATE

4
2013-06-01 NaN NaN NaN
2013-06-01 NaN NaN NaN
2013-06-01 NaN NaN NaN
2013-06-01 NaN NaN NaN
2013-06-01 NaN NaN NaN

In [52]: # Drop rows with NaN values


new_clt_climate_df = new_clt_climate_df.dropna()
# Replace T values (trace amt) to zero for daily precipitation and convert data to floa
new_clt_climate_df['DAILYPrecip'].replace(['T'], '0', inplace=True)
new_clt_climate_df[['DAILYPrecip']] = new_clt_climate_df[['DAILYPrecip']].apply(pd.to_n
# Rename column names
new_clt_climate_df = new_clt_climate_df.rename(columns={'DAILYMaximumDryBulbTemp': 'Dai
new_clt_climate_df.head()

Out[52]: STATION_NAME DailyMaxTemp DailyMinTemp \


DATE
2013-06-01 CHARLOTTE DOUGLAS AIRPORT NC US 85.0 66.0
2013-06-02 CHARLOTTE DOUGLAS AIRPORT NC US 84.0 69.0
2013-06-03 CHARLOTTE DOUGLAS AIRPORT NC US 83.0 67.0
2013-06-04 CHARLOTTE DOUGLAS AIRPORT NC US 84.0 68.0
2013-06-05 CHARLOTTE DOUGLAS AIRPORT NC US 81.0 68.0

DailyAvgTemp DailyAvgRelHumidity DailyAvgDewPointTemp \


DATE
2013-06-01 75.0 68.0 65.0
2013-06-02 77.0 78.0 66.0
2013-06-03 75.0 83.0 67.0
2013-06-04 76.0 70.0 64.0
2013-06-05 74.0 81.0 67.0

DailyPrecip
DATE
2013-06-01 0.00
2013-06-02 0.19
2013-06-03 2.33
2013-06-04 0.00
2013-06-05 0.03

In [53]: # Verify date range and total number of rows in the new dataframe
new_clt_climate_df.index

Out[53]: DatetimeIndex(['2013-06-01', '2013-06-02', '2013-06-03', '2013-06-04',


'2013-06-05', '2013-06-06', '2013-06-07', '2013-06-08',
'2013-06-09', '2013-06-10',
...
'2018-05-21', '2018-05-22', '2018-05-23', '2018-05-24',
'2018-05-25', '2018-05-26', '2018-05-27', '2018-05-28',

5
'2018-05-29', '2018-05-30'],
dtype='datetime64[ns]', name='DATE', length=1651, freq=None)

In [54]: # Verify data types


new_clt_climate_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1651 entries, 2013-06-01 to 2018-05-30
Data columns (total 7 columns):
STATION_NAME 1651 non-null object
DailyMaxTemp 1651 non-null float64
DailyMinTemp 1651 non-null float64
DailyAvgTemp 1651 non-null float64
DailyAvgRelHumidity 1651 non-null float64
DailyAvgDewPointTemp 1651 non-null float64
DailyPrecip 1651 non-null float64
dtypes: float64(6), object(1)
memory usage: 103.2+ KB

1.3 Visualizing the Average Daily Temperature for Charlotte, NC - 2013 to 2018
In [55]: # Visualize some of the 'cleaned' data by plotting the daily avg temperature in Charlot
new_clt_climate_df['DailyAvgTemp'].plot(figsize=(20,7), color="green")
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Average Temperature - Fahrenheit')
plt.show()

output_13_0.png

1.4 Derive Features for Weather Prediction Experiment


In [56]: features = ['DailyMaxTemp','DailyMinTemp','DailyAvgTemp','DailyAvgRelHumidity','DailyAv
# Function that creates columns representing Nth prior measurements of feature
# None values maintain the consistent rows length for each N
def derive_nth_day_feature(new_clt_climate_df, feature, N):
rows = new_clt_climate_df.shape[0]
nth_prior_measurements = [None]*N + [new_clt_climate_df[feature][i-N] for i in rang
col_name = "{}_{}".format(feature, N)
new_clt_climate_df[col_name] = nth_prior_measurements

6
In [57]: # Call the above function using a loop through each feature
for feature in features:
if feature != 'DATE':
for N in range(1, 4):
derive_nth_day_feature(new_clt_climate_df, feature, N)

In [58]: new_clt_climate_df.head(32)

Out[58]: STATION_NAME DailyMaxTemp DailyMinTemp \


DATE
2013-06-01 CHARLOTTE DOUGLAS AIRPORT NC US 85.0 66.0
2013-06-02 CHARLOTTE DOUGLAS AIRPORT NC US 84.0 69.0
2013-06-03 CHARLOTTE DOUGLAS AIRPORT NC US 83.0 67.0
2013-06-04 CHARLOTTE DOUGLAS AIRPORT NC US 84.0 68.0
2013-06-05 CHARLOTTE DOUGLAS AIRPORT NC US 81.0 68.0
2013-06-06 CHARLOTTE DOUGLAS AIRPORT NC US 78.0 68.0
2013-06-07 CHARLOTTE DOUGLAS AIRPORT NC US 82.0 68.0
2013-06-08 CHARLOTTE DOUGLAS AIRPORT NC US 83.0 67.0
2013-06-09 CHARLOTTE DOUGLAS AIRPORT NC US 86.0 67.0
2013-06-10 CHARLOTTE DOUGLAS AIRPORT NC US 83.0 68.0
2013-06-11 CHARLOTTE DOUGLAS AIRPORT NC US 87.0 67.0
2013-06-12 CHARLOTTE DOUGLAS AIRPORT NC US 90.0 64.0
2013-06-13 CHARLOTTE DOUGLAS AIRPORT NC US 92.0 69.0
2013-06-14 CHARLOTTE DOUGLAS AIRPORT NC US 83.0 60.0
2013-06-15 CHARLOTTE DOUGLAS AIRPORT NC US 84.0 57.0
2013-06-16 CHARLOTTE DOUGLAS AIRPORT NC US 85.0 65.0
2013-06-17 CHARLOTTE DOUGLAS AIRPORT NC US 85.0 71.0
2013-06-18 CHARLOTTE DOUGLAS AIRPORT NC US 84.0 70.0
2013-06-19 CHARLOTTE DOUGLAS AIRPORT NC US 84.0 69.0
2013-06-20 CHARLOTTE DOUGLAS AIRPORT NC US 84.0 68.0
2013-06-21 CHARLOTTE DOUGLAS AIRPORT NC US 84.0 66.0
2013-06-22 CHARLOTTE DOUGLAS AIRPORT NC US 86.0 62.0
2013-06-23 CHARLOTTE DOUGLAS AIRPORT NC US 86.0 69.0
2013-06-24 CHARLOTTE DOUGLAS AIRPORT NC US 85.0 69.0
2013-06-25 CHARLOTTE DOUGLAS AIRPORT NC US 89.0 72.0
2013-06-26 CHARLOTTE DOUGLAS AIRPORT NC US 90.0 70.0
2013-06-27 CHARLOTTE DOUGLAS AIRPORT NC US 89.0 72.0
2013-06-28 CHARLOTTE DOUGLAS AIRPORT NC US 91.0 69.0
2013-06-29 CHARLOTTE DOUGLAS AIRPORT NC US 86.0 71.0
2013-07-01 CHARLOTTE DOUGLAS AIRPORT NC US 83.0 68.0
2013-07-02 CHARLOTTE DOUGLAS AIRPORT NC US 80.0 71.0
2013-07-03 CHARLOTTE DOUGLAS AIRPORT NC US 82.0 72.0

DailyAvgTemp DailyAvgRelHumidity DailyAvgDewPointTemp \


DATE
2013-06-01 75.0 68.0 65.0
2013-06-02 77.0 78.0 66.0
2013-06-03 75.0 83.0 67.0

7
2013-06-04 76.0 70.0 64.0
2013-06-05 74.0 81.0 67.0
2013-06-06 73.0 94.0 69.0
2013-06-07 75.0 88.0 68.0
2013-06-08 75.0 80.0 67.0
2013-06-09 76.0 82.0 68.0
2013-06-10 75.0 87.0 69.0
2013-06-11 77.0 71.0 66.0
2013-06-12 77.0 73.0 69.0
2013-06-13 81.0 77.0 70.0
2013-06-14 72.0 60.0 57.0
2013-06-15 71.0 66.0 60.0
2013-06-16 75.0 69.0 64.0
2013-06-17 78.0 83.0 69.0
2013-06-18 77.0 86.0 70.0
2013-06-19 77.0 69.0 65.0
2013-06-20 76.0 67.0 63.0
2013-06-21 75.0 60.0 60.0
2013-06-22 74.0 71.0 64.0
2013-06-23 78.0 76.0 69.0
2013-06-24 77.0 83.0 70.0
2013-06-25 81.0 76.0 70.0
2013-06-26 80.0 78.0 70.0
2013-06-27 81.0 82.0 73.0
2013-06-28 80.0 78.0 71.0
2013-06-29 79.0 79.0 70.0
2013-07-01 75.0 83.0 69.0
2013-07-02 76.0 90.0 71.0
2013-07-03 77.0 90.0 72.0

DailyPrecip DailyMaxTemp_1 DailyMaxTemp_2 DailyMaxTemp_3 \


DATE
2013-06-01 0.00 NaN NaN NaN
2013-06-02 0.19 85.0 NaN NaN
2013-06-03 2.33 84.0 85.0 NaN
2013-06-04 0.00 83.0 84.0 85.0
2013-06-05 0.03 84.0 83.0 84.0
2013-06-06 1.12 81.0 84.0 83.0
2013-06-07 0.72 78.0 81.0 84.0
2013-06-08 0.00 82.0 78.0 81.0
2013-06-09 0.12 83.0 82.0 78.0
2013-06-10 0.62 86.0 83.0 82.0
2013-06-11 0.00 83.0 86.0 83.0
2013-06-12 0.00 87.0 83.0 86.0
2013-06-13 0.49 90.0 87.0 83.0
2013-06-14 0.00 92.0 90.0 87.0
2013-06-15 0.00 83.0 92.0 90.0
2013-06-16 0.00 84.0 83.0 92.0

8
2013-06-17 0.24 85.0 84.0 83.0
2013-06-18 0.41 85.0 85.0 84.0
2013-06-19 0.00 84.0 85.0 85.0
2013-06-20 0.00 84.0 84.0 85.0
2013-06-21 0.00 84.0 84.0 84.0
2013-06-22 0.00 84.0 84.0 84.0
2013-06-23 0.01 86.0 84.0 84.0
2013-06-24 0.01 86.0 86.0 84.0
2013-06-25 0.00 85.0 86.0 86.0
2013-06-26 0.00 89.0 85.0 86.0
2013-06-27 0.32 90.0 89.0 85.0
2013-06-28 0.27 89.0 90.0 89.0
2013-06-29 0.00 91.0 89.0 90.0
2013-07-01 0.07 86.0 91.0 89.0
2013-07-02 0.34 83.0 86.0 91.0
2013-07-03 0.11 80.0 83.0 86.0

... DailyAvgTemp_3 DailyAvgRelHumidity_1 \


DATE ...
2013-06-01 ... NaN NaN
2013-06-02 ... NaN 68.0
2013-06-03 ... NaN 78.0
2013-06-04 ... 75.0 83.0
2013-06-05 ... 77.0 70.0
2013-06-06 ... 75.0 81.0
2013-06-07 ... 76.0 94.0
2013-06-08 ... 74.0 88.0
2013-06-09 ... 73.0 80.0
2013-06-10 ... 75.0 82.0
2013-06-11 ... 75.0 87.0
2013-06-12 ... 76.0 71.0
2013-06-13 ... 75.0 73.0
2013-06-14 ... 77.0 77.0
2013-06-15 ... 77.0 60.0
2013-06-16 ... 81.0 66.0
2013-06-17 ... 72.0 69.0
2013-06-18 ... 71.0 83.0
2013-06-19 ... 75.0 86.0
2013-06-20 ... 78.0 69.0
2013-06-21 ... 77.0 67.0
2013-06-22 ... 77.0 60.0
2013-06-23 ... 76.0 71.0
2013-06-24 ... 75.0 76.0
2013-06-25 ... 74.0 83.0
2013-06-26 ... 78.0 76.0
2013-06-27 ... 77.0 78.0
2013-06-28 ... 81.0 82.0
2013-06-29 ... 80.0 78.0

9
2013-07-01 ... 81.0 79.0
2013-07-02 ... 80.0 83.0
2013-07-03 ... 79.0 90.0

DailyAvgRelHumidity_2 DailyAvgRelHumidity_3 \
DATE
2013-06-01 NaN NaN
2013-06-02 NaN NaN
2013-06-03 68.0 NaN
2013-06-04 78.0 68.0
2013-06-05 83.0 78.0
2013-06-06 70.0 83.0
2013-06-07 81.0 70.0
2013-06-08 94.0 81.0
2013-06-09 88.0 94.0
2013-06-10 80.0 88.0
2013-06-11 82.0 80.0
2013-06-12 87.0 82.0
2013-06-13 71.0 87.0
2013-06-14 73.0 71.0
2013-06-15 77.0 73.0
2013-06-16 60.0 77.0
2013-06-17 66.0 60.0
2013-06-18 69.0 66.0
2013-06-19 83.0 69.0
2013-06-20 86.0 83.0
2013-06-21 69.0 86.0
2013-06-22 67.0 69.0
2013-06-23 60.0 67.0
2013-06-24 71.0 60.0
2013-06-25 76.0 71.0
2013-06-26 83.0 76.0
2013-06-27 76.0 83.0
2013-06-28 78.0 76.0
2013-06-29 82.0 78.0
2013-07-01 78.0 82.0
2013-07-02 79.0 78.0
2013-07-03 83.0 79.0

DailyAvgDewPointTemp_1 DailyAvgDewPointTemp_2 \
DATE
2013-06-01 NaN NaN
2013-06-02 65.0 NaN
2013-06-03 66.0 65.0
2013-06-04 67.0 66.0
2013-06-05 64.0 67.0
2013-06-06 67.0 64.0
2013-06-07 69.0 67.0

10
2013-06-08 68.0 69.0
2013-06-09 67.0 68.0
2013-06-10 68.0 67.0
2013-06-11 69.0 68.0
2013-06-12 66.0 69.0
2013-06-13 69.0 66.0
2013-06-14 70.0 69.0
2013-06-15 57.0 70.0
2013-06-16 60.0 57.0
2013-06-17 64.0 60.0
2013-06-18 69.0 64.0
2013-06-19 70.0 69.0
2013-06-20 65.0 70.0
2013-06-21 63.0 65.0
2013-06-22 60.0 63.0
2013-06-23 64.0 60.0
2013-06-24 69.0 64.0
2013-06-25 70.0 69.0
2013-06-26 70.0 70.0
2013-06-27 70.0 70.0
2013-06-28 73.0 70.0
2013-06-29 71.0 73.0
2013-07-01 70.0 71.0
2013-07-02 69.0 70.0
2013-07-03 71.0 69.0

DailyAvgDewPointTemp_3 DailyPrecip_1 DailyPrecip_2 \


DATE
2013-06-01 NaN NaN NaN
2013-06-02 NaN 0.00 NaN
2013-06-03 NaN 0.19 0.00
2013-06-04 65.0 2.33 0.19
2013-06-05 66.0 0.00 2.33
2013-06-06 67.0 0.03 0.00
2013-06-07 64.0 1.12 0.03
2013-06-08 67.0 0.72 1.12
2013-06-09 69.0 0.00 0.72
2013-06-10 68.0 0.12 0.00
2013-06-11 67.0 0.62 0.12
2013-06-12 68.0 0.00 0.62
2013-06-13 69.0 0.00 0.00
2013-06-14 66.0 0.49 0.00
2013-06-15 69.0 0.00 0.49
2013-06-16 70.0 0.00 0.00
2013-06-17 57.0 0.00 0.00
2013-06-18 60.0 0.24 0.00
2013-06-19 64.0 0.41 0.24
2013-06-20 69.0 0.00 0.41

11
2013-06-21 70.0 0.00 0.00
2013-06-22 65.0 0.00 0.00
2013-06-23 63.0 0.00 0.00
2013-06-24 60.0 0.01 0.00
2013-06-25 64.0 0.01 0.01
2013-06-26 69.0 0.00 0.01
2013-06-27 70.0 0.00 0.00
2013-06-28 70.0 0.32 0.00
2013-06-29 70.0 0.27 0.32
2013-07-01 73.0 0.00 0.27
2013-07-02 71.0 0.07 0.00
2013-07-03 70.0 0.34 0.07

DailyPrecip_3
DATE
2013-06-01 NaN
2013-06-02 NaN
2013-06-03 NaN
2013-06-04 0.00
2013-06-05 0.19
2013-06-06 2.33
2013-06-07 0.00
2013-06-08 0.03
2013-06-09 1.12
2013-06-10 0.72
2013-06-11 0.00
2013-06-12 0.12
2013-06-13 0.62
2013-06-14 0.00
2013-06-15 0.00
2013-06-16 0.49
2013-06-17 0.00
2013-06-18 0.00
2013-06-19 0.00
2013-06-20 0.24
2013-06-21 0.41
2013-06-22 0.00
2013-06-23 0.00
2013-06-24 0.00
2013-06-25 0.00
2013-06-26 0.01
2013-06-27 0.01
2013-06-28 0.00
2013-06-29 0.00
2013-07-01 0.32
2013-07-02 0.27
2013-07-03 0.00

12
[32 rows x 25 columns]

In [59]: # Evaluate the distribution of the feature data and transpose it; drop latitude and lon
spread = new_clt_climate_df.describe().T
spread

Out[59]: count mean std min 25% 50% 75% \


DailyMaxTemp 1651.0 73.609933 15.500866 26.0 62.0 76.0 86.00
DailyMinTemp 1651.0 51.928528 16.196924 5.0 39.0 55.0 67.00
DailyAvgTemp 1651.0 62.815263 15.366170 16.0 51.0 65.0 76.00
DailyAvgRelHumidity 1651.0 64.890975 14.879809 26.0 55.0 65.0 76.00
DailyAvgDewPointTemp 1651.0 49.496669 17.611304 -4.0 36.0 54.0 65.00
DailyPrecip 1651.0 0.121532 0.344428 0.0 0.0 0.0 0.03
DailyMaxTemp_1 1650.0 73.603030 15.503027 26.0 62.0 76.0 86.00
DailyMaxTemp_2 1649.0 73.596119 15.505187 26.0 62.0 76.0 86.00
DailyMaxTemp_3 1648.0 73.593447 15.509513 26.0 62.0 76.0 86.00
DailyMinTemp_1 1650.0 51.916364 16.194288 5.0 39.0 55.0 67.00
DailyMinTemp_2 1649.0 51.904184 16.191640 5.0 39.0 55.0 67.00
DailyMinTemp_3 1648.0 51.892597 16.189714 5.0 39.0 55.0 67.00
DailyAvgTemp_1 1650.0 62.806061 15.366277 16.0 51.0 65.0 76.00
DailyAvgTemp_2 1649.0 62.796847 15.366378 16.0 51.0 65.0 76.00
DailyAvgTemp_3 1648.0 62.789442 15.368099 16.0 51.0 65.0 76.00
DailyAvgRelHumidity_1 1650.0 64.880000 14.877634 26.0 55.0 65.0 76.00
DailyAvgRelHumidity_2 1649.0 64.864767 14.869269 26.0 55.0 65.0 76.00
DailyAvgRelHumidity_3 1648.0 64.848301 14.858737 26.0 55.0 65.0 76.00
DailyAvgDewPointTemp_1 1650.0 49.483030 17.607919 -4.0 36.0 54.0 65.00
DailyAvgDewPointTemp_2 1649.0 49.468769 17.603726 -4.0 36.0 54.0 65.00
DailyAvgDewPointTemp_3 1648.0 49.455704 17.601070 -4.0 36.0 54.0 65.00
DailyPrecip_1 1650.0 0.121606 0.344519 0.0 0.0 0.0 0.03
DailyPrecip_2 1649.0 0.121364 0.344484 0.0 0.0 0.0 0.03
DailyPrecip_3 1648.0 0.121317 0.344583 0.0 0.0 0.0 0.03

max
DailyMaxTemp 100.00
DailyMinTemp 78.00
DailyAvgTemp 88.00
DailyAvgRelHumidity 98.00
DailyAvgDewPointTemp 75.00
DailyPrecip 3.89
DailyMaxTemp_1 100.00
DailyMaxTemp_2 100.00
DailyMaxTemp_3 100.00
DailyMinTemp_1 78.00
DailyMinTemp_2 78.00
DailyMinTemp_3 78.00
DailyAvgTemp_1 88.00
DailyAvgTemp_2 88.00
DailyAvgTemp_3 88.00

13
DailyAvgRelHumidity_1 98.00
DailyAvgRelHumidity_2 98.00
DailyAvgRelHumidity_3 98.00
DailyAvgDewPointTemp_1 75.00
DailyAvgDewPointTemp_2 75.00
DailyAvgDewPointTemp_3 75.00
DailyPrecip_1 3.89
DailyPrecip_2 3.89
DailyPrecip_3 3.89

In [60]: # Drop rows with NaN values


new_clt_climate_df = new_clt_climate_df.dropna()
new_clt_climate_df.head()

Out[60]: STATION_NAME DailyMaxTemp DailyMinTemp \


DATE
2013-06-04 CHARLOTTE DOUGLAS AIRPORT NC US 84.0 68.0
2013-06-05 CHARLOTTE DOUGLAS AIRPORT NC US 81.0 68.0
2013-06-06 CHARLOTTE DOUGLAS AIRPORT NC US 78.0 68.0
2013-06-07 CHARLOTTE DOUGLAS AIRPORT NC US 82.0 68.0
2013-06-08 CHARLOTTE DOUGLAS AIRPORT NC US 83.0 67.0

DailyAvgTemp DailyAvgRelHumidity DailyAvgDewPointTemp \


DATE
2013-06-04 76.0 70.0 64.0
2013-06-05 74.0 81.0 67.0
2013-06-06 73.0 94.0 69.0
2013-06-07 75.0 88.0 68.0
2013-06-08 75.0 80.0 67.0

DailyPrecip DailyMaxTemp_1 DailyMaxTemp_2 DailyMaxTemp_3 \


DATE
2013-06-04 0.00 83.0 84.0 85.0
2013-06-05 0.03 84.0 83.0 84.0
2013-06-06 1.12 81.0 84.0 83.0
2013-06-07 0.72 78.0 81.0 84.0
2013-06-08 0.00 82.0 78.0 81.0

... DailyAvgTemp_3 DailyAvgRelHumidity_1 \


DATE ...
2013-06-04 ... 75.0 83.0
2013-06-05 ... 77.0 70.0
2013-06-06 ... 75.0 81.0
2013-06-07 ... 76.0 94.0
2013-06-08 ... 74.0 88.0

DailyAvgRelHumidity_2 DailyAvgRelHumidity_3 \
DATE

14
2013-06-04 78.0 68.0
2013-06-05 83.0 78.0
2013-06-06 70.0 83.0
2013-06-07 81.0 70.0
2013-06-08 94.0 81.0

DailyAvgDewPointTemp_1 DailyAvgDewPointTemp_2 \
DATE
2013-06-04 67.0 66.0
2013-06-05 64.0 67.0
2013-06-06 67.0 64.0
2013-06-07 69.0 67.0
2013-06-08 68.0 69.0

DailyAvgDewPointTemp_3 DailyPrecip_1 DailyPrecip_2 \


DATE
2013-06-04 65.0 2.33 0.19
2013-06-05 66.0 0.00 2.33
2013-06-06 67.0 0.03 0.00
2013-06-07 64.0 1.12 0.03
2013-06-08 67.0 0.72 1.12

DailyPrecip_3
DATE
2013-06-04 0.00
2013-06-05 0.19
2013-06-06 2.33
2013-06-07 0.00
2013-06-08 0.03

[5 rows x 25 columns]

In [61]: # Assess the linearity between variables using the Pearson correlation coefficient.
df_linear = new_clt_climate_df.corr()[['DailyAvgTemp']].sort_values('DailyAvgTemp')
df_linear

Out[61]: DailyAvgTemp
DailyPrecip_2 -0.038175
DailyPrecip_3 -0.019563
DailyPrecip_1 -0.010878
DailyPrecip 0.010496
DailyAvgRelHumidity_3 0.208985
DailyAvgRelHumidity_2 0.219423
DailyAvgRelHumidity_1 0.295778
DailyAvgRelHumidity 0.334309
DailyAvgDewPointTemp_3 0.757075
DailyAvgDewPointTemp_2 0.790034
DailyMinTemp_3 0.801019

15
DailyMaxTemp_3 0.808100
DailyAvgTemp_3 0.826574
DailyMinTemp_2 0.831432
DailyMaxTemp_2 0.845655
DailyAvgTemp_2 0.861624
DailyAvgDewPointTemp_1 0.873512
DailyMinTemp_1 0.898852
DailyMaxTemp_1 0.912172
DailyAvgTemp_1 0.930317
DailyAvgDewPointTemp 0.939037
DailyMaxTemp 0.971456
DailyMinTemp 0.973825
DailyAvgTemp 1.000000

1.5 Visualizing Feature Relationships


In [62]: # Create new dataframe with features of interest
predictors = ['DailyMaxTemp_1','DailyMaxTemp_2','DailyMaxTemp_3','DailyMinTemp_1','Dail
new_clt_climate_df2 = new_clt_climate_df[['DailyAvgTemp'] + predictors]
new_clt_climate_df2.head()

Out[62]: DailyAvgTemp DailyMaxTemp_1 DailyMaxTemp_2 DailyMaxTemp_3 \


DATE
2013-06-04 76.0 83.0 84.0 85.0
2013-06-05 74.0 84.0 83.0 84.0
2013-06-06 73.0 81.0 84.0 83.0
2013-06-07 75.0 78.0 81.0 84.0
2013-06-08 75.0 82.0 78.0 81.0

DailyMinTemp_1 DailyMinTemp_2 DailyMinTemp_3 DailyAvgTemp_1 \


DATE
2013-06-04 67.0 69.0 66.0 75.0
2013-06-05 68.0 67.0 69.0 76.0
2013-06-06 68.0 68.0 67.0 74.0
2013-06-07 68.0 68.0 68.0 73.0
2013-06-08 68.0 68.0 68.0 75.0

DailyAvgTemp_2 DailyAvgTemp_3 DailyAvgDewPointTemp_1 \


DATE
2013-06-04 77.0 75.0 67.0
2013-06-05 75.0 77.0 64.0
2013-06-06 76.0 75.0 67.0
2013-06-07 74.0 76.0 69.0
2013-06-08 73.0 74.0 68.0

DailyAvgDewPointTemp_2 DailyAvgDewPointTemp_3
DATE
2013-06-04 66.0 65.0

16
2013-06-05 67.0 66.0
2013-06-06 64.0 67.0
2013-06-07 67.0 64.0
2013-06-08 69.0 67.0

In [63]: %matplotlib inline

# Manually set the parameters of the figure to an appropriate size


plt.rcParams['figure.figsize'] = [16, 22]

# Call subplots specifying the desired grid structure


# The y axes should be shared
fig, axes = plt.subplots(nrows=4, ncols=3, sharey=True)

# Loop through the features that will be the predictors to build the plot
# Rearrange data into a 2D array of 4 rows and 3 columns
arr = np.array(predictors).reshape(4, 3)

# Use enumerate to loop over the 2D array of rows and columns


# Create scatter plots of each DailyAvgTemp vs each feature; DailyAvgTemp will be the d
for row, col_arr in enumerate(arr):
for col, feature in enumerate(col_arr):
axes[row, col].scatter(new_clt_climate_df2[feature], new_clt_climate_df2['Daily
if col == 0:
axes[row, col].set(xlabel=feature, ylabel='DailyAvgTemp')
else:
axes[row, col].set(xlabel=feature)
plt.show()

output_23_0.png

1.6 Using Step-wise Regression to Build a Model


To test for the effects of interactions on the significance of any one variable in a linear regression
model a technique known as step-wise regression is often applied. Using step-wise regression you
add or remove variables from the model and assess the statistical significance of each variable on
the resultant model.
A backward elimination technique will be applied using the following steps: 1. Select a signif-
icance level A for which you test your hypothesis against to determine if a variable should stay
in the model 1. Fit the model with all predictor variables 1. Evaluate the p-values of the βj coeffi-
cients and for the one with the greatest p-value, if p-value > A progress to step 4, if not you have

17
your final model 1. Remove the predictor identified in step 3 1. Fit the model again but, this time
without the removed variable and cycle back to step 3
These steps will help to select statistically meaningful predictors (features)

In [ ]: # Separate predictor variables (X) from the outcome variable y


X = new_clt_climate_df2[predictors]
y = new_clt_climate_df2['DailyAvgTemp']

# Add a constant to the predictor variable set to represent the Bo intercept


X = sm.add_constant(X)
X.iloc[:5, :5]

In [ ]: # Step 1 - Select a significance value


alpha = 0.05

# Step 2 - Fit the model


model = sm.OLS(y, X).fit()

# Step 3 - Evaluate the coefficients' p-values


model.summary()

In [ ]: # Step 3 (cont.) - Identify the predictor with the greatest p-value and assess if its gr
# Based off the table, DailyAvgTemp_1 has the greatest p-value and it is greater than al

# Step 4 - Use pandas drop function to remove this column from X


X = X.drop('DailyAvgTemp_1', axis=1)

# Step 5 - Fit the model


model = sm.OLS(y, X).fit()
model.summary()

In [ ]: # Repeat steps 1 - 5 to continue identifying predictors with the greatest p-value that a
# ROUND 2
X = X.drop('DailyMinTemp_2', axis=1)
model = sm.OLS(y, X).fit()
model.summary()

In [ ]: # Repeat steps 1 - 5 to continue identifying predictors with the greatest p-value that a
# ROUND 3
X = X.drop('DailyAvgTemp_3', axis=1)
model = sm.OLS(y, X).fit()
model.summary()

In [ ]: # Repeat steps 1 - 5 to continue identifying predictors with the greatest p-value that a
# ROUND 4
X = X.drop('DailyMaxTemp_3', axis=1)
model = sm.OLS(y, X).fit()
model.summary()

18
In [ ]: # Repeat steps 1 - 5 to continue identifying predictors with the greatest p-value that a
# ROUND 5
X = X.drop('DailyMaxTemp_2', axis=1)
model = sm.OLS(y, X).fit()
model.summary()

1.7 Using the SciKit-Learn Linear Regression Module to Predict the Weather
The training and testing datasets are split into 80% training and 20% testing.

In [ ]: # A random_state of 12 is assigned to ensure getting the right random selection of data.


# This random_state parameter is very useful for reproducibility of results.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=12

In [ ]: # Instantiate the regressor class


regressor = LinearRegression()

# Fit and build the model by fitting the regressor to the training data
regressor.fit(X_train, y_train)

# Make a prediction set using the test set


prediction = regressor.predict(X_test)

# Evaluate the prediction accuracy of the model


print("Accuracy of Linear Regression: %.2f" % regressor.score(X_test, y_test))
print("The Mean Absolute Error: %.2f degrees fahrenheit" % mean_absolute_error(y_test, p
print("The Median Absolute Error: %.2f degrees fahrenheit" % median_absolute_error(y_tes

1.8 Visualizing Weather Forecast Predictions


In [ ]: # 365 days will be the number of forecast days
forecast_out = int(365)

# X contains the last 'n=forecast_out' rows for which we don't have label data
# Put those rows in a different Matrix X_forecast_out by X_forecast_out = X[end-forecast
X_forecast_out = X[-forecast_out:]
X = X[:-forecast_out]
print ("Length of X_forecast_out:", len(X_forecast_out), "& Length of X :", len(X))

In [ ]: # Predict average temp for the next 365 days using our Model
forecast_prediction = regressor.predict(X_forecast_out)
print(forecast_prediction)

In [ ]: # Plotting data with the 365-day forecast included


new_clt_climate_df['Forecast'] = np.nan
last_date = new_clt_climate_df.iloc[-1].name
last_unix = last_date.timestamp()
# Number of seconds in a day
one_day = 86400

19
next_unix = last_unix + one_day

for i in forecast_prediction:
next_date = datetime.datetime.fromtimestamp(next_unix)
next_unix += 86400
new_clt_climate_df.loc[next_date] = [np.nan for _ in range(len(new_clt_climate_df.co
new_clt_climate_df['DailyAvgTemp'].plot(figsize=(20,7), color="green")
new_clt_climate_df['Forecast'].plot(figsize=(20,7), color="orange")
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Average Temperature - Fahrenheit')
plt.show()

20
Create Database through Python SQL - Part 1

In [14]: # Python SQL toolkit and Object Relational Mapper


import sqlalchemy
from sqlalchemy import create_engine, MetaData
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, Numeric, Text, Float
import pandas as pd

In [15]: # Create an engine to the SQLite dB


engine = create_engine("sqlite:///Global_Land_Temps.sqlite")

In [16]: # Create a connection to the engine called `conn`


conn = engine.connect()

In [17]: # Load the cleaned csv file into a pandas dataframe


new_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-data/US_Cit

In [18]: # Verify the datatypes


new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294522 entries, 0 to 294521
Data columns (total 7 columns):
dt 294522 non-null object
AverageTemperature 294522 non-null float64
AverageTemperatureUncertainty 294522 non-null float64
City 294522 non-null object
Country 294522 non-null object
Latitude 294522 non-null object
Longitude 294522 non-null object
dtypes: float64(2), object(5)
memory usage: 15.7+ MB

In [19]: # Use `declarative_base` from SQLAlchemy to model the table as an ORM class
Base = declarative_base()
class US_Cities(Base):
__tablename__ = 'US_Cities_GLT'

id = Column(Integer, primary_key=True)
dt = Column(Text)

1
AverageTemperature = Column(Float)
AverageTemperatureUncertainty = Column(Float)
City = Column(Text)
Country = Column(Text)
Latitude = Column(Text)
Longitude = Column(Text)

def __repr__(self):
return "id={self.id}, name={self.name}"

In [20]: # Use `create_all` to create the table in the database


Base.metadata.create_all(engine)

In [21]: # Use Orient='records' to create a list of data to write


# to_dict() cleans out DataFrame metadata as well
data = new_df.to_dict(orient='records')

In [22]: # Data is just a list of dictionaries that represent each row of data
print(data[:5])

[{'dt': '1918-01-01', 'AverageTemperature': 1.2830000000000004, 'AverageTemperatureUncertainty':

In [23]: # Use MetaData from SQLAlchemy to reflect the tables


metadata = MetaData(bind=engine)
metadata.reflect()

In [24]: # Save the reference to the table as a variable called `table`


table = sqlalchemy.Table('US_Cities_GLT', metadata, autoload=True)

In [25]: # Use `table.insert()` to insert the data into the table


# The SQL table is populated during this step
conn.execute(table.insert(), data)

Out[25]: <sqlalchemy.engine.result.ResultProxy at 0x7efdc828ce10>

In [26]: # Test that the insert works by fetching the first 5 rows.
conn.execute("select * from US_Cities_GLT limit 5").fetchall()

Out[26]: [(1, '1918-01-01', 1.2830000000000004, 0.325, 'Abilene', 'United States', '32.95N', '10
(2, '1918-02-01', 9.244, 0.319, 'Abilene', 'United States', '32.95N', '100.53W'),
(3, '1918-03-01', 14.636, 0.41600000000000004, 'Abilene', 'United States', '32.95N', '
(4, '1918-04-01', 16.227999999999998, 0.44299999999999995, 'Abilene', 'United States',
(5, '1918-05-01', 23.049, 0.486, 'Abilene', 'United States', '32.95N', '100.53W')]

2
Create Database through Python SQL - Part 2

In [1]: import sqlalchemy


from sqlalchemy import create_engine, MetaData
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, Numeric, Text, Float
import pandas as pd

In [2]: engine = create_engine("sqlite:///Global_Land_Temps.sqlite")


conn = engine.connect()

In [3]: state_GLT_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-data/G


country_GLT_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-data
GT_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-data/GLT100.c

In [4]: Base = declarative_base()


class States_GLT(Base):
__tablename__ = 'GLT_By_State'

id = Column(Integer, primary_key=True)
dt = Column(Text)
AverageTemperature = Column(Float)
AverageTemperatureUncertainty = Column(Float)
State = Column(Text)
Country = Column(Text)

class Countries_GLT(Base):
__tablename__ = 'GLT_By_Country'

id = Column(Integer, primary_key=True)
dt = Column(Text)
AverageTemperature = Column(Float)
AverageTemperatureUncertainty = Column(Float)
Country = Column(Text)

class GLT_General(Base):
__tablename__ = 'GLT'

id = Column(Integer, primary_key=True)
dt = Column(Text)
LandAverageTemperature = Column(Float)

1
LandAverageTemperatureUncertainty = Column(Float)
LandMaxTemperature = Column(Float)
LandMaxTemperatureUncertainty = Column(Float)
LandMinTemperature = Column(Float)
LandMinTemperatureUncertainty = Column(Float)
LandAndOceanAverageTemperature = Column(Float)
LandAndOceanAverageTemperatureUncertainty = Column(Float)

def __repr__(self):
return "id={self.id}, name={self.name}"

In [5]: Base.metadata.create_all(engine)
data2 = state_GLT_df.to_dict(orient='records')
data3 = country_GLT_df.to_dict(orient='records')
data4 = GT_df.to_dict(orient='records')

In [6]: print(data2[:5])
print(data3[:5])
print(data4[:5])

[{'dt': '1918-01-01', 'AverageTemperature': 24.22300000000001, 'AverageTemperatureUncertainty':


[{'dt': '1918-01-01', 'AverageTemperature': -5.4339999999999975, 'AverageTemperatureUncertainty'
[{'dt': '1918-01-01', 'LandAverageTemperature': 1.934, 'LandAverageTemperatureUncertainty': 0.25

In [7]: metadata = MetaData(bind=engine)


metadata.reflect()

In [8]: table2 = sqlalchemy.Table('GLT_By_State', metadata, autoload=True)


table3 = sqlalchemy.Table('GLT_By_Country', metadata, autoload=True)
table4 = sqlalchemy.Table('GLT', metadata, autoload=True)

In [9]: conn.execute(table2.insert(), data2)


conn.execute(table3.insert(), data3)
conn.execute(table4.insert(), data4)

Out[9]: <sqlalchemy.engine.result.ResultProxy at 0x7f0171a91550>

In [10]: conn.execute("select * from GLT_By_State limit 5").fetchall()

Out[10]: [(1, '1918-01-01', 24.22300000000001, 0.573, 'Acre', 'Brazil'),


(2, '1918-02-01', 24.663, 1.286, 'Acre', 'Brazil'),
(3, '1918-03-01', 24.882, 0.7120000000000001, 'Acre', 'Brazil'),
(4, '1918-04-01', 25.038, 0.461, 'Acre', 'Brazil'),
(5, '1918-05-01', 25.27, 0.562, 'Acre', 'Brazil')]

In [11]: conn.execute("select * from GLT_By_Country limit 5").fetchall()

2
Out[11]: [(1, '1918-01-01', -5.4339999999999975, 0.5579999999999999, 'Åland'),
(2, '1918-02-01', -2.636, 0.449, 'Åland'),
(3, '1918-03-01', -1.0500000000000005, 0.612, 'Åland'),
(4, '1918-04-01', 2.615, 0.418, 'Åland'),
(5, '1918-05-01', 7.162999999999999, 0.343, 'Åland')]

In [12]: conn.execute("select * from GLT limit 5").fetchall()

Out[12]: [(1, '1918-01-01', 1.934, 0.251, 7.5520000000000005, 0.261, -3.8020000000000014, 0.371,


(2, '1918-02-01', 2.455, 0.342, 8.256, 0.314, -3.568, 0.344, 13.312, 0.156),
(3, '1918-03-01', 4.811, 0.257, 10.704, 0.197, -1.2670000000000001, 0.31, 14.034, 0.14
(4, '1918-04-01', 7.643999999999999, 0.258, 13.706, 0.255, 1.426, 0.38, 14.794, 0.1480
(5, '1918-05-01', 10.54, 0.304, 16.48, 0.332, 4.386, 0.359, 15.732999999999999, 0.156)

3
CSV-to-SQLite-dB Part III

January 15, 2020

In [1]: import sqlalchemy


from sqlalchemy import create_engine, MetaData
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy import Column, Integer, String, Numeric, Text, Float
import pandas as pd

In [2]: engine = create_engine("sqlite:///Global_Land_Temps.sqlite")


conn = engine.connect()

In [3]: new_df = pd.read_csv("All Datasets/US-events-1980-2018.csv", low_memory=False)


new_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233 entries, 0 to 232
Data columns (total 6 columns):
Name 233 non-null object
Disaster 233 non-null object
BeginDate 233 non-null int64
EndDate 233 non-null int64
Total_CPI_Adjusted_Cost_Millions 233 non-null float64
Deaths 233 non-null int64
dtypes: float64(1), int64(3), object(2)
memory usage: 11.0+ KB

In [4]: Base = declarative_base()


class US_Disaster_Events(Base):
__tablename__ = 'US_Disasters'

id = Column(Integer, primary_key=True)
Name = Column(Text)
Disaster = Column(Text)
BeginDate = Column(Integer)
EndDate = Column(Integer)
Total_CPI_Adjusted_Cost_Millions = Column(Float)
Deaths = Column(Integer)

def __repr__(self):
return "id={self.id}, name={self.name}"

1
In [5]: Base.metadata.create_all(engine)

In [6]: data = new_df.to_dict(orient='records')


print(data[:5])

[{'Name': 'Texas Hail Storm (June 2018)', 'Disaster': 'Severe Storm', 'BeginDate': 20180606, 'En

In [7]: metadata = MetaData(bind=engine)


metadata.reflect()

In [8]: table = sqlalchemy.Table('US_Disasters', metadata, autoload=True)

In [9]: conn.execute(table.insert(), data)

Out[9]: <sqlalchemy.engine.result.ResultProxy at 0x7f2a89bc08d0>

In [10]: conn.execute("select * from US_Disasters limit 5").fetchall()

Out[10]: [(1, 'Texas Hail Storm (June 2018)', 'Severe Storm', 20180606, 20180606, 1150.0, 0),
(2, 'Central and Eastern Severe Weather (May 2018)', 'Severe Storm', 20180513, 2018051
(3, 'Central and Northeastern Severe Weather (May 2018)', 'Severe Storm', 20180501, 20
(4, 'Southeastern Severe Storms and Tornadoes (March 2018)', 'Severe Storm', 20180318,
(5, 'Northeast Winter Storm (March 2018)', 'Winter Storm', 20180301, 20180303, 2216.0,

2
Data Preview & Prep - Part I

January 15, 2020

In [1]: import pandas as pd


import numpy as np
import datetime

1 Earth Surface Temperature Datasets


1.0.1 Data by City - US Only (includes longitude & latitude)
In [2]: us_cities_GLT_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-da
state_GLT_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-data/G
country_GLT_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-data
GT_df = pd.read_csv("All Datasets/climate-change-earth-surface-temperature-data/GlobalTe

In [3]: us_cities_GLT_df.head()

Out[3]: dt AverageTemperature AverageTemperatureUncertainty City \


0 1918-01-01 1.283 0.325 Abilene
1 1918-02-01 9.244 0.319 Abilene
2 1918-03-01 14.636 0.416 Abilene
3 1918-04-01 16.228 0.443 Abilene
4 1918-05-01 23.049 0.486 Abilene

Country Latitude Longitude


0 United States 32.95N 100.53W
1 United States 32.95N 100.53W
2 United States 32.95N 100.53W
3 United States 32.95N 100.53W
4 United States 32.95N 100.53W

In [4]: # Set date as the index and drop the extra date column
us_cities_GLT_df = us_cities_GLT_df.set_index(pd.DatetimeIndex(us_cities_GLT_df['dt']))
us_cities_GLT_df.drop('dt', axis=1, inplace=True)
us_cities_GLT_df.head()

Out[4]: AverageTemperature AverageTemperatureUncertainty City \


dt
1918-01-01 1.283 0.325 Abilene
1918-02-01 9.244 0.319 Abilene

1
1918-03-01 14.636 0.416 Abilene
1918-04-01 16.228 0.443 Abilene
1918-05-01 23.049 0.486 Abilene

Country Latitude Longitude


dt
1918-01-01 United States 32.95N 100.53W
1918-02-01 United States 32.95N 100.53W
1918-03-01 United States 32.95N 100.53W
1918-04-01 United States 32.95N 100.53W
1918-05-01 United States 32.95N 100.53W

In [5]: us_cities_GLT_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 294522 entries, 1918-01-01 to 2013-06-01
Data columns (total 6 columns):
AverageTemperature 294522 non-null float64
AverageTemperatureUncertainty 294522 non-null float64
City 294522 non-null object
Country 294522 non-null object
Latitude 294522 non-null object
Longitude 294522 non-null object
dtypes: float64(2), object(4)
memory usage: 15.7+ MB

1.0.2 Data By State - All Countries


In [6]: state_GLT_df.head()

Out[6]: dt AverageTemperature AverageTemperatureUncertainty State Country


0 1855-05-01 25.544 1.171 Acre Brazil
1 1855-06-01 24.228 1.103 Acre Brazil
2 1855-07-01 24.371 1.044 Acre Brazil
3 1855-08-01 25.427 1.073 Acre Brazil
4 1855-09-01 25.675 1.014 Acre Brazil

In [7]: # Drop rows with NaN values and set date as index
new_state_GLT_df = state_GLT_df.dropna()

# Set date as the index and drop the extra date column
new_state_GLT_df = new_state_GLT_df.set_index(pd.DatetimeIndex(new_state_GLT_df['dt']))
new_state_GLT_df.drop('dt', axis=1, inplace=True)

# Only use the date from 1918 - 2013


state_GLT_100_df = new_state_GLT_df['1918-01-01':'2013-06-01']
state_GLT_100_df.head()

2
Out[7]: AverageTemperature AverageTemperatureUncertainty State Country
dt
1918-01-01 24.223 0.573 Acre Brazil
1918-02-01 24.663 1.286 Acre Brazil
1918-03-01 24.882 0.712 Acre Brazil
1918-04-01 25.038 0.461 Acre Brazil
1918-05-01 25.270 0.562 Acre Brazil

In [8]: state_GLT_100_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 276186 entries, 1918-01-01 to 2013-06-01
Data columns (total 4 columns):
AverageTemperature 276186 non-null float64
AverageTemperatureUncertainty 276186 non-null float64
State 276186 non-null object
Country 276186 non-null object
dtypes: float64(2), object(2)
memory usage: 10.5+ MB

In [9]: state_GLT_100_df.index

Out[9]: DatetimeIndex(['1918-01-01', '1918-02-01', '1918-03-01', '1918-04-01',


'1918-05-01', '1918-06-01', '1918-07-01', '1918-08-01',
'1918-09-01', '1918-10-01',
...
'2012-09-01', '2012-10-01', '2012-11-01', '2012-12-01',
'2013-01-01', '2013-02-01', '2013-03-01', '2013-04-01',
'2013-05-01', '2013-06-01'],
dtype='datetime64[ns]', name='dt', length=276186, freq=None)

In [10]: #Filter out US info


us_states_df = state_GLT_100_df.query("Country == 'United States'")
us_states_df.head()

Out[10]: AverageTemperature AverageTemperatureUncertainty State \


dt
1918-01-01 2.733 0.490 Alabama
1918-02-01 11.486 0.316 Alabama
1918-03-01 16.338 0.356 Alabama
1918-04-01 15.583 0.218 Alabama
1918-05-01 22.087 0.200 Alabama

Country
dt
1918-01-01 United States
1918-02-01 United States
1918-03-01 United States

3
1918-04-01 United States
1918-05-01 United States

In [11]: # Output final df to CSV


state_GLT_100_df.to_csv('GLTbyState100.csv', encoding='utf-8')

1.0.3 Data By Country


In [12]: country_GLT_df.head()

Out[12]: dt AverageTemperature AverageTemperatureUncertainty Country


0 1743-11-01 4.384 2.294 Åland
1 1743-12-01 NaN NaN Åland
2 1744-01-01 NaN NaN Åland
3 1744-02-01 NaN NaN Åland
4 1744-03-01 NaN NaN Åland

In [13]: # Drop rows with NaN values and set date as index
new_country_GLT_df = country_GLT_df.dropna()

# Set date as the index and drop the extra date column
new_country_GLT_df = new_country_GLT_df.set_index(pd.DatetimeIndex(new_country_GLT_df['
new_country_GLT_df.drop('dt', axis=1, inplace=True)

# Only use the date from 1918 - 2013


country_GLT_100_df = new_country_GLT_df['1918-01-01':'2013-06-01']
country_GLT_100_df.head()

Out[13]: AverageTemperature AverageTemperatureUncertainty Country


dt
1918-01-01 -5.434 0.558 Åland
1918-02-01 -2.636 0.449 Åland
1918-03-01 -1.050 0.612 Åland
1918-04-01 2.615 0.418 Åland
1918-05-01 7.163 0.343 Åland

In [14]: country_GLT_100_df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 276596 entries, 1918-01-01 to 2013-06-01
Data columns (total 3 columns):
AverageTemperature 276596 non-null float64
AverageTemperatureUncertainty 276596 non-null float64
Country 276596 non-null object
dtypes: float64(2), object(1)
memory usage: 8.4+ MB

In [15]: country_GLT_100_df.index

4
Out[15]: DatetimeIndex(['1918-01-01', '1918-02-01', '1918-03-01', '1918-04-01',
'1918-05-01', '1918-06-01', '1918-07-01', '1918-08-01',
'1918-09-01', '1918-10-01',
...
'2012-09-01', '2012-10-01', '2012-11-01', '2012-12-01',
'2013-01-01', '2013-02-01', '2013-03-01', '2013-04-01',
'2013-05-01', '2013-06-01'],
dtype='datetime64[ns]', name='dt', length=276596, freq=None)

In [16]: # Output final df to CSV


country_GLT_100_df.to_csv('GLTbyCountry100.csv', encoding='utf-8')

1.0.4 Global Temperature Data


In [17]: GT_df.head()

Out[17]: dt LandAverageTemperature LandAverageTemperatureUncertainty \


0 1750-01-01 3.034 3.574
1 1750-02-01 3.083 3.702
2 1750-03-01 5.626 3.076
3 1750-04-01 8.490 2.451
4 1750-05-01 11.573 2.072

LandMaxTemperature LandMaxTemperatureUncertainty LandMinTemperature \


0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

LandMinTemperatureUncertainty LandAndOceanAverageTemperature \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN

LandAndOceanAverageTemperatureUncertainty
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN

In [18]: # Drop rows with NaN values and set date as index
new_GT_df = GT_df.dropna()

# Set date as the index


new_GT_df = new_GT_df.set_index(pd.DatetimeIndex(new_GT_df['dt']))

5
new_GT_df.drop('dt', axis=1, inplace=True)

# Only use the date from 1918 - 2013


GT_100_df = new_GT_df['1918-01-01':'2013-06-01']
GT_100_df.head()

Out[18]: LandAverageTemperature LandAverageTemperatureUncertainty \


dt
1918-01-01 1.934 0.251
1918-02-01 2.455 0.342
1918-03-01 4.811 0.257
1918-04-01 7.644 0.258
1918-05-01 10.540 0.304

LandMaxTemperature LandMaxTemperatureUncertainty \
dt
1918-01-01 7.552 0.261
1918-02-01 8.256 0.314
1918-03-01 10.704 0.197
1918-04-01 13.706 0.255
1918-05-01 16.480 0.332

LandMinTemperature LandMinTemperatureUncertainty \
dt
1918-01-01 -3.802 0.371
1918-02-01 -3.568 0.344
1918-03-01 -1.267 0.310
1918-04-01 1.426 0.380
1918-05-01 4.386 0.359

LandAndOceanAverageTemperature \
dt
1918-01-01 13.129
1918-02-01 13.312
1918-03-01 14.034
1918-04-01 14.794
1918-05-01 15.733

LandAndOceanAverageTemperatureUncertainty
dt
1918-01-01 0.141
1918-02-01 0.156
1918-03-01 0.147
1918-04-01 0.148
1918-05-01 0.156

In [19]: GT_100_df.info()

<class 'pandas.core.frame.DataFrame'>

6
DatetimeIndex: 1146 entries, 1918-01-01 to 2013-06-01
Data columns (total 8 columns):
LandAverageTemperature 1146 non-null float64
LandAverageTemperatureUncertainty 1146 non-null float64
LandMaxTemperature 1146 non-null float64
LandMaxTemperatureUncertainty 1146 non-null float64
LandMinTemperature 1146 non-null float64
LandMinTemperatureUncertainty 1146 non-null float64
LandAndOceanAverageTemperature 1146 non-null float64
LandAndOceanAverageTemperatureUncertainty 1146 non-null float64
dtypes: float64(8)
memory usage: 80.6 KB

In [20]: GT_100_df.index

Out[20]: DatetimeIndex(['1918-01-01', '1918-02-01', '1918-03-01', '1918-04-01',


'1918-05-01', '1918-06-01', '1918-07-01', '1918-08-01',
'1918-09-01', '1918-10-01',
...
'2012-09-01', '2012-10-01', '2012-11-01', '2012-12-01',
'2013-01-01', '2013-02-01', '2013-03-01', '2013-04-01',
'2013-05-01', '2013-06-01'],
dtype='datetime64[ns]', name='dt', length=1146, freq=None)

In [21]: # Output final df to CSV


GT_100_df.to_csv('GLT100.csv', encoding='utf-8')

7
Data Preview & Prep - Part II

January 15, 2020

In [1]: import pandas as pd


import numpy as np
import datetime

1 Hurricanes and Typhoons Datasets


In [2]: atlantic_df = pd.read_csv("All Datasets/hurricanes-and-typhoons-1851-2014/atlantic.csv",
pacific_df = pd.read_csv("All Datasets/hurricanes-and-typhoons-1851-2014/pacific.csv", l

In [3]: atlantic_df.head()

Out[3]: ID Name Date Time Event Status Latitude \


0 AL011851 UNNAMED 18510625 0 HU 28.0N
1 AL011851 UNNAMED 18510625 600 HU 28.0N
2 AL011851 UNNAMED 18510625 1200 HU 28.0N
3 AL011851 UNNAMED 18510625 1800 HU 28.1N
4 AL011851 UNNAMED 18510625 2100 L HU 28.2N

Longitude Maximum Wind Minimum Pressure ... Low Wind SW \


0 94.8W 80 -999 ... -999
1 95.4W 80 -999 ... -999
2 96.0W 80 -999 ... -999
3 96.5W 80 -999 ... -999
4 96.8W 80 -999 ... -999

Low Wind NW Moderate Wind NE Moderate Wind SE Moderate Wind SW \


0 -999 -999 -999 -999
1 -999 -999 -999 -999
2 -999 -999 -999 -999
3 -999 -999 -999 -999
4 -999 -999 -999 -999

Moderate Wind NW High Wind NE High Wind SE High Wind SW High Wind NW
0 -999 -999 -999 -999 -999
1 -999 -999 -999 -999 -999
2 -999 -999 -999 -999 -999
3 -999 -999 -999 -999 -999

1
4 -999 -999 -999 -999 -999

[5 rows x 22 columns]

In [4]: pacific_df.tail()

Out[4]: ID Name Date Time Event Status Latitude \


26132 EP222015 SANDRA 20151128 1200 LO 21.7N
26133 EP222015 SANDRA 20151128 1800 LO 22.4N
26134 EP222015 SANDRA 20151129 0 LO 23.1N
26135 EP222015 SANDRA 20151129 600 LO 23.5N
26136 EP222015 SANDRA 20151129 1200 LO 24.2N

Longitude Maximum Wind Minimum Pressure ... Low Wind SW \


26132 109.0W 35 1002 ... 0
26133 108.7W 30 1007 ... 0
26134 108.3W 30 1008 ... 0
26135 107.9W 25 1009 ... 0
26136 107.7W 20 1010 ... 0

Low Wind NW Moderate Wind NE Moderate Wind SE Moderate Wind SW \


26132 0 0 0 0
26133 0 0 0 0
26134 0 0 0 0
26135 0 0 0 0
26136 0 0 0 0

Moderate Wind NW High Wind NE High Wind SE High Wind SW \


26132 0 0 0 0
26133 0 0 0 0
26134 0 0 0 0
26135 0 0 0 0
26136 0 0 0 0

High Wind NW
26132 0
26133 0
26134 0
26135 0
26136 0

[5 rows x 22 columns]

2 Tsunami Datasets
In [5]: sources_df = pd.read_csv("All Datasets/seismic-waves/sources.csv", low_memory=False)
waves_df = pd.read_csv("All Datasets/seismic-waves/waves.csv", low_memory=False)

2
In [6]: sources_df.head()

Out[6]: SOURCE_ID YEAR MONTH DAY HOUR MINUTE CAUSE VALIDITY FOCAL_DEPTH \
0 1 -2000 NaN NaN NaN NaN 1.0 1.0 NaN
1 3 -1610 NaN NaN NaN NaN 6.0 4.0 NaN
2 4 -1365 NaN NaN NaN NaN 1.0 1.0 NaN
3 5 -1300 NaN NaN NaN NaN 0.0 2.0 NaN
4 6 -760 NaN NaN NaN NaN 0.0 2.0 NaN

PRIMARY_MAGNITUDE ... ALL_INJURIES INJURY_TOTAL \


0 NaN ... NaN NaN
1 NaN ... NaN NaN
2 NaN ... NaN NaN
3 6.0 ... NaN NaN
4 NaN ... NaN NaN

ALL_FATALITIES FATALITY_TOTAL ALL_DAMAGE_MILLIONS DAMAGE_TOTAL \


0 NaN 3.0 NaN 4.0
1 NaN 3.0 NaN 3.0
2 NaN NaN NaN 3.0
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN

ALL_HOUSES_DAMAGED HOUSE_DAMAGE_TOTAL ALL_HOUSES_DESTROYED \


0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 NaN NaN NaN
4 NaN NaN NaN

HOUSE_DESTRUCTION_TOTAL
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN

[5 rows x 45 columns]

In [7]: waves_df.tail()

Out[7]: SOURCE_ID WAVE_ID YEAR MONTH DAY REGION_CODE COUNTRY \


26198 5636 32278 2016 12.0 17.0 82.0 PAPUA NEW GUINEA
26199 5636 32277 2016 12.0 17.0 82.0 SOLOMON ISLANDS
26200 5637 32398 2016 12.0 25.0 89.0 CHILE
26201 5639 32404 2017 1.0 3.0 81.0 FIJI
26202 5642 32418 2017 1.0 22.0 82.0 SOLOMON ISLANDS

3
STATE/PROVINCE LOCATION LATITUDE \
26198 NaN D55023 BPR, ETD CORAL SEA -14.8030
26199 NaN TAREKUKURE, TARO ISLAND -6.6928
26200 AYSEN PUERTO MELINKA -43.8983
26201 NaN SUVA, KING'S WHARF -18.1330
26202 NaN TAREKUKURE, TARO ISLAND -6.6928

... INJURIES INJURY_ESTIMATE FATALITIES \


26198 ... NaN NaN NaN
26199 ... NaN NaN NaN
26200 ... NaN NaN NaN
26201 ... NaN NaN NaN
26202 ... NaN NaN NaN

FATALITY_ESTIMATE DAMAGE_MILLIONS_DOLLARS DAMAGE_ESTIMATE \


26198 NaN NaN NaN
26199 NaN NaN NaN
26200 NaN NaN NaN
26201 NaN NaN NaN
26202 NaN NaN NaN

HOUSES_DAMAGED HOUSE_DAMAGE_ESTIMATE HOUSES_DESTROYED \


26198 NaN NaN NaN
26199 NaN NaN NaN
26200 NaN NaN NaN
26201 NaN NaN NaN
26202 NaN NaN NaN

HOUSE_DESTRUCTION_ESTIMATE
26198 NaN
26199 NaN
26200 NaN
26201 NaN
26202 NaN

[5 rows x 30 columns]

3 Tornadoes Dataset
In [8]: tornadoes_df = pd.read_csv("All Datasets/1950-2017_all_tornadoes.csv", low_memory=False)
tornadoes_df.head()

Out[8]: om yr mo dy date time tz st stf stn ... len wid ns sn \


0 1 1950 1 3 1/3/50 11:00:00 3 MO 29 1 ... 9.5 150 2 0
1 1 1950 1 3 1/3/50 11:00:00 3 MO 29 1 ... 6.2 150 2 1
2 1 1950 1 3 1/3/50 11:10:00 3 IL 17 1 ... 3.3 100 2 1
3 2 1950 1 3 1/3/50 11:55:00 3 IL 17 2 ... 3.6 130 1 1

4
4 3 1950 1 3 1/3/50 16:00:00 3 OH 39 1 ... 0.1 10 1 1

sg f1 f2 f3 f4 fc
0 1 0 0 0 0 0
1 2 189 0 0 0 0
2 2 119 0 0 0 0
3 1 135 0 0 0 0
4 1 161 0 0 0 0

[5 rows x 29 columns]

In [9]: tornadoes_df.tail()

Out[9]: om yr mo dy date time tz st stf stn ... \


63676 2017-01425 2017 12 20 12/20/17 1:59:00 3 LA 22 0 ...
63677 2017-01426 2017 12 20 12/20/17 2:17:00 3 LA 22 0 ...
63678 2017-01427 2017 12 20 12/20/17 6:39:00 3 MS 28 0 ...
63679 2017-01428 2017 12 20 12/20/17 10:40:00 3 AL 1 0 ...
63680 2017-01429 2017 12 20 12/20/17 12:15:00 3 GA 13 0 ...

len wid ns sn sg f1 f2 f3 f4 fc
63676 3.05 600 1 1 1 19 0 0 0 0
63677 4.70 600 1 1 1 19 0 0 0 0
63678 0.17 50 1 1 1 147 0 0 0 0
63679 3.17 16 1 1 1 111 0 0 0 0
63680 3.17 125 1 1 1 199 0 0 0 0

[5 rows x 29 columns]

Potrebbero piacerti anche