Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Objective: To classify the the type of review from the Amazon users using Machine Learning
technique.
About the Data-set: We would be using the Amazon Review Data set which has 10,000 rows of
Text data. The dataset is classified into two classes. “Label 1” or "Negative review" as one class
and “Label 2” or "Positive review " as other. The Data set has two columns “Text” and “Label”.
Code:
In [53]: # Step - 1: Data Pre-processing - This will help in getting better results through the
1
In [54]: # Step - 1b : Change all the text to lower case. This is required as python interprets
Corpus['text'] = [entry.lower() for entry in Corpus['text']]
In [55]: # Step - 1c : Tokenization : In this each entry in the corpus will be broken into set o
Corpus['text']= [word_tokenize(entry) for entry in Corpus['text']]
In [56]: # Step - 1d : Remove Stop words, Non-Numeric and perfom Word Stemming/Lemmenting.
# WordNetLemmatizer requires Pos tags to understand if the word is noun or verb or adje
tag_map = defaultdict(lambda : wn.NOUN)
tag_map['J'] = wn.ADJ
tag_map['V'] = wn.VERB
tag_map['R'] = wn.ADV
In [57]: print(Corpus['text_final'].head())
In [67]: # Step - 2: Split the model into Train and Test Data set
Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(Corpus['text_final'
print(Train_X.shape)
print(Train_Y.shape)
print(Test_X.shape)
(7000,)
(7000,)
(3000,)
2
In [68]: # Step - 3: Label encode the target variable - This is done to transform Categorical d
Encoder = LabelEncoder()
Train_Y = Encoder.fit_transform(Train_Y)
Test_Y = Encoder.fit_transform(Test_Y)
In [69]: # Step - 4: Vectorize the words by using TF-IDF Vectorizer - This is done to find how i
Tfidf_vect = TfidfVectorizer(max_features=5000)
Tfidf_vect.fit(Corpus['text_final'])
Train_X_Tfidf = Tfidf_vect.transform(Train_X)
Test_X_Tfidf = Tfidf_vect.transform(Test_X)
In [70]: # Step - 5: Now we can run different algorithms to classify out data check for accuracy
In [ ]:
3
Project area # 5G_6G wireless networks
L4-L5: Handson implementation of Primary User detection
in fifth generation Cognitive Radio Networks
About the Data-set: The data is obtained using the USRP from empirical testbed setup. The
USRP is tuned to the UHF band. The data is a single column matrix where column indicates the
frequency tuned while the rows indicates the time instants.
Code:
1
2 Retrieving data
Retreiving data from the files for all the four wireless technologies.
In [4]: bandpower(UHF)
Out[4]: 0.0016754537
The variance of standard the signal is adjusted using the formula given below:
Power (signal )
SNRlinear =
Var (noise)
=⇒ Power (signal ) = Var (noise) × SNRlinear
2
# Using the boxed formula
var_signal = bandpower(noise) * snr_linear
print(awgn(UHF[0:100000], 4).shape)
3
for i in range(0, samples):
# Energy detection
E = np.sum(sampled_signal ** 2)
return X
In [9]: a = time.time()
print(create_dataset(UHF[50000:], 4, 15000, 100).shape)
b = time.time()
SNR = 4.0000010176512495
(15000, 1)
0.5682578086853027
Making the dataset for all the SNRs in the range -20 till 4 with step size of 2. The following
function will take a range of SNRs as input and will output the dataset. Other inputs are the
sample size, the signal, and, the number of samples per SNR.
return X
In [11]: a = time.time()
# UHF
X_UHF = {**final_dataset(UHF[100000:], range(-20, -4, 2), 5000, 100), **final_dataset(U
4
X_test_UHF = final_dataset(UHF[300000:], range(-20, 6, 2), 5129, 100)
b = time.time()
SNR = -20.000000947867104
SNR = -17.999999975177595
SNR = -16.000000564536986
SNR = -14.00000024874285
SNR = -12.000000647780782
SNR = -9.999999378102368
SNR = -7.999999424928018
SNR = -6.0000007320692585
SNR = -3.999998851007047
SNR = -1.9999968889338702
SNR = 8.946170328922171e-07
SNR = 2.000001013654348
SNR = 4.000000474544495
SNR = -19.999999636204638
SNR = -17.999999682125054
SNR = -16.000000286119587
SNR = -13.999999966226724
SNR = -12.000001115697067
SNR = -9.999999844380007
SNR = -8.000000337259307
SNR = -5.999999833073398
SNR = -3.999999084560373
SNR = -1.999999456449275
SNR = 4.338867090785353e-07
SNR = 2.0000011775885964
SNR = 4.000000351050954
Time taken :- 4.981256484985352
5
for i in range(0, samples):
# Energy detection
E = np.sum(sampled_signal ** 2)
return X
In [13]: a = time.time()
X_noise = create_noise_sequence(100000, 100)
b = time.time()
print(X_noise.shape)
look_back_Y = []
6
# Returning in numpy's array format
return np.array(look_back_X)
The following function will insert look backs into our dataset for all the SNRs.
return X_tech_lb
In [19]: look_back = 2
X = X_UHF_lb[-20]
y = []
y = np.ones((X.shape[0], 1))
print(X.shape)
print(X_noise_lb.shape)
X = np.concatenate((X, X_noise_lb), axis=0)
(4999, 2)
(99999, 2)
(99987, 2)
(99999, 2)
(199986, 2)
[[ 60.67028369 112.33572595]
[112.33572595 124.46628501]
[124.46628501 91.88249011]
7
...
[ 79.39234195 109.61704045]
[109.61704045 110.6177588 ]
[110.6177588 84.34582643]]
(199986, 1)
[[1.]
[1.]
[1.]
...
[0.]
[0.]
[0.]]
#ANN Model
# create model
model = Sequential() # This means it's sequential model, which is from one direction to
model.add(Dense(7, input_dim=2, kernel_initializer='uniform', activation='relu'))
#model.add(Dense(10, init='uniform', activation='relu')) #You can add as many hidden la
#model.add(Dense(5,init='uniform', activation='relu'))
model.add(Dense(1, kernel_initializer='uniform', activation='sigmoid')) #Output layer
# Compile model
#This function you have to explore in case you want to do the mathematic analysis
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
Epoch 1/40
- 3s - loss: 0.5897 - acc: 0.6314
Epoch 2/40
- 2s - loss: 0.4589 - acc: 0.8142
Epoch 3/40
- 2s - loss: 0.4123 - acc: 0.8157
8
Epoch 4/40
- 2s - loss: 0.4010 - acc: 0.8161
Epoch 5/40
- 2s - loss: 0.3973 - acc: 0.8163
Epoch 6/40
- 2s - loss: 0.3964 - acc: 0.8166
Epoch 7/40
- 2s - loss: 0.3955 - acc: 0.8168
Epoch 8/40
- 2s - loss: 0.3949 - acc: 0.8174
Epoch 9/40
- 2s - loss: 0.3949 - acc: 0.8167
Epoch 10/40
- 2s - loss: 0.3940 - acc: 0.8177
Epoch 11/40
- 2s - loss: 0.3943 - acc: 0.8176
Epoch 12/40
- 2s - loss: 0.3944 - acc: 0.8176
Epoch 13/40
- 2s - loss: 0.3934 - acc: 0.8181
Epoch 14/40
- 2s - loss: 0.3936 - acc: 0.8176
Epoch 15/40
- 2s - loss: 0.3934 - acc: 0.8183
Epoch 16/40
- 2s - loss: 0.3929 - acc: 0.8178
Epoch 17/40
- 2s - loss: 0.3944 - acc: 0.8169
Epoch 18/40
- 2s - loss: 0.3946 - acc: 0.8176
Epoch 19/40
- 2s - loss: 0.3933 - acc: 0.8180
Epoch 20/40
- 2s - loss: 0.3934 - acc: 0.8179
Epoch 21/40
- 2s - loss: 0.3933 - acc: 0.8175
Epoch 22/40
- 2s - loss: 0.3928 - acc: 0.8182
Epoch 23/40
- 2s - loss: 0.3932 - acc: 0.8177
Epoch 24/40
- 2s - loss: 0.3935 - acc: 0.8175
Epoch 25/40
- 2s - loss: 0.3937 - acc: 0.8175
Epoch 26/40
- 2s - loss: 0.3932 - acc: 0.8180
Epoch 27/40
- 2s - loss: 0.3930 - acc: 0.8177
9
Epoch 28/40
- 2s - loss: 0.3928 - acc: 0.8178
Epoch 29/40
- 2s - loss: 0.3927 - acc: 0.8182
Epoch 30/40
- 2s - loss: 0.3934 - acc: 0.8172
Epoch 31/40
- 2s - loss: 0.3933 - acc: 0.8175
Epoch 32/40
- 2s - loss: 0.3930 - acc: 0.8178
Epoch 33/40
- 2s - loss: 0.3929 - acc: 0.8184
Epoch 34/40
- 2s - loss: 0.3932 - acc: 0.8180
Epoch 35/40
- 2s - loss: 0.3937 - acc: 0.8171
Epoch 36/40
- 2s - loss: 0.3937 - acc: 0.8177
Epoch 37/40
- 2s - loss: 0.3932 - acc: 0.8177
Epoch 38/40
- 2s - loss: 0.3933 - acc: 0.8179
Epoch 39/40
- 2s - loss: 0.3931 - acc: 0.8177
Epoch 40/40
- 2s - loss: 0.3929 - acc: 0.8175
199986/199986 [==============================] - 3s 16us/step
acc: 81.96%
In [22]: pd_UHF = {}
10
At SNR = -16
acc: 10.66%
4999/4999 [==============================] - 0s 21us/step
...
At SNR = -8
acc: 47.27%
4999/4999 [==============================] - 0s 20us/step
At SNR = -6
acc: 76.38%
11999/11999 [==============================] - 0s 18us/step
At SNR = -4
acc: 96.49%
11999/11999 [==============================] - 0s 18us/step
At SNR = -2
acc: 99.91%
11999/11999 [==============================] - 0s 17us/step
At SNR = 0
acc: 100.00%
11999/11999 [==============================] - 0s 19us/step
At SNR = 2
acc: 100.00%
11999/11999 [==============================] - 0s 18us/step
At SNR = 4
acc: 100.00%
11
Project area # Biology / Bioinformatics
L4-L5: Handson implementation to classify the Cancer
patients with AML or ALL
Objective: The classify patients with acute myeloid leukemia (AML) and acute lymphoblastic
leukemia (ALL) using the SVM algorithm.
4. Each patient has 7129 gene expression values — i.e each patient has one value for each gene.
5. The training data contain gene expression values for patients 1 through 38.
6. The test data contain gene expression values for patients 39 through 72
Code:
1
call.1 3 call.2 4 call.3 ... 29 call.33 30 call.34 31 \
0 A -76 A -135 A ... 15 A -318 A -32
1 A -49 A -114 A ... -114 A -192 A -49
2 A -307 A 265 A ... 2 A -95 A 49
3 A 309 A 12 A ... 193 A 312 A 230
4 A -376 A -419 A ... -51 A -139 A -367
[5 rows x 78 columns]
In [102]: print(Train_Data.isna().sum().max())
print(Test_Data.isna().sum().max())
0
0
2
1 782 295 11 76 -14 2 0
2 1138 777 41 228 -41 3 0
3 627 170 -50 126 -91 4 0
4 250 314 14 56 -25 5 0
3
In [111]: from sklearn.decomposition import PCA
pca = PCA(n_components = 38)
X_train = pca.fit_transform(X_train)
X_test = pca.transform(X_test)
cum_sum = pca.explained_variance_ratio_.cumsum()
cum_sum = cum_sum*100
plt.bar(range(38), cum_sum)
plt.ylabel("Cumulative Explained Variance")
plt.xlabel("Principal Components")
plt.title("Around 90% of variance is explained by the First 38 columns ")
print(X_train.shape)
print(y_train.shape)
print(X_test.shape)
(54, 38)
(54,)
(18, 38)
4
In [131]: #best_parameters = search.best_estimator_
In [133]: y_pred=model.predict(X_test)
In [135]: class_names=[1,2,3]
fig, ax = plt.subplots()
from sklearn.metrics import confusion_matrix
import seaborn as sns
cm = confusion_matrix(y_test, y_pred)
class_names=['ALL', 'AML']
fig, ax = plt.subplots()
tick_marks = np.arange(len(class_names))
plt.xticks(tick_marks, class_names)
plt.yticks(tick_marks, class_names)
sns.heatmap(pd.DataFrame(cm), annot=True, cmap="viridis" ,fmt='g')
ax.xaxis.set_label_position("top")
plt.tight_layout()
plt.title('Confusion matrix', y=1.1)
plt.ylabel('Actual label')
plt.xlabel('Predicted label')
5
In [ ]:
6
Intelligent Transportation System
Project Hands-on
1. Objective:
2. Hardware Used : NA
3. Software Used :
4. Expected Outcomes :
5. Basic Instructions :
2. Go to http://sumo.dlr.de/wiki/Networks/Import/OpenStreetMap
6. Save as ”typemap.xml”
9. Save as ”map.osm”
14. We are looking for random traffic, Go to your sumo folder and search ”randomTraf-
fic.py” and Note the parameters like (Number of vehicles, Simulation time, Route
length etc..)
Type: python C:/Users/Sagar/Desktop/sumo/sumo/tools/randomTrips.py
-n map.net.xml -e 100 -l
19. Press Enter and SUMO will open. Customize it as per your convenience. Press RUN
20. Note your simulation parameters and save files.
Execution Steps
1. To generate LOS and NLOS links, we need to import the .xml file of our specific route
to the folder: ”inputmobilitySUMO”.
2. Open folder: ”inputmobilitySUMO”, from SUMO. You can see already it has .xml files
of previously simulated data. (You can replace your .xml file with existing one and
modify the .m file as per your requirement, Here we only use the previously simulated
data)
3. Run: ”runSimulation.m”
4. Open folder: ”outputKML”, you can see file named as per current date.
5. Go to :”https://earth.google.com/web/”
9. Select and open the KML file. A preview of the list will open in Google Earth.
10. To keep these places in your list, click Save and customize as per your requirement.
Aim: To perform DNN Classification.
Execution Steps:
1. Step 1: Initialization
1000 from future import p r i n t f u n c t i o n
import k e r a s
1002 #from k e r a s . d a t a s e t s import mnist
from k e r a s . models import S e q u e n t i a l
1004 from k e r a s . l a y e r s import Dense , Dropout , F l a t t e n , A c t i v a t i o n
#from k e r a s . l a y e r s import Conv1D , MaxPooling1D
1006 from k e r a s . l a y e r s import Conv2D , MaxPooling2D
from k e r a s . o p t i m i z e r s import Adagrad
1008 import numpy a s np
from s k l e a r n . p r e p r o c e s s i n g import minmax scale
1010 import k e r a s . backend a s K
import copy
1012 #one can d i s a b l e t h e i m p o r t s below i f not p l o t t i n g / s a v i n g
from k e r a s . u t i l s import p l o t m o d e l
1014 import m a t p l o t l i b . p y p l o t a s p l t
t e s t F i l e N a m e = ’ . . / d a t a s e t s / a l l t e s t c l a s s i f i c a t i o n . npz ’ #( 2 2 2 5 6 , 2 4 ,
362)
1008 p r i n t ( ” Reading d a t a s e t . . . ” , t e s t F i l e N a m e )
t e s t c a c h e f i l e = np . l o a d ( t e s t F i l e N a m e )
1010 #i n p u t f e a t u r e s ( X t e s t and X t r a i n ) a r e a r r a y s with m a t r i c e s . Here we
w i l l c o n v e r t m a t r i c e s t o 1−d a r r a y
X t r a i n = t r a i n c a c h e f i l e [ ’ p o s i t i o n m a t r i x a r r a y ’ ] #i n p u t s
1012 t r a i n b e s t t x r x a r r a y = t r a i n c a c h e f i l e [ ’ b e s t r a y a r r a y ’ ] #outputs , one
i n t e g e r f o r Tx and a n o t h e r f o r Rx
X t e s t = t e s t c a c h e f i l e [ ’ p o s i t i o n m a t r i x a r r a y ’ ] #i n p u t s
1014 t e s t b e s t t x r x a r r a y = t e s t c a c h e f i l e [ ’ b e s t r a y a r r a y ’ ] #outputs , one
i n t e g e r f o r Tx and a n o t h e r f o r Rx
#p r i n t ( p o s i t i o n m a t r i x a r r a y . shape )
1016 #p r i n t ( b e s t t x r x a r r a y . shape )
#X t r a i n and X t e s t have v a l u e s −4, −3, −1, 0 , 2 . S i m p l i f y i t t o u s i n g
o n l y −1 f o r b l o c k e r s and 1 f o r
1018 X t r a i n [ X t r a i n ==−4] = −1
X t r a i n [ X t r a i n ==−3] = −1
1020 X t r a i n [ X t r a i n ==2] = 1
X t e s t [ X t e s t ==−4] = −1
1022 X t e s t [ X t e s t ==−3] = −1
X t e s t [ X t e s t ==2] = 1
3. Load classes and features to find pairs
1000 t r a i n f u l l y = ( t r a i n b e s t t x r x a r r a y [ : , 0 ] ∗ numUPAAntennaElements +
t r a i n b e s t t x r x a r r a y [ : , 1 ] ) . a s t y p e ( np . i n t )
t e s t f u l l y = ( t e s t b e s t t x r x a r r a y [ : , 0 ] ∗ numUPAAntennaElements +
t e s t b e s t t x r x a r r a y [ : , 1 ] ) . a s t y p e ( np . i n t )
1002 t r a i n c l a s s e s = s e t ( t r a i n f u l l y ) #f i n d unique p a i r s
t e s t c l a s s e s = s e t ( t e s t f u l l y ) #f i n d unique p a i r s
1004 c l a s s e s = t r a i n c l a s s e s . union ( t e s t c l a s s e s )
1002 t r a i n n e x a m p l e s=l e n ( X t r a i n )
t e s t n e x a m p l e s=l e n ( X t e s t )
1004 nrows=l e n ( X t r a i n [ 0 ] )
ncolumns=l e n ( X t r a i n [ 0 ] [ 0 ] )
1006
print ( ’ test nexamples = ’ , test nexamples )
1008 print ( ’ train nexamples = ’ , train nexamples )
print ( ’ i n p u t m a t r i c e s s i z e = ’ , nrows , ’ x ’ , ncolumns )
1010 print ( ’ numClasses = ’ , numClasses )
1016 #f r a c t i o n t o be used f o r t r a i n i n g s e t
v a l i d a t i o n F r a c t i o n = 0 . 2 #from 0 t o 1
1018
#Keras i s r e q u i r i n g an e x t r a d i m e n s i o n : I w i l l add i t with r e s h a p e
1020 X t r a i n = X t r a i n . r e s h a p e ( X t r a i n . shape [ 0 ] , nrows , ncolumns , 1 )
X t e s t = X t e s t . r e s h a p e ( X t e s t . shape [ 0 ] , nrows , ncolumns , 1 )
1022 i n p u t s h a p e = ( nrows , ncolumns , 1 ) #t h e i n p u t matrix with t h e e x t r a
d i m e n s i o n r e q u e s t e d by Keras
1024 p r i n t ( X t r a i n . shape [ 0 ] , ’ t r a i n s a m p l e s ’ )
p r i n t ( X t e s t . shape [ 0 ] , ’ t e s t s a m p l e s ’ )
1026 p r i n t ( X t e s t . shape [ 0 ] + X t r a i n . shape [ 0 ] , ’ t o t a l s a m p l e s ’ )
print (” Finished reading datasets ”)
5. Step 5: Perform classification
1000 #from s k l e a r n . p r e p r o c e s s i n g import OneHotEncoder
#e n c o d e r = OneHotEncoder ( )
1002 #y t r a i n = e n c o d e r . f i t t r a n s f o r m ( y t r a i n . r e s h a p e ( −1 , 1 ) )
y t r a i n = k e r a s . u t i l s . t o c a t e g o r i c a l ( y t r a i n , numClasses )
1004 o r i g i n a l y t e s t = copy . deepcopy ( y t e s t ) . a s t y p e ( i n t )
y t e s t = k e r a s . u t i l s . t o c a t e g o r i c a l ( y t e s t , numClasses )
1006
# d e c l a r e model Convnet with two conv1D l a y e r s f o l l o w i n g by MaxPooling
l a y e r , and two d e n s e l a y e r s
1008 # Dropout l a y e r c o n s i s t s i n randomly s e t t i n g a f r a c t i o n r a t e o f i n p u t
u n i t s t o 0 a t each update d u r i n g t r a i n i n g time , which h e l p s p r e v e n t
overfitting .
1010 model = S e q u e n t i a l ( )
model . add ( Conv2D ( 5 0 , ( 1 2 , 1 2 ) , padding=”SAME” , a c t i v a t i o n= ’ r e l u ’ ) )
1012 model . add ( MaxPooling2D ( p o o l s i z e =(6 , 6 ) ) )
model . add ( Conv2D ( 2 0 , ( 1 0 , 1 0 ) , padding=”SAME” , a c t i v a t i o n= ’ r e l u ’ ) )
1014 model . add ( Dropout ( 0 . 3 ) )
model . add ( Dense ( 4 , a c t i v a t i o n= ’ r e l u ’ ) )
1016 model . add ( F l a t t e n ( ) )
#model . add ( A c t i v a t i o n ( ’ tanh ’ ) )
1018 #model . add ( A c t i v a t i o n ( ’ softmax ’ ) ) #softmax f o r p r o b a b i l i t y
model . add ( Dense ( numClasses , a c t i v a t i o n= ’ softmax ’ ) )
1020
model . summary ( )
1022
model . c o m p i l e ( l o s s=k e r a s . l o s s e s . c a t e g o r i c a l c r o s s e n t r o p y ,
1024 o p t i m i z e r=k e r a s . o p t i m i z e r s . Adadelta ( ) ,
m e t r i c s =[ ’ a c c u r a c y ’ ] )
1026
1028 h i s t o r y = model . f i t ( X t r a i n , y t r a i n ,
b a t c h s i z e=b a t c h s i z e ,
1030 e p o c h s=epochs ,
v e r b o s e =1,
1032 s h u f f l e=True ,
v a l i d a t i o n s p l i t=v a l i d a t i o n F r a c t i o n )
1034 #v a l i d a t i o n d a t a =( X t e s t , y t e s t ) )
1008
p r e d t e s t = model . p r e d i c t ( X t e s t )
1010 f o r i in range ( len ( y t e s t ) ) :
i f ( o r i g i n a l y t e s t [ i ] != np . argmax ( p r e d t e s t [ i ] ) ) :
1012 myImage = X t e s t [ i ] . r e s h a p e ( nrows , ncolumns )
p l t . imshow ( myImage )
1014 p l t . show ( )
p r i n t ( ”Type <ENTER> f o r next ” )
1016 input ()
Aim: Comparison of Various classifiers.
1. Step 1: Initilization
1000 import numpy a s np
#e n a b l e i f want t o p l o t images :
1002 #import m a t p l o t l i b
#m a t p l o t l i b . u s e ( ’ WebAgg ’ )
1004 #m a t p l o t l i b . u s e ( ’ Qt5Agg ’ )
#m a t p l o t l i b . u s e ( ’ agg ’ )
1006 #m a t p l o t l i b . i n l i n e ( )
#import m a t p l o t l i b . p y p l o t a s p l t
1008 #from m a t p l o t l i b . c o l o r s import ListedColormap
from s k l e a r n import m e t r i c s
1010 from s k l e a r n . m o d e l s e l e c t i o n import t r a i n t e s t s p l i t
from s k l e a r n . p r e p r o c e s s i n g import S t a n d a r d S c a l e r
1012 from s k l e a r n . d a t a s e t s import make moons , m a k e c i r c l e s ,
make classification
from s k l e a r n . n e u r a l n e t w o r k import M L P C l a s s i f i e r
1014 from s k l e a r n . n e i g h b o r s import K N e i g h b o r s C l a s s i f i e r
from s k l e a r n . svm import SVC
1016 from s k l e a r n . svm import LinearSVC
from s k l e a r n . g a u s s i a n p r o c e s s import G a u s s i a n P r o c e s s C l a s s i f i e r
1018 from s k l e a r n . g a u s s i a n p r o c e s s . k e r n e l s import RBF
from s k l e a r n . t r e e import D e c i s i o n T r e e C l a s s i f i e r
1020 from s k l e a r n . ensemble import R a n d o m F o r e s t C l a s s i f i e r , A d a B o o s t C l a s s i f i e r
from s k l e a r n . n a i v e b a y e s import GaussianNB
1022 from s k l e a r n . d i s c r i m i n a n t a n a l y s i s import Q u a d r a t i c D i s c r i m i n a n t A n a l y s i s
X t r a i n = t r a i n c a c h e f i l e [ ’ p o s i t i o n m a t r i x a r r a y ’ ] #i n p u t s
1014 t r a i n b e s t t x r x a r r a y = t r a i n c a c h e f i l e [ ’ b e s t t x r x a r r a y ’ ] #outputs ,
one i n t e g e r f o r Tx and a n o t h e r f o r Rx
X t e s t = t e s t c a c h e f i l e [ ’ p o s i t i o n m a t r i x a r r a y ’ ] #i n p u t s
1016 t e s t b e s t t x r x a r r a y = t e s t c a c h e f i l e [ ’ b e s t t x r x a r r a y ’ ] #outputs , one
i n t e g e r f o r Tx and a n o t h e r f o r Rx
#p r i n t ( p o s i t i o n m a t r i x a r r a y . shape )
1018 #p r i n t ( b e s t t x r x a r r a y . shape )
1018 t r a i n n e x a m p l e s=l e n ( X t r a i n )
t e s t n e x a m p l e s=l e n ( X t e s t )
1020 nrows=l e n ( X t r a i n [ 0 ] )
ncolumns=l e n ( X t r a i n [ 0 ] [ 0 ] )
1022
print ( ’ test nexamples = ’ , test nexamples )
1024 print ( ’ train nexamples = ’ , train nexamples )
print ( ’ i n p u t m a t r i c e s s i z e = ’ , nrows , ’ x ’ , ncolumns )
1026 print ( ’ numClasses = ’ , numClasses )
1032 p r i n t ( X t r a i n . shape [ 0 ] , ’ t r a i n s a m p l e s ’ )
p r i n t ( X t e s t . shape [ 0 ] , ’ t e s t s a m p l e s ’ )
1034 p r i n t ( X t e s t . shape [ 0 ] + X t r a i n . shape [ 0 ] , ’ t o t a l s a m p l e s ’ )
print (” Finished reading datasets ”)
5. Step 4: Output
1000 # i t e r a t e over c l a s s i f i e r s
f o r name , model i n z i p ( names , c l a s s i f i e r s ) :
1002 p r i n t ( ”#### T r a i n i n g c l a s s i f i e r ” , name )
model . f i t ( X t r a i n , y t r a i n )
1004 p r i n t ( ’ \ nPrediction accuracy f o r the t e s t dataset ’ )
p r e d t e s t = model . p r e d i c t ( X t e s t )
1006 p r i n t ( ’ { : . 2 % } \ n ’ . format ( m e t r i c s . a c c u r a c y s c o r e ( y t e s t , p r e d t e s t ) ) )
#now with t h e t r a i n s e t
1008 p r e d t r a i n = model . p r e d i c t ( X t r a i n )
p r i n t ( ’ \ nPrediction accuracy f o r the t r a i n dataset ’ )
1010 p r i n t ( ’ { : . 2 % } \ n ’ . format ( m e t r i c s . a c c u r a c y s c o r e ( y t r a i n , p r e d t r a i n ) ) )
Weather Forecasting Using Machine Learning
1
1 -80.9552 6/1/2013 1:00 FM-12 NaN NaN
2 -80.9552 6/1/2013 1:52 FM-15 BKN:07 65 10
3 -80.9552 6/1/2013 2:52 FM-15 BKN:07 75 10
4 -80.9552 6/1/2013 3:52 FM-15 FEW:02 75 SCT:04 110 10
HOURLYPRSENTWEATHERTYPE ... \
0 NaN ...
1 NaN ...
2 NaN ...
3 NaN ...
4 NaN ...
MonthlyMaxSeaLevelPressureTime MonthlyMinSeaLevelPressureValue \
0 -9999 NaN
1 -9999 NaN
2 -9999 NaN
3 -9999 NaN
4 -9999 NaN
MonthlyMinSeaLevelPressureDate MonthlyMinSeaLevelPressureTime \
0 -9999 -9999
1 -9999 -9999
2 -9999 -9999
3 -9999 -9999
4 -9999 -9999
MonthlyTotalHeatingDegreeDays MonthlyTotalCoolingDegreeDays \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
MonthlyDeptFromNormalHeatingDD MonthlyDeptFromNormalCoolingDD \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
MonthlyTotalSeasonToDateHeatingDD MonthlyTotalSeasonToDateCoolingDD
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
[5 rows x 90 columns]
2
1.2 Data Preparation & Cleanup
In [49]: # list all the columns to determine which is needed
clt_climate_df.columns
3
'DAILYAverageDewPointTemp', 'DAILYPrecip']]
new_clt_climate_df.head()
DAILYMinimumDryBulbTemp DAILYAverageDryBulbTemp \
0 NaN NaN
1 70.0 76.0
2 NaN NaN
3 NaN NaN
4 NaN NaN
DAILYMinimumDryBulbTemp DAILYAverageDryBulbTemp \
DATE
2013-06-01 NaN NaN
2013-06-01 70.0 76.0
2013-06-01 NaN NaN
2013-06-01 NaN NaN
2013-06-01 NaN NaN
4
2013-06-01 NaN NaN NaN
2013-06-01 NaN NaN NaN
2013-06-01 NaN NaN NaN
2013-06-01 NaN NaN NaN
2013-06-01 NaN NaN NaN
DailyPrecip
DATE
2013-06-01 0.00
2013-06-02 0.19
2013-06-03 2.33
2013-06-04 0.00
2013-06-05 0.03
In [53]: # Verify date range and total number of rows in the new dataframe
new_clt_climate_df.index
5
'2018-05-29', '2018-05-30'],
dtype='datetime64[ns]', name='DATE', length=1651, freq=None)
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 1651 entries, 2013-06-01 to 2018-05-30
Data columns (total 7 columns):
STATION_NAME 1651 non-null object
DailyMaxTemp 1651 non-null float64
DailyMinTemp 1651 non-null float64
DailyAvgTemp 1651 non-null float64
DailyAvgRelHumidity 1651 non-null float64
DailyAvgDewPointTemp 1651 non-null float64
DailyPrecip 1651 non-null float64
dtypes: float64(6), object(1)
memory usage: 103.2+ KB
1.3 Visualizing the Average Daily Temperature for Charlotte, NC - 2013 to 2018
In [55]: # Visualize some of the 'cleaned' data by plotting the daily avg temperature in Charlot
new_clt_climate_df['DailyAvgTemp'].plot(figsize=(20,7), color="green")
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Average Temperature - Fahrenheit')
plt.show()
output_13_0.png
6
In [57]: # Call the above function using a loop through each feature
for feature in features:
if feature != 'DATE':
for N in range(1, 4):
derive_nth_day_feature(new_clt_climate_df, feature, N)
In [58]: new_clt_climate_df.head(32)
7
2013-06-04 76.0 70.0 64.0
2013-06-05 74.0 81.0 67.0
2013-06-06 73.0 94.0 69.0
2013-06-07 75.0 88.0 68.0
2013-06-08 75.0 80.0 67.0
2013-06-09 76.0 82.0 68.0
2013-06-10 75.0 87.0 69.0
2013-06-11 77.0 71.0 66.0
2013-06-12 77.0 73.0 69.0
2013-06-13 81.0 77.0 70.0
2013-06-14 72.0 60.0 57.0
2013-06-15 71.0 66.0 60.0
2013-06-16 75.0 69.0 64.0
2013-06-17 78.0 83.0 69.0
2013-06-18 77.0 86.0 70.0
2013-06-19 77.0 69.0 65.0
2013-06-20 76.0 67.0 63.0
2013-06-21 75.0 60.0 60.0
2013-06-22 74.0 71.0 64.0
2013-06-23 78.0 76.0 69.0
2013-06-24 77.0 83.0 70.0
2013-06-25 81.0 76.0 70.0
2013-06-26 80.0 78.0 70.0
2013-06-27 81.0 82.0 73.0
2013-06-28 80.0 78.0 71.0
2013-06-29 79.0 79.0 70.0
2013-07-01 75.0 83.0 69.0
2013-07-02 76.0 90.0 71.0
2013-07-03 77.0 90.0 72.0
8
2013-06-17 0.24 85.0 84.0 83.0
2013-06-18 0.41 85.0 85.0 84.0
2013-06-19 0.00 84.0 85.0 85.0
2013-06-20 0.00 84.0 84.0 85.0
2013-06-21 0.00 84.0 84.0 84.0
2013-06-22 0.00 84.0 84.0 84.0
2013-06-23 0.01 86.0 84.0 84.0
2013-06-24 0.01 86.0 86.0 84.0
2013-06-25 0.00 85.0 86.0 86.0
2013-06-26 0.00 89.0 85.0 86.0
2013-06-27 0.32 90.0 89.0 85.0
2013-06-28 0.27 89.0 90.0 89.0
2013-06-29 0.00 91.0 89.0 90.0
2013-07-01 0.07 86.0 91.0 89.0
2013-07-02 0.34 83.0 86.0 91.0
2013-07-03 0.11 80.0 83.0 86.0
9
2013-07-01 ... 81.0 79.0
2013-07-02 ... 80.0 83.0
2013-07-03 ... 79.0 90.0
DailyAvgRelHumidity_2 DailyAvgRelHumidity_3 \
DATE
2013-06-01 NaN NaN
2013-06-02 NaN NaN
2013-06-03 68.0 NaN
2013-06-04 78.0 68.0
2013-06-05 83.0 78.0
2013-06-06 70.0 83.0
2013-06-07 81.0 70.0
2013-06-08 94.0 81.0
2013-06-09 88.0 94.0
2013-06-10 80.0 88.0
2013-06-11 82.0 80.0
2013-06-12 87.0 82.0
2013-06-13 71.0 87.0
2013-06-14 73.0 71.0
2013-06-15 77.0 73.0
2013-06-16 60.0 77.0
2013-06-17 66.0 60.0
2013-06-18 69.0 66.0
2013-06-19 83.0 69.0
2013-06-20 86.0 83.0
2013-06-21 69.0 86.0
2013-06-22 67.0 69.0
2013-06-23 60.0 67.0
2013-06-24 71.0 60.0
2013-06-25 76.0 71.0
2013-06-26 83.0 76.0
2013-06-27 76.0 83.0
2013-06-28 78.0 76.0
2013-06-29 82.0 78.0
2013-07-01 78.0 82.0
2013-07-02 79.0 78.0
2013-07-03 83.0 79.0
DailyAvgDewPointTemp_1 DailyAvgDewPointTemp_2 \
DATE
2013-06-01 NaN NaN
2013-06-02 65.0 NaN
2013-06-03 66.0 65.0
2013-06-04 67.0 66.0
2013-06-05 64.0 67.0
2013-06-06 67.0 64.0
2013-06-07 69.0 67.0
10
2013-06-08 68.0 69.0
2013-06-09 67.0 68.0
2013-06-10 68.0 67.0
2013-06-11 69.0 68.0
2013-06-12 66.0 69.0
2013-06-13 69.0 66.0
2013-06-14 70.0 69.0
2013-06-15 57.0 70.0
2013-06-16 60.0 57.0
2013-06-17 64.0 60.0
2013-06-18 69.0 64.0
2013-06-19 70.0 69.0
2013-06-20 65.0 70.0
2013-06-21 63.0 65.0
2013-06-22 60.0 63.0
2013-06-23 64.0 60.0
2013-06-24 69.0 64.0
2013-06-25 70.0 69.0
2013-06-26 70.0 70.0
2013-06-27 70.0 70.0
2013-06-28 73.0 70.0
2013-06-29 71.0 73.0
2013-07-01 70.0 71.0
2013-07-02 69.0 70.0
2013-07-03 71.0 69.0
11
2013-06-21 70.0 0.00 0.00
2013-06-22 65.0 0.00 0.00
2013-06-23 63.0 0.00 0.00
2013-06-24 60.0 0.01 0.00
2013-06-25 64.0 0.01 0.01
2013-06-26 69.0 0.00 0.01
2013-06-27 70.0 0.00 0.00
2013-06-28 70.0 0.32 0.00
2013-06-29 70.0 0.27 0.32
2013-07-01 73.0 0.00 0.27
2013-07-02 71.0 0.07 0.00
2013-07-03 70.0 0.34 0.07
DailyPrecip_3
DATE
2013-06-01 NaN
2013-06-02 NaN
2013-06-03 NaN
2013-06-04 0.00
2013-06-05 0.19
2013-06-06 2.33
2013-06-07 0.00
2013-06-08 0.03
2013-06-09 1.12
2013-06-10 0.72
2013-06-11 0.00
2013-06-12 0.12
2013-06-13 0.62
2013-06-14 0.00
2013-06-15 0.00
2013-06-16 0.49
2013-06-17 0.00
2013-06-18 0.00
2013-06-19 0.00
2013-06-20 0.24
2013-06-21 0.41
2013-06-22 0.00
2013-06-23 0.00
2013-06-24 0.00
2013-06-25 0.00
2013-06-26 0.01
2013-06-27 0.01
2013-06-28 0.00
2013-06-29 0.00
2013-07-01 0.32
2013-07-02 0.27
2013-07-03 0.00
12
[32 rows x 25 columns]
In [59]: # Evaluate the distribution of the feature data and transpose it; drop latitude and lon
spread = new_clt_climate_df.describe().T
spread
max
DailyMaxTemp 100.00
DailyMinTemp 78.00
DailyAvgTemp 88.00
DailyAvgRelHumidity 98.00
DailyAvgDewPointTemp 75.00
DailyPrecip 3.89
DailyMaxTemp_1 100.00
DailyMaxTemp_2 100.00
DailyMaxTemp_3 100.00
DailyMinTemp_1 78.00
DailyMinTemp_2 78.00
DailyMinTemp_3 78.00
DailyAvgTemp_1 88.00
DailyAvgTemp_2 88.00
DailyAvgTemp_3 88.00
13
DailyAvgRelHumidity_1 98.00
DailyAvgRelHumidity_2 98.00
DailyAvgRelHumidity_3 98.00
DailyAvgDewPointTemp_1 75.00
DailyAvgDewPointTemp_2 75.00
DailyAvgDewPointTemp_3 75.00
DailyPrecip_1 3.89
DailyPrecip_2 3.89
DailyPrecip_3 3.89
DailyAvgRelHumidity_2 DailyAvgRelHumidity_3 \
DATE
14
2013-06-04 78.0 68.0
2013-06-05 83.0 78.0
2013-06-06 70.0 83.0
2013-06-07 81.0 70.0
2013-06-08 94.0 81.0
DailyAvgDewPointTemp_1 DailyAvgDewPointTemp_2 \
DATE
2013-06-04 67.0 66.0
2013-06-05 64.0 67.0
2013-06-06 67.0 64.0
2013-06-07 69.0 67.0
2013-06-08 68.0 69.0
DailyPrecip_3
DATE
2013-06-04 0.00
2013-06-05 0.19
2013-06-06 2.33
2013-06-07 0.00
2013-06-08 0.03
[5 rows x 25 columns]
In [61]: # Assess the linearity between variables using the Pearson correlation coefficient.
df_linear = new_clt_climate_df.corr()[['DailyAvgTemp']].sort_values('DailyAvgTemp')
df_linear
Out[61]: DailyAvgTemp
DailyPrecip_2 -0.038175
DailyPrecip_3 -0.019563
DailyPrecip_1 -0.010878
DailyPrecip 0.010496
DailyAvgRelHumidity_3 0.208985
DailyAvgRelHumidity_2 0.219423
DailyAvgRelHumidity_1 0.295778
DailyAvgRelHumidity 0.334309
DailyAvgDewPointTemp_3 0.757075
DailyAvgDewPointTemp_2 0.790034
DailyMinTemp_3 0.801019
15
DailyMaxTemp_3 0.808100
DailyAvgTemp_3 0.826574
DailyMinTemp_2 0.831432
DailyMaxTemp_2 0.845655
DailyAvgTemp_2 0.861624
DailyAvgDewPointTemp_1 0.873512
DailyMinTemp_1 0.898852
DailyMaxTemp_1 0.912172
DailyAvgTemp_1 0.930317
DailyAvgDewPointTemp 0.939037
DailyMaxTemp 0.971456
DailyMinTemp 0.973825
DailyAvgTemp 1.000000
DailyAvgDewPointTemp_2 DailyAvgDewPointTemp_3
DATE
2013-06-04 66.0 65.0
16
2013-06-05 67.0 66.0
2013-06-06 64.0 67.0
2013-06-07 67.0 64.0
2013-06-08 69.0 67.0
# Loop through the features that will be the predictors to build the plot
# Rearrange data into a 2D array of 4 rows and 3 columns
arr = np.array(predictors).reshape(4, 3)
output_23_0.png
17
your final model 1. Remove the predictor identified in step 3 1. Fit the model again but, this time
without the removed variable and cycle back to step 3
These steps will help to select statistically meaningful predictors (features)
In [ ]: # Step 3 (cont.) - Identify the predictor with the greatest p-value and assess if its gr
# Based off the table, DailyAvgTemp_1 has the greatest p-value and it is greater than al
In [ ]: # Repeat steps 1 - 5 to continue identifying predictors with the greatest p-value that a
# ROUND 2
X = X.drop('DailyMinTemp_2', axis=1)
model = sm.OLS(y, X).fit()
model.summary()
In [ ]: # Repeat steps 1 - 5 to continue identifying predictors with the greatest p-value that a
# ROUND 3
X = X.drop('DailyAvgTemp_3', axis=1)
model = sm.OLS(y, X).fit()
model.summary()
In [ ]: # Repeat steps 1 - 5 to continue identifying predictors with the greatest p-value that a
# ROUND 4
X = X.drop('DailyMaxTemp_3', axis=1)
model = sm.OLS(y, X).fit()
model.summary()
18
In [ ]: # Repeat steps 1 - 5 to continue identifying predictors with the greatest p-value that a
# ROUND 5
X = X.drop('DailyMaxTemp_2', axis=1)
model = sm.OLS(y, X).fit()
model.summary()
1.7 Using the SciKit-Learn Linear Regression Module to Predict the Weather
The training and testing datasets are split into 80% training and 20% testing.
# Fit and build the model by fitting the regressor to the training data
regressor.fit(X_train, y_train)
# X contains the last 'n=forecast_out' rows for which we don't have label data
# Put those rows in a different Matrix X_forecast_out by X_forecast_out = X[end-forecast
X_forecast_out = X[-forecast_out:]
X = X[:-forecast_out]
print ("Length of X_forecast_out:", len(X_forecast_out), "& Length of X :", len(X))
In [ ]: # Predict average temp for the next 365 days using our Model
forecast_prediction = regressor.predict(X_forecast_out)
print(forecast_prediction)
19
next_unix = last_unix + one_day
for i in forecast_prediction:
next_date = datetime.datetime.fromtimestamp(next_unix)
next_unix += 86400
new_clt_climate_df.loc[next_date] = [np.nan for _ in range(len(new_clt_climate_df.co
new_clt_climate_df['DailyAvgTemp'].plot(figsize=(20,7), color="green")
new_clt_climate_df['Forecast'].plot(figsize=(20,7), color="orange")
plt.legend(loc=4)
plt.xlabel('Date')
plt.ylabel('Average Temperature - Fahrenheit')
plt.show()
20
Create Database through Python SQL - Part 1
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 294522 entries, 0 to 294521
Data columns (total 7 columns):
dt 294522 non-null object
AverageTemperature 294522 non-null float64
AverageTemperatureUncertainty 294522 non-null float64
City 294522 non-null object
Country 294522 non-null object
Latitude 294522 non-null object
Longitude 294522 non-null object
dtypes: float64(2), object(5)
memory usage: 15.7+ MB
In [19]: # Use `declarative_base` from SQLAlchemy to model the table as an ORM class
Base = declarative_base()
class US_Cities(Base):
__tablename__ = 'US_Cities_GLT'
id = Column(Integer, primary_key=True)
dt = Column(Text)
1
AverageTemperature = Column(Float)
AverageTemperatureUncertainty = Column(Float)
City = Column(Text)
Country = Column(Text)
Latitude = Column(Text)
Longitude = Column(Text)
def __repr__(self):
return "id={self.id}, name={self.name}"
In [22]: # Data is just a list of dictionaries that represent each row of data
print(data[:5])
In [26]: # Test that the insert works by fetching the first 5 rows.
conn.execute("select * from US_Cities_GLT limit 5").fetchall()
Out[26]: [(1, '1918-01-01', 1.2830000000000004, 0.325, 'Abilene', 'United States', '32.95N', '10
(2, '1918-02-01', 9.244, 0.319, 'Abilene', 'United States', '32.95N', '100.53W'),
(3, '1918-03-01', 14.636, 0.41600000000000004, 'Abilene', 'United States', '32.95N', '
(4, '1918-04-01', 16.227999999999998, 0.44299999999999995, 'Abilene', 'United States',
(5, '1918-05-01', 23.049, 0.486, 'Abilene', 'United States', '32.95N', '100.53W')]
2
Create Database through Python SQL - Part 2
id = Column(Integer, primary_key=True)
dt = Column(Text)
AverageTemperature = Column(Float)
AverageTemperatureUncertainty = Column(Float)
State = Column(Text)
Country = Column(Text)
class Countries_GLT(Base):
__tablename__ = 'GLT_By_Country'
id = Column(Integer, primary_key=True)
dt = Column(Text)
AverageTemperature = Column(Float)
AverageTemperatureUncertainty = Column(Float)
Country = Column(Text)
class GLT_General(Base):
__tablename__ = 'GLT'
id = Column(Integer, primary_key=True)
dt = Column(Text)
LandAverageTemperature = Column(Float)
1
LandAverageTemperatureUncertainty = Column(Float)
LandMaxTemperature = Column(Float)
LandMaxTemperatureUncertainty = Column(Float)
LandMinTemperature = Column(Float)
LandMinTemperatureUncertainty = Column(Float)
LandAndOceanAverageTemperature = Column(Float)
LandAndOceanAverageTemperatureUncertainty = Column(Float)
def __repr__(self):
return "id={self.id}, name={self.name}"
In [5]: Base.metadata.create_all(engine)
data2 = state_GLT_df.to_dict(orient='records')
data3 = country_GLT_df.to_dict(orient='records')
data4 = GT_df.to_dict(orient='records')
In [6]: print(data2[:5])
print(data3[:5])
print(data4[:5])
2
Out[11]: [(1, '1918-01-01', -5.4339999999999975, 0.5579999999999999, 'Åland'),
(2, '1918-02-01', -2.636, 0.449, 'Åland'),
(3, '1918-03-01', -1.0500000000000005, 0.612, 'Åland'),
(4, '1918-04-01', 2.615, 0.418, 'Åland'),
(5, '1918-05-01', 7.162999999999999, 0.343, 'Åland')]
3
CSV-to-SQLite-dB Part III
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 233 entries, 0 to 232
Data columns (total 6 columns):
Name 233 non-null object
Disaster 233 non-null object
BeginDate 233 non-null int64
EndDate 233 non-null int64
Total_CPI_Adjusted_Cost_Millions 233 non-null float64
Deaths 233 non-null int64
dtypes: float64(1), int64(3), object(2)
memory usage: 11.0+ KB
id = Column(Integer, primary_key=True)
Name = Column(Text)
Disaster = Column(Text)
BeginDate = Column(Integer)
EndDate = Column(Integer)
Total_CPI_Adjusted_Cost_Millions = Column(Float)
Deaths = Column(Integer)
def __repr__(self):
return "id={self.id}, name={self.name}"
1
In [5]: Base.metadata.create_all(engine)
[{'Name': 'Texas Hail Storm (June 2018)', 'Disaster': 'Severe Storm', 'BeginDate': 20180606, 'En
Out[10]: [(1, 'Texas Hail Storm (June 2018)', 'Severe Storm', 20180606, 20180606, 1150.0, 0),
(2, 'Central and Eastern Severe Weather (May 2018)', 'Severe Storm', 20180513, 2018051
(3, 'Central and Northeastern Severe Weather (May 2018)', 'Severe Storm', 20180501, 20
(4, 'Southeastern Severe Storms and Tornadoes (March 2018)', 'Severe Storm', 20180318,
(5, 'Northeast Winter Storm (March 2018)', 'Winter Storm', 20180301, 20180303, 2216.0,
2
Data Preview & Prep - Part I
In [3]: us_cities_GLT_df.head()
In [4]: # Set date as the index and drop the extra date column
us_cities_GLT_df = us_cities_GLT_df.set_index(pd.DatetimeIndex(us_cities_GLT_df['dt']))
us_cities_GLT_df.drop('dt', axis=1, inplace=True)
us_cities_GLT_df.head()
1
1918-03-01 14.636 0.416 Abilene
1918-04-01 16.228 0.443 Abilene
1918-05-01 23.049 0.486 Abilene
In [5]: us_cities_GLT_df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 294522 entries, 1918-01-01 to 2013-06-01
Data columns (total 6 columns):
AverageTemperature 294522 non-null float64
AverageTemperatureUncertainty 294522 non-null float64
City 294522 non-null object
Country 294522 non-null object
Latitude 294522 non-null object
Longitude 294522 non-null object
dtypes: float64(2), object(4)
memory usage: 15.7+ MB
In [7]: # Drop rows with NaN values and set date as index
new_state_GLT_df = state_GLT_df.dropna()
# Set date as the index and drop the extra date column
new_state_GLT_df = new_state_GLT_df.set_index(pd.DatetimeIndex(new_state_GLT_df['dt']))
new_state_GLT_df.drop('dt', axis=1, inplace=True)
2
Out[7]: AverageTemperature AverageTemperatureUncertainty State Country
dt
1918-01-01 24.223 0.573 Acre Brazil
1918-02-01 24.663 1.286 Acre Brazil
1918-03-01 24.882 0.712 Acre Brazil
1918-04-01 25.038 0.461 Acre Brazil
1918-05-01 25.270 0.562 Acre Brazil
In [8]: state_GLT_100_df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 276186 entries, 1918-01-01 to 2013-06-01
Data columns (total 4 columns):
AverageTemperature 276186 non-null float64
AverageTemperatureUncertainty 276186 non-null float64
State 276186 non-null object
Country 276186 non-null object
dtypes: float64(2), object(2)
memory usage: 10.5+ MB
In [9]: state_GLT_100_df.index
Country
dt
1918-01-01 United States
1918-02-01 United States
1918-03-01 United States
3
1918-04-01 United States
1918-05-01 United States
In [13]: # Drop rows with NaN values and set date as index
new_country_GLT_df = country_GLT_df.dropna()
# Set date as the index and drop the extra date column
new_country_GLT_df = new_country_GLT_df.set_index(pd.DatetimeIndex(new_country_GLT_df['
new_country_GLT_df.drop('dt', axis=1, inplace=True)
In [14]: country_GLT_100_df.info()
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 276596 entries, 1918-01-01 to 2013-06-01
Data columns (total 3 columns):
AverageTemperature 276596 non-null float64
AverageTemperatureUncertainty 276596 non-null float64
Country 276596 non-null object
dtypes: float64(2), object(1)
memory usage: 8.4+ MB
In [15]: country_GLT_100_df.index
4
Out[15]: DatetimeIndex(['1918-01-01', '1918-02-01', '1918-03-01', '1918-04-01',
'1918-05-01', '1918-06-01', '1918-07-01', '1918-08-01',
'1918-09-01', '1918-10-01',
...
'2012-09-01', '2012-10-01', '2012-11-01', '2012-12-01',
'2013-01-01', '2013-02-01', '2013-03-01', '2013-04-01',
'2013-05-01', '2013-06-01'],
dtype='datetime64[ns]', name='dt', length=276596, freq=None)
LandMinTemperatureUncertainty LandAndOceanAverageTemperature \
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
4 NaN NaN
LandAndOceanAverageTemperatureUncertainty
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
In [18]: # Drop rows with NaN values and set date as index
new_GT_df = GT_df.dropna()
5
new_GT_df.drop('dt', axis=1, inplace=True)
LandMaxTemperature LandMaxTemperatureUncertainty \
dt
1918-01-01 7.552 0.261
1918-02-01 8.256 0.314
1918-03-01 10.704 0.197
1918-04-01 13.706 0.255
1918-05-01 16.480 0.332
LandMinTemperature LandMinTemperatureUncertainty \
dt
1918-01-01 -3.802 0.371
1918-02-01 -3.568 0.344
1918-03-01 -1.267 0.310
1918-04-01 1.426 0.380
1918-05-01 4.386 0.359
LandAndOceanAverageTemperature \
dt
1918-01-01 13.129
1918-02-01 13.312
1918-03-01 14.034
1918-04-01 14.794
1918-05-01 15.733
LandAndOceanAverageTemperatureUncertainty
dt
1918-01-01 0.141
1918-02-01 0.156
1918-03-01 0.147
1918-04-01 0.148
1918-05-01 0.156
In [19]: GT_100_df.info()
<class 'pandas.core.frame.DataFrame'>
6
DatetimeIndex: 1146 entries, 1918-01-01 to 2013-06-01
Data columns (total 8 columns):
LandAverageTemperature 1146 non-null float64
LandAverageTemperatureUncertainty 1146 non-null float64
LandMaxTemperature 1146 non-null float64
LandMaxTemperatureUncertainty 1146 non-null float64
LandMinTemperature 1146 non-null float64
LandMinTemperatureUncertainty 1146 non-null float64
LandAndOceanAverageTemperature 1146 non-null float64
LandAndOceanAverageTemperatureUncertainty 1146 non-null float64
dtypes: float64(8)
memory usage: 80.6 KB
In [20]: GT_100_df.index
7
Data Preview & Prep - Part II
In [3]: atlantic_df.head()
Moderate Wind NW High Wind NE High Wind SE High Wind SW High Wind NW
0 -999 -999 -999 -999 -999
1 -999 -999 -999 -999 -999
2 -999 -999 -999 -999 -999
3 -999 -999 -999 -999 -999
1
4 -999 -999 -999 -999 -999
[5 rows x 22 columns]
In [4]: pacific_df.tail()
High Wind NW
26132 0
26133 0
26134 0
26135 0
26136 0
[5 rows x 22 columns]
2 Tsunami Datasets
In [5]: sources_df = pd.read_csv("All Datasets/seismic-waves/sources.csv", low_memory=False)
waves_df = pd.read_csv("All Datasets/seismic-waves/waves.csv", low_memory=False)
2
In [6]: sources_df.head()
Out[6]: SOURCE_ID YEAR MONTH DAY HOUR MINUTE CAUSE VALIDITY FOCAL_DEPTH \
0 1 -2000 NaN NaN NaN NaN 1.0 1.0 NaN
1 3 -1610 NaN NaN NaN NaN 6.0 4.0 NaN
2 4 -1365 NaN NaN NaN NaN 1.0 1.0 NaN
3 5 -1300 NaN NaN NaN NaN 0.0 2.0 NaN
4 6 -760 NaN NaN NaN NaN 0.0 2.0 NaN
HOUSE_DESTRUCTION_TOTAL
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
[5 rows x 45 columns]
In [7]: waves_df.tail()
3
STATE/PROVINCE LOCATION LATITUDE \
26198 NaN D55023 BPR, ETD CORAL SEA -14.8030
26199 NaN TAREKUKURE, TARO ISLAND -6.6928
26200 AYSEN PUERTO MELINKA -43.8983
26201 NaN SUVA, KING'S WHARF -18.1330
26202 NaN TAREKUKURE, TARO ISLAND -6.6928
HOUSE_DESTRUCTION_ESTIMATE
26198 NaN
26199 NaN
26200 NaN
26201 NaN
26202 NaN
[5 rows x 30 columns]
3 Tornadoes Dataset
In [8]: tornadoes_df = pd.read_csv("All Datasets/1950-2017_all_tornadoes.csv", low_memory=False)
tornadoes_df.head()
4
4 3 1950 1 3 1/3/50 16:00:00 3 OH 39 1 ... 0.1 10 1 1
sg f1 f2 f3 f4 fc
0 1 0 0 0 0 0
1 2 189 0 0 0 0
2 2 119 0 0 0 0
3 1 135 0 0 0 0
4 1 161 0 0 0 0
[5 rows x 29 columns]
In [9]: tornadoes_df.tail()
len wid ns sn sg f1 f2 f3 f4 fc
63676 3.05 600 1 1 1 19 0 0 0 0
63677 4.70 600 1 1 1 19 0 0 0 0
63678 0.17 50 1 1 1 147 0 0 0 0
63679 3.17 16 1 1 1 111 0 0 0 0
63680 3.17 125 1 1 1 199 0 0 0 0
[5 rows x 29 columns]