Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Spark
ON THE AWID WIFI ATTACK DATASET
Tools used
Dataset details
❑ The full dataset would take over a week runtime to perform feature
selection on so the reduced dataset was used for feature selection.
❑ The following steps were performed for feature selection:
1. All columns containing only one value were removed
2. All columns containing MAC addresses were removed
• There are over 7000 MAC addresses in just the reduced dataset (and
probably many times more in the full dataset) so the one-hot-encoded
version of them would be too large (would need over 7000 variables)
3. All buckets created in the pre-processing step were examined using
tables and the columns with high entropy were selected
❑ In the end 18 variables out of 155 were selected
• Some variables with high entropy were not selected because they were
basically duplicates of other high entropy variables
Example 1 of feature selection by
entropy
❑ The following variable was selected because its table showed high
entropy
Example 2 of feature selection by
entropy
❑ The following variable was selected because its table showed high
entropy
Example 3 of feature selection by
entropy
❑ The following variable was NOT selected because its table showed
low entropy
Example 4 of feature selection by
entropy
❑ The following variable was NOT selected because its table showed
low entropy
K-Means Clustering
❑ Clustering is a type of machine learning algorithm where rows in a table
are partitioned into K groups based upon common characteristics.
• Where K is a parameter supplied by the user of the clustering algorithm
❑ The K-Means algorithm looks for K groups with similar average
characteristics
• Where average is calculated by taking the average of the Euclidean
distance between variables
❑ I chose K-Means because it is sensitive to outliers and my goal was to
cluster outliers (the attacks)
• I also chose it because sparklyr does not have a lot of options for clusterers
❑ To perform K-Means all that was needed to be done was select a K
and provide the one-hot-encoded dummy variables to the algorithm
Clustering with a cluster of Spark
servers
❑ Clustering was performed on a cluster of 3 Linux servers running Spark
• Spark operates using Map/Reduce Operations
• Each server had at least 32 gigs of memory and 6 Cores with 2 threads per
Core
• Servers are located in the Bigdata Laboratory at CSU
❑ During processing it was common for 250%-1150% CPU to be used at a
time per server
• There was less CPU usage during operations that wrote to the tables
❑ The data was too large to be cached in memory so it was operated on
from disk.
❑ It took about 5.5 hours to find one batch of clusters and graph them
including time to pre-process the data
Clustering Results
Clustering Results
Clustering Results
Clustering Results
Clustering Results
Clustering Results
Clustering Results
Clustering Results
Clustering Results
Clustering Results
Conclusion