Sei sulla pagina 1di 23

Clustering with

Spark
ON THE AWID WIFI ATTACK DATASET
Tools used
Dataset details

❑ AWID WIFI Log dataset comes in two forms:


• Reduced which is about 770MB in one file (this is the one I used for feature selection)
• Full which is about 66GB in about 70 files (this is the one I used for clustering)
❑ Both are in .CSV table format
• But the full dataset needs to be merged into one file by concatenating the files
❑ The dataset contains 155 columns
• The columns were made with Wifi packet logs
• The columns include a class (normal/attack type) column
❑ Each packet is a row and is classified either as normal or an attack
❑ The rows contain MANY NULL values
Pre-processing

❑ The following steps were performed for pre-processing


1. Each column was converted to up to 20 buckets of equal width.
• When bucketing NULL values were placed into their own bucket
• The buckets were used again later for feature selection
2. The bucketed variables were then converted into one-hot-encoded dummy
variables
• This was needed to place both categorical (the NULL values) and numeric variables in a
format suitable for use by a clustering algorithm that normally only accepts numeric
values
Feature selection

❑ The full dataset would take over a week runtime to perform feature
selection on so the reduced dataset was used for feature selection.
❑ The following steps were performed for feature selection:
1. All columns containing only one value were removed
2. All columns containing MAC addresses were removed
• There are over 7000 MAC addresses in just the reduced dataset (and
probably many times more in the full dataset) so the one-hot-encoded
version of them would be too large (would need over 7000 variables)
3. All buckets created in the pre-processing step were examined using
tables and the columns with high entropy were selected
❑ In the end 18 variables out of 155 were selected
• Some variables with high entropy were not selected because they were
basically duplicates of other high entropy variables
Example 1 of feature selection by
entropy
❑ The following variable was selected because its table showed high
entropy
Example 2 of feature selection by
entropy
❑ The following variable was selected because its table showed high
entropy
Example 3 of feature selection by
entropy
❑ The following variable was NOT selected because its table showed
low entropy
Example 4 of feature selection by
entropy
❑ The following variable was NOT selected because its table showed
low entropy
K-Means Clustering
❑ Clustering is a type of machine learning algorithm where rows in a table
are partitioned into K groups based upon common characteristics.
• Where K is a parameter supplied by the user of the clustering algorithm
❑ The K-Means algorithm looks for K groups with similar average
characteristics
• Where average is calculated by taking the average of the Euclidean
distance between variables
❑ I chose K-Means because it is sensitive to outliers and my goal was to
cluster outliers (the attacks)
• I also chose it because sparklyr does not have a lot of options for clusterers
❑ To perform K-Means all that was needed to be done was select a K
and provide the one-hot-encoded dummy variables to the algorithm
Clustering with a cluster of Spark
servers
❑ Clustering was performed on a cluster of 3 Linux servers running Spark
• Spark operates using Map/Reduce Operations
• Each server had at least 32 gigs of memory and 6 Cores with 2 threads per
Core
• Servers are located in the Bigdata Laboratory at CSU
❑ During processing it was common for 250%-1150% CPU to be used at a
time per server
• There was less CPU usage during operations that wrote to the tables
❑ The data was too large to be cached in memory so it was operated on
from disk.
❑ It took about 5.5 hours to find one batch of clusters and graph them
including time to pre-process the data
Clustering Results
Clustering Results
Clustering Results
Clustering Results
Clustering Results
Clustering Results
Clustering Results
Clustering Results
Clustering Results
Clustering Results
Conclusion

❑ The AWID WIFI Attack Dataset was clustered into 10 clusters


• Similar attacks ended up in the same clusters
❑ Feature selection was performed using entropy of buckets.
❑ A lot of pre-processing work was performed in order to be able to
cluster this data.
• Bucketization
• One-Hot-Encoding
❑ Spark allowed a large dataset that otherwise would not have fit into
main memory (on the computers I have access to) to be clustered
in a reasonable amount of time.
Thank you.

Potrebbero piacerti anche