Pyspark Material

Celebrating “700K” YouTube Subscribers | Join Edureka YouTube Channel SUBSCRIBE NOW
   Log In Sign Up
All Videos Interview Questions Cheat Sheet

PySpark MLlib Tutorial : Machine Learning with PySpark
Recommended by 91 users
Kurt  1 Comment 2.4K Views  

Published on Mar 11,2019 Bookmark Email Post
Machine learning has gone through many recent developments and is becoming more popular day by
day. Machine Learning is being used in various projects to find hidden information in data by people
from all domains, including Computer Science, Mathematics, and Management. It was just a matter of
time that Apache Spark Jumped into the game of Machine Learning with Python, using its MLlib
library. It has also been noted that this combination of Python and Apache Spark is being preferred by
many over Scala for Spark and this has led to PySpark Certification becoming a widely engrossed
skill in the market today. So, in this PySpark MLlib Tutorial, I’ll be discussing the following topics:
What is Machine Learning?

What is PySpark MLlib?
Machine Learning (Python) Industrial Use Cases
Machine Learning Lifecycle
PySpark MLlib Features and Algorithms
Finding Hackers with PySpark MLlib
Customer Churn Prediction with PySpark MLlib
y p
What is Machine Learning?
Machine learning is a method of Data Analysis that automates Analytical Model building. Using
algorithms that iteratively learn from data, machine learning allows computers to find hidden
insights without being explicitly programmed where to look. It focuses on the development of
computer programs that can teach themselves to grow and change when exposed to new

data. Machine Learning uses the data to detect patterns in a dataset and adjust program actions
accordingly.
Most industries working with large amounts of data have recognized the value of machine learning
technology. By gleaning insights from this data, often in real time, organizations are able to work more
efficiently or gain an advantage over competitors. To know more about Machine Learning and it’s
various types you can refer to this What is Machine Learning? Blog.
Now that you have got a brief idea of what is Machine Learning, Let’s move forward with this PySpark
MLlib Tutorial Blog and understand what is MLlib and what are its features?
What is PySpark MLlib?
PySpark MLlib is a machine-learning library. It is a wrapper over PySpark Core to do data analysis
using machine-learning algorithms. It works on distributed systems and is scalable. We can find
implementations of classification, clustering, linear regression, and other machine-learning
algorithms in PySpark MLlib.
PySpark MLlib Tutorial | Edureka

PySpark MLlib Tutorial | Machine Learning on Apache Spark | PySpar…
PySpar…
Watch later Share
Learn PySpark from Experts
Enroll now
Machine Learning(Python) Industrial Use Cases
Machine learning algorithms, applications, and platforms are helping manufacturers find new
business models, fine-tune product quality, and optimize manufacturing operations to the shop floor
level. So Let’s continue our PySpark MLlib Tutorial and understand how the various industries are
using Machine Learning.
Government:
Government agencies such as public safety and utilities have a particular need for
machine learning. They use it for face detection, security and fraud
detection. Public sector agencies are making use of machine learning for
government initiatives to gain vital insights into policy data.
Marketing and E-commerce:
The number of purchases made online is steadily increasing, which allows

companies to gather detailed data on the whole customer experience. Websites
recommending items you might like based on previous purchases are using
machine learning to analyze your buying history and promote other items you’d be interested in
machine learning to analyze your buying history and promote other items you d be interested in.
Transportation:
Analyzing data to identify patterns and trends is key to the transportation

industry, which relies on making routes more efficient and predicting potential
 problems to increase profitability. Companies use ML to enable an efficient ride-
sharing marketplace, identify suspicious or fraudulent accounts, suggest optimal pickup and drop-off
points.
Finance:
Today, machine learning has come to play an integral role in many phases of the
financial ecosystem, from approving loans to managing assets, to assessing
risks. Banks and other businesses in the financial industry use machine learning
technology to prevent fraud.
Healthcare:
Machine learning is a fast-growing trend in the healthcare industry, thanks to the

advent of wearable devices and sensors that can use data to assess a patient’s
health in real time. Google has developed a machine learning algorithm to help
identify cancerous tumors on mammograms. Stanford is using a deep learning
algorithm to identify skin cancer.
Now that you have an idea of what is Machine Learning and what are the various areas in the
industry where it is used, let’s continue our PySpark MLlib Tutorial and understand what a typical
Machine Learning Lifecycle looks like.
Subscribe to our youtube channel to get new updates..!

edureka!
YouTube 717K
Machine Learning Lifecycle
A typical Machine Learning Cycle involves majorly two phases:

Training
Testing
In Machine Learning, we basically try to create a model to predict on the test data. So, we use the
training data to fit the model and testing data to test it. The models generated are to predict the results
unknown which is named as the test set. As you pointed out, the dataset is divided into train and test
set in order to check accuracies, precisions by training and testing it on it.
1. Training Set: Here, you have the complete training dataset. You can extract features and train to
fit a model and so on.
2. Testing Set: Here, once the model is obtained, you can predict using the model obtained on the
training set.
Now that you have an idea of what a Typical Machine Learning Lifecycle works, let’s move forward
with our PySpark MLlib Tutorial blog with MLlib features and the various languages supported by it.
Prepare for CCA 175
Start Learning Today

MLlib Features and Algorithms
We know that PySpark is good for iterative algorithms. Using iterative algorithms, many machine-
learning algorithms have been implemented in PySpark MLlib. Apart from PySpark efficiency and
scalability, PySpark MLlib APIs are very user-friendly.


Finding Hackers with MLlib
A company system was hacked and lots of data were stolen. Fortunately, metadata for each session
hackers used to connect was recorded and is available to us. There are 3 potential hackers, or even
more.
A common practice among hackers is the tradeoff of the job. This means that hackers do roughly the
same amount of hacks. So here we are going to use clustering to find out the number of hackers.

Initializing Spark Session
Firstly we need to initialize spark session.
1 from pyspark.sql import SparkSession

2 spark = SparkSession.builder.appName('find hacker').getOrCreate()
Importing KMeans Library and Loading the Dataset
We will be using Kmeans Algorithm to do our analysis and for that, we need to import the Kmeans
Library and then we’ll load our dataset with spark.read method.
1 from pyspark.ml.clustering import KMeans

2
3 dataset = spark.read.csv("file:///home/edureka/Downloads/hack data.csv",head
Schema of the Data retrieved
Let’s have a look at the schema of data to get a better understanding of what we are dealing with.
1 dataset.printSchema()
Importing VectorAssembler and creating our Features
W tt f d t i g th V t A bl f ti t i gl l h h
We must transform our data using the VectorAssembler function to a single column where each row
of the DataFrame contains a feature vector. In order to create our clusters, we need to select columns
based on which we will then create our features column. Here we are using the columns:
Session_Connection_Time
Bytes Transferred
Kali_Trace_Used

Servers_Corrupted
Pages_Corrupted
WPM_Typing_Speed : Words Per Minute
1 from pyspark.ml.linalg import Vectors

2 from pyspark.ml.feature import VectorAssembler
3
4 feat_cols = ['Session_Connection_Time', 'Bytes Transferred', 'Kali_Trace_Use
5 'Servers_Corrupted', 'Pages_Corrupted','WPM_Typing_Speed']
6
7 vec_assembler = VectorAssembler(inputCols = feat_cols, outputCol='features')
8
9 final data = vec assembler.transform(dataset)
Importing the StandardScaler Library and Creating Scaler
Centering and Scaling happen independently on each feature by computing the relevant statistics on
the samples in the training set. Mean and standard deviation are then stored to be used on later data
using the transform method.
Standardization of a dataset is a common requirement for many machine learning estimators: they
might behave badly if the individual feature does not more or less look like standard normally
distributed data.
1 from pyspark.ml.feature import StandardScaler

2
3 scaler = StandardScaler(inputCol="features", outputCol="scaledFeatures", wit
Computing Summary Statistics
Let’s Compute summary statistics by fitting the StandardScaler. Then Normalize each feature to have a
unit standard deviation.
1 scalerModel = scaler.fit(final_data)
2
3 cluster_final_data = scalerModel.transform(final_data)
4
5 kmeans3 = KMeans(featuresCol='scaledFeatures',k=3)
6 kmeans2 = KMeans(featuresCol='scaledFeatures' k=2)
6 kmeans2 = KMeans(featuresCol= scaledFeatures ,k=2)
Building KMeans Model and Calculating WSSE (Within Set of Squared Errors)
We have to first build our Model. The number of desired clusters is then passed to the algorithm. We
then compute Within Set Sum of Squared Error (WSSSE). We use the values derived from these to

figure out whether we have 2 or 3 hackers.
1 model_k3 = kmeans3.fit(cluster_final_data)
2 model_k2 = kmeans2.fit(cluster_final_data)
3
4 &nbsp;
5
6 wssse_k3 = model_k3.computeCost(cluster_final_data)
7 wssse_k2 = model_k2.computeCost(cluster_final_data)
8
9
10 print("With K=3")
11 print("Within Set Sum of Squared Errors = " + str(wssse_k3))
12 print('--'*30)
13 print("With K=2")
14 print("Within Set Sum of Squared Errors = " + str(wssse k2))
Checking the Elbow Point (WSSSE)
We’ll check the values of WSSSE for 2 to 8 and see if we have an elbow in the list.
1 for k in range(2,9):
2 kmeans = KMeans(featuresCol='scaledFeatures',k=k)
3 model = kmeans.fit(cluster_final_data)
4 wssse = model.computeCost(cluster_final_data)
5 print("With K={}".format(k))
6 print("Within Set Sum of Squared Errors = " + str(wssse))
7 print('--'*30)

Here we can see that the value of WSSSE is continuously decreasing and we don’t have an elbow. So

most probably the value of K is 2 and not 3. Let’s continue this PySpark MLlib Tutorial and get to the
verdict.
Final Check for the number of Hacker
Let’s find out how many hackers were involved based on the number of hacks done.
1 model k3.transform(cluster final data).groupBy('prediction').count().show()
1 model k2.transform(cluster final data).groupBy('prediction').count().show()
So, here we can see that for 3 hackers, our model has produced 167,79 and 88 hacks. This is not
possible as the hackers usually divide the tasks in between them. In our model where K =2, we get 167
number of hacks for both the hackers. Hence there were only 2 hackers involved.
Let’s continue our PySpark MLlib Tutorial blog and solve another problem faced by many companies
ie. Customer Churn.
Customer Churn Prediction with MLlib
Churn prediction is big business. It minimizes customer defection by predicting which customers are
likely to cancel a subscription to a service. Though originally used within the telecommunications
industry, it has become common practice across banks, ISPs, insurance firms, and other verticals.
The prediction process is heavily data-driven and often utilizes advanced machine learning
The prediction process is heavily data-driven and often utilizes advanced machine learning
techniques. Here, we’ll take a look at what types of customer data are typically used, do some
preliminary analysis of the data, and generate churn prediction models – all with PySpark and its
machine learning frameworks.
A marketing agency has many customers that use their service to produce ads for the client/customer
websites. They’ve noticed that they have quite a bit of churn in clients. They basically randomly assign
account managers right now but want you to create a machine learning model that will help predict
which customers will churn (stop buying their service) so that they can correctly assign the customers
most at risk to churn an account manager. Luckily they have some historical data.
So, can you help them out?
Learn Machine Learning with PySpark
Enroll Now
Loading the libraries
Let’s load up the required libraries. Here we are going to use Logistic Regression.
1 from pyspark.ml.classification import LogisticRegression
Reading the training and testing Data

Let’s load up the training data and the testing data (incoming data for testing purposes)
1 input_data=spark.read.csv('file:///home/edureka/Downloads/customer_churn.csv
2
3 test data=spark.read.csv('file:///home/edureka/Downloads/new customers.csv',
Schema of Data

We are going to have a look at the schema of data to get a better understanding of what we are dealing
with.
1 input data.printSchema() //training data
Here we have column Churn. Let’s see the schema of the testing data as well.
1 test data.printSchema() //testing data
Using VectorAssembler
1 from pyspark.ml.linalg import Vectors

2 from pyspark.ml.feature import VectorAssembler
3
4 assembler=VectorAssembler(inputCols=['Age','Total_Purchase','Account_Manager
5
6 output data=assembler.transform(input data)
Schema of Output Data
Let’s have a look at the schema of the output data.
1 output data printSchema()

1 output data.printSchema()



As you guys can see, we have a features column here, based on which the Classification will occur.
Using Logistic Regression on the data
1 final_data=output_data.select('features','churn') //creating final

2
3 train,test=final_data.randomSplit([0.7,0.3]) //splitting data
4
5 model=LogisticRegression(labelCol='churn') //creating model
6
7 model=model.fit(train) //fitting model on training dataset
8
9 summary=model.summary
10
11 summary.predictions.describe().show() //summary of the predictions
Importing BinaryClassificationEvaluator Library and Testing
The evaluation of binary classifiers compares two methods of assigning a binary attribute, one of

which is usually a standard method and the other is being investigated. There are many metrics that
can be used to measure the performance of a classifier or predictor; different fields have different
preferences for specific metrics due to different goals.
1 from pyspark.ml.evaluation import BinaryClassificationEvaluator

2
3 predictions=model.evaluate(test)
Next we’ll create an evaluator and use the Binary Classification Evaluator to predict the churn
Next, we ll create an evaluator and use the Binary Classification Evaluator to predict the churn.
1 evaluator=BinaryClassificationEvaluator(rawPredictionCol='prediction',labelC
2
3 evaluator.evaluate(predictions.predictions)
4
5 model1=LogisticRegression(labelCol='churn')
6 model1=model1.fit(final_data)
7
8 test data=assembler.transform(test data)

Finding Results
Now we are going to use the model to evaluate the new data
1 results=model1.transform(test_data)
2
3 results.select('Company','prediction').show()
So here we can see the potential client that can leave the organization and with this analysis, we come
to the end of this PySpark MLlib Tutorial Blog.
I hope you enjoyed this PySpark MLlib Tutorial blog. If you are reading this, Congratulations! You are
no longer a newbie to PySpark MLlib. Try out these simple example on your systems now.
Now that you have understood basics of PySpark MLlib Tutorial, check out the Python Spark
Certification Training using PySpark by Edureka, a trusted online learning company with a
network of more than 250,000 satisfied learners spread across the globe. Edureka’s Python Spark
Certification Training using PySpark is designed to provide you with the knowledge and skills that
are required to become a successful Spark Developer using Python and prepare you for the Cloudera
Hadoop and Spark Developer Certification Exam (CCA175).
Got a question for us? Please mention it in the comments section and we will get back to you.
About Author
Kurt
Published on Mar 11,2019
Kurt is a Big Data and Data Science Expert, working as a Research Analyst at
Edureka. He is keen to work with Machine Learning, Big Data and Cloud
Technologies like R, Hadoop, Spark, Cassandra, and much more.

Got your brain cells running?
Stay tuned to latest technology updates
Enter your Email Address
SUBSCRIBE
Recommended Articles for you
PySpark Dataframe Tutorial – RDDs in PySpark – Building PySpark MLlib Tutorial : Introductio
PySpark Programming with Blocks Of PySpark Machine Learning with PySpark Python – Py
Dataframes
Read Article Read Article Read Article Read Artic
Browse Categories
Arti cial Intelligence BI and Visualization Big Data Blockchain Cloud Computing Cyber Security
Data Science Data Warehousing and ETL Databases DevOps Digital Marketing
Front End Web Development Mobile Development Operating Systems Programming & Frameworks
Project Management and Methodologies Robotic Process Automation Software Testing
Systems & Architecture

Comments 0 Comments
1 Comment https://www.edureka.co/blog/ 
1 Login
 Recommend t Tweet f Share Sort by Best
 Join the discussion…
LOG IN WITH
OR SIGN UP WITH DISQUS ?
Name
Jean Touk • 4 months ago

Hey Kurt, thanks for your tutorials. They are concise and well explained. My only recommendation would
be to include the data sources because I had to look around to find the data, which was time consuming.
△ ▽ • Reply • Share ›
ALSO ON HTTPS://WWW.EDUREKA.CO/BLOG/
Everything You Need to Know about Power BI Reports : A Beginner’s Guide to BI

Blockchain Architecture Reporting
1 comment • 22 days ago 1 comment • 5 months ago
Mahima Sharma — Apoorv Gupta — Great Aarticle Upasana but need
to knw some detail about Power Bi Report Server
Java ArrayList: A Complete Guide for Java OOP Cheat Sheet – A Quick Guide to
Beginners Object Oriented Programming in Java
1 comment • 4 months ago 1 comment • 3 months ago
Ravin Cristiano — that's really helpful.. thank u so Chezhiyan sivakumar — Could you please confirm
much... God Bless.!! Private Default Protected PublicSame package non-
subclass No …
✉ Subscribe d Add Disqus to your site 🔒 Disqus' Privacy Policy
Subscribe for updates 
   


© 2014 Brain4ce Education Solutions Pvt. Ltd.

Pyspark Material

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Pyspark Material

Caricato da

Copyright:

Formati disponibili

Celebrating “700K” YouTube Subscribers | Join Edureka YouTube Channel SUBSCRIBE NOW

All Videos Interview Questions Cheat Sheet

Kurt  1 Comment 2.4K Views  

What is Machine Learning?

What is Machine Learning?

What is PySpark MLlib?

PySpark MLlib Tutorial | Edureka

Learn PySpark from Experts

Machine Learning(Python) Industrial Use Cases

Marketing and E-commerce:

The number of purchases made online is steadily increasing, which allows

Analyzing data to identify patterns and trends is key to the transportation

Machine learning is a fast-growing trend in the healthcare industry, thanks to the

Subscribe to our youtube channel to get new updates..!

Machine Learning Lifecycle

A typical Machine Learning Cycle involves majorly two phases:

Prepare for CCA 175

Start Learning Today

Finding Hackers with MLlib

Initializing Spark Session

Firstly we need to initialize spark session.

1 from pyspark.sql import SparkSession

Importing KMeans Library and Loading the Dataset

1 from pyspark.ml.clustering import KMeans

Schema of the Data retrieved

Importing VectorAssembler and creating our Features

1 from pyspark.ml.linalg import Vectors

Importing the StandardScaler Library and Creating Scaler

1 from pyspark.ml.feature import StandardScaler

Computing Summary Statistics

Checking the Elbow Point (WSSSE)

Final Check for the number of Hacker

1 model k3.transform(cluster final data).groupBy('prediction').count().show()

1 model k2.transform(cluster final data).groupBy('prediction').count().show()

Customer Churn Prediction with MLlib

So, can you help them out?

Learn Machine Learning with PySpark

Loading the libraries

1 from pyspark.ml.classification import LogisticRegression

Reading the training and testing Data

1 input data.printSchema() //training data

1 test data.printSchema() //testing data

1 from pyspark.ml.linalg import Vectors

Schema of Output Data

Let’s have a look at the schema of the output data.

1 output data printSchema()

Using Logistic Regression on the data

1 final_data=output_data.select('features','churn') //creating final

Importing BinaryClassificationEvaluator Library and Testing

The evaluation of binary classiﬁers compares two methods of assigning a binary attribute, one of

1 from pyspark.ml.evaluation import BinaryClassificationEvaluator

Enter your Email Address

Recommended Articles for you

Read Article Read Article Read Article Read Artic

Project Management and Methodologies Robotic Process Automation Software Testing

Systems & Architecture

 Recommend t Tweet f Share Sort by Best

 Join the discussion…

Jean Touk • 4 months ago

Everything You Need to Know about Power BI Reports : A Beginner’s Guide to BI

✉ Subscribe d Add Disqus to your site 🔒 Disqus' Privacy Policy

Subscribe for updates 

Potrebbero piacerti anche