Sei sulla pagina 1di 49

INDUSTRIAL TRAINING BTIT507

TABLE OF CONTENTS
S.NO TOPIC PAGE NO.
1 ABSTRACT 1
2 ACKNOWLEDGEMENT 2
3 INTRODUCTION 3
4 WORKING 6
5 APPLICATIONS 9
6 FEASIBILITY STUDY 17
7 SYSTEM DESIGN 18
8 SETUP GUIDE 19
9 SYSTEM IMPLEMENTATION 28
10 CASE STUDIES 30
11 PROJECT 37
12 MAINTAINENCE 45
13 CONCLUSION 46
14 REFERENCES 47
INDUSTRIAL TRAINING BTIT507

ABSTRACT

People looking to buy a new home tend to be more conservative with their budgets
and market strategies. The existing system involves calculation of house prices
without the necessary prediction about future market trends and price increase.
Aim of this project was to develop a real estate web application using Machine
Learning with the help of Jupyter Notebook.The real estate system Give the
functionality for buyers, allowing them to search for houses by features or address.
By this, when the user will search for the property, original property value and
predicted property value will be displayed. By analysing previous market trends and
price ranges, and also upcoming developments future prices will be predicted.
For the price prediction we will be using classification algorithm. This application
will help customers to invest in an estate without approaching an agent. It also
decreases the risk involved in the transaction. The property, original property value
and predicted property value will be displayed. Thus, there is a need to predict the
efficient house pricing for real estate customers with respect to their budgets and
priorities.
INDUSTRIAL TRAINING BTIT507

We will be attempting to predict the mean price of homes in a given Seattle suburb
in the, given a few data points about the suburb at the time, such as the no. of
bedrooms, bathrroms, sqft_living etc.

The dataset we will be using has another interesting difference from our two
previous examples: it has very few data points, only in total, split between 16488
training samples and 4122 test samples, and each “feature” in the input data
(e.g. the price is a feature) has a different scale.

We have 16448 training samples and 4122 test samples. The data comprises 13
features. The features in the input data are as follow:

1. square ft of living
2. square ft of lot
3. number of bedrooms
4. number of bathrooms
5. number of floors
6. zipcode

The targets are the mean values of owner-occupied homes, in thousands of dollars.

ACKNOWLEDGEMENT

Completing a task is never a one person effort. It offers the result of a number of
individuals in a direct or indirect manner that helps in shaping and achieving the
result. Acknowledgement is a genuine opportunity to thank all those people
without that active support in this project would not be able to be possible.

It gives me a great sense of pleasure to present the report of the B. Tech Project
undertaken during B. Tech. Second Year. I owe special debt of gratitude to Dr.
Dinesh Kumar, HOD, Department of Information & Technology, DAV Institute of
Engineering & Technology Jalandhar, for his constant support and guidance
throughout the course of our work. His sincerity, thoroughness and perseverance
have been a constant source of inspiration for me. It is only his cognizant efforts
that my endeavors have seen light of the day.

I am thankful to Dr. Carlos Guestrin and Dr. Emily Fox, Amazon Professor of
Machine Learning, University of Washington, for providing education and
INDUSTRIAL TRAINING BTIT507

enhancing skills in various Information Technology related fields through their


online forum on Coursera platform.

Finally, we are thankful to the Almighty God who had given us the power, good
sense and confidence to complete my project successfully. We also thank our
parents who were a constant source of encouragement. Their moral was
indispensable.

Name: GARBHIT GOEL

Roll No. : 1604526


INDUSTRIAL TRAINING BTIT507

INTRODUCTION

What is machine learning?


 A branch of artificial intelligence, concerned with the design and
development of algorithms that allow computers to evolve behaviors based
on empirical data.

 As intelligence requires knowledge, it is necessary for the computers to


acquire knowledge.

 Because of new computing technologies, machine learning today is not like


machine learning of the past. It was born from pattern recognition and the
theory that computers can learn without being programmed to perform
specific tasks; researchers interested in artificial intelligence wanted to see if
computers could learn from data.

 The iterative aspect of machine learning is important because as models are


exposed to new data, they are able to independently adapt. They learn from
previous computations to produce reliable, repeatable decisions and results.
It’s a science that’s not new – but one that has gained fresh momentum.
INDUSTRIAL TRAINING BTIT507

Why is machine learning important?


 Resurging interest in machine learning is due to the same factors that have
made data mining and Bayesian analysis more popular than ever. Things
like growing volumes and varieties of available data, computational
processing that is cheaper and more powerful, and affordable data storage.

 All of these things mean it's possible to quickly and automatically produce
models that can analyze bigger, more complex data and deliver faster, more
accurate results – even on a very large scale. And by building precise
models, an organization has a better chance of identifying profitable
opportunities – or avoiding unknown risks.

What's required to create good machine learning


systems?
 Data preparation capabilities.
 Algorithms – basic and advanced and iterative
processes.
 Scalability.
 Ensemble modeling.

Did you know?


 In machine learning, a target is called a label.
 In statistics, a target is called a dependent variable.
 A variable in statistics is called a feature in machine
learning.
 A transformation in statistics is called feature creation in machine learning .

The machine learning pipeline

ML Intelligence
Data Method
INDUSTRIAL TRAINING BTIT507

Differences between data mining, machine learning and deep


learning
Data mining
Data mining can be considered a superset of many different methods to extract
insights from data. It might involve traditional statistical methods and machine
learning. Data mining applies methods from many different areas to identify
previously unknown patterns from data. This can include statistical algorithms,
machine learning, text analytics, time series analysis and other areas of
analytics. Data mining also includes the study and practice of data storage and
data manipulation.

Machine Learning

The main difference with machine learning is that just like statistical models,
the goal is to understand the structure of the data – fit theoretical distributions
to the data that are well understood. So, with statistical models there is a
theory behind the model that is mathematically proven, but this requires that
data meets certain strong assumptions too. Machine learning has developed
based on the ability to use computers to probe the data for structure, even if
we do not have a theory of what that structure looks like. The test for a
machine learning model is a validation error on new data, not a theoretical
test that proves a null hypothesis. Because machine learning often uses an
iterative approach to learn from data, the learning can be easily automated.
Passes are run through the data until a robust pattern is found.

Deep learning
Deep learning combines advances in computing power and special types of
neural networks to learn complicated patterns in large amounts of data.
Deep learning techniques are currently state of the art for identifying
objects in images and words in sounds. Researchers are now looking to
apply these successes in pattern recognition to more complex tasks such as
automatic language translation, medical diagnoses and numerous other
important social and business problems.
INDUSTRIAL TRAINING BTIT507

WORKING
Popular machine learning methods
Two of the most widely adopted machine learning methods are supervised
learning and unsupervised learning – but there are also other methods of
machine learning. Here's an overview of the most popular types.

Supervised learning algorithms are trained using labeled examples, such as an


input where the desired output is known. For example, a piece of equipment could
have data points labeled either “F” (failed) or “R” (runs). The learning algorithm
receives a set of inputs along with the corresponding correct outputs, and the
algorithm learns by comparing its actual output with correct outputs to find errors.
It then modifies the model accordingly. Through methods like classification,
regression, prediction and gradient boosting, supervised learning uses patterns to
predict the values of the label on additional unlabeled data. Supervised learning is
commonly used in applications where historical data predicts likely future events.
For example, it can anticipate when credit card transactions are likely to be
fraudulent or which insurance customer is likely to file a claim.

Unsupervised learning is used against data that has no historical labels. The
system is not told the "right answer." The algorithm must figure out what is being
shown. The goal is to explore the data and find some structure within.
Unsupervised learning works well on transactional data. For example, it can
identify segments of customers with similar attributes who can then be treated
similarly in marketing campaigns. Or it can find the main attributes that separate
customer segments from each other. Popular techniques include self-organizing
maps, nearest-neighbor mapping, k-means clustering and singular value
decomposition. These algorithms are also used to segment text topics, recommend
items and identify data outliers.

Semisupervised learning is used for the same applications as supervised learning.


But it uses both labeled and unlabeled data for training – typically a small amount
of labeled data with a large amount of unlabeled data (because unlabeled data is
less expensive and takes less effort to acquire). This type of learning can be used
with methods such as classification, regression and prediction. Semisupervised
learning is useful when the cost associated with labeling is too high to allow for a
fully labeled training process. Early examples of this include identifying a person's
face on a web cam.
INDUSTRIAL TRAINING BTIT507

Reinforcement learning is often used for robotics, gaming and navigation. With
reinforcement learning, the algorithm discovers through trial and error which
actions yield the greatest rewards. This type of learning has three primary
components: the agent (the learner or decision maker), the environment
(everything the agent interacts with) and actions (what the agent can do). The
objective is for the agent to choose actions that maximize the expected reward over
a given amount of time. The agent will reach the goal much faster by following a
good policy. So the goal in reinforcement learning is to learn the best policy.

How it works?
To get the most value from machine learning, you have to know how to pair the
best algorithms with the right tools and processes. SAS(Statistical Analysis
System) combines rich, sophisticated heritage in statistics and data mining with
new architectural advances to ensure your models run as fast as possible – even in
huge enterprise environments.

Algorithms: SAS graphical user interfaces help you build machine learning
models and implement an iterative machine learning process. You don't have to be
an advanced statistician. Our comprehensive selection of machine learning
algorithms can help you quickly get value from your big data and are included in
many SAS products. SAS machine learning algorithms include:

 Neural networks

 Decision trees

 Random forests

 Associations and sequence discovery

 Gradient boosting and bagging

 Support vector machines

 Nearest-neighbor mapping
INDUSTRIAL TRAINING BTIT507

 k-means clustering

 Self-organizing maps

 Multivariate adaptive regression splines

 Bayesian networks

 Kernel density estimation

 Principal component analysis

 Singular value decomposition

 Gaussian mixture models

Tools and Processes: As we know by now, it’s not just the algorithms.
Ultimately, the secret to getting the most value from your big data lies in
pairing the best algorithms for the task at hand with:

 Comprehensive data quality and management 

 GUIs for building models and process flows 

 Interactive data exploration and visualization of model results 

 Comparisons of different machine learning models to quickly identify 


the best one
 Automated ensemble model evaluation to identify the best performers

 Easy model deployment so you can get repeatable, reliable results


quickly

 An integrated, end-to-end platform for the automation of the data-to-


decision process
INDUSTRIAL TRAINING BTIT507

APPLICATIONS
Machine Learning Applications in Healthcare
1. Drug Discovery/Manufacturing

Manufacturing or discovering a new drug is expensive and lengthy process


as thousands of compounds need to be subjected to a series of tests, and
only a single one might result in a usable drug. Machine learning can
speed up one or more of these steps in this lengthy multi-step process.

Machine Learning Examples in Healthcare for Drug Discovery

 Pfizer is using IBM Watson on its immuno-oncology (a technique that uses


body’s immune system to help fight cancer) research. This is one of the
most significant uses of IBM Watson for drug discovery. Pfizer has been
using machine learning for years to sieve through the data to facilitate
research in the areas of drug discovery (particularly the combination of
multiple drugs) and determine the best participant for a clinical trial.

2. Personalized Treatment/Medication

Imagine when you walk in to visit your doctor with some kind of an ache
in your stomach. You have an MRI and a computer helps the radiologist
detect problems that possibly could be too small for the human eye to see.
In the end, a computer scans all your health records and family medical
history and compares it to the latest research to advice a treatment protocol
that is particularly tailored to your problem.

Personalized treatment has great potential for growth in future, and


machine learning could play a vital role in finding what kind of genetic
makers and genes respond to a particular treatment or medication.
Personalized medication or treatment based on individual health records
paired with analytics is a hot research area as it provides better disease
assessment. In future, increased usage of sensor integrated devices and
mobile apps with sophisticated remote monitoring and health-measurement
capabilities, there would be another data deluge that could be used for
treatment efficacy. Personalized treatment facilitates health optimization
and also reduces overall healthcare costs.
INDUSTRIAL TRAINING BTIT507

 A major problem that drug manufacturers often have is that a potential


drug sometimes work only on a small group in clinical trial or it could be
considered unsafe because a small percentage of people developed serious
side effects. Genentech, a member of the Roche Group collaborated with
GNS Healthcare to innovate solutions and treatments using biomedical
data. Genentech will make use of GNS Reverse Engineering and Forward
Simulation to look for patient response markers based on genes which
could lead to providing targeted therapies for patients.

Machine Learning Applications in Finance


Machine Learning Examples in Finance for Fraud Detection

You are watching “Game of Thrones” when you get a call from your bank asking
if you have swiped your card for “$X” at a store in your city to buy a gadget. It
was not you who bought the expensive gadget using your card – in fact, it has
been in your pocket all noon. How did the bank flag this purchase as fraudulent?
All thanks to Machine Learning! Financial fraud costs $80 billion annually, of
which, Americans alone are exposed to a risk worth $50 billion per annum.
INDUSTRIAL TRAINING BTIT507

One of the core machine learning use cases in banking/finance domain is to


combat fraud. Machine learning is best suited for this use case as it can scan
through huge amounts of transactional data and identify if there is any unusual
behaviour. Every transaction a customer makes is analysed in real-time and given
a fraud-score that represents the likelihood of the transaction being fraudulent. If
the fraud score is above a particular threshold, a rejection will be triggered
automatically which would otherwise be difficult without the application of
machine learning techniques as humans cannot reviews 1000’s of data points in
seconds and make a decision.

 Citibank has collaborated with Portugal based fraud detection company


Feedzai that works in real-time to identify and eliminate fraud in online
and in-person banking by alerting the customer.
 PayPal is using machine learning to fight money laundering. PayPal has
several machine learning tools that compare billions of transactions and
can accurately differentiate between what is a legitimate and fraudulent
transaction amongst the buyers and sellers.

What the future holds for AI and machine learning in banking and finance?

We can expect a robot to give a sound investing advice as companies like


Betterment and Wealthfront make attempts to automate the best practices of
investors and provide them to customers at nominal costs than traditional fund
managers.
INDUSTRIAL TRAINING BTIT507

Machine Learning Applications in Retail


Machine learning in retail is more than just a latest trend, retailers are
implementing big data technologies like Hadoop and Spark to build big data
solutions and quickly realizing the fact that it’s only the start. They need a
solution which can analyse the data in real-time and provide valuable insights
that can translate into tangible outcomes like repeat purchasing. Machine
learning algorithms process this data intelligently and automate the analysis to
make this supercilious goal possible for retail giants like Amazon, Target,
Alibaba and Walmart.

Machine Learning Examples in Retail for Product Recommendations

According to The Realities of Online Personalisation Report, 42% of retailers are


using personalized product recommendations using machine learning technology.
It is no secret that customers always look for personalized shopping experiences,
and these recommendations increase the conversion rates for the retailers
resulting in fantastic revenue.

 The moment you start browsing for items on Amazon, you see
recommendations for products you are interested in as “Customers Who
Bought this Product Also Bought” and “Customers who viewed this
product also viewed”, as well specific tailored product recommendation on
the home page, and through email. Amazon uses Artificial Neural
Networks machine learning algorithm to generate these recommendations
for you.
 To make smart personalized recommendations, Alibaba has developed “E-
commerce Brain” that makes use of real-time online data to build machine
learning models for predicting what customers want and recommending
the relevant products based on their recent order history, bookmarking,
commenting, browsing history, and other actions.

Machine Learning Examples in Retail for Product Recommendations

According to The Realities of Online Personalisation Report, 42% of retailers are


using personalized product recommendations using machine learning technology.
It is no secret that customers always look for personalized shopping experiences,
and these recommendations increase the conversion rates for the retailers
resulting in fantastic revenue.
INDUSTRIAL TRAINING BTIT507

 The moment you start browsing for items on Amazon, you see
recommendations for products you are interested in as “Customers Who
Bought this Product Also Bought” and “Customers who viewed this
product also viewed”, as well specific tailored product recommendation on
the home page, and through email. Amazon uses Artificial Neural
Networks machine learning algorithm to generate these recommendations
for you.
 To make smart personalized recommendations, Alibaba has developed “E-
commerce Brain” that makes use of real-time online data to build machine
learning models for predicting what customers want and recommending
the relevant products based on their recent order history, bookmarking,
commenting, browsing history, and other actions.

Machine Learning Applications in Travel


Machine Learning Examples in Travel for Dynamic Pricing

How does Uber determine the price of your ride?

How does Uber enable ridesharing by optimally matching you other passengers
to minimize roundabout routes?
INDUSTRIAL TRAINING BTIT507

How does Uber minimize the wait time once you book a car?

The answer to all these questions is Machine Learning.

One of Uber’s biggest uses of machine learning comes in the form of surge
pricing, a machine learning model nicknamed as “Geosurge” at Uber. If you are
getting late for a meeting and you need to book an Uber in crowded area, get
ready to pay twice the normal fare. In 2011, during New Year’s Eve in New
York, Uber charged $37 to $135 for one mile journey. Uber leverages predictive
modelling in real-time based on traffic patterns, supply and demand. Uber has
acquired a patent on surge pricing. However, customer backlash on surge-pricing
is strong, so Uber is using machine learning to predict where demand will be
high so that drivers can prepare in advance to meet the demand, and surge pricing
can be reduced to a greater extent.

Machine Learning Examples in Travel for Sentiment Analysis

According to Amadeus IT group, 90% of American travellers with a smartphone


share their photos and travel experience on social media and review services.
TripAdvisor gets about 280 reviews from travellers every minute. With a large
pool of valuable data from 390 million unique visitors and 435 million reviews,
TripAdvisor analyses this information to enhance its service. Machine learning
techniques at TripAdvisor focus on analysing brand-related reviews.
INDUSTRIAL TRAINING BTIT507

Machine Learning Applications in Social Media


Machine learning offers the most efficient means of engaging billions of social
media users. From personalizing news feed to rendering targeted ads, machine
learning is the heart of all social media platforms for their own and user benefits.
Social media and chat applications have advanced to a great extent that users do
not pick up the phone or use email to communicate with brands – they leave a
comment on Facebook or Instagram expecting a speedy reply than the traditional
channels.

Here are some machine learning examples that you must be using and loving in
your social media accounts without knowing the fact that there interesting
features are machine learning applications -

 Earlier Facebook used to prompt users to tag your friends but nowadays
the social networks artificial neural networks machine learning algorithm
identifies familiar faces from contact list. The ANN algorithm mimics the
structure of human brain to power facial recognition.
 The professional network LinkedIn knows where you should apply for
your next job, whom you should connect with and how your skills stack up
against your peers as you search for new job.
INDUSTRIAL TRAINING BTIT507

Other important applications of machine learning


 Oil and gas :

Finding new energy sources. Analyzing minerals in the ground. Predicting


refinery sensor failure. Streamlining oil distribution to make it more
efficient and cost-effective. The number of machine learning use cases for
this industry is vast – and still expanding.

 Government :

Government agencies such as public safety and utilities have a particular


need for machine learning since they have multiple sources of data that can
be mined for insights. Analyzing sensor data, for example, identifies ways
to increase efficiency and save money. Machine learning can also help
detect fraud and minimize identity theft.
INDUSTRIAL TRAINING BTIT507

FEASIBILITY STUDY
A feasibility study is a test of a system proposal according to its workability impact
on organization, ability to meet user needs and effective use of resources. The
objective of a feasibility study is not to solve a problem but to acquire a sense of its
scope. During the study, the problem definition is crystallized and the aspects of
the problem to be included in the system are determined. After the initial
investigation of the system that helped to have in-depth study of the existing
system, understanding its strength and weaknesses and the requirements for the
new proposed system. Feasibility study was done in three phases documented
below.

It would be problematic to feed into a neural network values that all take wildly
different ranges. The network might be able to automatically adapt to such
heterogeneous data, but it would definitely make learning more difficult. A
widespread best practice to deal with such data is to do feature-wise normalization:
for each feature in the input data (a column in the input data matrix), you subtract
the mean of the feature and divide by the standard deviation, so that the feature is
centered around 0 and has a unit standard deviation.

Note that the quantities that we use for normalizing the test data have been
computed using the training data. We should never use in our workflow any
quantity computed on the test data, even for something as simple as data
normalization.
INDUSTRIAL TRAINING BTIT507

SYSTEM DESIGN
The most creative and challenging phase of SDLC is system design. The term design
describes a final system and the process by which it is developed. It includes
construction of programs and program testing. The purpose of the design phase is to
plan a solution of the problem specifies by the requirements document. This phase
is the first step in the moving from the problem domain to the solution domain.
Starting with what is needed; design takes us towards how to satisfy the needs. The
design of the system is perhaps the most critical factor affecting the quality of the
software. It has major impact on the later phase, particularly testing and
maintenance. The output of this phase is the design document. This document is
similar to blueprint or plan for the solution and is used later during implementation,
testing and maintenance. A systematic method has to achieve the beneficial result at
the end. It includes starting with average idea and developing it into a series of steps.
The series of steps for successful system development are given below:
 Study problem completely because first of all we should know the goal, which
he has to achieve.
 We should see what kind of output we require and what kind of input we give
so we can get the desired output from system output from system. It is very
challenging step of system development.
 According to the output requirement of system the strength of various
databases should be design.
 Next, we should know what kind of program we should develop, which will
lead us to reach final goal.
 Then we write this individual program, which later on joining solve problem.
 Then we test these programs and make necessary correction in them to achieve
target of program.
 At last combining all these problems in the forms of a bar in the menu of
windows, this will complete software package for general insurance.

The two main objectives which the designer has to bear in mind are:-
 How fast the design will be do the users work given particular hardware
resources.
INDUSTRIAL TRAINING BTIT507

 The extent to which the design is secure against the human errors and
machine malfunctions.

SETUP GUIDE FOR JUYTER (IPYTHON) NOTEBOOK

System Requirements
 Windows 7+ or Windows Server 2012 R2
 64-bit architecture
 at least 4 GB of RAM
 at least 2 GB of free disk space

Getting started with Python, IPython Notebook & GraphLab Create


It's important to emphasize that this specialization is not about providing training
for a specific software package. The goal of the specialization is for your effort to
be spent on learning the fundamental concepts and algorithms behind machine
learning in a hands-on fashion. These concepts transcend any single package. What
you learn here you can use whether you write code from scratch, use any existing
ML packages out there, or any that may be developed in the future.

The learning approach in this specialization is to start from use cases and then dig
into algorithms and methods, what we call a case-studies approach. We are very
excited about this approach, since it has worked well in several other courses. The
first course is focused on understanding how ML can be used in various cases
studies, and the follow on courses will dig into the details of algorithms and
methods for each of the main ML areas. In the first course, you will not be
implementing algorithms from scratch, but rather building intelligent applications
that use ML. In the subsequent course, we will be implementing and comparing a
wide range of algorithms. To make it easy to implement the use cases we will be
covering, we are recommending a particular set of software tools, but you can
successfully complete the course with other tools out there.

Why Python
In this course, we are going to use the Python programming language to build
several intelligent applications that use machine learning. Python is a simple
INDUSTRIAL TRAINING BTIT507

scripting language that makes it easy to interact with data. Furthermore, Python has
a wide range of packages that make it easy to get started and build applications,
from the simplest ones to the most complex. Python is widely used in industry, and
is becoming the de facto language for data science in industry. (R is another
alternative language. However, R tends to be significantly less scalable and has
very few deployment tools, thus it is seldomly used for production code in
industry. It is possible, but highly discouraged to use R in this specialization.)

We will also use the IPython Notebook in our videos. The IPython Notebook is a
simple interactive environment for programming with Python, which makes it
really easy to share your results. Think about it as a combination of a Python
terminal and a wiki page. Thus, you can combine code, plots and text to explain
what you did. (You are not required to use IPython Notebook in the assignments,
and should have no problem using straight up Python if you prefer.)

Why SFrame & GraphLab Create


There are many excellent machine learning libraries in Python. One of the most
popular one today is scikit-learn. Similarly, there are many tools for data
manipulations in Python; a popular example is Pandas. However, most of these
tools do not scale to large datasets, including some we will tackle in this
Specialization. In addition, in this specialization, we will cover a wide range of ML
models, feature engineering transformation, and evaluation metrics. With most
existing packages, you will have to install a combination of packages to get the
tools that we need to tackle the use cases in this course. This is possible, but
requires advanced knowledge of Python, which we feel will slow down most
people's learning of the core concepts.

The main goal of this course is to learn core ML concepts, not how to use a
specific software package. Thus, in this course, we recommend you use GraphLab
Create, a package we have been working on for many years now, and has seen an
exciting adoption curve, especially in industry with folks building real
applications. GraphLab Create is a highly scalable machine learning library for
Python, which also includes the SFrame, a highly-scalable library for data
manipulation. A huge advantage of SFrame over Pandas is that with SFrame, you
are not limited to datasets that fit in memory, which allows you to deal with large
datasets, even on a laptop. (The SFrame API is very similar to Pandas' API. To
download Jupyter on windows execute the following steps.

Step 1: Download Anaconda2 v4.0.0


INDUSTRIAL TRAINING BTIT507

Installing Jupyter on Windows

Jupyter requires Python to be installed (it is based on the Python language). There
are a couple of tools that will automate the installation of Jupyter (and optionally
Python) from a GUI. In this case, we are showing how to install using Anaconda,
which is a Python tool for distributing software. You first have to install Anaconda.
It is available on Windows and Mac environments. Download the executable from
https://www.continuum.io/ (company that produces Anaconda) and run it to install
Anaconda. The software provides a regular installation setup process, as shown in
the following screenshot:

The installation process goes through the regular steps of making you agree to the
distribution rights license:
INDUSTRIAL TRAINING BTIT507

The standard Windows installation allows you to decide whether all users on the
machine can run the new software or not. If you are sharing a machine with
different levels of users, then you can decide the appropriate action:

After clicking on Next, it will ask for a destination for the software to reside (I
almost always keep the default paths):
INDUSTRIAL TRAINING BTIT507

And, most importantly, make sure that Python...

Step 2: Install Anaconda

# Run Anaconda2 v4.0.0 installer.


# Double-click the .exe file to install Anaconda and follow the instructions on the
screen.

Step 3: Create conda environment

# Create a new conda environment with Python 2.7.x


conda create -n gl-env python=2.7 anaconda=4.0.0

# Activate the conda environment


activate gl-env

Step 4: Ensure pip version >= 7

# Ensure pip is updated to the latest version


# miniconda users may need to install pip first, using 'conda install pip'
conda update pip

Step 5: Install GraphLab Create


INDUSTRIAL TRAINING BTIT507

# Install your licensed copy of GraphLab Create


pip install --upgrade --no-cache-dir https://get.graphlab.com/GraphLab-
Create/2.1/your registered email address here/your product key
here/GraphLab-Create-License.tar.gz

Step 4: Ensure installation of IPython and IPython Notebook

# Install or update IPython and IPython Notebook


conda install ipython-notebook

Where should my files go?


1. Launch the GraphLab Create Launcher.

2. Click "IPYTHON NOTEBOOK" to launch IPython notebook. You will be


greeted with the main page:
INDUSTRIAL TRAINING BTIT507

3. From the top right, find the button labeled "New▾". Click the button to get a
drop-down menu, and select "Python 2" under the sub-heading "Notebooks." This
should create a new notebook inside the home directory of IPython notebook.

4. In the new notebook, run


INDUSTRIAL TRAINING BTIT507

import os
print os.getcw d(

to obtain the full path of the home directory of IPython


notebook. This path is where your files should go. Highlight the path and copy it.

5. Place any files (notebooks and datasets) under the home directory. You may
organize your files using sub-folders.
INDUSTRIAL TRAINING BTIT507

6. All files and folders placed inside the home folder will appear in the main page:

SS
INDUSTRIAL TRAINING BTIT507

SYSTEM IMPLEMENTATION AND TESTING

Implementation Issues
Implementation phase of the software development is concerned with translating
the design specifications into the source code. After the system has been designed,
arrives the stage of
putting it into actual usage known as the implementation of the system. This
involves putting up
of actual practical usage of the theoretically designed system. The primary goal of
implementation is to write the source code and the internal documentation so that
conformance of
the code to its specifications can easily be verified and so the debugging,
modifications and
testing are eased. This goal can be achieved by making the source code as clear
and as
straightforward as possible. Simplicity, Elegance and Clarity are the hallmarks of
good programs
whereas complexity are indications of inadequate design and misdirected thinking.
The system
implementation is a fairly complex and expensive task requiring numerous inter-
dependent
activities. It involves the effort of a number of groups of people: user and the
programmers and
the computer operating staff etc. This needs a proper planning to carry out the task
successfully.
Thus it involves the following activities:
§ Writing and testing of programs individually
§ Testing the system as a whole using the live data
§ Training and Education of the users and supervisory staff Source code clarity is
enhance
buy using structured coding techniques, by efficient coding style, by appropriate
supporting documents, by efficient internal comments and by features provided in
the
modern programming language.
The following are the structured coding techniques:
1) Single Entry, Single Exit
2) Data Encapsulation
3) Using recursion for appropriate problems
INDUSTRIAL TRAINING BTIT507

Testing
The most important activity at the implementation stage is the system testing with
the objective of validating the system against the designed criteria. During the
development cycle, user was involved in all the phases that are analysis, design and
coding. After each phase the user was asked whether he was satisfied with the
output and the desired rectification was done at the moment. During coding,
generally bottom up technique is used. Firstly the lower level modules are coded
and then they are integrated together. Thus before implementation, it involves the
testing of the system. The testing phase involves testing first of separate parts of
the system and then finally of the system as a whole. Each independent module is
tested first and then the complete system is tested. This is the most important phase
of the system development. The user carries out this testing and test data is also
prepared by the user to check for all possible combinations of correct data as well
as the wrong data that is trapped by the system. So the testing phase consists of the
following steps:
 Unit testing: In the bottom of coding technique, each module is tested
individually. Firstly the module is tested with some test data that covers all
the possible paths and then the actual data was fed to check for results.
 Integration testing: After all the modules are ready and duly tested, these
have to be integrated into the application. This integrated application was
again tested first with the test data and then with the actual data.
 Parallel testing: The third in the series of tests before handling over the
system to the user is the parallel processing of the old and the new system.
At this stage, complete and thorough testing is done and supports out the
event that goes wrong. This provides the better practical support to the
persons using the system for the first time who may be uncertain or even
nervous using it.
1) Clerical procedure for collection and disposal of results
2) Flow of data within the organization
3) Accuracy of report output
4) Software testing which involves testing of all the programs together. This
involves the testing of system software utilities being used and specifically
develops application software.
5) Incomplete data formats
6) Halts due to various reasons and the restart procedures.
7) Range of items and incorrect formats
8) Invalid combination of data records.
9) Access control mechanism used to prevent unauthorized access to the system.
INDUSTRIAL TRAINING BTIT507

CASE STUDIES
Case Study 1: LINEAR REGRESSION
What is linear regression?

Before knowing what is linear regression, let us get ourselves accustomed to


regression. Regression is a method of modelling a target value based on
independent predictors. This method is mostly used for forecasting and finding out
cause and effect relationship between variables. Regression techniques mostly
differ based on the number of independent variables and the type of relationship
between the independent and dependent variables.

Simple linear regression is a type of regression analysis where the number of


independent variables is one and there is a linear relationship between the
independent(x) and dependent(y) variable. The red line in the above graph is
referred to as the best fit straight line. Based on the given data points, we try to plot
a line that models the points the best. The line can be modelled based on the linear
equation shown below.

y = a_0 + a_1 * x
INDUSTRIAL TRAINING BTIT507

The motive of the linear regression algorithm is to find the best values for a_0 and
a_1. Before moving on to the algorithm, let’s have a look at two important
concepts you must know to better understand linear regression.

Cost Function

The cost function helps us to figure out the best possible values for a_0 and a_1
which would provide the best fit line for the data points. Since we want the best
values for a_0 and a_1, we convert this search problem into a minimization
problem where we would like to minimize the error between the predicted value
and the actual value.

We choose the above function to minimize. The difference between the predicted
values and ground truth measures the error difference. We square the error
difference and sum over all data points and divide that value by the total number of
data points. This provides the average squared error over all the data points.
Therefore, this cost function is also known as the Mean Squared Error(MSE)
function. Now, using this MSE function we are going to change the values of a_0
and a_1 such that the MSE value settles at the minima.

Gradient Descent

The next important concept needed to understand linear regression is gradient


descent. Gradient descent is a method of updating a_0 and a_1 to reduce the cost
function(MSE). The idea is that we start with some values for a_0 and a_1 and
then we change these values iteratively to reduce the cost. Gradient descent helps
us on how to change the values.
INDUSTRIAL TRAINING BTIT507

Case Study 2: SENTIMENT ANALYSIS


The best businesses understand sentiment of their customers – what people are
saying, how they’re saying it, and what they mean. Sentiment Analysis is the
domain of understanding these emotions with software, and it’s a must-understand
for developers and business leaders in a modern workplace.

As with many other fields, advances in Machine Learning have brought Sentiment
Analysis into the foreground of cutting-edge algorithms. Today we use natural
language processing, statistics, and text analysis to extract, and identify the
sentiment of text into positive, negative, or neutral categories.

Sentiment Analysis Use Cases

 Sentiment Analysis for Brand Monitoring

One of the most well documented uses of Sentiment Analysis is to get a full 360
view of how your brand, product, or company is viewed by your customers and
stakeholders. Widely available media, like product reviews and social, can reveal
key insights about what your business is doing right or wrong. Companies can also
use sentiment analysis to measure the impact of a new product, ad campaign, or
consumer’s response to recent company news on social media. Private companies
like Unamo offer this as a service.

 Sentiment Analysis for Customer Service

Customer service agents often use sentiment analysis to automatically sort


incoming user email into “urgent” or “not urgent” buckets based on the sentiment
of the email, proactively identifying frustrated users. The agent then directs their
time toward resolving the users with the most urgent needs first. As customer
INDUSTRIAL TRAINING BTIT507

service becomes more and more automated through Machine Learning,


understanding the sentiment of a given case becomes increasingly important.

 Sentiment Analysis for Market Research and Analysis

Sentiment analysis is used in business intelligence to understand the subjective


reasons why consumers are or are not responding to something (e.x. why are
consumers buying a product? What do they think of the user experience? Did
customer service support meet their expectations?). Sentiment analysis can also be
used in the areas of political science, sociology, and psychology to analyze trends,
ideological bias, opinions, gauge reactions, etc.

A lot of these applications are already up and running. Bing recently integrated
sentiment analysis into its Multi-Perspective Answers product. Hedge funds are
almost certainly using the technology to predict price fluctuations based on public
sentiment. And companies like CallMiner offer sentiment analysis for customer
interactions as a service.

Sentiment Analysis Algorithms

Sentiment Analysis can be used to quickly analyze the text of research papers,
news articles, social media posts like tweets and more.

Social Sentiment Analysis is an algorithm that is tuned to analyze the sentiment of


social media content, like tweets and status updates. The algorithm takes a string,
and returns the sentiment rating for the “positive,” “negative,” and “neutral.” In
addition, this algorithm provides a compound result, which is the general overall
sentiment of the string.

The algorithm takes an input string and returns a rating from 0 to 4, which
corresponds to the sentiment being very negative, negative, neutral, positive, or
very positive.
INDUSTRIAL TRAINING BTIT507

Case Study 2: CLUSTERING


It is basically a type of unsupervised learning method . An unsupervised learning
method is a method in which we draw references from datasets consisting of input
data without labeled responses. Generally, it is used as a process to find
meaningful structure, explanatory underlying processes, generative features, and
groupings inherent in a set of examples.

Clustering is the task of dividing the population or data points into a number of
groups such that data points in the same groups are more similar to other data
points in the same group and dissimilar to the data points in other groups. It is
basically a collection of objects on the basis of similarity and dissimilarity between
them.

For ex– The data points in the graph below clustered together can be classified
into one single group. We can distinguish the cluster s, and we can identify that
there are 3 clusters in the below picture.

It is not necessary for clusters to be a spherical. Such as in the following image


these data points are clustered by using the basic concept that the data point lies
within the given constraint from the cluster center. Various distance methods and
techniques are used for calculation of the outliers.
INDUSTRIAL TRAINING BTIT507

Why Clustering ?

Clustering is very much important as it determines the intrinsic grouping among


the unlabeled data present. There are no criteria for a good clustering. It depends
on the user, what is the criteria they may use which satisfy their need. For instance,
we could be interested in finding representatives for homogeneous groups (data
reduction), in finding “natural clusters” and describe their unknown properties
(“natural” data types), in finding useful and suitable groupings (“useful” data
classes) or in finding unusual data objects (outlier detection). This algorithm must
make some assumptions which constitute the similarity of points and each
assumption make different and equally valid clusters.

Clustering Methods :

1. Density-Based Methods : These methods consider the clusters as the dense


region having some similarity and different from the lower dense region of the
space. These methods have good accuracy and ability to merge two clusters.
Example DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
, OPTICS (Ordering Points to Identify Clustering Structure) etc.

2. Hierarchical Based Methods : The clusters formed in this method forms a tree
type structure based on the hierarchy. New clusters are formed using the previously
formed one. It is divided into two category
-> Agglomerative (bottom up approach)
INDUSTRIAL TRAINING BTIT507

-> Divisive (top down approach) .


examples CURE (Clustering Using Representatives), BIRCH (Balanced Iterative
Reducing Clustering and using Hierarchies) etc.

3. Partitioning Methods : These methods partition the objects into k clusters and
each partition forms one cluster. This method is used to optimize an objective
criterion similarity function such as when the distance is a major parameter
example K-means, CLARANS (Clustering Large Applications based upon
randomized Search) etc.

4. Grid-based Methods : In this method the data space are formulated into a finite
number of cells that form a grid-like structure. All the clustering operation done on
these grids are fast and independent of the number of data objects example STING
(Statistical Information Grid), wave cluster, CLIQUE (Clustering In Quest).

Clustering Algorithms:

K-means clustering algorithm – It is the simplest unsupervised learning algorithm


that solves clustering problem. K-means algorithm partition n observations into k
clusters where each observation belongs to the cluster with the nearest mean
serving as a prototype of the cluster .
INDUSTRIAL TRAINING BTIT507

PROJECT: PREDICTION OF HOUSE PRICE


THROUGH LINEAR REGRESSION

Fire up garphlab create

 import graphlab

Load some house sales data

 sales = graphlab.SFrame('home_data.gl/')
 sales
INDUSTRIAL TRAINING BTIT507

Exploring the data for housing sales

graphlab.canvas.set_target('ipynb')

sales.show(view="Scatter Plot", x="sqft_living", y="price")


INDUSTRIAL TRAINING BTIT507

Create a simple regression model of sqft_living to price

train_data,test_data = sales.random_split(.8,seed=0)

Build the regression model

sqft_model= graphlab.linear_regression.create(train_data, target='price',


features=['sqft_living'])

Evaluate the simple model

 print test_data['price'].mean()
543054.042563

 print sqft_model.evaluate(test_data)
{'max_error': 4149118.5001014257, 'rmse': 255176.56433446918

Let's show what out predictions look like

 import matplotlib.pyplot as plt

%matplotlib inline
INDUSTRIAL TRAINING BTIT507

 plt.plot(test_data['sqft_living'],test_data['price'],'.',test_data['sqft_living'],sqft
_model.predict(test_data),'-')

sqft_model.get('coefficients')

name index value stderr

(intercept) None -45687.8487598 5028.55792979

sqft_living None 281.250692483 2.20956763456

Exploring other features in data

 features = ['bedrooms','bathrooms','sqft_living','sqft_lot','floors','zipcode']
 sales[features].show()
INDUSTRIAL TRAINING BTIT507
INDUSTRIAL TRAINING BTIT507

 sales.show(view='BoxWhisker Plot',x='zipcode',y='price')
INDUSTRIAL TRAINING BTIT507

Build a regression model with more features

 features_model =
graphlab.linear_regression.create(train_data,target='price',features=features)
Linear regression:
--------------------------------------------------------
Number of examples : 16488
Number of features : 6
Number of unpacked features : 6
Number of coefficients : 115
Starting Newton Method
--------------------------------------------------------
+-----------+----------+--------------+--------------------+-----------------
-----+---------------+-----------------+
| Iteration | Passes | Elapsed Time | Training-max_error | Validation-
max_error | Training-rmse | Validation-rmse |
+-----------+----------+--------------+--------------------+-----------------
-----+---------------+-----------------+
| 1 | 2 | 0.062400 | 2593719.371933 | 3834931.820422
| 180540.247964 | 206634.094708 |
+-----------+----------+--------------+--------------------+-----------------
-----+---------------+-----------------+

 print sqft_model.evaluate(test_data)

print features_model.evaluate(test_data)
{'max_error': 4149118.5001014257, 'rmse': 255176.56433446918}
{'max_error': 3543180.6009335285, 'rmse': 179779.09588756037}

Apply learned models to predict prices of 2 houses

 house1 = sales[sales['id']=='5309101200']
 house1
INDUSTRIAL TRAINING BTIT507

 print house1['price']
[620000L, ... ]

 print sqft_model.predict(house1)
[629313.8131997312]

 print features_model.predict(house1)
[722914.1214590659]

Prediction for a second, fancier house

 house2 = sales[sales['id']=='1925069082']
 house2

 print sqft_model.predict(house2)
[1259315.364362002

 print features_model.predict(house2)
[1460311.9229361552]
INDUSTRIAL TRAINING BTIT507

MAINTENANCE

Maintenance Environment
The proper maintenance of the new system is very important for its smooth
working. The maintenance of the software is to be done by the system analyst and
programmers in the organization. But for hardware maintenance engineer may be
called from where hardware was purchased.

Operations and Maintenance Phase


The system operation is ongoing. The system is monitored for continued
performance in accordance with user requirements and needed system
modifications are incorporated. Operations continue as long as the system can be
effectively adapted to respond to the organization’s needs. When modifications or
changes are identified, the system may reenter the planning phase. The purpose of
this phase is to:
 Operate, maintain, and enhance the system.
 Certify that the system can process sensitive information.
 Conduct periodic assessments of the system to ensure the functional
requirements continue to be satisfied.
 Determine when the system needs to be modernized, replaced or retired.
INDUSTRIAL TRAINING BTIT507

CONCLUSION
Here’s what you should take away from this example:

 Regression is done using different loss functions from


classification; Mean Squared Error (MSE) is a commonly used loss
function for regression.
 Similarly, evaluation metrics to be used for regression differ from
those used for classification; naturally the concept of “accuracy”
does not apply for regression. A common regression metric is
Mean Absolute Error (MAE).
 When features in the input data have values in different ranges,
each feature should be scaled independently as a preprocessing
step.
 When there is little data available, using Newton’s method with
Graphlab Create is a great way to reliably evaluate a model.
 When little training data is available, it is preferable to use a small
network with very few hidden layers (typically only one or two), in
order to avoid severe overfitting.
INDUSTRIAL TRAINING BTIT507

REFERENCES
www.google.com

www.coursera.org

www.geeksforgeeks.org

www.sas.com

Potrebbero piacerti anche