Sei sulla pagina 1di 11

InsideBIGDATA Guide to

Predictive Analytics
by Daniel D. Gutierrez

BROUGHT TO YOU BY
Predictive Analytics

Predictive Analytics Defined Contents


Predictive analytics, sometimes called Predictive Analytics Defined. . . . . . 2
advanced analytics, is a term used to de- The History of Predictive
scribe a range of analytical and statisti- Analytics. . . . . . . . . . . . . . . . . . . . . . 2
cal techniques to predict future actions or Business Uses of Predictive
behaviors. In business, predictive analyt- Analytics . . . . . . . . . . . . . . . . . . . . . 3
ics are used to make proactive decisions
Classes of Predictive Analytics . . . . 4
and determine actions, by using statisti-
cal models to discover patterns in his- Predictive Analytics Software. . . . . 6
torical and transactional data to uncover R as the Choice for Predictive
likely risks and opportunities. Analytics. . . . . . . . . . . . . . . . . . . . . . 7
Data Access for Predictive
Predictive analytics incorporates a range Analytics. . . . . . . . . . . . . . . . . . . . . . 8
of activities which we will explore in this
paper, including data access, exploratory Exploratory Data Analysis (EDA). . . 8
data analysis and visualization, develop- Predictive Modeling . . . . . . . . . . . 10
ing assumptions and data models, apply- Production Deployment . . . . . . . . 11
ing predictive models, then estimating
Conclusion . . . . . . . . . . . . . . . . . . . 11
and/or predicting future outcomes.

The History of Predictive Analytics


Modern day predictive analytics has its origin in Today predictive analytics has finally arrived into
the 1940s, when governments started using the the corporate mainstream, being used by everyday
first computational models — Monte Carlo simula- business users for a broadening set of use cases.
tions, computational models for neural networks, This growing phenomenon has been driven by the
and linear programing — for decoding German realities of a global economy, i.e., organizations
messages in WWII, automating targeting of anti- continually looking for competitive advantage, and
aircraft weapons against enemy planes, and com- enabled by strong technological innovations.
puter simulations to predict behavior of nuclear
These technological innovations include more
chain reactions for the Manhattan Project. In the
scalable computing power, relational databases,
1960s, corporations and research institutions be-
new Big Data technologies, such as Hadoop, and
gan the era of commercializing analytics with non-
self-service analytics software that puts data, and
linear programming, and computer-based heuristic
problem solving — for the first models to forecast predictive models in the hands of front-line deci-
the weather, solving the “shortest path problem” sion makers. All of this has allowed organizations to
to improve air travel and logistics, and applying compete based on analytic innovation.
predictive modeling to credit risk decisions. Then in First, organizations embraced simple data discov-
the 1970s – 1990s analytics was used more broad- ery analytics to understand the state of their busi-
ly in organizations, and tech start-ups made real- ness, and to look deeply in the data to understand
time and prescriptive analytics a reality. However, the “whys” behind the data. But as they become
predictive analytics mainly remained in the hands more comfortable with the data, they see the
of corporate statisticians, brought to the business
opportunity to further outperform their competi-
only in static, batch-driven reports.
tors using advanced analytics.

2
www.inside-bigdata.com | 508-259-8570 | Kevin@insideBigData.com
Predictive Analytics

Business Uses of Predictive Analytics


The need for predictive analytics in the enterprise • Fraud detection – find inaccurate credit
is clear, as it can provide smarter analysis for better applications, fraudulent transactions both
decision making, increased market competiveness, done offline and online, identity thefts, false
a direct path in taking advantage of market op- insurance claims, etc.
portunity and threats, a way to reduce uncertainty • Retail campaign optimization – allows mar-
and manage risk, an approach to proactively plan keters to model outcomes of campaigns
and act, discovery of meaningful patterns, and the based on a deep analysis of customer behav-
means to anticipate and react to emerging trends. iors, preferences, and profile data.
Advanced quantitative analysis has demonstrated • Marketing & customer analytics – collect
benefits in a wide cross section of industries and data from digital marketing, social media, call
domains. Many classes of business problems can centers, mobile apps, etc. and use the infor-
be solved with predictive analytics, here are just mation on what customers have done in the
a few: past to gauge what they may do in the future.
• Sales forecasting – predict what you should • HR analytics – enable organizations to analyze
expect to book this month or this quarter, the past and look forward to spot trends in
taking into account your historical conver- key factors related to voluntary termination,
sion rates, i.e. what your sales team’s winning absences and other sources of risk, as well as
percentage on similar opportunities has been identifying trends in required skill sets vs. cur-
like in the past, coupled with your current rent resources.
sales pipeline, i.e. the number of opportuni-
• Risk management – predict the best portfo-
ties your team is working on within this time
lio to maximize return in capital asset pricing
window.
model and probabilistic risk assessment to
yield accurate forecasts.

Predictive analytics for sales forecasting provides targeted, relevant predictive analytics to a broad
spectrum of business users to improve decision making. Image courtesy of TIBCO Spotfire.

3
www.inside-bigdata.com | 508-259-8570 | Kevin@insideBigData.com
Predictive Analytics

Classes of Predictive Analytics


At its core, predictive analytics relies on capturing
relationships between past data points, and using Classification is another popular
those relationships to predict future outcomes. In type of predictive analytics. With
order to make predictions based on a given data
set, one or more predictor variables are used to
classification, there is a response
predict a response variable. In its simplest form, categorical variable. The classifier
predictive analytics assists with developing fore- examines a data set where each
casts for business decision making. To handle more observation contains information
complex requirements, advanced predictive analyt- on the response variable as well as
ics techniques are applied to drive critical business
processes. In this section we will provide a high- the predictor variables.
level view of the primary classes of predictive analyt-
ics: supervised learning and unsupervised learning. • Classification – Classification is another pop-
ular type of predictive analytics. With classi-
Regression is the most common fication, there is a response categorical vari-
form of predictive analytics. With able, such as income bracket, which could be
partitioned into three classes or categories:
regression, there is a quantitative high income, middle income, and low income.
response variable (what you’re The classifier examines a data set where each
trying to predict). observation contains information on the
response variable as well as the predictor vari-
Supervised learning is divided into two broad cat- ables. For example, suppose an analyst would
egories: regression for responses that are quantita- like to be able to classify the income brack-
tive (a numeric value), such as miles per gallon for ets of persons not in the data set, based on
a particular car, and classification for responses characteristics associated with that person,
that can have just a few known values, such as such as age, gender, and occupation. This is a
‘true’ or ‘false’. classification task that would proceed as fol-
lows: examine the data set containing both
• Regression – Regression is the most common the predictor variables and the already classi-
form of predictive analytics. With regres- fied response variable, income bracket. In this
sion, there is a quantitative response variable way, the algorithm learns about which combi-
(what you’re trying to predict) like the sale nations of variables are associated with which
price of a home, based on a series of predictor income brackets. This data set is called the
variables such as the number of square feet, training set. Then the algorithm would look
number of bedrooms, and average income in at new observations for which no informa-
the neighborhood according to census data. tion about income bracket is available. Based
The relationship between sale price and the on the classifications in the training set, the
predictors in the training set would provide a algorithm would assign classifications to the
predictive model. new observations. For example, a 51 year old
There are many types of regression methods female marketing director might be classified
including the above – multivariate linear re- in the high-income bracket.
gression, polynomial regression, and regres- There are many types of classification methods
sion trees, to mention a few. such as logistic regression, decision trees, sup-
port vector machines, Random Forests, k-Near-
est Neighbors, naïve Bayes, to mention a few.

4
www.inside-bigdata.com | 508-259-8570 | Kevin@insideBigData.com
Predictive Analytics

Unsupervised learning is used to draw inferences different groups like frequent shoppers and
from datasets consisting of input data without infrequent shoppers. A classification analysis
labeled responses. The most common unsuper- would be possible if the customer shopping
vised learning method is cluster analysis, which is history were available, but this is not the case
used for exploratory data analysis to find hidden in unsupervised learning — we don’t have
patterns or grouping in data. response variables telling us whether a cus-
• Clustering – Using unsupervised techniques tomer is a frequent shopper or not. Instead,
like clustering, we can seek to understand we can attempt to cluster the customers on
the relationships between the variables or the basis of the variables in order to identify
between the observations by determining distinct customer groups.
whether observations fall into relatively dis- There are other types of unsupervised sta-
tinct groups. For example, in a customer seg- tistical learning including k-means clustering,
mentation analysis we might observe multiple hierarchical clustering, principal component
variables: gender, age, zip code, income, etc. analysis, etc.
Our belief may be that the customers fall in

Clustering shows the relationships between the variables or observations by determining whether they fall into
relatively distinct groups. Image courtesy of TIBCO Spotfire.

5
www.inside-bigdata.com | 508-259-8570 | Kevin@insideBigData.com
Predictive Analytics

Predictive Analytics Software Effective predictive analytics tools provide a wide


variety of algorithms and methods to support all
There is a vast array of predictive analytics tools, the data characteristics and business problems
but not all are created equal. Software differs that users encounter, as well as the ability to flex-
widely in terms of capability and usability — not all ibly apply these algorithms as needed. The exten-
solutions can address all types of advanced analyt- sibility to easily integrate new analytic methods as
ics needs. There are different classes of analytics they become available is also critical for maximizing
users — some need to build statistical models, competitive advantage. An important criteria when
others just need to use them. selecting the right tool, is to make sure the feature-
For the advanced user, the importance of tool se- set matches your business data characteristics and
lection centers on the ability to put proprietary that the tool will benefit your data analysts. The
models into the hands of business users (front line right tool typically combines powerful data integra-
decision makers) so that they can act competitively tion and transformation capabilities, exploratory
with predictive analytics, hiding the complexity of features, analytic algorithms, all with an intuitive
these proprietary models under the hood. interface. In essence there are three important in-
gredients providing a recipe for success in utilizing
predictive analytics: (i) the data scientist builds the
Effective predictive analytics tools most competitive model, (ii) the analytic applica-
provide a wide variety of algorithms tion author embeds the competitive model into the
analytic application, and (iii) the business user en-
and methods to support all the gages the competitive model as part of the regular
data characteristics and business flow of business.
problems that users encounter, as
Here is a short list of characteristics and consid-
well as the ability to flexibly apply erations to focus on when evaluating a predictive
these algorithms as needed. analytics tool:
• Consider the processing capabilities of the
Business users possess the domain knowledge analytics tool for addressing the needs of the
necessary to understand the business answer they predictive analytics cycle — data munging,
are looking for from predictive analytics, but at the exploratory data analysis, predictive model-
same time they don’t need, want to, or can’t devel- ing techniques such as forecasting, clustering,
op the models themselves. So the optimal tool pro- and scoring, as well as model evaluation.
vides an easy method of putting the data scientists’ • Find a tool that supports combining the ana-
expertise in the hands of frontline decision makers, lyst’s business and data knowledge with pre-
often times in the form of a guided analytic applica- defined procedures and tools, and graphical
tion, with the predictive model encapsulated under workflows to simplify and streamline the path
the covers. This enables best practices use of ad- from preparation to prediction.
vanced analytics (i.e. taking the risk out of having
• A good tool must easily integrate with the
business people try to develop their own models),
data sources required to answer critical busi-
and broad deployment of secret-sauce analytics.
ness questions.
When selecting the right tool for your organization, • The tool should be readily usable by all classes
you need to ensure you choose a tool which has of users: business users, business analysts,
the depth and breadth of capability, from simple data analysts, data scientists, application de-
out-of-the-box functionality for the easiest prob- velopers and system administrators.
lems to the most advanced statistic capability to
• Consider tools that serve to minimize the
support data scientists, so that competitive models
need for IT professionals and data scientists to
can be embedded in business users’ analytic dash-
set-up integration with multiple data sources.
boards for day-to-day use.

6
www.inside-bigdata.com | 508-259-8570 | Kevin@insideBigData.com
Predictive Analytics

The goal in selecting a robust tool is to secure a


broad range of predictive analytics capabilities — The only issue with the open source
from the simplest, such as trend lines and a fore- R engine is its inherent limitation as
cast tool, all the way through to leveraging an en-
tire ecosystem of statistical capabilities where you
a scalable production environment.
have the full depth of capability in creating and ex- R is notoriously memory-based,
ecuting any type of statistical model or algorithm. meaning it can only run on the
Out-of-the-box/standard algorithms aren’t going to confines of its compute environment.
gain you a competitive advantage once your com-
petitors start using those same tools. You need the
using ggplot2. In addition, R enjoys a thriving user
tools to create your own proprietary models that
community flush with local Meetup groups, online
will allow you to build that competitive advantage
courses, and specialty blogs (see top blogs via con-
by leveraging your enterprise data assets.
solidator: r-bloggers.com).
A best-practice choice is a solution that integrates
Open source R is a logical first choice for predic-
predictive analytics within the entire analytic
tive analytics modeling as the statistical environ-
decision-making process, allowing it to be incor-
ment contains a number of algorithms in the base
porated where appropriate into self-service dash-
R package as well as additional packages that have
boards and exploratory data discovery. This ori-
extended functionality.
entation provides advanced analytics access to all
analytic users, giving them the tools necessary to • Linear regression using lm()
spot new opportunities, manage risks, and swiftly • Logistic regression using glm()
react to unforeseen events. Further, professionals • Regression with regularization using the
managing mission-critical departments and global glmnet package
processes have the ability to immediately and in-
tuitively ask questions and get answers from their • Neural networks using nnet()
data — anticipating what’s next, taking quick and • Support vector machines using the e1071
educated actions. package
• Naïve Bayes models using the e1071 package

R as the Choice for Predictive • K-nearest-neighbors classification using the


knn() function from the class package
Analytics • Decision trees using tree()
Although there are many choices for performing • Ensembles of trees using the randomForest
tasks related to data analysis, data modeling, and package
predictive analytics, R has become the overwhelm-
• Gradient boosting using the gbm package
ing favorite today. This is due to the widespread
use of R in academia over commercial products like • Clustering using kmeans(), hclust()
SAS and SPSS, where new graduates enter industry The only issue with the open source R engine is its
with a firm knowledge of R. inherent limitation as a scalable production envi-
There are currently spirited debates between the ronment. R is notoriously memory-based, meaning
R user community and both the SAS and Python it can only run on the confines of its compute en-
communities as to what is the best tool for data vironment. A good best practices policy for imple-
science. R has compelling justifications includ- menting R in production would be to leverage a
ing the availability of free open source R, a widely commercial, enterprise-grade platform for running
used extensible analytical environment, over 5,000 the R language, such as TERR (TIBCO Enterprise
packages available on CRAN to extend the function- Runtime for R), in order to get the great benefits of
ality of R, and top-rated visualization capabilities R, while avoiding the scalability challenges.

7
www.inside-bigdata.com | 508-259-8570 | Kevin@insideBigData.com
Predictive Analytics

Data Access for Predictive An integral step in preparing for


Analytics predictive analytics is to become
Enterprise data assets are what feed the predictive intimately familiar with the data, a
analytic process, and any tool must facilitate easy process known as exploratory data
integration with all the different types data sources
required to answer critical business questions. Ro- analysis (EDA).
bust predictive analytics needs to access analytical
and relational databases, OLAP cubes, flat files, and
enterprise applications. The following data integra-
Exploratory Data Analysis
tion areas may be required by predictive analytics: (EDA)
• Structured data sources such as traditional An integral step in preparing for predictive analyt-
SQL databases and data warehouses already ics is to become intimately familiar with the data, a
in use by the enterprise. process known as exploratory data analysis (EDA).
• Unstructured data sources such as social A clear understanding of the data provides the
media, e-mail, etc. foundation for model selection, i.e. choosing the
appropriate predictive analytics algorithm to solve
• External third-party data included from your business problem. Various types of software
vendors such as Salesforce. can be used by different users for an initial explora-
Tools for predictive analytics should make integra- tion of data.
tion with multiple data sources quick and straight- One way to gain this level of familiarity is to utilize
forward without the need for exhaustive work by IT the many features of the R statistical environment
professionals and data scientists. to support this effort — numeric summaries, plots,
Users should have the flexibility to quickly combine aggregations, distributions, densities, reviewing all
their own private data stores, such as Excel spread- the levels of factor variables and applying general
sheets or Access databases, with corporate data statistical methods. Other tools can also be used
stores, such as Hadoop or cloud application con- effectively for EDA including: TIBCO Spotfire, SAS,
nectors (e.g. Hadoop/Hive, Netezza, HANA, Tera- SPSS, Statistica, Matlab among many others. Statis-
data, and many others). Support for in-database, tical software, such as R, enables a user to very flex-
in-memory and on-demand analytics via direct ibly explore and visualize data, but requires a high
connectors are all features gaining steam in the level of knowledge of scripting.
predictive analytics arena.
Open source R offers a low cost of entry for am- With thorough EDA you can gain
ple enterprise data access as it possesses many important insights into the story
packages providing access to a wide range of data your data is telling and how best
sources including ODBC databases, Excel, CSV, Twit- to utilize the data to make accurate
ter, Google Analytics just to name a few.
predictions.
Best practice is to ensure your predictive analytics
solution provides access to all types of data sources Open source R has many visualization mechanisms
so you can combine and mashup data in any va- for EDA including histograms, boxplots, barplots,
riety of ways in order to get a holistic view of the scatterplots, heatmaps, and many others using the
business — preferably without coding or without ggplot2 library. Using these tools allows for a deep
requiring IT involvement. This capability will em- understanding of the data being employed for pre-
power users to derive powerful insights and make dictive analytics. With thorough EDA you can gain
educated business decisions in real time. important insights into the story your data is telling
and how best to utilize the data to make accurate
predictions.

8
www.inside-bigdata.com | 508-259-8570 | Kevin@insideBigData.com
Predictive Analytics

Another method for EDA is data discovery software.


Data discovery, such as TIBCO Spotfire, enables a Data discovery, such as TIBCO Spot-
wide variety of users to visually explore and under- fire, enables a wide variety of users
stand their data, without requiring deep statistical
to visually explore and understand
knowledge. Users can perform enhanced EDA tasks
that can add an additional layer of insights without their data, without requiring deep
having to request assistance from their IT depart- statistical knowledge.
ments or data scientists.
Combining data discovery and predictive analyt-
ics capabilities on the same analytics platform is
a best practice to give analytics users a seamless
experience as they move from one analytics task to
another, in addition to providing a more sound
total cost of ownership.

Data discovery software offers a rich, interactive analytic interface for EDA including accessing and manipulating data,
and composing analyses. Image courtesy of TIBCO Spotfire.

9
www.inside-bigdata.com | 508-259-8570 | Kevin@insideBigData.com
Predictive Analytics

Predictive Modeling
Using predictive analytics involves understanding might best meet the data analyst’s needs. Here are
and preparing the data, defining the predictive a number of points to consider when determining
model, and following the predictive process. Pre- which technique to use based on your data and the
dictive models can assume many shapes and sizes, problem you wish to solve.
depending on their complexity and the application • When the data is grouped by observations,
for which they are designed. The first step is to un- tools such as cluster analysis, association
derstand what questions you are trying to answer rules, and k-nearest neighbors usually provide
for your organization. The level of detail and com- the best results.
plexity of your questions will increase as you be-
• Use classification to separate the data into
come more comfortable with the analytic process.
classes based on the response variable – both
The most important steps in the predictive analyt-
binary classes like True or False, as well as
ics process are as follows:
multi-class situations.
• Define the project outcomes and deliverables,
• Use single, multiple and polynomial regres-
state the scope of the effort, establish busi-
sion when attempting to make a prediction
ness objectives, and identify the data sets to
rather than a classification.
be used.
• In poor quality or limited data situations, A/B
• Undertake data collection and data under-
testing is appropriate. As an example, A/B
standing.
tests are statistical experiments that help you
• Perform data munging – the process of in- decide whether a change is actually making a
specting, cleaning, and transforming the data. significant impact on your product.
• Utilize exploratory data analysis (EDA) – use
graphical techniques with the objective of The Predictive Analytics Process
discovering useful information, arriving at
conclusions. Apply statistics to validate the Define State goals and business
the problem to objectives, and identify
assumptions, hypothesis and test using stan- be solved data sets
dard statistical techniques.
• Apply modeling principles to provide the abil-
Perform data collection
ity to automatically create accurate predictive Data collection and data understanding
models about the future.
• Evaluate the model allowing you to verify the
robustness of the chosen model and make Data munging Cleanse and transform data
in preparation for analytics
mid-course corrections. Test models on exist-
ing data and apply predictions to new data.
• Select a deployment option to open up the Exploratory Use plots to discover useful
data analysis insights. Apply statistics.
analytical results to every day decision making
and to get results by automating the decisions
based on the modeling. Create accurate predictive
Data modeling
models about the future
Each of the above steps can be considered itera-
tive and may be revisited as needed. It should be
noted that the data munging step often is very Verify robustness of model
Evaluate model
time-consuming depending on the cleanliness of and make adjustments
the incoming data and can take up to 70% of the
overall project timeline.
Deploy model in
Deployment production environment
Characteristics of the data can often help you de-
termine what predictive modeling techniques
10
www.inside-bigdata.com | 508-259-8570 | Kevin@insideBigData.com
Predictive Analytics

Production Deployment
The final step in the predictive analytics project With predictive analytics software you can:
timeline is to determine how best to deploy the • Transform data into predictive insights to
solution to a production environment. Of prima- guide front-line decisions and interactions.
ry concern is using open source R on larger data
• Predict what customers want and will do
sets where performance is important. The open
next to increase profitability and retention.
source R engine was not built for enterprise usage.
Deploying open source R can problematic for the • Maximize the productivity of your people
following reasons: and processes.
• Poor memory management – R does not • Increase the value of your data assets.
reclaim memory well, so memory use can • Detect and avoid security threats and fraud
grow faster, leading to out-of-memory crash- before they affect your organization.
es, as well as non-linear performance due to • Perform statistical analysis including regression
increased garbage collection requests, and analysis, classification, and cluster analysis.
increased swapping.
• Measure the social media impact of your
• Risk of deploying open source with GPL products, services and marketing campaigns.
license – software vendors are forbidden to
embed or redistribute open source R as a part
of any commercial closed-source software.
About TIBCO Spotfire
In order to avoid these issues, analysts often will
opt to convert their working R solution to a differ- TIBCO Spotfire® is the analytics solution from infrastructure
ent programming environment like C++ or Python. and business intelligence giant, TIBCO Software. From interac-
This path, however, is far from optimal since it tive dashboards and data discovery to predictive and real-time
requires recoding and significant retesting. analytics, Spotfire’s intuitive software provides an astonishingly
fast and flexible environment for visualizing and analyzing your
Best practice would be to use a commercial, data. As your analytics needs increase, our enterprise-class
enterprise-grade R solution, like TIBCO Software’s capabilities can be seamlessly layered on, helping you to be
Enterprise Runtime for R (TERR) to resolve the first to insight — and first to action.
above limitations and to yield a robust production
environment. Because many corporations already TIBCO Spotfire has a long, rich history in predictive analytics.
have legacy predictive models in house, it is also With Spotfire you can develop your own proprietary models
recommended that you ensure your analytics plat- and leverage your investments in R, S+, SAS, MATLAB, and
form supports TERR, open source R, S+, MATLAB in-database analytics of Big Data sources, such as Teradata
and SAS models, in order to take advantage of an Aster. Spotfire also offers a commercial-grade R environment,
ecosystems of predictive analytics. TERR (TIBCO Enterprise Runtime for R), which was built from
the ground up to extend the reach of R to the enterprise,
making R faster, more scalable, and able to handle memory
Conclusion much more efficiently than the open source R engine.
In this Guide we have reviewed how predictive an- TIBCO regularly contributes to the R community, including feed-
alytics helps your organization predict with confi- back to the R Core team, and offers broad compatibility with
dence what will happen next so that you can make R functions and a growing number of CRAN packages, currently
smarter decisions and improve business outcomes. 1800+. The company regularly tests TERR with a wide variety
It is important to adopt a predictive analytics solu- of R packages, and continues to extend TERR to greater R cover-
tion that meets the specific needs of different users age. TERR can be used in RStudio, the popular R IDE and also
and skill sets from beginners, to experienced ana- integrates fully with TIBCO Spotfire, as well as TIBCO Complex
lysts, to data scientists. Event Processing products, such as TIBCO Streambase.
Learn more about TIBCO Spotfire and TERR at spotfire.com

11
www.inside-bigdata.com | 508-259-8570 | Kevin@insideBigData.com

Potrebbero piacerti anche