Sei sulla pagina 1di 43

Session 49 PD, Free Tools for Healthcare Analytics: R and Python –

A New Paradigm!

Moderator:
David L. Snell, ASA, MAAA

Presenters:
Brian D. Holland, FSA, MAAA
Dihui Lai, Ph.D.
Sheamus Kee Parkes, FSA, MAAA
Python for Actuaries

Brian Holland, FSA, MAAA


2015 SOA Health Meeting
Atlanta, GA
Disclaimer:
Any views or opinions discussed or shown in this presentation are solely those
of the author and do not represent those of AIG or any of its subsidiaries or
employees.

2
Why learn Python?
• We hear a ton about machine learning, data science, big data.
• To actually do these things personally, you have to have the technical skills –
programming / hacking skills included.
• Python has a lot of traction in data science applications and is now quite
popular. You don’t have to look long before seeing it.
• Some data science companies are Python shops.

Why not learn or learn about Python:


• You don’t program or manage programmers or programming.
• You can get by in a spreadsheet or with VBA.
• You have no interest in doing or trying advanced analytics.

Fair warning: this is a presentation about a programming language.

3
Purpose today: shake hands with Python
See what you might want to dig into
What is Python?
• an object-oriented language
• with extensive scientific, numeric libraries
• with many special-purpose libraries
• with an expanding user base
• that is designed for readability
• Forced tabbing; many places to comment work in accessible ways
• around since 1991
• in two active versions: 2 and 3
For new work: not much case for sticking with 2 now, big libraries are ported to 3.
• named after Monte Python, not the snake

4
Applications for actuaries

• A general-purpose master tool, with libraries for special purposes


Can manipulate R; MS Office, other Windows objects
• Data munging:
Easily read spreadsheets, text files, databases, scrape web (with library BeautifulSoup)
• Process automation and documentation
• Data visualization
• Statistical modeling / machine learning / data science / predictive modeling
• Presentations

5
Ways to use Python
System command: for scripts

Command line environment

6
Ways to use Python: IPython notebooks

Edit browser-based documents – saved in JSON

Mix formatted text and computation


• Typeset math
• Section headings, HTML, markdown
• Graphics inline with the flow of text, computed as you go

Run remote servers thorough the web – also grids

Convert the notebooks easily to slides, HTML, plain Python files; on to MS Word

Note: IPython notebooks recently folded into Jupyter project


• Front-end for many other back-end computations, including R, Julia

7
Ways to use Python: IPython notebooks

Could you do that


in a spreadsheet?
I could not, not
reasonably.

8
What is “knowing Python” ?

Language: syntax, and Python standard library


• The Python Standard Library by Example, Doug Hellmann, 2011

Libraries to do what you need


• BeautifulSoup: to read and manipulate HTML/XML, scraping web

• PyODBC – to talk to databases

• NumPy, Pandas, Scikit-Learn:


essential for machine learning and computation generally

9
Graphics libraries:
Death by choice
• Bokeh for interactive plots in browser
• Seaborn
• GGPLOT port for R fans and experts;
• VisPy – bleeding edge, GPU, interactive,
2d, 3d, wow
• Matplotlib – the main one

Tip: come to afternoon


session to see what these
LTC exhibits are.

10
Data I/O with Pandas
The Pandas library can import many
document types directly into a
DataFrame object (similar to R’s)
• Fixed-width text
• Delimited text
• Spreadsheets
• HTML, JSON
• SQL queries, using an open
connection to the DB

11
Machine learning: scikit-learn – the “killer app”?
Many examples at http://scikit-learn.org/stable/auto_examples/index.html.
A very small sample from the page:

12
Cooperation with other software:
RPy2 in a Notebook
“R Magic”: (are many “magic” functions in IPython or Jupyter notebooks)
• Allow commands to other tools directly in the notebook

13
More on RPy2: accessing R objects

14
PypeR: another way to talk to R
PypeR uses pipes to communicate with R.

15
Good luck, have fun!

Thanks for your interest.

Brian Holland, FSA, MAAA

16
R for Actuarial Science

Dihui Lai, PhD


Reinsurance Group of America, Incorporated
Outline

 R, WHATS and WHYS?

 Use R for Actuarial Science

 R Demo

 Conquer Big Data with R


R, WHATS and WHYS?

 Open source project since 1995

 Active community (>2 million users and developers)

 Incorporates features of object-oriented and functional programming

 Powerful analytic toolkits (text mining, SVM, nnet, deepnet)


R, WHATS and WHYS?
Easy data manipulation
STUDY_YEAR ISSUE_AGE POLICY_YEAR EXPOSURE LAPSE_CNT
2009-2010 33-37 10 1 1
2009-2010 63-67 10 1 0
2008-2009 28-32 10 2 2
2008-2009 53-57 10 2 1
2009-2010 38-42 10 1 1
2008-2009 23-27 10 1 0

Statistic toolkits Cutting edge analytics

Database

Integrate advanced data tech

Visualization tools
Use R for Actuarial Science
Example: Term Tail Lapse Study
load("LapseData.Rdata")

head(LapseData)
## STUDY_YEAR ISSUE_AGE POLICY_YEAR EXPOSURE LAPSE_CNT FA_BAND
## 9 2009-2010 33-37 10 1 1 B. 100k-249k
## 71 2009-2010 63-67 10 1 0 B. 100k-249k
## 121 2008-2009 28-32 10 2 2 C. 250k-999k
## 210 2008-2009 53-57 10 2 1 B. 100k-249k
## 223 2009-2010 38-42 10 1 1 C. 250k-999k
## 237 2008-2009 23-27 10 1 0 B. 100k-249k

summary(LapseData)
## STUDY_YEAR ISSUE_AGE POLICY_YEAR EXPOSURE
## 2010-2011:98630 33-37 :92930 Min. :10.00 Min. : 0.002732
## 2011-2012:88353 38-42 :91723 1st Qu.:10.00 1st Qu.: 1.000000
## 2009-2010:83321 43-47 :76142 Median :10.00 Median : 1.000000
## 2008-2009:77505 28-32 :69777 Mean :10.87 Mean : 1.226270
## 2007-2008:59968 48-52 :57920 3rd Qu.:11.00 3rd Qu.: 1.000000
## 2006-2007:41000 53-57 :41278 Max. :19.00 Max. :26.000000
## (Other) :64476 (Other):83483
## LAPSE_CNT FA_BAND
## Min. : 0.000 A. < 100k : 39121
## 1st Qu.: 0.000 B. 100k-249k :230897
## Median : 1.000 C. 250k-999k :208131
## Mean : 0.615 D. 1M - 1.99M: 26042
## 3rd Qu.: 1.000 E. 2M+ : 7232
## Max. :24.000 D. 1M-1.99M : 1830
Use R for Actuarial Science
Example: Term Tail Lapse Study
Use R for Actuarial Science
Example: Term Tail Lapse Study
Model1 <- glm(LAPSE_CNT~offset(log(EXPOSURE))+FA_BAND, family=poisson(),data=
LapseData)
summary(Model1)
##
## Call:
## glm(formula = LAPSE_CNT ~ offset(log(EXPOSURE)) + FA_BAND, family = poisso
n(),
## data = LapseData)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.6517 -0.9669 -0.2003 0.6752 2.8462
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.987363 0.007434 -132.81 <2e-16 ***
## FA_BANDB. 100k-249k 0.226844 0.007926 28.62 <2e-16 ***
## FA_BANDC. 250k-999k 0.372967 0.007905 47.18 <2e-16 ***
## FA_BANDD. 1M - 1.99M 0.488017 0.010462 46.65 <2e-16 ***
## FA_BANDE. 2M+ 0.615627 0.015559 39.57 <2e-16 ***
## FA_BANDD. 1M-1.99M 0.857298 0.020445 41.93 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 413195 on 513252 degrees of freedom
## Residual deviance: 408135 on 513247 degrees of freedom
## AIC: 951877
Use R for Actuarial Science
Example: Hierarchical Clustering
Use R for Actuarial Science
Examples: Other Potentials
SVM

Text Mining Map

Have Fun
R Demo

Use R for Twitter Streaming


Conquer Big Data with R
Example: use generalized linear model (GLM) for large data set*

Function Elapsed Time (s) Memory (Mb) Approach Elapsed Time (s) Memory (Mb)

glm 184.78 2408.3 lapply*+glm Memory overflow

snowfall+glm Memory overflow


glm.fit 78.28 1056.20
lapply+bigglm 1003.97 <10
bigglm 208.74 2.3
snowfall+bigglm 497.33 <10
rxGlm 28.78 43.5
lapply+rxGlm 54.41 44.4

snowfall+rxGlm 43.23 204.3

* Rerference: Xu R and Lai D; Forcasting& Futurism Issue9; July 2014


Conquer Big Data with R
R packages for big data

Integrate R with
Memory clusters:
allocation: ff, RHadoop,
bigmemory SparkR

Parallel computing Commercial


package: snowfall, distribution:
multicore Revolution R
Summary - Do You Want the Toolbox?
Easy data manipulation
STUDY_YEAR ISSUE_AGE POLICY_YEAR EXPOSURE LAPSE_CNT
2009-2010 33-37 10 1 1
2009-2010 63-67 10 1 0
2008-2009 28-32 10 2 2
2008-2009 53-57 10 2 1
2009-2010 38-42 10 1 1
2008-2009 23-27 10 1 0

Statistic toolkits Cutting edge analytics

Database

Integrate advanced data tech

Visualization tools
Questions ?
R vs Python
SOA Health Meeting – June 2015

Presented by
Shea Parkes, FSA, MAAA
Limitations
The views expressed in this presentation are those of the
presenter, and not those of Milliman. Nothing in this
presentation is intended to represent a professional opinion
or be an interpretation of actuarial standards of practice.

2
Data Science – A Useful Perspective

http://drewconway
.com/zia/2013/3/26
/the-data-science-
venn-diagram

3 June 27, 2011


Data Science – A Useful Perspective

http://drewconway
.com/zia/2013/3/26
/the-data-science-
venn-diagram

=Actuarial Student/Analyst Self-Assessment

4 June 27, 2011


Data Science – A Useful Perspective

http://drewconway
.com/zia/2013/3/26
/the-data-science-
venn-diagram

=Actuarial Student/Analyst Self-Assessment

5 June 27, 2011


Bending your brain

The more you use


Python, the better you The more you use R, the
are able to think about better you are able to
programming think about data analysis

6 June 27, 2011


Both are multi-paradigm… but…

Functions are first class


objects, but “lambda”s 3+ ways to do Object
are constrained and an Oriented Programming,
awkward “nonlocal” but none of them are
statement was only simple and easy to use
recently introduced

7 June 27, 2011


Both could use a little help…

8 June 27, 2011


“Recent” growth – coming together

Data Science stack – RStudio + devtools +


Pandas + scikit-learn + more… encouraging
statsmodels + IPython best software
development practices
Cutting edge modeling –
Theano and PyStan Dplyr + magrittr = more
readable code = faster
development

9 June 27, 2011


But what should I use?

Will you need to Is analyzing data 80%+ of


integrate with other what you will be doing?
systems at all?

Whichever your colleagues


have experience in!

10 June 27, 2011


But what should I use?

Will you need to Is analyzing data 80%+ of


integrate with other what you will be doing?
systems at all?

Whichever your colleagues


have experience in!

11 June 27, 2011

Potrebbero piacerti anche