Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Meera Ragunathan
NOVEMBER 2017
Features
Binning Methods
Agenda
Syntax
• Nominal (categorical)
take distinct values that assume no order or scale relationship between them
ex. Province
• Nominal (categorical)
take distinct values that assume no order or scale relationship between them
ex. Province
• Ordinal (rank)
similar to nominal variable but with order defined on categories
ex. Level of education
• Nominal (categorical)
take distinct values that assume no order or scale relationship between them
ex. Province
• Ordinal (rank)
similar to nominal variable but with order defined on categories
ex. Level of education
• Continuous
take values represented on real number scale
ex. Income
max(x) – min(x)
_________________
L=
n
PROC HPBIN
DATA = SAS-data-set <options>; Specifies data set being used
INPUT variables-list; Specifies interval variable(s) to be binned
RUN;
Option Description
DATA= Specifies input data set
Basic
OUTPUT= Specifies output data set
Binning level NUMBIN= Specifies global number of bins for all binning variables
BUCKET Specifies bucket binning method
Binning method
QUANTILE Specifies quantile binning method
WOE Computes WOE and IVs ; must specify target variable
• BinInfo
– Includes information about binning method, number of bins, number of variables
• Mapping
– Level starts at 1 and increases to value of NUMBINS
– Bin level can be less than NUMBINS if input data are small
• PerformanceInfo
– Information about execution mode
• Single-machine
• Distributed-mode
• WOE
– Provides level mapping information, binning information, weight of evidence and information
value for each bin
– When WOE table is printed, mapping table is not printed
– Other values: event/non-event counts and event/non-event rates for each bin
• InfoValue
– Generates Information Value for each variable
Id Age
1
…
…
…
…
32,000
Bad Distribution I
_________________
WOE = ln
Good Distribution I
Number of Bad i
• Nominal variable Bad Distribution I = ___________________
i represents category Total number of Bad
Defintion of events:
Bad
consumers with score less than 700
Good
consumers with score greater than 700
1,188,423
Bad Dist = ___________ = 0.189
6,281,156
1,188,423
Bad Dist = ___________ = 0.189
6,281,156
0.189
WOE = ln ___________ = 0.0364
0.182
Age Ins
of customer in years Did customer buy insurance product?
(Binary target variable)
Income 1 = Yes 0 = No
in thousands of dollars
Customized binning
levels across Information Values
variables
• “SAS® Enterprise Miner™ High-Performance Data Mining Procedures and Macro Reference for SAS®
9.3”
https://support.sas.com/documentation/onlinedoc/miner/emhp71/emhpprcref.pdf
• Upadhyay, Roopam. “Information Value (IV) and Weight of Evidence (WOE) – A Case Study from
Banking (Part 4)”
http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/
Contact Information
Meera Ragunathan
Data Scientist, TransUnion Canada
Work:
mraguna@transunion.com
Personal:
https://www.linkedin.com/in/meeraragunathan/
m.ragunathan@icloud.com