Report Final Datasciencerock

1.
Executive Summary
This project aims to predict the loyalty score of ELO by user level and explore what kind of
features of credit card tend to lead higher loyalty. The train dataset contains 201,917 lines of
customer records with their card ID, card feature, first active month and Target. The transaction
data have two parts: historical data in the past 6 months and new period data in the near three
months. The data contains card id, card holder’s demographic feature like city and state,
purchase amount and time, authorized status and the number of installments of purchase.
For interpretation purpose, we first build a simple decision tree to examine how tree nodes
divided in different groups with target as the dependent variable and feature 1, feature 2 and
feature 3 as three independent variables. The RMSE is 3.88, which is larger than the standard
deviation of target.
We aggregate several purchase records by the same card ID. To get more information of
variables, we do different feature-- maximum, minimum, mean, sum and standard deviation of
several records based on the nature of each variables and aggregate purchase records by
those newly created features. In this way, the prediction accuracy of machine learning model
has been increased. The dimension reduction method PCA and machine learning method have
been applied together or individually to the aggregated dataset and lightGBM get the best
prediction result with RMSE at 3.658.
Optimized promotion time is recommended to the merchandise. For ELO, feature 1 and feature
2 and more effective than feature 3 to classify and detect customer loyalty and design of feature
3 need to be improved.
2. Introduction and Background
Nowadays, many credit card companies invest a lot in merchants discount recommendation. The
discount can be sent to card holders through credit card mobile app, official website or offline
posts according to card owner’s geographic and demographic data. It is a win-win strategy to both
merchants and card holders, so it is widely used in credit card products to attract individual and
business customers and differentiate themselves from other credit card products.
Elo, one of the largest payment brands in Brazil, wants to figure out the best way to do merchant
discount recommendation. They have collected customer level data with their activated time,
consumption features and loyal score. By maintaining partnerships with merchants for a long time,
Elo also had purchase level data containing merchant information and transaction details. The
team is given these data to help Elo identify and serve the most relevant opportunities to
individuals by predicting individual-level loyalty score.
3. Data and Methodology
The datasets available for the team are train dataset, transaction dataset and test dataset
(Appendix I). The train dataset contains 201,917 lines of customer records with their card ID, card
feature, first active month and Target. The object to predict, Target, ranged from -33 to 18 and
the higher the value is, the more loyal the customer is. All features are anonymized as feature_1,
feature_2 and feature_3. For feature_1, most customers chose choice 3; for feature_2, most
customers chose choice 1 (Appendix II. Figure 1); for feature_3, most customers chose choice 1.
And there is no significant correlation among features and Target in descriptive visualizations.
The first active month ranged from November 2011 to February 2018 (Appendix II. Figure 3). The
test dataset has the same structure as the train dataset without Target which is the object to
predict.
The transaction data have two parts: historical data in the past 6 months and new period data in
the near three months. The data contains card id, card holder’s demographic feature like city and
state, purchase amount and time, authorized status and the number of installments of purchase
(Appendix II. Figure 4). Besides that, it also contains merchants and product details like merchant
id, merchant category id, and product features as category_1, category_2 and category_3
(Appendix II. Figure 2).
Data Exploration
For interpretation purpose, a decision tree model was built with Target as the dependent
variable and feature 1, feature 2 and feature 3 as three independent variables.
tree1 <- rpart(target~feature_1+feature_2+feature_3,data=tr, method = 'anova',

control=rpart.control(cp=0.0001))
As the tree plot shown (Appendix III), data points whose feature 1 equals to 5 are classified into
a group whose average target value is -0.507, and this group is 20.1% of the whole dataset.
Data points whose feature_1 value equal to either 1,2,3 or 4 are classified into a group whose
average target value is -0.36, and this group is 79.9% of the whole dataset. To further
decompose this group, if feature_2 equals to either 1 or 2, then these data points will be
assigned to a group whose size is 62.1% of the data set and average target value of -0.319.
If feature_2 equals to 3 and feature_1 equal to either 1 or 3, 13.5% of the whole dataset will be
assigned to a group, and its average target value is -0.556. If feature_2 equals to 3 and
feature_1 equal to either 2 or 4, this group’s average target value is -0.337 and its size is 5.2%
of the whole dataset.
Prediction with this tree model gives a RMSR (root mean square error) value of 3.88 which is
higher than the standard deviation of Target that is 3.85, showing that more information is
needed to do Target value prediction.
Feature Engineering
New transaction record and historical transaction record given data in transaction level, which
means if merged and combined with train dataset directly, transactions that happen in different
merchant store with different purchase amount will have same target value, which demonstrates
why it is not reasonable to do so. Therefore, transactions of the same card id need to be
aggregated into card level data.
Historical transaction data was divided into two data sets, history data with authorized_flag
equals to ‘N’ which means these transactions did not get approved, and authorized data with
authorized_flag equals to ‘Y’’ which means these transactions was approved. Aggregation is
done within three datasets -- new transaction data, authorized transaction data and history
transaction data. Different aggregation methods are used for different variables (Appendix IV).
To take purchase amount as an example, it was aggregated by multiple records’ maximum,
minimum, mean, sum and standard deviation values, and each of these values become a new
variable later on in the dataset.
Each variable’s missing value are replaced by its median value. After aggregation, data cleaning
and merging dataset together by card id, there are 184 variables and 201917 observations in
total in clean training dataset. With clean dataset at hand, multiple machine learning techniques
were applied to do further prediction.
Besides, in order to observe the relationship between time periods and purchase, we merged
new merchant and train data set, split purchase time into second, day, month, year and create
new variable “purchase_weekday”, in which Morning: 5am to 12pm, EST, Afternoon: 12pm to
5pm, EST, Evening: 5pm to 9pm EST, Night: 9pm to 5am EST.
Modeling
Considering aggregation will lead to multicollinearity issue, principal component analysis (PCA)
is used to transform the data into principal components that are not correlated with each other.
According to how much variance each component explained (Appendix V), 36 principle
components were selected and used to transform dataset. After transforming, linear regression,
lasso regression and ridge regression, ensemble methods, random forest and k-nn are used to
in modelling process. However, with data transformation, RMSE value is even higher than
previous decision tree, which shows PCA method do not have good predictive power in this
case, also tree methods tend to give better results (Appendix VI). Therefore, the team further
explores tree methods to do prediction.
Gradient Boosting Machine is then chosen by the team to determine the significance / relative
importance of the different variables on the target. It is used here as it assigns more weights to
the misclassified data to emphasize the difficult cases so that the subsequent learners focus on
them during the training, which results in better performance and improved prediction accuracy.
The last 50% of the dataset is used for training and the first 50% of the dataset is used for testing.
Out of the 184 variables obtained from feature engineering and data cleaning, 4 variables which
are unique identifiers, and which do not cause any variation in the GBM are eliminated. GBM
Model is then run for the remaining 180 variables with the following conditions:
gbmfit <- gbm (target ~ . , data = data_train, distribution = "gaussian", n.trees = 1000,
interaction.depth = 1, shrinkage = 0.1)
Gaussian distribution is used as the dependent variable or target has a range of values from -33
to 18. The number of trees is initially set to 1000, and the learning rate / shrinkage is tuned to 0.1
as smaller shrinkage value (slow learning rate) is ideal when a large number of trees is built.
The optimal number of iterations is determined using the out-of-bag estimate (OOB) method.
Here, the best number of iterations for this model is 194 (Appendix IX and Appendix X). The model
is then re-run with the optimal number of trees and summary of GBM is obtained. The RMSE
obtained for this model is 3.8554 for the testing dataset. The RMSE is still relatively high and the
performance from boosting is further enhanced by the use of LightGBM.
LightGBM is also a tree-based machine learning algorithm under gradient boosting framework.
The main difference between LightGBM and traditional GBM is about the ways of tree growing.
While other algorithms grow trees horizontally, Light GBM grows tree vertically. Therefore, when
growing the same leaf, this kind of leaf-wise algorithm can extract the most important
information and reduce more loss compared to other level-wise algorithm (Appendix XII
Figure.1).
Based on this feature, LightGBM is used for predicting loyalty score of 201,917 unique Card_ID,
as it could provide us with higher accuracy and faster training speed, the advantages that are of
great importance in this case due to the dataset of large scale.
After feature engineering, 182 out of the 184 variables are used for modeling, while 2 variables
are eliminated here because they are unique ID and cannot be identified. The whole process
LightGBM is as follows:
1. Parameters setting: To avoid increasing complexity and overfitting caused by the Leaf-
wise algorithm of LightGBM, the max_depth is set as 9 to specific its splitting depth. It
should be noticed that after comparing several combinations, RMSE is found to be not
so sensitive to different parameters in this case, therefore, the rest parameters are set
based on its default value.
2. Cross validation setting: The whole prepared dataset is split into five folds, and for each
fold, all the values are assigned to both the train and validation dataset randomly to
improve model stability.
3. Iteration: The number of trees is initially set to 10,000. To reduce the training time, we
set the early stopping rounds as 200, it guarantees that the training process will stop
when the validation scores do not improve for 200 rounds. In this way, we can figure out
the RMSE for each fold under its optimal boosting round (Appendix XII Figure.2).
The final CV scores for LightGBM model is the average validation RMSE of each fold, which is
3.658, the best performance among other 16 methods we have tried.
4. Key Findings
From the decision tree model, feature 1 and feature 2 take a major role to classify customers.
Customers in these groups have an average target value higher than overall average: (1) With
feature 1 equals to either 1, 2, 3 or 4 as well as feature 2 equals to 1 or 2; (2) With feature 2
equals to 3 as well as feature 1 equals to 2 or 4. Higher target value indicates that these
customers are more loyal than average customers.
From GBM model, the relative importance of different variables are determined (Appendix VII).
Out of the 180 variables around 20 variables had a high relative importance (greater than 1) and
around 118 variables had a non-zero influence on the model (Appendix VIII). Some of the
significant variables are purchase date, purchase amount, month lag and purchase month. This
shows that the time period of purchase and the amount of purchase are key factors in measuring
loyalty.
From LightGBM model, it is found that purchase date and purchase amount are two important
variables. After feature engineering, the features related to purchase date occupy 5 out of the top
10 features. And the most important one is the max purchase_date from authorized transactions,
followed by the max purchase_date from new transactions. For purchase amount related features,
new_purchase_amount max purchase amount from new transactions and the mean of max
purchase amount also have high importance. However, the three features from original train
dataset do not boost the performance of LightGBM significantly (Appendix XII Figure.3).
According to the time series analysis (Appendix XIII), The number of activated cards raised
steeply in 2017 and the impressive amount of sales is tied to three specific time frames --
morning and afternoon (5:00 am~ 5:00 pm EST), Monday and Tuesday and from January to
April.
5. Recommendations
Based on the result of the study, there are two recommendations for ELO and its merchants.
Firstly, ELO are suggested to cooperate with merchants and offer promotions or discounts to
cardholders in Morning and afternoon (5:00 am~ 5:00 pm EST), Monday and Tuesday and from
January to April, in which the business could take advantage of this time to improve promotional
effectiveness. Secondly, to optimize its marketing strategy, we suggest ELO make a further
study on its marketing performance in 2017, find out drivers of the booming sales, and make full
use of them.
Secondly, results from tree plot shows that feature 1 and feature 2 and more effective than
feature 3 to group and detect customer loyalty. Elo is recommended to improve the design of
feature 3 so that it matters to customer loyalty if that is what Elo intends to do.
Thirdly, customers with feature 1 equals to 5, and customers with feature 1 equals to 1 or 3 and
feature 2 equals to 3, leads to two groups of customers whose loyalty score way lower than
average level. Therefore, Elo is suggested to manage its relationship with these two groups of
customers so that their loyalty score can increase later on.
Fourthly, customers with feature 1 not equals to 5 and feature 2 equals to either 1 or 2 has
higher loyalty score than average, so as customers with feature 1 equals to 2 or 4 and feature 2
equals to 3. Customers with these features tend to have higher loyalty scores, so Elo is
suggested to promote cards with these features so that Elo can obtain more loyal customers.
There are limitations in the projects as well. First, the card IDs among transaction data and train
data are not completely matched and there are several lines of record only appearing in one of
the datasets, which brings challenges to the prediction steps. Second, features of customer card
and buying products are both anonymized with little variance among different target value,
which limits the interpretation of prediction models and insights gained from analysis. Third, the
calculation of Target is vague. The team needs more information to give suggestions about the
usage of this score.
6. Appendices: Tables, Exhibits, Figures
Appendix I. Data Introduction

Train
card_id Unique card identifier
first_active_month 'YYYY-MM', month of first purchase
feature_1 Anonymized card categorical feature
target Loyalty numerical score calculated 2 months after historical and evaluation
period
Historical_transactions
card_id Card identifier
month_lag month lag to reference date
purchase_date Purchase date
authorized_flag Y' if approved, 'N' if denied
category_3 anonymized category
installments number of installments of purchase
merchant_category_id Merchant category identifier (anonymized)

subsector_id Merchant category group identifier (anonymized)
merchant_id Merchant identifier (anonymized)
purchase_amount Normalized purchase amount
city_id City identifier (anonymized)
state_id State identifier (anonymized)
New_period Transactions
card_id Card identifier
month_lag month lag to reference date
purchase_date Purchase date
authorized_flag Y' if approved, 'N' if denied
installments number of installments of purchase
merchant_category_id Merchant category identifier (anonymized)
subsector_id Merchant category group identifier (anonymized)
merchant_id Merchant identifier (anonymized)
purchase_amount Normalized purchase amount

Merchants
merchant_id Unique merchant identifier
merchant_group_id Merchant group (anonymized)
merchant_category_id Unique identifier for merchant category (anonymized)
subsector_id Merchant category group (anonymized)
numerical_1 anonymized measure
numerical_2 anonymized measure
most_recent_sales_ran Range of revenue (monetary units) in last active month --> A > B > C > D > E
ge
most_recent_purchase Range of quantity of transactions in last active month --> A > B > C > D > E
s_range
avg_sales_lag3 Monthly average of revenue in last 3 months divided by revenue in last

active month
avg_purchases_lag3 Monthly average of transactions in last 3 months divided by transactions in

last active month
active_months_lag3 Quantity of active months within last 3 months

active month

last active month

active month

last active month

Appendix II. Data Visualization-Descriptive Analysis
Figure 1. Feature1-3 & count of card ID
Figure 2. Feature1-3 & count of card ID

Figure 3. First active month & Count of card ID
Figure 4. Installment & count of card ID
Appendix III. Tree Plot

Appendix IV. Cumulative sum of eigenvalues (% variance explained)
First 1 components: 11.51%
…
Appendix V: Aggregation functions
Appendix VI: Machine Learning RMSE results

Appendix VII. Gradient Boosting Machine - Relative Importance Graph
Appendix VIII. Gradient Boosting Machine - Variables and their Relative Importance
Tabulation
Variables Relative Importance
auth_purchase_date_max 31.946767
new_purchase_amount_std 11.0910439
hist_category_1_sum 6.6192641
new_purchase_amount_max 4.4416875
auth_month_lag_max 4.3413756
auth_purchase_month_std 4.0498309
new_category_1_sum 3.5434138
new_purchase_month_max 3.3074587
new_month_lag_mean 2.8122073
new_purchase_date_min 2.2832693
new_purchase_month_mean 2.0865649
city_id_purchase_amount_max 1.994331
auth_category_1_sum 1.6601309
hist_purchase_amount_sum 1.5600257
hist_purchase_date_ptp 1.4680911
new_purchase_month_min 1.3256561
auth_purchase_date_ptp 1.3116186
new_purchase_amount_sum 1.2307739
auth_purchase_amount_sum 1.1962774
new_category_1_mean 1.1309463
hist_installments_sum 0.9489174
auth_purchase_date_min 0.9205837
new_merchant_category_id_nu
nique 0.907367
purchase_amount_max_mean 0.8170171
hist_transactions_count 0.6969533
auth_category_1_mean 0.6756207
new_transactions_count 0.5261769
hist_category_1_mean 0.4743934
category_1_purchase_amount_
max 0.4319488
new_purchase_amount_mean 0.3774726
new_installments_min 0.3674045
hist_purchase_amount_min 0.3564642
new_purchase_date_max 0.3063942
auth_purchase_amount_max 0.2858968
hist_state_id_nunique 0.2722762
auth_purchase_amount_std 0.2609926
category_1_installments_min 0.2144762
purchase_amount_mean_mean 0.2097846
city_id_purchase_amount_min 0.1868164
new_purchase_amount_min 0.174537
city_id_purchase_amount_mea
n 0.1634363
auth_purchase_amount_mean 0.1602825
purchase_amount_max_std 0.1567252
new_category_3_B_mean 0.1562249
hist_purchase_date_max 0.1450873
new_merchant_id_nunique 0.1435751
new_subsector_id_nunique 0.1355075
purchase_amount_count_mean 0.1269334
purchase_amount_sum_std 0
purchase_amount_sum_mean 0
purchase_amount_std_std 0
purchase_amount_std_mean 0
purchase_amount_min_std 0
purchase_amount_min_mean 0
purchase_amount_mean_std 0
purchase_amount_count_std 0
new_state_id_nunique 0
new_purchase_month_std 0
new_purchase_date_ptp 0
new_month_lag_std 0
new_month_lag_min 0
new_month_lag_max 0
new_installments_sum 0
new_installments_std 0
new_installments_mean 0
new_installments_max 0
new_count_installments 0
new_count_category_3_C 0
new_count_category_3_B 0
new_count_category_3_A 0
new_count_category_2_5 0
new_count_category_1_Y 0
new_count_category_1_N 0
new_city_id_nunique 0
new_category_3_C_mean 0
new_category_3_A_mean 0
new_category_2_5_mean 0
new_avg_month_diff 0
month_lag_std 0
month_lag_mean 0
installments_sum_std 0
installments_sum_mean 0
installments_std_std 0
installments_std_mean 0
installments_purchase_amount
_std 0
_min 0
_mean 0
_max 0
installments_min_std 0
installments_min_mean 0
installments_mean_std 0
installments_mean_mean 0
installments_max_std 0
installments_max_mean 0
installments_count_std 0
installments_count_mean 0
hist_subsector_id_nunique 0
hist_purchase_month_std 0
hist_purchase_month_min 0
hist_purchase_month_mean 0
hist_purchase_month_max 0
hist_purchase_date_min 0
hist_purchase_amount_std 0
hist_purchase_amount_mean 0
hist_purchase_amount_max 0
hist_month_lag_std 0
hist_month_lag_min 0
hist_month_lag_mean 0
hist_month_lag_max 0
hist_merchant_id_nunique 0
hist_merchant_category_id_nu
nique 0
hist_installments_std 0
hist_installments_min 0
hist_installments_mean 0
hist_installments_max 0
hist_city_id_nunique 0
hist_category_3_C_mean 0
hist_category_3_B_mean 0
hist_category_3_A_mean 0
hist_category_2_5_mean 0
feature_3 0
feature_2 0
feature_1 0
count_installments 0
count_category_3_C 0
count_category_3_B 0
count_category_3_A 0
count_category_2_5 0
count_category_1_Y 0
count_category_1_N 0
city_id_purchase_amount_std 0
std 0
min 0
mean 0
category_1_installments_std 0
category_1_installments_mean 0
category_1_installments_max 0
avg_month_diff 0
auth_transactions_count 0
auth_subsector_id_nunique 0
auth_state_id_nunique 0
auth_purchase_month_min 0
auth_purchase_month_mean 0
auth_purchase_month_max 0
auth_purchase_amount_min 0
auth_month_lag_std 0
auth_month_lag_min 0
auth_month_lag_mean 0
auth_merchant_id_nunique 0
auth_merchant_category_id_nu
nique 0
auth_installments_sum 0
auth_installments_std 0
auth_installments_min 0
auth_installments_mean 0
auth_installments_max 0
auth_city_id_nunique 0
auth_category_3_C_mean 0
auth_category_3_B_mean 0
auth_category_3_A_mean 0
auth_category_2_5_mean 0
Appendix IX. GBM Summary
gbm(formula = target ~ ., distribution = "gaussian", data = data_train,

n.trees = 1000, interaction.depth = 1, shrinkage = 0.1)
A gradient boosted model with gaussian loss function.
1000 iterations were performed.
There were 180 predictors of which 118 had non-zero influence.
Appendix X. OOB Estimates - For Optimal Number of Iterations
Initially we ran the model for 1000 trees but from the OOB estimates we can determine that the
best possible value of iterations is 194
[1] 194
attr(,"smoother")
Call:
loess(formula = object$oobag.improve ~ x, enp.target = min(max(4,
length(x)/10), 50))
Number of Observations: 1000

Equivalent Number of Parameters: 40
Residual Standard Error: 0.001170247
Appendix XII LightGBM

Figure.1 Difference between algorithms under GBM framework
Figure.2 Best Iteration of each fold
Figure.3 Feature Importance

Appendix XIII Data Visualization – Recommendation
Relationship between purchase amount and time of day
Relationship between purchase amount and time of month

Relationship between purchase amount and time of weekday
Relationship between purchase amount and time of year

Report Final Datasciencerock

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Report Final Datasciencerock

Caricato da

Copyright:

Formati disponibili

1.

2. Introduction and Background

tree1 <- rpart(target~feature_1+feature_2+feature_3,data=tr, method = 'anova',

Appendix I. Data Introduction

card_id Unique card identifier

first_active_month 'YYYY-MM', month of first purchase

feature_1 Anonymized card categorical feature

feature_2 Anonymized card categorical feature

feature_3 Anonymized card categorical feature

card_id Card identifier

month_lag month lag to reference date

purchase_date Purchase date

authorized_flag Y' if approved, 'N' if denied

category_3 anonymized category

installments number of installments of purchase

category_1 anonymized category

merchant_category_id Merchant category identifier (anonymized)

merchant_id Merchant identifier (anonymized)

purchase_amount Normalized purchase amount

city_id City identifier (anonymized)

state_id State identifier (anonymized)

category_2 anonymized category

card_id Card identifier

month_lag month lag to reference date

purchase_date Purchase date

authorized_flag Y' if approved, 'N' if denied

category_3 anonymized category

installments number of installments of purchase

category_1 anonymized category

merchant_category_id Merchant category identifier (anonymized)

subsector_id Merchant category group identifier (anonymized)

merchant_id Merchant identifier (anonymized)

purchase_amount Normalized purchase amount

state_id State identifier (anonymized)

category_2 anonymized category

merchant_id Unique merchant identifier

merchant_group_id Merchant group (anonymized)

merchant_category_id Unique identifier for merchant category (anonymized)

subsector_id Merchant category group (anonymized)

numerical_1 anonymized measure

numerical_2 anonymized measure

category_1 anonymized category

avg_sales_lag3 Monthly average of revenue in last 3 months divided by revenue in last

avg_purchases_lag3 Monthly average of transactions in last 3 months divided by transactions in

active_months_lag3 Quantity of active months within last 3 months

avg_purchases_lag6 Monthly average of transactions in last 6 months divided by transactions in

active_months_lag6 Quantity of active months within last 6 months

avg_sales_lag12 Monthly average of revenue in last 12 months divided by revenue in last

avg_purchases_lag12 Monthly average of transactions in last 12 months divided by transactions in

active_months_lag12 Quantity of active months within last 12 months

category_4 anonymized category

city_id City identifier (anonymized)

state_id State identifier (anonymized)

category_2 anonymized category

Figure 1. Feature1-3 & count of card ID

Figure 2. Feature1-3 & count of card ID

Appendix III. Tree Plot

Appendix VI: Machine Learning RMSE results

Variables Relative Importance

Appendix IX. GBM Summary

gbm(formula = target ~ ., distribution = "gaussian", data = data_train,

Appendix X. OOB Estimates - For Optimal Number of Iterations

Number of Observations: 1000