Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
net/publication/327337318
CITATIONS READS
2 2,051
3 authors, including:
Banage T. G. S. Kumara
Sabaragamuwa University of Sri Lanka
43 PUBLICATIONS 146 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Banage T. G. S. Kumara on 01 March 2019.
40
2018 International Conference On Business Innovation (ICOBI), 25-26 August 2018, NSBM, Colombo, Sri Lanka
maintained by telecom companies for each of their customers The type of customer, whether a
4 Tariffs
prepaid or post-paid customer
that keep on changing rapidly due to a competitive Length of time a customer has
5 Tenure
environment. The information includes the details about been with a particular subscriber
billing, calls and network data. The huge availability of Approximates the amount used to
Credit purchase
6 purchase call credits a month in
information arises the scope of using Data mining techniques amount (CpM)
rupees
in the telecom database. The information available can be Data purchase
Approximates the amount used to
7 purchase data bundles a month in
analyzed in different perspectives to provide various ways to amount (DpM)
rupees
the operators to predict and reduce churning. Only the Identifies whether customer have
8 Internet usage
relevant details are used in the analysis which contributes to used internet facility or not
Determines whether product
the study from the information given. Data mining 9 Product innovation innovation is necessary for
techniques are used for discovering the interesting patterns sustaining customers
within data and it helps to learn to predict whether a Identifies whether customer have
10 Churn
changed networks or not
customer will churn or not based on customer‘s data stored
in the database.
B. Data Pre-processing
C. Research Objectives The training and testing dataset used in this research may
The main objective of this research is to produce a be included missing data, repeated data or inconsistent data.
predictive model with better results that assess customer To handling missing data and removing duplicated data
churn rate of telecommunication companies using the values data pre-processing is done. The RapidMiner tool is
predictive analytics algorithm for data mining. used at this stage to pre-process the data for analysis and
mining. In doing cluster analysis, the Pearson chi-square and
The supporting objectives examined are to: predictive model building, the data types to be converted into
i. Cluster customers into various categories to numerical values.
enhance marketing and promotional activities.
ii. Mine the relevant patterns embedded in the TABLE 2: CODES FOR ALTERNATIVES
41
2018 International Conference On Business Innovation (ICOBI), 25-26 August 2018, NSBM, Colombo, Sri Lanka
means clustering produced four clusters out of the 200
collected data.
42
2018 International Conference On Business Innovation (ICOBI), 25-26 August 2018, NSBM, Colombo, Sri Lanka
Pattern Growth (FP-Growth) algorithm was used to mine Tariffs=Prepaid
and
associations between variables that result in a churn decision Gender=Female
with particular interest and focus on confidence. The
generated Association rules model presented in Figure 2. B. Predictive Model Building
Using the valid variables identified in the Pearson Chi-
square test, the four predictive models are created with IBM
SPSS Modeler 18.0 data mining software. The four
classification modeling techniques; C5.0 tree, the Bayesian
network, Neural Network and Logistic regression are used to
create predictive models. The optimal model is
recommended based on individual models and performance
metrics.
Figure 2: Association rules model An auto classifier was applied in the created C5.0
In the Table 6 showed ten (10) generated association rules tree model in Figure 3, to test whether the selected C5.0
were selected based on filtering the conclusion as the algorithm will be determined as one of the best algorithms to
decision of churn is yes and sorted in descending order in create the predictive model.
line with confidence. The sorted rules have a maximum
confidence of 95.5 percent and a minimum of 84.6 percent.
Support
Laplace
No
InternetUsage=
Yes, Gender
Churn_Ye
1 =Male and 0.105 0.955 0.995
Tenure = 3-5
s Figure 3: C5.0 algorithm tree model
years
Tariffs=Prepaid, The C5.0 algorithm was listed in the suggested
Gender=Male Churn_Ye
2 0.100 0.952 0.995 churn algorithms which were applied to the data. In Figure 3,
and Tenure= 3- s
5 years the matrix was applied to create a table showing the
Gender=Male
3 and Tenure=3-5
Churn_Ye
0.130 0.929 0.991
relationship between fields of Churn by $C-Churn. In the
s
years created above model, analysis and evaluation are used to
MobileNetwork create a report and a chart for comparing the accuracy of
OftenUsed=Mo
Churn_Ye predictive models.
4 bitel and 0.115 0.920 0.991
s
Tenure=3-5
years
InternetUsage=
Yes and
Churn_Ye
5 MobileNetwork 0.105 0.913 0.991
s
OftenUsed=Mo
bitel
Tariffs=Prepaid
Churn_Ye
6 and Tenure= 3- 0.230 0.902 0.980
s
5 years
InternetUsage=
Yes, Figure 4: Bayesian network Figure 5: Neural network
Churn_Ye
7 Tariffs=Prepaid 0.190 0.884 0.979 model model
s
and Tenure= 3-
5 years
Tenure= 3-5 Churn_Ye
8 0.270 0.871 0.969
years s
Tariffs=Prepaid,
Gender=Female Churn_Ye
9 0.130 0.867 0.983
and Tenure= 3- s
5 years
1 InternetUsage= Churn_Ye
0.110 0.846 0.982
0 Yes, s
Figure 6: Logistic regression
model
43
2018 International Conference On Business Innovation (ICOBI), 25-26 August 2018, NSBM, Colombo, Sri Lanka
and effectiveness of the model in predicting customer churn
in Telecommunication.
As the result of generating the logistic regression
model, it built up a statistical model which consists of two no yes % correct
mathematical equations to calculate the ability of a person C5.0 no 45 20 69.2
being churner or non-churner. yes 10 125 92.5
Overall Percentage 85%
no yes % correct
Equation 1: Calculating Y’ BN no 47 18 72.3
yes 24 111 82.2
Y' = 0.0682*Gender+(-0.00182)*Age
Overall Percentage 79%
+0.04558*Occupation+0.00001458* no yes % correct
MonthlyIncome+(-0.7214)*Tenure LR no 27 38 41.5
+(-0.2053)*Tariffs+0.00001024*CpM+2.659 yes 18 117 86.6
Overall Percentage 72%
Equation 2: Calculating P(1) no yes % correct
NN no 21 44 32.3
P(1) = exp(Y')/(1 + exp(Y')) yes 16 119 88.1
Overall Percentage 70%
Equation 1 consists of most relevant variables TABLE 8: ACCURACY AND AUC VALUE OF EACH MODEL
which are most affected by the churn decision. The variables
Contrasting the four models, the C5.0 algorithm of
values should be replaced by this equation and then the value
decision tree proved optimal model with 85% accuracy and
of Y' can be calculated. Then the calculated Y' value should
AUC value as 0.888 for the customer churn analysis and
be replaced with the equation 2 and calculate the value of
prediction in Telecommunication based on the chosen
P(1). Prediction of being a churn or non-churn customer is
variables and attributes.
depending on this P(1) value.
If the P(1) value is equal or greater than 0.5, then the D. Model Testing
prediction result is positive and the person will be a churner. The optimal model based on the results of the evaluation
If the P(1) value is less than 0.5, the result is close to 0 is tested on the dataset designed to test the model. The C5.0
(zero). It means the prediction result is negative and the algorithm model was used to test the data as it was identified
person will be a non-churner. as the most optimal among the models. The chosen optimal
model was tested using the test data collected from
C. Model Evaluation customers. The test data has 50 observations, 7 variables and
The four models are evaluated by testing the significance coded the same as the coding in Table 2. The distribution of
of the predictive model generated. The performance metrics the dataset is along with all the gender, age, monthly income,
of all the models were correlated for optimal performance occupation and the other demographic and operational
using Area Under Receiver Operating Characteristic Curve variables used to develop the model. Predictions are then
(AUROC). made to indicate which customers are likely to churn and
those that are not. The predictor variable and target variables
TABLE 7: CONFUSION MATRIX WITH TRAINING DATA
used in building the predictive churn model were tested for
significance.
Model Accuracy (%) AUC Value
C5.0 algorithm
85 0.888
model
BA model 79 0.886
LR model 72 0.762
NN model 70 0.759
44
2018 International Conference On Business Innovation (ICOBI), 25-26 August 2018, NSBM, Colombo, Sri Lanka
Figure 7: Test model for C5.0 algorithm marketing purpose to access marketing strategies in the
industry. In addition, the association rule mining was
The test data is applied by mapping the dataset to the model provided the significant results that present relevant
designed by the C5.0 algorithm as indicated in Figure 7. knowledge of factors that have a huge influence on the
Further model screening and applications are initiated to revenues and growth of the Telecommunication companies.
define the output in determining the likelihood of churn. The Telecommunication companies must grasp on this finding
test results presented as the model predicted that 36 and work to maintain their clients. C5.0 Decision tree model,
customers will churn with confidence from 100% to 55.6%. the Bayesian Network model, Logistic Regression model,
It was further explained by the results that over 62% of the and the Neural Network model were used and compared for
churn customers have a confidence of above 80%. According the most optimal model that predicts accurately. The C5.0
to the Figure 8, the results also indicate the churn customers algorithm of decision trees model proved optimal among the
staying their network above 5 years. It is expensive to models with 85 percent accuracy and AUC value as 0.888.
acquire new customers than to retain existing ones, the The C5.0 algorithm model of the decision tree can be
prediction of churners and the reasons proffered earlier need recommended for churn management. The models can be
close attention. The top 10 churners and non-churners used by industry with the IBM SPSS Modeler or any other
predicted by the model are presented in Figures 8 and 9 appropriate tool with the same algorithm. The
respectively. The source of the test data set can be connected Telecommunication companies can connect the models
to the database or server of the company to produce a real- directly to their servers or database to produce real-time
time output of churn results for decision making. results.
ACKNOWLEDGMENT
REFERENCES
Figure 9: Results of test predictions_No
[1] V. Umayaparvathi and K. Iyakutti, "A Survey on Customer Churn
Prediction in Telecom Industry: Datasets, Methods and Metrics,"
V. DISCUSSION AND CONCLUSION International Research Journal of Engineering and Technology
(IRJET), vol. 03, no. 04, April 2016.
Data mining is a symbolic tool in the
[2] Shin-Yuan Hung and Hsiu-Yu Wang, "Applying Data Mining to
Telecommunication industry that can exploit the large Telecom Churn Management," Department of Information
volume of data generated for pattern analysis. The recent Management, National Chung-Cheng University, Taiwan, ROC,.
increasing embrace of the predictive algorithm of data [3] M.Balasubramanian and M.Selvarani, "Churn Prediction in Mobile
Telecom Systems Using Data Mining Techniques," Department Of
mining has given room for companies to assess their future Computer Science, Annamalai University, Chidambaram, April 2014.
success, challenges, and targets. The research brings to fore [4] Rahul J. Jadhav and Usharani T. Pawar, "Churn Prediction in
the relevant untapped customer data and knowledge for Telecommunication Using Data Mining Technology," International
Journal of Advanced Computer Science and Applications, vol. 2, no.
churn prediction and customer classification for better 2, February 2011.
decision making. Clustering customers were developed in [5] K.Dahiya and S.Bhatai, "Customer churn analysis in telecom
industry," 4th International Conference on Realibility, Infocom
this research to determine the involvement of customers, Tehnilogies and Optimization(ICRITO), 2015.
interest areas and reasons for the churn decision. The results [6] Amjad Khan and Zahid Ansari, "Comparative Study Of Data Mining
of the cluster analysis can be used in promotional and direct Techniques In Telecommunications-A Survey," Dept of Electronics
and Communication, P.A. College of Engineering, Mangalore, India.
45