Sei sulla pagina 1di 8

Received: January 1, 2019 1

Poverty Prediction Data Processing Based on E-Commerce Data Using


The K-Nearest Neighbor Method and Information Theoretical Based
Feature Selection

Tiara Fatehana Aulia, Dedy Rahman Wijaya, Elis Hernawati


School of Applied Science, Telkom University
Email: tiarafatehana@gmail.com

Abstract: The Central Statistics Agency (BPS) is a government agency that runs in the household economic and social
needs. Every two years BPS conducts Susenas (National Socio-Economic Survey) to find out how to predict poverty
levels in Indonesia. Every year BPS is tasked with providing information on how the community is in their economic
and social fields. In this very rapid development, there are many methods to determine predictions of poverty levels.
one of them is with the rapid development of E-commerce in Indonesia and is able to determine the level of poverty in
Indonesia today. Therefore, the authors built an application to complement BPS in predicting poverty levels in an area,
namely the application of poverty rate prediction based on e-commerce data using the K-Nearest Neighbor method and
the selection of Information Thereotical Based features. This application was built using the waterfall model, using the
Python programming language and the MySQL database. With this application, it is expected to be able to complete
the BPS Census and Susenas in predicting poverty levels in an area.
.
Keywords: The Central Statistics Agency (BPS), waterfall, Python, MySQL, K-Nearest Neighbor, Information
Thereotical Based.

Poverty data obtained by the Central Statistics


1. Introduction Agency (BPS) by conducting the National Socio-
Economic Survey (Susenas). Susenas is a survey
1.1 Background that collects data concerning household needs, in the
fields of education, health / nutrition, housing,
Poverty is a condition where there is an inability socio-cultural activities, household income and
to fulfill basic needs such as food, clothing, household expenditure. Susenas has been carried out
shelter, education and health. The Indonesian by the Central Statistics Agency (BPS) since 1963.
government has made progress in reducing Every three years, BPS conducts Susenas for survey
poverty in recent years. Many Indonesians module data in Indonesia. However, implementing
remain vulnerable to marginal positions above Susenas can be estimated to take a long time by
the national poverty line. In March 2019, the conducting interviews with households using the
Central Statistics Agency noted that the poverty Consumption and Expenditure questionnaire, and
rate in Indonesia reached 25.14 million people can be sure to incur significant costs [2].
(9.41 percent) of the total population in
Indonesia. In contrast to 2007 the highest The development of e-commerce in Indonesia is
poverty rate was 37.17 million people (20.37 very rapid in 2018. Indonesia is a market with
percent) [1]. attractive e-commerce growth from year to year.
Census data from the Central Statistics Agency
(BPS) also mentioned that the e-commerce industry
International Journal of Intelligent Engineering and Systems
Received: January 1, 2019 2

in Indonesia in the last 10 years increased by 17 2. Making an application displaying graphs from the
percent with a total number of e-commerce results of prediction of poverty levels based on e-
businesses reaching 26.2 million e-commerce units. commerce data by using the K-Nearest Neighbor
Over the past 4 years, the growth of e-commerce in learning machine and Information There based on
Indonesia has increased to reach 500 percent. statistics.
Besides these data, the greatest potential achieved
by the e-commerce industry in Indonesia is also 3. By using a feature selection algorithm, namely
influenced by online shopping styles, such as the Information Thereotical Based.
millennial generation [3].
2. Related Work
From the problems above, a solution can be taken to 2.1 Poverty
solve the National Socio-Economic Survey
(Susenas) which will cost a lot of time and is using Table 1 This table shows another method for predicting an
area’s poverty level
the kNN (K - Nearest Neighbor) machine learning
method and Information Theoretical based feature
selection. Machine Learning is used specifically to No Dat Method Country Result References
deal with predictions of poverty levels in an area aset
with e-commerce data that is being used by
Indonesians. E-commerce data is very used to 1 Sateli CNN to Malawi, Model A. Perez, C. Yeh,
predict poverty rates because it can be seen from the t predict Nigeria, VGG-F G. Azzari, M.
average history of purchases of goods or houses in Lands sunlight Rwanda, and Burke, D. Lobell,
an area. This e-commerce data was obtained from at 7, intensity Tanzania model and S. Ermon,
2000- classes and GBT2 “Poverty
Pulse Lab Jakarta - United Nations Global Pulse 2010 (0,1, or Uganda Prediction with
using a dataset taken from the OLX e-commerce 2) Public Landsat 7
platform (olx.com). By using this machine learning, Satellite Imagery
it can help in maximizing the level of poverty and Machine
prediction in e-commerce data to be very helpful, Learning,” no.
Nips, 2017.
more accurate in obtaining poverty data in Indonesia
and does not take longer and can complete a census Calculate V. Soto, V. Frias-
in predicting poverty levels in an area. 2 Cell Vector- Amerika Tessellat Martinez, J.
Phone For latin ion Virseda, and E.
Note, features Voronoi Frias-Martinez,
1.2 Formulations of Problem 2010 in each “Prediction of
BTS socioeconomic
Based on the exposure to sub-backgrounds, the (Base levels using cell
existing problems are: Transcei phone records,”
ver Lect. Notes
1. How to complete the results of the survey and Station) Comput. Sci.
census in an area in Indonesia so that it does not (including Subser.
Lect. Notes Artif.
take time and money based on E-Commerce data? Intell. Lect. Notes
Bioinformatics),
2. How to present the results of poverty data vol. 6787 LNCS,
prediction based on e-commerce data with the no. 1, pp. 377–
application of the K-Nearest Neighbor Neighbor and 388, 2011.
Information Thereotical based method?

3. How to identify items that influence poverty OLS Taiwan, C. D. Elvidge et


prediction? 3 Satell Data South LandSca al., “A global
ite Collectio Korea, n 2004 poverty map
Data, n Saudi derived from
1.3 Purpose (Operatio Arabia, satellite data,”
2009
The purpose of this Final Project is to build nal Japan, Comput. Geosci.,
Linescan Belgium, vol. 35, no. 8, pp.
applications that can: System) the 1652–1660, 2009.
Netherla
1. Implement the K-Nearest Neighbor method and nds,
Information Thereotical Based to predict poverty Italy,
based on E-commerce data. United
Kingdom

International Journal of Intelligent Engineering and Systems


Received: January 1, 2019 3

, United
States,
Canada,
and
Czech
Figure 1 Shows the performance of the proposed method. This
shows that the Pre-Processing

2.2 K – Nearest Neighbor In the picture above, there are several processes
Table 2 This table shows a prediction in various problems using before entering into data processing in an E-
K-Nearest Neighbor method Commerce dataset-based application for predicting
poverty levels using the K-Nearest Neighbor method
and the Information Thereotical Based algorithm.
No Name and Problems Method Result
Year
1. Data Process

1 Khalid Fluctuation K- Time series


a) Business Understanding Phase
Alkhatib et al, s in stock Nearest data
2013 [4] prices Neighbo prediction
In determining the percentage of poverty level, BPS
r and (Statistics Indonesia) uses data from the National
prediction Economic Survey (Susenas) which requires a very
results that
are close
long time and a lot of energy because they have to
to original interview approximately 300,000 households spread
price. across 514 cities or districts in Indonesia. During the
Susenas implementation, field staff will interview
2 Yisheng Lv, Traffic K- Prediction approximately 2 hours per household, so an
2009 [5] accident Nearest trigger
Information There-based Based Algorithm and K-
Neighbo traffic
r accident Nearest Neighbor Machine Learning method use
based data OLX E-Commerce dataset from Pulse Lab Jakarta
realtime
to complete survey data needed by BPS.

3 Ricky Imanuel Resignation K- Analysis of b) Data Understanding Phase


Ndaumanu,2 of students Nearest predicted
014 [6] Neighbo student In developing applications using Information
r resignation Thereotical Based algorithms with machine learning
rates
with OLX E-Commerce dataset taken from Pulse
Lab Jakarta such as cars, motorcycles, houses,
4 Nihru Nafi, Recruiment K- Predict the apartments, and land. Here are the data, i.e.
Indriati, and of Nearest acceptance
Budi Darma prospective Nighbor of Table 3 This table shows a feature in the e-commerce dataset
Setiawan, teachers prospectiv
May 2017. [7] and e teachers
No Feature
employees and
1 sum_price_car
employee
based on 2 avg_price_car
the values 3 std_price_car
of each 4 sum_sold_car
criterion.
5 avg_sold_car
6 std_sold_car
3. Proposed Methodology 7 sum_viewer_car
The following is an overview of the proposed 8 avg_viewer_car
preprocessing system. 9 std_viewer_car
10 sum_buyer_car
11 avg_buyer_car
12 std_buyer_car

International Journal of Intelligent Engineering and Systems


Received: January 1, 2019 4

13 sum_price_motor 63 std_price_apt_rent
14 avg_price_motor 64 sum_sold_apt_rent
15 std_price_motor 65 avg_sold_apt_rent
16 sum_sold_motor 66 std_sold_apt_rent
17 avg_sold_motor 67 sum_viewer_apt_rent
18 std_sold_motor 68 avg_viewer_apt_rent
19 sum_viewer_motor 69 std_viewer_apt_rent
20 avg_viewer_motor 70 sum_buyer_apt_rent
21 std_viewer_motor 71 avg_buyer_apt_rent
22 sum_buyer_motor 72 std_buyer_apt_rent
23 avg_buyer_motor 73 sum_price_land_sell
24 std_buyer_motor 74 avg_price_land_sell
25 sum_price_rumah_sell 75 std_price_land_sell
26 avg_price_rumah_sell 76 sum_sold_land_sell
27 std_price_rumah_sell 77 avg_sold_land_sell
28 sum_sold_rumah_sell 78 std_sold_land_sell
29 avg_sold_rumah_sell 79 sum_viewer_land_sell
30 std_sold_rumah_sell 80 avg_viewer_land_sell
31 sum_viewer_rumah_sell 81 std_viewer_land_sell
32 avg_viewer_rumah_sell 82 sum_buyer_land_sell
33 std_viewer_rumah_sell 83 avg_buyer_land_sell
34 sum_buyer_rumah_sell 84 std_buyer_land_sell
35 avg_buyer_rumah_sell 85 sum_price_land_rent
36 std_buyer_rumah_sell 86 avg_price_land_rent
37 sum_price_rumah_rent 87 std_price_land_rent
38 avg_price_rumah_rent 88 sum_sold_land_rent
39 std_price_rumah_rent 89 avg_sold_land_rent
40 sum_sold_rumah_rent 90 std_sold_land_rent
41 avg_sold_rumah_rent 91 sum_viewer_land_rent
42 std_sold_rumah_rent 92 avg_viewer_land_rent
43 sum_viewer_rumah_rent 93 std_viewer_land_rent
44 avg_viewer_rumah_rent 94 sum_buyer_land_rent
45 std_viewer_rumah_rent 95 avg_buyer_land_rent
46 sum_buyer_rumah_rent 96 std_buyer_land_rent
47 avg_buyer_rumah_rent
48 std_buyer_rumah_rent From the features contained in the E-Commerce
49 sum_price_apt_sell dataset above, a feature selection will be performed
50 avg_price_apt_sell to find the highest feature data in the accuracy of
51 std_price_apt_sell data prediction.
52 sum_sold_apt_sell
53 avg_sold_apt_sell 2. Normalization
54 std_sold_apt_sell
After the data processing is carried out, it is then
55 sum_viewer_apt_sell
continued with the data normalization process in
56 avg_viewer_apt_sell
which the specified data will be scaled to equal data
57 std_viewer_apt_sell
values. In this normalization process, the Rescaling
58 sum_buyer_apt_sell
avg_buyer_apt_sell
method (min-max normalization) will be used.
59
std_buyer_apt_sell Following is the basic formula of the Rescaling
60
61 sum_price_apt_rent method.
62 avg_price_apt_rent

International Journal of Intelligent Engineering and Systems


Received: January 1, 2019 5

𝑥−min⁡(𝑥) performance matrices. There are two regressions


𝑀𝑖𝑛𝑀𝑎𝑥 = max(x)−min⁡(x) * 10
matrices namely R2 (R-Square) and RMSE. R2
matrix is used for vector variances that can be
predicted by the regression model, such as if R2 = 1
Information : then the regression model is said to be correct but if
R2 = negative then the regression model is said to
X = value of the feature. be wrong in predicting values. The RMSE matrix is
used to measure the difference in error between the
Min (x) = the lowest value of each feature. actual and predicted vectors. If the RMSE value is
lower, then there is less difference between the
Max (x) = the highest value of each feature. actual value and the predicted value.

3. Information Thereotical Based Feature 4. Result and Discussion


Selection
After going through the analysis and design stages,
the implementation or coding phase of the
After normalizing the data, the normalized data will
application will be carried out. In this prediction
enter the feature selection process. In the feature
application, coding uses 3 different feature selection
selection process, several algorithms are provided,
algorithms with different characteristics.
such as information there based on feature selection.
After the data enters the feature selection process, 4.1 Data Preparation
the data that will be generated is data that already
has its own score, each feature has its own score. At the data preparation stage, there are several steps
Then after being selected and getting each score, it that will be carried out in order to obtain maximum
will proceed to the next process which is the feature results. The steps taken in data preparation are the
filtering process, where the process chooses the process of pre-processing and data normalization.
most relevant features to be processed in order to get The following is the coding stage for the application
a prediction of poverty level. The algorithms used to be built:
are CIFE, MRMR, and DISR.

1. Data Pre-processing Stage


4. K-Nearest Neighbor

After making a feature selection by getting relevant At the data pre-processing stage, data cleaning
features, the data will then be entered into each will be carried out to eliminate ambiguous data
machine learning. The machine learning provided is that is not in line with expectations, disturbing
k-nearest neighbor. The following is an algorithm data such as -2 values and inconsistent data,
from KNN. which can hinder the next process. In this
m process will change the null value to 0.
2
𝑑(𝑥, 𝑦) = √∑(xi − yi )
i=1
2. Data Normalization Stage

The idea of this formula is from the Pythagorean After cleaning the data from a null value is
formula. changed to the number 0, then proceed to the
data normalization process. this process will
change all data to scale from 0-10.
𝑐 = √𝑎2 − 𝑏 2

*d (x, y) read the distance between x and y 4.2 Implementation of Feature Selection

Before making predictions using the Information


5. Evaluation Theoretical Based algorithm, feature selection is
needed to determine which features are most
relevant for predicting poverty data, so that the
In the regression process there are several process of predicting poverty data is more accurate.
International Journal of Intelligent Engineering and Systems
Received: January 1, 2019 6

1) CIFE Feature Selection 4.4 Evaluation


Here are the results of Rangking Feature of Required testing of the implementation of
CIFE. training models in machine learning above. In
this section, the test results of the K-Nearest
Neighbor algorithm model will be displayed. In
testing this KNN model uses several algorithms
Figure 2 Shows the performance of the proposed method. This or different feature selection. The feature
shows that the feature rangking result of CIFE feature selection selection used is CIFE, DISR, MRMR. Then the
test will be conducted is to test the results of the
2) MRMR Feature Selection
prediction data accuracy using each algorithm.
Here are the results of Rangking Feature of
MRMR.
1. Prediction Testing with CIFE Feature
Selection

The following is an accuracy test with r2 and


Figure 3 Shows the performance of the proposed method. This
shows that the feature rangking result of MRMR feature rmse on Knn machine learning by using the
selection CIFE feature selection.

3) DISR Feature Selection


Here are the results of Rangking Feature of
DISR.

Figure 4 Shows the performance of the proposed method. This Figure 6 Shows the performance of the proposed method. This
shows that the feature rangking result of DISR feature selection shows that the RMSE and RSquare of CIFE Feature Selection

4.3 Implementation of K-Nearest Neighbor


The results of accuracy in each of the different
At the data prediction stage there is a correlation features will produce different graphics. The
with the feature selection, which will search for following is a predictive picture.
accuracy and rmse values from the prediction
results using the n best features from the feature
selection results. The following graphical
implementation uses KNN and uses the MRMR
feature selection:

Figure 7 Shows the performance of the proposed method. This


shows that the RMSE and RSquare graphic of CIFE Feature
Selection

2. Prediction Testing with MRMR Feature


Selection

The following is an accuracy test with r2 and


rmse on Knn machine learning by using the
Figure 5 Shows the performance of the proposed method. This
shows that the Implementation of kNN
MRMR feature selection.

International Journal of Intelligent Engineering and Systems


Received: January 1, 2019 7

Figure 8 Shows the performance of the proposed method. This


shows that the RMSE and RSquare of MRMR Feature Selection

The results of accuracy in each of the different


features will produce different graphics. The Figure 11 Shows the performance of the proposed method. This
following is a predictive picture. shows that the RMSE and RSquare graphic of DISR Feature
Selection

5. Conclusion
After carrying out the stages of application
development with the chosen method (waterfall)
such as needs analysis, design, system design,
program code implementation and testing of Poverty
Prediction Applications based on E-Commerce Data
Using the KNN Method and Information Theoretical
Figure 9 Shows the performance of the proposed method. This
shows that the RMSE and RSquare graphic of MRMR Feature Based Algorithms:
Selection 1. This application meets the needs of users in
poverty prediction using the KNN method and
Information Theoretical Based algorithm.
3. Prediction Testing with DISR Feature 2. This application is able to display a graph of
Selection the results of prediction and the results of
poverty data based on e-commerce accuracy.
The following is an accuracy test with r2 and
rmse on Knn machine learning by using the References
DISR feature selection.
[1] B. P. Statistik, "Badan Pusat Statistik," [Online].
Available:
https://www.bps.go.id/subject/23/kemiskinan-dan-
ketimpangan.html. [Accessed 29 September 2019].

[2] B. P. Statistik, "Survey Sosial Ekonomi Nasional,"


p. 1, 2007.
Figure 10 Shows the performance of the proposed method. This
shows that the RMSE and RSquare of DISR Feature Selection
[3] N. Rahayu, "WE Online," 19 Februari 2019.
The results of accuracy in each of the different [Online]. Available:
features will produce different graphics. The https://www.wartaekonomi.co.id. [Accessed 23
following is a predictive picture. Oktober 2019].

[4] Khalib Alkatib, Hasan Najadat, Ismail Hmeidi, and


Mohammed Ali Shatnawi, "Stock Prediction Using
K-Nearest Neighbor (kNN) Algorithm,"
International Journal of Business, Humanities and
Technology, vol. Vol 3, pp. 32 - 45, 2013.

International Journal of Intelligent Engineering and Systems


Received: January 1, 2019 8

[5] Lv Yisheng and Shuming Tang, "Realtime


Highway Traffic Accident Prediction Based on the
k-Nearest Neighbor Method," in International
Conference on Measuring Technology and
Mechatronics Automation, 2009.

[6] Basu Swastha, Manajemen Penjualan. Yogyakarta :


Badan Penerbit Fakultas Ekonomi, Universitas
Gajah Mada, 2001.

[7] Nihru Nafi’, Indrianti and Budi Darma, “Penerapan


Metode K-Nearest Neighbor(kNN) dalam
Penerimaan Calon Guru dan Karyawan Tata Usaha
(Studi Kasus : SMP Muhammadiyah 2 Kediri)”,
May 2017.

B. P. Statistik, "

International Journal of Intelligent Engineering and Systems

Potrebbero piacerti anche