Sei sulla pagina 1di 18

Department of Computer Engineering

ZEAL College Of Engineering and Research,

Mini Project Report


on
“Analysis of Electronic-Card Transaction
Services”
By
Ms. Sanchita mandlik Roll no:B211058
Ms. Kajal Kamble Roll no:B211045

Under the guidence of


Prof. R.T.Waghmode

Department Of Computer Engineering


Zeal College Of Engineering And Research SAVITRIBAI PHULE
PUNE UNIVERCITY
2019-2020
Department of Computer Engineering
ZEAL College Of Engineering and Research,

CERTIFICATE

This is to certify that,

Sanchita Mandlik(B211058)
Kajal Kamble (B211045)

Of class BE comp,have successfuly completed mini project work on


“Analysis of Electronic-Card Transaction Services”at Zeal College
of Engineering at Research ,Pune the partial fulfillment of the
graduate Degree cource in BE at a department of Computer
Engineering in the acadamic Year 2019-2020 Semister(VII) . I as
prescribed by the Savitibai Phule Pune University.

Prof.R.T.Waghamode Dr.Sangave
Department of Computer Engineering
ZEAL College Of Engineering and Research,

Guide Head Of Department


(Department of Computer
Engineering)

INDEX

1.1 Name of Project: . . . . . . . . . . . . . . . . . . . . . . . . . . ...1

1.2 Outcomes. . . . . . . . . . . . . . . . . . . . ..1

Dataset used and its link . . . . . . . . . . . . . 1

1.3

Introduction………………………………………………1

Software Requirement Specification

3.1 Hardware Requirements: . . . . . . . . . . . . . . . . . . . . . .3

3.2 Software Requirements: . . . . . . . . . . . . . . . . . . . . . . .3

3.3 Work Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . …4

3.3.1 Data Pre-processing . . . . . . . . . . . . . . . . . . . . . .4

Classification Algorithm

4.1 Classification Algorithm used . . . . . . . . . . . . . . . . . . . . .5

4.1.1 Decision tree: . . . . . . . . . . . . . . . . . . . . . . . . .5

4.1.2 K-Nearest Neighbour. . . . . . . . . . . . . . . . . . .7

4.1.3 Random Forest. . . . . . . . . . . . . . . . . . . . .9

Output scene………………………………………………12
Department of Computer Engineering
ZEAL College Of Engineering and Research,

Data Analysis Graphs………………………………………..13

Comparison Of Classification……………………………14

Conclusion………………………………………………..15
Name of
Bibliography……………………………………………….15

Project:

Analysis of Electronic-Card-Transaction services

Outcomes:

 Classification of services of Electronic-Card-Transaction.


 Calculation of accuracy, precision, recall and f1-score.

Introduction:

Nowadays, Data mining is playing vital role in various field and one of the most
important area of research with the objective of finding meaningful information
from the data stored in a huge Data Set. Data mining or knowledge discovery
has become the area of growing significance because it helps in analyzing data
from different perspectives and summarizing it into useful information. Data
mining is defined as extracting information from huge sets of data.
Here we are using Electronic card transaction data set with 600 samples. we are
classifying the services of the data set using data value, period, Magnitude,
units, status. Data Preprocessing is a data mining technique that involves
transforming row data into an understandable format. Real world data is often
incomplete, inconsistent and lacking in certain behavior or trends, and is likely
to contain many errors. Data Preprocessing is proven method of resolving such
issues. After the data preprocessing, we are applying classifiers for prediction of
services of data set. Here we are applying Decision tree classifier (ID3), KNN,
Random Forest Classifier.

Dataset used and its Link:

 Data set: Electronic-Card-Transaction.


 Link : http://groups.google.com/group/get-
theinfo/www.kdnuggets.com/datasets/electronic_card_transaction
Department of Computer Engineering
ZEAL College Of Engineering and Research,

Series Period Data Statu Units Magnitu Serie Series


Reference Value s de s title title
1 2(y)

ECTA.S19A 2001.0 Nan C Dollar 6 Actu RTS


1 3 s al total
industri
es

ECTA.S19A 2002.0 C Dollar 6 Actu RTS


1 3 s al total
industri
es

ECTA.S19A 2001.0 C Dollar 6 Actu Total


9 3 s al

ECTA.S19A 2004.0 36878.7 C Dollar 6 Actu Total


9 3 s al

ECTA.S4A 2018.0 50.4 F Percen 0 Actu Debit


WP 3 t al card

ECTA.S1G 2012.0 6174.4 F Dollar 6 Actu Fuel


A7 3 s al

ECTA.S1G 2001.0 C Dollar 6 Actu Non-


A8 3 s al retail

ECTA.S29A 2012.0 10381774 F Numb 0 Actu RTS


1 3 83 er al total
industri
es
Department of Computer Engineering
ZEAL College Of Engineering and Research,

ECTA.S29A 2010.0 25141530 F Numb 0 Actu Credit


W 3 2 er al

ECTA.S1G 2005.0 2421.6 F Dollar 0 Actu Apparel


A5 3 s al

Sample Dataset

SRS (Software Requirement specification):

 Software Requirement:
PyCharm IDE
Python 3.7
 Hardware requirement:
Laptop/PC, 4 GB RAM, Windows based 64 bit OS.

Data Preprocessing Done:

Data preprocessing is a data mining technique which is used to transform the


raw data in a useful and efficient format.

Import CSV dataset:

The basic process of loading data from a CSV file into a Pandas Data Frame
(with all going well) is achieved using the “read_csv” function in Pandas.

Convert into array(Independent and Dependent variables):

The variables can be classified as independent (X) and dependent(Y)


variables. The independent variables are used to determine the dependent
variable. In our dataset, the first eight columns are independent variables
which will be used to determine the dependent variable, which is the last
column.
In our dataset, we are considering the independent variables which
are Period, Data_value, STATUS, UNITS, Magnitude and dependent
variable as Series_title_2 which are the services, we are going to classify.

Steps Involved in Data Preprocessing:

1. Label Encoding:
Department of Computer Engineering
ZEAL College Of Engineering and Research,

Label Encoding refers to converting the labels into numeric form so as to


convert it into the machine-readable form.

In our dataset, the two columns of the X have string values. We are
converting them into numeric form using label encoder function. Here in
STATUS column, there are 4 values i.e. F, P, C, R. as well as in UNITS
column, there are 3 values i.e. Dollars, Percent, Number.

Label encoding convert the data in machine readable form, but it assigns
a unique number (starting from 0) to each class of data. This may lead to
the generation of priority issue in training of data sets. A label with high
value may be considered to have high priority than a label having lower
value.

2. Handling Missing values:

Missing data is always a problem in real life scenarios. In machine learning


and data mining face severe issues in the accuracy of their model predictions
because of poor quality of data caused by missing values. In these areas,
missing value treatment is a major point of focus to make their models more
accurate and valid.

Here we are using Imputer function for handling the missing values. In our
dataset, Data_value column has missing value with blank spaces. It is
considered as NaN value and hence this NaN value is replaced by the mean
value of column Data_value.

3. Transformation:

This step is taken in order to transform the data in appropriate forms, suitable
for mining process. It is done in order to scale the data values in a specified
range (-1.0 to 1.0 or 0.0 to 1.0).

Here, we are using Standard scaler for data scaling.


Department of Computer Engineering
ZEAL College Of Engineering and Research,

The Standard Scaler assumes your data is normally distributed within each
feature and will scale them such that the distribution is now centred around 0,
with a standard deviation of 1.

The mean and standard deviation are calculated for the feature and then the
feature is scaled based on:

xi–mean(x)/stdev(x)

Splitting of Testing and Training dataset:

The data we use is usually split into training data and test data. The
training set contains a known output and the model learns on this data in
order to be generalized to other data later on. We have the test dataset (or
subset) in order to test our model’s prediction on this subset.

From Sklearn, sub-library model_selection, We have imported the


train_test_split so we can, well, split to training and test sets. We have
splitted the independent and dependent variables into training and testing
set. i.e. X_trainset, X_testset, Y_trainset, Y_testset.

Now we have used the train_test_split function in order to make the split.
The test_size=0.3 inside the function indicates the percentage of the data
that should be held over for testing.

Based upon this splitting, we have predicted the values of Y using


training set.

Classification Algorithms Used :

A. Decision Tree (ID3) :


A decision tree is a flowchart-like tree structure where an internal node
represents feature(or attribute), the branch represents a decision rule, and
each leaf node represents the outcome. The topmost node in a decision
tree is known as the root node. It learns to partition on the basis of the
attribute value. It partitions the tree in recursively manner call recursive
partitioning. This flowchart-like structure helps you in decision making.
It's visualization like a flowchart diagram which easily mimics the human
Department of Computer Engineering
ZEAL College Of Engineering and Research,

level thinking. That is why decision trees are easy to understand and
interpret.

Algorithm:
1. Compute the entropy for data-set
2. For every attribute/features:
1.calculate entropy for all categorical values
2.take average information entropy for the current attribute

3. Pick the highest gain attribute.

4.Repeat until we get the tree we desired.

Attribute Selection Measures


Attribute selection measure is a heuristic for selecting the splitting criterion that
partition data into the best possible manner. It is also known as splitting rules
because it helps us to determine breakpoints for tuples on a given node. ASM
provides a rank to each feature(or attribute) by explaining the given dataset.
Best score attribute will be selected as a splitting attribute (Source). In the case
of a continuous-valued attribute, split points for branches also need to define.
Most popular selection measures are Information Gain, Gain Ratio, and Gini
Index.

Entropy

A decision tree is built top-down from a root node and involves partitioning the
data into subsets that contain instances with similar values (homogeneous). ID3
algorithm uses entropy to calculate the homogeneity of a sample. If the sample is
completely homogeneous the entropy is zero and if the sample is equally
divided then it has entropy of one.

Entropy controls how a Decision Tree decides to split the data. It actually effects
how a Decision Tree draws its boundaries. Entropy is the measures
of impurity, disorder or uncertainty in a bunch of examples.
Department of Computer Engineering
ZEAL College Of Engineering and Research,

Information Gain
Shannon invented the concept of entropy, which measures the impurity of the
input set. In physics and mathematics, entropy referred as the randomness or the
impurity in the system. In information theory, it refers to the impurity in a group
of examples. Information gain is the decrease in entropy. Information gain
computes the difference between entropy before split and average entropy after
split of the dataset based on given attribute values. ID3 (Iterative Dichotomiser)
decision tree algorithm uses information gain.

Gain (T,X) = Entropy (T) – Entropy (T,X).

B. K-Nearest Neighbor:
K-Nearest Neighbor is a simple algorithm that stores all the available
cases and classifies the new data or case based on similarity measure. It
searches the pattern space for k training tuples that are closest to the
unknown new test tuple.
Closeness is calculated by using Euclidean distance. KNN classifier can
be extremely slow when classifying test tuples O(n). KNN is a type of
instance-based learning, or lazy learning, where the function is only
approximated locally, and all computation is deferred until classification.
Both for classification and regression, a useful technique can be to assign
weights to the contributions of the neighbors, so that the nearer neighbors
contribute more to the average than the more distant ones.
For example, a common weighting scheme consists in giving each
neighbor a weight of 1/d, where d is the distance to the neighbor. The
neighbors are taken from a set of objects for which the class (for KNN
Department of Computer Engineering
ZEAL College Of Engineering and Research,

classification) or the object property value (for KNN regression) is


known. This can be thought of as the training set for the algorithm,
though no explicit training step is required.
In the classification phase, k is a user-defined constant, and an
unlabeled vector (a query or test point) is classified by assigning the label
which is most frequent among the k training samples nearest to that query
point.

Parameter Selection
The best choice of k depends upon the data; generally, larger values of k
reduces effect of the noise on the classification, but make boundaries
between classes less distinct. A good k can be selected by various
heuristic techniques (see hyperparameter optimization). The special case
where the class is predicted to be the class of the closest training sample
(i.e. when k=1) is called the nearest neighbor algorithm. The accuracy of
the KNN algorithm can be severely degraded by the presence of noisy or
irrelevant features, or if the feature scales are not consistent with their
importance. Much research effort has been put into selecting or scaling
features to improve classification. A particularly popular [citation needed]
approach is the use of evolutionary algorithms to optimize feature
scaling.[6] Another popular approach is to scale features by the mutual
information of the training data with the training classes.

Properties
KNN is a special case of a variable-bandwidth, kernel density ”balloon”
estimator with a uniform kernel. The naive version of the algorithm is
easy to implement by computing the distances from the test example to
all stored examples, but it is computationally intensive for large training
sets. Using an approximate nearest neighbor search algorithm makes
KNN computationally tractable even for large data sets. Many nearest
neighbor search algorithms have been proposed over the years; these
generally seek to reduce the number of distance evaluations actually
performed.

KNN has some strong consistency results. As the amount of data


approaches infinity, the two-class KNN algorithm is guaranteed to yield
an error rate no worse than twice the Bayes error rate (the minimum
Department of Computer Engineering
ZEAL College Of Engineering and Research,

achievable error rate given the distribution of the data). Various


improvements to the KNN speed are possible by using proximity graphs.

Classification
A case is classified by a majority vote of its neighbors, with the case
being assigned to the class most common amongst its K nearest neighbors
measured by a distance function. If K = 1, then the case is simply
assigned to the class of its nearest neighbor.
Distance Function

Example : Consider the following data set and predict the service for
given new tuple.

Data value Period Services

36422 2007 RTS industry

16307 2004 Credit

80468 2017 Total

62614 2012 Total

22272 2007 Credit

New tuple , X (20210,2006)

Where, k=3

By Euclidean Distance,

1) D (X,1)
Sqrt((20210-36422)^2 + (2006-2007)^2) =16212

2) D (X,2)
Department of Computer Engineering
ZEAL College Of Engineering and Research,

Sqrt((20210-16307)^2 + (2006-2004)^2) =3903

3) D (X,1)
Sqrt((20210-80468)^2 + (2006-2017)^2) =60085

4) D (X,1)
Sqrt((20210-62614)^2 + (2006-2012)^2) =42404

5) D (X,1)
Sqrt((20210-22272)^2 + (2006-2007)^2) =2062

As given , k=3

The first three minimum distances are


a. 2062 - Credit
b. 3903 – Credit
c. 16212 - RTS industry

As given, Majority is Credit. Hence the new tuple X has service Credit.

Algorithm:

1. Store the training samples in an array of data points arr[]. This


means each element of this array represents a tuple (x,y).
2. For i=0 to m:
Calculate Euclidean distance d(arr[i]).
3. Make set S of K smallest distance obtained. Each of these
Distances corresponds to an already classified data point.
4. Return the majority label among S.

C. Random Forest:
Random forest classifier creates a set of decision trees from randomly
selected subset of training set. It then aggregates the votes from different
decision trees to decide the final class of the test object. Random Forest
Classifieris ensemble algorithm. Innextone or two posts we shall explore
such algorithms.
Department of Computer Engineering
ZEAL College Of Engineering and Research,

Ensembled algorithms are those which combines more than one


algorithms of same or different kind for classifying objects. For example,
running prediction over Naive Bayes, SVM and Decision Tree and then
taking vote for final consideration of class for test object. Basic
parameters to Random Forest Classifier can be total number of trees to be
generated and decision tree related parameters like minimum split, split
criteria etc Random forest is like bootstrapping algorithm with Decision
tree (CART) model.
Say, we have 1000 observation in the complete population with 10
variables. Random forest tries to build multiple CART models with
different samples and different initial variables. For instance, it will take a
random sample of 100 observation and 5 randomly chosen initial
variables to build a CART model. It will repeat the process (say) 10 times
and then make a final prediction on each observation. Final prediction is a
function of each prediction. This final prediction can simply be the mean
of each prediction.

Algorithm:
1. Randomly select “k” features from total “m” features.
1. Where k << m
2. Among the “k” features, calculate the node “d” using the
best split point.
3. Split the node into daughter nodes using the best split.
4. Repeat 1 to 3 steps until “l” number of nodes has been
reached.
5. Build forest by repeating steps 1 to 4 for “n” number times to
create “n” number of trees.

Algorithm (Prediction):

1. Takes the test features and use the rules of each randomly
created decision tree to predict the outcome and stores
the predicted outcome (target)
2. Calculate the votes for each predicted target.
3. Consider the high voted predicted target as the final
prediction from the random forest algorithm.
Department of Computer Engineering
ZEAL College Of Engineering and Research,

Presentation:

Output Sceens:
Department of Computer Engineering
ZEAL College Of Engineering and Research,

Data Analysis Plots (Graphs):


Department of Computer Engineering
ZEAL College Of Engineering and Research,

Comparison of classification algorithms:


Department of Computer Engineering
ZEAL College Of Engineering and Research,

Conclusion:

In this project, we searched a dataset of Electronic Card services. We applied


data preprocessing on this dataset. we used three classifiers which are decision
tree, KNN, Random Forest for classification. we have calculated precision,
recall, F1 score of each classifier. As a result, it has seen that Decision Tree
Classifier is superior to other classifiers in case of accuracy.

Bibliography:

https://www.geeksforgeeks.org/data-preprocessing-in-data-mining/

https://scikit-learn.org/stable/modules/preprocessing.html

https://www.datacamp.com/community/tutorials/k-nearest-neighbor-
classificationscikit-learn

Potrebbero piacerti anche