Sei sulla pagina 1di 10

A SURVEY OF DATA MINING TECHNIQUES IN BUSINESS APPLICATIONS

Dr. Th. Shanta Kumar


Principal, Scholars Institute of Technology and Management, Guwahati, Assam.

Abstract
With the advancement of Information Technology, Entrepreneurs are trying to
excavate in all the possible directions and extract information about their customer
and products in order to increase their sales. As the transactional data gets
accumulated from time to time, it gets over weighted. Data mining is the study to
extraction of hidden information from a large repository of data, making it an
efficient tool for the modern business organisations in enhancing and making
business decisions. They can differentiate between normal customer and valuable
customers, predict future sales and identify fast moving items helping management in
making proactive and knowledge-based decisions. Data mining techniques provide
solutions easily which was time-consuming in the past. There are different
techniques of data mining each with their own advantages and disadvantages.
Organisations can select the techniques depending on their business applications.
This paper focuses on these different techniques from business perspective.
Keywords: Associations, Business applications, Classification, Clustering, Data mining,
Regression

1. Introduction
These days data storage is gaining its niche in business applications. Modern organisations strive
to maintain database regarding their products, sales, customers etc. All possible techniques are
used to make the best of these databases. New business techniques are seen to develop these
days, in which the study of the economics of products, sales and customers is given the first
priority in order to stand as the best in the present market.
2. Significance of the study
There are different reasons as to why this study is given preference today and not in the past
(Berry & Linoff, 2004).
First, before Information Technology (IT) was used, the transactional database was not
maintained. As people start using IT in business applications, the size of this transactional
database grows exponentially. People who make organisational decisions find this large dataset
as a row material for their knowledge extraction. In reality, the process of decision making needs
large amount of data in order to study the trend in the organisation. Increases in point-of-sale
scanners in big malls, automatic teller machines, credit and debit cards, e-business etc. have
contributed to enormous amount of data.
Secondly, use of computer was costly and limited to few scientific laboratories. Now
there is continued decrease in the price of peripherals including processors, disk and memory. It

is seen that there is competition among the vendors in increasing the I/O bandwidth. Also, the
successful introduction of parallel processing has brought the requirement of data analysis to a
higher level.
Third, the interest in Customer Relationship Management has increased among the
organisations. They have come to realise that their customers is their asset. And study of the
customer information is their highest priority.
3. Review of Literature
Rygielski, Wang and Yen (2002), have shown how the different techniques of data mining can
be used in customer relationship management. They have compared the different evolutionary
stages of data mining over the decades, enabling technologies, product providers and their
Characteristics. They have also shown how these technologies answered different questions in
business applications.
Joseph (2013) demonstrated the significant role of data warehousing combined with data
mining technology in business applications. He has shown how data warehouse can be made a
repository of relational database designed for the purpose of query and analysis, helping the
business organizations to consolidate data from different sources. Data mining techniques is
applied to these warehouses in order to extract the hidden information.
Karahoca, Karahoca and Sanver (2012) have shown a detailed study of the application of
data mining techniques in business. In their survey, they have projected the researchers along
with their area of work from 1996 till 2012, segmented in different categories as predictive
methods, time series analysis and descriptive methods.
This article is a survey of some of the important techniques in data mining which can be
used in business applications. It starts with an introduction to data mining, then its applications in
business, followed by the different popular techniques of data mining. Through one case study, I
try to offer a closer look at k-means clustering and show how it can be used in extract the hidden
trend within the transactional database for enhanced business. Finally, conclusions based on the
discussion are made.
4. Framework of the Study
The framework of the study has been organized under four sub-sections.
4.1 Data Mining
4.2 Applications of Data Mining
4.3 Data Mining Techniques
4.4 Case Study
4.1 Data mining
Data mining refers to extracting or mining knowledge from large amounts of data (Han &
Kamber, 2003). The goal of data mining is to help administrator and decision makers, with the
information regarding marketing, sales and customer support operations through a better
understanding of its customers. Data mining employs combinations of techniques from statistics,
computer science and machine learning. Selection of a particular technique depends on the
problem under study, availability of data and expertise of the miner. It should be noted that the
pattern and information generated by data mining cannot be trusted without verification. It

simply helps business analysts by generating hypothesis which needs to be validated. Data
mining is broadly divided into two categories - directed and undirected. Directed techniques are
supervised learning which implement using supervised training data, consisting to training
examples. In this learning, the examples are pairs of {input object, desired output value}.
Directed learning techniques analyses the training data and produces an inferred rule called a
classifier, which should predict the correct output value of any random valid input object.
Indirect techniques are unsupervised and determine how the data are organised without the help
of any training data.

4.2 Applications of data mining


Data mining is used to construct different types of models used to solve business problems:
classification, regression, clustering, association analysis, and sequence discovery (Edelstein,
2014). Classification and regression are used to make predictions estimates, while object
behavior can be studied through association and sequence discovery. Clustering can be used for
either forecasting or description. It can also be used for outlier analysis.
4.2.1 Sales and Marketing: As data mining extracts the hidden information within the
database over a long period of time, it helps in expanding the business or launching of new
products. Different ways of applying data mining in sales and marketing are suggested, Zentut
(2014):
i) It is used in market basket analysis which analyses customers buying habits by finding
the associations among the items. Such a finding can help the decision makers to develop
marketing strategies by finding items which are frequently purchased together.
ii) Sometime it may be a reminder to the customer in case they have missed any.
iii) The retailer can have an in-depth knowledge about the customers behavior in terms of
buying patterns.
4.2.2 Banking and Finance
i) There has been tremendous research in distributed data mining, which are developed to
help detect fraud activities in credit card.
ii) It can also be used to identify customers loyalty to the organization. By studying the
customers purchasing behavior such as frequency of visits, amounts spent in buying.
These characteristics can be compared over a period of time. Analyzing these parameters,
a report can be generated for individual customers. Higher the relative measure can
conclude the more loyal the customer is.
iii) Data mining can also help the bank in retaining credit card customers. Analyzing the past
record it can predict if the customer is expected to change their credit card affiliation. In
such situations the bank can recommend special offers to retain these customers.
iv) Transactions done by group customers can also be recognized by using data mining.
v) Stock trading rules can be identified from historical market data.
4.2.3 Health Care and Insurance: The success of the insurance industry, to great extend,
depends on its ability to convert data and information into knowledge. The organization need to
study about their customer, competitors and its markets. Of late, data mining is seen to be used
within the insurance companies and have shown quite successful. Some of its applications are:
i) It can predict about customer who are likely to buy new policies.
ii) By studying the customers behavior patterns, data mining identify risky customers.

iii) Data mining can be used in analysis of claims which helps in identifying the medical
procedures are being claimed together.
iv) Data mining also helps in find the fraudulent activities of the customers.
4.2.4 Medicine: Of late the field of medical science has started to explore the powers of data
mining. It is being applied to handle different queries of knowledge discovery in the health
sector.
i) Data mining can be used in errors handling. As human being is prone to errors, there can
be errors occurring accidentally. Hospitals use sophisticated instruments handled by
doctors, nurses and other staffs. A database can be maintained keeping track of these
errors and correlation can be applied in order to avoid such accidents in near future.
ii) Researchers have started using classification algorithms to facilitate in early detection of
different diseases including heart diseases, which is a major concern all over the world
Cheng et al. (2006), cited the use of classification algorithms to help in early detection of
heart disease. Also it can be used to monitor the trends in the clinical trials of cancer
vaccines. By using data mining embedded with visualization, researchers have shown
that patterns and anomalies are better seen rather than just having a set of tabulated data,
Cao et al. (2008).
4.2.5 Transportation
i) There are complex situations in which determining the distribution schedules among
warehouses and outlets becomes difficult. Data mining help in such circumstances. It also
analyses the loading patterns.
ii) Leu et al. (2000) examined the possibilities of using data mining in the prediction of
tunnel support stability with the help of an algorithm of data mining called artificial
neural networks (ANN). Since data mining requires a large database of earlier cases for
the study, applications in the construction must be equipped to those where such
databases are readily available.
4.3 Data mining techniques
Here four broad techniques of data mining algorithms are covered. There are a number of other
algorithms and some enhancement of the algorithms described. These algorithms are frequently
used in real world problems bringing unique solutions. These techniques of data mining are:
4.3.1 Association Analysis
4.3.2 Classification
4.3.3 Clustering
4.3.4 Regression

Figure-1: Techniques of Data mining


Data Mining

Association

Classification

Clustering

Regression

4.3.1
Association Analysis: Association analysis is a technique for discovering
interesting relationships hidden in large data sets (Tan, Steinbach and Kumar, 2005). The
uncovered relationships can be represented in the form of association rules or sets of frequent
items. It has two segments, an antecedent (if) and a consequent (then). An antecedent is an item
selected from the already available dataset, while consequent is one or more item found after
along with the antecedent. This is found out after applying algorithmic process into the dataset.
An example of an association rule would be "If a customer buys bread, he is 90% likely to also
purchase butter."
Association Rule: An association rule is a proposition expression of the form X Y,
where X and Y are disjoint itemsets, i.e., XY= . There are two parameters that indicate the
strength of an association rule which are: support and confidence. Support determines how often
a rule is applicable to a given data set, while confidence determines how frequently items in Y
appear in transactions that contain X. The formal definitions of these metrics are
Support, s(X Y) = (XY)/N
(1)
Confidence, c(X Y) = (XY)/(X)
(2)
After studying the transaction of items being sold in a shop, we can create a dataset of
items under study, say items. Table 1 below illustrates four rules being generated after the
application of the algorithm. If the minimum threshold of the confidence is set to 75% then only
the rules 2, 3 and 4 will be the optimal rules chosen. Accordingly, knowledge-based decisions
can be made to study which item goes well with which items. The storekeeper can also check the
availability of the items.
Table-1: Rules generated for items dataset
Rule Antecedent Consequent Confidence (%)
1
I1^I2
I5
55
2
I2^I5
I1
93
3
I4
I1^I3
90
4
I2^I4
I3
80
Source: Compiled data
Some of the popular algorithms are:
i) Apriori Algorithm
ii) Frequent-Pattern Growth (FP Growth)

4.3.2
Classification: Classification is the task of finding the exact category of an object
into one of several predefined categories. This analysis of data helps us to provide a better
understanding of large data. It consists of two sets of datasets. The first set called training set is a
small dataset used by a classification algorithm to generate the rules. The second prediction or
test set is used to verify the accuracy of the classification rules already generated. If we find that
the accuracy is acceptable then we classify the tuples to the classes.
Suppose, a travelling agent would like to classify his clients to the classes Travel Mode
{Bus, Car, Train}, certain important attributes are selected like Gender, Own Car, Travel Cost,
Income; and then study their behaviour, the classes, which will be followed by generation of the
rules. Once this is done we can use the prediction set to find out the class in which the tuple falls
as shown in Table-2 and Table-3.
Table-2: Training set for travel database
Attributes
Gender
Own Car
Travel Cost
Male
No
Cheap
Male
Yes
Cheap
Female
Yes
Expensive
Female
Yes
Standard
Table-3: Prediction set for travel database
Attributes
Gender
Own Car
Travel Cost
Male
Yes
Standard
Female
No
Standard

Income
Low
Medium
High
Medium

Income
Medium
Low

Classes
Travel mode
Bus
Bus
Car
Train

Classes
Travel mode
?
?

Using the example above, a rule which predicts some of the tuples may be represented as:
If (Travel Cost=Cheap AND Gender=Male)
OR (Travel Cost=Cheap AND Gender=Female
AND Own Car=No)
Then Travel mode=Bus
Some of the methods of classification are:
i) Classification by Decision Tree Induction. Example: ID3, C4.5, CART.
ii) Bayesian classification. Example: Naive Bayesian Classification
iii) Rule-Based Classification
iv) Lazy Learners. Example: k-Nearest-Neighbour

4.3.1 Cluster Analysis


A cluster is a collection of data objects that are similar to one another within the same cluster and
are dissimilar to the objects in other clusters. Thus the behaviour of the objects in the same
cluster would be same while contradicting to the objects in other clusters.
If a vendor is suppose to calculate the risk factor while selling items under credit to his
customers, he can have a study of the old records and find the clusters. The number of clusters
needs to be specified in most of the clustering algorithms. In this case there can be two clusters

of Risky and No Risk. If the particular customer falls within the No Risk cluster then he can go
ahead with the credit.
Table-4: Customer database for clustering
Customer Income (x) Credit_rate(y)
A
Low
Fair
B
Low
Good
C
Medium
Excellent
D
High
Excellent
Some of the methods of clustering are:
i) Partitioning Methods. Example: k-Means clustering, k-medoids clustering
ii) Hierarchical Methods. Example: AGNES, DIANA
iii) Density-Based Methods. Examples: DBSCAN
iv) Grid-Based Methods. Example: STING, Wave cluster, CLIQUE
4.3.3
Regression Method: Regression is normally used to estimate the future values
based on past values by fitting a set of those points on a curve. It is one of the most widely used
techniques for numeric prediction which is used to model the relationship between one or more
of independent (or predictor) variables and a dependable (or response) variable. The common
formula for a linear regression is:
y=w0+w1x1+w2x2+...wnxn
(3)
where w0, w1,.., wn are regression coefficients.
Suppose a hotel manager wants to find out how many customers (in thousands) he should expect
for the year 2015 given the following information in Table-5. This can be estimated using
regression model.
Table-5: Database for regression method
Year
2010 2011 2012 2013
2014
2015
Customer
2.2
2.8
3.1
4.2
4.8
?
4.4 Case Study
Let us see an application of k-means clustering. The parameter k (sometimes represented with c)
stands for the number of clusters. Six months transactional database (14,690 records) of a shop
which I would refer to as myshop, has been collected. This case study is not intended to show the
detail working of the k-means algorithm, rather how the output of the algorithms can be analysed
from the business point of view.
The buying detail records associated with a customer is summarised into a single record
that describes the customers buying behavior. The choice of summary variables (features) is
critical in order to obtain a useful description of the customer. To define the features, one can
think of the smallest set of variables that describe the complete behavior of a customer.
Keywords like what, when, where, how often, who, etc. can help with this process.
Based on these keywords a list of features that can be used as a summary description of a
customer based on the buying behavior over some time period P is obtained. Some of the
attributes are: gender, time of purchase, item, and amount spent etc.

Figure-2: Visualisation of customers visit per hour

Before we run the algorithm, we plot a graph of the time of visit against number of
customer in Figure-2. We noticed that customer visit is more at around 12 oclock and 3 p.m.
Again the number of customer increases from 6 p.m. Better decisions can be made while keeping
more staff during these peak hours.
Now, k-means clustering is executed with the dataset taking the number of cluster, c=3.
As the result needs to be visualised, a technique called Principal Component Analysis (PCA) is
used for the purpose. A random of 150 tuples is selected for the purpose of projection; this is
because too many tuples would result in a crammed full projection as in Figure-3.
Figure-3: Result of K-means algorithm using PCA with c=3

The three clusters can be analysed as:

Cluster 1: Consists of REGULAR customers who SPEND A LOT of money; they


buy mostly COLD DRINK, CHOCOLATES VARIETIES, ICE-CREAM, TEA
& COFFEE.
Cluster 2: Consists of customers who comes SOMETIMES; SPEND AVERAGE
amount; buy DESSERTS & BREADS. These segment of customer least buy
CAKES & CHOCOLATES VARIETIES.
Cluster 3: Consist of customers who comes SOMETIMES; SPEND LESS amount;
they buy BREADS, CHEESE PRODUCTS & TEA.

5. Conclusion
The customer life cycle, which refers to the different stages in the relationship between a
business and a customer, should be clearly understood and safeguarded as it is the principle to
the success of the business. The organisations can look into their actual requirement and see
which technique of data mining can answer to their queries. Of late, Data mining has gained its
popularity in both private and public sectors. Industrial sector such as banking, insurance,
manufacture and retailing have started to use data mining in order to reduce cost, increase sales
and retain the present customers while trying to attract new customers. With the rise of
entrepreneurs in the last decade, many application areas of predictive methods have become
popular. These techniques of data mining will continue to increase its popularity in the coming
decades.

References
1. Berry M. J. A. and Linoff G.S. (2004) Data Mining Techniques for Marketing,
Sales and Customer Relationship management, Indiana.
2. Cheng T.H., Wei C.P. and Tseng, V.S. (2006) Feature Selection for Medical Data
Mining: Comparisons of Expert Judgment and Automatic Approaches,
Proceedings of the 19th IEEE Symposium on Computer-Based Medical Systems
(CBMS'06).
3. Edelstein H. (2014) Data mining: exploiting the hidden trends in your data. DB2
Online Magazine.
Online available: http://www.db2mag.com/9701edel.htm
Accessed on 16-08-2014
4. Freeman M. (1999) The 2 customer lifecycles, Intelligent Enterprise, Volume-2,
Issue-9, 1999.
5. Han J. and Kamber M., (2003). Data Mining Concepts and Techniques, (2nd ed.).
Elsevier Publication, India.
6. Joseph M.V. (2013) Significance of Data Warehousing and Data Mining in
Business Applications, International Journal of Soft Computing and Engineering
(IJSCE), Volume-3, Issue-1.
7. Karahoca, A., Karahoca, D. and anver (2012) M. Survey of Data Mining and
Applications (Review from 1996 to Now)
8. Khaled N. (2007) Application of Data Mining to State Transportation Agencies
Online avilable: http://www.itcon.org/data/works/att/2007_8. content.06891.pdf
Accessed on 10-08-2014
9. Leu S., Chee N. and Shiu-Lin C. (2000) Data mining for tunnel support: neural
network approach, Journal of automation in construction, Volume 10, Number 4,
pp. 429-441(13).
10. Rygielski C., Wang, J. C. and Yen D. (2002) D. C. Data mining techniques for
customer relationship management, Journal of Technology in Society, Volume24, Issue-2 PP 483502.
11. Tan P. N., Steinbach M. and Kumar V. (2005). Introduction to Data Mining, (2nd
ed.). Addison Wesley Publications.
12. Teknomo K. (2008) Market Basket Analysis.
Online available: http://people.revoledu.com/kardi/tutorial/marketbasket/
Accessed on 08-07-2014
13. Voznika F. and Viana L. (2014) Data Mining Classification.
Online available: http://www.ggram.com/pdf/data-mining-classification.html
Accessed on 25-08-2014
14. Zentut M. (2014) Data Mining Applications.
Online available: http://www.zentut.com/data-mining/data-mining-applications/
Accessed on 22-08-2014

Potrebbero piacerti anche