Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
net/publication/274195064
CITATIONS READS
0 45
3 authors, including:
Some of the authors of this publication are also working on these related projects:
WSMC12 12th Workshop on Statistics, Mathematics and Computation In Honour of Professor Carlos Braumann November
9th-10th, 2018 Covilhã - Portugal View project
All content following this page was uploaded by Teresa Oliveira on 12 August 2019.
CONTENTS
4.1 Introduction to Data Mining: Literature Review.....................................84
4.2 Data Mining and Statistics.......................................................................... 85
4.3 Data Mining Major Algorithms.................................................................. 86
4.3.1 Classification...................................................................................... 86
4.3.2 Statistical Learning........................................................................... 87
4.3.3 Association Analysis........................................................................ 87
4.3.4 Link Mining....................................................................................... 87
4.3.5 Clustering........................................................................................... 88
4.3.6 Bagging and Boosting...................................................................... 88
4.3.7 Sequential Patterns........................................................................... 88
4.3.8 Integrated Mining............................................................................. 89
4.4 Software on Data Mining............................................................................ 89
4.5 Service Industry and Challenges............................................................... 89
4.6 Data Mining and Quality in Service Industry.........................................90
4.7 Application of Data Mining to Service Industry:
Example of Clustering.................................................................................. 91
4.7.1 Case Study......................................................................................... 92
4.7.2 The Methodology.............................................................................. 95
4.7.2.1 Basic Concepts.................................................................... 96
4.7.2.2 Methodology of Conceptual Characterization
by Embedded Conditioning (CCEC)............................... 98
4.7.3 Application and Results................................................................... 99
4.7.3.1 Hierarchical Clustering..................................................... 99
4.7.3.2 CCEC Interpretation of Classes..................................... 102
4.7.3.3 Discussion......................................................................... 104
4.8 Considerations and Future Research....................................................... 104
Acknowledgments............................................................................................... 105
References.............................................................................................................. 105
83
84 Decision Making in Service Industries: A Practical Approach
1. Cleaning the data—A stage where noise and inconsistent data are
removed.
2. Data integration—A stage where different data sources can be com-
bined, producing a single data repository.
3. Selection—A stage where user attributes of interest are selected.
4. Data transformation—A phase where data is processed into a format
suitable for application of mining algorithms (e.g., through opera-
tions aggregation).
5. Prospecting—An essential step consisting of the application of intel-
ligent techniques in order to extract patterns of interest.
6. Evaluation or postprocessing—A stage where interesting patterns are
identified according to some criterion of the user.
7. Display of results—A stage of knowledge representation techniques,
showing how to present the extracted knowledge to the user.
Data Mining and Quality in Service Industry 85
mining and most frequent statistical techniques, the goal of this chapter is to
review an application of data mining in the service industry.
4.3.1 Classification
• C4.5 is an extension of Quinlan’s earlier ID3 algorithm. The decision
trees generated by C4.5 can be used for classification, and for this rea-
son, C4.5 is often referred to as a statistical classifier. Details on this
algorithm can be seen in Quinlan (1993, 1996) and in Kotsiantis (2007).
• Classification and regression tree (CART) analysis is a nonparametric
decision tree learning technique that produces either classification
or regression trees, depending on whether the dependent variable
is categorical or numeric, respectively. We refer to Loh (2008) and to
Wu and Kumar (2009) for further developments.
• k-Nearest neighbor (kNN) is a method for classifying objects based on
closest training examples in the feature space. This is a nonparametric
method, which estimates the value of the probability density function
Data Mining and Quality in Service Industry 87
4.3.5 Clustering
• k-Means is a simple iterative method to partition a given data set into
a user-specified number of clusters, k. It is also referred to as Lloyd’s
algorithm, particularly in the computer science community, since
it was introduced in Lloyd (1957), being published later in Lloyd
Downloaded by [Teresa Oliveira] at 07:28 18 June 2014
service industries’ future (priorities: linking past and future) and the authors
emphasize the role of industry services on the growth of the international
economy. They refer to the ease with which services can be implemented in
certain activities with notably less capital than that required for most indus-
tries as well as the impact on the job market, which are factors that have
brought about belief and reliance on this sector to provide the recovery from
any international economic crisis.
based on a DM algorithm. This algorithm has three steps that calculate the
data transaction quality. Ograjenšek (2002) emphasizes that to provide a sat-
isfactory definition of services, some authors like Mudie and Cottam (1993),
Hope and Mühlemann (1997), and Kasper et al. (1999) characterize them with
these important features: intangibility, inseparability, variability, and perish-
ability. Murdick et al. (1990) define quality of a service as the degree to which
the bundle of services attributes as a whole satisfies the customer; it involves
a comparison between customer expectations (needs, wishes) and the reality
(what is really offered). Whenever a person pays for a service or product, the
main aim is getting quality. There are two particular reasons why customers
must be given quality service:
Downloaded by [Teresa Oliveira] at 07:28 18 June 2014
In a quality service one faces a special challenge since there is the need to
meet the customer requests while keeping the process economically com-
petitive. To ensure quality assurance of the companies there are shadow
shopping companies who act as quality control laboratories. They develop
experiments, draw inference, and provide solutions for improvement. In the
case of the service industry the quality assurance service is provided by the
various surveys and inspections that are carried out from time to time.
Another survey by KDnuggets about fields where the experts applied
data mining can be seen at www.kdnuggets.com/polls/2010/analytics-data-
mining-industries-applications.html. We see that the biggest increases from
2009 to 2010 were for the following fields: manufacturing (141.9%), education
(124.1%), entertainment/music (93.3%), government/military (56.5%). The
biggest declines were for social policy/survey analysis (–44.8%), credit scor-
ing (–48.8%), travel/hospitality (–49.7%), security/anti-terrorism (–62.4%).
work worse. The methodology benefits from the hierarchical structure of the
target classification to induce concepts iterating with binary divisions from
dendrogram (or hierarchical tree, see Figure 4.3), so that, based on the vari-
ables that describe the objects belonging to a certain domain, the specifics
of each class can be found, thus contributing to the automatic interpreta-
tion of the conceptual description of clustering. Industrial wastewater treat-
ment covers the mechanisms and processes used to treat waters that have
been contaminated in some way by anthropogenic industrial or commercial
activities, prior to its release into the environment or its reuse. Most indus-
tries produce some wet waste although recent trends in the developed world
have been to minimize such production or recycle such waste within the
production process. However, many industries remain dependent on pro-
Downloaded by [Teresa Oliveira] at 07:28 18 June 2014
1 Qair 2P
33–34 V2
QT Air
3 NH4-N
4 TN LT
5 TOC
6 Inhibition 15 L
M2 27–28
QT QT QT QT
FT 10 ORP 8 NO3–N 12 DO 13 DO TT
16–20 SC 1 7Q M4 30 11 ORP 9 NH4-N 14 T
M3 29
31–32 V1 Outflow
600 m3
Input 88 m 3 88 m3 130 m3 117 m3 115 m3
after Mechanical M1 26
Treatment
FIGURE 4.1
MBBR (moving bed biofilm reactor) pilot plant with sensors and actuators.
93
94 Decision Making in Service Industries: A Practical Approach
I Q–E = I1 I2 I3 I DBO–D = I1 I2 I3
C392 C391
C393 C389
FIGURE 4.2
Boxplot-based discretization for two classes partition.
a settler. The total airflow to both aerobic tanks can be manipulated online so
that oxygen concentration in the first aerobic tank is controlled at the desired
value. The wastewater rich with nitrate is recycled with the constant flow
rate from the fifth tank back to the first tank. The influent to the pilot plant is
wastewater after mechanical treatment, which is pumped to the pilot plant.
The inflow is kept constant to fix the hydraulic retention time. The influent
flow rate can be adjusted manually to observe the plant performance at dif-
ferent hydraulic retention times. The database that was used in this study
consisted of 365 daily averaged observations that were taken from June 1,
2005, to May 31, 2006. Every observation includes measurements of the 16
variables that are relevant for the operation of the pilot plant. The variables
that were measured were:
From a theoretical point of view, the main focus of this methodology has
been to present a hybrid methodology that combines tools and techniques
of statistics and artificial intelligence in a cooperative way, using a trans-
versal and multidisciplinary approach combining elements of the induction
96 Decision Making in Service Industries: A Practical Approach
ing with binary divisions from dendrogram, so that, based on the variables
that describe the objects belonging to a certain domain, the specifics of each
class can be found, thus contributing to the automatic interpretation of the
conceptual description of clustering.
TABLE 4.1
Summary Statistics for Q-E versus P 2 (Left) and MQ–E, Set of Extreme Values
of Q-E|C ∈ P 2 and Z Q–E, Corresponding Ascendant Sorting (Right)
Class C393 C392
Q–E Q–E
Var. N = 396 nc = 390 nc = 6 M Z
Q-E X 42,112.9453 29,920 20,500
S 4,559.2437 1,168.8481 20,500 23,662.9004
Min 29,920 20,500 54,088.6016 29,920
Max 54,088.6016 23,662.9004 23,662.9004 54,088.6016
N* 1 0
Downloaded by [Teresa Oliveira] at 07:28 18 June 2014
card {c}
x x12 … x1K −1 x1K C(1,Pξ )
11
x21 x22 … x2 K −1 x2 K C( 2 ,Pξ )
χ p =
xn−11 xn−12 xn−1K −1 xn−1K C(n − 1K ),Pξ )
xn1 xn 2 … xnK −1 xnK C(n ,P )
100 Decision Making in Service Industries: A Practical Approach
cr363
4836.922
2418.461
cr361
Downloaded by [Teresa Oliveira] at 07:28 18 June 2014
cr362
cr353
cn78
cr358
cr360
cr357
cm36
FIGURE 4.3
Hierarchical tree to case study.
Classer 358 93
Classer 357 50
7.775 28.126 48.477 0 10.141 20.282 2.98 5.8825 8.785 3.91 6.5675 9.225 28.604 49.251 69.898
Classe nc Q-Air h-w w Q-Influent FR1-DOTOK Frequ-rec
Classer 358 93
Classer 357 50
697.829 1449.5496 2201.27 2.66 2.879 3.098 49.706 67.603 85.5 39.153 44.943 50.733 23.862 33.931 44
Classer 358 93
Classer 357 50
0 41.896 83.792 0 17.4335 34.867 8.217 15.4 22.583 0 177.5 355 0 26.854 53.708
101
FIGURE 4.4
Class panel graph of P4.
102 Decision Making in Service Industries: A Practical Approach
TABLE 4.2
,G
Summary of Knowledge Base for Cr361 and Cr361 from P 2EnW
Lj 3, R 2
Concep Knowledge Base Cr361 (172 Days) and Cr362 (193 Days) Cov Cov R
A 2 , NH 4 −influent
C361 r NH 4 − influent
3, Cr 361 (
: xNH 4− influent, i ∈ 40.541, 48.477 → Cr361 1.0
5 2.91%
,O 2−1 aerobic
AC2361 r3O, C2r−361
1 aerobic
(
: xO 2−1 aerobic, i ∈ 8.371, 8.785 1
.0
→ Cr361 1 0.58%
,O 2− 2 aerobic
AC2361 r1O,C2r−3612 aerobic : xO 2− 2 aerobic, i ∈ 3.91 , 4.94 ) → Cr361
1.0
2 1.16%
A 2,Value − air
C361 r3Value − air
,Cr 361 ( 1.0
: xValue− air , i ∈ 54.777 , 69.898 → Cr361 28 16.28%
Q − air
AC2,361 r3Q,C−rair
361
( .0
: xQ− air, i ∈ 2030.77 , 2201.27 1 → Cr361 27 15.70%
Downloaded by [Teresa Oliveira] at 07:28 18 June 2014
4.7.3.3 Discussion
Concepts associated with classes are built taking advantage of the hierar-
chical structure of the underlying clustering. CCEC is a quick and effective
method that generates a conceptual model of the domain, which will be of
great support to the later decision-making process based on a combination
Downloaded by [Teresa Oliveira] at 07:28 18 June 2014
Acknowledgments
We would like to thank the staff of the Domale-Kamnik WWTP for providing
us with the data from the pilot plant and the referees for their helpful comments.
We also thank the Project HAROSA Knowledge Community. Research was par-
tially sponsored by national funds through the Fundação Nacional para a Ciência
e Tecnologia, Portugal–FCT under the project PEst-OE/MAT/UI0006/2011.
Downloaded by [Teresa Oliveira] at 07:28 18 June 2014
References
Agrawal, R., and Srikant, R. (1994). Fast algorithms for mining association rules in
large databases. Proceedings of the 20th International Conference on Very Large Data
Bases, VLDB, pp. 487–499, Santiago, Chile.
Aldana, W. A. (2000). Data mining industry: Emerging trends and new opportunities.
Master thesis, Massachusetts Institute of Technology.
Antunes, M. (2010). CRM e Prospecção de Dados–ao seu serviço. Boletim da Sociedade
Portuguesa de Estatística, Fernando Rosado, ed., Primavera de 2010, 34–39.
Branco, J. A. (2010). Estatísticos e mineiros (de dados) inseparáveis de costas voltadas?
Boletim da Sociedade Portuguesa de Estatística, Fernando Rosado, ed., Primavera
de 2010, 40–43.
Castro, L. M., Montoro-Sanchez, A., and Ortiz-De-Urbina-Criado, M. (2011). Inno-
vation in services industries: Current and future trends. The Service Industries
Journal, 31(1): 7–20.
Chatfield, C. (1997). Data mining. Royal Statistical Society News, 25(3): 1–2.
Farzi, S., and Dastjerdi, A. B. (2010). Data quality measurement using data mining.
International Journal of Computer Theory and Engineering, 2(1): 1793–8201.
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. (1996). From data mining to knowl-
edge discovery: An overview. In Advances in Knowledge Discovery and Data
Mining, Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., and Uthurusamy, R., eds.,
pp. 1–36, AAAI Press/MIT Press, Menlo Park, CA.
Frank, E., Trigg, L., Holmes, G., and Witten, I. H. (2000). Naive Bayes for regression.
Machine Learning, 41(1): 5–15.
Frawley, W., Piatetsky-Shapiro, G., and Mathews, C. (1992). Knowledge discovery in
databases: An overview. AI Magazine, 213–228.
Freund, Y., and Schapire, R. E. (1997). A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of Computer and System Sciences,
55(1): 119–139.
Friedman, J. H. (1998). Data mining and statistics: What is the connection? www-stat.
stanford.edu/~jhf/ftp/dm-stat.ps.
Gibert, K., Nonell, R., Velarde, J. M., and Colillas, M. M. (2005). Knowledge discovery
with clustering: Impact of metrics and reporting phase by using klass. Neural
Network World, 15(4): 319–326.
106 Decision Making in Service Industries: A Practical Approach
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning.
Springer, New York.
Hirate, Y., and Yamana, H. (2006). Generalized sequential pattern mining with item
intervals. Journal of Computers, 1(3): 51–60.
Hope, C., and Mühlemann, A. (1997). Service Operations Management: Strategy, Design
and Delivery. Prentice Hall, London.
Kasper, H., van Helsdingen, P., and de Vries, W. Jr. (1999). Services Marketing
Management: An International Perspective. John Wiley & Sons, Chichester.
Kotsiantis, S. B. (2007). Supervised machine learning: A review of classification tech-
niques. Informatica 31: 249–268.
Kuonen, D. (2004). Data mining and statistics: What is the connection? The Data
Administrative Newsletter.
Leamer, E. E. (1978). Specification Searches: Ad Hoc Inference with Nonexperimental Data.
John Wiley & Sons, New York.
Li, L., Shang, Y., and Zhang, W. (2002). Improvement of HITS-based algorithms on web
documents. Proceedings of the 11th International World Wide Web Conference
(WWW 2002). Honolulu, HI.
Liu, B., Hsu, W., Chen, S., and Ma, Y. (2000). Analyzing the subjective interestingness
of association rules. IEEE Intelligent Systems, 47–55.
Liu, B., Hsu, W., and Ma, Y. (1998). Integrating classification and association rule mining.
Proceedings of the Fourth International Conference on Knowledge Discovery
and Data Mining (KDD-98, Plenary Presentation), New York.
Lloyd, S. P. (1957). Least square quantization in PCM. Bell Telephone Laboratories Paper.
Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information
Theory 28(2): 129–137.
Loh, W.-Y. (2008). Classification and regression tree methods. In Encyclopedia of
Statistics in Quality and Reliability, Ruggeri, F., Kenett, R., and Faltin, F., eds., pp.
315–323, Wiley, Chichester.
Mahajan, M., Nimbhorkar, P., and Varadarajan, K. (2009). The planar k-means prob-
lem is NP-hard. Lecture Notes in Computer Science, 5431: 274–285.
Mudie, P., and Cottam, A. (1993). The Management and Marketing of Services.
Butterworth-Heinemann, Oxford.
Murdick, R. G., Ross, J. E., and Claggett, J. R. (1990). Information System for Modern
Management. Prentice Hall, New Jersey.
Ograjenšek, I. (2002). Applying statistical tools to improve quality in the service sec-
tor. Developments in Social Science Methodology (Metodološki zvezki), 18: 239–251.
Data Mining and Quality in Service Industry 107
Oliveira, T. A., and Oliveira, A. (2010). Data mining and quality in service industry:
Review and some applications. Presentation in IN3-HAROSA Workshop, Media-
TIC, Barcelona, Spain, November 22–23. http://dpcs.uoc.edu/joomla/images/
stories/workshop2010/Harosa_2010_Teresa_1.pdf.
Page, L., Brin, S., Motwani, R., and Winograd, T. et al. (1999). The PageRank cita-
tion ranking: Bringing order to the web. Available at http://ilpubs.stanford.
edu:8090/422/1/1999-66.pdf.
Parellada, F. S., Soriano, D. R., and Huarng, K.-H. (2011). An overview of the ser-
vice industries’ future (priorities: linking past and future). The Service Industries
Journal, 31(1): 1–6.
Pei, J., Han, J., Mortazavi-Asl, B., Pinto, H., Chen, Q., Dayal, U., and Hsu, M.-C. (2001).
PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth.
ICDE ’01 Proceedings of the 17th International Conference on Data Engineering.
Downloaded by [Teresa Oliveira] at 07:28 18 June 2014
Zhao, C. M., and Luan, J. L. (2006). Data mining: Going beyond traditional statistics.
New Directions for Institutional Research, 131: 7–16.
Zwikael, O. (2007). Quality management: A key process in the service industries. The
Service Industries Journal, 27(8): 1007–1020.
Downloaded by [Teresa Oliveira] at 07:28 18 June 2014