Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Abstract Data mining is the process of extraction of relevant information from data warehouse. It also refers to the
analysis of the data using pattern matching techniques. With the continuous and extensive use of database for
storage, there arises a need for the database management and retrieval of the required information. This paper
discusses the data mining techniques used for the knowledge discovery of the databases. It also surveys the various
data mining algorithms for the optimized mining of information.
Keywords Data Mining, Apriori algorithm, KDD, k-means algorithm, AdaBoost algorithm
I. INTRODUCTION
With the rapid growth and the development of the society along with rises the need of storing of the data leading to
creation of huge number of databases. A large number of databases give way to the creation of data warehouses. A data
warehouse refers to a central repository created by integration of data from one or more databases. It stores both the
current as well as the historical data. They help in the creation of the trending reports using the information stored. Data
warehouses are subdivided into data marts. A data mart refers to the storage of the related information.
In essence, the goal of data mining is to extract knowledge from data. Data mining is an inter-disciplinary field, whose
core is at the intersection of machine learning, statistics and databases. We emphasize that in data mining unlike for
example in classical statistics the goal is to discover knowledge that is not only accurate but also comprehensible for
the user. Comprehensibility is important whenever discovered knowledge will be used for supporting a decision made by
a human user. After all, if discovered knowledge is not comprehensible for the user, he/she will not be able to interpret
and validate it. In this case, probably the user will not trust enough the discovered knowledge to use it for decision
making. This can lead to wrong decisions.
There are several data mining tasks, including classification, regression, clustering, dependence modelling, etc. Each
of these tasks can be regarded as a kind of problem to be solved by a data mining algorithm. Therefore, the first step in
designing a data mining algorithm is to define which task the algorithm will address.
The continuous [1] development of database technology and the extensive applications of database management
system, the data volume stored in database increases rapidly and in the large amounts of data much important
information is hidden. If the information can be extracted from the database they will create a lot of potential profit for
the companies and the technology of mining information from the massive [2] database is known as data mining.
Data can now be stored in many different types of databases. One database architecture that has recently emerged is the
data warehouse, a repository of multiple heterogeneous data sources, organized under a unified schema at a single site in
order to facilitate management decision making. Data warehouse technology includes data cleansing, data integration,
and On-Line Analytical Processing (OLAP), that is, analysis techniques with functionalities such as summarization,
consolidation and aggregation, as well as the ability to view information at different angles.
Data mining tools can forecast the future trends and activities to support the decision of people. For example, through
analysing the whole database system of the company the data mining tools [3] can answer the problems such as Which
customer is most likely to respond to the e-mail marketing activities of our company, why, and other similar problems.
Some data mining tools can also resolve some traditional problems which consumed much time, this is because that they
can rapidly browse the entire database and find some useful information experts unnoticed.
The rest of this paper is organized as follows. The concepts of data mining are discussed in Section II. It also describes
the process of discovery of data. The emerging algorithms for knowledge extraction are discussed in Section III. It
highlights various algorithms used for knowledge extraction with a number of security solutions. Finally, the conclusions
and the future works are discussed in Section IV.
REFERENCES
[1] Agrawal R, Srikant R (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th VLDB
conference, pp. 487499
[2] Ahmed S, Coenen F, Leng PH (2006) Tree-based partitioning of date for association rule mining. Knowl Inf Syst
10(3):315331
[3] Banerjee A, Merugu S, Dhillon I, Ghosh J (2005) Clustering with Bregman divergences. J Mach Learn Res
6:17051749
[4] Bezdek JC, Chuah SK, Leep D (1986) Generalized k-nearest neighbour rules. Fuzzy Sets Syst 18(3):237256.
http://dx.doi.org/10.1016/0165-0114(86)90004-7
[5] Bloch DA, Olshen RA, Walker MG (2002) Risk estimation for classification trees. J Comput Graph Stat 11:263
288
[6] Bonchi F, Lucchese C (2006) on condensed representations of constrained frequent patterns. Knowl Inf Syst
9(2):180201
[7] Breiman L (1968) Probability theory. Addison-Wesley, Reading. Republished (1991) in Classics of mathematics.
SIAM, Philadelphia
[8] Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont
[9] Brin S, Page L (1998) The anatomy of a large-scale hypertextualWeb Search Engine. Comput Networks 30(1
7):107117
[10] Cheung D W, Han J, Ng V, Wong C Y (1996) Maintenance of discovered association rules in large databases: an
incremental updating technique. In: Proceedings of the ACM SIGMOD international conference on management
of data, pp. 1323
[11] Chi Y, Wang H, Yu PS, Muntz RR (2006) Catch the moment: maintaining closed frequent itemsets over a data
stream sliding window. Knowl Inf Syst 10(3):265294
[12] Cost S, Salzberg S (1993) A weighted nearest neighbour algorithm for learning with symbolic features. Mach
Learn 10:57.78 (PEBLS: Parallel Exemplar-Based Learning System)
[13] Kuramochi M, Karypis G (2005) Gene Classification using Expression Profiles: A Feasibility Study. Int J Artif
Intell Tools 14(4):641660
[14] Langville AN, Meyer CD (2006) Googles PageRank and beyond: the science of search engine rankings.
Princeton University Press, Princeton
[15] Leung CW-k, Chan SC-f, Chung F-L (2006) A collaborative filtering framework based on fuzzy association rules
and multiple-level similarity. Knowl Inf Syst 10(3):357381
[16] Li T, Zhu S, Ogihara M (2006) Using discriminant analysis for multi-class classification: an experimental
investigation. Knowl Inf Syst 10(4):453472