Data Mining Versus Knowledge Discovery I

Data Mining Versus Knowledge Discovery in
Databases
Dr. Kishori Lal Bansal1, Satish Sood2
1
Associate Professor, Dept. Of Computer Science, HP University, Summer Hill, Shimla HP
2
Research Scholar, Dept. Of Computer Science, HP University, Summer Hill, Shimla HP
satishdsala@gmail.com
Abstract— The terms knowledge discovery in databases (KDD)  Interpretation/evaluation: How the data mining results
and data mining are often used interchangeably. In fact, there are presented to the users is extremely important because
have been many other names given to this process of discovering the usefulness of the results is dependent on it. Various
useful (hidden) patterns in data knowledge extraction,
visualization and GUI strategies are used at this last step.
information discovery, exploratory data analysis, information
harvesting, and unsupervised pattern recognition. Over the last
few years KDD has been used to refer to a process consisting of Transformation techniques are used to make the data easier to
many steps, while data mining is only one of these steps. This mine and more useful, and to provide more meaningful
paper is focused on the data mining and knowledge discovery in results. The actual distribution of the data may be modified to
databases. facilitate use by techniques that require specific types of data
distributions. Some attribute values may be combined to
Keywords— Data Mining; Knowledge Discovery; Databases provide new values, thus reducing the complexity of the data.
For example, current date and birth date could be replaced by
I. INTRODUCTION age. One attribute could be substituted for another. An
Knowledge discovery in databases (KDD) is the process of example would be replacing a sequence of actual attribute
finding useful information and patterns in data. Data mining is values with the differences between consecutive values. Real
the use of algorithms to extract the information and patterns valued attributes may be more easily handled by partitioning
derived by the KDD process. The KDD process is often said the values into ranges and using these discrete range values.
to be nontrivial; however, we take the larger view that KDD is Some data values may actually be removed. Outliers, extreme
an all-encompassing concept. A traditional SQL database values that occur infrequently, may actually be removed. The
query can be viewed as the data mining part of a KDD data may be transformed by applying a function to the values.
process. Indeed, this may be viewed as somewhat simple and A common transformation function is to use the log of the
trivial. However, this was not the case 30 years ago. If we value rather than the value itself. These techniques make the
were to advance 30 years into the future, we might find that mining task easier by reducing the dimensionality (number of
processes thought of today as nontrivial and complex will be attributes) or by reducing the variability of the data values.
viewed as equally simple. KDD is the process that involves The removal of outliers can actually improve the quality of the
many different steps. The input to this process is the data, and results. As with all steps in the KDD process, however, care
the output is the useful information desired by the users. must be used in performing transformation. If used
However, the objective may be unclear or inexact. The incorrectly, the transformation could actually change the data
process itself is interactive and may require much elapsed such that the results of the data mining step are inaccurate.
time. To ensure the usefulness and accuracy of the results of
III. VISUALIZATION
the process, interaction throughout the process with both
domain experts and technical experts might be needed. Visualization refers to the visual presentation of data. The old
expression “a picture is worth a thousand words” certainly is
II. KNOWLEDGE DISCOVERY PROCESS STEPS true when examining the structure of data. For example, a line
The KDD process consists of the following five steps. graph that shows the distribution of a data variable is easier to
understand and perhaps more informative than the formula for
 Selection: The data needed for the data mining process the corresponding distribution. The use of visualization
may be obtained from many different and heterogeneous techniques allows users to summarize, extract, and grasp more
data sources. This first step obtains the data from various complex results than more mathematical or text type
databases, files, and nonelectronic sources. description of the results. Visualization techniques include:
 Pre-processing: The data to be used by the process may
have incorrect or missing data. There may be anomalous  Graphical: Traditional graph structures including bar
data from multiple sources involving different data types charts, pie charts, histograms, and line graphs may be
and metrics. There may be many different activities used.
performed at this time. Erroneous data may be corrected  Geometric: Geometric techniques include the box plot
or removed, whereas missing data must be supplied or and scatter diagrams techniques.
predicted often using the data mining tools.  Icon-based: Using figures, colors, or other icons can
 Transformation: Data from different sources must be improve the presentation techniques.
converted into a common format for processing. Some  Pixel-based: With these techniques each data value is
data may be encoded or transformed into more usable shown as a uniquely colored pixel.
formats. Data reduction may be used to reduce the  Hierarchical: These techniques hierarchically divide the
number of possible data values being considered. display area (screen) into regions based on data values.
 Data Mining: Based o the data mining task being  Hybrid: The proceeding approaches can be combined into
performed, this step applies algorithms to the transformed one display.
data to generate the desired results.
Any of these approaches may be two-dimensional or three-
dimensional. Visualization tools can be used to summarize
data as a data mining technique itself. In addition, must be able to use it. KDD is not a new technique but rather
visualization can be used to show the complex results of data a multi-disciplinary field of research; machine learning,
mining tasks. statistics, database technology, expert systems, and data
visualization all make a contribution.
IV. THE DEVELOPMENT OF DATA MINING VI. REFERENCES

The current evolution of data mining functions and products is [1] Fayyad U., Piatetsky-Shapiro G., and Smyth P. “Knowledge Discovery
and Data Minning: Towards a Unifying Framework”, Proc, 2nd Int. Conf. on
the result of years of influence from many disciplines, Knowledge Discovery and Data Mining, Portland, OR, 1996, pp. 82-88.
including databases, information retrieval, statistics, [2]Malik Shahzad Kaleem Awan, Mian Muhammad Awais, “Data Mining-
algorithms, and machine learning. Another computer science Redefining the Boundaries”, IEEE Computer Society.
area that has had a major impact on the KDD process is [3] Berry, M. and G. Linoff (2002) Mastering Data Mining, New York: John
Wiley & Sons.
multimedia and graphics. A major goal of KDD is to be able [4]Hand, D., H. Mannila, and P. Smyth (2000) Principles of Data Mining,
to describe the results of the KDD process in a meaningful Boston: MIT Press.
manner. Because many different results are often produced, [5]Pieter Adriaans, Dolf Zantige Data Mining, Pearson Education.
this is a nontrivial problem. Visualization techniques often
involve sophisticated multimedia and graphics presentations.
In addition, data mining techniques can be applied to
multimedia applications. Artificial intelligence, information
retrieval, databases, and statistics leading to the current view
of data mining. These different historical influences, which
have led to the development of the total data mining areas,
have given rise to different views of what data mining
functions actually are:
 Induction: is used to proceed from very specific

knowledge to more general information. This type of
technique is often found in artificial intelligence
applications.
 Because the primary objective of data mining is to
describe some characteristics of a set of data by a general
model, this approach can be viewed as type of
compression. Here the detailed data within the database
are abstracted and compressed to a smaller description of
the data characteristics that are found in the model.
 The data mining process itself can be viewed as a type of
querying the underlying database. Indeed, an ongoing
direction of data mining research is how to define a data
mining query and whether a query language (like SQL)
can be developed to capture the many different types of
data mining queries.
 Describing a large database can be viewed as using
approximation to help uncover hidden information about
the data.
 When dealing with large databases, the impact of size and
efficiency of developing an abstract model can be thought
of as a type of search problem.
V. CONCLUSION
The term KDD be employed to describe the whole process
of extraction of knowledge from data. In this context,
knowledge means relationships and patterns between data
elements. The data mining should be used exclusively for the
discovery stage of the KDD process. A more or less official
definition of KDD is: ‘the non-trivial extraction of implicit,
previously unknown and potentially useful knowledge from
data.’ So the knowledge must be new, not obvious, and one

Data Mining Versus Knowledge Discovery I

Caricato da

Informazioni sul documento

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Data Mining Versus Knowledge Discovery I

Caricato da

Copyright:

Formati disponibili

Data Mining Versus Knowledge Discovery in

IV. THE DEVELOPMENT OF DATA MINING VI. REFERENCES

 Induction: is used to proceed from very specific

Potrebbero piacerti anche