Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Databases
Dr. Kishori Lal Bansal1, Satish Sood2
1
Associate Professor, Dept. Of Computer Science, HP University, Summer Hill, Shimla HP
2
Research Scholar, Dept. Of Computer Science, HP University, Summer Hill, Shimla HP
satishdsala@gmail.com
Abstract— The terms knowledge discovery in databases (KDD) Interpretation/evaluation: How the data mining results
and data mining are often used interchangeably. In fact, there are presented to the users is extremely important because
have been many other names given to this process of discovering the usefulness of the results is dependent on it. Various
useful (hidden) patterns in data knowledge extraction,
visualization and GUI strategies are used at this last step.
information discovery, exploratory data analysis, information
harvesting, and unsupervised pattern recognition. Over the last
few years KDD has been used to refer to a process consisting of Transformation techniques are used to make the data easier to
many steps, while data mining is only one of these steps. This mine and more useful, and to provide more meaningful
paper is focused on the data mining and knowledge discovery in results. The actual distribution of the data may be modified to
databases. facilitate use by techniques that require specific types of data
distributions. Some attribute values may be combined to
Keywords— Data Mining; Knowledge Discovery; Databases provide new values, thus reducing the complexity of the data.
For example, current date and birth date could be replaced by
I. INTRODUCTION age. One attribute could be substituted for another. An
Knowledge discovery in databases (KDD) is the process of example would be replacing a sequence of actual attribute
finding useful information and patterns in data. Data mining is values with the differences between consecutive values. Real
the use of algorithms to extract the information and patterns valued attributes may be more easily handled by partitioning
derived by the KDD process. The KDD process is often said the values into ranges and using these discrete range values.
to be nontrivial; however, we take the larger view that KDD is Some data values may actually be removed. Outliers, extreme
an all-encompassing concept. A traditional SQL database values that occur infrequently, may actually be removed. The
query can be viewed as the data mining part of a KDD data may be transformed by applying a function to the values.
process. Indeed, this may be viewed as somewhat simple and A common transformation function is to use the log of the
trivial. However, this was not the case 30 years ago. If we value rather than the value itself. These techniques make the
were to advance 30 years into the future, we might find that mining task easier by reducing the dimensionality (number of
processes thought of today as nontrivial and complex will be attributes) or by reducing the variability of the data values.
viewed as equally simple. KDD is the process that involves The removal of outliers can actually improve the quality of the
many different steps. The input to this process is the data, and results. As with all steps in the KDD process, however, care
the output is the useful information desired by the users. must be used in performing transformation. If used
However, the objective may be unclear or inexact. The incorrectly, the transformation could actually change the data
process itself is interactive and may require much elapsed such that the results of the data mining step are inaccurate.
time. To ensure the usefulness and accuracy of the results of
III. VISUALIZATION
the process, interaction throughout the process with both
domain experts and technical experts might be needed. Visualization refers to the visual presentation of data. The old
expression “a picture is worth a thousand words” certainly is
II. KNOWLEDGE DISCOVERY PROCESS STEPS true when examining the structure of data. For example, a line
The KDD process consists of the following five steps. graph that shows the distribution of a data variable is easier to
understand and perhaps more informative than the formula for
Selection: The data needed for the data mining process the corresponding distribution. The use of visualization
may be obtained from many different and heterogeneous techniques allows users to summarize, extract, and grasp more
data sources. This first step obtains the data from various complex results than more mathematical or text type
databases, files, and nonelectronic sources. description of the results. Visualization techniques include:
Pre-processing: The data to be used by the process may
have incorrect or missing data. There may be anomalous Graphical: Traditional graph structures including bar
data from multiple sources involving different data types charts, pie charts, histograms, and line graphs may be
and metrics. There may be many different activities used.
performed at this time. Erroneous data may be corrected Geometric: Geometric techniques include the box plot
or removed, whereas missing data must be supplied or and scatter diagrams techniques.
predicted often using the data mining tools. Icon-based: Using figures, colors, or other icons can
Transformation: Data from different sources must be improve the presentation techniques.
converted into a common format for processing. Some Pixel-based: With these techniques each data value is
data may be encoded or transformed into more usable shown as a uniquely colored pixel.
formats. Data reduction may be used to reduce the Hierarchical: These techniques hierarchically divide the
number of possible data values being considered. display area (screen) into regions based on data values.
Data Mining: Based o the data mining task being Hybrid: The proceeding approaches can be combined into
performed, this step applies algorithms to the transformed one display.
data to generate the desired results.
Any of these approaches may be two-dimensional or three-
dimensional. Visualization tools can be used to summarize
data as a data mining technique itself. In addition, must be able to use it. KDD is not a new technique but rather
visualization can be used to show the complex results of data a multi-disciplinary field of research; machine learning,
mining tasks. statistics, database technology, expert systems, and data
visualization all make a contribution.
V. CONCLUSION
The term KDD be employed to describe the whole process
of extraction of knowledge from data. In this context,
knowledge means relationships and patterns between data
elements. The data mining should be used exclusively for the
discovery stage of the KDD process. A more or less official
definition of KDD is: ‘the non-trivial extraction of implicit,
previously unknown and potentially useful knowledge from
data.’ So the knowledge must be new, not obvious, and one