Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Kusrini
Definisi Data Mining
Garner Group:
“Data mining is the process of discovering meaningful
new correlations, patterns and trends by sifting through
large amounts of data stored in repositories, using
pattern recognition technologies as well as statistical and
mathematical techniques.”
Hand et al
“Data mining is the analysis of (often large) observational
data sets to find unsuspected relationships and to
summarize the data in novel ways that are both
understandable and useful to the data owner”
Evangelos Simoudis in Cabena et al:
“Data mining is an interdisciplinary field bringing
together techniques from machine learning, pattern
recognition, statistics, databases, and visualization to
address the issue of information extraction from large
data bases”
Berry and Linoff:
Data mining is the process of exploration and analysis, by
automatic or semi-automatic means, of large quantities of
data in order to discover meaningful patterns and rules
Why data mining?
Data warehouses
Transactional databases
Advanced database systems
Spacial and Temporal
Time-series
Multimedia, text
WWW
…
Knowledge Discovery in Database
1. Data cleaning (to remove noise and inconsistent data)
2. Data integration (where multiple data sources may be combined)
3. Data selection (where data relevant to the analysis task are retrieved
from the database)
4. Data transformation (where data are transformed or consolidated
into forms appropriate for mining by performing summary or aggregation
operations, for instance)
5. Data mining (an essential process where intelligent methods are
applied in order to extract data patterns)
6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness measures)
7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present the mined knowledge to the
user)
Task in Data Mining
Classification
Estimation
Prediction
Clustering
Association
Clasification
Classification is the process of finding a model (or
function) that describes and distinguishes data
classes or concepts, for the purpose of being able
to use the model to predict the class of objects
whose class label is unknown.
The derived model is based on the analysis of a set
of training data (i.e., data objects whose class label
is known).
In classification, there is a target categorical variable,
such as income bracket, which, for example, could be
partitioned into three classes or categories: high income,
middle income, and low income. The data mining model
examines a large set of records, each record containing
information on the target variable as well as a set of
input or predictor variables.
Algorithm:
k-nearestneighbor classification
Pohon Keputusan
Kedekatan Pendidikan
Rancangan Sistem
Pohon Keputusan
Pohon keputusan merupakan sebuah
struktur yang dapat digunakan untuk
membagi kumpulan data yang besar
menjadi himpunan-himpunan record
yang lebih kecil dengan menerapkan
serangkaian aturan keputusan.
Dengan masing-masing rangkaian
pembagian, anggota himpunan hasil
menjadi mirip satu dengan yang lain
(Berry, Michael J.A., Linoff, Gordon S.,
2004)
Sebuah model pohon keputusan terdiri
dari sekumpulan aturan untuk
membagi sejumlah populasi yang
heterogen menjadi lebih kecil, lebih
homogen dengan memperhatikan
pada variabel tujuannya
Pilih atribut sebagai root
Buat cabang untuk masing-masing nilai
Bagi kasus dalam cabang
Ulangi proses untuk masing-masing cabang
sampai semua kasus pada cabang memiliki kelas
yang sama
Bayesian Clasification
Bayesian classification adalah pengklasifikasi
statistik yang dapat digunakan untuk memprediksi
probabilitas keanggotaan suatu class.
Bayesian classification didasarkan pada teorema
bayes yang memiliki kemampuan klasifikasi serupa
dengan decision tree dan neural network.
Bayesian classification terbukti memiliki akurasi
dan kecepatan yang tinggi saat diaplikasikan ke
dalam database dengan data yang besar.
X = (age = “<=30”, income =“medium”, student =
“yes”, credit_rating = “fair”, Class buys_computer?)
age = “<=30”, income =“medium”, student = “yes”,
credit_rating = “fair”
Dibutuh untuk memaksimalkan P(X|Ci) P(Ci)
untuk i= 1, 2 P(X|buys_computer=“yes”)
P(Ci) merupakan prior probability untuk setiap = 0.222 x 0.444 x 0.677 x 0.677 = 0.044
class berdasar data contoh :
P(buys_computer=“yes”) = 9/14 = 0.643
P(X|buys_computer=“no”)
= 0.600 x 0.400 x 0.200 x 0.400 = 0.019
P(buys_computer=“no”) = 5/14 = 0.357
Hitung P(X|Ci), untuk i=1,2 P(X|buys_computer=“yes”)
P(buys_computer=“yes”)
P(age = “<30” | buys_computer =“yes”) = 2/9
= 0.222 = 0.044 x 0.643 = 0.028
P(age = “<30” | buys_computer =“no”) = 3/5 = P(X|buys_computer=“no”)
0.600 P(buys_computer=“no”)
P(income = “medium” | buys_computer =“yes”) = = 0.019 x 0.357 = 0.007
4/9 = 0.444
P(income = “medium” | buys_computer =“no”) =
2/5 = 0.400 Kesimpulan : buys_computer = “yes”
P(student = “yes” | buys_computer =“yes”) =
6/9 = 0.667
P(student = “yes” | buys_computer =“no”)
=1/5=0.200
P(credit_rating= “fair” | buys_computer =“yes”)
= 6/9 = 0.667
P(credit_rating= “fair” | buys_computer =“no”) =
2/5 = 0.400
Department: systems, age: 26-30, salary:46-50K
Status??
Estimation
Estimation is similar to classification except that the
target variable is numerical rather than categorical.
Models are built using “complete” records, which
provide the value of the target variable as well as the
predictors.
Example:
Estimating the grade-point average (GPA) of a graduate
student, based on that student’s undergraduate GPA.
Estimating the systolic blood pressure reading of a hospital
patient, based on the patient’s age, gender, body-mass
index, and blood sodium levels
Method:
simple linear regression and correlation
multiple regression
Prediction
Prediction is similar to classification and estimation,
except that for prediction, the results lie in the future
Example:
Predicting the price of a stock three months into the future
Predicting the percentage increase in traffic deaths next
year if the speed limit is increased
Predicting whether a particular molecule in drug discovery
will lead to a profitable new drug for a pharmaceutical
company
Method:
simple linear regression and correlation
multiple regression
Pohon Keputusan