Sei sulla pagina 1di 32

DATA MINING

Kusrini
Definisi Data Mining
 Garner Group:
“Data mining is the process of discovering meaningful
new correlations, patterns and trends by sifting through
large amounts of data stored in repositories, using
pattern recognition technologies as well as statistical and
mathematical techniques.”
 Hand et al
“Data mining is the analysis of (often large) observational
data sets to find unsuspected relationships and to
summarize the data in novel ways that are both
understandable and useful to the data owner”
 Evangelos Simoudis in Cabena et al:
“Data mining is an interdisciplinary field bringing
together techniques from machine learning, pattern
recognition, statistics, databases, and visualization to
address the issue of information extraction from large
data bases”
 Berry and Linoff:
Data mining is the process of exploration and analysis, by
automatic or semi-automatic means, of large quantities of
data in order to discover meaningful patterns and rules
Why data mining?

The explosive growth in data collection


 The storing of data in data warehouses
 The availability of increased access to data from Web
navigation and intranet
 We have to find a more effective way to use these
data in decision support process than just using
traditional query languages
On what kind of data?

 Data warehouses
 Transactional databases
 Advanced database systems
 Spacial and Temporal
 Time-series
 Multimedia, text
 WWW
…
Knowledge Discovery in Database
 1. Data cleaning (to remove noise and inconsistent data)
 2. Data integration (where multiple data sources may be combined)
 3. Data selection (where data relevant to the analysis task are retrieved
from the database)
 4. Data transformation (where data are transformed or consolidated
into forms appropriate for mining by performing summary or aggregation
operations, for instance)
 5. Data mining (an essential process where intelligent methods are
applied in order to extract data patterns)
 6. Pattern evaluation (to identify the truly interesting patterns
representing knowledge based on some interestingness measures)
 7. Knowledge presentation (where visualization and knowledge
representation techniques are used to present the mined knowledge to the
user)
Task in Data Mining

 Classification
 Estimation
 Prediction
 Clustering
 Association
Clasification
 Classification is the process of finding a model (or
function) that describes and distinguishes data
classes or concepts, for the purpose of being able
to use the model to predict the class of objects
whose class label is unknown.
 The derived model is based on the analysis of a set
of training data (i.e., data objects whose class label
is known).
 In classification, there is a target categorical variable,
such as income bracket, which, for example, could be
partitioned into three classes or categories: high income,
middle income, and low income. The data mining model
examines a large set of records, each record containing
information on the target variable as well as a set of
input or predictor variables.
 Algorithm:
 k-nearestneighbor classification
 Pohon Keputusan

 Naïve Bayesian classification

 support vector machines


Nearest Neighbor (KNN)
 Nearest Neighbor adalah pendekatan untuk mencari kasus
dengan menghitung kedekatan antara kasus baru dengan
kasus lama, yaitu berdasarkan pada pencocokan bobot
dari sejumlah fitur yang ada.
 Misalkan diinginkan untuk mencari solusi terhadap seorang
pasien baru dengan menggunakan solusi dari pasien
terdahulu.
 Untuk mencari kasus pasien mana yang akan digunakan
maka dihitung kedekatan kasus pasien baru dengan semua
kasus pasien lama.
 Kasus pasien lama dengan kedekatan terbesar-lah yang
akan diambil solusinya untuk digunakan pada kasus pasien
baru.
 Seperti tampak pada
Gambar
 Ada 2 pasien lama A dan B.
 Ketika ada pasien Baru,
maka solusi yang akan
diambil adalah solusi dari
pasien terdekat dari pasien
Baru.
 Seandainya d1 adalah
kedekatan antara pasien
Baru dan pasien A,
sedangkan d2 adalah
kedekatan antara pasien
Baru dengan pasien B.
 Karena d2 lebih dekat dari
d1 maka solusi dari pasien
B lah yang akan digunakan
untuk memberikan solusi
pasien Baru.
Rumus kedekatan
 Kedekatan biasanya berada pada nilai antara 0
s/d 1.
 Nilai 0 artinya kedua kasus mutlak tidak mirip,
sebaliknya untuk nilai 1 kasus mirip dengan mutlak.
Bobot antar variabel
Kedekatan Agama

Kedekatan Jenis Kelamin

Kedekatan Pendidikan
Rancangan Sistem
Pohon Keputusan
 Pohon keputusan merupakan sebuah
struktur yang dapat digunakan untuk
membagi kumpulan data yang besar
menjadi himpunan-himpunan record
yang lebih kecil dengan menerapkan
serangkaian aturan keputusan.
Dengan masing-masing rangkaian
pembagian, anggota himpunan hasil
menjadi mirip satu dengan yang lain
(Berry, Michael J.A., Linoff, Gordon S.,
2004)
 Sebuah model pohon keputusan terdiri
dari sekumpulan aturan untuk
membagi sejumlah populasi yang
heterogen menjadi lebih kecil, lebih
homogen dengan memperhatikan
pada variabel tujuannya
 Pilih atribut sebagai root
 Buat cabang untuk masing-masing nilai
 Bagi kasus dalam cabang
 Ulangi proses untuk masing-masing cabang
sampai semua kasus pada cabang memiliki kelas
yang sama
Bayesian Clasification
 Bayesian classification adalah pengklasifikasi
statistik yang dapat digunakan untuk memprediksi
probabilitas keanggotaan suatu class.
 Bayesian classification didasarkan pada teorema
bayes yang memiliki kemampuan klasifikasi serupa
dengan decision tree dan neural network.
 Bayesian classification terbukti memiliki akurasi
dan kecepatan yang tinggi saat diaplikasikan ke
dalam database dengan data yang besar.
 X = (age = “<=30”, income =“medium”, student =
“yes”, credit_rating = “fair”, Class buys_computer?)
age = “<=30”, income =“medium”, student = “yes”,
credit_rating = “fair”
 Dibutuh untuk memaksimalkan P(X|Ci) P(Ci)
untuk i= 1, 2  P(X|buys_computer=“yes”)
 P(Ci) merupakan prior probability untuk setiap  = 0.222 x 0.444 x 0.677 x 0.677 = 0.044
class berdasar data contoh :
 P(buys_computer=“yes”) = 9/14 = 0.643
 P(X|buys_computer=“no”)
 = 0.600 x 0.400 x 0.200 x 0.400 = 0.019
 P(buys_computer=“no”) = 5/14 = 0.357
 Hitung P(X|Ci), untuk i=1,2  P(X|buys_computer=“yes”)
P(buys_computer=“yes”)
 P(age = “<30” | buys_computer =“yes”) = 2/9
= 0.222  = 0.044 x 0.643 = 0.028
 P(age = “<30” | buys_computer =“no”) = 3/5 =  P(X|buys_computer=“no”)
0.600 P(buys_computer=“no”)
 P(income = “medium” | buys_computer =“yes”) =  = 0.019 x 0.357 = 0.007
4/9 = 0.444
 P(income = “medium” | buys_computer =“no”) =
2/5 = 0.400  Kesimpulan : buys_computer = “yes”
 P(student = “yes” | buys_computer =“yes”) =
6/9 = 0.667
 P(student = “yes” | buys_computer =“no”)
=1/5=0.200
 P(credit_rating= “fair” | buys_computer =“yes”)
= 6/9 = 0.667
 P(credit_rating= “fair” | buys_computer =“no”) =
2/5 = 0.400
 Department: systems, age: 26-30, salary:46-50K
 Status??
Estimation
 Estimation is similar to classification except that the
target variable is numerical rather than categorical.
 Models are built using “complete” records, which
provide the value of the target variable as well as the
predictors.
 Example:
 Estimating the grade-point average (GPA) of a graduate
student, based on that student’s undergraduate GPA.
 Estimating the systolic blood pressure reading of a hospital
patient, based on the patient’s age, gender, body-mass
index, and blood sodium levels
 Method:
 simple linear regression and correlation
 multiple regression
Prediction
 Prediction is similar to classification and estimation,
except that for prediction, the results lie in the future
 Example:
 Predicting the price of a stock three months into the future
 Predicting the percentage increase in traffic deaths next
year if the speed limit is increased
 Predicting whether a particular molecule in drug discovery
will lead to a profitable new drug for a pharmaceutical
company
 Method:
 simple linear regression and correlation
 multiple regression

 k-nearest neighbor classification

 Pohon Keputusan

 Naïve Bayesian classification

 support vector machines

Potrebbero piacerti anche