Sei sulla pagina 1di 37

 Recoding

 Deteksi missing value


 Deteksi outlier
1. Dapat mendeteksi missing value dan cara
mengatasinya
2. Dapat mendeteksi data outlier baik secara
univariate dan multivariate
 Use Frequency for categorical variable
 Use Descriptive Stat. for measurement variable
 For categorical variables:
▪ If missing value is < 5%, use list wise option
▪ If >=5%, define the missing value as a new category
 For measurement variables:
▪ If missing value is < 5%, use List-wise option
▪ If between 5% and 15%, use Transform>Replace Missing Value.
Replacing less than 15% of data has little effect on the outcome
▪ If greater than 15%, consider to drop the variable or subject
 Example : HBAT_MISSING.SAV
 www.prenhall.com/hair
 Univariate outlier detection

 Multivariate outlier detection

See page 75 on Mutivariate Data Analysis (Hair,


et all, 2006)
 Univariate outlier detection
- scatter plot
- box plot
- standardized data

 Multivariate outlier detection

See page 75 on Mutivariate Data Analysis (Hair, et


all, 2006)
 Example : HBAT.SAV
 Dependent var : X19
 Independent var : X6 – X18
 One way to identify univariate outliers is to
convert all of the scores for a variable to
standard scores.
 If the sample size is small (80 or fewer cases), a
case is an outlier if its standard score is ±2.5 or
beyond.
 If the sample size is larger than 80 cases, a case
is an outlier if its standard score is ±3.0 or
beyond
1. Convert to Zscore
2. Sort descending
Example :
X7 : 13, 22, 90
X8 :87
X9-X10 : no cases
X11 : 7
X12 : 90
 X13 : No cases
 X14 : 77
 X15 : 6, 53
 X16 : 24
 X17 : No cases
 X18 : 7, 84
 X19 : 22
 Mahalanobis D2 is a multidimensional version
of a z-score.

 A case is a multivariate outlier if :


 > 2.5

 Df = number of variables

 Mahalanobis D2 requires that the variables


be metric.
Data : HBAT.SAV
 Dependent var : X19
 Independent var : X6 – X18
 Df = 13

 Multivariate outlier :
 Case : 98, 36
 Lakukan sort pada kolom “mah_2”
 Dependent var : X19
 Independent var : X6 – X18
 Df = 13

 Multivariate outlier :
 Case : 98, 36
 #check missing value for data
 nrows = nrow(data)
 ncomplete = sum(complete.cases(data))
 ncomplete
 #check structure data or variables
 str(data)
 library(ggplot2)
 #Plotting the dependent variable distribution
 pl1 <- ggplot(data, aes(y))
 pl1 + geom_density(fill = "red", alpha = "0.7")
 #Here we can see that the distribution looks similar to a half normal distribution.
 #If we take a closer look, we can see that there is a sudden spike towards the right end of the graph.
 #This might possibly be a sentinel value.
 #A sentinel value is a special kind of bad numerical value: a value that used to represent “yes” or “not” or "other special cases in numeric data"
 #One way to detect sentinel values is to look for sudden jumps in an otherwise smooth distribution of values.
 #We can now take a look into the summary of "y" variable to confirm this
 summary(data$y)
 summary(data)
 hist(data$age)
 library(ggplot2)
 pie(table(data$y))
 pie(table(data$loan))

Potrebbero piacerti anche