Benvenuto in Scribd!

Preparing Data

Caricato da

Il 0% ha trovato utile questo documento (0 voti)

29 visualizzazioni37 pagine

1. The document discusses methods for detecting missing values and outliers in data. 2. It provides guidance on how to handle missing values depending on the percentage of missing data, including using list-wise deletion or replacing values. 3. Detection of outliers is described both for individual (univariate) variables using standardized scores and for multiple (multivariate) variables using Mahalanobis distance, identifying cases 98 and 36 as multivariate outliers.

Descrizione originale:

cchdfhd

Titolo originale

2. Preparing Data

Copyright

Formati disponibili

PPTX, PDF, TXT o leggi online da Scribd

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Segnala questo documento

Copyright:

Formati disponibili

Scarica in formato PPTX, PDF, TXT o leggi online su Scribd

Segnala contenuti inappropriati

Il 0% ha trovato utile questo documento (0 voti)

29 visualizzazioni37 pagine

Preparing Data

Caricato da

Alfu Alfinnikmah

Copyright:

Formati disponibili

Scarica in formato PPTX, PDF, TXT o leggi online su Scribd

Segnala contenuti inappropriati

Salta alla pagina

Sei sulla pagina 1di 37

Cerca all'interno del documento

 Recoding

 Deteksi missing value

 Deteksi outlier
1. Dapat mendeteksi missing value dan cara
mengatasinya
2. Dapat mendeteksi data outlier baik secara
univariate dan multivariate
 Use Frequency for categorical variable
 Use Descriptive Stat. for measurement variable
 For categorical variables:
▪ If missing value is < 5%, use list wise option
▪ If >=5%, define the missing value as a new category
 For measurement variables:
▪ If missing value is < 5%, use List-wise option
▪ If between 5% and 15%, use Transform>Replace Missing Value.
Replacing less than 15% of data has little effect on the outcome
▪ If greater than 15%, consider to drop the variable or subject
 Example : HBAT_MISSING.SAV
 www.prenhall.com/hair
 Univariate outlier detection

 Multivariate outlier detection

See page 75 on Mutivariate Data Analysis (Hair,

et all, 2006)
 Univariate outlier detection
- scatter plot
- box plot
- standardized data

 Multivariate outlier detection

See page 75 on Mutivariate Data Analysis (Hair, et

all, 2006)
 Example : HBAT.SAV
 Dependent var : X19
 Independent var : X6 – X18
 One way to identify univariate outliers is to
convert all of the scores for a variable to
standard scores.
 If the sample size is small (80 or fewer cases), a
case is an outlier if its standard score is ±2.5 or
beyond.
 If the sample size is larger than 80 cases, a case
is an outlier if its standard score is ±3.0 or
beyond
1. Convert to Zscore
2. Sort descending
Example :
X7 : 13, 22, 90
X8 :87
X9-X10 : no cases
X11 : 7
X12 : 90
 X13 : No cases
 X14 : 77
 X15 : 6, 53
 X16 : 24
 X17 : No cases
 X18 : 7, 84
 X19 : 22
 Mahalanobis D2 is a multidimensional version
of a z-score.

 A case is a multivariate outlier if :

 > 2.5

 Df = number of variables

 Mahalanobis D2 requires that the variables

be metric.
Data : HBAT.SAV
 Dependent var : X19
 Independent var : X6 – X18
 Df = 13

 Multivariate outlier :
 Case : 98, 36
 Lakukan sort pada kolom “mah_2”
 Dependent var : X19
 Independent var : X6 – X18
 Df = 13

 Multivariate outlier :
 Case : 98, 36
 #check missing value for data
 nrows = nrow(data)
 ncomplete = sum(complete.cases(data))
 ncomplete
 #check structure data or variables
 str(data)
 library(ggplot2)
 #Plotting the dependent variable distribution
 pl1 <- ggplot(data, aes(y))
 pl1 + geom_density(fill = "red", alpha = "0.7")
 #Here we can see that the distribution looks similar to a half normal distribution.
 #If we take a closer look, we can see that there is a sudden spike towards the right end of the graph.
 #This might possibly be a sentinel value.
 #A sentinel value is a special kind of bad numerical value: a value that used to represent “yes” or “not” or "other special cases in numeric data"
 #One way to detect sentinel values is to look for sudden jumps in an otherwise smooth distribution of values.
 #We can now take a look into the summary of "y" variable to confirm this
 summary(data$y)
 summary(data)
 hist(data$age)
 library(ggplot2)
 pie(table(data$y))
 pie(table(data$loan))

Potrebbero piacerti anche

R Lab - Probability Distributions
Documento10 pagine
R Lab - Probability Distributions
Pranay Pandey
Nessuna valutazione finora
Stats With Python
Documento4 pagine
Stats With Python
Ayush Garg
100% (3)
Statistics: a QuickStudy Laminated Reference Guide
Da Everand
Statistics: a QuickStudy Laminated Reference Guide
BarCharts, Inc.
Nessuna valutazione finora
Sta1503 2013 - Tutorial Letter 101 2013 3 e PDF
Documento21 pagine
Sta1503 2013 - Tutorial Letter 101 2013 3 e PDF
sal27adam
Nessuna valutazione finora
Error and Uncertainty: General Statistical Principles
Documento8 pagine
Error and Uncertainty: General Statistical Principles
déborah_rosales
Nessuna valutazione finora
Business Statistics and Analysis Course 2&3
Documento42 pagine
Business Statistics and Analysis Course 2&3
Mugdho Hossain
Nessuna valutazione finora
Chapter 4 Dispersion of Data
Documento3 pagine
Chapter 4 Dispersion of Data
Daniel Nimabwaya
Nessuna valutazione finora
g (y) = βo + β (Age) - (a)
Documento6 pagine
g (y) = βo + β (Age) - (a)
k767
Nessuna valutazione finora
Model Definition
Documento6 pagine
Model Definition
k767
Nessuna valutazione finora
Model Definition11
Documento6 pagine
Model Definition11
k767
Nessuna valutazione finora
2.2 Unit-Dsp
Documento63 pagine
2.2 Unit-Dsp
suryashiva422
Nessuna valutazione finora
Box Cox Transformation07052016
Documento11 pagine
Box Cox Transformation07052016
Graciela Marques
Nessuna valutazione finora
STA1013 Study Guide
Documento3 pagine
STA1013 Study Guide
Julia
Nessuna valutazione finora
Introduction To Linear Regression Analysis
Documento22 pagine
Introduction To Linear Regression Analysis
Nikhil Gandhi
Nessuna valutazione finora
Descriptive Descriptive Analysis and Histograms 1.1 Recode 1.2 Select Cases & Split File 2. Reliability
Documento6 pagine
Descriptive Descriptive Analysis and Histograms 1.1 Recode 1.2 Select Cases & Split File 2. Reliability
dmihalina2988
100% (1)
Spss Training Material
Documento117 pagine
Spss Training Material
Steve Elroy
Nessuna valutazione finora
FS Summary of Book and Videos
Documento13 pagine
FS Summary of Book and Videos
Prasanna Kandavel
Nessuna valutazione finora
Cheat Sheet Final
Documento7 pagine
Cheat Sheet Final
kookmasteraj
100% (2)
9.1. Prob - Stats
Documento19 pagine
9.1. Prob - Stats
Ankit Kabi
Nessuna valutazione finora
WINSEM2020-21 ECE3502 ETH VL2020210501413 Reference Material I 29-Apr-2021 New PPT
Documento23 pagine
WINSEM2020-21 ECE3502 ETH VL2020210501413 Reference Material I 29-Apr-2021 New PPT
Aryan Verma
Nessuna valutazione finora
What Is A Critical Value?
Documento2 pagine
What Is A Critical Value?
charisse
Nessuna valutazione finora
Interpret Standard Deviation Outlier Rule: Using Normalcdf and Invnorm (Calculator Tips)
Documento12 pagine
Interpret Standard Deviation Outlier Rule: Using Normalcdf and Invnorm (Calculator Tips)
maryrachel713
Nessuna valutazione finora
Range Stat
Documento5 pagine
Range Stat
Dilrukshi Fernando
Nessuna valutazione finora
Logistic Regression
Documento11 pagine
Logistic Regression
Gabriel Danea
Nessuna valutazione finora
Big Data Analytics Statistical Methods
Documento8 pagine
Big Data Analytics Statistical Methods
A47Sahil Rahate
Nessuna valutazione finora
K Nearest Neighbours (KNN) : Short Intro To KNN
Documento13 pagine
K Nearest Neighbours (KNN) : Short Intro To KNN
Luka Filipovic
Nessuna valutazione finora
CH 8 Residual Analysistopost
Documento20 pagine
CH 8 Residual Analysistopost
Ajay Kumar
Nessuna valutazione finora
Unit3-Data Science
Documento37 pagine
Unit3-Data Science
DIVYANSH GAUR (RA2011027010090)
Nessuna valutazione finora
00000chen - Linear Regression Analysis3
Documento252 pagine
00000chen - Linear Regression Analysis3
Tommy Ngo
Nessuna valutazione finora
Eviews Help
Documento7 pagine
Eviews Help
jehanmo
Nessuna valutazione finora
Which Test When: 1 Exploratory Tests
Documento5 pagine
Which Test When: 1 Exploratory Tests
Anonymous HIhW9DB
Nessuna valutazione finora
Cheat Sheet: ANOVA: Scenario
Documento6 pagine
Cheat Sheet: ANOVA: Scenario
visualfn finance
Nessuna valutazione finora
Checking Model Assumptions
Documento4 pagine
Checking Model Assumptions
smycz
Nessuna valutazione finora
Data Science Interview Preparation (30 Days of Interview Preparation)
Documento18 pagine
Data Science Interview Preparation (30 Days of Interview Preparation)
Satyavaraprasad Balla
Nessuna valutazione finora
Statistics Study Notes
Documento71 pagine
Statistics Study Notes
bakajin00
Nessuna valutazione finora
Material DA 7
Documento3 pagine
Material DA 7
Aparna Singh
Nessuna valutazione finora
Material DA 7
Documento3 pagine
Material DA 7
Sonali Kapoor
Nessuna valutazione finora
Material DA 7
Documento3 pagine
Material DA 7
Sonali Kapoor
Nessuna valutazione finora
Stat Review Notes
Documento8 pagine
Stat Review Notes
안지연
Nessuna valutazione finora
Math in ML ALgo
Documento18 pagine
Math in ML ALgo
Nithya Prasath
Nessuna valutazione finora
ChoiceModelR Manual
Documento17 pagine
ChoiceModelR Manual
Ida Bagus Ketut Wedastra
Nessuna valutazione finora
P NB ProbitE
Documento21 pagine
P NB ProbitE
AnarMasimov
Nessuna valutazione finora
Adnan's Predictive Modelling Business Report
Documento32 pagine
Adnan's Predictive Modelling Business Report
Adnan Sayed
Nessuna valutazione finora
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
Documento16 pagine
Sas Notes Module 4-Categorical Data Analysis Testing Association Between Categorical Variables
NISHITA MALPANI
100% (1)
Chapter 2 Notes
Documento3 pagine
Chapter 2 Notes
shireensaghir05
Nessuna valutazione finora
Lab 3 - Kristi Proc Univariate
Documento10 pagine
Lab 3 - Kristi Proc Univariate
Jacob Sheridan
Nessuna valutazione finora
Exploratory Data Analysis
Documento26 pagine
Exploratory Data Analysis
mizart rna
Nessuna valutazione finora
Exploring Data: AP Statistics Unit 1: Chapters 1-4
Documento83 pagine
Exploring Data: AP Statistics Unit 1: Chapters 1-4
Jacob Pegher
Nessuna valutazione finora
Normal, Binomial, Poisson, and Exponential Distributions
Documento39 pagine
Normal, Binomial, Poisson, and Exponential Distributions
Ayushi Tanwar
Nessuna valutazione finora
Nummerical Summaries
Documento11 pagine
Nummerical Summaries
60 Vibha Shree.S
Nessuna valutazione finora
Print Mda 3
Documento24 pagine
Print Mda 3
surekha
Nessuna valutazione finora
Safari - 23-Mar-2019 at 1:49 PM
Documento1 pagina
Safari - 23-Mar-2019 at 1:49 PM
Rakesh Cho
Nessuna valutazione finora
Gamma Distribution Fitting
Documento14 pagine
Gamma Distribution Fitting
mipimipi03
Nessuna valutazione finora
AMR Sessions
Documento5 pagine
AMR Sessions
Rohit Pai
Nessuna valutazione finora
Excel Normal Distribution Functions
Documento6 pagine
Excel Normal Distribution Functions
Leon Fourone
Nessuna valutazione finora
Experiment No. 1: Objective: Write A MATLAB Program To Generate An Exponential Sequence X (N) (A)
Documento53 pagine
Experiment No. 1: Objective: Write A MATLAB Program To Generate An Exponential Sequence X (N) (A)
Shinibali Mandal
Nessuna valutazione finora
PS 601 Notes - Part II Statistical Tests
Documento56 pagine
PS 601 Notes - Part II Statistical Tests
Anisah Nies
Nessuna valutazione finora
Z-Score Problems With The Normal Model: Objective
Documento24 pagine
Z-Score Problems With The Normal Model: Objective
Cleverton da Veiga
Nessuna valutazione finora
Stats and Prob Reviewer, Q3 Jess Anch.
Documento8 pagine
Stats and Prob Reviewer, Q3 Jess Anch.
Jessica
Nessuna valutazione finora
Quartile Deviation Chap3
Documento11 pagine
Quartile Deviation Chap3
Ishwar Chandra
100% (1)
Random Sample Consensus: Robust Estimation in Computer Vision
Da Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
Nessuna valutazione finora
Probablity
Documento310 pagine
Probablity
Priyaprasad Panda
Nessuna valutazione finora
The Effect of Knowledge On Uptake of Breast Cancer Prevention Modalities Among Women in Kyadondo County, Uganda
Documento8 pagine
The Effect of Knowledge On Uptake of Breast Cancer Prevention Modalities Among Women in Kyadondo County, Uganda
lailatus sakinah
Nessuna valutazione finora
R1 Factor Graph
Documento2 pagine
R1 Factor Graph
N EM
Nessuna valutazione finora
Example 4.23: Joint PDF of X and Y Follows
Documento19 pagine
Example 4.23: Joint PDF of X and Y Follows
tahermoh
Nessuna valutazione finora
STA301 SHORT NOTES (23 To 45) Final Term by JUNAID
Documento16 pagine
STA301 SHORT NOTES (23 To 45) Final Term by JUNAID
mc210202582 MUEEN HASAN
100% (2)
ST2133 Advanced Statistics Distribution Theory Half Course
Documento2 pagine
ST2133 Advanced Statistics Distribution Theory Half Course
Alex Zhang
Nessuna valutazione finora
Stat 130 Module 1 B Slides
Documento16 pagine
Stat 130 Module 1 B Slides
ambonulan
Nessuna valutazione finora
18MAT41 Module-5
Documento25 pagine
18MAT41 Module-5
M.A raja
Nessuna valutazione finora
20-Statisticsandprobability11 q4 Mod20 Identifyingthedependentandindependentvariable
Documento27 pagine
20-Statisticsandprobability11 q4 Mod20 Identifyingthedependentandindependentvariable
Lerwin Garinga
67% (3)
QA Chapter 1 2 3 4 5
Documento145 pagine
QA Chapter 1 2 3 4 5
Rajvi Sampat
100% (2)
Research in International Business and Finance 49 (2019) 251-268
Documento18 pagine
Research in International Business and Finance 49 (2019) 251-268
Nadia Cenat Cenut
Nessuna valutazione finora
18MAT41
Documento6 pagine
18MAT41
Deepak Darshan
Nessuna valutazione finora
Probability and Statistics
Documento127 pagine
Probability and Statistics
prours
Nessuna valutazione finora
Exponential Independent Joint Order PDF Statistics
Documento2 pagine
Exponential Independent Joint Order PDF Statistics
Joe
Nessuna valutazione finora
Hsslive-Xi-Eco-Ststistics-Ch-3 ORGANISATION OF DATA
Documento2 pagine
Hsslive-Xi-Eco-Ststistics-Ch-3 ORGANISATION OF DATA
Manas Das
Nessuna valutazione finora
Mathematical Foundations of Risk Measurement: PRM Self Study Guide
Documento14 pagine
Mathematical Foundations of Risk Measurement: PRM Self Study Guide
Hamza Amiri
Nessuna valutazione finora
Declustering and Debiasing: January 2007
Documento26 pagine
Declustering and Debiasing: January 2007
Ferdinand Siahaan
Nessuna valutazione finora
Random Vibration and Spectral Analysis: André Preumont
Documento8 pagine
Random Vibration and Spectral Analysis: André Preumont
Pasquale Ruocco
Nessuna valutazione finora
Joint Distribution PDF
Documento9 pagine
Joint Distribution PDF
AnuragBajpai
Nessuna valutazione finora
Applied-Probability-And-Statistics-Problems Basic With Questions
Documento20 pagine
Applied-Probability-And-Statistics-Problems Basic With Questions
Jeya
Nessuna valutazione finora
Applications of The Double Integral
Documento13 pagine
Applications of The Double Integral
Kamel Herfaoui Gomez Oliveros
Nessuna valutazione finora
MIT6 01SCS11 Chap07 PDF
Documento26 pagine
MIT6 01SCS11 Chap07 PDF
Pranav Karthikeyan
Nessuna valutazione finora
Be - PQT - Problem Metiral
Documento13 pagine
Be - PQT - Problem Metiral
vels
Nessuna valutazione finora
Theories Joint Distribution PDF
Documento25 pagine
Theories Joint Distribution PDF
SundarRajan
Nessuna valutazione finora
Random Vibration
Documento283 pagine
Random Vibration
Louc Ing
Nessuna valutazione finora
PED Merged PDF
Documento269 pagine
PED Merged PDF
Satyam K
Nessuna valutazione finora
Reddy Et Al-2012-Hydrological Processes
Documento15 pagine
Reddy Et Al-2012-Hydrological Processes
Noman Mirza
Nessuna valutazione finora
Lesson 5 Chapter 4: Jointly Distributed Random Variables: Michael Akritas
Documento89 pagine
Lesson 5 Chapter 4: Jointly Distributed Random Variables: Michael Akritas
naveengargns
Nessuna valutazione finora
UNIT 2 Probability and Random Processes PDF
Documento111 pagine
UNIT 2 Probability and Random Processes PDF
deepak
Nessuna valutazione finora