Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Garbage in, garbage out is a saying that dates back to the early days of computers but is still true
today, perhaps even more so. If the numbers you use in a statistical analysis are incorrect
(garbage), so too will be the results. That’s why so much effort has to go into getting the
numbers right. The process of checking data values can be
divided into two parts — verification and validation.
Verification addresses the issue of whether each value in the
dataset is identical to the value that was originally generated.
Twenty years ago, verification amounted to a check of the
data entry process. Did the keypunch operator enter all the
data as it appeared on the hard copy from the person who
generated the data? Today, it includes making sure
automated devices and instruments generate data as they
were designed to. Sometimes, for example, electrical
interference can cause instrumentation to record or output
Spreadsheet Tricks
This is the point when using a spreadsheet to assemble your dataset really pays off. Most
spreadsheet programs have a variety of capabilities that are well suited to data scrubbing. Here
are a few tricks:
Marker Rows — Before you do any scrubbing, create marker rows throughout the
dataset. You can do this by coloring the cell fill for the entire row with a unique color.
You don’t need a lot; just spread them through the dataset. If you make any mistakes in
sorting that corrupt the rows, you’ll be able to tell. You could do the same thing with
columns but columns usually aren’t sorted.
Original Order — Insert a column with the original order of the rows. This will allow
you to get back to the original order the data was in if you need to.
Sorting — One at a time, sort your dataset by each of your variables. Check the top and
the bottom of the column for entries with leading blanks and nonnumeric text. Then
check within each column for misspellings, ID variants, analyte aliases, non-numbers,
bad classifications, and incorrect dates.
Reformatting — Change fonts to detect character errors, such
as O and 0. Change the format on any date, time, currency, or
percentages and incorrect entries may pop out. For example, a
percentage entered as 50% instead of 0.50 would be a text field that
could not be processed by a statistical package. This trick works
especially well with incorrect dates. Conditional formatting can also be
used to find data that fall outside a range of acceptable values. For
example, identify percentages greater than 1 by conditionally
formatting them with a red font.
Formulas — Write formulas to check proportions, sums,
differences and any other relationships between variables in your
dataset. Use cell information functions (e.g., Value and Isnumber in
Excel) to verify that all your values are numbers and not
alphanumerics. Also, check to see if two columns are identical, in
Is that a 0 or an O? which case, one can be deleted. This problem occurs often with
datasets that have been merged.
Descriptive Statistics
Even before getting involved in the data analysis, you can use descriptive statistics to find errors.
Here are a few things to look for.
Counts — Make sure you have the same number of samples for all the variables.
Otherwise, you have missing or censored data to contend with. Count the number of data
values that are censored for each variable. If all the values are censored for a variable, the
variable can be removed from the dataset. Also count the number of samples in all levels
of grouping variables to see if you have any misclassifications.
Sums — If some measurements are supposed to sum to a constant, like 100%, you can
usually find errors pretty easily. Fixing them can be another matter. If it looks like just an
addition error, fix the entries by multiplying them by
{what the sum should be} divided by {what the incorrect sum is}
For example, if the sum should be 100% and the entries add up to 90%, multiply all the
entries by 1.0/0.9 (1.11) and then they’ll all add up to 100%. There will be situations
though, especially in opinion surveys, when you’ll have to try to divine the intent of the
respondent. If someone entered 1%, 30%, 49%, did he mean 1%, 50%, 49%, or 1%, 30%,
69%, or even 21%, 30%, 49%? It’s like being in Florida during November of 2000. You
want to use as much of the data as possible but you just have to be sure it’s the right data.
Min/Max — Look at the minimum and maximum for each variable to make sure there
are no anomalously high or low values.
Dispersion — Calculate the variance or standard deviation for each variable. If any are
zero, you can delete the variable because it will add nothing to your statistical analysis.
Correlations — Look at a correlation matrix for your variables. Look for correlations
between independent variables that are near 1. These variables are statistical duplicates in
that they convey the same information even if the numbers are different. You won’t need
both so delete one.
There are also other calculations that you could do depending on your dataset, for example,
recalculating data derived from other variables.
Plotting
Whatever plotting you do at this point is preliminary; you’re not looking to interpret the data,
only find anomalies. These plots won’t make it into your final report, so don’t spend a lot of time
on them. Here are a few key graphics to look at.
Bivariate plots — Plot the relationships between independent variables having high
correlations to be sure they are not statistically redundant. Redundant variables can be
eliminated. Check plots of the dependent variable versus the independent variables for
outliers.
Time-series plots — If you have any data collected over time at the same sampling
point, plot the time-series. Look for incorrect dates and possible outliers in the data
series.
Maps — If you collected any spatially dependent samples or measurements, plot the
location coordinates on a map. Have field personnel review the map to see if there are
any obvious errors. If your surveyor made a mistake, this is where you’ll find it.
http://statswithcats.wordpress.com/2010/10/17/the-data-scrub-3/