Sei sulla pagina 1di 4

THE DATA SCRUB

Garbage in, garbage out is a saying that dates back to the early days of computers but is still true
today, perhaps even more so. If the numbers you use in a statistical analysis are incorrect
(garbage), so too will be the results. That’s why so much effort has to go into getting the
numbers right. The process of checking data values can be
divided into two parts — verification and validation.
Verification addresses the issue of whether each value in the
dataset is identical to the value that was originally generated.
Twenty years ago, verification amounted to a check of the
data entry process. Did the keypunch operator enter all the
data as it appeared on the hard copy from the person who
generated the data? Today, it includes making sure
automated devices and instruments generate data as they
were designed to. Sometimes, for example, electrical
interference can cause instrumentation to record or output

I’m the oversight contractor. Give RULES OF THE DATA SCRUB


me all your QA/QC records … and 1st Rule: You should talk about data scrubbing.
a can of tuna.
2nd Rule: You DO document what you did in data
erroneous values. Verification is a scrubbing.
simple concept embodied in simple 3rd Rule: If a client says stop or doesn’t pay bills,
yet time-consuming processes. data scrubbing is over.
Validation addresses the issue of 4th Rule: Only two eyes to a file.
whether the data were generated in 5th Rule: One dataset at a time.
accordance with quality assurance 6th Rule: No effort, no clues.
specifications. In other words,
7th Rule: Data scrubbing goes on as long as it has to.
validation is a process to determine
if the data are of a known level of 8th Rule: If this is your first try at data scrubbing,
quality that is appropriate for the you have to look for everything.
analyses to be conducted and the
decisions that will be made as a
result. An example of a validation process is the assignment of data quality flags to each analyte
concentration reported by an analytical chemistry laboratory. Validation is a complex concept
embodied in a complex time-consuming process.
Changes in data points create a dilemma for statisticians. Change one data point and you’ve
changed the entire analysis. It probably won’t change your conclusions, but all those means and
variances, and other statistics not to mention the graphs and maps will be incorrect, or at least
inconsistent with the final database. This is the most important warning I issue to my clients
when I start a job. Still, on the majority of projects, some data point changes after I am well into
the data analysis. It’s inevitable.
Finding Data Problems
Lots of bad things can happen in a dataset, so it’s useful to have a few tricks, tips, and tools for
finding those mistakes and anomalies.

Spreadsheet Tricks
This is the point when using a spreadsheet to assemble your dataset really pays off. Most
spreadsheet programs have a variety of capabilities that are well suited to data scrubbing. Here
are a few tricks:
Marker Rows — Before you do any scrubbing, create marker rows throughout the
dataset. You can do this by coloring the cell fill for the entire row with a unique color.
You don’t need a lot; just spread them through the dataset. If you make any mistakes in
sorting that corrupt the rows, you’ll be able to tell. You could do the same thing with
columns but columns usually aren’t sorted.
Original Order — Insert a column with the original order of the rows. This will allow
you to get back to the original order the data was in if you need to.
Sorting — One at a time, sort your dataset by each of your variables. Check the top and
the bottom of the column for entries with leading blanks and nonnumeric text. Then
check within each column for misspellings, ID variants, analyte aliases, non-numbers,
bad classifications, and incorrect dates.
Reformatting — Change fonts to detect character errors, such
as O and 0. Change the format on any date, time, currency, or
percentages and incorrect entries may pop out. For example, a
percentage entered as 50% instead of 0.50 would be a text field that
could not be processed by a statistical package. This trick works
especially well with incorrect dates. Conditional formatting can also be
used to find data that fall outside a range of acceptable values. For
example, identify percentages greater than 1 by conditionally
formatting them with a red font.
Formulas — Write formulas to check proportions, sums,
differences and any other relationships between variables in your
dataset. Use cell information functions (e.g., Value and Isnumber in
Excel) to verify that all your values are numbers and not
alphanumerics. Also, check to see if two columns are identical, in
Is that a 0 or an O? which case, one can be deleted. This problem occurs often with
datasets that have been merged.

Descriptive Statistics
Even before getting involved in the data analysis, you can use descriptive statistics to find errors.
Here are a few things to look for.
Counts — Make sure you have the same number of samples for all the variables.
Otherwise, you have missing or censored data to contend with. Count the number of data
values that are censored for each variable. If all the values are censored for a variable, the
variable can be removed from the dataset. Also count the number of samples in all levels
of grouping variables to see if you have any misclassifications.
Sums — If some measurements are supposed to sum to a constant, like 100%, you can
usually find errors pretty easily. Fixing them can be another matter. If it looks like just an
addition error, fix the entries by multiplying them by
{what the sum should be} divided by {what the incorrect sum is}
For example, if the sum should be 100% and the entries add up to 90%, multiply all the
entries by 1.0/0.9 (1.11) and then they’ll all add up to 100%. There will be situations
though, especially in opinion surveys, when you’ll have to try to divine the intent of the
respondent. If someone entered 1%, 30%, 49%, did he mean 1%, 50%, 49%, or 1%, 30%,
69%, or even 21%, 30%, 49%? It’s like being in Florida during November of 2000. You
want to use as much of the data as possible but you just have to be sure it’s the right data.
Min/Max — Look at the minimum and maximum for each variable to make sure there
are no anomalously high or low values.
Dispersion — Calculate the variance or standard deviation for each variable. If any are
zero, you can delete the variable because it will add nothing to your statistical analysis.
Correlations — Look at a correlation matrix for your variables. Look for correlations
between independent variables that are near 1. These variables are statistical duplicates in
that they convey the same information even if the numbers are different. You won’t need
both so delete one.
There are also other calculations that you could do depending on your dataset, for example,
recalculating data derived from other variables.

Plotting
Whatever plotting you do at this point is preliminary; you’re not looking to interpret the data,
only find anomalies. These plots won’t make it into your final report, so don’t spend a lot of time
on them. Here are a few key graphics to look at.
Bivariate plots — Plot the relationships between independent variables having high
correlations to be sure they are not statistically redundant. Redundant variables can be
eliminated. Check plots of the dependent variable versus the independent variables for
outliers.
Time-series plots — If you have any data collected over time at the same sampling
point, plot the time-series. Look for incorrect dates and possible outliers in the data
series.
Maps — If you collected any spatially dependent samples or measurements, plot the
location coordinates on a map. Have field personnel review the map to see if there are
any obvious errors. If your surveyor made a mistake, this is where you’ll find it.

The Right Data for the Job


As it turns out, you’ll probably have to rebuild the dataset more than once, especially if your
analysis is complex. You’ll add and discard variables more times than you can count (especially
if you don’t document it). Plus, different analyses within a single project can require different
data structures. Graphics especially are prone to requiring some special format to get the picture
the way you want it. That’s just a part of the game of data analysis. It’s like mowing the lawn.
You’re only finished temporarily.
When you do update a dataset, it usually involves creating new variables but not samples. Data
for the existing samples might change, for example, you might average replicates, fill-in values
for missing and censored data, or accommodate outliers, but you normally won’t add data from
new samples. This is because the original samples are your best guess at a representation of the
population. Change the samples and you change what you think the population is like. That
might not sound too bad, but from a practical view, there are consequences. If you change
samples you may have to redo a lot of data scrubbing. The new samples may affect outliers,
linear relationships, and so on. It’s not just the new data you have to examine, it’s also how all
the data, new and old, fit together. You have to revisit all those data scrubbing steps.
Data scrubbing isn’t a standardized process. Some data analysts
do a painstakingly thorough job, looking at each variable and
sample to ensure they are appropriate for the analysis. Others just
remove systematic problems with automated queries and filters,
preferring to let the original data speak for themselves. That’s
why it’s important to document everything you do. You may
have to justify your data scrubbing decisions to a client or a
reviewer. You may also need the documentation for contract
modifications, presentations, and any similar projects you plan to
pursue.
And that’s why it takes so long and costs so much to prepare a
dataset for a statistical analysis. Be prepared, it will probably
Scrubbing takes time but you consume the majority of your project budget and schedule but
have to do it.
you have to do it so that your analysis isn’t just garbage out.

Join the Stats with Cats group on Facebook.

http://statswithcats.wordpress.com/2010/10/17/the-data-scrub-3/

Potrebbero piacerti anche