Faculty of Economics and Business Administration Department of Accounting, Information Systems and Statistics
Data Analysis & Data
Science with R
By Marin Fotache
Introduction to Data Processing
Why need Data Processing ?
Data
is gathered from different sources
Databases (PostgreSQL, Oracle, MongoDB, etc.)
Spreadsheets (MS Excel, Google Spreadsheets) HTML/XML tables, Web pages Other statistical packages
It
is extremely rare that imported data is ready for
analysis:
There are missing values (NAs)
Data was gathered with errors Variables must be renamed, recoded, removed Data are spread among many datasets Aggregation is needed
According
with some authors, 80% of the data
analysis time is spent with data preparation/processing
Topics covered Other
basic (and available in packages) functions
for dealing with NAs, NULLs, strings, dates etc. Options for renaming, recoding and removing variables in data frames Subsetting Data bind and join Aggregation Reshaping Basics of programming in R Group operations: apply "family", packages "plyr" and "dplyr" Case studies
Data cleaning best practices
http://www.quora.com/What-are-some-of-the-best-prac tices-for-data-cleaning (Leon Polovets) Sort the data by different attributes. Negative numbers, strings that start with obscure symbols, and other outliers will often appear at the top of your dataset after a sort. Look
at summary statistics (mean, standard deviation,
number of missing values, etc.) for each column. These can be used to quickly zero in on the most common problems.
Keep
track of every cleaning operation you do so that you
can modify, repeat, or remove operations as necessary.
For
large, messy datasets, reaching 100% cleanliness is next
to impossible. If you have a list of cleanliness issues, sort them by estimated frequency and attack the most common problems until your dataset is good enough.
Data cleaning best practices
http://www.quora.com/What-are-some-of-the-best-prac tices-for-data-cleaning (Leon Polovets) - (cont.) You can also use sampling to test data quality. For example, the acceptance test for your dataset could be something like, "If you pick 250 random rows, no more than 5 of them should have formatting issues." Tools like OpenRefine can save you a lot of time by performing a lot of simple cleanings with very little user input. Become an expert at using regular expressions. Create a set of utility functions/scripts/tools to handle common cleaning tasks. These might include: regex search-and-replace, remapping values based on a SQL database or CSV file, blanking out all values that don't match a regex or an allowed list of values, deleting rows that do/don't match some filter, and so on. If you find yourself repeating several steps over and over, write a script to handle all of the steps at once.
were read as strings, dates as ranges (for example 1-10 instead of 10/1). Inconsistencies. For example, say you have a list of product code numbers and the name of the products (194392: white socks). You should check that you don't have any other product names associated with 194392, and no other product code for white socks. Also, when adding labels to records based on a certain value (e.g. Small, Medium, Large for company size, based on the number of employees), avoid the practice of using an if statement that uses logic such as "if none of the previous n-1 labels, then it's the nth label". That would be a catchall that leads to mistakes.
Jaramogi Oginga Odinga University of Science and Technology, Kenya - BSC Computer Security & Forensics MSC IT Security & Audit, PHD IT Security & Audit