Sei sulla pagina 1di 6

BASIC DATA EXPLORATION IN R

The most common type of dataset in R is a data frame. A data frame typically consists of a
set of observations (as the rows) and variables (as the columns). Any data file imported in
any format can be read a s a data frame by R and all kinds of data analysis can be
conducted on the data frame itself.

Before conducting any data analysis like correlation, regression or advanced time series
analysis, it is always a good idea to explore the dataset. Now, what is meant by exploring
the dataset? This is can be likened to watching a movie. Before watching the movie, we
are interested in a brief sketch of the plot, the characters, the actors, the makers etc. in
order to know better what to expect from the movie. We like to see the trailers, read the
reviews etc. before the actually going to watch the movie. Similarly, before actually
working on the dataset by applying various statistical or econometric or even simple
arithmetic techniques, we would like to know better about our data, what are we actually
working on. What does it look like, what are the key variables, what are the various
characteristics of those variables. Similarly, before jumping into the river to swim, we feel
it with our hands or toes to see how cold the water is.

This is data exploration. Getting the feel of what you are going to work with, know better
about it before working on it.

Right, lets get right down to how we go about exploring our data frame at a very basic
level.

1. First, thing we have to do is to load the package containing the dataset. There are
some datasets which come pre-loaded with R, so you can straightaway get on with
them, but many datasets come with certain packages. What is meant by loading of

Visit www.ceekh.com 1
BASIC DATA EXPLORATION IN R

a package? Simply put, loading the package tells R, which package you will now be
working on.

To load a package to R, (after you have downloaded and installed it), type the
following command in your console:

library(package name)

So, lets say you want to load the package datasets

So, you will type:

library(datasets)

R will not really show you any output after this command, as it knows that the package
has been loaded, and so should you!

2. Right, now that you have loaded the package, time to watch the trailer. You would
want to know what is the basic structure of your dataset. That means, how many
observations (rows) and how many variables does your dataset contain. For this,
type:

dim(data frame name)

Let us say we want to explore the structure of the dataset called longley contained
in the pacakge datasets. Longley is a dataframe with 7 macroeconomic variables
like GNP, Number of unemployed etc. observed yearly from 1947 to 1962. So we type:

dim(longley)

Visit www.ceekh.com 2
BASIC DATA EXPLORATION IN R

what do you get?

TA DA! In one stroke, R has told us that longley dataset contains 16 observations and
7 variables. So remember the sequence first observations and then variables.

3. Right, now having known the number of height and weight of the dataset, we want
to see its face and its feet. What did I just say? Well, longley isnt that well, long but
there are big datasets out there (after all, we are dealing with big data!), thus, it is
not possible for R to display the whole of data in one frame. Thus, what you would
want to do (particularly with such big dataframes) is to look at the first few rows and
the last few rows of the dataset to see what actually does it look it.

To see the face or the first few rows of the dataset, type:
head(data frame name)

and to see the feet or the last few rows of the dataset, type:

tail(data frame name)

Visit www.ceekh.com 3
BASIC DATA EXPLORATION IN R

Continuing with our beloved longley,

head(longley)

gives you

tail(longley)

gives you

4. Sometimes you would just want to see the names of the variables (columns) in the
data. For that, you type

names(data frame name)

Visit www.ceekh.com 4
BASIC DATA EXPLORATION IN R

So names(longley) gives us

5. Now that we know the names of the variables, we would like to know more about
the nature of these variables, because they are the key elements of the data frame
that we will have to work with. We have two commands for it:

summary(dataframe name) would give us simple descriptive statistics like mean,


median, maximum value, minimum value for EACH variable. These stats would tell
you about the nature of the variable which goes a long way in understanding your
data and applying the appropriate technique of analysis.

str(date frame name) would give us the structure of the data i.e. the number of
observations and variables, what type the each variable is i.e integer or numeric or
character/string and it also gives us a glimpse of actual values in each variable.

Let us see our results of applying both these commands to longley

summary(longley)

As told, we can see summary descriptive statistics that well, describe our variables
or give us a hint of their nature.
Visit www.ceekh.com 5
BASIC DATA EXPLORATION IN R

str(longley)

The first column gives the names of variable, second the type of variable and third
the actual observations. The first row tells us that our dataset is a data frame and it
contains 16 observations and 7 variables. These commands can also be used on
other types of datasets like matrices and they would give the results accordingly.

Hope you liked learning the simple ways of exploring your data frame. Do try on other
datasets for practices. Happy R-ing!

Visit www.ceekh.com 6

Potrebbero piacerti anche