Sei sulla pagina 1di 2

Set Sail

When the Titanic sank, 1502 of the 2224 passengers and crew got killed. One of the
main reasons for this high level of casualties was the lack of lifeboats on this
supposedly unsinkable ship.

Those that have seen the movie know that some individuals were more likely to
survive the sinking (lucky Rose) than others (poor Jack). In this course you wil apply
machine learning techniques to predict a passenger's chance of surviving using R.

Let's start with loading in the training and testing set into your R environment. You
will use the training set to build your model, and the test set to validate it. The data is
stored on the web as CSV files; their URLs are already available as character strings
in the sample code. You can load this data with the read.csv() function.

 Use read.csv() on train_url to create train. This is the train data.


 Use read.csv() on test_url to create test. This is the test data.
 Print out train and test to the console to have a look.

Understanding your data


Before starting with the actual analysis, it's important to understand the structure of
your data. Both test and train are data frames, R's way of representing a dataset.
You can easily explore a data frame using the str() function. str() gives you
information such as the data types in the data frame (e.g. int for integer), the
number of observations, and the number of variables.

The training and test set that you've imported before are already available in the
workspace, as train and test. Which of the following statements is correct?
Rose vs Jack, or Female vs Male
How many people in your training set survived the disaster with the Titanic? To see
this, you can use the table() command in combination with the $-operator to select
a single column of a data frame:

# absolute numbers
table(train$Survived)

# proportions
prop.table(table(train$Survived))

If you run these commands in the console, you'll see that 549 individuals died (62%)
and 342 survived (38%). A simple prediction heuristic could thus be "majority wins":
you predict every unseen observation to not survive.

In general, the table() command can help you to explore which variables have


predictive value. For example, maybe gender could play a role as well? For a two-
way comparison, also including gender, you can use

table(train$Sex, train$Survived)

To get proportions, you can again wrap prop.table() around table(), but you'll


have to specify whether you want row-wise or column-wise proportions: This is done
by setting the second argument of prop.table(), called margin, to 1 or 2,
respectively.

Potrebbero piacerti anche