Sei sulla pagina 1di 188

DATA SCIENCE

TRAINING

Prepared by:
Ms. KATRINA D. ELIZON
Faculty Member
Department of Mathematics and Statistics
WHY DATA SCIENCE?
By Thomas H. Davenport and D.J. Patil
WHAT IS DATA
SCIENCE?
DATA SCIENCE

Data Science is a blend of various


tools, algorithms, and machine
learning principles with the goal to
discover hidden patterns from the
raw data.
WHY WE NEED DATA
SCIENCE?

https://www.edureka.co/blog/what-is-data-science
https://www.edureka.co/blog/what-is-data-science
REQUIRES SKILLS FOR
DATA SCIENTIST

https://www.edureka.co/blog/what-is-data-science
DATAFICATION

D a t a fi c a t i o n i s a m o d e r n
technological trend turning many
aspects of our life into data
which is subsequently transfer
into information realised as a
new form of value.
Basics
CONSOLE PANE
The console is the
heart of RStudio. You
can type commands
directly into the
console whenever you
see the flashing cursor.
Output and error
messages are displayed
in the console.
R SCRIPT OR SOURCE PANE
The script or source pane is
where you can type and save
your commands and make
n o t e s t o yo u r s e l f a b o u t
projects. When you run a
command from the source
pane, the command is sent
over to the console pane to be
executed. It is possible to have
multiple sources or scripts
appear in the source pane,
and they will each have their
own tab at the top of the pane.
ENVIRONMENT AND HISTORY
PANE

The environment and


history pane is where
you will see the
different objects you
create or the different
datasets you import.
FINAL PANE

The final pane


c o n t a i n s
everything else
including help,
plots, packages,
etc.
WHAT ARE VARIABLES?
Variables in Statistics

A variable is any characteristics,


number, or quantity that can be
measured or counted.
Variables can be classified as:
Qualitative Variables
Quantitative Variables
WHAT ARE VARIABLES?
Variables in Programming

The names you give to computer


memory locations which are used to
store values in a computer program.

A way of labelling data with a


descriptive name, so our programs can
be understood more clearly by the
reader and ourselves.
ASSIGNING VALUE TO
VARIABLES
Make sure that the name you assign your
variable is accurately descriptive and
understandable to another reader.

The command for naming object:


= or < −
BASIC DATA TYPES IN
R works with numerous data types. Some of
the most basic types to get started are:

Decimals values like 3.6 are called numerics.


Natural numbers like 7 are called integers. Integers
are also numerics.
Boolean values (TRUE or FALSE) are called Logical.
Text (or string) values are called characters.
WHAT’S THAT DATA
TYPE?

To check the data type of a


variable, use the class ( )
function.
Data Structures in
Four Basic Data Structures
defined in R:
Vector
Matrices
List
Data Frame
WHAT IS VECTOR?

A vector is a type of array that is one


dimensional.
Vectors are a logical element in
programming languages that are
used for storing data.
HOW TO CREATE A
VECTOR IN ?

In R, you create a vector with


the combine function c ( ).

Take note that all elements must


be of the same type.
The quotation marks
indicate that “a”, “b”, “c”,
“d”, and “e” are characters.
If we want to create a vector of
consecutive numbers, the : operator
is very helpful.
More complex sequences can be
created using the seq( ) function,
like defining number of points in an
interval, or the step size.
EXAMPLE
After one week in City of Dreams Manila and still zero
Ferraris in your garage, you decide that it is time to start
using your data analytical superpowers.
Before doing a first analysis, you decide to first collect all
the winnings and losses for the last week:

Poker Roulette
On Monday you won P14,000 On Monday you lost P2,400
Tuesday you lost P5,000 Tuesday you lost P5,000
Wednesday you won P2,000 Wednesday you won P10,000
Thursday you lost P12,000 Thursday you lost P35,000
Friday you won P24,000 Friday you won P1,000
NAMING A VECTOR

You can give a name to the


elements of a vector with the
names ( ) function.
Now that you have the poker and roulette winnings nicely
as named vectors, you can start doing some data analytical
magic.

You want to find out the following type of information:

How much has been your overall profit or loss per day
of the week?

Are you winning/losing money on poker or on roulette?

Have you lost money over the week in total?

To get the answers, you have to do arithmetic calculations


on vector.
Addition +

One of the more simple Subtraction -

uses of RStudio is to use Multiplication *

it like a calculator: Division /

Exponentiation ^
A function that helps you answer
this question is sum ( ). It calculate
the sum of all elements of a vector.
There are two ways to calculate the
overall winnings. Get the sum of
total_daily vector, or add the
total_poker and total_roulette vector.
VECTOR SELECTION

Our goal is to select specific elements


of the vector. To select elements of a
vector you can use square brackets [ ].
Between the square brackets, you
indicate what elements to select.
Another way to select elements
from a vector is by using the
names of the vector elements
instead of numeric position.
To select multiple elements from a vector,
you can add square bracket at the end of it.
You can indicate between the brackets what
elements should be selected.
WHAT’S A FACTOR?
The term factor refers to a statistical data
type used to store categorical variables.
The difference between a categorical
variable and a continuous variable is that
a categorical variable can belong to a
limited number of categories. A
continuous variable, on the other hand,
can correspond to an infinite number of
values.
NOMINAL AND ORDINAL

Two types of categorical variables:


Nominal categorical variable
Ordinal categorical variable
A nominal variable is a categorical variable
without an implied order. This means that it is
impossible to say that 'one is worth more
than the other’.
In contrast, ordinal variables do have a
natural ordering.
HOW TO CREATE A
FACTORS IN ?

To create factors in R, you make use of


the function factor ( ).

First thing that you have to do is create a


vector that contains all the observations
that belong to a limited number of
categories.
It is clear that there are two categories,
or in R-terms 'factor levels', at work
here: Male and Female.
To create an ordered factor, you have to add two
additional arguments: ordered and levels. By setting the
argument ordered to TRUE in the function factor ( ), you
indicate that the factor is ordered. With the argument
levels you give the values of the factor in the correct order.
EXAMPLE:

1. Out of 10 respondents considered in the survey, four


of them indicates that they are single, four of them
are married and only two of them are separated.
Create a factor, based on the information given.

2. I asked 10 students if they like watching television.


Seven of them answered “Most of the Time”, two of
them answered “Sometimes” and only one student
answered the question as “Hardly Ever”. Create a
factor, based on the information given.
CONVERT NUMERIC TO FACTOR

cut (x, breaks, labels = Null) divides


the range of x into intervals and codes
the values in x according to which
interval they fall.
Apply table ( ) to determine the frequency.
Age
25 years old and below
26 to 30 years old
31 to 35 years old
36 to 40 years old
41 years old and above
Age
25 years old
26 to 30 years
31 to 35 years
36 to 40 years
41 years old
WHAT IS LIST IN R?
List is the object which Contains
elements of different types – like
strings, numbers, vectors and
another list inside it.

R list can also contain a matrix


or a function as its elements.
HOW TO CREATE A
LIST IN ?

The List is been created using list ( )


function in R.
EXAMPLE:

Create a list that contains a numeric


vector (1 to 30), a sequence of number
from 1 to 10 with step size of 0.5, and
a character vector that contains your
top 3 favourite subject.
WHAT IS MATRIX?
A matrix is a two-dimensional array of
numbers, having a fixed number of rows
and columns, and containing a number at
the intersection of each row and each
column.
By storing values in a matrix rather than as
individual variables, a program can access
and perform operations on the data more
efficiently.
HOW TO CREATE A
MATRIX IN ?

matrix(data = NA, nrow = 1, ncol = 1,


byrow = FALSE, dimnames = NULL)
A dimnames attribute for the matrix:
NULL or a list of length 2 giving the
row and column names respectively.
EXAMPLE:

Create a 6 x 5 matrix that contains a


number from 1 to 30. Fill matrix by
rows. Make sure that the names of the
rows are A, B, C, D, E, and F, and the
names of the columns are Blue, Red,
White, Green, and Yellow.
EXTRACT A ROW AND
COLUMN FROM A MATRIX
th th
An element at the m row, n column of a
matrix can be accessed by the expression
matrixname[m,n].

th
m
The entire row of a matrix can be extracted
as matrixname[m,].

th
Similarly, the n column of a matrix can be
extracted by matrixname[,n].
EXTRACT A ROW AND
COLUMN FROM A MATRIX
To extract more than one rows or
columns at a time.
Multiple rows:
matrixname[c( ),]
Multiple columns:
matrixname[,c( )]
EXTRACT A ROW AND
COLUMN FROM A MATRIX
th th
An element at the m row, n column of a
matrix can be accessed by the expression
matrixname[m,n].

th
m
The entire row of a matrix can be extracted
as matrixname[m,].

th
Similarly, the n column of a matrix can be
extracted by matrixname[,n].
HOW TO CREATE A
DATA FRAME IN ?

To create a data frame use the data.frame( )


function.
TO DETERMINE THE
STRUCTURE OF YOUR DATA SET
The function str( ) shows you the structure
of your data set.
For a data frame it tells you:
The total number of observations
The total number of variables
A full list of the variables names
The data type of each variable
The first observations
EXTRACT A VARIABLE
FROM A DATA FRAME

If you want to extract a particular


variable from a data frame, use
dataname$variablename.
SUBSETTING A DATA
FRAME
To take a subset from a data frame, first
create a new data frame and use the
subset command:
subset(dataname, condition)
THE (LOGICAL) COMPARISON
OPERATORS

< for less than


> for greater than
<= for less than or equal
>= for greater than or equal
== for equal to each other
!= not equal to each other
The operator & are used to
denote multiple conditions.
EXAMPLE:

Create a data frame for the profile


of the respondents. Include only
the male respondents that are
single and at most 40 years old,
also with highest educational
attainment of college graduate.
CONDITIONS

If/else statements
In R, we can write a conditional if/else
statement as follows:
ifelse(condition on data, true value
returned, false returned)
EXAMPLE:
Suppose we want to create a variable
called grades that is assigned as follows:
E for score less than or equal to 60
D for score 61 to 70
C for score 71 to 80
B for score 81 to 90
A for score at least 91
STRING
OPERATIONS
IN R
You can create strings with either
single quotes or double quotes.
Base R contains many functions to
work with strings but we’ll avoid
them because they can be
inconsistent, which makes them
hard to remember. Instead we’ll
use functions from stringr.
Apply str_length( ) to determine
the number of characters in a
string_vector.
EXAMPLE:

1. Out of 10 respondents considered in the survey, four


of them indicates that they are single, four of them
are married and only two of them are separated.
Create a vector, based on the information given.

2. I asked 10 students if they like watching television.


Seven of them answered “Most of the Time”, two of
them answered “Sometimes” and only one student
answered the question as “Hardly Ever”. Create a
vector, based on the information given.
To combine two or more strings,
use str_c( ).
Use the sep argument to control
how they’re separated.
SUBSETTING STRINGS
To extract parts of a string use:

str_sub (string, start =1L, end = -1L )

It takes start and end arguments


which give the position of the
substring.
To change the text to lower case
and upper case use: tolower( )
and toupper( ).
COMPUTATION OF
MEAN, MEDIAN
AND MODE USING
R
Mean
It is calculated by taking the sum of the values
and dividing with the number of values in a
data series.
The function mean( ) is used to calculate this
in R.
If there are missing values, then the mean
function returns NA.
To drop the missing values from the
calculation use na.rm = TRUE, which means
remove the NA values.

Potrebbero piacerti anche