Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16

Part X
Code Basics &

Data Manipulation with R
Literature:
Wickham & Grolemund; R for Data Science; ch. 3, 16
42
Code basics
Very basics:
• The simple calculator: 5 / 200 * 30
• Type pi and press enter
Functions:
• Functions are written as function( <value>)
• Call a build-in function: sin(pi/2) or log(2.718)
Objects:
• Create objects with x <- 1 and watch your global environment window
• Inspect an object by typing its name, e.g. x
• Create a vector with v <- c(1, 5)
• Create a vector with v2 <- c ('r', ‘f’)
• Inspect the vectors
Lectures in Data Mining Winter 2018 44
Code basics
• Create a vector with v2 <- c ('r', 'f’)

• Create a vector with v <- c(1, 5)
• Create a vector v <- seq(1, 5)

• Create a vector v2 <- seq(1, 10, length.out = 5)
• Inspect the vector
• Try (v <- seq(1, 10, length.out = 5)) …. What happens?
• Inspect the third element of vector v  v[3]

Code basics
- useful functions -
Compute a lagged version of a time series, shifting the time base back by
a given number of observations:
lag() and lead()
(x <- seq(1, 10))

lag(x)
Drawbacks:
• Need exactly one variable for sorting or sorted dataset.
• Does not support partitioning.

Code basics
- remember tibbles -
tibble( seq(1, 10), seq(10, 1) )

tibble(a = seq(1,10), b = a*a, c = b - lag(b))
Cumulative and rolling aggregates: cumsum(), cumprod(), cummin(),

cummax() and cummean() RcppRoll-package for rolling windows.
tibble(a = sample(1:10, 20), b = cumsum(a))
Rankings: see min_rank(), dense_rank(), percent_rank()

Code basics
- tibbles and cross table -
Columns in tables can be addressed by using <TABLENAME>$<COLUMN>
Try:
tibble(mpg$manufacturer, mpg$drv)
tibble(a = sample(1:10, 20), b = cumsum(a)) $b
What is a cross table?

A cross table is a two-way table consisting of columns and rows. It is also
known as a pivot table or a multi-dimensional table. Its greatest strength
is its ability to structure, summarize and display large amounts of data.
Cross tables can also be used to determine whether there is a relation
between the row variable and the column variable or not.

Code basics
- cross table -
Table() uses the cross-classifying factors to build a contingency table of

the counts at each combination of factors.
Try:
table(mpg$manufacturer)
table(mpg$manufacturer, mpg$drv)
table(mpg$manufacturer, mpg$drv, mpg$cyl)
Exercise:
table(mpg$manufacturer, front = mpg$drv == 'f’ )
Try out and explain!

Data manipulation in R
with sqldf (SELECT)
• Install the package and load the library with:

install.packages ("sqldf")
library(sqldf)
• Remember: Install a package once but load the library

each time you restart R.
• Test:
sqldf ( "select manufacturer, cyl, hwy from mpg limit 30")
• Library RODBC allow to connect an Oracle DB to R.

The dplyr package
• Great for data exploration and transformation

• Intuitive to write and easy to read, especially when using the
“pipelining” syntax (covered below)
• Fast on data frames
dplyr functionality
• Five basic verbs: filter, select, arrange, mutate, summarize (plus
group_by)
• Can work with data stored in databases and data tables
• Joins: inner join, left join, semi-join, anti-join
• Window functions for calculating ranking, offsets, and more

The dplyr package
- filter -
Similar syntax of all dplyr functions:

• First argument is data set
• Subsequent arguments describe what to do with the data set
• Functions return modified data sets
General syntax for filter():
filter( data = #DATA, <CONDITION>, <MORE_CONDITIONS>)
filter(mpg, cyl == 6)
filter(mpg, cyl ==6, drv == 'r')

The dplyr package
- filter -
• Filter uses Conditions:

– Equals (==)
– Larger (>) and larger or equal (>=)
– Smaller (<) and smaller or equal (<=)
• NA argument always return NA (not FALSE and not TRUE)
– Try NA == NA 
– Try is.na(NA)
• Conditions may contain uses logical operators and brakets
– And: & , Or: |, Not: !
– In : %in% <VECTOR>
• Try using condition & instead of ,
• Replace & by | (or) and look what happens.

The dplyr package
- arrange -
• Arrange sorts the data set by one or more attributes.
• General syntax for arrange():
arrange( data = #DATA, <ATTRIBUTE>, <MORE_ATTRIBUTES>)
• Use functions desc(<ATTRIBUTE>) and asc(<ATTRIBUTE>) to define

the order.
arrange(mpg, hwy)
arrange(mpg, desc(hwy))

The dplyr package
- select -
• Select allows to take subsets out of a data set by specifying the subset
by its attributes.
• General syntax of select():
select( data = #DATA, <ATTRIBUTE1>, <ATTRIBUTES2>, …)
• To find columns in tables with a lot of attributes use the functions

starts_with(), contains() or ends_with() to define the subset.
select(mpg, 1, 2, 9)
select(mpg, manufacturer, model, hwy)
select(mpg, "manufacturer", "model", "hwy")
select(mpg, starts_with("m"), hwy)

The dplyr package
- select -
• Select allows to take subsets out of a data set by specifying the subset
by its attributes.
select( data = #DATA, <ATTRIBUTE_SET> …)
• To find select several subsequent columns at one in a table use the

colon in the select function to define the subset.
select(mpg, 1 : 4)
select(mpg, manufacturer : year)

The dplyr package
- select -
• Select allows to alias attributes by specifying new attribute names.

select( data = #DATA, <NEW_NAME> = <ATTRIBUTE1>, …)
• To keep the remaining columns in tables use the function
everything().
• To remove a column you can use minus (– <ATTRIBUTE>)
Try the following statements and look what happens:

select(mpg, maker = manufacturer, model, hwy)
select(mpg, maker = manufacturer, everything())
select(mpg, maker = manufacturer, everything(), category = class)
select(mpg, maker = manufacturer, everything(), - cyl)
The dplyr package
- select and rename-
• Rename also allows to alias attributes by specifying new attribute

names and keeps the remaining columns.
• General syntax of rename():

rename( data = #DATA, <NEW_NAME> = <ATTRIBUTE1>, …)
Try and look what happens:
rename(mpg, maker = manufacturer)

The dplyr package
- mutate -
• Before we start let’s switch the example and prepare the data:
install.packages("nycflights13")
library(nycflights13)
flights
(flights_sml <- select(flights,

year:day,
ends_with("delay"),
distance, air_time
))

The dplyr package
- mutate -
• Mutate adds attributes (or columns) that are functions of other

attributes at the end of the dataset.
• General syntax of mutate():

mutate( data = #DATA, <NEW_NAME> = <FUNCTION>, …)
Try:
mutate( flights_sml,
gain = arr_delay – dep_delay,
speed = distance / air_time * 60)
Arithmetic operations: +, -, *, /, ^
Try transmute with the same parameters and look what happens.
The dplyr package
- mutate -
• Mutate allows to reuse variables:

mutate( flights_sml,
gain = arr_delay – dep_delay,
hours = air_time / 60,
gain_per_hour = gain/hours)

The dplyr package
- mutate functions -
Modular arithmetic: %/% - integer division

%% - remainder
mutate( flights_sml, air_time,
air_hours = air_time %/% 60,
air_mins = air_time %% 60)

The dplyr package
- Grouping and summarize -
• Summarize() collapses a data set to a single row.

• Summarize() can be paired with group_by(). Consequently, the
aggregation functions use the attributes of the grouped data set
(here: #DATA)
#NEW_DATA <- group_by( data = #DATA, <ATTRIBUTES>)
summarize( #NEW_DATA,
<NEW_NAME> = <AGGR_FUNCTION>, …)
by_dest <- group_by(flights, dest)

delay <- summarize( by_dest, count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = T))

The dplyr package
- Grouping and summarize -
• Let’s visualize:
by_dest <- group_by(flights, dest)

delay <- summarize( by_dest, count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = T))
delay <- filter( delay, count > 20, dest != "HNL")
ggplot(data = delay, mapping = aes(x = dist, y = delay)) +

geom_point(aes(size = count), alpha = 1/2) +
geom_smooth(se = FALSE)

The dplyr package
- Piping -
• Instead of creating data set it is possible to send data from one

command to the next by using %>%.
#NEW_DATA <-
<FUNCTION> (#DATA %>%, <ATTRIBUTES>) %>%
<NEXT_FUNCTION> (<ATTRIBUTES>)
Applied to our example:

delay <- group_by( flights, dest) %>%
summarize( count = n(),
dist = mean(distance, na.rm = T),
delay = mean(arr_delay, na.rm = T)) %>%
filter( count > 20, dest != "HNL")

The dplyr package
- Piping -
• Alternatively, it is possible to put the filter from the summary to the

input data:
delay <- filter(flights, !is.na(arr_delay)) %>%
group_by(dest) %>%
dist = mean(distance, na.rm = T),
delay = mean(arr_delay, na.rm = T)) %>%
filter( count > 20, dest != "HNL")
• To count only non empty values use

count = sum(!is.na(arr_delay))

The dplyr package
- Piping -
• To make adding commands less work, start with the data set and pipe
it to the first command by using %>%.
Applied to our example:

delay <- flights %>%
filter(flights, !is.na(arr_delay)) %>%
group_by( dest) %>%
dist = mean(distance),
delay = mean(arr_delay) %>%
filter( delay, count > 20, dest != "HNL")

The dplyr package
- More functions -
• Some other functions:

– lag and lead return the predecessor or the successor of the dataset
flights %>%
group_by(month) %>%
summarise(flight_count = n()) %>%
arrange(month) %>%
mutate(change = flight_count - lag(flight_count))
– Function sample_n( #x) returns #x randomly sampled data objects from
the data set.
– Function sample_frac( #frac) returns #frac * number of row data objects
from the data set. Default option is “without replacing”.
flights %>% sample_n(10 )

flights %>% sample_frac( 0.01, replace=TRUE)

Power Supplier
- Switch Process -
Energy Grid
New Supplier Current

Customer (you) Supplier
Request delivery and

Announce delivery Request
declare consumption deregistration
Answer dereg.
Deregistration Check
Request
Confirm
Confirm registration deregistration
Consumption correction or
confirmation of declaration
Confirm delivery

Exercise or Homework
What data is available
• SWITCHING_REASON: The reason of the new contract: Move into a new home (MOVE_IN)
or change of supplier (SUPPLIER_SWITCH)
• DECLARATION_CONSUMP_CUST: The energy consumption the customer expects to use
and thus declares to the new supplier.
• DECLARATION_CONSUMP_GRID: The grid has its own estimation of the energy
consumption. If it (significantly) differs from the above value the supplier received it in the
delivery confirmation.
• CORRECTIVE_VALUE: Pre-calculated by your lecturer as “hypothetical” correction of
consumption if the grid is correct. (to make it easier for you)
• MONTH_BILLING: The month when the customer received its first bill from the new
supplier.
• DAYS_INVOICED: Number of days between delivery start and billing.
• CONSUMPTION_INVOICED: Consumption as invoiced in the first bill of the new supplier.
• READ_OUT_CAT: The first year consumption can have different sources. Depending on the
situation it is a value from meter reading or an estimation (see categories).
• BILLING_TYPE : Final if the customer quits or rotational
Exercise or homework
• Import data sets

HH_SAMPLE_POWER_CONSUMPTION_GRID
_A.csv into R Studio
• Download the zip file from moodle
Topic “Visualisation”  “Examples”
and extract to a local folder
• library(readr)
consumption <-
read_delim("C:/PATH_TO_FILE/HH_SAMPLE_POWER_CONSU
MPTION_GRID_A.csv", ";", na = "empty", trim_ws = TRUE)

Rename the dataset and the columns
• Create a new table or tibble with the name hh_spc:

– month_billing = MONTH_BILLING
– swt_reason = SWITCHING_REASON
– cust_decl_kwh = DECLARATION_CONSUMP_CUST
– grid_decl_kwh = DECLARATION_CONSUMP_CUST + CORRECTIVE_VALUE
– bill = BILLING_TYPE
– inv_kwh_365 = CONSUMPTION_INVOICED * 365 / DAYS_INVOICED
• Use round function to cut decimals
• Limit the selection to:

– inv_kwh_365 < 10000 consumption <-
read_delim("C:/PATH_TO_FILE/H
– cust_decl_kwh < 7500
H_SAMPLE_POWER_CONSUMPT
– cust_decl_kwh > 0 ION_GRID_A.csv", ";", na =
• Order by month_billing "empty", trim_ws = TRUE)

• Plot a sample of 400 of these point & including smoothing function

– Compare customers’ declared consumption and (invoiced) 365 day consumption
– Separate switching reason by color
– Use se = FALSE for the smoothing
• Create a boxplot with switching reason and 365 day consumption by

using a 10% sample.
(Hint: If it does not work change x and y attributes)

Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16

Caricato da

Informazioni sul documento

Descrizione originale:

Titolo originale

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Code Basics & Data Manipulation With R: Literature: Wickham & Grolemund R For Data Science Ch. 3, 16

Caricato da

Copyright:

Formati disponibili

Part X

Code Basics &

• Create a vector with v2 <- c ('r', 'f’)

• Create a vector v <- seq(1, 5)

• Inspect the third element of vector v  v[3]

Lectures in Data Mining Winter 2018 45

(x <- seq(1, 10))

Lectures in Data Mining Winter 2018 46

tibble( seq(1, 10), seq(10, 1) )

Cumulative and rolling aggregates: cumsum(), cumprod(), cummin(),

tibble(a = sample(1:10, 20), b = cumsum(a))

Rankings: see min_rank(), dense_rank(), percent_rank()

Lectures in Data Mining Winter 2018 47

Columns in tables can be addressed by using <TABLENAME>$<COLUMN>

What is a cross table?

Lectures in Data Mining Winter 2018 48

Table() uses the cross-classifying factors to build a contingency table of

Try out and explain!

Lectures in Data Mining Winter 2018 49

• Install the package and load the library with:

• Remember: Install a package once but load the library

• Library RODBC allow to connect an Oracle DB to R.

Lectures in Data Mining Winter 2018 50

• Great for data exploration and transformation

Lectures in Data Mining Winter 2018 51

Similar syntax of all dplyr functions:

General syntax for filter():

filter( data = #DATA, <CONDITION>, <MORE_CONDITIONS>)

Lectures in Data Mining Winter 2018 52

• Filter uses Conditions:

Lectures in Data Mining Winter 2018 53

• Arrange sorts the data set by one or more attributes.

• General syntax for arrange():

arrange( data = #DATA, <ATTRIBUTE>, <MORE_ATTRIBUTES>)

• Use functions desc(<ATTRIBUTE>) and asc(<ATTRIBUTE>) to define

Lectures in Data Mining Winter 2018 54

select( data = #DATA, <ATTRIBUTE1>, <ATTRIBUTES2>, …)

• To find columns in tables with a lot of attributes use the functions

Lectures in Data Mining Winter 2018 55

select( data = #DATA, <ATTRIBUTE_SET> …)

• To find select several subsequent columns at one in a table use the

Lectures in Data Mining Winter 2018 56

• Select allows to alias attributes by specifying new attribute names.

Try the following statements and look what happens:

• Rename also allows to alias attributes by specifying new attribute

• General syntax of rename():

Try and look what happens:

rename(mpg, maker = manufacturer)

Lectures in Data Mining Winter 2018 58

(flights_sml <- select(flights,

Lectures in Data Mining Winter 2018 59

• Mutate adds attributes (or columns) that are functions of other

• General syntax of mutate():

• Mutate allows to reuse variables:

Lectures in Data Mining Winter 2018 61

Modular arithmetic: %/% - integer division

Lectures in Data Mining Winter 2018 62

• Summarize() collapses a data set to a single row.

by_dest <- group_by(flights, dest)

Lectures in Data Mining Winter 2018 63

by_dest <- group_by(flights, dest)

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +