Documenti di Didattica
Documenti di Professioni
Documenti di Cultura
Literature:
Wickham & Grolemund; R for Data Science; ch. 3, 16
42
Code basics
Very basics:
• The simple calculator: 5 / 200 * 30
• Type pi and press enter
Functions:
• Functions are written as function( <value>)
• Call a build-in function: sin(pi/2) or log(2.718)
Objects:
• Create objects with x <- 1 and watch your global environment window
• Inspect an object by typing its name, e.g. x
• Create a vector with v <- c(1, 5)
• Create a vector with v2 <- c ('r', ‘f’)
• Inspect the vectors
Lectures in Data Mining Winter 2018 44
Code basics
Compute a lagged version of a time series, shifting the time base back by
a given number of observations:
lag() and lead()
Drawbacks:
• Need exactly one variable for sorting or sorted dataset.
• Does not support partitioning.
Try:
tibble(mpg$manufacturer, mpg$drv)
tibble(a = sample(1:10, 20), b = cumsum(a)) $b
Try:
table(mpg$manufacturer)
table(mpg$manufacturer, mpg$drv)
table(mpg$manufacturer, mpg$drv, mpg$cyl)
Exercise:
table(mpg$manufacturer, front = mpg$drv == 'f’ )
library(sqldf)
dplyr functionality
• Five basic verbs: filter, select, arrange, mutate, summarize (plus
group_by)
• Can work with data stored in databases and data tables
• Joins: inner join, left join, semi-join, anti-join
• Window functions for calculating ranking, offsets, and more
filter(mpg, cyl == 6)
filter(mpg, cyl ==6, drv == 'r')
arrange(mpg, hwy)
arrange(mpg, desc(hwy))
• Select allows to take subsets out of a data set by specifying the subset
by its attributes.
• General syntax of select():
select(mpg, 1, 2, 9)
select(mpg, manufacturer, model, hwy)
select(mpg, "manufacturer", "model", "hwy")
select(mpg, starts_with("m"), hwy)
• Select allows to take subsets out of a data set by specifying the subset
by its attributes.
• General syntax of select():
select(mpg, 1 : 4)
select(mpg, manufacturer : year)
• Before we start let’s switch the example and prepare the data:
install.packages("nycflights13")
library(nycflights13)
flights
Try:
mutate( flights_sml,
gain = arr_delay – dep_delay,
speed = distance / air_time * 60)
Arithmetic operations: +, -, *, /, ^
Try transmute with the same parameters and look what happens.
Lectures in Data Mining Winter 2018 60
The dplyr package
- mutate -
• Let’s visualize:
• To make adding commands less work, start with the data set and pipe
it to the first command by using %>%.
Energy Grid
Answer dereg.
Deregistration Check
Request
Confirm
Confirm registration deregistration
Consumption correction or
confirmation of declaration
Confirm delivery
• SWITCHING_REASON: The reason of the new contract: Move into a new home (MOVE_IN)
or change of supplier (SUPPLIER_SWITCH)
• DECLARATION_CONSUMP_CUST: The energy consumption the customer expects to use
and thus declares to the new supplier.
• DECLARATION_CONSUMP_GRID: The grid has its own estimation of the energy
consumption. If it (significantly) differs from the above value the supplier received it in the
delivery confirmation.
• CORRECTIVE_VALUE: Pre-calculated by your lecturer as “hypothetical” correction of
consumption if the grid is correct. (to make it easier for you)
• MONTH_BILLING: The month when the customer received its first bill from the new
supplier.
• DAYS_INVOICED: Number of days between delivery start and billing.
• CONSUMPTION_INVOICED: Consumption as invoiced in the first bill of the new supplier.
• READ_OUT_CAT: The first year consumption can have different sources. Depending on the
situation it is a value from meter reading or an estimation (see categories).
• BILLING_TYPE : Final if the customer quits or rotational
Lectures in Data Mining Winter 2018 70
Exercise or homework