Sei sulla pagina 1di 31

Part X

Code Basics &


Data Manipulation with R

Literature:
Wickham & Grolemund; R for Data Science; ch. 3, 16

42
Code basics

Very basics:
• The simple calculator: 5 / 200 * 30
• Type pi and press enter
Functions:
• Functions are written as function( <value>)
• Call a build-in function: sin(pi/2) or log(2.718)
Objects:
• Create objects with x <- 1 and watch your global environment window
• Inspect an object by typing its name, e.g. x
• Create a vector with v <- c(1, 5)
• Create a vector with v2 <- c ('r', ‘f’)
• Inspect the vectors
Lectures in Data Mining Winter 2018 44
Code basics

• Create a vector with v2 <- c ('r', 'f’)


• Create a vector with v <- c(1, 5)

• Create a vector v <- seq(1, 5)


• Create a vector v2 <- seq(1, 10, length.out = 5)
• Inspect the vector
• Try (v <- seq(1, 10, length.out = 5)) …. What happens?

• Inspect the third element of vector v  v[3]

Lectures in Data Mining Winter 2018 45


Code basics
- useful functions -

Compute a lagged version of a time series, shifting the time base back by
a given number of observations:
lag() and lead()

(x <- seq(1, 10))


lag(x)

Drawbacks:
• Need exactly one variable for sorting or sorted dataset.
• Does not support partitioning.

Lectures in Data Mining Winter 2018 46


Code basics
- remember tibbles -

tibble( seq(1, 10), seq(10, 1) )


tibble(a = seq(1,10), b = a*a, c = b - lag(b))

Cumulative and rolling aggregates: cumsum(), cumprod(), cummin(),


cummax() and cummean() RcppRoll-package for rolling windows.

tibble(a = sample(1:10, 20), b = cumsum(a))

Rankings: see min_rank(), dense_rank(), percent_rank()

Lectures in Data Mining Winter 2018 47


Code basics
- tibbles and cross table -

Columns in tables can be addressed by using <TABLENAME>$<COLUMN>

Try:
tibble(mpg$manufacturer, mpg$drv)
tibble(a = sample(1:10, 20), b = cumsum(a)) $b

What is a cross table?


A cross table is a two-way table consisting of columns and rows. It is also
known as a pivot table or a multi-dimensional table. Its greatest strength
is its ability to structure, summarize and display large amounts of data.
Cross tables can also be used to determine whether there is a relation
between the row variable and the column variable or not.

Lectures in Data Mining Winter 2018 48


Code basics
- cross table -

Table() uses the cross-classifying factors to build a contingency table of


the counts at each combination of factors.

Try:
table(mpg$manufacturer)
table(mpg$manufacturer, mpg$drv)
table(mpg$manufacturer, mpg$drv, mpg$cyl)

Exercise:
table(mpg$manufacturer, front = mpg$drv == 'f’ )

Try out and explain!

Lectures in Data Mining Winter 2018 49


Data manipulation in R
with sqldf (SELECT)

• Install the package and load the library with:


install.packages ("sqldf")

library(sqldf)

• Remember: Install a package once but load the library


each time you restart R.
• Test:
sqldf ( "select manufacturer, cyl, hwy from mpg limit 30")

• Library RODBC allow to connect an Oracle DB to R.

Lectures in Data Mining Winter 2018 50


The dplyr package

• Great for data exploration and transformation


• Intuitive to write and easy to read, especially when using the
“pipelining” syntax (covered below)
• Fast on data frames

dplyr functionality
• Five basic verbs: filter, select, arrange, mutate, summarize (plus
group_by)
• Can work with data stored in databases and data tables
• Joins: inner join, left join, semi-join, anti-join
• Window functions for calculating ranking, offsets, and more

Lectures in Data Mining Winter 2018 51


The dplyr package
- filter -

Similar syntax of all dplyr functions:


• First argument is data set
• Subsequent arguments describe what to do with the data set
• Functions return modified data sets

General syntax for filter():

filter( data = #DATA, <CONDITION>, <MORE_CONDITIONS>)

filter(mpg, cyl == 6)
filter(mpg, cyl ==6, drv == 'r')

Lectures in Data Mining Winter 2018 52


The dplyr package
- filter -

• Filter uses Conditions:


– Equals (==)
– Larger (>) and larger or equal (>=)
– Smaller (<) and smaller or equal (<=)
• NA argument always return NA (not FALSE and not TRUE)
– Try NA == NA 
– Try is.na(NA)
• Conditions may contain uses logical operators and brakets
– And: & , Or: |, Not: !
– In : %in% <VECTOR>
• Try using condition & instead of ,
• Replace & by | (or) and look what happens.

Lectures in Data Mining Winter 2018 53


The dplyr package
- arrange -

• Arrange sorts the data set by one or more attributes.

• General syntax for arrange():

arrange( data = #DATA, <ATTRIBUTE>, <MORE_ATTRIBUTES>)

• Use functions desc(<ATTRIBUTE>) and asc(<ATTRIBUTE>) to define


the order.

arrange(mpg, hwy)
arrange(mpg, desc(hwy))

Lectures in Data Mining Winter 2018 54


The dplyr package
- select -

• Select allows to take subsets out of a data set by specifying the subset
by its attributes.
• General syntax of select():

select( data = #DATA, <ATTRIBUTE1>, <ATTRIBUTES2>, …)

• To find columns in tables with a lot of attributes use the functions


starts_with(), contains() or ends_with() to define the subset.

select(mpg, 1, 2, 9)
select(mpg, manufacturer, model, hwy)
select(mpg, "manufacturer", "model", "hwy")
select(mpg, starts_with("m"), hwy)

Lectures in Data Mining Winter 2018 55


The dplyr package
- select -

• Select allows to take subsets out of a data set by specifying the subset
by its attributes.
• General syntax of select():

select( data = #DATA, <ATTRIBUTE_SET> …)

• To find select several subsequent columns at one in a table use the


colon in the select function to define the subset.

select(mpg, 1 : 4)
select(mpg, manufacturer : year)

Lectures in Data Mining Winter 2018 56


The dplyr package
- select -

• Select allows to alias attributes by specifying new attribute names.


• General syntax of select():
select( data = #DATA, <NEW_NAME> = <ATTRIBUTE1>, …)
• To keep the remaining columns in tables use the function
everything().
• To remove a column you can use minus (– <ATTRIBUTE>)

Try the following statements and look what happens:


select(mpg, maker = manufacturer, model, hwy)
select(mpg, maker = manufacturer, everything())
select(mpg, maker = manufacturer, everything(), category = class)
select(mpg, maker = manufacturer, everything(), - cyl)
Lectures in Data Mining Winter 2018 57
The dplyr package
- select and rename-

• Rename also allows to alias attributes by specifying new attribute


names and keeps the remaining columns.

• General syntax of rename():


rename( data = #DATA, <NEW_NAME> = <ATTRIBUTE1>, …)

Try and look what happens:

rename(mpg, maker = manufacturer)

Lectures in Data Mining Winter 2018 58


The dplyr package
- mutate -

• Before we start let’s switch the example and prepare the data:

install.packages("nycflights13")
library(nycflights13)
flights

(flights_sml <- select(flights,


year:day,
ends_with("delay"),
distance, air_time
))

Lectures in Data Mining Winter 2018 59


The dplyr package
- mutate -

• Mutate adds attributes (or columns) that are functions of other


attributes at the end of the dataset.

• General syntax of mutate():


mutate( data = #DATA, <NEW_NAME> = <FUNCTION>, …)

Try:
mutate( flights_sml,
gain = arr_delay – dep_delay,
speed = distance / air_time * 60)

Arithmetic operations: +, -, *, /, ^
Try transmute with the same parameters and look what happens.
Lectures in Data Mining Winter 2018 60
The dplyr package
- mutate -

• Mutate allows to reuse variables:


mutate( flights_sml,
gain = arr_delay – dep_delay,
hours = air_time / 60,
gain_per_hour = gain/hours)

Lectures in Data Mining Winter 2018 61


The dplyr package
- mutate functions -

Modular arithmetic: %/% - integer division


%% - remainder
mutate( flights_sml, air_time,
air_hours = air_time %/% 60,
air_mins = air_time %% 60)

Lectures in Data Mining Winter 2018 62


The dplyr package
- Grouping and summarize -

• Summarize() collapses a data set to a single row.


• Summarize() can be paired with group_by(). Consequently, the
aggregation functions use the attributes of the grouped data set
(here: #DATA)
#NEW_DATA <- group_by( data = #DATA, <ATTRIBUTES>)
summarize( #NEW_DATA,
<NEW_NAME> = <AGGR_FUNCTION>, …)

by_dest <- group_by(flights, dest)


delay <- summarize( by_dest, count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = T))

Lectures in Data Mining Winter 2018 63


The dplyr package
- Grouping and summarize -

• Let’s visualize:

by_dest <- group_by(flights, dest)


delay <- summarize( by_dest, count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = T))
delay <- filter( delay, count > 20, dest != "HNL")

ggplot(data = delay, mapping = aes(x = dist, y = delay)) +


geom_point(aes(size = count), alpha = 1/2) +
geom_smooth(se = FALSE)

Lectures in Data Mining Winter 2018 64


The dplyr package
- Piping -

• Instead of creating data set it is possible to send data from one


command to the next by using %>%.
#NEW_DATA <-
<FUNCTION> (#DATA %>%, <ATTRIBUTES>) %>%
<NEXT_FUNCTION> (<ATTRIBUTES>)

Applied to our example:


delay <- group_by( flights, dest) %>%
summarize( count = n(),
dist = mean(distance, na.rm = T),
delay = mean(arr_delay, na.rm = T)) %>%
filter( count > 20, dest != "HNL")

Lectures in Data Mining Winter 2018 65


The dplyr package
- Piping -

• Alternatively, it is possible to put the filter from the summary to the


input data:
delay <- filter(flights, !is.na(arr_delay)) %>%
group_by(dest) %>%
summarize( count = n(),
dist = mean(distance, na.rm = T),
delay = mean(arr_delay, na.rm = T)) %>%
filter( count > 20, dest != "HNL")

• To count only non empty values use


count = sum(!is.na(arr_delay))

Lectures in Data Mining Winter 2018 66


The dplyr package
- Piping -

• To make adding commands less work, start with the data set and pipe
it to the first command by using %>%.

Applied to our example:


delay <- flights %>%
filter(flights, !is.na(arr_delay)) %>%
group_by( dest) %>%
summarize( count = n(),
dist = mean(distance),
delay = mean(arr_delay) %>%
filter( delay, count > 20, dest != "HNL")

Lectures in Data Mining Winter 2018 67


The dplyr package
- More functions -

• Some other functions:


– lag and lead return the predecessor or the successor of the dataset
flights %>%
group_by(month) %>%
summarise(flight_count = n()) %>%
arrange(month) %>%
mutate(change = flight_count - lag(flight_count))
– Function sample_n( #x) returns #x randomly sampled data objects from
the data set.
– Function sample_frac( #frac) returns #frac * number of row data objects
from the data set. Default option is “without replacing”.

flights %>% sample_n(10 )


flights %>% sample_frac( 0.01, replace=TRUE)

Lectures in Data Mining Winter 2018 68


Power Supplier
- Switch Process -

Energy Grid

New Supplier Current


Customer (you) Supplier

Request delivery and


Announce delivery Request
declare consumption deregistration

Answer dereg.

Deregistration Check
Request
Confirm
Confirm registration deregistration

Consumption correction or
confirmation of declaration
Confirm delivery

Lectures in Data Mining Winter 2018 69


Exercise or Homework
What data is available

• SWITCHING_REASON: The reason of the new contract: Move into a new home (MOVE_IN)
or change of supplier (SUPPLIER_SWITCH)
• DECLARATION_CONSUMP_CUST: The energy consumption the customer expects to use
and thus declares to the new supplier.
• DECLARATION_CONSUMP_GRID: The grid has its own estimation of the energy
consumption. If it (significantly) differs from the above value the supplier received it in the
delivery confirmation.
• CORRECTIVE_VALUE: Pre-calculated by your lecturer as “hypothetical” correction of
consumption if the grid is correct. (to make it easier for you)

• MONTH_BILLING: The month when the customer received its first bill from the new
supplier.
• DAYS_INVOICED: Number of days between delivery start and billing.
• CONSUMPTION_INVOICED: Consumption as invoiced in the first bill of the new supplier.
• READ_OUT_CAT: The first year consumption can have different sources. Depending on the
situation it is a value from meter reading or an estimation (see categories).
• BILLING_TYPE : Final if the customer quits or rotational
Lectures in Data Mining Winter 2018 70
Exercise or homework

• Import data sets


HH_SAMPLE_POWER_CONSUMPTION_GRID
_A.csv into R Studio
• Download the zip file from moodle
Topic “Visualisation”  “Examples”
and extract to a local folder
• library(readr)
consumption <-
read_delim("C:/PATH_TO_FILE/HH_SAMPLE_POWER_CONSU
MPTION_GRID_A.csv", ";", na = "empty", trim_ws = TRUE)

Lectures in Data Mining Winter 2018 71


Exercise or Homework
Rename the dataset and the columns

• Create a new table or tibble with the name hh_spc:


– month_billing = MONTH_BILLING
– swt_reason = SWITCHING_REASON
– cust_decl_kwh = DECLARATION_CONSUMP_CUST
– grid_decl_kwh = DECLARATION_CONSUMP_CUST + CORRECTIVE_VALUE
– bill = BILLING_TYPE
– inv_kwh_365 = CONSUMPTION_INVOICED * 365 / DAYS_INVOICED
• Use round function to cut decimals

• Limit the selection to:


– inv_kwh_365 < 10000 consumption <-
read_delim("C:/PATH_TO_FILE/H
– cust_decl_kwh < 7500
H_SAMPLE_POWER_CONSUMPT
– cust_decl_kwh > 0 ION_GRID_A.csv", ";", na =
• Order by month_billing "empty", trim_ws = TRUE)

Lectures in Data Mining Winter 2018 72


Exercise or Homework

• Plot a sample of 400 of these point & including smoothing function


– Compare customers’ declared consumption and (invoiced) 365 day consumption
– Separate switching reason by color
– Use se = FALSE for the smoothing

• Create a boxplot with switching reason and 365 day consumption by


using a 10% sample.
(Hint: If it does not work change x and y attributes)

Lectures in Data Mining Winter 2018 73

Potrebbero piacerti anche