Data Transformation Cheatsheet

Data Transformation
with dplyr Cheat Sheet
dplyr functions work with pipes and expect tidy data. In tidy data:
A B C
&
Each variable is
in its own column
A B C
pipes
x %>% f(y)
becomes f(x, y)
Each observation, or
case, is in its own row
Summarise Cases
These apply summary functions to
columns to create a new table.
Summary functions take vectors as
input and return one value (see back).
w
w
w
w
w
w
summary
function
Manipulate Cases
Manipulate Variables
Extract Cases
Extract Variables
Row functions return a subset of rows as a new table. Use a variant

that ends in _ for non-standard evaluation friendly code.
Column functions return a set of columns as a new table. Use a

variant that ends in _ for non-standard evaluation friendly code.
filter(.data, )
w
w
ww
w
w
w
w
ww
w
w
Remove rows with duplicate values. Also

distinct_(). distinct(iris, Species)
w
w
ww
w
w
Randomly select fraction of rows.

sample_frac(iris, 0.5, replace = TRUE)
weight = NULL, .env = parent.frame())

Randomly select size rows.
sample_n(iris, 10, replace = TRUE)
w
w
ww
w
w
Select rows by position. Also slice_().

slice(iris, 10:15)
top_n(x, n, wt)
Select and order top n entries (by group if
grouped data). top_n(iris, 5, Sepal.Width)
summarise_all() - Apply funs to every column.

summarise_at() - Apply funs to every column.
summarise_if() - Apply funs to all cols of one type.
<
>
<=
>=
is.na()
!is.na()
ungroup(x, ...)
Returns ungrouped copy of table.

ungroup(g_iris)
|
&
xor()
See ?base::logic and ?Comparison for help.
Arrange Cases
w
w
ww
w
w
Compute new column(s).

mutate(mtcars, gpm = 1/mpg)
w
w
w
Compute new column(s), drop others.

transmute(mtcars, gpm = 1/mpg)
w
ww
w
ww
w
mutate_all(.tbl, .funs, ...)
Apply funs to every column. Use with

funs(). mutate_all(faithful, funs(log(.),
log2(.)))
mutate_at(.tbl, .cols, .funs, ...)
Apply funs to specific columns. Use with

funs() and the helper functions for
select().
mutate_at(iris, -Species, funs(log(.)))
Apply funs to all columns of one type. Use
with funs().
mutate_if(iris, is.numeric, funs(log(.)))
add_column(.data, ..., .before =

NULL, .after = NULL)
w
w
ww
w
w
Add Cases
RStudio is a trademark of RStudio, Inc. CC BY RStudio info@rstudio.com 844-448-1212 rstudio.com
vectorized
function
mutate(.data, )
w
w
ww
w
w
Order rows by values of a column (low to high),

use with desc() to order from high to low.
arrange(mtcars, mpg)
arrange(mtcars, desc(mpg))
add_row(.data, ..., .before = NULL,

.after = NULL)
w
w
ww
w
w
:, e.g. mpg:cyl
-, e.g, -Species
mutate_if(.tbl, .predicate, .funs, ...)
arrange(.data, ...)
group_by(.data, ..., add = FALSE)
Returns copy of table grouped by

g_iris <- group_by(iris, Species)
%in%
!
num_range(prefix, range)
one_of()
starts_with(match)
Make New Variables
Logical and boolean operators to use with filter()
Use group_by() to created a "grouped" copy of a table. dplyr

functions will manipulate each "group" separately and then
combine the results.
w
w
w
w
ww
w
w
w
contains(match)
ends_with(match)
matches(match)
transmute(.data, )
slice(.data, )
Variations
mtcars %>%
group_by(cyl) %>%
summarise(avg = mean(mpg))
e.g. select(iris, starts_with("Sepal"))
These apply vectorized functions to

columns. Vectorized funs take vectors
as input and return vectors of the
same length as output (see back).
sample_n(tbl, size, replace = FALSE,
count(x, ..., wt = NULL, sort = FALSE)
Extract columns by name. Also select_if()

select(iris, Sepal.Length, Species)
Use these helpers with select(),
sample_frac(tbl, size = 1, replace = FALSE,

weight = NULL, .env = parent.frame())
Compute table of summaries. Also

summarise_().
summarise(mtcars, avg = mean(mpg))
Group Cases
w
w
ww
distinct(.data, ..., .keep_all = FALSE)
summarise(.data, )
Count number of rows in each group defined

by the variables in Also tally().
count(iris, Species)
Extract rows that meet logical criteria. Also

filter_(). filter(iris, Sepal.Length > 7)
select(.data, )
Add one or more rows to a table.

add_row(faithful, eruptions = 1, waiting = 1)
Add new column(s).

add_column(mtcars, new = 1:32)
rename(.data, )
w
w
ww
w
Rename columns.
rename(iris, Length = Sepal.Length)
Learn more with browseVignettes(package = c("dplyr", "tibble")) dplyr 0.5.0 tibble 1.2.0 Updated: 11/16
Vectorized Functions
Summary Functions
to use with mutate()
to use with summarise()
mutate() and transmute() apply vectorized

functions to columns to create new columns.
Vectorized functions take vectors as input and
return vectors of the same length as output.
summarise() applies summary functions to

columns to create a new table. Summary
functions take vectors as input and return single
values as output.
vectorized
function
summary
function
Counts
Osets
dplyr::lag() - Oset elements by 1
dplyr::lead() - Oset elements by -1
dplyr::n() - number of values/rows

dplyr::n_distinct() - # of uniques
sum(!is.na()) - # of non-NAs
Cumulative Aggregates
Location
dplyr::cumall() - Cumulative all()

dplyr::cumany() - Cumulative any()
cummax() - Cumulative max()
dplyr::cummean() - Cumulative mean()
cummin() - Cumulative min()
cumprod() - Cumulative prod()
cumsum() - Cumulative sum()
Rankings
dplyr::cume_dist() - Proportion of all values <=
dplyr::dense_rank() - rank with ties = min, no
gaps
dplyr::min_rank() - rank with ties = min
dplyr::ntile() - bins into n bins
dplyr::percent_rank() - min_rank scaled to [0,1]
dplyr::row_number() - rank with ties = "first"
mean() - mean, also mean(is.na())

median() - median
Logicals
mean() - Proportion of TRUEs
sum() - # of TRUEs
Position/Order
dplyr::first() - first value
dplyr::last() - last value
dplyr::nth() - value in nth location of vector
Rank
quantile() - nth quantile
min() - minimum value
max() - maximum value
Spread
IQR() - Inter-Quartile Range
mad() - mean absolute deviation
sd() - standard deviation
var() - variance
Math
+, - , *, ?, ^, %/%, %% - arithmetic ops
log(), log2(), log10() - logs
<, <=, >, >=, !=, == - logical comparisons
Misc
dplyr::between() - x > right & x < left
dplyr::case_when() - multi-case if_else()
dplyr::coalesce() - first non-NA values by
element across a set of vectors
if_else() - element-wise if() + else()
dplyr::na_if() - replace specific values with NA
pmax() - element-wise max()
pmin() - element-wise min()
dplyr::recode() - Vectorized switch()
dplyr::recode_factor() - Vectorized switch() for
factors
Row names
Tidy data does not use rownames, which store
a variable outside of the columns. To work with
the rownames, first move them into a column.
C
1
2
3
A
a
b
c
B
t
u
v
C
1
2
3
A
a
b
c
B
t
u
v
rownames_to_column()
Move row names into col.
a <- rownames_to_column(iris,
var = "C")
column_to_rownames()
Move col in row names.
column_to_rownames(a,
var = "C")
Also has_rownames(), remove_rownames()
A
a
b
c
B
t
u
v
C
1
2
3
C
1
2
3
A
a
b
c
B
t
u
v
RStudio is a trademark of RStudio, Inc. CC BY RStudio info@rstudio.com 844-448-1212 rstudio.com
Combine Tables
Combine Variables
x
A
A
aa
bb
cc
Combine Cases
B
B
tt
uu
vv
C
C
C
11
22
33
A
A
A
aaa
bbb
ddd
B
B
B
tt
uuu
w
w
w
D
D
33
22
11
=
+
Use bind_cols() to paste tables beside each other as

they are.
Returns tables placed side by
side as a single table.
BE SURE THAT ROWS ALIGN.
a t 1 a t 3
b u 2 b u 2
c v 3 d w 1
Use a "Mutating Join" to join one table to columns

from another, matching values with the rows that
they correspond to. Each join retains a dierent
combination of values from the tables.
a
b
c
t
u
v
A B C D
a
b
d
t 1 3
u 2 2
w NA 1
t
u
1
2
t 1
u 2
v 3
w NA
A
a
b
c
c
d
B
t
u
v
v
w
C
1
2
3
3
4
bind_rows(, .id = NULL)
Returns tables one on top of the other

as a single table. Set .id to a column
name to add a column of the original
table names (as pictured)
copy=FALSE, suix=c(.x,.y),)
Join matching values from y to x.
A B C
c v 3
intersect(x, y, )
right_join(x, y, by = NULL, copy =

FALSE, suix=c(.x,.y),)
A
aa
bb
B
B
tt
uu
C
C
11
22
setdi(x, y, )
A
A
aa
bb
c
d
B
B
tt
u
v
w
C
C
1
1
2
3
4
union(x, y, )
inner_join(x, y, by = NULL, copy =

FALSE, suix=c(.x,.y),)
3
2
Join data. Retain only rows with

matches.
full_join(x, y, by = NULL,
A B C D
a
b
c
d
DF
x
x
x
z
z
Rows that appear in both x and z.
Rows that appear in both x but not z.
Join matching values from x to y.
A B C D
a
b
B C
v 3
w 4
Use bind_rows() to paste tables below each other as

they are.
left_join(x, y, by = NULL,
1 3
2 2
3 NA
A
c
d
3
2
Rows that appear in x or z. (Duplicates

removed). union_all() retains
duplicates.
Use setequal() to test whether two data sets contain

the exact same rows (in any order).
copy=FALSE, suix=c(.x,.y),)
Join data. Retain all values, all
rows.
NA
1
Extract Rows
x
A B.x C B.y D
a
b
c
t
u
v
1
2
3
t
3
u
2
NA NA
A.x B.x C A.y B.y

t
u
v
1
2
3
d
b
a
w
u
t
A
a
b
cc
Use by = c("col1", "col2") to

specify the column(s) to match
on.
left_join(x, y, by = "A")
a
b
c
Use a named vector, by =

c("col1" = "col2"), to match on
columns with dierent names in
each data set.
a
b
c
t
u
v
1
2
3
A2 B2
d
b
a
w
u
t
Use suix to specify suix to give

to duplicate column names.
left_join(x, y, by = c("C" = "D"),

suix = c("1", "2"))
B
t
u
vv
y
C
C
1
2
3
A
a
b
d
B
t
uu
w
D
C
33
22
11
Use a "Filtering Join" to filter one table against the

rows of another.
A
a
b
B
t
u
C
1
2
semi_join(x, y, by = NULL, )
A
c
B
v
C
3
anti_join(x, y, by = NULL, )
left_join(x, y, by = c("C" = "D"))

A1 B1 C
C
1
2
3
bind_cols()
A B C A B D
A B C D
A B
a t
b u
c v
Return rows of x that have a match in y.

USEFUL TO SEE WHAT WILL BE JOINED.
Return rows of x that do not have a

match in y. USEFUL TO SEE WHAT WILL
NOT BE JOINED.
Learn more with browseVignettes(package = c("dplyr", "tibble")) dplyr 0.5.0 tibble 1.2.0 Updated: 12/16

Data Transformation Cheatsheet

Caricato da

Informazioni sul documento

Copyright

Formati disponibili

Condividi questo documento

Condividi o incorpora il documento

Opzioni di condivisione

Hai trovato utile questo documento?

Questo contenuto è inappropriato?

Copyright:

Formati disponibili

Data Transformation Cheatsheet

Caricato da

Copyright:

Formati disponibili

Data Transformation

with dplyr Cheat Sheet

Row functions return a subset of rows as a new table. Use a variant

Column functions return a set of columns as a new table. Use a

Remove rows with duplicate values. Also

Randomly select fraction of rows.

weight = NULL, .env = parent.frame())

Select rows by position. Also slice_().

summarise_all() - Apply funs to every column.

Returns ungrouped copy of table.

See ?base::logic and ?Comparison for help.

Compute new column(s).

Compute new column(s), drop others.

mutate_all(.tbl, .funs, ...)

Apply funs to every column. Use with

mutate_at(.tbl, .cols, .funs, ...)

Apply funs to specific columns. Use with

add_column(.data, ..., .before =

RStudio is a trademark of RStudio, Inc. CC BY RStudio info@rstudio.com 844-448-1212 rstudio.com

Order rows by values of a column (low to high),

add_row(.data, ..., .before = NULL,

mutate_if(.tbl, .predicate, .funs, ...)

group_by(.data, ..., add = FALSE)

Returns copy of table grouped by

Make New Variables

Logical and boolean operators to use with filter()

Use group_by() to created a "grouped" copy of a table. dplyr

e.g. select(iris, starts_with("Sepal"))

These apply vectorized functions to

sample_n(tbl, size, replace = FALSE,

count(x, ..., wt = NULL, sort = FALSE)

Extract columns by name. Also select_if()

Use these helpers with select(),

sample_frac(tbl, size = 1, replace = FALSE,

Compute table of summaries. Also

distinct(.data, ..., .keep_all = FALSE)

Count number of rows in each group defined

Extract rows that meet logical criteria. Also

Add one or more rows to a table.

Add new column(s).

to use with mutate()

to use with summarise()

mutate() and transmute() apply vectorized

summarise() applies summary functions to

dplyr::n() - number of values/rows

dplyr::cumall() - Cumulative all()

mean() - mean, also mean(is.na())

RStudio is a trademark of RStudio, Inc. CC BY RStudio info@rstudio.com 844-448-1212 rstudio.com

Use bind_cols() to paste tables beside each other as

Use a "Mutating Join" to join one table to columns

bind_rows(, .id = NULL)

Returns tables one on top of the other

right_join(x, y, by = NULL, copy =

inner_join(x, y, by = NULL, copy =

Join data. Retain only rows with

Rows that appear in both x and z.

Rows that appear in both x but not z.

Join matching values from x to y.

Use bind_rows() to paste tables below each other as

Rows that appear in x or z. (Duplicates

Use setequal() to test whether two data sets contain

A.x B.x C A.y B.y

Use by = c("col1", "col2") to

Use a named vector, by =