Sei sulla pagina 1di 2

Data Import : : CHEAT SHEET

Rs tidyverse is built around tidy data stored


in tibbles, which are enhanced data frames.
Read Tabular Data - These functions share the common arguments: Data types
read_*(file, col_names = TRUE, col_types = NULL, locale = default_locale(), na = c("", "NA"), readr functions guess
The front side of this sheet shows the types of each column and
how to read text files into R with quoted_na = TRUE, comment = "", trim_ws = TRUE, skip = 0, n_max = Inf, guess_max = min(1000,
n_max), progress = interactive()) convert types when appropriate (but will NOT
readr. convert strings to factors automatically).
The reverse side shows how to A B C Comma Delimited Files
a,b,c read_csv("file.csv") A message shows the type of each column in the
create tibbles with tibble and to 1 2 3
result.
1,2,3 To make file.csv run:
layout tidy data with tidyr. 4 5 NA
4,5,NA write_file(x = "a,b,c\n1,2,3\n4,5,NA", path = "file.csv")
## Parsed with column specification:
## cols(
OTHER TYPES OF DATA A B C Semi-colon Delimited Files ## age = col_integer(), age is an
a;b;c
Try one of the following packages to import 1 2 3 read_csv2("file2.csv") ## sex = col_character(), integer
other types of files 1;2;3 4 5 NA write_file(x = "a;b;c\n1;2;3\n4;5;NA", path = "file2.csv") ## earn = col_double()
4;5;NA ## )
haven - SPSS, Stata, and SAS files
Files with Any Delimiter sex is a
readxl - excel files (.xls and .xlsx) character
A B C read_delim("file.txt", delim = "|") earn is a double (numeric)
DBI - databases a|b|c 1 2 3 write_file(x = "a|b|c\n1|2|3\n4|5|NA", path = "file.txt")
jsonlite - json 1|2|3 4 5 NA 1. Use problems() to diagnose problems
xml2 - XML 4|5|NA Fixed Width Files x <- read_csv("file.csv"); problems(x)
httr - Web APIs read_fwf("file.fwf", col_positions = c(1, 3, 5))
rvest - HTML (Web Scraping) abc
A B C
write_file(x = "a b c\n1 2 3\n4 5 NA", path = "file.fwf")
1 2 3 2. Use a col_ function to guide parsing
123 4 5 NA col_guess() - the default
Save Data 4 5 NA
Tab Delimited Files
read_tsv("file.tsv") Also read_table(). col_character()
write_file(x = "a\tb\tc\n1\t2\t3\n4\t5\tNA", path = "file.tsv") col_double(), col_euro_double()
Save x, an R object, to path, a file path, as: col_datetime(format = "") Also
USEFUL ARGUMENTS col_date(format = ""), col_time(format = "")
Comma delimited file
write_csv(x, path, na = "NA", append = FALSE, Example file Skip lines col_factor(levels, ordered = FALSE)
a,b,c 1 2 3
col_names = !append) write_file("a,b,c\n1,2,3\n4,5,NA","file.csv") read_csv(f, skip = 1) col_integer()
1,2,3 4 5 NA
File with arbitrary delimiter f <- "file.csv" col_logical()
4,5,NA
write_delim(x, path, delim = " ", na = "NA", col_number(), col_numeric()
append = FALSE, col_names = !append) A B C No header A B C Read in a subset col_skip()
CSV for excel
1 2 3
read_csv(f, col_names = FALSE) 1 2 3 read_csv(f, n_max = 1) x <- read_csv("file.csv", col_types = cols(
write_excel_csv(x, path, na = "NA", append =
4 5 NA A = col_double(),
Provide header B = col_logical(),
FALSE, col_names = !append) x y z
Missing Values C = col_factor()))
String to file
A B C read_csv(f, col_names = c("x", "y", "z")) A B C

write_file(x, path, append = FALSE)


1 2 3 NA 2 3 read_csv(f, na = c("1", "."))
4 5 NA 4 5 NA 3. Else, read in as character vectors then parse
String vector to file, one element per line with a parse_ function.
write_lines(x,path, na = "NA", append = FALSE) parse_guess()
Object to RDS file Read Non-Tabular Data parse_character()
parse_datetime() Also parse_date() and
write_rds(x, path, compress = c("none", "gz", Read a file into a raw vector
"bz2", "xz"), ...) Read a file into a single string parse_time()
read_file(file, locale = default_locale()) read_file_raw(file)
Tab delimited files parse_double()
Read each line into its own string Read each line into a raw vector parse_factor()
write_tsv(x, path, na = "NA", append = FALSE,
col_names = !append) read_lines(file, skip = 0, n_max = -1L, na = character(), read_lines_raw(file, skip = 0, n_max = -1L, parse_integer()
locale = default_locale(), progress = interactive()) progress = interactive()) parse_logical()
Read Apache style log files parse_number()
read_log(file, col_names = FALSE, col_types = NULL, skip = 0, n_max = -1, progress = interactive()) x$A <- parse_number(x$A)
RStudio is a trademark of RStudio, Inc. CC BY RStudio info@rstudio.com 844-448-1212 rstudio.com Learn more with tidyverse.org readr 1.1.0 tibble 1.2.12 tidyr 0.6.0 Updated: 2017-01
Tibbles - an enhanced data frame Tidy Data with Tidyr Split Cells
Tidy data is a way to organize tabular data. It provides a consistent data structure across packages.
The tibble package provides a new Use these functions to
A table is tidy if: Tidy data:
S3 class for storing tabular data, the A * B -> C split or combine cells
tibble. Tibbles inherit the data frame A B C A B C A B C A * B C into individual, isolated
class, but improve three behaviors: values.
Subsetting - [ always returns a new tibble,
[[ and $ always return a vector.
& separate(data, col, into, sep = "[^[:alnum:]]
Each variable is in Each observation, or Makes variables easy Preserves cases during +", remove = TRUE, convert = FALSE,
No partial matching - You must use full its own column case, is in its own row to access as vectors vectorized operations extra = "warn", fill = "warn", ...)
column names when subsetting
Separate each cell in a column to make
Display - When you print a tibble, R provides a
concise view of the
Reshape Data - change the layout of values in a table several columns.
table3
data that fits on Use gather() and spread() to reorganize the values of a table into a new layout.
# A tibble: 234 6 country year rate country year cases pop
manufacturer model displ
one screen 1
<chr>
audi
<chr> <dbl>
a4 1.8
gather(data, key, value, ..., na.rm = FALSE, spread(data, key, value, fill = NA, convert = A 1999 0.7K/19M A 1999 0.7K 19M
2 audi a4 1.8
3 audi a4 2.0 A 2000 2K/20M A 2000 2K 20M
4
5
6
audi
audi
audi
a4
a4
a4
2.0
2.8
2.8
convert = FALSE, factor_key = FALSE) FALSE, drop = TRUE, sep = NULL) B 1999 37K/172M B 1999 37K 172
7 audi a4 3.1
B 2000 80K/174M B 2000 80K 174
Gather moves column names into a key Spread moves the unique values of a key column
8 audi a4 quattro 1.8

w
w
9 audi a4 quattro 1.8
10 audi a4 quattro 2.0 C 1999 212K/1T C 1999 212K 1T
# ... with 224 more rows, and 3
#
#
more variables: year <int>,
cyl <int>, trans <chr>
column, gathering the column values into a into the column names, spreading the values of a C 2000 213K/1T C 2000 213K 1T
single value column. value column across the new columns.
separate(table3, rate,
tibble display table4a table2
country 1999 2000 country year cases country year type count country year cases pop
into = c("cases", "pop"))
156 1999 6 auto(l4)
A 0.7K 2K A 1999 0.7K A 1999 cases 0.7K A 1999 0.7K 19M
separate_rows(data, ..., sep = "[^[:alnum:].]
157 1999 6 auto(l4)
158 2008 6 auto(l4)
159 2008 8 auto(s4) B 37K 80K B 1999 37K A 1999 pop 19M A 2000 2K 20M
160 1999 4 manual(m5)
C 212K 213K C 1999 212K B 1999 37K 172M
+", convert = FALSE)
161 1999 4 auto(l4) A 2000 cases 2K
162 2008 4 manual(m5)
163 2008 4 manual(m5) A 2000 2K B 2000 80K 174M
164 2008 4 auto(l4) A 2000 pop 20M
165 2008
166 1999
[ reached
4
4
auto(l4)
auto(l4)
getOption("max.print")
B 2000 80K B 1999 cases 37K C 1999 212K 1T Separate each cell in a column to make
A large table -- omitted 68 rows ] C 2000 213K B 1999 pop 172M C 2000 213K 1T several rows. Also separate_rows_().
key value B 2000 cases 80K
to display data frame display B 2000 pop 174M table3
Control the default appearance with options: C 1999 cases 212K country year rate country year rate
C 1999 pop 1T A 1999 0.7K/19M A 1999 0.7K
options(tibble.print_max = n, C 2000 cases 213K A 2000 2K/20M A 1999 19M
tibble.print_min = m, tibble.width = Inf) C 2000 pop 1T B 1999 37K/172M A 2000 2K
B 2000 80K/174M A 2000 20M
gather(table4a, `1999`, `2000`, key value
View full data set with View() or glimpse() C 1999 212K/1T B 1999 37K
key = "year", value = "cases") spread(table2, type, count) C 2000 213K/1T B 1999 172M
Revert to data frame with as.data.frame() B 2000 80K
B 2000 174M
CONSTRUCT A TIBBLE IN TWO WAYS
tibble()
Handle Missing Values C
C
1999
1999
212K
1T

Both drop_na(data, ...) fill(data, ..., .direction = c("down", "up")) replace_na(data, C 2000 213K
Construct by columns. replace = list(), ...)
C 2000 1T
make this Drop rows containing Fill in NAs in columns with most
tibble(x = 1:3, y = c("a", "b", "c")) tibble NAs in columns. recent non-NA values. Replace NAs by column. separate_rows(table3, rate)
x x x
tribble()
unite(data, col, ..., sep = "_", remove = TRUE)
A tibble: 3 2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2
Construct by rows. x y A 1 A 1 A 1 A 1 A 1 A 1
tribble( ~x, ~y, <int> <dbl> B NA D 3 B NA B 1 B NA B 2
Collapse cells across several columns to
1 1 a C NA C NA C 1 C NA C 2
1, "a", 2 2 b D 3 D 3 D 3 D 3 D 3 make a single column.
2, "b", 3 3 c E NA E NA E 3 E NA E 2
table5
3, "c") drop_na(x, x2) fill(x, x2) replace_na(x,list(x2 = 2), x2) country century year country year
as_tibble(x, ) Convert data frame to tibble. Afghan 19 99 Afghan 1999

enframe(x, name = "name", value = "value") Expand Tables - quickly create tables with combinations of values Afghan
Brazil
20
19
0
99
Afghan
Brazil
2000
1999
Convert named vector to a tibble Brazil 20 0 Brazil 2000
complete(data, ..., fill = list()) expand(data, ...) China 19 99 China 1999
is_tibble(x) Test whether x is a tibble. China 20 0 China 2000
Adds to the data missing combinations of the Create new tibble with all possible combinations
values of the variables listed in of the values of the variables listed in unite(table5, century, year,
complete(mtcars, cyl, gear, carb) expand(mtcars, cyl, gear, carb) col = "year", sep = "")
RStudio is a trademark of RStudio, Inc. CC BY RStudio info@rstudio.com 844-448-1212 rstudio.com Learn more with tidyverse.org readr 1.1.0 tibble 1.2.12 tidyr 0.6.0 Updated: 2017-01