Sei sulla pagina 1di 30

SQL using R

SQL USING R
1. INTRODUCTION

So far, we have dealt with small datasets that easily fit into your computer’s memory. But
what about datasets that are too large for your computer to handle as a whole? In this case,
storing the data outside of R and organizing it in a database is helpful. Connecting to the
database allows you to retrieve only the chunks needed for the current analysis.
Even better, many large datasets are already available in public or private databases. You
can query them without having to download the data first.
R can connect to almost any existing database type. Most common database types have R
packages that allow you to connect to them (e.g., RSQLite, RMySQL, etc). Furthermore,
the dplyr package you used in the previous chapter, in conjunction with dbplyr supports
connecting to the widely-used open source databases sqlite and mysql and postgresql, as
well as Google’s bigquery, and it can also be extended to other database types (a vignette
in the dplyr package explains how to do it). RStudio has created a website that provides
documentation and best practices to work on database interfaces.
Interfacing with databases using dplyr focuses on retrieving and analyzing datasets by
generating SELECTSQL statements, but it doesn’t modify the database itself. dplyr does
not offer functions to UPDATE or DELETEentries. If you need these functionalities, you
will need to use additional R packages (e.g., RSQLite). Here we will demonstrate how to
interact with a database using dplyr, using both the dplyr’s verb syntax and the SQL
syntax.
SQL:
SQL is a language to operate databases; it includes database creation, deletion, fetching
rows, modifying rows, etc. SQL is an ANSI (American National Standards Institute)
standard language, but there are many different versions of the SQL language.

What is SQL?
SQL is Structured Query Language, which is a computer language for storing,
manipulating and retrieving data stored in a relational database.

SQL is the standard language for Relational Database System. All the Relational
Database Management Systems (RDMS) like MySQL, MS Access, Oracle, Sybase,
Informix, Postgres and SQL Server use SQL as their standard database language.

SKIT, Dept.of.CSE 1
SQL using R

Also, they are using different dialects, such as −

● MS SQL Server using T-SQL,

● Oracle using PL/SQL,

● MS Access version of SQL is called JET SQL (native format) etc.

Why SQL?

SQL is widely popular because it offers the following advantages


● Allows users to access data in the relational database management systems.

● Allows users to describe the data.

● Allows users to define the data in a database and manipulate that data.

● Allows to embed within other languages using SQL modules, libraries & pre-
compilers.

● Allows users to create and drop databases and tables.

● Allows users to create view, stored procedure, functions in a database.

● Allows users to set permissions on tables, procedures and views.

A Brief History of SQL

1970 − Dr. Edgar F. "Ted" Codd of IBM is known as the father


of relational databases. He described a relational model
for databases.1974 − Structured Query Language
appeared.1978 − IBM worked to develop Codd's ideas and
released a product named System/R.1986 − IBM developed
the first prototype of relational database and
standardized by ANSI. The first relational database was

SKIT, Dept.of.CSE 2
SQL using R

released by Relational Software which later came to be


known as Oracle.

R programming:

R is a programming language and software environment for statistical analysis,


graphics representation and reporting. R was created by Ross Ihaka and Robert
Gentleman at the University of Auckland, New Zealand, and is currently developed by the
R Development Core Team.The core of R is an interpreted computer language which
allows branching and looping as well as modular programming using functions. R allows
integration with the procedures written in the C, C++, .Net, Python or FORTRAN
languages for efficiency.R is freely available under the GNU General Public License, and
pre-compiled binary versions are provided for various operating systems like Linux,
Windows and Mac.R is free software distributed under a GNU-style copy left, and an

official part of the GNU project called GNU S.

SKIT, Dept.of.CSE 3
SQL using R

2. SQL(structured query language)

SQL Process
When you are executing an SQL command for any RDBMS, the system
determines the best way to carry out your request and SQL engine figures out how
to interpret the task.

There are various components included in this process.

These components are −

● Query Dispatcher

● Optimization Engines

● Classic Query Engine

● SQL Query Engine, etc.

A classic query engine handles all the non-SQL queries, but a


SQL query engine won't handle logical files.Following is a
simple diagram showing the SQL Architecture −

SKIT, Dept.of.CSE 4
SQL using R

SQL Commands

The standard SQL commands to interact with relational


databases are CREATE, SELECT, INSERT, UPDATE, DELETE and DROP.
These commands can be classified into the following groups
based on their nature −

DDL - Data Definition Language

Sr.No. Command &Description

1. CREATE->Creates a new table, a view of a table, or other object in the database.

2. ALTER->Modifies an existing database object, such as a table.

3. DROP->Deletes an entire table,a view of table or other objects in the database.

DML - Data Manipulation Language

Sr.No Command & Description

1. SELECT-> Retrieves certain records from one or more tables

2. INSERT-> Creates a record

3. UPDATE-> Modifies a record

4. DELETE-> Deletes record

DCL - Data Control Language

Sr.No Command & Description

1. GRANT-> Gives a privilege to a user.

SKIT, Dept.of.CSE 5
SQL using R

2. REVOKE-> Takes back privilege granted from user

3. SQL - RDBMS Concepts


What is RDBMS?
RDBMS stands for Relational Database Management System. RDBMS is the basis for
SQL, and for all modern database systems like MS SQL Server, IBM DB2, Oracle,
MySQL, and Microsoft Access.

A Relational database management system (RDBMS) is a database management system


(DBMS) that is based on the relational model as introduced by E. F. Codd.

What is a table?
The data in an RDBMS is stored in database objects which are called as tables. This table
is basically a collection of related data entries and it consists of numerous columns and
rows.

The following program is an example of a CUSTOMERS table −

+----+----------+-----+-----------+----------+
| ID | NAME | AGE | ADDRESS | SALARY |
+----+----------+-----+-----------+----------+
| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |
| 2 | Khilan | 25 | Delhi | 1500.00 |
| 3 | kaushik | 23 | Kota | 2000.00 |
| 4 | Chaitali | 25 | Mumbai | 6500.00 |
| 5 | Hardik | 27 | Bhopal | 8500.00 |
| 6 | Komal | 22 | MP | 4500.00 |
| 7 | Muffy | 24 | Indore | 10000.00 |

+----+----------+-----+-----------+----------+

What is a field?
Every table is broken up into smaller entities called fields. The fields in the
CUSTOMERS table consist of ID, NAME, AGE, ADDRESS and SALARY.

A field is a column in a table that is designed to maintain specific information about


every record in the table.

What is a Record or a Row?

SKIT, Dept.of.CSE 6
SQL using R

A record is also called as a row of data is each individual entry


that exists in a table. For example, there are 7 records in the
above CUSTOMERS table. Following is a single row of data or
record in the CUSTOMERS table −

+----+----------+-----+-----------+----------+

| 1 | Ramesh | 32 | Ahmedabad | 2000.00 |

+----+----------+-----+-----------+----------+
A record is a horizontal entity in a table.

What is a column?
A column is a vertical entity in a table that contains all information associated with a
specific field in a table.

For example, a column in the CUSTOMERS table is ADDRESS, which


represents location description and would be as shown below −

+-----------+

| ADDRESS |

+----------------+

| Ahmedabad |

| Delhi |

| Kota |

| Mumbai |

| Bhopal |

| MP

| Indore |

What is a NULL value?


A NULL value in a table is a value in a field that appears to be blank, which means a
field with a NULL value is a field with no value.

SKIT, Dept.of.CSE 7
SQL using R

It is very important to understand that a NULL value is different than a zero value or a
field that contains spaces. A field with a NULL value is the one that has been left blank
during a record creation.

SQL Constraints
Constraints are the rules enforced on data columns on a table. These are used to limit the
type of data that can go into a table. This ensures the accuracy and reliability of the data
in the database.

Constraints can either be column level or table level. Column level constraints are applied
only to one column whereas, table level constraints are applied to the entire table.

Following are some of the most commonly used constraints available in SQL

● NOT NULL Constraint − Ensures that a column cannot have a

NULL value.
● DEFAULT Constraint − Provides a default value for a column

when none is specified.


● UNIQUE Constraint − Ensures that all the values in a column

are different.
● PRIMARY Key − Uniquely identifies each row/record in a

database table.
● FOREIGN Key − Uniquely identifies a row/record in any

another database table.


● CHECK Constraint − The CHECK constraint ensures that all

values in a column satisfy certain conditions.


● INDEX − Used to create and retrieve data from the database

very quickly.

Data Integrity
SKIT, Dept.of.CSE 8
SQL using R

The following categories of data integrity exist with each


RDBMS −

● Entity Integrity − There are no duplicate rows in a table.

● Domain Integrity − Enforces valid entries for a given

column by restricting the type, the format, or the range of


values.
● Referential integrity − Rows cannot be deleted, which are

used by other records.


● User-Defined Integrity − Enforces some specific business

rules that do not fall into entity, domain or referential


integrity.

Database Normalization

Database normalization is the process of efficiently organizing


data in a database. There are two reasons of this normalization
process −

● Eliminating redundant data, for example, storing the same data in more than one
table.

● Ensuring data dependencies make sense.

Normalization guidelines are divided into normal forms; think of a form as the format or
the way a database structure is laid out. The aim of normal forms is to organize the
database structure, so that it complies with the rules of first normal form, then second
normal form and finally the third normal form.

It is your choice to take it further and go to the fourth normal form, fifth normal form and
so on, but in general, the third normal form is more than enough.

SKIT, Dept.of.CSE 9
SQL using R

● First Normal Form (1NF)

● Second Normal Form (2NF)

● Third Normal Form (3NF)

4. SQL - RDBMS Databases


MySQL
MySQL is an open source SQL database, which is developed by a Swedish company –
MySQL AB. MySQL is pronounced as "my ess-que-ell," in contrast with SQL,
pronounced "sequel."

MySQL is supporting many different platforms including Microsoft Windows, the major
Linux distributions, UNIX, and Mac OS X.

MySQL has free and paid versions, depending on its usage (non-commercial/commercial)
and features. MySQL comes with a very fast, multi-threaded, multi-user and robust SQL
database server.

History

● Development of MySQL by Michael Widenius & David Axmark beginning in


1994.First internal release on 23rd May 1995.Windows Version was released on
the 8th January 1998 for Windows 95 and NT.Version 3.23: beta from June 2000,
production release January 2001.Version 4.0: beta from August 2002, production
release March 2003 (unions).Version 4.01: beta from August 2003, Jyoti adopts

SKIT, Dept.of.CSE 10
SQL using R

MySQL for database tracking.Version 4.1: beta from June 2004, production
release October 2004.Version 5.0: beta from March 2005, production release
October 2005.Sun Microsystems acquired MySQL AB on the 26 th February
2008.Version 5.1: production release 27th November 2008.

Features

● High Performance and High Availability.

● Scalability and Flexibility Run anything etc.,

MS SQL Server

MS SQL Server is a Relational Database Management System


developed by Microsoft Inc. Its primary query languages are −

● T-SQL

● ANSI SQL

History

1987 - Sybase releases SQL Server for UNIX.1988 - Microsoft, Sybase, and
Aston-Tate port SQL Server to OS/2.1989 - Microsoft, Sybase, and Aston-Tate
release SQL Server 1.0 for OS/2.1990 - SQL Server 1.1 is released with support
for Windows 3.0 clients.Aston - Tate drops out of SQL Server development.2000 -
Microsoft releases SQL Server 2000.2001 - Microsoft releases XML for SQL
Server Web Release 1 (download).2002 - Microsoft releases SQLXML 2.0
(renamed from XML for SQL Server).2002 - Microsoft releases SQLXML
3.0.2005 - Microsoft releases SQL Server 2005 on November 7th, 2005.

SKIT, Dept.of.CSE 11
SQL using R

Features

● High Performance and High Availability

● XML integration,TRY...CATCH and Database Mail etc.,

ORACLE
It is a very large multi-user based database management system. Oracle is a relational
database management system developed by 'Oracle Corporation'.

Oracle works to efficiently manage its resources, a database of information among the
multiple clients requesting and sending data in the network.It is an excellent database
server choice for client/server computing. Oracle supports all major operating systems for
both clients and servers, including MSDOS, NetWare, UnixWare, OS/2 and most UNIX
flavors.

History
Oracle began in 1977 and celebrating its 32 wonderful years in the industry (from 1977 to
2009).1977 - Larry Ellison, Bob Miner and Ed Oates founded Software Development
Laboratories to undertake development work.1979 - Version 2.0 of Oracle was released
and it became first commercial relational database and first SQL database. The company
changed its name to Relational Software Inc. (RSI).1981 - RSI started developing tools
for Oracle.1982 - RSI was renamed to Oracle Corporation.1983 - Oracle released version
3.0, rewritten in C language and ran on multiple platforms.1984 - Oracle version 4.0 was
released. It contained features like concurrency control - multi-version read consistency,
etc.1985 - Oracle version 4.0 was released. It contained features like concurrency control
- multi-version read consistency, etc.2007 - Oracle released Oracle11g. The new version
focused on better partitioning, easy migration, etc.

Features

● Concurrency ,Read Consistency,Locking Mechanisms and Quiesce Database.

SKIT, Dept.of.CSE 12
SQL using R

● Portability , Self-managing database,SQL*Plus , ASM andScheduler ,

● Resource Manager,Data Warehousing, Materialized views and Bitmap indexes,

● Table compression,Parallel Execution,Analytic SQL,Data mining and

Partitioning

MS ACCESS:
This is one of the most popular Microsoft products. Microsoft Access is an entry-level
database management software.

MS Access database is not only inexpensive but also a powerful database for small-scale
projects.MS Access uses the Jet database engine, which utilizes a specific SQL language
dialect (sometimes referred to as Jet SQL).MS Access comes with the professional
edition of MS Office package. MS Access has easy to-use intuitive graphical interface.

1992 - Access version 1.0 was released.1993 - Access 1.1 released to improve
compatibility with inclusion the Access Basic programming language.The most
significant transition was from Access 97 to Access 2000.2007 - Access 2007, a new
database format was introduced ACCDB which supports complex data types such as
multi valued and attachment fields.

Features:

● Users can create tables, queries, forms and reports and connect them together with
macros.Option of importing and exporting the data to many formats including
Excel, Outlook, ASCII, dBase, Paradox, FoxPro, SQL Server, Oracle, ODBC,
etc.There is also the Jet Database format (MDB or ACCDB in Access 2007),
which can contain the application and data in one file. This makes it very
convenient to distribute the entire application to another user, who can run it in
disconnected environments.

SKIT, Dept.of.CSE 13
SQL using R

● Microsoft Access offers parameterized queries. These queries and Access tables
can be referenced from other programs like VB6 and .NET through DAO or
ADO.The desktop editions of Microsoft SQL Server can be used with Access as
an alternative to the Jet Database Engine.Microsoft Access is a file server-based
database. Unlike the client-server relational database management systems
(RDBMS), Microsoft Access does not implement database triggers, stored
procedures or transaction logging.

5. R - OVERVIEW
Evolution of R:
R was initially written by Ross Ihaka and Robert Gentleman at the Department of
Statistics of the University of Auckland in Auckland, New Zealand. R made its first
appearance in 1993.

● A large group of individuals has contributed to R by sending code and bug


reports.

● Since mid-1997 there has been a core group (the "R Core Team") who can modify
the R source code archive.

Features of R:

As stated earlier, R is a programming language and software


environment for statistical analysis, graphics representation
and reporting. The following are the important features of R −

SKIT, Dept.of.CSE 14
SQL using R

● R is a well-developed, simple and effective programming language which


includes conditionals, loops, user defined recursive functions and input and output
facilities.

● R has an effective data handling and storage facility,

● R provides a suite of operators for calculations on arrays, lists, vectors and


matrices.

● R provides a large, coherent and integrated collection of tools for data analysis.

● R provides graphical facilities for data analysis and display either directly at the
computer or printing at the papers.

As a conclusion, R is world’s most widely used statistical programming language.


It's the # 1 choice of data scientists and supported by a vibrant and talented
community of contributors. R is taught in universities and deployed in mission
critical business applications. This tutorial will teach you R programming along
with suitable examples in simple and easy steps.

Local Environment Setup:


If you are still willing to setup your environment for R, you can follow the steps given
below.

Windows Installation
You can download the Windows installer version of R from R-3.2.2 for Windows (32/64
bit) and save it in a local directory.

Linux Installation
R is available as a binary for many versions of Linux at the location R Binaries.The
instruction to install Linux varies from flavor to flavor.

R - Basic Syntax:

SKIT, Dept.of.CSE 15
SQL using R

Depending on the needs, you can program either at R command prompt or you can use an
R script file to write your program. Let's check both one by one.

R Command Prompt

Once you have R environment setup, then it’s easy to start your
R command prompt by just typing the following command at
your command prompt −$ R.

This will launch R interpreter and you will get a prompt > where
you can start typing your program as follows −

> myString <- "Hello, World!"

> print ( myString)

[1] "Hello, World!"


Here first statement defines a string variable myString, where we assign a string "Hello,
World!" and then next statement print() is being used to print the value stored in variable
myString.

R Script File

Usually, you will do your programming by writing your programs


in script files and then you execute those scripts at your
command prompt with the help of R interpreter called Rscript.
So let's start with writing following code in a text file called
test.R as under −

# My first program in R Programming

myString <- "Hello, World!"

print ( myString)
Save the above code in a file test.R and execute it at Linux command prompt as given
below. Even if you are using Windows or other system, syntax will remain same.

SKIT, Dept.of.CSE 16
SQL using R

$ Rscript test.R

When we run the above program, it produces the following result.

[1] "Hello, World!"

Comments

Comments are like helping text in your R program and they are
ignored by the interpreter while executing your actual program.
Single comment is written using # in the beginning of the
statement as follows −

# My first program in R Programming

R does not support multi-line comments but you can perform a


trick which is something as follows −

if(FALSE) {

"This is a demo for multi-line comments and it should be put inside either a

single OR double quote"

myString <- "Hello, World!"

6. R - Data Types

In contrast to other programming languages like C and java in R,


the variables are not declared as some data type. The variables
are assigned with R-Objects and the data type of the R-object
becomes the data type of the variable. There are many types of
R-objects. The frequently used ones are −

● Vectors

SKIT, Dept.of.CSE 17
SQL using R

● Lists

● Matrices

● Arrays

● Factors

● Data Frames

The simplest of these objects is the vector object and there are six data types of these
atomic vectors, also termed as six classes of vectors. The other R-Objects are built upon
the atomic vectors.

Data type Example Verify

Logical TRUE,FALSE v <- TRUE


print(class(v))
it produces the
following result −

[1] "logical"

Numeric 12.3, 5, 999 v <- 23.5


print(class(v))
it produces the
following result −

[1] "numeric"

Integer 2L, 34L, 0L v <- 2L


print(class(v))
it produces the
following result −

[1] "integer"

Complex 3 + 2i v <- 2+5i


print(class(v))
it produces the
following result −

SKIT, Dept.of.CSE 18
SQL using R

[1] "complex"

Character 'a' , '"good", "TRUE", '23.4' v <- "TRUE"


print(class(v))
it produces the
following result −

[1] "character"

Row "Hello" is stored as 48 65 6c 6c 6f v <- charToRaw("Hello")


print(class(v))
it produces the
following result −

[1] "raw

In R programming, the very basic data types are the R-objects called vectors which hold
elements of different classes as shown above. Please note in R the number of classes is
not confined to only the above six types. For example, we can use many atomic vectors
and create an array whose class will become array.

Vectors:

When you want to create vector with more than one element, you should use c() function
which means to combine the elements into a vector.

Lists:

A list is an R-object which can contain many different types of elements inside it like
vectors, functions and even another list inside it.

Matrices:

A matrix is a two-dimensional rectangular data set. It can be created using a vector input
to the matrix function.

SKIT, Dept.of.CSE 19
SQL using R

Arrays:

While matrices are confined to two dimensions, arrays can be of any number of
dimensions. The array function takes a dim attribute which creates the required number
of dimension. In the below example we create an array with two elements which are 3x3
matrices each.

Factors:

Factors are the r-objects which are created using a vector. It stores the vector along
with the distinct values of the elements in the vector as labels. The labels are always
character irrespective of whether it is numeric or character or Boolean etc. in the input
vector. They are useful in statistical modeling.

Factors are created using the factor() function. The nlevels functions gives the
count of levels.

Data Frames:

Data frames are tabular data objects. Unlike a matrix in data frame each column can
contain different modes of data. The first column can be numeric while the second
column can be character and third column can be logical. It is a list of vectors of equal
length.

Data Frames are created using the data.frame() function.

R – Operator:

:An operator is a symbol that tells the compiler to perform specific mathematical or
logical manipulations. R language is rich in built-in operators and provides following
types of operators.

Types of Operators:

We have the following types of operators in R programming −

 Arithmetic Operators

SKIT, Dept.of.CSE 20
SQL using R

 Relational Operators

 Logical Operators

 Assignment Operators

 Miscellaneous Operators

Arithmetic Operators:

Following table shows the arithmetic operators supported by R language. The operators
act on each element of the vector.

Operator Description

+ Adds two vectors

_ Subtracts second vector from the first

* Multiplies both vectors

/ Divide the first vector with the second

%% Give the remainder of the first vector with the second

%/% The result of division of first vector with second (quotient)

^ The first vector raised to the exponent of second vector.

Relational Operators:

Following table shows the relational operators supported by R language. Each element of
the first vector is compared with the corresponding element of the second vector. The
result of comparison is a Boolean value.

Operator Description

> Checks if each element of the first vector is greater than the corresponding

SKIT, Dept.of.CSE 21
SQL using R

element of the second vector.

< Checks if each element of the first vector is less than the corresponding
element of the second vector.

== Checks if each element of the first vector is equal to the corresponding


element of the second vector.

<= Checks if each element of the first vector is less than or equal to the
corresponding element of the second vector.

>= Checks if each element of the first vector is greater than or equal to the
corresponding element of the second vector.

!= Checks if each element of the first vector is unequal to the corresponding


element of the second vector.

Logical Operators:

Following table shows the logical operators supported by R language. It is applicable


only to vectors of type logical, numeric or complex. All numbers greater than 1 are
considered logical value is TRUE.

Operator Description

& It is called Element-wise Logical AND operator. It combines each element of the
first vector with the corresponding element of the second vector and gives a output
TRUE if both the elements are TRUE.

| It is called Element-wise Logical OR operator. It combines each element of the


first vector with the corresponding element of the second vector and gives a output
TRUE if one the elements is TRUE.

! It is called Logical NOT operator. Takes each element of the vector and gives the
opposite logical value.
The logical operator && and || considers only the first element of the vectors and give a
vector of single element as output.

Operators Description

&& Called Logical AND operator. Takes first element of both the vectors and gives
the TRUE only if both are TRUE.

SKIT, Dept.of.CSE 22
SQL using R

|| Called Logical AND operator. Takes first element of both the vectors and gives
the TRUE only if both are TRUE.

Assignment Operators:

These operators are used to assign values to vectors.

Operators Description

<− Called Left Assignment


or
=
or
<<−
-> Called Right Assignment
or
->>

Miscellaneous Operators:

These operators are used to for specific purpose and not general mathematical or logical
computation.

Operators Description

: Colon operator. It creates the series of numbers in sequence for a vector.

%in% This operator is used to identify if an element belongs to a vector.

%*% This operator is used to multiply a matrix with its transpose.

7. SQL DATABASE and R

Connecting to databases:

SKIT, Dept.of.CSE 23
SQL using R

We can point R to this database using:

library(dplyr)

library(dbplyr)

mammals <- DBI::dbConnect(RSQLite::SQLite(), "data/portal_mammals.sqlite")

This command uses 2 packages that helps dbplyr and dplyr talk to the SQLite database.
DBI is not something that you’ll use directly as a user. It allows R to send commands to
databases irrespective of the database management system used. The RSQLite package
allows R to interface with SQLite databases.

This command does not load the data into the R session (as the read_csv() function did).
Instead, it merely instructs R to connect to the SQLite database contained in the
portal_mammals.sqlite file.

Using a similar approach, you could connect to many other database management systems
that are supported by R including MySQL, PostgreSQL, BigQuery, etc.

Let’s take a closer look at the mammals database we just connected to:

src_dbi(mammals)

Just like a spreadsheet with multiple worksheets, a SQLite database can contain multiple
tables. In this case three of them are listed in the tbls row in the output above:

plots

species

surveys

Now that we know we can connect to the database, let’s explore how to get the data from
its tables into R.

Querying the database with the SQL syntax:

SKIT, Dept.of.CSE 24
SQL using R

To connect to tables within a database, you can use the tbl() function from dplyr. This
function can be used to send SQL queries to the database. To demonstrate this
functionality, let’s select the columns “year”, “species_id”, and “plot_id” from the
surveys table:

tbl(mammals, sql("SELECT year, species_id, plot_id FROM surveys"))

With this approach you can use any of the SQL queries we have seen in the database
lesson.

Querying the database with the dplyr syntax

One of the strengths of dplyr is that the same operation can be done using dplyr’s verbs
instead of writing SQL. First, we select the table on which to do the operations by
creating the surveys object, and then we use the standard dplyr syntax as if it were a data
frame:

surveys <- tbl(mammals, "surveys")

surveys %>%

select(year, species_id, plot_id)

In this case, the surveys object behaves like a data frame. Several functions that can be
used with data frames can also be used on tables from a database. For instance, the head()
function can be used to check the first 10 rows of the table:

head(surveys, n = 10)

This output of the head command looks just like a regular data.frame: The table has 9
columns and the head() command shows us the first 10 rows. Note that the columns
plot_type, taxa, genus, and species are missing. These are now located in the tables plots
and species which we will join together in a moment.

SKIT, Dept.of.CSE 25
SQL using R

However, some functions don’t work quite as expected. For instance, let’s check how
many rows there are in total using nrow():

nrow(surveys)

That’s strange - R doesn’t know how many rows the surveys table contains - it returns NA
instead. You might have already noticed that the first line of the head() output included ??
indicating that the number of rows wasn’t known.

The reason for this behavior highlights a key difference between using dplyr on datasets
in memory (e.g. loaded into your R session via read_csv()) and those provided by a
database. To understand it, we take a closer look at how dplyr communicates with our
SQLite database.

SQL translation:

Relational databases typically use a special-purpose language, Structured Query Language


(SQL), to manage and query data.

For example, the following SQL query returns the first 10 rows from the surveys table:

SELECT *

FROM `surveys`

LIMIT 10

Behind the scenes, dplyr:

translates your R code into SQL

submits it to the database

translates the database’s response into an R data frame

To lift the curtain, we can use dplyr’s show_query() function to show which SQL
commands are actually sent to the database:

SKIT, Dept.of.CSE 26
SQL using R

show_query(head(surveys, n = 10))

The output shows the actual SQL query sent to the database; it matches our manually
constructed SELECT statement above.

Instead of having to formulate the SQL query ourselves - and having to mentally switch
back and forth between R and SQL syntax - we can delegate this translation to dplyr.
(You don’t even need to know SQL to interact with a database via dplyr!)

dplyr, in turn, doesn’t do the real work of subsetting the table, either. Instead, it merely
sends the query to the database, waits for its response and returns it to us.

That way, R never gets to see the full surveys table - and that’s why it could not tell us
how many rows it contains. On the bright side, this allows us to work with large datasets -
even too large to fit into our computer’s memory.

dplyr can translate many different query types into SQL allowing us to, e.g., select()
specific columns, filter() rows, or join tables.

To see this in action, let’s compose a few queries with deplorer.

Simple database queries:

First, let’s only request rows of the surveys table in which weight is less than 5 and keep
only the species_id, sex, and weight columns.

surveys %>%

filter(weight < 5) %>%

select(species_id, sex, weight)

It delays doing any work until the last possible moment - it collects together everything
you want to do and then sends it to the database in one step.

When you construct a dplyr query, you can connect multiple verbs into a single pipeline.
For example, we combined the filter() and select() verbs using the %>% pipe.

SKIT, Dept.of.CSE 27
SQL using R

Complex database queries:

dplyr enables database queries across one or multiple database tables, using the same
single- and multiple-table verbs you encountered previously. This means you can use the
same commands regardless of whether you interact with a remote database or local
dataset! This is a really useful feature if you work with large datasets: you can first
prototype your code on a small subset that fits into memory, and when your code is ready,
you can change the input dataset to your full database without having to change the
syntax.

On the other hand, being able to use SQL queries directly can be useful if your
collaborators have already put together complex queries to prepare the dataset that you
need for your analysis.

To illustrate how to use dplyr with these complex queries, we are going to join the plots
and surveys tables. The plots table in the database contains information about the different
plots surveyed by the researchers. To access it, we point the tbl() command to it:

plots <- tbl(mammals, "plots")

plots

Creating a new SQLite database

So far, we have used a previously prepared SQLite database. But we can also use R to
create a new database, e.g. from existing csv files. Let’s recreate the mammals database
that we’ve been working with, in R. First let’s download and read in the csv files. We’ll
import tidyverse to gain access to the read_csv() function.

download.file("https://ndownloader.figshare.com/files/3299483",

"data/species.csv")

library(tidyverse)

species <- read_csv("data/species.csv")

SKIT, Dept.of.CSE 28
SQL using R

8. CONCLUSION

Finally,Both SQL and R can combined and implemented together there many great
opportunities for the data analysits to store and analysis the data of the consumers and
people and produce better results based on the obtained data. Interfacing with databases
using dplyr focuses on retrieving and analyzing datasets by generating SELECTSQL
statements, but it doesn’t modify the database itself. dplyr does not offer functions to
UPDATE or DELETEentries. If you need these functionalities, you will need to use
additional R packages. Furthermore, the dplyr package you used in the previous chapter,
in conjunction with dbplyr supports connecting to the widely-used open source databases
sqlite and mysql and postgresql, as well as Google’s bigquery, and it can also be
extended to other database types (a vignette in the dplyr package explains how to do it).

SKIT, Dept.of.CSE 29
SQL using R

9. REFERENCES

1. R Overview:
https://www.tutorialspoint.com/r/r_overview.htm
2. SQL Overview:
https://www.tutorialspoint.com/sql/sql-overview.htm
3. SQL Databases and R:
https://datacarpentry.org/R-ecology-lesson/05-r-and-databases.html

SKIT, Dept.of.CSE 30

Potrebbero piacerti anche