# ---- Workspace Setup ----
library('nycflights13') # install.packages('nycflights13')
library('DBI') # install.packages('DBI')
library('dplyr') # install.packages('dplyr')
library('dbplyr') # install.packages('dbplyr')
# Load the sqrrl package
# devtools::isntall_github('gadenbuie/sqrrl')
library('sqrrl')
# Alias to create nice tables
<- function(...) knitr::kable(..., format = 'html') as_table
🐿️ Sqrrl
UPDATE 10/17/2017: Don’t use this! I made it for myself so it works for what I needed it for. But you probably shouldn’t use this package. There are better ways of building SQL queries that are safer and better (and probably even easier). For now, let me just point you in the direction of db.rstudio.com, dplyr/dbplyr, and the recently added glue_sql()
function in the glue package.
Project Links: source, documentation
sqrrl
is a small collection of utility functions that help build text-based SQL queries in an R-style native-feeling and functional manner.
Unlike other packages that build SQL queries using an object-oriented style, sqrrl
provides small functions that produce SQL snippets and can be linked together to compose SQL queries. The result is that the code to produce the SQL statement reads much like the SQL statement iteself. On the other hand, sqrrl
doesn’t know anything about your database and can’t help you out with completions, etc.
Where this package is most useful is with Shiny web apps that interact with a MySQL backend. The utilities are all built so that queries can be built using column names and values stored inside ordinary R data structures.
The following is a quick demonstration of how the package works using the nyclights13
dataset. For more information on sqrrl
, check out the package documentation.
Setup flights
database
To demonstrate the features in sqrrl
, let’s set up an in-memory SQLite database using the nycflights13
dataset featured in dplyr
and dbplyr
.
First, load (or install) the pacakges and functions that we need.
Then load the flights
data frame from nycflights13
into the in-memory SQLite database (this code comes direclty from the dbplyr documentation).
# ---- Example Setup ----
# Create an in-memory SQLite database
<- dbConnect(RSQLite::SQLite(), path = ":memory:")
con
# Use dplyr/dbplyr to copy flights table to the temp db
copy_to(con, nycflights13::flights, "flights",
temporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
# Show first 5 rows
dbGetQuery(con, 'SELECT * FROM flights LIMIT 5') %>%
as_table
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 1357034400 |
2013 | 1 | 1 | 533 | 529 | 4 | 850 | 830 | 20 | UA | 1714 | N24211 | LGA | IAH | 227 | 1416 | 5 | 29 | 1357034400 |
2013 | 1 | 1 | 542 | 540 | 2 | 923 | 850 | 33 | AA | 1141 | N619AA | JFK | MIA | 160 | 1089 | 5 | 40 | 1357034400 |
2013 | 1 | 1 | 544 | 545 | -1 | 1004 | 1022 | -18 | B6 | 725 | N804JB | JFK | BQN | 183 | 1576 | 5 | 45 | 1357034400 |
2013 | 1 | 1 | 554 | 600 | -6 | 812 | 837 | -25 | DL | 461 | N668DN | LGA | ATL | 116 | 762 | 6 | 0 | 1357038000 |
Querying flights
Often, when I’m working with a database, I’ll create an alias for dbGetQuery
with the database or table name. Inside the alias function I usually add any data type modifications that might need to be applied, and I suppress the warning messages that DBI
outputs about data type conversions.
<- function(query, ...) {
flights suppressWarnings(dbGetQuery(con, query, ...))
}
Now we can repeat the above SELECT
statement using sqrrl
, this time limiting the columns selected.
<- c('year', 'month', 'day',
flight_cols 'carrier', 'flight', 'tailnum')
SELECT(flight_cols) %+%
FROM('flights') %+%
LIMIT(5) %>%
%>%
flights as_table
year | month | day | carrier | flight | tailnum |
---|---|---|---|---|---|
2013 | 1 | 1 | UA | 1545 | N14228 |
2013 | 1 | 1 | UA | 1714 | N24211 |
2013 | 1 | 1 | AA | 1141 | N619AA |
2013 | 1 | 1 | B6 | 725 | N804JB |
2013 | 1 | 1 | DL | 461 | N668DN |
Note that sqrrl
provides the %+%
infix operator, which is essentially just an alias for paste(x, y)
.
'a' %+% 'b'
[1] "a b"
# or PHP style without a padded space: paste0
'a' %.% 'b'
[1] "ab"
We can also do more complicated queries, like finding the average arrival delay, grouped by tail number:
SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
FROM('flights') %+%
GROUP_BY('tailnum') %+%
ORDER_BY(DESC('delay')) %+%
LIMIT(10) %>%
%>%
flights as_table
tailnum | delay | n |
---|---|---|
N844MH | 320.0000 | 1 |
N911DA | 294.0000 | 1 |
N922EV | 276.0000 | 1 |
N587NW | 264.0000 | 1 |
N851NW | 219.0000 | 1 |
N928DN | 201.0000 | 1 |
N7715E | 188.0000 | 1 |
N654UA | 185.0000 | 1 |
N665MQ | 174.6667 | 6 |
N427SW | 157.0000 | 1 |
sqrrl
also provides a wrapper around the python utility sqlformat
that can be used to pretty-print SQL formats.
SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
FROM('flights') %+%
GROUP_BY('tailnum') %+%
ORDER_BY(DESC('delay')) %+%
LIMIT(10) %>%
%>% cat sqlformat
SELECT tailnum,
avg(arr_delay) AS delay,
count(*) AS n
FROM flights
GROUP BY tailnum
ORDER BY delay DESC
LIMIT 10
Let’s use the above as an inner query and filter on n > 100
:
<- SELECT(
query_all_arr_delay 'tailnum', delay = 'avg(arr_delay)', n = 'count(*)'
%+%
) FROM('flights') %+%
GROUP_BY('tailnum') %+%
ORDER_BY(DESC('delay'))
SELECT() %+%
FROM(delay = parens(query_all_arr_delay)) %+%
WHERE(gt(n = 100)) %+%
LIMIT(10) %>%
%>%
flights as_table
tailnum | delay | n |
---|---|---|
N11119 | 30.30657 | 148 |
N16919 | 29.88745 | 251 |
N14998 | 27.92202 | 230 |
N15910 | 27.61132 | 280 |
N13123 | 25.97345 | 121 |
N11192 | 25.85235 | 154 |
N14950 | 25.28780 | 219 |
N21130 | 24.96610 | 126 |
N24128 | 24.91803 | 129 |
N22971 | 24.74766 | 230 |
Queries are just strings
Notice that unlike other packages, sqrrl
can’t build the nested queries for you. You still need to understand the structure of the database and the structure of the query.
But when compared with the final output of the query, the sqrrl
version looks a lot like SQL transliterated into R functions.
SELECT() %+%
FROM(delay = parens(
SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
FROM('flights') %+%
GROUP_BY('tailnum') %+%
ORDER_BY(DESC('delay'))
%+%
)) WHERE(gt(n = 100)) %+%
LIMIT(10) %>%
sqlformat() %>%
cat()
SELECT *
FROM (
SELECT tailnum,
avg(arr_delay) AS delay,
count(*) AS n
FROM flights
GROUP BY tailnum
ORDER BY delay DESC
) delayWHERE n>100
LIMIT 10
For me, at least, where the goal is to write SQL queries as bare strings, sqrrl
lets me write in R and think in SQL without having to add a huge number of paste
and paste0
functions.
Everything in sqrrl
takes input data from regular R data types and outputs an SQL snippet.
For an example of nearly everything each of the functions can do, see the Getting Started section in the documentation.
A more complicated SELECT query
As a final example, here is a fully-loaded select query.
SELECT('`year`', 'carrier', 'flight', 'dest',
n = 'count(*)',
avg_dist = 'avg(distance)',
avg_air_time = 'avg(air_time)') %+%
FROM(f = 'flights') %+%
WHERE(
BETWEEN('month', 6, 12),
'carrier' %IN% c("UA", "AA", "US", "WN"),
geq('dep_time' = 800),
leq('air_time' = 120),
'origin' %LIKE% 'JFK'
%+%
) GROUP_BY('`year`', 'carrier', 'flight', 'dest') %+%
ORDER_BY(DESC('n')) %+%
LIMIT(10) %>%
sqlformat(.) %>% cat; . } %>%
{ %>%
flights as_table
SELECT `year`,
carrier,
flight,
dest,count(*) AS n,
avg(distance) AS avg_dist,
avg(air_time) AS avg_air_time
FROM flights f
WHERE `month` BETWEEN 6 AND 12
AND carrier IN ("UA", "AA", "US", "WN")
AND dep_time>=800
AND air_time<=120
AND origin LIKE("JFK")
GROUP BY `year`,
carrier,
flight,
destORDER BY n DESC
LIMIT 10
year | carrier | flight | dest | n | avg_dist | avg_air_time |
---|---|---|---|---|---|---|
2013 | US | 1831 | CLT | 178 | 541 | 86.95506 |
2013 | US | 425 | CLT | 126 | 541 | 84.92857 |
2013 | AA | 178 | BOS | 119 | 187 | 37.94118 |
2013 | AA | 256 | BOS | 117 | 187 | 39.13675 |
2013 | AA | 2314 | BOS | 115 | 187 | 37.85217 |
2013 | US | 1802 | CLT | 112 | 541 | 87.23214 |
2013 | AA | 84 | BOS | 101 | 187 | 37.95049 |
2013 | AA | 1850 | BOS | 94 | 187 | 38.46809 |
2013 | AA | 1838 | BOS | 93 | 187 | 37.83871 |
2013 | AA | 1762 | BOS | 86 | 187 | 38.47674 |
This query and table select the most popular flights from JFK between June and December of 2013 from the carriers UA
, AA
, US
, and WN
that depart JFK after 8:00 AM and have an air time of less than 2 hours.
Learn more
There’s more that the package can do, like JOIN
s, INSERT
s, and UPDATE
s that I haven’t gone into here.
There are also a number of wrappers, comparison operators and concatenators that can be used for wrapping strings in quotes — e.g. quotes()
— comparing columns to values — e.g. geq()
, eq()
, lt()
, neq()
— and stringing together statements — e.g. AND()
, OR()
, %LIKE%
, %IN%
, BETWEEN()
.
There’s an example of nearly every single function and each of it’s possible configurations in the package documentation.
Hopefully this package is useful to someone other than myself (like you!). If you run into any problems, let me know or submit an issue on GitHub.