# ---- Workspace Setup ----
library('nycflights13') # install.packages('nycflights13')
library('DBI') # install.packages('DBI')
library('dplyr') # install.packages('dplyr')
library('dbplyr') # install.packages('dbplyr')
# Load the sqrrl package
# devtools::isntall_github('gadenbuie/sqrrl')
library('sqrrl')
# Alias to create nice tables
as_table <- function(...) knitr::kable(..., format = 'html')🐿️ Sqrrl
UPDATE 10/17/2017: Don’t use this! I made it for myself so it works for what I needed it for. But you probably shouldn’t use this package. There are better ways of building SQL queries that are safer and better (and probably even easier). For now, let me just point you in the direction of db.rstudio.com, dplyr/dbplyr, and the recently added glue_sql() function in the glue package.
Project Links: source, documentation
sqrrl is a small collection of utility functions that help build text-based SQL queries in an R-style native-feeling and functional manner.
Unlike other packages that build SQL queries using an object-oriented style, sqrrl provides small functions that produce SQL snippets and can be linked together to compose SQL queries. The result is that the code to produce the SQL statement reads much like the SQL statement iteself. On the other hand, sqrrl doesn’t know anything about your database and can’t help you out with completions, etc.
Where this package is most useful is with Shiny web apps that interact with a MySQL backend. The utilities are all built so that queries can be built using column names and values stored inside ordinary R data structures.
The following is a quick demonstration of how the package works using the nyclights13 dataset. For more information on sqrrl, check out the package documentation.
Setup flights database
To demonstrate the features in sqrrl, let’s set up an in-memory SQLite database using the nycflights13 dataset featured in dplyr and dbplyr.
First, load (or install) the pacakges and functions that we need.
Then load the flights data frame from nycflights13 into the in-memory SQLite database (this code comes direclty from the dbplyr documentation).
# ---- Example Setup ----
# Create an in-memory SQLite database
con <- dbConnect(RSQLite::SQLite(), path = ":memory:")
# Use dplyr/dbplyr to copy flights table to the temp db
copy_to(con, nycflights13::flights, "flights",
temporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
# Show first 5 rows
dbGetQuery(con, 'SELECT * FROM flights LIMIT 5') %>%
as_table| year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 1357034400 |
| 2013 | 1 | 1 | 533 | 529 | 4 | 850 | 830 | 20 | UA | 1714 | N24211 | LGA | IAH | 227 | 1416 | 5 | 29 | 1357034400 |
| 2013 | 1 | 1 | 542 | 540 | 2 | 923 | 850 | 33 | AA | 1141 | N619AA | JFK | MIA | 160 | 1089 | 5 | 40 | 1357034400 |
| 2013 | 1 | 1 | 544 | 545 | -1 | 1004 | 1022 | -18 | B6 | 725 | N804JB | JFK | BQN | 183 | 1576 | 5 | 45 | 1357034400 |
| 2013 | 1 | 1 | 554 | 600 | -6 | 812 | 837 | -25 | DL | 461 | N668DN | LGA | ATL | 116 | 762 | 6 | 0 | 1357038000 |
Querying flights
Often, when I’m working with a database, I’ll create an alias for dbGetQuery with the database or table name. Inside the alias function I usually add any data type modifications that might need to be applied, and I suppress the warning messages that DBI outputs about data type conversions.
flights <- function(query, ...) {
suppressWarnings(dbGetQuery(con, query, ...))
}Now we can repeat the above SELECT statement using sqrrl, this time limiting the columns selected.
flight_cols <- c('year', 'month', 'day',
'carrier', 'flight', 'tailnum')
SELECT(flight_cols) %+%
FROM('flights') %+%
LIMIT(5) %>%
flights %>%
as_table| year | month | day | carrier | flight | tailnum |
|---|---|---|---|---|---|
| 2013 | 1 | 1 | UA | 1545 | N14228 |
| 2013 | 1 | 1 | UA | 1714 | N24211 |
| 2013 | 1 | 1 | AA | 1141 | N619AA |
| 2013 | 1 | 1 | B6 | 725 | N804JB |
| 2013 | 1 | 1 | DL | 461 | N668DN |
Note that sqrrl provides the %+% infix operator, which is essentially just an alias for paste(x, y).
'a' %+% 'b'[1] "a b"
# or PHP style without a padded space: paste0
'a' %.% 'b'[1] "ab"
We can also do more complicated queries, like finding the average arrival delay, grouped by tail number:
SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
FROM('flights') %+%
GROUP_BY('tailnum') %+%
ORDER_BY(DESC('delay')) %+%
LIMIT(10) %>%
flights %>%
as_table| tailnum | delay | n |
|---|---|---|
| N844MH | 320.0000 | 1 |
| N911DA | 294.0000 | 1 |
| N922EV | 276.0000 | 1 |
| N587NW | 264.0000 | 1 |
| N851NW | 219.0000 | 1 |
| N928DN | 201.0000 | 1 |
| N7715E | 188.0000 | 1 |
| N654UA | 185.0000 | 1 |
| N665MQ | 174.6667 | 6 |
| N427SW | 157.0000 | 1 |
sqrrl also provides a wrapper around the python utility sqlformat that can be used to pretty-print SQL formats.
SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
FROM('flights') %+%
GROUP_BY('tailnum') %+%
ORDER_BY(DESC('delay')) %+%
LIMIT(10) %>%
sqlformat %>% catSELECT tailnum,
avg(arr_delay) AS delay,
count(*) AS n
FROM flights
GROUP BY tailnum
ORDER BY delay DESC
LIMIT 10Let’s use the above as an inner query and filter on n > 100:
query_all_arr_delay <- SELECT(
'tailnum', delay = 'avg(arr_delay)', n = 'count(*)'
) %+%
FROM('flights') %+%
GROUP_BY('tailnum') %+%
ORDER_BY(DESC('delay'))
SELECT() %+%
FROM(delay = parens(query_all_arr_delay)) %+%
WHERE(gt(n = 100)) %+%
LIMIT(10) %>%
flights %>%
as_table| tailnum | delay | n |
|---|---|---|
| N11119 | 30.30657 | 148 |
| N16919 | 29.88745 | 251 |
| N14998 | 27.92202 | 230 |
| N15910 | 27.61132 | 280 |
| N13123 | 25.97345 | 121 |
| N11192 | 25.85235 | 154 |
| N14950 | 25.28780 | 219 |
| N21130 | 24.96610 | 126 |
| N24128 | 24.91803 | 129 |
| N22971 | 24.74766 | 230 |
Queries are just strings
Notice that unlike other packages, sqrrl can’t build the nested queries for you. You still need to understand the structure of the database and the structure of the query.
But when compared with the final output of the query, the sqrrl version looks a lot like SQL transliterated into R functions.
SELECT() %+%
FROM(delay = parens(
SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
FROM('flights') %+%
GROUP_BY('tailnum') %+%
ORDER_BY(DESC('delay'))
)) %+%
WHERE(gt(n = 100)) %+%
LIMIT(10) %>%
sqlformat() %>%
cat()SELECT *
FROM (
SELECT tailnum,
avg(arr_delay) AS delay,
count(*) AS n
FROM flights
GROUP BY tailnum
ORDER BY delay DESC
) delay
WHERE n>100
LIMIT 10For me, at least, where the goal is to write SQL queries as bare strings, sqrrl lets me write in R and think in SQL without having to add a huge number of paste and paste0 functions.
Everything in sqrrl takes input data from regular R data types and outputs an SQL snippet.
For an example of nearly everything each of the functions can do, see the Getting Started section in the documentation.
A more complicated SELECT query
As a final example, here is a fully-loaded select query.
SELECT('`year`', 'carrier', 'flight', 'dest',
n = 'count(*)',
avg_dist = 'avg(distance)',
avg_air_time = 'avg(air_time)') %+%
FROM(f = 'flights') %+%
WHERE(
BETWEEN('month', 6, 12),
'carrier' %IN% c("UA", "AA", "US", "WN"),
geq('dep_time' = 800),
leq('air_time' = 120),
'origin' %LIKE% 'JFK'
) %+%
GROUP_BY('`year`', 'carrier', 'flight', 'dest') %+%
ORDER_BY(DESC('n')) %+%
LIMIT(10) %>%
{ sqlformat(.) %>% cat; . } %>%
flights %>%
as_tableSELECT `year`,
carrier,
flight,
dest,
count(*) AS n,
avg(distance) AS avg_dist,
avg(air_time) AS avg_air_time
FROM flights f
WHERE `month` BETWEEN 6 AND 12
AND carrier IN ("UA", "AA", "US", "WN")
AND dep_time>=800
AND air_time<=120
AND origin LIKE("JFK")
GROUP BY `year`,
carrier,
flight,
dest
ORDER BY n DESC
LIMIT 10| year | carrier | flight | dest | n | avg_dist | avg_air_time |
|---|---|---|---|---|---|---|
| 2013 | US | 1831 | CLT | 178 | 541 | 86.95506 |
| 2013 | US | 425 | CLT | 126 | 541 | 84.92857 |
| 2013 | AA | 178 | BOS | 119 | 187 | 37.94118 |
| 2013 | AA | 256 | BOS | 117 | 187 | 39.13675 |
| 2013 | AA | 2314 | BOS | 115 | 187 | 37.85217 |
| 2013 | US | 1802 | CLT | 112 | 541 | 87.23214 |
| 2013 | AA | 84 | BOS | 101 | 187 | 37.95049 |
| 2013 | AA | 1850 | BOS | 94 | 187 | 38.46809 |
| 2013 | AA | 1838 | BOS | 93 | 187 | 37.83871 |
| 2013 | AA | 1762 | BOS | 86 | 187 | 38.47674 |
This query and table select the most popular flights from JFK between June and December of 2013 from the carriers UA, AA, US, and WN that depart JFK after 8:00 AM and have an air time of less than 2 hours.
Learn more
There’s more that the package can do, like JOINs, INSERTs, and UPDATEs that I haven’t gone into here.
There are also a number of wrappers, comparison operators and concatenators that can be used for wrapping strings in quotes — e.g. quotes() — comparing columns to values — e.g. geq(), eq(), lt(), neq() — and stringing together statements — e.g. AND(), OR(), %LIKE%, %IN%, BETWEEN().
There’s an example of nearly every single function and each of it’s possible configurations in the package documentation.
Hopefully this package is useful to someone other than myself (like you!). If you run into any problems, let me know or submit an issue on GitHub.