🐿️ Sqrrl

Easily build bespoke SQL queries programmatically in R

Published

June 30, 2017

UPDATE 10/17/2017: Don’t use this! I made it for myself so it works for what I needed it for. But you probably shouldn’t use this package. There are better ways of building SQL queries that are safer and better (and probably even easier). For now, let me just point you in the direction of db.rstudio.com, dplyr/dbplyr, and the recently added glue_sql() function in the glue package.

Project Links: source, documentation

sqrrl is a small collection of utility functions that help build text-based SQL queries in an R-style native-feeling and functional manner.

Unlike other packages that build SQL queries using an object-oriented style, sqrrl provides small functions that produce SQL snippets and can be linked together to compose SQL queries. The result is that the code to produce the SQL statement reads much like the SQL statement iteself. On the other hand, sqrrl doesn’t know anything about your database and can’t help you out with completions, etc.

Where this package is most useful is with Shiny web apps that interact with a MySQL backend. The utilities are all built so that queries can be built using column names and values stored inside ordinary R data structures.

The following is a quick demonstration of how the package works using the nyclights13 dataset. For more information on sqrrl, check out the package documentation.

Setup `flights` database

To demonstrate the features in sqrrl, let’s set up an in-memory SQLite database using the nycflights13 dataset featured in dplyr and dbplyr.

First, load (or install) the pacakges and functions that we need.

# ---- Workspace Setup ----
library('nycflights13') # install.packages('nycflights13')
library('DBI')          # install.packages('DBI')
library('dplyr')        # install.packages('dplyr')
library('dbplyr')       # install.packages('dbplyr')

# Load the sqrrl package
# devtools::isntall_github('gadenbuie/sqrrl')
library('sqrrl')

# Alias to create nice tables
as_table <- function(...) knitr::kable(..., format = 'html')

Then load the flights data frame from nycflights13 into the in-memory SQLite database (this code comes direclty from the dbplyr documentation).

# ---- Example Setup ----
# Create an in-memory SQLite database
con <- dbConnect(RSQLite::SQLite(), path = ":memory:")

# Use dplyr/dbplyr to copy flights table to the temp db
copy_to(con, nycflights13::flights, "flights",
  temporary = FALSE,
  indexes = list(
    c("year", "month", "day"),
    "carrier",
    "tailnum",
    "dest"
  )
)

# Show first 5 rows
dbGetQuery(con, 'SELECT * FROM flights LIMIT 5') %>%
  as_table

year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
2013	1	1	517	515	2	830	819	11	UA	1545	N14228	EWR	IAH	227	1400	5	15	1357034400
2013	1	1	533	529	4	850	830	20	UA	1714	N24211	LGA	IAH	227	1416	5	29	1357034400
2013	1	1	542	540	2	923	850	33	AA	1141	N619AA	JFK	MIA	160	1089	5	40	1357034400
2013	1	1	544	545	-1	1004	1022	-18	B6	725	N804JB	JFK	BQN	183	1576	5	45	1357034400
2013	1	1	554	600	-6	812	837	-25	DL	461	N668DN	LGA	ATL	116	762	6	0	1357038000

Querying `flights`

Often, when I’m working with a database, I’ll create an alias for dbGetQuery with the database or table name. Inside the alias function I usually add any data type modifications that might need to be applied, and I suppress the warning messages that DBI outputs about data type conversions.

flights <- function(query, ...) {
  suppressWarnings(dbGetQuery(con, query, ...))
}

Now we can repeat the above SELECT statement using sqrrl, this time limiting the columns selected.

flight_cols <- c('year', 'month', 'day',
                 'carrier', 'flight', 'tailnum')
SELECT(flight_cols) %+%
  FROM('flights') %+%
  LIMIT(5) %>%
  flights %>%
  as_table

year	month	day	carrier	flight	tailnum
2013	1	1	UA	1545	N14228
2013	1	1	UA	1714	N24211
2013	1	1	AA	1141	N619AA
2013	1	1	B6	725	N804JB
2013	1	1	DL	461	N668DN

Note that sqrrl provides the %+% infix operator, which is essentially just an alias for paste(x, y).

'a' %+% 'b'

[1] "a b"

# or PHP style without a padded space: paste0
'a' %.% 'b'

[1] "ab"

We can also do more complicated queries, like finding the average arrival delay, grouped by tail number:

SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
  FROM('flights') %+%
  GROUP_BY('tailnum') %+%
  ORDER_BY(DESC('delay')) %+%
  LIMIT(10) %>%
  flights %>%
  as_table

tailnum	delay	n
N844MH	320.0000	1
N911DA	294.0000	1
N922EV	276.0000	1
N587NW	264.0000	1
N851NW	219.0000	1
N928DN	201.0000	1
N7715E	188.0000	1
N654UA	185.0000	1
N665MQ	174.6667	6
N427SW	157.0000	1

sqrrl also provides a wrapper around the python utility sqlformat that can be used to pretty-print SQL formats.

SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
  FROM('flights') %+%
  GROUP_BY('tailnum') %+%
  ORDER_BY(DESC('delay')) %+%
  LIMIT(10) %>%
  sqlformat %>% cat

SELECT tailnum,
       avg(arr_delay) AS delay,
       count(*) AS n
  FROM flights
 GROUP BY tailnum
 ORDER BY delay DESC
 LIMIT 10

Let’s use the above as an inner query and filter on n > 100:

query_all_arr_delay <- SELECT(
  'tailnum', delay = 'avg(arr_delay)', n = 'count(*)'
) %+%
  FROM('flights') %+%
  GROUP_BY('tailnum') %+%
  ORDER_BY(DESC('delay'))

SELECT() %+%
  FROM(delay = parens(query_all_arr_delay)) %+%
  WHERE(gt(n = 100)) %+%
  LIMIT(10) %>%
  flights %>%
  as_table

tailnum	delay	n
N11119	30.30657	148
N16919	29.88745	251
N14998	27.92202	230
N15910	27.61132	280
N13123	25.97345	121
N11192	25.85235	154
N14950	25.28780	219
N21130	24.96610	126
N24128	24.91803	129
N22971	24.74766	230

Queries are just strings

Notice that unlike other packages, sqrrl can’t build the nested queries for you. You still need to understand the structure of the database and the structure of the query.

But when compared with the final output of the query, the sqrrl version looks a lot like SQL transliterated into R functions.

SELECT() %+%
  FROM(delay = parens(
    SELECT('tailnum', delay = 'avg(arr_delay)', n = 'count(*)') %+%
      FROM('flights') %+%
      GROUP_BY('tailnum') %+%
      ORDER_BY(DESC('delay'))
  )) %+%
  WHERE(gt(n = 100)) %+%
  LIMIT(10) %>%
  sqlformat() %>%
  cat()

SELECT *
  FROM (
        SELECT tailnum,
               avg(arr_delay) AS delay,
               count(*) AS n
          FROM flights
         GROUP BY tailnum
         ORDER BY delay DESC
       ) delay
 WHERE n>100
 LIMIT 10

For me, at least, where the goal is to write SQL queries as bare strings, sqrrl lets me write in R and think in SQL without having to add a huge number of paste and paste0 functions.

Everything in sqrrl takes input data from regular R data types and outputs an SQL snippet.

For an example of nearly everything each of the functions can do, see the Getting Started section in the documentation.

A more complicated SELECT query

As a final example, here is a fully-loaded select query.

SELECT('`year`', 'carrier', 'flight', 'dest',
       n = 'count(*)',
       avg_dist = 'avg(distance)',
       avg_air_time = 'avg(air_time)') %+%
  FROM(f = 'flights') %+%
  WHERE(
    BETWEEN('month', 6, 12),
    'carrier' %IN% c("UA", "AA", "US", "WN"),
    geq('dep_time' = 800),
    leq('air_time' = 120),
    'origin' %LIKE% 'JFK'
  ) %+%
  GROUP_BY('`year`', 'carrier', 'flight', 'dest') %+%
  ORDER_BY(DESC('n')) %+%
  LIMIT(10) %>%
  { sqlformat(.) %>% cat; . } %>%
  flights %>%
  as_table

SELECT `year`,
       carrier,
       flight,
       dest,
       count(*) AS n,
       avg(distance) AS avg_dist,
       avg(air_time) AS avg_air_time
  FROM flights f
 WHERE `month` BETWEEN 6 AND 12
   AND carrier IN ("UA", "AA", "US", "WN")
   AND dep_time>=800
   AND air_time<=120
   AND origin LIKE("JFK")
 GROUP BY `year`,
          carrier,
          flight,
          dest
 ORDER BY n DESC
 LIMIT 10

year	carrier	flight	dest	n	avg_dist	avg_air_time
2013	US	1831	CLT	178	541	86.95506
2013	US	425	CLT	126	541	84.92857
2013	AA	178	BOS	119	187	37.94118
2013	AA	256	BOS	117	187	39.13675
2013	AA	2314	BOS	115	187	37.85217
2013	US	1802	CLT	112	541	87.23214
2013	AA	84	BOS	101	187	37.95049
2013	AA	1850	BOS	94	187	38.46809
2013	AA	1838	BOS	93	187	37.83871
2013	AA	1762	BOS	86	187	38.47674

This query and table select the most popular flights from JFK between June and December of 2013 from the carriers UA, AA, US, and WN that depart JFK after 8:00 AM and have an air time of less than 2 hours.

Learn more

There’s more that the package can do, like JOINs, INSERTs, and UPDATEs that I haven’t gone into here.

There are also a number of wrappers, comparison operators and concatenators that can be used for wrapping strings in quotes — e.g. quotes() — comparing columns to values — e.g. geq(), eq(), lt(), neq() — and stringing together statements — e.g. AND(), OR(), %LIKE%, %IN%, BETWEEN().

There’s an example of nearly every single function and each of it’s possible configurations in the package documentation.

Hopefully this package is useful to someone other than myself (like you!). If you run into any problems, let me know or submit an issue on GitHub.

Setup flights database

Querying flights

Queries are just strings

A more complicated SELECT query

Learn more

Setup `flights` database

Querying `flights`