Working with SQL in R
Suggested Code Snippets
TOC
Connect to Postgres
Follow instructions on the cori_db wiki to set up your .Renviron
file.
<- connect_to_db('sch_layer') con
The cori_db wiki also documents several helper functions for DB interaction in the coriverse.
Read SQL files
<- cori.db::read_sql('path/to/sql.file')
query
# read_sql() is vectorized, so you can read multiple files into a vector of queries
<- cori.db::read_sql(c('path/to/file1.sql', 'path/to/file2.sql')) queries
Execute a query on Postgres (no return value)
execute_on_postgres(query)
# execute_on_postgres() is vectorized, so you can execute multiple queries sequentially
execute_on_postgres(queries)
Read data from a query
<- connect_to_db('schema')
con <- DBI::dbGetQuery(con, "select * from table limit 5")
dta ::dbDisconnect(con) # important ! DBI
Read Data
CSV
- Use
readr::read_csv()
for files with < 100K rows. For larger files, usedata.table::fread()
Google Sheet
.SHP, .GPKG, or other spatial file
sf::st_read()
ESRI Server
::install_github('yonghah/esri2sf')
remotes
library(esri2sf)
esri2sf(url)
Airtable
::install_github('matthewjrogers/rairtable')
remotes
set_airtable_api_key('MY_KEY_HERE')
<- airtable('Table Name', "BASE_ID")
atbl
<- read_airtable(atbl) dta
Write Data
NON-SPATIAL
To CSV
- Use
readr::write_csv()
for data sets of <100K rows,data.table::fwrite()
for larger data sets
To Postgres
Google Sheets
library(googlesheets4)
options(httr_oob_default = TRUE)
gs4_auth()
# make a new workbook
gs4_create("name", sheets = list(sheet_name = df, ...))
# existing workbook
write_sheet(df, ss = "workbook_url", sheet = "worksheet_name_to_write_to)
To Airtable
::install_github('matthewjrogers/rairtable')
remotes
set_airtable_api_key('MY_KEY_HERE')
# requires that table already exists with correct column names and types
insert_records()
# same requirements as insert, plus you need data with airtable record ids
update_records()
Also consider Airtable Python Wrapper, as it is somewhat faster.
SPATIAL
To Postgres
::st_write(df, con, "name") sf
To File
::st_write(df, "path/name.gpkg") sf
CARTO
library(coriverse)
write_carto(sf_df, 'name_of_layer_on_carto', Rto::carto())
Disconnect from a database
Validate a left join
# check that the result of the join has the same number of rows as the left hand side of the join # If the numbers are identical, the join was performed on a unique key or set of keys = dplyr::left_join(old_df1, old_df2, by = 'key_var') new_df stopifnot(nrow(old_df1) == nrow(new_df))
Explore a dataset
- Use
dplyr::glimpse()
to get an overview of large data sets (many columns) head()
will work well for data with fewer columns- For numeric variables of interest,
summary()
provides a good overview - For categorical variables, use
group_by()
andcount()
fromdplyr
to get counts by group - Get count of
NA
values across all columns withlapply(df, function(c) sum(is.na(c)))
Visualize data on a map
plot()
is the simplest option. Useplot(table$geography_field)
to just see a single map.ggplot()
and geom_sf() are useful when you want to see multiple layers overlayed.e.g.
r ggplot() + geom_sf(data = county_polygon) + geom_sf(data = served_polygon, color = "green", fill = "green") + geom_sf(data = county_roads_union) + geom_sf(data = served_roads, color = "red")
leaflet() is great for seeing data on top of a basemap that lets you zoom in and out, add popups, etc. The documentation is pretty good.
e.g.
r leaflet() %>% addTiles() %>% addPolygons(data = albany_boundary, color = "black") %>% addPolylines(data = albany_roads, color = "yellow") %>% addPolylines(data = albany_served_roads, color = "red") %>% addMarkers(data = Carroll_County_Addresses_g %>% filter(Town_Name == "Albany"))
Compare 2 dataframes
(Useful when updating data and you want to see what changed)