Arrow is standardization of an in-memory representation of the data: it allows “zero copy” (use of pointer at low level), is an OLAP (Online Analytical Process, can be run in R) implementing deferred or lazy materilalization (optimize query us)
DuckDB is an implementation (engine) of an OLAP DB
Parquet is a open source columnar file format storage designed for fast I/O
“hive style”, a key value system: "year=2019/month=1/file.parquet"
“Directory”: "2019/01/file.parquet"
Hadley Wickham suggest:
As a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files. 1
The extension will ATTACH to our PG instance (Using in this case credential from a single line ~/.pgpass) using a connection string (URI is also an option). The connection is aliased “DB” and for this case it is READ_ONLY
Optional: an example on how to inspect information_schema.tables, DuckDB is using a similar implementation than PG
Use of COPY from a SELECT statement with “catalog.schema.table” TO a file
Disconnect, closing the DuckDB connection seems to DETACH the one on PG.
Code
con <-dbConnect(duckdb())dbExecute(con, "INSTALL postgres")dbExecute(con, "LOAD postgres")pgpass <-readLines("~/.pgpass")cred <-unlist(strsplit(pgpass, ":"))attach_string <-sprintf("ATTACH 'dbname=%s user=%s password=%s host=%s' AS db (TYPE POSTGRES, READ_ONLY)", cred[3], cred[4], cred[5], cred[1])dbExecute(con, attach_string)dbGetQuery(con, "select * from information_schema.tables where table_schema = 'staging'")# dbListTables(con) will works but not that usefullstart <-Sys.time()dbExecute(con, "COPY (SELECT frn, location_id, technology, max_advertised_download_speed, max_advertised_upload_speed, business_residential_code, state_abbr, block_geoid, low_latency FROM db.staging.june23) TO 'june23' (FORMAT 'parquet', PARTITION_BY(state_abbr, technology))")end <-Sys.time()DBI::dbDisconnect(con)