Arrow & DuckDB

Arrow / DuckDB / Parquet

Arrow is standardization of an in-memory representation of the data: it allows “zero copy” (use of pointer at low level), is an OLAP (Online Analytical Process, can be run in R) implementing deferred or lazy materilalization (optimize query us)
DuckDB is an implementation (engine) of an OLAP DB
Parquet is a open source columnar file format storage designed for fast I/O

Great explanations from Wes McKinney here

Arrow / DuckDB development

Maturity on OS: seems fine on MacOS and linux
Maturity in R: well supported but currently multiple implemetations
Status on spatial operations: We have a spec for v1 but a lot have not been implemented.
PG -> parquet: Possible with DuckDB, ADBC is promising and the PG community is also experimenting.
appending parquet: No or very hard

How to partition a file:

Arrow support:

“hive style”, a key value system: "year=2019/month=1/file.parquet"
“Directory”: "2019/01/file.parquet"

Hadley Wickham suggest:

As a rough guide, arrow suggests that you avoid files smaller than 20MB and larger than 2GB and avoid partitions that produce more than 10,000 files. ¹

¹ https://r4ds.hadley.nz/arrow.html#partitioning

Note

What could be a good partition scheme for FCC data?

From CSVs to parquet using DuckDB

It is following the similar pattern than COPY from PG:

Open a connection (in case you run oout of memory it is good to add a temp directory to write.)
Then we execute a COPY statement using FROM the CSV reader (using a function read_csv()) TO a new file.
Finaly: disconnect

You may have noticed that some globing options are available² : ~/data_swamp/fcc_dec_22/*.csv

² https://duckdb.org/docs/data/multiple_files/overview.html

Code

con <- dbConnect(duckdb(),  dbdir = "dec22.canard")

start <- Sys.time()

dbExecute(con, "COPY 
  (SELECT frn, 
          location_id, 
          technology, 
          max_advertised_download_speed,
          max_advertised_upload_speed,
          business_residential_code,
          state_usps as state_abbr,
          block_geoid,
          low_latency 
  FROM read_csv('~/data_swamp/fcc_dec_22/*.csv', 
          delim=',', quote='\"', new_line='\\n', skip=0, header=true))
  TO 'dec22' (FORMAT 'parquet', PARTITION_BY(state_abbr, technology))")

end <- Sys.time()

end - start

DBI::dbDisconnect(con)

From PG to parquet using DuckDB

The workflow is similar:

Connect to DuckDb
Installing in load Postgres extension³

³ (https://duckdb.org/docs/extensions/postgres)[https://duckdb.org/docs/extensions/postgres]

The extension will ATTACH to our PG instance (Using in this case credential from a single line ~/.pgpass) using a connection string (URI is also an option). The connection is aliased “DB” and for this case it is READ_ONLY
Optional: an example on how to inspect information_schema.tables, DuckDB is using a similar implementation than PG
Use of COPY from a SELECT statement with “catalog.schema.table” TO a file
Disconnect, closing the DuckDB connection seems to DETACH the one on PG.

Code

con <- dbConnect(duckdb())
  
dbExecute(con, "INSTALL postgres")
dbExecute(con, "LOAD postgres")

pgpass <- readLines("~/.pgpass")

cred <- unlist(strsplit(pgpass, ":"))

attach_string <- sprintf(
  "ATTACH 'dbname=%s user=%s password=%s host=%s' AS db (TYPE POSTGRES, READ_ONLY)",
  cred[3],
  cred[4],
  cred[5],
  cred[1]
)

dbExecute(con, attach_string)

dbGetQuery(con, "select * from information_schema.tables where table_schema = 'staging'")

# dbListTables(con) will works but not that usefull

start <- Sys.time()
dbExecute(con, "COPY 
  (SELECT frn, 
          location_id, 
          technology, 
          max_advertised_download_speed,
          max_advertised_upload_speed,
          business_residential_code,
          state_abbr,
          block_geoid,
          low_latency FROM db.staging.june23)
  TO 'june23' (FORMAT 'parquet', PARTITION_BY(state_abbr, technology))")

end <- Sys.time()

DBI::dbDisconnect(con)

Workflow “mind map”

Code

flowchart LR
dplyr --> arrow
arrow --> filesystem["`**Filesystem**
    multiple csv
    parquet
    more ..`"]

arrow <-->|?| duckDB

dplyr --> dbplyr
dplyr --> DBI
dbplyr --> DBI 
DBI --> duckDB
duckDB --> filesystem["`**Filesystem**
    multiple csv
    parquet
    more ..`"]

flowchart LR
dplyr --> arrow
arrow --> filesystem["`**Filesystem**
    multiple csv
    parquet
    more ..`"]

arrow <-->|?| duckDB

dplyr --> dbplyr
dplyr --> DBI
dbplyr --> DBI 
DBI --> duckDB
duckDB --> filesystem["`**Filesystem**
    multiple csv
    parquet
    more ..`"]

Resources

https://mastodon.cloud/@milvus/112395302626488455
https://grantmcdermott.com/duckdb-polars/
https://duckdb.org/docs/
https://r4ds.hadley.nz/arrow.html
on Arrow posts from Danielle: https://blog.djnavarro.net/posts/2022-12-26_strange-year/#writing-about-apache-arrow (usually awesome)