PostgreSQL-RDS-Management
What kind of data is “legacy” data? what kind of data should be “versionned”? here we need stable structure what can be candidate for that -> fcc county? FCC annualized it (standardize?) -> to what extend are we supporting this pipeline
storing archive in “cheap format” that can be queried outside of PG (option are OLAP DB/filesystem)? -> paws / targets
Maybe do not use acs schema and just use it for/inside TED
caching -> S3 / not team wide
Archive -> team wide
🏗️ This page is under construction! 🏗️
Database Migration Update 2023-11-20
https://docs.google.com/document/d/1nuYftMYqEKbH2i_wKhf63Kika7RtEoHb-A7NZYaSPwQ/edit
Goal Overview:
Use data from the single database instance in our old RDS cluster (cori-risi) to populate the following four new databases instances in our new RDS cluster (cori-risi-ad-postgresql):
api-dev - Database used as a development environment with the main purpose of “promoting” data access to api-prod; serves as a test environment for the CORI Data API.
api-prod - Database used to serve CORI Data API with data. Highest level of security with limited access. Products will or can be public facing.
data-prod - Database used to house vetted, documented, and production level datasets. Potential for very limited data sharing (with collaborators outside the org) with support.
data - Database used to support exploratory project work, research data, datasets under development.
We will do this in sequential steps, populating each new database instance one-at-a-time, with the database instance called “data” being the final one (equivalent to the original “data” instance in the old cori-risi RDS cluster). Some schemas will be database-specific from now on, while others (i.e., “metadata”) will be common to all instances.
Glossary:
Database: Collection of schemas (data domains), each with a specific set of data tables
RDS: AWS service for hosting and administering networked database (DB) clusters
RDS (cluster): Also used to reference a single cluster of databases
Role: a generic specification that can do stuff in DB, (example: read_only) . For administration purposes some “group” roles will be a collection of multiple users (example: bb_team )
User: a role that can login (example: Olivier)
Basic activities plan:
- “Dumping” existing schemas in old RDS (pg_dump) (Oct 25, 2023~ Nov 1, 2023)
Schemas intended for data-prod will be manually dumped/restored (by Olivier Nov 1, 2023)
(MDA Team:) Should we restore these schemas? If yes, into which database?
Historical_census_data
Used in RWJF story 2
Not actively developed
Will not restore
sch_layers
Carto layers…
Version of higher ed drive times (last run?)
Will not restore
sch_source
Dumping ground during ETL
Practice question: how should we deal with storing intermediate data in db?
- Utilize S3; document bucket/path in code README
Will not restore; absence of schema will break some scripts
vt_broadband
Old VT 10 year work
Will not restore
Creating group roles (and clean a bit manually) for administrative management in new RDS (Oct 31, 2023)
Creating new database instances in new RDS (Active Directory secured)
Populate them in this order:
api-dev (Nov 1, 2023)
-> api-prod (Nov 2, 2023)
-> data-prod (Nov 6, 2023~ Nov 9, 2023)
This will use schemas manually dumped/restored (by Olivier)- We actually used
ansible
for both dumping schemas and restoring them to the individual database instances.
-> data (Nov 13, 2023)
Check-in AK and BB team about down-time (November 9)
Check-in Nora and RII team about down-time (November 9)
Notify Org about app down-time (Nov 2, 2023)
Set time with Nahum (MF) for API re-deploy/testing (Nov 2, 2023?)
Re-deploy Broadband Climate Risk Mitigation Tool with downtime notification
- Creating users with role assignments to access the new database instances
Maintaining a version-controlled YAML file as the source of truth on who can connect to what (and how)
db_password should be a shell env.
What happens if users already exist and have set a password? (avoid idempotency; password reset ->
no_password_changes:
is taking that into account )
- (Re-)Connecting “client” apps/tools to each new database instance (Nov 6, 2023-10)
Prerequisite: updating cori.db (remember to increment the version)
Set credential function (Nov 6, 2023)
Connect to db functions (Nov 6, 2023)
Specify search path in params
Adding one function to connect to data-prod (November 6)
R processes/scripts (including DoctR => TED data pipeline)
Re-deploy some of our shiny apps (TED, WIM, CORI Explorer) to shinyapps.io (November 9)
- Required: uses .Renviron for db credentials
Re-deploy API (November 2)
Re-deploy BCAT and Broadband Risk Mitigation (CH) (TBD)
Testing functionality (November 6; ongoing)
Onboard BB team
- BB <=> MDA Database onboarding (2.0) (Nov 29, 2023)
Shut down old server RDS cluster (November 20)
Sharing (developing) best practices going forward (TBD)
Writing/Updating documentation:
Architecture overview
cori.db docmentation (README)
Data content/variables overview -
Adding users
- Self-managed password (procedures when pwd is forgotten/lost)?
Moving data
Examples:
Data set handling: data that is “in-process/progress” stays in the data db instance. Metadata is essential and required for any data set that will be moved to the api-prod or data-prod db instances.
Queries: Use less *, use alias (AS) and WHERE to help homogenize input in R or other scripting tools
Appendix:
Some examples of role inheritance: - Drew (User) <- mda_team (“group”) <- read_and_write (roles)
Kirstin (User) <- bb_team(“group”) <- read_only (roles)
Shinybot1 (User) <- bot (“group”) <- read_only (roles)
Collaborator_bill (User) <- group <- read_only (roles)
Follow-up:
-
November 2023
November 2022
Compare change in RDS costs with Drew Rosebush
Database Migration 2023-03-10
Goals:
- List remaining Challenges
- Process to update data
- Process to consolidate/clean/document previous data
- Process to archive data
- documentations
- merge branches / clean Ansible repo
- individual password reset from ansible
- Onboard bb team -> meeting 29/11
- Drop the previous instance (last dump?)
The list of Drew is a great start how can we improve it?
I think it is outside of the scope of migrating DB but we should at least link the two and we should add a specific ticket/process on how we can build a better version of it.
What is important is how long should we keep data for every source. If I am correct we are tracking the DB with a shiny app that target the metadata schema. In my opinion it is a good idea one question I am unclear is when do we update metadata: right after writing anything? after the data is in “production” ? (other option). When we are storing past data we should think who do we clean metadata.
The ansible repo can be used has a documentation on how the schema were dump and restore:
- **Schemas dumps**: https://github.com/ruralinnovation/ansible/blob/dce9174cbcd65bf6533a6a4516764d6760ea902c/playbooks/cori-risi-old-db/README.md
- **schema per databases** https://github.com/ruralinnovation/ansible/tree/dce9174cbcd65bf6533a6a4516764d6760ea902c/playbooks/cori-risi-ad-postgresql/vars
Data lifecycle?
Currently, we just have one DB and multiple schemas.
I will divide with at least:
- projects
- schema for specific projects
- staging place at schema level
- can be called by bots
- no limitation in writing/reading
- schema for specific projects
- sources
- mostly some raw data could be redundant with data
- can’t be called by bots
- more limited writing
- data
- Data that we need frequently and has been processed by us
- Can be called by bots?
- no limitation in writing/reading
This three seems to have different lifecycle (ie when we update them/store them/delete part of them)
list of schemas and tables that can be in prod env.
Rules:
data in prod must have metadata before pushing into prod.
Metadata is:
required:
- A readme.md in a repo with description of data (dimensions and what is a “row”) and the above (should be at the schema )
- “cookbook” at tables ->
metadata
- Sources (link, date, last_updated)
- Licenses (PD, MIT, Apache 2 etc )
- “executive summary description”: one-liner providing a quick overview of what the data is
- A readme.md in a repo with description of data (dimensions and what is a “row”) and the above (should be at the schema )
optional:
- s3 bucket with raw
- Codes that produce table
- validation tests
- summary of validation tests
- testing
metadata
columns - TODO stuff:
- summary of validation tests
TED
shinyapp: - entry point app.R - it is unsing cori.apps for component - it is using doctR and doctR needs fonts - renv.lock is used by shinyapps.io to download the correct packages and version - it needs core_data