The ACS data pipeline is almost feature complete and is driven entirely by the YAML parameters file and the ACS codebook. Documentation on how to modify the ACS pipeline is available in the data-acs wiki.
As of July 22 there are two outstanding tasks on this pipeline.
The step calculating the diversity score is not currently idempotent (that is, running that step in isolation multiple times consecutively will calculate and append a diversity score variable each time – in other words, after the second run there will be two diversity score variables in the DB, after the third three, etc. This is avoided if all steps are run sequentially (which drops and re-creates the table) but each step should be idempotent). This should be addressed
A final step should be added to update the DED data as part of the ACS pipeline
We are currently storing the following ACS years: 2020, 2019, 2015, 2014
When the time comes to ETL the 2021/2016 ACS data, the 2019 and 2014 data should be retired and removed from the database. If there is a need for extensive historical ACS data, it should not come through the pipeline.
Scripts: data-acs
DB Schema: acs
Level of Measure: County, place, state, national
Update Cadence: Annual
BEA is a good resource for many economic data points, including employment, GDP, and earnings flow and income. To date, CORI mainly ingests four data sets from the BEA:
Employment by NAICS: CAEMP25N
GDP by County and NAICS: CAGDP2
Gross Flow of Earnings: CAINC91
Population/Personal Income: CAINC1
Of note for the BEA data sets, Virginia counties and independent cities are always combined.
Scripts: data-bea
DB Schema: employment_data
Products: DED
Update Cadence: Annual
BLS is a good resource for employment related statistics and CORI uses the Local Area Unemployment Statistics (LAUS) data to understand the employment, labor force, unemployment, etc. of US counties. In particular, we source the county labor force participation rate from the BLS for the DED.
Because the demand for BLS data has been low in the past two years, the process remains rough.
DB Schema: employment_data
Products: DED
DED Script (Ugly): ded_bls.R
ETL Script (Butt Ugly…): source_bls_data_annual.R
A historical note: You will notice in this repo (and elsewhere) the widespread use of ‘x.’ as a prefix for variables in R code. The initial data work done by the MDA team was done by Alex Tenenbaum (AT), one of the earliest members of this organization, which was subsequently handed off to Chen Chen. AT encouraged the use of common prefix for objects such that he could remove objects from memory using rm() without needing to ever restart his R session, a practice Chen followed. This is not a good idea and should not be emulated moving forward.
Broadbandnow is a dataset released by Microsoft that contains speed and usage information derived from proprietary Microsoft data. We use it very little, except for Broadband availability and access metric in the DED
Script: data-broadbandnow
Source: BroadbandNow Github
Products: DED
Level of Measure: ZIP Code
County Business Pattern (CBP) is another data source for employment by NAICS codes. The advantage of CBP over other employment data sources is that CBP provides 6-digit NAICS code breakdown, so if you want to check the employment for a specific job category in a county, CBP is a better choice than BEA or BLS. The CBP also provides the number of establishments along with employment.
For years prior to 2016, it was necessary to impute suppressed data for a fully useful data set. Because the CBP is so specific, there is a great deal of data suppression. After 2016, the CBP changed their methodology for dealing with small samples, using a data perturbation method for privacy protection. This renders the imputation both impossible and unnecessary.
Because of some initial confusion surrounding imputation, this data was not fully updated for the DED effort in Spring 2022. Some rough work was done in this script which will need to be updated if we intend to continue using this data.
Products: DED
Level of Measure: County
County population characteristics data. We currently only use the annual population in the DED report.
Script: ded/ded_one_off_data_updates/cpc_data_processing.R
Products: DED
FCC Form 477 data is a provider-service-block level table provided by
the FCC. This table is likely to be replaced by an address level data
set in late 2022. The current process should thus be treated essentially
as a legacy product. There are three major table patterns,
source_fcc_477_{month}{year}_{version}
which is unprocessed
source data, broadband_f477_{year}
which is processed
source data, and broadband_f477_byblock_{year}
which is the
processed block level tables. There is some variation in these patterns
(particularly around the number of underscores separating words) driven
by historical changes to the pipeline.
Two notes:
Unlike most of our ETL pipelines, this pipeline is written in Python. This is due to the FCC’s use of the service Box to store the data. The existing R interfaces to the box API are clunky and ineffective, but Box provides a Python SDK that works seamlessly
The FCC has been compressing the files using deflate64 compression, which defeats Python’s built in unzip functionality as well as the available dedicated deflate64 compression module. Linux unzip, however, deals with it handily. Thus, the initial ETL script needs to be run in a Linux environment.
Because this data set is being retired soon, any effort spent on the current pipeline is burned time. It may be worth using a manual process to limp through the last few updates.
Historically, we have retained all versions of each release. We should retain only the latest version of each release.
Scripts: fcc_form_477
DB Schema: sch_broadband
Products: BCAT
Level of Measure: Block + Provider + Technology (Raw), Block (Processed)
Update Cadence: Bi-Annual
Form D data is a relatively new data source for RISI as of July 2022. While the data is incomplete, in that it systematically under counts offerings, it has the advantage of offering very recent data, often lagged by less than a quarter.
Because this data is so dynamic, we have elected not store the data
in our database in a general sense. Instead, we use the
dform
R package to access the data dynamically.
Scripts: data_venture_capital
Package: dform
Level of Measure: Offering
Update Cadence: Quarterly
Higher education data is one of our more complex data processes. There are several moving pieces:
IPEDS and College Scorecard data. These are point data sets representing colleges and universities across the country
IPEDS completions data, which tracks degrees by SOC code
A CIP/SOC code crosswalk that allows us to track tech related degrees
A drivetime analysis that allows us to associate colleges and universities with geographies
Because the drivetimes are unlikely to change, we should avoid re-running that process wholesale as it is incredibly time consuming. One possible method would be to create a process that checks for new institutions, calculates the drivetimes for those institutions, then appends the data to an existing table. The association of drivetimes to blocks is massive, and is best done in Google BigQuery or a similar environment.
Pipeline: data_higher_ed_pipeline
Legacy code (for reference on drivetimes): data-high-education
DB Schema: higher_ed
Products: DED
Level of Measure: County
Update Cadence: Annual
LEHD data is used in two primary ways. First, the LODES geographic crosswalk is perhaps our most used crosswalk as it links blocks to a wide range of child geographies. Second, the LEHD data provides a reasonably detailed breakdown of employment by variables such as firm age, and as such we have historically used it for calculating variables such as the percent of employment in young firms.
Code: data-lodes
Package: lehdr (also available on
CRAN via install.packages
)
DB Schema: employment_data
Level of Measure: Block
Update Cadence: Annual
Location Analysis (LA) is a holdover from the early days of our team and describes a process rather than a specific set of data. The LA process exploits the fact that all U.S. census geographies are composed of Census blocks to calculate metrics across a range of geographies using block level estimates.
Typically this has been percentages of blocks, households, or population in a given area meeting a characteristic (e.g. percent of blocks served by 25/3 internet per the FCC).
This process is convenient, but has not been well utilized in 2021-2022. If this process is to be maintained, it should be re-evaluated for usefulness. For instance, using population weighted averages or something may ultimately provide more value.
For instructions on how to add a data set to the location analysis process, see the location_analysis wiki.
Scripts: location_analysis
Package: aggregatoR
DB Schema: location_analysis
The TIGER/Line data is the central source of truth for geographic boundaries in the US. There are two types of boundaries available from the TIGER/Line data, TIGER/Line boundaries and cartographic boundaries. Cartographic boundaries are simplified for representation on maps and should be used only for display and never for analysis.
Repo: data-tiger-line
Discussion of TIGER and CB files: CORIverse Wiki Page
Level of Measure: Block, County, Place, State (etc.)
Update Cadence: Annual
Primary responsibility for database management should fall to the Data Engineer.
Data that must be shared should be written to the database
Data that is cumbersome to create should be written to the database
Disposable data should not be written to the database
Tests should not be written to the database
Schemas should be thematic – they should collect tables from a single source or that reference a single concept.
A schema should contain more than one table
Schemas should be re-evaluated for their usefulness at least once per year. Given the typical cadence of work at CORI/RISI, this should likely be targeted for Spring (when things tend to be least busy).
A significant project should have a database schema. The bar or level of scrutiny for writing something to a project schema is lower than for a core schema.
Tables in project schemata should be considered fundamentally ephemeral. When starting a new project, you should never reference tables in an another project schema.
All ACS data tables
table_schema | table_name |
---|---|
acs | acs_5_yr_county |
acs | acs_5_yr_place |
acs | acs_5_yr_state |
acs | acs_5_yr_us |
All tables for Broadband County Assessment Tool
table_schema | table_name |
---|---|
bcat | acp_adoption_county |
bcat | bcat_auction_904_subsidy_awards |
bcat | bcat_broadband_100_20_unserved_blocks |
bcat | bcat_broadband_unserved_blocks |
bcat | bcat_county_broadband_farm_bill_eligibility |
bcat | bcat_county_broadband_pending_rural_dev |
bcat | bcat_county_ilecs |
bcat | bcat_county_ilecs_geo |
bcat | bcat_county_provider_summary |
bcat | bcat_county_rural_dev_broadband_protected_borrowers |
bcat | bcat_county_summary |
bcat | bcat_fiber_cable_unserved_blocks |
bcat | bcat_incumbent_electric_providers |
bcat | bcat_incumbent_electric_providers_geo |
bcat | bcat_tn_areas_wo_100x20_2022_01_17 |
bcat | bcat_tn_areas_wo_25x3_2022_01_17 |
bcat | bcat_tn_bb_grant_2022_01_17 |
bcat | bcat_tn_ebb_fund_2022_01_17 |
bcat | bcat_tn_reconnect_2022_01_17 |
bcat | bcat_wo_25x3_2022_01_17 |
bcat | county_adjacency_crosswalk |
bcat | hifld_electric_retail_service_territories |
Core data encompasses a variety of essential data sets. Additions to this schema should be 1) core to our work (e.g. the census), 2) not fit neatly into another existing schema, and 3) not merit their own new schema (e.g. a single table goes here, five tables get a new schema)
table_schema | table_name |
---|---|
core_data | county_county_agg_crosswalk |
core_data | fcc_staff_estimates |
core_data | rural_definitions_block_level_codes |
core_data | rural_definitions_block_level_flags |
core_data | rural_definitions_county_codes |
core_data | rural_definitions_county_flags |
core_data | rural_definitions_tract_codes |
core_data | rural_definitions_tract_flags |
core_data | rural_definitions_tract_level_codes |
core_data | rural_definitions_tract_level_flags |
core_data | source_census_2020_redist_h1_occupancy_status |
core_data | source_census_2020_redist_p1_race |
core_data | source_census_2020_redist_p2_hisp_lat_by_race |
core_data | source_census_2020_redist_p3_race_for_pop_18_plus |
core_data | source_census_2020_redist_p4_hisp_lat_by_race_for_pop_18_plus |
core_data | source_census_2020_redist_p5_group_quarters_pop_by_gq_type |
core_data | tract_xwalk_00_10 |
Project schema for the EDA Who is Missing analysis project (WIM) (concluded Winter/Spring 2022).. Should be removed by EOY 2023
table_schema | table_name |
---|---|
eda | acs_need_variables_co |
eda | acs_need_variables_pl |
eda | aggregated_inst_level_he_data_2sfca |
eda | broadband_service_25_3_FCC2020jun |
eda | cbsa_cims |
eda | cims_build_to_scale_awardees |
eda | cims_cbsa_geo |
eda | cims_county_geo |
eda | cims_rii_applicants |
eda | cims_rin_ecosystem |
eda | county_cims |
eda | county_indx_table |
eda | county_indx_table_w_outliers |
eda | cousub_cims |
eda | distress_ed_ds_24_month_unemployment_rate |
eda | distress_ed_ds_acs_per_capita_money_income |
eda | distress_ed_ds_bea_per_capita_personal_income |
eda | distress_metadata |
eda | distress_metros_micros_24_month_unemployment_rate |
eda | distress_metros_micros_acs_per_capita_money_income |
eda | distress_metros_micros_bea_per_capita_personal_income |
eda | distress_tracts_acs_per_capita_money_income |
eda | distress_tracts_acs_unemployment_rate |
eda | distress_us_states_and_counties_24_month_unemployment_rate |
eda | distress_us_states_and_counties_acs_per_capita_money_income |
eda | distress_us_states_and_counties_bea_per_capita_personal_income |
eda | godaddy_cbsa_long |
eda | godaddy_cbsa_wide |
eda | godaddy_county_long |
eda | godaddy_county_wide |
eda | higher_ed_cs_grad_inst_counts_county |
eda | higher_ed_cs_grad_inst_counts_place |
eda | higher_ed_inst_counts_county |
eda | higher_ed_inst_counts_place |
eda | moodys_employ_eda |
eda | moodys_employment_data_raw_src |
eda | moodys_industry_sector_eda |
eda | moodys_industry_sector_raw_src |
eda | place_cims |
eda | place_county_indx_table |
eda | place_county_indx_table_2500_75000 |
eda | place_indx_table |
eda | place_level_2sfca |
eda | places_to_institution_catchment_30min |
eda | places_to_institution_catchment_30min_ALL |
eda | puma_crosswalk |
eda | stats_america_distress_variables_co |
eda | tribal_leader_county_indicator |
eda | tribal_leader_directory_src |
Schema for employment data from all sources (e.g. BLS, BEA, LODES, etc.). BLS has not been systematically updated due to low demand, and should live on this schema moving forward.
table_schema | table_name |
---|---|
employment_data | bea_caemp25n |
employment_data | bea_cagdp2 |
employment_data | bea_cainc1 |
employment_data | bea_cainc4 |
employment_data | lodes_firm_age |
employment_data | lodes_geography_crosswalk |
employment_data | lodes_od_county_all_jobs |
employment_data | lodes_wac_all_jobs |
employment_data | source_lodes_geography_crosswalk_2018 |
employment_data | source_lodes_od_jt00_2017 |
employment_data | source_lodes_od_jt00_2018 |
employment_data | source_lodes_wac_jt00_2017 |
employment_data | source_lodes_wac_jt00_2018 |
employment_data | source_lodes_wac_jt02_2017 |
employment_data | source_lodes_wac_jt02_2018 |
Schema for Location Analysis tables
table_schema | table_name |
---|---|
location_analysis | field_metadata_la |
location_analysis | la_cbsa |
location_analysis | la_county |
location_analysis | la_county_sub |
location_analysis | la_place |
location_analysis | la_state |
location_analysis | source_metadata_la |
Home for metadata tables
table_schema | table_name |
---|---|
metadata | acs_metadata |
metadata | field_metadat |
metadata | field_metadata |
metadata | pipeline_diagnostics |
metadata | product_metadata |
metadata | source_metadata |
metadata | table_metadata |
Home for RII Diagnostic tables
table_schema | table_name |
---|---|
rii_diagnostic | county_diagnostic_variables |
rii_diagnostic | county_inst_map |
rii_diagnostic | diagnostic_visualizations |
rii_diagnostic | diagnostic_visualizations_bup_20220425 |
rii_diagnostic | diagnostic_visualizations_bup_20220826 |
rii_diagnostic | national_diagnostic_variables |
rii_diagnostic | rin_counties |
rii_diagnostic | state_diagnostic_data |
Project schema for the RWJF project
table_schema | table_name |
---|---|
rwjf | acp_typology |
rwjf | aiannh_2020_county_dta |
rwjf | aiannh_2020_tract_dta |
rwjf | census_block_population_2010 |
rwjf | colonia_counties |
rwjf | colonia_tracts |
rwjf | decoding_rural_america_variables |
rwjf | decoding_rural_america_variables_county |
rwjf | decoding_rural_america_variables_tract |
rwjf | deep_disadvantage_codebook |
rwjf | deep_disadvantage_counties |
rwjf | hud_colonia_communities |
rwjf | nhgis_census_pop_1790_2020 |
rwjf | nhgis_census_race_1980_2020 |
rwjf | persistent_poverty_counties |
rwjf | reinvisioning_rural_typology |
rwjf | social_vulnerability_index |
rwjf | ur_america_typology |
Schema for Broadband related tables (particularly the FCC Form 477).
table_schema | table_name |
---|---|
sch_broadband | broadband_f477_2014dec |
sch_broadband | broadband_f477_2016dec |
sch_broadband | broadband_f477_2017dec |
sch_broadband | broadband_f477_2018dec |
sch_broadband | broadband_f477_2018dec_v2 |
sch_broadband | broadband_f477_2018jun |
sch_broadband | broadband_f477_2019dec |
sch_broadband | broadband_f477_2019jun |
sch_broadband | broadband_f477_2020december_v1 |
sch_broadband | broadband_f477_2020june_v1 |
sch_broadband | broadband_f477_2021june_v1 |
sch_broadband | broadband_f477_byblock_2014dec |
sch_broadband | broadband_f477_byblock_2015dec |
sch_broadband | broadband_f477_byblock_2015jun |
sch_broadband | broadband_f477_byblock_2016dec |
sch_broadband | broadband_f477_byblock_2016jun |
sch_broadband | broadband_f477_byblock_2017dec |
sch_broadband | broadband_f477_byblock_2017jun |
sch_broadband | broadband_f477_byblock_2018dec |
sch_broadband | broadband_f477_byblock_2018dec_v2 |
sch_broadband | broadband_f477_byblock_2018jun |
sch_broadband | broadband_f477_byblock_2019dec |
sch_broadband | broadband_f477_byblock_2019jun |
sch_broadband | broadband_f477_byblock_2020december_v1 |
sch_broadband | broadband_f477_byblock_2020june_v1 |
sch_broadband | broadband_f477_byblock_2021june_v1 |
sch_broadband | county_cable_providers |
sch_broadband | county_fiber_providers |
sch_broadband | county_providers |
sch_broadband | fcc_477_tech_code_crosswalk |
sch_broadband | source__fcc_477_dec14_v2_20190125 |
sch_broadband | source__fcc_477_dec15_v2 |
sch_broadband | source__fcc_477_dec15_v3_20190125 |
sch_broadband | source__fcc_477_dec16_v1_20190125 |
sch_broadband | source__fcc_477_dec17_v1_20190125 |
sch_broadband | source__fcc_477_dec17_v1_20190508 |
sch_broadband | source__fcc_477_dec17_v2_20191024 |
sch_broadband | source__fcc_477_dec18_v1_20200204 |
sch_broadband | source__fcc_477_dec18_v2_20200824 |
sch_broadband | source__fcc_477_dec19_v1_20201216 |
sch_broadband | source__fcc_477_jun15_v3 |
sch_broadband | source__fcc_477_jun15_v4_20190125 |
sch_broadband | source__fcc_477_jun16_v2 |
sch_broadband | source__fcc_477_jun16_v3_20190125 |
sch_broadband | source__fcc_477_jun17_v1 |
sch_broadband | source__fcc_477_jun17_v2_20190125 |
sch_broadband | source__fcc_477_jun18_v1_20191024 |
sch_broadband | source__fcc_477_jun19_v1_20200610 |
sch_broadband | source_fcc_477_dec2014_v3 |
sch_broadband | source_fcc_477_dec2015_v4 |
sch_broadband | source_fcc_477_dec2016_v2 |
sch_broadband | source_fcc_477_dec2017_v3 |
sch_broadband | source_fcc_477_dec2018_v3 |
sch_broadband | source_fcc_477_december_2020_v1 |
sch_broadband | source_fcc_477_jun_2020_v1 |
sch_broadband | source_fcc_477_june_2020_v1 |
Schema for TIGER/Line data. Tables flagged as ‘cb’ are cartographic boundary files for mapping, as opposed to ‘tiger’ tables for spatial analysis.
table_schema | table_name |
---|---|
sch_census_tiger | bcat_county_summary |
sch_census_tiger | source_cb_2019_cbsa |
sch_census_tiger | source_cb_2019_county |
sch_census_tiger | source_cb_2019_cousub |
sch_census_tiger | source_cb_2019_place |
sch_census_tiger | source_cb_2019_puma |
sch_census_tiger | source_cb_2019_state |
sch_census_tiger | source_cb_2019_uac |
sch_census_tiger | source_tiger_2019_blockgroup |
sch_census_tiger | source_tiger_2019_blocks |
sch_census_tiger | source_tiger_2019_cbsa |
sch_census_tiger | source_tiger_2019_county |
sch_census_tiger | source_tiger_2019_cousub |
sch_census_tiger | source_tiger_2019_place |
sch_census_tiger | source_tiger_2019_puma |
sch_census_tiger | source_tiger_2019_state |
sch_census_tiger | source_tiger_2019_tract |
sch_census_tiger | source_tiger_2019_uac |
sch_census_tiger | tiger_2019_cbsa |
sch_census_tiger | tiger_2019_county |
sch_census_tiger | tiger_2019_cousub |
sch_census_tiger | tiger_2019_place |
sch_census_tiger | tiger_2019_state |
sch_census_tiger | tiger_aiannh_areas_2018 |
This is a legacy database schema, previously used for processed mappable tables (i.e. ‘layers’). Do not write new data to this schema. Do not transfer it to the new database. Remove by EOY 2023 or sooner.
table_schema | table_name |
---|---|
sch_layer | bea_caemp25n_county_01_18 |
sch_layer | bea_caemp25n_county_01_19 |
sch_layer | bea_caemp25n_state_01_18 |
sch_layer | bea_caemp25n_state_01_19 |
sch_layer | bea_caemp25n_us_01_18 |
sch_layer | bea_caemp25n_us_01_19 |
sch_layer | bea_cagdp2_county_01_18 |
sch_layer | bea_cagdp2_us_01_18 |
sch_layer | bea_cainc4_county_69_18 |
sch_layer | bea_cainc4_state_69_18 |
sch_layer | bea_cainc4_us_69_18 |
sch_layer | bea_gross_flow_earnings_90_18 |
sch_layer | bea_imputed_emp_2002_2018 |
sch_layer | bea_imputed_emp_2018 |
sch_layer | bea_info_employment_01_18 |
sch_layer | bea_mark_variables_2018 |
sch_layer | bea_self_employment_01_18 |
sch_layer | bea_total_employment_01_18 |
sch_layer | bea_total_employment_01_19 |
sch_layer | bea_total_gdp_01_18 |
sch_layer | bea_total_info_gdp_01_18 |
sch_layer | bea_total_pop_70_18 |
sch_layer | bl_geoid_lookup |
sch_layer | bl_geoid_lookup_all |
sch_layer | bls_laus_county_annual_90_19 |
sch_layer | bls_laus_county_annual_90_20 |
sch_layer | bls_laus_county_monthly |
sch_layer | broadbandnow |
sch_layer | cbp_software_publishers |
sch_layer | cbp_tech_emp_2018 |
sch_layer | census2010_bg_race |
sch_layer | census2010_tract_race |
sch_layer | computer_science_graduates_15_17_drive_time |
sch_layer | computer_science_graduates_15_17_pt |
sch_layer | computer_science_graduates_16_18_drive_time |
sch_layer | computer_science_graduates_16_18_pt |
sch_layer | computer_science_graduates_by_county |
sch_layer | computer_science_graduates_by_county_2021 |
sch_layer | county_population_characteristics_2000_2019 |
sch_layer | county_race_score |
sch_layer | county_zip_to_city_xwalk |
sch_layer | covid_19_county_latest |
sch_layer | covid_19_county_long |
sch_layer | covid_county_chris |
sch_layer | cpc_2010_2019_age_race_sex |
sch_layer | cpc_county_year |
sch_layer | eda_investments_fy12_to_fy17_app |
sch_layer | fdic_bank_info_06_2021 |
sch_layer | godaddy_county |
sch_layer | high_ed_drive_time_30_60_90_mins |
sch_layer | high_ed_drive_time_30_mins |
sch_layer | high_ed_drive_time_60_mins |
sch_layer | high_ed_drive_time_90_mins |
sch_layer | high_ed_field_of_study_ipeds |
sch_layer | high_ed_points |
sch_layer | higher_ed_40_min_drivetimes |
sch_layer | higher_ed_blocks_2021 |
sch_layer | higher_ed_blocks_40min_2021 |
sch_layer | higher_ed_inst_and_block_level |
sch_layer | imputed_cbp_high_tech_emp_12_16 |
sch_layer | imputed_cbp_high_tech_emp_county_2016 |
sch_layer | la_highed_by_block_2021jun |
sch_layer | lodes_county_2017 |
sch_layer | lodes_county_by_industry_2002_2018 |
sch_layer | lodes_od_county |
sch_layer | lodes_wac_bg |
sch_layer | microsoft_county_geo |
sch_layer | natural_amenity_score_county |
sch_layer | non_metro_od_with_dist_mi |
sch_layer | race_ethnicity_breakdown |
sch_layer | rii_cohort_applicants |
sch_layer | rural_by_state_cbsa |
sch_layer | rural_by_state_cmsrhc |
sch_layer | rural_by_state_eda |
sch_layer | rural_by_state_far |
sch_layer | rural_by_state_fhfa |
sch_layer | rural_by_state_forhp |
sch_layer | rural_by_state_lisc |
sch_layer | rural_by_state_msa |
sch_layer | rural_by_state_ruca |
sch_layer | rural_by_state_rucc |
sch_layer | rural_by_state_uc |
sch_layer | rural_by_state_uic |
sch_layer | rural_by_state_usda |
sch_layer | rural_definitions_all |
This is a legacy database schema used to store raw source data. Its continued use is optional. It is most useful for large datasets such as the form 477, where the time cost to load the data into postgres is high. If the new FCC process moves out of Postgres, as it likely should, this schema should be retired and emptied.
table_schema | table_name |
---|---|
sch_source | acp_county_type |
sch_source | ami_rural_definition |
sch_source | bea_caemp25n_01_18 |
sch_source | bea_caemp25n_01_19 |
sch_source | bea_cainc4_69_18 |
sch_source | bea_gdp_01_18 |
sch_source | bea_gross_flow_earnings_90_18 |
sch_source | bea_total_emp_01_18 |
sch_source | bea_total_emp_01_19 |
sch_source | bls_laus_90_19 |
sch_source | bls_laus_90_20 |
sch_source | bls_laus_county_monthly |
sch_source | broadband_auction_903_subsidy_awards |
sch_source | broadband_auction_904_subsidy_awards |
sch_source | broadband_farm_bill_eligibility |
sch_source | broadbandnow_20201216 |
sch_source | business_stops_and_starts |
sch_source | cbp_all_emp_estab_county_1986_2018 |
sch_source | cbp_all_emp_estab_state_1986_2018 |
sch_source | cbp_all_emp_estab_us_1986_2018 |
sch_source | cbp_ict_naics_county_2012_2018 |
sch_source | cbp_ict_naics_state_2012_2018 |
sch_source | cbp_ict_naics_us_2012_2018 |
sch_source | county_business_patterns_2018 |
sch_source | cpc_2010_2019_age_race_sex |
sch_source | effingham_served_roads |
sch_source | fdic_institutions |
sch_source | high_ed_field_of_study |
sch_source | high_ed_field_of_study_ipeds |
sch_source | high_education_college_scorecard |
sch_source | ilecs_geo |
sch_source | microsoft_broadband_20201221 |
sch_source | nber_county_adjacency |
sch_source | opportunity_zone_eligibility_pg |
sch_source | prison_boundaries_base |
sch_source | rural_dev_broadband_pending |
sch_source | source_bea_caemp25_2020 |
sch_source | source_bea_cagdp2_2020 |
sch_source | source_bea_cainc1_2020 |
sch_source | source_bea_cainc4_2020 |
sch_source | source_bea_cainc91_2020 |
sch_source | source_fcc_477_june_2021_v1 |
sch_source | source_rural_definition_cbsa_2010 |
sch_source | source_rural_definition_cbsa_2013 |
sch_source | source_rural_definition_cbsa_2014 |
sch_source | source_rural_definition_cbsa_2019 |
sch_source | source_rural_definition_cbsa_2020 |
sch_source | source_rural_definition_cdc |
sch_source | source_rural_definition_far |
sch_source | source_rural_definition_fhfa |
sch_source | source_rural_definition_forhp |
sch_source | source_rural_definition_ruca |
sch_source | source_rural_definition_rucc |
sch_source | source_rural_definition_uic |
Project schema for the a broadband project in VT concluded in 2021. It should be destroyed by EOY 2023
table_schema | table_name |
---|---|
vt_broadband | buffer_att_cell_signal_0_5_mile |
vt_broadband | buffer_att_towers_5_mile |
vt_broadband | buffer_tmobile_cell_signal_0_5_mile |
vt_broadband | buffer_tower_248a_3_5_mile |
vt_broadband | buffer_tower_nrb_3_5_mile |
vt_broadband | buffer_uscellular_cell_signal_0_5_mile |
vt_broadband | buffer_verizon_cell_signal_0_5_mile |
vt_broadband | buffer_vtel_reconnect_5_mile |
vt_broadband | cable_fiber_label_buffer |
vt_broadband | cabled_roads_southern_vermont_cud |
vt_broadband | cell_all_providers_buffer_count |
vt_broadband | cell_att_buffer_count |
vt_broadband | cell_towers_buffer_count |
vt_broadband | cell_verizon_buffer_count |
vt_broadband | connectivity_initiative_grant_2020 |
vt_broadband | connectivity_initiative_grant_2021 |
vt_broadband | CTC_cell_propagation_700MHz_Outdoor |
vt_broadband | fiber_cable_road_pct_by_town |
vt_broadband | non_residential_fiber_flag |
vt_broadband | psd_address_cable_fiber_label |
vt_broadband | raw_addresses |
vt_broadband | raw_cable_segments |
vt_broadband | raw_cell_drive_test_att |
vt_broadband | raw_cell_drive_test_tmobile |
vt_broadband | raw_cell_drive_test_uscellular |
vt_broadband | raw_cell_drive_test_verizon |
vt_broadband | raw_connectivity_initiative_grant_2020 |
vt_broadband | raw_connectivity_initiative_grant_2020_20210224 |
vt_broadband | raw_e911_roads |
vt_broadband | raw_fiber_segments |
vt_broadband | raw_road_centerline |
vt_broadband | raw_road_premises |
vt_broadband | raw_tower_248_pt |
vt_broadband | raw_tower_nrb_pt |
vt_broadband | raw_towns |
vt_broadband | raw_vtel_reconnect_program |
vt_broadband | road_premises_in_propagation |
vt_broadband | tower_248a_nrb_buffer_count |
vt_broadband | vermont_700mhz_outdoor_union |
vt_broadband | vt_cable_segments |
vt_broadband | vt_cud_boundaries |
vt_broadband | vt_cud_boundaries_combined |
vt_broadband | vt_data_broadband_status_2019 |
vt_broadband | vt_data_broadband_status_2019_1 |
vt_broadband | vt_data_broadband_status_2019_1_unserved |
vt_broadband | vt_roads |
vt_broadband | vtel_reconnect_buffer_count |
Schema for venture capital data.
table_schema | table_name |
---|---|
venture_capital | formd_22Q1_19Q1_combined |
venture_capital | formd_county |
venture_capital | formd_county_year |
venture_capital | venture_capital_modeling_datav1 |
Schema for all higher education related data
table_schema | table_name |
---|---|
higher_ed | cip_soc_xwalk |
higher_ed | college_scorecard |
higher_ed | county_cs_grad_percentile_by_cbsa_type |
higher_ed | cs_grads |
higher_ed | degrees_by_inst_and_cip |
higher_ed | field_of_study |
higher_ed | high_ed_drive_time_30_60_90_mins |
higher_ed | ipeds_completions |
higher_ed | ipeds_inst_characteristics |
higher_ed | total_degrees_by_inst |
Schema for historical census tables. Experimental. If empty by EOY 2022, remove.
table_schema | table_name |
---|---|
historical_census_data | historical_census_county_crosswalk |
historical_census_data | pop_race_nhgis_1840_county |
historical_census_data | pop_race_nhgis_1850_county |
historical_census_data | pop_race_nhgis_1860_county |
historical_census_data | pop_race_nhgis_1870_county |
historical_census_data | pop_race_nhgis_1880_county |
historical_census_data | pop_race_nhgis_1890_county |
historical_census_data | pop_race_nhgis_1900_county |
historical_census_data | pop_race_nhgis_1910_county |
historical_census_data | pop_race_nhgis_1920_county |
historical_census_data | pop_race_nhgis_1930_county |
historical_census_data | pop_race_nhgis_1940_county |
historical_census_data | pop_race_nhgis_1950_county |
historical_census_data | pop_race_nhgis_1960_county |
historical_census_data | pop_race_nhgis_1970_county |
historical_census_data | pop_race_nhgis_1980_county |
historical_census_data | pop_race_nhgis_1990_county |
historical_census_data | pop_race_nhgis_2000_county |
historical_census_data | pop_race_nhgis_2010_county |
historical_census_data | pop_race_nhgis_2020_county |
If it is necessary to use Airtable, rairtable
is the
best interface for our use. Install with
install.packages('rairtable')
.
However, Airtable should not be supported as an MDA data access tool under any circumstances. It is time consuming to work with in the best of cases. Worse, database compatible names are typically unacceptable to client teams, rendering direct updates inconvenient at best and impossible at worst. There are no access controls and manual data edits are simple, so changes – intentional or otherwise – made by client teams will persist indefinitely.
Airtable was fine for quick prototyping in the early days of our team, but it is fundamentally incompatible with data integrity and storage best practices and should be disavowed with vigor. In the best case, we should delete all MDA data from Airtable and forcefully discontinue support. If this happens I’ll eat my hat.
BCAT is an application that integrates a variety of county and national level broadband metrics. It was originally developed for a project in Tennessee, but a national level version also exists. John Hall is the best reference for the inner workings of the BCAT.
The pipeline for this tool is rough and in some cases assumes that now-defunct architecture and tables exist. It should receive an overhaul, not least to accommodate the deployment of the API designed and built for our team by Merging Futures in the summer of 2022.
DB Schema: bcat
Pipeline: broadband_county_assessment_tool_pipeline
App Repo: broadband_county_assessment_tool
The RII Diagnostic report (as of August 22 called the Digital Economy Diagnostic, or DED; previously known as the DEE-R or Digital Economy Ecosystem Report) is one of the MDA team’s main data products.
It consists, in essence, of a data visualization API housed in a Shiny app, and generates data and graphic assets for the construction of the RII diagnostic reports.
Notes:
As of 8/22, maps are not well supported as DED output. Adding new maps or modifying the existing map output is a moderate to large lift (depending on the ask). This will need to be supported by one or more MDA software developers
The higher education map is not updated in the same way as any of
the other data visualizations. All of the processing happens in doctR::make_map()
and is dependent on a handful of external tables. Additional development
on this feature should not be taken lightly.
Outstanding tasks in this process include the following:
Integrating data updates into their respective data pipelines
Updating data (continuous need)
Detailed documentation on how to use the system is in the doctR wiki linked below.
Recording of Overview Meeting w/ RII team: https://drive.google.com/file/d/1Gv5uC6bZD-B7D_nATU7RH8FxMYhLuqgc/view?usp=sharing
Recording of Technical Zoom Meeting: https://drive.google.com/drive/folders/1jUTQ2gMPpZ--foFQPGZi7Jy_-nT7Vw0-?usp=sharing
Viz Definitions + Misc Code: ded
Package: doctR
Package Docs: doctR wiki
DED Portal Code: clinic
DB Schema: rii_diagostic
Legacy Code (non-functional): dee-r
NB: the ‘doctor’ puns make the most sense if you pronounce DED as “dead”. What do you do when something might be DED? Call the doctR. I couldn’t help myself. MR 8/10/22
The T3 is a high traffic, high value tool for the organization. However, it is woefully out of date and a bastion of tech debt. It was built quickly on a shoestring budget without the benefit of development best practices.
The current T3 codebase should receive no new development.
The script that created the data for this tool lives in the
temp
folder, and is titled
shiny_data_prep.R
.