DATA

ACS

The ACS data pipeline is almost feature complete and is driven entirely by the YAML parameters file and the ACS codebook. Documentation on how to modify the ACS pipeline is available in the data-acs wiki.

As of July 22 there are two outstanding tasks on this pipeline.

The step calculating the diversity score is not currently idempotent (that is, running that step in isolation multiple times consecutively will calculate and append a diversity score variable each time – in other words, after the second run there will be two diversity score variables in the DB, after the third three, etc. This is avoided if all steps are run sequentially (which drops and re-creates the table) but each step should be idempotent). This should be addressed
A final step should be added to update the DED data as part of the ACS pipeline

We are currently storing the following ACS years: 2020, 2019, 2015, 2014

When the time comes to ETL the 2021/2016 ACS data, the 2019 and 2014 data should be retired and removed from the database. If there is a need for extensive historical ACS data, it should not come through the pipeline.

Scripts: data-acs

DB Schema: acs

Level of Measure: County, place, state, national

Update Cadence: Annual

BEA

BEA is a good resource for many economic data points, including employment, GDP, and earnings flow and income. To date, CORI mainly ingests four data sets from the BEA:

Employment by NAICS: CAEMP25N
GDP by County and NAICS: CAGDP2
Gross Flow of Earnings: CAINC91
Population/Personal Income: CAINC1

Of note for the BEA data sets, Virginia counties and independent cities are always combined.

Scripts: data-bea

DB Schema: employment_data

Products: DED

Update Cadence: Annual

BLS

BLS is a good resource for employment related statistics and CORI uses the Local Area Unemployment Statistics (LAUS) data to understand the employment, labor force, unemployment, etc. of US counties. In particular, we source the county labor force participation rate from the BLS for the DED.

Because the demand for BLS data has been low in the past two years, the process remains rough.

DB Schema: employment_data

Products: DED

DED Script (Ugly): ded_bls.R

ETL Script (Butt Ugly…): source_bls_data_annual.R

A historical note: You will notice in this repo (and elsewhere) the widespread use of ‘x.’ as a prefix for variables in R code. The initial data work done by the MDA team was done by Alex Tenenbaum (AT), one of the earliest members of this organization, which was subsequently handed off to Chen Chen. AT encouraged the use of common prefix for objects such that he could remove objects from memory using rm() without needing to ever restart his R session, a practice Chen followed. This is not a good idea and should not be emulated moving forward.

Broadbandnow

Broadbandnow is a dataset released by Microsoft that contains speed and usage information derived from proprietary Microsoft data. We use it very little, except for Broadband availability and access metric in the DED

Script: data-broadbandnow

Source: BroadbandNow Github

Products: DED

Level of Measure: ZIP Code

CBP

County Business Pattern (CBP) is another data source for employment by NAICS codes. The advantage of CBP over other employment data sources is that CBP provides 6-digit NAICS code breakdown, so if you want to check the employment for a specific job category in a county, CBP is a better choice than BEA or BLS. The CBP also provides the number of establishments along with employment.

For years prior to 2016, it was necessary to impute suppressed data for a fully useful data set. Because the CBP is so specific, there is a great deal of data suppression. After 2016, the CBP changed their methodology for dealing with small samples, using a data perturbation method for privacy protection. This renders the imputation both impossible and unnecessary.

Because of some initial confusion surrounding imputation, this data was not fully updated for the DED effort in Spring 2022. Some rough work was done in this script which will need to be updated if we intend to continue using this data.

Products: DED

Level of Measure: County

CPC

County population characteristics data. We currently only use the annual population in the DED report.

Script: ded/ded_one_off_data_updates/cpc_data_processing.R

Products: DED

FCC Form 477

FCC Form 477 data is a provider-service-block level table provided by the FCC. This table is likely to be replaced by an address level data set in late 2022. The current process should thus be treated essentially as a legacy product. There are three major table patterns, source_fcc_477_{month}{year}_{version} which is unprocessed source data, broadband_f477_{year} which is processed source data, and broadband_f477_byblock_{year} which is the processed block level tables. There is some variation in these patterns (particularly around the number of underscores separating words) driven by historical changes to the pipeline.

Two notes:

Unlike most of our ETL pipelines, this pipeline is written in Python. This is due to the FCC’s use of the service Box to store the data. The existing R interfaces to the box API are clunky and ineffective, but Box provides a Python SDK that works seamlessly
The FCC has been compressing the files using deflate64 compression, which defeats Python’s built in unzip functionality as well as the available dedicated deflate64 compression module. Linux unzip, however, deals with it handily. Thus, the initial ETL script needs to be run in a Linux environment.

Because this data set is being retired soon, any effort spent on the current pipeline is burned time. It may be worth using a manual process to limp through the last few updates.

Historically, we have retained all versions of each release. We should retain only the latest version of each release.

Scripts: fcc_form_477

DB Schema: sch_broadband

Products: BCAT

Level of Measure: Block + Provider + Technology (Raw), Block (Processed)

Update Cadence: Bi-Annual

Form D

Form D data is a relatively new data source for RISI as of July 2022. While the data is incomplete, in that it systematically under counts offerings, it has the advantage of offering very recent data, often lagged by less than a quarter.

Because this data is so dynamic, we have elected not store the data in our database in a general sense. Instead, we use the dform R package to access the data dynamically.

Scripts: data_venture_capital

Package: dform

Level of Measure: Offering

Update Cadence: Quarterly

Higher Education

Higher education data is one of our more complex data processes. There are several moving pieces:

IPEDS and College Scorecard data. These are point data sets representing colleges and universities across the country
IPEDS completions data, which tracks degrees by SOC code
A CIP/SOC code crosswalk that allows us to track tech related degrees
A drivetime analysis that allows us to associate colleges and universities with geographies

Because the drivetimes are unlikely to change, we should avoid re-running that process wholesale as it is incredibly time consuming. One possible method would be to create a process that checks for new institutions, calculates the drivetimes for those institutions, then appends the data to an existing table. The association of drivetimes to blocks is massive, and is best done in Google BigQuery or a similar environment.

Pipeline: data_higher_ed_pipeline

Legacy code (for reference on drivetimes): data-high-education

DB Schema: higher_ed

Products: DED

Level of Measure: County

Update Cadence: Annual

LODES/LEHD

LEHD data is used in two primary ways. First, the LODES geographic crosswalk is perhaps our most used crosswalk as it links blocks to a wide range of child geographies. Second, the LEHD data provides a reasonably detailed breakdown of employment by variables such as firm age, and as such we have historically used it for calculating variables such as the percent of employment in young firms.

Code: data-lodes

Package: lehdr (also available on CRAN via install.packages)

DB Schema: employment_data

Level of Measure: Block

Update Cadence: Annual

Location Analysis

Location Analysis (LA) is a holdover from the early days of our team and describes a process rather than a specific set of data. The LA process exploits the fact that all U.S. census geographies are composed of Census blocks to calculate metrics across a range of geographies using block level estimates.

Typically this has been percentages of blocks, households, or population in a given area meeting a characteristic (e.g. percent of blocks served by 25/3 internet per the FCC).

This process is convenient, but has not been well utilized in 2021-2022. If this process is to be maintained, it should be re-evaluated for usefulness. For instance, using population weighted averages or something may ultimately provide more value.

For instructions on how to add a data set to the location analysis process, see the location_analysis wiki.

Scripts: location_analysis

Package: aggregatoR

DB Schema: location_analysis

TIGER/Line

The TIGER/Line data is the central source of truth for geographic boundaries in the US. There are two types of boundaries available from the TIGER/Line data, TIGER/Line boundaries and cartographic boundaries. Cartographic boundaries are simplified for representation on maps and should be used only for display and never for analysis.

Repo: data-tiger-line

Discussion of TIGER and CB files: CORIverse Wiki Page

Level of Measure: Block, County, Place, State (etc.)

Update Cadence: Annual

DATABASE

Primary responsibility for database management should fall to the Data Engineer.

Philosophy

Data that must be shared should be written to the database
Data that is cumbersome to create should be written to the database
Disposable data should not be written to the database
Tests should not be written to the database
Schemas should be thematic – they should collect tables from a single source or that reference a single concept.
A schema should contain more than one table
Schemas should be re-evaluated for their usefulness at least once per year. Given the typical cadence of work at CORI/RISI, this should likely be targeted for Spring (when things tend to be least busy).
A significant project should have a database schema. The bar or level of scrutiny for writing something to a project schema is lower than for a core schema.
Tables in project schemata should be considered fundamentally ephemeral. When starting a new project, you should never reference tables in an another project schema.

acs

All ACS data tables

table_schema	table_name
acs	acs_5_yr_county
acs	acs_5_yr_place
acs	acs_5_yr_state
acs	acs_5_yr_us

bcat

All tables for Broadband County Assessment Tool

table_schema	table_name
bcat	acp_adoption_county
bcat	bcat_auction_904_subsidy_awards
bcat	bcat_broadband_100_20_unserved_blocks
bcat	bcat_broadband_unserved_blocks
bcat	bcat_county_broadband_farm_bill_eligibility
bcat	bcat_county_broadband_pending_rural_dev
bcat	bcat_county_ilecs
bcat	bcat_county_ilecs_geo
bcat	bcat_county_provider_summary
bcat	bcat_county_rural_dev_broadband_protected_borrowers
bcat	bcat_county_summary
bcat	bcat_fiber_cable_unserved_blocks
bcat	bcat_incumbent_electric_providers
bcat	bcat_incumbent_electric_providers_geo
bcat	bcat_tn_areas_wo_100x20_2022_01_17
bcat	bcat_tn_areas_wo_25x3_2022_01_17
bcat	bcat_tn_bb_grant_2022_01_17
bcat	bcat_tn_ebb_fund_2022_01_17
bcat	bcat_tn_reconnect_2022_01_17
bcat	bcat_wo_25x3_2022_01_17
bcat	county_adjacency_crosswalk
bcat	hifld_electric_retail_service_territories

core_data

Core data encompasses a variety of essential data sets. Additions to this schema should be 1) core to our work (e.g. the census), 2) not fit neatly into another existing schema, and 3) not merit their own new schema (e.g. a single table goes here, five tables get a new schema)

table_schema	table_name
core_data	county_county_agg_crosswalk
core_data	fcc_staff_estimates
core_data	rural_definitions_block_level_codes
core_data	rural_definitions_block_level_flags
core_data	rural_definitions_county_codes
core_data	rural_definitions_county_flags
core_data	rural_definitions_tract_codes
core_data	rural_definitions_tract_flags
core_data	rural_definitions_tract_level_codes
core_data	rural_definitions_tract_level_flags
core_data	source_census_2020_redist_h1_occupancy_status
core_data	source_census_2020_redist_p1_race
core_data	source_census_2020_redist_p2_hisp_lat_by_race
core_data	source_census_2020_redist_p3_race_for_pop_18_plus
core_data	source_census_2020_redist_p4_hisp_lat_by_race_for_pop_18_plus
core_data	source_census_2020_redist_p5_group_quarters_pop_by_gq_type
core_data	tract_xwalk_00_10

eda

Project schema for the EDA Who is Missing analysis project (WIM) (concluded Winter/Spring 2022).. Should be removed by EOY 2023

table_schema	table_name
eda	acs_need_variables_co
eda	acs_need_variables_pl
eda	aggregated_inst_level_he_data_2sfca
eda	broadband_service_25_3_FCC2020jun
eda	cbsa_cims
eda	cims_build_to_scale_awardees
eda	cims_cbsa_geo
eda	cims_county_geo
eda	cims_rii_applicants
eda	cims_rin_ecosystem
eda	county_cims
eda	county_indx_table
eda	county_indx_table_w_outliers
eda	cousub_cims
eda	distress_ed_ds_24_month_unemployment_rate
eda	distress_ed_ds_acs_per_capita_money_income
eda	distress_ed_ds_bea_per_capita_personal_income
eda	distress_metadata
eda	distress_metros_micros_24_month_unemployment_rate
eda	distress_metros_micros_acs_per_capita_money_income
eda	distress_metros_micros_bea_per_capita_personal_income
eda	distress_tracts_acs_per_capita_money_income
eda	distress_tracts_acs_unemployment_rate
eda	distress_us_states_and_counties_24_month_unemployment_rate
eda	distress_us_states_and_counties_acs_per_capita_money_income
eda	distress_us_states_and_counties_bea_per_capita_personal_income
eda	godaddy_cbsa_long
eda	godaddy_cbsa_wide
eda	godaddy_county_long
eda	godaddy_county_wide
eda	higher_ed_cs_grad_inst_counts_county
eda	higher_ed_cs_grad_inst_counts_place
eda	higher_ed_inst_counts_county
eda	higher_ed_inst_counts_place
eda	moodys_employ_eda
eda	moodys_employment_data_raw_src
eda	moodys_industry_sector_eda
eda	moodys_industry_sector_raw_src
eda	place_cims
eda	place_county_indx_table
eda	place_county_indx_table_2500_75000
eda	place_indx_table
eda	place_level_2sfca
eda	places_to_institution_catchment_30min
eda	places_to_institution_catchment_30min_ALL
eda	puma_crosswalk
eda	stats_america_distress_variables_co
eda	tribal_leader_county_indicator
eda	tribal_leader_directory_src

employment_data

Schema for employment data from all sources (e.g. BLS, BEA, LODES, etc.). BLS has not been systematically updated due to low demand, and should live on this schema moving forward.

table_schema	table_name
employment_data	bea_caemp25n
employment_data	bea_cagdp2
employment_data	bea_cainc1
employment_data	bea_cainc4
employment_data	lodes_firm_age
employment_data	lodes_geography_crosswalk
employment_data	lodes_od_county_all_jobs
employment_data	lodes_wac_all_jobs
employment_data	source_lodes_geography_crosswalk_2018
employment_data	source_lodes_od_jt00_2017
employment_data	source_lodes_od_jt00_2018
employment_data	source_lodes_wac_jt00_2017
employment_data	source_lodes_wac_jt00_2018
employment_data	source_lodes_wac_jt02_2017
employment_data	source_lodes_wac_jt02_2018

location_analysis

Schema for Location Analysis tables

table_schema	table_name
location_analysis	field_metadata_la
location_analysis	la_cbsa
location_analysis	la_county
location_analysis	la_county_sub
location_analysis	la_place
location_analysis	la_state
location_analysis	source_metadata_la

metadata

Home for metadata tables

table_schema	table_name
metadata	acs_metadata
metadata	field_metadat
metadata	field_metadata
metadata	pipeline_diagnostics
metadata	product_metadata
metadata	source_metadata
metadata	table_metadata

rii_diagnostic

Home for RII Diagnostic tables

table_schema	table_name
rii_diagnostic	county_diagnostic_variables
rii_diagnostic	county_inst_map
rii_diagnostic	diagnostic_visualizations
rii_diagnostic	diagnostic_visualizations_bup_20220425
rii_diagnostic	diagnostic_visualizations_bup_20220826
rii_diagnostic	national_diagnostic_variables
rii_diagnostic	rin_counties
rii_diagnostic	state_diagnostic_data

rwjf

Project schema for the RWJF project

table_schema	table_name
rwjf	acp_typology
rwjf	aiannh_2020_county_dta
rwjf	aiannh_2020_tract_dta
rwjf	census_block_population_2010
rwjf	colonia_counties
rwjf	colonia_tracts
rwjf	decoding_rural_america_variables
rwjf	decoding_rural_america_variables_county
rwjf	decoding_rural_america_variables_tract
rwjf	deep_disadvantage_codebook
rwjf	deep_disadvantage_counties
rwjf	hud_colonia_communities
rwjf	nhgis_census_pop_1790_2020
rwjf	nhgis_census_race_1980_2020
rwjf	persistent_poverty_counties
rwjf	reinvisioning_rural_typology
rwjf	social_vulnerability_index
rwjf	ur_america_typology

sch_broadband

Schema for Broadband related tables (particularly the FCC Form 477).

table_schema	table_name
sch_broadband	broadband_f477_2014dec
sch_broadband	broadband_f477_2016dec
sch_broadband	broadband_f477_2017dec
sch_broadband	broadband_f477_2018dec
sch_broadband	broadband_f477_2018dec_v2
sch_broadband	broadband_f477_2018jun
sch_broadband	broadband_f477_2019dec
sch_broadband	broadband_f477_2019jun
sch_broadband	broadband_f477_2020december_v1
sch_broadband	broadband_f477_2020june_v1
sch_broadband	broadband_f477_2021june_v1
sch_broadband	broadband_f477_byblock_2014dec
sch_broadband	broadband_f477_byblock_2015dec
sch_broadband	broadband_f477_byblock_2015jun
sch_broadband	broadband_f477_byblock_2016dec
sch_broadband	broadband_f477_byblock_2016jun
sch_broadband	broadband_f477_byblock_2017dec
sch_broadband	broadband_f477_byblock_2017jun
sch_broadband	broadband_f477_byblock_2018dec
sch_broadband	broadband_f477_byblock_2018dec_v2
sch_broadband	broadband_f477_byblock_2018jun
sch_broadband	broadband_f477_byblock_2019dec
sch_broadband	broadband_f477_byblock_2019jun
sch_broadband	broadband_f477_byblock_2020december_v1
sch_broadband	broadband_f477_byblock_2020june_v1
sch_broadband	broadband_f477_byblock_2021june_v1
sch_broadband	county_cable_providers
sch_broadband	county_fiber_providers
sch_broadband	county_providers
sch_broadband	fcc_477_tech_code_crosswalk
sch_broadband	source__fcc_477_dec14_v2_20190125
sch_broadband	source__fcc_477_dec15_v2
sch_broadband	source__fcc_477_dec15_v3_20190125
sch_broadband	source__fcc_477_dec16_v1_20190125
sch_broadband	source__fcc_477_dec17_v1_20190125
sch_broadband	source__fcc_477_dec17_v1_20190508
sch_broadband	source__fcc_477_dec17_v2_20191024
sch_broadband	source__fcc_477_dec18_v1_20200204
sch_broadband	source__fcc_477_dec18_v2_20200824
sch_broadband	source__fcc_477_dec19_v1_20201216
sch_broadband	source__fcc_477_jun15_v3
sch_broadband	source__fcc_477_jun15_v4_20190125
sch_broadband	source__fcc_477_jun16_v2
sch_broadband	source__fcc_477_jun16_v3_20190125
sch_broadband	source__fcc_477_jun17_v1
sch_broadband	source__fcc_477_jun17_v2_20190125
sch_broadband	source__fcc_477_jun18_v1_20191024
sch_broadband	source__fcc_477_jun19_v1_20200610
sch_broadband	source_fcc_477_dec2014_v3
sch_broadband	source_fcc_477_dec2015_v4
sch_broadband	source_fcc_477_dec2016_v2
sch_broadband	source_fcc_477_dec2017_v3
sch_broadband	source_fcc_477_dec2018_v3
sch_broadband	source_fcc_477_december_2020_v1
sch_broadband	source_fcc_477_jun_2020_v1
sch_broadband	source_fcc_477_june_2020_v1

sch_census_tiger

Schema for TIGER/Line data. Tables flagged as ‘cb’ are cartographic boundary files for mapping, as opposed to ‘tiger’ tables for spatial analysis.

table_schema	table_name
sch_census_tiger	bcat_county_summary
sch_census_tiger	source_cb_2019_cbsa
sch_census_tiger	source_cb_2019_county
sch_census_tiger	source_cb_2019_cousub
sch_census_tiger	source_cb_2019_place
sch_census_tiger	source_cb_2019_puma
sch_census_tiger	source_cb_2019_state
sch_census_tiger	source_cb_2019_uac
sch_census_tiger	source_tiger_2019_blockgroup
sch_census_tiger	source_tiger_2019_blocks
sch_census_tiger	source_tiger_2019_cbsa
sch_census_tiger	source_tiger_2019_county
sch_census_tiger	source_tiger_2019_cousub
sch_census_tiger	source_tiger_2019_place
sch_census_tiger	source_tiger_2019_puma
sch_census_tiger	source_tiger_2019_state
sch_census_tiger	source_tiger_2019_tract
sch_census_tiger	source_tiger_2019_uac
sch_census_tiger	tiger_2019_cbsa
sch_census_tiger	tiger_2019_county
sch_census_tiger	tiger_2019_cousub
sch_census_tiger	tiger_2019_place
sch_census_tiger	tiger_2019_state
sch_census_tiger	tiger_aiannh_areas_2018

sch_layer

This is a legacy database schema, previously used for processed mappable tables (i.e. ‘layers’). Do not write new data to this schema. Do not transfer it to the new database. Remove by EOY 2023 or sooner.

table_schema	table_name
sch_layer	bea_caemp25n_county_01_18
sch_layer	bea_caemp25n_county_01_19
sch_layer	bea_caemp25n_state_01_18
sch_layer	bea_caemp25n_state_01_19
sch_layer	bea_caemp25n_us_01_18
sch_layer	bea_caemp25n_us_01_19
sch_layer	bea_cagdp2_county_01_18
sch_layer	bea_cagdp2_us_01_18
sch_layer	bea_cainc4_county_69_18
sch_layer	bea_cainc4_state_69_18
sch_layer	bea_cainc4_us_69_18
sch_layer	bea_gross_flow_earnings_90_18
sch_layer	bea_imputed_emp_2002_2018
sch_layer	bea_imputed_emp_2018
sch_layer	bea_info_employment_01_18
sch_layer	bea_mark_variables_2018
sch_layer	bea_self_employment_01_18
sch_layer	bea_total_employment_01_18
sch_layer	bea_total_employment_01_19
sch_layer	bea_total_gdp_01_18
sch_layer	bea_total_info_gdp_01_18
sch_layer	bea_total_pop_70_18
sch_layer	bl_geoid_lookup
sch_layer	bl_geoid_lookup_all
sch_layer	bls_laus_county_annual_90_19
sch_layer	bls_laus_county_annual_90_20
sch_layer	bls_laus_county_monthly
sch_layer	broadbandnow
sch_layer	cbp_software_publishers
sch_layer	cbp_tech_emp_2018
sch_layer	census2010_bg_race
sch_layer	census2010_tract_race
sch_layer	computer_science_graduates_15_17_drive_time
sch_layer	computer_science_graduates_15_17_pt
sch_layer	computer_science_graduates_16_18_drive_time
sch_layer	computer_science_graduates_16_18_pt
sch_layer	computer_science_graduates_by_county
sch_layer	computer_science_graduates_by_county_2021
sch_layer	county_population_characteristics_2000_2019
sch_layer	county_race_score
sch_layer	county_zip_to_city_xwalk
sch_layer	covid_19_county_latest
sch_layer	covid_19_county_long
sch_layer	covid_county_chris
sch_layer	cpc_2010_2019_age_race_sex
sch_layer	cpc_county_year
sch_layer	eda_investments_fy12_to_fy17_app
sch_layer	fdic_bank_info_06_2021
sch_layer	godaddy_county
sch_layer	high_ed_drive_time_30_60_90_mins
sch_layer	high_ed_drive_time_30_mins
sch_layer	high_ed_drive_time_60_mins
sch_layer	high_ed_drive_time_90_mins
sch_layer	high_ed_field_of_study_ipeds
sch_layer	high_ed_points
sch_layer	higher_ed_40_min_drivetimes
sch_layer	higher_ed_blocks_2021
sch_layer	higher_ed_blocks_40min_2021
sch_layer	higher_ed_inst_and_block_level
sch_layer	imputed_cbp_high_tech_emp_12_16
sch_layer	imputed_cbp_high_tech_emp_county_2016
sch_layer	la_highed_by_block_2021jun
sch_layer	lodes_county_2017
sch_layer	lodes_county_by_industry_2002_2018
sch_layer	lodes_od_county
sch_layer	lodes_wac_bg
sch_layer	microsoft_county_geo
sch_layer	natural_amenity_score_county
sch_layer	non_metro_od_with_dist_mi
sch_layer	race_ethnicity_breakdown
sch_layer	rii_cohort_applicants
sch_layer	rural_by_state_cbsa
sch_layer	rural_by_state_cmsrhc
sch_layer	rural_by_state_eda
sch_layer	rural_by_state_far
sch_layer	rural_by_state_fhfa
sch_layer	rural_by_state_forhp
sch_layer	rural_by_state_lisc
sch_layer	rural_by_state_msa
sch_layer	rural_by_state_ruca
sch_layer	rural_by_state_rucc
sch_layer	rural_by_state_uc
sch_layer	rural_by_state_uic
sch_layer	rural_by_state_usda
sch_layer	rural_definitions_all

sch_source

This is a legacy database schema used to store raw source data. Its continued use is optional. It is most useful for large datasets such as the form 477, where the time cost to load the data into postgres is high. If the new FCC process moves out of Postgres, as it likely should, this schema should be retired and emptied.

table_schema	table_name
sch_source	acp_county_type
sch_source	ami_rural_definition
sch_source	bea_caemp25n_01_18
sch_source	bea_caemp25n_01_19
sch_source	bea_cainc4_69_18
sch_source	bea_gdp_01_18
sch_source	bea_gross_flow_earnings_90_18
sch_source	bea_total_emp_01_18
sch_source	bea_total_emp_01_19
sch_source	bls_laus_90_19
sch_source	bls_laus_90_20
sch_source	bls_laus_county_monthly
sch_source	broadband_auction_903_subsidy_awards
sch_source	broadband_auction_904_subsidy_awards
sch_source	broadband_farm_bill_eligibility
sch_source	broadbandnow_20201216
sch_source	business_stops_and_starts
sch_source	cbp_all_emp_estab_county_1986_2018
sch_source	cbp_all_emp_estab_state_1986_2018
sch_source	cbp_all_emp_estab_us_1986_2018
sch_source	cbp_ict_naics_county_2012_2018
sch_source	cbp_ict_naics_state_2012_2018
sch_source	cbp_ict_naics_us_2012_2018
sch_source	county_business_patterns_2018
sch_source	cpc_2010_2019_age_race_sex
sch_source	effingham_served_roads
sch_source	fdic_institutions
sch_source	high_ed_field_of_study
sch_source	high_ed_field_of_study_ipeds
sch_source	high_education_college_scorecard
sch_source	ilecs_geo
sch_source	microsoft_broadband_20201221
sch_source	nber_county_adjacency
sch_source	opportunity_zone_eligibility_pg
sch_source	prison_boundaries_base
sch_source	rural_dev_broadband_pending
sch_source	source_bea_caemp25_2020
sch_source	source_bea_cagdp2_2020
sch_source	source_bea_cainc1_2020
sch_source	source_bea_cainc4_2020
sch_source	source_bea_cainc91_2020
sch_source	source_fcc_477_june_2021_v1
sch_source	source_rural_definition_cbsa_2010
sch_source	source_rural_definition_cbsa_2013
sch_source	source_rural_definition_cbsa_2014
sch_source	source_rural_definition_cbsa_2019
sch_source	source_rural_definition_cbsa_2020
sch_source	source_rural_definition_cdc
sch_source	source_rural_definition_far
sch_source	source_rural_definition_fhfa
sch_source	source_rural_definition_forhp
sch_source	source_rural_definition_ruca
sch_source	source_rural_definition_rucc
sch_source	source_rural_definition_uic

vt_broadband

Project schema for the a broadband project in VT concluded in 2021. It should be destroyed by EOY 2023

table_schema	table_name
vt_broadband	buffer_att_cell_signal_0_5_mile
vt_broadband	buffer_att_towers_5_mile
vt_broadband	buffer_tmobile_cell_signal_0_5_mile
vt_broadband	buffer_tower_248a_3_5_mile
vt_broadband	buffer_tower_nrb_3_5_mile
vt_broadband	buffer_uscellular_cell_signal_0_5_mile
vt_broadband	buffer_verizon_cell_signal_0_5_mile
vt_broadband	buffer_vtel_reconnect_5_mile
vt_broadband	cable_fiber_label_buffer
vt_broadband	cabled_roads_southern_vermont_cud
vt_broadband	cell_all_providers_buffer_count
vt_broadband	cell_att_buffer_count
vt_broadband	cell_towers_buffer_count
vt_broadband	cell_verizon_buffer_count
vt_broadband	connectivity_initiative_grant_2020
vt_broadband	connectivity_initiative_grant_2021
vt_broadband	CTC_cell_propagation_700MHz_Outdoor
vt_broadband	fiber_cable_road_pct_by_town
vt_broadband	non_residential_fiber_flag
vt_broadband	psd_address_cable_fiber_label
vt_broadband	raw_addresses
vt_broadband	raw_cable_segments
vt_broadband	raw_cell_drive_test_att
vt_broadband	raw_cell_drive_test_tmobile
vt_broadband	raw_cell_drive_test_uscellular
vt_broadband	raw_cell_drive_test_verizon
vt_broadband	raw_connectivity_initiative_grant_2020
vt_broadband	raw_connectivity_initiative_grant_2020_20210224
vt_broadband	raw_e911_roads
vt_broadband	raw_fiber_segments
vt_broadband	raw_road_centerline
vt_broadband	raw_road_premises
vt_broadband	raw_tower_248_pt
vt_broadband	raw_tower_nrb_pt
vt_broadband	raw_towns
vt_broadband	raw_vtel_reconnect_program
vt_broadband	road_premises_in_propagation
vt_broadband	tower_248a_nrb_buffer_count
vt_broadband	vermont_700mhz_outdoor_union
vt_broadband	vt_cable_segments
vt_broadband	vt_cud_boundaries
vt_broadband	vt_cud_boundaries_combined
vt_broadband	vt_data_broadband_status_2019
vt_broadband	vt_data_broadband_status_2019_1
vt_broadband	vt_data_broadband_status_2019_1_unserved
vt_broadband	vt_roads
vt_broadband	vtel_reconnect_buffer_count

venture_capital

Schema for venture capital data.

table_schema	table_name
venture_capital	formd_22Q1_19Q1_combined
venture_capital	formd_county
venture_capital	formd_county_year
venture_capital	venture_capital_modeling_datav1

higher_ed

Schema for all higher education related data

table_schema	table_name
higher_ed	cip_soc_xwalk
higher_ed	college_scorecard
higher_ed	county_cs_grad_percentile_by_cbsa_type
higher_ed	cs_grads
higher_ed	degrees_by_inst_and_cip
higher_ed	field_of_study
higher_ed	high_ed_drive_time_30_60_90_mins
higher_ed	ipeds_completions
higher_ed	ipeds_inst_characteristics
higher_ed	total_degrees_by_inst

historical_census_data

Schema for historical census tables. Experimental. If empty by EOY 2022, remove.

table_schema	table_name
historical_census_data	historical_census_county_crosswalk
historical_census_data	pop_race_nhgis_1840_county
historical_census_data	pop_race_nhgis_1850_county
historical_census_data	pop_race_nhgis_1860_county
historical_census_data	pop_race_nhgis_1870_county
historical_census_data	pop_race_nhgis_1880_county
historical_census_data	pop_race_nhgis_1890_county
historical_census_data	pop_race_nhgis_1900_county
historical_census_data	pop_race_nhgis_1910_county
historical_census_data	pop_race_nhgis_1920_county
historical_census_data	pop_race_nhgis_1930_county
historical_census_data	pop_race_nhgis_1940_county
historical_census_data	pop_race_nhgis_1950_county
historical_census_data	pop_race_nhgis_1960_county
historical_census_data	pop_race_nhgis_1970_county
historical_census_data	pop_race_nhgis_1980_county
historical_census_data	pop_race_nhgis_1990_county
historical_census_data	pop_race_nhgis_2000_county
historical_census_data	pop_race_nhgis_2010_county
historical_census_data	pop_race_nhgis_2020_county

TOOLS

Airtable

If it is necessary to use Airtable, rairtable is the best interface for our use. Install with install.packages('rairtable').

However, Airtable should not be supported as an MDA data access tool under any circumstances. It is time consuming to work with in the best of cases. Worse, database compatible names are typically unacceptable to client teams, rendering direct updates inconvenient at best and impossible at worst. There are no access controls and manual data edits are simple, so changes – intentional or otherwise – made by client teams will persist indefinitely.

Airtable was fine for quick prototyping in the early days of our team, but it is fundamentally incompatible with data integrity and storage best practices and should be disavowed with vigor. In the best case, we should delete all MDA data from Airtable and forcefully discontinue support. If this happens I’ll eat my hat.

BCAT (Broadband County Assessment Tool)

BCAT is an application that integrates a variety of county and national level broadband metrics. It was originally developed for a project in Tennessee, but a national level version also exists. John Hall is the best reference for the inner workings of the BCAT.

The pipeline for this tool is rough and in some cases assumes that now-defunct architecture and tables exist. It should receive an overhaul, not least to accommodate the deployment of the API designed and built for our team by Merging Futures in the summer of 2022.

DB Schema: bcat

Pipeline: broadband_county_assessment_tool_pipeline

App Repo: broadband_county_assessment_tool

RII Diagnostic Reports

The RII Diagnostic report (as of August 22 called the Digital Economy Diagnostic, or DED; previously known as the DEE-R or Digital Economy Ecosystem Report) is one of the MDA team’s main data products.

It consists, in essence, of a data visualization API housed in a Shiny app, and generates data and graphic assets for the construction of the RII diagnostic reports.

Notes:

As of 8/22, maps are not well supported as DED output. Adding new maps or modifying the existing map output is a moderate to large lift (depending on the ask). This will need to be supported by one or more MDA software developers
The higher education map is not updated in the same way as any of the other data visualizations. All of the processing happens in doctR::make_map() and is dependent on a handful of external tables. Additional development on this feature should not be taken lightly.

Outstanding tasks in this process include the following:

Integrating data updates into their respective data pipelines
Updating data (continuous need)

Detailed documentation on how to use the system is in the doctR wiki linked below.

Recording of Overview Meeting w/ RII team: https://drive.google.com/file/d/1Gv5uC6bZD-B7D_nATU7RH8FxMYhLuqgc/view?usp=sharing

Recording of Technical Zoom Meeting: https://drive.google.com/drive/folders/1jUTQ2gMPpZ--foFQPGZi7Jy_-nT7Vw0-?usp=sharing

Viz Definitions + Misc Code: ded

Package: doctR

Package Docs: doctR wiki

DED Portal Code: clinic

DB Schema: rii_diagostic

Legacy Code (non-functional): dee-r

NB: the ‘doctor’ puns make the most sense if you pronounce DED as “dead”. What do you do when something might be DED? Call the doctR. I couldn’t help myself. MR 8/10/22

T3 (Tech Talent Tracker)

The T3 is a high traffic, high value tool for the organization. However, it is woefully out of date and a bastion of tech debt. It was built quickly on a shoestring budget without the benefit of development best practices.

The current T3 codebase should receive no new development.

The script that created the data for this tool lives in the temp folder, and is titled shiny_data_prep.R.

Data Documentation

Matt Rogers

2022-07-25

DATA

ACS

BEA

BLS

Broadbandnow

CBP

CPC

FCC Form 477

Form D

Higher Education

LODES/LEHD

Location Analysis

TIGER/Line

DATABASE

Philosophy

acs

bcat

core_data

eda

employment_data

location_analysis

metadata

rii_diagnostic

rwjf

sch_broadband

sch_census_tiger

sch_layer

sch_source

vt_broadband

venture_capital

higher_ed

historical_census_data

TOOLS

Airtable

BCAT (Broadband County Assessment Tool)

RII Diagnostic Reports

T3 (Tech Talent Tracker)