Code
source("R/table_with_options.R")
January 2, 2025
We are starting our first exploratory data analysis around ISPs in the FCC NBM data set. It should be kept in mind that an ISP can be listed multiple times at the same location (offering multiple service).
Our goal is being able to take FCC data and 1. correctly identify the single ISP that is providing each reported service (i.e., deduplication of ISP names) 2. correctly identify that ISP over time (from the same program and from other FCC products).
How should we define an ISP? How can we define coverage (should a service 0/0 be considered as part of the extent of an ISP’s coverage)?
We shifted a bit from exploring to trying to classify the quality of information we have from FCC about ISP.
The query that generated the data set is here:
select
frn,
provider_id,
brand_name,
count(distinct location_id) as cnt_locations,
count(*) as cnt_locations_services,
bool_or(case when technology = 10 then true else false end) as has_copperwire,
bool_or(case when technology = 40 then true else false end) as has_coaxial_cable,
bool_or(case when technology = 50 then true else false end) as has_fiber,
bool_or(case when technology >= 70 then true else false end) as has_wireless,
bool_or(case when technology = 60 or technology = 61 then true else false end) as has_satel,
array_agg(distinct state_abbr)
from staging.june23
group by frn, provider_id, brand_name
The name of the columns match the FCC’s documented descriptions.
We are adding:
cnt_locations_services
: count of services, in one location you can have multiple services with different providers, technology and speeds provides (sometimes one providers can have multiple technology and/or multiple speeds)
cnt_locations
: count of locations covered by this specific set of brand_name, provider_id, state_abbr and technology (here if a provider declare providing different speed in that location it will not be counted)
a serie of flag (has
) telling if this “combo” is proving said technology
an array listing in which states are present our “combo”
We have from FCC:
Source: https://us-fcc.app.box.com/v/bdc-data-downloads-output page 4
frn
FCC Registration Number; “number of the entity that submited the data”. It is supposed to be a string of 10 characters (with padding 0).
provider_id
: “unique identifier for the fixed service provider”
brand_name
: “Name of the entity or service advertised or offered to consumers.”
Every row is matching a combination of unique FRN, Provider ID and brand name.
get_me_isp <- function(path) {
isp <- read.csv(path,
colClasses = c(frn ="character",
provider_id = "character",
brand_name = "character",
cnt_locations = "character",
cnt_locations_services = "character",
has_copperwire = "logical",
has_coaxial_cable = "logical",
has_fiber = "logical",
has_wireless = "logical",
has_satel = "logical",
array_agg = "character"))
isp[["cnt_locations"]] <- as.numeric(isp[["cnt_locations"]])
isp[["cnt_locations_services"]] <- as.numeric(isp[["cnt_locations_services"]])
return(isp)
}
isp <- get_me_isp("data/isp.csv")
Number of unique frn
: 2879
Number of unique provider_id
: 2184
Number of unique brand name pre cleaning: 2902
Removing all capitalization and change underscore for white space help lower the number of unique brand names to: 2878
FRN
can be wrong or not meaningfullprovider_id
can be wrongOne case:
frn | provider_id | brand_name | cnt_locations | cnt_locations_services | has_copperwire | has_coaxial_cable | has_fiber | has_wireless | has_satel | array_agg | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0003738655 | 130432 | “EATEL Corp.” | 83537 | 86548 | true | true | true | false | false | {LA} | ||
0009873712 | 131103 | “EATEL Corp.” | 34494 | 34497 | false | true | true | false | false | {LA} |
Other case:
frn | provider_id | brand_name | cnt_locations | cnt_locations_services | has_copperwire | has_coaxial_cable | has_fiber | has_wireless | has_satel | array_agg | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|
0002626984 | 130008 | Acentek | 47 | 47 | true | false | true | false | false | {MN} | ||
0002626984 | 130008 | ACENTEK | 1395 | 1395 | true | false | false | false | false | {MN} | ||
0002645927 | 130008 | Acentek | 19521 | 26636 | true | false | true | true | false | {IA,MN} |
Number of locations | Number of cases |
---|---|
1 | 43 |
2 | 9 |
3 | 12 |
4 | 5 |
5 | 2 |
6 | 4 |
7 | 8 |
8 | 5 |
9 | 3 |
10 | 3 |
Potential solutions:
We can decide to not keep those rows
Merge them with either other rows that is matching provider_id
or frn
(when this is an option)
temp <- sapply(split(isp$frn, isp$provider_id), function(x) length(unique(x)))
dta <- data.frame(provider_id = names(temp), frn_by_provider_id = temp)
# correct lower/upper case / replace underscore by " " / some white space on both side
# triming whispace did not improve for this dataset but I should keept doing it
dta[["unique_brand_name_by_provider_id"]] <- sapply(split(isp$brand_name, isp$provider_id),
count_and_clean)
dta[["same_state_by_provider_id"]] <- sapply(split(isp$array_agg, isp$provider_id), function(x) length(unique(x)))
isp <- merge(isp, dta, by.x = "provider_id", by.y = "provider_id",
all.x = TRUE, all.y = TRUE)
more_frn_than_provider <- subset(isp, isp$frn_by_provider_id > 1)
table_with_options(more_frn_than_provider)
Unique provider_id + brand_name are kind of “green” (for one time frame):
Number of green isp: 1972
We can have one provider_id
with multiple frn
and same or not brand_name
(see TSC for example / 150266)
It seems:
Windstream has 37 different frn
: we can maybe test if it has windstream in it’s name ..
Otelco/GoNet (18 cases)
Rally Networks/Oregon Telephone Company, is their frn wrongs ? (17 cases)
Same Provider for differents frn
and brand_name
in Minesota (MN) (16 cases)
160127
I do not see any kind of specific pattern for this one
131486
seems to be RiverStreet Networks with various frn
(13 cases) -> will be catch by unique_brand_name_by_provider_id
190233
multiple brand name and frn but seems to be in Texas and Oklahoma (13)
131226
seems to be Fastwyre Broadband divided by technology and states (12 cases) -> will be catch by unique_brand_name_by_provider_id
130804
seems to be Mediacom (+ Bolt) with different states and names indicating their states (11 cases)
Google Fiber (240041
) seems to be have frn
split by states (with a weird ‘Webpass, Inc.’) (11 cases) -> will be catch by unique_brand_name_by_provider_id (except Webpass, Inc which is weird, technology is 70 that I should correct)
AT&T Inc (130077
) multiple frn (10 cases) -> filter by unique_brand_name_by_provider_id
130079 = Astound_Broadband (10 cases) -> will be catch by unique_brand_name_by_provider_id
Verizon -> filter by unique_brand_name_by_provider_id
long ling (130757 ) & co are problematics (multiples names / one provider / 3 states )
130906
is also hard to fix
Assuming that same name (clean version) + same provider_id
provide us with a unique ISP, it helps move to greensih:
This is removing 194 out of 1170.
Not too sure about this one.
This is the case for “EATEL”.
temp <- sapply(split(isp$provider_id, isp$clean_name),
function(x) length(unique(x)))
dta <- data.frame(clean_name = names(temp), provider_id_by_clean_name = temp)
isp <- merge(isp, dta, by.x = "clean_name", by.y = "clean_name",
all.x = TRUE, all.y = TRUE)
provider_id_by_clean_name <- subset(isp, isp$provider_id_by_clean_name > 1)
table_with_options(provider_id_by_clean_name)
This could also move 63 cases in the greenish spot. -> nop
# if it has unique brand name and frn by provider id id should be good
temp <- isp$frn_by_provider_id + isp$unique_brand_name_by_provider_id
isp[["rdy_to_go"]] <- ifelse(temp == 2, "green", "not green")
# cases where we have a unique frn and provider id but not unique brand name
# default of olive is that we need to pick a name out of more than one
temp <- ifelse(isp$frn_by_provider_id == 1, 1, 0) + ifelse(isp$unique_brand_name_by_provider_id > 1, 1, 0)
isp[temp == 2, "rdy_to_go"] <- "olive"
# the few locations should be "red" and maybe dropped later
isp[which(isp$few_locations == "few locations"), "rdy_to_go"] <- "red"
table_with_options(isp)
A good example could be 131167
and how we can discriminate Orbitel communications. We can also prob. raise the bar of “few locations”.
A quick summary of where we are:
The data was generated from June 23 FCC release and assumed that an FRN = ISP.
Can we guess who is a small ISP?
# con <- cori.db::connect_to_db("proj_calix")
# bob <- DBI::dbReadTable(con, "frn_desc")
# DBI::dbDisconnect(con)
# bob[["is_calix"]] <- NULL
# write.csv(bob, "data/frn_desc.csv")
frn_desc <- read.csv("data/frn_desc.csv")
#classInt::classIntervals(cnt_locations, n = 20, style = "jenks")
# not that great
table_with_options(frn_desc)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 1384 5476 253046 19588 114863490
Some ISP are declaraing covering a huge number of locations. Some low counts are probably errors.
Count of FRN with a less than 10 locations: 36
Count of FRN with more than 500 000 locations: 74
If we filter them out (removing 110 cases):
List of ISP that the Broadband team that are good reference of small provider:
Name | FRN | Locations |
---|---|---|
Newport Utilities, TN | 0027152438 | 15383 |
SandyNet, OR | 0004119376 | 4439 |
ECFiber, VT | 0027379676 | 22926 |
Maple Broadband, VT | 0032366692 | 315 |
Uplink | 0026218602 | 1611 |
Black Bear Fiber | 0025132648 | 1583 |
QCOL | 0019663095 | 5610 |
Salsgiver | 0011167079 | 29941 |
All Points Broadband | 0023524705 | 107803 |
Marquette-Adams Telephone co-op | 0003774023 | 130783 |
USI fiber | 0017096538 | 71466 |
Scott county telephone co | 0002069862 | 7829 |
PANGAEA | 0016202236 | 8410 |
Blue Mountain Networks | 0005450507 | 310013 |
Side notes:
Newport Utilities = NUconnect
SandyNet, OR = City of Sandy, OR
USI FIber =
Blue Mountain Networks = Blue Ridge Mountain Electric Membership Corporation