Download, unzip, read, clean the Facility Registry Service dataset
Usage
frs_get(
only_essential_cols = TRUE,
folder = NULL,
downloaded_and_unzipped_already = FALSE,
zfile = "national_single.zip",
zipbaseurl = "https://ordsext.epa.gov/FLA/www3/state_files/",
csvname = "NATIONAL_SINGLE.CSV",
date = Sys.Date()
)
Arguments
- only_essential_cols
TRUE by default. used in frs_read()
- folder
NULL by default which means it downloads to and unzips in a temporary folder
- downloaded_and_unzipped_already
If set to TRUE, looks in folder for csv file instead of trying to download/unzip. Looks in working directory if folder not specified.
- zfile
filename, just use default unless EPA changes it
- zipbaseurl
url, just use default unless EPA changes it
- csvname
name of csv file. just use default unless EPA changes it
- date
default is Sys.Date() which is today, but this is used as an attribute assigned to the results, representing the vintage, such as the date the frs was downloaded, obtained.
Details
Used by frs_update_datasets()
Uses frs_download()
, frs_unzip()
, frs_read()
, frs_clean()
See examples for how package maintainer might use this.
See source code of this function for more notes.
For a developer updating the frs datasets in this package,
see frs_update_datasets()
frs_get() invisibly returns the table of data, as a data.table. It will download, unzip, read, clean, and set metadata for the data.
This function gets the whole thing in one file from
NATIONAL_SINGLE.CSV from https://ordsext.epa.gov/FLA/www3/state_files/national_single.zip
Other files and related information:
https://www.epa.gov/frs/epa-frs-facilities-state-single-file-csv-download
Also could download individual files from ECHO for parts of the info: https://echo.epa.gov/tools/data-downloads/frs-download-summary for a description of other related files available from EPA's ECHO.
This function creates the following:
> head(frs_by_programid)
lat lon REGISTRY_ID program pgm_sys_id
1: 44.13415 -104.12563 110012799846 STATE #5005
2: 41.16163 -80.07847 110057783590 PA-EFACTS ++++
3: 41.21463 -111.96224 110020117862 CIM 0
4: 29.62889 -83.10833 110040716473 LUST-ARRA 0
5: 40.71490 -74.00316 110019246163 FIS 0-0000-01097
6: 40.76395 -73.97037 110019163359 FIS 0-0000-01103
> frs_by_naics[1:2, ]
lat lon REGISTRY_ID NAICS
1: 30.33805 -87.15616 110002524055 0
2: 48.77306 -104.56154 110007654038 0
> names(frs)
"lat" "lon" "REGISTRY_ID" "PRIMARY_NAME" "NAICS" "PGM_SYS_ACRNMS"
> head(frs[,1:4]) # looks something like this:
lat lon REGISTRY_ID PRIMARY_NAME
1: 18.37269 -66.14207 110000307695 xyz CHEMICALS INCORPORATED
x: 17.98615 -66.61845 110000307784 ABC INC
x: 17.94930 -66.23170 110000307800 COMPANY QRSTU
**WHICH SITES ARE ACTIVE VS INACTIVE SITES**
See frs_active_ids() or frs_inactive_ids()
Approx 4.6 million rows total 10/2022.
table(is.na(frs$lat))
table(is.na(frs$NAICS))
It is not entirely clear how to simply identify
which ones are active vs inactive sites.
See inst folder for notes on that.
This as of 2/10/23 is not exactly how ECHO/OECA defines "active"
**WHICH SITES HAVE LAT LON INFO**
As of 2022-01-31: Among all including inactive sites,
1/3 have no latitude or longitude.
Even those with lat lon have some problems:
Some are are not in the USA.
Some have errors in country code.
Some use alternate ways of specifying USA.
**WHICH SITES HAVE NAICS OR SIC INDUSTRY CODES**
Only 1/4 have both location and some industry code (27
2/3 lack industry code (have no NAICS and no SIC).
NAICS vs SIC codes:
11 percent have both NAICS and SIC,
9.5 percent have just NAICS =
(21 percent have NAICS).
12.5 percent have just SIC.
2/3 have neither NAICS nor SIC.
**WHICH COLUMNS TO IMPORT AND KEEP**
approx 39 columns if all are imported, but most useful 10 is default.
[1] "REGISTRY_ID" "PRIMARY_NAME" "PGM_SYS_ACRNMS"
[4] "INTEREST_TYPES" "NAICS_CODES" "NAICS_CODE_DESCRIPTIONS"
[7] "SIC_CODES" "SIC_CODE_DESCRIPTIONS" "LATITUDE83"
[10] "LONGITUDE83"
Some fields are csv lists actually, to be split into separate rows
to enable queries on NAICS code or program system id:
PGM_SYS_ACRNMS = 'c', # csv format like AIR:AK999, AIRS/AFS:123,
NPDES:AK0020630, RCRAINFO:AK6690360312, RCRAINFO:AKR000206516"
INTEREST_TYPES = 'c', # eg "AIR SYNTHETIC MINOR, ICIS-NPDES NON-MAJOR"
NAICS_CODES = 'c', # csv of NAICS
Examples
# \donttest{
# These steps in the examples are all done by frs_update_datasets()
# (a function not exported by the package)
# Note these take a long time to run, for downloads and processing.
frs <- frs_get()
# keep only if an active site, or unclear whether active. Remove clearly inactive ones.
closedidlist <- frs_inactive_ids()
frs <- frs_drop_inactive(frs, closedid = closedidlist)
frs_by_programid <- frs_make_programid_lookup(x = frs) # another super slow step
frs_by_naics <- frs_make_naics_lookup( x = frs) # NAs introduced by coercion
usethis::use_data(frs, overwrite = TRUE)
usethis::use_data(frs_by_programid, overwrite = TRUE)
usethis::use_data(frs_by_naics, overwrite = TRUE)
# }