Introducing echor

The U.S. Environmental Protection Agency (EPA) provides access to facility and compliance information for registered permit holders under the Clean Air Act, Clean Water Act. The primary way for non-governmental entities to obtain this data is through the EPA Environmental and Compliance History Online (ECHO) website. Data is housed under “media-specific” programs. Relevant to this post, the National Pollutant Discharge Elimination Systems (NPDES) maintains data on pollutant discharges to waterways and the Air Facility Service (AFS) maintain data on emission to air. Gibbs and Simpson (2009) assess the strengths and weakness of the data collated by ECHO and provide an example of assessing environmental crime rates. While a discussion about the merits of EPA’s environmenatal data collection efforts and methodology are warranted, this post will discuss a new package that provides API access to the ECHO database.

Introduction and need

I primarily use ECHO to obtain discharge monitoring records for wastewater and industrial discharges. Until recently, my workflow was to call or email the state environmental agency and ask for all the available permit numbers in the watershed. Some states maintain and provide a GIS file with spatial locations (this was preffered, but finding out when that file was last updated was can be difficult). Once I obtained the permit numbers, I log onto ECHO and type the permits numbers in, and individually retrieve disharge records for each facility. This requires quite a bit of clicking and typing, and is prone to error. Furthermore, there is no way to verify the records I received are correct. If I mistyped a number or received a wrong record from the agency, I have little way of catching the error.

Thankfully, ECHO provides web access through “GET” and REST services to provide some level of automated and reproducible data access. I recently wrote the echor package to provide access to these service in R. This was my first attempt at developing an R package and my first attempt at utilizing data APIs in a programatic way.https://media.giphy.com/media/elFj7gYCIxVxC/giphy.gif

echor provides functions to search for and obtain facility data under the NPDES, AFS, and Safe Drinking Water (SDW) program areas. Functions are also available to retrieve facility specific pollutant discharge data under the NPDES and AFS programs. Under the NPDES program, pollutant discharge reporting intervals are specific to the permit, so you may get records on quarterly, monthly, or perhaps daily basis. Records under the AFS program are returned as annual emissions.

Use cases

Download facility specific discharge records

If you have a facility permit number, downloading the discharge records is pretty simple. I made every attempt to return records in “tidy” long format to make plotting and further analysis easy. If you haven’t already, install echor from CRAN.

install.packages("echor")

echoGetEffluent() will return the facility reported discharges. Arguments are available to specify the permit number, date range, and parameter code (the pollutant of interest). A function is also provided to search for parameter codes: echoWaterGetParams(). In the following example I want to identify the parameter code for daily flow, then get the reported discharge for a facility that I already have a permit number for.

## Load package
library(echor)

## Find the parameter code
echoWaterGetParams(term = "Flow, in conduit")
## # A tibble: 2 x 2
##   ValueCode ValueDescription                        
##   <chr>     <chr>                                   
## 1 51725     Flow, in conduit or thru treatment plant
## 2 50050     Flow, in conduit or thru treatment plant

Here, I get two parameter codes. I happen to know I want to use parameter code 50050. Searching for the right terminology and correct code can be difficult and might require some trial and error.

The next step is to obtain the discharge records:

df <- echoGetEffluent(p_id = 'tx0119407', parameter_code = '50050',
                      start_date = "01/01/2017", end_date = "08/30/2018")
tibble::glimpse(df)
## Observations: 38
## Variables: 38
## $ Name                    <chr> "SKIDMORE WSC WWTP", "SKIDMORE WSC WWT...
## $ Outfall                 <chr> "001", "001", "001", "001", "001", "00...
## $ ID                      <chr> "TX0119407", "TX0119407", "TX0119407",...
## $ RegistryID              <chr> "110009771693", "110009771693", "11000...
## $ Location                <chr> "1000'  N OF THE END OF BLACK RANCH RD...
## $ City                    <chr> "SKIDMORE", "SKIDMORE", "SKIDMORE", "S...
## $ State                   <chr> "TX", "TX", "TX", "TX", "TX", "TX", "T...
## $ Zip                     <chr> "78387", "78387", "78387", "78387", "7...
## $ Status                  <chr> "78387", "78387", "78387", "78387", "7...
## $ LimitBeginDate          <date> 2015-08-01, 2015-08-01, 2015-08-01, 2...
## $ LimitEndDate            <date> 2020-03-01, 2020-03-01, 2020-03-01, 2...
## $ LimitValueNmbr          <dbl> 0.131, NA, 0.131, NA, 0.131, NA, 0.131...
## $ LimitUnitCode           <chr> "03", "03", "03", "03", "03", "03", "0...
## $ LimitUnitDesc           <chr> "MGD", "MGD", "MGD", "MGD", "MGD", "MG...
## $ StdUnitCode             <chr> "MGD", "MGD", "MGD", "MGD", "MGD", "MG...
## $ StdUnitDesc             <chr> "MGD", "MGD", "MGD", "MGD", "MGD", "MG...
## $ LimitValueStdUnit       <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ StatisticalBaseCode     <chr> "DB", "DD", "DB", "DD", "DB", "DD", "D...
## $ StatisticalBaseDesc     <chr> "DAILY AV", "DAILY MX", "DAILY AV", "D...
## $ StatisticalBaseTypeCode <list> ["AVG", "MAX", "AVG", "MAX", "AVG", "...
## $ StatisticalBaseTypeDesc <list> ["Average", "Maximum", "Average", "Ma...
## $ DMREventId              <chr> "3600838828", "3600838828", "360083883...
## $ MonitoringPeriodEndDate <date> 2017-01-31, 2017-01-31, 2017-02-28, 2...
## $ DMRFormValueId          <chr> "3610877668", "3610877673", "361087778...
## $ ValueTypeCode           <list> ["Q1", "Q2", "Q1", "Q2", "Q1", "Q2", ...
## $ ValueTypeDesc           <list> ["Quantity1", "Quantity2", "Quantity1...
## $ DMRValueId              <chr> "3634771541", "3634771542", "363603181...
## $ DMRValueNmbr            <dbl> 0.0603, 0.0878, 0.0534, 0.0709, 0.0513...
## $ DMRUnitCode             <list> ["03", "03", "03", "03", "03", "03", ...
## $ DMRUnitDesc             <list> ["MGD", "MGD", "MGD", "MGD", "MGD", "...
## $ DMRValueStdUnits        <dbl> 0.0603, 0.0878, 0.0534, 0.0709, 0.0513...
## $ DMRQualifierCode        <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ValueReceivedDate       <date> 2017-02-16, 2017-02-16, 2017-03-20, 2...
## $ DaysLate                <int> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ NODICode                <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NODEDesc                <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ ExceedancePct           <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...
## $ NPDESViolations         <list> [NA, NA, NA, NA, NA, NA, NA, NA, NA, ...

The returned data_frame includes a row for each reporting period from the facility and a row for each statistical type (daily average and daily max in this case). Please note, that if you include the available start_date and end_date arguments in the function, they must be entered as "mm/dd/yyyy".I might consider changing the date format in the future, but for now this is the default format requested by ECHO. Currently, echor is a low-level implementation of the API. Because of the sheer number of available parameter arguments I opted to not make assumptions about argument inputs. Unfortunately, this puts the requirement on users to become somewhat familiar with the numerous available arguments and formats requested by ECHO.

Use dplyr to do some data tidying and ggplot2 to make a quick plot of this data:

library(dplyr)
library(ggplot2)

ggplot(df) +
  geom_line(aes(MonitoringPeriodEndDate, DMRValueNmbr, color = StatisticalBaseDesc)) +
  geom_point(aes(MonitoringPeriodEndDate, DMRValueNmbr, color = StatisticalBaseDesc)) +
  scale_y_log10() +
  labs(x = "Date", y = "Flow (MGD)", 
       title = "Facility Reported Discharge", 
       subtitle = "Jan-2017 through Aug-2018",
       caption = "source: EPA ECHO")

Find facilities and data

Searching ECHO for for permit holders by location or facility characterisitics is really valuable, albeit potentially verbose. The functions available for facility serachs have a long list of available search arguments. Here I will demonstrate a search by hydrologic unit code, a useful search area for those in hydrology.

df <- echoWaterGetFacilityInfo(p_huc = "12100401")
tibble::glimpse(df)
## Observations: 47
## Variables: 26
## $ CWPName                  <chr> "3280 FARM", "ALAMO CONCRETE PRODUCTS...
## $ SourceID                 <chr> "TXG130056", "TXG110081", "TX0004715"...
## $ CWPStreet                <chr> "1297 FM 3280", "W OF ST HWY 35 17MI ...
## $ CWPCity                  <chr> "PALACIOS", "PALACIOS", "POINT COMFOR...
## $ CWPState                 <chr> "TX", "TX", "TX", "TX", "TX", "TX", "...
## $ CWPStateDistrict         <chr> "", "", "", "", "", "", "", "", "", "...
## $ CWPZip                   <chr> "774651779", "77465", "77978", "77465...
## $ MasterExternalPermitNmbr <chr> "TXG130000", "TXG110000", "", "TXG130...
## $ RegistryID               <chr> "110045502639", "110009784732", "1100...
## $ CWPCounty                <chr> "Jackson", "Matagorda", "Calhoun", "M...
## $ CWPEPARegion             <chr> "06", "06", "06", "06", "06", "06", "...
## $ FacDerivedHuc            <chr> "12100401", "12100401", "12100401", "...
## $ FacLat                   <dbl> 28.69808, 28.72175, 28.65083, 28.6889...
## $ FacLong                  <dbl> -96.32411, -96.21038, -96.56222, -96....
## $ CWPTotalDesignFlowNmbr   <dbl> NA, NA, 92.320, NA, NA, NA, NA, NA, 2...
## $ CWPActualAverageFlowNmbr <dbl> NA, NA, 92.320, NA, NA, 0.024, NA, 8....
## $ ReceivingMs4Name         <chr> "", "", "", "", "", "", "", "", "", "...
## $ AssociatedPollutant      <chr> "", "", "", "", "", "", "", "", "", "...
## $ MsgpPermitType           <chr> "", "", "", "", "", "", "", "", "", "...
## $ CWPPermitStatusDesc      <chr> "Terminated", "Effective", "Effective...
## $ CWPPermitTypeDesc        <chr> "General Permit Covered Facility", "G...
## $ CWPIssueDate             <date> 2012-03-13, 2017-03-14, 2015-12-21, ...
## $ CWPEffectiveDate         <date> 2012-04-01, 2017-04-01, 2016-01-01, ...
## $ CWPExpirationDate        <date> 2016-04-17, 2021-11-07, 2020-08-01, ...
## $ CWPSNCStatusDate         <date> 2018-06-30, 2018-06-30, 2018-06-30, ...
## $ CWPStateWaterBodyCode    <chr> "", "", "121004010260", "", "", "2456...

I found 47 permits, some are terminated some active. By default, the facility search function return a data_frame with variables I deemed useful. However, ECHO provides a a whole host of possible variables (in my state, these are often unfortunately left blank). echoWaterGetParams() will return a dataframe with these variable names, description and ColumnID number. Use the qcolumn argument in the search function to specify what variables you would like returned. According to echoWaterGetParams() there are 299 possible return variables. Certain variables are always returned regardless of the qcolumns argument specified. If I want to find all the POTW (public wastewater treatment plants), I need to specify qcolumn 27, which I cfound using the echoWaterGetMeta() function.

meta <- echoWaterGetMeta()
tibble::glimpse(meta)
## Observations: 299
## Variables: 6
## $ ColumnName  <chr> "CWP_NAME", "SOURCE_ID", "CWP_STREET", "CWP_CITY",...
## $ DataType    <chr> "VARCHAR2", "VARCHAR2", "VARCHAR2", "VARCHAR2", "C...
## $ DataLength  <chr> "200", "30", "200", "100", "2", "40", "10", "9", "...
## $ ColumnID    <chr> "1", "2", "3", "4", "5", "6", "7", "8", "9", "10",...
## $ ObjectName  <chr> "CWPName", "SourceID", "CWPStreet", "CWPCity", "CW...
## $ Description <chr> "Facility or permit holder name, as maintained in ...
df <- echoWaterGetFacilityInfo(p_huc = "12100401", qcolumns = "1,2,14,23,24,27")
df
## # A tibble: 47 x 6
##    CWPName         SourceID FacDerivedHuc FacLat FacLong CWPFacilityTypeI~
##    <chr>           <chr>    <chr>          <dbl>   <dbl> <chr>            
##  1 3280 FARM       TXG1300~ 12100401        28.7   -96.3 NON-POTW         
##  2 ALAMO CONCRETE~ TXG1100~ 12100401        28.7   -96.2 NON-POTW         
##  3 ALCOA POINT CO~ TX00047~ 12100401        28.7   -96.6 NON-POTW         
##  4 APEKS AQUACULT~ TXG1300~ 12100401        28.7   -96.4 NON-POTW         
##  5 BLESSING        TXG1300~ 12100401        28.9   -96.2 NON-POTW         
##  6 BOCA CHICA SEC~ TX00982~ 12100401        28.7   -96.4 NON-POTW         
##  7 BON L. CAMPO, ~ TX00305~ 12100401        29.1   -96.2 NON-POTW         
##  8 BOWERS SHRIMP ~ TX01126~ 12100401        28.7   -96.2 NON-POTW         
##  9 CITY OF EL CAM~ TX00214~ 12100401        29.2   -96.3 POTW             
## 10 CITY OF PALACI~ TX00230~ 12100401        28.7   -96.2 POTW             
## # ... with 37 more rows

I can use this data_frame and purrr::pmap to retrieve discharge information. I will look up bacteria concentrations this time.

df <- df %>%
  filter(CWPFacilityTypeIndicator == "POTW")

reports <- df %>%
  mutate(DMR = purrr::pmap(., ~echoGetEffluent(p_id = ..2,
                                               parameter_code = "51040",
                                               start_date = "01/01/2010",
                                               end_date = "12/30/2017"))) %>%
  tidyr::unnest(DMR)
tibble::glimpse(df)                          
## Observations: 6
## Variables: 6
## $ CWPName                  <chr> "CITY OF EL CAMPO WWTF", "CITY OF PAL...
## $ SourceID                 <chr> "TX0021474", "TX0023051", "TX0105104"...
## $ FacDerivedHuc            <chr> "12100401", "12100401", "12100401", "...
## $ FacLat                   <dbl> 29.18397, 28.70369, 28.84582, 28.8356...
## $ FacLong                  <dbl> -96.25656, -96.23397, -96.46138, -96....
## $ CWPFacilityTypeIndicator <chr> "POTW", "POTW", "POTW", "POTW", "POTW...

Very quickly, we just found all the POTWs that discharge treated wastewater in the watershed of interest and pulled in their reported discharges. Now we can plot the data and call it a day.

ggplot(reports) +
  geom_line(aes(MonitoringPeriodEndDate, DMRValueNmbr, color = SourceID), alpha = 0.5) +
  geom_point(aes(MonitoringPeriodEndDate, DMRValueNmbr, color = SourceID), alpha = 0.5) +
  facet_wrap(~StatisticalBaseDesc) +
  scale_y_log10() +
  labs(x = "Date", y = "Bacteria (MPN/100mL)")

ggplot(reports) +
  geom_density(aes(DMRValueNmbr, fill = SourceID, color = SourceID), alpha = 0.5, trim = TRUE) +
  facet_wrap(~StatisticalBaseDesc) +
  scale_x_log10(labels = scales::comma) +
  labs(x = "Bacteria (MPN/100mL)", y = "Density")

I was able to quickly generate some time series and density plots without ever touching the ECHO online user interface. There is plenty more documentation at:

https://mps9506.github.io/echor/index.html

If you have suggestions or problems, please report it at:

https://github.com/mps9506/echor/issues

References

Gibbs, Carole, and Sally S. Simpson. 2009. “Measuring Corporate Environmental Crime Rates: Progress and Problems.” Crime, Law and Social Change 51 (1):87–107. https://doi.org/10.1007/s10611-008-9145-1.