@mpschramm: Predicting estuarine salinity using simple statistical models part 1

Michael Schramm

For some of my projects I need to predict daily mean salinity in tidal rivers/estuaries. There are process models that are capable of doing this at daily and sub-daily time steps but require a lot of information. I rely on simple statistical methods for identifying the relationship between freshwater inflow and salinity. This article will demonstrate nonlinear least squares. The data used in this example is from USGS and the Texas Water Development Board. The dataRetrieval is a must have package if you routinely retrieve USGS stream gage data. TWDB provides a robust web API for calling data they have collated in the Water Data for Texas dashboard. minpack.lm is used for fitting nonlinear least squares. I use the tidyverse set of packages for data wrangling and plotting.

library(dataRetrieval)
library(hrbrthemes)
library(tidyverse)
library(patchwork)
library(minpack.lm)

## downloads mean daily discharge stream gage data from USGS
flow <- readNWISdv(siteNumbers = "08041780",
                   parameterCd = "72137", # technically this is tidally filtered discharge
                                          # 00060 is discharge
                   startDate = "2008-01-01",
                   endDate = "2020-07-31") %>%
  dplyr::select(Date, X_72137_00003) %>%
  dplyr::rename(date = Date, 
                discharge = X_72137_00003)


## downloads hourly salinity data from TWDB
## and summarizes it to mean daily salinity
salinity <- read_csv("https://waterdatafortexas.org/coastal/api/stations/SAB1/data/seawater_salinity?start_date=2008-01-01&end_date=2020-07-31&binning=hour&output_format=csv",
                     comment = "#",
                     col_types = list(col_datetime(),
                                      col_number())) %>%
  mutate(datetime = as.Date(datetime)) %>%
  group_by(datetime) %>%
  summarize(salinity = mean(value))

df <- left_join(flow, salinity, by = c("date" = "datetime")) %>%
  dplyr::filter(!is.na(salinity))

With the data downloaded, a quick visualization of the data below shows the expected relationship. Increased salinity with decreased freshwater inflow and possible long term trends in flow and salinity. The salinity-flow relationship appears to be a logistic function. This makes sense since, we know salinity (at least for most estuaries under normal conditions) will have a maximum value (typically around 36 psu) and a minimum value (maybe at or above zero).

## a little data exploration

p1 <- ggplot(df) +
  geom_point(aes(date, log1p(discharge), color = "mean daily discharge (cfs)"), alpha = 0.25, shape = 16) +
  geom_point(aes(date, salinity, color = "mean daily salinity (psu)"), alpha = 0.25, shape = 16) +
  geom_smooth(aes(date, log1p(discharge), color = "mean daily discharge (cfs)"), method = "lm", se = FALSE) +
  geom_smooth(aes(date, salinity, color = "mean daily salinity (psu)"), method = "lm", se = FALSE) +
  scale_color_ipsum() +
  theme_ipsum_pub() +
  labs(y = "") +
  theme(legend.position = "bottom")

p2 <- ggplot(df) +
  geom_point(aes(log1p(discharge), salinity), alpha = 0.25, shape = 16) +
  scale_color_ipsum() +
  theme_ipsum_pub()

p1 + p2

Figure 1: Flow and salinity scatterplots

Linear regression

Since we observe the logistic function in the streamflow salinity relationship, a linear regression probably isn’t the best choice. But let’s fit one anyways.

## Go ahead a transform the predictor variable
## I am also going to add a tiny amount to the single zero value
df <- df %>%
  mutate(discharge = case_when(
    discharge <= 0 ~ 0.0001,
    discharge > 0 ~ discharge),
    log_discharge = log(discharge))

m.lm <- lm(salinity ~ log_discharge,
           data = df)
summary(m.lm)


Call:
lm(formula = salinity ~ log_discharge, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-29.6060  -3.4776  -0.6134   3.1222  16.0620 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    26.6522     0.3579   74.46   <2e-16 ***
log_discharge  -2.1864     0.0424  -51.56   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.727 on 3888 degrees of freedom
Multiple R-squared:  0.4061,    Adjusted R-squared:  0.406 
F-statistic:  2659 on 1 and 3888 DF,  p-value: < 2.2e-16

The model summary indicates a fairly low adjusted r squared. I know I don’t want to use this model, but we can plot the model residuals and predictions to see where the issue is. Basically we observe over-predictions at low flows and potential under predictions at higher flows. We will move on to a nonlinear least square approach.

df_lm <- df %>%
  mutate(residuals = resid(m.lm),
         fits = predict(m.lm, type = "response"))

p1 <- ggplot(df_lm) +
  geom_histogram(aes(residuals)) +
  scale_fill_ipsum() +
  theme_ipsum_pub() +
  labs(subtitle = "histogram of residuals")

p2 <- ggplot(df_lm) +
  geom_point(aes(log_discharge, residuals), alpha = 0.25) +
  geom_hline(yintercept = 0) +
  scale_color_ipsum() +
  theme_ipsum_pub() +
  labs(subtitle = "residuals vs predictor")

p3 <- ggplot(df_lm) +
  geom_point(aes(salinity, fits), alpha = 0.2) +
  geom_abline(slope = 1) +
  labs(x = "measured", y = "predicted",
       subtitle = "predicted vs measured") +
  scale_color_ipsum() +
  theme_ipsum_pub()


p4 <- ggplot(df_lm) +
  geom_point(aes(log_discharge, salinity), alpha = 0.2) +
  geom_line(aes(log_discharge, fits)) +
  scale_color_ipsum() +
  theme_ipsum_pub() +
  labs(subtitle = "measured and fit")


(p1 + p2) / (p3 + p4)

Nonlinear Least Squares

A clear sigmoid curve or logistic function is evident in figure… The logistic function is defined by the formula:

\[ f(x) = \frac{L}{1 + e^{-k(x-x_0)}} \]

where \(L\) = the maximum value of the curve, \(x_0\) = the midpoint of the curve, and \(k\) is the logistic growth rate. Nonlinear least squares can be used to parameterize the model. The starting values in the list are eyeballed from figure. Instead of nls, I am using the propagate package and the predictNLS function to also obtain the 95% confidence intervals.

f_S <- formula(salinity ~ L / (1 + exp(-k * (log_discharge - x_0))))

m.ls <- nls(f_S,
              data = df,
              start = list(L = 20, k = -3, x_0 = 2))

summary(m.ls)


Formula: salinity ~ L/(1 + exp(-k * (log_discharge - x_0)))

Parameters:
    Estimate Std. Error t value Pr(>|t|)    
L   18.46682    0.26986   68.43   <2e-16 ***
k   -1.19123    0.03164  -37.65   <2e-16 ***
x_0  8.21522    0.03256  252.32   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.503 on 3887 degrees of freedom

Number of iterations to convergence: 8 
Achieved convergence tolerance: 2.818e-06

df_ls <- propagate::predictNLS(model = m.ls, 
                      newdata = data.frame(log_discharge = seq(min(df$log_discharge),max(df$log_discharge), by = .1)),
                      interval = "confidence", nsim = 10000, alpha = 0.05)

df_predict <- tibble(
  log_discharge = seq(min(df$log_discharge),max(df$log_discharge), by = .1),
  fit = df_ls$summary$Prop.Mean.1,
  upr = df_ls$summary$`Prop.97.5%`,
  lwr = df_ls$summary$`Prop.2.5%`)


ggplot() +
  geom_point(data = df, aes(log_discharge, salinity, color = "Measured Values"), alpha = 0.25) +
  geom_line(data = df_predict, aes(log_discharge, fit, color = "Least Squares Fit"), color = "black") +
  geom_ribbon(data = df_predict, aes(x = log_discharge, ymin = lwr, ymax = upr), alpha = 0.5) +
  scale_color_ipsum() +
  theme_ipsum_pub()

The model summary provides the parameter estimates, \(f(x) = \frac{18.467}{1 + e^{--1.191(x-8.215)}}\) The plot generally appears to follow reality. However, when we look at the model residuals below, it is evident that as salinity approaches zero, the model over predicts. If you prefer a numeric model metric, hydroGOF provides a range of model metrics to choose from below. Additional model terms to account for covariates such as season or long term trends might result in a better model. In future posts I will look at fitting this data with beta regression and generalized additive models. The NLS approach is appealing because it provides a simple model with relatively good fit. However, if we want better predictive performance I suspect a multiple regression approach would offer some advantages.

hydroGOF::gof(df_ls$salinity, df_ls$fits)

          [,1]
ME       -0.06
MAE       2.69
MSE      12.26
RMSE      3.50
NRMSE %  71.10
PBIAS %  -0.70
RSR       0.71
rSD       1.25
NSE       0.49
mNSE      0.36
rNSE      0.09
d         0.89
md        0.71
rd        0.81
cp      -15.11
r         0.82
R2        0.67
bR2       0.67
KGE       0.70
VE        0.69

Predicting estuarine salinity using simple statistical models part 1

Linear regression

Nonlinear Least Squares

References

Corrections

Reuse

Citation