---
title: "Summarizing and Visualizing Sampling Frames, Design Sites, and Analysis Data"
author: "Michael Dumelle, Tom Kincaid, Anthony Olsen, and Marc Weber"
output: 
  html_document:
    theme: flatly
    number_sections: true
    highlighted: default 
    toc: yes
    toc_float:
      collapsed: no
      smooth_scroll: no
    toc_depth: 2
vignette: >
  %\VignetteIndexEntry{Summaries and Visualizations}
  %\VignetteEngine{knitr::rmarkdown}
  \usepackage[utf8]{inputenc}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>", 
  warning = FALSE, 
  message = FALSE
)
```

If you have yet not read the "Start Here" vignette, please do so by running
```{r, eval = FALSE}
vignette("start-here", "spsurvey")
```

# Introduction 

Before proceeding, we load spsurvey by running
```{r}
library(spsurvey)
```

The `summary()` and `plot()` functions in spsurvey are used to summarize and visualize sampling frames, design sites, and analysis data. Both functions use a formula argument that specifies the variables to summarize or visualize. These functions behave differently for one-sided and two-sided formulas. To learn more about formulas in R, run `?formula`. Only the core functionality of `summary()` and `plot()` will be covered in this vignette, so to learn more about these functions, run `?summary` and `?plot`. The `sp_summary()` and `sp_plot()` functions can equivalently be used in place of `plot()` and `summary()`, respectively (`sp_summary()` and `sp_plot()` are currently maintained for backwards compatibility with previous spsurvey versions).

The `plot()` function in spsurvey is built on the `plot()` function in sf. spsurvey's `plot()` function accommodates all the arguments in sf's `plot()` function and adds a few additional features.  To learn more about the `plot()` function in sf, run `?plot.sf()`. 

# Sampling frames

Summarizing and visualizing the sampling frame is often helpful to better understand your data and inform additional survey design options (e.g. stratification). To use `plot()` or `sp_summarize()`, sampling frames must either be an `sf` object or a data frame with x-coordinates, y-coordinates, and a crs (coordinate reference system).

The `NE_Lakes` data in spsurvey is a sampling frame (as an `sf` object) that contains lakes from the Northeastern United States. There are three variables in `NE_Lakes` you will use next:

1. `AREA_CAT`: lake area categories (small and large)
2. `ELEV`: lake elevation (a continuous variable)
3. `ELEV_CAT`: lake elevation categories (low and high)

Before summarizing or visualizing a sampling frame, turn it into an \code{sp_frame} object using `sp_frame()`:
```{r}
NE_Lakes <- sp_frame(NE_Lakes)
```


## One-sided formulas

One-sided formulas are used to summarize and visualize the distributions of variables. The variables of interest should be placed on the right-hand side of the formula. To summarize the distribution of `ELEV`, run
```{r}
summary(NE_Lakes, formula = ~ ELEV)
```
The output contains two columns: `total` and `ELEV`. The `total` column returns the total number of lakes, functioning as an "intercept" to the formula (it can by removed by supplying `- 1` to the formula). The `ELEV` column returns a numerical summary of lake elevation. To visualize `ELEV`, run
```{r, eval = FALSE}
plot(NE_Lakes, formula = ~ ELEV)
```

```{r, echo = FALSE}
plot(NE_Lakes, formula = ~ ELEV, key.pos = 4)
```

To summarize the distribution of `ELEV_CAT`, run
```{r}
summary(NE_Lakes, formula = ~ ELEV_CAT)
```
The `ELEV_CAT` column returns the number of lakes in each elevation category. To visualize `ELEV_CAT`, run
```{r, eval = FALSE}
plot(NE_Lakes, formula = ~ ELEV_CAT, key.width = lcm(3))
```

```{r, echo = FALSE}
plot(NE_Lakes, formula = ~ ELEV_CAT, key.width = lcm(3), key.pos = 4)
```

The `key.width` argument extends the plot's margin to fit the legend text nicely within the plot. The plot's default title is the `formula` argument, though this is changed using the `main` argument to `plot()`.

The formula used by `summary()` and `plot()` is quite flexible. Additional variables are included using `+`:
```{r}
summary(NE_Lakes, formula = ~ ELEV_CAT + AREA_CAT)
```
The `plot()` function returns two plots -- one for `ELEV_CAT` and another for `AREA_CAT`:
```{r, eval = FALSE}
plot(NE_Lakes, formula = ~ ELEV_CAT + AREA_CAT, key.width = lcm(3))
```

```{r, echo = FALSE}
plot(NE_Lakes, formula = ~ ELEV_CAT + AREA_CAT, key.width = lcm(3), key.pos = 4)
```

Interactions are included using the interaction operator, `:`. The interaction operator returns the interaction between variables and is most useful when used with categorical variables. To summarize the interaction between `ELEV_CAT` and `AREA_CAT`, run
```{r}
summary(NE_Lakes, formula = ~ ELEV_CAT:AREA_CAT)
```
Levels of each variable are separated by `:`. For example, there are 86 lakes that are in the low elevation category and the small area category. To visualize this interaction, run
```{r, eval = FALSE}
plot(NE_Lakes, formula = ~ ELEV_CAT:AREA_CAT, key.width = lcm(3))
```

```{r, echo = FALSE}
plot(NE_Lakes, formula = ~ ELEV_CAT:AREA_CAT, key.width = lcm(3), key.pos = 4)
```

The formula accommodates the `*` operator, which combines the `+` and `:` operators. For example, `ELEV_CAT*AREA_CAT` is shorthand for `ELEV_CAT + AREA_CAT + ELEV_CAT:AREA_CAT`. The formula also accommodates the `.` operator, which is shorthand for all variables separated by `+`.

## Two-sided formulas

Two-sided formulas are used to summarize the distribution of a left-hand side variable for each level of each right-hand side variable. To summarize the distribution of `ELEV` for each level of `AREA_CAT`, run
```{r}
summary(NE_Lakes, formula = ELEV ~ AREA_CAT)
```
To visualize the distribution of `ELEV` for each level of `AREA_CAT`, run
```{r, eval = FALSE}
plot(NE_Lakes, formula = ELEV ~ AREA_CAT)
```

```{r, echo = FALSE}
plot(NE_Lakes, formula = ELEV ~ AREA_CAT, key.pos = 4)
```

To only summarize or visualize a particular level of a single right-hand side variable, use the `onlyshow` argument:
```{r}
summary(NE_Lakes, formula = ELEV ~ AREA_CAT, onlyshow = "small")
```

```{r, eval = FALSE}
plot(NE_Lakes, formula = ELEV ~ AREA_CAT, onlyshow = "small")
```

```{r, echo = FALSE}
plot(NE_Lakes, formula = ELEV ~ AREA_CAT, onlyshow = "small", key.pos = 4)
```

To summarize the distribution of `ELEV_CAT` for each level of `AREA_CAT`, run
```{r}
summary(NE_Lakes, formula = ELEV_CAT ~ AREA_CAT)
```
To visualize the distribution of `ELEV_CAT` for each level of `AREA_CAT`, run
```{r, eval = FALSE}
plot(NE_Lakes, formula = ELEV_CAT ~ AREA_CAT, key.width = lcm(3))
```

```{r, echo = FALSE}
plot(NE_Lakes, formula = ELEV_CAT ~ AREA_CAT, key.width = lcm(3), key.pos = 4)
```


## Adjusting graphical parameters

There are three arguments in `plot()` that can adjust graphical parameters:

1. `var_args` adjusts graphical parameters simultaneously for all levels of a variable 
2. `varlevel_args` adjusts graphical parameters uniquely for each level of a variable
3. `...` adjusts graphical parameters for simultaneously for all levels of all variables
    
The `var_args` and `varlevel_args` arguments take lists whose names match variable names in the formula. For `varlevel_args`, each list element must have an element named `levels` that matches the variable's levels. The following example combines all three graphical parameter adjustment arguments:
```{r, eval = FALSE}
list1 <- list(main = "Elevation Categories", pal = rainbow)
list2 <- list(main = "Area Categories")
list3 <- list(levels = c("small", "large"), pch = c(4, 19))
plot(
  NE_Lakes,
  formula = ~ ELEV_CAT + AREA_CAT,
  var_args = list(ELEV_CAT = list1, AREA_CAT = list2),
  varlevel_args = list(AREA_CAT = list3),
  cex = 0.75,
  key.width = lcm(3)
)
```

```{r, echo = FALSE}
list1 <- list(main = "Elevation Categories", pal = rainbow)
list2 <- list(main = "Area Categories")
list3 <- list(levels = c("small", "large"), pch = c(4, 19))
plot(
  NE_Lakes,
  formula = ~ ELEV_CAT + AREA_CAT,
  var_args = list(ELEV_CAT = list1, AREA_CAT = list2),
  varlevel_args = list(AREA_CAT = list3),
  cex = 0.75,
  key.width = lcm(3),
  key.pos = 4
)
```

`var_args` uses `list1` to give the `ELEV_CAT` visualization a new title and color palette; `var_args` uses `list2` to give the `AREA_CAT` visualization a new title; `varlevel_args` uses `list3` to give the `AREA_CAT` visualization different shapes for the small and large levels; `...` uses `cex = 0.75` to reduce the size of all points; and `...` uses `key.width` to adjust legend spacing for all visualizations. 

If a two-sided formula is used, it is possible to adjust graphical parameters of the left-hand side variable for all levels of a right-hand side variable. This occurs when a sublist matching the structure of `varlevel_args` is used as an argument to `var_args`. In this next example, different shapes are used for the small and large levels of `AREA_CAT` for all levels of `ELEV_CAT`:
```{r,, eval = FALSE}
sublist <- list(AREA_CAT = list3)
plot(
  NE_Lakes,
  formula = AREA_CAT ~ ELEV_CAT,
  var_args = list(ELEV_CAT = sublist),
  key.width = lcm(3)
)
```

```{r, echo = FALSE}
sublist <- list(AREA_CAT = list3)
plot(
  NE_Lakes,
  formula = AREA_CAT ~ ELEV_CAT,
  var_args = list(ELEV_CAT = sublist),
  key.width = lcm(3),
  key.pos = 4
)
```

# Design sites

Design sites (output from the `grts()` or `irs()` functions) can be summarized and visualized using `summary()` and `plot()` very similarly to how sampling frames were summarized and visualized in the previous section. Soon you will use the `grts()` function to select a spatially balanced sample. The `grts()` function does incorporate randomness, so to match your results with this output exactly you will need to set a reproducible seed by running
```{r}
set.seed(51)
```

First we will obtain some design sites: To select an equal probability GRTS sample of size 50 with 10 reverse hierarchically ordered replacement sites, run
```{r}
eqprob_rho <- grts(NE_Lakes, n_base = 50, n_over = 10)
```

Similar to `summary()` and `plot()` for sampling frames, `summary()` and `plot()` for design sites uses a formula. The formula should include `siteuse`, which is the name of the variable in the design sites object that indicates the type of each site. The default formula for `summary()` and `plot()` is `~siteuse`, which summarizes or visualizes the `sites` objects in the design sites object. By default, the formula is applied to all non-`NULL` `sites` objects (in `eqprob_rho`, the non`NULL` sites objects are `sites_base` (for the base sites) and `sites_over` (for the reverse hierarchically ordered replacement sites)).
```{r}
summary(eqprob_rho)
```


```{r, eval = FALSE}
plot(eqprob_rho, key.width = lcm(3))
```

```{r, echo = FALSE}
plot(eqprob_rho, key.width = lcm(3), key.pos = 4)
```

The sampling frame may be included as an argument to the `plot()` function:
```{r, eval = FALSE}
plot(eqprob_rho, NE_Lakes, key.width = lcm(3))
```

```{r, echo = FALSE}
plot(eqprob_rho, NE_Lakes, key.width = lcm(3), key.pos = 4)
```

When you include `siteuse` as a left-hand side variable (`siteuse` is treated as a categorical variable), you can summarize and visualize the `sites` object for each level of each right-hand side variable:
```{r}
summary(eqprob_rho, formula = siteuse ~ AREA_CAT)
```

```{r, eval = FALSE}
plot(eqprob_rho, formula = siteuse ~ AREA_CAT, key.width = lcm(3))
```

```{r, echo = FALSE}
plot(eqprob_rho, formula = siteuse ~ AREA_CAT, key.width = lcm(3), key.pos = 4)
```


You can also summarize and visualize a left-hand side variable for each level of `siteuse`:
```{r}
summary(eqprob_rho, formula = ELEV ~ siteuse)
```

```{r, eval = FALSE}
plot(eqprob_rho, formula = ELEV ~ siteuse)
```

```{r, echo = FALSE}
plot(eqprob_rho, formula = ELEV ~ siteuse, key.pos = 4)
```


# Analysis data

`sp_summarize()` and `plot()` work for analysis data the same way they do for sampling frames.
The `NLA_PNW` analysis data in spsurvey is analysis data (as an `sf` object) from lakes in California, Oregon, and Washington. There are two variables in `NLA_PNW` you will use next:

1. `STATE`: state name (`California`, `Washington`, and `Oregon`)
2. `NITR_COND` : nitrogen content categories (`Poor`, `Fair`, and `Good`)

Before summarizing or visualizing a sampling frame, turn it into an object using  `sp_frame()`:
```{r}
NLA_PNW <- sp_frame(NLA_PNW)
```

To summarize and visualize `NITR_COND` across all states, run
```{r}
summary(NLA_PNW, formula = ~ NITR_COND)
```

```{r, eval = FALSE}
plot(NLA_PNW, formula = ~ NITR_COND, key.width = lcm(3))
```

```{r, echo = FALSE}
plot(NLA_PNW, formula = ~ NITR_COND, key.width = lcm(3), key.pos = 4)
```

Suppose the sampling design was stratified by `STATE`. To summarize and visualize `NITR_COND` by `STATE`, run
```{r}
summary(NLA_PNW, formula = NITR_COND ~ STATE)
```

```{r, eval = FALSE}
plot(NLA_PNW, formula = NITR_COND ~ STATE, key.width = lcm(3))
```

```{r, echo = FALSE}
plot(NLA_PNW, formula = NITR_COND ~ STATE, key.width = lcm(3), key.pos = 4)
```