--- title: "Summarizing and Visualizing Sampling Frames, Design Sites, and Analysis Data" author: "Michael Dumelle, Tom Kincaid, Anthony Olsen, and Marc Weber" output: html_document: theme: flatly number_sections: true highlighted: default toc: yes toc_float: collapsed: no smooth_scroll: no toc_depth: 2 vignette: > %\VignetteIndexEntry{Summaries and Visualizations} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE ) ``` If you have yet not read the "Start Here" vignette, please do so by running ```{r, eval = FALSE} vignette("start-here", "spsurvey") ``` # Introduction Before proceeding, we load spsurvey by running ```{r} library(spsurvey) ``` The `summary()` and `plot()` functions in spsurvey are used to summarize and visualize sampling frames, design sites, and analysis data. Both functions use a formula argument that specifies the variables to summarize or visualize. These functions behave differently for one-sided and two-sided formulas. To learn more about formulas in R, run `?formula`. Only the core functionality of `summary()` and `plot()` will be covered in this vignette, so to learn more about these functions, run `?summary` and `?plot`. The `sp_summary()` and `sp_plot()` functions can equivalently be used in place of `plot()` and `summary()`, respectively (`sp_summary()` and `sp_plot()` are currently maintained for backwards compatibility with previous spsurvey versions). The `plot()` function in spsurvey is built on the `plot()` function in sf. spsurvey's `plot()` function accommodates all the arguments in sf's `plot()` function and adds a few additional features. To learn more about the `plot()` function in sf, run `?plot.sf()`. # Sampling frames Summarizing and visualizing the sampling frame is often helpful to better understand your data and inform additional survey design options (e.g. stratification). To use `plot()` or `sp_summarize()`, sampling frames must either be an `sf` object or a data frame with x-coordinates, y-coordinates, and a crs (coordinate reference system). The `NE_Lakes` data in spsurvey is a sampling frame (as an `sf` object) that contains lakes from the Northeastern United States. There are three variables in `NE_Lakes` you will use next: 1. `AREA_CAT`: lake area categories (small and large) 2. `ELEV`: lake elevation (a continuous variable) 3. `ELEV_CAT`: lake elevation categories (low and high) Before summarizing or visualizing a sampling frame, turn it into an \code{sp_frame} object using `sp_frame()`: ```{r} NE_Lakes <- sp_frame(NE_Lakes) ``` ## One-sided formulas One-sided formulas are used to summarize and visualize the distributions of variables. The variables of interest should be placed on the right-hand side of the formula. To summarize the distribution of `ELEV`, run ```{r} summary(NE_Lakes, formula = ~ ELEV) ``` The output contains two columns: `total` and `ELEV`. The `total` column returns the total number of lakes, functioning as an "intercept" to the formula (it can by removed by supplying `- 1` to the formula). The `ELEV` column returns a numerical summary of lake elevation. To visualize `ELEV`, run ```{r, eval = FALSE} plot(NE_Lakes, formula = ~ ELEV) ``` ```{r, echo = FALSE} plot(NE_Lakes, formula = ~ ELEV, key.pos = 4) ``` To summarize the distribution of `ELEV_CAT`, run ```{r} summary(NE_Lakes, formula = ~ ELEV_CAT) ``` The `ELEV_CAT` column returns the number of lakes in each elevation category. To visualize `ELEV_CAT`, run ```{r, eval = FALSE} plot(NE_Lakes, formula = ~ ELEV_CAT, key.width = lcm(3)) ``` ```{r, echo = FALSE} plot(NE_Lakes, formula = ~ ELEV_CAT, key.width = lcm(3), key.pos = 4) ``` The `key.width` argument extends the plot's margin to fit the legend text nicely within the plot. The plot's default title is the `formula` argument, though this is changed using the `main` argument to `plot()`. The formula used by `summary()` and `plot()` is quite flexible. Additional variables are included using `+`: ```{r} summary(NE_Lakes, formula = ~ ELEV_CAT + AREA_CAT) ``` The `plot()` function returns two plots -- one for `ELEV_CAT` and another for `AREA_CAT`: ```{r, eval = FALSE} plot(NE_Lakes, formula = ~ ELEV_CAT + AREA_CAT, key.width = lcm(3)) ``` ```{r, echo = FALSE} plot(NE_Lakes, formula = ~ ELEV_CAT + AREA_CAT, key.width = lcm(3), key.pos = 4) ``` Interactions are included using the interaction operator, `:`. The interaction operator returns the interaction between variables and is most useful when used with categorical variables. To summarize the interaction between `ELEV_CAT` and `AREA_CAT`, run ```{r} summary(NE_Lakes, formula = ~ ELEV_CAT:AREA_CAT) ``` Levels of each variable are separated by `:`. For example, there are 86 lakes that are in the low elevation category and the small area category. To visualize this interaction, run ```{r, eval = FALSE} plot(NE_Lakes, formula = ~ ELEV_CAT:AREA_CAT, key.width = lcm(3)) ``` ```{r, echo = FALSE} plot(NE_Lakes, formula = ~ ELEV_CAT:AREA_CAT, key.width = lcm(3), key.pos = 4) ``` The formula accommodates the `*` operator, which combines the `+` and `:` operators. For example, `ELEV_CAT*AREA_CAT` is shorthand for `ELEV_CAT + AREA_CAT + ELEV_CAT:AREA_CAT`. The formula also accommodates the `.` operator, which is shorthand for all variables separated by `+`. ## Two-sided formulas Two-sided formulas are used to summarize the distribution of a left-hand side variable for each level of each right-hand side variable. To summarize the distribution of `ELEV` for each level of `AREA_CAT`, run ```{r} summary(NE_Lakes, formula = ELEV ~ AREA_CAT) ``` To visualize the distribution of `ELEV` for each level of `AREA_CAT`, run ```{r, eval = FALSE} plot(NE_Lakes, formula = ELEV ~ AREA_CAT) ``` ```{r, echo = FALSE} plot(NE_Lakes, formula = ELEV ~ AREA_CAT, key.pos = 4) ``` To only summarize or visualize a particular level of a single right-hand side variable, use the `onlyshow` argument: ```{r} summary(NE_Lakes, formula = ELEV ~ AREA_CAT, onlyshow = "small") ``` ```{r, eval = FALSE} plot(NE_Lakes, formula = ELEV ~ AREA_CAT, onlyshow = "small") ``` ```{r, echo = FALSE} plot(NE_Lakes, formula = ELEV ~ AREA_CAT, onlyshow = "small", key.pos = 4) ``` To summarize the distribution of `ELEV_CAT` for each level of `AREA_CAT`, run ```{r} summary(NE_Lakes, formula = ELEV_CAT ~ AREA_CAT) ``` To visualize the distribution of `ELEV_CAT` for each level of `AREA_CAT`, run ```{r, eval = FALSE} plot(NE_Lakes, formula = ELEV_CAT ~ AREA_CAT, key.width = lcm(3)) ``` ```{r, echo = FALSE} plot(NE_Lakes, formula = ELEV_CAT ~ AREA_CAT, key.width = lcm(3), key.pos = 4) ``` ## Adjusting graphical parameters There are three arguments in `plot()` that can adjust graphical parameters: 1. `var_args` adjusts graphical parameters simultaneously for all levels of a variable 2. `varlevel_args` adjusts graphical parameters uniquely for each level of a variable 3. `...` adjusts graphical parameters for simultaneously for all levels of all variables The `var_args` and `varlevel_args` arguments take lists whose names match variable names in the formula. For `varlevel_args`, each list element must have an element named `levels` that matches the variable's levels. The following example combines all three graphical parameter adjustment arguments: ```{r, eval = FALSE} list1 <- list(main = "Elevation Categories", pal = rainbow) list2 <- list(main = "Area Categories") list3 <- list(levels = c("small", "large"), pch = c(4, 19)) plot( NE_Lakes, formula = ~ ELEV_CAT + AREA_CAT, var_args = list(ELEV_CAT = list1, AREA_CAT = list2), varlevel_args = list(AREA_CAT = list3), cex = 0.75, key.width = lcm(3) ) ``` ```{r, echo = FALSE} list1 <- list(main = "Elevation Categories", pal = rainbow) list2 <- list(main = "Area Categories") list3 <- list(levels = c("small", "large"), pch = c(4, 19)) plot( NE_Lakes, formula = ~ ELEV_CAT + AREA_CAT, var_args = list(ELEV_CAT = list1, AREA_CAT = list2), varlevel_args = list(AREA_CAT = list3), cex = 0.75, key.width = lcm(3), key.pos = 4 ) ``` `var_args` uses `list1` to give the `ELEV_CAT` visualization a new title and color palette; `var_args` uses `list2` to give the `AREA_CAT` visualization a new title; `varlevel_args` uses `list3` to give the `AREA_CAT` visualization different shapes for the small and large levels; `...` uses `cex = 0.75` to reduce the size of all points; and `...` uses `key.width` to adjust legend spacing for all visualizations. If a two-sided formula is used, it is possible to adjust graphical parameters of the left-hand side variable for all levels of a right-hand side variable. This occurs when a sublist matching the structure of `varlevel_args` is used as an argument to `var_args`. In this next example, different shapes are used for the small and large levels of `AREA_CAT` for all levels of `ELEV_CAT`: ```{r,, eval = FALSE} sublist <- list(AREA_CAT = list3) plot( NE_Lakes, formula = AREA_CAT ~ ELEV_CAT, var_args = list(ELEV_CAT = sublist), key.width = lcm(3) ) ``` ```{r, echo = FALSE} sublist <- list(AREA_CAT = list3) plot( NE_Lakes, formula = AREA_CAT ~ ELEV_CAT, var_args = list(ELEV_CAT = sublist), key.width = lcm(3), key.pos = 4 ) ``` # Design sites Design sites (output from the `grts()` or `irs()` functions) can be summarized and visualized using `summary()` and `plot()` very similarly to how sampling frames were summarized and visualized in the previous section. Soon you will use the `grts()` function to select a spatially balanced sample. The `grts()` function does incorporate randomness, so to match your results with this output exactly you will need to set a reproducible seed by running ```{r} set.seed(51) ``` First we will obtain some design sites: To select an equal probability GRTS sample of size 50 with 10 reverse hierarchically ordered replacement sites, run ```{r} eqprob_rho <- grts(NE_Lakes, n_base = 50, n_over = 10) ``` Similar to `summary()` and `plot()` for sampling frames, `summary()` and `plot()` for design sites uses a formula. The formula should include `siteuse`, which is the name of the variable in the design sites object that indicates the type of each site. The default formula for `summary()` and `plot()` is `~siteuse`, which summarizes or visualizes the `sites` objects in the design sites object. By default, the formula is applied to all non-`NULL` `sites` objects (in `eqprob_rho`, the non`NULL` sites objects are `sites_base` (for the base sites) and `sites_over` (for the reverse hierarchically ordered replacement sites)). ```{r} summary(eqprob_rho) ``` ```{r, eval = FALSE} plot(eqprob_rho, key.width = lcm(3)) ``` ```{r, echo = FALSE} plot(eqprob_rho, key.width = lcm(3), key.pos = 4) ``` The sampling frame may be included as an argument to the `plot()` function: ```{r, eval = FALSE} plot(eqprob_rho, NE_Lakes, key.width = lcm(3)) ``` ```{r, echo = FALSE} plot(eqprob_rho, NE_Lakes, key.width = lcm(3), key.pos = 4) ``` When you include `siteuse` as a left-hand side variable (`siteuse` is treated as a categorical variable), you can summarize and visualize the `sites` object for each level of each right-hand side variable: ```{r} summary(eqprob_rho, formula = siteuse ~ AREA_CAT) ``` ```{r, eval = FALSE} plot(eqprob_rho, formula = siteuse ~ AREA_CAT, key.width = lcm(3)) ``` ```{r, echo = FALSE} plot(eqprob_rho, formula = siteuse ~ AREA_CAT, key.width = lcm(3), key.pos = 4) ``` You can also summarize and visualize a left-hand side variable for each level of `siteuse`: ```{r} summary(eqprob_rho, formula = ELEV ~ siteuse) ``` ```{r, eval = FALSE} plot(eqprob_rho, formula = ELEV ~ siteuse) ``` ```{r, echo = FALSE} plot(eqprob_rho, formula = ELEV ~ siteuse, key.pos = 4) ``` # Analysis data `sp_summarize()` and `plot()` work for analysis data the same way they do for sampling frames. The `NLA_PNW` analysis data in spsurvey is analysis data (as an `sf` object) from lakes in California, Oregon, and Washington. There are two variables in `NLA_PNW` you will use next: 1. `STATE`: state name (`California`, `Washington`, and `Oregon`) 2. `NITR_COND` : nitrogen content categories (`Poor`, `Fair`, and `Good`) Before summarizing or visualizing a sampling frame, turn it into an object using `sp_frame()`: ```{r} NLA_PNW <- sp_frame(NLA_PNW) ``` To summarize and visualize `NITR_COND` across all states, run ```{r} summary(NLA_PNW, formula = ~ NITR_COND) ``` ```{r, eval = FALSE} plot(NLA_PNW, formula = ~ NITR_COND, key.width = lcm(3)) ``` ```{r, echo = FALSE} plot(NLA_PNW, formula = ~ NITR_COND, key.width = lcm(3), key.pos = 4) ``` Suppose the sampling design was stratified by `STATE`. To summarize and visualize `NITR_COND` by `STATE`, run ```{r} summary(NLA_PNW, formula = NITR_COND ~ STATE) ``` ```{r, eval = FALSE} plot(NLA_PNW, formula = NITR_COND ~ STATE, key.width = lcm(3)) ``` ```{r, echo = FALSE} plot(NLA_PNW, formula = NITR_COND ~ STATE, key.width = lcm(3), key.pos = 4) ```