--- title: "Chemical API" author: "Center for Computational Toxicology and Exposure" output: prettydoc::html_pretty: theme: architect toc: yes toc_depth: 6 vignette: > %\VignetteIndexEntry{Chemical} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{css, echo=FALSE} .scroll-200 { max-height: 200px; overflow-y: auto; } .scroll-300 { max-height: 300px; overflow-y: auto; } .noticebox { padding: 1em; background: lightgray; color: blue; border: 2px solid black; border-radius: 10px; } ``` ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(httptest) start_vignette("2") ``` ```{r setup, echo=FALSE, message=FALSE, warning=FALSE} if (!library(ccdR, logical.return = TRUE)){ devtools::load_all() } old_options <- options("width") ``` ```{r, echo=FALSE, warning=FALSE} # Used to visualize data in a variety of plot designs library(ggplot2) library(gridExtra) ``` ```{r setup-print, echo = FALSE} # Redefining the knit_print method to truncate character values to 25 characters # in each column and to truncate the columns in the print call to prevent # wrapping tables with several columns. #library(ccdR) knit_print.data.table = function(x, ...) { y <- data.table::copy(x) y <- y[, lapply(.SD, function(t){ if (is.character(t)){ t <- strtrim(t, 25) } return(t) })] print(y, trunc.cols = TRUE) } registerS3method( "knit_print", "data.table", knit_print.data.table, envir = asNamespace("knitr") ) ``` ## Introduction In this vignette, [CTX Chemical API](https://api-ccte.epa.gov/docs/chemical.html) will be explored. ::: {.noticebox data-latex=""} **NOTE:** Please see the introductory vignette for an overview of the *ccdR* package and initial set up instruction with API key storage. ::: The foundation of toxicology, toxicokinetics, and exposure is embedded in the physics and chemistry of chemical-biological interactions. The accurate characterization of chemical structure linked to commonly used identifiers, such as names and Chemical Abstracts Service Registry Numbers (CASRNs), is essential to support both predictive modeling of the data as well as dissemination and application of the data for chemical safety decisions. With cheminformatics as the backbone for research efforts, sources of available data through the CTX Chemical API include: - Chemical structures, nomenclature, synonyms, IDs, list associations, physicochemical property, environmental fate and transport data from the Distributed Structure-Searchable Toxicity ([DSSTox](https://www.epa.gov/comptox-tools/distributed-structure-searchable-toxicity-dsstox-database)) database. For early references, see [(Richard, A. et al. 2002)](https://doi.org/10.1016/S0027-5107(01)00289-5), [(Richard, A. et al. 2006)](https://files.toxplanet.com/cpdb/pdfs/structure_tox_on_web.pdf), and [(Richard, A. et al 2008)](https://doi.org/10.1080/15376510701857452). - Predictions from Toxicity Estimation Software Tool ([TEST](https://www.epa.gov/comptox-tools/toxicity-estimation-software-tool-test)) suite of QSAR models. For early references, see [(Martin, T. et al. 2001)](https://pubs.acs.org/doi/10.1021/tx0155045), [(Martin, T. et al. 2007)](https://doi.org/10.1080/15376510701857353), and [(Young, D. et al. 2008)]( https://doi.org/10.1002/qsar.200810084). More information on Chemicals and Chemistry Data can be found here: . ## Functions Several ccdR functions are used to access the CTX Chemical API data. ### Chemical Details Resource #### Get chemical data `get_chemical_details()` retrieves chemical detail data either using the chemical identifier DTXSID or DTXCID. Alternate parameter "projection" determines the type of data returned. Examples for each are provided below: ##### By DTXSID ```{r ccdR dtxsid data chemical, message=FALSE, eval=FALSE} res_dt <- get_chemical_details(DTXSID = 'DTXSID7020182') ``` ##### By DTXCID ```{r ccdR dtxcid data chemical, message=FALSE, eval=FALSE} res_dt <- get_chemical_details(DTXCID = 'DTXCID30182') ``` ### Chemical Property Resource `get_chemical_by_property_range()` retrieves data for chemicals that have a specified property within the input range. ```{r ccdR property range chemical, message=FALSE, eval=FALSE} res_dt <- get_chemical_by_property_range(start = 1.311, end = 1.313, property = 'Density') ``` `get_chem_info()` retrieves specific chemical information for an input chemical. This includes both experimental and predicted values by default, but providing "experimental" or "predicted" to the type parameter will return the specific associated information. ```{r ccdR info chemical, message=FALSE, eval=FALSE} res_dt <- get_chem_info(DTXSID = 'DTXSID7020182') ``` ### Chemical Fate Resource `get_fate_by_dtxsid()` retrieves chemical fate data. ```{r ccdR fate data chemical, message=FALSE, eval=FALSE} res_dt <- get_fate_by_dtxsid(DTXSID = 'DTXSID7020182') ``` ### Chemical Search Resource Chemicals can be searched using string values. Examples for each are provided by the following: #### By starting value ```{r ccdR starting value chemical, message=FALSE, eval=FALSE} res_dt <- chemical_starts_with(word = 'DTXSID70201') ``` #### By exact value ```{r ccdR exact value chemical, message=FALSE, eval=FALSE} res_dt <- chemical_equal(word = 'DTXSID7020182') ``` #### By substring value ```{r ccdR substring value chemical, message=FALSE, eval=FALSE} res_dt <- chemical_contains(word = 'DTXSID702018') ``` #### Subset for MS-Ready Structures MS-Ready data can be retrieved using a variety of input information. Examples for each are provided below: ##### Mass Range ```{r ccdR mass range ms ready chemical, message=FALSE, eval=FALSE} res_dt <- get_msready_by_mass(start = 200.9, end = 200.95) ``` ##### Chemical Formula ```{r ccdR chemical formula ms ready chemical, message=FALSE, eval=FALSE} res_dt <- get_msready_by_formula(formula = 'C16H24N2O5S') ``` ##### DTXCID ```{r ccdR dtxcid ms ready chemical, message=FALSE, eval=FALSE} res_dt <- get_msready_by_dtxcid(DTXCID = 'DTXCID30182') ``` ### List Resource There are several lists of chemicals one can access. These can be filtered by the type, name, inclusion of a specific chemical, or name of list. #### All lists by type ```{r ccdR all list types chemical, message=FALSE, eval=FALSE} res_dt <- get_chemical_lists_by_type(type = 'federal') ``` #### List by name ```{r ccdR list by name chemical, message=FALSE, eval=FALSE} res_dt <- get_public_chemical_list_by_name(listname = 'CCL4') ``` #### Lists containing a specific chemical `get_lists_containing_chemical()` retrieves a list of names of chemical lists, each of which contains the specified chemical. ```{r ccdR lists containing chemical, message=FALSE, eval=FALSE} res_dt <- get_lists_containing_chemical(DTXSID = 'DTXSID7020182') ``` #### Chemicals in a specific list `get_chemicals_in_list()` retrieves the specific chemical information for each chemical contained in the specified list. ```{r ccdR chemical in list chemical, message=FALSE, eval=FALSE} res_dt <- get_chemicals_in_list(list_name = 'CCL4') ``` ### Chemical File Resource There a mrv, mol, and image files that can be accessed using either the DTXSID or DTXCID. Examples are provided below: #### Get mrv by DTXSID or DTXCID `get_chemical_mrv()` retrieves mrv file information for a chemical specified either by DTXSID or DTXCID. ```{r ccdR mrv by dtxsid dtxcid chemical, message=FALSE, eval=FALSE} res_dt <- get_chemical_mrv(DTXSID = 'DTXSID7020182') res_dt <- get_chemical_mrv(DTXCID = 'DTXCID30182') ``` #### Get mol by DTXSID or DTXCID `get_chemical_mol()` retrieves mol file information for a chemical specified either by DTXSID or DTXCID. ```{r ccdR mol by dtxsid dtxcid chemical, message=FALSE, eval=FALSE} res_dt <- get_chemical_mol(DTXSID = 'DTXSID7020182') res_dt <- get_chemical_mol(DTXCID = 'DTXCID30182') ``` #### Get structure image by DTXSID or DTXCID `get_chemical_image()` retrieves image file information for a chemical specified either by DTXSID or DTXCID. ```{r ccdR image by dtxsid dtxcid chemical, message=FALSE, eval=FALSE} res_dt <- get_chemical_image(DTXSID = 'DTXSID7020182') res_dt <- get_chemical_image(DTXCID = 'DTXCID30182') ``` ### Chemical Synonym Resource `get_chemical_synonym()` retrieves synonyms for the specified chemical. ```{r ccdR synonym by dtxsid chemical, message=FALSE, eval=FALSE} res_dt <- get_chemical_synonym(DTXSID = 'DTXSID7020182') ``` ## Example Use Case: Comparing Physico-chemical Properties Across Chemical Lists The fourth Drinking Water Contaminant Candidate List (CCL4) is a set of chemicals that "...are not subject to any proposed or promulgated national primary drinking water regulations, but are known or anticipated to occur in public water systems...." Moreover, this list "...was announced on November 17, 2016. The CCL 4 includes 97 chemicals or chemical groups and 12 microbial contaminants...." The National-Scale Air Toxics Assessments (NATA) is "... EPA's ongoing comprehensive evaluation of air toxics in the United States... a state-of-the-science screening tool for State/Local/Tribal agencies to prioritize pollutants, emission sources and locations of interest for further study in order to gain a better understanding of risks... use general information about sources to develop estimates of risks which are more likely to overestimate impacts than underestimate them...." These lists can be found in the CCD at [CCL4](https://comptox.epa.gov/dashboard/chemical-lists/CCL4) with additional information at [CCL4 information](https://www.epa.gov/ccl/contaminant-candidate-list-4-ccl-4-0) and [NATADB](https://comptox.epa.gov/dashboard/chemical-lists/NATADB) with additional information at [NATA information](https://www.epa.gov/national-air-toxics-assessment). The quotes from the previous paragraph were excerpted from list detail descriptions found using the CCD links. In this example use case, physico-chemical Properties data will be compared between a water contaminant priority and an air toxics list. ### Obtain Lists of Chemicals First, confirm the chemical list to query. ```{r} options(width = 100) ccl4_information <- get_public_chemical_list_by_name('CCL4') print(ccl4_information, trunc.cols = TRUE) natadb_information <- get_public_chemical_list_by_name('NATADB') print(natadb_information, trunc.cols = TRUE) ``` Next, retrieve the list of chemicals associated with each list. ```{r} ccl4 <- get_chemicals_in_list('ccl4') ccl4 <- data.table::as.data.table(ccl4) natadb <- get_chemicals_in_list('NATADB') natadb <- data.table::as.data.table(natadb) ``` We examine the dimensions of the data, the column names, and display a single row for illustrative purposes. ```{r, eval=FALSE} dim(ccl4) dim(natadb) colnames(ccl4) head(ccl4, 1) ``` ### Access Physico-chemical Property Data for Chemical Lists Next, physico-chemical properties for all chemicals in each list can be retrieved. The function `get_chem_info()` will be used to batch search for a list of DTXSIDs. ```{r} ccl4_phys_chem <- get_chem_info_batch(ccl4$dtxsid) natadb_phys_chem <- get_chem_info_batch(natadb$dtxsid) ``` Observe that this returns a single data.table for each query, and the data.table contains the physico-chemical properties available from the CompTox Chemicals Dashboard for each chemical in the query. Note, a warning message was triggered, `Warning: Setting type to ''!`, which indicates the the parameter `type` was not given a value. A default value is set within the function and more information can be found in the associated documentation. We examine the set of physico-chemical properties for the first chemical in CCL4. Before any deeper analysis, consider the dimensions of the data and the column names. ```{r, eval=FALSE} dim(ccl4_phys_chem) colnames(ccl4_phys_chem) ``` Next, we display the unique values for the columns `propertyID` and `propType`. ```{r} ccl4_phys_chem[, unique(propertyId)] ccl4_phys_chem[, unique(propType)] ``` Let's explore this further by examining the mean of the "boiling-point" and "melting-point" data. ```{r} ccl4_phys_chem[propertyId == 'boiling-point', .(Mean = mean(value))] ccl4_phys_chem[propertyId == 'boiling-point', .(Mean = mean(value)), by = .(propType)] ccl4_phys_chem[propertyId == 'melting-point', .(Mean = mean(value))] ccl4_phys_chem[propertyId == 'melting-point', .(Mean = mean(value)), by = .(propType)] ``` These results tell us about some of the reported physico-chemical properties of the data sets. The mean "boiling-point" is 252.6593 degrees Celsius for CCL4, with mean values of 250.5943 and 253.8196 for experimental and predicted, respectively. The mean "melting-point" is 34.91613 degrees Celsius for CCL4, with mean values of 23.18876 and 49.99417 for experimental and predicted, respectively. To explore **all** the values of the physico-chemical properties and calculate their means, we can do the following procedure. First we look at all the physico-chemical properties individually, then group them by each property ("boiling-point", "melting-point", etc...), and then additionally group those by property type ("experimental" vs "predicted"). In the grouping, we look at the columns `value`, `unit`, `propertyID` and `propType`. We also demonstrate how take the mean of the values for each grouping. ```{r fig.align='center',class.source="scroll-300",message=FALSE} head(ccl4_phys_chem[dtxsid == ccl4$dtxsid[[1]], ]) ccl4_phys_chem[dtxsid == ccl4$dtxsid[[1]], .(propType, value, unit), by = .(propertyId)] ccl4_phys_chem[dtxsid == ccl4$dtxsid[[1]], .(value, unit), by = .(propertyId, propType)] ccl4_phys_chem[dtxsid == ccl4$dtxsid[[1]], .(Mean_value = sapply(.SD, mean)), by = .(propertyId, unit), .SDcols = c("value")] ccl4_phys_chem[dtxsid == ccl4$dtxsid[[1]], .(Mean_value = sapply(.SD, mean)), by = .(propertyId, unit, propType), .SDcols = c("value")][order(propertyId)] ``` ### Review Physico-Chemical Properties Across Chemical Lists We consider exploring the differences in mean predicted and experimental values for a variety of physico-chemical properties in an effort to understand better the CCL4 and NATADB lists. In particular, we examine "vapor-pressure", "henrys-law", and "boiling-point" and plot the means by chemical for these using boxplots. We then compare the values by grouping by both data set and `propType` value. #### Vapor Pressure Begin by grouping pulled data by DTXSID, and also by DTXSID and property type. ```{r fig.align='center',class.source="scroll-300",message=FALSE} ccl4_vapor_all <- ccl4_phys_chem[propertyId %in% 'vapor-pressure', .(mean_vapor_pressure = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid)] natadb_vapor_all <- natadb_phys_chem[propertyId %in% 'vapor-pressure', .(mean_vapor_pressure = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid)] ccl4_vapor_grouped <- ccl4_phys_chem[propertyId %in% 'vapor-pressure', .(mean_vapor_pressure = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid, propType)] natadb_vapor_grouped <- natadb_phys_chem[propertyId %in% 'vapor-pressure', .(mean_vapor_pressure = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid, propType)] ``` Examine summary statistics. ```{r fig.align='center',class.source="scroll-300",message=FALSE} summary(ccl4_vapor_all) summary(ccl4_vapor_grouped) summary(natadb_vapor_all) summary(natadb_vapor_grouped) ``` With such a large range of values covering several orders of magnitude, log transform the data. This data from both chemical lists can also be plotted individually and by property type. ```{r fig.align='center',class.source="scroll-300",message=FALSE} ccl4_vapor_all[, log_transform_mean_vapor_pressure := log(mean_vapor_pressure)] ccl4_vapor_grouped[, log_transform_mean_vapor_pressure := log(mean_vapor_pressure)] natadb_vapor_all[, log_transform_mean_vapor_pressure := log(mean_vapor_pressure)] natadb_vapor_grouped[, log_transform_mean_vapor_pressure := log(mean_vapor_pressure)] ``` ```{r, fig.align='center', echo=FALSE, eval=FALSE} ggplot(ccl4_vapor_all, aes(log_transform_mean_vapor_pressure)) + geom_boxplot() + coord_flip() ggplot(ccl4_vapor_grouped, aes(propType, log_transform_mean_vapor_pressure)) + geom_boxplot() ``` ```{r, fig.align='center', echo=FALSE, eval=FALSE} ggplot(natadb_vapor_all, aes(log_transform_mean_vapor_pressure)) + geom_boxplot() + coord_flip() ggplot(natadb_vapor_grouped, aes(propType, log_transform_mean_vapor_pressure)) + geom_boxplot() ``` Finally, compare both chemical lists simultaneously. To accomplish this, add a column to each data.table denoting to which chemical list the rows correspond and then combine the rows from both data sets together using the function `rbind()`. ```{r fig.align='center',class.source="scroll-300",message=FALSE} ccl4_vapor_grouped[, set := 'CCL4'] natadb_vapor_grouped[, set := 'NATADB'] all_vapor_grouped <- rbind(ccl4_vapor_grouped, natadb_vapor_grouped) vapor_box <- ggplot(all_vapor_grouped, aes(set, log_transform_mean_vapor_pressure)) + geom_boxplot(aes(color = propType)) vapor <- ggplot(all_vapor_grouped, aes(log_transform_mean_vapor_pressure)) + geom_boxplot((aes(color = set))) + coord_flip() ``` Plot the combined data. Boxplots are colored based on the property type, with mean log transformed vapor pressure plotted for each chemical list and property type, or by chemical list alone. ```{r fig.align='center', class.source="scroll-200", echo=FALSE} gridExtra::grid.arrange(vapor_box, vapor, ncol=2) ``` In the box plots above, a general trend indicates that that the NATADB chemical list has a higher mean vapor pressure than the CCL4 chemical list. #### Henry's Law constant Henry's Law constant can be explored in a similar fashion. Begin by grouping data by DTXSID, and also by DTXSID and property type. ```{r fig.align='center',class.source="scroll-300",message=FALSE} ccl4_hlc_all <- ccl4_phys_chem[propertyId %in% 'henrys-law', .(mean_hlc = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid)] natadb_hlc_all <- natadb_phys_chem[propertyId %in% 'henrys-law', .(mean_hlc = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid)] ccl4_hlc_grouped <- ccl4_phys_chem[propertyId %in% 'henrys-law', .(mean_hlc = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid, propType)] natadb_hlc_grouped <- natadb_phys_chem[propertyId %in% 'henrys-law', .(mean_hlc = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid, propType)] ``` Examine summary statistics. ```{r fig.align='center',class.source="scroll-300",message=FALSE} summary(ccl4_hlc_all) summary(ccl4_hlc_grouped) summary(natadb_hlc_all) summary(natadb_hlc_grouped) ``` Again, log transform the data as it is positive and covers several orders of magnitude. ```{r fig.align='center',class.source="scroll-300",message=FALSE} ccl4_hlc_all[, log_transform_mean_hlc := log(mean_hlc)] ccl4_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)] natadb_hlc_all[, log_transform_mean_hlc := log(mean_hlc)] natadb_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)] ``` Finally, compare both chemical lists simultaneously. To accomplish this, add a column to each data.table denoting to which chemical list the rows correspond and then combine the rows from both data sets together using the function `rbind()`. ```{r fig.align='center',class.source="scroll-300",message=FALSE} ccl4_hlc_grouped[, set := 'CCL4'] natadb_hlc_grouped[, set := 'NATADB'] all_hlc_grouped <- rbind(ccl4_hlc_grouped, natadb_hlc_grouped) hlc_box <- ggplot(all_hlc_grouped, aes(set, log_transform_mean_hlc)) + geom_boxplot(aes(color = propType)) hlc <- ggplot(all_hlc_grouped, aes(log_transform_mean_hlc)) + geom_boxplot(aes(color = set)) + coord_flip() ``` ```{r fig.align='center',class.source="scroll-200", echo=FALSE} gridExtra::grid.arrange(hlc_box, hlc, ncol=2) ``` Again, in both grouping by `propType` and aggregating all results together by chemical list, NATADB chemicals generally higher mean Henry's Law Constant value than CCL4 chemicals. #### Boling Point Boiling Point data be explored. Begin by grouping data by DTXSID, and also by DTXSID and property type. ```{r fig.align='center',class.source="scroll-300",message=FALSE} ccl4_boiling_all <- ccl4_phys_chem[propertyId %in% 'boiling-point', .(mean_boiling_point = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid)] natadb_boiling_all <- natadb_phys_chem[propertyId %in% 'boiling-point', .(mean_boiling_point = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid)] ccl4_boiling_grouped <- ccl4_phys_chem[propertyId %in% 'boiling-point', .(mean_boiling_point = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid, propType)] natadb_boiling_grouped <- natadb_phys_chem[propertyId %in% 'boiling-point', .(mean_boiling_point = sapply(.SD, mean)), .SDcols = c('value'), by = .(dtxsid, propType)] ``` Calculate summary statistics. ```{r fig.align='center',class.source="scroll-300",message=FALSE} summary(ccl4_boiling_all) summary(ccl4_boiling_grouped) summary(natadb_boiling_all) summary(natadb_boiling_grouped) ``` Since some of the boiling point values have negative values, log transformation of these values will result in warnings as NaNs are produced. Finally, compare both chemical lists simultaneously. To accomplish this, add a column to each data.table denoting to which chemical list the rows correspond and then combine the rows from both data sets together using the function `rbind()`. ```{r fig.align='center',class.source="scroll-300",message=FALSE} ccl4_boiling_grouped[, set := 'CCL4'] natadb_boiling_grouped[, set := 'NATADB'] all_boiling_grouped <- rbind(ccl4_boiling_grouped, natadb_boiling_grouped) boiling_box <- ggplot(all_boiling_grouped, aes(set, mean_boiling_point)) + geom_boxplot(aes(color = propType)) boiling <- ggplot(all_boiling_grouped, aes(mean_boiling_point)) + geom_boxplot(aes(color = set)) + coord_flip() ``` ```{r fig.align='center',class.source="scroll-200", echo=FALSE} gridExtra::grid.arrange(boiling_box, boiling, ncol=2) ``` A visual inspection of this set of graphs is not as clear as in the previous cases. Note that the predicted values for each data set tend to be higher than the experimental. The mean of CCL4, by predicted and experimental appears to be greater than the corresponding means for NATADB, as does the overall mean, but the interquartile ranges of these different groupings yield slightly different results. This gives us a sense that the picture for boiling point is not as clear cut between experimental and predicted for these two chemical lists as it was in the previous physico-chemical properties investigated. To summarize the observations, across the various physico-chemical properties for chemicals in these chemical lists, there are indeed differences between the mean values of various physico-chemical properties when grouped by predicted or experimental. - For "vapor-pressure", the means of predicted values tend to be a little lower than experimental, though they are much closer in the case of NATADB than CCL4. - The trend of lower predicted means compared to experimental means is more clearly demonstrated for "henrys-law" values in both data sets. - In the case of "boiling-point", the predicted values are greater than the experimental values, though this is much more pronounced in CCL4 while the set of means for NATADB are again fairly close. ## Conclusion In this vignette, a variety of functions that access different types of data found in the `Chemical` endpoints of the CTX APIs were explored. While this exploration was not exhaustive, it provides a basic introduction to how one may access data and work with it. Additional endpoints and corresponding functions exist and we encourage the user to explore these while keeping in mind the examples contained in this vignette. ```{r breakdown, echo = FALSE, results = 'hide'} # This chunk will be hidden in the final product. It serves to undo defining the # custom print function to prevent unexpected behavior after this module during # the final knitting process and restores original option values. knit_print.data.table = knitr::normal_print registerS3method( "knit_print", "data.table", knit_print.data.table, envir = asNamespace("knitr") ) options(old_options) ``` ```{r, include=FALSE} end_vignette() ```