ctxR: Chemical API

Introduction

In this vignette, CTX Chemical API will be explored.

The foundation of toxicology, toxicokinetics, and exposure is embedded in the physics and chemistry of chemical-biological interactions. The accurate characterization of chemical structure linked to commonly used identifiers, such as names and Chemical Abstracts Service Registry Numbers (CASRNs), is essential to support both predictive modeling of the data as well as dissemination and application of the data for chemical safety decisions.

With cheminformatics as the backbone for research efforts, sources of available data through the CTX Chemical API include:

More information on Chemicals and Chemistry Data can be found here: https://www.epa.gov/comptox-tools/downloadable-computational-toxicology-data#SCD.

NOTE: Please see the introductory vignette for an overview of the ctxR package and initial set up instruction with API key storage.

Several ctxR functions can be used to access the CTX Chemical API data, as described in the following sections.Tables output in each example have been filtered to only display the first few rows of data.

Chemical Details Resource

Get chemical data

get_chemical_details() retrieves chemical detail data either using the chemical identifier DTXSID or DTXCID. Alternate parameter “projection” determines the type of data returned. Examples for each are provided below:

By DTXSID

chemical_details_by_dtxsid <- get_chemical_details(DTXSID = 'DTXSID7020182')

By DTXCID

chemical_details_by_dtxcid <- get_chemical_details(DTXCID = 'DTXCID30182')

Chemical Property Resource

get_chemical_by_property_range() retrieves data for chemicals that have a specified property within the input range.

chemical_by_property_range <- get_chemical_by_property_range(start = 1.311, 
                                         end = 1.313, 
                                         property = 'Density')

get_chem_info() retrieves specific chemical information for an input chemical. This includes both experimental and predicted values by default, but providing “experimental” or “predicted” to the type parameter will return the specific associated information.

chemical_info <- get_chem_info(DTXSID = 'DTXSID7020182')

Chemical Fate Resource

get_fate_by_dtxsid() retrieves chemical fate data.

fate_by_dtxsid <- get_fate_by_dtxsid(DTXSID = 'DTXSID7020182')

Chemical Search Resource

Chemicals can be searched using string values. Examples for each are provided by the following:

By starting value

search_starts_with <- chemical_starts_with(word = 'DTXSID70201')

By exact value

search_exact <- chemical_equal(word = 'DTXSID7020182')

By substring value

search_contains <- chemical_contains(word = 'DTXSID702018')

Subset for MS-Ready Structures

MS-Ready (McEachran, A. et al. 2018) data can be retrieved using a variety of input information. Examples for each are provided below:

By Mass Range

msready_by_mass <- get_msready_by_mass(start = 200.9, 
                              end = 200.95)

By Chemical Formula

msready_by_formula <- get_msready_by_formula(formula = 'C16H24N2O5S')

By DTXCID

msready_by_dtxcid <- get_msready_by_dtxcid(DTXCID = 'DTXCID30182')

List Resource

There are several lists of chemicals one can access using the (CCD list search). These can be filtered by the type, name, inclusion of a specific chemical, or name of list.

Get all list types

get_all_list_types()

All lists by type

chemical_lists_by_type <- get_chemical_lists_by_type(type =  'federal')

List by name

public_chemical_list_by_name <- get_public_chemical_list_by_name(listname = 'CCL4')

Lists containing a specific chemical

get_lists_containing_chemical() retrieves a list of names of chemical lists, each of which contains the specified chemical.

lists_containing_chemical <- get_lists_containing_chemical(DTXSID = 'DTXSID7020182')

DTXSIDs for chemical list and starting value

get_chemicals_in_list_start() retrieves a list of DTXSIDs for a given starting character string in a specified list of chemicals.

chemicals_in_ccl4_start <- get_chemicals_in_list_start(list_name = 'CCL4', word = 'Bi')

DTXSIDs for chemical list and exact value

get_chemicals_in_list_exact() retrieves a list of DTXSIDs matching exactly a given character string in a specified list of chemicals.

chemicals_in_ccl4_exact <- get_chemicals_in_list_exact(list_name = 'BIOSOLIDS2021', word = 'Bisphenol A')

DTXSIDs for chemical list and containing value

get_chemicals_in_list_contain() retrieves a list of DTXSIDs that contain a given character string in a specified list of chemicals.

chemicals_in_ccl4_contain <- get_chemicals_in_list_contain(list_name = 'CCL4', word = 'Bis')

Chemicals in a specific list

get_chemicals_in_list() retrieves the specific chemical information for each chemical contained in the specified list.

chemicals_in_list <- get_chemicals_in_list(list_name = 'CCL4')

Chemical File Resource

There are mrv, mol, and image files that can be accessed using either the DTXSID or DTXCID. Examples are provided below:

Get mrv by DTXSID or DTXCID

get_chemical_mrv() retrieves mrv file information for a chemical specified either by DTXSID or DTXCID.

chemical_mrv_by_dtxsid <- get_chemical_mrv(DTXSID = 'DTXSID7020182')
chemical_mrv_by_dtxcid <- get_chemical_mrv(DTXCID = 'DTXCID30182')

Get mol by DTXSID or DTXCID

get_chemical_mol() retrieves mol file information for a chemical specified either by DTXSID or DTXCID.

chemical_mol_by_dtxsid <- get_chemical_mol(DTXSID = 'DTXSID7020182')
chemical_mol_by_dtxcid <- get_chemical_mol(DTXCID = 'DTXCID30182')

Get structure image by DTXSID, DTXCID, or SMILES

get_chemical_image() retrieves image file information for a chemical specified either by DTXSID or DTXCID. To visualize the returned array of image information, the user may use either the png::writePNG() or countcolors::plotArrayAsImage() functions, among many choices.

chemical_image_by_dtxsid <- get_chemical_image(DTXSID = 'DTXSID7020182')
chemical_image_by_dtxcid <- get_chemical_image(DTXCID = 'DTXCID30182')
chemical_image_by_smiles <- get_chemical_image(SMILES = 'CC(C)(C1=CC=C(O)C=C1)C1=CC=C(O)C=C1')

countcolors::plotArrayAsImage(chemical_image_by_dtxsid)
countcolors::plotArrayAsImage(chemical_image_by_dtxcid)
countcolors::plotArrayAsImage(chemical_image_by_smiles)

Chemical Synonym Resource

get_chemical_synonym() retrieves synonyms for the specified chemical.

chemical_synonym <- get_chemical_synonym(DTXSID = 'DTXSID7020182')

Example Use Case: Comparing Physico-chemical Properties Across Chemical Lists

The fourth Drinking Water Contaminant Candidate List (CCL4) is a set of chemicals that “…are not subject to any proposed or promulgated national primary drinking water regulations, but are known or anticipated to occur in public water systems….” Moreover, this list “…was announced on November 17, 2016. The CCL 4 includes 97 chemicals or chemical groups and 12 microbial contaminants….” The National-Scale Air Toxics Assessments (NATA) is “… EPA’s ongoing comprehensive evaluation of air toxics in the United States… a state-of-the-science screening tool for State/Local/Tribal agencies to prioritize pollutants, emission sources and locations of interest for further study in order to gain a better understanding of risks… use general information about sources to develop estimates of risks which are more likely to overestimate impacts than underestimate them….”

These lists can be found in the CCD at CCL4 with additional information at CCL4 information and NATADB with additional information at NATA information. The quotes from the previous paragraph were excerpted from list detail descriptions found using the CCD links.

In this example use case, physico-chemical Properties data will be compared between a water contaminant priority and an air toxics list. Note, the following code chunks use the data.table object, which is an extension of the data.frame object and has slightly different syntax. For more information, please refer to data.table

Obtain Lists of Chemicals

First, confirm the chemical list to query.

options(width = 100)
ccl4_information <- get_public_chemical_list_by_name('CCL4')
print(ccl4_information, trunc.cols = TRUE)
#>    id    type                                    label visibility
#> 1 443 federal WATER|EPA: Chemical Contaminants - CCL 4     PUBLIC
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             longDescription
#> 1 The Contaminant Candidate List (CCL) is a list of contaminants that, at the time of publication, are not subject to any proposed or promulgated national primary drinking water regulations, but are known or anticipated to occur in public water systems. Contaminants listed on the CCL may require future regulation under the Safe Drinking Water Act (SDWA). EPA announced the <a href='https://www.epa.gov/ccl/contaminant-candidate-list-4-ccl-4-0' target='_blank'>fourth Drinking Water Contaminant Candidate List (CCL 4)</a> on November 17, 2016. The CCL 4 includes 97 chemicals or chemical groups and 12 microbial contaminants. The group of cyanotoxins on CCL 4 includes, but is not limited to: anatoxin-a, cylindrospermopsin, microcystins, and saxitoxin. The CCL Chemical Candidate Lists are versioned iteratively and this description navigates between the various versions of the lists. The list of substances displayed below represents only the chemical CCL 4 contaminants. For the versioned lists, please use the hyperlinked lists below.<br/><br/> \r\n\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL5' target='_blank'>CCL5 - November 2022</a> <br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL4' target='_blank'>CCL4 - November 2016</a> \r\n This list<br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL3' target='_blank'>CCL3 - October 2009</a> <br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL2' target='_blank'>CCL2 - February 2005</a><br/><br/>\r\n<a href='https://comptox.epa.gov/dashboard/chemical_lists/CCL1' target='_blank'>CCL1 - March 1998</a><br/><br/> 
#>              updatedAt listName chemicalCount            createdAt
#> 1 2022-10-26T21:14:27Z     CCL4           100 2017-12-28T17:58:36Z
#>                                                                                                                                              shortDescription
#> 1 The Contaminant Candidate List (CCL) is a list of contaminants that are known or anticipated to occur in public water systems. Version 4 is known as CCL 4.

natadb_information <- get_public_chemical_list_by_name('NATADB')
print(natadb_information, trunc.cols = TRUE)
#>    id    type                                            label visibility
#> 1 454 federal EPA: National-Scale Air Toxics Assessment (NATA)     PUBLIC
#>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              longDescription
#> 1 The National-Scale Air Toxics Assessment (NATA) is EPA's ongoing comprehensive evaluation of air toxics in the United States. EPA developed the NATA as a state-of-the-science screening tool for State/Local/Tribal Agencies to prioritize pollutants, emission sources and locations of interest for further study in order to gain a better understanding of risks.  NATA assessments do not incorporate refined information about emission sources but, rather, use general information about sources to develop estimates of risks which are more likely to overestimate impacts than underestimate them.\r\n\r\nNATA provides estimates of the risk of cancer and other serious health effects from breathing (inhaling) air toxics in order to inform both national and more localized efforts to identify and prioritize air toxics, emission source types and locations which are of greatest potential concern in terms of contributing to population risk.  This in turn helps air pollution experts focus limited analytical resources on areas and or populations where the potential for health risks are highest.  Assessments include estimates of cancer and non-cancer health effects based on chronic exposure from outdoor sources, including assessments of non-cancer health effects for Diesel Particulate Matter (PM). Assessments provide a snapshot of the outdoor air quality and the risks to human health that would result if air toxic emissions levels remained unchanged.
#>              updatedAt listName chemicalCount            createdAt
#> 1 2018-11-16T21:42:01Z   NATADB           163 2018-02-21T12:04:16Z
#>                                                                                                                shortDescription
#> 1 The National-Scale Air Toxics Assessment (NATA) is EPA's ongoing comprehensive evaluation of air toxics in the United States.

Next, retrieve the list of chemicals associated with each list.

ccl4 <- get_chemicals_in_list('ccl4')
ccl4 <- data.table::as.data.table(ccl4)

natadb <- get_chemicals_in_list('NATADB')
natadb <- data.table::as.data.table(natadb)

We examine the dimensions of the data, the column names, and display a single row for illustrative purposes.

dim(ccl4)
dim(natadb)
colnames(ccl4)
head(ccl4, 1)

Batch Search Physico-chemical Property Data by DTXSID for Chemical Lists

Next, physico-chemical properties for all chemicals in each list can be retrieved. The function get_chem_info() will be used to batch search for a list of DTXSIDs.

ccl4_phys_chem <- get_chem_info_batch(ccl4$dtxsid)
#> Warning in get_chem_info_batch(ccl4$dtxsid): Setting type to ''!
natadb_phys_chem <- get_chem_info_batch(natadb$dtxsid)
#> Warning in get_chem_info_batch(natadb$dtxsid): Setting type to ''!

Observe that this returns a single data.table for each query, and the data.table contains the physico-chemical properties available from the CompTox Chemicals Dashboard for each chemical in the query. Note, a warning message was triggered, Warning: Setting type to ''!, which indicates the the parameter type was not given a value. A default value is set within the function and more information can be found in the associated documentation. We examine the set of physico-chemical properties for the first chemical in CCL4.

Before any deeper analysis, consider the dimensions of the data and the column names.

dim(ccl4_phys_chem)
colnames(ccl4_phys_chem)

Next, we display the unique values for the columns propertyID and propType.

ccl4_phys_chem[, unique(propertyId)]
#>  [1] "boiling-point"        "logkow-octanol-water" "melting-point"        "vapor-pressure"      
#>  [5] "water-solubility"     "density"              "flash-point"          "henrys-law"          
#>  [9] "index-of-refraction"  "logkoa-octanol-air"   "molar-refractivity"   "molar-volume"        
#> [13] "polarizability"       "surface-tension"      "thermal-conductivity" "viscosity"           
#> [17] "pka-acidic-apparent"  "pka-basic-apparent"
ccl4_phys_chem[, unique(propType)]
#> [1] "experimental" "predicted"

Let’s explore this further by examining the mean of the “boiling-point” and “melting-point” data.

ccl4_phys_chem[propertyId == 'boiling-point', .(Mean = mean(value))]
#>        Mean
#>       <num>
#> 1: 252.6593
ccl4_phys_chem[propertyId == 'boiling-point', .(Mean = mean(value)),
               by = .(propType)]
#>        propType     Mean
#>          <char>    <num>
#> 1: experimental 250.5943
#> 2:    predicted 253.8196

ccl4_phys_chem[propertyId == 'melting-point', .(Mean = mean(value))]
#>        Mean
#>       <num>
#> 1: 34.91613
ccl4_phys_chem[propertyId == 'melting-point', .(Mean = mean(value)),
               by = .(propType)]
#>        propType     Mean
#>          <char>    <num>
#> 1: experimental 23.18876
#> 2:    predicted 49.99417

These results tell us about some of the reported physico-chemical properties of the data sets.

The mean “boiling-point” is 252.6593 degrees Celsius for CCL4, with mean values of 250.5943 and 253.8196 for experimental and predicted, respectively. The mean “melting-point” is 34.91613 degrees Celsius for CCL4, with mean values of 23.18876 and 49.99417 for experimental and predicted, respectively.

To explore all the values of the physico-chemical properties and calculate their means, we can do the following procedure. First we look at all the physico-chemical properties individually, then group them by each property (“boiling-point”, “melting-point”, etc…), and then additionally group those by property type (“experimental” vs “predicted”). In the grouping, we look at the columns value, unit, propertyID and propType. We also demonstrate how take the mean of the values for each grouping, using the chemical identifier ‘DTXSID1037567’ for this example, the 25th chemical in CCL4.

head(ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], ])
#>             name       value       id   source               description  propType        unit
#>           <char>       <num>    <int>   <char>                    <char>    <char>      <char>
#> 1: Boiling Point 4.85890e+02 58771702 EPISUITE <a href='https://www.epa. predicted          °C
#> 2: Boiling Point 3.14450e+02 56038082    OPERA OPERA is a free and open  predicted          °C
#> 3: Boiling Point 4.04135e+02 55985095     TEST The <a href='https://www. predicted          °C
#> 4:       Density 1.24300e+00 48458607 ACD/Labs ACD/Labs physicochemical  predicted      g/cm^3
#> 5:       Density 1.32900e+00 55296157     TEST The <a href='https://www. predicted      g/cm^3
#> 6:   Henry's Law 4.33964e-10 45640026    OPERA OPERA is a free and open  predicted atm-m3/mole
#> 3 variable(s) not shown: [propertyId <char>, dtxsid <char>, dtxcid <char>]
ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], .(propType, value, unit),
               by = .(propertyId)]
#>               propertyId  propType       value        unit
#>                   <char>    <char>       <num>      <char>
#>  1:        boiling-point predicted 4.85890e+02          °C
#>  2:        boiling-point predicted 3.14450e+02          °C
#>  3:        boiling-point predicted 4.04135e+02          °C
#>  4:              density predicted 1.24300e+00      g/cm^3
#>  5:              density predicted 1.32900e+00      g/cm^3
#>  6:           henrys-law predicted 4.33964e-10 atm-m3/mole
#>  7:  index-of-refraction predicted 1.55100e+00        <NA>
#>  8:   logkoa-octanol-air predicted 1.06255e+01        <NA>
#>  9: logkow-octanol-water predicted 7.08000e-01        <NA>
#> 10: logkow-octanol-water predicted 1.68000e-01        <NA>
#> 11: logkow-octanol-water predicted 1.86625e+00        <NA>
#> 12: logkow-octanol-water predicted 1.69000e+00        <NA>
#> 13:        melting-point predicted 2.06010e+02          °C
#> 14:        melting-point predicted 1.55804e+02          °C
#> 15:        melting-point predicted 1.89090e+02          °C
#> 16:   molar-refractivity predicted 8.44810e+01        cm^3
#> 17:         molar-volume predicted 2.64941e+02        cm^3
#> 18:       polarizability predicted 3.34910e+01         Å^3
#> 19:      surface-tension predicted 5.00550e+01      dyn/cm
#> 20:       vapor-pressure predicted 4.63364e-08        mmHg
#> 21:     water-solubility predicted 7.98701e-07       mol/L
#> 22:     water-solubility predicted 1.90108e-03       mol/L
#> 23:     water-solubility predicted 6.81847e-03       mol/L
#> 24:     water-solubility predicted 2.08200e+00       mol/L
#>               propertyId  propType       value        unit
ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], .(value, unit), 
               by = .(propertyId, propType)]
#>               propertyId  propType       value        unit
#>                   <char>    <char>       <num>      <char>
#>  1:        boiling-point predicted 4.85890e+02          °C
#>  2:        boiling-point predicted 3.14450e+02          °C
#>  3:        boiling-point predicted 4.04135e+02          °C
#>  4:              density predicted 1.24300e+00      g/cm^3
#>  5:              density predicted 1.32900e+00      g/cm^3
#>  6:           henrys-law predicted 4.33964e-10 atm-m3/mole
#>  7:  index-of-refraction predicted 1.55100e+00        <NA>
#>  8:   logkoa-octanol-air predicted 1.06255e+01        <NA>
#>  9: logkow-octanol-water predicted 7.08000e-01        <NA>
#> 10: logkow-octanol-water predicted 1.68000e-01        <NA>
#> 11: logkow-octanol-water predicted 1.86625e+00        <NA>
#> 12: logkow-octanol-water predicted 1.69000e+00        <NA>
#> 13:        melting-point predicted 2.06010e+02          °C
#> 14:        melting-point predicted 1.55804e+02          °C
#> 15:        melting-point predicted 1.89090e+02          °C
#> 16:   molar-refractivity predicted 8.44810e+01        cm^3
#> 17:         molar-volume predicted 2.64941e+02        cm^3
#> 18:       polarizability predicted 3.34910e+01         Å^3
#> 19:      surface-tension predicted 5.00550e+01      dyn/cm
#> 20:       vapor-pressure predicted 4.63364e-08        mmHg
#> 21:     water-solubility predicted 7.98701e-07       mol/L
#> 22:     water-solubility predicted 1.90108e-03       mol/L
#> 23:     water-solubility predicted 6.81847e-03       mol/L
#> 24:     water-solubility predicted 2.08200e+00       mol/L
#>               propertyId  propType       value        unit

ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], .(Mean_value = sapply(.SD, mean)),
               by = .(propertyId, unit), .SDcols = c("value")]
#>               propertyId        unit   Mean_value
#>                   <char>      <char>        <num>
#>  1:        boiling-point          °C 4.014917e+02
#>  2:              density      g/cm^3 1.286000e+00
#>  3:           henrys-law atm-m3/mole 4.339640e-10
#>  4:  index-of-refraction        <NA> 1.551000e+00
#>  5:   logkoa-octanol-air        <NA> 1.062550e+01
#>  6: logkow-octanol-water        <NA> 1.108062e+00
#>  7:        melting-point          °C 1.836347e+02
#>  8:   molar-refractivity        cm^3 8.448100e+01
#>  9:         molar-volume        cm^3 2.649410e+02
#> 10:       polarizability         Å^3 3.349100e+01
#> 11:      surface-tension      dyn/cm 5.005500e+01
#> 12:       vapor-pressure        mmHg 4.633640e-08
#> 13:     water-solubility       mol/L 5.226801e-01
ccl4_phys_chem[dtxsid == ccl4$dtxsid[[25]], .(Mean_value = sapply(.SD, mean)), 
               by = .(propertyId, unit, propType), 
               .SDcols = c("value")][order(propertyId)]
#>               propertyId        unit  propType   Mean_value
#>                   <char>      <char>    <char>        <num>
#>  1:        boiling-point          °C predicted 4.014917e+02
#>  2:              density      g/cm^3 predicted 1.286000e+00
#>  3:           henrys-law atm-m3/mole predicted 4.339640e-10
#>  4:  index-of-refraction        <NA> predicted 1.551000e+00
#>  5:   logkoa-octanol-air        <NA> predicted 1.062550e+01
#>  6: logkow-octanol-water        <NA> predicted 1.108062e+00
#>  7:        melting-point          °C predicted 1.836347e+02
#>  8:   molar-refractivity        cm^3 predicted 8.448100e+01
#>  9:         molar-volume        cm^3 predicted 2.649410e+02
#> 10:       polarizability         Å^3 predicted 3.349100e+01
#> 11:      surface-tension      dyn/cm predicted 5.005500e+01
#> 12:       vapor-pressure        mmHg predicted 4.633640e-08
#> 13:     water-solubility       mol/L predicted 5.226801e-01

Review Physico-Chemical Properties Across Chemical Lists

We consider exploring the differences in mean predicted and experimental values for a variety of physico-chemical properties in an effort to understand better the CCL4 and NATADB lists. In particular, we examine “vapor-pressure”, “henrys-law”, and “boiling-point” and plot the means by chemical for these using boxplots. We then compare the values by grouping by both data set and propType value.

Vapor Pressure

Begin by grouping pulled data by DTXSID, and also by DTXSID and property type.

ccl4_vapor_all <- ccl4_phys_chem[propertyId %in% 'vapor-pressure', 
                                 .(mean_vapor_pressure = sapply(.SD, mean)), 
                                 .SDcols = c('value'), by = .(dtxsid)]
natadb_vapor_all <- natadb_phys_chem[propertyId %in% 'vapor-pressure', 
                                     .(mean_vapor_pressure = sapply(.SD, mean)),
                                     .SDcols = c('value'), by = .(dtxsid)]
ccl4_vapor_grouped <- ccl4_phys_chem[propertyId %in% 'vapor-pressure', 
                                     .(mean_vapor_pressure = sapply(.SD, mean)),
                                     .SDcols = c('value'), 
                                     by = .(dtxsid, propType)]
natadb_vapor_grouped <- natadb_phys_chem[propertyId %in% 'vapor-pressure', 
                                         .(mean_vapor_pressure = 
                                             sapply(.SD, mean)), 
                                         .SDcols = c('value'), 
                                         by = .(dtxsid, propType)]

Examine summary statistics.

summary(ccl4_vapor_all)
#>     dtxsid          mean_vapor_pressure
#>  Length:83          Min.   :   0.000   
#>  Class :character   1st Qu.:   0.000   
#>  Mode  :character   Median :   0.002   
#>                     Mean   : 224.118   
#>                     3rd Qu.:   3.866   
#>                     Max.   :6007.242
summary(ccl4_vapor_grouped)
#>     dtxsid            propType         mean_vapor_pressure
#>  Length:140         Length:140         Min.   :   0.000   
#>  Class :character   Class :character   1st Qu.:   0.000   
#>  Mode  :character   Mode  :character   Median :   0.069   
#>                                        Mean   : 265.007   
#>                                        3rd Qu.:  10.125   
#>                                        Max.   :7252.700
summary(natadb_vapor_all)
#>     dtxsid          mean_vapor_pressure
#>  Length:145         Min.   :   0.000   
#>  Class :character   1st Qu.:   0.007   
#>  Mode  :character   Median :   1.001   
#>                     Mean   : 225.872   
#>                     3rd Qu.:  89.931   
#>                     Max.   :9298.713
summary(natadb_vapor_grouped)
#>     dtxsid            propType         mean_vapor_pressure
#>  Length:267         Length:267         Min.   :   0.000   
#>  Class :character   Class :character   1st Qu.:   0.008   
#>  Mode  :character   Mode  :character   Median :   1.521   
#>                                        Mean   : 243.404   
#>                                        3rd Qu.:  98.721   
#>                                        Max.   :9412.390

With such a large range of values covering several orders of magnitude, log transform the data. This data from both chemical lists can also be plotted individually and by property type.

ccl4_vapor_all[, log_transform_mean_vapor_pressure := log(mean_vapor_pressure)]
#>             dtxsid mean_vapor_pressure log_transform_mean_vapor_pressure
#>             <char>               <num>                             <num>
#>  1:  DTXSID0020153        1.436092e+00                         0.3619259
#>  2:  DTXSID0020446        1.211530e-06                       -13.6236263
#>  3:  DTXSID0020573        1.016181e-08                       -18.4046290
#>  4:  DTXSID0020600        1.060439e+03                         6.9664387
#>  5:  DTXSID0020814        1.713402e-08                       -17.8821999
#>  6:  DTXSID0021541        3.623345e+03                         8.1951529
#>  7:  DTXSID0021917        1.447043e+02                         4.9746920
#>  8:  DTXSID0024052        2.579787e-07                       -15.1703889
#>  9:  DTXSID1020437        2.153143e+02                         5.3720986
#> 10:  DTXSID1021407        9.093477e-04                        -7.0027830
#> 11:  DTXSID1021740        6.808950e+00                         1.9182379
#> 12:  DTXSID1021798        5.779040e-02                        -2.8509326
#> 13:  DTXSID1024174        5.616588e-06                       -12.0897863
#> 14:  DTXSID1024338        7.125250e-08                       -16.4570359
#> 15:  DTXSID1026164        2.444570e-01                        -1.4087159
#> 16:  DTXSID1037484        4.124746e-07                       -14.7010911
#> 17:  DTXSID1037486        4.206599e-07                       -14.6814412
#> 18:  DTXSID1037567        4.633640e-08                       -16.8873380
#> 19:  DTXSID2020684        3.733735e-03                        -5.5903463
#> 20:  DTXSID2021028        1.127656e+00                         0.1201408
#> 21:  DTXSID2021317        1.563302e+01                         2.7493857
#> 22:  DTXSID2021731        2.360838e+02                         5.4641866
#> 23:  DTXSID2022333        1.685188e+00                         0.5218768
#> 24:  DTXSID2037506        8.393291e-06                       -11.6880778
#> 25:  DTXSID2052156        3.878108e-09                       -19.3679185
#> 26:  DTXSID3020203        1.876293e+03                         7.5370530
#> 27:  DTXSID3020833        2.420272e+02                         5.4890503
#> 28:  DTXSID3020964        2.336795e-01                        -1.4538048
#> 29:  DTXSID3024366        5.501190e+01                         4.0075495
#> 30:  DTXSID3024869        1.885188e-02                        -3.9711425
#> 31:  DTXSID3031864        2.479965e-06                       -12.9072661
#> 32:  DTXSID3032464        1.727766e-06                       -13.2686813
#> 33:  DTXSID3042219        3.106110e+00                         1.1333711
#> 34:  DTXSID3074313        1.481020e-11                       -24.9357050
#> 35:  DTXSID4020533        4.036368e+01                         3.6979302
#> 36:  DTXSID4021503        1.722775e+02                         5.1491065
#> 37:  DTXSID4022361        3.226160e-06                       -12.6442179
#> 38:  DTXSID4022367        1.643378e-08                       -17.9239267
#> 39:  DTXSID4022448        1.773129e-05                       -10.9401797
#> 40:  DTXSID4022991        1.440972e-10                       -22.6605333
#> 41:  DTXSID4032611        5.263403e-04                        -7.5495627
#> 42:  DTXSID4034948        3.744327e-08                       -17.1004389
#> 43:  DTXSID5020023        2.545125e+02                         5.5393500
#> 44:  DTXSID5020576        5.737673e-09                       -18.9762122
#> 45:  DTXSID5020601        1.428200e+00                         0.3564147
#> 46:  DTXSID5021207        4.463567e+02                         6.1011185
#> 47:  DTXSID5024182        7.171160e+00                         1.9700674
#> 48:  DTXSID5039224        7.963770e+02                         6.6800727
#> 49: DTXSID50867064        1.522060e-03                        -6.4876906
#> 50:  DTXSID6020301        6.007243e+03                         8.7007211
#> 51:  DTXSID6020856        2.909907e-01                        -1.2344638
#> 52:  DTXSID6021030        3.562765e-05                       -10.2423885
#> 53:  DTXSID6021032        8.306000e-01                        -0.1856069
#> 54:  DTXSID6022422        1.921524e-06                       -13.1623921
#> 55:  DTXSID6024177        7.035691e-02                        -2.6541743
#> 56:  DTXSID6037483        4.847560e-08                       -16.8422053
#> 57:  DTXSID6037485        4.976030e-08                       -16.8160484
#> 58:  DTXSID6037568        2.875770e-07                       -15.0617752
#> 59:  DTXSID7020005        6.776980e-02                        -2.6916386
#> 60:  DTXSID7020637        1.836866e+03                         7.5158163
#> 61:  DTXSID7021029        3.323853e+00                         1.2011248
#> 62:  DTXSID7024241        2.031631e-06                       -13.1066715
#> 63:  DTXSID7047433        1.073489e-08                       -18.3497663
#> 64:  DTXSID8020044        2.447645e+01                         3.1977114
#> 65:  DTXSID8020090        5.073765e-01                        -0.6785019
#> 66:  DTXSID8020597        2.037895e-01                        -1.5906676
#> 67:  DTXSID8020832        1.520092e+03                         7.3265265
#> 68:  DTXSID8021062        1.180510e-01                        -2.1366388
#> 69:  DTXSID8022292        1.528695e-08                       -17.9962663
#> 70:  DTXSID8022377        1.016181e-08                       -18.4046290
#> 71:  DTXSID8023846        1.702425e-06                       -13.2834569
#> 72:  DTXSID8023848        8.277807e-06                       -11.7019325
#> 73:  DTXSID8025541        1.949228e-05                       -10.8454918
#> 74:  DTXSID8031865        8.386721e-01                        -0.1759355
#> 75:  DTXSID9020243        8.999120e-08                       -16.2235539
#> 76:  DTXSID9021390        4.408660e+00                         1.4835708
#> 77:  DTXSID9021427        3.637282e-01                        -1.0113483
#> 78:  DTXSID9022366        1.506794e-09                       -20.3132814
#> 79:  DTXSID9023380        1.149256e-08                       -18.2815660
#> 80:  DTXSID9023914        3.131080e-04                        -8.0689624
#> 81:  DTXSID9024142        2.739923e-09                       -19.7153359
#> 82:  DTXSID9032113        2.981382e-08                       -17.3282936
#> 83:  DTXSID9032329        7.041032e-07                       -14.1663408
#>             dtxsid mean_vapor_pressure log_transform_mean_vapor_pressure
ccl4_vapor_grouped[, log_transform_mean_vapor_pressure := 
                     log(mean_vapor_pressure)]
#>             dtxsid     propType mean_vapor_pressure log_transform_mean_vapor_pressure
#>             <char>       <char>               <num>                             <num>
#>   1: DTXSID0020153 experimental        1.229990e+00                         0.2070060
#>   2: DTXSID0020153    predicted        1.504793e+00                         0.4086556
#>   3: DTXSID0020446 experimental        6.899220e-08                       -16.4892724
#>   4: DTXSID0020446    predicted        1.592376e-06                       -13.3502831
#>   5: DTXSID0020573    predicted        1.016181e-08                       -18.4046290
#>  ---                                                                                 
#> 136: DTXSID9024142    predicted        2.059810e-09                       -20.0006521
#> 137: DTXSID9032113 experimental        1.279970e-08                       -18.1738441
#> 138: DTXSID9032113    predicted        3.548520e-08                       -17.1541501
#> 139: DTXSID9032329 experimental        8.000180e-07                       -14.0386316
#> 140: DTXSID9032329    predicted        6.721317e-07                       -14.2128116
natadb_vapor_all[, log_transform_mean_vapor_pressure := 
                   log(mean_vapor_pressure)]
#>             dtxsid mean_vapor_pressure log_transform_mean_vapor_pressure
#>             <char>               <num>                             <num>
#>   1: DTXSID0020153        1.436092e+00                         0.3619259
#>   2: DTXSID0020448        4.899910e+01                         3.8918019
#>   3: DTXSID0020523        2.824828e-04                        -8.1718931
#>   4: DTXSID0020529        7.603507e-04                        -7.1817307
#>   5: DTXSID0020600        1.060439e+03                         6.9664387
#>  ---                                                                    
#> 141: DTXSID9020293        3.124725e-02                        -3.4658239
#> 142: DTXSID9020299        1.177018e-06                       -13.6525264
#> 143: DTXSID9020827        2.930390e-06                       -12.7403750
#> 144: DTXSID9021138        3.619153e-03                        -5.6215152
#> 145: DTXSID9041522        6.433997e-05                        -9.6513295
natadb_vapor_grouped[, log_transform_mean_vapor_pressure := 
                       log(mean_vapor_pressure)]
#>             dtxsid     propType mean_vapor_pressure log_transform_mean_vapor_pressure
#>             <char>       <char>               <num>                             <num>
#>   1: DTXSID0020153 experimental        1.229990e+00                         0.2070060
#>   2: DTXSID0020153    predicted        1.504793e+00                         0.4086556
#>   3: DTXSID0020448 experimental        5.329670e+01                         3.9758744
#>   4: DTXSID0020448    predicted        4.756657e+01                         3.8621301
#>   5: DTXSID0020523 experimental        3.900320e-04                        -7.8492818
#>  ---                                                                                 
#> 263: DTXSID9020299 experimental        2.199890e-06                       -13.0271032
#> 264: DTXSID9020299    predicted        8.360607e-07                       -13.9945647
#> 265: DTXSID9020827    predicted        2.930390e-06                       -12.7403750
#> 266: DTXSID9021138    predicted        3.619153e-03                        -5.6215152
#> 267: DTXSID9041522    predicted        6.433997e-05                        -9.6513295

Finally, compare both chemical lists simultaneously. To accomplish this, add a column to each data.table denoting to which chemical list the rows correspond and then combine the rows from both data sets together using the function rbind().

ccl4_vapor_grouped[, set := 'CCL4']
#>             dtxsid     propType mean_vapor_pressure log_transform_mean_vapor_pressure    set
#>             <char>       <char>               <num>                             <num> <char>
#>   1: DTXSID0020153 experimental        1.229990e+00                         0.2070060   CCL4
#>   2: DTXSID0020153    predicted        1.504793e+00                         0.4086556   CCL4
#>   3: DTXSID0020446 experimental        6.899220e-08                       -16.4892724   CCL4
#>   4: DTXSID0020446    predicted        1.592376e-06                       -13.3502831   CCL4
#>   5: DTXSID0020573    predicted        1.016181e-08                       -18.4046290   CCL4
#>  ---                                                                                        
#> 136: DTXSID9024142    predicted        2.059810e-09                       -20.0006521   CCL4
#> 137: DTXSID9032113 experimental        1.279970e-08                       -18.1738441   CCL4
#> 138: DTXSID9032113    predicted        3.548520e-08                       -17.1541501   CCL4
#> 139: DTXSID9032329 experimental        8.000180e-07                       -14.0386316   CCL4
#> 140: DTXSID9032329    predicted        6.721317e-07                       -14.2128116   CCL4
natadb_vapor_grouped[, set := 'NATADB']
#>             dtxsid     propType mean_vapor_pressure log_transform_mean_vapor_pressure    set
#>             <char>       <char>               <num>                             <num> <char>
#>   1: DTXSID0020153 experimental        1.229990e+00                         0.2070060 NATADB
#>   2: DTXSID0020153    predicted        1.504793e+00                         0.4086556 NATADB
#>   3: DTXSID0020448 experimental        5.329670e+01                         3.9758744 NATADB
#>   4: DTXSID0020448    predicted        4.756657e+01                         3.8621301 NATADB
#>   5: DTXSID0020523 experimental        3.900320e-04                        -7.8492818 NATADB
#>  ---                                                                                        
#> 263: DTXSID9020299 experimental        2.199890e-06                       -13.0271032 NATADB
#> 264: DTXSID9020299    predicted        8.360607e-07                       -13.9945647 NATADB
#> 265: DTXSID9020827    predicted        2.930390e-06                       -12.7403750 NATADB
#> 266: DTXSID9021138    predicted        3.619153e-03                        -5.6215152 NATADB
#> 267: DTXSID9041522    predicted        6.433997e-05                        -9.6513295 NATADB

all_vapor_grouped <- rbind(ccl4_vapor_grouped, natadb_vapor_grouped)

vapor_box <- ggplot(all_vapor_grouped, 
                    aes(set, log_transform_mean_vapor_pressure)) + 
                    geom_boxplot(aes(color = propType))
vapor <- ggplot(all_vapor_grouped, aes(log_transform_mean_vapor_pressure)) +
                     geom_boxplot((aes(color = set))) + 
                     coord_flip()

Plot the combined data. Boxplots are colored based on the property type, with mean log transformed vapor pressure plotted for each chemical list and property type, or by chemical list alone.

In the box plots above, a general trend indicates that that the NATADB chemical list has a higher mean vapor pressure than the CCL4 chemical list.

Henry’s Law constant

Henry’s Law constant can be explored in a similar fashion. Begin by grouping data by DTXSID, and also by DTXSID and property type.

ccl4_hlc_all <- ccl4_phys_chem[propertyId %in% 'henrys-law', 
                               .(mean_hlc = sapply(.SD, mean)), 
                               .SDcols = c('value'), by = .(dtxsid)]
natadb_hlc_all <- natadb_phys_chem[propertyId %in% 'henrys-law', 
                                   .(mean_hlc = sapply(.SD, mean)), 
                                   .SDcols = c('value'), by = .(dtxsid)]
ccl4_hlc_grouped <- ccl4_phys_chem[propertyId %in% 'henrys-law', 
                                   .(mean_hlc = sapply(.SD, mean)), 
                                   .SDcols = c('value'), 
                                   by = .(dtxsid, propType)]
natadb_hlc_grouped <- natadb_phys_chem[propertyId %in% 'henrys-law', 
                                       .(mean_hlc = sapply(.SD, mean)), 
                                       .SDcols = c('value'), 
                                       by = .(dtxsid, propType)]

Examine summary statistics.

summary(ccl4_hlc_all)
#>     dtxsid             mean_hlc        
#>  Length:82          Min.   :0.0000000  
#>  Class :character   1st Qu.:0.0000000  
#>  Mode  :character   Median :0.0000007  
#>                     Mean   :0.0076308  
#>                     3rd Qu.:0.0000216  
#>                     Max.   :0.4922550
summary(ccl4_hlc_grouped)
#>     dtxsid            propType            mean_hlc        
#>  Length:110         Length:110         Min.   :0.0000000  
#>  Class :character   Class :character   1st Qu.:0.0000000  
#>  Mode  :character   Mode  :character   Median :0.0000020  
#>                                        Mean   :0.0063874  
#>                                        3rd Qu.:0.0001217  
#>                                        Max.   :0.4922550
summary(natadb_hlc_all)
#>     dtxsid             mean_hlc        
#>  Length:144         Min.   :0.0000000  
#>  Class :character   1st Qu.:0.0000002  
#>  Mode  :character   Median :0.0000328  
#>                     Mean   :0.0086311  
#>                     3rd Qu.:0.0012426  
#>                     Max.   :0.4922550
summary(natadb_hlc_grouped)
#>     dtxsid            propType            mean_hlc        
#>  Length:210         Length:210         Min.   :0.0000000  
#>  Class :character   Class :character   1st Qu.:0.0000003  
#>  Mode  :character   Mode  :character   Median :0.0001028  
#>                                        Mean   :0.0073349  
#>                                        3rd Qu.:0.0028132  
#>                                        Max.   :0.4922550

Again, log transform the data as it is positive and covers several orders of magnitude.

ccl4_hlc_all[, log_transform_mean_hlc := log(mean_hlc)]
#>             dtxsid     mean_hlc log_transform_mean_hlc
#>             <char>        <num>                  <num>
#>  1:  DTXSID0020153 2.918720e-03             -5.8366101
#>  2:  DTXSID0020446 1.527830e-09            -20.2994174
#>  3:  DTXSID0020573 3.748870e-06            -12.4940561
#>  4:  DTXSID0020600 1.479610e-04             -8.8185618
#>  5:  DTXSID0020814 2.049460e-07            -15.4005193
#>  6:  DTXSID0021541 8.825620e-03             -4.7300964
#>  7:  DTXSID0021917 4.922550e-01             -0.7087584
#>  8:  DTXSID0024052 2.338400e-10            -22.1763840
#>  9:  DTXSID1020437 5.639950e-03             -5.1778801
#> 10:  DTXSID1021407 1.768600e-06            -13.2453223
#> 11:  DTXSID1021740 8.824295e-06            -11.6380018
#> 12:  DTXSID1021798 8.664450e-06            -11.6562821
#> 13:  DTXSID1024174 2.964310e-07            -15.0314514
#> 14:  DTXSID1024338 1.646800e-10            -22.5270169
#> 15:  DTXSID1026164 1.977445e-06            -13.1337050
#> 16:  DTXSID1037484 1.196270e-09            -20.5440575
#> 17:  DTXSID1037486 1.185530e-09            -20.5530759
#> 18:  DTXSID1037567 4.339640e-10            -21.5580595
#> 19:  DTXSID2020684 3.194620e-06            -12.6540424
#> 20:  DTXSID2021028 3.662790e-06            -12.5172854
#> 21:  DTXSID2021317 2.493330e-03             -5.9941361
#> 22:  DTXSID2021731 4.553975e-06            -12.2995101
#> 23:  DTXSID2022333 8.026420e-03             -4.8250167
#> 24:  DTXSID2037506 1.313660e-09            -20.4504487
#> 25:  DTXSID2052156 3.785200e-10            -21.6947522
#> 26:  DTXSID3020203 4.129470e-02             -3.1870211
#> 27:  DTXSID3020833 5.902805e-04             -7.4349127
#> 28:  DTXSID3020964 2.389895e-05            -10.6416760
#> 29:  DTXSID3024366 1.485460e-04             -8.8146159
#> 30:  DTXSID3024869 1.776260e-06            -13.2410005
#> 31:  DTXSID3031864 1.803350e-11            -24.7387900
#> 32:  DTXSID3032464 8.835860e-06            -11.6366921
#> 33:  DTXSID3042219 1.047975e-02             -4.5583105
#> 34:  DTXSID3074313 2.046920e-11            -24.6120998
#> 35:  DTXSID4020533 4.847905e-06            -12.2369639
#> 36:  DTXSID4021503 3.771570e-03             -5.5802639
#> 37:  DTXSID4022361 2.458220e-08            -17.5212432
#> 38:  DTXSID4022367 1.047830e-09            -20.6765445
#> 39:  DTXSID4022448 9.004310e-09            -18.5255625
#> 40:  DTXSID4022991 1.238900e-11            -25.1142121
#> 41:  DTXSID4032611 1.477090e-05            -11.1228515
#> 42:  DTXSID4034948 1.537700e-09            -20.2929780
#> 43:  DTXSID5020023 1.214215e-04             -9.0162426
#> 44:  DTXSID5020576 9.440170e-08            -16.1757068
#> 45:  DTXSID5020601 3.864950e-09            -19.3713171
#> 46:  DTXSID5021207 1.374810e-04             -8.8920248
#> 47:  DTXSID5024182 3.307185e-07            -14.9219983
#> 48:  DTXSID5039224 6.639505e-05             -9.6198881
#> 49: DTXSID50867064 1.183860e-08            -18.2519005
#> 50:  DTXSID6020301 4.061025e-02             -3.2037348
#> 51:  DTXSID6020856 3.217130e-09            -19.5547762
#> 52:  DTXSID6021030 9.187830e-07            -13.9002159
#> 53:  DTXSID6021032 3.249700e-04             -8.0317777
#> 54:  DTXSID6022422 1.067460e-07            -16.0528137
#> 55:  DTXSID6024177 2.553800e-07            -15.1805132
#> 56:  DTXSID6037483 5.542140e-10            -21.3134702
#> 57:  DTXSID6037485 5.623550e-10            -21.2988878
#> 58:  DTXSID6037568 8.289390e-09            -18.6082895
#> 59:  DTXSID7020005 8.830950e-08            -16.2424181
#> 60:  DTXSID7020637 3.432355e-07            -14.8848490
#> 61:  DTXSID7021029 3.650960e-05            -10.2179353
#> 62:  DTXSID7024241 3.283070e-06            -12.6267316
#> 63:  DTXSID7047433 6.785440e-08            -16.5059016
#> 64:  DTXSID8020044 4.999450e-06            -12.2061827
#> 65:  DTXSID8020090 2.019270e-06            -13.1127745
#> 66:  DTXSID8020597 5.999740e-08            -16.6289646
#> 67:  DTXSID8020832 7.367270e-03             -4.9107081
#> 68:  DTXSID8021062 1.159700e-05            -11.3647641
#> 69:  DTXSID8022292 2.387120e-08            -17.5505931
#> 70:  DTXSID8022377 3.748870e-06            -12.4940561
#> 71:  DTXSID8023846 4.938440e-09            -19.1262163
#> 72:  DTXSID8023848 9.954840e-09            -18.4252070
#> 73:  DTXSID8025541 4.150640e-07            -14.6948331
#> 74:  DTXSID8031865 1.916920e-10            -22.3751312
#> 75:  DTXSID9021390 3.432295e-04             -7.9771112
#> 76:  DTXSID9021427 5.556870e-07            -14.4030607
#> 77:  DTXSID9022366 5.167830e-09            -19.0808130
#> 78:  DTXSID9023380 4.156230e-09            -19.2986574
#> 79:  DTXSID9023914 5.035540e-11            -23.7119153
#> 80:  DTXSID9024142 2.692760e-06            -12.8249439
#> 81:  DTXSID9032113 3.098320e-07            -14.9872356
#> 82:  DTXSID9032329 1.566850e-06            -13.3664433
#>             dtxsid     mean_hlc log_transform_mean_hlc
ccl4_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)]
#>             dtxsid     propType    mean_hlc log_transform_mean_hlc
#>             <char>       <char>       <num>                  <num>
#>   1: DTXSID0020153    predicted 2.91872e-03              -5.836610
#>   2: DTXSID0020446    predicted 1.52783e-09             -20.299417
#>   3: DTXSID0020573    predicted 3.74887e-06             -12.494056
#>   4: DTXSID0020600 experimental 1.48000e-04              -8.818298
#>   5: DTXSID0020600    predicted 1.47922e-04              -8.818825
#>  ---                                                              
#> 106: DTXSID9023914 experimental 5.03002e-11             -23.713012
#> 107: DTXSID9023914    predicted 5.04106e-11             -23.710820
#> 108: DTXSID9024142    predicted 2.69276e-06             -12.824944
#> 109: DTXSID9032113    predicted 3.09832e-07             -14.987236
#> 110: DTXSID9032329    predicted 1.56685e-06             -13.366443

natadb_hlc_all[, log_transform_mean_hlc := log(mean_hlc)]
#>             dtxsid     mean_hlc log_transform_mean_hlc
#>             <char>        <num>                  <num>
#>   1: DTXSID0020153 2.918720e-03              -5.836610
#>   2: DTXSID0020448 2.806465e-03              -5.875830
#>   3: DTXSID0020523 8.611785e-08             -16.267549
#>   4: DTXSID0020529 5.417975e-08             -16.730959
#>   5: DTXSID0020600 1.479610e-04              -8.818562
#>  ---                                                  
#> 140: DTXSID9020293 2.943670e-06             -12.735853
#> 141: DTXSID9020299 4.846860e-10             -21.447520
#> 142: DTXSID9020827 2.040880e-07             -15.404715
#> 143: DTXSID9021138 5.672420e-08             -16.685065
#> 144: DTXSID9041522 8.981400e-06             -11.620355
natadb_hlc_grouped[, log_transform_mean_hlc := log(mean_hlc)]
#>             dtxsid     propType    mean_hlc log_transform_mean_hlc
#>             <char>       <char>       <num>                  <num>
#>   1: DTXSID0020153    predicted 2.91872e-03              -5.836610
#>   2: DTXSID0020448 experimental 2.82000e-03              -5.871018
#>   3: DTXSID0020448    predicted 2.79293e-03              -5.880664
#>   4: DTXSID0020523 experimental 8.59999e-08             -16.268920
#>   5: DTXSID0020523    predicted 8.62358e-08             -16.266180
#>  ---                                                              
#> 206: DTXSID9020299    predicted 4.84686e-10             -21.447520
#> 207: DTXSID9020827 experimental 2.03000e-07             -15.410060
#> 208: DTXSID9020827    predicted 2.05176e-07             -15.399398
#> 209: DTXSID9021138    predicted 5.67242e-08             -16.685065
#> 210: DTXSID9041522    predicted 8.98140e-06             -11.620355

Finally, compare both chemical lists simultaneously. To accomplish this, add a column to each data.table denoting to which chemical list the rows correspond and then combine the rows from both data sets together using the function rbind().

ccl4_hlc_grouped[, set := 'CCL4']
#>             dtxsid     propType    mean_hlc log_transform_mean_hlc    set
#>             <char>       <char>       <num>                  <num> <char>
#>   1: DTXSID0020153    predicted 2.91872e-03              -5.836610   CCL4
#>   2: DTXSID0020446    predicted 1.52783e-09             -20.299417   CCL4
#>   3: DTXSID0020573    predicted 3.74887e-06             -12.494056   CCL4
#>   4: DTXSID0020600 experimental 1.48000e-04              -8.818298   CCL4
#>   5: DTXSID0020600    predicted 1.47922e-04              -8.818825   CCL4
#>  ---                                                                     
#> 106: DTXSID9023914 experimental 5.03002e-11             -23.713012   CCL4
#> 107: DTXSID9023914    predicted 5.04106e-11             -23.710820   CCL4
#> 108: DTXSID9024142    predicted 2.69276e-06             -12.824944   CCL4
#> 109: DTXSID9032113    predicted 3.09832e-07             -14.987236   CCL4
#> 110: DTXSID9032329    predicted 1.56685e-06             -13.366443   CCL4
natadb_hlc_grouped[, set := 'NATADB']
#>             dtxsid     propType    mean_hlc log_transform_mean_hlc    set
#>             <char>       <char>       <num>                  <num> <char>
#>   1: DTXSID0020153    predicted 2.91872e-03              -5.836610 NATADB
#>   2: DTXSID0020448 experimental 2.82000e-03              -5.871018 NATADB
#>   3: DTXSID0020448    predicted 2.79293e-03              -5.880664 NATADB
#>   4: DTXSID0020523 experimental 8.59999e-08             -16.268920 NATADB
#>   5: DTXSID0020523    predicted 8.62358e-08             -16.266180 NATADB
#>  ---                                                                     
#> 206: DTXSID9020299    predicted 4.84686e-10             -21.447520 NATADB
#> 207: DTXSID9020827 experimental 2.03000e-07             -15.410060 NATADB
#> 208: DTXSID9020827    predicted 2.05176e-07             -15.399398 NATADB
#> 209: DTXSID9021138    predicted 5.67242e-08             -16.685065 NATADB
#> 210: DTXSID9041522    predicted 8.98140e-06             -11.620355 NATADB

all_hlc_grouped <- rbind(ccl4_hlc_grouped, natadb_hlc_grouped)

hlc_box <- ggplot(all_hlc_grouped, aes(set, log_transform_mean_hlc)) + 
  geom_boxplot(aes(color = propType))

hlc <- ggplot(all_hlc_grouped, aes(log_transform_mean_hlc)) +
  geom_boxplot(aes(color = set)) +
  coord_flip()

Again, in both grouping by propType and aggregating all results together by chemical list, NATADB chemicals generally higher mean Henry’s Law Constant value than CCL4 chemicals.

Boling Point

Boiling Point data be explored. Begin by grouping data by DTXSID, and also by DTXSID and property type.

ccl4_boiling_all <- ccl4_phys_chem[propertyId %in% 'boiling-point', 
                                   .(mean_boiling_point = sapply(.SD, mean)), 
                                   .SDcols = c('value'), by = .(dtxsid)]
natadb_boiling_all <- natadb_phys_chem[propertyId %in% 'boiling-point', 
                                       .(mean_boiling_point = 
                                           sapply(.SD, mean)), 
                                       .SDcols = c('value'), by = .(dtxsid)]
ccl4_boiling_grouped <- ccl4_phys_chem[propertyId %in% 'boiling-point', 
                                       .(mean_boiling_point = 
                                           sapply(.SD, mean)), 
                                       .SDcols = c('value'), 
                                       by = .(dtxsid, propType)]
natadb_boiling_grouped <- natadb_phys_chem[propertyId %in% 'boiling-point', 
                                           .(mean_boiling_point = 
                                               sapply(.SD, mean)), 
                                           .SDcols = c('value'), 
                                           by = .(dtxsid, propType)]

Calculate summary statistics.

summary(ccl4_boiling_all)
#>     dtxsid          mean_boiling_point
#>  Length:91          Min.   : -34.92   
#>  Class :character   1st Qu.: 167.40   
#>  Mode  :character   Median : 306.29   
#>                     Mean   : 350.26   
#>                     3rd Qu.: 390.63   
#>                     Max.   :3377.66
summary(ccl4_boiling_grouped)
#>     dtxsid            propType         mean_boiling_point
#>  Length:143         Length:143         Min.   : -40.78   
#>  Class :character   Class :character   1st Qu.: 115.13   
#>  Mode  :character   Mode  :character   Median : 208.03   
#>                                        Mean   : 318.70   
#>                                        3rd Qu.: 383.44   
#>                                        Max.   :4825.00
summary(natadb_boiling_all)
#>     dtxsid          mean_boiling_point
#>  Length:152         Min.   :-38.45    
#>  Class :character   1st Qu.: 85.98    
#>  Mode  :character   Median :187.18    
#>                     Mean   :186.92    
#>                     3rd Qu.:275.57    
#>                     Max.   :584.48
summary(natadb_boiling_grouped)
#>     dtxsid            propType         mean_boiling_point
#>  Length:295         Length:295         Min.   :-87.78    
#>  Class :character   Class :character   1st Qu.: 82.20    
#>  Mode  :character   Mode  :character   Median :180.00    
#>                                        Mean   :181.67    
#>                                        3rd Qu.:262.46    
#>                                        Max.   :685.00

Since some of the boiling point values have negative values, log transformation of these values will result in warnings as NaNs are produced.

Finally, compare both chemical lists simultaneously. To accomplish this, add a column to each data.table denoting to which chemical list the rows correspond and then combine the rows from both data sets together using the function rbind().

ccl4_boiling_grouped[, set := 'CCL4']
#>             dtxsid     propType mean_boiling_point    set
#>             <char>       <char>              <num> <char>
#>   1: DTXSID0020153 experimental           178.9778   CCL4
#>   2: DTXSID0020153    predicted           180.1880   CCL4
#>   3: DTXSID0020446 experimental           182.5000   CCL4
#>   4: DTXSID0020446    predicted           340.0280   CCL4
#>   5: DTXSID0020573    predicted           398.2052   CCL4
#>  ---                                                     
#> 139: DTXSID9024142    predicted           452.0113   CCL4
#> 140: DTXSID9032113    predicted           384.5745   CCL4
#> 141: DTXSID9032119 experimental           990.0000   CCL4
#> 142: DTXSID9032119    predicted           482.9800   CCL4
#> 143: DTXSID9032329    predicted           448.4297   CCL4
natadb_boiling_grouped[, set := 'NATADB']
#>             dtxsid     propType mean_boiling_point    set
#>             <char>       <char>              <num> <char>
#>   1: DTXSID0020153 experimental          178.97780 NATADB
#>   2: DTXSID0020153    predicted          180.18800 NATADB
#>   3: DTXSID0020448 experimental           96.14168 NATADB
#>   4: DTXSID0020448    predicted           97.64523 NATADB
#>   5: DTXSID0020523 experimental          113.00000 NATADB
#>  ---                                                     
#> 291: DTXSID9021138    predicted          263.44175 NATADB
#> 292: DTXSID9021261 experimental          685.00000 NATADB
#> 293: DTXSID9021261    predicted          483.96000 NATADB
#> 294: DTXSID9041522 experimental          340.00000 NATADB
#> 295: DTXSID9041522    predicted          334.18150 NATADB

all_boiling_grouped <- rbind(ccl4_boiling_grouped, natadb_boiling_grouped)

boiling_box <- ggplot(all_boiling_grouped, aes(set, mean_boiling_point)) + 
  geom_boxplot(aes(color = propType))
boiling <- ggplot(all_boiling_grouped, aes(mean_boiling_point)) +
  geom_boxplot(aes(color = set)) + 
  coord_flip()

A visual inspection of this set of graphs is not as clear as in the previous cases. Note that the predicted values for each data set tend to be higher than the experimental. The mean of CCL4, by predicted and experimental appears to be greater than the corresponding means for NATADB, as does the overall mean, but the interquartile ranges of these different groupings yield slightly different results. This gives us a sense that the picture for boiling point is not as clear cut between experimental and predicted for these two chemical lists as it was in the previous physico-chemical properties investigated.

To summarize the observations, across the various physico-chemical properties for chemicals in these chemical lists, there are indeed differences between the mean values of various physico-chemical properties when grouped by predicted or experimental.

  • For “vapor-pressure”, the means of predicted values tend to be a little lower than experimental, though they are much closer in the case of NATADB than CCL4.
  • The trend of lower predicted means compared to experimental means is more clearly demonstrated for “henrys-law” values in both data sets.
  • In the case of “boiling-point”, the predicted values are greater than the experimental values, though this is much more pronounced in CCL4 while the set of means for NATADB are again fairly close.

Conclusion

In this vignette, a variety of functions that access different types of data found in the Chemical endpoints of the CTX APIs were explored. While this exploration was not exhaustive, it provides a basic introduction to how one may access data and work with it. Additional endpoints and corresponding functions exist and we encourage the user to explore these while keeping in mind the examples contained in this vignette.