Introduction to DataSpaceR

Ju Yeong Kim

2019-04-03

This package provides a thin wrapper around Rlabkey and connects to the the CAVD DataSpace database, making it easier to fetch datasets from specific studies.

Configuration

First, go to DataSpace now and set yourself up with an account.

In order to connect to the CAVD DataSpace via DataSpaceR, you will need a netrc file in your home directory that will contain a machine name (hostname of DataSpace), and login and password. There are two ways to create a netrc file.

Creating a netrc file with writeNetrc

On your R console, create a netrc file using a function from DataSpaceR:

writeNetrc(
  login = "yourEmail@address.com", 
  password = "yourSecretPassword",
  netrcFile = "/your/home/directory/.netrc" # use getNetrcPath() to get the default path 
)

This will create a netrc file in your home directory. Make sure you have a valid login and password.

Manually creating a netrc file

Alternatively, you can manually create a netrc file.

The following three lines must be included in the .netrc or _netrc file either separated by white space (spaces, tabs, or newlines) or commas. Multiple such blocks can exist in one file.

machine dataspace.cavd.org
login myuser@domain.com
password supersecretpassword

See here for more information about netrc.

Initiate a connection

We’ll be looking at study cvd256. If you want to use a different study, change that string. You can instantiate multiple connections to different studies simultaneously.

library(DataSpaceR)
#> By exporting data from the CAVD DataSpace, you agree to be bound by the Terms of Use available on the CAVD DataSpace sign-in page at https://dataspace.cavd.org
con <- connectDS()
con
#> <DataSpaceConnection>
#>   URL: https://dataspace.cavd.org
#>   User: jkim2345@scharp.org
#>   Available studies: 249
#>     - 65 studies with data
#>     - 4499 subjects
#>     - 288603 data points
#>   Available groups: 6

The call to connectDS instantiates the connection. Printing the object shows where it’s connected and the available studies.

knitr::kable(head(con$availableStudies))
study_name short_name title type status stage species start_date strategy
cvd232 Parks_RV_232 Limiting Dose Vaginal SIVmac239 Challenge of RhCMV-SIV vaccinated Indian rhesus macaques. Pre-Clinical NHP Inactive Assays Completed Rhesus macaque 2009-11-24 Vector vaccines (viral or bacterial)
cvd234 Zolla-Pazner_Mab_test1 Study Zolla-Pazner_Mab_Test1 Antibody Screening Inactive Assays Completed Non-Organism Study 2009-02-03 Prophylactic neutralizing Ab
cvd235 mAbs potency Weiss mAbs potency Antibody Screening Inactive Assays Completed Non-Organism Study 2008-08-21 Prophylactic neutralizing Ab
cvd236 neutralization assays neutralization assays Antibody Screening Active In Progress Non-Organism Study 2009-02-03 Prophylactic neutralizing Ab
cvd238 Gallo_PA_238 HIV-1 neutralization responses in chronically infected individuals Antibody Screening Inactive Assays Completed Non-Organism Study 2009-01-08 Prophylactic neutralizing Ab
cvd239 CAVIMC-015 Lehner_Thorstensson_Allovac Pre-Clinical NHP Inactive Assays Completed Rhesus macaque 2009-01-08 Protein and peptide vaccines

con$availableStudies shows the available studies in the CAVD DataSpace. Check out the reference page of DataSpaceConnection for all available fields and methods.

cvd256 <- con$getStudy("cvd256")
cvd256
#> <DataSpaceStudy>
#>   Study: cvd256
#>   URL: https://dataspace.cavd.org/CAVD/cvd256
#>   Available datasets:
#>     - BAMA
#>     - Demographics
#>     - NAb

con$getStudy creates a connection to the study cvd256. Printing the object shows where it’s connected, to what study, and the available datasets.

knitr::kable(cvd256$availableDatasets)
name label n
BAMA Binding Ab multiplex assay 6740
Demographics Demographics 121
NAb Neutralizing antibody 1419
knitr::kable(cvd256$treatmentArm)
arm_id arm_part arm_group arm_name randomization coded_label last_day description
cvd256-NA-A-A NA A A Vaccine Group A Vaccine 168 DNA-C 4 mg administered IM at weeks 0, 4, and 8 AND NYVAC-C 10^7pfu/mL administered IM at week 24
cvd256-NA-B-B NA B B Vaccine Group B Vaccine 168 DNA-C 4 mg administered IM at weeks 0 and 4 AND NYVAC-C 10^7pfu/mL administered IM at weeks 20 and 24

Available datasets and treatment arm information for the connection can be accessed by availableDatasets and treatmentArm.

Fetching datasets

We can grab any of the datasets listed in the connection (availableDatasets).

NAb <- cvd256$getDataset("NAb")
dim(NAb)
#> [1] 1419   29
colnames(NAb)
#>  [1] "ParticipantId"          "ParticipantVisit/Visit"
#>  [3] "visit_day"              "assay_identifier"      
#>  [5] "summary_level"          "specimen_type"         
#>  [7] "antigen"                "antigen_type"          
#>  [9] "virus"                  "virus_type"            
#> [11] "virus_insert_name"      "clade"                 
#> [13] "neutralization_tier"    "tier_clade_virus"      
#> [15] "target_cell"            "initial_dilution"      
#> [17] "titer_ic50"             "titer_ic80"            
#> [19] "response_call"          "nab_lab_source_key"    
#> [21] "lab_code"               "exp_assayid"           
#> [23] "titer_ID50"             "titer_ID80"            
#> [25] "nab_response_ID50"      "nab_response_ID80"     
#> [27] "slope"                  "vaccine_matched"       
#> [29] "study_prot"

The cvd256 object is an R6 class, so it behaves like a true object. Functions (like getDataset) are members of the object, thus the $ semantics to access member functions.

We can get detailed variable information using getDatasetDescription.

knitr::kable(cvd256$getDatasetDescription("NAb"))
fieldName caption type description
ParticipantId Participant ID Text (String) Subject identifier
antigen Antigen name Text (String) The name of the antigen (virus) being tested.
antigen_type Antigen type Text (String) The standardized term for the type of virus used in the construction of the nAb antigen.
assay_identifier Assay identifier Text (String) Name identifying assay
clade Virus clade Text (String) The clade (gene subtype) of the virus (antigen) being tested.
exp_assayid Experimental Assay Design Code Integer Unique ID assigned to the experiment design of the assay for tracking purposes.
initial_dilution Initial dilution Number (Double) Indicates the initial specimen dilution.
lab_code Lab ID Text (String) A code indicating the lab performing the assay.
nab_lab_source_key Data provenance Integer Details regarding the provenance of the assay results.
nab_response_ID50 Response call ID50 True/False (Boolean) Indicates if neutralization is detected based on ID50 titer.
nab_response_ID80 Response call ID80 True/False (Boolean) Indicates if neutralization is detected based on ID80 titer.
neutralization_tier Neutralization tier Text (String) A classification specific to HIV NAb assay design, in which an antigen is assessed for its ease of neutralization (1=most easily neutralized, 3=least easily neutralized)
response_call Response call True/False (Boolean) Indicates if neutralization is detected.
slope Slope Number (Double) The slope calculated using the difference between 50% and 80% neutralization.
specimen_type Specimen type Text (String) The type of specimen used in the assay. For nAb assays, this is generally serum or plasma.
study_prot Study Protocol Text (String) Study protocol
summary_level Data summary level Text (String) Defines the level at which the magnitude or response has been summarized (e.g. summarized at the isolate level).
target_cell Target cell Text (String) The cell line used in the assay to determine infection (lack of neutralization). Generally TZM-bl or A3R5, but can also be other cell lines or non-engineered cells.
tier_clade_virus Neutralization tier + Antigen clade + Virus Text (String) A combination of neutralization tier, antigen clade, and virus used for filtering.
titer_ID50 Titer ID50 Number (Double) The adjusted value of 50% maximal inhibitory dilution (ID50).
titer_ID80 Titer ID80 Number (Double) The adjusted value of 80% maximal inhibitory dilution (ID80).
titer_ic50 Titer IC50 Number (Double) The half maximal inhibitory concentration (IC50).
titer_ic80 Titer IC80 Number (Double) The 80% maximal inhibitory concentration (IC80).
vaccine_matched Antigen vaccine match indicator True/False (Boolean) Indicates if the interactive part of the antigen was designed to match the immunogen in the vaccine.
virus Virus name Text (String) The term for the virus (antigen) being tested.
virus_insert_name Virus insert name Text (String) The amino acid sequence inserted in the virus construct.
virus_type Virus type Text (String) The type of virus used in the construction of the nAb antigen.
visit_day Visit Day Integer Target study day defined for a study visit. Study days are relative to Day 0, where Day 0 is typically defined as enrollment and/or first injection.

To get only a subset of the data and speed up the download, filters can be passed to getDataset. The filters are created using the makeFilter function of the Rlabkey package.

cvd256Filter <- makeFilter(c("visit_day", "EQUAL", "0"))
NAb_day0 <- cvd256$getDataset("NAb", colFilter = cvd256Filter)
dim(NAb_day0)
#> [1] 709  29

See ?makeFilter for more information on the syntax.

Creating a connection to all studies

To fetch data from multiple studies, create a connection at the project level.

cavd <- con$getStudy("")

This will instantiate a connection at the CAVD level. Most functions work cross study connections just like they do on single studies.

You can get a list of datasets available across all studies.

cavd
#> <DataSpaceStudy>
#>   Study: CAVD
#>   URL: https://dataspace.cavd.org/CAVD
#>   Available datasets:
#>     - BAMA
#>     - Demographics
#>     - ELISPOT
#>     - ICS
#>     - NAb
knitr::kable(cavd$availableDatasets)
name label n
BAMA Binding Ab multiplex assay 86289
Demographics Demographics 4499
ELISPOT Enzyme-Linked ImmunoSpot 5610
ICS Intracellular Cytokine Staining 150910
NAb Neutralizing antibody 45794

In all-study connection, getDataset will combine the requested datasets. Note that in most cases, the datasets will have too many subjects for quick data transfer, making filtering of the data a necessity. The colFilter argument can be used here, as described in the getDataset section.

conFilter <- makeFilter(c("species", "EQUAL", "Human"))
human <- cavd$getDataset("Demographics", colFilter = conFilter)
dim(human)
#> [1] 2754   36
colnames(human)
#>  [1] "SubjectId"                       "SubjectVisit/Visit"             
#>  [3] "species"                         "subspecies"                     
#>  [5] "sexatbirth"                      "race"                           
#>  [7] "ethnicity"                       "country_enrollment"             
#>  [9] "circumcised_enrollment"          "bmi_enrollment"                 
#> [11] "agegroup_range"                  "agegroup_enrollment"            
#> [13] "age_enrollment"                  "study_label"                    
#> [15] "study_start_date"                "study_first_enr_date"           
#> [17] "study_fu_complete_date"          "study_public_date"              
#> [19] "study_network"                   "study_last_vaccination_day"     
#> [21] "study_type"                      "study_part"                     
#> [23] "study_group"                     "study_arm"                      
#> [25] "study_arm_summary"               "study_arm_coded_label"          
#> [27] "study_randomization"             "study_product_class_combination"
#> [29] "study_product_combination"       "study_short_name"               
#> [31] "study_grant_pi_name"             "study_strategy"                 
#> [33] "study_prot"                      "genderidentity"                 
#> [35] "studycohort"                     "bmi_category"

Check out the reference page of DataSpaceStudy for all available fields and methods.

Connect to a saved group

A group is a curated collection of participants from filtering of treatments, products, studies, or species, and it is created in the DataSpace App.

Let’s say you are using the App to filter and visualize data and want to save them for later or explore in R with DataSpaceR. You can save a group by clicking the Save button on the Active Filter Panel.

We can browse available the saved groups or the curated groups by DataSpace Team via availableGroups.

knitr::kable(con$availableGroups)
id label originalLabel description createdBy shared n studies
216 mice mice NA readjk FALSE 75 c(“cvd468”, “cvd483”, “cvd316”, “cvd331”)
217 CAVD 242 CAVD 242 This is a fake group for CAVD 242 readjk FALSE 30 cvd242
220 NYVAC durability comparison NYVAC_durability Compare durability in 4 NHP studies using NYVAC-C (vP2010) and NYVAC-KC-gp140 (ZM96) products. ehenrich TRUE 78 c(“cvd281”, “cvd434”, “cvd259”, “cvd277”)
224 cvd338 cvd338 NA readjk FALSE 36 cvd338
228 HVTN 505 case control subjects HVTN 505 case control subjects Participants from HVTN 505 included in the case-control analysis drienna TRUE 189 vtn505
230 HVTN 505 polyfunctionality vs BAMA HVTN 505 polyfunctionality vs BAMA Compares ICS polyfunctionality (CD8+, Any Env) to BAMA mfi-delta (single Env antigen) in the HVTN 505 case control cohort drienna TRUE 170 vtn505

To fetch data from a saved group, create a connection at the project level with a group ID. For example, we can connect to the “NYVAC durability comparison” group which has group ID 220 by getGroup.

nyvac <- con$getGroup(220)
nyvac
#> <DataSpaceStudy>
#>   Group: NYVAC durability comparison
#>   URL: https://dataspace.cavd.org/CAVD
#>   Available datasets:
#>     - BAMA
#>     - Demographics
#>     - ELISPOT
#>     - ICS
#>     - NAb

Retrieving a dataset is the same as before.

NAb_nyvac <- nyvac$getDataset("NAb")
dim(NAb_nyvac)
#> [1] 4281   29

Reference Tables

The followings are the tables of all fields and methods that work on DataSpaceConnection and DataSpaceStudy objects and could be used as a quick reference.

DataSpaceConnection

Name Description
availableStudies The table of available studies.
availableGroups The table of available groups.
getStudy Create a DataSpaceStudy object by study.
getGroup Create a DataSpaceStudy object by group.

DataSpaceStudy

Name Description
study The study name.
group The group name.
availableDatasets The table of datasets available in the study object.
treatmentArm The table of treatment arm information for the connected study. Not available for all study connection.
studyInfo Stores the information about the study.
getDataset Get a dataset from the connection.
getDatasetDescription Get variable information.

Session information

sessionInfo()
#> R version 3.5.2 (2018-12-20)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS Mojave 10.14.2
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] C/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] DataSpaceR_0.6.3
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_1.0.0        assertthat_0.2.1  digest_0.6.18    
#>  [4] R6_2.4.0          jsonlite_1.6      magrittr_1.5     
#>  [7] evaluate_0.13     highr_0.7         httr_1.4.0       
#> [10] stringi_1.3.1     curl_3.3          data.table_1.12.0
#> [13] rmarkdown_1.12    tools_3.5.2       stringr_1.4.0    
#> [16] Rlabkey_2.2.5     xfun_0.5          yaml_2.2.0       
#> [19] compiler_3.5.2    htmltools_0.3.6   knitr_1.22