This vignette is an explanation on how to use NIPTeR. It starts with a short introduction, the second chapter explains the two most important objects in this package, `NIPTSample`

and `NIPTControlgroup`

. The third chapter explains all the analysis functions, the arguments the function take and their output. The last chapter shows some workflow examples.

Non-Invasive Prenatal Testing (NIPT) is a method based on analysis of cell-free fetal (cff) DNA in maternal blood using next-generation sequencing. The basic approach is relatively straightforward: cell-free DNA is isolated from the mothers blood plasma and the obtained DNA fragments are sequenced. Subsequently the number of DNA fragments originating from the different chromosomes is counted, since in case of a fetal trisomy a relative increase of the fraction of reads of the trisomic chromosome is expected. Various analysis methods for trisomy prediction, variation reduction and quality control have been described in the literature. These methods can be combined to improve the sensitivity of the analysis. The NIPTeR package makes these methods available and allows for flexibility in workflow design. NIPTeR uses BAM files as input and calculates chromosomal fractions, which are used for trisomy prediction. These fractions are based on read counts of the different chromosomes. The chromosomes are divided in bins to allow for read count correction.

The atomic unit of analysis in this package is a `NIPTSample`

object

A `NIPTSample`

object is a named list of length 5, having the following fields:

`autosomal_chromosome_reads`

A list containing the bin count matrices for the autosomal chromosomes. If forward and reverse reads are counted separately the list contains two matrices, if the reads are counted together it contains one. In these matrices the columns represent bins and the 22 rows represent the chromosomes. For a small excerpt of such a matrix see Table 1.`correction_status_autosomal_chromosomes`

Correction status of the autosomal chromosomes. Default is*Uncorrected*`sex_chromosome_reads`

List containing bin count matrices for the sex chromosomes. If forward and reverse reads are counted separately the list contains two matrices, if the reads are counted together it contains one. In these matrices the columns represent bins and the two represent rows the chromosomes. For a small excerpt of such a matrix see Table 2.`correction_status_sex_chromosome`

Correction status of the sex chromosomes. Default is*Uncorrected*`sample_name`

The name of the sample. Default is the filename of the .bam it originates from with the path prefix and .bam suffix

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|

1 |
308 | 303 | 270 | 157 | 148 | 99 | 146 | 236 |

2 |
243 | 263 | 262 | 269 | 225 | 225 | 285 | 259 |

3 |
0 | 189 | 287 | 245 | 264 | 251 | 258 | 242 |

4 |
624 | 637 | 257 | 240 | 229 | 241 | 251 | 216 |

5 |
448 | 249 | 274 | 262 | 239 | 205 | 241 | 244 |

6 |
0 | 174 | 213 | 142 | 231 | 324 | 355 | 383 |

7 |
247 | 266 | 234 | 217 | 142 | 88 | 264 | 230 |

8 |
163 | 216 | 244 | 209 | 243 | 255 | 259 | 261 |

1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | |
---|---|---|---|---|---|---|---|---|

X |
0 | 160 | 6 | 266 | 127 | 85 | 182 | 235 |

Y |
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |

The panel of control samples to which the sample of interest can be compared is also modelled in an object, `NIPTControlGroup`

. A control group is a named list and composed of `NIPTSample`

objects. It has the following fields:

`samples`

The`NIPTSample`

objects`correction_status_autosomal_chromosomes`

Correction status of the autosomal chromosomes of the samples. Default is*Uncorrected*`correction_status_sex_chromosomes`

Correction status of the sex chromosomes of the samples. Default is*Uncorrected*`Description`

Description of the control group. This can either be*Generic control group*or*Fitted to sample name*

To load a bam file, use the `bin_bam_sample`

function. The bam file does not need to be sorted, but its recommended to use pre-sorted bam files

`bin_bam_sample(bam_filepath, do_sort = F, separate_strands = F, custom_name = NULL)`

The arguments `bin_bam_sample`

takes:

`bam_filepath`

The location and filename on the file system where the bam file is stored.`do_sort`

Sort the bam file? If the bam is unsorted set to true, but the use of pre-sorted bam files is recommended.`separate_strands`

If set to true, reads from forward and reverse strands are counted and stored separately. This option should be used if you are planning on using regression, since this doubles the number of predictors (F+R) and distributes predictive power more equally over prediction sets since F and R strand from the same chromosome cannot be both in one prediction set.`custom_name`

The sample name. Default samplename is the filename of the bam file without the .bam suffix and filepath prefix.

GC content bias is the correlation between the number of reads mapped to a specific genomic region and the GC content of this region. In NIPTeR, two GC bias correction algorithms have been implemented, the LOESS based method introduced by Chen et al. (2011) and the bin weight based method described by Fan and Quake (2010).

`gc_correct(nipt_object, method, include_XY = F, span = 0.75)`

`nipt_object`

The object that will be corrected. This can either be a`NIPTSample`

or a`NIPTControlGroup`

object.`method`

To select the LOESS based method use “*LOESS*”, to select the bin weights based method use “*bin*”.`include_XY`

Also perform correction on the X and Y chromosomes?`span`

The span for the LOESS fit. Only applicable when LOESS method is used.`ref_genome`

The reference genome used. Either “*hg37*” or “*hg38*” default = “*hg37*”

The output of the gc_correct function is a similar object as the input; when the input is a `NIPTSample`

the output will also be a (corrected) `NIPTSample`

, and the same applies for a `NIPTControlGroup`

object. . After correction the correction status of the object is set to *GC_corrected*.

Code snippit:

```
#Correct NIPTSample object using LOESS method
loess_corrected_sample <- gc_correct(nipt_object = sample_of_interest, method = "LOESS",
include_XY = F, span = 0.75)
#Correct NIPTControlGroup object using bin method
gc_bin_corrected_control_group <- gc_correct(nipt_object = control_group, method = "bin",
include_XY = T)
```

The `matchcontrolgroup`

function determines how well an `NIPTSample`

fits within the `NIPTControlGroup`

and, if needed, makes a subset `NIPTControlGroup`

of length *n*.

```
matchcontrolgroup(nipt_sample, nipt_control_group, mode, n_of_samples,
include_chromosomes = NULL, exclude_chromosomes = NULL)
```

Explanation of the arguments `matchcontrolgroup`

takes:

`nipt_sample`

The`NIPTSample`

object that is the focus of the analysis.`nipt_control_group`

The`NIPTControlGroup`

object used in the analysis.`mode`

The function mode. This can either be “*subset*” or “*report*”. Mode “*subset*” means the return value will be a new`NIPTControlGroup`

object containing*n*samples. When mode “*report*” is used the output is a matrix containing the sum of squares score of the differences between the chromosomal fractions of the sample and the control for every control sample, sorted in increasing score.`n_of_samples`

The length of the resulting NIPTControlGroup. Only applicable if mode “*subset*” is used.`include_chromosomes`

Include potential trisomic chromosomes into the comparison? Default = NULL, meaning chromosomes 13, 18 and 21 are not included.`exclude_chromosomes`

Exclude other autosomal chromosomes besides chromosomes 13, 18 and 21? Default = NULL.

The output for mode *subset* is a new `NIPTControlGroup`

composed of *n* samples. The output for mode *report* is a matrix with a single column containing the sum of squares in ascending order.

Sum_of_squares | |
---|---|

Sample 6 |
6.766e-07 |

Sample 8 |
2.159e-06 |

Sample 1 |
3.101e-06 |

Sample 10 |
3.292e-06 |

Sample 9 |
3.588e-06 |

Sample 3 |
3.839e-06 |

Sample 5 |
4.432e-06 |

Sample 2 |
7.892e-06 |

Sample 7 |
9.737e-06 |

Sample 4 |
1.126e-05 |

Code snippit:

```
#Run matchcontrolgroup in mode "report"
scores_control_group <- matchcontrolgroup(nipt_sample = gc_LOESS_corrected_sample,
nipt_control_group = gc_LOESS_corrected_control_group,
mode = "report", include_chromosomes = c(13,18))
#Run matchcontrolgroup in mode "subset" and select 50 best matching samples
subset_loess_corrected_control_group <- matchcontrolgroup(nipt_sample = gc_LOESS_corrected_sample,
nipt_control_group = loess_corrected_control_group,
mode = "subset", n_of_samples = 50)
```

The chi-squared based variation reduction identifies overdispersed bins within the control group and corrects these bins in both the sample of interest and the control group. The function takes in a `NIPTSample`

and a `NIPTControlGroup`

object, both to be corrected. For every corresponding bin in the control group a chi-squared score is calculated and this total score is converted to a normal distribution. Corresponding bins with a normalized score above *chi_cutoff* (default 3.5) are corrected by dividing the number of reads by the total chi-squared score divided by degrees of freedom

`chi_correct(nipt_sample, nipt_control_group, chi_cutoff = 3.5, include_XY = F)`

`nipt_sample`

The`NIPTSample`

object that is the focus of the analysis.`nipt_control_group`

The`NIPTControlGroup`

object used in the analysis.`chi_cutoff`

The normalized cut-off threshold.`include_XY`

Also perform correction on the X and Y chromosomes?

The output of this function is a named list containing two fields, the first field is the corrected sample (`$sample`

) and the second the corrected control group (`$control_group`

).

Code snippit:

```
#Apply chi-squared based variation reduction method
chi_corrected_data <- chicorrect(nipt_sample = gc_LOESS_corrected_sample,
nipt_control_group = subset_loess_corrected_control_group)
#Extract sample and control group
loess_chi_corrected_sample <- chi_corrected_data$sample
subset_loess_chi_corrected_control_group <- chi_corrected_data$control_group
```

In the Z-score approach, introduced by Chiu et al in 2008, the chromosomal fraction of interest of a sample is compared to the chromosomal fractions of interest of the reference samples, the `NIPTControlGroup`

object.

`calculate_z_score(nipt_sample, nipt_control_group, chromo_focus)`

This function takes three arguments:

`nipt_sample`

The`NIPTSample`

object that is the focus of the analysis.`nipt_control_group`

The`NIPTControlGroup`

object used in the analysis.`chromo_focus`

The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.

The output of the function is an object of class `ZscoreResult`

. It is a named list containing seven fields:

`sample_Zscore`

A numeric, The Z score for the sample of interest for the sample of interest.`control_group_statistics`

Named num of length 3, the first field being the mean (name mean), the second field is the standard deviation (name SD) and the third field is the P value of the Shapiro-Wilk test (name Shapiro_P_value).`control_group_Zscores`

Matrix containing the Z scores of the chromosome of interest for all used control samples.`focus_chromosome`

The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.`control_group_sample_names`

The sample names of all control group samples used in the analysis.`correction_status`

The correction status of the control group.`sample_name`

The sample_name of the sample of interest.

Code snippit:

```
#Calculate Z score for chromosome 13
z_score_result_13 <- calculate_z_score(nipt_sample = loess_chi_corrected_sample,
nipt_control_group = subset_loess_chi_corrected_control_group,
chromo_focus = 13)
```

The regression based Z-score builds *n* models with *m* predictors using stepwise regression with forward selection. The models are used to predict the chromosomal fraction of interest, for the sample and for the control group. The observed fractions are then divided by the expected fraction, and Z-scores are calculated over these ratios. The Z-score is calculated by subtracting one from the ratio of the sample and dividing this result by the coefficient of variation. The coefficient of variation (CV) can either be the *Practical* or *Theoretical* CV. The Theoretical CV is the standard error multiplied by the overdispersion. Theoretically, the CV cannot be lower than the standard error of the mean. If it is case the CV is lower than Theoretical CV, then the Theoretical CV is used.

```
perform_regression(nipt_sample, nipt_control_group, chromo_focus,
n_models = 4, n_predictors = 4, exclude_chromosomes = NULL,
include_chromosomes = NULL, use_test_train_set = T,
size_of_train_set = 0.6, overdispersion_rate = 1.15,
force_practical_vc = F)
```

`nipt_sample`

The`NIPTSample`

object that is the focus of the analysis.`nipt_control_group`

The`NIPTControlGroup`

object used in the analysis.`chromo_focus`

The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.`n_models`

Number of linear models to be made. Default setting is 4 models.`n_predictors`

The number of predictors each model contains. Default is 4.`exclude_chromosomes`

Exclude which autosomal chromosomes as potential predictors? Default potential trisomic chromosomes 13, 18 and 21 are exluded.`include_chromosomes`

Include potential trisomic chromosomes? Options are: chromosomes 13, 18 and 21.`use_test_train_set`

Use a test and train set to build the models? Default is TRUE.`size_of_train_set`

The size of the train set expressed in a decimal. Default is 0.6 (60% of the control samples).`overdispersion_rate`

The standard error of the mean is multiplied by this factor.`force_practical_vc`

Ignore the theoretical CV and always use the practical CV.

The output of this function is an object of type `RegressionResult`

, a named list containing:

`prediction_statistics`

A dataframe with 7 rows and a column for every model. Table 3 is an example of such a dataframe.

Prediction_set_1 | Prediction_set_2 | |
---|---|---|

Z_score_sample |
-0.171248675025602 | -0.799303637284865 |

CV |
0.00615122686158584 | 0.00599131304613039 |

cv_types |
Practical_CV | Practical_CV |

P_value_shapiro |
0.583158688871931 | 0.503865450973465 |

Predictor_chromosomes |
11 2 5 12 | 10 19 17 16 |

Mean_test_set |
1.00050469190991 | 1.00011356353248 |

CV_train_set |
0.00335713985926308 | 0.00394053718934269 |

`control_group_Zscores`

A matrix containing the regression based Z-scores for the control sample`focus_chromosome`

The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.`correction_status`

The correction status of the control group autosomes.`control_group_sample_names`

The sample names of the test set group.`models`

List of the summary.lm() output for every model.`potential_predictors`

The total pool of chromosomes where the predictors are selected from.`all_control_group_Z_scores`

Z-scores for every sample using theoretical and practical VCs.`additional_statistics`

Statistics for both the practical and theoretical CVs for every prediction set.

The Normalized Chromosome Value or NCV (Sehnert et al., 2011) method selects a subset of chromosomes to calculate the chromosomal fractions. The ‘best’ subset is the set which yields the lowest coefficient of variation for the chromosomal fractions of the chromosome of interest in the control group. Because a brute force approach is used to determine the best subset, which can be computationally intensive,this method is divided into two functions, `prepare_ncv`

and `calculate_ncv`

. `prepare_ncv`

returns a template object (NCVTemplate) for a given chromosome of interest and the control group used. This template can be used for any number of analyses. If the control group or chromosome of interest changes, a new template must be made.

`calculate_ncv`

uses a NCVTemplate and a NIPTSample to calculate the NCV score for the NIPTSample.

```
prepare_ncv(nipt_control_group, chr_focus, max_elements,
exclude_chromosomes = NULL, include_chromosomes = NULL,
use_test_train_set = T, size_of_train_set = 0.6)
```

`nipt_control_group`

The`NIPTControlGroup`

object used in the analysis.`chr_focus`

The chromosome of interest. Most commonly chromosome 13, 18 or 21. However, every autosomal chromosome can be predicted.`max_elements`

The maximum number of denominator chromosomes.`exclude_chromosomes`

Exclude which autosomal chromosomes as potential predictors? Default potential trisomic chromosomes 13, 18 and 21 are exluded.`include_chromosomes`

Include potential trisomic chromosomes? Options are: chromosomes 13, 18 and 21.`use_test_train_set`

Use a test and train set to to build the models? Default is TRUE.`size_of_train_set`

The size of the train set expressed in a decimal. Default is 0.6 (60% of the control group samples).

The output of this function is an object of class NCVTemplate, a named list.

`denominators`

The set of ‘best’ denominators.`focus_chromosome`

The chromosome of interest used for this`NCVTemplate`

object.`control_group_sample_names`

The sample names of the test set samples.`correction_status`

The correction status of the`NIPTControlGroup`

used.`control_group_Zscores`

The NCV scores for the test set samples.`potential_denominators`

The total pool of denominators the best denominators are selected from.`control_group_statistics`

Named num of length 3, the first field being the mean (name mean), the second field is the standard deviation (name SD) and the third field is the P value of the Shapiro-Wilk test (name Shapiro_P_value).`sample_names_train_set`

The names of the samples used in the train set.`train_set_statistics`

Named num of length 3, the first field being the mean of the train set (name mean), the second field is the standard deviation of the train set (name SD) and the third field is the P value of the Shapiro-Wilk test of the train set (name Shapiro_P_value).`train_set_Zscores`

Matrix containing the Z scores for the train set samples.

`calculate_ncv_score(nipt_sample, ncv_template)`

`nipt_sample`

The`NIPTSample`

object that is the focus of the analysis.`ncv_template`

An object of class NCVTemplate that is produced by function`prepare_ncv`

The output of this function is an object of class NCVResult, which is basically the same as a NCVTemplate object, the only difference being the NCV score of the sample appended to the list.

This functions converts a list of `NIPTSample`

objects to a `NIPTControlGroup`

object and return this object.

`as_control_group(nipt_samples)`

1 `nipt_samples`

A list of `NIPTSample`

objects.

This functions adds `NIPTSample`

objects to an existing control group and returns a new `NIPTControlGroup`

object.

`add_samples_controlgroup(nipt_control_group, samples_to_add)`

`nipt_control_group`

The`NIPTControlGroup`

to add the samples to.`samples_to_add`

A list with sample(s) to add. This always needs to be a list.

This functions removes duplicate samples from the control group based on name. It returns a new `NIPTControlGroup`

object.

`remove_duplicates_controlgroup(nipt_control_group)`

`nipt_control_group`

The`NIPTControlGroup`

object to remove duplicates from.

This function removes a sample from the `NIPTControlGroup`

object by name. Note that this function uses a regular expression, and if more sample_names satisfy the regular expression, they will also be removed. It returns a new `NIPTControlGroup`

object.

`remove_sample_controlgroup(samplename, nipt_control_group)`

`samplename`

samplename Regular expression string. All matching samplenames are removed from the control group.`nipt_control_group`

`NIPTControlGroup`

object where the samples are removed from.

This function computes a regular Z-score for every chromosome of every sample in the `NIPTControlGroup`

object. It returns a named list with diagnostics information.

`diagnose_control_group(nipt_control_group)`

`nipt_control_group`

The`NIPTControlGroup`

to diagnose.

The function returns a named list with 3 fields:

`Z_scores`

A matrix containing Z-scores for every sample and every chromosome.`abberant_scores`

Dataframe with samplename and chromosome of Z-scores outside -3 3 range.`control_group_statistics`

Matrix with mean, standard deviation and P value of Shapiro-Wilk test.

Since NIPT is based on the comparison of the sample of interest to a set of samples with no known aneuploidies, the first thing that should be done before any samples can be analyzed is making this `NIPTControlGroup`

. A `NIPTControlGroup`

is an object and is composed of `NIPTSample`

objects. First, all bam file paths have to be collected in a vector. This tutorial assumes all .bam files are in the same directory. To collect all bam files the following command can be used:

`bam_filepaths <- list.files(path = "/Path/to/bamfiles/", pattern = ".bam", full.names = T)`

Subsequently, the filepaths must be fed to the function `bin_bam_sample`

. This function loads bam files and converts them to `NIPTSample`

objects. For example:

```
list_NIPT_samples <- lapply(X = bam_filepaths, bin_bam_sample, do_sort = FALSE,
separate_strands = T, custom_name = NULL)
```

In this example we count the forward and reverse reads separately and a pre-sorted bam file is used, so the argument do_sort is ommited since the default is F.

Using the above command all samples will be loaded with the same default settings. To load the files with for instance custom names, store the filenames in a vector of the same length as the bam filepath vector and use `mapply`

. For example:

`list_NIPT_samples <- mapply(FUN = bin_bam_sample, bam_filepaths, name = names, SIMPLIFY = F)`

To convert the list of `NIPTSamples`

to a `NIPTControlGroup`

object use:

`control_group <- as_control_group(nipt_samples = list_NIPT_samples)`

of course, the fastest and simplest way is to wrap the `as_control_group`

function around the loading of samples like this:

```
control_group <- as_control_group(nipt_samples = lapply(X = bam_filepaths, bin_bam_sample,
do_sort = F, separate_strands = T))
```

or when applying custom names:

```
control_group <- as_control_group(nipt_samples = mapply(FUN = bin_bam_sample, bam_filepaths,
custom_name = names, SIMPLIFY = FALSE))
```

To store the control group on disk for later use:

`saveRDS(object = control_group, file = "/Path/to/directory/control_group.rds")`

So, in conclusion, the data control group preparation script might look something like this:

```
library(NIPTeR)
#Retrieve filenames
bam_filepaths <- list.files(path = "/Path/to/bamfiles/", pattern = ".bam", full.names = T)
#Load files and convert to control group
control_group <- as_control_group(nipt_samples = lapply(X = bam_filepaths, bin_bam_sample,
do_sort = F, separate_strands = FALSE))
#Save control group for later
saveRDS(object = control_group, file = "/Path/to/directory/control_group.rds")
```

Assuming a `NIPTControlGroup`

object has been prepared and stored on disk we can analyse our first sample. The script starts with loading the `NIPTeR`

package:

`library(NIPTeR)`

After that, we can load a sample of interest with the `bin_bam_sample`

function:

`sample_of_interest <- bin_bam_sample(bam_filepath = "/path/to/sample.bam", separate_strands = T)`

```
library(NIPTeR)
#Gather all bam filepaths in a vector. Corresponds to 1a in figure
bam_filepaths <- list.files(path = "/Path/to/bamfiles/", pattern = ".bam", full.names = T)
#Load all bam files using lapply and feed the results to function as_control_group,
#converting the NIPTSamples to a NIPTControlGroup object. Corresponds to 1b in figure
control_group <- as_control_group(nipt_samples = lapply(X = bam_filepaths, bin_bam_sample,
do_sort = F, separate_strands = F))
#apply a gc LOESS correction to all samples. Since this can take up to 30 seconds
#sample, doing this once for a control group and store the results can be rewarding
#in terms of analysis time. Corresponds to 2a in figure
gc_control_group <- gc_correct(nipt_object = control_group, method = "LOESS")
#Retrieve control group diagnostics. Corresponds with 3a in figure
control_group_diagnostics <- diagnose_control_group(nipt_control_group = control_group)
#Retrieve samplenames with an abberant Z score for any chromosome and remove these samples
#from the control group. Corresponds with 3b in figure
abberant_sample_names <- unique(control_group_diagnostics$abberant_scores$Sample_name)
for (i in 1:length(abberant_sample_names)){
control_group <- remove_sample_controlgroup(samplename = abberant_sample_names[i],
nipt_control_group = control_group)
}
#Save the gc_corrected control groups to disk. Corresponds to 4a in figure
saveRDS(object = control_group, file = "/path/to/controlgroup.rds")
```

```
library(NIPTeR)
#Load sample. Corresponds with 1a in figure
sample_of_interest <- bin_bam_sample(bam_filepath = "/Path/to/bamfile.bam", separate_strands = T)
#Load control group. Corresponds with 1b in figure
control_group <- readRDS("/Path/to/control_group.rds")
#Perform a chi-square based variation reduction on new trimmed control group and sample and
#extract data. Corresponds with 2a in figure
chi_data <- chi_correct(nipt_sample = sample_of_interest, nipt_control_group = control_group)
sample_of_interest <- chi_data$sample
control_group <- chi_data$control_group
#Perform regression for chromosome 21 with default settings, so:
#Create 4 models with 4 predictors each
#All chromosomes are potential predictors except the potential trisomic chromosomes 13, 18 and 21
#Use a test and train set where the size of the train set is 60% of the control group
#Assume at least 15% overdispersion in the data
#Dont force practical CV, so if the CV is below 1.15 * standard error of the mean use this as VC
#Corresponds with 3a in figure
regression_score_21 <- perform_regression(nipt_sample = sample_of_interest,
nipt_control_group = control_group, chromo_focus = 21)
### Gather relevant data from the objects on the workspace #######
```