# COVID19.Analytics

–>

Click to Expand/Collapse

## Introduction

The “covid19.analytics” R package allows users to obtain live* worldwide data from the novel CoronaVirus Disease originally reported in 2019, CoViD-19, as published by the JHU CCSE repository [1], as well as, provide basic analysis tools and functions to investigate these datasets.

The goal of this package is to make the latest data promptly available to researchers and the scientific community.

The following sections briefly describe some of the covid19.analytics package main features, we strongly recomend users to read our paper “covid19.analytics: An R Package to Obtain, Analyze and Visualize Data from the Corona Virus Disease Pandemic” (https://arxiv.org/abs/2009.01091) where further details about the package are presented and discussed.

## covid19.analytics Main Features

Click to Expand/Collapse

### Data Accessibility

Click to Expand/Collapse

The covid19.data() function allows users to obtain realtime data about the CoViD19 reported cases from the JHU’s CCSE repository, in the following modalities: * “aggregated” data for the latest day, with a great ‘granularity’ of geographical regions (ie. cities, provinces, states, countries) * “time series” data for larger accumulated geographical regions (provinces/countries)

• “deprecated”: we also include the original data style in which these datasets were reported initially.

The datasets also include information about the different categories (status) “confirmed”/“deaths”/“recovered” of the cases reported daily per country/region/city.

This data-acquisition function, will first attempt to retrieve the data directly from the JHU repository with the latest updates. If for what ever reason this fails (eg. problems with the connection) the package will load a preserved “image” of the data which is not the latest one but it will still allow the user to explore this older dataset. In this way, the package offers a more robust and resilient approach to the quite dynamical situation with respect to data availability and integrity.

#### Data retrieval options

argument description
aggregated latest number of cases aggregated by country
Time Series data
ts-confirmed time series data of confirmed cases
ts-deaths time series data of fatal cases
ts-recovered time series data of recovered cases
ts-ALL all time series data combined
Deprecated data formats
ts-dep-confirmed time series data of confirmed cases as originally reported (deprecated)
ts-dep-deaths time series data of deaths as originally reported (deprecated)
ts-dep-recovered time series data of recovered cases as originally reported (deprecated)
Combined
ALL all of the above
Time Series data for specific locations
ts-Toronto time series data of confirmed cases for the city of Toronto, ON - Canada
ts-confirmed-US time series data of confirmed cases for the US detailed per state
ts-deaths-US time series data of fatal cases for the US detailed per state

### Data Structure

The TimeSeries data is organized in an specific manner with a given set of fields or columns, which resembles the following structure:

 “Province.State” “Country.Region” “Lat” “Long” … seq of dates …

#### Using your own data and/or importing new data sets

If you have data structured in a data.frame organized as described above, then most of the functions provided by the “covid19.analytics” package for analyzing TimeSeries data will work with your data. In this way it is possible to add new data sets to the ones that can be loaded using the repositories predefined in this package and extend the analysis capabilities to these new datasets.

Be sure also to check the compatibility of these datasets using the Data Integrity and Consistency Checks functions described in the following section.

### Data Integrity and Consistency Checks

Due to the ongoing and rapid changing situation with the CoViD-19 pandemic, sometimes the reported data has been detected to change its internal format or even show some “anomalies” or “inconsistencies” (see https://github.com/CSSEGISandData/COVID-19/issues/).

For instance, in some cumulative quantities reported in time series datasets, it has been observed that these quantities instead of continuously increase sometimes they decrease their values which is something that should not happen, (see for instance, https://github.com/CSSEGISandData/COVID-19/issues/2165). We refer to this as inconsistency of “type II”.

Some negative values have been reported as well in the data, which also is not possible or valid; we call this inconsistency of “type I”.

When this occurs, it happens at the level of the origin of the dataset, in our case, the one obtained from the JHU/CCESGIS repository [1]. In order to make the user aware of this, we implemented two consistency and integrity checking functions:

• consistency.check(), this function attempts to determine whether there are consistency issues within the data, such as, negative reported value (inconsistency of “type I”) or anomalies in the cumulative quantities of the data (inconsistency of “type II”)

• integrity.check(), this determines whether there are integrity issues within the datasets or changes to the structure of the data

Alternatively we provide a data.checks() function that will run both functions on an specified dataset.

#### Data Integrity

It is highly unlikely that you would face a situation where the internal structure of the data, or its actual integrity may be compromised but if you think that this is the case or the integrity.check() function reports this, please we urge you to contact the developer of this package (https://github.com/mponce0/covid19.analytics/issues).

#### Data Consistency

Data consistency issues and/or anomalies in the data have been reported several times, see https://github.com/CSSEGISandData/COVID-19/issues/. These are claimed, in most of the cases, to be missreported data and usually are just an insignificant number of the total cases. Having said that, we believe that the user should be aware of these situations and we recommend using the consistency.check() function to verify the dataset you will be working with.

#### Nullifying Spurious Data

In order to deal with the different scenarios arising from incomplete, inconsistent or missreported data, we provide the nullify.data() function, which will remove any potential entry in the data that can be suspected of these incongruencies. In addition ot that, the function accepts an optional argument stringent=TRUE, which will also prune any incomplete cases (e.g. with NAs present).

### Genomics Data

That’s why the covid19.analytics package provides access to a good number of the genomics data currently available.

The covid19.genomic.data() function allows users to obtain the CoViD19’s genomics data from NCBI’s databases [3]. The type of genomics data accessible from the package is described in the following table.

 type description source genomic a composite list containing different indicators and elements of the SARS-CoV-2 genomic information https://www.ncbi.nlm.nih.gov/sars-cov-2/ genome genetic composition of the reference sequence of the SARS-CoV-2 from GenBank https://www.ncbi.nlm.nih.gov/nuccore/NC_045512 fasta genetic composition of the reference sequence of the SARS-CoV-2 from a fasta file https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2?report=fasta ptree phylogenetic tree as produced by NCBI data servers https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/precomptree nucleotide / protein list and composition of nucleotides/proteins from the SARS-CoV-2 virus https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/ nucleotide-fasta / protein-fasta FASTA sequences files for nucleotides, proteins and coding regions https://www.ncbi.nlm.nih.gov/labs/virus/vssi/#/

Although the package attempts to provide the latest available genomic data, there are a few important details and differences with respect to the reported cases data. For starting, the amount of genomic information available is way larger than the data reporting the number of cases which adds some additional constraints when retrieving this data. In addition to that, the hosting servers for the genomic databases impose certain limits on the rate and amounts of downloads.

In order to mitigate these factors, the covid19.analytics package employs a couple of different strategies as summarized below: * most of the data will be attempted to be retrieved live from NCBI databases – same as using src='livedata' * if that is not possible, the package keeps a local version of some of the largest datasets (i.e. genomes, nucleotides and proteins) which might not be up-to-date – same as using src='repo'. * the package will attempt to obtain the data from a mirror server with the datasets updated on a regular basis but not necessarily with the latest updates – same as using src='local'.

### Analytical & Graphical Indicators

Click to Expand/Collapse

In addition to the access and retrieval of the data, the package includes some basics functions to estimate totals per regions/country/cities, growth rates and daily changes in the reported number of cases.

### Overview of the Main Functions from the “covid19.analytics” Package

Function Description Main Type of Output
Data Acquisition
covid19.data obtain live* worldwide data for covid19 virus, from the JHU’s CCSE repository [1] return dataframes/list with the collected data
covid19.Toronto.data obtain live* data for covid19 cases in the city of Toronto, ON Canada, from the City of Toronto reports [2] return dataframe/list with the collected data
covid19.US.data obtain live* US specific data for covid19 virus, from the JHU’s CCSE repository [1] return dataframe with the collected data

### Some basic analysis

#### Summary Report

# a quick function to overview top cases per region for time series and aggregated records
report.summary()

# save the tables into a text file named 'covid19-SummaryReport_CURRENTDATE.txt'
# where *CURRRENTDATE* is the actual date
report.summary(saveReport=TRUE)
# summary report for an specific location with default number of entries

# summary report for an specific location with top 5

# it can combine several locations
report.summary(Nentries=30, geo.loc=c("Canada","US","Italy","Uruguay","Argentina"))

#### Totals per Country/Region/Province

# totals for confirmed cases for "Ontario"
tots.per.location(covid19.confirmed.cases,geo.loc="Ontario")

# total for confirmed cases for "Canada"

# total nbr of deaths for "Mainland China"
tots.per.location(covid19.TS.deaths,geo.loc="China")

# total nbr of confirmed cases in Hubei including a confidence band based on moving average
tots.per.location(covid19.confirmed.cases,geo.loc="Hubei", confBnd=TRUE)

The figures show the total number of cases for different cities (provinces/regions) and countries: one the upper plot in log-scale with a linear fit to an exponential law and in linear scale in the bottom panel. Details about the models are included in the plot, in particular the growth rate which in several cases appears to be around 1.2+ as predicted by some models. Notice that in the case of Hubei, the values is closer to 1, as the dispersion of the virus has reached its logistic asymptote while in other cases (e.g. Germany and Italy –for the presented dates–) is still well above 1, indicating its exponential growth.

IMPORTANT Please notice that the “linear exponential” modelling function implements a simple (naive) and straight-forward linear regression model, which is not optimal for exponential fits. The reason is that the errors for large values of the dependent variable weight much more than those for small values when apply the exponential function to go back to the original model. Nevertheless for the sake of a quick interpretation is OK, but one should bare in mind the implications of this simplification.

We also provide two additional models, as shown in the figures above, using the Generalized Linear Model glm() function, using a Poisson and Gamma family function. In particular, the tots.per.location function will determine when is possible to automatically generate each model and display the information in the plot as well as details of the models in the console.

# read the time series data for all the cases
all.data <- covid19.data('ts-ALL')

# run on all the cases
tots.per.location(all.data,"Japan")

It is also possible to run the tots.per.location (and growth.rate) functions, on the whole data set, for which a quite large but complete mosaic figure will be generated, e.g.

# total for death cases for "ALL" the regions
tots.per.location(covid19.TS.deaths)

# or just
tots.per.location(covid19.data("ts-confirmed"))

#### Growth Rate

# read time series data for confirmed cases
TS.data <- covid19.data("ts-confirmed")

# compute changes and growth rates per location for all the countries
growth.rate(TS.data)

# compute changes and growth rates per location for 'Italy'
growth.rate(TS.data,geo.loc="Italy")

# compute changes and growth rates per location for 'Italy' and 'Germany'
growth.rate(TS.data,geo.loc=c("Italy","Germany"))

The previous figures show on the upper panel the number of changes on a daily basis in linear scale (thin line, left y-axis) and log scale (thicker line, right y-axis), while the bottom panel displays the growth rate for the given country/region/city.

Combining multiple geographical locations:

# obtain Time Series data
TSconfirmed <- covid19.data("ts-confirmed")

# explore different combinations of regions/cities/countries
# when combining different locations, heatmaps will also be generated comparing the trends among these locations

growth.rate(TSconfirmed,geo.loc=c("Hubei","Italy","Spain","US","Canada","Ontario","Quebec","Uruguay")

#### Visualization Tools

# retrieve time series data
TS.data <- covid19.data("ts-ALL")

# static and interactive plot
totals.plt(TS.data)
# totals for Ontario and Canada, without displaying totals and one plot per page

# totals for Ontario, Canada, Italy and Uruguay; including global totals with the linear and semi-log plots arranged one next to the other

# totals for all the locations reported on the dataset, interactive plot will be saved as "totals-all.html"
totals.plt(TS.data, "ALL", fileName="totals-all")
# retrieve aggregated data
data <- covid19.data("aggregated")

# interactive map of aggregated cases -- with more spatial resolution
live.map(data)

# or
live.map()

# interactive map of the time series data of the confirmed cases with less spatial resolution, ie. aggregated by country
live.map(covid19.data("ts-confirmed"))

Interactive examples can be seen at https://mponce0.github.io/covid19.analytics/

# read time series data for confirmed cases
data <- covid19.data("ts-confirmed")

# run a SIR model for a given geographical location
generate.SIR.model(data,"Hubei", t0=1,t1=15)
generate.SIR.model(data,"Germany",tot.population=83149300)
generate.SIR.model(data,"Uruguay", tot.population=3500000)
generate.SIR.model(data,"Ontario",tot.population=14570000)

# the function will aggregate data for a geographical location, like a country with multiple entries
generate.SIR.model(data,"Canada",tot.population=37590000)

# modelling the spread for the whole world, storing the model and generating an interactive visualization
world.SIR.model <- generate.SIR.model(data,"ALL", t0=1,t1=15, tot.population=7.8e9, staticPlt=FALSE)
# plotting and visualizing the model
plt.SIR.model(world.SIR.model,"World",interactiveFig=TRUE,fileName="world.SIR.model")

## References

(*) Data can be upto 24 hs delayed wrt the latest updates.

Click to Expand/Collapse

[1] 2019 Novel CoronaVirus CoViD-19 (2019-nCoV) Data Repository by Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE) https://github.com/CSSEGISandData/COVID-19

[2] COVID-19: Status of Cases in Toronto – City of Toronto https://www.toronto.ca/home/covid-19/covid-19-latest-city-of-toronto-news/covid-19-status-of-cases-in-toronto/

[3] Severe acute respiratory syndrome coronavirus 2 isolate Wuhan-Hu-1, complete genome NCBI Reference Sequence: NC_045512.2 https://www.ncbi.nlm.nih.gov/nuccore/NC_045512.2

### How to Cite this Package

If you are using this package please cite our main publication about the covid19.analytics package:

https://arxiv.org/abs/2009.01091

You can also ask for this citation information in R:

> citation("covid19.analytics")

To cite covid19.analytics in publications use:

Marcelo Ponce, Amit Sandhel (2020). covid19.analytics: An R Package
to Obtain, Analyze and Visualize Data from the Corona Virus Disease
Pandemic. URL https://arxiv.org/abs/2009.01091

A BibTeX entry for LaTeX users is

@Article{,
title = {covid19.analytics: An R Package to Obtain, Analyze and Visualize Data from the Corona Virus Disease Pandemic},
author = {Marcelo Ponce and Amit Sandhel},
journal = {pre-print},
year = {2020},
url = {https://arxiv.org/abs/2009.01091},
}

## Further Resources

Click to Exapand/Collapse