epitweetr: user documentation

European Centre for Disease Prevention and Control (ECDC)

Description

The epitweetr package allows you to automatically monitor trends of tweets by time, place and topic. This automated monitoring aims at early detecting public health threats through the detection of signals (e.g. an unusual increase in the number of tweets for a specific time, place and topic). The epitweetr package was designed to focus on infectious diseases, and it can be extended to all hazards or other fields of study by modifying the topics and keywords.

The general principle behind epitweetr is that it collects tweets and related metadata from the Twitter Standard API versions 1.1 (https://developer.twitter.com/en/docs/twitter-api/v1/tweets/search/overview) and 2.0 (https://developer.twitter.com/en/docs/twitter-api/tweets/search/api-reference/get-tweets-search-recent) according to specified topics and stores these tweets on your computer on a database that can operate to calculate statistics or as a search engine. epitweetr geolocalises the tweets and collects information on key words, URLs, hashtags within a tweet but also entities and context detected by the Twitter API 2.0. Tweets are aggregated according to topic and geographical location. Next, a signal detection algorithm identifies the number of tweets (by topic and geographical location) that exceeds what is expected for a given day. If a number of tweets exceeds what is expected, epitweetr sends out email alerts to notify those who need to further investigate these signals following the epidemic intelligence processes (filtering, validation, analysis and preliminary assessment).

The package includes an interactive web application (Shiny app) with six pages: the dashboard, where a user can visualise and explore tweets (Fig 1), the alerts page, where you can view the current alerts and train machine learning models for alert classification on user defined categories (Fig 2), the geotag page, where you can evaluate the geolocation algorithm and provide annotations for improving its performance (Fig 3), the data protection page, where the user can search, anonymise and delete tweets from the epitweetr database to support data deletion requests (Fig 4), the configuration page, where you can change settings and check the status of the underlying processes (Fig 5), and the troubleshoot page, with automatic checks and hints for using epitweetr with all its functionalities (Fig 6).

On the dashboard, users can view the aggregated number of tweets over time, the location of these tweets on a map and different most frequent elements found in or extracted from these tweets (words, hashtags, URLs, contexts and entities). These visualisations can be filtered by the topic, location and time period you are interested in. Other filters are available and include the possibility to adjust the time unit of the timeline, whether retweets/quotes should be included, what kind of geolocation types you are interested in, the sensitivity of the prediction interval for the signal detection, and the number of days used to calculate the threshold for signals. This information is also downloadable directly from this interface in the form of data, pictures, and/or reports.

More information on the methodology used is available here

Shiny app dashboard:

Fig 1: Shiny app dashboard figure

Shiny app alerts page:

Fig 2: Shiny app alerts page

Shiny app geotag evaluation page:

Fig 3: Shiny app geotag evaluation page

Shiny app data protection page:

Fig 4: Shiny app data protection page

Shiny app configuration page:

Fig 5: Shiny app configuration page

Shiny app troubleshoot page:

Fig 6: Shiny app troubleshoot page

Background

Epidemic Intelligence at ECDC

Article 3 of the European Centre for Disease Prevention and Control (ECDC) funding regulation and the Decision No 1082/2013/EU on serious cross-border threats to health have established the detection of public health threats as a core activity of ECDC.

ECDC performs Epidemic Intelligence (El) activities aiming at rapidly detecting and assessing public health threats, focusing on infectious diseases, to ensure EU’s health security. ECDC uses social media as part of its sources to early detect signals of public health threats. Until 2020, the monitoring of social media was mainly performed through the screening and analysis of posts from pre-selected experts or organisations, mainly in Twitter and Facebook.

More information and an online tutorial are available:

EI sources

EI tutorial

Objectives of epitweetr

The primary objective of epitweetr is to use the Twitter Standard Search API version 1.1 and Twitter Recent Search API version 2 in order to detect early signals of potential threats by topic and by geographical unit.

Its secondary objective is to enable the user through an interactive web interface to explore the trend of tweets by time, geographical location and topic, including information on top words and numbers of tweets from trusted users, using charts and tables.

Hardware requirements

The minimum and suggested hardware requirements for the computer are in the table below:

Hardware requirements Minimum Suggested
RAM Needed 8GB 16GB recommended
CPU Needed 4 cores 12 cores
Space needed for 3 years of storage 3TB 5TB

The CPU and RAM usage can be configured on the Shiny app configuration page (see section The interactive user application (Shiny app)>The configuration page). The RAM, CPU and space needed may depend on the amount and size of the topics you request in the collection process.

Installation

epitweetr is conceived to be platform independent, working on Windows, Linux and Mac. We recommend that you use epitweetr on a computer that can be run continuously. You can switch the computer off, but you may miss some tweets if the downtime is large enough, which will have implications for the alert detection.

If you need to reinstall epitweetr after activating its tasks, you must restart the machine running epitweetr first.

Before using epitweetr, the following items need to be installed:

Prerequisites for running epitweetr

Prerequisites for some of the functionalities in epitweetr

Extra prerequisites for R developers

If you would like to develop epitweetr further, then the following development tools are needed:

External dependencies

epitweetr will need to download some dependencies in order to work. The tool will do this automatically the first time the alert detection process is launched. The Shiny app configuration page will allow you to change the target URLs of these dependencies, which are the following:

Please note that during the dependency download you will be prompted, first to stop the embedded database and then enable it again. If you are on Windows and you have activated the tasks using the ‘activate’ buttons on the configuration page you can performs this tasks by disabling and enabling the tasks on the ‘Windows Task Scheduler’. For more information see the section ‘Setting up tweet collection and the alert detection loop’

Installing epitweetr from CRAN

After installing all required dependencies listed in the section “Prerequisites for running epitweetr”, you can install epitweetr:

Environment variables

Additionally, the R environment needs to know where the Java installation home is. To check this, type in the R console:

If the command returns null or empty, then you will need to set the Java Home environment variable, for your operating system (OS), please see your specific OS instructions. In some cases, epitweetr can work without setting the Java Home environment variable.

The first time you run the application, if the tool cannot identify a secure password store provided by the operating system, you will see a pop-up window requesting a keyring password (Linux and Mac). This is a password necessary for storing encrypted Twitter credentials. Please choose a strong password and remember it. You will be asked for this password each time you run the tool. You can avoid this by setting a system environment variable named ecdc_twitter_tool_kr_password containing the chosen password.

Launching the epitweetr Shiny app

You can launch the epitweetr Shiny app from the R session by typing in the R console. Replace “data_dir” with the designated data directory which is a local folder you choose to store tweets, time series and configuration files in:

Please note that the data directory entered in R should have ‘/’ instead of ‘\’ (an example of a correct path would be ‘C:/user/name/Documents’). This applies especially in Windows if you copy the path from the File Explorer.

Alternatively, you can use a launcher: In an executable .bat or shell file type the following, (replacing “data_dir” with the designated data directory)

R –vanilla -e epitweetr::epitweetr_app(“data_dir”)

You can check that all requirements are properly installed in the troubleshoot page. More information is available in section The interactive user application (Shiny app)>Dashboard:The interactive user interface for visualisation>The troubleshoot page

Migrating to epitweetr v2

Migrating epitweetr from previous versions to version 2.0 or higher is possible without any data loss. On this section we will describe the necessary steps to perform the migration.

This migration is not necessary if you are installing epitweetr for the first time.

In epitweetr v2, we have redesigned the way how tweets and series are stored. On previous versions, tweets were saved as compressed JSON files and series as RDS data frames on ‘tweets’ and ‘series’ folder, respectively. In addition, we have moved to a different storage system allowing epitweetr to work as a search engine and allowing efficient updates, deletions and faster aggregation. For doing so, data is stored using Apache Lucene indexes in the ‘fs’ folder. Note that during migration, Twitter data are moved to the ‘fs’ folder and series are left as it is. Epitweetr reports will combine data from older and new storage system.

If you have an existing installation that contains data in the previous format, you have to migrate it following the steps detailed in this section. This applies to any epitweetr version before v2.0.0. You can also check this by looking in ‘tweets/geo’ or ‘tweets/search’ folders. If there is a json.gz file, migration is needed.

The migration steps are the following:

Setting up tweet collection and the alert detection loop

In order to use epitweetr, you will need to collect and process tweets, run the epitweetr database and run the requirements and alerts pipeline. Further details are also available in subsequent sections of the user documentation. A summary of the steps needed is as follows:

For more details you can go through the section How does it work? General architecture behind epitweetr, which describes the underlying processes behind the tweet collection and the signal detection. Also, the section “The interactive Shiny application (Shiny app)>The configuration page” describes the different settings on the configuration page.

How does it work? General architecture behind epitweetr

The following sections describe in detail the above general principles. The settings of many of these elements can be configured in the Shiny app configuration page, which is explained in the section The interactive Shiny application (Shiny app)>The configuration page.

Collection of tweets

Use of the Twitter Standard Search API version 1.1 and Twitter Recent Search API version 2.0

epitweetr uses the Twitter Standard Search API version 1.1 and/or Twitter Recent Search API version 2.0. The advantage of these APIs is that these are a free service provided by Twitter enabling users of epitweetr to access tweets free of charge. The search API is not meant to be an exhaustive source of tweets. It searches against a sample of recent tweets published in the past 7 days and it focuses on relevance and not completeness. This means that some tweets and users may be missing from search results.

While this may be a limitation in other fields of public health or research, the epitweetr development team believe that for the objective of signal detection a sample of tweets is sufficient to detect potential threats of importance in combination with other type of sources.

Other attributes of the Twitter Standard Search API version 1.1 include:

  • Only tweets from the last 5–8 days are indexed by Twitter

  • A maximum of 180 requests every 15 minutes are supported by the Twitter Standard Search API (450 requests every 15 minutes if you are using the Twitter developer app credentials; see next section)

  • Each request returns a maximum of 100 tweets and/or retweets

Other attributes of the Twitter Recent Search API version 2.0 include:

  • Only tweets from the last week days are indexed by Twitter

  • A maximum of 300 requests every 15 minutes are supported

  • Each request returns a maximum of 100 tweets and/or retweets

  • 500.000 tweets per month in the essential access level

If you are using both endpoints epitweetr will alternate between them when the limits are hit.

Twitter authentication

You can authenticate the collection of tweets by using a Twitter account (this approach utilises the rtweet package app) or by using a Twitter application. For the latter, you will need a Twitter developer account, which can take some time to obtain, due to verification procedures. We recommend using a Twitter account via the rtweet package for testing purposes and short-term use, and the Twitter developer application for long-term use.

  • Using a Twitter account: delegated via rtweet (user authentication)

    • You will need a Twitter account (username and password)

    • The rtweet package will send a request to Twitter, so it can access your Twitter account on your behalf

    • A pop-up window will appear where you can enter your Twitter user name and password to confirm that the application can access Twitter on your behalf. You will send this token each time you access tweets. If you are already logged in Twitter, this pop-up window may not appear and automatically take the credentials of the ‘active’ Twitter account in the machine

    • You can only use Twitter API version 1.1

  • Using a Twitter developer app: via epitweetr (app authentication)

    • If you have not done so already, you will need to create a Twitter developer account: [https://developer.twitter.com/en/docs/twitter-api/getting-started/getting-access-to-the-twitter-api]

    • Follow the instuctions, answer the questions to activate the Twitter API v2 using Essential access.

    • Next, you will create a Project and an associated developer App during the onboarding process, which will provide you a set of credentials that you will use to authenticate all requests to the API.

    • Make a note of your OAuth settings

      • Add them to the configuration page in the Shiny app (see image below)

      • With this information epitweetr can request a token at any time directly to Twitter. The advantage of this method is that the token is not connected to any user information and tweets are returned independently of any user context.

      • With this app, you can perform 450 requests every 15 minutes instead of the 180 requests every 15 minutes that a Twitter account allows.

      • You can activate Twitter API version 2.0 in the config page

Topics and tweet collection queries

After the Twitter authentication, you need to specify a list of topics in epitweetr to indicate which tweets to collect. For each topic, you have one or more queries that epitweetr uses to collect the relevant tweets (e.g. several queries for a topic using different terminology and/or languages).

A query consists of keywords and operators that are used to match tweet attributes. Keywords separated by a space indicate an AND clause. You can also use an OR operator. A minus sign before the keyword (with no space between the sign and the keyword) indicates the keyword should not be in the tweet attributes. While queries can be up to 512 characters long, best practice is to limit your query to 10 keywords and operators and limit complexity of the query, meaning that sometimes you need more than one query per topic. If a query surpasses this limit, it is recommended to split the topic in several queries.

epitweetr comes with a default list of topics as used by the ECDC Epidemic Intelligence team at the date of package generation (15th of December, 2021). You can view details of the list of topics in the Shiny app configuration page (see screenshot below). In addition, the colour coding in the downloadable file allows users to see if the query for a topic is too long (red colour) and the topic should be split in several queries.