mlbgameday: Data Automation

Kris Eberwein


Creating A Database

The mlbgameday package is able to work with several databases. Many examples use SQLite databases, but for long-term persistent storage, something a bit more stable is useful. Any database supported by the DBI package can be used.

If using more full featured database systems such as MySQL or PostgreSQL, we would have to first create an empty database and sign in to that database with a username and password. For brevity, we will use SQLite.

Below, create an empty database and populate it with one day of data, just to test. In this case, we populate with the first day of the 2018 MLB regular season.


# Create an empty database instance.
con <- dbConnect(RSQLite::SQLite(), dbname = "mlbgameday.sqlite3")

# Load the first day of the 2018 season.
get_payload(start = "2018-03-30", end = "2018-03-30", db_con = con)
## Gathering Gameday data, please be patient...
## Processing data chunk 1 of 1
## Warning: executing %dopar% sequentially: no parallel backend registered
## Transaction complete, disconnecting from the database. 2019-04-02 14:08:43

Check The Default Instance

# Re-connect to database since the get_payload() function closed the connection.
con <- DBI::dbConnect(RSQLite::SQLite(), dbname = "mlbgameday.sqlite3")

# Take a look at our tables.
## [1] "action" "atbat"  "pitch"  "po"     "runner"

Setting Up Automated Data Collection

Once you have a database in-place, you can get started quickly. The mlbgameday package will work if your current database was gathered using the pitchRx package.


# Log into your database and retreive the most recent date.
con <- dbConnect(RSQLite::SQLite(), dbname = "mlbgameday.sqlite3")

# Find the last day of data in the current database.
db_end <- dbGetQuery(con, "SELECT MAX(date) FROM atbat")

# Use the max date +1 as the start date and today -1 for the end date for your new payload.
get_payload(start = as.Date(db_end[1,1]) + 1, end = Sys.Date() - 1, db_con = con)

Task Scheduling

I prefer to pull the day’s data early in the morning (for the day before.) What ever time you choose, you want to consider time zones and allow enough additional time to cover rain delays for late games, as not to miss any information. There are various task scheduling tools, depending on your operating system.