Get started with vroom

The vroom package contains one main function vroom() which is used to read all types of delimited files. A delimited file is any file in which the data is separated (delimited) by one or more characters.

The most common type of delimited files are CSV (Comma Separated Values) or TSV (Tab Separated Values) files, typically these files have a .csv and .tsv suffix respectively.

library(vroom)

This vignette covers the following topics:

Reading files

To read a CSV, or other type of delimited file with vroom pass the file to vroom(). The delimiter will be automatically guessed if it is a common delimiter; e.g. (“,” “” " “|” “:” “;”). If the guessing fails or you are using a less common delimiter specify it with the delim parameter. (e.g. delim = ",").

We have included an example CSV file in the vroom package for use in examples and tests. Access it with vroom_example("mtcars.csv")

# See where the example file is stored on your machine
file <- vroom_example("mtcars.csv")
file
#> [1] "/private/var/folders/9x/_8jnmxwj3rq1t90mlr6_0k1w0000gn/T/RtmpyUZfc6/Rinst4e3033160a6d/vroom/extdata/mtcars.csv"

# Read the file, by default vroom will guess the delimiter automatically.
vroom(file)
#> Rows: 32
#> Columns: 12
#> Delimiter: ","
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 32 x 12
#>   model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

# You can also specify it explicitly, which is (slightly) faster, and safer if
# you know how the file is delimited.
vroom(file, delim = ",")
#> Rows: 32
#> Columns: 12
#> Delimiter: ","
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 32 x 12
#>   model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

Reading multiple files

If you are reading a set of files which all have the same columns, you can pass the filenames directly to vroom() and it will combine them into one result.

First we will create some files to read by splitting the mtcars dataset by number of cylinders, (it is OK if you don’t currently understand this code).

mt <- tibble::rownames_to_column(mtcars, "model")
purrr::iwalk(
  split(mt, mt$cyl),
  ~ vroom_write(.x, glue::glue("mtcars_{.y}.csv"), "\t")
)

Then we can efficiently read them into one table by passing the filenames directly to vroom.

files <- fs::dir_ls(glob = "mtcars*csv")
files
#> mtcars_4.csv mtcars_6.csv mtcars_8.csv
vroom(files)
#> Rows: 32
#> Columns: 12
#> Delimiter: "\t"
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 32 x 12
#>   model        mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>      <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Datsun 710  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#> 2 Merc 240D   24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#> 3 Merc 230    22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> # … with 29 more rows

Often the filename or directory where the files are stored contains information. The id parameter can be used to add an extra column to the result with the full path to each file. (in this case we name the column path).

vroom(files, id = "path")
#> Rows: 32
#> Columns: 13
#> Delimiter: "\t"
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 32 x 13
#>   path   model   mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 mtcar… Dats…  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
#> 2 mtcar… Merc…  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
#> 3 mtcar… Merc…  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
#> # … with 29 more rows

Reading compressed files

vroom supports reading zip, gz, bz2 and xz compressed files automatically, just pass the filename of the compressed file to vroom.

file <- vroom_example("mtcars.csv.gz")

vroom(file)
#> Rows: 32
#> Columns: 12
#> Delimiter: ","
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 32 x 12
#>   model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

vroom() decompresses, indexes and writes the decompressed data to a file in the temp directory in a single stream. The temporary file is used to lazily look up the values and will be automatically cleaned up when all values in the object have been fully read, the object is removed, or the R session ends.

Reading single files from multiple multi-file zip archives

If you are reading a zip file that contains multiple files with the same format, you can use a wrapper function like this:

read_all_zip <- function(file, ...) {
  filenames <- unzip(file, list = TRUE)$Name
  vroom(purrr::map(filenames, ~ unz(file, .x)), ...)
}

Reading remote files

vroom can read files directly from the internet as well by passing the URL of the file to vroom.

file <- "https://raw.githubusercontent.com/r-lib/vroom/master/inst/extdata/mtcars.csv"
vroom(file)

It can even read gzipped files from the internet (although not the other compressed formats).

file <- "https://raw.githubusercontent.com/r-lib/vroom/master/inst/extdata/mtcars.csv.gz"
vroom(file)

Column selection

vroom provides the same interface for column selection and renaming as dplyr::select(). This provides very efficient, flexible and readable selections. For example you can select by:

file <- vroom_example("mtcars.csv.gz")

vroom(file, col_select = c(model, cyl, gear))
#> Rows: 32
#> Columns: 3
#> Delimiter: ","
#> chr [1]: model
#> dbl [2]: cyl, gear
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 32 x 3
#>   model           cyl  gear
#>   <chr>         <dbl> <dbl>
#> 1 Mazda RX4         6     4
#> 2 Mazda RX4 Wag     6     4
#> 3 Datsun 710        4     4
#> # … with 29 more rows
vroom(file, col_select = c(1, 3, 11))
#> Rows: 32
#> Columns: 3
#> Delimiter: ","
#> chr [1]: model
#> dbl [2]: cyl, gear
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 32 x 3
#>   model           cyl  gear
#>   <chr>         <dbl> <dbl>
#> 1 Mazda RX4         6     4
#> 2 Mazda RX4 Wag     6     4
#> 3 Datsun 710        4     4
#> # … with 29 more rows
vroom(file, col_select = starts_with("d"))
#> Rows: 32
#> Columns: 2
#> Delimiter: ","
#> dbl [2]: disp, drat
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 32 x 2
#>    disp  drat
#>   <dbl> <dbl>
#> 1   160  3.9 
#> 2   160  3.9 
#> 3   108  3.85
#> # … with 29 more rows
vroom(file, col_select = list(car = model, everything()))
#> Rows: 32
#> Columns: 12
#> Delimiter: ","
#> chr [ 1]: model
#> dbl [11]: mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb
#> 
#> Use `spec()` to retrieve the guessed column specification
#> Pass a specification to the `col_types` argument to quiet this message
#> # A tibble: 32 x 12
#>   car            mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

Reading fixed width files

A fixed width file can be a very compact representation of numeric data. Unfortunately, it’s also often painful to read because you need to describe the length of every field. vroom aims to make it as easy as possible by providing a number of different ways to describe the field structure. Use vroom_fwf() in conjunction with one of the following helper functions to read the file.

fwf_sample <- vroom_example("fwf-sample.txt")
cat(readLines(fwf_sample))
#> John Smith          WA        418-Y11-4111 Mary Hartford       CA        319-Z19-4341 Evan Nolan          IL        219-532-c301
vroom_fwf(fwf_sample, fwf_empty(fwf_sample, col_names = c("first", "last", "state", "ssn")))
#> # A tibble: 3 x 4
#>   first last     state ssn         
#>   <chr> <chr>    <chr> <chr>       
#> 1 John  Smith    WA    418-Y11-4111
#> 2 Mary  Hartford CA    319-Z19-4341
#> 3 Evan  Nolan    IL    219-532-c301
vroom_fwf(fwf_sample, fwf_widths(c(20, 10, 12), c("name", "state", "ssn")))
#> # A tibble: 3 x 3
#>   name          state ssn         
#>   <chr>         <chr> <chr>       
#> 1 John Smith    WA    418-Y11-4111
#> 2 Mary Hartford CA    319-Z19-4341
#> 3 Evan Nolan    IL    219-532-c301
vroom_fwf(fwf_sample, fwf_positions(c(1, 30), c(20, 42), c("name", "ssn")))
#> # A tibble: 3 x 2
#>   name          ssn         
#>   <chr>         <chr>       
#> 1 John Smith    418-Y11-4111
#> 2 Mary Hartford 319-Z19-4341
#> 3 Evan Nolan    219-532-c301
vroom_fwf(fwf_sample, fwf_cols(name = 20, state = 10, ssn = 12))
#> # A tibble: 3 x 3
#>   name          state ssn         
#>   <chr>         <chr> <chr>       
#> 1 John Smith    WA    418-Y11-4111
#> 2 Mary Hartford CA    319-Z19-4341
#> 3 Evan Nolan    IL    219-532-c301
vroom_fwf(fwf_sample, fwf_cols(name = c(1, 20), ssn = c(30, 42)))
#> # A tibble: 3 x 2
#>   name          ssn         
#>   <chr>         <chr>       
#> 1 John Smith    418-Y11-4111
#> 2 Mary Hartford 319-Z19-4341
#> 3 Evan Nolan    219-532-c301

Column types

vroom guesses the data types of columns as they are read, however sometimes the guessing fails and it is necessary to explicitly set the type of one or more columns.

The available specifications are: (with single letter abbreviations in quotes)

You can tell vroom what columns to use with the col_types() argument in a number of ways.

If you only need to override a single column the most concise way is to use a named vector.

# read the 'hp' columns as an integer
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i"))
#> # A tibble: 32 x 12
#>   model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>        <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

# also skip reading the 'cyl' column
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i", cyl = "_"))
#> # A tibble: 32 x 11
#>   model           mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <chr>         <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Mazda RX4      21     160   110  3.9   2.62  16.5     0     1     4     4
#> 2 Mazda RX4 Wag  21     160   110  3.9   2.88  17.0     0     1     4     4
#> 3 Datsun 710     22.8   108    93  3.85  2.32  18.6     1     1     4     1
#> # … with 29 more rows

# also read the gears as a factor
vroom(vroom_example("mtcars.csv"), col_types = c(hp = "i", cyl = "_", gear = "f"))
#> # A tibble: 32 x 11
#>   model           mpg  disp    hp  drat    wt  qsec    vs    am gear   carb
#>   <chr>         <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda RX4      21     160   110  3.9   2.62  16.5     0     1 4         4
#> 2 Mazda RX4 Wag  21     160   110  3.9   2.88  17.0     0     1 4         4
#> 3 Datsun 710     22.8   108    93  3.85  2.32  18.6     1     1 4         1
#> # … with 29 more rows

You can read all the columns with the same type, by using the .default argument. For example reading everything as a character.

vroom(vroom_example("mtcars.csv"), col_types = c(.default = "c"))
#> # A tibble: 32 x 12
#>   model        mpg   cyl   disp  hp    drat  wt    qsec  vs    am    gear  carb 
#>   <chr>        <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Mazda RX4    21    6     160   110   3.9   2.62  16.46 0     1     4     4    
#> 2 Mazda RX4 W… 21    6     160   110   3.9   2.875 17.02 0     1     4     4    
#> 3 Datsun 710   22.8  4     108   93    3.85  2.32  18.61 1     1     4     1    
#> # … with 29 more rows

However you can also use the col_*() functions in a list.

vroom(
  vroom_example("mtcars.csv"),
  col_types = list(hp = col_integer(), cyl = col_skip(), gear = col_factor())
)
#> # A tibble: 32 x 11
#>   model           mpg  disp    hp  drat    wt  qsec    vs    am gear   carb
#>   <chr>         <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda RX4      21     160   110  3.9   2.62  16.5     0     1 4         4
#> 2 Mazda RX4 Wag  21     160   110  3.9   2.88  17.0     0     1 4         4
#> 3 Datsun 710     22.8   108    93  3.85  2.32  18.6     1     1 4         1
#> # … with 29 more rows

This is most useful when a column type needs additional information, such as for categorical data when you know all of the levels of a factor.

vroom(
  vroom_example("mtcars.csv"),
  col_types = list(gear = col_factor(levels = c(gear = c("3", "4", "5"))))
)
#> # A tibble: 32 x 12
#>   model          mpg   cyl  disp    hp  drat    wt  qsec    vs    am gear   carb
#>   <chr>        <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <dbl>
#> 1 Mazda RX4     21       6   160   110  3.9   2.62  16.5     0     1 4         4
#> 2 Mazda RX4 W…  21       6   160   110  3.9   2.88  17.0     0     1 4         4
#> 3 Datsun 710    22.8     4   108    93  3.85  2.32  18.6     1     1 4         1
#> # … with 29 more rows

Name repair

Often the names of columns in the original dataset are not ideal to work with. vroom() uses the same .name_repair argument as tibble, so you can use one of the default name repair strategies or provide a custom function. A great approach is to use the janitor::make_clean_names() function as the input. This will automatically clean the names to use whatever case you specify, here I am setting it to use ALLCAPS names.

vroom(
  vroom_example("mtcars.csv"),
  .name_repair = ~ janitor::make_clean_names(., case = "all_caps")
)

Writing delimited files

Use vroom_write() to write delimited files, the default delimiter is tab, to write TSV files. Writing to TSV by default has the following benefits: - Avoids the issue of whether to use ; (common in Europe) or , (common in the US) - Unlikely to require quoting in fields, as very few fields contain tabs - More easily and efficiently ingested by Unix command line tools such as cut, perl and awk.

vroom_write(mtcars, "mtcars.tsv")

Writing CSV delimited files

However you can also use delim = ',' to write CSV files, which are common as inputs to GUI spreadsheet tools like Excel or Google Sheets.

vroom_write(mtcars, "mtcars.csv", delim = ",")

Writing compressed files

For gzip, bzip2 and xz compression the outputs will be automatically compressed if the filename ends in .gz, .bz2 or .xz.

vroom_write(mtcars, "mtcars.tsv.gz")

vroom_write(mtcars, "mtcars.tsv.bz2")

vroom_write(mtcars, "mtcars.tsv.xz")

It is also possible to use other compressors by using pipe() with vroom_write() to create a pipe connection to command line utilities, such as

The parallel compression versions can be considerably faster for large output files and generally vroom_write() is fast enough that the compression speed becomes the bottleneck when writing.

vroom_write(mtcars, pipe("pigz > mtcars.tsv.gz"))

Reading and writing from standard input and output

vroom supports reading and writing to the C-level stdin and stdout of the R process by using stdin() and stdout(). E.g. from a shell prompt you can pipe to and from vroom directly.

cat inst/extdata/mtcars.csv | Rscript -e 'vroom::vroom(stdin())'

Rscript -e 'vroom::vroom_write(iris, stdout())' | head

Note this interpretation of stdin() and stdout() differs from that used elsewhere by R, however we believe it better matches most user’s expectations for this use case.

Further reading