On macOS and Windows, when you
install.packages("arrow"), you get a binary package that contains Arrow’s C++ dependencies along with it. On Linux,
install.packages() retrieves a source package that has to be compiled locally, and C++ dependencies need to be resolved as well. Generally for R packages with C++ dependencies, this requires either installing system packages, which you may not have privileges to do, or building the C++ dependencies separately, which introduces all sorts of additional ways for things to go wrong.
Our goal is to make
install.packages("arrow") “just work” for as many Linux distributions, versions, and configurations as possible. This document describes how it works and the options for fine-tuning Linux installation. The intended audience for this document is
arrow R package users on Linux, not developers. If you’re contributing to the Arrow project, you’ll probably want to manage your C++ installation more directly. Note also that if you use
conda to manage your R environment, this document does not apply. You can
conda install -c conda-forge --strict-channel-priority r-arrow and you’ll get the latest official release of the R package along with any C++ dependencies.
Having trouble installing
arrow? See the “Troubleshooting” section below.
Install the latest release of
arrow from CRAN with
Daily development builds, which are not official releases, can be installed from the Ursa Labs repository:
install.packages("arrow", repos = "https://arrow-r-nightly.s3.amazonaws.com")
or for conda users via:
conda install -c arrow-nightlies -c conda-forge --strict-channel-priority r-arrow
You can also install the R package from a git checkout:
git clone https://github.com/apache/arrow cd arrow/r R CMD INSTALL .
If you don’t already have the Arrow C++ libraries on your system, when installing the R package from source, it will also download and build the Arrow C++ libraries for you. To speed installation up, you can set
to look for C++ binaries prebuilt for your Linux distribution/version. Alternatively, you can set
to build the Arrow libraries with optional features such as compression libraries enabled. This will increase the build time but provides many useful features. Prebuilt binaries are built with this flag enabled, so you get the full functionality by using them as well.
Both of these variables are also set this way if you have the
NOT_CRAN=true environment variable set.
If you already have
arrow installed and want to upgrade to a different version, install a development build, or try to reinstall and fix issues with Linux C++ binaries, you can call
install_arrow() provides some convenience wrappers around the various environment variables described below. This function is part of the
arrow package, and it is also available as a standalone script, so you can access it for convenience without first installing the package:
install_arrow() will install from CRAN, while
install_arrow(nightly = TRUE) will give you a development build.
install_arrow() does not require environment variables to be set in order to satisfy C++ dependencies.
Note that, unlike packages like
blogdown, and others that require external dependencies, you do not need to run
install_arrow()after a successful
arrow package allows you to work with data in AWS S3 or in other cloud storage system that emulate S3. However, support for working with S3 is not enabled in the default build, and it has additional system requirements. To enable it, set the environment variable
NOT_CRAN=true to choose the full-featured build, or more selectively set
ARROW_S3=ON. You also need the following system dependencies:
gcc>= 4.9 or
clang>= 3.3; note that the default compiler on CentOS 7 is gcc 4.8.5, which is not sufficient
The prebuilt C++ binaries come with S3 support enabled, so you will need to meet these system requirements in order to use them–the package will not install without them. If you’re building everything from source, the install script will check for the presence of these dependencies and turn off S3 support in the build if the prerequisites are not met–installation will succeed but without S3 functionality. If afterwards you install the missing system requirements, you’ll need to reinstall the package in order to enable S3 support.
In order for the
arrow R package to work, it needs the Arrow C++ library. There are a number of ways you can get it: a system package; a library you’ve built yourself outside of the context of installing the R package; or, if you don’t already have it, the R package will attempt to resolve it automatically when it installs.
If you are authorized to install system packages and you’re installing a CRAN release, you may want to use the official Apache Arrow release packages corresponding to the R package version. See the Arrow project installation page to find pre-compiled binary packages for some common Linux distributions, including Debian, Ubuntu, and CentOS. You’ll need to install
libparquet-dev on Debian and Ubuntu, or
parquet-devel on CentOS. This will also automatically install the Arrow C++ library as a dependency.
When you install the
arrow R package on Linux, it will first attempt to find the Arrow C++ libraries on your system using the
pkg-config command. This will find either installed system packages or libraries you’ve built yourself. In order for
install.packages("arrow") to work with these system packages, you’ll need to install them before installing the R package.
If no Arrow C++ libraries are found on the system, the R package installation script will next attempt to download prebuilt static Arrow C++ libraries that match your both your local operating system and
arrow R package version. C++ libraries (source or binary) will only be retrieved if you have set the environment variable
NOT_CRAN. If found, they will be downloaded and bundled when your R package compiles. For a list of supported distributions and versions, see the arrow-r-nightly project.
If no binary is found, it will download the Arrow C++ source that matches the R package version (CRAN release or nightly build) and attempt to build it locally. If no matching source bundle is found, it will also look to see if you are in a checkout of the
apache/arrow git repository and thus have the C++ source there. Depending on your system, building Arrow C++ from source likely will be slow; consequently, it is designed to happen only when you run
R CMD INSTALL but not when running
R CMD check, unless you’ve set the
NOT_CRAN=true environment variable.
For the mechanics of how all this works, see the R package
configure script, which calls
tools/nixlibs.R. If the C++ library is built from source,
inst/build_arrow_static.sh is executed. This build script is also what is used to generate the prebuilt binaries.
The intent is that
install.packages("arrow") will just work and handle all C++ dependencies, but depending on your system, you may have better results if you tune one of several parameters. Here are some known complications and ways to address them.
If you get an error like
Cannot call io___MemoryMappedFile__Open(). See https://arrow.apache.org/docs/r/articles/install.html for help installing Arrow C++ libraries.
arrow function you call, that means that installing the package failed to retrieve or build C++ libraries compatible with the current version of the R package.
It is expected that C++ dependencies should be built successfully on all Linux distributions, so you should not see this message. If you do, please check the “Known installation issues” below to see if any apply. If none apply, retry the installation with
arrow::install_arrow(verbose = TRUE) so that details on what failed are shown, then please report an issue and include the full verbose installation output.
If a system library or other installed Arrow is found but it doesn’t match the R package version (for example, you have libarrow 1.0.0 on your system and are installing R package 2.0.0), it is likely that the R bindings will fail to compile. Because the Apache Arrow project is under active development, is it essential that versions of the C++ and R libraries match. When
install.packages("arrow") has to download the C++ libraries, the install script ensures that you fetch the C++ libraries that correspond to your R package version. However, if you are using Arrow libraries already on your system, version match isn’t guaranteed.
To fix version mismatch, you can either update your system packages to match the R package version, or set the environment variable
ARROW_USE_PKG_CONFIG=FALSE to tell the configure script not to look for system Arrow packages. (The latter is the default of
install_arrow().) System packages are available corresponding to all CRAN releases but not for nightly or dev versions, so depending on the R package version you’re installing, system packages may not be an option.
Note also that once you have a working R package installation based on system (shared) libraries, if you update your system Arrow, you’ll need to reinstall the R package to match its version. Similarly, if you’re using Arrow system libraries, running
update.packages() after a new release of the
arrow package will likely fail unless you first update the system packages.
If the R package finds and downloads a prebuilt binary of the C++ library, but then the
arrow package can’t be loaded, perhaps with “undefined symbols” errors, please report an issue. This is likely a compiler mismatch and may be resolvable by setting some environment variables to instruct R to compile the packages to match the C++ library.
A workaround would be to set the environment variable
LIBARROW_BINARY=FALSE and retry installation: this value instructs the package to build the C++ library from source instead of downloading the prebuilt binary. That should guarantee that the compiler settings match.
If a prebuilt binary wasn’t found for your operating system but you think it should have been, check the logs for a message that says
*** Unable to identify current OS/version, or a message that says
*** No C++ binaries found for an invalid OS. If you see either, please report an issue. You may also set the environment variable
ARROW_R_DEV=TRUE for additional debug messages.
A workaround would be to set the environment variable
LIBARROW_BINARY to a
distribution-version that exists in the Ursa Labs repository. Setting
LIBARROW_BINARY is also an option when there’s not an exact match for your OS but a similar version would work, such as if you’re on
ubuntu-18.10 and there’s only a binary for
If that workaround works for you, and you believe that it should work for everyone else too, you may propose adding an entry to this lookup table. This table is checked during the installation process and tells the script to use binaries built on a different operating system/version because they’re known to work.
If building the C++ library from source fails, check the error message. The install script attempts to install any necessary build dependencies, but it’s possible that some operating systems may require additional ones. You may be able to install them and retry. Regardless, if the C++ library fails to compile, please report an issue so that we can attempt to improve the script.
On CentOS, if you are using a more modern
devtoolset, you may need to set the environment variables
CXX either in the shell or in R’s
Makeconf. For CentOS 7 and above, both the Arrow system packages and the C++ binaries for R are built with the default system compilers. If you want to use either of these and you have a
devtoolset installed, set
CC=/usr/bin/gcc CXX=/usr/bin/g++ to use the system compilers instead of the
devtoolset. Alternatively, if you want to build
arrow with the newer
devtoolset compilers, set both
false so that you build the Arrow C++ from source using those compilers. Compiler mismatch between the arrow system libraries and the R package may cause R to segfault when
arrow package functions are used. See discussions here and here.
If you have multiple versions of
zstd installed on your system, installation by building the C++ from source may fail with an undefined symbols error. Workarounds include (1) setting
LIBARROW_BINARY to use a C++ binary; (2) setting
ARROW_WITH_ZSTD=OFF to build without
zstd; or (3) uninstalling the conflicting
zstd. See discussion here.
Some features are optional when you build Arrow from source. With the exception of
ARROW_S3, these are all
ON by default in the bundled C++ build, but you can set them to
OFF to disable them.
ARROW_S3: If set to
ONS3 support will be built as long as the dependencies are met; if they are not met, the build script will turn this
ARROW_WITH_RE2for the RE2 regular expression library, used in some string compute functions
ARROW_WITH_UTF8PROCfor the UTF8Proc string library, used in many other string compute functions
There are a number of other variables that affect the
configure script and the bundled build script. By default, these are all unset. All boolean variables are case-insensitive.
ARROW_USE_PKG_CONFIG: If set to
false, the configure script won’t look for Arrow libraries on your system and instead will look to download/build them. Use this if you have a version mismatch between installed system libraries and the version of the R package you’re installing.
LIBARROW_DOWNLOAD: Unless set to
false, the build script will attempt to download C++ binary or source bundles. If you’re in a checkout of the
apache/arrowgit repository and want to build the C++ library from the local source, make this
LIBARROW_BINARY: If set to
true, the script will try to download a binary C++ library built for your operating system. You may also set it to some other string, a related “distro-version” that has binaries built that work for your OS. If no binary is found, installation will fall back to building C++ dependencies from source.
LIBARROW_BUILD: If set to
false, the build script will not attempt to build the C++ from source. This means you will only get a working
arrowR package if a prebuilt binary is found. Use this if you want to avoid compiling the C++ library, which may be slow and resource-intensive, and ensure that you only use a prebuilt binary.
LIBARROW_MINIMAL: If set to
false, the build script will enable some optional features, including compression libraries, S3 support, and additional alternative memory allocators. This will increase the source build time but results in a more fully functional library.
NOT_CRAN: If this variable is set to
true, as the
devtoolspackage does, the build script will set
LIBARROW_MINIMAL=falseunless those environment variables are already set. This provides for a more complete and fast installation experience for users who already have
NOT_CRAN=trueas part of their workflow, without requiring additional environment variables to be set.
ARROW_R_DEV: If set to
true, more verbose messaging will be printed in the build script.
arrow::install_arrow(verbose = TRUE)sets this. This variable also is needed if you’re modifying C++ code in the package: see “Editing C++ code” in the README.
LIBARROW_DEBUG_DIR: If the C++ library building from source fails (
cmake), there may be messages telling you to check some log file in the build directory. However, when the library is built during R package installation, that location is in a temp directory that is already deleted. To capture those logs, set this variable to an absolute (not relative) path and the log files will be copied there. The directory will be created if it does not exist.
CMAKE: When building the C++ library from source, you can specify a
/path/to/cmaketo use a different version than whatever is found on the
As mentioned above, please report an issue if you encounter ways to improve this. If you find that your Linux distribution or version is not supported, we welcome the contribution of Docker images (hosted on Docker Hub) that we can use in our continuous integration. These Docker images should be minimal, containing only R and the dependencies it requires. (For reference, see the images that R-hub uses.)
You can test the
arrow R package installation using the
docker-compose setup included in the
apache/arrow git repository. For example,
R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose build r R_ORG=rhub R_IMAGE=ubuntu-gcc-release R_TAG=latest docker-compose run r
arrow R package, including the C++ source build, on the rhub/ubuntu-gcc-release image.