graph LR a[Dev] --> b[Staging] b --> c[Prod] style b fill: lightgrey style c fill: lightgrey
Safely Deploy R to Production
Data products built in R, such as dashboards, web applications, reports, and APIs, are increasingly deployed to production. While specific definitions of production can vary, everyone agrees that production content should be stable. Most production systems use a variation of the Snapshot and Restore strategy. Here, we focus on examples of production systems that apply the snapshot and restore strategy. This page also describes staging environments, acceptable differences between development and production environments, and strategies for upgrading production environments.
This page focuses on reproducible environments for production content. There are other important concerns for placing R code in production:
Production systems come in different shapes and sizes. Some organizations store code in Git and use continuous integration tools like Jenkins to deploy content. Other organizations might use containers and an orchestration tool like Kubernetes. Or, you may use infrastructure-as-code tooling like Chef or Puppet to deploy products onto physical, virtual, or cloud servers.
Regardless of the specific implementation, there are three basic steps required to deploy R environments to production:
A simple checklist:
These steps are the heart of the “snapshot and restore” strategy for reproducing environments. The following two examples showcase implementations of this strategy in production systems. These examples are not exhaustive, and you can certainly design other processes. All implementations should meet the key requirements of snapshot, isolate, and restore.
Connect Implementation | |
---|---|
Snapshot | Manifest file is created during publication |
Isolate | Connect creates an isolated package library for each piece of content |
Restore | Connect installs the packages listed in the manifest into the isolated library |
Summary of the Snapshot and Restore Strategy Applied in Connect
Connect is a publishing platform for data products, that automatically implements the snapshot and restore strategy when users publish content. If you’re using Connect, you don’t need to manually manage this process. Here is what happens when a data product is deployed:
rsconnect::writeManifest()
from within your project’s working directory in the development environment. Here is a sample from a manifest file:{
"version": 1,
"locale": "en_US",
"platform": "3.4.4",
"metadata": {
"appmode": "api",
"primary_rmd": null,
"primary_html": null,
"content_category": null,
"has_parameters": false
},
"packages": {
"BH": {
"Source": "CRAN",
"Repository": "https://cran.rstudio.com/",
"description": {
"Package": "BH",
"Type": "Package",
"Title": "Boost C++ Header Files",
"Version": "1.66.0-1",
"Date": "2018-02-12",
"Author": "Dirk Eddelbuettel, John W. Emerson and Michael J. Kane",
"Maintainer": "Dirk Eddelbuettel <edd@debian.org>",
The manifest file, application code, and supporting files are sent to the production Connect server.
The Connect server creates an isolated library for each piece of content.
The required packages are restored into the isolated library using the manifest file. As an example, if content A depends on ISLR
1.0 and content B depends on ISLR
2.0, the appropriate version will be installed into the separate content libraries. Connect maintains a cache so that packages are appropriately reused when possible.
renv
and DockerIn this example, a Docker container is used to isolate the data product, and renv
is used to snapshot and restore the appropriate package environment. More details are available for using R with Docker here.
Docker + renv Implementation | |
---|---|
Snapshot | renv::snapshot() creates a lock file from the development environment |
Isolate | The Docker container creates an isolated environment for the data product |
Restore | renv::restore() is run in the Docker container to recreate the package environment |
Summary of the Snapshot and Restore Strategy Applied with Docker and renv
In the development environment, create a renv.lock
file for the project by running renv::snapshot()
. The lock file records the version of the R packages in use. Commit this lock file alongside the code.
Create a Dockerfile, starting with the appropriate version of R:
# start with the appropriate version of R
FROM rstudio/r-base:3.4-bionic
# install git
RUN apt-get install -y git
# clone the code base
RUN git clone https://ourgit.example.com/user/project.git
# install renv
RUN R -e 'install.packages("renv", repos = "https://r-pkgs.example.com")'
# restore the package environment
RUN R -e 'setwd("./project"); renv::restore()'
# run the data product
CMD ...
In this example, the version of R is controlled by the base image, using an image provided by Posit that includes R. Other alternatives also work, such as including the commands to install R. You can determine the R version from the renv
lock file:
The focus so far has been deploying R environments to production systems. With proper record keeping and environment isolation, there is a high chance that deployed content will work as expected. However, for systems that require minimal downtime, it is still imperative to test content in a pre-production system before officially deploying to production. The concept is simple:
graph LR a[Dev] --> b[Staging] b --> c[Prod] style b fill: lightgrey style c fill: lightgrey
Staging and Production should be identical clones!
While conceptually simple, in practice there are two challenges:
To solve these problems, most organizations use either containers or infrastructure-as-code. Luckily the idea is straight forward: instead of manually running this process, automate as much as possible by writing explicit code that accomplishes steps 1 and 2. The specific details for implementing these steps would require an entire website all their own, but most R users do not need to worry about re-inventing this process. Typically organizations will have a “DevOps” team or strategy in place for staging content. The main task for the R user is explaining how those tools should be adapted for data products using R. The adaptation is simply including our snapshot, isolate, and restore steps.
In this example, a DevOps team maintains staging and production servers using Chef. They also maintain an enterprise Git application and use Jenkins for continuous integration. The current DevOps strategy relies on Git branches. Branches of a repository are automatically deployed to staging, whereas the master branch is deployed to production. To integrate R based data products:
The DevOps team should create Chef recipes responsible for installing multiple versions of R onto the staging and production servers.
The DevOps team should create a Chef recipe to install and configure Connect.
The DevOps team should configure Jenkins to deploy a repository’s branches to the staging environment. Jenkins should also be configured to deploy the master branch to production. In both cases, the Jenkins pipeline will consist of bash shell scripts that clone the Git repository, create a tar file, and then call Connect API endpoints to deploy. Example shell scripts are available.
When the R user is ready to deploy content, they should start by running rsconnect::writeManifest()
inside of the development environment. The resulting manifest file should be included alongside the application code in a Git commit to a staging branch.
Following the commit, Jenkins will deploy the code to the staging Connect environment, using the automatic process described above. The R user should confirm the content looks correct.
The user or admin can merge the staging branch into the master branch. This merge triggers a deployment to the production server.
graph TB a[Dev] b["Dependency Manifest"] c[Code] d["Git Branch"] f[Connect Staging] g(Approval) h[Git Master] i["Jenkins (staging CI/CD)"] j[Connect Prod] k["Jenkins (proudction CI/CD)"] a --> b a --> c c --> d b --> d d --> i i --> f f --> g g --> h h --> k k --> j style b fill:lightgrey style c fill:lightgrey style g fill:lightgreen
More details are available here.
We’ve now described three environments: development, staging/testing, and production. The key to success is keeping these environments as similar as possible. However, what happens if your development environment is a Windows desktop, and production is a Linux server? This section outlines differences that are acceptable: the R patch version and the R source.
R’s version scheme has there components, the major version, the minor version, and a patch. For example, R version 3.5.2 has:
- Major version: 3
- Minor version: 5
- Patch version: 2
Major versions are released rarely. Minor versions are released once a year in the spring. Patch versions are released on a regular, as-needed basis. R packages are compatible across patch versions, but not major or minor versions!. As an example, a package built on version 3.5.1 will work on 3.5.2, but is not guaranteed to work on 3.6.0.
For the reason above, we recommend that development and production systems have the same available major.minor version, but the patch version could vary. For example, content created in the development environment using R version 3.5.1 could be deployed to a production environment using 3.5.2.
R packages do not follow these same rules, and package versions should match exactly!
It is possible for development and production environments to have different operating systems. For example, development could be performed on a Windows desktop, while production lives on a Linux server.
While possible, this setup is not recommended. Instead, many organizations prefer to standardize on a single operating system, usually Linux. RStudio Server or Posit Workbench make it easy for R users to develop in an environment that more closely resembles production.
In the case where operating systems vary, the source of an R package may vary as well. Using the scenario above, R packages on Windows are typically installed from pre-compiled CRAN binaries. When the same packages are restored on a Linux system, they are normally installed from source. This difference will not impact behavior, but it explains why information about the package (name, version, repository) is transferred as opposed to transferring the installed package library.
In production, one does not simply upgrade packages or system dependencies! These tips can enable successful maintenance of your production system overtime:
apt
or yum
. See R Installations for details.