Connect & targets

targets

Example GitHub repository: https://github.com/sol-eng/targets-deployment-rsc

What is targets?

targets can be described as an orchestration package for R.

From the targets user manual:

The targets package is a Make-like pipeline toolkit for Statistics and data science in R. With targets, you can maintain a reproducible workflow without repeating yourself. targets

  • learns how your pipeline fits together
  • skips costly runtime for tasks that are already up to date
  • runs only the necessary computation, supports implicit parallel computing
  • abstracts files as R objects
  • shows tangible evidence that the results match the underlying code and data.

How does targets works?

To showcase how targets work we will discuss an example of a pipeline that reads data from a CSV file, trains a linear regression model using tidymodels, and deploys an RMarkdown report with the information about the model.

targets has one main file named _targets.R. This file specifies the dependencies of the project in addition to the steps that the pipeline will execute.

The first four lines of the _targets.R file in the example repository look as follows:

title="_targets.R"
library(targets)
options(tidyverse.quiet = TRUE)
source("R/connect_helpers.R")
tar_option_set(packages = c("tidymodels", "tidyverse", "connectapi", "tarchetypes"))

The only two lines that are needed for any _targets.R file are library(targets) and tar_option_set(packages = c(...)). The other two lines are specific to this example pipeline. To include the packages that need to be available in the R environment where the pipeline will execute, you use the tar_option_set() function.

Using a list with multiple calls to the tar_target() function, you specify the list of targets (steps in the pipeline). This function lays out the definition of a target and the computation step to be performed. In the case of our example pipeline, the first two targets look as follows:

list(
  tar_target(
    raw_data_url,
    "https://tidymodels.org/start/models/urchins.csv",
    format = "url"
  ),
  tar_target(
    raw_data,
    read_csv(raw_data_url, col_types = cols())
  ),
  ...
)

In the first target, we specify that we want a target with the name raw_data_url which, holds the value of a URL string. In the second target, we use the raw_data_url string to read the data and store this output in a target named raw_data. targets uses all of these dependency relationships to build a dependency graph. When executing the pipeline, targets looks at the dependency graph before executing a step so that the pipeline only computes the steps that are out-of-date. However, how does targets know what is out-of-date and what isn’t?

The _targets cache is how targets handles this problem. By default, each target will get stored in the cache as an R object. If you don’t configure the location, targets will create the cache locally in the _targets folder. However, the cache can live in other places like a separate folder or an S3 bucket. We will see how we will use this flexibility to store the cache to our advantage when deploying a pipeline to Connect.

Deploying a targets pipeline to Connect

To host a targets pipeline in Connect, you will need to add an RMarkdown document that executes the pipeline. In our example, this file is called driver.Rmd and, in line 39, it calls the tar_make() command to execute the pipeline. One of the caveats of deploying an RMarkdown document to Connect is that whenever the document is re-executed, it spawns a new process that only contains the files included in the deployment bundle. This re-execution means that we lose the _targets folder where the cache lives and, thus, we lose the main functionality of targets. To avoid losing the _targets folder, you need to configure a specific path for the targets cache in Connect that lives outside of the folder where the application is deployed (E.g /mnt/data).

We can use the config package to handle having different cache paths accross development, staging, and production environments. driver.Rmd contains the following code chunk:

title="driver.Rmd"
targets_settings <- config::get('targets')
store_path <- targets_settings$path
tar_config_set(store = store_path)

This code chunk sets the cache through an environment variable based on the environment we are running. By default, in Connect, this value of this environment variable is rsconnect. We use this in our config.yml to configure the path for the cache in the following manner:

yaml title="config.yml" rsconnect: targets: path: '/mnt/shared-data/targets-test/'

With these steps, you can host your {targets} pipeline reliably within Connect. You can also host a Shiny app to monitor a pipeline as shown in this demo repository: https://github.com/wlandau/targets-shiny

Back to top