Strategy Maps

Strategies to Reproduce Environments Over Time

Reproducing data science work is the main objective of environment management. This site details three strategies for reproducing R environments over time. To select a strategy, you will need to answer two questions:

  1. Who is responsible for managing the environment?
  2. How open is the environment?

At first these two questions might seem similar, but separating the two uncovers common “danger zones” or “anti-strategies”. The map below depicts these danger zones as well as three successful strategies. Use the map and the two questions above to determine where your organization currently operates and identify which strategy to move towards.

Graph showing Who is Responsible on the x-axis, with Admins on the far left, Users on the far right. Package Access on the y-axis with locked down on the bottom and Open at the top. Three strategies fall into a green zone along these axis, moving upward, left to right: Validated, Shared Baseline, and Snapshot.

The three strategies are outlined in detail:

In addition to these three strategies, the strategy map above details a set of danger zones, areas where “who” is in control and “what” can be installed are mis-aligned to create painful environments that can not be reliably recreated. Identifying if you’re in a danger zone can help you identify a “nearby” strategy to move towards.

Wild West

The wild west scenario occurs when users are given free reign to install packages with no strategy for reproducing package environments.

Recommendations:

  • If you are a single data scientist, or in a team of experienced data scientists, consider moving to the snapshot and restore strategy.

  • If you are working with a group of newer users, consider working with IT to setup the shared baseline strategy. Be careful not to slip into the ticket system scenario, which occurs if you ask IT to restrict the system without teaching them how to manage shared baselines. It might make sense to use the shared baseline strategy by default, and allow experienced users to step into the snapshot strategy.

Ticket System

The ticket system scenario occurs when administrators are involved in package installation, but they do not have a strategy for ensuring consistent and safe package updates; for example:

  1. A user wants a new package installed, so they submit a ticket to have the package added
  2. An admin receives the ticket, and manually installs the new package into the system library

This scenario is problematic because it encourages [partial upgrades](../reproduce/ten slow, and still results in broken environments!

Recommendation

  • If your organization requires admin involvement for practical reasons, (e.g. you’re working on offline server), consider adopting the shared baseline strategy.

  • If your organization requires admin involvement for strategic reasons (e.g. you have concerns about package licenses), consider adopting the validation strategy.

Blocked

The blocked scenario occurs when servers are locked down, but there is no strategy in place for R package access. This strategy often leads R users to “backdoor” approaches to package access, such as manually copying over installed packages.

In this scenario, it is important for R users to level-set with IT on why R packages are essential to successful data science work. You may need to refer to the validation section of the site, which helps explain where packages come from and address issues around trust.

Come to this discussion prepared to advocate for either the shared baseline or validation strategy. It may also help your admin team to know that there are supported products, like Package Manager , designed to help themhelp you!

Back to top