Dev/Test/Prod with Posit Team
It is common for an analysis project to lead into a second phase. In this second phase, one or several data products are developed. A data product could be a dashboard, report, API or ETL process. It takes the insights gathered during the analysis phase and makes them available on a permanent basis to stakeholders.
Unlike the experimental nature of data analysis, a data product has to work consistently when consumed. This means that the code for the data product will need to be developed in a more formal manner. Development can occur in three basic stages:
- The product is developed and tested by the developer
- One, or a few, stakeholders test the product for functionality
- The product is made available to all stakeholders
Each of these stages occurs in separate environments, respectively referred to as:
- Development
- Testing
- Production
After the product is successfully tested in each stage, the code is then promoted to the next stage.
Code promotion in Posit Team
Development
With Posit Team, code development and testing are done within two of our products. As illustrated in this section’s diagram, development and unit testing happen in Workbench. Only developers, such as the data scientists or data analysts, need access to Workbench. They perform unit testing of the product before making it available to other stakeholders.
Testing
Once the data product is ready for review, then the developer will deploy the data product to Connect. The stakeholders who are responsible for making sure that the product works as expected are then able to access it via Connect. This is called User Acceptance testing (UAT).
The data product may depend on external assets, such as databases or shared drives. It is important to make sure that they are still accessible to the data product once deployed to Connect. This is called Integrated testing.
Production
After all testing is completed, the data product is made available to all stakeholders for consumption. In some cases, when the data product is a script that performs data transformation, or ETL, the last stage is to also schedule the frequency in which the script is to run. These steps are completed within the Connect product.
Deployment with Connect
There are a few ways to deploy content to Connect. By deployment, we mean moving the code, the dependent files, and the metadata concerning R and/or Python, and the packages that the data product uses. To learn about available options to deploy to Connect, see our article on deployments.
Package Manager
Here are two scenarios in which using Package Manager is needed for a successful promotion of code:
Some organizations do not allow servers to have access to the Internet. Actions, such as patching and upgrades are performed offline. This is called an air-gapped environment. This means that Workbench and Connect will not be able to download packages on-demand. Package Manager allows for someone in the enterprise to download CRAN manually and then perform the update offline. Package Manager becomes the source of packages for the other two products.
Many organizations use a combination of Workbench and the open-source desktop version of RStudio, called RStudio Desktop. Access to different sources of packages will vary from software that runs on someone’s laptop than the access of a central server. Using Package Manager ensures that both are able to access the exact same packages.
Server Environments
Minimal
We recommend that each component of Posit Team is installed in its own, independent server environment. Server environment here refers to a single server, or a cluster of multiple servers, such as those used to provide High Availability. There should be at minimum three server environments. In this mode, the Test and Production stages will occur in the same server environment.
Separate Test and Production
A preferable setup may be to have a separate server environment for Test and Production. This ensures that resources needed to serve data products that are already in Production will not be impacted by ongoing tests. Another reason to have separate server environments is to limit who can publish data products to Production. For example, the developer is able to deploy a data product to the Test server environment, but will need to request that IT deploy the final product to the Production server. That ensures that there are no changes made in the official version of the data product that were not fully tested and approved.
Testing server upgrades
Eventually, the servers themselves will need to be patched or upgraded. For example, the Posit software installed on the server may need to be upgraded. Before upgrading the servers used for code development and deployment, it is a good idea to test the changes in a separate server environment. These are called staging servers. These server environments are meant to mirror the servers that are in regular use. The staging servers are infrequently used, and usually only IT and maybe some developers will have access to them. They are meant to only confirm that software upgrades were successful.
Appendix
Why not a cron job inside Workbench?
There are cases when an R or Python script needs to run on a regular basis, and also for the foreseeable future. It is very common that over time those scripts grows, both in number and importance. Depending on a single developer to run all of the scripts becomes a problem. The solution for that is to automate the scripts.
Please be aware that at this point, those scripts are no longer considered to be “in-development.” When the enterprise, or a team in the enterprise, depends on these scripts to run on a regular and consistent basis, that is a Production script. As such, these should be moved to Connect.
There is also a practical reason to move the scripts to Connect. The cron job depends on the same user, with the same version of R or Python, and version of the packages to run the script on a regular frequency. Connect handles all the dependencies and the scheduling in a safe and consistent manner.
Connect isolates each data product that is deployed to it, so there are no issues with some data products using one version of a given package, while other data products use a different version of the same package. Connect makes sure that no package version collision exists.