3. User Environments
In this section, you will learn:
- what the administrator needs to provide for users to do their work
- why developers require multiple versions of R and Python
- the relationship between language and package versions
- best practices for managing language versions
- configuring where package installations come from
- identifying operating system dependency requirements for packages/libraries
Environment Needs from Data Scientists
Administrators will need to partners with the data science team to establish and maintain the necessary environments for development in Workbench. Initial requirements will include:
- Having one or more versions of the R or Python programming language installed on the server
- Establishing source(s) for downloading packages
- Establishing location(s) for installed packages
- Ensuring any underlying system dependencies required for package installation or runtime are installed on the server
As new versions of R, Python, and packages are released, the data science team may have requirements to add these to the server, while still maintaining previous environments to support any historical work.
Workbench allows you to serve multiple version of R and Python. It is important to note that when you need a new version of R or Python, add the new version and leave existing versions in place. This will allow developers to choose which version to run for each session and ensure that scripts that depend on a specific version will still run.
Adding new versions of R and Python
We recommend you install desired versions of R and Python on Workbench by following the instructions here:
Managing R and Python packages
The strength of the R and Python languages are the packages that extend core functionality. Packages are specialized modules of code built for their specific language. Packages frequently contain dependencies to other packages and sometimes have system-level dependencies. Packages are updated independently from the version of R or Python.
To provide packages to your users, you will need to partner with the data science team to address two key questions:
- Where are packages installed from? (Repository)
- Where do installed packages go on the server? (Library)
Repositories: Where packages are installed from
Repositories are file servers with a defined structure for R or Python packages. A repository may be public or hosted privately within your organization. Common sources for repositories include:
- PyPI for Python (public)
- CRAN for R (public)
- Posit Public Package Manager, a public mirror for PyPI and CRAN
- Posit Package Manager, a private repository source for PyPI, CRAN, curated subsets, and internal packages
In general, a repository should be comprehensive, offering many packages and many package versions, to service the needs of developers.
R Package Installation
Users typically install R packages inside an R session using a function like install.packages()
.
Users can specify the address of the desired repository source for the package in the function. If that is not specified, the package will be installed from the first repository configured on the server where the package is found. You can list the repositories configured in the server by running in R:
options('repos')
As the administrator, you can configure the default repositories for R. You may choose to do this because:
- You have your own internal or validated repository (e.g., Posit Package Manager)
- You want users to install the pre-compiled package binaries from Posit Public Package Manager or an on-premise Posit Package Manager (the advantage of binaries are significantly faster installation times and avoidance of compilation errors due to missing system dependencies)
There are three options for where default R repositories can be configured on Workbench:
File | Applies to… | Use |
---|---|---|
/etc/rstudio/rsession.conf |
All RStudio Pro sessions on the server | Configuring a single default repo server-wide |
/etc/rstudio/repos.conf |
All RStudio Pro sessions on the server | Configuring multiple default repo server-wide |
Rprofile.site or Renviron.site |
All R sessions for a specific version of R | Configuring different repos per R version or setting a default repo per version of R across all R sessions. |
Users can override where their package installs come from in three places:
- Using their own .Rprofile.
- Changing settings in the RStudio Pro IDE.
- Changing the value in the R console.
The only way to definitively control where users can download packages from is by restricting the Workbench server’s connectivity and only allowing access to an on-premise repository such as Posit Package Manager.
Python packages
Pip is a package manager for python that handles the installation of packages. In contrast to R, pip
is called from the terminal instead of within the Python REPL. The default repository for installations is set inside the pip.conf
file. The pip.conf
file can exist in several locations:
/etc/pip.conf
for global settings/opt/python/3.10.4/pip.conf
for site settings/home/$USER/.pip/pip.conf
for user-specific settings/home/$USER/.config/pip/pip.conf
for user-specific settings
You can see these settings and confirm their location by running:
pip config -v list
You may want to customize the install source for Python for similar reasons to those listed for R above.
Libraries: Where packages are installed to
The package installation process downloads a package from the repository and places the package in a library for use. Where a repository is ideally comprehensive with many packages and versions, a library is deliberately more narrow. For any one version of R or Python, there can only be one version of a package within that library.
On the Workbench server, a library can exist at the system level, at the user level, and even at the project level. Determining the appropriate strategy for how and where packages are installed requires partnership with the data science team. There are multiple successful patterns that are described on the Solutions Site under the Environments Management Strategy Map.
A System library makes packages available server-wide after installation by a sudo
user. The library install is specific to a version of R or Python, and there can only be one version of a package in that library. A few pros and cons of having packages in a system library:
Pros:
Reduces duplication of popular packages
Provides a base environment of working packages
Good for locked-down environments and more static environments
Cons:
Can be time-intensive for admin to manage
Challenging to upgrade safely
Unlikely to meet all package needs with one library as requirements vary across users, projects, and time
Unless access to external package repositories is locked down, users will be able to install packages in addition to the system library. By default, these packages will be installed into the user library in the user’s home directory. For example in R the user library could have a path that may look like:
/home/user_name/R/x86_64-pc-linux-gnu-library/4.2
For Python it might look like:
/home/user_name/.local/lib/python3.10/site-packages
In a running R session you can see where R packages will be installed by running .libPaths()
. By default, R will install a package into the first directory given by .libPaths() that it can write to.
In Python you can find the path a package is installed at by running pip show package_name
Project libraries enable data scientists to better isolate projects. The developer can use a package such as renv
for R or venv
or virtualenv
for Python to manage project-specific libraries.
System Dependencies
Some packages require specific operating system or third-party software. If software (OS or third-party) is missing, package installation will fail until you install the dependencies. There are a few ways you may be able to discover this:
Review requirements listed in package documentation
Check system dependencies listed for the package in Posit Package Manager
Install the package and review the error for any insight on missing dependencies
If you or a user are having trouble installing a package, confirming you have all dependencies installed is one great place to start troubleshooting. Using pre-compiled package binaries from Posit Public Package Manager or an on-premise Posit Package Manager repository will avoid system dependency compilation errors.
🚀 Launch the exercise environment!
In the exercise environment you will get experience:
adding new versions of R and Python
configuring default sources for package installs
Go to: 4. Data Access